mbox series

[mm-unstable,v7,00/18] mm: userspace hugepage collapse

Message ID 20220706235936.2197195-1-zokeefe@google.com (mailing list archive)
Headers show
Series mm: userspace hugepage collapse | expand

Message

Zach O'Keefe July 6, 2022, 11:59 p.m. UTC
v7 Forward
--------------------------------

The major changes to v7 over v6[1] are:

1.  mm_find_pmd() refactoring has been extended, and now returns the raw
pmd_t* without additional check (which was it's original behavior).  For
MADV_COLLAPSE, we've tightened up our use of it and now check if we've
raced with khugepaged when collapsing (Yang Shi).

2.  errno return values have been changed, and now deviate from madvise
convention in some places.  Most notably, this is to allow ENOMEM to mean "
memory allocation failed" to the user - the most important being THP
allocation failure.

3.  We now longer do lru_add_drain() + lru_add_drain_all() if we fail
collapse because pages aren't found on the LRU.  This has been simplified,
and we just do a lru_add_drain_all() upfront (Yang Shi).

4.  struct collapse_control has been further simplified, and all flags
controlling collapse behavior are now squashed into a single .is_hugepaged
flag.  We also now kmalloc() this structure in MADV_COLLAPSE context.

5.  Rebased on top of Yang Shi's "Cleanup transhuge_xxx helpers" series
[2] as well as Miaohe Lin's "A few cleanup patches for khugepaged" series
[3] which caused some refactoring and allowed for some nice
simplifications - most notably the VMA (re)validation checks.

6.  A new /proc/<pid>/smaps field, PMDMappable, has been added to inform
userspace what VMAs are eligible for MADV_COLLAPSE.

7.  A tracepoint was added to assist with MADV_COLLAPSE debugging

8.  selftest coverage is tightened up and now covers collapsing multiple
hugepage-sized regions.

See the Changelog for more details.

v6 Forward
--------------------------------

v6 improves on v5[4] in 3 major ways:

1.  Changed MADV_COLLAPSE eligibility semantics.  In v5, MADV_COLLAPSE
ignored khugepaged max_ptes_* sysfs settings, as well as all sysfs defrag
settings.  v6 takes this further by also decoupling MADV_COLLAPSE from
sysfs enabled setting.  MADV_COLLAPSE can now initiate a collapse of memory
into THPs in "madvise" and "never" mode, and doesn't ever require
VM_HUGEPAGE.  MADV_COLLAPSE retains it's adherence to not operating on
VM_NOHUGEPAGE-marked VMAs.

2.  Thanks to a patch by Yang Shi to remove UMA hugepage preallocation,
hugepage allocation in khugepaged is independent of CONFIG_NUMA.  This
allows us to reuse all the allocation codepaths between collapse contexts,
greatly simplifying struct collapse_control.  Redundant khugepaged
heuristic flags have also been merged into a new enforce_page_heuristics
flag.

3.  Using MADV_COLLAPSE's new eligibility semantics, the hacks in the
selftests to disable khugepaged are no longer necessary, since we can test
MADV_COLLAPSE in "never" THP mode to prevent khugepaged interaction.

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[5].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	This operation is independent of the system THP sysfs settings,
	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.

	THP allocation may enter direct reclaim and/or compaction.

	When a range spans multiple VMAs, the semantics of the collapse
	over of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Current Use-Cases
--------------------------------

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  With MADV_COLLAPSE, we get the best of both
	worlds: Peak upfront performance and lower RAM footprints.  Note
	that subsequent support for file-backed memory is required here.

(2)	malloc() implementations that manage memory in hugepage-sized
	chunks, but sometimes subrelease memory back to the system in
	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
	when the memory is hot, the implementation could
	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
	hugepage coverage and dTLB performance.  TCMalloc is such an
	implementation that could benefit from this[6].  A prior study of
	Google internal workloads during evaluation of Temeraire, a
	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
	all cpu cycles were spent in dTLB stalls, and that increasing
	hugepage coverage by even small amount can help with that[7].

(3)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.  Note that
	subsequent support for file/shmem-backed memory is required here.

(4)	HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
	be mapped at different levels in the page tables[8].  As it's not
	"transparent" like THP, HugeTLB high-granularity mappings require
	an explicit user API. It is intended that MADV_COLLAPSE be co-opted
	for this use case[9].  Note that subsequent support for HugeTLB
	memory is required here.

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[10].

Sequence of Patches
--------------------------------
* Patch 1 is a cleanup patch.

* Patch 2 (Yang Shi) removes UMA hugepage preallocation and makes
  khugepaged hugepage allocation independent of CONFIG_NUMA

* Patches 3-8 perform refactoring of collapse logic within khugepaged.c
  and introduce the notion of a collapse context.

* Patch 9 introduces MADV_COLLAPSE and is the main patch in this series.

* Patches 10-13 add additional support: tracepoints, clean-ups,
  process_madvise(2), and /proc/<pid>/smaps output

* Patches 14-18 add selftests.

Applies against mm-unstable

Changelog
--------------------------------
v6 -> v7:
* Added 'mm/khugepaged: remove redundant transhuge_vma_suitable() check'
* 'mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA'
  -> Open-coded khugepaged_alloc_sleep() logic (Peter Xu)
* 'mm/khugepaged: pipe enum scan_result codes back to callers'
  -> Refactored __collapse_huge_page_swapin() to return enum scan_result
  -> A few small cleanups (Yang Shi)
* 'mm/khugepaged: add flag to predicate khugepaged-only behavior'
  -> Renamed from 'mm/khugepaged: add flag to ignore khugepaged heuristics'
  -> The flag is now ".is_hugepaged" (Peter Xu)
* 'mm/khugepaged: add flag to ignore THP sysfs enabled'
  -> Refactored to pass flag to hugepage_vma_check(), and to reuse
     .is_khugepaged flag (Peter Xu)
* 'mm/khugepaged: make allocation semantics context-specific'
  -> !CONFIG_SHMEM bugfix and minor changes (Yang Shi)
  -> Squashed into 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
  -> Removed .gfp member of struct collapse_control.  Instead, use the
     .is_khugepaged member to decide what gfp flags to use.
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Replaced multiple mm_find_pmd() callsites with
     find_pmd_or_thp_or_none() to make sure khugepaged doesn't collapse
     out from under us (Yang Shi)
  -> Added check_pmd_still_valid() helper
  -> Return SCAN_PMD_NULL if pmd_bad() (Yang Shi)
  -> Renamed mm_find_pmd() -> mm_find_pte_pmd()
  -> Renamed mm_find_pmd_raw() -> mm_find_pmd()
  -> Add mm_find_pmd() to split_huge_pmd_address()
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Replace SCAN_PAGE_LRU + lru_add_drain_all() retry logic with single
     lru_add_drain_all() upfront.
  -> errno mapping changes.  Most notably, use ENOMEM when memory
     allocation (most notably, THP allocation) fails.
  -> When !THP, madvise_collapse() and hugepage_madvise() return -EINVAL
     instead of BUG(). (Yang Shi)
* 'tools headers uapi: add MADV_COLLAPSE madvise mode to tools'
  -> Squashed into 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse' (Yang Shi)
* 'mm/khugepaged: rename prefix of shared collapse functions'
  -> Revert change to huge_memory:mm_khugepaged_scan_pmd tracepoint to
     retain ABI. (Yang Shi)
* Added 'mm/madvise: add huge_memory:mm_madvise_collapse tracepoint'
* Added 'proc/smaps: add PMDMappable field to smaps'
* Added 'selftests/vm: dedup hugepage allocation logic'
* Added 'selftests/vm: add selftest to verify multi THP collapse'
* Collected review tags
* Rebased on ??

v5 -> v6:
* Added 'mm: khugepaged: don't carry huge page to the next loop for
  !CONFIG_NUMA'
  (Yang Shi)
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Add a pmd_bad() check for nonhuge pmds (Peter Xu)
* 'mm/khugepaged: dedup and simplify hugepage alloc and charging'
  -> Remove dependency on 'mm/khugepaged: sched to numa node when collapse
     huge page'
  -> No more !NUMA casing
* 'mm/khugepaged: make allocation semantics context-specific'
  -> Renamed from 'mm/khugepaged: make hugepage allocation
     context-specific'
  -> Removed function pointer hooks. (David Rientjes)
  -> Added gfp_t member to control allocation semantics.
* 'mm/khugepaged: add flag to ignore khugepaged heuristics'
  -> Squashed from
     'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*' and
     'mm/khugepaged: add flag to ignore page young/referenced requirement'.
     (David Rientjes)
* Added 'mm/khugepaged: add flag to ignore THP sysfs enabled'
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Use hugepage_vma_check() instead of transparent_hugepage_active()
     to determine vma eligibility.
  -> Only retry collapse once per hugepage if pages aren't found on LRU
  -> Save last failed result for more accurate errno
  -> Refactored loop structure
  -> Renamed labels
* 'selftests/vm: modularize collapse selftests'
  -> Refactored into straightline code and removed loop over contexts.
* 'selftests/vm: add MADV_COLLAPSE collapse context to selftests;
  -> Removed ->init() and ->cleanup() hooks from struct collapse_context()
     (David Rientjes)
  -> MADV_COLLAPSE operates in "never" THP mode to prevent khugepaged
     interaction. Removed all the previous khugepaged hacks.
* Added 'tools headers uapi: add MADV_COLLAPSE madvise mode to tools'
* Rebased on next-20220603

v4 -> v5:
* Fix kernel test robot <lkp@intel.com> errors
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fix khugepaged_alloc_page() UMA definition
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Add "fallthrough" pseudo keyword to fix -Wimplicit-fallthrough

v3 -> v4:
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Dropped pmd_none() check from find_pmd_or_thp_or_none()
  -> Moved SCAN_PMD_MAPPED after SCAN_PMD_NULL
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: add struct collapse_control'
  -> Updated commit description and some code comments
  -> Removed extra brackets added in khugepaged_find_target_node()
* Added 'mm/khugepaged: dedup hugepage allocation and charging code'
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Has been majorly reworked to replace ->gfp() and ->alloc_hpage()
     struct collapse_control hooks with a ->alloc_charge_hpage() hook
     which makes node-allocation, gfp flags, node scheduling, hpage
     allocation, and accounting/charging context-specific.
  -> Dropped <lkp@intel.com> from sign-offs
* Added 'mm/khugepaged: pipe enum scan_result codes back to callers'
  -> Replaces 'mm/khugepaged: add struct collapse_result'
* Dropped 'mm/khugepaged: add struct collapse_result'
* 'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/khugepaged: add flag to ignore page young/referenced requirement'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Moved struct collapse_control* argument to end of alloc_hpage()
  -> Some refactoring to rebase on top changes to struct
     collapse_control hook changes and other previous commits.
  -> Reworded commit description
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: rename prefix of shared collapse functions'
  -> Renamed from 'mm/khugepaged: remove khugepaged prefix from shared
     collapse functions'
  -> Instead of dropping "khugepaged_" prefix, replace with
     "hpage_collapse_"
  -> Dropped <lkp@intel.com> from sign-offs
* Rebased onto next-20220502

v2 -> v3:
* Collapse semantics have changed: the gfp flags used for hugepage
  allocation now are independent of khugepaged.
* Cover-letter: add primary use-cases and update description of collapse
  semantics.
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Added .gfp operation to struct collapse_control
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added madvise context .gfp implementation.
  -> Set scan_result appropriately on early exit due to mm exit or vma
     vma revalidation.
  -> Reword patch description
* Rebased onto next-20220426

v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
  -> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
  -> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
  -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
     defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
  functions
  -> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414

RFC -> v1:
* The series was significantly reworked from RFC and most patches are
  entirely new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
  VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage
  allocation now match that of khugepaged for the same VMA, instead of the
  gfp flags used
  at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
  spanning a single MADV_COLLAPSE call are now independent, whereas before
  the idea was to allow direct reclaim/compaction if any spanned VMA
  permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
  MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of
  hugepages in a batched-fashion with the aim of doing multiple page table
  updates inside a single mmap_lock write.  This has been changed, and the
  implementation now collapses each hugepage-aligned/sized region
  iteratively.  This was motivated by an experiment which showed that, when
  multiple threads were concurrently faulting during a MADV_COLLAPSE
  operation, mean and tail latency to acquire mmap_lock in read for threads
  in the fault patch was improved by using a batch size of 1 (batch sizes
  of 1, 8, 16, 32 were tested)[11].
* Added: If a collapse operation fails because a page isn't found on the
  LRU, do a lru_add_drain_all() and retry.
* Added: selftests

[1] https://lore.kernel.org/linux-mm/20220604004004.954674-1-zokeefe@google.com/
[2] https://lore.kernel.org/linux-mm/YrJJoP5vrZflvwd0@google.com/
[3] https://lore.kernel.org/linux-mm/20220625092816.4856-1-linmiaohe@huawei.com/
[4] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/
[5] https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[6] https://github.com/google/tcmalloc/tree/master/tcmalloc
[7] https://research.google/pubs/pub50370/
[8] https://lore.kernel.org/linux-mm/CAHS8izPnJd5EQjUi9cOk=03u3X1rk0PexTQZi+bEE4VMtFfksQ@mail.gmail.com/
[9] https://lore.kernel.org/linux-mm/20220624173656.2033256-23-jthoughton@google.com/
[10] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[11] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/


Zach O'Keefe (18):
  mm/khugepaged: remove redundant transhuge_vma_suitable() check
  mm: khugepaged: don't carry huge page to the next loop for
    !CONFIG_NUMA
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: dedup and simplify hugepage alloc and charging
  mm/khugepaged: pipe enum scan_result codes back to callers
  mm/khugepaged: add flag to predicate khugepaged-only behavior
  mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
  mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/khugepaged: rename prefix of shared collapse functions
  mm/madvise: add huge_memory:mm_madvise_collapse tracepoint
  mm/madvise: add MADV_COLLAPSE to process_madvise()
  proc/smaps: add PMDMappable field to smaps
  selftests/vm: modularize collapse selftests
  selftests/vm: dedup hugepage allocation logic
  selftests/vm: add MADV_COLLAPSE collapse context to selftests
  selftests/vm: add selftest to verify recollapse of THPs
  selftests/vm: add selftest to verify multi THP collapse

 Documentation/filesystems/proc.rst           |  10 +-
 arch/alpha/include/uapi/asm/mman.h           |   2 +
 arch/mips/include/uapi/asm/mman.h            |   2 +
 arch/parisc/include/uapi/asm/mman.h          |   2 +
 arch/xtensa/include/uapi/asm/mman.h          |   2 +
 fs/proc/task_mmu.c                           |   4 +-
 include/linux/huge_mm.h                      |  23 +-
 include/trace/events/huge_memory.h           |  23 +
 include/uapi/asm-generic/mman-common.h       |   2 +
 mm/huge_memory.c                             |  32 +-
 mm/internal.h                                |   2 +-
 mm/khugepaged.c                              | 745 +++++++++++--------
 mm/ksm.c                                     |  10 +
 mm/madvise.c                                 |  11 +-
 mm/memory.c                                  |   4 +-
 mm/rmap.c                                    |  15 +-
 tools/include/uapi/asm-generic/mman-common.h |   2 +
 tools/testing/selftests/vm/khugepaged.c      | 563 ++++++++------
 18 files changed, 845 insertions(+), 609 deletions(-)

Comments

Zach O'Keefe July 14, 2022, 6:55 p.m. UTC | #1
Hey All,

There are still a couple interface topics (capabilities for process_madvise(2),
errnos) to iron out, but for the most part the behavior and semantics of
MADV_COLLAPSE on anonymous memory seems to be ironed out. Thanks for everyone's
time and effort contributing to that effort.

Looking forward, I'd like to align on the semantics of file/shmem so seal
MADV_COLLAPSE behavior. This is what I'd propose for an initial man-page-like
description of MADV_COLLAPSE for madvise(2), to paint a full-picture view:

---8<---
Perform a best-effort synchronous collapse of the native pages mapped by the
memory range into Transparent Hugepages (THPs). MADV_COLLAPSE operates on the
current state of memory for the specified process and makes no persistent
changes or guarantees on how pages will be mapped, constructed, or faulted in
the future. However, for file/shmem memory, other mappings of this file extent
may be queued and processed later by khugepaged to attempt to update their
pagetables to map the hugepage by a PMD.

If the ranges provided span multiple VMAs, the semantics of the collapse over
each VMA is independent from the others. This implies a hugepage cannot cross a
VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the
operation may continue to attempt collapsing the remainder of the specified
memory.

All non-resident pages covered by the range will first be swapped/faulted-in,
before being copied onto a freshly allocated hugepage. If the native pages
compose the same PTE-mapped hugepage, and are suitably aligned, the collapse
may happen in-place. Unmapped pages will have their data directly initialized
to 0 in the new hugepage. However, for every eligible hugepage aligned/sized
region to-be collapsed, at least one page must currently be backed by memory.

MADV_COLLAPSE is independent of any THP sysfs setting, both in terms of
determining THP eligibility, and allocation semantics. The VMA must not be
marked VM_NOHUGEPAGE, VM_HUGETLB**, VM_IO, VM_DONTEXPAND, VM_MIXEDMAP, or
VM_PFNMAP, nor can it be stack memory or DAX-backed. The process must not have
PR_SET_THP_DISABLE set. For file-backed memory, the file must either be (1) not
open for write, and the mapping must be executable, or (2) the backing
filesystem must support large pages. Allocation for the new hugepage may enter
direct reclaim and/or compaction, regardless of VMA flags.  When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing the
most native pages.

If all hugepage-sized/aligned regions covered by the provided range were either
successfully collapsed, or were already PMD-mapped THPs, this operation will be
deemed successful. On successful return, all hugepage-aligned/sized memory
regions provided will be mapped by PMDs. Note that this doesn’t guarantee
anything about other possible mappings of the memory. Note that many failures
might have occurred, since the operation may continue to collapse in the event
collapse of a single hugepage-sized/aligned region fails.

MADV_COLLAPSE is only available if the kernel was configured with
CONFIGURE_TRANSPARENT_HUGEPAGE, and file/shmem support additionally require
CONFIG_READ_ONLY_THP_FOR_FS and CONFIG_SHMEM.
---8<---

** Might change with HugeTLB high-granularity mappings[1].


There are a few new items of note here:

1) PMD-mapped on success

MADV_COLLAPSE ultimately wants memory mapped by PMDs, and so I propose we
should always try to actually do the page table updates. For file/shmem, this
means two things: (a) adding support to handle compound pages (both pte-mapped
hugepages and non-HPAGE_PMD_ORDER compound pages), and (b) doing a final PMD
install before returning, and not relying on subsequent fault. This makes the
semantics of file/shmem the same as anonymous. I call out (a), since there was
an existing debate about this, and so I want to ensure we are aligned[1]. Note
that (b), along with presenting a consistent interface to users, also has
real-world usecases too, where relying on fault is difficult (for example,
shmem + UFFDIO_REGISTER_MODE_MINOR-managed memory). Also note that for (b), I'm
proposing to only do the synchronous PMD install for the memory range provided
- the page table collapse of other mappings of the memory can be deferred until
later (by khugepaged).

2) folio timing && file non-writable, executable mapping

I just want to align on some timing due to ongoing folio work. Currently, the
requirement to be able to collapse file/shmem memory is that the file not be
opened for write anywhere, and that the mapping is executable, but we'd
eventually like to support filesystems that claim
mapping_large_folio_support()[2]. Is it acceptable that future MADV_COLLAPSE
works for either mapping_large_folio_support() or the old conditions?
Alternatively, should MADV_COLLAPSE only support mapping_large_folio_support()
filesystems from the onset? (I believe shmem and xfs are the only current
users)

3) (shmem) sysfs settings and huge= tmpfs mount

Should we ignore /sys/kernel/mm/transparent_hugepage/shmem_enabled, similar to
how we ignore /sys/kernel/mm/transparent_hugepage/enabled for anon/file? Does
that include "deny"? This choice is (partially) coupled with tmpfs huge= mount
option. I think today, things work if we ignore this. However, I don't want to
back us into a corner if we ever want to allow MADV_COLLAPSE to work on
writeable shmem mappings one day (or any other incompatibility I'm unaware of).
One option, if in (2) we chose to allow the old conditions, then we could
ignore shmem_enabled in the non-writable, executable case - otherwise defer to
"if the filesystem supports it", where we would then respect huge=.

I think those are the important points. Am I missing anything?

Thanks again everyone for taking the time to read and discuss,

Best,
Zach


[1] https://lore.kernel.org/linux-mm/20220624173656.2033256-23-jthoughton@google.com/
[2] https://lore.kernel.org/linux-mm/YpGbnbi44JqtRg+n@casper.infradead.org/