mbox series

[v2,00/12] mm: userspace hugepage collapse

Message ID 20220414180612.3844426-1-zokeefe@google.com (mailing list archive)
Headers show
Series mm: userspace hugepage collapse | expand

Message

Zach O'Keefe April 14, 2022, 6:06 p.m. UTC
Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[1], and the semantics and
implementation were introduced and discussed in a previous PATCH RFC[2].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	Allocation semantics are the same as khugepaged, and depend on
	(1) the active sysfs settings
	/sys/kernel/mm/transparent_hugepage/enabled and
	/sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
	the VMA flags of the memory range being collapsed.

	Collapse eligibility criteria differs from khugepaged in that
	the sysfs files
	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
	are ignored.

	When a range spans multiple VMAs, the semantics of the collapse over
	of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[3].

Sequence of Patches
--------------------------------

Patches 1-4 perform refactoring of collapse logic within khugepaged.c
and introduce the notion of a collapse context.

Patches 5-9 introduces MADV_COLLAPSE, does some renaming, adds support
so that MADV_COLLAPSE context has the eligibility and allocation
semantics referenced above, and adds process_madivse(2) support.

Patches 10-12 add selftests to test the new functionality.

Applies against next-20220414.

v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
  -> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
  -> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
  -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
     defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
  functions
  -> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414

RFC -> v1:
* The series was significantly reworked from RFC and most patches are entirely
  new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
  VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage allocation
  now match that of khugepaged for the same VMA, instead of the gfp flags used
  at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
  spanning a single MADV_COLLAPSE call are now independent, whereas before the
  idea was to allow direct reclaim/compaction if any spanned VMA permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
  MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of hugepages
  in a batched-fashion with the aim of doing multiple page table updates inside
  a single mmap_lock write.  This has been changed, and the implementation now
  collapses each hugepage-aligned/sized region iteratively.  This was motivated
  by an experiment which showed that, when multiple threads were concurrently
  faulting during a MADV_COLLAPSE operation, mean and tail latency to acquire
  mmap_lock in read for threads in the fault patch was improved by using a batch
  size of 1 (batch sizes of 1, 8, 16, 32 were tested)[4].
* Added: If a collapse operation fails because a page isn't found on the LRU, do
  a lru_add_drain_all() and retry.
* Added: selftests

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20220308213417.1407042-1-zokeefe@google.com/
[3] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[4] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/

Zach O'Keefe (13):
  mm/khugepaged: separate hugepage preallocation and cleanup
  mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: make hugepage allocation context-specific
  mm/khugepaged: add struct collapse_result
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/khugepaged: remove khugepaged prefix from shared collapse functions
  mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  mm/khugepaged: add flag to ignore page young/referenced requirement
  mm/madvise: add MADV_COLLAPSE to process_madvise()
  selftests/vm: modularize collapse selftests
  selftests/vm: add MADV_COLLAPSE collapse context to selftests
  selftests/vm: add test to verify recollapse of THPs

 include/linux/huge_mm.h                 |  12 +
 include/trace/events/huge_memory.h      |   5 +-
 include/uapi/asm-generic/mman-common.h  |   2 +
 mm/internal.h                           |   1 +
 mm/khugepaged.c                         | 598 ++++++++++++++++--------
 mm/madvise.c                            |  11 +-
 mm/rmap.c                               |  15 +-
 tools/testing/selftests/vm/khugepaged.c | 417 +++++++++++------
 8 files changed, 702 insertions(+), 359 deletions(-)

Comments

Peter Xu April 15, 2022, 12:04 a.m. UTC | #1
Hi, Zach,

On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> process_madvise(2)
> 
> 	Performs a synchronous collapse of the native pages
> 	mapped by the list of iovecs into transparent hugepages.
> 
> 	Allocation semantics are the same as khugepaged, and depend on
> 	(1) the active sysfs settings
> 	/sys/kernel/mm/transparent_hugepage/enabled and
> 	/sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> 	the VMA flags of the memory range being collapsed.
> 
> 	Collapse eligibility criteria differs from khugepaged in that
> 	the sysfs files
> 	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> 	are ignored.

The userspace khugepaged idea definitely makes sense to me, though I'm
curious how the line is drown on the different behaviors here by explicitly
ignoring the max_ptes_* entries.

Let's assume the initiative is to duplicate a more data-aware khugepaged in
the userspace, then IMHO it makes more sense to start with all the policies
that applies to khugepaged already, including max_pte_*.

I can understand the willingness to provide even stronger semantics here
than khugepaged since the userspace could have very clear knowledge of how
to provision the memories (better than a kernel scanner).  It's just that
IMHO it could be slightly confusing if the new interface only partially
apply the khugepaged rules.

No strong opinion here.  It could already been a trade-off after the
discussion from the RFC with Michal which I read..  Just curious about how
you made that design decision so feel free to read it as a pure question.

Thanks,
Zach O'Keefe April 15, 2022, 12:52 a.m. UTC | #2
Hey Peter,

Thanks for taking the time to review!

On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Zach,
>
> On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> > process_madvise(2)
> >
> >       Performs a synchronous collapse of the native pages
> >       mapped by the list of iovecs into transparent hugepages.
> >
> >       Allocation semantics are the same as khugepaged, and depend on
> >       (1) the active sysfs settings
> >       /sys/kernel/mm/transparent_hugepage/enabled and
> >       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> >       the VMA flags of the memory range being collapsed.
> >
> >       Collapse eligibility criteria differs from khugepaged in that
> >       the sysfs files
> >       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> >       are ignored.
>
> The userspace khugepaged idea definitely makes sense to me, though I'm
> curious how the line is drown on the different behaviors here by explicitly
> ignoring the max_ptes_* entries.
>
> Let's assume the initiative is to duplicate a more data-aware khugepaged in
> the userspace, then IMHO it makes more sense to start with all the policies
> that applies to khugepaged already, including max_pte_*.
>
> I can understand the willingness to provide even stronger semantics here
> than khugepaged since the userspace could have very clear knowledge of how
> to provision the memories (better than a kernel scanner).  It's just that
> IMHO it could be slightly confusing if the new interface only partially
> apply the khugepaged rules.
>
> No strong opinion here.  It could already been a trade-off after the
> discussion from the RFC with Michal which I read..  Just curious about how
> you made that design decision so feel free to read it as a pure question.
>

Understand your point here. The allocation and max_pte_* semantics are
split between khugepaged-like and fault-like, respectively - which
could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
flag to control the former's behavior, but agreed to keep things
simple to start, and expand the interface if/when necessary. I opted
to ignore max_ptes_* as the default since I envisioned that early
adopters would "just want it to work". One such example would be
backing executable text by hugepages on program load when many pages
haven't been demand-paged in yet.

What do you think?

Thanks,
Zach

> Thanks,
>
> --
> Peter Xu
>
Peter Xu April 15, 2022, 1:39 p.m. UTC | #3
On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
> Hey Peter,
> 
> Thanks for taking the time to review!
> 
> On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Zach,
> >
> > On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> > > process_madvise(2)
> > >
> > >       Performs a synchronous collapse of the native pages
> > >       mapped by the list of iovecs into transparent hugepages.
> > >
> > >       Allocation semantics are the same as khugepaged, and depend on
> > >       (1) the active sysfs settings
> > >       /sys/kernel/mm/transparent_hugepage/enabled and
> > >       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> > >       the VMA flags of the memory range being collapsed.
> > >
> > >       Collapse eligibility criteria differs from khugepaged in that
> > >       the sysfs files
> > >       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> > >       are ignored.
> >
> > The userspace khugepaged idea definitely makes sense to me, though I'm
> > curious how the line is drown on the different behaviors here by explicitly
> > ignoring the max_ptes_* entries.
> >
> > Let's assume the initiative is to duplicate a more data-aware khugepaged in
> > the userspace, then IMHO it makes more sense to start with all the policies
> > that applies to khugepaged already, including max_pte_*.
> >
> > I can understand the willingness to provide even stronger semantics here
> > than khugepaged since the userspace could have very clear knowledge of how
> > to provision the memories (better than a kernel scanner).  It's just that
> > IMHO it could be slightly confusing if the new interface only partially
> > apply the khugepaged rules.
> >
> > No strong opinion here.  It could already been a trade-off after the
> > discussion from the RFC with Michal which I read..  Just curious about how
> > you made that design decision so feel free to read it as a pure question.
> >
> 
> Understand your point here. The allocation and max_pte_* semantics are
> split between khugepaged-like and fault-like, respectively - which
> could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
> flag to control the former's behavior, but agreed to keep things
> simple to start, and expand the interface if/when necessary. I opted
> to ignore max_ptes_* as the default since I envisioned that early
> adopters would "just want it to work". One such example would be
> backing executable text by hugepages on program load when many pages
> haven't been demand-paged in yet.
> 
> What do you think?

I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
blurred.

To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
khugepaged on this range, and with current thread context".  IMHO any
feature bits then can be supplementing special needs, and I'll take the thp
backing executable example to be one of the (good?) reason we'd need an
extra flag for ignoring the max_ptes_* knobs.

So personally if I were you maybe I'll start with the simple scheme of that
(even if it won't immediately service a thing) but then add either the
defrag or ignore_max_ptes_* as feature bits later on, with clear use case
descriptions about why we need each of the feature flags.  IMHO numbers
would be even more helpful when there's specific use cases on the show.

Or, perhaps you think all potential MADV_COLLAPSE users should literally
skip max_ptes_* limitations always?

Anyway, I won't pretend I am an expert in this area. :) So please take that
with a grain of salt.

Thanks,
Zach O'Keefe April 15, 2022, 8:04 p.m. UTC | #4
On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
> > Hey Peter,
> >
> > Thanks for taking the time to review!
> >
> > On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Zach,
> > >
> > > On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> > > > process_madvise(2)
> > > >
> > > >       Performs a synchronous collapse of the native pages
> > > >       mapped by the list of iovecs into transparent hugepages.
> > > >
> > > >       Allocation semantics are the same as khugepaged, and depend on
> > > >       (1) the active sysfs settings
> > > >       /sys/kernel/mm/transparent_hugepage/enabled and
> > > >       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> > > >       the VMA flags of the memory range being collapsed.
> > > >
> > > >       Collapse eligibility criteria differs from khugepaged in that
> > > >       the sysfs files
> > > >       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> > > >       are ignored.
> > >
> > > The userspace khugepaged idea definitely makes sense to me, though I'm
> > > curious how the line is drown on the different behaviors here by explicitly
> > > ignoring the max_ptes_* entries.
> > >
> > > Let's assume the initiative is to duplicate a more data-aware khugepaged in
> > > the userspace, then IMHO it makes more sense to start with all the policies
> > > that applies to khugepaged already, including max_pte_*.
> > >
> > > I can understand the willingness to provide even stronger semantics here
> > > than khugepaged since the userspace could have very clear knowledge of how
> > > to provision the memories (better than a kernel scanner).  It's just that
> > > IMHO it could be slightly confusing if the new interface only partially
> > > apply the khugepaged rules.
> > >
> > > No strong opinion here.  It could already been a trade-off after the
> > > discussion from the RFC with Michal which I read..  Just curious about how
> > > you made that design decision so feel free to read it as a pure question.
> > >
> >
> > Understand your point here. The allocation and max_pte_* semantics are
> > split between khugepaged-like and fault-like, respectively - which
> > could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
> > flag to control the former's behavior, but agreed to keep things
> > simple to start, and expand the interface if/when necessary. I opted
> > to ignore max_ptes_* as the default since I envisioned that early
> > adopters would "just want it to work". One such example would be
> > backing executable text by hugepages on program load when many pages
> > haven't been demand-paged in yet.
> >
> > What do you think?
>
> I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
> blurred.
>
> To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
> khugepaged on this range, and with current thread context".  IMHO any
> feature bits then can be supplementing special needs, and I'll take the thp
> backing executable example to be one of the (good?) reason we'd need an
> extra flag for ignoring the max_ptes_* knobs.
>
> So personally if I were you maybe I'll start with the simple scheme of that
> (even if it won't immediately service a thing) but then add either the
> defrag or ignore_max_ptes_* as feature bits later on, with clear use case
> descriptions about why we need each of the feature flags.  IMHO numbers
> would be even more helpful when there's specific use cases on the show.
>
> Or, perhaps you think all potential MADV_COLLAPSE users should literally
> skip max_ptes_* limitations always?
>

Thanks for your time and valuable feedback here, Peter. I had a response typed
up, but after a few iterations became increasingly unsatisfied with my
own response.

I think this feature should be able to stand on its own without
consideration of a userspace khugepaged, as we have existing concrete
examples where it would be useful. In these cases, and I assume almost
all other use-cases outside userspace khugepaged, max_ptes_* should be
ignored as the fundamental assumption of MADV_COLLAPSE is that the
user knows better, and IMHO, khugepaged heuristics shouldn't tell
users they are wrong.

But this, as you mention, unsatisfactorily blurs the semantics of
MADV_COLLAPSE: "act like khugepaged here, but not here".

As such, WDYT about the reverse-side of the coin of what you proposed:
to not couple the default behavior of MADV_COLLAPSE with khugepaged at
all? I.e. Not tie the allocation semantics to
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add
flags as necessary when/if a reimplementation of khugepaged in
userspace proves fruitful.

Thanks for your time and input,
Zach



> Anyway, I won't pretend I am an expert in this area. :) So please take that
> with a grain of salt.
>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu April 16, 2022, 7:26 p.m. UTC | #5
Hi, Zach,

On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote:
> On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
> > > Hey Peter,
> > >
> > > Thanks for taking the time to review!
> > >
> > > On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Hi, Zach,
> > > >
> > > > On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> > > > > process_madvise(2)
> > > > >
> > > > >       Performs a synchronous collapse of the native pages
> > > > >       mapped by the list of iovecs into transparent hugepages.
> > > > >
> > > > >       Allocation semantics are the same as khugepaged, and depend on
> > > > >       (1) the active sysfs settings
> > > > >       /sys/kernel/mm/transparent_hugepage/enabled and
> > > > >       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> > > > >       the VMA flags of the memory range being collapsed.
> > > > >
> > > > >       Collapse eligibility criteria differs from khugepaged in that
> > > > >       the sysfs files
> > > > >       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> > > > >       are ignored.
> > > >
> > > > The userspace khugepaged idea definitely makes sense to me, though I'm
> > > > curious how the line is drown on the different behaviors here by explicitly
> > > > ignoring the max_ptes_* entries.
> > > >
> > > > Let's assume the initiative is to duplicate a more data-aware khugepaged in
> > > > the userspace, then IMHO it makes more sense to start with all the policies
> > > > that applies to khugepaged already, including max_pte_*.
> > > >
> > > > I can understand the willingness to provide even stronger semantics here
> > > > than khugepaged since the userspace could have very clear knowledge of how
> > > > to provision the memories (better than a kernel scanner).  It's just that
> > > > IMHO it could be slightly confusing if the new interface only partially
> > > > apply the khugepaged rules.
> > > >
> > > > No strong opinion here.  It could already been a trade-off after the
> > > > discussion from the RFC with Michal which I read..  Just curious about how
> > > > you made that design decision so feel free to read it as a pure question.
> > > >
> > >
> > > Understand your point here. The allocation and max_pte_* semantics are
> > > split between khugepaged-like and fault-like, respectively - which
> > > could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
> > > flag to control the former's behavior, but agreed to keep things
> > > simple to start, and expand the interface if/when necessary. I opted
> > > to ignore max_ptes_* as the default since I envisioned that early
> > > adopters would "just want it to work". One such example would be
> > > backing executable text by hugepages on program load when many pages
> > > haven't been demand-paged in yet.
> > >
> > > What do you think?
> >
> > I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
> > blurred.
> >
> > To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
> > khugepaged on this range, and with current thread context".  IMHO any
> > feature bits then can be supplementing special needs, and I'll take the thp
> > backing executable example to be one of the (good?) reason we'd need an
> > extra flag for ignoring the max_ptes_* knobs.
> >
> > So personally if I were you maybe I'll start with the simple scheme of that
> > (even if it won't immediately service a thing) but then add either the
> > defrag or ignore_max_ptes_* as feature bits later on, with clear use case
> > descriptions about why we need each of the feature flags.  IMHO numbers
> > would be even more helpful when there's specific use cases on the show.
> >
> > Or, perhaps you think all potential MADV_COLLAPSE users should literally
> > skip max_ptes_* limitations always?
> >
> 
> Thanks for your time and valuable feedback here, Peter. I had a response typed
> up, but after a few iterations became increasingly unsatisfied with my
> own response.
> 
> I think this feature should be able to stand on its own without
> consideration of a userspace khugepaged, as we have existing concrete
> examples where it would be useful. In these cases, and I assume almost
> all other use-cases outside userspace khugepaged, max_ptes_* should be
> ignored as the fundamental assumption of MADV_COLLAPSE is that the
> user knows better, and IMHO, khugepaged heuristics shouldn't tell
> users they are wrong.

Valid point.  And actually right after I replied I thought similarly on
whether we need to connect the two interfaces at all..

It's just that it's very easy to go think like that after reading the cover
letter since that's exactly what it is comparing to. :)

There's definitely a difference view on user/kernel level of things, then
it sounds reasonable to me if we add a new interface it by default has a
stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's
existance.

So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases
as you said.  And that's a real pure question I asked above, and I feel
like your answer is actually "yes" we should always ignore the max_ptes_*
fields until there's a proof that it'll be helpful.

> 
> But this, as you mention, unsatisfactorily blurs the semantics of
> MADV_COLLAPSE: "act like khugepaged here, but not here".
> 
> As such, WDYT about the reverse-side of the coin of what you proposed:
> to not couple the default behavior of MADV_COLLAPSE with khugepaged at
> all? I.e. Not tie the allocation semantics to
> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add
> flags as necessary when/if a reimplementation of khugepaged in
> userspace proves fruitful.

Let's see whether others have thoughts, but what you proposed here makes
sense to me.

Thanks,
Zach O'Keefe April 17, 2022, 5:23 p.m. UTC | #6
On Sat, Apr 16, 2022 at 12:26 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Zach,
>
> On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote:
> > On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
> > > > Hey Peter,
> > > >
> > > > Thanks for taking the time to review!
> > > >
> > > > On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > Hi, Zach,
> > > > >
> > > > > On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> > > > > > process_madvise(2)
> > > > > >
> > > > > >       Performs a synchronous collapse of the native pages
> > > > > >       mapped by the list of iovecs into transparent hugepages.
> > > > > >
> > > > > >       Allocation semantics are the same as khugepaged, and depend on
> > > > > >       (1) the active sysfs settings
> > > > > >       /sys/kernel/mm/transparent_hugepage/enabled and
> > > > > >       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> > > > > >       the VMA flags of the memory range being collapsed.
> > > > > >
> > > > > >       Collapse eligibility criteria differs from khugepaged in that
> > > > > >       the sysfs files
> > > > > >       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> > > > > >       are ignored.
> > > > >
> > > > > The userspace khugepaged idea definitely makes sense to me, though I'm
> > > > > curious how the line is drown on the different behaviors here by explicitly
> > > > > ignoring the max_ptes_* entries.
> > > > >
> > > > > Let's assume the initiative is to duplicate a more data-aware khugepaged in
> > > > > the userspace, then IMHO it makes more sense to start with all the policies
> > > > > that applies to khugepaged already, including max_pte_*.
> > > > >
> > > > > I can understand the willingness to provide even stronger semantics here
> > > > > than khugepaged since the userspace could have very clear knowledge of how
> > > > > to provision the memories (better than a kernel scanner).  It's just that
> > > > > IMHO it could be slightly confusing if the new interface only partially
> > > > > apply the khugepaged rules.
> > > > >
> > > > > No strong opinion here.  It could already been a trade-off after the
> > > > > discussion from the RFC with Michal which I read..  Just curious about how
> > > > > you made that design decision so feel free to read it as a pure question.
> > > > >
> > > >
> > > > Understand your point here. The allocation and max_pte_* semantics are
> > > > split between khugepaged-like and fault-like, respectively - which
> > > > could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
> > > > flag to control the former's behavior, but agreed to keep things
> > > > simple to start, and expand the interface if/when necessary. I opted
> > > > to ignore max_ptes_* as the default since I envisioned that early
> > > > adopters would "just want it to work". One such example would be
> > > > backing executable text by hugepages on program load when many pages
> > > > haven't been demand-paged in yet.
> > > >
> > > > What do you think?
> > >
> > > I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
> > > blurred.
> > >
> > > To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
> > > khugepaged on this range, and with current thread context".  IMHO any
> > > feature bits then can be supplementing special needs, and I'll take the thp
> > > backing executable example to be one of the (good?) reason we'd need an
> > > extra flag for ignoring the max_ptes_* knobs.
> > >
> > > So personally if I were you maybe I'll start with the simple scheme of that
> > > (even if it won't immediately service a thing) but then add either the
> > > defrag or ignore_max_ptes_* as feature bits later on, with clear use case
> > > descriptions about why we need each of the feature flags.  IMHO numbers
> > > would be even more helpful when there's specific use cases on the show.
> > >
> > > Or, perhaps you think all potential MADV_COLLAPSE users should literally
> > > skip max_ptes_* limitations always?
> > >
> >
> > Thanks for your time and valuable feedback here, Peter. I had a response typed
> > up, but after a few iterations became increasingly unsatisfied with my
> > own response.
> >
> > I think this feature should be able to stand on its own without
> > consideration of a userspace khugepaged, as we have existing concrete
> > examples where it would be useful. In these cases, and I assume almost
> > all other use-cases outside userspace khugepaged, max_ptes_* should be
> > ignored as the fundamental assumption of MADV_COLLAPSE is that the
> > user knows better, and IMHO, khugepaged heuristics shouldn't tell
> > users they are wrong.
>
> Valid point.  And actually right after I replied I thought similarly on
> whether we need to connect the two interfaces at all..
>
> It's just that it's very easy to go think like that after reading the cover
> letter since that's exactly what it is comparing to. :)
>

Yes, this is my fault :) After others have had a chance to review /
align, I'll include the immediate use cases
in the v3 cover letter as well, rather than deep in individual patch
messages.

> There's definitely a difference view on user/kernel level of things, then
> it sounds reasonable to me if we add a new interface it by default has a
> stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's
> existance.
>

Yes, good point.

> So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases
> as you said.  And that's a real pure question I asked above, and I feel
> like your answer is actually "yes" we should always ignore the max_ptes_*
> fields until there's a proof that it'll be helpful.
>
> >
> > But this, as you mention, unsatisfactorily blurs the semantics of
> > MADV_COLLAPSE: "act like khugepaged here, but not here".
> >
> > As such, WDYT about the reverse-side of the coin of what you proposed:
> > to not couple the default behavior of MADV_COLLAPSE with khugepaged at
> > all? I.e. Not tie the allocation semantics to
> > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add
> > flags as necessary when/if a reimplementation of khugepaged in
> > userspace proves fruitful.
>
> Let's see whether others have thoughts, but what you proposed here makes
> sense to me.
>

Great! Sounds good to me. Thank you again for your time, questions,
and feedback!

Best,

Zach

> Thanks,
>
> --
> Peter Xu
>
David Hildenbrand April 19, 2022, 2:33 p.m. UTC | #7
On 16.04.22 21:26, Peter Xu wrote:
> Hi, Zach,
> 
> On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote:
>> On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@redhat.com> wrote:
>>>
>>> On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
>>>> Hey Peter,
>>>>
>>>> Thanks for taking the time to review!
>>>>
>>>> On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> Hi, Zach,
>>>>>
>>>>> On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
>>>>>> process_madvise(2)
>>>>>>
>>>>>>       Performs a synchronous collapse of the native pages
>>>>>>       mapped by the list of iovecs into transparent hugepages.
>>>>>>
>>>>>>       Allocation semantics are the same as khugepaged, and depend on
>>>>>>       (1) the active sysfs settings
>>>>>>       /sys/kernel/mm/transparent_hugepage/enabled and
>>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
>>>>>>       the VMA flags of the memory range being collapsed.
>>>>>>
>>>>>>       Collapse eligibility criteria differs from khugepaged in that
>>>>>>       the sysfs files
>>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
>>>>>>       are ignored.
>>>>>
>>>>> The userspace khugepaged idea definitely makes sense to me, though I'm
>>>>> curious how the line is drown on the different behaviors here by explicitly
>>>>> ignoring the max_ptes_* entries.
>>>>>
>>>>> Let's assume the initiative is to duplicate a more data-aware khugepaged in
>>>>> the userspace, then IMHO it makes more sense to start with all the policies
>>>>> that applies to khugepaged already, including max_pte_*.
>>>>>
>>>>> I can understand the willingness to provide even stronger semantics here
>>>>> than khugepaged since the userspace could have very clear knowledge of how
>>>>> to provision the memories (better than a kernel scanner).  It's just that
>>>>> IMHO it could be slightly confusing if the new interface only partially
>>>>> apply the khugepaged rules.
>>>>>
>>>>> No strong opinion here.  It could already been a trade-off after the
>>>>> discussion from the RFC with Michal which I read..  Just curious about how
>>>>> you made that design decision so feel free to read it as a pure question.
>>>>>
>>>>
>>>> Understand your point here. The allocation and max_pte_* semantics are
>>>> split between khugepaged-like and fault-like, respectively - which
>>>> could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
>>>> flag to control the former's behavior, but agreed to keep things
>>>> simple to start, and expand the interface if/when necessary. I opted
>>>> to ignore max_ptes_* as the default since I envisioned that early
>>>> adopters would "just want it to work". One such example would be
>>>> backing executable text by hugepages on program load when many pages
>>>> haven't been demand-paged in yet.
>>>>
>>>> What do you think?
>>>
>>> I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
>>> blurred.
>>>
>>> To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
>>> khugepaged on this range, and with current thread context".  IMHO any
>>> feature bits then can be supplementing special needs, and I'll take the thp
>>> backing executable example to be one of the (good?) reason we'd need an
>>> extra flag for ignoring the max_ptes_* knobs.
>>>
>>> So personally if I were you maybe I'll start with the simple scheme of that
>>> (even if it won't immediately service a thing) but then add either the
>>> defrag or ignore_max_ptes_* as feature bits later on, with clear use case
>>> descriptions about why we need each of the feature flags.  IMHO numbers
>>> would be even more helpful when there's specific use cases on the show.
>>>
>>> Or, perhaps you think all potential MADV_COLLAPSE users should literally
>>> skip max_ptes_* limitations always?
>>>
>>
>> Thanks for your time and valuable feedback here, Peter. I had a response typed
>> up, but after a few iterations became increasingly unsatisfied with my
>> own response.
>>
>> I think this feature should be able to stand on its own without
>> consideration of a userspace khugepaged, as we have existing concrete
>> examples where it would be useful. In these cases, and I assume almost
>> all other use-cases outside userspace khugepaged, max_ptes_* should be
>> ignored as the fundamental assumption of MADV_COLLAPSE is that the
>> user knows better, and IMHO, khugepaged heuristics shouldn't tell
>> users they are wrong.
> 
> Valid point.  And actually right after I replied I thought similarly on
> whether we need to connect the two interfaces at all..
> 
> It's just that it's very easy to go think like that after reading the cover
> letter since that's exactly what it is comparing to. :)
> 
> There's definitely a difference view on user/kernel level of things, then
> it sounds reasonable to me if we add a new interface it by default has a
> stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's
> existance.
> 
> So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases
> as you said.  And that's a real pure question I asked above, and I feel
> like your answer is actually "yes" we should always ignore the max_ptes_*
> fields until there's a proof that it'll be helpful.
> 
>>
>> But this, as you mention, unsatisfactorily blurs the semantics of
>> MADV_COLLAPSE: "act like khugepaged here, but not here".
>>
>> As such, WDYT about the reverse-side of the coin of what you proposed:
>> to not couple the default behavior of MADV_COLLAPSE with khugepaged at
>> all? I.e. Not tie the allocation semantics to
>> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add
>> flags as necessary when/if a reimplementation of khugepaged in
>> userspace proves fruitful.
> 
> Let's see whether others have thoughts, but what you proposed here makes
> sense to me.


Makes sense to me. IIUC, the whole handling of max_ptes_* is in place as
 tunable because we don't know what user space is up to.

E.g., have with a very sparse memory layout, we don't want to waste
memory by allocating memory where we actually have no page populated yet
-- could be user space won't reuse that memory in the foreseeable
future. With too many swap entries, we don't want to trigger an
eventually unnecessary overhead of swapping in entries if user space
won't access them in the foreseeable future. Something similar applies
to max_ptes_shared, where one might just end up wasting a lot of memory
eventually in some applications.

So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
try figuring out what user space might be doing. We know exactly what
user space asks for -- and that can be documented properly.
Zach O'Keefe April 19, 2022, 3:55 p.m. UTC | #8
On Tue, Apr 19, 2022 at 7:33 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.04.22 21:26, Peter Xu wrote:
> > Hi, Zach,
> >
> > On Fri, Apr 15, 2022 at 01:04:04PM -0700, Zach O'Keefe wrote:
> >> On Fri, Apr 15, 2022 at 6:39 AM Peter Xu <peterx@redhat.com> wrote:
> >>>
> >>> On Thu, Apr 14, 2022 at 05:52:43PM -0700, Zach O'Keefe wrote:
> >>>> Hey Peter,
> >>>>
> >>>> Thanks for taking the time to review!
> >>>>
> >>>> On Thu, Apr 14, 2022 at 5:04 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>>
> >>>>> Hi, Zach,
> >>>>>
> >>>>> On Thu, Apr 14, 2022 at 11:06:00AM -0700, Zach O'Keefe wrote:
> >>>>>> process_madvise(2)
> >>>>>>
> >>>>>>       Performs a synchronous collapse of the native pages
> >>>>>>       mapped by the list of iovecs into transparent hugepages.
> >>>>>>
> >>>>>>       Allocation semantics are the same as khugepaged, and depend on
> >>>>>>       (1) the active sysfs settings
> >>>>>>       /sys/kernel/mm/transparent_hugepage/enabled and
> >>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
> >>>>>>       the VMA flags of the memory range being collapsed.
> >>>>>>
> >>>>>>       Collapse eligibility criteria differs from khugepaged in that
> >>>>>>       the sysfs files
> >>>>>>       /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
> >>>>>>       are ignored.
> >>>>>
> >>>>> The userspace khugepaged idea definitely makes sense to me, though I'm
> >>>>> curious how the line is drown on the different behaviors here by explicitly
> >>>>> ignoring the max_ptes_* entries.
> >>>>>
> >>>>> Let's assume the initiative is to duplicate a more data-aware khugepaged in
> >>>>> the userspace, then IMHO it makes more sense to start with all the policies
> >>>>> that applies to khugepaged already, including max_pte_*.
> >>>>>
> >>>>> I can understand the willingness to provide even stronger semantics here
> >>>>> than khugepaged since the userspace could have very clear knowledge of how
> >>>>> to provision the memories (better than a kernel scanner).  It's just that
> >>>>> IMHO it could be slightly confusing if the new interface only partially
> >>>>> apply the khugepaged rules.
> >>>>>
> >>>>> No strong opinion here.  It could already been a trade-off after the
> >>>>> discussion from the RFC with Michal which I read..  Just curious about how
> >>>>> you made that design decision so feel free to read it as a pure question.
> >>>>>
> >>>>
> >>>> Understand your point here. The allocation and max_pte_* semantics are
> >>>> split between khugepaged-like and fault-like, respectively - which
> >>>> could be confusing. Originally, I proposed a MADV_F_COLLAPSE_LIMITS
> >>>> flag to control the former's behavior, but agreed to keep things
> >>>> simple to start, and expand the interface if/when necessary. I opted
> >>>> to ignore max_ptes_* as the default since I envisioned that early
> >>>> adopters would "just want it to work". One such example would be
> >>>> backing executable text by hugepages on program load when many pages
> >>>> haven't been demand-paged in yet.
> >>>>
> >>>> What do you think?
> >>>
> >>> I'm just slightly worried that'll make the default MADV_COLLAPSE semantics
> >>> blurred.
> >>>
> >>> To me, a clean default definition for MADV_COLLAPSE would be nice, as "do
> >>> khugepaged on this range, and with current thread context".  IMHO any
> >>> feature bits then can be supplementing special needs, and I'll take the thp
> >>> backing executable example to be one of the (good?) reason we'd need an
> >>> extra flag for ignoring the max_ptes_* knobs.
> >>>
> >>> So personally if I were you maybe I'll start with the simple scheme of that
> >>> (even if it won't immediately service a thing) but then add either the
> >>> defrag or ignore_max_ptes_* as feature bits later on, with clear use case
> >>> descriptions about why we need each of the feature flags.  IMHO numbers
> >>> would be even more helpful when there's specific use cases on the show.
> >>>
> >>> Or, perhaps you think all potential MADV_COLLAPSE users should literally
> >>> skip max_ptes_* limitations always?
> >>>
> >>
> >> Thanks for your time and valuable feedback here, Peter. I had a response typed
> >> up, but after a few iterations became increasingly unsatisfied with my
> >> own response.
> >>
> >> I think this feature should be able to stand on its own without
> >> consideration of a userspace khugepaged, as we have existing concrete
> >> examples where it would be useful. In these cases, and I assume almost
> >> all other use-cases outside userspace khugepaged, max_ptes_* should be
> >> ignored as the fundamental assumption of MADV_COLLAPSE is that the
> >> user knows better, and IMHO, khugepaged heuristics shouldn't tell
> >> users they are wrong.
> >
> > Valid point.  And actually right after I replied I thought similarly on
> > whether we need to connect the two interfaces at all..
> >
> > It's just that it's very easy to go think like that after reading the cover
> > letter since that's exactly what it is comparing to. :)
> >
> > There's definitely a difference view on user/kernel level of things, then
> > it sounds reasonable to me if we add a new interface it by default has a
> > stronger semantics otherwise we may not bother if with MADV_HUGEPAGE's
> > existance.
> >
> > So maybe max_ptes_* won't even make sense for MADV_COLLAPSE in most cases
> > as you said.  And that's a real pure question I asked above, and I feel
> > like your answer is actually "yes" we should always ignore the max_ptes_*
> > fields until there's a proof that it'll be helpful.
> >
> >>
> >> But this, as you mention, unsatisfactorily blurs the semantics of
> >> MADV_COLLAPSE: "act like khugepaged here, but not here".
> >>
> >> As such, WDYT about the reverse-side of the coin of what you proposed:
> >> to not couple the default behavior of MADV_COLLAPSE with khugepaged at
> >> all? I.e. Not tie the allocation semantics to
> >> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag. We can add
> >> flags as necessary when/if a reimplementation of khugepaged in
> >> userspace proves fruitful.
> >
> > Let's see whether others have thoughts, but what you proposed here makes
> > sense to me.
>
>
Hey David - thanks again for taking the time to read and comment.

> Makes sense to me. IIUC, the whole handling of max_ptes_* is in place as
>  tunable because we don't know what user space is up to.
>

Agreed.

> E.g., have with a very sparse memory layout, we don't want to waste
> memory by allocating memory where we actually have no page populated yet
> -- could be user space won't reuse that memory in the foreseeable
> future. With too many swap entries, we don't want to trigger an
> eventually unnecessary overhead of swapping in entries if user space
> won't access them in the foreseeable future. Something similar applies
> to max_ptes_shared, where one might just end up wasting a lot of memory
> eventually in some applications.
>
> So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
> try figuring out what user space might be doing. We know exactly what
> user space asks for -- and that can be documented properly.
>

Sounds good to me. Would you also be in favor of decoupling allocation
semantics from khugepaged? I.e. we'll pick some default gfp flags and
not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag?

Thanks again,

Zach

> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand April 19, 2022, 8:02 p.m. UTC | #9
>> E.g., have with a very sparse memory layout, we don't want to waste
>> memory by allocating memory where we actually have no page populated yet
>> -- could be user space won't reuse that memory in the foreseeable
>> future. With too many swap entries, we don't want to trigger an
>> eventually unnecessary overhead of swapping in entries if user space
>> won't access them in the foreseeable future. Something similar applies
>> to max_ptes_shared, where one might just end up wasting a lot of memory
>> eventually in some applications.
>>
>> So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
>> try figuring out what user space might be doing. We know exactly what
>> user space asks for -- and that can be documented properly.
>>

Just a thought, if we ever want to implement khugepaged in user space,
it could theoretically obtain similar information using e.g., the
pagemap. It wouldn't be race-free, but the question is if it would matter.

I consider the primary use case of giving an application more precise
control over actual THP placement.

> 
> Sounds good to me. Would you also be in favor of decoupling allocation
> semantics from khugepaged? I.e. we'll pick some default gfp flags and
> not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag?

Good question. It's not really a heuristic like that other stuff.

Easy answer: we're not dealing with khugepaged, so anything in
/sys/kernel/mm/transparent_hugepage/khugepaged/ shouldn't apply?

Sure, we could have a separate toggles for MADV_COLLAPSE.

Maybe we simply want a dedicated syscall where we can specify additional
options ... but maybe that simply over-complicates the problem.
Zach O'Keefe April 19, 2022, 10:42 p.m. UTC | #10
On Tue, Apr 19, 2022 at 1:03 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> E.g., have with a very sparse memory layout, we don't want to waste
> >> memory by allocating memory where we actually have no page populated yet
> >> -- could be user space won't reuse that memory in the foreseeable
> >> future. With too many swap entries, we don't want to trigger an
> >> eventually unnecessary overhead of swapping in entries if user space
> >> won't access them in the foreseeable future. Something similar applies
> >> to max_ptes_shared, where one might just end up wasting a lot of memory
> >> eventually in some applications.
> >>
> >> So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
> >> try figuring out what user space might be doing. We know exactly what
> >> user space asks for -- and that can be documented properly.
> >>
>
> Just a thought, if we ever want to implement khugepaged in user space,
> it could theoretically obtain similar information using e.g., the
> pagemap. It wouldn't be race-free, but the question is if it would matter.
>
> I consider the primary use case of giving an application more precise
> control over actual THP placement.
>

Good point about the pagemap and agree about the primary use case -
I'll make that clear in v3 cover letter.

> >
> > Sounds good to me. Would you also be in favor of decoupling allocation
> > semantics from khugepaged? I.e. we'll pick some default gfp flags and
> > not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag?
>
> Good question. It's not really a heuristic like that other stuff.
>
> Easy answer: we're not dealing with khugepaged, so anything in
> /sys/kernel/mm/transparent_hugepage/khugepaged/ shouldn't apply?
>

That's what I'm thinking now too. If there's no objections, I'll
proceed in that direction for v3.

> Sure, we could have a separate toggles for MADV_COLLAPSE.
>
> Maybe we simply want a dedicated syscall where we can specify additional
> options ... but maybe that simply over-complicates the problem.
>

Thankfully process_madvise(2) has flags, and madvise(2) users can
always migrate to using process_madvise(2) on self. Piggy-backing off
madvise infrastructure for these "non-advice actions" (e.g.
MADV_PAGEOUT) seems to be the norm.

Thanks as always for your time and thoughts!

Zach

> --
> Thanks,
>
> David / dhildenb
>
Yang Shi April 21, 2022, 12:56 a.m. UTC | #11
On Tue, Apr 19, 2022 at 3:43 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Tue, Apr 19, 2022 at 1:03 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > >> E.g., have with a very sparse memory layout, we don't want to waste
> > >> memory by allocating memory where we actually have no page populated yet
> > >> -- could be user space won't reuse that memory in the foreseeable
> > >> future. With too many swap entries, we don't want to trigger an
> > >> eventually unnecessary overhead of swapping in entries if user space
> > >> won't access them in the foreseeable future. Something similar applies
> > >> to max_ptes_shared, where one might just end up wasting a lot of memory
> > >> eventually in some applications.
> > >>
> > >> So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
> > >> try figuring out what user space might be doing. We know exactly what
> > >> user space asks for -- and that can be documented properly.
> > >>
> >
> > Just a thought, if we ever want to implement khugepaged in user space,
> > it could theoretically obtain similar information using e.g., the
> > pagemap. It wouldn't be race-free, but the question is if it would matter.
> >
> > I consider the primary use case of giving an application more precise
> > control over actual THP placement.
> >
>
> Good point about the pagemap and agree about the primary use case -
> I'll make that clear in v3 cover letter.
>
> > >
> > > Sounds good to me. Would you also be in favor of decoupling allocation
> > > semantics from khugepaged? I.e. we'll pick some default gfp flags and
> > > not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag?
> >
> > Good question. It's not really a heuristic like that other stuff.
> >
> > Easy answer: we're not dealing with khugepaged, so anything in
> > /sys/kernel/mm/transparent_hugepage/khugepaged/ shouldn't apply?
> >
>
> That's what I'm thinking now too. If there's no objections, I'll
> proceed in that direction for v3.

I agree, we should not treat MADV_COLLAPSE as "userspace khugepaged"
IMHO. It is still best effort though, but it is requested by the users
explicitly so kernel should trust the users' judgement and ignore
those max_ptes_* since we should assume the users know what they are
doing and the cost.

>
> > Sure, we could have a separate toggles for MADV_COLLAPSE.
> >
> > Maybe we simply want a dedicated syscall where we can specify additional
> > options ... but maybe that simply over-complicates the problem.
> >
>
> Thankfully process_madvise(2) has flags, and madvise(2) users can
> always migrate to using process_madvise(2) on self. Piggy-backing off
> madvise infrastructure for these "non-advice actions" (e.g.
> MADV_PAGEOUT) seems to be the norm.
>
> Thanks as always for your time and thoughts!
>
> Zach
>
> > --
> > Thanks,
> >
> > David / dhildenb
> >
Zach O'Keefe April 21, 2022, 1:02 a.m. UTC | #12
On Wed, Apr 20, 2022 at 5:57 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 3:43 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Tue, Apr 19, 2022 at 1:03 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > >> E.g., have with a very sparse memory layout, we don't want to waste
> > > >> memory by allocating memory where we actually have no page populated yet
> > > >> -- could be user space won't reuse that memory in the foreseeable
> > > >> future. With too many swap entries, we don't want to trigger an
> > > >> eventually unnecessary overhead of swapping in entries if user space
> > > >> won't access them in the foreseeable future. Something similar applies
> > > >> to max_ptes_shared, where one might just end up wasting a lot of memory
> > > >> eventually in some applications.
> > > >>
> > > >> So IMHO, with MADV_COLLAPSE we should ignore/disable any heuristics that
> > > >> try figuring out what user space might be doing. We know exactly what
> > > >> user space asks for -- and that can be documented properly.
> > > >>
> > >
> > > Just a thought, if we ever want to implement khugepaged in user space,
> > > it could theoretically obtain similar information using e.g., the
> > > pagemap. It wouldn't be race-free, but the question is if it would matter.
> > >
> > > I consider the primary use case of giving an application more precise
> > > control over actual THP placement.
> > >
> >
> > Good point about the pagemap and agree about the primary use case -
> > I'll make that clear in v3 cover letter.
> >
> > > >
> > > > Sounds good to me. Would you also be in favor of decoupling allocation
> > > > semantics from khugepaged? I.e. we'll pick some default gfp flags and
> > > > not depend on /sys/kernel/mm/transparent_hugepage/khugepaged/defrag?
> > >
> > > Good question. It's not really a heuristic like that other stuff.
> > >
> > > Easy answer: we're not dealing with khugepaged, so anything in
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/ shouldn't apply?
> > >
> >
> > That's what I'm thinking now too. If there's no objections, I'll
> > proceed in that direction for v3.
>
> I agree, we should not treat MADV_COLLAPSE as "userspace khugepaged"
> IMHO. It is still best effort though, but it is requested by the users
> explicitly so kernel should trust the users' judgement and ignore
> those max_ptes_* since we should assume the users know what they are
> doing and the cost.
>

Thanks for reading and giving your thoughts, Yang. Glad to hear we are
aligned here!

I'll send out a v3 early next week. Only real change is the gfp flags,
but I want to avoid spamming folks so soon since v2.

Thanks,
Zach

> >
> > > Sure, we could have a separate toggles for MADV_COLLAPSE.
> > >
> > > Maybe we simply want a dedicated syscall where we can specify additional
> > > options ... but maybe that simply over-complicates the problem.
> > >
> >
> > Thankfully process_madvise(2) has flags, and madvise(2) users can
> > always migrate to using process_madvise(2) on self. Piggy-backing off
> > madvise infrastructure for these "non-advice actions" (e.g.
> > MADV_PAGEOUT) seems to be the norm.
> >
> > Thanks as always for your time and thoughts!
> >
> > Zach
> >
> > > --
> > > Thanks,
> > >
> > > David / dhildenb
> > >