diff mbox series

[v2,1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise()

Message ID 20240118120347.61817-1-ioworker0@gmail.com (mailing list archive)
State New
Headers show
Series [v2,1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise() | expand

Commit Message

Lance Yang Jan. 18, 2024, 12:03 p.m. UTC
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
[2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
[3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
[4] https://github.com/golang/go/issues/63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
---
V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative 
	to madvise(MADV_COLLAPSE)

 arch/alpha/include/uapi/asm/mman.h           |  1 +
 arch/mips/include/uapi/asm/mman.h            |  1 +
 arch/parisc/include/uapi/asm/mman.h          |  1 +
 arch/xtensa/include/uapi/asm/mman.h          |  1 +
 include/linux/huge_mm.h                      |  5 +--
 include/uapi/asm-generic/mman-common.h       |  1 +
 mm/khugepaged.c                              | 15 ++++++--
 mm/madvise.c                                 | 36 +++++++++++++++++---
 tools/include/uapi/asm-generic/mman-common.h |  1 +
 9 files changed, 52 insertions(+), 10 deletions(-)

Comments

Michal Hocko Jan. 18, 2024, 1:28 p.m. UTC | #1
[CC linux-api]

On Thu 18-01-24 20:03:46, Lance Yang wrote:
> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> 
> Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
> has CAP_SYS_ADMIN or is requesting the collapse of its own memory.
> 
> The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
> it  avoids direct reclaim and/or compaction, quickly failing on allocation
> errors.
> 
> This change enables a more flexible and efficient usage of memory collapse
> operations, providing additional control to userspace applications for
> system-wide THP optimization.
> 
> Semantics
> 
> This call is independent of the system-wide THP sysfs settings, but will
> fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> multiple VMAs, the semantics of the collapse over each VMA is independent
> from the others.  This implies a hugepage cannot cross a VMA boundary.  If
> collapse of a given hugepage-aligned/sized region fails, the operation may
> continue to attempt collapsing the remainder of memory specified.
> 
> The memory ranges provided must be page-aligned, but are not required to
> be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> start/end of the range will be clamped to the first/last hugepage-aligned
> address covered by said range.  The memory ranges must span at least one
> hugepage-sized region.
> 
> All non-resident pages covered by the range will first be
> swapped/faulted-in, before being internally copied onto a freshly
> allocated hugepage.  Unmapped pages will have their data directly
> initialized to 0 in the new hugepage.  However, for every eligible
> hugepage aligned/sized region to-be collapsed, at least one page must
> currently be backed by memory (a PMD covering the address range must
> already exist).
> 
> Allocation for the new hugepage will not enter direct reclaim and/or
> compaction, quickly failing if allocation fails. When the system has
> multiple NUMA nodes, the hugepage will be allocated from the node providing
> the most native pages. This operation operates on the current state of the
> specified process and makes no persistent changes or guarantees on how pages
> will be mapped, constructed, or faulted in the future.
> 
> Use Cases
> 
> An immediate user of this new functionality is the Go runtime heap allocator
> that manages memory in hugepage-sized chunks. In the past, whether it was a
> newly allocated chunk through mmap() or a reused chunk released by
> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> respectively. However, both approaches resulted in performance issues; for
> both scenarios, there could be entries into direct reclaim and/or compaction,
> leading to unpredictable stalls[4]. Now, the allocator can confidently use
> process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> 
> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> [4] https://github.com/golang/go/issues/63334
> 
> [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> 
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> Suggested-by: Zach O'Keefe <zokeefe@google.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> ---
> V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative 
> 	to madvise(MADV_COLLAPSE)
> 
>  arch/alpha/include/uapi/asm/mman.h           |  1 +
>  arch/mips/include/uapi/asm/mman.h            |  1 +
>  arch/parisc/include/uapi/asm/mman.h          |  1 +
>  arch/xtensa/include/uapi/asm/mman.h          |  1 +
>  include/linux/huge_mm.h                      |  5 +--
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/khugepaged.c                              | 15 ++++++--
>  mm/madvise.c                                 | 36 +++++++++++++++++---
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  9 files changed, 52 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..22f23ca04f1a 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -77,6 +77,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..acec0b643e9c 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -104,6 +104,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..812029c98cd7 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -71,6 +71,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  #define MADV_HWPOISON     100		/* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..52ef463dd5b6 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -112,6 +112,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..075fdb5d481a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>  		     int advice);
>  int madvise_collapse(struct vm_area_struct *vma,
>  		     struct vm_area_struct **prev,
> -		     unsigned long start, unsigned long end);
> +		     unsigned long start, unsigned long end, int behavior);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>  			   unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>  
>  static inline int madvise_collapse(struct vm_area_struct *vma,
>  				   struct vm_area_struct **prev,
> -				   unsigned long start, unsigned long end)
> +				   unsigned long start, unsigned long end,
> +				   int behavior)
>  {
>  	return -EINVAL;
>  }
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..2840051c0ae2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
>  struct collapse_control {
>  	bool is_khugepaged;
>  
> +	int behavior;
> +
>  	/* Num pages scanned per node */
>  	u32 node_load[MAX_NUMNODES];
>  
> @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
>  			      struct collapse_control *cc)
>  {
> -	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> -		     GFP_TRANSHUGE);
>  	int node = hpage_collapse_find_target_node(cc);
>  	struct folio *folio;
> +	gfp_t gfp;
> +
> +	if (cc->is_khugepaged)
> +		gfp = alloc_hugepage_khugepaged_gfpmask();
> +	else
> +		gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
> +			       GFP_TRANSHUGE_LIGHT :
> +			       GFP_TRANSHUGE);
>  
>  	if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
>  		*hpage = NULL;
> @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r)
>  }
>  
>  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> -		     unsigned long start, unsigned long end)
> +		     unsigned long start, unsigned long end, int behavior)
>  {
>  	struct collapse_control *cc;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	if (!cc)
>  		return -ENOMEM;
>  	cc->is_khugepaged = false;
> +	cc->behavior = behavior;
>  
>  	mmgrab(mm);
>  	lru_add_drain_all();
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..9c40226505aa 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_POPULATE_READ:
>  	case MADV_POPULATE_WRITE:
>  	case MADV_COLLAPSE:
> +	case MADV_F_COLLAPSE_LIGHT:
>  		return 0;
>  	default:
>  		/* be safe, default to 1. list exceptions explicitly */
> @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_F_COLLAPSE_LIGHT:
>  	case MADV_COLLAPSE:
> -		return madvise_collapse(vma, prev, start, end);
> +		return madvise_collapse(vma, prev, start, end, behavior);
>  	}
>  
>  	anon_name = anon_vma_name(vma);
> @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_HUGEPAGE:
>  	case MADV_NOHUGEPAGE:
>  	case MADV_COLLAPSE:
> +	case MADV_F_COLLAPSE_LIGHT:
>  #endif
>  	case MADV_DONTDUMP:
>  	case MADV_DODUMP:
> @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior)
>  	}
>  }
>  
> +
> +static bool process_madvise_behavior_only(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_F_COLLAPSE_LIGHT:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>  static bool process_madvise_behavior_valid(int behavior)
>  {
>  	switch (behavior) {
> @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior)
>  	case MADV_PAGEOUT:
>  	case MADV_WILLNEED:
>  	case MADV_COLLAPSE:
> +	case MADV_F_COLLAPSE_LIGHT:
>  		return true;
>  	default:
>  		return false;
> @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *		transparent huge pages so the existing pages will not be
>   *		coalesced into THP and new pages will not be allocated as THP.
>   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> + *  MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
> + *		compaction.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *		from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  -EBADF  - map exists, but area maps something that isn't a file.
>   *  -EAGAIN - a kernel resource was temporarily unavailable.
>   */
> -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> +		int behavior, bool is_process_madvise)
>  {
>  	unsigned long end;
>  	int error;
> @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	if (!madvise_behavior_valid(behavior))
>  		return -EINVAL;
>  
> +	if (!is_process_madvise && process_madvise_behavior_only(behavior))
> +		return -EINVAL;
> +
>  	if (!PAGE_ALIGNED(start))
>  		return -EINVAL;
>  	len = PAGE_ALIGN(len_in);
> @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	return error;
>  }
>  
> +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +{
> +	return _do_madvise(mm, start, len_in, behavior, false);
> +}
> +
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -	return do_madvise(current->mm, start, len_in, behavior);
> +	return _do_madvise(current->mm, start, len_in, behavior, false);
>  }
>  
>  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  	total_len = iov_iter_count(&iter);
>  
>  	while (iov_iter_count(&iter)) {
> -		ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> -					iter_iov_len(&iter), behavior);
> +		ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> +					iter_iov_len(&iter), behavior, true);
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(&iter, iter_iov_len(&iter));
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
>  #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>  
>  /* compatibility flags */
>  #define MAP_FILE	0
> -- 
> 2.33.1
Michal Hocko Jan. 18, 2024, 1:40 p.m. UTC | #2
On Thu 18-01-24 20:03:46, Lance Yang wrote:
[...]

before we discuss the semantic, let's focus on the usecase.

> Use Cases
> 
> An immediate user of this new functionality is the Go runtime heap allocator
> that manages memory in hugepage-sized chunks. In the past, whether it was a
> newly allocated chunk through mmap() or a reused chunk released by
> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> respectively. However, both approaches resulted in performance issues; for
> both scenarios, there could be entries into direct reclaim and/or compaction,
> leading to unpredictable stalls[4]. Now, the allocator can confidently use
> process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

IIUC the primary reason is the cost of the huge page allocation which
can be really high if the memory is heavily fragmented and it is called
synchronously from the process directly, correct? Can that be worked
around by process_madvise and performing the operation from a different
context? Are there any other reasons to have a different mode?

I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
e.g. non blocking one to make sure that the caller doesn't really block
on resource contention (be it locks or memory availability) because that
matches our non-blocking interface in other areas but having a LIGHT
operation sounds really vague and the exact semantic would be
implementation specific and might change over time. Non-blocking has a
clear semantic but it is not really clear whether that is what you
really need/want.

> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> [4] https://github.com/golang/go/issues/63334
> 
> [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
Michal Hocko Jan. 18, 2024, 1:43 p.m. UTC | #3
Dang, forgot to cc linux-api...

On Thu 18-01-24 14:40:19, Michal Hocko wrote:
> On Thu 18-01-24 20:03:46, Lance Yang wrote:
> [...]
> 
> before we discuss the semantic, let's focus on the usecase.
> 
> > Use Cases
> > 
> > An immediate user of this new functionality is the Go runtime heap allocator
> > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > newly allocated chunk through mmap() or a reused chunk released by
> > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > respectively. However, both approaches resulted in performance issues; for
> > both scenarios, there could be entries into direct reclaim and/or compaction,
> > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> 
> IIUC the primary reason is the cost of the huge page allocation which
> can be really high if the memory is heavily fragmented and it is called
> synchronously from the process directly, correct? Can that be worked
> around by process_madvise and performing the operation from a different
> context? Are there any other reasons to have a different mode?
> 
> I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> e.g. non blocking one to make sure that the caller doesn't really block
> on resource contention (be it locks or memory availability) because that
> matches our non-blocking interface in other areas but having a LIGHT
> operation sounds really vague and the exact semantic would be
> implementation specific and might change over time. Non-blocking has a
> clear semantic but it is not really clear whether that is what you
> really need/want.
> 
> > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > [4] https://github.com/golang/go/issues/63334
> > 
> > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> -- 
> Michal Hocko
> SUSE Labs
Zach O'Keefe Jan. 18, 2024, 2:58 p.m. UTC | #4
On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Dang, forgot to cc linux-api...
>
> On Thu 18-01-24 14:40:19, Michal Hocko wrote:
> > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > [...]
> >
> > before we discuss the semantic, let's focus on the usecase.
> >
> > > Use Cases
> > >
> > > An immediate user of this new functionality is the Go runtime heap allocator
> > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > newly allocated chunk through mmap() or a reused chunk released by
> > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > respectively. However, both approaches resulted in performance issues; for
> > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be
process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT)

> > IIUC the primary reason is the cost of the huge page allocation which
> > can be really high if the memory is heavily fragmented and it is called
> > synchronously from the process directly, correct? Can that be worked
> > around by process_madvise and performing the operation from a different
> > context? Are there any other reasons to have a different mode?
> >
> > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > e.g. non blocking one to make sure that the caller doesn't really block
> > on resource contention (be it locks or memory availability) because that
> > matches our non-blocking interface in other areas but having a LIGHT
> > operation sounds really vague and the exact semantic would be
> > implementation specific and might change over time. Non-blocking has a
> > clear semantic but it is not really clear whether that is what you
> > really need/want.

IIUC, usecase from Go is unbounded latency due to sync compaction in a
context where the latency is unacceptable. Working w/ them to
understand how things can be improved -- it's possible the changes can
occur entirely on their side, w/o any additional kernel support.

The non-blocking case awkwardly sits between MADV_COLLAPSE today, and
khugepaged; esp when common case is that the allocation can probably
be satisfied in fast path.

The suggestion for something like "LIGHT" was intentionally vague
because it could allow for other optimizations / changes down the
line, as you point out. I think that might be a win, vs tying to a
specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I
could be alone on that front, given the design of
/sys/kernel/mm/transparent_hugepage.

But circling back, I agree w/ you that the first order of business is to
iron out a real usecase. As of right now, it's not clear something
like this is required or helpful.

Thanks,
Zach




> > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > > [4] https://github.com/golang/go/issues/63334
> > >
> > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> > --
> > Michal Hocko
> > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs
Yang Shi Jan. 18, 2024, 7 p.m. UTC | #5
On Thu, Jan 18, 2024 at 6:59 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Dang, forgot to cc linux-api...
> >
> > On Thu 18-01-24 14:40:19, Michal Hocko wrote:
> > > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > > [...]
> > >
> > > before we discuss the semantic, let's focus on the usecase.
> > >
> > > > Use Cases
> > > >
> > > > An immediate user of this new functionality is the Go runtime heap allocator
> > > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > > newly allocated chunk through mmap() or a reused chunk released by
> > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > > respectively. However, both approaches resulted in performance issues; for
> > > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
>
> Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be
> process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT)
>
> > > IIUC the primary reason is the cost of the huge page allocation which
> > > can be really high if the memory is heavily fragmented and it is called
> > > synchronously from the process directly, correct? Can that be worked
> > > around by process_madvise and performing the operation from a different
> > > context? Are there any other reasons to have a different mode?
> > >
> > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > > e.g. non blocking one to make sure that the caller doesn't really block
> > > on resource contention (be it locks or memory availability) because that
> > > matches our non-blocking interface in other areas but having a LIGHT
> > > operation sounds really vague and the exact semantic would be
> > > implementation specific and might change over time. Non-blocking has a
> > > clear semantic but it is not really clear whether that is what you
> > > really need/want.
>
> IIUC, usecase from Go is unbounded latency due to sync compaction in a
> context where the latency is unacceptable. Working w/ them to
> understand how things can be improved -- it's possible the changes can
> occur entirely on their side, w/o any additional kernel support.
>
> The non-blocking case awkwardly sits between MADV_COLLAPSE today, and
> khugepaged; esp when common case is that the allocation can probably
> be satisfied in fast path.
>
> The suggestion for something like "LIGHT" was intentionally vague
> because it could allow for other optimizations / changes down the
> line, as you point out. I think that might be a win, vs tying to a
> specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I
> could be alone on that front, given the design of
> /sys/kernel/mm/transparent_hugepage.

Per the description Go marks the address spaces with MADV_HUGEPAGE. It
means the application really wants to have huge page back the address
space so kernel will try as hard as possible to get huge page. This is
the default behavior of MADV_HUGEPAGE. If they don't want to enter
direct reclaim, they can configure the defrag mode to "defer", which
means no direct reclaim and wakeup kswapd and kcompactd, and rely on
khugepaged to install huge page later on. But this mode is not
supported by khugepaged defrag, so MADV_COLLAPSE may not support it
(IIRC MADV_COLLAPSE uses khugepaged defrag mode). Or they can just not
call MADV_HUGEPAGE and leave the decision to the users, IIRC Java does
so (specifying a flag to indicate use huge page or not by the users).

>
> But circling back, I agree w/ you that the first order of business is to
> iron out a real usecase. As of right now, it's not clear something
> like this is required or helpful.
>
> Thanks,
> Zach
>
>
>
>
> > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > > > [4] https://github.com/golang/go/issues/63334
> > > >
> > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > --
> > Michal Hocko
> > SUSE Labs
Lance Yang Jan. 19, 2024, 1:46 a.m. UTC | #6
On Thu, Jan 18, 2024 at 10:59 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > Dang, forgot to cc linux-api...
> >
> > On Thu 18-01-24 14:40:19, Michal Hocko wrote:
> > > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > > [...]
> > >
> > > before we discuss the semantic, let's focus on the usecase.
> > >
> > > > Use Cases
> > > >
> > > > An immediate user of this new functionality is the Go runtime heap allocator
> > > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > > newly allocated chunk through mmap() or a reused chunk released by
> > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > > respectively. However, both approaches resulted in performance issues; for
> > > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
>
> Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be
> process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT)

I apologize for the misunderstanding. I will provide the correct implementation
in version 3.

BR,
Lance

>
> > > IIUC the primary reason is the cost of the huge page allocation which
> > > can be really high if the memory is heavily fragmented and it is called
> > > synchronously from the process directly, correct? Can that be worked
> > > around by process_madvise and performing the operation from a different
> > > context? Are there any other reasons to have a different mode?
> > >
> > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > > e.g. non blocking one to make sure that the caller doesn't really block
> > > on resource contention (be it locks or memory availability) because that
> > > matches our non-blocking interface in other areas but having a LIGHT
> > > operation sounds really vague and the exact semantic would be
> > > implementation specific and might change over time. Non-blocking has a
> > > clear semantic but it is not really clear whether that is what you
> > > really need/want.
>
> IIUC, usecase from Go is unbounded latency due to sync compaction in a
> context where the latency is unacceptable. Working w/ them to
> understand how things can be improved -- it's possible the changes can
> occur entirely on their side, w/o any additional kernel support.
>
> The non-blocking case awkwardly sits between MADV_COLLAPSE today, and
> khugepaged; esp when common case is that the allocation can probably
> be satisfied in fast path.
>
> The suggestion for something like "LIGHT" was intentionally vague
> because it could allow for other optimizations / changes down the
> line, as you point out. I think that might be a win, vs tying to a
> specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I
> could be alone on that front, given the design of
> /sys/kernel/mm/transparent_hugepage.
>
> But circling back, I agree w/ you that the first order of business is to
> iron out a real usecase. As of right now, it's not clear something
> like this is required or helpful.
>
> Thanks,
> Zach
>
>
>
>
> > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > > > [4] https://github.com/golang/go/issues/63334
> > > >
> > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > --
> > Michal Hocko
> > SUSE Labs
Lance Yang Jan. 19, 2024, 2:03 a.m. UTC | #7
Hey Michal,

Thanks for taking the time to review!

On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 18-01-24 20:03:46, Lance Yang wrote:
> [...]
>
> before we discuss the semantic, let's focus on the usecase.
>
> > Use Cases
> >
> > An immediate user of this new functionality is the Go runtime heap allocator
> > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > newly allocated chunk through mmap() or a reused chunk released by
> > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > respectively. However, both approaches resulted in performance issues; for
> > both scenarios, there could be entries into direct reclaim and/or compaction,
> > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
>
> IIUC the primary reason is the cost of the huge page allocation which
> can be really high if the memory is heavily fragmented and it is called
> synchronously from the process directly, correct? Can that be worked

Yes, that's correct.

> around by process_madvise and performing the operation from a different
> context? Are there any other reasons to have a different mode?

In latency-sensitive scenarios, some applications aim to enhance performance
by utilizing huge pages as much as possible. At the same time, in case of
allocation failure, they prefer a quick return without triggering direct memory
reclamation and compaction.

>
> I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> e.g. non blocking one to make sure that the caller doesn't really block
> on resource contention (be it locks or memory availability) because that
> matches our non-blocking interface in other areas but having a LIGHT
> operation sounds really vague and the exact semantic would be
> implementation specific and might change over time. Non-blocking has a
> clear semantic but it is not really clear whether that is what you
> really need/want.

Could you provide me with some suggestions regarding the naming of a
more relaxed (opportunistic) MADV_COLLAPSE?

Thanks again for your review and your suggestion!
Lance

>
> > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > [4] https://github.com/golang/go/issues/63334
> >
> > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> --
> Michal Hocko
> SUSE Labs
Lance Yang Jan. 19, 2024, 2:37 a.m. UTC | #8
Hey Yang,

Thanks for taking the time to review!

On Fri, Jan 19, 2024 at 3:00 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 6:59 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > Dang, forgot to cc linux-api...
> > >
> > > On Thu 18-01-24 14:40:19, Michal Hocko wrote:
> > > > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > > > [...]
> > > >
> > > > before we discuss the semantic, let's focus on the usecase.
> > > >
> > > > > Use Cases
> > > > >
> > > > > An immediate user of this new functionality is the Go runtime heap allocator
> > > > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > > > newly allocated chunk through mmap() or a reused chunk released by
> > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > > > respectively. However, both approaches resulted in performance issues; for
> > > > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> >
> > Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be
> > process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT)
> >
> > > > IIUC the primary reason is the cost of the huge page allocation which
> > > > can be really high if the memory is heavily fragmented and it is called
> > > > synchronously from the process directly, correct? Can that be worked
> > > > around by process_madvise and performing the operation from a different
> > > > context? Are there any other reasons to have a different mode?
> > > >
> > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > > > e.g. non blocking one to make sure that the caller doesn't really block
> > > > on resource contention (be it locks or memory availability) because that
> > > > matches our non-blocking interface in other areas but having a LIGHT
> > > > operation sounds really vague and the exact semantic would be
> > > > implementation specific and might change over time. Non-blocking has a
> > > > clear semantic but it is not really clear whether that is what you
> > > > really need/want.
> >
> > IIUC, usecase from Go is unbounded latency due to sync compaction in a
> > context where the latency is unacceptable. Working w/ them to
> > understand how things can be improved -- it's possible the changes can
> > occur entirely on their side, w/o any additional kernel support.
> >
> > The non-blocking case awkwardly sits between MADV_COLLAPSE today, and
> > khugepaged; esp when common case is that the allocation can probably
> > be satisfied in fast path.
> >
> > The suggestion for something like "LIGHT" was intentionally vague
> > because it could allow for other optimizations / changes down the
> > line, as you point out. I think that might be a win, vs tying to a
> > specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I
> > could be alone on that front, given the design of
> > /sys/kernel/mm/transparent_hugepage.
>
> Per the description Go marks the address spaces with MADV_HUGEPAGE. It
> means the application really wants to have huge page back the address
> space so kernel will try as hard as possible to get huge page. This is
> the default behavior of MADV_HUGEPAGE. If they don't want to enter
> direct reclaim, they can configure the defrag mode to "defer", which
> means no direct reclaim and wakeup kswapd and kcompactd, and rely on
> khugepaged to install huge page later on. But this mode is not
> supported by khugepaged defrag, so MADV_COLLAPSE may not support it
> (IIRC MADV_COLLAPSE uses khugepaged defrag mode). Or they can just not
> call MADV_HUGEPAGE and leave the decision to the users, IIRC Java does
> so (specifying a flag to indicate use huge page or not by the users).

Thank you for providing insights into the Go use cases with MADV_HUGEPAGE
and the configuration options for defrag mode.

Considering the limitations with the "defer" mode, it becomes apparent
that there
is a gap in addressing scenarios where an application desires a lighter-weight
alternative to MADV_HUGEPAGE.

MADV_F_COLLAPSE_LIGHT aims to fill this gap by providing a more flexible and
opportunistic approach, catering to applications in latency-sensitive
environments
that seek performance improvements with huge pages but prefer to avoid direct
reclaim and compaction. This option can serve as a valuable addition for users
who want more control over the behavior without the constraints of existing
configurations.

In the era of cloud-native computing, it's challenging for users to be
aware of the
THP configurations on all nodes in a cluster, let alone have
fine-grained control
over them. Simply disabling the use of huge pages due to concerns
about potential
direct reclamation and compaction may be regrettable, as users are deprived of
the opportunity to experiment with large page allocations. However,
relying solely on
MADV_HUGEPAGE introduces the risk of unpredictable stalls, making it a trade-off
that users must carefully consider.

By introducing MADV_F_COLLAPSE_LIGHT, we offer users a more flexible and
controllable solution in cloud-native environments, allowing them to
better balance
performance requirements and resource management. This selectively lightweight
alternative is designed to provide users with more choices to better
meet the diverse
needs of different scenarios.

Thanks again for your review and your suggestion!
Lance

>
> >
> > But circling back, I agree w/ you that the first order of business is to
> > iron out a real usecase. As of right now, it's not clear something
> > like this is required or helpful.
> >
> > Thanks,
> > Zach
> >
> >
> >
> >
> > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > > > > [4] https://github.com/golang/go/issues/63334
> > > > >
> > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
Michal Hocko Jan. 19, 2024, 12:51 p.m. UTC | #9
On Fri 19-01-24 10:03:05, Lance Yang wrote:
> Hey Michal,
> 
> Thanks for taking the time to review!
> 
> On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > [...]
> >
> > before we discuss the semantic, let's focus on the usecase.
> >
> > > Use Cases
> > >
> > > An immediate user of this new functionality is the Go runtime heap allocator
> > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > newly allocated chunk through mmap() or a reused chunk released by
> > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > respectively. However, both approaches resulted in performance issues; for
> > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> >
> > IIUC the primary reason is the cost of the huge page allocation which
> > can be really high if the memory is heavily fragmented and it is called
> > synchronously from the process directly, correct? Can that be worked
> 
> Yes, that's correct.
> 
> > around by process_madvise and performing the operation from a different
> > context? Are there any other reasons to have a different mode?
> 
> In latency-sensitive scenarios, some applications aim to enhance performance
> by utilizing huge pages as much as possible. At the same time, in case of
> allocation failure, they prefer a quick return without triggering direct memory
> reclamation and compaction.

Could you elaborate some more on why?

> > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > e.g. non blocking one to make sure that the caller doesn't really block
> > on resource contention (be it locks or memory availability) because that
> > matches our non-blocking interface in other areas but having a LIGHT
> > operation sounds really vague and the exact semantic would be
> > implementation specific and might change over time. Non-blocking has a
> > clear semantic but it is not really clear whether that is what you
> > really need/want.
> 
> Could you provide me with some suggestions regarding the naming of a
> more relaxed (opportunistic) MADV_COLLAPSE?

Naming is not all that important at this stage (it could be
MADV_COLLAPSE_NOBLOCK for example). The primary question is whether
non-blocking in general is the desired behavior or the implementation
should try but not too hard.
Lance Yang Jan. 19, 2024, 2:08 p.m. UTC | #10
On Fri, Jan 19, 2024 at 8:51 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-01-24 10:03:05, Lance Yang wrote:
> > Hey Michal,
> >
> > Thanks for taking the time to review!
> >
> > On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > > [...]
> > >
> > > before we discuss the semantic, let's focus on the usecase.
> > >
> > > > Use Cases
> > > >
> > > > An immediate user of this new functionality is the Go runtime heap allocator
> > > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > > newly allocated chunk through mmap() or a reused chunk released by
> > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > > respectively. However, both approaches resulted in performance issues; for
> > > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> > >
> > > IIUC the primary reason is the cost of the huge page allocation which
> > > can be really high if the memory is heavily fragmented and it is called
> > > synchronously from the process directly, correct? Can that be worked
> >
> > Yes, that's correct.
> >
> > > around by process_madvise and performing the operation from a different
> > > context? Are there any other reasons to have a different mode?
> >
> > In latency-sensitive scenarios, some applications aim to enhance performance
> > by utilizing huge pages as much as possible. At the same time, in case of
> > allocation failure, they prefer a quick return without triggering direct memory
> > reclamation and compaction.
>
> Could you elaborate some more on why?

Previously, the Go runtime attempted to marks all new memory as MADV_HUGEPAGE
on Linux and manages its hugepage eligibility status. Unfortunately,
the default THP
behavior on most Linux distros is that MADV_HUGEPAGE blocks while the kernel
eagerly reclaims and compacts memory to allocate a hugepage.
This direct reclaim and compaction is unbounded, and may result in significant
application thread stalls. In really bad cases, this can exceed 100s
of ms or even
seconds.
The overall strategy of trying to keep hugepages for the heap unbroken
however is
sound. So, the Go runtime uses MADV_COLLAPSE as an alternative.
See https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af

Later, a Google production service experienced a performance
regression with the Go
runtime's use of MADV_COLLAPSE. For now, the Go runtime has rolled
back the usage of MADV_COLLAPSE.
See https://github.com/golang/go/issues/63334

If there were a more relaxed (opportunistic) MADV_COLLAPSE, it would
avoid direct reclaim
and/or compaction and quickly fail on allocation errors. This could be
beneficial for similar
use cases.

BR,
Lance

>
> > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > > e.g. non blocking one to make sure that the caller doesn't really block
> > > on resource contention (be it locks or memory availability) because that
> > > matches our non-blocking interface in other areas but having a LIGHT
> > > operation sounds really vague and the exact semantic would be
> > > implementation specific and might change over time. Non-blocking has a
> > > clear semantic but it is not really clear whether that is what you
> > > really need/want.
> >
> > Could you provide me with some suggestions regarding the naming of a
> > more relaxed (opportunistic) MADV_COLLAPSE?
>
> Naming is not all that important at this stage (it could be
> MADV_COLLAPSE_NOBLOCK for example). The primary question is whether
> non-blocking in general is the desired behavior or the implementation
> should try but not too hard.
>
> --
> Michal Hocko
> SUSE Labs
Lance Yang Jan. 20, 2024, 2:09 a.m. UTC | #11
On Fri, Jan 19, 2024 at 8:51 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-01-24 10:03:05, Lance Yang wrote:
> > Hey Michal,
> >
> > Thanks for taking the time to review!
> >
> > On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 18-01-24 20:03:46, Lance Yang wrote:
> > > [...]
> > >
> > > before we discuss the semantic, let's focus on the usecase.
> > >
> > > > Use Cases
> > > >
> > > > An immediate user of this new functionality is the Go runtime heap allocator
> > > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > > newly allocated chunk through mmap() or a reused chunk released by
> > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > > respectively. However, both approaches resulted in performance issues; for
> > > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> > >
> > > IIUC the primary reason is the cost of the huge page allocation which
> > > can be really high if the memory is heavily fragmented and it is called
> > > synchronously from the process directly, correct? Can that be worked
> >
> > Yes, that's correct.
> >
> > > around by process_madvise and performing the operation from a different
> > > context? Are there any other reasons to have a different mode?
> >
> > In latency-sensitive scenarios, some applications aim to enhance performance
> > by utilizing huge pages as much as possible. At the same time, in case of
> > allocation failure, they prefer a quick return without triggering direct memory
> > reclamation and compaction.
>
> Could you elaborate some more on why?
>
> > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE -
> > > e.g. non blocking one to make sure that the caller doesn't really block
> > > on resource contention (be it locks or memory availability) because that
> > > matches our non-blocking interface in other areas but having a LIGHT
> > > operation sounds really vague and the exact semantic would be
> > > implementation specific and might change over time. Non-blocking has a
> > > clear semantic but it is not really clear whether that is what you
> > > really need/want.
> >
> > Could you provide me with some suggestions regarding the naming of a
> > more relaxed (opportunistic) MADV_COLLAPSE?
>
> Naming is not all that important at this stage (it could be
> MADV_COLLAPSE_NOBLOCK for example). The primary question is whether
> non-blocking in general is the desired behavior or the implementation
> should try but not too hard.

Hey Michal,

Thanks for your suggestion!

It seems that the implementation should try but not too hard aligns well
with my desired behavior. Non-blocking in general is also a great idea.
Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK
flag for scenarios where latency is extremely critical.

Thanks again,
Lance
>
> --
> Michal Hocko
> SUSE Labs
Lance Yang Jan. 21, 2024, 3:12 a.m. UTC | #12
Hello Everyone,

For applications actively utilizing THP, the defrag mode may
not be a very user-friendly design. Here are the reasons:
1. Before marking the address space with
    MADV_HUGEPAGE,it is necessary to check if
    the current configuration of the defrag mode aligns with
    their preferences.
2. Once the defrag mode configuration changes, these
    applications may face the risk of unpredictable stalls.

THP is an important feature of the Linux kernel that can
significantly enhance memory access performance.
However, due to the lack of fine-grained control over
the huge page allocation strategy, many applications
default to not using huge pages and even recommend
users to disable THP. This situation is regrettable.

With the introduction of MADV_COLLAPSE into the kernel,
it is not affected by the defrag mode.
MADV_COLLAPSE offers the potential for
fine-grained synchronous control over the huge page
allocation mechanism, marking a significant enhancement
for THP.

By adding flags to MADV_COLLAPSE, different
synchronous allocation strategies can be provided to
applications. This can instill confidence in them, allowing
them to reconsider using THP and allocate huge pages
according to their desired synchronous allocation strategy,
without worrying about the defrag mode configuration.

BR,
Lance


On Thu, Jan 18, 2024 at 8:03 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
>
> Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
> has CAP_SYS_ADMIN or is requesting the collapse of its own memory.
>
> The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
> it  avoids direct reclaim and/or compaction, quickly failing on allocation
> errors.
>
> This change enables a more flexible and efficient usage of memory collapse
> operations, providing additional control to userspace applications for
> system-wide THP optimization.
>
> Semantics
>
> This call is independent of the system-wide THP sysfs settings, but will
> fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> multiple VMAs, the semantics of the collapse over each VMA is independent
> from the others.  This implies a hugepage cannot cross a VMA boundary.  If
> collapse of a given hugepage-aligned/sized region fails, the operation may
> continue to attempt collapsing the remainder of memory specified.
>
> The memory ranges provided must be page-aligned, but are not required to
> be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> start/end of the range will be clamped to the first/last hugepage-aligned
> address covered by said range.  The memory ranges must span at least one
> hugepage-sized region.
>
> All non-resident pages covered by the range will first be
> swapped/faulted-in, before being internally copied onto a freshly
> allocated hugepage.  Unmapped pages will have their data directly
> initialized to 0 in the new hugepage.  However, for every eligible
> hugepage aligned/sized region to-be collapsed, at least one page must
> currently be backed by memory (a PMD covering the address range must
> already exist).
>
> Allocation for the new hugepage will not enter direct reclaim and/or
> compaction, quickly failing if allocation fails. When the system has
> multiple NUMA nodes, the hugepage will be allocated from the node providing
> the most native pages. This operation operates on the current state of the
> specified process and makes no persistent changes or guarantees on how pages
> will be mapped, constructed, or faulted in the future.
>
> Use Cases
>
> An immediate user of this new functionality is the Go runtime heap allocator
> that manages memory in hugepage-sized chunks. In the past, whether it was a
> newly allocated chunk through mmap() or a reused chunk released by
> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> respectively. However, both approaches resulted in performance issues; for
> both scenarios, there could be entries into direct reclaim and/or compaction,
> leading to unpredictable stalls[4]. Now, the allocator can confidently use
> process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
>
> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> [4] https://github.com/golang/go/issues/63334
>
> [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
>
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> Suggested-by: Zach O'Keefe <zokeefe@google.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> ---
> V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative
>         to madvise(MADV_COLLAPSE)
>
>  arch/alpha/include/uapi/asm/mman.h           |  1 +
>  arch/mips/include/uapi/asm/mman.h            |  1 +
>  arch/parisc/include/uapi/asm/mman.h          |  1 +
>  arch/xtensa/include/uapi/asm/mman.h          |  1 +
>  include/linux/huge_mm.h                      |  5 +--
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/khugepaged.c                              | 15 ++++++--
>  mm/madvise.c                                 | 36 +++++++++++++++++---
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  9 files changed, 52 insertions(+), 10 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..22f23ca04f1a 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -77,6 +77,7 @@
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
>  /* compatibility flags */
>  #define MAP_FILE       0
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..acec0b643e9c 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -104,6 +104,7 @@
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
>  /* compatibility flags */
>  #define MAP_FILE       0
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..812029c98cd7 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -71,6 +71,7 @@
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
>  #define MADV_HWPOISON     100          /* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..52ef463dd5b6 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -112,6 +112,7 @@
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
>  /* compatibility flags */
>  #define MAP_FILE       0
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..075fdb5d481a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>                      int advice);
>  int madvise_collapse(struct vm_area_struct *vma,
>                      struct vm_area_struct **prev,
> -                    unsigned long start, unsigned long end);
> +                    unsigned long start, unsigned long end, int behavior);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>                            unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>
>  static inline int madvise_collapse(struct vm_area_struct *vma,
>                                    struct vm_area_struct **prev,
> -                                  unsigned long start, unsigned long end)
> +                                  unsigned long start, unsigned long end,
> +                                  int behavior)
>  {
>         return -EINVAL;
>  }
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
>  /* compatibility flags */
>  #define MAP_FILE       0
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..2840051c0ae2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
>  struct collapse_control {
>         bool is_khugepaged;
>
> +       int behavior;
> +
>         /* Num pages scanned per node */
>         u32 node_load[MAX_NUMNODES];
>
> @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
>                               struct collapse_control *cc)
>  {
> -       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> -                    GFP_TRANSHUGE);
>         int node = hpage_collapse_find_target_node(cc);
>         struct folio *folio;
> +       gfp_t gfp;
> +
> +       if (cc->is_khugepaged)
> +               gfp = alloc_hugepage_khugepaged_gfpmask();
> +       else
> +               gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
> +                              GFP_TRANSHUGE_LIGHT :
> +                              GFP_TRANSHUGE);
>
>         if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
>                 *hpage = NULL;
> @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r)
>  }
>
>  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> -                    unsigned long start, unsigned long end)
> +                    unsigned long start, unsigned long end, int behavior)
>  {
>         struct collapse_control *cc;
>         struct mm_struct *mm = vma->vm_mm;
> @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>         if (!cc)
>                 return -ENOMEM;
>         cc->is_khugepaged = false;
> +       cc->behavior = behavior;
>
>         mmgrab(mm);
>         lru_add_drain_all();
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..9c40226505aa 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
>         case MADV_COLLAPSE:
> +       case MADV_F_COLLAPSE_LIGHT:
>                 return 0;
>         default:
>                 /* be safe, default to 1. list exceptions explicitly */
> @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 if (error)
>                         goto out;
>                 break;
> +       case MADV_F_COLLAPSE_LIGHT:
>         case MADV_COLLAPSE:
> -               return madvise_collapse(vma, prev, start, end);
> +               return madvise_collapse(vma, prev, start, end, behavior);
>         }
>
>         anon_name = anon_vma_name(vma);
> @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior)
>         case MADV_HUGEPAGE:
>         case MADV_NOHUGEPAGE:
>         case MADV_COLLAPSE:
> +       case MADV_F_COLLAPSE_LIGHT:
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior)
>         }
>  }
>
> +
> +static bool process_madvise_behavior_only(int behavior)
> +{
> +       switch (behavior) {
> +       case MADV_F_COLLAPSE_LIGHT:
> +               return true;
> +       default:
> +               return false;
> +       }
> +}
> +
>  static bool process_madvise_behavior_valid(int behavior)
>  {
>         switch (behavior) {
> @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior)
>         case MADV_PAGEOUT:
>         case MADV_WILLNEED:
>         case MADV_COLLAPSE:
> +       case MADV_F_COLLAPSE_LIGHT:
>                 return true;
>         default:
>                 return false;
> @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *             transparent huge pages so the existing pages will not be
>   *             coalesced into THP and new pages will not be allocated as THP.
>   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> + *  MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
> + *             compaction.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *             from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  -EBADF  - map exists, but area maps something that isn't a file.
>   *  -EAGAIN - a kernel resource was temporarily unavailable.
>   */
> -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> +               int behavior, bool is_process_madvise)
>  {
>         unsigned long end;
>         int error;
> @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>         if (!madvise_behavior_valid(behavior))
>                 return -EINVAL;
>
> +       if (!is_process_madvise && process_madvise_behavior_only(behavior))
> +               return -EINVAL;
> +
>         if (!PAGE_ALIGNED(start))
>                 return -EINVAL;
>         len = PAGE_ALIGN(len_in);
> @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>         return error;
>  }
>
> +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +{
> +       return _do_madvise(mm, start, len_in, behavior, false);
> +}
> +
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -       return do_madvise(current->mm, start, len_in, behavior);
> +       return _do_madvise(current->mm, start, len_in, behavior, false);
>  }
>
>  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>         total_len = iov_iter_count(&iter);
>
>         while (iov_iter_count(&iter)) {
> -               ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> -                                       iter_iov_len(&iter), behavior);
> +               ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> +                                       iter_iov_len(&iter), behavior, true);
>                 if (ret < 0)
>                         break;
>                 iov_iter_advance(&iter, iter_iov_len(&iter));
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
>  /* compatibility flags */
>  #define MAP_FILE       0
> --
> 2.33.1
>
Michal Hocko Jan. 22, 2024, 1:50 p.m. UTC | #13
On Sat 20-01-24 10:09:32, Lance Yang wrote:
[...]
> Hey Michal,
> 
> Thanks for your suggestion!
> 
> It seems that the implementation should try but not too hard aligns well
> with my desired behavior.

The problem I have with this semantic is that it is really hard to
define and then stick with. Our implementation might change over time
and what somebody considers good ATM might turn int "trying harder than
I wanted" later on.

> Non-blocking in general is also a great idea.
> Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK
> flag for scenarios where latency is extremely critical.

Non blocking semantic is much easier to define and maintain. The actual
allocation/compaction implementation might change as well over time but
the userspace at least knows that the request will not block waiting for
any required resources.
Lance Yang Jan. 22, 2024, 2:14 p.m. UTC | #14
On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sat 20-01-24 10:09:32, Lance Yang wrote:
> [...]
> > Hey Michal,
> >
> > Thanks for your suggestion!
> >
> > It seems that the implementation should try but not too hard aligns well
> > with my desired behavior.
>
> The problem I have with this semantic is that it is really hard to
> define and then stick with. Our implementation might change over time
> and what somebody considers good ATM might turn int "trying harder than
> I wanted" later on.
>
> > Non-blocking in general is also a great idea.
> > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK
> > flag for scenarios where latency is extremely critical.
>
> Non blocking semantic is much easier to define and maintain. The actual
> allocation/compaction implementation might change as well over time but
> the userspace at least knows that the request will not block waiting for
> any required resources.

I appreciate your insights!

It makes sense that a non-blocking semantic is easier to define and maintain,
providing userspace with the certainty that requests won’t be blocked.

Thanks,
Lance

>
> --
> Michal Hocko
> SUSE Labs
Lance Yang Jan. 22, 2024, 2:34 p.m. UTC | #15
Hey Zach,

What do you think about the semantic?

Thanks,
Lance

On Mon, Jan 22, 2024 at 10:14 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sat 20-01-24 10:09:32, Lance Yang wrote:
> > [...]
> > > Hey Michal,
> > >
> > > Thanks for your suggestion!
> > >
> > > It seems that the implementation should try but not too hard aligns well
> > > with my desired behavior.
> >
> > The problem I have with this semantic is that it is really hard to
> > define and then stick with. Our implementation might change over time
> > and what somebody considers good ATM might turn int "trying harder than
> > I wanted" later on.
> >
> > > Non-blocking in general is also a great idea.
> > > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK
> > > flag for scenarios where latency is extremely critical.
> >
> > Non blocking semantic is much easier to define and maintain. The actual
> > allocation/compaction implementation might change as well over time but
> > the userspace at least knows that the request will not block waiting for
> > any required resources.
>
> I appreciate your insights!
>
> It makes sense that a non-blocking semantic is easier to define and maintain,
> providing userspace with the certainty that requests won’t be blocked.
>
> Thanks,
> Lance
>
> >
> > --
> > Michal Hocko
> > SUSE Labs
Lance Yang Jan. 26, 2024, 6:16 a.m. UTC | #16
I’d like to add another real use case.

In our company, we deploy applications using offline-online
hybrid deployment. This approach leverages the distinctive
resource utilization patterns of online services, utilizing idle
resources during various time periods by filling them with
offline jobs. This helps reduce the growing cost expenditures
for the enterprise.

Whether for online services or offline jobs, their requirements
for THP can be roughly categorized into three types:

* The first type aims to use huge pages as much as possible
and tolerates unpredictable stalls caused by direct reclaim
and/or compaction.
* The second type attempts to use huge pages but is relatively
latency-sensitive and cannot tolerate unpredictable stalls.
* The third type prefers not to use huge pages at all and is
extremely latency-sensitive.

After careful consideration, we decided to prioritize the
requirements of the first type and modify the THP settings
as follows:

echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo defer >/sys/kernel/mm/transparent_hugepage/defrag

With the introduction of MADV_COLLAPSE into the kernel,
it is no longer dependent on any sysfs setting under
/sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE
offers the potential for fine-grained synchronous control over
the huge page allocation mechanism, marking a significant
enhancement for THP.

If the kernel supports a more relaxed (opportunistic)
MADV_COLLAPSE, we will modify the THP settings as follows:

echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
to address the requirements of the second type.

Why don't we favor madvise(MADV_COLLAPSE) for the first type
of requirements?
The main reason is that these requirements are typically for offline
jobs in the Hadoop ecosystem, such as MapReduce and Spark,
which run primarily on the JVM. IIRC, the JVM currently does not
support madvise(MADV_COLLAPSE). The second type of
requirements is all for our in-house developed online services.
For us, integrating a more relaxed (opportunistic)
MADV_COLLAPSE into our online services is relatively
straightforward.

By introducing various flags to MADV_COLLAPSE, we can offer
multiple synchronous allocation strategies for applications. This
fine-grained control may be more suitable for cloud-native
environments than the widespread settings under
/sys/kernel/mm/transparent_hugepage in sysfs.

Thanks for your time!
Lance

On Sun, Jan 21, 2024 at 11:12 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> Hello Everyone,
>
> For applications actively utilizing THP, the defrag mode may
> not be a very user-friendly design. Here are the reasons:
> 1. Before marking the address space with
>     MADV_HUGEPAGE,it is necessary to check if
>     the current configuration of the defrag mode aligns with
>     their preferences.
> 2. Once the defrag mode configuration changes, these
>     applications may face the risk of unpredictable stalls.
>
> THP is an important feature of the Linux kernel that can
> significantly enhance memory access performance.
> However, due to the lack of fine-grained control over
> the huge page allocation strategy, many applications
> default to not using huge pages and even recommend
> users to disable THP. This situation is regrettable.
>
> With the introduction of MADV_COLLAPSE into the kernel,
> it is not affected by the defrag mode.
> MADV_COLLAPSE offers the potential for
> fine-grained synchronous control over the huge page
> allocation mechanism, marking a significant enhancement
> for THP.
>
> By adding flags to MADV_COLLAPSE, different
> synchronous allocation strategies can be provided to
> applications. This can instill confidence in them, allowing
> them to reconsider using THP and allocate huge pages
> according to their desired synchronous allocation strategy,
> without worrying about the defrag mode configuration.
>
> BR,
> Lance
>
>
> On Thu, Jan 18, 2024 at 8:03 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> >
> > Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
> > has CAP_SYS_ADMIN or is requesting the collapse of its own memory.
> >
> > The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
> > it  avoids direct reclaim and/or compaction, quickly failing on allocation
> > errors.
> >
> > This change enables a more flexible and efficient usage of memory collapse
> > operations, providing additional control to userspace applications for
> > system-wide THP optimization.
> >
> > Semantics
> >
> > This call is independent of the system-wide THP sysfs settings, but will
> > fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> > multiple VMAs, the semantics of the collapse over each VMA is independent
> > from the others.  This implies a hugepage cannot cross a VMA boundary.  If
> > collapse of a given hugepage-aligned/sized region fails, the operation may
> > continue to attempt collapsing the remainder of memory specified.
> >
> > The memory ranges provided must be page-aligned, but are not required to
> > be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> > start/end of the range will be clamped to the first/last hugepage-aligned
> > address covered by said range.  The memory ranges must span at least one
> > hugepage-sized region.
> >
> > All non-resident pages covered by the range will first be
> > swapped/faulted-in, before being internally copied onto a freshly
> > allocated hugepage.  Unmapped pages will have their data directly
> > initialized to 0 in the new hugepage.  However, for every eligible
> > hugepage aligned/sized region to-be collapsed, at least one page must
> > currently be backed by memory (a PMD covering the address range must
> > already exist).
> >
> > Allocation for the new hugepage will not enter direct reclaim and/or
> > compaction, quickly failing if allocation fails. When the system has
> > multiple NUMA nodes, the hugepage will be allocated from the node providing
> > the most native pages. This operation operates on the current state of the
> > specified process and makes no persistent changes or guarantees on how pages
> > will be mapped, constructed, or faulted in the future.
> >
> > Use Cases
> >
> > An immediate user of this new functionality is the Go runtime heap allocator
> > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > newly allocated chunk through mmap() or a reused chunk released by
> > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > respectively. However, both approaches resulted in performance issues; for
> > both scenarios, there could be entries into direct reclaim and/or compaction,
> > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> >
> > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > [4] https://github.com/golang/go/issues/63334
> >
> > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> >
> > Signed-off-by: Lance Yang <ioworker0@gmail.com>
> > Suggested-by: Zach O'Keefe <zokeefe@google.com>
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > ---
> > V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative
> >         to madvise(MADV_COLLAPSE)
> >
> >  arch/alpha/include/uapi/asm/mman.h           |  1 +
> >  arch/mips/include/uapi/asm/mman.h            |  1 +
> >  arch/parisc/include/uapi/asm/mman.h          |  1 +
> >  arch/xtensa/include/uapi/asm/mman.h          |  1 +
> >  include/linux/huge_mm.h                      |  5 +--
> >  include/uapi/asm-generic/mman-common.h       |  1 +
> >  mm/khugepaged.c                              | 15 ++++++--
> >  mm/madvise.c                                 | 36 +++++++++++++++++---
> >  tools/include/uapi/asm-generic/mman-common.h |  1 +
> >  9 files changed, 52 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 763929e814e9..22f23ca04f1a 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -77,6 +77,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index c6e1fc77c996..acec0b643e9c 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -104,6 +104,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index 68c44f99bc93..812029c98cd7 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -71,6 +71,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  #define MADV_HWPOISON     100          /* poison a page for testing */
> >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 1ff0c858544f..52ef463dd5b6 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -112,6 +112,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5adb86af35fc..075fdb5d481a 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >                      int advice);
> >  int madvise_collapse(struct vm_area_struct *vma,
> >                      struct vm_area_struct **prev,
> > -                    unsigned long start, unsigned long end);
> > +                    unsigned long start, unsigned long end, int behavior);
> >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> >                            unsigned long end, long adjust_next);
> >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> >
> >  static inline int madvise_collapse(struct vm_area_struct *vma,
> >                                    struct vm_area_struct **prev,
> > -                                  unsigned long start, unsigned long end)
> > +                                  unsigned long start, unsigned long end,
> > +                                  int behavior)
> >  {
> >         return -EINVAL;
> >  }
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..92c67bc755da 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -78,6 +78,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2b219acb528e..2840051c0ae2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
> >  struct collapse_control {
> >         bool is_khugepaged;
> >
> > +       int behavior;
> > +
> >         /* Num pages scanned per node */
> >         u32 node_load[MAX_NUMNODES];
> >
> > @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> >                               struct collapse_control *cc)
> >  {
> > -       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> > -                    GFP_TRANSHUGE);
> >         int node = hpage_collapse_find_target_node(cc);
> >         struct folio *folio;
> > +       gfp_t gfp;
> > +
> > +       if (cc->is_khugepaged)
> > +               gfp = alloc_hugepage_khugepaged_gfpmask();
> > +       else
> > +               gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
> > +                              GFP_TRANSHUGE_LIGHT :
> > +                              GFP_TRANSHUGE);
> >
> >         if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
> >                 *hpage = NULL;
> > @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r)
> >  }
> >
> >  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > -                    unsigned long start, unsigned long end)
> > +                    unsigned long start, unsigned long end, int behavior)
> >  {
> >         struct collapse_control *cc;
> >         struct mm_struct *mm = vma->vm_mm;
> > @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >         if (!cc)
> >                 return -ENOMEM;
> >         cc->is_khugepaged = false;
> > +       cc->behavior = behavior;
> >
> >         mmgrab(mm);
> >         lru_add_drain_all();
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..9c40226505aa 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> >         case MADV_COLLAPSE:
> > +       case MADV_F_COLLAPSE_LIGHT:
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> > @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >                 if (error)
> >                         goto out;
> >                 break;
> > +       case MADV_F_COLLAPSE_LIGHT:
> >         case MADV_COLLAPSE:
> > -               return madvise_collapse(vma, prev, start, end);
> > +               return madvise_collapse(vma, prev, start, end, behavior);
> >         }
> >
> >         anon_name = anon_vma_name(vma);
> > @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior)
> >         case MADV_HUGEPAGE:
> >         case MADV_NOHUGEPAGE:
> >         case MADV_COLLAPSE:
> > +       case MADV_F_COLLAPSE_LIGHT:
> >  #endif
> >         case MADV_DONTDUMP:
> >         case MADV_DODUMP:
> > @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior)
> >         }
> >  }
> >
> > +
> > +static bool process_madvise_behavior_only(int behavior)
> > +{
> > +       switch (behavior) {
> > +       case MADV_F_COLLAPSE_LIGHT:
> > +               return true;
> > +       default:
> > +               return false;
> > +       }
> > +}
> > +
> >  static bool process_madvise_behavior_valid(int behavior)
> >  {
> >         switch (behavior) {
> > @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior)
> >         case MADV_PAGEOUT:
> >         case MADV_WILLNEED:
> >         case MADV_COLLAPSE:
> > +       case MADV_F_COLLAPSE_LIGHT:
> >                 return true;
> >         default:
> >                 return false;
> > @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *             transparent huge pages so the existing pages will not be
> >   *             coalesced into THP and new pages will not be allocated as THP.
> >   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > + *  MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
> > + *             compaction.
> >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> >   *             from being included in its core dump.
> >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *  -EBADF  - map exists, but area maps something that isn't a file.
> >   *  -EAGAIN - a kernel resource was temporarily unavailable.
> >   */
> > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > +               int behavior, bool is_process_madvise)
> >  {
> >         unsigned long end;
> >         int error;
> > @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >         if (!madvise_behavior_valid(behavior))
> >                 return -EINVAL;
> >
> > +       if (!is_process_madvise && process_madvise_behavior_only(behavior))
> > +               return -EINVAL;
> > +
> >         if (!PAGE_ALIGNED(start))
> >                 return -EINVAL;
> >         len = PAGE_ALIGN(len_in);
> > @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >         return error;
> >  }
> >
> > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > +{
> > +       return _do_madvise(mm, start, len_in, behavior, false);
> > +}
> > +
> >  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> > -       return do_madvise(current->mm, start, len_in, behavior);
> > +       return _do_madvise(current->mm, start, len_in, behavior, false);
> >  }
> >
> >  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >         total_len = iov_iter_count(&iter);
> >
> >         while (iov_iter_count(&iter)) {
> > -               ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> > -                                       iter_iov_len(&iter), behavior);
> > +               ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> > +                                       iter_iov_len(&iter), behavior, true);
> >                 if (ret < 0)
> >                         break;
> >                 iov_iter_advance(&iter, iter_iov_len(&iter));
> > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..92c67bc755da 100644
> > --- a/tools/include/uapi/asm-generic/mman-common.h
> > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > @@ -78,6 +78,7 @@
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> > --
> > 2.33.1
> >
Lance Yang Jan. 26, 2024, 10:15 a.m. UTC | #17
I would like to correct the information provided in my previous
email and also provide some additional information.

On Fri, Jan 26, 2024 at 2:16 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> I’d like to add another real use case.
>
> In our company, we deploy applications using offline-online
> hybrid deployment. This approach leverages the distinctive
> resource utilization patterns of online services, utilizing idle
> resources during various time periods by filling them with
> offline jobs. This helps reduce the growing cost expenditures
> for the enterprise.
>
> Whether for online services or offline jobs, their requirements
> for THP can be roughly categorized into three types:
>
> * The first type aims to use huge pages as much as possible
> and tolerates unpredictable stalls caused by direct reclaim
> and/or compaction.
> * The second type attempts to use huge pages but is relatively
> latency-sensitive and cannot tolerate unpredictable stalls.
> * The third type prefers not to use huge pages at all and is
> extremely latency-sensitive.
>
> After careful consideration, we decided to prioritize the
> requirements of the first type and modify the THP settings
> as follows:
>
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo defer >/sys/kernel/mm/transparent_hugepage/defrag
>
> With the introduction of MADV_COLLAPSE into the kernel,
> it is no longer dependent on any sysfs setting under
> /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE
> offers the potential for fine-grained synchronous control over
> the huge page allocation mechanism, marking a significant
> enhancement for THP.
>
> If the kernel supports a more relaxed (opportunistic)
> MADV_COLLAPSE, we will modify the THP settings as follows:
>
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

The correct THP settings should be:
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

>
> Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
> to address the requirements of the second type.
>
> Why don't we favor madvise(MADV_COLLAPSE) for the first type
> of requirements?
> The main reason is that these requirements are typically for offline
> jobs in the Hadoop ecosystem, such as MapReduce and Spark,
> which run primarily on the JVM. IIRC, the JVM currently does not
> support madvise(MADV_COLLAPSE). The second type of

To add, there are also some offline jobs that rely on PyTorch for
machine learning model training tasks. IIRC, PyTorch also does
not support madvise(MADV_COLLAPSE).

Thanks,
Lance

> requirements is all for our in-house developed online services.
> For us, integrating a more relaxed (opportunistic)
> MADV_COLLAPSE into our online services is relatively
> straightforward.
>
> By introducing various flags to MADV_COLLAPSE, we can offer
> multiple synchronous allocation strategies for applications. This
> fine-grained control may be more suitable for cloud-native
> environments than the widespread settings under
> /sys/kernel/mm/transparent_hugepage in sysfs.
>
> Thanks for your time!
> Lance
>
> On Sun, Jan 21, 2024 at 11:12 AM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > Hello Everyone,
> >
> > For applications actively utilizing THP, the defrag mode may
> > not be a very user-friendly design. Here are the reasons:
> > 1. Before marking the address space with
> >     MADV_HUGEPAGE,it is necessary to check if
> >     the current configuration of the defrag mode aligns with
> >     their preferences.
> > 2. Once the defrag mode configuration changes, these
> >     applications may face the risk of unpredictable stalls.
> >
> > THP is an important feature of the Linux kernel that can
> > significantly enhance memory access performance.
> > However, due to the lack of fine-grained control over
> > the huge page allocation strategy, many applications
> > default to not using huge pages and even recommend
> > users to disable THP. This situation is regrettable.
> >
> > With the introduction of MADV_COLLAPSE into the kernel,
> > it is not affected by the defrag mode.
> > MADV_COLLAPSE offers the potential for
> > fine-grained synchronous control over the huge page
> > allocation mechanism, marking a significant enhancement
> > for THP.
> >
> > By adding flags to MADV_COLLAPSE, different
> > synchronous allocation strategies can be provided to
> > applications. This can instill confidence in them, allowing
> > them to reconsider using THP and allocate huge pages
> > according to their desired synchronous allocation strategy,
> > without worrying about the defrag mode configuration.
> >
> > BR,
> > Lance
> >
> >
> > On Thu, Jan 18, 2024 at 8:03 PM Lance Yang <ioworker0@gmail.com> wrote:
> > >
> > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> > >
> > > Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
> > > has CAP_SYS_ADMIN or is requesting the collapse of its own memory.
> > >
> > > The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
> > > it  avoids direct reclaim and/or compaction, quickly failing on allocation
> > > errors.
> > >
> > > This change enables a more flexible and efficient usage of memory collapse
> > > operations, providing additional control to userspace applications for
> > > system-wide THP optimization.
> > >
> > > Semantics
> > >
> > > This call is independent of the system-wide THP sysfs settings, but will
> > > fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> > > multiple VMAs, the semantics of the collapse over each VMA is independent
> > > from the others.  This implies a hugepage cannot cross a VMA boundary.  If
> > > collapse of a given hugepage-aligned/sized region fails, the operation may
> > > continue to attempt collapsing the remainder of memory specified.
> > >
> > > The memory ranges provided must be page-aligned, but are not required to
> > > be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> > > start/end of the range will be clamped to the first/last hugepage-aligned
> > > address covered by said range.  The memory ranges must span at least one
> > > hugepage-sized region.
> > >
> > > All non-resident pages covered by the range will first be
> > > swapped/faulted-in, before being internally copied onto a freshly
> > > allocated hugepage.  Unmapped pages will have their data directly
> > > initialized to 0 in the new hugepage.  However, for every eligible
> > > hugepage aligned/sized region to-be collapsed, at least one page must
> > > currently be backed by memory (a PMD covering the address range must
> > > already exist).
> > >
> > > Allocation for the new hugepage will not enter direct reclaim and/or
> > > compaction, quickly failing if allocation fails. When the system has
> > > multiple NUMA nodes, the hugepage will be allocated from the node providing
> > > the most native pages. This operation operates on the current state of the
> > > specified process and makes no persistent changes or guarantees on how pages
> > > will be mapped, constructed, or faulted in the future.
> > >
> > > Use Cases
> > >
> > > An immediate user of this new functionality is the Go runtime heap allocator
> > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > newly allocated chunk through mmap() or a reused chunk released by
> > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > respectively. However, both approaches resulted in performance issues; for
> > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> > >
> > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > > [4] https://github.com/golang/go/issues/63334
> > >
> > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
> > >
> > > Signed-off-by: Lance Yang <ioworker0@gmail.com>
> > > Suggested-by: Zach O'Keefe <zokeefe@google.com>
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > ---
> > > V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative
> > >         to madvise(MADV_COLLAPSE)
> > >
> > >  arch/alpha/include/uapi/asm/mman.h           |  1 +
> > >  arch/mips/include/uapi/asm/mman.h            |  1 +
> > >  arch/parisc/include/uapi/asm/mman.h          |  1 +
> > >  arch/xtensa/include/uapi/asm/mman.h          |  1 +
> > >  include/linux/huge_mm.h                      |  5 +--
> > >  include/uapi/asm-generic/mman-common.h       |  1 +
> > >  mm/khugepaged.c                              | 15 ++++++--
> > >  mm/madvise.c                                 | 36 +++++++++++++++++---
> > >  tools/include/uapi/asm-generic/mman-common.h |  1 +
> > >  9 files changed, 52 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > > index 763929e814e9..22f23ca04f1a 100644
> > > --- a/arch/alpha/include/uapi/asm/mman.h
> > > +++ b/arch/alpha/include/uapi/asm/mman.h
> > > @@ -77,6 +77,7 @@
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > > index c6e1fc77c996..acec0b643e9c 100644
> > > --- a/arch/mips/include/uapi/asm/mman.h
> > > +++ b/arch/mips/include/uapi/asm/mman.h
> > > @@ -104,6 +104,7 @@
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > > index 68c44f99bc93..812029c98cd7 100644
> > > --- a/arch/parisc/include/uapi/asm/mman.h
> > > +++ b/arch/parisc/include/uapi/asm/mman.h
> > > @@ -71,6 +71,7 @@
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > >  #define MADV_HWPOISON     100          /* poison a page for testing */
> > >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > > index 1ff0c858544f..52ef463dd5b6 100644
> > > --- a/arch/xtensa/include/uapi/asm/mman.h
> > > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > > @@ -112,6 +112,7 @@
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 5adb86af35fc..075fdb5d481a 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > >                      int advice);
> > >  int madvise_collapse(struct vm_area_struct *vma,
> > >                      struct vm_area_struct **prev,
> > > -                    unsigned long start, unsigned long end);
> > > +                    unsigned long start, unsigned long end, int behavior);
> > >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > >                            unsigned long end, long adjust_next);
> > >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> > >
> > >  static inline int madvise_collapse(struct vm_area_struct *vma,
> > >                                    struct vm_area_struct **prev,
> > > -                                  unsigned long start, unsigned long end)
> > > +                                  unsigned long start, unsigned long end,
> > > +                                  int behavior)
> > >  {
> > >         return -EINVAL;
> > >  }
> > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > index 6ce1f1ceb432..92c67bc755da 100644
> > > --- a/include/uapi/asm-generic/mman-common.h
> > > +++ b/include/uapi/asm-generic/mman-common.h
> > > @@ -78,6 +78,7 @@
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 2b219acb528e..2840051c0ae2 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
> > >  struct collapse_control {
> > >         bool is_khugepaged;
> > >
> > > +       int behavior;
> > > +
> > >         /* Num pages scanned per node */
> > >         u32 node_load[MAX_NUMNODES];
> > >
> > > @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > >  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> > >                               struct collapse_control *cc)
> > >  {
> > > -       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> > > -                    GFP_TRANSHUGE);
> > >         int node = hpage_collapse_find_target_node(cc);
> > >         struct folio *folio;
> > > +       gfp_t gfp;
> > > +
> > > +       if (cc->is_khugepaged)
> > > +               gfp = alloc_hugepage_khugepaged_gfpmask();
> > > +       else
> > > +               gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
> > > +                              GFP_TRANSHUGE_LIGHT :
> > > +                              GFP_TRANSHUGE);
> > >
> > >         if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
> > >                 *hpage = NULL;
> > > @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r)
> > >  }
> > >
> > >  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > -                    unsigned long start, unsigned long end)
> > > +                    unsigned long start, unsigned long end, int behavior)
> > >  {
> > >         struct collapse_control *cc;
> > >         struct mm_struct *mm = vma->vm_mm;
> > > @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >         if (!cc)
> > >                 return -ENOMEM;
> > >         cc->is_khugepaged = false;
> > > +       cc->behavior = behavior;
> > >
> > >         mmgrab(mm);
> > >         lru_add_drain_all();
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 912155a94ed5..9c40226505aa 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> > >         case MADV_POPULATE_READ:
> > >         case MADV_POPULATE_WRITE:
> > >         case MADV_COLLAPSE:
> > > +       case MADV_F_COLLAPSE_LIGHT:
> > >                 return 0;
> > >         default:
> > >                 /* be safe, default to 1. list exceptions explicitly */
> > > @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > >                 if (error)
> > >                         goto out;
> > >                 break;
> > > +       case MADV_F_COLLAPSE_LIGHT:
> > >         case MADV_COLLAPSE:
> > > -               return madvise_collapse(vma, prev, start, end);
> > > +               return madvise_collapse(vma, prev, start, end, behavior);
> > >         }
> > >
> > >         anon_name = anon_vma_name(vma);
> > > @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior)
> > >         case MADV_HUGEPAGE:
> > >         case MADV_NOHUGEPAGE:
> > >         case MADV_COLLAPSE:
> > > +       case MADV_F_COLLAPSE_LIGHT:
> > >  #endif
> > >         case MADV_DONTDUMP:
> > >         case MADV_DODUMP:
> > > @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior)
> > >         }
> > >  }
> > >
> > > +
> > > +static bool process_madvise_behavior_only(int behavior)
> > > +{
> > > +       switch (behavior) {
> > > +       case MADV_F_COLLAPSE_LIGHT:
> > > +               return true;
> > > +       default:
> > > +               return false;
> > > +       }
> > > +}
> > > +
> > >  static bool process_madvise_behavior_valid(int behavior)
> > >  {
> > >         switch (behavior) {
> > > @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior)
> > >         case MADV_PAGEOUT:
> > >         case MADV_WILLNEED:
> > >         case MADV_COLLAPSE:
> > > +       case MADV_F_COLLAPSE_LIGHT:
> > >                 return true;
> > >         default:
> > >                 return false;
> > > @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > >   *             transparent huge pages so the existing pages will not be
> > >   *             coalesced into THP and new pages will not be allocated as THP.
> > >   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > > + *  MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
> > > + *             compaction.
> > >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > >   *             from being included in its core dump.
> > >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > >   *  -EBADF  - map exists, but area maps something that isn't a file.
> > >   *  -EAGAIN - a kernel resource was temporarily unavailable.
> > >   */
> > > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > > +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > > +               int behavior, bool is_process_madvise)
> > >  {
> > >         unsigned long end;
> > >         int error;
> > > @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > >         if (!madvise_behavior_valid(behavior))
> > >                 return -EINVAL;
> > >
> > > +       if (!is_process_madvise && process_madvise_behavior_only(behavior))
> > > +               return -EINVAL;
> > > +
> > >         if (!PAGE_ALIGNED(start))
> > >                 return -EINVAL;
> > >         len = PAGE_ALIGN(len_in);
> > > @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > >         return error;
> > >  }
> > >
> > > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > > +{
> > > +       return _do_madvise(mm, start, len_in, behavior, false);
> > > +}
> > > +
> > >  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > >  {
> > > -       return do_madvise(current->mm, start, len_in, behavior);
> > > +       return _do_madvise(current->mm, start, len_in, behavior, false);
> > >  }
> > >
> > >  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > > @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > >         total_len = iov_iter_count(&iter);
> > >
> > >         while (iov_iter_count(&iter)) {
> > > -               ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> > > -                                       iter_iov_len(&iter), behavior);
> > > +               ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> > > +                                       iter_iov_len(&iter), behavior, true);
> > >                 if (ret < 0)
> > >                         break;
> > >                 iov_iter_advance(&iter, iter_iov_len(&iter));
> > > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > > index 6ce1f1ceb432..92c67bc755da 100644
> > > --- a/tools/include/uapi/asm-generic/mman-common.h
> > > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > > @@ -78,6 +78,7 @@
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > >  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT  26      /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > > --
> > > 2.33.1
> > >
Lance Yang Jan. 26, 2024, 12:52 p.m. UTC | #18
On Fri, Jan 26, 2024 at 6:15 PM Lance Yang <ioworker0@gmail.com> wrote:
[...]
> > If the kernel supports a more relaxed (opportunistic)
> > MADV_COLLAPSE, we will modify the THP settings as follows:
> >
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>
> The correct THP settings should be:
> echo always >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>

Apologize for the confusion in my previous email.

The third type of requirements prefers not to use huge pages at all.
The correct THP settings should be:
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag

> >
> > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
> > to address the requirements of the second type.
[...]
Zach O'Keefe Jan. 26, 2024, 11:26 p.m. UTC | #19
On Mon, Jan 22, 2024 at 6:35 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> Hey Zach,
>
> What do you think about the semantic?

Hey Lance,

Sorry for the late reply.

I can see both sides of the argument; though I would argue that
"non-blocking" is equally as vague in this context. E.g. we'll "block" on
acquiring a number of different locks along the collapse path.

If you really want to talk about not entering direct reclaim /
compaction, then keeping with the sys/kernel/vm/thp notion of "defrag"
would be better, IMO. I don't feel that strongly about it though.

But I see you've provided some more use cases in another mail, so let
me pick up my thoughts over there.

Best,
Zach



> Thanks,
> Lance
>
> On Mon, Jan 22, 2024 at 10:14 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sat 20-01-24 10:09:32, Lance Yang wrote:
> > > [...]
> > > > Hey Michal,
> > > >
> > > > Thanks for your suggestion!
> > > >
> > > > It seems that the implementation should try but not too hard aligns well
> > > > with my desired behavior.
> > >
> > > The problem I have with this semantic is that it is really hard to
> > > define and then stick with. Our implementation might change over time
> > > and what somebody considers good ATM might turn int "trying harder than
> > > I wanted" later on.
> > >
> > > > Non-blocking in general is also a great idea.
> > > > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK
> > > > flag for scenarios where latency is extremely critical.
> > >
> > > Non blocking semantic is much easier to define and maintain. The actual
> > > allocation/compaction implementation might change as well over time but
> > > the userspace at least knows that the request will not block waiting for
> > > any required resources.
> >
> > I appreciate your insights!
> >
> > It makes sense that a non-blocking semantic is easier to define and maintain,
> > providing userspace with the certainty that requests won’t be blocked.
> >
> > Thanks,
> > Lance
> >
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
Zach O'Keefe Jan. 26, 2024, 11:46 p.m. UTC | #20
> I’d like to add another real use case.
>
> In our company, we deploy applications using offline-online
> hybrid deployment. This approach leverages the distinctive
> resource utilization patterns of online services, utilizing idle
> resources during various time periods by filling them with
> offline jobs. This helps reduce the growing cost expenditures
> for the enterprise.
>
> Whether for online services or offline jobs, their requirements
> for THP can be roughly categorized into three types:
>
> * The first type aims to use huge pages as much as possible
> and tolerates unpredictable stalls caused by direct reclaim
> and/or compaction.
> * The second type attempts to use huge pages but is relatively
> latency-sensitive and cannot tolerate unpredictable stalls.
> * The third type prefers not to use huge pages at all and is
> extremely latency-sensitive.
>
> After careful consideration, we decided to prioritize the
> requirements of the first type and modify the THP settings
> as follows:
>
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo defer >/sys/kernel/mm/transparent_hugepage/defrag
>
> With the introduction of MADV_COLLAPSE into the kernel,
> it is no longer dependent on any sysfs setting under
> /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE
> offers the potential for fine-grained synchronous control over
> the huge page allocation mechanism, marking a significant
> enhancement for THP.
>
> If the kernel supports a more relaxed (opportunistic)
> MADV_COLLAPSE, we will modify the THP settings as follows:
>
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

[corrected, via 2 previous mails, to: echo madvise
>/sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag]


> Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
> to address the requirements of the second type.
>
> Why don't we favor madvise(MADV_COLLAPSE) for the first type
> of requirements?
> The main reason is that these requirements are typically for offline
> jobs in the Hadoop ecosystem, such as MapReduce and Spark,
> which run primarily on the JVM. [..]

Hey Lance,

Thanks for proving this context, it's very helpful.

Though, couldn't you use enabled=always, defrag=defer+madvise, then
just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the
behaviour you want? i.e.

type 1: apply MADV_HUGEPAGE -> sync defrag to get THP
type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick
kswapd+kcompactd otherwise
type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs

Or am I missing something? It sounds like a confounding issue is that
these are external workloads, or you don't have ability to modify? But
that would preclude MADV_COLLAPSE (unless you're using
process_madvise()).

Appreciate the help understanding the use case. I'm not opposed to the
idea in general, but IMO would be great to have a clear need for it
(and right now, we don't currently have alignment with the original
motivating usecase (Go) in that regard w.r.t their plans).

Thanks,
Zach
Lance Yang Jan. 27, 2024, 8:03 a.m. UTC | #21
Hey Zach,

Thanks for taking time to look into this!

On Sat, Jan 27, 2024 at 7:47 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> > I’d like to add another real use case.
> >
> > In our company, we deploy applications using offline-online
> > hybrid deployment. This approach leverages the distinctive
> > resource utilization patterns of online services, utilizing idle
> > resources during various time periods by filling them with
> > offline jobs. This helps reduce the growing cost expenditures
> > for the enterprise.
> >
> > Whether for online services or offline jobs, their requirements
> > for THP can be roughly categorized into three types:
> >
> > * The first type aims to use huge pages as much as possible
> > and tolerates unpredictable stalls caused by direct reclaim
> > and/or compaction.
> > * The second type attempts to use huge pages but is relatively
> > latency-sensitive and cannot tolerate unpredictable stalls.
> > * The third type prefers not to use huge pages at all and is
> > extremely latency-sensitive.
> >
> > After careful consideration, we decided to prioritize the
> > requirements of the first type and modify the THP settings
> > as follows:
> >
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> >
> > With the introduction of MADV_COLLAPSE into the kernel,
> > it is no longer dependent on any sysfs setting under
> > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE
> > offers the potential for fine-grained synchronous control over
> > the huge page allocation mechanism, marking a significant
> > enhancement for THP.
> >
> > If the kernel supports a more relaxed (opportunistic)
> > MADV_COLLAPSE, we will modify the THP settings as follows:
> >
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>
> [corrected, via 2 previous mails, to: echo madvise
> >/sys/kernel/mm/transparent_hugepage/enabled
> echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag]
>
>
> > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
> > to address the requirements of the second type.
> >
> > Why don't we favor madvise(MADV_COLLAPSE) for the first type
> > of requirements?
> > The main reason is that these requirements are typically for offline
> > jobs in the Hadoop ecosystem, such as MapReduce and Spark,
> > which run primarily on the JVM. [..]
>
> Hey Lance,
>
> Thanks for proving this context, it's very helpful.
>
> Though, couldn't you use enabled=always, defrag=defer+madvise, then
> just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the
> behaviour you want? i.e.

prctl(PR_SET_THP_DISABLE) is a good choice that can fully meet
the needs of type-3 workloads.

I might prefer using enabled=madvise, as this would allow
applications to implement specific calls to madvise to request huge
pages selectively. If we set enabled=always, some applications
may not be optimized for or may not benefit from huge pages.
In such cases, using huge pages for all allocations could lead
to suboptimal performance.

>
> type 1: apply MADV_HUGEPAGE -> sync defrag to get THP
> type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick
> kswapd+kcompactd otherwise

Sorry, I did not express myself clearly. The type 2 of requirements
should be:
type 2: apply MADV_HUGEPAGE with defrag=defer, or use a more
relaxed (opportunistic) MADV_COLLAPSE.

> type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs
>
> Or am I missing something? It sounds like a confounding issue is that
> these are external workloads, or you don't have ability to modify? But
> that would preclude MADV_COLLAPSE (unless you're using
> process_madvise()).

Sorry, my previous explanation has been unclear. What I meant is
that the requirements of type-1 workloads can be independent of
any sysfs setting and can be addressed using madvise(MADV_COLLAPSE).
In this scenario, why haven't I utilized it? The reason is that I
currently lack the capability to modify the JVM or PyTorch to
make them compatible with madvise(MADV_COLLAPSE).
Therefore, the needs of type-1 workloads still rely on sysfs settings.

>
> Appreciate the help understanding the use case. I'm not opposed to the
> idea in general, but IMO would be great to have a clear need for it

I appreciate your perspective!

Thanks again for your valuable insights and your suggestions!
Lance

> (and right now, we don't currently have alignment with the original
> motivating usecase (Go) in that regard w.r.t their plans).
>
> Thanks,
> Zach
Lance Yang Jan. 27, 2024, 8:06 a.m. UTC | #22
How about MADV_F_COLLAPSE_NODEFRAG?

On Sat, Jan 27, 2024 at 7:27 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Mon, Jan 22, 2024 at 6:35 AM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > Hey Zach,
> >
> > What do you think about the semantic?
>
> Hey Lance,
>
> Sorry for the late reply.
>
> I can see both sides of the argument; though I would argue that
> "non-blocking" is equally as vague in this context. E.g. we'll "block" on
> acquiring a number of different locks along the collapse path.
>
> If you really want to talk about not entering direct reclaim /
> compaction, then keeping with the sys/kernel/vm/thp notion of "defrag"
> would be better, IMO. I don't feel that strongly about it though.
>
> But I see you've provided some more use cases in another mail, so let
> me pick up my thoughts over there.
>
> Best,
> Zach
>
>
>
> > Thanks,
> > Lance
> >
> > On Mon, Jan 22, 2024 at 10:14 PM Lance Yang <ioworker0@gmail.com> wrote:
> > >
> > > On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sat 20-01-24 10:09:32, Lance Yang wrote:
> > > > [...]
> > > > > Hey Michal,
> > > > >
> > > > > Thanks for your suggestion!
> > > > >
> > > > > It seems that the implementation should try but not too hard aligns well
> > > > > with my desired behavior.
> > > >
> > > > The problem I have with this semantic is that it is really hard to
> > > > define and then stick with. Our implementation might change over time
> > > > and what somebody considers good ATM might turn int "trying harder than
> > > > I wanted" later on.
> > > >
> > > > > Non-blocking in general is also a great idea.
> > > > > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK
> > > > > flag for scenarios where latency is extremely critical.
> > > >
> > > > Non blocking semantic is much easier to define and maintain. The actual
> > > > allocation/compaction implementation might change as well over time but
> > > > the userspace at least knows that the request will not block waiting for
> > > > any required resources.
> > >
> > > I appreciate your insights!
> > >
> > > It makes sense that a non-blocking semantic is easier to define and maintain,
> > > providing userspace with the certainty that requests won’t be blocked.
> > >
> > > Thanks,
> > > Lance
> > >
> > > >
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
diff mbox series

Patch

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..22f23ca04f1a 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -77,6 +77,7 @@ 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
 
 /* compatibility flags */
 #define MAP_FILE	0
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index c6e1fc77c996..acec0b643e9c 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -104,6 +104,7 @@ 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
 
 /* compatibility flags */
 #define MAP_FILE	0
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 68c44f99bc93..812029c98cd7 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -71,6 +71,7 @@ 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
 
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1ff0c858544f..52ef463dd5b6 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -112,6 +112,7 @@ 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
 
 /* compatibility flags */
 #define MAP_FILE	0
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..075fdb5d481a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -303,7 +303,7 @@  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
 int madvise_collapse(struct vm_area_struct *vma,
 		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end);
+		     unsigned long start, unsigned long end, int behavior);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -450,7 +450,8 @@  static inline int hugepage_madvise(struct vm_area_struct *vma,
 
 static inline int madvise_collapse(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev,
-				   unsigned long start, unsigned long end)
+				   unsigned long start, unsigned long end,
+				   int behavior)
 {
 	return -EINVAL;
 }
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..92c67bc755da 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -78,6 +78,7 @@ 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
 
 /* compatibility flags */
 #define MAP_FILE	0
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2b219acb528e..2840051c0ae2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -97,6 +97,8 @@  static struct kmem_cache *mm_slot_cache __ro_after_init;
 struct collapse_control {
 	bool is_khugepaged;
 
+	int behavior;
+
 	/* Num pages scanned per node */
 	u32 node_load[MAX_NUMNODES];
 
@@ -1058,10 +1060,16 @@  static int __collapse_huge_page_swapin(struct mm_struct *mm,
 static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 			      struct collapse_control *cc)
 {
-	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
-		     GFP_TRANSHUGE);
 	int node = hpage_collapse_find_target_node(cc);
 	struct folio *folio;
+	gfp_t gfp;
+
+	if (cc->is_khugepaged)
+		gfp = alloc_hugepage_khugepaged_gfpmask();
+	else
+		gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
+			       GFP_TRANSHUGE_LIGHT :
+			       GFP_TRANSHUGE);
 
 	if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
 		*hpage = NULL;
@@ -2697,7 +2705,7 @@  static int madvise_collapse_errno(enum scan_result r)
 }
 
 int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end)
+		     unsigned long start, unsigned long end, int behavior)
 {
 	struct collapse_control *cc;
 	struct mm_struct *mm = vma->vm_mm;
@@ -2718,6 +2726,7 @@  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (!cc)
 		return -ENOMEM;
 	cc->is_khugepaged = false;
+	cc->behavior = behavior;
 
 	mmgrab(mm);
 	lru_add_drain_all();
diff --git a/mm/madvise.c b/mm/madvise.c
index 912155a94ed5..9c40226505aa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -60,6 +60,7 @@  static int madvise_need_mmap_write(int behavior)
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
+	case MADV_F_COLLAPSE_LIGHT:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1082,8 +1083,9 @@  static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_F_COLLAPSE_LIGHT:
 	case MADV_COLLAPSE:
-		return madvise_collapse(vma, prev, start, end);
+		return madvise_collapse(vma, prev, start, end, behavior);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1178,6 +1180,7 @@  madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 	case MADV_COLLAPSE:
+	case MADV_F_COLLAPSE_LIGHT:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1194,6 +1197,17 @@  madvise_behavior_valid(int behavior)
 	}
 }
 
+
+static bool process_madvise_behavior_only(int behavior)
+{
+	switch (behavior) {
+	case MADV_F_COLLAPSE_LIGHT:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static bool process_madvise_behavior_valid(int behavior)
 {
 	switch (behavior) {
@@ -1201,6 +1215,7 @@  static bool process_madvise_behavior_valid(int behavior)
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
 	case MADV_COLLAPSE:
+	case MADV_F_COLLAPSE_LIGHT:
 		return true;
 	default:
 		return false;
@@ -1368,6 +1383,8 @@  int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
  *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
+ *  MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
+ *		compaction.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
@@ -1394,7 +1411,8 @@  int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  -EBADF  - map exists, but area maps something that isn't a file.
  *  -EAGAIN - a kernel resource was temporarily unavailable.
  */
-int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
+int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
+		int behavior, bool is_process_madvise)
 {
 	unsigned long end;
 	int error;
@@ -1405,6 +1423,9 @@  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	if (!madvise_behavior_valid(behavior))
 		return -EINVAL;
 
+	if (!is_process_madvise && process_madvise_behavior_only(behavior))
+		return -EINVAL;
+
 	if (!PAGE_ALIGNED(start))
 		return -EINVAL;
 	len = PAGE_ALIGN(len_in);
@@ -1448,9 +1469,14 @@  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	return error;
 }
 
+int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
+{
+	return _do_madvise(mm, start, len_in, behavior, false);
+}
+
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	return do_madvise(current->mm, start, len_in, behavior);
+	return _do_madvise(current->mm, start, len_in, behavior, false);
 }
 
 SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
@@ -1504,8 +1530,8 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	total_len = iov_iter_count(&iter);
 
 	while (iov_iter_count(&iter)) {
-		ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
-					iter_iov_len(&iter), behavior);
+		ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
+					iter_iov_len(&iter), behavior, true);
 		if (ret < 0)
 			break;
 		iov_iter_advance(&iter, iter_iov_len(&iter));
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..92c67bc755da 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -78,6 +78,7 @@ 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+#define MADV_F_COLLAPSE_LIGHT	26	/* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
 
 /* compatibility flags */
 #define MAP_FILE	0