Message ID | 20240118120347.61817-1-ioworker0@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise() | expand |
[CC linux-api] On Thu 18-01-24 20:03:46, Lance Yang wrote: > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller > has CAP_SYS_ADMIN or is requesting the collapse of its own memory. > > The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but > it avoids direct reclaim and/or compaction, quickly failing on allocation > errors. > > This change enables a more flexible and efficient usage of memory collapse > operations, providing additional control to userspace applications for > system-wide THP optimization. > > Semantics > > This call is independent of the system-wide THP sysfs settings, but will > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > multiple VMAs, the semantics of the collapse over each VMA is independent > from the others. This implies a hugepage cannot cross a VMA boundary. If > collapse of a given hugepage-aligned/sized region fails, the operation may > continue to attempt collapsing the remainder of memory specified. > > The memory ranges provided must be page-aligned, but are not required to > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > start/end of the range will be clamped to the first/last hugepage-aligned > address covered by said range. The memory ranges must span at least one > hugepage-sized region. > > All non-resident pages covered by the range will first be > swapped/faulted-in, before being internally copied onto a freshly > allocated hugepage. Unmapped pages will have their data directly > initialized to 0 in the new hugepage. However, for every eligible > hugepage aligned/sized region to-be collapsed, at least one page must > currently be backed by memory (a PMD covering the address range must > already exist). > > Allocation for the new hugepage will not enter direct reclaim and/or > compaction, quickly failing if allocation fails. When the system has > multiple NUMA nodes, the hugepage will be allocated from the node providing > the most native pages. This operation operates on the current state of the > specified process and makes no persistent changes or guarantees on how pages > will be mapped, constructed, or faulted in the future. > > Use Cases > > An immediate user of this new functionality is the Go runtime heap allocator > that manages memory in hugepage-sized chunks. In the past, whether it was a > newly allocated chunk through mmap() or a reused chunk released by > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > respectively. However, both approaches resulted in performance issues; for > both scenarios, there could be entries into direct reclaim and/or compaction, > leading to unpredictable stalls[4]. Now, the allocator can confidently use > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > [4] https://github.com/golang/go/issues/63334 > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > Suggested-by: Zach O'Keefe <zokeefe@google.com> > Suggested-by: David Hildenbrand <david@redhat.com> > --- > V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative > to madvise(MADV_COLLAPSE) > > arch/alpha/include/uapi/asm/mman.h | 1 + > arch/mips/include/uapi/asm/mman.h | 1 + > arch/parisc/include/uapi/asm/mman.h | 1 + > arch/xtensa/include/uapi/asm/mman.h | 1 + > include/linux/huge_mm.h | 5 +-- > include/uapi/asm-generic/mman-common.h | 1 + > mm/khugepaged.c | 15 ++++++-- > mm/madvise.c | 36 +++++++++++++++++--- > tools/include/uapi/asm-generic/mman-common.h | 1 + > 9 files changed, 52 insertions(+), 10 deletions(-) > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index 763929e814e9..22f23ca04f1a 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -77,6 +77,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index c6e1fc77c996..acec0b643e9c 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -104,6 +104,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index 68c44f99bc93..812029c98cd7 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -71,6 +71,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > #define MADV_HWPOISON 100 /* poison a page for testing */ > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index 1ff0c858544f..52ef463dd5b6 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -112,6 +112,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 5adb86af35fc..075fdb5d481a 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > int advice); > int madvise_collapse(struct vm_area_struct *vma, > struct vm_area_struct **prev, > - unsigned long start, unsigned long end); > + unsigned long start, unsigned long end, int behavior); > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > unsigned long end, long adjust_next); > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > static inline int madvise_collapse(struct vm_area_struct *vma, > struct vm_area_struct **prev, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, > + int behavior) > { > return -EINVAL; > } > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..92c67bc755da 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -78,6 +78,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2b219acb528e..2840051c0ae2 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; > struct collapse_control { > bool is_khugepaged; > > + int behavior; > + > /* Num pages scanned per node */ > u32 node_load[MAX_NUMNODES]; > > @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > struct collapse_control *cc) > { > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > - GFP_TRANSHUGE); > int node = hpage_collapse_find_target_node(cc); > struct folio *folio; > + gfp_t gfp; > + > + if (cc->is_khugepaged) > + gfp = alloc_hugepage_khugepaged_gfpmask(); > + else > + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ? > + GFP_TRANSHUGE_LIGHT : > + GFP_TRANSHUGE); > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { > *hpage = NULL; > @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r) > } > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, int behavior) > { > struct collapse_control *cc; > struct mm_struct *mm = vma->vm_mm; > @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > if (!cc) > return -ENOMEM; > cc->is_khugepaged = false; > + cc->behavior = behavior; > > mmgrab(mm); > lru_add_drain_all(); > diff --git a/mm/madvise.c b/mm/madvise.c > index 912155a94ed5..9c40226505aa 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_POPULATE_READ: > case MADV_POPULATE_WRITE: > case MADV_COLLAPSE: > + case MADV_F_COLLAPSE_LIGHT: > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ > @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > if (error) > goto out; > break; > + case MADV_F_COLLAPSE_LIGHT: > case MADV_COLLAPSE: > - return madvise_collapse(vma, prev, start, end); > + return madvise_collapse(vma, prev, start, end, behavior); > } > > anon_name = anon_vma_name(vma); > @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior) > case MADV_HUGEPAGE: > case MADV_NOHUGEPAGE: > case MADV_COLLAPSE: > + case MADV_F_COLLAPSE_LIGHT: > #endif > case MADV_DONTDUMP: > case MADV_DODUMP: > @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior) > } > } > > + > +static bool process_madvise_behavior_only(int behavior) > +{ > + switch (behavior) { > + case MADV_F_COLLAPSE_LIGHT: > + return true; > + default: > + return false; > + } > +} > + > static bool process_madvise_behavior_valid(int behavior) > { > switch (behavior) { > @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior) > case MADV_PAGEOUT: > case MADV_WILLNEED: > case MADV_COLLAPSE: > + case MADV_F_COLLAPSE_LIGHT: > return true; > default: > return false; > @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > * transparent huge pages so the existing pages will not be > * coalesced into THP and new pages will not be allocated as THP. > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or > + * compaction. > * MADV_DONTDUMP - the application wants to prevent pages in the given range > * from being included in its core dump. > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > * -EBADF - map exists, but area maps something that isn't a file. > * -EAGAIN - a kernel resource was temporarily unavailable. > */ > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, > + int behavior, bool is_process_madvise) > { > unsigned long end; > int error; > @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > if (!madvise_behavior_valid(behavior)) > return -EINVAL; > > + if (!is_process_madvise && process_madvise_behavior_only(behavior)) > + return -EINVAL; > + > if (!PAGE_ALIGNED(start)) > return -EINVAL; > len = PAGE_ALIGN(len_in); > @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > return error; > } > > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > +{ > + return _do_madvise(mm, start, len_in, behavior, false); > +} > + > SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > { > - return do_madvise(current->mm, start, len_in, behavior); > + return _do_madvise(current->mm, start, len_in, behavior, false); > } > > SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > total_len = iov_iter_count(&iter); > > while (iov_iter_count(&iter)) { > - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > - iter_iov_len(&iter), behavior); > + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > + iter_iov_len(&iter), behavior, true); > if (ret < 0) > break; > iov_iter_advance(&iter, iter_iov_len(&iter)); > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..92c67bc755da 100644 > --- a/tools/include/uapi/asm-generic/mman-common.h > +++ b/tools/include/uapi/asm-generic/mman-common.h > @@ -78,6 +78,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > -- > 2.33.1
On Thu 18-01-24 20:03:46, Lance Yang wrote: [...] before we discuss the semantic, let's focus on the usecase. > Use Cases > > An immediate user of this new functionality is the Go runtime heap allocator > that manages memory in hugepage-sized chunks. In the past, whether it was a > newly allocated chunk through mmap() or a reused chunk released by > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > respectively. However, both approaches resulted in performance issues; for > both scenarios, there could be entries into direct reclaim and/or compaction, > leading to unpredictable stalls[4]. Now, the allocator can confidently use > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. IIUC the primary reason is the cost of the huge page allocation which can be really high if the memory is heavily fragmented and it is called synchronously from the process directly, correct? Can that be worked around by process_madvise and performing the operation from a different context? Are there any other reasons to have a different mode? I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - e.g. non blocking one to make sure that the caller doesn't really block on resource contention (be it locks or memory availability) because that matches our non-blocking interface in other areas but having a LIGHT operation sounds really vague and the exact semantic would be implementation specific and might change over time. Non-blocking has a clear semantic but it is not really clear whether that is what you really need/want. > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > [4] https://github.com/golang/go/issues/63334 > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/
Dang, forgot to cc linux-api... On Thu 18-01-24 14:40:19, Michal Hocko wrote: > On Thu 18-01-24 20:03:46, Lance Yang wrote: > [...] > > before we discuss the semantic, let's focus on the usecase. > > > Use Cases > > > > An immediate user of this new functionality is the Go runtime heap allocator > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > newly allocated chunk through mmap() or a reused chunk released by > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > respectively. However, both approaches resulted in performance issues; for > > both scenarios, there could be entries into direct reclaim and/or compaction, > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > IIUC the primary reason is the cost of the huge page allocation which > can be really high if the memory is heavily fragmented and it is called > synchronously from the process directly, correct? Can that be worked > around by process_madvise and performing the operation from a different > context? Are there any other reasons to have a different mode? > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > e.g. non blocking one to make sure that the caller doesn't really block > on resource contention (be it locks or memory availability) because that > matches our non-blocking interface in other areas but having a LIGHT > operation sounds really vague and the exact semantic would be > implementation specific and might change over time. Non-blocking has a > clear semantic but it is not really clear whether that is what you > really need/want. > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > [4] https://github.com/golang/go/issues/63334 > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > -- > Michal Hocko > SUSE Labs
On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote: > > Dang, forgot to cc linux-api... > > On Thu 18-01-24 14:40:19, Michal Hocko wrote: > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > [...] > > > > before we discuss the semantic, let's focus on the usecase. > > > > > Use Cases > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > newly allocated chunk through mmap() or a reused chunk released by > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > respectively. However, both approaches resulted in performance issues; for > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT) > > IIUC the primary reason is the cost of the huge page allocation which > > can be really high if the memory is heavily fragmented and it is called > > synchronously from the process directly, correct? Can that be worked > > around by process_madvise and performing the operation from a different > > context? Are there any other reasons to have a different mode? > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > e.g. non blocking one to make sure that the caller doesn't really block > > on resource contention (be it locks or memory availability) because that > > matches our non-blocking interface in other areas but having a LIGHT > > operation sounds really vague and the exact semantic would be > > implementation specific and might change over time. Non-blocking has a > > clear semantic but it is not really clear whether that is what you > > really need/want. IIUC, usecase from Go is unbounded latency due to sync compaction in a context where the latency is unacceptable. Working w/ them to understand how things can be improved -- it's possible the changes can occur entirely on their side, w/o any additional kernel support. The non-blocking case awkwardly sits between MADV_COLLAPSE today, and khugepaged; esp when common case is that the allocation can probably be satisfied in fast path. The suggestion for something like "LIGHT" was intentionally vague because it could allow for other optimizations / changes down the line, as you point out. I think that might be a win, vs tying to a specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I could be alone on that front, given the design of /sys/kernel/mm/transparent_hugepage. But circling back, I agree w/ you that the first order of business is to iron out a real usecase. As of right now, it's not clear something like this is required or helpful. Thanks, Zach > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > > [4] https://github.com/golang/go/issues/63334 > > > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > -- > > Michal Hocko > > SUSE Labs > > -- > Michal Hocko > SUSE Labs
On Thu, Jan 18, 2024 at 6:59 AM Zach O'Keefe <zokeefe@google.com> wrote: > > On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > Dang, forgot to cc linux-api... > > > > On Thu 18-01-24 14:40:19, Michal Hocko wrote: > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > > [...] > > > > > > before we discuss the semantic, let's focus on the usecase. > > > > > > > Use Cases > > > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > > newly allocated chunk through mmap() or a reused chunk released by > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > > respectively. However, both approaches resulted in performance issues; for > > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be > process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT) > > > > IIUC the primary reason is the cost of the huge page allocation which > > > can be really high if the memory is heavily fragmented and it is called > > > synchronously from the process directly, correct? Can that be worked > > > around by process_madvise and performing the operation from a different > > > context? Are there any other reasons to have a different mode? > > > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > > e.g. non blocking one to make sure that the caller doesn't really block > > > on resource contention (be it locks or memory availability) because that > > > matches our non-blocking interface in other areas but having a LIGHT > > > operation sounds really vague and the exact semantic would be > > > implementation specific and might change over time. Non-blocking has a > > > clear semantic but it is not really clear whether that is what you > > > really need/want. > > IIUC, usecase from Go is unbounded latency due to sync compaction in a > context where the latency is unacceptable. Working w/ them to > understand how things can be improved -- it's possible the changes can > occur entirely on their side, w/o any additional kernel support. > > The non-blocking case awkwardly sits between MADV_COLLAPSE today, and > khugepaged; esp when common case is that the allocation can probably > be satisfied in fast path. > > The suggestion for something like "LIGHT" was intentionally vague > because it could allow for other optimizations / changes down the > line, as you point out. I think that might be a win, vs tying to a > specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I > could be alone on that front, given the design of > /sys/kernel/mm/transparent_hugepage. Per the description Go marks the address spaces with MADV_HUGEPAGE. It means the application really wants to have huge page back the address space so kernel will try as hard as possible to get huge page. This is the default behavior of MADV_HUGEPAGE. If they don't want to enter direct reclaim, they can configure the defrag mode to "defer", which means no direct reclaim and wakeup kswapd and kcompactd, and rely on khugepaged to install huge page later on. But this mode is not supported by khugepaged defrag, so MADV_COLLAPSE may not support it (IIRC MADV_COLLAPSE uses khugepaged defrag mode). Or they can just not call MADV_HUGEPAGE and leave the decision to the users, IIRC Java does so (specifying a flag to indicate use huge page or not by the users). > > But circling back, I agree w/ you that the first order of business is to > iron out a real usecase. As of right now, it's not clear something > like this is required or helpful. > > Thanks, > Zach > > > > > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > > > [4] https://github.com/golang/go/issues/63334 > > > > > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > > -- > > > Michal Hocko > > > SUSE Labs > > > > -- > > Michal Hocko > > SUSE Labs
On Thu, Jan 18, 2024 at 10:59 PM Zach O'Keefe <zokeefe@google.com> wrote: > > On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > Dang, forgot to cc linux-api... > > > > On Thu 18-01-24 14:40:19, Michal Hocko wrote: > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > > [...] > > > > > > before we discuss the semantic, let's focus on the usecase. > > > > > > > Use Cases > > > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > > newly allocated chunk through mmap() or a reused chunk released by > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > > respectively. However, both approaches resulted in performance issues; for > > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be > process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT) I apologize for the misunderstanding. I will provide the correct implementation in version 3. BR, Lance > > > > IIUC the primary reason is the cost of the huge page allocation which > > > can be really high if the memory is heavily fragmented and it is called > > > synchronously from the process directly, correct? Can that be worked > > > around by process_madvise and performing the operation from a different > > > context? Are there any other reasons to have a different mode? > > > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > > e.g. non blocking one to make sure that the caller doesn't really block > > > on resource contention (be it locks or memory availability) because that > > > matches our non-blocking interface in other areas but having a LIGHT > > > operation sounds really vague and the exact semantic would be > > > implementation specific and might change over time. Non-blocking has a > > > clear semantic but it is not really clear whether that is what you > > > really need/want. > > IIUC, usecase from Go is unbounded latency due to sync compaction in a > context where the latency is unacceptable. Working w/ them to > understand how things can be improved -- it's possible the changes can > occur entirely on their side, w/o any additional kernel support. > > The non-blocking case awkwardly sits between MADV_COLLAPSE today, and > khugepaged; esp when common case is that the allocation can probably > be satisfied in fast path. > > The suggestion for something like "LIGHT" was intentionally vague > because it could allow for other optimizations / changes down the > line, as you point out. I think that might be a win, vs tying to a > specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I > could be alone on that front, given the design of > /sys/kernel/mm/transparent_hugepage. > > But circling back, I agree w/ you that the first order of business is to > iron out a real usecase. As of right now, it's not clear something > like this is required or helpful. > > Thanks, > Zach > > > > > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > > > [4] https://github.com/golang/go/issues/63334 > > > > > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > > -- > > > Michal Hocko > > > SUSE Labs > > > > -- > > Michal Hocko > > SUSE Labs
Hey Michal, Thanks for taking the time to review! On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > [...] > > before we discuss the semantic, let's focus on the usecase. > > > Use Cases > > > > An immediate user of this new functionality is the Go runtime heap allocator > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > newly allocated chunk through mmap() or a reused chunk released by > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > respectively. However, both approaches resulted in performance issues; for > > both scenarios, there could be entries into direct reclaim and/or compaction, > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > IIUC the primary reason is the cost of the huge page allocation which > can be really high if the memory is heavily fragmented and it is called > synchronously from the process directly, correct? Can that be worked Yes, that's correct. > around by process_madvise and performing the operation from a different > context? Are there any other reasons to have a different mode? In latency-sensitive scenarios, some applications aim to enhance performance by utilizing huge pages as much as possible. At the same time, in case of allocation failure, they prefer a quick return without triggering direct memory reclamation and compaction. > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > e.g. non blocking one to make sure that the caller doesn't really block > on resource contention (be it locks or memory availability) because that > matches our non-blocking interface in other areas but having a LIGHT > operation sounds really vague and the exact semantic would be > implementation specific and might change over time. Non-blocking has a > clear semantic but it is not really clear whether that is what you > really need/want. Could you provide me with some suggestions regarding the naming of a more relaxed (opportunistic) MADV_COLLAPSE? Thanks again for your review and your suggestion! Lance > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > [4] https://github.com/golang/go/issues/63334 > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > -- > Michal Hocko > SUSE Labs
Hey Yang, Thanks for taking the time to review! On Fri, Jan 19, 2024 at 3:00 AM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 6:59 AM Zach O'Keefe <zokeefe@google.com> wrote: > > > > On Thu, Jan 18, 2024 at 5:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > Dang, forgot to cc linux-api... > > > > > > On Thu 18-01-24 14:40:19, Michal Hocko wrote: > > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > > > [...] > > > > > > > > before we discuss the semantic, let's focus on the usecase. > > > > > > > > > Use Cases > > > > > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > > > newly allocated chunk through mmap() or a reused chunk released by > > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > > > respectively. However, both approaches resulted in performance issues; for > > > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > > > Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be > > process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT) > > > > > > IIUC the primary reason is the cost of the huge page allocation which > > > > can be really high if the memory is heavily fragmented and it is called > > > > synchronously from the process directly, correct? Can that be worked > > > > around by process_madvise and performing the operation from a different > > > > context? Are there any other reasons to have a different mode? > > > > > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > > > e.g. non blocking one to make sure that the caller doesn't really block > > > > on resource contention (be it locks or memory availability) because that > > > > matches our non-blocking interface in other areas but having a LIGHT > > > > operation sounds really vague and the exact semantic would be > > > > implementation specific and might change over time. Non-blocking has a > > > > clear semantic but it is not really clear whether that is what you > > > > really need/want. > > > > IIUC, usecase from Go is unbounded latency due to sync compaction in a > > context where the latency is unacceptable. Working w/ them to > > understand how things can be improved -- it's possible the changes can > > occur entirely on their side, w/o any additional kernel support. > > > > The non-blocking case awkwardly sits between MADV_COLLAPSE today, and > > khugepaged; esp when common case is that the allocation can probably > > be satisfied in fast path. > > > > The suggestion for something like "LIGHT" was intentionally vague > > because it could allow for other optimizations / changes down the > > line, as you point out. I think that might be a win, vs tying to a > > specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I > > could be alone on that front, given the design of > > /sys/kernel/mm/transparent_hugepage. > > Per the description Go marks the address spaces with MADV_HUGEPAGE. It > means the application really wants to have huge page back the address > space so kernel will try as hard as possible to get huge page. This is > the default behavior of MADV_HUGEPAGE. If they don't want to enter > direct reclaim, they can configure the defrag mode to "defer", which > means no direct reclaim and wakeup kswapd and kcompactd, and rely on > khugepaged to install huge page later on. But this mode is not > supported by khugepaged defrag, so MADV_COLLAPSE may not support it > (IIRC MADV_COLLAPSE uses khugepaged defrag mode). Or they can just not > call MADV_HUGEPAGE and leave the decision to the users, IIRC Java does > so (specifying a flag to indicate use huge page or not by the users). Thank you for providing insights into the Go use cases with MADV_HUGEPAGE and the configuration options for defrag mode. Considering the limitations with the "defer" mode, it becomes apparent that there is a gap in addressing scenarios where an application desires a lighter-weight alternative to MADV_HUGEPAGE. MADV_F_COLLAPSE_LIGHT aims to fill this gap by providing a more flexible and opportunistic approach, catering to applications in latency-sensitive environments that seek performance improvements with huge pages but prefer to avoid direct reclaim and compaction. This option can serve as a valuable addition for users who want more control over the behavior without the constraints of existing configurations. In the era of cloud-native computing, it's challenging for users to be aware of the THP configurations on all nodes in a cluster, let alone have fine-grained control over them. Simply disabling the use of huge pages due to concerns about potential direct reclamation and compaction may be regrettable, as users are deprived of the opportunity to experiment with large page allocations. However, relying solely on MADV_HUGEPAGE introduces the risk of unpredictable stalls, making it a trade-off that users must carefully consider. By introducing MADV_F_COLLAPSE_LIGHT, we offer users a more flexible and controllable solution in cloud-native environments, allowing them to better balance performance requirements and resource management. This selectively lightweight alternative is designed to provide users with more choices to better meet the diverse needs of different scenarios. Thanks again for your review and your suggestion! Lance > > > > > But circling back, I agree w/ you that the first order of business is to > > iron out a real usecase. As of right now, it's not clear something > > like this is required or helpful. > > > > Thanks, > > Zach > > > > > > > > > > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > > > > [4] https://github.com/golang/go/issues/63334 > > > > > > > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > > > -- > > > > Michal Hocko > > > > SUSE Labs > > > > > > -- > > > Michal Hocko > > > SUSE Labs
On Fri 19-01-24 10:03:05, Lance Yang wrote: > Hey Michal, > > Thanks for taking the time to review! > > On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > [...] > > > > before we discuss the semantic, let's focus on the usecase. > > > > > Use Cases > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > newly allocated chunk through mmap() or a reused chunk released by > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > respectively. However, both approaches resulted in performance issues; for > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > > > IIUC the primary reason is the cost of the huge page allocation which > > can be really high if the memory is heavily fragmented and it is called > > synchronously from the process directly, correct? Can that be worked > > Yes, that's correct. > > > around by process_madvise and performing the operation from a different > > context? Are there any other reasons to have a different mode? > > In latency-sensitive scenarios, some applications aim to enhance performance > by utilizing huge pages as much as possible. At the same time, in case of > allocation failure, they prefer a quick return without triggering direct memory > reclamation and compaction. Could you elaborate some more on why? > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > e.g. non blocking one to make sure that the caller doesn't really block > > on resource contention (be it locks or memory availability) because that > > matches our non-blocking interface in other areas but having a LIGHT > > operation sounds really vague and the exact semantic would be > > implementation specific and might change over time. Non-blocking has a > > clear semantic but it is not really clear whether that is what you > > really need/want. > > Could you provide me with some suggestions regarding the naming of a > more relaxed (opportunistic) MADV_COLLAPSE? Naming is not all that important at this stage (it could be MADV_COLLAPSE_NOBLOCK for example). The primary question is whether non-blocking in general is the desired behavior or the implementation should try but not too hard.
On Fri, Jan 19, 2024 at 8:51 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 19-01-24 10:03:05, Lance Yang wrote: > > Hey Michal, > > > > Thanks for taking the time to review! > > > > On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > > [...] > > > > > > before we discuss the semantic, let's focus on the usecase. > > > > > > > Use Cases > > > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > > newly allocated chunk through mmap() or a reused chunk released by > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > > respectively. However, both approaches resulted in performance issues; for > > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > > > > > IIUC the primary reason is the cost of the huge page allocation which > > > can be really high if the memory is heavily fragmented and it is called > > > synchronously from the process directly, correct? Can that be worked > > > > Yes, that's correct. > > > > > around by process_madvise and performing the operation from a different > > > context? Are there any other reasons to have a different mode? > > > > In latency-sensitive scenarios, some applications aim to enhance performance > > by utilizing huge pages as much as possible. At the same time, in case of > > allocation failure, they prefer a quick return without triggering direct memory > > reclamation and compaction. > > Could you elaborate some more on why? Previously, the Go runtime attempted to marks all new memory as MADV_HUGEPAGE on Linux and manages its hugepage eligibility status. Unfortunately, the default THP behavior on most Linux distros is that MADV_HUGEPAGE blocks while the kernel eagerly reclaims and compacts memory to allocate a hugepage. This direct reclaim and compaction is unbounded, and may result in significant application thread stalls. In really bad cases, this can exceed 100s of ms or even seconds. The overall strategy of trying to keep hugepages for the heap unbroken however is sound. So, the Go runtime uses MADV_COLLAPSE as an alternative. See https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af Later, a Google production service experienced a performance regression with the Go runtime's use of MADV_COLLAPSE. For now, the Go runtime has rolled back the usage of MADV_COLLAPSE. See https://github.com/golang/go/issues/63334 If there were a more relaxed (opportunistic) MADV_COLLAPSE, it would avoid direct reclaim and/or compaction and quickly fail on allocation errors. This could be beneficial for similar use cases. BR, Lance > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > > e.g. non blocking one to make sure that the caller doesn't really block > > > on resource contention (be it locks or memory availability) because that > > > matches our non-blocking interface in other areas but having a LIGHT > > > operation sounds really vague and the exact semantic would be > > > implementation specific and might change over time. Non-blocking has a > > > clear semantic but it is not really clear whether that is what you > > > really need/want. > > > > Could you provide me with some suggestions regarding the naming of a > > more relaxed (opportunistic) MADV_COLLAPSE? > > Naming is not all that important at this stage (it could be > MADV_COLLAPSE_NOBLOCK for example). The primary question is whether > non-blocking in general is the desired behavior or the implementation > should try but not too hard. > > -- > Michal Hocko > SUSE Labs
On Fri, Jan 19, 2024 at 8:51 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 19-01-24 10:03:05, Lance Yang wrote: > > Hey Michal, > > > > Thanks for taking the time to review! > > > > On Thu, Jan 18, 2024 at 9:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > > [...] > > > > > > before we discuss the semantic, let's focus on the usecase. > > > > > > > Use Cases > > > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > > newly allocated chunk through mmap() or a reused chunk released by > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > > respectively. However, both approaches resulted in performance issues; for > > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > > > > > IIUC the primary reason is the cost of the huge page allocation which > > > can be really high if the memory is heavily fragmented and it is called > > > synchronously from the process directly, correct? Can that be worked > > > > Yes, that's correct. > > > > > around by process_madvise and performing the operation from a different > > > context? Are there any other reasons to have a different mode? > > > > In latency-sensitive scenarios, some applications aim to enhance performance > > by utilizing huge pages as much as possible. At the same time, in case of > > allocation failure, they prefer a quick return without triggering direct memory > > reclamation and compaction. > > Could you elaborate some more on why? > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > > e.g. non blocking one to make sure that the caller doesn't really block > > > on resource contention (be it locks or memory availability) because that > > > matches our non-blocking interface in other areas but having a LIGHT > > > operation sounds really vague and the exact semantic would be > > > implementation specific and might change over time. Non-blocking has a > > > clear semantic but it is not really clear whether that is what you > > > really need/want. > > > > Could you provide me with some suggestions regarding the naming of a > > more relaxed (opportunistic) MADV_COLLAPSE? > > Naming is not all that important at this stage (it could be > MADV_COLLAPSE_NOBLOCK for example). The primary question is whether > non-blocking in general is the desired behavior or the implementation > should try but not too hard. Hey Michal, Thanks for your suggestion! It seems that the implementation should try but not too hard aligns well with my desired behavior. Non-blocking in general is also a great idea. Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK flag for scenarios where latency is extremely critical. Thanks again, Lance > > -- > Michal Hocko > SUSE Labs
Hello Everyone, For applications actively utilizing THP, the defrag mode may not be a very user-friendly design. Here are the reasons: 1. Before marking the address space with MADV_HUGEPAGE,it is necessary to check if the current configuration of the defrag mode aligns with their preferences. 2. Once the defrag mode configuration changes, these applications may face the risk of unpredictable stalls. THP is an important feature of the Linux kernel that can significantly enhance memory access performance. However, due to the lack of fine-grained control over the huge page allocation strategy, many applications default to not using huge pages and even recommend users to disable THP. This situation is regrettable. With the introduction of MADV_COLLAPSE into the kernel, it is not affected by the defrag mode. MADV_COLLAPSE offers the potential for fine-grained synchronous control over the huge page allocation mechanism, marking a significant enhancement for THP. By adding flags to MADV_COLLAPSE, different synchronous allocation strategies can be provided to applications. This can instill confidence in them, allowing them to reconsider using THP and allocate huge pages according to their desired synchronous allocation strategy, without worrying about the defrag mode configuration. BR, Lance On Thu, Jan 18, 2024 at 8:03 PM Lance Yang <ioworker0@gmail.com> wrote: > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller > has CAP_SYS_ADMIN or is requesting the collapse of its own memory. > > The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but > it avoids direct reclaim and/or compaction, quickly failing on allocation > errors. > > This change enables a more flexible and efficient usage of memory collapse > operations, providing additional control to userspace applications for > system-wide THP optimization. > > Semantics > > This call is independent of the system-wide THP sysfs settings, but will > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > multiple VMAs, the semantics of the collapse over each VMA is independent > from the others. This implies a hugepage cannot cross a VMA boundary. If > collapse of a given hugepage-aligned/sized region fails, the operation may > continue to attempt collapsing the remainder of memory specified. > > The memory ranges provided must be page-aligned, but are not required to > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > start/end of the range will be clamped to the first/last hugepage-aligned > address covered by said range. The memory ranges must span at least one > hugepage-sized region. > > All non-resident pages covered by the range will first be > swapped/faulted-in, before being internally copied onto a freshly > allocated hugepage. Unmapped pages will have their data directly > initialized to 0 in the new hugepage. However, for every eligible > hugepage aligned/sized region to-be collapsed, at least one page must > currently be backed by memory (a PMD covering the address range must > already exist). > > Allocation for the new hugepage will not enter direct reclaim and/or > compaction, quickly failing if allocation fails. When the system has > multiple NUMA nodes, the hugepage will be allocated from the node providing > the most native pages. This operation operates on the current state of the > specified process and makes no persistent changes or guarantees on how pages > will be mapped, constructed, or faulted in the future. > > Use Cases > > An immediate user of this new functionality is the Go runtime heap allocator > that manages memory in hugepage-sized chunks. In the past, whether it was a > newly allocated chunk through mmap() or a reused chunk released by > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > respectively. However, both approaches resulted in performance issues; for > both scenarios, there could be entries into direct reclaim and/or compaction, > leading to unpredictable stalls[4]. Now, the allocator can confidently use > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > [4] https://github.com/golang/go/issues/63334 > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > Suggested-by: Zach O'Keefe <zokeefe@google.com> > Suggested-by: David Hildenbrand <david@redhat.com> > --- > V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative > to madvise(MADV_COLLAPSE) > > arch/alpha/include/uapi/asm/mman.h | 1 + > arch/mips/include/uapi/asm/mman.h | 1 + > arch/parisc/include/uapi/asm/mman.h | 1 + > arch/xtensa/include/uapi/asm/mman.h | 1 + > include/linux/huge_mm.h | 5 +-- > include/uapi/asm-generic/mman-common.h | 1 + > mm/khugepaged.c | 15 ++++++-- > mm/madvise.c | 36 +++++++++++++++++--- > tools/include/uapi/asm-generic/mman-common.h | 1 + > 9 files changed, 52 insertions(+), 10 deletions(-) > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index 763929e814e9..22f23ca04f1a 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -77,6 +77,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index c6e1fc77c996..acec0b643e9c 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -104,6 +104,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index 68c44f99bc93..812029c98cd7 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -71,6 +71,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > #define MADV_HWPOISON 100 /* poison a page for testing */ > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index 1ff0c858544f..52ef463dd5b6 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -112,6 +112,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 5adb86af35fc..075fdb5d481a 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > int advice); > int madvise_collapse(struct vm_area_struct *vma, > struct vm_area_struct **prev, > - unsigned long start, unsigned long end); > + unsigned long start, unsigned long end, int behavior); > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > unsigned long end, long adjust_next); > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > static inline int madvise_collapse(struct vm_area_struct *vma, > struct vm_area_struct **prev, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, > + int behavior) > { > return -EINVAL; > } > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..92c67bc755da 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -78,6 +78,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2b219acb528e..2840051c0ae2 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; > struct collapse_control { > bool is_khugepaged; > > + int behavior; > + > /* Num pages scanned per node */ > u32 node_load[MAX_NUMNODES]; > > @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > struct collapse_control *cc) > { > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > - GFP_TRANSHUGE); > int node = hpage_collapse_find_target_node(cc); > struct folio *folio; > + gfp_t gfp; > + > + if (cc->is_khugepaged) > + gfp = alloc_hugepage_khugepaged_gfpmask(); > + else > + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ? > + GFP_TRANSHUGE_LIGHT : > + GFP_TRANSHUGE); > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { > *hpage = NULL; > @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r) > } > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, int behavior) > { > struct collapse_control *cc; > struct mm_struct *mm = vma->vm_mm; > @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > if (!cc) > return -ENOMEM; > cc->is_khugepaged = false; > + cc->behavior = behavior; > > mmgrab(mm); > lru_add_drain_all(); > diff --git a/mm/madvise.c b/mm/madvise.c > index 912155a94ed5..9c40226505aa 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_POPULATE_READ: > case MADV_POPULATE_WRITE: > case MADV_COLLAPSE: > + case MADV_F_COLLAPSE_LIGHT: > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ > @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > if (error) > goto out; > break; > + case MADV_F_COLLAPSE_LIGHT: > case MADV_COLLAPSE: > - return madvise_collapse(vma, prev, start, end); > + return madvise_collapse(vma, prev, start, end, behavior); > } > > anon_name = anon_vma_name(vma); > @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior) > case MADV_HUGEPAGE: > case MADV_NOHUGEPAGE: > case MADV_COLLAPSE: > + case MADV_F_COLLAPSE_LIGHT: > #endif > case MADV_DONTDUMP: > case MADV_DODUMP: > @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior) > } > } > > + > +static bool process_madvise_behavior_only(int behavior) > +{ > + switch (behavior) { > + case MADV_F_COLLAPSE_LIGHT: > + return true; > + default: > + return false; > + } > +} > + > static bool process_madvise_behavior_valid(int behavior) > { > switch (behavior) { > @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior) > case MADV_PAGEOUT: > case MADV_WILLNEED: > case MADV_COLLAPSE: > + case MADV_F_COLLAPSE_LIGHT: > return true; > default: > return false; > @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > * transparent huge pages so the existing pages will not be > * coalesced into THP and new pages will not be allocated as THP. > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or > + * compaction. > * MADV_DONTDUMP - the application wants to prevent pages in the given range > * from being included in its core dump. > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > * -EBADF - map exists, but area maps something that isn't a file. > * -EAGAIN - a kernel resource was temporarily unavailable. > */ > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, > + int behavior, bool is_process_madvise) > { > unsigned long end; > int error; > @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > if (!madvise_behavior_valid(behavior)) > return -EINVAL; > > + if (!is_process_madvise && process_madvise_behavior_only(behavior)) > + return -EINVAL; > + > if (!PAGE_ALIGNED(start)) > return -EINVAL; > len = PAGE_ALIGN(len_in); > @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > return error; > } > > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > +{ > + return _do_madvise(mm, start, len_in, behavior, false); > +} > + > SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > { > - return do_madvise(current->mm, start, len_in, behavior); > + return _do_madvise(current->mm, start, len_in, behavior, false); > } > > SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > total_len = iov_iter_count(&iter); > > while (iov_iter_count(&iter)) { > - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > - iter_iov_len(&iter), behavior); > + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > + iter_iov_len(&iter), behavior, true); > if (ret < 0) > break; > iov_iter_advance(&iter, iter_iov_len(&iter)); > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h > index 6ce1f1ceb432..92c67bc755da 100644 > --- a/tools/include/uapi/asm-generic/mman-common.h > +++ b/tools/include/uapi/asm-generic/mman-common.h > @@ -78,6 +78,7 @@ > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > /* compatibility flags */ > #define MAP_FILE 0 > -- > 2.33.1 >
On Sat 20-01-24 10:09:32, Lance Yang wrote: [...] > Hey Michal, > > Thanks for your suggestion! > > It seems that the implementation should try but not too hard aligns well > with my desired behavior. The problem I have with this semantic is that it is really hard to define and then stick with. Our implementation might change over time and what somebody considers good ATM might turn int "trying harder than I wanted" later on. > Non-blocking in general is also a great idea. > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK > flag for scenarios where latency is extremely critical. Non blocking semantic is much easier to define and maintain. The actual allocation/compaction implementation might change as well over time but the userspace at least knows that the request will not block waiting for any required resources.
On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote: > > On Sat 20-01-24 10:09:32, Lance Yang wrote: > [...] > > Hey Michal, > > > > Thanks for your suggestion! > > > > It seems that the implementation should try but not too hard aligns well > > with my desired behavior. > > The problem I have with this semantic is that it is really hard to > define and then stick with. Our implementation might change over time > and what somebody considers good ATM might turn int "trying harder than > I wanted" later on. > > > Non-blocking in general is also a great idea. > > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK > > flag for scenarios where latency is extremely critical. > > Non blocking semantic is much easier to define and maintain. The actual > allocation/compaction implementation might change as well over time but > the userspace at least knows that the request will not block waiting for > any required resources. I appreciate your insights! It makes sense that a non-blocking semantic is easier to define and maintain, providing userspace with the certainty that requests won’t be blocked. Thanks, Lance > > -- > Michal Hocko > SUSE Labs
Hey Zach, What do you think about the semantic? Thanks, Lance On Mon, Jan 22, 2024 at 10:14 PM Lance Yang <ioworker0@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Sat 20-01-24 10:09:32, Lance Yang wrote: > > [...] > > > Hey Michal, > > > > > > Thanks for your suggestion! > > > > > > It seems that the implementation should try but not too hard aligns well > > > with my desired behavior. > > > > The problem I have with this semantic is that it is really hard to > > define and then stick with. Our implementation might change over time > > and what somebody considers good ATM might turn int "trying harder than > > I wanted" later on. > > > > > Non-blocking in general is also a great idea. > > > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK > > > flag for scenarios where latency is extremely critical. > > > > Non blocking semantic is much easier to define and maintain. The actual > > allocation/compaction implementation might change as well over time but > > the userspace at least knows that the request will not block waiting for > > any required resources. > > I appreciate your insights! > > It makes sense that a non-blocking semantic is easier to define and maintain, > providing userspace with the certainty that requests won’t be blocked. > > Thanks, > Lance > > > > > -- > > Michal Hocko > > SUSE Labs
I’d like to add another real use case. In our company, we deploy applications using offline-online hybrid deployment. This approach leverages the distinctive resource utilization patterns of online services, utilizing idle resources during various time periods by filling them with offline jobs. This helps reduce the growing cost expenditures for the enterprise. Whether for online services or offline jobs, their requirements for THP can be roughly categorized into three types: * The first type aims to use huge pages as much as possible and tolerates unpredictable stalls caused by direct reclaim and/or compaction. * The second type attempts to use huge pages but is relatively latency-sensitive and cannot tolerate unpredictable stalls. * The third type prefers not to use huge pages at all and is extremely latency-sensitive. After careful consideration, we decided to prioritize the requirements of the first type and modify the THP settings as follows: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo defer >/sys/kernel/mm/transparent_hugepage/defrag With the introduction of MADV_COLLAPSE into the kernel, it is no longer dependent on any sysfs setting under /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE offers the potential for fine-grained synchronous control over the huge page allocation mechanism, marking a significant enhancement for THP. If the kernel supports a more relaxed (opportunistic) MADV_COLLAPSE, we will modify the THP settings as follows: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/defrag Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) to address the requirements of the second type. Why don't we favor madvise(MADV_COLLAPSE) for the first type of requirements? The main reason is that these requirements are typically for offline jobs in the Hadoop ecosystem, such as MapReduce and Spark, which run primarily on the JVM. IIRC, the JVM currently does not support madvise(MADV_COLLAPSE). The second type of requirements is all for our in-house developed online services. For us, integrating a more relaxed (opportunistic) MADV_COLLAPSE into our online services is relatively straightforward. By introducing various flags to MADV_COLLAPSE, we can offer multiple synchronous allocation strategies for applications. This fine-grained control may be more suitable for cloud-native environments than the widespread settings under /sys/kernel/mm/transparent_hugepage in sysfs. Thanks for your time! Lance On Sun, Jan 21, 2024 at 11:12 AM Lance Yang <ioworker0@gmail.com> wrote: > > Hello Everyone, > > For applications actively utilizing THP, the defrag mode may > not be a very user-friendly design. Here are the reasons: > 1. Before marking the address space with > MADV_HUGEPAGE,it is necessary to check if > the current configuration of the defrag mode aligns with > their preferences. > 2. Once the defrag mode configuration changes, these > applications may face the risk of unpredictable stalls. > > THP is an important feature of the Linux kernel that can > significantly enhance memory access performance. > However, due to the lack of fine-grained control over > the huge page allocation strategy, many applications > default to not using huge pages and even recommend > users to disable THP. This situation is regrettable. > > With the introduction of MADV_COLLAPSE into the kernel, > it is not affected by the defrag mode. > MADV_COLLAPSE offers the potential for > fine-grained synchronous control over the huge page > allocation mechanism, marking a significant enhancement > for THP. > > By adding flags to MADV_COLLAPSE, different > synchronous allocation strategies can be provided to > applications. This can instill confidence in them, allowing > them to reconsider using THP and allocate huge pages > according to their desired synchronous allocation strategy, > without worrying about the defrag mode configuration. > > BR, > Lance > > > On Thu, Jan 18, 2024 at 8:03 PM Lance Yang <ioworker0@gmail.com> wrote: > > > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > > > Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller > > has CAP_SYS_ADMIN or is requesting the collapse of its own memory. > > > > The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but > > it avoids direct reclaim and/or compaction, quickly failing on allocation > > errors. > > > > This change enables a more flexible and efficient usage of memory collapse > > operations, providing additional control to userspace applications for > > system-wide THP optimization. > > > > Semantics > > > > This call is independent of the system-wide THP sysfs settings, but will > > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > > multiple VMAs, the semantics of the collapse over each VMA is independent > > from the others. This implies a hugepage cannot cross a VMA boundary. If > > collapse of a given hugepage-aligned/sized region fails, the operation may > > continue to attempt collapsing the remainder of memory specified. > > > > The memory ranges provided must be page-aligned, but are not required to > > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > > start/end of the range will be clamped to the first/last hugepage-aligned > > address covered by said range. The memory ranges must span at least one > > hugepage-sized region. > > > > All non-resident pages covered by the range will first be > > swapped/faulted-in, before being internally copied onto a freshly > > allocated hugepage. Unmapped pages will have their data directly > > initialized to 0 in the new hugepage. However, for every eligible > > hugepage aligned/sized region to-be collapsed, at least one page must > > currently be backed by memory (a PMD covering the address range must > > already exist). > > > > Allocation for the new hugepage will not enter direct reclaim and/or > > compaction, quickly failing if allocation fails. When the system has > > multiple NUMA nodes, the hugepage will be allocated from the node providing > > the most native pages. This operation operates on the current state of the > > specified process and makes no persistent changes or guarantees on how pages > > will be mapped, constructed, or faulted in the future. > > > > Use Cases > > > > An immediate user of this new functionality is the Go runtime heap allocator > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > newly allocated chunk through mmap() or a reused chunk released by > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > respectively. However, both approaches resulted in performance issues; for > > both scenarios, there could be entries into direct reclaim and/or compaction, > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > [4] https://github.com/golang/go/issues/63334 > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > > Suggested-by: Zach O'Keefe <zokeefe@google.com> > > Suggested-by: David Hildenbrand <david@redhat.com> > > --- > > V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative > > to madvise(MADV_COLLAPSE) > > > > arch/alpha/include/uapi/asm/mman.h | 1 + > > arch/mips/include/uapi/asm/mman.h | 1 + > > arch/parisc/include/uapi/asm/mman.h | 1 + > > arch/xtensa/include/uapi/asm/mman.h | 1 + > > include/linux/huge_mm.h | 5 +-- > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/khugepaged.c | 15 ++++++-- > > mm/madvise.c | 36 +++++++++++++++++--- > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > 9 files changed, 52 insertions(+), 10 deletions(-) > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > index 763929e814e9..22f23ca04f1a 100644 > > --- a/arch/alpha/include/uapi/asm/mman.h > > +++ b/arch/alpha/include/uapi/asm/mman.h > > @@ -77,6 +77,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > index c6e1fc77c996..acec0b643e9c 100644 > > --- a/arch/mips/include/uapi/asm/mman.h > > +++ b/arch/mips/include/uapi/asm/mman.h > > @@ -104,6 +104,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > index 68c44f99bc93..812029c98cd7 100644 > > --- a/arch/parisc/include/uapi/asm/mman.h > > +++ b/arch/parisc/include/uapi/asm/mman.h > > @@ -71,6 +71,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > #define MADV_HWPOISON 100 /* poison a page for testing */ > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > index 1ff0c858544f..52ef463dd5b6 100644 > > --- a/arch/xtensa/include/uapi/asm/mman.h > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > @@ -112,6 +112,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 5adb86af35fc..075fdb5d481a 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > > int advice); > > int madvise_collapse(struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > - unsigned long start, unsigned long end); > > + unsigned long start, unsigned long end, int behavior); > > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > > unsigned long end, long adjust_next); > > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > > > static inline int madvise_collapse(struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > - unsigned long start, unsigned long end) > > + unsigned long start, unsigned long end, > > + int behavior) > > { > > return -EINVAL; > > } > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..92c67bc755da 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -78,6 +78,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 2b219acb528e..2840051c0ae2 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; > > struct collapse_control { > > bool is_khugepaged; > > > > + int behavior; > > + > > /* Num pages scanned per node */ > > u32 node_load[MAX_NUMNODES]; > > > > @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > > struct collapse_control *cc) > > { > > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > > - GFP_TRANSHUGE); > > int node = hpage_collapse_find_target_node(cc); > > struct folio *folio; > > + gfp_t gfp; > > + > > + if (cc->is_khugepaged) > > + gfp = alloc_hugepage_khugepaged_gfpmask(); > > + else > > + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ? > > + GFP_TRANSHUGE_LIGHT : > > + GFP_TRANSHUGE); > > > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { > > *hpage = NULL; > > @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r) > > } > > > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > - unsigned long start, unsigned long end) > > + unsigned long start, unsigned long end, int behavior) > > { > > struct collapse_control *cc; > > struct mm_struct *mm = vma->vm_mm; > > @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > if (!cc) > > return -ENOMEM; > > cc->is_khugepaged = false; > > + cc->behavior = behavior; > > > > mmgrab(mm); > > lru_add_drain_all(); > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..9c40226505aa 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > case MADV_COLLAPSE: > > + case MADV_F_COLLAPSE_LIGHT: > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > > @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > if (error) > > goto out; > > break; > > + case MADV_F_COLLAPSE_LIGHT: > > case MADV_COLLAPSE: > > - return madvise_collapse(vma, prev, start, end); > > + return madvise_collapse(vma, prev, start, end, behavior); > > } > > > > anon_name = anon_vma_name(vma); > > @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior) > > case MADV_HUGEPAGE: > > case MADV_NOHUGEPAGE: > > case MADV_COLLAPSE: > > + case MADV_F_COLLAPSE_LIGHT: > > #endif > > case MADV_DONTDUMP: > > case MADV_DODUMP: > > @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior) > > } > > } > > > > + > > +static bool process_madvise_behavior_only(int behavior) > > +{ > > + switch (behavior) { > > + case MADV_F_COLLAPSE_LIGHT: > > + return true; > > + default: > > + return false; > > + } > > +} > > + > > static bool process_madvise_behavior_valid(int behavior) > > { > > switch (behavior) { > > @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior) > > case MADV_PAGEOUT: > > case MADV_WILLNEED: > > case MADV_COLLAPSE: > > + case MADV_F_COLLAPSE_LIGHT: > > return true; > > default: > > return false; > > @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > > * transparent huge pages so the existing pages will not be > > * coalesced into THP and new pages will not be allocated as THP. > > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > > + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or > > + * compaction. > > * MADV_DONTDUMP - the application wants to prevent pages in the given range > > * from being included in its core dump. > > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > > @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > > * -EBADF - map exists, but area maps something that isn't a file. > > * -EAGAIN - a kernel resource was temporarily unavailable. > > */ > > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > > +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, > > + int behavior, bool is_process_madvise) > > { > > unsigned long end; > > int error; > > @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > if (!madvise_behavior_valid(behavior)) > > return -EINVAL; > > > > + if (!is_process_madvise && process_madvise_behavior_only(behavior)) > > + return -EINVAL; > > + > > if (!PAGE_ALIGNED(start)) > > return -EINVAL; > > len = PAGE_ALIGN(len_in); > > @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > return error; > > } > > > > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > > +{ > > + return _do_madvise(mm, start, len_in, behavior, false); > > +} > > + > > SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > { > > - return do_madvise(current->mm, start, len_in, behavior); > > + return _do_madvise(current->mm, start, len_in, behavior, false); > > } > > > > SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > total_len = iov_iter_count(&iter); > > > > while (iov_iter_count(&iter)) { > > - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > > - iter_iov_len(&iter), behavior); > > + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > > + iter_iov_len(&iter), behavior, true); > > if (ret < 0) > > break; > > iov_iter_advance(&iter, iter_iov_len(&iter)); > > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..92c67bc755da 100644 > > --- a/tools/include/uapi/asm-generic/mman-common.h > > +++ b/tools/include/uapi/asm-generic/mman-common.h > > @@ -78,6 +78,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > -- > > 2.33.1 > >
I would like to correct the information provided in my previous email and also provide some additional information. On Fri, Jan 26, 2024 at 2:16 PM Lance Yang <ioworker0@gmail.com> wrote: > > I’d like to add another real use case. > > In our company, we deploy applications using offline-online > hybrid deployment. This approach leverages the distinctive > resource utilization patterns of online services, utilizing idle > resources during various time periods by filling them with > offline jobs. This helps reduce the growing cost expenditures > for the enterprise. > > Whether for online services or offline jobs, their requirements > for THP can be roughly categorized into three types: > > * The first type aims to use huge pages as much as possible > and tolerates unpredictable stalls caused by direct reclaim > and/or compaction. > * The second type attempts to use huge pages but is relatively > latency-sensitive and cannot tolerate unpredictable stalls. > * The third type prefers not to use huge pages at all and is > extremely latency-sensitive. > > After careful consideration, we decided to prioritize the > requirements of the first type and modify the THP settings > as follows: > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > echo defer >/sys/kernel/mm/transparent_hugepage/defrag > > With the introduction of MADV_COLLAPSE into the kernel, > it is no longer dependent on any sysfs setting under > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE > offers the potential for fine-grained synchronous control over > the huge page allocation mechanism, marking a significant > enhancement for THP. > > If the kernel supports a more relaxed (opportunistic) > MADV_COLLAPSE, we will modify the THP settings as follows: > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag The correct THP settings should be: echo always >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/defrag > > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) > to address the requirements of the second type. > > Why don't we favor madvise(MADV_COLLAPSE) for the first type > of requirements? > The main reason is that these requirements are typically for offline > jobs in the Hadoop ecosystem, such as MapReduce and Spark, > which run primarily on the JVM. IIRC, the JVM currently does not > support madvise(MADV_COLLAPSE). The second type of To add, there are also some offline jobs that rely on PyTorch for machine learning model training tasks. IIRC, PyTorch also does not support madvise(MADV_COLLAPSE). Thanks, Lance > requirements is all for our in-house developed online services. > For us, integrating a more relaxed (opportunistic) > MADV_COLLAPSE into our online services is relatively > straightforward. > > By introducing various flags to MADV_COLLAPSE, we can offer > multiple synchronous allocation strategies for applications. This > fine-grained control may be more suitable for cloud-native > environments than the widespread settings under > /sys/kernel/mm/transparent_hugepage in sysfs. > > Thanks for your time! > Lance > > On Sun, Jan 21, 2024 at 11:12 AM Lance Yang <ioworker0@gmail.com> wrote: > > > > Hello Everyone, > > > > For applications actively utilizing THP, the defrag mode may > > not be a very user-friendly design. Here are the reasons: > > 1. Before marking the address space with > > MADV_HUGEPAGE,it is necessary to check if > > the current configuration of the defrag mode aligns with > > their preferences. > > 2. Once the defrag mode configuration changes, these > > applications may face the risk of unpredictable stalls. > > > > THP is an important feature of the Linux kernel that can > > significantly enhance memory access performance. > > However, due to the lack of fine-grained control over > > the huge page allocation strategy, many applications > > default to not using huge pages and even recommend > > users to disable THP. This situation is regrettable. > > > > With the introduction of MADV_COLLAPSE into the kernel, > > it is not affected by the defrag mode. > > MADV_COLLAPSE offers the potential for > > fine-grained synchronous control over the huge page > > allocation mechanism, marking a significant enhancement > > for THP. > > > > By adding flags to MADV_COLLAPSE, different > > synchronous allocation strategies can be provided to > > applications. This can instill confidence in them, allowing > > them to reconsider using THP and allocate huge pages > > according to their desired synchronous allocation strategy, > > without worrying about the defrag mode configuration. > > > > BR, > > Lance > > > > > > On Thu, Jan 18, 2024 at 8:03 PM Lance Yang <ioworker0@gmail.com> wrote: > > > > > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > > > > > Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller > > > has CAP_SYS_ADMIN or is requesting the collapse of its own memory. > > > > > > The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but > > > it avoids direct reclaim and/or compaction, quickly failing on allocation > > > errors. > > > > > > This change enables a more flexible and efficient usage of memory collapse > > > operations, providing additional control to userspace applications for > > > system-wide THP optimization. > > > > > > Semantics > > > > > > This call is independent of the system-wide THP sysfs settings, but will > > > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > > > multiple VMAs, the semantics of the collapse over each VMA is independent > > > from the others. This implies a hugepage cannot cross a VMA boundary. If > > > collapse of a given hugepage-aligned/sized region fails, the operation may > > > continue to attempt collapsing the remainder of memory specified. > > > > > > The memory ranges provided must be page-aligned, but are not required to > > > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > > > start/end of the range will be clamped to the first/last hugepage-aligned > > > address covered by said range. The memory ranges must span at least one > > > hugepage-sized region. > > > > > > All non-resident pages covered by the range will first be > > > swapped/faulted-in, before being internally copied onto a freshly > > > allocated hugepage. Unmapped pages will have their data directly > > > initialized to 0 in the new hugepage. However, for every eligible > > > hugepage aligned/sized region to-be collapsed, at least one page must > > > currently be backed by memory (a PMD covering the address range must > > > already exist). > > > > > > Allocation for the new hugepage will not enter direct reclaim and/or > > > compaction, quickly failing if allocation fails. When the system has > > > multiple NUMA nodes, the hugepage will be allocated from the node providing > > > the most native pages. This operation operates on the current state of the > > > specified process and makes no persistent changes or guarantees on how pages > > > will be mapped, constructed, or faulted in the future. > > > > > > Use Cases > > > > > > An immediate user of this new functionality is the Go runtime heap allocator > > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > > newly allocated chunk through mmap() or a reused chunk released by > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > > respectively. However, both approaches resulted in performance issues; for > > > both scenarios, there could be entries into direct reclaim and/or compaction, > > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. > > > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > > [4] https://github.com/golang/go/issues/63334 > > > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ > > > > > > Signed-off-by: Lance Yang <ioworker0@gmail.com> > > > Suggested-by: Zach O'Keefe <zokeefe@google.com> > > > Suggested-by: David Hildenbrand <david@redhat.com> > > > --- > > > V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative > > > to madvise(MADV_COLLAPSE) > > > > > > arch/alpha/include/uapi/asm/mman.h | 1 + > > > arch/mips/include/uapi/asm/mman.h | 1 + > > > arch/parisc/include/uapi/asm/mman.h | 1 + > > > arch/xtensa/include/uapi/asm/mman.h | 1 + > > > include/linux/huge_mm.h | 5 +-- > > > include/uapi/asm-generic/mman-common.h | 1 + > > > mm/khugepaged.c | 15 ++++++-- > > > mm/madvise.c | 36 +++++++++++++++++--- > > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > > 9 files changed, 52 insertions(+), 10 deletions(-) > > > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > > index 763929e814e9..22f23ca04f1a 100644 > > > --- a/arch/alpha/include/uapi/asm/mman.h > > > +++ b/arch/alpha/include/uapi/asm/mman.h > > > @@ -77,6 +77,7 @@ > > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > > index c6e1fc77c996..acec0b643e9c 100644 > > > --- a/arch/mips/include/uapi/asm/mman.h > > > +++ b/arch/mips/include/uapi/asm/mman.h > > > @@ -104,6 +104,7 @@ > > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > > index 68c44f99bc93..812029c98cd7 100644 > > > --- a/arch/parisc/include/uapi/asm/mman.h > > > +++ b/arch/parisc/include/uapi/asm/mman.h > > > @@ -71,6 +71,7 @@ > > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > > > #define MADV_HWPOISON 100 /* poison a page for testing */ > > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > > index 1ff0c858544f..52ef463dd5b6 100644 > > > --- a/arch/xtensa/include/uapi/asm/mman.h > > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > > @@ -112,6 +112,7 @@ > > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > index 5adb86af35fc..075fdb5d481a 100644 > > > --- a/include/linux/huge_mm.h > > > +++ b/include/linux/huge_mm.h > > > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > > > int advice); > > > int madvise_collapse(struct vm_area_struct *vma, > > > struct vm_area_struct **prev, > > > - unsigned long start, unsigned long end); > > > + unsigned long start, unsigned long end, int behavior); > > > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > > > unsigned long end, long adjust_next); > > > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > > > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > > > > > static inline int madvise_collapse(struct vm_area_struct *vma, > > > struct vm_area_struct **prev, > > > - unsigned long start, unsigned long end) > > > + unsigned long start, unsigned long end, > > > + int behavior) > > > { > > > return -EINVAL; > > > } > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > > index 6ce1f1ceb432..92c67bc755da 100644 > > > --- a/include/uapi/asm-generic/mman-common.h > > > +++ b/include/uapi/asm-generic/mman-common.h > > > @@ -78,6 +78,7 @@ > > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 2b219acb528e..2840051c0ae2 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; > > > struct collapse_control { > > > bool is_khugepaged; > > > > > > + int behavior; > > > + > > > /* Num pages scanned per node */ > > > u32 node_load[MAX_NUMNODES]; > > > > > > @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > > > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > > > struct collapse_control *cc) > > > { > > > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > > > - GFP_TRANSHUGE); > > > int node = hpage_collapse_find_target_node(cc); > > > struct folio *folio; > > > + gfp_t gfp; > > > + > > > + if (cc->is_khugepaged) > > > + gfp = alloc_hugepage_khugepaged_gfpmask(); > > > + else > > > + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ? > > > + GFP_TRANSHUGE_LIGHT : > > > + GFP_TRANSHUGE); > > > > > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { > > > *hpage = NULL; > > > @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r) > > > } > > > > > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > > - unsigned long start, unsigned long end) > > > + unsigned long start, unsigned long end, int behavior) > > > { > > > struct collapse_control *cc; > > > struct mm_struct *mm = vma->vm_mm; > > > @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > > if (!cc) > > > return -ENOMEM; > > > cc->is_khugepaged = false; > > > + cc->behavior = behavior; > > > > > > mmgrab(mm); > > > lru_add_drain_all(); > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 912155a94ed5..9c40226505aa 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > > case MADV_POPULATE_READ: > > > case MADV_POPULATE_WRITE: > > > case MADV_COLLAPSE: > > > + case MADV_F_COLLAPSE_LIGHT: > > > return 0; > > > default: > > > /* be safe, default to 1. list exceptions explicitly */ > > > @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > > if (error) > > > goto out; > > > break; > > > + case MADV_F_COLLAPSE_LIGHT: > > > case MADV_COLLAPSE: > > > - return madvise_collapse(vma, prev, start, end); > > > + return madvise_collapse(vma, prev, start, end, behavior); > > > } > > > > > > anon_name = anon_vma_name(vma); > > > @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior) > > > case MADV_HUGEPAGE: > > > case MADV_NOHUGEPAGE: > > > case MADV_COLLAPSE: > > > + case MADV_F_COLLAPSE_LIGHT: > > > #endif > > > case MADV_DONTDUMP: > > > case MADV_DODUMP: > > > @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior) > > > } > > > } > > > > > > + > > > +static bool process_madvise_behavior_only(int behavior) > > > +{ > > > + switch (behavior) { > > > + case MADV_F_COLLAPSE_LIGHT: > > > + return true; > > > + default: > > > + return false; > > > + } > > > +} > > > + > > > static bool process_madvise_behavior_valid(int behavior) > > > { > > > switch (behavior) { > > > @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior) > > > case MADV_PAGEOUT: > > > case MADV_WILLNEED: > > > case MADV_COLLAPSE: > > > + case MADV_F_COLLAPSE_LIGHT: > > > return true; > > > default: > > > return false; > > > @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > > > * transparent huge pages so the existing pages will not be > > > * coalesced into THP and new pages will not be allocated as THP. > > > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > > > + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or > > > + * compaction. > > > * MADV_DONTDUMP - the application wants to prevent pages in the given range > > > * from being included in its core dump. > > > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > > > @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > > > * -EBADF - map exists, but area maps something that isn't a file. > > > * -EAGAIN - a kernel resource was temporarily unavailable. > > > */ > > > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > > > +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, > > > + int behavior, bool is_process_madvise) > > > { > > > unsigned long end; > > > int error; > > > @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > > if (!madvise_behavior_valid(behavior)) > > > return -EINVAL; > > > > > > + if (!is_process_madvise && process_madvise_behavior_only(behavior)) > > > + return -EINVAL; > > > + > > > if (!PAGE_ALIGNED(start)) > > > return -EINVAL; > > > len = PAGE_ALIGN(len_in); > > > @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > > return error; > > > } > > > > > > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) > > > +{ > > > + return _do_madvise(mm, start, len_in, behavior, false); > > > +} > > > + > > > SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > > { > > > - return do_madvise(current->mm, start, len_in, behavior); > > > + return _do_madvise(current->mm, start, len_in, behavior, false); > > > } > > > > > > SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > > @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > > total_len = iov_iter_count(&iter); > > > > > > while (iov_iter_count(&iter)) { > > > - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > > > - iter_iov_len(&iter), behavior); > > > + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter), > > > + iter_iov_len(&iter), behavior, true); > > > if (ret < 0) > > > break; > > > iov_iter_advance(&iter, iter_iov_len(&iter)); > > > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h > > > index 6ce1f1ceb432..92c67bc755da 100644 > > > --- a/tools/include/uapi/asm-generic/mman-common.h > > > +++ b/tools/include/uapi/asm-generic/mman-common.h > > > @@ -78,6 +78,7 @@ > > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > > > /* compatibility flags */ > > > #define MAP_FILE 0 > > > -- > > > 2.33.1 > > >
On Fri, Jan 26, 2024 at 6:15 PM Lance Yang <ioworker0@gmail.com> wrote: [...] > > If the kernel supports a more relaxed (opportunistic) > > MADV_COLLAPSE, we will modify the THP settings as follows: > > > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag > > The correct THP settings should be: > echo always >/sys/kernel/mm/transparent_hugepage/enabled > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag > Apologize for the confusion in my previous email. The third type of requirements prefers not to use huge pages at all. The correct THP settings should be: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag > > > > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) > > to address the requirements of the second type. [...]
On Mon, Jan 22, 2024 at 6:35 AM Lance Yang <ioworker0@gmail.com> wrote: > > Hey Zach, > > What do you think about the semantic? Hey Lance, Sorry for the late reply. I can see both sides of the argument; though I would argue that "non-blocking" is equally as vague in this context. E.g. we'll "block" on acquiring a number of different locks along the collapse path. If you really want to talk about not entering direct reclaim / compaction, then keeping with the sys/kernel/vm/thp notion of "defrag" would be better, IMO. I don't feel that strongly about it though. But I see you've provided some more use cases in another mail, so let me pick up my thoughts over there. Best, Zach > Thanks, > Lance > > On Mon, Jan 22, 2024 at 10:14 PM Lance Yang <ioworker0@gmail.com> wrote: > > > > On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Sat 20-01-24 10:09:32, Lance Yang wrote: > > > [...] > > > > Hey Michal, > > > > > > > > Thanks for your suggestion! > > > > > > > > It seems that the implementation should try but not too hard aligns well > > > > with my desired behavior. > > > > > > The problem I have with this semantic is that it is really hard to > > > define and then stick with. Our implementation might change over time > > > and what somebody considers good ATM might turn int "trying harder than > > > I wanted" later on. > > > > > > > Non-blocking in general is also a great idea. > > > > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK > > > > flag for scenarios where latency is extremely critical. > > > > > > Non blocking semantic is much easier to define and maintain. The actual > > > allocation/compaction implementation might change as well over time but > > > the userspace at least knows that the request will not block waiting for > > > any required resources. > > > > I appreciate your insights! > > > > It makes sense that a non-blocking semantic is easier to define and maintain, > > providing userspace with the certainty that requests won’t be blocked. > > > > Thanks, > > Lance > > > > > > > > -- > > > Michal Hocko > > > SUSE Labs
> I’d like to add another real use case. > > In our company, we deploy applications using offline-online > hybrid deployment. This approach leverages the distinctive > resource utilization patterns of online services, utilizing idle > resources during various time periods by filling them with > offline jobs. This helps reduce the growing cost expenditures > for the enterprise. > > Whether for online services or offline jobs, their requirements > for THP can be roughly categorized into three types: > > * The first type aims to use huge pages as much as possible > and tolerates unpredictable stalls caused by direct reclaim > and/or compaction. > * The second type attempts to use huge pages but is relatively > latency-sensitive and cannot tolerate unpredictable stalls. > * The third type prefers not to use huge pages at all and is > extremely latency-sensitive. > > After careful consideration, we decided to prioritize the > requirements of the first type and modify the THP settings > as follows: > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > echo defer >/sys/kernel/mm/transparent_hugepage/defrag > > With the introduction of MADV_COLLAPSE into the kernel, > it is no longer dependent on any sysfs setting under > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE > offers the potential for fine-grained synchronous control over > the huge page allocation mechanism, marking a significant > enhancement for THP. > > If the kernel supports a more relaxed (opportunistic) > MADV_COLLAPSE, we will modify the THP settings as follows: > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag [corrected, via 2 previous mails, to: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag] > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) > to address the requirements of the second type. > > Why don't we favor madvise(MADV_COLLAPSE) for the first type > of requirements? > The main reason is that these requirements are typically for offline > jobs in the Hadoop ecosystem, such as MapReduce and Spark, > which run primarily on the JVM. [..] Hey Lance, Thanks for proving this context, it's very helpful. Though, couldn't you use enabled=always, defrag=defer+madvise, then just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the behaviour you want? i.e. type 1: apply MADV_HUGEPAGE -> sync defrag to get THP type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick kswapd+kcompactd otherwise type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs Or am I missing something? It sounds like a confounding issue is that these are external workloads, or you don't have ability to modify? But that would preclude MADV_COLLAPSE (unless you're using process_madvise()). Appreciate the help understanding the use case. I'm not opposed to the idea in general, but IMO would be great to have a clear need for it (and right now, we don't currently have alignment with the original motivating usecase (Go) in that regard w.r.t their plans). Thanks, Zach
Hey Zach, Thanks for taking time to look into this! On Sat, Jan 27, 2024 at 7:47 AM Zach O'Keefe <zokeefe@google.com> wrote: > > > I’d like to add another real use case. > > > > In our company, we deploy applications using offline-online > > hybrid deployment. This approach leverages the distinctive > > resource utilization patterns of online services, utilizing idle > > resources during various time periods by filling them with > > offline jobs. This helps reduce the growing cost expenditures > > for the enterprise. > > > > Whether for online services or offline jobs, their requirements > > for THP can be roughly categorized into three types: > > > > * The first type aims to use huge pages as much as possible > > and tolerates unpredictable stalls caused by direct reclaim > > and/or compaction. > > * The second type attempts to use huge pages but is relatively > > latency-sensitive and cannot tolerate unpredictable stalls. > > * The third type prefers not to use huge pages at all and is > > extremely latency-sensitive. > > > > After careful consideration, we decided to prioritize the > > requirements of the first type and modify the THP settings > > as follows: > > > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > > echo defer >/sys/kernel/mm/transparent_hugepage/defrag > > > > With the introduction of MADV_COLLAPSE into the kernel, > > it is no longer dependent on any sysfs setting under > > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE > > offers the potential for fine-grained synchronous control over > > the huge page allocation mechanism, marking a significant > > enhancement for THP. > > > > If the kernel supports a more relaxed (opportunistic) > > MADV_COLLAPSE, we will modify the THP settings as follows: > > > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag > > [corrected, via 2 previous mails, to: echo madvise > >/sys/kernel/mm/transparent_hugepage/enabled > echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag] > > > > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) > > to address the requirements of the second type. > > > > Why don't we favor madvise(MADV_COLLAPSE) for the first type > > of requirements? > > The main reason is that these requirements are typically for offline > > jobs in the Hadoop ecosystem, such as MapReduce and Spark, > > which run primarily on the JVM. [..] > > Hey Lance, > > Thanks for proving this context, it's very helpful. > > Though, couldn't you use enabled=always, defrag=defer+madvise, then > just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the > behaviour you want? i.e. prctl(PR_SET_THP_DISABLE) is a good choice that can fully meet the needs of type-3 workloads. I might prefer using enabled=madvise, as this would allow applications to implement specific calls to madvise to request huge pages selectively. If we set enabled=always, some applications may not be optimized for or may not benefit from huge pages. In such cases, using huge pages for all allocations could lead to suboptimal performance. > > type 1: apply MADV_HUGEPAGE -> sync defrag to get THP > type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick > kswapd+kcompactd otherwise Sorry, I did not express myself clearly. The type 2 of requirements should be: type 2: apply MADV_HUGEPAGE with defrag=defer, or use a more relaxed (opportunistic) MADV_COLLAPSE. > type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs > > Or am I missing something? It sounds like a confounding issue is that > these are external workloads, or you don't have ability to modify? But > that would preclude MADV_COLLAPSE (unless you're using > process_madvise()). Sorry, my previous explanation has been unclear. What I meant is that the requirements of type-1 workloads can be independent of any sysfs setting and can be addressed using madvise(MADV_COLLAPSE). In this scenario, why haven't I utilized it? The reason is that I currently lack the capability to modify the JVM or PyTorch to make them compatible with madvise(MADV_COLLAPSE). Therefore, the needs of type-1 workloads still rely on sysfs settings. > > Appreciate the help understanding the use case. I'm not opposed to the > idea in general, but IMO would be great to have a clear need for it I appreciate your perspective! Thanks again for your valuable insights and your suggestions! Lance > (and right now, we don't currently have alignment with the original > motivating usecase (Go) in that regard w.r.t their plans). > > Thanks, > Zach
How about MADV_F_COLLAPSE_NODEFRAG? On Sat, Jan 27, 2024 at 7:27 AM Zach O'Keefe <zokeefe@google.com> wrote: > > On Mon, Jan 22, 2024 at 6:35 AM Lance Yang <ioworker0@gmail.com> wrote: > > > > Hey Zach, > > > > What do you think about the semantic? > > Hey Lance, > > Sorry for the late reply. > > I can see both sides of the argument; though I would argue that > "non-blocking" is equally as vague in this context. E.g. we'll "block" on > acquiring a number of different locks along the collapse path. > > If you really want to talk about not entering direct reclaim / > compaction, then keeping with the sys/kernel/vm/thp notion of "defrag" > would be better, IMO. I don't feel that strongly about it though. > > But I see you've provided some more use cases in another mail, so let > me pick up my thoughts over there. > > Best, > Zach > > > > > Thanks, > > Lance > > > > On Mon, Jan 22, 2024 at 10:14 PM Lance Yang <ioworker0@gmail.com> wrote: > > > > > > On Mon, Jan 22, 2024 at 9:50 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Sat 20-01-24 10:09:32, Lance Yang wrote: > > > > [...] > > > > > Hey Michal, > > > > > > > > > > Thanks for your suggestion! > > > > > > > > > > It seems that the implementation should try but not too hard aligns well > > > > > with my desired behavior. > > > > > > > > The problem I have with this semantic is that it is really hard to > > > > define and then stick with. Our implementation might change over time > > > > and what somebody considers good ATM might turn int "trying harder than > > > > I wanted" later on. > > > > > > > > > Non-blocking in general is also a great idea. > > > > > Perhaps in the future, we can add a MADV_F_COLLAPSE_NOBLOCK > > > > > flag for scenarios where latency is extremely critical. > > > > > > > > Non blocking semantic is much easier to define and maintain. The actual > > > > allocation/compaction implementation might change as well over time but > > > > the userspace at least knows that the request will not block waiting for > > > > any required resources. > > > > > > I appreciate your insights! > > > > > > It makes sense that a non-blocking semantic is easier to define and maintain, > > > providing userspace with the certainty that requests won’t be blocked. > > > > > > Thanks, > > > Lance > > > > > > > > > > > -- > > > > Michal Hocko > > > > SUSE Labs
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..22f23ca04f1a 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -77,6 +77,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index c6e1fc77c996..acec0b643e9c 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -104,6 +104,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..812029c98cd7 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -71,6 +71,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..52ef463dd5b6 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -112,6 +112,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5adb86af35fc..075fdb5d481a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice); int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, int behavior); void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, long adjust_next); spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, static inline int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + int behavior) { return -EINVAL; } diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..92c67bc755da 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -78,6 +78,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2b219acb528e..2840051c0ae2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; struct collapse_control { bool is_khugepaged; + int behavior; + /* Num pages scanned per node */ u32 node_load[MAX_NUMNODES]; @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, struct collapse_control *cc) { - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : - GFP_TRANSHUGE); int node = hpage_collapse_find_target_node(cc); struct folio *folio; + gfp_t gfp; + + if (cc->is_khugepaged) + gfp = alloc_hugepage_khugepaged_gfpmask(); + else + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ? + GFP_TRANSHUGE_LIGHT : + GFP_TRANSHUGE); if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { *hpage = NULL; @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r) } int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, int behavior) { struct collapse_control *cc; struct mm_struct *mm = vma->vm_mm; @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, if (!cc) return -ENOMEM; cc->is_khugepaged = false; + cc->behavior = behavior; mmgrab(mm); lru_add_drain_all(); diff --git a/mm/madvise.c b/mm/madvise.c index 912155a94ed5..9c40226505aa 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: + case MADV_F_COLLAPSE_LIGHT: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, if (error) goto out; break; + case MADV_F_COLLAPSE_LIGHT: case MADV_COLLAPSE: - return madvise_collapse(vma, prev, start, end); + return madvise_collapse(vma, prev, start, end, behavior); } anon_name = anon_vma_name(vma); @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior) case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: case MADV_COLLAPSE: + case MADV_F_COLLAPSE_LIGHT: #endif case MADV_DONTDUMP: case MADV_DODUMP: @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior) } } + +static bool process_madvise_behavior_only(int behavior) +{ + switch (behavior) { + case MADV_F_COLLAPSE_LIGHT: + return true; + default: + return false; + } +} + static bool process_madvise_behavior_valid(int behavior) { switch (behavior) { @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior) case MADV_PAGEOUT: case MADV_WILLNEED: case MADV_COLLAPSE: + case MADV_F_COLLAPSE_LIGHT: return true; default: return false; @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. * MADV_COLLAPSE - synchronously coalesce pages into new THP. + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or + * compaction. * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. */ -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, + int behavior, bool is_process_madvise) { unsigned long end; int error; @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh if (!madvise_behavior_valid(behavior)) return -EINVAL; + if (!is_process_madvise && process_madvise_behavior_only(behavior)) + return -EINVAL; + if (!PAGE_ALIGNED(start)) return -EINVAL; len = PAGE_ALIGN(len_in); @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh return error; } +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) +{ + return _do_madvise(mm, start, len_in, behavior, false); +} + SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { - return do_madvise(current->mm, start, len_in, behavior); + return _do_madvise(current->mm, start, len_in, behavior, false); } SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, total_len = iov_iter_count(&iter); while (iov_iter_count(&iter)) { - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter), - iter_iov_len(&iter), behavior); + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter), + iter_iov_len(&iter), behavior, true); if (ret < 0) break; iov_iter_advance(&iter, iter_iov_len(&iter)); diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..92c67bc755da 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -78,6 +78,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller has CAP_SYS_ADMIN or is requesting the collapse of its own memory. The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but it avoids direct reclaim and/or compaction, quickly failing on allocation errors. This change enables a more flexible and efficient usage of memory collapse operations, providing additional control to userspace applications for system-wide THP optimization. Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage will not enter direct reclaim and/or compaction, quickly failing if allocation fails. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future. Use Cases An immediate user of this new functionality is the Go runtime heap allocator that manages memory in hugepage-sized chunks. In the past, whether it was a newly allocated chunk through mmap() or a reused chunk released by madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] respectively. However, both approaches resulted in performance issues; for both scenarios, there could be entries into direct reclaim and/or compaction, leading to unpredictable stalls[4]. Now, the allocator can confidently use process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af [4] https://github.com/golang/go/issues/63334 [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Zach O'Keefe <zokeefe@google.com> Suggested-by: David Hildenbrand <david@redhat.com> --- V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative to madvise(MADV_COLLAPSE) arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h | 1 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/xtensa/include/uapi/asm/mman.h | 1 + include/linux/huge_mm.h | 5 +-- include/uapi/asm-generic/mman-common.h | 1 + mm/khugepaged.c | 15 ++++++-- mm/madvise.c | 36 +++++++++++++++++--- tools/include/uapi/asm-generic/mman-common.h | 1 + 9 files changed, 52 insertions(+), 10 deletions(-)