Message ID | 20231221065943.2803551-2-shy828301@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] mm: mmap: no need to call khugepaged_enter_vma() for stack | expand |
On 2023/12/21 14:59, Yang Shi wrote: > From: Yang Shi <yang@os.amperecomputing.com> > > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP > boundaries") incured regression for stress-ng pthread benchmark [1]. > It is because THP get allocated to pthread's stack area much more possible > than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. > > The MAP_STACK flag is used to mark the stack area, but it is a no-op on > Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating > THP for such stack area. > > With this change the stack area looks like: > > fffd18e10000-fffd19610000 rw-p 00000000 00:00 0 > Size: 8192 kB > KernelPageSize: 4 kB > MMUPageSize: 4 kB > Rss: 12 kB > Pss: 12 kB > Pss_Dirty: 12 kB > Shared_Clean: 0 kB > Shared_Dirty: 0 kB > Private_Clean: 0 kB > Private_Dirty: 12 kB > Referenced: 12 kB > Anonymous: 12 kB > KSM: 0 kB > LazyFree: 0 kB > AnonHugePages: 0 kB > ShmemPmdMapped: 0 kB > FilePmdMapped: 0 kB > Shared_Hugetlb: 0 kB > Private_Hugetlb: 0 kB > Swap: 0 kB > SwapPss: 0 kB > Locked: 0 kB > THPeligible: 0 > VmFlags: rd wr mr mw me ac nh > > The "nh" flag is set. > > [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/ > > Reported-by: kernel test robot <oliver.sang@intel.com> > Tested-by: Oliver Sang <oliver.sang@intel.com> > Cc: Yin Fengwei <fengwei.yin@intel.com> > Cc: Rik van Riel <riel@surriel.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Christopher Lameter <cl@linux.com> > Cc: Huang, Ying <ying.huang@intel.com> > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > --- > include/linux/mman.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/include/linux/mman.h b/include/linux/mman.h > index 40d94411d492..dc7048824be8 100644 > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags) > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | > _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | > + _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) | > arch_calc_vm_flag_bits(flags); > } >
Thanks Yang, Should this be marked for stable? Given how easily it is for pthreads to allocate hugepages w/o this change, it can easily cause memory bloat on larger systems and/or users with high thread counts. I don't think that will be welcomed, and seems odd that just 6.7 should suffer this. Thanks, Zach On Tue, Jan 9, 2024 at 5:36 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 2023/12/21 14:59, Yang Shi wrote: > > From: Yang Shi <yang@os.amperecomputing.com> > > > > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP > > boundaries") incured regression for stress-ng pthread benchmark [1]. > > It is because THP get allocated to pthread's stack area much more possible > > than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN > > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. > > > > The MAP_STACK flag is used to mark the stack area, but it is a no-op on > > Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating > > THP for such stack area. > > > > With this change the stack area looks like: > > > > fffd18e10000-fffd19610000 rw-p 00000000 00:00 0 > > Size: 8192 kB > > KernelPageSize: 4 kB > > MMUPageSize: 4 kB > > Rss: 12 kB > > Pss: 12 kB > > Pss_Dirty: 12 kB > > Shared_Clean: 0 kB > > Shared_Dirty: 0 kB > > Private_Clean: 0 kB > > Private_Dirty: 12 kB > > Referenced: 12 kB > > Anonymous: 12 kB > > KSM: 0 kB > > LazyFree: 0 kB > > AnonHugePages: 0 kB > > ShmemPmdMapped: 0 kB > > FilePmdMapped: 0 kB > > Shared_Hugetlb: 0 kB > > Private_Hugetlb: 0 kB > > Swap: 0 kB > > SwapPss: 0 kB > > Locked: 0 kB > > THPeligible: 0 > > VmFlags: rd wr mr mw me ac nh > > > > The "nh" flag is set. > > > > [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/ > > > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Tested-by: Oliver Sang <oliver.sang@intel.com> > > Cc: Yin Fengwei <fengwei.yin@intel.com> > > Cc: Rik van Riel <riel@surriel.com> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Christopher Lameter <cl@linux.com> > > Cc: Huang, Ying <ying.huang@intel.com> > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > > > --- > > include/linux/mman.h | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/include/linux/mman.h b/include/linux/mman.h > > index 40d94411d492..dc7048824be8 100644 > > --- a/include/linux/mman.h > > +++ b/include/linux/mman.h > > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags) > > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > > _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | > > _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | > > + _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) | > > arch_calc_vm_flag_bits(flags); > > } > > >
On Tue, Jan 16, 2024 at 11:22 AM Zach O'Keefe <zokeefe@google.com> wrote: > > Thanks Yang, > > Should this be marked for stable? Given how easily it is for pthreads > to allocate hugepages w/o this change, it can easily cause memory > bloat on larger systems and/or users with high thread counts. I don't > think that will be welcomed, and seems odd that just 6.7 should suffer > this. Thanks for the suggestion, fine to me. > > Thanks, > Zach > > On Tue, Jan 9, 2024 at 5:36 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > > > > > On 2023/12/21 14:59, Yang Shi wrote: > > > From: Yang Shi <yang@os.amperecomputing.com> > > > > > > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP > > > boundaries") incured regression for stress-ng pthread benchmark [1]. > > > It is because THP get allocated to pthread's stack area much more possible > > > than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN > > > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. > > > > > > The MAP_STACK flag is used to mark the stack area, but it is a no-op on > > > Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating > > > THP for such stack area. > > > > > > With this change the stack area looks like: > > > > > > fffd18e10000-fffd19610000 rw-p 00000000 00:00 0 > > > Size: 8192 kB > > > KernelPageSize: 4 kB > > > MMUPageSize: 4 kB > > > Rss: 12 kB > > > Pss: 12 kB > > > Pss_Dirty: 12 kB > > > Shared_Clean: 0 kB > > > Shared_Dirty: 0 kB > > > Private_Clean: 0 kB > > > Private_Dirty: 12 kB > > > Referenced: 12 kB > > > Anonymous: 12 kB > > > KSM: 0 kB > > > LazyFree: 0 kB > > > AnonHugePages: 0 kB > > > ShmemPmdMapped: 0 kB > > > FilePmdMapped: 0 kB > > > Shared_Hugetlb: 0 kB > > > Private_Hugetlb: 0 kB > > > Swap: 0 kB > > > SwapPss: 0 kB > > > Locked: 0 kB > > > THPeligible: 0 > > > VmFlags: rd wr mr mw me ac nh > > > > > > The "nh" flag is set. > > > > > > [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/ > > > > > > Reported-by: kernel test robot <oliver.sang@intel.com> > > > Tested-by: Oliver Sang <oliver.sang@intel.com> > > > Cc: Yin Fengwei <fengwei.yin@intel.com> > > > Cc: Rik van Riel <riel@surriel.com> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Christopher Lameter <cl@linux.com> > > > Cc: Huang, Ying <ying.huang@intel.com> > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > > > Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > > > > > --- > > > include/linux/mman.h | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/include/linux/mman.h b/include/linux/mman.h > > > index 40d94411d492..dc7048824be8 100644 > > > --- a/include/linux/mman.h > > > +++ b/include/linux/mman.h > > > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags) > > > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > > > _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | > > > _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | > > > + _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) | > > > arch_calc_vm_flag_bits(flags); > > > } > > > > >
On Tue, 16 Jan 2024 12:57:41 -0800 Yang Shi <shy828301@gmail.com> wrote: > > Should this be marked for stable? Given how easily it is for pthreads > > to allocate hugepages w/o this change, it can easily cause memory > > bloat on larger systems and/or users with high thread counts. I don't > > think that will be welcomed, and seems odd that just 6.7 should suffer > > this. > > Thanks for the suggestion, fine to me. > Thanks, added, along with Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
* Yang Shi: > From: Yang Shi <yang@os.amperecomputing.com> > > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP > boundaries") incured regression for stress-ng pthread benchmark [1]. > It is because THP get allocated to pthread's stack area much more possible > than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. > > The MAP_STACK flag is used to mark the stack area, but it is a no-op on > Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating > THP for such stack area. Doesn't this introduce a regression in the other direction, where workloads expect to use a hugepage TLB entry for the stack? It's seems an odd approach to fixing the stress-ng regression. Isn't it very much coding to the benchmark? Thanks, Florian
On Tue, Jan 30, 2024 at 11:53 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Yang Shi: > > > From: Yang Shi <yang@os.amperecomputing.com> > > > > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP > > boundaries") incured regression for stress-ng pthread benchmark [1]. > > It is because THP get allocated to pthread's stack area much more possible > > than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN > > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. > > > > The MAP_STACK flag is used to mark the stack area, but it is a no-op on > > Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating > > THP for such stack area. > > Doesn't this introduce a regression in the other direction, where > workloads expect to use a hugepage TLB entry for the stack? Maybe, it is theoretically possible. But AFAICT, the real life workloads performance usually gets hurt if THP is used for stack. Willy has an example: https://lore.kernel.org/linux-mm/ZYPDwCcAjX+r+g6s@casper.infradead.org/#t And avoiding THP on stack is not new, VM_GROWSDOWN | VM_GROWSUP areas have been applied before, this patch just extends this to MAP_STACK. > > It's seems an odd approach to fixing the stress-ng regression. Isn't it > very much coding to the benchmark? > > Thanks, > Florian >
* Yang Shi: > On Tue, Jan 30, 2024 at 11:53 PM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Yang Shi: >> >> > From: Yang Shi <yang@os.amperecomputing.com> >> > >> > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP >> > boundaries") incured regression for stress-ng pthread benchmark [1]. >> > It is because THP get allocated to pthread's stack area much more possible >> > than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN >> > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. >> > >> > The MAP_STACK flag is used to mark the stack area, but it is a no-op on >> > Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating >> > THP for such stack area. >> >> Doesn't this introduce a regression in the other direction, where >> workloads expect to use a hugepage TLB entry for the stack? > > Maybe, it is theoretically possible. But AFAICT, the real life > workloads performance usually gets hurt if THP is used for stack. > Willy has an example: > > https://lore.kernel.org/linux-mm/ZYPDwCcAjX+r+g6s@casper.infradead.org/#t > > And avoiding THP on stack is not new, VM_GROWSDOWN | VM_GROWSUP areas > have been applied before, this patch just extends this to MAP_STACK. If it's *always* beneficial then we should help it along in glibc as well. We've started to offer a tunable in response to this observation (also paper over in OpenJDK): Make thread stacks not use huge pages <https://bugs.openjdk.org/browse/JDK-8303215> But this is specifically about RSS usage, and not directly about reducing TLB misses etc. Thanks, Florian
On Thu, Feb 1, 2024 at 7:34 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Yang Shi: > > > On Tue, Jan 30, 2024 at 11:53 PM Florian Weimer <fweimer@redhat.com> wrote: > >> > >> * Yang Shi: > >> > >> > From: Yang Shi <yang@os.amperecomputing.com> > >> > > >> > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP > >> > boundaries") incured regression for stress-ng pthread benchmark [1]. > >> > It is because THP get allocated to pthread's stack area much more possible > >> > than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN > >> > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not. > >> > > >> > The MAP_STACK flag is used to mark the stack area, but it is a no-op on > >> > Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating > >> > THP for such stack area. > >> > >> Doesn't this introduce a regression in the other direction, where > >> workloads expect to use a hugepage TLB entry for the stack? > > > > Maybe, it is theoretically possible. But AFAICT, the real life > > workloads performance usually gets hurt if THP is used for stack. > > Willy has an example: > > > > https://lore.kernel.org/linux-mm/ZYPDwCcAjX+r+g6s@casper.infradead.org/#t > > > > And avoiding THP on stack is not new, VM_GROWSDOWN | VM_GROWSUP areas > > have been applied before, this patch just extends this to MAP_STACK. > > If it's *always* beneficial then we should help it along in glibc as > well. We've started to offer a tunable in response to this observation > (also paper over in OpenJDK): > > Make thread stacks not use huge pages > <https://bugs.openjdk.org/browse/JDK-8303215> > > But this is specifically about RSS usage, and not directly about > reducing TLB misses etc. Thanks for the data point. Out of curiosity, what mmap flags are used by JVM to indicate a stack? MAP_STACK? If so it should get VM_NOHUGEPAGE due to this patch (of course, on older kernel MADV_NOHUGEPAGE must be called by JVM). Letting others, for example, glibc, call MADV_NOHUGEPAGE explicitly on stack area is fine too, but it may take some time to get there... > > Thanks, > Florian >
diff --git a/include/linux/mman.h b/include/linux/mman.h index 40d94411d492..dc7048824be8 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags) return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | + _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) | arch_calc_vm_flag_bits(flags); }