Message ID | 1556037781-57869-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: thp: fix false negative of shmem vma's THP eligibility | expand |
On Wed 24-04-19 00:43:01, Yang Shi wrote: > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each > vma") introduced THPeligible bit for processes' smaps. But, when checking > the eligibility for shmem vma, __transparent_hugepage_enabled() is > called to override the result from shmem_huge_enabled(). It may result > in the anonymous vma's THP flag override shmem's. For example, running a > simple test which create THP for shmem, but with anonymous THP disabled, > when reading the process's smaps, it may show: > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test > Size: 4096 kB > ... > [snip] > ... > ShmemPmdMapped: 4096 kB > ... > [snip] > ... > THPeligible: 0 > > And, /proc/meminfo does show THP allocated and PMD mapped too: > > ShmemHugePages: 4096 kB > ShmemPmdMapped: 4096 kB > > This doesn't make too much sense. The anonymous THP flag should not > intervene shmem THP. Calling shmem_huge_enabled() with checking > MMF_DISABLE_THP sounds good enough. And, we could skip stack and > dax vma check since we already checked if the vma is shmem already. Kirill, can we get a confirmation that this is really intended behavior rather than an omission please? Is this documented? What is a global knob to simply disable THP system wise? I have to say that the THP tuning API is one giant mess :/ Btw. this patch also seem to fix khugepaged behavior because it previously ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP. > Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") > Cc: Michal Hocko <mhocko@suse.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: David Rientjes <rientjes@google.com> > Cc: Kirill A. Shutemov <kirill@shutemov.name> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > --- > v2: Check VM_NOHUGEPAGE per Michal Hocko > > mm/huge_memory.c | 4 ++-- > mm/shmem.c | 3 +++ > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 165ea46..5881e82 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) > { > if (vma_is_anonymous(vma)) > return __transparent_hugepage_enabled(vma); > - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) > - return __transparent_hugepage_enabled(vma); > + if (vma_is_shmem(vma)) > + return shmem_huge_enabled(vma); > > return false; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 2275a0f..6f09a31 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) > loff_t i_size; > pgoff_t off; > > + if ((vma->vm_flags & VM_NOHUGEPAGE) || > + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) > + return false; > if (shmem_huge == SHMEM_HUGE_FORCE) > return true; > if (shmem_huge == SHMEM_HUGE_DENY) > -- > 1.8.3.1 >
On 4/23/19 10:52 AM, Michal Hocko wrote: > On Wed 24-04-19 00:43:01, Yang Shi wrote: >> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >> vma") introduced THPeligible bit for processes' smaps. But, when checking >> the eligibility for shmem vma, __transparent_hugepage_enabled() is >> called to override the result from shmem_huge_enabled(). It may result >> in the anonymous vma's THP flag override shmem's. For example, running a >> simple test which create THP for shmem, but with anonymous THP disabled, >> when reading the process's smaps, it may show: >> >> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >> Size: 4096 kB >> ... >> [snip] >> ... >> ShmemPmdMapped: 4096 kB >> ... >> [snip] >> ... >> THPeligible: 0 >> >> And, /proc/meminfo does show THP allocated and PMD mapped too: >> >> ShmemHugePages: 4096 kB >> ShmemPmdMapped: 4096 kB >> >> This doesn't make too much sense. The anonymous THP flag should not >> intervene shmem THP. Calling shmem_huge_enabled() with checking >> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >> dax vma check since we already checked if the vma is shmem already. > Kirill, can we get a confirmation that this is really intended behavior > rather than an omission please? Is this documented? What is a global > knob to simply disable THP system wise? > > I have to say that the THP tuning API is one giant mess :/ > > Btw. this patch also seem to fix khugepaged behavior because it previously > ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP. Aha, I didn't notice this. It looks we need separate the patch to fix that khugepaged problem for both 5.1-rc and LTS. > >> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: David Rientjes <rientjes@google.com> >> Cc: Kirill A. Shutemov <kirill@shutemov.name> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >> --- >> v2: Check VM_NOHUGEPAGE per Michal Hocko >> >> mm/huge_memory.c | 4 ++-- >> mm/shmem.c | 3 +++ >> 2 files changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 165ea46..5881e82 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) >> { >> if (vma_is_anonymous(vma)) >> return __transparent_hugepage_enabled(vma); >> - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) >> - return __transparent_hugepage_enabled(vma); >> + if (vma_is_shmem(vma)) >> + return shmem_huge_enabled(vma); >> >> return false; >> } >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 2275a0f..6f09a31 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) >> loff_t i_size; >> pgoff_t off; >> >> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >> + return false; >> if (shmem_huge == SHMEM_HUGE_FORCE) >> return true; >> if (shmem_huge == SHMEM_HUGE_DENY) >> -- >> 1.8.3.1 >>
On 4/23/19 11:34 AM, Yang Shi wrote: > > > On 4/23/19 10:52 AM, Michal Hocko wrote: >> On Wed 24-04-19 00:43:01, Yang Shi wrote: >>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for >>> each >>> vma") introduced THPeligible bit for processes' smaps. But, when >>> checking >>> the eligibility for shmem vma, __transparent_hugepage_enabled() is >>> called to override the result from shmem_huge_enabled(). It may result >>> in the anonymous vma's THP flag override shmem's. For example, >>> running a >>> simple test which create THP for shmem, but with anonymous THP >>> disabled, >>> when reading the process's smaps, it may show: >>> >>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >>> Size: 4096 kB >>> ... >>> [snip] >>> ... >>> ShmemPmdMapped: 4096 kB >>> ... >>> [snip] >>> ... >>> THPeligible: 0 >>> >>> And, /proc/meminfo does show THP allocated and PMD mapped too: >>> >>> ShmemHugePages: 4096 kB >>> ShmemPmdMapped: 4096 kB >>> >>> This doesn't make too much sense. The anonymous THP flag should not >>> intervene shmem THP. Calling shmem_huge_enabled() with checking >>> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >>> dax vma check since we already checked if the vma is shmem already. >> Kirill, can we get a confirmation that this is really intended behavior >> rather than an omission please? Is this documented? What is a global >> knob to simply disable THP system wise? >> >> I have to say that the THP tuning API is one giant mess :/ >> >> Btw. this patch also seem to fix khugepaged behavior because it >> previously >> ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP. Second look shows this is not ignored. hugepage_vma_check() would check this for both anonymous vma and shmem vma before scanning. It is called before shmem_huge_enabled(). > > Aha, I didn't notice this. It looks we need separate the patch to fix > that khugepaged problem for both 5.1-rc and LTS. > >> >>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >>> vma") >>> Cc: Michal Hocko <mhocko@suse.com> >>> Cc: Vlastimil Babka <vbabka@suse.cz> >>> Cc: David Rientjes <rientjes@google.com> >>> Cc: Kirill A. Shutemov <kirill@shutemov.name> >>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >>> --- >>> v2: Check VM_NOHUGEPAGE per Michal Hocko >>> >>> mm/huge_memory.c | 4 ++-- >>> mm/shmem.c | 3 +++ >>> 2 files changed, 5 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index 165ea46..5881e82 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct >>> vm_area_struct *vma) >>> { >>> if (vma_is_anonymous(vma)) >>> return __transparent_hugepage_enabled(vma); >>> - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) >>> - return __transparent_hugepage_enabled(vma); >>> + if (vma_is_shmem(vma)) >>> + return shmem_huge_enabled(vma); >>> return false; >>> } >>> diff --git a/mm/shmem.c b/mm/shmem.c >>> index 2275a0f..6f09a31 100644 >>> --- a/mm/shmem.c >>> +++ b/mm/shmem.c >>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct >>> *vma) >>> loff_t i_size; >>> pgoff_t off; >>> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >>> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >>> + return false; >>> if (shmem_huge == SHMEM_HUGE_FORCE) >>> return true; >>> if (shmem_huge == SHMEM_HUGE_DENY) >>> -- >>> 1.8.3.1 >>> >
On Tue 23-04-19 17:22:36, Yang Shi wrote: > > > On 4/23/19 11:34 AM, Yang Shi wrote: > > > > > > On 4/23/19 10:52 AM, Michal Hocko wrote: > > > On Wed 24-04-19 00:43:01, Yang Shi wrote: > > > > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility > > > > for each > > > > vma") introduced THPeligible bit for processes' smaps. But, when > > > > checking > > > > the eligibility for shmem vma, __transparent_hugepage_enabled() is > > > > called to override the result from shmem_huge_enabled(). It may result > > > > in the anonymous vma's THP flag override shmem's. For example, > > > > running a > > > > simple test which create THP for shmem, but with anonymous THP > > > > disabled, > > > > when reading the process's smaps, it may show: > > > > > > > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test > > > > Size: 4096 kB > > > > ... > > > > [snip] > > > > ... > > > > ShmemPmdMapped: 4096 kB > > > > ... > > > > [snip] > > > > ... > > > > THPeligible: 0 > > > > > > > > And, /proc/meminfo does show THP allocated and PMD mapped too: > > > > > > > > ShmemHugePages: 4096 kB > > > > ShmemPmdMapped: 4096 kB > > > > > > > > This doesn't make too much sense. The anonymous THP flag should not > > > > intervene shmem THP. Calling shmem_huge_enabled() with checking > > > > MMF_DISABLE_THP sounds good enough. And, we could skip stack and > > > > dax vma check since we already checked if the vma is shmem already. > > > Kirill, can we get a confirmation that this is really intended behavior > > > rather than an omission please? Is this documented? What is a global > > > knob to simply disable THP system wise? > > > > > > I have to say that the THP tuning API is one giant mess :/ > > > > > > Btw. this patch also seem to fix khugepaged behavior because it > > > previously > > > ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP. > > Second look shows this is not ignored. hugepage_vma_check() would check this > for both anonymous vma and shmem vma before scanning. It is called before > shmem_huge_enabled(). Right. I have missed the earlier check. The main question remains though.
On 4/23/19 6:43 PM, Yang Shi wrote: > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each > vma") introduced THPeligible bit for processes' smaps. But, when checking > the eligibility for shmem vma, __transparent_hugepage_enabled() is > called to override the result from shmem_huge_enabled(). It may result > in the anonymous vma's THP flag override shmem's. For example, running a > simple test which create THP for shmem, but with anonymous THP disabled, > when reading the process's smaps, it may show: > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test > Size: 4096 kB > ... > [snip] > ... > ShmemPmdMapped: 4096 kB But how does this happen in the first place? In __handle_mm_fault() we do: if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; And __transparent_hugepage_enabled() checks the global THP settings. If THP is not enabled / is only for madvise and the vma is not madvised, then this should fail, and also khugepaged shouldn't either run at all, or don't do its job for such non-madvised vma. What am I missing? > ... > [snip] > ... > THPeligible: 0 > > And, /proc/meminfo does show THP allocated and PMD mapped too: > > ShmemHugePages: 4096 kB > ShmemPmdMapped: 4096 kB > > This doesn't make too much sense. The anonymous THP flag should not > intervene shmem THP. Calling shmem_huge_enabled() with checking > MMF_DISABLE_THP sounds good enough. And, we could skip stack and > dax vma check since we already checked if the vma is shmem already. > > Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") > Cc: Michal Hocko <mhocko@suse.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: David Rientjes <rientjes@google.com> > Cc: Kirill A. Shutemov <kirill@shutemov.name> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > --- > v2: Check VM_NOHUGEPAGE per Michal Hocko > > mm/huge_memory.c | 4 ++-- > mm/shmem.c | 3 +++ > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 165ea46..5881e82 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) > { > if (vma_is_anonymous(vma)) > return __transparent_hugepage_enabled(vma); > - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) > - return __transparent_hugepage_enabled(vma); > + if (vma_is_shmem(vma)) > + return shmem_huge_enabled(vma); > > return false; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 2275a0f..6f09a31 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) > loff_t i_size; > pgoff_t off; > > + if ((vma->vm_flags & VM_NOHUGEPAGE) || > + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) > + return false; > if (shmem_huge == SHMEM_HUGE_FORCE) > return true; > if (shmem_huge == SHMEM_HUGE_DENY) >
On 4/24/19 6:10 AM, Vlastimil Babka wrote: > On 4/23/19 6:43 PM, Yang Shi wrote: >> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >> vma") introduced THPeligible bit for processes' smaps. But, when checking >> the eligibility for shmem vma, __transparent_hugepage_enabled() is >> called to override the result from shmem_huge_enabled(). It may result >> in the anonymous vma's THP flag override shmem's. For example, running a >> simple test which create THP for shmem, but with anonymous THP disabled, >> when reading the process's smaps, it may show: >> >> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >> Size: 4096 kB >> ... >> [snip] >> ... >> ShmemPmdMapped: 4096 kB > But how does this happen in the first place? > In __handle_mm_fault() we do: > > if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { > ret = create_huge_pmd(&vmf); > if (!(ret & VM_FAULT_FALLBACK)) > return ret; > > And __transparent_hugepage_enabled() checks the global THP settings. > If THP is not enabled / is only for madvise and the vma is not madvised, > then this should fail, and also khugepaged shouldn't either run at all, > or don't do its job for such non-madvised vma. If __transparent_hugepage_enabled() returns false, the code will not reach create_huge_pmd() at all. If it returns true, create_huge_pmd() actually will return VM_FAULT_FALLBACK for shmem since shmem doesn't have huge_fault (or pmd_fault in earlier versions) method. Then it will get into handle_pte_fault(), finally shmem_fault() is called, which allocates THP by checking some global flag (i.e. VM_NOHUGEPAGE and MMF_DISABLE_THP) and shmem THP knobs. 4.8 (the first version has shmem THP merged) behaves exactly in the same way. So, I suspect this may be intended behavior. > > What am I missing? > >> ... >> [snip] >> ... >> THPeligible: 0 >> >> And, /proc/meminfo does show THP allocated and PMD mapped too: >> >> ShmemHugePages: 4096 kB >> ShmemPmdMapped: 4096 kB >> >> This doesn't make too much sense. The anonymous THP flag should not >> intervene shmem THP. Calling shmem_huge_enabled() with checking >> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >> dax vma check since we already checked if the vma is shmem already. >> >> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: David Rientjes <rientjes@google.com> >> Cc: Kirill A. Shutemov <kirill@shutemov.name> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >> --- >> v2: Check VM_NOHUGEPAGE per Michal Hocko >> >> mm/huge_memory.c | 4 ++-- >> mm/shmem.c | 3 +++ >> 2 files changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 165ea46..5881e82 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) >> { >> if (vma_is_anonymous(vma)) >> return __transparent_hugepage_enabled(vma); >> - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) >> - return __transparent_hugepage_enabled(vma); >> + if (vma_is_shmem(vma)) >> + return shmem_huge_enabled(vma); >> >> return false; >> } >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 2275a0f..6f09a31 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) >> loff_t i_size; >> pgoff_t off; >> >> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >> + return false; >> if (shmem_huge == SHMEM_HUGE_FORCE) >> return true; >> if (shmem_huge == SHMEM_HUGE_DENY) >>
On 4/24/19 5:47 PM, Yang Shi wrote: > > > On 4/24/19 6:10 AM, Vlastimil Babka wrote: >> On 4/23/19 6:43 PM, Yang Shi wrote: >>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >>> vma") introduced THPeligible bit for processes' smaps. But, when checking >>> the eligibility for shmem vma, __transparent_hugepage_enabled() is >>> called to override the result from shmem_huge_enabled(). It may result >>> in the anonymous vma's THP flag override shmem's. For example, running a >>> simple test which create THP for shmem, but with anonymous THP disabled, >>> when reading the process's smaps, it may show: >>> >>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >>> Size: 4096 kB >>> ... >>> [snip] >>> ... >>> ShmemPmdMapped: 4096 kB >> But how does this happen in the first place? >> In __handle_mm_fault() we do: >> >> if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { >> ret = create_huge_pmd(&vmf); >> if (!(ret & VM_FAULT_FALLBACK)) >> return ret; >> >> And __transparent_hugepage_enabled() checks the global THP settings. >> If THP is not enabled / is only for madvise and the vma is not madvised, >> then this should fail, and also khugepaged shouldn't either run at all, >> or don't do its job for such non-madvised vma. > > If __transparent_hugepage_enabled() returns false, the code will not > reach create_huge_pmd() at all. If it returns true, create_huge_pmd() > actually will return VM_FAULT_FALLBACK for shmem since shmem doesn't > have huge_fault (or pmd_fault in earlier versions) method. > > Then it will get into handle_pte_fault(), finally shmem_fault() is > called, which allocates THP by checking some global flag (i.e. > VM_NOHUGEPAGE and MMF_DISABLE_THP) and shmem THP knobs. Aha, thanks! What a mess... > > 4.8 (the first version has shmem THP merged) behaves exactly in the same > way. So, I suspect this may be intended behavior. Still looks like an oversight to me. And it's inconsistent... it might fault huge shmem pages when THPs are globally disabled, but khugepaged is still not running. I think it should just check the global THP flags as well...
On 4/24/19 9:17 AM, Vlastimil Babka wrote: > On 4/24/19 5:47 PM, Yang Shi wrote: >> >> On 4/24/19 6:10 AM, Vlastimil Babka wrote: >>> On 4/23/19 6:43 PM, Yang Shi wrote: >>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >>>> vma") introduced THPeligible bit for processes' smaps. But, when checking >>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is >>>> called to override the result from shmem_huge_enabled(). It may result >>>> in the anonymous vma's THP flag override shmem's. For example, running a >>>> simple test which create THP for shmem, but with anonymous THP disabled, >>>> when reading the process's smaps, it may show: >>>> >>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >>>> Size: 4096 kB >>>> ... >>>> [snip] >>>> ... >>>> ShmemPmdMapped: 4096 kB >>> But how does this happen in the first place? >>> In __handle_mm_fault() we do: >>> >>> if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { >>> ret = create_huge_pmd(&vmf); >>> if (!(ret & VM_FAULT_FALLBACK)) >>> return ret; >>> >>> And __transparent_hugepage_enabled() checks the global THP settings. >>> If THP is not enabled / is only for madvise and the vma is not madvised, >>> then this should fail, and also khugepaged shouldn't either run at all, >>> or don't do its job for such non-madvised vma. >> If __transparent_hugepage_enabled() returns false, the code will not >> reach create_huge_pmd() at all. If it returns true, create_huge_pmd() >> actually will return VM_FAULT_FALLBACK for shmem since shmem doesn't >> have huge_fault (or pmd_fault in earlier versions) method. >> >> Then it will get into handle_pte_fault(), finally shmem_fault() is >> called, which allocates THP by checking some global flag (i.e. >> VM_NOHUGEPAGE and MMF_DISABLE_THP) and shmem THP knobs. > Aha, thanks! What a mess... Yes, it does look convoluted. I'm wondering we may consider refactor the shmem THP fault handling. > >> 4.8 (the first version has shmem THP merged) behaves exactly in the same >> way. So, I suspect this may be intended behavior. > Still looks like an oversight to me. And it's inconsistent... it might > fault huge shmem pages when THPs are globally disabled, but khugepaged > is still not running. I think it should just check the global THP flags > as well... It does looks inconsistent, particularly for the khugepaged part.
On 4/23/19 10:52 AM, Michal Hocko wrote: > On Wed 24-04-19 00:43:01, Yang Shi wrote: >> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >> vma") introduced THPeligible bit for processes' smaps. But, when checking >> the eligibility for shmem vma, __transparent_hugepage_enabled() is >> called to override the result from shmem_huge_enabled(). It may result >> in the anonymous vma's THP flag override shmem's. For example, running a >> simple test which create THP for shmem, but with anonymous THP disabled, >> when reading the process's smaps, it may show: >> >> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >> Size: 4096 kB >> ... >> [snip] >> ... >> ShmemPmdMapped: 4096 kB >> ... >> [snip] >> ... >> THPeligible: 0 >> >> And, /proc/meminfo does show THP allocated and PMD mapped too: >> >> ShmemHugePages: 4096 kB >> ShmemPmdMapped: 4096 kB >> >> This doesn't make too much sense. The anonymous THP flag should not >> intervene shmem THP. Calling shmem_huge_enabled() with checking >> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >> dax vma check since we already checked if the vma is shmem already. > Kirill, can we get a confirmation that this is really intended behavior > rather than an omission please? Is this documented? What is a global > knob to simply disable THP system wise? Hi Kirill, Ping. Any comment? Thanks, Yang > > I have to say that the THP tuning API is one giant mess :/ > > Btw. this patch also seem to fix khugepaged behavior because it previously > ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP. > >> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: David Rientjes <rientjes@google.com> >> Cc: Kirill A. Shutemov <kirill@shutemov.name> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >> --- >> v2: Check VM_NOHUGEPAGE per Michal Hocko >> >> mm/huge_memory.c | 4 ++-- >> mm/shmem.c | 3 +++ >> 2 files changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 165ea46..5881e82 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) >> { >> if (vma_is_anonymous(vma)) >> return __transparent_hugepage_enabled(vma); >> - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) >> - return __transparent_hugepage_enabled(vma); >> + if (vma_is_shmem(vma)) >> + return shmem_huge_enabled(vma); >> >> return false; >> } >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 2275a0f..6f09a31 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) >> loff_t i_size; >> pgoff_t off; >> >> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >> + return false; >> if (shmem_huge == SHMEM_HUGE_FORCE) >> return true; >> if (shmem_huge == SHMEM_HUGE_DENY) >> -- >> 1.8.3.1 >>
On 4/28/19 12:13 PM, Yang Shi wrote: > > > On 4/23/19 10:52 AM, Michal Hocko wrote: >> On Wed 24-04-19 00:43:01, Yang Shi wrote: >>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for >>> each >>> vma") introduced THPeligible bit for processes' smaps. But, when >>> checking >>> the eligibility for shmem vma, __transparent_hugepage_enabled() is >>> called to override the result from shmem_huge_enabled(). It may result >>> in the anonymous vma's THP flag override shmem's. For example, >>> running a >>> simple test which create THP for shmem, but with anonymous THP >>> disabled, >>> when reading the process's smaps, it may show: >>> >>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >>> Size: 4096 kB >>> ... >>> [snip] >>> ... >>> ShmemPmdMapped: 4096 kB >>> ... >>> [snip] >>> ... >>> THPeligible: 0 >>> >>> And, /proc/meminfo does show THP allocated and PMD mapped too: >>> >>> ShmemHugePages: 4096 kB >>> ShmemPmdMapped: 4096 kB >>> >>> This doesn't make too much sense. The anonymous THP flag should not >>> intervene shmem THP. Calling shmem_huge_enabled() with checking >>> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >>> dax vma check since we already checked if the vma is shmem already. >> Kirill, can we get a confirmation that this is really intended behavior >> rather than an omission please? Is this documented? What is a global >> knob to simply disable THP system wise? > > Hi Kirill, > > Ping. Any comment? Talked with Kirill at LSFMM, it sounds this is kind of intended behavior according to him. But, we all agree it looks inconsistent. So, we may have two options: - Just fix the false negative issue as what the patch does - Change the behavior to make it more consistent I'm not sure whether anyone relies on the behavior explicitly or implicitly or not. If we would like to change the behavior, I may consider to take a step further to refactor the code a little bit to use huge_fault() to handle THP fault instead of falling back to handle_pte_fault() in the current implementation. This may make adding THP for other filesystems easier. > > Thanks, > Yang > >> >> I have to say that the THP tuning API is one giant mess :/ >> >> Btw. this patch also seem to fix khugepaged behavior because it >> previously >> ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP. >> >>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >>> vma") >>> Cc: Michal Hocko <mhocko@suse.com> >>> Cc: Vlastimil Babka <vbabka@suse.cz> >>> Cc: David Rientjes <rientjes@google.com> >>> Cc: Kirill A. Shutemov <kirill@shutemov.name> >>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >>> --- >>> v2: Check VM_NOHUGEPAGE per Michal Hocko >>> >>> mm/huge_memory.c | 4 ++-- >>> mm/shmem.c | 3 +++ >>> 2 files changed, 5 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index 165ea46..5881e82 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct >>> vm_area_struct *vma) >>> { >>> if (vma_is_anonymous(vma)) >>> return __transparent_hugepage_enabled(vma); >>> - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) >>> - return __transparent_hugepage_enabled(vma); >>> + if (vma_is_shmem(vma)) >>> + return shmem_huge_enabled(vma); >>> return false; >>> } >>> diff --git a/mm/shmem.c b/mm/shmem.c >>> index 2275a0f..6f09a31 100644 >>> --- a/mm/shmem.c >>> +++ b/mm/shmem.c >>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct >>> *vma) >>> loff_t i_size; >>> pgoff_t off; >>> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >>> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >>> + return false; >>> if (shmem_huge == SHMEM_HUGE_FORCE) >>> return true; >>> if (shmem_huge == SHMEM_HUGE_DENY) >>> -- >>> 1.8.3.1 >>> >
[Hmm, I thought, Hugh was CCed] On Mon 06-05-19 16:37:42, Yang Shi wrote: > > > On 4/28/19 12:13 PM, Yang Shi wrote: > > > > > > On 4/23/19 10:52 AM, Michal Hocko wrote: > > > On Wed 24-04-19 00:43:01, Yang Shi wrote: > > > > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility > > > > for each > > > > vma") introduced THPeligible bit for processes' smaps. But, when > > > > checking > > > > the eligibility for shmem vma, __transparent_hugepage_enabled() is > > > > called to override the result from shmem_huge_enabled(). It may result > > > > in the anonymous vma's THP flag override shmem's. For example, > > > > running a > > > > simple test which create THP for shmem, but with anonymous THP > > > > disabled, > > > > when reading the process's smaps, it may show: > > > > > > > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test > > > > Size: 4096 kB > > > > ... > > > > [snip] > > > > ... > > > > ShmemPmdMapped: 4096 kB > > > > ... > > > > [snip] > > > > ... > > > > THPeligible: 0 > > > > > > > > And, /proc/meminfo does show THP allocated and PMD mapped too: > > > > > > > > ShmemHugePages: 4096 kB > > > > ShmemPmdMapped: 4096 kB > > > > > > > > This doesn't make too much sense. The anonymous THP flag should not > > > > intervene shmem THP. Calling shmem_huge_enabled() with checking > > > > MMF_DISABLE_THP sounds good enough. And, we could skip stack and > > > > dax vma check since we already checked if the vma is shmem already. > > > Kirill, can we get a confirmation that this is really intended behavior > > > rather than an omission please? Is this documented? What is a global > > > knob to simply disable THP system wise? > > > > Hi Kirill, > > > > Ping. Any comment? > > Talked with Kirill at LSFMM, it sounds this is kind of intended behavior > according to him. But, we all agree it looks inconsistent. > > So, we may have two options: > - Just fix the false negative issue as what the patch does > - Change the behavior to make it more consistent > > I'm not sure whether anyone relies on the behavior explicitly or implicitly > or not. Well, I would be certainly more happy with a more consistent behavior. Talked to Hugh at LSFMM about this and he finds treating shmem objects separately from the anonymous memory. And that is already the case partially when each mount point might have its own setup. So the primary question is whether we need a one global knob to controll all THP allocations. One argument to have that is that it might be helpful to for an admin to simply disable source of THP at a single place rather than crawling over all shmem mount points and remount them. Especially in environments where shmem points are mounted in a container by a non-root. Why would somebody wanted something like that? One example would be to temporarily workaround high order allocations issues which we have seen non trivial amount of in the past and we are likely not at the end of the tunel. That being said I would be in favor of treating the global sysfs knob to be global for all THP allocations. I will not push back on that if there is a general consensus that shmem and fs in general are a different class of objects and a single global control is not desirable for whatever reasons. Kirill, Hugh othe folks?
On 5/7/19 3:47 AM, Michal Hocko wrote: > [Hmm, I thought, Hugh was CCed] > > On Mon 06-05-19 16:37:42, Yang Shi wrote: >> >> On 4/28/19 12:13 PM, Yang Shi wrote: >>> >>> On 4/23/19 10:52 AM, Michal Hocko wrote: >>>> On Wed 24-04-19 00:43:01, Yang Shi wrote: >>>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility >>>>> for each >>>>> vma") introduced THPeligible bit for processes' smaps. But, when >>>>> checking >>>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is >>>>> called to override the result from shmem_huge_enabled(). It may result >>>>> in the anonymous vma's THP flag override shmem's. For example, >>>>> running a >>>>> simple test which create THP for shmem, but with anonymous THP >>>>> disabled, >>>>> when reading the process's smaps, it may show: >>>>> >>>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >>>>> Size: 4096 kB >>>>> ... >>>>> [snip] >>>>> ... >>>>> ShmemPmdMapped: 4096 kB >>>>> ... >>>>> [snip] >>>>> ... >>>>> THPeligible: 0 >>>>> >>>>> And, /proc/meminfo does show THP allocated and PMD mapped too: >>>>> >>>>> ShmemHugePages: 4096 kB >>>>> ShmemPmdMapped: 4096 kB >>>>> >>>>> This doesn't make too much sense. The anonymous THP flag should not >>>>> intervene shmem THP. Calling shmem_huge_enabled() with checking >>>>> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >>>>> dax vma check since we already checked if the vma is shmem already. >>>> Kirill, can we get a confirmation that this is really intended behavior >>>> rather than an omission please? Is this documented? What is a global >>>> knob to simply disable THP system wise? >>> Hi Kirill, >>> >>> Ping. Any comment? >> Talked with Kirill at LSFMM, it sounds this is kind of intended behavior >> according to him. But, we all agree it looks inconsistent. >> >> So, we may have two options: >> - Just fix the false negative issue as what the patch does >> - Change the behavior to make it more consistent >> >> I'm not sure whether anyone relies on the behavior explicitly or implicitly >> or not. > Well, I would be certainly more happy with a more consistent behavior. > Talked to Hugh at LSFMM about this and he finds treating shmem objects > separately from the anonymous memory. And that is already the case > partially when each mount point might have its own setup. So the primary > question is whether we need a one global knob to controll all THP > allocations. One argument to have that is that it might be helpful to > for an admin to simply disable source of THP at a single place rather > than crawling over all shmem mount points and remount them. Especially > in environments where shmem points are mounted in a container by a > non-root. Why would somebody wanted something like that? One example > would be to temporarily workaround high order allocations issues which > we have seen non trivial amount of in the past and we are likely not at > the end of the tunel. Shmem has a global control for such use. Setting shmem_enabled to "force" or "deny" would enable or disable THP for shmem globally, including non-fs objects, i.e. memfd, SYS V shmem, etc. > > That being said I would be in favor of treating the global sysfs knob to > be global for all THP allocations. I will not push back on that if there > is a general consensus that shmem and fs in general are a different > class of objects and a single global control is not desirable for > whatever reasons. OK, we need more inputs from Kirill, Hugh and other folks. > > Kirill, Hugh othe folks?
On 5/7/19 10:10 AM, Yang Shi wrote: > > > On 5/7/19 3:47 AM, Michal Hocko wrote: >> [Hmm, I thought, Hugh was CCed] >> >> On Mon 06-05-19 16:37:42, Yang Shi wrote: >>> >>> On 4/28/19 12:13 PM, Yang Shi wrote: >>>> >>>> On 4/23/19 10:52 AM, Michal Hocko wrote: >>>>> On Wed 24-04-19 00:43:01, Yang Shi wrote: >>>>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility >>>>>> for each >>>>>> vma") introduced THPeligible bit for processes' smaps. But, when >>>>>> checking >>>>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is >>>>>> called to override the result from shmem_huge_enabled(). It may >>>>>> result >>>>>> in the anonymous vma's THP flag override shmem's. For example, >>>>>> running a >>>>>> simple test which create THP for shmem, but with anonymous THP >>>>>> disabled, >>>>>> when reading the process's smaps, it may show: >>>>>> >>>>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >>>>>> Size: 4096 kB >>>>>> ... >>>>>> [snip] >>>>>> ... >>>>>> ShmemPmdMapped: 4096 kB >>>>>> ... >>>>>> [snip] >>>>>> ... >>>>>> THPeligible: 0 >>>>>> >>>>>> And, /proc/meminfo does show THP allocated and PMD mapped too: >>>>>> >>>>>> ShmemHugePages: 4096 kB >>>>>> ShmemPmdMapped: 4096 kB >>>>>> >>>>>> This doesn't make too much sense. The anonymous THP flag should not >>>>>> intervene shmem THP. Calling shmem_huge_enabled() with checking >>>>>> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >>>>>> dax vma check since we already checked if the vma is shmem already. >>>>> Kirill, can we get a confirmation that this is really intended >>>>> behavior >>>>> rather than an omission please? Is this documented? What is a global >>>>> knob to simply disable THP system wise? >>>> Hi Kirill, >>>> >>>> Ping. Any comment? >>> Talked with Kirill at LSFMM, it sounds this is kind of intended >>> behavior >>> according to him. But, we all agree it looks inconsistent. >>> >>> So, we may have two options: >>> - Just fix the false negative issue as what the patch does >>> - Change the behavior to make it more consistent >>> >>> I'm not sure whether anyone relies on the behavior explicitly or >>> implicitly >>> or not. >> Well, I would be certainly more happy with a more consistent behavior. >> Talked to Hugh at LSFMM about this and he finds treating shmem objects >> separately from the anonymous memory. And that is already the case >> partially when each mount point might have its own setup. So the primary >> question is whether we need a one global knob to controll all THP >> allocations. One argument to have that is that it might be helpful to >> for an admin to simply disable source of THP at a single place rather >> than crawling over all shmem mount points and remount them. Especially >> in environments where shmem points are mounted in a container by a >> non-root. Why would somebody wanted something like that? One example >> would be to temporarily workaround high order allocations issues which >> we have seen non trivial amount of in the past and we are likely not at >> the end of the tunel. > > Shmem has a global control for such use. Setting shmem_enabled to > "force" or "deny" would enable or disable THP for shmem globally, > including non-fs objects, i.e. memfd, SYS V shmem, etc. > >> >> That being said I would be in favor of treating the global sysfs knob to >> be global for all THP allocations. I will not push back on that if there >> is a general consensus that shmem and fs in general are a different >> class of objects and a single global control is not desirable for >> whatever reasons. > > OK, we need more inputs from Kirill, Hugh and other folks. [Forgot cc to mailing lists] Hi guys, How should we move forward for this one? Make the sysfs knob (/sys/kernel/mm/transparent_hugepage/enabled) to be global for both anonymous and tmpfs? Or just treat shmem objects separately from anon memory then fix the false-negative of THP eligibility by this patch? > >> >> Kirill, Hugh othe folks? >
On Thu, 6 Jun 2019, Yang Shi wrote: > On 5/7/19 10:10 AM, Yang Shi wrote: > > On 5/7/19 3:47 AM, Michal Hocko wrote: > > > [Hmm, I thought, Hugh was CCed] > > > > > > On Mon 06-05-19 16:37:42, Yang Shi wrote: > > > > > > > > On 4/28/19 12:13 PM, Yang Shi wrote: > > > > > > > > > > On 4/23/19 10:52 AM, Michal Hocko wrote: > > > > > > On Wed 24-04-19 00:43:01, Yang Shi wrote: > > > > > > > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility > > > > > > > for each > > > > > > > vma") introduced THPeligible bit for processes' smaps. But, when > > > > > > > checking > > > > > > > the eligibility for shmem vma, __transparent_hugepage_enabled() > > > > > > > is > > > > > > > called to override the result from shmem_huge_enabled(). It may > > > > > > > result > > > > > > > in the anonymous vma's THP flag override shmem's. For example, > > > > > > > running a > > > > > > > simple test which create THP for shmem, but with anonymous THP > > > > > > > disabled, > > > > > > > when reading the process's smaps, it may show: > > > > > > > > > > > > > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test > > > > > > > Size: 4096 kB > > > > > > > ... > > > > > > > [snip] > > > > > > > ... > > > > > > > ShmemPmdMapped: 4096 kB > > > > > > > ... > > > > > > > [snip] > > > > > > > ... > > > > > > > THPeligible: 0 > > > > > > > > > > > > > > And, /proc/meminfo does show THP allocated and PMD mapped too: > > > > > > > > > > > > > > ShmemHugePages: 4096 kB > > > > > > > ShmemPmdMapped: 4096 kB > > > > > > > > > > > > > > This doesn't make too much sense. The anonymous THP flag should > > > > > > > not > > > > > > > intervene shmem THP. Calling shmem_huge_enabled() with checking > > > > > > > MMF_DISABLE_THP sounds good enough. And, we could skip stack and > > > > > > > dax vma check since we already checked if the vma is shmem > > > > > > > already. > > > > > > Kirill, can we get a confirmation that this is really intended > > > > > > behavior > > > > > > rather than an omission please? Is this documented? What is a > > > > > > global > > > > > > knob to simply disable THP system wise? > > > > > Hi Kirill, > > > > > > > > > > Ping. Any comment? > > > > Talked with Kirill at LSFMM, it sounds this is kind of intended > > > > behavior > > > > according to him. But, we all agree it looks inconsistent. > > > > > > > > So, we may have two options: > > > > - Just fix the false negative issue as what the patch does > > > > - Change the behavior to make it more consistent > > > > > > > > I'm not sure whether anyone relies on the behavior explicitly or > > > > implicitly > > > > or not. > > > Well, I would be certainly more happy with a more consistent behavior. > > > Talked to Hugh at LSFMM about this and he finds treating shmem objects > > > separately from the anonymous memory. And that is already the case > > > partially when each mount point might have its own setup. So the primary > > > question is whether we need a one global knob to controll all THP > > > allocations. One argument to have that is that it might be helpful to > > > for an admin to simply disable source of THP at a single place rather > > > than crawling over all shmem mount points and remount them. Especially > > > in environments where shmem points are mounted in a container by a > > > non-root. Why would somebody wanted something like that? One example > > > would be to temporarily workaround high order allocations issues which > > > we have seen non trivial amount of in the past and we are likely not at > > > the end of the tunel. > > > > Shmem has a global control for such use. Setting shmem_enabled to "force" > > or "deny" would enable or disable THP for shmem globally, including non-fs > > objects, i.e. memfd, SYS V shmem, etc. > > > > > > > > That being said I would be in favor of treating the global sysfs knob to > > > be global for all THP allocations. I will not push back on that if there > > > is a general consensus that shmem and fs in general are a different > > > class of objects and a single global control is not desirable for > > > whatever reasons. > > > > OK, we need more inputs from Kirill, Hugh and other folks. > > [Forgot cc to mailing lists] > > Hi guys, > > How should we move forward for this one? Make the sysfs knob > (/sys/kernel/mm/transparent_hugepage/enabled) to be global for both anonymous > and tmpfs? Or just treat shmem objects separately from anon memory then fix > the false-negative of THP eligibility by this patch? Sorry for not getting back to you sooner on this. I don't like to drive design by smaps. I agree with the word "mess" used several times of THP tunings in this thread, but it's too easy to make that mess worse by unnecessary changes, so I'm very cautious here. The addition of "THPeligible" without an "Anon" in its name was unfortunate. I suppose we're two releases too late to change that. Applying process (PR_SET_THP_DISABLE) and mm (MADV_*HUGEPAGE) limitations to shared filesystem objects doesn't work all that well. I recommend that you continue to treat shmem objects separately from anon memory, and just make the smaps "THPeligible" more often accurate. Is your v2 patch earlier in this thread the best for that? No answer tonight, I'll re-examine later in the day. Hugh
On Fri 07-06-19 03:57:18, Hugh Dickins wrote: [...] > The addition of "THPeligible" without an "Anon" in its name was > unfortunate. I suppose we're two releases too late to change that. Well, I do not really see any reason why THPeligible should be Anon specific at all. Even if ... > Applying process (PR_SET_THP_DISABLE) and mm (MADV_*HUGEPAGE) > limitations to shared filesystem objects doesn't work all that well. ... this is what we are going with then it is really important to have a single place to query the eligibility IMHO. > I recommend that you continue to treat shmem objects separately from > anon memory, and just make the smaps "THPeligible" more often accurate. Agreed on this.
On 6/7/19 3:57 AM, Hugh Dickins wrote: > On Thu, 6 Jun 2019, Yang Shi wrote: >> On 5/7/19 10:10 AM, Yang Shi wrote: >>> On 5/7/19 3:47 AM, Michal Hocko wrote: >>>> [Hmm, I thought, Hugh was CCed] >>>> >>>> On Mon 06-05-19 16:37:42, Yang Shi wrote: >>>>> On 4/28/19 12:13 PM, Yang Shi wrote: >>>>>> On 4/23/19 10:52 AM, Michal Hocko wrote: >>>>>>> On Wed 24-04-19 00:43:01, Yang Shi wrote: >>>>>>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility >>>>>>>> for each >>>>>>>> vma") introduced THPeligible bit for processes' smaps. But, when >>>>>>>> checking >>>>>>>> the eligibility for shmem vma, __transparent_hugepage_enabled() >>>>>>>> is >>>>>>>> called to override the result from shmem_huge_enabled(). It may >>>>>>>> result >>>>>>>> in the anonymous vma's THP flag override shmem's. For example, >>>>>>>> running a >>>>>>>> simple test which create THP for shmem, but with anonymous THP >>>>>>>> disabled, >>>>>>>> when reading the process's smaps, it may show: >>>>>>>> >>>>>>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >>>>>>>> Size: 4096 kB >>>>>>>> ... >>>>>>>> [snip] >>>>>>>> ... >>>>>>>> ShmemPmdMapped: 4096 kB >>>>>>>> ... >>>>>>>> [snip] >>>>>>>> ... >>>>>>>> THPeligible: 0 >>>>>>>> >>>>>>>> And, /proc/meminfo does show THP allocated and PMD mapped too: >>>>>>>> >>>>>>>> ShmemHugePages: 4096 kB >>>>>>>> ShmemPmdMapped: 4096 kB >>>>>>>> >>>>>>>> This doesn't make too much sense. The anonymous THP flag should >>>>>>>> not >>>>>>>> intervene shmem THP. Calling shmem_huge_enabled() with checking >>>>>>>> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >>>>>>>> dax vma check since we already checked if the vma is shmem >>>>>>>> already. >>>>>>> Kirill, can we get a confirmation that this is really intended >>>>>>> behavior >>>>>>> rather than an omission please? Is this documented? What is a >>>>>>> global >>>>>>> knob to simply disable THP system wise? >>>>>> Hi Kirill, >>>>>> >>>>>> Ping. Any comment? >>>>> Talked with Kirill at LSFMM, it sounds this is kind of intended >>>>> behavior >>>>> according to him. But, we all agree it looks inconsistent. >>>>> >>>>> So, we may have two options: >>>>> - Just fix the false negative issue as what the patch does >>>>> - Change the behavior to make it more consistent >>>>> >>>>> I'm not sure whether anyone relies on the behavior explicitly or >>>>> implicitly >>>>> or not. >>>> Well, I would be certainly more happy with a more consistent behavior. >>>> Talked to Hugh at LSFMM about this and he finds treating shmem objects >>>> separately from the anonymous memory. And that is already the case >>>> partially when each mount point might have its own setup. So the primary >>>> question is whether we need a one global knob to controll all THP >>>> allocations. One argument to have that is that it might be helpful to >>>> for an admin to simply disable source of THP at a single place rather >>>> than crawling over all shmem mount points and remount them. Especially >>>> in environments where shmem points are mounted in a container by a >>>> non-root. Why would somebody wanted something like that? One example >>>> would be to temporarily workaround high order allocations issues which >>>> we have seen non trivial amount of in the past and we are likely not at >>>> the end of the tunel. >>> Shmem has a global control for such use. Setting shmem_enabled to "force" >>> or "deny" would enable or disable THP for shmem globally, including non-fs >>> objects, i.e. memfd, SYS V shmem, etc. >>> >>>> That being said I would be in favor of treating the global sysfs knob to >>>> be global for all THP allocations. I will not push back on that if there >>>> is a general consensus that shmem and fs in general are a different >>>> class of objects and a single global control is not desirable for >>>> whatever reasons. >>> OK, we need more inputs from Kirill, Hugh and other folks. >> [Forgot cc to mailing lists] >> >> Hi guys, >> >> How should we move forward for this one? Make the sysfs knob >> (/sys/kernel/mm/transparent_hugepage/enabled) to be global for both anonymous >> and tmpfs? Or just treat shmem objects separately from anon memory then fix >> the false-negative of THP eligibility by this patch? > Sorry for not getting back to you sooner on this. > > I don't like to drive design by smaps. I agree with the word "mess" used > several times of THP tunings in this thread, but it's too easy to make > that mess worse by unnecessary changes, so I'm very cautious here. > > The addition of "THPeligible" without an "Anon" in its name was > unfortunate. I suppose we're two releases too late to change that. The smaps shows it is anon vma or shmem vma for the most cases. > > Applying process (PR_SET_THP_DISABLE) and mm (MADV_*HUGEPAGE) > limitations to shared filesystem objects doesn't work all that well. The THP eligibility indicator is per vma, it just reports whether THP is eligible for a specific vma. So, I'm supposed it should keep consistent with MMF_DISABLE_THP and MADV_*HUGEPAGE setting. The current implementation in shmem and kuhugepaged also checks these. > > I recommend that you continue to treat shmem objects separately from > anon memory, and just make the smaps "THPeligible" more often accurate. > > Is your v2 patch earlier in this thread the best for that? The v2 patch treats shmem objects separately from anon memory and it makes the "THPeligible" more often accurate. > No answer tonight, I'll re-examine later in the day. > > Hugh
On Wed, 24 Apr 2019, Yang Shi wrote: > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each > vma") introduced THPeligible bit for processes' smaps. But, when checking > the eligibility for shmem vma, __transparent_hugepage_enabled() is > called to override the result from shmem_huge_enabled(). It may result > in the anonymous vma's THP flag override shmem's. For example, running a > simple test which create THP for shmem, but with anonymous THP disabled, > when reading the process's smaps, it may show: > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test > Size: 4096 kB > ... > [snip] > ... > ShmemPmdMapped: 4096 kB > ... > [snip] > ... > THPeligible: 0 > > And, /proc/meminfo does show THP allocated and PMD mapped too: > > ShmemHugePages: 4096 kB > ShmemPmdMapped: 4096 kB > > This doesn't make too much sense. The anonymous THP flag should not > intervene shmem THP. Calling shmem_huge_enabled() with checking > MMF_DISABLE_THP sounds good enough. And, we could skip stack and > dax vma check since we already checked if the vma is shmem already. > > Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") > Cc: Michal Hocko <mhocko@suse.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: David Rientjes <rientjes@google.com> > Cc: Kirill A. Shutemov <kirill@shutemov.name> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > --- > v2: Check VM_NOHUGEPAGE per Michal Hocko > > mm/huge_memory.c | 4 ++-- > mm/shmem.c | 3 +++ > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 165ea46..5881e82 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) > { > if (vma_is_anonymous(vma)) > return __transparent_hugepage_enabled(vma); > - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) > - return __transparent_hugepage_enabled(vma); > + if (vma_is_shmem(vma)) > + return shmem_huge_enabled(vma); > > return false; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index 2275a0f..6f09a31 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) > loff_t i_size; > pgoff_t off; > > + if ((vma->vm_flags & VM_NOHUGEPAGE) || > + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) > + return false; Yes, that is correct; and correctly placed. But a little more is needed: see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to be used instead of a pte if the vma offset and size permit. smaps should not report a shmem vma as THPeligible if its offset or size prevent it. And I see that should also be fixed on anon vmas: at present smaps reports even a 4kB anon vma as THPeligible, which is not right. Maybe a test like transhuge_vma_suitable() can be added into transparent_hugepage_enabled(), to handle anon and shmem together. I say "like transhuge_vma_suitable()", because that function needs an address, which here you don't have. The anon offset situation is interesting: usually anon vm_pgoff is initialized to fit with its vm_start, so the anon offset check passes; but I wonder what happens after mremap to a different address - does transhuge_vma_suitable() then prevent the use of pmds where they could actually be used? Not a Number#1 priority to investigate or fix here! but a curiosity someone might want to look into. > if (shmem_huge == SHMEM_HUGE_FORCE) > return true; > if (shmem_huge == SHMEM_HUGE_DENY) > -- > 1.8.3.1 Even with your changes ShmemPmdMapped: 4096 kB THPeligible: 0 will easily be seen: THPeligible reflects whether a huge page can be allocated and mapped by pmd in that vma; but if something else already allocated the huge page earlier, it will be mapped by pmd in this vma if offset and size allow, whatever THPeligible says. We could change transhuge_vma_suitable() to force ptes in that case, but it would be a silly change, just to make what smaps shows easier to explain. Hugh
On 6/7/19 8:58 PM, Hugh Dickins wrote: > On Wed, 24 Apr 2019, Yang Shi wrote: > >> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each >> vma") introduced THPeligible bit for processes' smaps. But, when checking >> the eligibility for shmem vma, __transparent_hugepage_enabled() is >> called to override the result from shmem_huge_enabled(). It may result >> in the anonymous vma's THP flag override shmem's. For example, running a >> simple test which create THP for shmem, but with anonymous THP disabled, >> when reading the process's smaps, it may show: >> >> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test >> Size: 4096 kB >> ... >> [snip] >> ... >> ShmemPmdMapped: 4096 kB >> ... >> [snip] >> ... >> THPeligible: 0 >> >> And, /proc/meminfo does show THP allocated and PMD mapped too: >> >> ShmemHugePages: 4096 kB >> ShmemPmdMapped: 4096 kB >> >> This doesn't make too much sense. The anonymous THP flag should not >> intervene shmem THP. Calling shmem_huge_enabled() with checking >> MMF_DISABLE_THP sounds good enough. And, we could skip stack and >> dax vma check since we already checked if the vma is shmem already. >> >> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: David Rientjes <rientjes@google.com> >> Cc: Kirill A. Shutemov <kirill@shutemov.name> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >> --- >> v2: Check VM_NOHUGEPAGE per Michal Hocko >> >> mm/huge_memory.c | 4 ++-- >> mm/shmem.c | 3 +++ >> 2 files changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 165ea46..5881e82 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) >> { >> if (vma_is_anonymous(vma)) >> return __transparent_hugepage_enabled(vma); >> - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) >> - return __transparent_hugepage_enabled(vma); >> + if (vma_is_shmem(vma)) >> + return shmem_huge_enabled(vma); >> >> return false; >> } >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 2275a0f..6f09a31 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) >> loff_t i_size; >> pgoff_t off; >> >> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) >> + return false; > Yes, that is correct; and correctly placed. But a little more is needed: > see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to > be used instead of a pte if the vma offset and size permit. smaps should > not report a shmem vma as THPeligible if its offset or size prevent it. > > And I see that should also be fixed on anon vmas: at present smaps > reports even a 4kB anon vma as THPeligible, which is not right. > Maybe a test like transhuge_vma_suitable() can be added into > transparent_hugepage_enabled(), to handle anon and shmem together. > I say "like transhuge_vma_suitable()", because that function needs > an address, which here you don't have. Thanks for the remind. Since we don't have an address I'm supposed we just need check if the vma's size is big enough or not other than other alignment check. And, I'm wondering whether we could reuse transhuge_vma_suitable() by passing in an impossible address, i.e. -1 since it is not a valid userspace address. It can be used as and indicator that this call is from THPeligible context. > > The anon offset situation is interesting: usually anon vm_pgoff is > initialized to fit with its vm_start, so the anon offset check passes; > but I wonder what happens after mremap to a different address - does > transhuge_vma_suitable() then prevent the use of pmds where they could > actually be used? Not a Number#1 priority to investigate or fix here! > but a curiosity someone might want to look into. Will mark on my TODO list. > >> if (shmem_huge == SHMEM_HUGE_FORCE) >> return true; >> if (shmem_huge == SHMEM_HUGE_DENY) >> -- >> 1.8.3.1 > > Even with your changes > ShmemPmdMapped: 4096 kB > THPeligible: 0 > will easily be seen: THPeligible reflects whether a huge page can be > allocated and mapped by pmd in that vma; but if something else already > allocated the huge page earlier, it will be mapped by pmd in this vma > if offset and size allow, whatever THPeligible says. We could change > transhuge_vma_suitable() to force ptes in that case, but it would be > a silly change, just to make what smaps shows easier to explain. Where did this come from? From the commit log? If so it is the example for the wrong smap output. If that case really happens, I think we could document it since THPeligible should just show the current status. > > Hugh
On Mon, 10 Jun 2019, Yang Shi wrote: > On 6/7/19 8:58 PM, Hugh Dickins wrote: > > Yes, that is correct; and correctly placed. But a little more is needed: > > see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to > > be used instead of a pte if the vma offset and size permit. smaps should > > not report a shmem vma as THPeligible if its offset or size prevent it. > > > > And I see that should also be fixed on anon vmas: at present smaps > > reports even a 4kB anon vma as THPeligible, which is not right. > > Maybe a test like transhuge_vma_suitable() can be added into > > transparent_hugepage_enabled(), to handle anon and shmem together. > > I say "like transhuge_vma_suitable()", because that function needs > > an address, which here you don't have. > > Thanks for the remind. Since we don't have an address I'm supposed we just > need check if the vma's size is big enough or not other than other alignment > check. > > And, I'm wondering whether we could reuse transhuge_vma_suitable() by passing > in an impossible address, i.e. -1 since it is not a valid userspace address. > It can be used as and indicator that this call is from THPeligible context. Perhaps, but sounds like it will abuse and uglify transhuge_vma_suitable() just for smaps. Would passing transhuge_vma_suitable() the address ((vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE) give the the correct answer in all cases? > > > > The anon offset situation is interesting: usually anon vm_pgoff is > > initialized to fit with its vm_start, so the anon offset check passes; > > but I wonder what happens after mremap to a different address - does > > transhuge_vma_suitable() then prevent the use of pmds where they could > > actually be used? Not a Number#1 priority to investigate or fix here! > > but a curiosity someone might want to look into. > > Will mark on my TODO list. > > > Even with your changes > > ShmemPmdMapped: 4096 kB > > THPeligible: 0 > > will easily be seen: THPeligible reflects whether a huge page can be > > allocated and mapped by pmd in that vma; but if something else already > > allocated the huge page earlier, it will be mapped by pmd in this vma > > if offset and size allow, whatever THPeligible says. We could change > > transhuge_vma_suitable() to force ptes in that case, but it would be > > a silly change, just to make what smaps shows easier to explain. > > Where did this come from? From the commit log? If so it is the example for > the wrong smap output. If that case really happens, I think we could document > it since THPeligible should just show the current status. Please read again what I explained there: it's not necessarily an example of wrong smaps output, it's reasonable smaps output for a reasonable case. Yes, maybe Documentation/filesystems/proc.txt should explain "THPeligble" a little better - "eligible for allocating THP pages" rather than just "eligible for THP pages" would be good enough? we don't want to write a book about the various cases. Oh, and the "THPeligible" output lines up very nicely there in proc.txt: could the actual alignment of that 0 or 1 be fixed in smaps itself too? Thanks, Hugh
On 6/12/19 11:44 AM, Hugh Dickins wrote: > On Mon, 10 Jun 2019, Yang Shi wrote: >> On 6/7/19 8:58 PM, Hugh Dickins wrote: >>> Yes, that is correct; and correctly placed. But a little more is needed: >>> see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to >>> be used instead of a pte if the vma offset and size permit. smaps should >>> not report a shmem vma as THPeligible if its offset or size prevent it. >>> >>> And I see that should also be fixed on anon vmas: at present smaps >>> reports even a 4kB anon vma as THPeligible, which is not right. >>> Maybe a test like transhuge_vma_suitable() can be added into >>> transparent_hugepage_enabled(), to handle anon and shmem together. >>> I say "like transhuge_vma_suitable()", because that function needs >>> an address, which here you don't have. >> Thanks for the remind. Since we don't have an address I'm supposed we just >> need check if the vma's size is big enough or not other than other alignment >> check. >> >> And, I'm wondering whether we could reuse transhuge_vma_suitable() by passing >> in an impossible address, i.e. -1 since it is not a valid userspace address. >> It can be used as and indicator that this call is from THPeligible context. > Perhaps, but sounds like it will abuse and uglify transhuge_vma_suitable() > just for smaps. Would passing transhuge_vma_suitable() the address > ((vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE) > give the the correct answer in all cases? Yes, it looks better. > >>> The anon offset situation is interesting: usually anon vm_pgoff is >>> initialized to fit with its vm_start, so the anon offset check passes; >>> but I wonder what happens after mremap to a different address - does >>> transhuge_vma_suitable() then prevent the use of pmds where they could >>> actually be used? Not a Number#1 priority to investigate or fix here! >>> but a curiosity someone might want to look into. >> Will mark on my TODO list. >> >>> Even with your changes >>> ShmemPmdMapped: 4096 kB >>> THPeligible: 0 >>> will easily be seen: THPeligible reflects whether a huge page can be >>> allocated and mapped by pmd in that vma; but if something else already >>> allocated the huge page earlier, it will be mapped by pmd in this vma >>> if offset and size allow, whatever THPeligible says. We could change >>> transhuge_vma_suitable() to force ptes in that case, but it would be >>> a silly change, just to make what smaps shows easier to explain. >> Where did this come from? From the commit log? If so it is the example for >> the wrong smap output. If that case really happens, I think we could document >> it since THPeligible should just show the current status. > Please read again what I explained there: it's not necessarily an example > of wrong smaps output, it's reasonable smaps output for a reasonable case. > > Yes, maybe Documentation/filesystems/proc.txt should explain "THPeligble" > a little better - "eligible for allocating THP pages" rather than just > "eligible for THP pages" would be good enough? we don't want to write > a book about the various cases. Yes, I agree. > > Oh, and the "THPeligible" output lines up very nicely there in proc.txt: > could the actual alignment of that 0 or 1 be fixed in smaps itself too? Sure. Thanks, Yang > > Thanks, > Hugh
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 165ea46..5881e82 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) { if (vma_is_anonymous(vma)) return __transparent_hugepage_enabled(vma); - if (vma_is_shmem(vma) && shmem_huge_enabled(vma)) - return __transparent_hugepage_enabled(vma); + if (vma_is_shmem(vma)) + return shmem_huge_enabled(vma); return false; } diff --git a/mm/shmem.c b/mm/shmem.c index 2275a0f..6f09a31 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) loff_t i_size; pgoff_t off; + if ((vma->vm_flags & VM_NOHUGEPAGE) || + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + return false; if (shmem_huge == SHMEM_HUGE_FORCE) return true; if (shmem_huge == SHMEM_HUGE_DENY)
The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") introduced THPeligible bit for processes' smaps. But, when checking the eligibility for shmem vma, __transparent_hugepage_enabled() is called to override the result from shmem_huge_enabled(). It may result in the anonymous vma's THP flag override shmem's. For example, running a simple test which create THP for shmem, but with anonymous THP disabled, when reading the process's smaps, it may show: 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test Size: 4096 kB ... [snip] ... ShmemPmdMapped: 4096 kB ... [snip] ... THPeligible: 0 And, /proc/meminfo does show THP allocated and PMD mapped too: ShmemHugePages: 4096 kB ShmemPmdMapped: 4096 kB This doesn't make too much sense. The anonymous THP flag should not intervene shmem THP. Calling shmem_huge_enabled() with checking MMF_DISABLE_THP sounds good enough. And, we could skip stack and dax vma check since we already checked if the vma is shmem already. Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> --- v2: Check VM_NOHUGEPAGE per Michal Hocko mm/huge_memory.c | 4 ++-- mm/shmem.c | 3 +++ 2 files changed, 5 insertions(+), 2 deletions(-)