Message ID | alpine.DEB.2.21.1809241227370.241621@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm, thp: always specify ineligible vmas as nh in smaps | expand |
On Mon 24-09-18 12:30:07, David Rientjes wrote: > Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") > introduced a regression in that userspace cannot always determine the set > of vmas where thp is ineligible. > > Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps > to determine if a vma is eligible to be backed by hugepages. I was under impression that nh resp hg flags only tell about the madvise status. How do you exactly use these flags in an application? Your eligible rules as defined here: > + [*] A process mapping is eligible to be backed by transparent hugepages (thp) > + depending on system-wide settings and the mapping itself. See > + Documentation/admin-guide/mm/transhuge.rst for default behavior. If a > + mapping has a flag of "nh", it is not eligible to be backed by hugepages > + in any condition, either because of prctl(PR_SET_THP_DISABLE) or > + madvise(MADV_NOHUGEPAGE). PR_SET_THP_DISABLE takes precedence over any > + MADV_HUGEPAGE. doesn't seem to match the reality. I do not see all the file backed mappings to be nh marked. So is this really about eligibility rather than the madvise status? Maybe it is just the above documentation that needs to be updated. That being said, I do not object to the patch, I am just trying to understand what is the intended usage for the flag that does try to say more than the madvise status.
On Mon 24-09-18 21:56:03, Michal Hocko wrote: > On Mon 24-09-18 12:30:07, David Rientjes wrote: > > Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") > > introduced a regression in that userspace cannot always determine the set > > of vmas where thp is ineligible. > > > > Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps > > to determine if a vma is eligible to be backed by hugepages. > > I was under impression that nh resp hg flags only tell about the madvise > status. How do you exactly use these flags in an application? > > Your eligible rules as defined here: > > > + [*] A process mapping is eligible to be backed by transparent hugepages (thp) > > + depending on system-wide settings and the mapping itself. See > > + Documentation/admin-guide/mm/transhuge.rst for default behavior. If a > > + mapping has a flag of "nh", it is not eligible to be backed by hugepages > > + in any condition, either because of prctl(PR_SET_THP_DISABLE) or > > + madvise(MADV_NOHUGEPAGE). PR_SET_THP_DISABLE takes precedence over any > > + MADV_HUGEPAGE. > > doesn't seem to match the reality. I do not see all the file backed > mappings to be nh marked. So is this really about eligibility rather > than the madvise status? Maybe it is just the above documentation that > needs to be updated. > > That being said, I do not object to the patch, I am just trying to > understand what is the intended usage for the flag that does try to say > more than the madvise status. And moreover, how is the PR_SET_THP_DISABLE any different from the global THP disabled case. Do we want to set all vmas to nh as well?
On 9/24/18 10:02 PM, Michal Hocko wrote: > On Mon 24-09-18 21:56:03, Michal Hocko wrote: >> On Mon 24-09-18 12:30:07, David Rientjes wrote: >>> Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") >>> introduced a regression in that userspace cannot always determine the set >>> of vmas where thp is ineligible. >>> >>> Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps >>> to determine if a vma is eligible to be backed by hugepages. >> >> I was under impression that nh resp hg flags only tell about the madvise >> status. How do you exactly use these flags in an application? >> >> Your eligible rules as defined here: >> >>> + [*] A process mapping is eligible to be backed by transparent hugepages (thp) >>> + depending on system-wide settings and the mapping itself. See >>> + Documentation/admin-guide/mm/transhuge.rst for default behavior. If a >>> + mapping has a flag of "nh", it is not eligible to be backed by hugepages >>> + in any condition, either because of prctl(PR_SET_THP_DISABLE) or >>> + madvise(MADV_NOHUGEPAGE). PR_SET_THP_DISABLE takes precedence over any >>> + MADV_HUGEPAGE. >> >> doesn't seem to match the reality. I do not see all the file backed >> mappings to be nh marked. So is this really about eligibility rather >> than the madvise status? Maybe it is just the above documentation that >> needs to be updated. Yeah the change from madvise to eligibility in the doc seems to go too far. >> That being said, I do not object to the patch, I am just trying to >> understand what is the intended usage for the flag that does try to say >> more than the madvise status. > > And moreover, how is the PR_SET_THP_DISABLE any different from the > global THP disabled case. Do we want to set all vmas to nh as well? Probably not. It's easy to check the global status, but is it possible to query for the prctl flags of a process? We are looking at process or even vma-specific flags here. If the prctl was historically implemented via VM_NOHUGEPAGE and thus reported as such in smaps, it makes sense to do so even with the MMF_ flag IMHO?
On Mon 24-09-18 22:43:49, Vlastimil Babka wrote: > On 9/24/18 10:02 PM, Michal Hocko wrote: > > On Mon 24-09-18 21:56:03, Michal Hocko wrote: [...] > >> That being said, I do not object to the patch, I am just trying to > >> understand what is the intended usage for the flag that does try to say > >> more than the madvise status. > > > > And moreover, how is the PR_SET_THP_DISABLE any different from the > > global THP disabled case. Do we want to set all vmas to nh as well? > > Probably not. It's easy to check the global status, but is it possible > to query for the prctl flags of a process? Dunno but I suspect there is no way to check for this. > We are looking at process or > even vma-specific flags here. If the prctl was historically implemented > via VM_NOHUGEPAGE and thus reported as such in smaps, it makes sense to > do so even with the MMF_ flag IMHO? Yes if this breaks some userspace which relied on the previous behavior. But if nothing really broke then I guess it would be better to have the semantic as clear as possible. Go and check the global status to make the whole picture doesn't look very sound to me. On the other hand this VMA has a madvise flag on it sounds quite clear and you know what to expect at least. Sure the hint might be ignored in the end but well, these are hints they do not guarantee anything after all.
On Mon, 24 Sep 2018, Vlastimil Babka wrote: > On 9/24/18 10:02 PM, Michal Hocko wrote: > > On Mon 24-09-18 21:56:03, Michal Hocko wrote: > >> On Mon 24-09-18 12:30:07, David Rientjes wrote: > >>> Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") > >>> introduced a regression in that userspace cannot always determine the set > >>> of vmas where thp is ineligible. > >>> > >>> Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps > >>> to determine if a vma is eligible to be backed by hugepages. > >> > >> I was under impression that nh resp hg flags only tell about the madvise > >> status. How do you exactly use these flags in an application? > >> This is used to identify heap mappings that should be able to fault thp but do not, and they normally point to a low-on-memory or fragmentation issue. After commit 1860033237d4, our users of PR_SET_THP_DISABLE no longer show "nh" for their heap mappings so they get reported as having a low thp ratio when in reality it is disabled. It is also used in automated testing to ensure that vmas get disabled for thp appropriately and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced this, and those tests now break. > >> Your eligible rules as defined here: > >> > >>> + [*] A process mapping is eligible to be backed by transparent hugepages (thp) > >>> + depending on system-wide settings and the mapping itself. See > >>> + Documentation/admin-guide/mm/transhuge.rst for default behavior. If a > >>> + mapping has a flag of "nh", it is not eligible to be backed by hugepages > >>> + in any condition, either because of prctl(PR_SET_THP_DISABLE) or > >>> + madvise(MADV_NOHUGEPAGE). PR_SET_THP_DISABLE takes precedence over any > >>> + MADV_HUGEPAGE. > >> > >> doesn't seem to match the reality. I do not see all the file backed > >> mappings to be nh marked. So is this really about eligibility rather > >> than the madvise status? Maybe it is just the above documentation that > >> needs to be updated. > > Yeah the change from madvise to eligibility in the doc seems to go too far. > I'll reword this to explicitly state that "hg" and "nh" mappings either allow or disallow thp backing.
On Tue 25-09-18 12:52:09, David Rientjes wrote: > On Mon, 24 Sep 2018, Vlastimil Babka wrote: > > > On 9/24/18 10:02 PM, Michal Hocko wrote: > > > On Mon 24-09-18 21:56:03, Michal Hocko wrote: > > >> On Mon 24-09-18 12:30:07, David Rientjes wrote: > > >>> Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") > > >>> introduced a regression in that userspace cannot always determine the set > > >>> of vmas where thp is ineligible. > > >>> > > >>> Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps > > >>> to determine if a vma is eligible to be backed by hugepages. > > >> > > >> I was under impression that nh resp hg flags only tell about the madvise > > >> status. How do you exactly use these flags in an application? > > >> > > This is used to identify heap mappings that should be able to fault thp > but do not, and they normally point to a low-on-memory or fragmentation > issue. After commit 1860033237d4, our users of PR_SET_THP_DISABLE no > longer show "nh" for their heap mappings so they get reported as having a > low thp ratio when in reality it is disabled. I am still not sure I understand the issue completely. How are PR_SET_THP_DISABLE users any different from the global THP disabled case? Is this only about the scope? E.g the one who checks for the state cannot check the PR_SET_THP_DISABLE state? Besides that what are consequences of the low ratio? Is this an example of somebody using the prctl and still complaining or an external observer trying to do something useful which ends up doing contrary? > It is also used in > automated testing to ensure that vmas get disabled for thp appropriately > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced > this, and those tests now break. This sounds like a bit of an abuse to me. It shows how an internal implementation detail leaks out to the userspace which is something we should try to avoid. > > >> Your eligible rules as defined here: > > >> > > >>> + [*] A process mapping is eligible to be backed by transparent hugepages (thp) > > >>> + depending on system-wide settings and the mapping itself. See > > >>> + Documentation/admin-guide/mm/transhuge.rst for default behavior. If a > > >>> + mapping has a flag of "nh", it is not eligible to be backed by hugepages > > >>> + in any condition, either because of prctl(PR_SET_THP_DISABLE) or > > >>> + madvise(MADV_NOHUGEPAGE). PR_SET_THP_DISABLE takes precedence over any > > >>> + MADV_HUGEPAGE. > > >> > > >> doesn't seem to match the reality. I do not see all the file backed > > >> mappings to be nh marked. So is this really about eligibility rather > > >> than the madvise status? Maybe it is just the above documentation that > > >> needs to be updated. > > > > Yeah the change from madvise to eligibility in the doc seems to go too far. > > > > I'll reword this to explicitly state that "hg" and "nh" mappings either > allow or disallow thp backing. How are you going to distinguish a regular THP-able mapping then? I am still not sure how this is supposed to work. Could you be more specific. Let's say I have a THP-able mapping (shmem resp. anon for the current implementation). What is the the matrix for hg/nh wrt. madvice/nomadvise PR_SET_THP_DISABLE and global THP enabled/disable.
On Tue, 25 Sep 2018, Michal Hocko wrote: > > This is used to identify heap mappings that should be able to fault thp > > but do not, and they normally point to a low-on-memory or fragmentation > > issue. After commit 1860033237d4, our users of PR_SET_THP_DISABLE no > > longer show "nh" for their heap mappings so they get reported as having a > > low thp ratio when in reality it is disabled. > > I am still not sure I understand the issue completely. How are PR_SET_THP_DISABLE > users any different from the global THP disabled case? Is this only > about the scope? E.g the one who checks for the state cannot check the > PR_SET_THP_DISABLE state? Besides that what are consequences of the > low ratio? Is this an example of somebody using the prctl and still > complaining or an external observer trying to do something useful which > ends up doing contrary? > Yes, that is how I found out about this. The system-wide policy can be determined from /sys/kernel/mm/transparent_hugepage/enabled. If it is "always" and heap mappings are not being backed by hugepages and lack the "nh" flag, it was considered as a likely fragmentation issue before commit 1860033237d4. After commit 1860033237d4, the heap mapping for PR_SET_THP_DISABLE users was not showing it actually is prevented from faulting thp because of policy, not because of fragmentation. > > It is also used in > > automated testing to ensure that vmas get disabled for thp appropriately > > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced > > this, and those tests now break. > > This sounds like a bit of an abuse to me. It shows how an internal > implementation detail leaks out to the userspace which is something we > should try to avoid. > Well, it's already how this has worked for years before commit 1860033237d4 broke it. Changing the implementation in the kernel is fine as long as you don't break userspace who relies on what is exported to it and is the only way to determine if MADV_NOHUGEPAGE is preventing it from being backed by hugepages. > > I'll reword this to explicitly state that "hg" and "nh" mappings either > > allow or disallow thp backing. > > How are you going to distinguish a regular THP-able mapping then? I am > still not sure how this is supposed to work. Could you be more specific. You look for "[heap]" in smaps to determine where the heap mapping is.
On Tue, 25 Sep 2018 14:45:19 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > It is also used in > > > automated testing to ensure that vmas get disabled for thp appropriately > > > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced > > > this, and those tests now break. > > > > This sounds like a bit of an abuse to me. It shows how an internal > > implementation detail leaks out to the userspace which is something we > > should try to avoid. > > > > Well, it's already how this has worked for years before commit > 1860033237d4 broke it. Changing the implementation in the kernel is fine > as long as you don't break userspace who relies on what is exported to it > and is the only way to determine if MADV_NOHUGEPAGE is preventing it from > being backed by hugepages. 1860033237d4 was over a year ago so perhaps we don't need to be too worried about restoring the old interface. In which case we have an opportunity to make improvements such as that suggested by Michal?
On Tue, 25 Sep 2018, Andrew Morton wrote: > > > > It is also used in > > > > automated testing to ensure that vmas get disabled for thp appropriately > > > > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced > > > > this, and those tests now break. > > > > > > This sounds like a bit of an abuse to me. It shows how an internal > > > implementation detail leaks out to the userspace which is something we > > > should try to avoid. > > > > > > > Well, it's already how this has worked for years before commit > > 1860033237d4 broke it. Changing the implementation in the kernel is fine > > as long as you don't break userspace who relies on what is exported to it > > and is the only way to determine if MADV_NOHUGEPAGE is preventing it from > > being backed by hugepages. > > 1860033237d4 was over a year ago so perhaps we don't need to be > too worried about restoring the old interface. In which case > we have an opportunity to make improvements such as that suggested > by Michal? > The only way to determine if a vma was thp disabled prior to this commit was parsing VmFlags from /proc/pid/smaps. That was possible either through MADV_NOHUGEPAGE or PR_SET_THP_DISABLE. It is perfectly legitimate for a test case to check if either are being set correctly through userspace libraries or through the kernel itself in the manner in which the kernel exports this information. It is also perfectly legitimate for userspace to cull through information in the only way it is exported by the kernel to identify reasons for why applications are not having their heap backed by transparent hugepages: the mapping is disabled, the application is hitting the limit for its mem cgroup, we are low on memory, or there are fragmentation issues. Differentiating between those is something our userspace does, and was broken by 1860033237d4.
On Tue 25-09-18 15:04:06, Andrew Morton wrote: > On Tue, 25 Sep 2018 14:45:19 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > > > It is also used in > > > > automated testing to ensure that vmas get disabled for thp appropriately > > > > and we used "nh" since that is how PR_SET_THP_DISABLE previously enforced > > > > this, and those tests now break. > > > > > > This sounds like a bit of an abuse to me. It shows how an internal > > > implementation detail leaks out to the userspace which is something we > > > should try to avoid. > > > > > > > Well, it's already how this has worked for years before commit > > 1860033237d4 broke it. Changing the implementation in the kernel is fine > > as long as you don't break userspace who relies on what is exported to it > > and is the only way to determine if MADV_NOHUGEPAGE is preventing it from > > being backed by hugepages. > > 1860033237d4 was over a year ago so perhaps we don't need to be > too worried about restoring the old interface. In which case > we have an opportunity to make improvements such as that suggested > by Michal? Yeah, can we add a way to export PR_SET_THP_DISABLE to userspace somehow? E.g. /proc/<pid>/status. It is a process wide thing so reporting it per VMA sounds strange at best. This would also keep a sane (and currently documented) semantic for the smaps flag to be really hg - huge page advise flag nh - no-huge page advise flag
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -490,10 +490,18 @@ manner. The codes are the following: dd - do not include area into core dump sd - soft-dirty flag mm - mixed map area - hg - huge page advise flag - nh - no-huge page advise flag + hg - eligible for transparent hugepages [*] + nh - not eligible for transparent hugepages [*] mg - mergable advise flag + [*] A process mapping is eligible to be backed by transparent hugepages (thp) + depending on system-wide settings and the mapping itself. See + Documentation/admin-guide/mm/transhuge.rst for default behavior. If a + mapping has a flag of "nh", it is not eligible to be backed by hugepages + in any condition, either because of prctl(PR_SET_THP_DISABLE) or + madvise(MADV_NOHUGEPAGE). PR_SET_THP_DISABLE takes precedence over any + MADV_HUGEPAGE. + Note that there is no guarantee that every flag and associated mnemonic will be present in all further kernel releases. Things get changed, the flags may be vanished or the reverse -- new added. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -653,13 +653,25 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif #endif /* CONFIG_ARCH_HAS_PKEYS */ }; + unsigned long flags = vma->vm_flags; size_t i; + /* + * Disabling thp is possible through both MADV_NOHUGEPAGE and + * PR_SET_THP_DISABLE. Both historically used VM_NOHUGEPAGE. Since + * the introduction of MMF_DISABLE_THP, however, userspace needs the + * ability to detect vmas where thp is not eligible in the same manner. + */ + if (vma->vm_mm && test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) { + flags &= ~VM_HUGEPAGE; + flags |= VM_NOHUGEPAGE; + } + seq_puts(m, "VmFlags: "); for (i = 0; i < BITS_PER_LONG; i++) { if (!mnemonics[i][0]) continue; - if (vma->vm_flags & (1UL << i)) { + if (flags & (1UL << i)) { seq_putc(m, mnemonics[i][0]); seq_putc(m, mnemonics[i][1]); seq_putc(m, ' ');
Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") introduced a regression in that userspace cannot always determine the set of vmas where thp is ineligible. Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps to determine if a vma is eligible to be backed by hugepages. Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to be disabled and emit "nh" as a flag for the corresponding vmas as part of /proc/pid/smaps. After the commit, thp is disabled by means of an mm flag and "nh" is not emitted. This causes smaps parsing libraries to assume a vma is eligible for thp and ends up puzzling the user on why its memory is not backed by thp. This also clears the "hg" flag to make the behavior of MADV_HUGEPAGE and PR_SET_THP_DISABLE definitive. Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: David Rientjes <rientjes@google.com> --- v2: - clear VM_HUGEPAGE per Vlastimil - update Documentation/filesystems/proc.txt to be explicit Documentation/filesystems/proc.txt | 12 ++++++++++-- fs/proc/task_mmu.c | 14 +++++++++++++- 2 files changed, 23 insertions(+), 3 deletions(-)