Message ID | 20240110173203.3419437-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] tools/mm: Add thpmaps script to dump THP usage info | expand |
On 1/10/24 09:32, Ryan Roberts wrote: ... > options: > -h, --help show this help message and exit > --pid pid Process id of the target process. Exactly one > of --pid and --cgroup must be provided. > --cgroup path Path to the target cgroup in sysfs. Iterates > over every pid in the cgroup and its children. > Get global stats by passing in the root cgroup Hi Ryan, Yes, this version is fairly effective at getting global stats now. I've got some proposed minor tweaks below, and a few questions. Let me start with the questions: 1) When I run this on an older 6.4.8-based kernel: # ./thpmaps --cgroup /sys/fs/cgroup --cont 128K --cont 512K --cont 1M \ --cont 2M --cont 512M --summary , I get this output: file-thp-aligned-524288kB: 36175872 kB (95%) file-thp-partial: 856640 kB ( 2%) file-cont-aligned-128kB: 37032320 kB (97%) file-cont-aligned-512kB: 36597760 kB (96%) file-cont-aligned-1024kB: 36597760 kB (96%) file-cont-aligned-2048kB: 36595712 kB (96%) file-cont-aligned-524288kB: 36175872 kB (95%) Is it true that the above is basically "normal" 512MB THP in action? And all of the "cont" entries are just that way because we can't really tell mTHP/cont apart from normal THP? 2) On an mTHP kernel with the latest patchsets (arm64, 64K page size), I *think* I cannot turn off mTHP. I'm still teasing apart how much of this is an instrumentation error, and how much is a measurement problem (with the test suite). And maybe I'm wrong entirely. But the "never" option doesn't seem to have an effect. Unless the latest version of the testsuite is doing something new, sigh. $ for f in $(find /sys/kernel/mm/transparent_hugepage/ -name enabled); do echo "$f: $(cat $f)"; done /sys/kernel/mm/transparent_hugepage/hugepages-512kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/enabled: always madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-262144kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-32768kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-16384kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-524288kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-8192kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-256kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-65536kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-131072kB/enabled: always inherit madvise [never] /sys/kernel/mm/transparent_hugepage/hugepages-4096kB/enabled: always inherit madvise [never] Any quick thoughts? Don't waste any time on this, it's probably operator error. Just in case, though. > (e.g. /sys/fs/cgroup for cgroup-v2 or > /sys/fs/cgroup/pids for cgroup-v1). Exactly one > of --pid and --cgroup must be provided. Maybe we could add "--global" to that list. That would look, in order, inside cgroups2 and cgroups, for a list of pids, and then run as if --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. It's nicer than failing out. And it's also directly useful. I would be running my above command like this, instead: # ./thpmaps --global --cont 128K --cont 512K --cont 1M \ --cont 2M --cont 512M --summary thanks,
On 1/10/24 3:21 PM, John Hubbard wrote: ... > 2) On an mTHP kernel with the latest patchsets (arm64, 64K page size), I > *think* I cannot turn off mTHP. I'm still teasing apart how much of this > is an instrumentation error, and how much is a measurement problem (with > the test suite). And maybe I'm wrong entirely. But the "never" option > doesn't seem to have an effect. Unless the latest version of the testsuite > is doing something new, sigh. OK, it turns out that the test suite changed over to use hugetlbfs on this version. So the performance remained high, with or without THP settings. heh. Please disregard this part. :) thanks,
On 1/10/24 16:11, John Hubbard wrote: > On 1/10/24 3:21 PM, John Hubbard wrote: > ... >> 2) On an mTHP kernel with the latest patchsets (arm64, 64K page size), I >> *think* I cannot turn off mTHP. I'm still teasing apart how much of this >> is an instrumentation error, and how much is a measurement problem (with >> the test suite). And maybe I'm wrong entirely. But the "never" option >> doesn't seem to have an effect. Unless the latest version of the testsuite >> is doing something new, sigh. > > OK, it turns out that the test suite changed over to use hugetlbfs > on this version. So the performance remained high, with or without > THP settings. heh. Please disregard this part. :) > ...although, this does turn out to be informative with respect to thpmap. Because, look at this case: Container app does this: umount /dev/shm mount -t tmpfs tmpfs -o huge=always /dev/shm; findmnt /dev/shm ...and then monitoring it: # ./thpmaps --cgroup /sys/fs/cgroup --cont 128K --cont 512K --cont 1M \ --cont 2M --cont 512M --summary file-thp-aligned-524288kB: 105906176 kB (97%) file-thp-partial: 1415680 kB ( 1%) file-cont-aligned-128kB: 107321600 kB (99%) file-cont-aligned-512kB: 106886656 kB (98%) file-cont-aligned-1024kB: 106886144 kB (98%) file-cont-aligned-2048kB: 106883072 kB (98%) file-cont-aligned-524288kB: 105906176 kB (97%) So what I'm trying to say here is, it is still difficult to tell what's going on. In particular, hugetlbfs vs. [m]THP is not yet clear. thanks,
On 10/01/2024 23:21, John Hubbard wrote: > On 1/10/24 09:32, Ryan Roberts wrote: > ... >> options: >> -h, --help show this help message and exit >> --pid pid Process id of the target process. Exactly one >> of --pid and --cgroup must be provided. >> --cgroup path Path to the target cgroup in sysfs. Iterates >> over every pid in the cgroup and its children. >> Get global stats by passing in the root cgroup > > Hi Ryan, > > Yes, this version is fairly effective at getting global stats now. > > I've got some proposed minor tweaks below, and a few questions. Let me > start with the questions: > > 1) When I run this on an older 6.4.8-based kernel: > > # ./thpmaps --cgroup /sys/fs/cgroup --cont 128K --cont 512K --cont 1M \ > --cont 2M --cont 512M --summary > > , I get this output: > > file-thp-aligned-524288kB: 36175872 kB (95%) > file-thp-partial: 856640 kB ( 2%) > file-cont-aligned-128kB: 37032320 kB (97%) > file-cont-aligned-512kB: 36597760 kB (96%) > file-cont-aligned-1024kB: 36597760 kB (96%) > file-cont-aligned-2048kB: 36595712 kB (96%) > file-cont-aligned-524288kB: 36175872 kB (95%) > > > Is it true that the above is basically "normal" 512MB THP in action? No: the "file" part of the counter name means it is file (not anon). So this is not mTHP, which would always be anon (e.g. "anon-cont-aligned-128kB"). Based on your follow-up mail, I would guess this is mostly hugetlb memory rather than actual page cache memory, but they are both getting lumped into those "file" labels. > And all of the "cont" entries are just that way because we can't > really tell mTHP/cont apart from normal THP? I'm not sure exectly what you are asking. The "cont" counters are counting blocks of contiguous, naturally aligned physical memory, which are also mapped contiguously and aligned. So a smaller --cont would always include all the memory captured in a larger --cont. In this case, its all the *file-backed* memory (as highighted in the label name) so nothing to do with (m)THP. But where you have THP, --cont doesn't care what the underlying THP size is as long as its requirements are met, so PMD-sized THPs would be included in e.g. *anon*-cont-aligned-128kB. Note the the "--cont" counters don't directly count memory that is PTE-mapped with the contiguous bit set in the page table; it just counts memory that meets the alignment, size and mapping requirements. On arm64 systems with the contpte series, the contiguous bit would be used here, but its not a part of what's getting measured. > > 2) On an mTHP kernel with the latest patchsets (arm64, 64K page size), I > *think* I cannot turn off mTHP. I'm still teasing apart how much of this > is an instrumentation error, and how much is a measurement problem (with > the test suite). And maybe I'm wrong entirely. But the "never" option > doesn't seem to have an effect. Unless the latest version of the testsuite > is doing something new, sigh. > > $ for f in $(find /sys/kernel/mm/transparent_hugepage/ -name enabled); do echo > "$f: $(cat $f)"; done > /sys/kernel/mm/transparent_hugepage/hugepages-512kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/enabled: always madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-262144kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-32768kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-16384kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-524288kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-8192kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-256kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-65536kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-131072kB/enabled: always inherit > madvise [never] > /sys/kernel/mm/transparent_hugepage/hugepages-4096kB/enabled: always inherit > madvise [never] > > Any quick thoughts? Don't waste any time on this, it's probably > operator error. Just in case, though. As per your email, you're looking at hugetlb memory (as per counter label). I have all the information to create a hugetlb-specific set of counters, so its not lumped in with page cache memory. You would then have counter sets of "anon", "file" and "htlb". Would that be useful? > > >> (e.g. /sys/fs/cgroup for cgroup-v2 or >> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >> of --pid and --cgroup must be provided. > > Maybe we could add "--global" to that list. That would look, in order, > inside cgroups2 and cgroups, for a list of pids, and then run as if > --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. I think actually it might be better just to make global the default when neither --pid nor --cgroup are provided? And in this case, I'll just grab all the pids from /proc rather than traverse the cgroup hierachy, that way it will work on systems without cgroups. Does that work for you? > > It's nicer than failing out. And it's also directly useful. I would be > running my above command like this, instead: > > # ./thpmaps --global --cont 128K --cont 512K --cont 1M \ > --cont 2M --cont 512M --summary > > thanks,
On 11/01/2024 11:54, Ryan Roberts wrote: > On 10/01/2024 23:21, John Hubbard wrote: >> On 1/10/24 09:32, Ryan Roberts wrote: >> ... >>> options: >>> -h, --help show this help message and exit >>> --pid pid Process id of the target process. Exactly one >>> of --pid and --cgroup must be provided. >>> --cgroup path Path to the target cgroup in sysfs. Iterates >>> over every pid in the cgroup and its children. >>> Get global stats by passing in the root cgroup >> >> Hi Ryan, >> >> Yes, this version is fairly effective at getting global stats now. >> >> I've got some proposed minor tweaks below, and a few questions. Let me >> start with the questions: >> >> 1) When I run this on an older 6.4.8-based kernel: >> >> # ./thpmaps --cgroup /sys/fs/cgroup --cont 128K --cont 512K --cont 1M \ >> --cont 2M --cont 512M --summary >> >> , I get this output: >> >> file-thp-aligned-524288kB: 36175872 kB (95%) >> file-thp-partial: 856640 kB ( 2%) >> file-cont-aligned-128kB: 37032320 kB (97%) >> file-cont-aligned-512kB: 36597760 kB (96%) >> file-cont-aligned-1024kB: 36597760 kB (96%) >> file-cont-aligned-2048kB: 36595712 kB (96%) >> file-cont-aligned-524288kB: 36175872 kB (95%) >> >> >> Is it true that the above is basically "normal" 512MB THP in action? > > No: the "file" part of the counter name means it is file (not anon). So this is > not mTHP, which would always be anon (e.g. "anon-cont-aligned-128kB"). Based on > your follow-up mail, I would guess this is mostly hugetlb memory rather than > actual page cache memory, but they are both getting lumped into those "file" labels. > >> And all of the "cont" entries are just that way because we can't >> really tell mTHP/cont apart from normal THP? > > I'm not sure exectly what you are asking. The "cont" counters are counting > blocks of contiguous, naturally aligned physical memory, which are also mapped > contiguously and aligned. So a smaller --cont would always include all the > memory captured in a larger --cont. In this case, its all the *file-backed* > memory (as highighted in the label name) so nothing to do with (m)THP. But where > you have THP, --cont doesn't care what the underlying THP size is as long as its > requirements are met, so PMD-sized THPs would be included in e.g. > *anon*-cont-aligned-128kB. > > Note the the "--cont" counters don't directly count memory that is PTE-mapped > with the contiguous bit set in the page table; it just counts memory that meets > the alignment, size and mapping requirements. On arm64 systems with the contpte > series, the contiguous bit would be used here, but its not a part of what's > getting measured. > >> >> 2) On an mTHP kernel with the latest patchsets (arm64, 64K page size), I >> *think* I cannot turn off mTHP. I'm still teasing apart how much of this >> is an instrumentation error, and how much is a measurement problem (with >> the test suite). And maybe I'm wrong entirely. But the "never" option >> doesn't seem to have an effect. Unless the latest version of the testsuite >> is doing something new, sigh. >> >> $ for f in $(find /sys/kernel/mm/transparent_hugepage/ -name enabled); do echo >> "$f: $(cat $f)"; done >> /sys/kernel/mm/transparent_hugepage/hugepages-512kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/enabled: always madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-262144kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-32768kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-16384kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-524288kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-8192kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-256kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-65536kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-131072kB/enabled: always inherit >> madvise [never] >> /sys/kernel/mm/transparent_hugepage/hugepages-4096kB/enabled: always inherit >> madvise [never] >> >> Any quick thoughts? Don't waste any time on this, it's probably >> operator error. Just in case, though. > > As per your email, you're looking at hugetlb memory (as per counter label). > > I have all the information to create a hugetlb-specific set of counters, so its > not lumped in with page cache memory. You would then have counter sets of > "anon", "file" and "htlb". Would that be useful? Or I could just filter out hugetlb memory so it doesn't appear in this tool at all? That would be easier implementation-wise, and probably more in line with the original intention of the tool (it's called thpmaps, after all). > >> >> >>> (e.g. /sys/fs/cgroup for cgroup-v2 or >>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>> of --pid and --cgroup must be provided. >> >> Maybe we could add "--global" to that list. That would look, in order, >> inside cgroups2 and cgroups, for a list of pids, and then run as if >> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. > > I think actually it might be better just to make global the default when neither > --pid nor --cgroup are provided? And in this case, I'll just grab all the pids > from /proc rather than traverse the cgroup hierachy, that way it will work on > systems without cgroups. Does that work for you? > >> >> It's nicer than failing out. And it's also directly useful. I would be >> running my above command like this, instead: >> >> # ./thpmaps --global --cont 128K --cont 512K --cont 1M \ >> --cont 2M --cont 512M --summary >> >> thanks, >
>> As per your email, you're looking at hugetlb memory (as per counter label). >> >> I have all the information to create a hugetlb-specific set of counters, so its >> not lumped in with page cache memory. You would then have counter sets of >> "anon", "file" and "htlb". Would that be useful? > > Or I could just filter out hugetlb memory so it doesn't appear in this tool at > all? That would be easier implementation-wise, and probably more in line with > the original intention of the tool (it's called thpmaps, after all). +1
On 1/11/24 09:32, Ryan Roberts wrote: ... >> I have all the information to create a hugetlb-specific set of counters, so its >> not lumped in with page cache memory. You would then have counter sets of >> "anon", "file" and "htlb". Would that be useful? > > Or I could just filter out hugetlb memory so it doesn't appear in this tool at > all? That would be easier implementation-wise, and probably more in line with > the original intention of the tool (it's called thpmaps, after all). > That does seem better. And I spend a fair amount of time explaining to end users that hugetlbfs != THP, so that would also avoid aggravating that problem as well. thanks,
On 1/11/24 03:54, Ryan Roberts wrote: ... > I'm not sure exectly what you are asking. The "cont" counters are counting > blocks of contiguous, naturally aligned physical memory, which are also mapped > contiguously and aligned. So a smaller --cont would always include all the > memory captured in a larger --cont. In this case, its all the *file-backed* > memory (as highighted in the label name) so nothing to do with (m)THP. But where > you have THP, --cont doesn't care what the underlying THP size is as long as its > requirements are met, so PMD-sized THPs would be included in e.g. > *anon*-cont-aligned-128kB. > > Note the the "--cont" counters don't directly count memory that is PTE-mapped > with the contiguous bit set in the page table; it just counts memory that meets > the alignment, size and mapping requirements. On arm64 systems with the contpte > series, the contiguous bit would be used here, but its not a part of what's > getting measured. > The "cont" and "naturally aligned" terms are difficult here, even though I'm familiar with the implementation. But putting on my systems monitoring hat, these terms are not helping people as much as I'd like, because: a) "Contiguous" is not really a unique situation, so measuring large pages that are "contiguous" is confusing. All folios are contiguous, and anything a pte points to is contiguous as well. So --cont really throws off the user/reader. b) "Naturally aligned" is also tricky. Because "natural" is not explained. Here it means NAPOT (naturally aligned power of two, I saw that in the riscv docs). After spending a day or two exploring running systems with this, I'd like to suggest: 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot of information: mTHP is configured as expected, and is helping or not, etc. 2) Not having to list out all the mTHP sizes would be nice. Instead, just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* , unless the user specifies sizes. ... (e.g. /sys/fs/cgroup for cgroup-v2 or >>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>> of --pid and --cgroup must be provided. >> >> Maybe we could add "--global" to that list. That would look, in order, >> inside cgroups2 and cgroups, for a list of pids, and then run as if >> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. > > I think actually it might be better just to make global the default when neither > --pid nor --cgroup are provided? And in this case, I'll just grab all the pids > from /proc rather than traverse the cgroup hierachy, that way it will work on > systems without cgroups. Does that work for you? Yes! That was my initial idea, in fact, and after over-thinking it for a while, it turned into the above. haha :) thanks,
On 11/01/2024 18:17, John Hubbard wrote: > On 1/11/24 03:54, Ryan Roberts wrote: > ... >> I'm not sure exectly what you are asking. The "cont" counters are counting >> blocks of contiguous, naturally aligned physical memory, which are also mapped >> contiguously and aligned. So a smaller --cont would always include all the >> memory captured in a larger --cont. In this case, its all the *file-backed* >> memory (as highighted in the label name) so nothing to do with (m)THP. But where >> you have THP, --cont doesn't care what the underlying THP size is as long as its >> requirements are met, so PMD-sized THPs would be included in e.g. >> *anon*-cont-aligned-128kB. >> >> Note the the "--cont" counters don't directly count memory that is PTE-mapped >> with the contiguous bit set in the page table; it just counts memory that meets >> the alignment, size and mapping requirements. On arm64 systems with the contpte >> series, the contiguous bit would be used here, but its not a part of what's >> getting measured. >> > > The "cont" and "naturally aligned" terms are difficult here, even though > I'm familiar with the implementation. But putting on my systems > monitoring hat, these terms are not helping people as much as I'd like, > because: > > a) "Contiguous" is not really a unique situation, so measuring large pages > that are "contiguous" is confusing. All folios are contiguous, and > anything a pte points to is contiguous as well. So --cont really > throws off the user/reader. > > b) "Naturally aligned" is also tricky. Because "natural" is not explained. > Here it means NAPOT (naturally aligned power of two, I saw that in the > riscv docs). > > After spending a day or two exploring running systems with this, I'd > like to suggest: > > 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot > of information: mTHP is configured as expected, and is helping or not, > etc. There is a difference between how a THP is mapped (PTE vs PMD) and its size. A PMD-sized THP can still be mapped with PTEs. So I'd rather not completely filter out PMD-sized THPs, if that's your suggestion. But we could make a distinction between THPs mapped by PTE and those mapped by PMD; the kernel interface doesn't directly give us this, but we can infer it from the AnonHugePages and *PmdMapped stats in smaps. > > 2) Not having to list out all the mTHP sizes would be nice. Instead, > just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* , > unless the user specifies sizes. This is exactly what the tool already does. Perhaps you haven't fully understood the counters that it outputs? You *always* get the following counters (although note the tool *hides* all counters whose value is 0 by default - show them with --inc-empty). This example is for a system with 4K base pages: # thpmaps --pid 1 --summary --inc-empty anon-thp-aligned-16kB: anon-thp-aligned-32kB: anon-thp-aligned-64kB: anon-thp-aligned-128kB: anon-thp-aligned-256kB: anon-thp-aligned-512kB: anon-thp-aligned-1024kB: anon-thp-aligned-2048kB: anon-thp-unaligned-16kB: anon-thp-unaligned-32kB: anon-thp-unaligned-64kB: anon-thp-unaligned-128kB: anon-thp-unaligned-256kB: anon-thp-unaligned-512kB: anon-thp-unaligned-1024kB: anon-thp-unaligned-2048kB: anon-thp-partial: file-thp-aligned-16kB: file-thp-aligned-32kB: file-thp-aligned-64kB: file-thp-aligned-128kB: file-thp-aligned-256kB: file-thp-aligned-512kB: file-thp-aligned-1024kB: file-thp-aligned-2048kB: file-thp-unaligned-16kB: file-thp-unaligned-32kB: file-thp-unaligned-64kB: file-thp-unaligned-128kB: file-thp-unaligned-256kB: file-thp-unaligned-512kB: file-thp-unaligned-1024kB: file-thp-unaligned-2048kB: file-thp-partial: So you have counters for every supported THP size in the system - they will be different for a 64K base page system. anon vs file: hopefully obvious aligned vs unaligned: In both cases the THP is mapped fully and contiguously. In the aligned cases it is mapped so that it is naturally aligned. So a 16K THP is mapped into VA space on a 16K boundary, a 32K THP on a 32K boundary, etc. partial: Parts of THPs that are partially mapped into VA space. Note this does not draw a distinction between PMD-mapped and PTE-mapped THPs. But a THP can only be PMD-mapped if it is both PMD-aligned and PMD-sized. So only 2 counters can include PMD-mappings; anon-thp-aligned-2048kB and file-thp-aligned-2048kB. We can filter that out by subtracting the relevant smaps counters from them. I could add a --ignore-pmd-mapped flag to do that? Or I could rename all the existing counters to include "pte" and introduce 2 new counters: anon-thp-aligned-pmd-2048kB and file-thp-aligned-pmd-2048kB? The --cont option will add *additional* special counters, if specified. The idea here is to provide a view on what percentage of memory is getting contpte-mapped. So if you provide "--cont 64K" it will give you a counter showing how much memory is in 64K, naturally aligned blocks (actually 2 counters; file and anon). Those blocks can come from fully mapped and aligned 64K THPs. But they can also come from bigger THPs - for example, if a 128K THP is aligned on a 64K boundary (but not a 128K boundary), then it will provide 2 64K cont blocks, but it will be counted as unaligned in anon-thp-unaligned-128kB. Or if a 2M THP is partially mapped so that only it's first 1M is mapped and aligned on a 64K boundary, then it will be counted in the *-thp-partial counter and would add 1M to the *-cont-aligned-64kB counter. Sorry if I've labored the point here. But I think the only thing the tool doesn't already do that you are asking for is to differentiate PTE- vs PMD- mappings? > > ... > (e.g. /sys/fs/cgroup for cgroup-v2 or >>>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>>> of --pid and --cgroup must be provided. >>> >>> Maybe we could add "--global" to that list. That would look, in order, >>> inside cgroups2 and cgroups, for a list of pids, and then run as if >>> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. >> >> I think actually it might be better just to make global the default when neither >> --pid nor --cgroup are provided? And in this case, I'll just grab all the pids >> from /proc rather than traverse the cgroup hierachy, that way it will work on >> systems without cgroups. Does that work for you? > > Yes! That was my initial idea, in fact, and after over-thinking it for > a while, it turned into the above. haha :) OK great - implemented for v3. > > > thanks,
On 11/01/2024 18:04, John Hubbard wrote: > On 1/11/24 09:32, Ryan Roberts wrote: > ... >>> I have all the information to create a hugetlb-specific set of counters, so its >>> not lumped in with page cache memory. You would then have counter sets of >>> "anon", "file" and "htlb". Would that be useful? >> >> Or I could just filter out hugetlb memory so it doesn't appear in this tool at >> all? That would be easier implementation-wise, and probably more in line with >> the original intention of the tool (it's called thpmaps, after all). >> > > That does seem better. And I spend a fair amount of time explaining to > end users that hugetlbfs != THP, so that would also avoid aggravating that > problem as well. > Implemented for v3 > > thanks,
On 1/12/24 02:00, Ryan Roberts wrote: >> ... >> After spending a day or two exploring running systems with this, I'd >> like to suggest: >> >> 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot >> of information: mTHP is configured as expected, and is helping or not, >> etc. > > There is a difference between how a THP is mapped (PTE vs PMD) and its size. A > PMD-sized THP can still be mapped with PTEs. So I'd rather not completely filter > out PMD-sized THPs, if that's your suggestion. But we could make a distinction It's not... > between THPs mapped by PTE and those mapped by PMD; the kernel interface doesn't > directly give us this, but we can infer it from the AnonHugePages and *PmdMapped > stats in smaps. Yes, that would be excellent! > >> >> 2) Not having to list out all the mTHP sizes would be nice. Instead, >> just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* , >> unless the user specifies sizes. > > This is exactly what the tool already does. Perhaps you haven't fully understood > the counters that it outputs? Oh yes, we are in perfect agreement about my not understanding these counters. :) I'd even expound upon that a bit: despite having a fairly good working understanding the mTHP implementation in the kernel; despite reading and re-reading the thpmaps documentation and peeking a number of times at the thpmaps script; and despite poring over the thpmaps output, I am still having a rough time with these counters. Mainly because there is a set of hidden assumptions, many of which are revealed below. But it's actually just a few key points that were missing from the documentation, plus the ability to clearly see the pte-mapped parts. And your proposed changes below look great; I've got a few more to add and that should finish the job. > > You *always* get the following counters (although note the tool *hides* all Good. It was not clear that these counters were always active. The --cont documentation misleads the reader a bit on that matter. > counters whose value is 0 by default - show them with --inc-empty). This example > is for a system with 4K base pages: > > # thpmaps --pid 1 --summary --inc-empty > > anon-thp-aligned-16kB: > anon-thp-aligned-32kB: > anon-thp-aligned-64kB: > anon-thp-aligned-128kB: > anon-thp-aligned-256kB: > anon-thp-aligned-512kB: > anon-thp-aligned-1024kB: > anon-thp-aligned-2048kB: > anon-thp-unaligned-16kB: > anon-thp-unaligned-32kB: > anon-thp-unaligned-64kB: > anon-thp-unaligned-128kB: > anon-thp-unaligned-256kB: > anon-thp-unaligned-512kB: > anon-thp-unaligned-1024kB: > anon-thp-unaligned-2048kB: > anon-thp-partial: > file-thp-aligned-16kB: > file-thp-aligned-32kB: > file-thp-aligned-64kB: > file-thp-aligned-128kB: > file-thp-aligned-256kB: > file-thp-aligned-512kB: > file-thp-aligned-1024kB: > file-thp-aligned-2048kB: > file-thp-unaligned-16kB: > file-thp-unaligned-32kB: > file-thp-unaligned-64kB: > file-thp-unaligned-128kB: > file-thp-unaligned-256kB: > file-thp-unaligned-512kB: > file-thp-unaligned-1024kB: > file-thp-unaligned-2048kB: > file-thp-partial: > > So you have counters for every supported THP size in the system - they will be > different for a 64K base page system. > > anon vs file: hopefully obvious > > aligned vs unaligned: In both cases the THP is mapped fully and contiguously. In > the aligned cases it is mapped so that it is naturally aligned. So a 16K THP is I think we should use "aligned" or "aligned to <size>", and stop saying "naturally aligned", throughout. "Natural" adds no additional information, and it makes the reader wonder if there is some other aspect to the alignment (does natural imply PMD-mapped? etc) that they are unaware of. > mapped into VA space on a 16K boundary, a 32K THP on a 32K boundary, etc. > > partial: Parts of THPs that are partially mapped into VA space. > > Note this does not draw a distinction between PMD-mapped and PTE-mapped THPs. > But a THP can only be PMD-mapped if it is both PMD-aligned and PMD-sized. So > only 2 counters can include PMD-mappings; anon-thp-aligned-2048kB and > file-thp-aligned-2048kB. We can filter that out by subtracting the relevant > smaps counters from them. I could add a --ignore-pmd-mapped flag to do that? Or That would work but is relatively awkward, but...1 > I could rename all the existing counters to include "pte" and introduce 2 new > counters: anon-thp-aligned-pmd-2048kB and file-thp-aligned-pmd-2048kB? ...this would be perfect, I think. The "pte" would help self-document, and separately things out allows for a clearer view into the behavior. > > The --cont option will add *additional* special counters, if specified. The idea > here is to provide a view on what percentage of memory is getting > contpte-mapped. So if you provide "--cont 64K" it will give you a counter > showing how much memory is in 64K, naturally aligned blocks (actually 2 > counters; file and anon). Those blocks can come from fully mapped and aligned > 64K THPs. But they can also come from bigger THPs - for example, if a 128K THP > is aligned on a 64K boundary (but not a 128K boundary), then it will provide 2 > 64K cont blocks, but it will be counted as unaligned in > anon-thp-unaligned-128kB. Or if a 2M THP is partially mapped so that only it's > first 1M is mapped and aligned on a 64K boundary, then it will be counted in the > *-thp-partial counter and would add 1M to the *-cont-aligned-64kB counter. > Interesting, and completely undocumented until now. Let's add this to the tool's help output! In fact, all of the above. > > Sorry if I've labored the point here. But I think the only thing the tool > doesn't already do that you are asking for is to differentiate PTE- vs PMD- > mappings? That, plus explain itself, yes. :) > >> >> ... >> (e.g. /sys/fs/cgroup for cgroup-v2 or >>>>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>>>> of --pid and --cgroup must be provided. >>>> >>>> Maybe we could add "--global" to that list. That would look, in order, >>>> inside cgroups2 and cgroups, for a list of pids, and then run as if >>>> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. >>> >>> I think actually it might be better just to make global the default when neither >>> --pid nor --cgroup are provided? And in this case, I'll just grab all the pids >>> from /proc rather than traverse the cgroup hierachy, that way it will work on >>> systems without cgroups. Does that work for you? >> >> Yes! That was my initial idea, in fact, and after over-thinking it for >> a while, it turned into the above. haha :) > > OK great - implemented for v3. > Sweet! thanks,
On 12/01/2024 19:14, John Hubbard wrote: > On 1/12/24 02:00, Ryan Roberts wrote: >>> ... >>> After spending a day or two exploring running systems with this, I'd >>> like to suggest: >>> >>> 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot >>> of information: mTHP is configured as expected, and is helping or not, >>> etc. >> >> There is a difference between how a THP is mapped (PTE vs PMD) and its size. A >> PMD-sized THP can still be mapped with PTEs. So I'd rather not completely filter >> out PMD-sized THPs, if that's your suggestion. But we could make a distinction > > It's not... > >> between THPs mapped by PTE and those mapped by PMD; the kernel interface doesn't >> directly give us this, but we can infer it from the AnonHugePages and *PmdMapped >> stats in smaps. > > Yes, that would be excellent! > >> >>> >>> 2) Not having to list out all the mTHP sizes would be nice. Instead, >>> just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* , >>> unless the user specifies sizes. >> >> This is exactly what the tool already does. Perhaps you haven't fully understood >> the counters that it outputs? > > Oh yes, we are in perfect agreement about my not understanding these > counters. :) I'd even expound upon that a bit: despite having a fairly > good working understanding the mTHP implementation in the kernel; > despite reading and re-reading the thpmaps documentation and peeking a > number of times at the thpmaps script; and despite poring over the > thpmaps output, I am still having a rough time with these counters. > Mainly because there is a set of hidden assumptions, many of which are > revealed below. Oh dear sorry about that. Thanks for sticking with it and helping me get it right... > > But it's actually just a few key points that were missing from the > documentation, plus the ability to clearly see the pte-mapped parts. And > your proposed changes below look great; I've got a few more to add and > that should finish the job. OK good! > >> >> You *always* get the following counters (although note the tool *hides* all > > Good. It was not clear that these counters were always active. The --cont > documentation misleads the reader a bit on that matter. > >> counters whose value is 0 by default - show them with --inc-empty). This example >> is for a system with 4K base pages: >> >> # thpmaps --pid 1 --summary --inc-empty >> >> anon-thp-aligned-16kB: >> anon-thp-aligned-32kB: >> anon-thp-aligned-64kB: >> anon-thp-aligned-128kB: >> anon-thp-aligned-256kB: >> anon-thp-aligned-512kB: >> anon-thp-aligned-1024kB: >> anon-thp-aligned-2048kB: >> anon-thp-unaligned-16kB: >> anon-thp-unaligned-32kB: >> anon-thp-unaligned-64kB: >> anon-thp-unaligned-128kB: >> anon-thp-unaligned-256kB: >> anon-thp-unaligned-512kB: >> anon-thp-unaligned-1024kB: >> anon-thp-unaligned-2048kB: >> anon-thp-partial: >> file-thp-aligned-16kB: >> file-thp-aligned-32kB: >> file-thp-aligned-64kB: >> file-thp-aligned-128kB: >> file-thp-aligned-256kB: >> file-thp-aligned-512kB: >> file-thp-aligned-1024kB: >> file-thp-aligned-2048kB: >> file-thp-unaligned-16kB: >> file-thp-unaligned-32kB: >> file-thp-unaligned-64kB: >> file-thp-unaligned-128kB: >> file-thp-unaligned-256kB: >> file-thp-unaligned-512kB: >> file-thp-unaligned-1024kB: >> file-thp-unaligned-2048kB: >> file-thp-partial: >> >> So you have counters for every supported THP size in the system - they will be >> different for a 64K base page system. >> >> anon vs file: hopefully obvious >> >> aligned vs unaligned: In both cases the THP is mapped fully and contiguously. In >> the aligned cases it is mapped so that it is naturally aligned. So a 16K THP is > > I think we should use "aligned" or "aligned to <size>", and stop saying > "naturally aligned", throughout. "Natural" adds no additional > information, and it makes the reader wonder if there is some other > aspect to the alignment (does natural imply PMD-mapped? etc) that they > are unaware of. OK. I thought "naturally aligned" was a fairly standard and well-understood term. Google says "We call a datum naturally aligned if its address is aligned to its size". But I'm happy to use the phrase "aligned to <size>" if that's clearer. > > >> mapped into VA space on a 16K boundary, a 32K THP on a 32K boundary, etc. >> >> partial: Parts of THPs that are partially mapped into VA space. >> >> Note this does not draw a distinction between PMD-mapped and PTE-mapped THPs. >> But a THP can only be PMD-mapped if it is both PMD-aligned and PMD-sized. So >> only 2 counters can include PMD-mappings; anon-thp-aligned-2048kB and >> file-thp-aligned-2048kB. We can filter that out by subtracting the relevant >> smaps counters from them. I could add a --ignore-pmd-mapped flag to do that? Or > > That would work but is relatively awkward, but...1 > >> I could rename all the existing counters to include "pte" and introduce 2 new >> counters: anon-thp-aligned-pmd-2048kB and file-thp-aligned-pmd-2048kB? > > ...this would be perfect, I think. The "pte" would help self-document, and > separately things out allows for a clearer view into the behavior. > >> >> The --cont option will add *additional* special counters, if specified. The idea >> here is to provide a view on what percentage of memory is getting >> contpte-mapped. So if you provide "--cont 64K" it will give you a counter >> showing how much memory is in 64K, naturally aligned blocks (actually 2 >> counters; file and anon). Those blocks can come from fully mapped and aligned >> 64K THPs. But they can also come from bigger THPs - for example, if a 128K THP >> is aligned on a 64K boundary (but not a 128K boundary), then it will provide 2 >> 64K cont blocks, but it will be counted as unaligned in >> anon-thp-unaligned-128kB. Or if a 2M THP is partially mapped so that only it's >> first 1M is mapped and aligned on a 64K boundary, then it will be counted in the >> *-thp-partial counter and would add 1M to the *-cont-aligned-64kB counter. >> > > Interesting, and completely undocumented until now. Let's add this to the > tool's help output! In fact, all of the above. Well it already has this, which I intended to convey the same info: --cont size[KMG] Adds anon and file stats for naturally aligned, contiguously mapped blocks of the specified size. May be issued multiple times to track multiple sized blocks. Useful to infer e.g. arm64 contpte and hpa mappings. Size must be a power-of-2 number of pages. But yes, let me work up some improved documentation and send it out for your review. The reason its a bit terse at the moment, is that I'm using Python's ArgumentParser for the documentation, and it removes all line breaks from the description which makes it hard to format longer form docs. Anyway, that's a bad excuse for bad docs so I'll figure out a solution. > >> >> Sorry if I've labored the point here. But I think the only thing the tool >> doesn't already do that you are asking for is to differentiate PTE- vs PMD- >> mappings? > > That, plus explain itself, yes. :) Excellent! I'll post a follow up shortly. > >> >>> >>> ... >>> (e.g. /sys/fs/cgroup for cgroup-v2 or >>>>>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>>>>> of --pid and --cgroup must be provided. >>>>> >>>>> Maybe we could add "--global" to that list. That would look, in order, >>>>> inside cgroups2 and cgroups, for a list of pids, and then run as if >>>>> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. >>>> >>>> I think actually it might be better just to make global the default when >>>> neither >>>> --pid nor --cgroup are provided? And in this case, I'll just grab all the pids >>>> from /proc rather than traverse the cgroup hierachy, that way it will work on >>>> systems without cgroups. Does that work for you? >>> >>> Yes! That was my initial idea, in fact, and after over-thinking it for >>> a while, it turned into the above. haha :) >> >> OK great - implemented for v3. >> > > Sweet! > > > thanks,
On 15/01/2024 09:48, Ryan Roberts wrote: > On 12/01/2024 19:14, John Hubbard wrote: >> On 1/12/24 02:00, Ryan Roberts wrote: >>>> ... >>>> After spending a day or two exploring running systems with this, I'd >>>> like to suggest: >>>> >>>> 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot >>>> of information: mTHP is configured as expected, and is helping or not, >>>> etc. >>> >>> There is a difference between how a THP is mapped (PTE vs PMD) and its size. A >>> PMD-sized THP can still be mapped with PTEs. So I'd rather not completely filter >>> out PMD-sized THPs, if that's your suggestion. But we could make a distinction >> >> It's not... >> >>> between THPs mapped by PTE and those mapped by PMD; the kernel interface doesn't >>> directly give us this, but we can infer it from the AnonHugePages and *PmdMapped >>> stats in smaps. >> >> Yes, that would be excellent! >> >>> >>>> >>>> 2) Not having to list out all the mTHP sizes would be nice. Instead, >>>> just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* , >>>> unless the user specifies sizes. >>> >>> This is exactly what the tool already does. Perhaps you haven't fully understood >>> the counters that it outputs? >> >> Oh yes, we are in perfect agreement about my not understanding these >> counters. :) I'd even expound upon that a bit: despite having a fairly >> good working understanding the mTHP implementation in the kernel; >> despite reading and re-reading the thpmaps documentation and peeking a >> number of times at the thpmaps script; and despite poring over the >> thpmaps output, I am still having a rough time with these counters. >> Mainly because there is a set of hidden assumptions, many of which are >> revealed below. > > Oh dear sorry about that. Thanks for sticking with it and helping me get it right... > >> >> But it's actually just a few key points that were missing from the >> documentation, plus the ability to clearly see the pte-mapped parts. And >> your proposed changes below look great; I've got a few more to add and >> that should finish the job. > > OK good! > >> >>> >>> You *always* get the following counters (although note the tool *hides* all >> >> Good. It was not clear that these counters were always active. The --cont >> documentation misleads the reader a bit on that matter. >> >>> counters whose value is 0 by default - show them with --inc-empty). This example >>> is for a system with 4K base pages: >>> >>> # thpmaps --pid 1 --summary --inc-empty >>> >>> anon-thp-aligned-16kB: >>> anon-thp-aligned-32kB: >>> anon-thp-aligned-64kB: >>> anon-thp-aligned-128kB: >>> anon-thp-aligned-256kB: >>> anon-thp-aligned-512kB: >>> anon-thp-aligned-1024kB: >>> anon-thp-aligned-2048kB: >>> anon-thp-unaligned-16kB: >>> anon-thp-unaligned-32kB: >>> anon-thp-unaligned-64kB: >>> anon-thp-unaligned-128kB: >>> anon-thp-unaligned-256kB: >>> anon-thp-unaligned-512kB: >>> anon-thp-unaligned-1024kB: >>> anon-thp-unaligned-2048kB: >>> anon-thp-partial: >>> file-thp-aligned-16kB: >>> file-thp-aligned-32kB: >>> file-thp-aligned-64kB: >>> file-thp-aligned-128kB: >>> file-thp-aligned-256kB: >>> file-thp-aligned-512kB: >>> file-thp-aligned-1024kB: >>> file-thp-aligned-2048kB: >>> file-thp-unaligned-16kB: >>> file-thp-unaligned-32kB: >>> file-thp-unaligned-64kB: >>> file-thp-unaligned-128kB: >>> file-thp-unaligned-256kB: >>> file-thp-unaligned-512kB: >>> file-thp-unaligned-1024kB: >>> file-thp-unaligned-2048kB: >>> file-thp-partial: >>> >>> So you have counters for every supported THP size in the system - they will be >>> different for a 64K base page system. >>> >>> anon vs file: hopefully obvious >>> >>> aligned vs unaligned: In both cases the THP is mapped fully and contiguously. In >>> the aligned cases it is mapped so that it is naturally aligned. So a 16K THP is >> >> I think we should use "aligned" or "aligned to <size>", and stop saying >> "naturally aligned", throughout. "Natural" adds no additional >> information, and it makes the reader wonder if there is some other >> aspect to the alignment (does natural imply PMD-mapped? etc) that they >> are unaware of. > > OK. I thought "naturally aligned" was a fairly standard and well-understood > term. Google says "We call a datum naturally aligned if its address is aligned > to its size". But I'm happy to use the phrase "aligned to <size>" if that's clearer. > >> >> >>> mapped into VA space on a 16K boundary, a 32K THP on a 32K boundary, etc. >>> >>> partial: Parts of THPs that are partially mapped into VA space. >>> >>> Note this does not draw a distinction between PMD-mapped and PTE-mapped THPs. >>> But a THP can only be PMD-mapped if it is both PMD-aligned and PMD-sized. So >>> only 2 counters can include PMD-mappings; anon-thp-aligned-2048kB and >>> file-thp-aligned-2048kB. We can filter that out by subtracting the relevant >>> smaps counters from them. I could add a --ignore-pmd-mapped flag to do that? Or >> >> That would work but is relatively awkward, but...1 >> >>> I could rename all the existing counters to include "pte" and introduce 2 new >>> counters: anon-thp-aligned-pmd-2048kB and file-thp-aligned-pmd-2048kB? >> >> ...this would be perfect, I think. The "pte" would help self-document, and >> separately things out allows for a clearer view into the behavior. >> >>> >>> The --cont option will add *additional* special counters, if specified. The idea >>> here is to provide a view on what percentage of memory is getting >>> contpte-mapped. So if you provide "--cont 64K" it will give you a counter >>> showing how much memory is in 64K, naturally aligned blocks (actually 2 >>> counters; file and anon). Those blocks can come from fully mapped and aligned >>> 64K THPs. But they can also come from bigger THPs - for example, if a 128K THP >>> is aligned on a 64K boundary (but not a 128K boundary), then it will provide 2 >>> 64K cont blocks, but it will be counted as unaligned in >>> anon-thp-unaligned-128kB. Or if a 2M THP is partially mapped so that only it's >>> first 1M is mapped and aligned on a 64K boundary, then it will be counted in the >>> *-thp-partial counter and would add 1M to the *-cont-aligned-64kB counter. >>> >> >> Interesting, and completely undocumented until now. Let's add this to the >> tool's help output! In fact, all of the above. > > Well it already has this, which I intended to convey the same info: > > --cont size[KMG] Adds anon and file stats for naturally aligned, > contiguously mapped blocks of the specified size. May be > issued multiple times to track multiple sized blocks. > Useful to infer e.g. arm64 contpte and hpa mappings. Size > must be a power-of-2 number of pages. > > But yes, let me work up some improved documentation and send it out for your > review. The reason its a bit terse at the moment, is that I'm using Python's > ArgumentParser for the documentation, and it removes all line breaks from the > description which makes it hard to format longer form docs. Anyway, that's a bad > excuse for bad docs so I'll figure out a solution. Here is my proposed documentation. If you could take a look and let me know if it makes sense, then I'll modify the tool to conform: --8<-- $ ./thpmaps --help usage: thpmaps [-h] [--pid pid | --cgroup path] [--rollup] [--cont size[KMG]] [--inc-smaps] [--inc-empty] [--periodic sleep_ms] Prints information about how transparent huge pages are mapped, either system- wide, or for a specified process or cgroup. A default set of statistics is always generated for THP mappings. However, it is also possible to generate additional statistics for "contiguous block mappings" where the block size is user-defined. Statistics are maintained independently for anonymous and file-backed (pagecache) memory and are shown both in kB and as a percentage of either total anonymous or total file-backed memory as appropriate. THP Statistics -------------- Statistics are always generated for fully- and contiguously-mapped THPs whose mapping address is aligned to their size, for each <size> supported by the system. Separate counters describe THPs mapped by PTE vs those mapped by PMD. (Although note a THP can only be mapped by PMD if it is PMD-sized): - anon-thp-pte-aligned-<size>kB - file-thp-pte-aligned-<size>kB - anon-thp-pmd-aligned-<size>kB - file-thp-pmd-aligned-<size>kB Similarly, statistics are always generated for fully- and contiguously-mapped THPs whose mapping address is *not* aligned to their size, for each <size> supported by the system. Due to the unaligned mapping, it is impossible to map by PMD, so there are only PTE counters for this case: - anon-thp-pte-unaligned-<size>kB - file-thp-pte-unaligned-<size>kB Statistics are also always generated for mapped pages that belong to a THP but where the is THP is *not* fully- and contiguously- mapped. These "partial" mappings are all counted in the same counter regardless of the size of the THP that is partially mapped: - anon-thp-pte-partial - file-thp-pte-partial Contiguous Block Statistics --------------------------- An optional, additional set of statistics is generated for every contiguous block size specified with `--cont <size>`. These statistics show how much memory is mapped in contiguous blocks of <size> and also aligned to <size>. A given contiguous block must all belong to the same THP, but there is no requirement for it to be the *whole* THP. Separate counters describe contiguous blocks mapped by PTE vs those mapped by PMD: - anon-cont-pte-aligned-<size>kB - file-cont-pte-aligned-<size>kB - anon-cont-pmd-aligned-<size>kB - file-cont-pmd-aligned-<size>kB As an example, if montiroing 64K contiguous blocks (--cont 64K), there are a number of sources that could provide such blocks: a fully- and contiguously- mapped 64K THP that is aligned to a 64K boundary would provide 1 block. A fully- and contiguously-mapped 128K THP that is aligned to at least a 64K boundary would provide 2 blocks. Or a 128K THP that maps its first 100K, but contiguously and starting at a 64K boundary would provide 1 block. A fully- and contiguously- mapped 2M THP would provide 32 blocks. There are many other possible permutations. optional arguments: -h, --help show this help message and exit --pid pid Process id of the target process. --pid and --cgroup are mutually exclusive. If neither are provided, all processes are scanned to provide system-wide information. --cgroup path Path to the target cgroup in sysfs. Iterates over every pid in the cgroup and its children. --pid and --cgroup are mutually exclusive. If neither are provided, all processes are scanned to provide system-wide information. --rollup Sum the per-vma statistics to provide a summary over the whole system, process or cgroup. --cont size[KMG] Adds stats for memory that is mapped in contiguous blocks of <size> and also aligned to <size>. May be issued multiple times to track multiple sized blocks. Useful to infer e.g. arm64 contpte and hpa mappings. Size must be a power-of-2 number of pages. --inc-smaps Include all numerical, additive /proc/<pid>/smaps stats in the output. --inc-empty Show all statistics including those whose value is 0. --periodic sleep_ms Run in a loop, polling every sleep_ms milliseconds. Requires root privilege to access pagemap and kpageflags. --8<-- Thanks, Ryan > > >> >>> >>> Sorry if I've labored the point here. But I think the only thing the tool >>> doesn't already do that you are asking for is to differentiate PTE- vs PMD- >>> mappings? >> >> That, plus explain itself, yes. :) > > Excellent! I'll post a follow up shortly. > >> >>> >>>> >>>> ... >>>> (e.g. /sys/fs/cgroup for cgroup-v2 or >>>>>>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>>>>>> of --pid and --cgroup must be provided. >>>>>> >>>>>> Maybe we could add "--global" to that list. That would look, in order, >>>>>> inside cgroups2 and cgroups, for a list of pids, and then run as if >>>>>> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. >>>>> >>>>> I think actually it might be better just to make global the default when >>>>> neither >>>>> --pid nor --cgroup are provided? And in this case, I'll just grab all the pids >>>>> from /proc rather than traverse the cgroup hierachy, that way it will work on >>>>> systems without cgroups. Does that work for you? >>>> >>>> Yes! That was my initial idea, in fact, and after over-thinking it for >>>> a while, it turned into the above. haha :) >>> >>> OK great - implemented for v3. >>> >> >> Sweet! >> >> >> thanks, >
On 1/15/24 07:56, Ryan Roberts wrote: ... >> But yes, let me work up some improved documentation and send it out for your >> review. The reason its a bit terse at the moment, is that I'm using Python's >> ArgumentParser for the documentation, and it removes all line breaks from the >> description which makes it hard to format longer form docs. Anyway, that's a bad >> excuse for bad docs so I'll figure out a solution. > > Here is my proposed documentation. If you could take a look and let me know if > it makes sense, then I'll modify the tool to conform: > Looks great. One typo fix and a note, below. > --8<-- > > $ ./thpmaps --help > > usage: thpmaps [-h] [--pid pid | --cgroup path] [--rollup] [--cont size[KMG]] > [--inc-smaps] [--inc-empty] [--periodic sleep_ms] > > Prints information about how transparent huge pages are mapped, either system- > wide, or for a specified process or cgroup. > > A default set of statistics is always generated for THP mappings. However, it is The way this is done is sufficiently interesting to the sysadmin to say a few words about it. Something along these lines, approximately: ----- When run without options, cgroups v1 or v2 (depending on what is active on the system) is used in order to get a listing of all user space pids. That pid list is passed into the core script, as if the user had provided "--pids pid1 pid2 ...". ----- This reminds me that maybe a --pids options is helpful, what do you think? > also possible to generate additional statistics for "contiguous block mappings" > where the block size is user-defined. > > Statistics are maintained independently for anonymous and file-backed > (pagecache) memory and are shown both in kB and as a percentage of either total > anonymous or total file-backed memory as appropriate. > > THP Statistics > -------------- > > Statistics are always generated for fully- and contiguously-mapped THPs whose > mapping address is aligned to their size, for each <size> supported by the > system. Separate counters describe THPs mapped by PTE vs those mapped by PMD. > (Although note a THP can only be mapped by PMD if it is PMD-sized): > > - anon-thp-pte-aligned-<size>kB > - file-thp-pte-aligned-<size>kB > - anon-thp-pmd-aligned-<size>kB > - file-thp-pmd-aligned-<size>kB > > Similarly, statistics are always generated for fully- and contiguously-mapped > THPs whose mapping address is *not* aligned to their size, for each <size> > supported by the system. Due to the unaligned mapping, it is impossible to map > by PMD, so there are only PTE counters for this case: > > - anon-thp-pte-unaligned-<size>kB > - file-thp-pte-unaligned-<size>kB > > Statistics are also always generated for mapped pages that belong to a THP but > where the is THP is *not* fully- and contiguously- mapped. These "partial" > mappings are all counted in the same counter regardless of the size of the THP > that is partially mapped: > > - anon-thp-pte-partial > - file-thp-pte-partial > > Contiguous Block Statistics > --------------------------- > > An optional, additional set of statistics is generated for every contiguous > block size specified with `--cont <size>`. These statistics show how much memory > is mapped in contiguous blocks of <size> and also aligned to <size>. A given > contiguous block must all belong to the same THP, but there is no requirement > for it to be the *whole* THP. Separate counters describe contiguous blocks > mapped by PTE vs those mapped by PMD: > > - anon-cont-pte-aligned-<size>kB > - file-cont-pte-aligned-<size>kB > - anon-cont-pmd-aligned-<size>kB > - file-cont-pmd-aligned-<size>kB > > As an example, if montiroing 64K contiguous blocks (--cont 64K), there are a typo: "monitoring" > number of sources that could provide such blocks: a fully- and contiguously- > mapped 64K THP that is aligned to a 64K boundary would provide 1 block. A fully- > and contiguously-mapped 128K THP that is aligned to at least a 64K boundary > would provide 2 blocks. Or a 128K THP that maps its first 100K, but contiguously > and starting at a 64K boundary would provide 1 block. A fully- and contiguously- > mapped 2M THP would provide 32 blocks. There are many other possible > permutations. > > optional arguments: > -h, --help show this help message and exit > --pid pid Process id of the target process. --pid and --cgroup are > mutually exclusive. If neither are provided, all > processes are scanned to provide system-wide information. > --cgroup path Path to the target cgroup in sysfs. Iterates over every > pid in the cgroup and its children. --pid and --cgroup > are mutually exclusive. If neither are provided, all > processes are scanned to provide system-wide information. > --rollup Sum the per-vma statistics to provide a summary over the > whole system, process or cgroup. > --cont size[KMG] Adds stats for memory that is mapped in contiguous blocks > of <size> and also aligned to <size>. May be issued > multiple times to track multiple sized blocks. Useful to > infer e.g. arm64 contpte and hpa mappings. Size must be a > power-of-2 number of pages. > --inc-smaps Include all numerical, additive /proc/<pid>/smaps stats > in the output. > --inc-empty Show all statistics including those whose value is 0. > --periodic sleep_ms Run in a loop, polling every sleep_ms milliseconds. > > Requires root privilege to access pagemap and kpageflags. > > --8<-- It's all looking much more understandable now, very nice. thanks,
On 15/01/2024 21:30, John Hubbard wrote: > On 1/15/24 07:56, Ryan Roberts wrote: > ... >>> But yes, let me work up some improved documentation and send it out for your >>> review. The reason its a bit terse at the moment, is that I'm using Python's >>> ArgumentParser for the documentation, and it removes all line breaks from the >>> description which makes it hard to format longer form docs. Anyway, that's a bad >>> excuse for bad docs so I'll figure out a solution. >> >> Here is my proposed documentation. If you could take a look and let me know if >> it makes sense, then I'll modify the tool to conform: >> > > Looks great. One typo fix and a note, below. > >> --8<-- >> >> $ ./thpmaps --help >> >> usage: thpmaps [-h] [--pid pid | --cgroup path] [--rollup] [--cont size[KMG]] >> [--inc-smaps] [--inc-empty] [--periodic sleep_ms] >> >> Prints information about how transparent huge pages are mapped, either system- >> wide, or for a specified process or cgroup. >> >> A default set of statistics is always generated for THP mappings. However, it is > > The way this is done is sufficiently interesting to the sysadmin to say a > few words about it. Something along these lines, approximately: > > ----- > When run without options, cgroups v1 or v2 (depending on what is active > on the system) is used in order to get a listing of all user space pids. > That pid list is passed into the core script, as if the user had provided > "--pids pid1 pid2 ...". > ----- Agree with the sentiment; I'll add something similar. Although, I'm no longer using cgroups to get all the pids - I'm grabbing them from /proc. --8<-- When run with --pid, the user explicitly specifies the set of pids to scan. e.g. "--pid 10 [--pid 134 ...]". When run with --cgroup, the user passes either a v1 or v2 cgroup and all pids that belong to the cgroup subtree are scanned. When run with neither --pid nor --cgroup, the full set of pids on the system is gathered from /proc and scanned as if the user had provided "--pid 1 --pid 2 ...". --8<-- > > This reminds me that maybe a --pids options is helpful, what do you think? How about I allow --pid to be specified multiple times? That will make the parsing easier (and be consistent with the way it works for --cont): --pid 1 --pid 2 --pid 3 ... > > >> also possible to generate additional statistics for "contiguous block mappings" >> where the block size is user-defined. >> >> Statistics are maintained independently for anonymous and file-backed >> (pagecache) memory and are shown both in kB and as a percentage of either total >> anonymous or total file-backed memory as appropriate. >> >> THP Statistics >> -------------- >> >> Statistics are always generated for fully- and contiguously-mapped THPs whose >> mapping address is aligned to their size, for each <size> supported by the >> system. Separate counters describe THPs mapped by PTE vs those mapped by PMD. >> (Although note a THP can only be mapped by PMD if it is PMD-sized): >> >> - anon-thp-pte-aligned-<size>kB >> - file-thp-pte-aligned-<size>kB >> - anon-thp-pmd-aligned-<size>kB >> - file-thp-pmd-aligned-<size>kB >> >> Similarly, statistics are always generated for fully- and contiguously-mapped >> THPs whose mapping address is *not* aligned to their size, for each <size> >> supported by the system. Due to the unaligned mapping, it is impossible to map >> by PMD, so there are only PTE counters for this case: >> >> - anon-thp-pte-unaligned-<size>kB >> - file-thp-pte-unaligned-<size>kB >> >> Statistics are also always generated for mapped pages that belong to a THP but >> where the is THP is *not* fully- and contiguously- mapped. These "partial" >> mappings are all counted in the same counter regardless of the size of the THP >> that is partially mapped: >> >> - anon-thp-pte-partial >> - file-thp-pte-partial >> >> Contiguous Block Statistics >> --------------------------- >> >> An optional, additional set of statistics is generated for every contiguous >> block size specified with `--cont <size>`. These statistics show how much memory >> is mapped in contiguous blocks of <size> and also aligned to <size>. A given >> contiguous block must all belong to the same THP, but there is no requirement >> for it to be the *whole* THP. Separate counters describe contiguous blocks >> mapped by PTE vs those mapped by PMD: >> >> - anon-cont-pte-aligned-<size>kB >> - file-cont-pte-aligned-<size>kB >> - anon-cont-pmd-aligned-<size>kB >> - file-cont-pmd-aligned-<size>kB >> >> As an example, if montiroing 64K contiguous blocks (--cont 64K), there are a > > typo: "monitoring" > >> number of sources that could provide such blocks: a fully- and contiguously- >> mapped 64K THP that is aligned to a 64K boundary would provide 1 block. A fully- >> and contiguously-mapped 128K THP that is aligned to at least a 64K boundary >> would provide 2 blocks. Or a 128K THP that maps its first 100K, but contiguously >> and starting at a 64K boundary would provide 1 block. A fully- and contiguously- >> mapped 2M THP would provide 32 blocks. There are many other possible >> permutations. >> >> optional arguments: >> -h, --help show this help message and exit >> --pid pid Process id of the target process. --pid and --cgroup are >> mutually exclusive. If neither are provided, all >> processes are scanned to provide system-wide information. >> --cgroup path Path to the target cgroup in sysfs. Iterates over every >> pid in the cgroup and its children. --pid and --cgroup >> are mutually exclusive. If neither are provided, all >> processes are scanned to provide system-wide information. >> --rollup Sum the per-vma statistics to provide a summary over the >> whole system, process or cgroup. >> --cont size[KMG] Adds stats for memory that is mapped in contiguous blocks >> of <size> and also aligned to <size>. May be issued >> multiple times to track multiple sized blocks. Useful to >> infer e.g. arm64 contpte and hpa mappings. Size must be a >> power-of-2 number of pages. >> --inc-smaps Include all numerical, additive /proc/<pid>/smaps stats >> in the output. >> --inc-empty Show all statistics including those whose value is 0. >> --periodic sleep_ms Run in a loop, polling every sleep_ms milliseconds. >> >> Requires root privilege to access pagemap and kpageflags. >> >> --8<-- > > It's all looking much more understandable now, very nice. Great - thanks for the review. I'll get this straightened out and post later today. > > thanks,
On 1/16/24 00:53, Ryan Roberts wrote: > On 15/01/2024 21:30, John Hubbard wrote: >> On 1/15/24 07:56, Ryan Roberts wrote: >> ... >> ----- >> When run without options, cgroups v1 or v2 (depending on what is active >> on the system) is used in order to get a listing of all user space pids. >> That pid list is passed into the core script, as if the user had provided >> "--pids pid1 pid2 ...". >> ----- > > Agree with the sentiment; I'll add something similar. Although, I'm no longer > using cgroups to get all the pids - I'm grabbing them from /proc. > > --8<-- > When run with --pid, the user explicitly specifies the set of pids to scan. e.g. > "--pid 10 [--pid 134 ...]". When run with --cgroup, the user passes either a v1 > or v2 cgroup and all pids that belong to the cgroup subtree are scanned. When > run with neither --pid nor --cgroup, the full set of pids on the system is > gathered from /proc and scanned as if the user had provided "--pid 1 --pid 2 ...". > --8<-- > Sounds good. >> >> This reminds me that maybe a --pids options is helpful, what do you think? > > How about I allow --pid to be specified multiple times? That will make the > parsing easier (and be consistent with the way it works for --cont): > > --pid 1 --pid 2 --pid 3 ... > Sure, that works nicely. thanks,
diff --git a/tools/mm/Makefile b/tools/mm/Makefile index 1c5606cc3334..7bb03606b9ea 100644 --- a/tools/mm/Makefile +++ b/tools/mm/Makefile @@ -3,7 +3,8 @@ # include ../scripts/Makefile.include -TARGETS=page-types slabinfo page_owner_sort +BUILD_TARGETS=page-types slabinfo page_owner_sort +INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps LIB_DIR = ../lib/api LIBS = $(LIB_DIR)/libapi.a @@ -11,9 +12,9 @@ LIBS = $(LIB_DIR)/libapi.a CFLAGS += -Wall -Wextra -I../lib/ -pthread LDFLAGS += $(LIBS) -pthread -all: $(TARGETS) +all: $(BUILD_TARGETS) -$(TARGETS): $(LIBS) +$(BUILD_TARGETS): $(LIBS) $(LIBS): make -C $(LIB_DIR) @@ -29,4 +30,4 @@ sbindir ?= /usr/sbin install: all install -d $(DESTDIR)$(sbindir) - install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir) + install -m 755 -p $(INSTALL_TARGETS) $(DESTDIR)$(sbindir) diff --git a/tools/mm/thpmaps b/tools/mm/thpmaps new file mode 100755 index 000000000000..8ac8579d7aa6 --- /dev/null +++ b/tools/mm/thpmaps @@ -0,0 +1,553 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0-only +# Copyright (C) 2024 ARM Ltd. +# +# Utility providing smaps-like output detailing transparent hugepage usage. +# For more info, run: +# ./thpmaps --help +# +# Requires numpy: +# pip3 install numpy + + +import argparse +import collections +import math +import os +import resource +import shutil +import sys +import time +import numpy as np + + +with open('/sys/kernel/mm/transparent_hugepage/hpage_pmd_size') as f: + PAGE_SIZE = resource.getpagesize() + PAGE_SHIFT = int(math.log2(PAGE_SIZE)) + PMD_SIZE = int(f.read()) + PMD_ORDER = int(math.log2(PMD_SIZE / PAGE_SIZE)) + + +def align_forward(v, a): + return (v + (a - 1)) & ~(a - 1) + + +def align_offset(v, a): + return v & (a - 1) + + +def nrkb(nr): + # Convert number of pages to KB. + return (nr << PAGE_SHIFT) >> 10 + + +def odkb(order): + # Convert page order to KB. + return (PAGE_SIZE << order) >> 10 + + +def cont_ranges_all(search, index): + # Given a list of arrays, find the ranges for which values are monotonically + # incrementing in all arrays. all arrays in search and index must be the + # same size. + sz = len(search[0]) + r = np.full(sz, 2) + d = np.diff(search[0]) == 1 + for dd in [np.diff(arr) == 1 for arr in search[1:]]: + d &= dd + r[1:] -= d + r[:-1] -= d + return [np.repeat(arr, r).reshape(-1, 2) for arr in index] + + +class ArgException(Exception): + pass + + +class FileIOException(Exception): + pass + + +class BinArrayFile: + # Base class used to read /proc/<pid>/pagemap and /proc/kpageflags into a + # numpy array. Use inherrited class in a with clause to ensure file is + # closed when it goes out of scope. + def __init__(self, filename, element_size): + self.element_size = element_size + self.filename = filename + self.fd = os.open(self.filename, os.O_RDONLY) + + def cleanup(self): + os.close(self.fd) + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.cleanup() + + def _readin(self, offset, buffer): + length = os.preadv(self.fd, (buffer,), offset) + if len(buffer) != length: + raise FileIOException('error: {} failed to read {} bytes at {:x}' + .format(self.filename, len(buffer), offset)) + + def _toarray(self, buf): + assert(self.element_size == 8) + return np.frombuffer(buf, dtype=np.uint64) + + def getv(self, vec): + vec *= self.element_size + offsets = vec[:, 0] + lengths = (np.diff(vec) + self.element_size).reshape(len(vec)) + buf = bytearray(int(np.sum(lengths))) + view = memoryview(buf) + pos = 0 + for offset, length in zip(offsets, lengths): + offset = int(offset) + length = int(length) + self._readin(offset, view[pos:pos+length]) + pos += length + return self._toarray(buf) + + def get(self, index, nr=1): + offset = index * self.element_size + length = nr * self.element_size + buf = bytearray(length) + self._readin(offset, buf) + return self._toarray(buf) + + +PM_PAGE_PRESENT = 1 << 63 +PM_PFN_MASK = (1 << 55) - 1 + +class PageMap(BinArrayFile): + # Read ranges of a given pid's pagemap into a numpy array. + def __init__(self, pid='self'): + super().__init__(f'/proc/{pid}/pagemap', 8) + + +KPF_ANON = 1 << 12 +KPF_COMPOUND_HEAD = 1 << 15 +KPF_COMPOUND_TAIL = 1 << 16 + +class KPageFlags(BinArrayFile): + # Read ranges of /proc/kpageflags into a numpy array. + def __init__(self): + super().__init__(f'/proc/kpageflags', 8) + + +VMA = collections.namedtuple('VMA', [ + 'name', + 'start', + 'end', + 'read', + 'write', + 'execute', + 'private', + 'pgoff', + 'major', + 'minor', + 'inode', + 'stats', +]) + +class VMAList: + # A container for VMAs, parsed from /proc/<pid>/smaps. Iterate over the + # instance to receive VMAs. + exclude = ['KernelPageSize', 'MMUPageSize'] + + def __init__(self, pid='self', stats=False): + self.vmas = [] + with open(f'/proc/{pid}/smaps', 'r') as file: + for line in file: + elements = line.split() + if '-' in elements[0]: + start, end = map(lambda x: int(x, 16), elements[0].split('-')) + major, minor = map(lambda x: int(x, 16), elements[3].split(':')) + self.vmas.append(VMA( + name=elements[5] if len(elements) == 6 else '', + start=start, + end=end, + read=elements[1][0] == 'r', + write=elements[1][1] == 'w', + execute=elements[1][2] == 'x', + private=elements[1][3] == 'p', + pgoff=int(elements[2], 16), + major=major, + minor=minor, + inode=int(elements[4], 16), + stats={}, + )) + got_stats = False + elif not got_stats: + # If stats were not requested, only save Rss, since we use + # Rss==0 as an optimization to avoid reading the pagemap. If + # stats were requested, currently only handle the KB stats + # because they are summed for --summary. Core code doesn't + # know how to combine other stats. The got_stats guard is a + # performance optimization for the stats=False (common) + # case. + if stats: + param = elements[0][:-1] + if len(elements) == 3 and elements[2] == 'kB' and param not in self.exclude: + value = int(elements[1]) + self.vmas[-1].stats[param] = {'type': None, 'value': value} + elif elements[0] == 'Rss:': + value = int(elements[1]) + self.vmas[-1].stats['Rss'] = {'type': None, 'value': value} + got_stats = True + + + def __iter__(self): + yield from self.vmas + + +def thp_parse(max_order, kpageflags, ranges, indexes, vfns, pfns, anons, heads): + # Given 4 same-sized arrays representing a range within a page table backed + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: + # True if page is anonymous, heads: True if page is head of a THP), return a + # dictionary of statistics describing the mapped THPs. + stats = { + 'file': { + 'partial': 0, + 'aligned': [0] * (max_order + 1), + 'unaligned': [0] * (max_order + 1), + }, + 'anon': { + 'partial': 0, + 'aligned': [0] * (max_order + 1), + 'unaligned': [0] * (max_order + 1), + }, + } + + for rindex, rpfn in zip(ranges[0], ranges[2]): + index_next = int(rindex[0]) + index_end = int(rindex[1]) + 1 + pfn_end = int(rpfn[1]) + 1 + + folios = indexes[index_next:index_end][heads[index_next:index_end]] + + # Account pages for any partially mapped THP at the front. In that case, + # the first page of the range is a tail. + nr = (int(folios[0]) if len(folios) else index_end) - index_next + stats['anon' if anons[index_next] else 'file']['partial'] += nr + + # Account pages for any partially mapped THP at the back. In that case, + # the next page after the range is a tail. + if len(folios): + flags = int(kpageflags.get(pfn_end)[0]) + if flags & KPF_COMPOUND_TAIL: + nr = index_end - int(folios[-1]) + folios = folios[:-1] + index_end -= nr + stats['anon' if anons[index_end - 1] else 'file']['partial'] += nr + + # Account fully mapped THPs in the middle of the range. + if len(folios): + folio_nrs = np.append(np.diff(folios), np.uint64(index_end - folios[-1])) + folio_orders = np.log2(folio_nrs).astype(np.uint64) + for index, order in zip(folios, folio_orders): + index = int(index) + order = int(order) + nr = 1 << order + vfn = int(vfns[index]) + align = 'aligned' if align_forward(vfn, nr) == vfn else 'unaligned' + anon = 'anon' if anons[index] else 'file' + stats[anon][align][order] += nr + + rstats = {} + + def flatten_sub(type, subtype, stats): + param = f"{type}-thp-{subtype}-{{}}kB" + for od, nr in enumerate(stats[2:], 2): + rstats[param.format(odkb(od))] = {'type': type, 'value': nrkb(nr)} + + def flatten_type(type, stats): + flatten_sub(type, 'aligned', stats['aligned']) + flatten_sub(type, 'unaligned', stats['unaligned']) + rstats[f"{type}-thp-partial"] = {'type': type, 'value': nrkb(stats['partial'])} + + flatten_type('anon', stats['anon']) + flatten_type('file', stats['file']) + + return rstats + + +def cont_parse(order, ranges, anons, heads): + # Given 4 same-sized arrays representing a range within a page table backed + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: + # True if page is anonymous, heads: True if page is head of a THP), return a + # dictionary of statistics describing the contiguous blocks. + nr_cont = 1 << order + nr_anon = 0 + nr_file = 0 + + for rindex, rvfn, rpfn in zip(*ranges): + index_next = int(rindex[0]) + index_end = int(rindex[1]) + 1 + vfn_start = int(rvfn[0]) + pfn_start = int(rpfn[0]) + + if align_offset(pfn_start, nr_cont) != align_offset(vfn_start, nr_cont): + continue + + off = align_forward(vfn_start, nr_cont) - vfn_start + index_next += off + + while index_next + nr_cont <= index_end: + folio_boundary = heads[index_next+1:index_next+nr_cont].any() + if not folio_boundary: + if anons[index_next]: + nr_anon += nr_cont + else: + nr_file += nr_cont + index_next += nr_cont + + return { + f"anon-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'anon', 'value': nrkb(nr_anon)}, + f"file-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'file', 'value': nrkb(nr_file)}, + } + + +def vma_print(vma, pid): + # Prints a VMA instance in a format similar to smaps. The main difference is + # that the pid is included as the first value. + print("{:010d}: {:016x}-{:016x} {}{}{}{} {:08x} {:02x}:{:02x} {:08x} {}" + .format( + pid, vma.start, vma.end, + 'r' if vma.read else '-', 'w' if vma.write else '-', + 'x' if vma.execute else '-', 'p' if vma.private else 's', + vma.pgoff, vma.major, vma.minor, vma.inode, vma.name + )) + + +def stats_print(stats, tot_anon, tot_file, inc_empty): + # Print a statistics dictionary. + label_field = 32 + for label, stat in stats.items(): + type = stat['type'] + value = stat['value'] + if value or inc_empty: + pad = max(0, label_field - len(label) - 1) + if type == 'anon' and tot_anon > 0: + percent = f' ({value / tot_anon:3.0%})' + elif type == 'file' and tot_file > 0: + percent = f' ({value / tot_file:3.0%})' + else: + percent = '' + print(f"{label}:{' ' * pad}{value:8} kB{percent}") + + +def vma_parse(vma, pagemap, kpageflags, contorders): + # Generate thp and cont statistics for a single VMA. + start = vma.start >> PAGE_SHIFT + end = vma.end >> PAGE_SHIFT + + pmes = pagemap.get(start, end - start) + present = pmes & PM_PAGE_PRESENT != 0 + pfns = pmes & PM_PFN_MASK + pfns = pfns[present] + vfns = np.arange(start, end, dtype=np.uint64) + vfns = vfns[present] + + pfn_vec = cont_ranges_all([pfns], [pfns])[0] + flags = kpageflags.getv(pfn_vec) + anons = flags & KPF_ANON != 0 + heads = flags & KPF_COMPOUND_HEAD != 0 + tails = flags & KPF_COMPOUND_TAIL != 0 + thps = heads | tails + + tot_anon = np.count_nonzero(anons) + tot_file = np.size(anons) - tot_anon + tot_anon = nrkb(tot_anon) + tot_file = nrkb(tot_file) + + vfns = vfns[thps] + pfns = pfns[thps] + anons = anons[thps] + heads = heads[thps] + + indexes = np.arange(len(vfns), dtype=np.uint64) + ranges = cont_ranges_all([vfns, pfns], [indexes, vfns, pfns]) + + thpstats = thp_parse(PMD_ORDER, kpageflags, ranges, indexes, vfns, pfns, anons, heads) + contstats = [cont_parse(order, ranges, anons, heads) for order in contorders] + + return { + **thpstats, + **{k: v for s in contstats for k, v in s.items()} + }, tot_anon, tot_file + + +def do_main(args): + pids = set() + summary = {} + summary_anon = 0 + summary_file = 0 + + if args.cgroup: + for walk_info in os.walk(args.cgroup): + cgroup = walk_info[0] + with open(f'{cgroup}/cgroup.procs') as pidfile: + for line in pidfile.readlines(): + pids.add(int(line.strip())) + else: + pids.add(args.pid) + + if not args.summary: + print(" PID START END PROT OFFSET DEV INODE OBJECT") + + for pid in pids: + try: + with PageMap(pid) as pagemap: + with KPageFlags() as kpageflags: + for vma in VMAList(pid, args.inc_smaps): + if (vma.read or vma.write or vma.execute) and vma.stats['Rss']['value'] > 0: + stats, vma_anon, vma_file = vma_parse(vma, pagemap, kpageflags, args.cont) + else: + stats = {} + vma_anon = 0 + vma_file = 0 + if args.inc_smaps: + stats = {**vma.stats, **stats} + if args.summary: + for k, v in stats.items(): + if k in summary: + assert(summary[k]['type'] == v['type']) + summary[k]['value'] += v['value'] + else: + summary[k] = v + summary_anon += vma_anon + summary_file += vma_file + else: + vma_print(vma, pid) + stats_print(stats, vma_anon, vma_file, args.inc_empty) + except FileNotFoundError: + if not args.cgroup: + raise + except ProcessLookupError: + if not args.cgroup: + raise + except FileIOException: + if not args.cgroup: + raise + + if args.summary: + stats_print(summary, summary_anon, summary_file, args.inc_empty) + + +def main(): + def formatter(prog): + width = shutil.get_terminal_size().columns + width -= 2 + width = min(80, width) + return argparse.HelpFormatter(prog, width=width) + + def size2order(human): + units = { + "K": 2**10, "M": 2**20, "G": 2**30, + "k": 2**10, "m": 2**20, "g": 2**30, + } + unit = 1 + if human[-1] in units: + unit = units[human[-1]] + human = human[:-1] + try: + size = int(human) + except ValueError: + raise ArgException('error: --cont value must be integer size with optional KMG unit') + size *= unit + order = int(math.log2(size / PAGE_SIZE)) + if order < 1: + raise ArgException('error: --cont value must be size of at least 2 pages') + if (1 << order) * PAGE_SIZE != size: + raise ArgException('error: --cont value must be size of power-of-2 pages') + return order + + parser = argparse.ArgumentParser(formatter_class=formatter, + description="""Prints information about how transparent huge pages are + mapped to a specified process or cgroup. + + Shows statistics for fully-mapped THPs of every size, mapped + both naturally aligned and unaligned for both file and + anonymous memory. See + [anon|file]-thp-[aligned|unaligned]-<size>kB keys. + + Shows statistics for mapped pages that belong to a THP but + which are not fully mapped. See [anon|file]-thp-partial + keys. + + Optionally shows statistics for naturally aligned, + contiguous blocks of memory of a specified size (when --cont + is provided). See [anon|file]-cont-aligned-<size>kB keys. + + Statistics are shown in kB and as a percentage of either + total anon or file memory as appropriate.""", + epilog="""Requires root privilege to access pagemap and kpageflags.""") + + parser.add_argument('--pid', + metavar='pid', required=False, type=int, + help="""Process id of the target process. Exactly one of --pid and + --cgroup must be provided.""") + + parser.add_argument('--cgroup', + metavar='path', required=False, + help="""Path to the target cgroup in sysfs. Iterates over every pid in + the cgroup and its children. Get global stats by passing in the root + cgroup (e.g. /sys/fs/cgroup for cgroup-v2 or /sys/fs/cgroup/pids for + cgroup-v1). Exactly one of --pid and --cgroup must be provided.""") + + parser.add_argument('--summary', + required=False, default=False, action='store_true', + help="""Sum the per-vma statistics to provide a summary over the whole + process or cgroup.""") + + parser.add_argument('--cont', + metavar='size[KMG]', required=False, default=[], action='append', + help="""Adds anon and file stats for naturally aligned, contiguously + mapped blocks of the specified size. May be issued multiple times to + track multiple sized blocks. Useful to infer e.g. arm64 contpte and + hpa mappings. Size must be a power-of-2 number of pages.""") + + parser.add_argument('--inc-smaps', + required=False, default=False, action='store_true', + help="""Include all numerical, additive /proc/<pid>/smaps stats in the + output.""") + + parser.add_argument('--inc-empty', + required=False, default=False, action='store_true', + help="""Show all statistics including those whose value is 0.""") + + parser.add_argument('--periodic', + metavar='sleep_ms', required=False, type=int, + help="""Run in a loop, polling every sleep_ms milliseconds.""") + + args = parser.parse_args() + + try: + if (args.pid and args.cgroup) or \ + (not args.pid and not args.cgroup): + raise ArgException("error: Exactly one of --pid and --cgroup must be provided.") + + args.cont = [size2order(cont) for cont in args.cont] + except ArgException as e: + parser.print_usage() + raise + + if args.periodic: + while True: + do_main(args) + print() + time.sleep(args.periodic / 1000) + else: + do_main(args) + + +if __name__ == "__main__": + try: + main() + except Exception as e: + prog = os.path.basename(sys.argv[0]) + print(f'{prog}: {e}') + exit(1)