Message ID | 20240102153828.1002295-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,v1] tools/mm: Add thpmaps script to dump THP usage info | expand |
On Wed, Jan 3, 2024 at 4:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > With the proliferation of large folios for file-backed memory, and more > recently the introduction of multi-size THP for anonymous memory, it is > becoming useful to be able to see exactly how large folios are mapped > into processes. For some architectures (e.g. arm64), if most memory is > mapped using contpte-sized and -aligned blocks, TLB usage can be > optimized so it's useful to see where these requirements are and are not > being met. > > thpmaps is a Python utility that reads /proc/<pid>/smaps, > /proc/<pid>/pagemap and /proc/kpageflags to print information about how > transparent huge pages (both file and anon) are mapped to a specified > process or cgroup. It aims to help users debug and optimize their > workloads. In future we may wish to introduce stats directly into the > kernel (e.g. smaps or similar), but for now this provides a short term > solution without the need to introduce any new ABI. > > Run with help option for a full listing of the arguments: > > # thpmaps --help > > --8<-- > usage: thpmaps [-h] [--pid pid] [--cgroup path] [--summary] > [--cont size[KMG]] [--inc-smaps] [--inc-empty] > [--periodic sleep_ms] > > Prints information about how transparent huge pages are mapped to a > specified process or cgroup. Shows statistics for fully-mapped THPs of > every size, mapped both naturally aligned and unaligned for both file > and anonymous memory. See [anon|file]-thp-[aligned|unaligned]-<size>kB > keys. Shows statistics for mapped pages that belong to a THP but which > are not fully mapped. See [anon|file]-thp-partial keys. Optionally > shows statistics for naturally aligned, contiguous blocks of memory of > a specified size (when --cont is provided). See [anon|file]-cont- > aligned-<size>kB keys. Statistics are shown in kB and as a percentage > of either total anon or file memory as appropriate. > > options: > -h, --help show this help message and exit > --pid pid Process id of the target process. Exactly one of > --pid and --cgroup must be provided. > --cgroup path Path to the target cgroup in sysfs. Iterates > over every pid in the cgroup. Exactly one of > --pid and --cgroup must be provided. > --summary Sum the per-vma statistics to provide a summary > over the whole process or cgroup. > --cont size[KMG] Adds anon and file stats for naturally aligned, > contiguously mapped blocks of the specified > size. May be issued multiple times to track > multiple sized blocks. Useful to infer e.g. > arm64 contpte and hpa mappings. Size must be a > power-of-2 number of pages. > --inc-smaps Include all numerical, additive > /proc/<pid>/smaps stats in the output. > --inc-empty Show all statistics including those whose value > is 0. > --periodic sleep_ms Run in a loop, polling every sleep_ms > milliseconds. > > Requires root privilege to access pagemap and kpageflags. > --8<-- > > Example command to summarise fully and partially mapped THPs and 64K > contiguous blocks over all VMAs in a single process (--inc-empty forces > printing stats that are 0): > > # ./thpmaps --pid 10837 --cont 64K --summary --inc-empty > > --8<-- > anon-thp-aligned-16kB: 16 kB ( 0%) > anon-thp-aligned-32kB: 0 kB ( 0%) > anon-thp-aligned-64kB: 4194304 kB (100%) > anon-thp-aligned-128kB: 0 kB ( 0%) > anon-thp-aligned-256kB: 0 kB ( 0%) > anon-thp-aligned-512kB: 0 kB ( 0%) > anon-thp-aligned-1024kB: 0 kB ( 0%) > anon-thp-aligned-2048kB: 0 kB ( 0%) > anon-thp-unaligned-16kB: 0 kB ( 0%) > anon-thp-unaligned-32kB: 0 kB ( 0%) > anon-thp-unaligned-64kB: 0 kB ( 0%) > anon-thp-unaligned-128kB: 0 kB ( 0%) > anon-thp-unaligned-256kB: 0 kB ( 0%) > anon-thp-unaligned-512kB: 0 kB ( 0%) > anon-thp-unaligned-1024kB: 0 kB ( 0%) > anon-thp-unaligned-2048kB: 0 kB ( 0%) > anon-thp-partial: 0 kB ( 0%) > file-thp-aligned-16kB: 16 kB ( 1%) > file-thp-aligned-32kB: 64 kB ( 5%) > file-thp-aligned-64kB: 640 kB (50%) > file-thp-aligned-128kB: 128 kB (10%) > file-thp-aligned-256kB: 0 kB ( 0%) > file-thp-aligned-512kB: 0 kB ( 0%) > file-thp-aligned-1024kB: 0 kB ( 0%) > file-thp-aligned-2048kB: 0 kB ( 0%) > file-thp-unaligned-16kB: 16 kB ( 1%) > file-thp-unaligned-32kB: 32 kB ( 3%) > file-thp-unaligned-64kB: 64 kB ( 5%) > file-thp-unaligned-128kB: 0 kB ( 0%) > file-thp-unaligned-256kB: 0 kB ( 0%) > file-thp-unaligned-512kB: 0 kB ( 0%) > file-thp-unaligned-1024kB: 0 kB ( 0%) > file-thp-unaligned-2048kB: 0 kB ( 0%) > file-thp-partial: 12 kB ( 1%) > anon-cont-aligned-64kB: 4194304 kB (100%) > file-cont-aligned-64kB: 768 kB (61%) > --8<-- > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- Hi Ryan, I ran a couple of test cases with different parameters, it seems to work correctly. just i don't understand the below, what is the meaning of 000000ce at the beginning of each line? /thpmaps --pid 206 --cont 64K 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 /root/a.out 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00 00426969 /root/a.out 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00 00426969 /root/a.out 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000 anon-thp-aligned-64kB: 473920 kB (100%) anon-cont-aligned-64kB: 473920 kB (100%) 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar] 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso] 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack] > > I've found this very useful for debugging, and I know others have requested a > way to check if mTHP and contpte is working, so thought this might a good short > term solution until we figure out how best to add stats in the kernel? > > Thanks, > Ryan > > tools/mm/Makefile | 9 +- > tools/mm/thpmaps | 573 ++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 578 insertions(+), 4 deletions(-) > create mode 100755 tools/mm/thpmaps > > diff --git a/tools/mm/Makefile b/tools/mm/Makefile > index 1c5606cc3334..7bb03606b9ea 100644 > --- a/tools/mm/Makefile > +++ b/tools/mm/Makefile > @@ -3,7 +3,8 @@ > # > include ../scripts/Makefile.include > > -TARGETS=page-types slabinfo page_owner_sort > +BUILD_TARGETS=page-types slabinfo page_owner_sort > +INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps > > LIB_DIR = ../lib/api > LIBS = $(LIB_DIR)/libapi.a > @@ -11,9 +12,9 @@ LIBS = $(LIB_DIR)/libapi.a > CFLAGS += -Wall -Wextra -I../lib/ -pthread > LDFLAGS += $(LIBS) -pthread > > -all: $(TARGETS) > +all: $(BUILD_TARGETS) > > -$(TARGETS): $(LIBS) > +$(BUILD_TARGETS): $(LIBS) > > $(LIBS): > make -C $(LIB_DIR) > @@ -29,4 +30,4 @@ sbindir ?= /usr/sbin > > install: all > install -d $(DESTDIR)$(sbindir) > - install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir) > + install -m 755 -p $(INSTALL_TARGETS) $(DESTDIR)$(sbindir) > diff --git a/tools/mm/thpmaps b/tools/mm/thpmaps > new file mode 100755 > index 000000000000..af9b19f63eb4 > --- /dev/null > +++ b/tools/mm/thpmaps > @@ -0,0 +1,573 @@ > +#!/usr/bin/env python3 > +# SPDX-License-Identifier: GPL-2.0-only > +# Copyright (C) 2024 ARM Ltd. > +# > +# Utility providing smaps-like output detailing transparent hugepage usage. > +# For more info, run: > +# ./thpmaps --help > +# > +# Requires numpy: > +# pip3 install numpy > + > + > +import argparse > +import collections > +import math > +import os > +import re > +import resource > +import shutil > +import sys > +import time > +import numpy as np > + > + > +with open('/sys/kernel/mm/transparent_hugepage/hpage_pmd_size') as f: > + PAGE_SIZE = resource.getpagesize() > + PAGE_SHIFT = int(math.log2(PAGE_SIZE)) > + PMD_SIZE = int(f.read()) > + PMD_ORDER = int(math.log2(PMD_SIZE / PAGE_SIZE)) > + > + > +def align_forward(v, a): > + return (v + (a - 1)) & ~(a - 1) > + > + > +def align_offset(v, a): > + return v & (a - 1) > + > + > +def nrkb(nr): > + # Convert number of pages to KB. > + return (nr << PAGE_SHIFT) >> 10 > + > + > +def odkb(order): > + # Convert page order to KB. > + return nrkb(1 << order) > + > + > +def cont_ranges_all(arrs): > + # Given a list of arrays, find the ranges for which values are monotonically > + # incrementing in all arrays. > + assert(len(arrs) > 0) > + sz = len(arrs[0]) > + for arr in arrs: > + assert(arr.shape == (sz,)) > + r = np.full(sz, 2) > + d = np.diff(arrs[0]) == 1 > + for dd in [np.diff(arr) == 1 for arr in arrs[1:]]: > + d &= dd > + r[1:] -= d > + r[:-1] -= d > + return [np.repeat(arr, r).reshape(-1, 2) for arr in arrs] > + > + > +class ArgException(Exception): > + pass > + > + > +class FileIOException(Exception): > + pass > + > + > +class BinArrayFile: > + # Base class used to read /proc/<pid>/pagemap and /proc/kpageflags into a > + # numpy array. Use inherrited class in a with clause to ensure file is > + # closed when it goes out of scope. > + def __init__(self, filename, element_size): > + self.element_size = element_size > + self.filename = filename > + self.fd = os.open(self.filename, os.O_RDONLY) > + > + def cleanup(self): > + os.close(self.fd) > + > + def __enter__(self): > + return self > + > + def __exit__(self, exc_type, exc_val, exc_tb): > + self.cleanup() > + > + def _readin(self, offset, buffer): > + length = os.preadv(self.fd, (buffer,), offset) > + if len(buffer) != length: > + raise FileIOException('error: {} failed to read {} bytes at {:x}' > + .format(self.filename, len(buffer), offset)) > + > + def _toarray(self, buf): > + assert(self.element_size == 8) > + return np.frombuffer(buf, dtype=np.uint64) > + > + def getv(self, vec): > + sz = 0 > + for region in vec: > + sz += int(region[1] - region[0] + 1) * self.element_size > + buf = bytearray(sz) > + view = memoryview(buf) > + pos = 0 > + for region in vec: > + offset = int(region[0]) * self.element_size > + length = int(region[1] - region[0] + 1) * self.element_size > + self._readin(offset, view[pos:pos+length]) > + pos += length > + return self._toarray(buf) > + > + def get(self, index, nr=1): > + offset = index * self.element_size > + length = nr * self.element_size > + buf = bytearray(length) > + self._readin(offset, buf) > + return self._toarray(buf) > + > + > +PM_PAGE_PRESENT = 1 << 63 > +PM_PFN_MASK = (1 << 55) - 1 > + > +class PageMap(BinArrayFile): > + # Read ranges of a given pid's pagemap into a numpy array. > + def __init__(self, pid='self'): > + super().__init__(f'/proc/{pid}/pagemap', 8) > + > + > +KPF_ANON = 1 << 12 > +KPF_COMPOUND_HEAD = 1 << 15 > +KPF_COMPOUND_TAIL = 1 << 16 > + > +class KPageFlags(BinArrayFile): > + # Read ranges of /proc/kpageflags into a numpy array. > + def __init__(self): > + super().__init__(f'/proc/kpageflags', 8) > + > + > +VMA = collections.namedtuple('VMA', [ > + 'name', > + 'start', > + 'end', > + 'read', > + 'write', > + 'execute', > + 'private', > + 'pgoff', > + 'major', > + 'minor', > + 'inode', > + 'stats', > +]) > + > +class VMAList: > + # A container for VMAs, parsed from /proc/<pid>/smaps. Iterate over the > + # instance to receive VMAs. > + head_regex = re.compile(r"^([\da-f]+)-([\da-f]+) ([r-])([w-])([x-])([ps]) ([\da-f]+) ([\da-f]+):([\da-f]+) ([\da-f]+)\s*(.*)$") > + kb_item_regex = re.compile(r"(\w+):\s*(\d+)\s*kB") > + > + def __init__(self, pid='self'): > + def is_vma(line): > + return self.head_regex.search(line) != None > + > + def get_vma(line): > + m = self.head_regex.match(line) > + if m is None: > + return None > + return VMA( > + name=m.group(11), > + start=int(m.group(1), 16), > + end=int(m.group(2), 16), > + read=m.group(3) == 'r', > + write=m.group(4) == 'w', > + execute=m.group(5) == 'x', > + private=m.group(6) == 'p', > + pgoff=int(m.group(7), 16), > + major=int(m.group(8), 16), > + minor=int(m.group(9), 16), > + inode=int(m.group(10), 16), > + stats={}, > + ) > + > + def get_value(line): > + # Currently only handle the KB stats because they are summed for > + # --summary. Core code doesn't know how to combine other stats. > + exclude = ['KernelPageSize', 'MMUPageSize'] > + m = self.kb_item_regex.search(line) > + if m: > + param = m.group(1) > + if param not in exclude: > + value = int(m.group(2)) > + return param, value > + return None, None > + > + def parse_smaps(file): > + vmas = [] > + i = 0 > + > + line = file.readline() > + > + while True: > + if not line: > + break > + line = line.strip() > + > + i += 1 > + > + vma = get_vma(line) > + if vma is None: > + raise FileIOException(f'error: could not parse line {i}: "{line}"') > + > + while True: > + line = file.readline() > + if not line: > + break > + line = line.strip() > + if is_vma(line): > + break > + > + i += 1 > + > + param, value = get_value(line) > + if param: > + vma.stats[param] = {'type': None, 'value': value} > + > + vmas.append(vma) > + > + return vmas > + > + with open(f'/proc/{pid}/smaps', 'r') as file: > + self.vmas = parse_smaps(file) > + > + def __iter__(self): > + yield from self.vmas > + > + > +def thp_parse(max_order, kpageflags, vfns, pfns, anons, heads): > + # Given 4 same-sized arrays representing a range within a page table backed > + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: > + # True if page is anonymous, heads: True if page is head of a THP), return a > + # dictionary of statistics describing the mapped THPs. > + stats = { > + 'file': { > + 'partial': 0, > + 'aligned': [0] * (max_order + 1), > + 'unaligned': [0] * (max_order + 1), > + }, > + 'anon': { > + 'partial': 0, > + 'aligned': [0] * (max_order + 1), > + 'unaligned': [0] * (max_order + 1), > + }, > + } > + > + indexes = np.arange(len(vfns), dtype=np.uint64) > + ranges = cont_ranges_all([indexes, vfns, pfns]) > + for rindex, rpfn in zip(ranges[0], ranges[2]): > + index_next = int(rindex[0]) > + index_end = int(rindex[1]) + 1 > + pfn_end = int(rpfn[1]) + 1 > + > + folios = indexes[index_next:index_end][heads[index_next:index_end]] > + > + # Account pages for any partially mapped THP at the front. In that case, > + # the first page of the range is a tail. > + nr = (int(folios[0]) if len(folios) else index_end) - index_next > + stats['anon' if anons[index_next] else 'file']['partial'] += nr > + > + # Account pages for any partially mapped THP at the back. In that case, > + # the next page after the range is a tail. > + if len(folios): > + flags = int(kpageflags.get(pfn_end)[0]) > + if flags & KPF_COMPOUND_TAIL: > + nr = index_end - int(folios[-1]) > + folios = folios[:-1] > + index_end -= nr > + stats['anon' if anons[index_end - 1] else 'file']['partial'] += nr > + > + # Account fully mapped THPs in the middle of the range. > + if len(folios): > + folio_nrs = np.append(np.diff(folios), np.uint64(index_end - folios[-1])) > + folio_orders = np.log2(folio_nrs).astype(np.uint64) > + for index, order in zip(folios, folio_orders): > + index = int(index) > + order = int(order) > + nr = 1 << order > + vfn = int(vfns[index]) > + align = 'aligned' if align_forward(vfn, nr) == vfn else 'unaligned' > + anon = 'anon' if anons[index] else 'file' > + stats[anon][align][order] += nr > + > + rstats = {} > + > + def flatten_sub(type, subtype, stats): > + for od, nr in enumerate(stats[2:], 2): > + rstats[f"{type}-thp-{subtype}-{odkb(od)}kB"] = {'type': type, 'value': nrkb(nr)} > + > + def flatten_type(type, stats): > + flatten_sub(type, 'aligned', stats['aligned']) > + flatten_sub(type, 'unaligned', stats['unaligned']) > + rstats[f"{type}-thp-partial"] = {'type': type, 'value': nrkb(stats['partial'])} > + > + flatten_type('anon', stats['anon']) > + flatten_type('file', stats['file']) > + > + return rstats > + > + > +def cont_parse(order, vfns, pfns, anons, heads): > + # Given 4 same-sized arrays representing a range within a page table backed > + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: > + # True if page is anonymous, heads: True if page is head of a THP), return a > + # dictionary of statistics describing the contiguous blocks. > + nr_cont = 1 << order > + nr_anon = 0 > + nr_file = 0 > + > + ranges = cont_ranges_all([np.arange(len(vfns), dtype=np.uint64), vfns, pfns]) > + for rindex, rvfn, rpfn in zip(*ranges): > + index_next = int(rindex[0]) > + index_end = int(rindex[1]) + 1 > + vfn_start = int(rvfn[0]) > + pfn_start = int(rpfn[0]) > + > + if align_offset(pfn_start, nr_cont) != align_offset(vfn_start, nr_cont): > + continue > + > + off = align_forward(vfn_start, nr_cont) - vfn_start > + index_next += off > + > + while index_next + nr_cont <= index_end: > + folio_boundary = heads[index_next+1:index_next+nr_cont].any() > + if not folio_boundary: > + if anons[index_next]: > + nr_anon += nr_cont > + else: > + nr_file += nr_cont > + index_next += nr_cont > + > + return { > + f"anon-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'anon', 'value': nrkb(nr_anon)}, > + f"file-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'file', 'value': nrkb(nr_file)}, > + } > + > + > +def vma_print(vma, pid): > + # Prints a VMA instance in a format similar to smaps. The main difference is > + # that the pid is included as the first value. > + print("{:08x} {:016x}-{:016x} {}{}{}{} {:08x} {:02x}:{:02x} {:08x} {}" > + .format( > + pid, vma.start, vma.end, > + 'r' if vma.read else '-', 'w' if vma.write else '-', > + 'x' if vma.execute else '-', 'p' if vma.private else 's', > + vma.pgoff, vma.major, vma.minor, vma.inode, vma.name > + )) > + > + > +def stats_print(stats, tot_anon, tot_file, inc_empty): > + # Print a statistics dictionary. > + label_field = 32 > + for label, stat in stats.items(): > + type = stat['type'] > + value = stat['value'] > + if value or inc_empty: > + pad = max(0, label_field - len(label) - 1) > + if type == 'anon': > + percent = f' ({value / tot_anon:3.0%})' > + elif type == 'file': > + percent = f' ({value / tot_file:3.0%})' > + else: > + percent = '' > + print(f"{label}:{' ' * pad}{value:8} kB{percent}") > + > + > +def vma_parse(vma, pagemap, kpageflags, contorders): > + # Generate thp and cont statistics for a single VMA. > + start = vma.start >> PAGE_SHIFT > + end = vma.end >> PAGE_SHIFT > + > + pmes = pagemap.get(start, end - start) > + present = pmes & PM_PAGE_PRESENT != 0 > + pfns = pmes & PM_PFN_MASK > + pfns = pfns[present] > + vfns = np.arange(start, end, dtype=np.uint64) > + vfns = vfns[present] > + > + flags = kpageflags.getv(cont_ranges_all([pfns])[0]) > + anons = flags & KPF_ANON != 0 > + heads = flags & KPF_COMPOUND_HEAD != 0 > + tails = flags & KPF_COMPOUND_TAIL != 0 > + thps = heads | tails > + > + tot_anon = np.count_nonzero(anons) > + tot_file = np.size(anons) - tot_anon > + tot_anon = nrkb(tot_anon) > + tot_file = nrkb(tot_file) > + > + vfns = vfns[thps] > + pfns = pfns[thps] > + anons = anons[thps] > + heads = heads[thps] > + > + thpstats = thp_parse(PMD_ORDER, kpageflags, vfns, pfns, anons, heads) > + contstats = [cont_parse(order, vfns, pfns, anons, heads) for order in contorders] > + > + return { > + **thpstats, > + **{k: v for s in contstats for k, v in s.items()} > + }, tot_anon, tot_file > + > + > +def do_main(args): > + pids = set() > + summary = {} > + summary_anon = 0 > + summary_file = 0 > + > + if args.cgroup: > + with open(f'{args.cgroup}/cgroup.procs') as pidfile: > + for line in pidfile.readlines(): > + pids.add(int(line.strip())) > + else: > + pids.add(args.pid) > + > + for pid in pids: > + try: > + with PageMap(pid) as pagemap: > + with KPageFlags() as kpageflags: > + for vma in VMAList(pid): > + if (vma.read or vma.write or vma.execute) and vma.stats['Rss']['value'] > 0: > + stats, vma_anon, vma_file = vma_parse(vma, pagemap, kpageflags, args.cont) > + else: > + stats = {} > + vma_anon = 0 > + vma_file = 0 > + if args.inc_smaps: > + stats = {**vma.stats, **stats} > + if args.summary: > + for k, v in stats.items(): > + if k in summary: > + assert(summary[k]['type'] == v['type']) > + summary[k]['value'] += v['value'] > + else: > + summary[k] = v > + summary_anon += vma_anon > + summary_file += vma_file > + else: > + vma_print(vma, pid) > + stats_print(stats, vma_anon, vma_file, args.inc_empty) > + except FileNotFoundError: > + if not args.cgroup: > + raise > + except ProcessLookupError: > + if not args.cgroup: > + raise > + > + if args.summary: > + stats_print(summary, summary_anon, summary_file, args.inc_empty) > + > + > +def main(): > + def formatter(prog): > + width = shutil.get_terminal_size().columns > + width -= 2 > + width = min(80, width) > + return argparse.HelpFormatter(prog, width=width) > + > + def size2order(human): > + units = {"K": 2**10, "M": 2**20, "G": 2**30} > + unit = 1 > + if human[-1] in units: > + unit = units[human[-1]] > + human = human[:-1] > + try: > + size = int(human) > + except ValueError: > + raise ArgException('error: --cont value must be integer size with optional KMG unit') > + size *= unit > + order = int(math.log2(size / PAGE_SIZE)) > + if order < 1: > + raise ArgException('error: --cont value must be size of at least 2 pages') > + if (1 << order) * PAGE_SIZE != size: > + raise ArgException('error: --cont value must be size of power-of-2 pages') > + return order > + > + parser = argparse.ArgumentParser(formatter_class=formatter, > + description="""Prints information about how transparent huge pages are > + mapped to a specified process or cgroup. > + > + Shows statistics for fully-mapped THPs of every size, mapped > + both naturally aligned and unaligned for both file and > + anonymous memory. See > + [anon|file]-thp-[aligned|unaligned]-<size>kB keys. > + > + Shows statistics for mapped pages that belong to a THP but > + which are not fully mapped. See [anon|file]-thp-partial > + keys. > + > + Optionally shows statistics for naturally aligned, > + contiguous blocks of memory of a specified size (when --cont > + is provided). See [anon|file]-cont-aligned-<size>kB keys. > + > + Statistics are shown in kB and as a percentage of either > + total anon or file memory as appropriate.""", > + epilog="""Requires root privilege to access pagemap and kpageflags.""") > + > + parser.add_argument('--pid', > + metavar='pid', required=False, type=int, > + help="""Process id of the target process. Exactly one of --pid and > + --cgroup must be provided.""") > + > + parser.add_argument('--cgroup', > + metavar='path', required=False, > + help="""Path to the target cgroup in sysfs. Iterates over every pid in > + the cgroup. Exactly one of --pid and --cgroup must be provided.""") > + > + parser.add_argument('--summary', > + required=False, default=False, action='store_true', > + help="""Sum the per-vma statistics to provide a summary over the whole > + process or cgroup.""") > + > + parser.add_argument('--cont', > + metavar='size[KMG]', required=False, default=[], action='append', > + help="""Adds anon and file stats for naturally aligned, contiguously > + mapped blocks of the specified size. May be issued multiple times to > + track multiple sized blocks. Useful to infer e.g. arm64 contpte and > + hpa mappings. Size must be a power-of-2 number of pages.""") > + > + parser.add_argument('--inc-smaps', > + required=False, default=False, action='store_true', > + help="""Include all numerical, additive /proc/<pid>/smaps stats in the > + output.""") > + > + parser.add_argument('--inc-empty', > + required=False, default=False, action='store_true', > + help="""Show all statistics including those whose value is 0.""") > + > + parser.add_argument('--periodic', > + metavar='sleep_ms', required=False, type=int, > + help="""Run in a loop, polling every sleep_ms milliseconds.""") > + > + args = parser.parse_args() > + > + try: > + if (args.pid and args.cgroup) or \ > + (not args.pid and not args.cgroup): > + raise ArgException("error: Exactly one of --pid and --cgroup must be provided.") > + > + args.cont = [size2order(cont) for cont in args.cont] > + except ArgException as e: > + parser.print_usage() > + raise > + > + if args.periodic: > + while True: > + do_main(args) > + print() > + time.sleep(args.periodic / 1000) > + else: > + do_main(args) > + > + > +if __name__ == "__main__": > + try: > + main() > + except Exception as e: > + prog = os.path.basename(sys.argv[0]) > + print(f'{prog}: {e}') > + exit(1) > -- > 2.25.1 >
> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote: > > Hi Ryan, > > I ran a couple of test cases with different parameters, it seems to > work correctly. > just i don't understand the below, what is the meaning of 000000ce at > the beginning of > each line? It's the pid; 0xce is the specified pid, 206. Perhaps the pid should be printed in decimal? -- William Kucharski > /thpmaps --pid 206 --cont 64K > 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 > 00426969 /root/a.out > 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00 > 00426969 /root/a.out > 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00 > 00426969 /root/a.out > 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000 > anon-thp-aligned-64kB: 473920 kB (100%) > anon-cont-aligned-64kB: 473920 kB (100%) > 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00 > 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00 > 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00 > 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00 > 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000 > 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00 > 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 > 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000 > 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar] > 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso] > 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00 > 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 > 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00 > 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 > 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]
On 03/01/2024 08:07, William Kucharski wrote: > >> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote: >> >> Hi Ryan, >> >> I ran a couple of test cases with different parameters, it seems to >> work correctly. >> just i don't understand the below, what is the meaning of 000000ce at >> the beginning of >> each line? > > It's the pid; 0xce is the specified pid, 206. Yes indeed. I added the pid to the front for the case where you are using --cgroup without --summary; in that case, each vma will be printed for each pid in the cgroup and it seemed sensible to be able to see which pid each vma belonged to. > > Perhaps the pid should be printed in decimal? I thought about printing in decimal, but every other value in the vma is in hex without a leading "0x" (I'm trying to follow the smaps convention). So I thought it could be more confusing in decimal. I'm happy to change it to decimal if that's the preference though? Although I'd like to continue to present it in a fixed width field, padded with 0s on the left so that everything lines up. > > -- William Kucharski > >> /thpmaps --pid 206 --cont 64K >> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 >> 00426969 /root/a.out >> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00 >> 00426969 /root/a.out >> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00 >> 00426969 /root/a.out >> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000 >> anon-thp-aligned-64kB: 473920 kB (100%) >> anon-cont-aligned-64kB: 473920 kB (100%) >> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00 >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00 >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00 >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00 >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000 >> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00 >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000 >> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar] >> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso] >> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00 >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00 >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack] >
On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 03/01/2024 08:07, William Kucharski wrote: > > > >> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote: > >> > >> Hi Ryan, > >> > >> I ran a couple of test cases with different parameters, it seems to > >> work correctly. > >> just i don't understand the below, what is the meaning of 000000ce at > >> the beginning of > >> each line? > > > > It's the pid; 0xce is the specified pid, 206. > > Yes indeed. I added the pid to the front for the case where you are using > --cgroup without --summary; in that case, each vma will be printed for each pid > in the cgroup and it seemed sensible to be able to see which pid each vma > belonged to. I don't understand why we have to add the pid before each line as this tool already has pid in the parameter :-) this seems like duplicated information to me. but it doesn't matter too much as this tool is really nice though it is not so easy to deploy on Android. Please feel free to add, Tested-by: Barry Song <v-songbaohua@oppo.com> > > > > > Perhaps the pid should be printed in decimal? > > I thought about printing in decimal, but every other value in the vma is in hex > without a leading "0x" (I'm trying to follow the smaps convention). So I thought > it could be more confusing in decimal. > > I'm happy to change it to decimal if that's the preference though? Although I'd > like to continue to present it in a fixed width field, padded with 0s on the > left so that everything lines up. > > > > > -- William Kucharski > > > >> /thpmaps --pid 206 --cont 64K > >> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 > >> 00426969 /root/a.out > >> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00 > >> 00426969 /root/a.out > >> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00 > >> 00426969 /root/a.out > >> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000 > >> anon-thp-aligned-64kB: 473920 kB (100%) > >> anon-cont-aligned-64kB: 473920 kB (100%) > >> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00 > >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > >> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00 > >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > >> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00 > >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > >> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00 > >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 > >> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000 > >> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00 > >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 > >> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000 > >> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar] > >> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso] > >> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00 > >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 > >> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00 > >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 > >> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack] > > Thanks Barry
On 03/01/2024 09:16, Barry Song wrote: > On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 03/01/2024 08:07, William Kucharski wrote: >>> >>>> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote: >>>> >>>> Hi Ryan, >>>> >>>> I ran a couple of test cases with different parameters, it seems to >>>> work correctly. >>>> just i don't understand the below, what is the meaning of 000000ce at >>>> the beginning of >>>> each line? >>> >>> It's the pid; 0xce is the specified pid, 206. >> >> Yes indeed. I added the pid to the front for the case where you are using >> --cgroup without --summary; in that case, each vma will be printed for each pid >> in the cgroup and it seemed sensible to be able to see which pid each vma >> belonged to. > > I don't understand why we have to add the pid before each line as this tool > already has pid in the parameter :-) The reason is that it is also possible to invoke the tool with --cgroup instead of --pid. In this case, the tool will iterate over all the pids in the cgroup so (when --summary is not specified) having the pid associated with each vma is useful. I could change it to conditionally output the pid only when --cgroup is specified? > this seems like duplicated information > to me. but it doesn't matter too much as this tool is really nice though it is > not so easy to deploy on Android. Hmm. I've seen tutorials where people have Python running under Android, but I agree its not zero effort. Perhaps it would be better in C. Unfortuantely, I can't commit to doing a port at this point. > > Please feel free to add, > > Tested-by: Barry Song <v-songbaohua@oppo.com> Thanks! > >> >>> >>> Perhaps the pid should be printed in decimal? >> >> I thought about printing in decimal, but every other value in the vma is in hex >> without a leading "0x" (I'm trying to follow the smaps convention). So I thought >> it could be more confusing in decimal. >> >> I'm happy to change it to decimal if that's the preference though? Although I'd >> like to continue to present it in a fixed width field, padded with 0s on the >> left so that everything lines up. >> >>> >>> -- William Kucharski >>> >>>> /thpmaps --pid 206 --cont 64K >>>> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 >>>> 00426969 /root/a.out >>>> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00 >>>> 00426969 /root/a.out >>>> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00 >>>> 00426969 /root/a.out >>>> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000 >>>> anon-thp-aligned-64kB: 473920 kB (100%) >>>> anon-cont-aligned-64kB: 473920 kB (100%) >>>> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00 >>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00 >>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00 >>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00 >>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000 >>>> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00 >>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000 >>>> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar] >>>> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso] >>>> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00 >>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00 >>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack] >>> > > Thanks > Barry
> On Jan 3, 2024, at 02:35, Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 03/01/2024 09:16, Barry Song wrote: >> On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> On 03/01/2024 08:07, William Kucharski wrote: >>>> >>>>> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote: >>>>> >>>>> Hi Ryan, >>>>> >>>>> I ran a couple of test cases with different parameters, it seems to >>>>> work correctly. >>>>> just i don't understand the below, what is the meaning of 000000ce at >>>>> the beginning of >>>>> each line? >>>> >>>> It's the pid; 0xce is the specified pid, 206. >>> >>> Yes indeed. I added the pid to the front for the case where you are using >>> --cgroup without --summary; in that case, each vma will be printed for each pid >>> in the cgroup and it seemed sensible to be able to see which pid each vma >>> belonged to. >> >> I don't understand why we have to add the pid before each line as this tool >> already has pid in the parameter :-) > > The reason is that it is also possible to invoke the tool with --cgroup instead > of --pid. In this case, the tool will iterate over all the pids in the cgroup so > (when --summary is not specified) having the pid associated with each vma is useful. > > I could change it to conditionally output the pid only when --cgroup is specified? You could, or perhaps emit a colon after the pid to delineate it, e.g.: > 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 /root/a.out but then some people would probably read it as a memory address, so who knows. -- William Kucharski > >> this seems like duplicated information >> to me. but it doesn't matter too much as this tool is really nice though it is >> not so easy to deploy on Android. > > Hmm. I've seen tutorials where people have Python running under Android, but I > agree its not zero effort. Perhaps it would be better in C. Unfortuantely, I > can't commit to doing a port at this point. > >> >> Please feel free to add, >> >> Tested-by: Barry Song <v-songbaohua@oppo.com> > > Thanks! > >> >>> >>>> >>>> Perhaps the pid should be printed in decimal? >>> >>> I thought about printing in decimal, but every other value in the vma is in hex >>> without a leading "0x" (I'm trying to follow the smaps convention). So I thought >>> it could be more confusing in decimal. >>> >>> I'm happy to change it to decimal if that's the preference though? Although I'd >>> like to continue to present it in a fixed width field, padded with 0s on the >>> left so that everything lines up. >>> >>>> >>>> -- William Kucharski >>>> >>>>> /thpmaps --pid 206 --cont 64K >>>>> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 >>>>> 00426969 /root/a.out >>>>> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00 >>>>> 00426969 /root/a.out >>>>> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00 >>>>> 00426969 /root/a.out >>>>> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000 >>>>> anon-thp-aligned-64kB: 473920 kB (100%) >>>>> anon-cont-aligned-64kB: 473920 kB (100%) >>>>> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00 >>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00 >>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00 >>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00 >>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000 >>>>> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00 >>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>>> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000 >>>>> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar] >>>>> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso] >>>>> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00 >>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>>> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00 >>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>>> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack] >>>> >> >> Thanks >> Barry > >
On 03/01/2024 10:09, William Kucharski wrote: > > >> On Jan 3, 2024, at 02:35, Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 03/01/2024 09:16, Barry Song wrote: >>> On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 03/01/2024 08:07, William Kucharski wrote: >>>>> >>>>>> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote: >>>>>> >>>>>> Hi Ryan, >>>>>> >>>>>> I ran a couple of test cases with different parameters, it seems to >>>>>> work correctly. >>>>>> just i don't understand the below, what is the meaning of 000000ce at >>>>>> the beginning of >>>>>> each line? >>>>> >>>>> It's the pid; 0xce is the specified pid, 206. >>>> >>>> Yes indeed. I added the pid to the front for the case where you are using >>>> --cgroup without --summary; in that case, each vma will be printed for each pid >>>> in the cgroup and it seemed sensible to be able to see which pid each vma >>>> belonged to. >>> >>> I don't understand why we have to add the pid before each line as this tool >>> already has pid in the parameter :-) >> >> The reason is that it is also possible to invoke the tool with --cgroup instead >> of --pid. In this case, the tool will iterate over all the pids in the cgroup so >> (when --summary is not specified) having the pid associated with each vma is useful. >> >> I could change it to conditionally output the pid only when --cgroup is specified? > > You could, or perhaps emit a colon after the pid to delineate it, e.g.: > >> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 /root/a.out Yeah that sounds like the least worst option. Let's go with that. > > but then some people would probably read it as a memory address, so who knows. > > -- William Kucharski > >> >>> this seems like duplicated information >>> to me. but it doesn't matter too much as this tool is really nice though it is >>> not so easy to deploy on Android. >> >> Hmm. I've seen tutorials where people have Python running under Android, but I >> agree its not zero effort. Perhaps it would be better in C. Unfortuantely, I >> can't commit to doing a port at this point. >> >>> >>> Please feel free to add, >>> >>> Tested-by: Barry Song <v-songbaohua@oppo.com> >> >> Thanks! >> >>> >>>> >>>>> >>>>> Perhaps the pid should be printed in decimal? >>>> >>>> I thought about printing in decimal, but every other value in the vma is in hex >>>> without a leading "0x" (I'm trying to follow the smaps convention). So I thought >>>> it could be more confusing in decimal. >>>> >>>> I'm happy to change it to decimal if that's the preference though? Although I'd >>>> like to continue to present it in a fixed width field, padded with 0s on the >>>> left so that everything lines up. >>>> >>>>> >>>>> -- William Kucharski >>>>> >>>>>> /thpmaps --pid 206 --cont 64K >>>>>> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 >>>>>> 00426969 /root/a.out >>>>>> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00 >>>>>> 00426969 /root/a.out >>>>>> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00 >>>>>> 00426969 /root/a.out >>>>>> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000 >>>>>> anon-thp-aligned-64kB: 473920 kB (100%) >>>>>> anon-cont-aligned-64kB: 473920 kB (100%) >>>>>> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00 >>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>>> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00 >>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>>> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00 >>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>>> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00 >>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6 >>>>>> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000 >>>>>> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00 >>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>>>> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000 >>>>>> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar] >>>>>> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso] >>>>>> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00 >>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>>>> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00 >>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1 >>>>>> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack] >>>>> >>> >>> Thanks >>> Barry >> >> >
On 1/3/24 02:20, Ryan Roberts wrote: > On 03/01/2024 10:09, William Kucharski wrote: ... >>> The reason is that it is also possible to invoke the tool with --cgroup instead >>> of --pid. In this case, the tool will iterate over all the pids in the cgroup so >>> (when --summary is not specified) having the pid associated with each vma is useful. >>> >>> I could change it to conditionally output the pid only when --cgroup is specified? >> >> You could, or perhaps emit a colon after the pid to delineate it, e.g.: >> >>> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 /root/a.out > > Yeah that sounds like the least worst option. Let's go with that. I'm trying this out and had the exact same issue with pid. I'd suggest: a) pid should always be printed in decimal, because that's what ps(1) uses and no one expects to see it in other formats such as hex. b) In fact, perhaps a header row would help. There could be a --no-header-row option for cases that want to feed this to other scripts, but the default would be to include a human-friendly header. c) pid should probably be suppressed if --pid is specified, but that's less important than the other points. In a day or two I'll get a chance to run this on something that allocates lots of mTHPs, and give a closer look. thanks,
On 04/01/2024 22:48, John Hubbard wrote: > On 1/3/24 02:20, Ryan Roberts wrote: >> On 03/01/2024 10:09, William Kucharski wrote: > ... >>>> The reason is that it is also possible to invoke the tool with --cgroup instead >>>> of --pid. In this case, the tool will iterate over all the pids in the >>>> cgroup so >>>> (when --summary is not specified) having the pid associated with each vma is >>>> useful. >>>> >>>> I could change it to conditionally output the pid only when --cgroup is >>>> specified? >>> >>> You could, or perhaps emit a colon after the pid to delineate it, e.g.: >>> >>>> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 >>>> /root/a.out >> >> Yeah that sounds like the least worst option. Let's go with that. > > I'm trying this out and had the exact same issue with pid. I'd suggest: > > a) pid should always be printed in decimal, because that's what ps(1) uses > and no one expects to see it in other formats such as hex. right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like ps? But given pid is the first column, I think it will look weird right aligned. Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options: 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 My personal preference is the first option; right aligned with 0 pad. > > b) In fact, perhaps a header row would help. There could be a --no-header-row > option for cases that want to feed this to other scripts, but the default > would be to include a human-friendly header. How about this for a header (with example first data row): PID START END PROT OFF MJ:MN INODE FILE 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 Personally I wouldn't bother with a --no-header option; just keep it always on. > > c) pid should probably be suppressed if --pid is specified, but that's > less important than the other points. If we have the header then I think its clear what it is and I'd prefer to keep the data format consistent between --pid and --cgroup. So prefer to leave pid in always. > > In a day or two I'll get a chance to run this on something that allocates > lots of mTHPs, and give a closer look. Thanks - it would be great to get some feedback on the usefulness of the actual counters! :) I'm considering adding an --ignore-folio-boundaries option, which would modify the way the cont counters work, to only look for contiguity and alignment and ignore any folio boundaries. At the moment, if you have multiple contiguous folios, they don't count, because the memory doesn't all belong to the same folio. I think this could be useful in some (limited) circumstances. > > > thanks,
On 02/01/2024 15:38, Ryan Roberts wrote: > With the proliferation of large folios for file-backed memory, and more > recently the introduction of multi-size THP for anonymous memory, it is > becoming useful to be able to see exactly how large folios are mapped > into processes. For some architectures (e.g. arm64), if most memory is > mapped using contpte-sized and -aligned blocks, TLB usage can be > optimized so it's useful to see where these requirements are and are not > being met. > > thpmaps is a Python utility that reads /proc/<pid>/smaps, > /proc/<pid>/pagemap and /proc/kpageflags to print information about how > transparent huge pages (both file and anon) are mapped to a specified > process or cgroup. It aims to help users debug and optimize their > workloads. In future we may wish to introduce stats directly into the > kernel (e.g. smaps or similar), but for now this provides a short term > solution without the need to introduce any new ABI. > > Run with help option for a full listing of the arguments: > > # thpmaps --help > > --8<-- > usage: thpmaps [-h] [--pid pid] [--cgroup path] [--summary] > [--cont size[KMG]] [--inc-smaps] [--inc-empty] > [--periodic sleep_ms] > > Prints information about how transparent huge pages are mapped to a > specified process or cgroup. Shows statistics for fully-mapped THPs of > every size, mapped both naturally aligned and unaligned for both file > and anonymous memory. See [anon|file]-thp-[aligned|unaligned]-<size>kB > keys. Shows statistics for mapped pages that belong to a THP but which > are not fully mapped. See [anon|file]-thp-partial keys. Optionally > shows statistics for naturally aligned, contiguous blocks of memory of > a specified size (when --cont is provided). See [anon|file]-cont- > aligned-<size>kB keys. Statistics are shown in kB and as a percentage > of either total anon or file memory as appropriate. > > options: > -h, --help show this help message and exit > --pid pid Process id of the target process. Exactly one of > --pid and --cgroup must be provided. > --cgroup path Path to the target cgroup in sysfs. Iterates > over every pid in the cgroup. Exactly one of > --pid and --cgroup must be provided. > --summary Sum the per-vma statistics to provide a summary > over the whole process or cgroup. > --cont size[KMG] Adds anon and file stats for naturally aligned, > contiguously mapped blocks of the specified > size. May be issued multiple times to track > multiple sized blocks. Useful to infer e.g. > arm64 contpte and hpa mappings. Size must be a > power-of-2 number of pages. > --inc-smaps Include all numerical, additive > /proc/<pid>/smaps stats in the output. > --inc-empty Show all statistics including those whose value > is 0. > --periodic sleep_ms Run in a loop, polling every sleep_ms > milliseconds. > > Requires root privilege to access pagemap and kpageflags. > --8<-- > > Example command to summarise fully and partially mapped THPs and 64K > contiguous blocks over all VMAs in a single process (--inc-empty forces > printing stats that are 0): > > # ./thpmaps --pid 10837 --cont 64K --summary --inc-empty > > --8<-- > anon-thp-aligned-16kB: 16 kB ( 0%) > anon-thp-aligned-32kB: 0 kB ( 0%) > anon-thp-aligned-64kB: 4194304 kB (100%) > anon-thp-aligned-128kB: 0 kB ( 0%) > anon-thp-aligned-256kB: 0 kB ( 0%) > anon-thp-aligned-512kB: 0 kB ( 0%) > anon-thp-aligned-1024kB: 0 kB ( 0%) > anon-thp-aligned-2048kB: 0 kB ( 0%) > anon-thp-unaligned-16kB: 0 kB ( 0%) > anon-thp-unaligned-32kB: 0 kB ( 0%) > anon-thp-unaligned-64kB: 0 kB ( 0%) > anon-thp-unaligned-128kB: 0 kB ( 0%) > anon-thp-unaligned-256kB: 0 kB ( 0%) > anon-thp-unaligned-512kB: 0 kB ( 0%) > anon-thp-unaligned-1024kB: 0 kB ( 0%) > anon-thp-unaligned-2048kB: 0 kB ( 0%) > anon-thp-partial: 0 kB ( 0%) > file-thp-aligned-16kB: 16 kB ( 1%) > file-thp-aligned-32kB: 64 kB ( 5%) > file-thp-aligned-64kB: 640 kB (50%) > file-thp-aligned-128kB: 128 kB (10%) > file-thp-aligned-256kB: 0 kB ( 0%) > file-thp-aligned-512kB: 0 kB ( 0%) > file-thp-aligned-1024kB: 0 kB ( 0%) > file-thp-aligned-2048kB: 0 kB ( 0%) > file-thp-unaligned-16kB: 16 kB ( 1%) > file-thp-unaligned-32kB: 32 kB ( 3%) > file-thp-unaligned-64kB: 64 kB ( 5%) > file-thp-unaligned-128kB: 0 kB ( 0%) > file-thp-unaligned-256kB: 0 kB ( 0%) > file-thp-unaligned-512kB: 0 kB ( 0%) > file-thp-unaligned-1024kB: 0 kB ( 0%) > file-thp-unaligned-2048kB: 0 kB ( 0%) > file-thp-partial: 12 kB ( 1%) > anon-cont-aligned-64kB: 4194304 kB (100%) > file-cont-aligned-64kB: 768 kB (61%) > --8<-- > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > > I've found this very useful for debugging, and I know others have requested a > way to check if mTHP and contpte is working, so thought this might a good short > term solution until we figure out how best to add stats in the kernel? > > Thanks, > Ryan I found a minor bug and a change I plan to make in the next version. Just FYI: > > tools/mm/Makefile | 9 +- > tools/mm/thpmaps | 573 ++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 578 insertions(+), 4 deletions(-) > create mode 100755 tools/mm/thpmaps > > diff --git a/tools/mm/Makefile b/tools/mm/Makefile > index 1c5606cc3334..7bb03606b9ea 100644 > --- a/tools/mm/Makefile > +++ b/tools/mm/Makefile > @@ -3,7 +3,8 @@ > # > include ../scripts/Makefile.include > > -TARGETS=page-types slabinfo page_owner_sort > +BUILD_TARGETS=page-types slabinfo page_owner_sort > +INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps > > LIB_DIR = ../lib/api > LIBS = $(LIB_DIR)/libapi.a > @@ -11,9 +12,9 @@ LIBS = $(LIB_DIR)/libapi.a > CFLAGS += -Wall -Wextra -I../lib/ -pthread > LDFLAGS += $(LIBS) -pthread > > -all: $(TARGETS) > +all: $(BUILD_TARGETS) > > -$(TARGETS): $(LIBS) > +$(BUILD_TARGETS): $(LIBS) > > $(LIBS): > make -C $(LIB_DIR) > @@ -29,4 +30,4 @@ sbindir ?= /usr/sbin > > install: all > install -d $(DESTDIR)$(sbindir) > - install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir) > + install -m 755 -p $(INSTALL_TARGETS) $(DESTDIR)$(sbindir) > diff --git a/tools/mm/thpmaps b/tools/mm/thpmaps > new file mode 100755 > index 000000000000..af9b19f63eb4 > --- /dev/null > +++ b/tools/mm/thpmaps > @@ -0,0 +1,573 @@ > +#!/usr/bin/env python3 > +# SPDX-License-Identifier: GPL-2.0-only > +# Copyright (C) 2024 ARM Ltd. > +# > +# Utility providing smaps-like output detailing transparent hugepage usage. > +# For more info, run: > +# ./thpmaps --help > +# > +# Requires numpy: > +# pip3 install numpy > + > + > +import argparse > +import collections > +import math > +import os > +import re > +import resource > +import shutil > +import sys > +import time > +import numpy as np > + > + > +with open('/sys/kernel/mm/transparent_hugepage/hpage_pmd_size') as f: > + PAGE_SIZE = resource.getpagesize() > + PAGE_SHIFT = int(math.log2(PAGE_SIZE)) > + PMD_SIZE = int(f.read()) > + PMD_ORDER = int(math.log2(PMD_SIZE / PAGE_SIZE)) > + > + > +def align_forward(v, a): > + return (v + (a - 1)) & ~(a - 1) > + > + > +def align_offset(v, a): > + return v & (a - 1) > + > + > +def nrkb(nr): > + # Convert number of pages to KB. > + return (nr << PAGE_SHIFT) >> 10 > + > + > +def odkb(order): > + # Convert page order to KB. > + return nrkb(1 << order) > + > + > +def cont_ranges_all(arrs): > + # Given a list of arrays, find the ranges for which values are monotonically > + # incrementing in all arrays. > + assert(len(arrs) > 0) > + sz = len(arrs[0]) > + for arr in arrs: > + assert(arr.shape == (sz,)) > + r = np.full(sz, 2) > + d = np.diff(arrs[0]) == 1 > + for dd in [np.diff(arr) == 1 for arr in arrs[1:]]: > + d &= dd > + r[1:] -= d > + r[:-1] -= d > + return [np.repeat(arr, r).reshape(-1, 2) for arr in arrs] > + > + > +class ArgException(Exception): > + pass > + > + > +class FileIOException(Exception): > + pass > + > + > +class BinArrayFile: > + # Base class used to read /proc/<pid>/pagemap and /proc/kpageflags into a > + # numpy array. Use inherrited class in a with clause to ensure file is > + # closed when it goes out of scope. > + def __init__(self, filename, element_size): > + self.element_size = element_size > + self.filename = filename > + self.fd = os.open(self.filename, os.O_RDONLY) > + > + def cleanup(self): > + os.close(self.fd) > + > + def __enter__(self): > + return self > + > + def __exit__(self, exc_type, exc_val, exc_tb): > + self.cleanup() > + > + def _readin(self, offset, buffer): > + length = os.preadv(self.fd, (buffer,), offset) > + if len(buffer) != length: > + raise FileIOException('error: {} failed to read {} bytes at {:x}' > + .format(self.filename, len(buffer), offset)) > + > + def _toarray(self, buf): > + assert(self.element_size == 8) > + return np.frombuffer(buf, dtype=np.uint64) > + > + def getv(self, vec): > + sz = 0 > + for region in vec: > + sz += int(region[1] - region[0] + 1) * self.element_size > + buf = bytearray(sz) > + view = memoryview(buf) > + pos = 0 > + for region in vec: > + offset = int(region[0]) * self.element_size > + length = int(region[1] - region[0] + 1) * self.element_size > + self._readin(offset, view[pos:pos+length]) > + pos += length > + return self._toarray(buf) > + > + def get(self, index, nr=1): > + offset = index * self.element_size > + length = nr * self.element_size > + buf = bytearray(length) > + self._readin(offset, buf) > + return self._toarray(buf) > + > + > +PM_PAGE_PRESENT = 1 << 63 > +PM_PFN_MASK = (1 << 55) - 1 > + > +class PageMap(BinArrayFile): > + # Read ranges of a given pid's pagemap into a numpy array. > + def __init__(self, pid='self'): > + super().__init__(f'/proc/{pid}/pagemap', 8) > + > + > +KPF_ANON = 1 << 12 > +KPF_COMPOUND_HEAD = 1 << 15 > +KPF_COMPOUND_TAIL = 1 << 16 > + > +class KPageFlags(BinArrayFile): > + # Read ranges of /proc/kpageflags into a numpy array. > + def __init__(self): > + super().__init__(f'/proc/kpageflags', 8) > + > + > +VMA = collections.namedtuple('VMA', [ > + 'name', > + 'start', > + 'end', > + 'read', > + 'write', > + 'execute', > + 'private', > + 'pgoff', > + 'major', > + 'minor', > + 'inode', > + 'stats', > +]) > + > +class VMAList: > + # A container for VMAs, parsed from /proc/<pid>/smaps. Iterate over the > + # instance to receive VMAs. > + head_regex = re.compile(r"^([\da-f]+)-([\da-f]+) ([r-])([w-])([x-])([ps]) ([\da-f]+) ([\da-f]+):([\da-f]+) ([\da-f]+)\s*(.*)$") > + kb_item_regex = re.compile(r"(\w+):\s*(\d+)\s*kB") > + > + def __init__(self, pid='self'): > + def is_vma(line): > + return self.head_regex.search(line) != None > + > + def get_vma(line): > + m = self.head_regex.match(line) > + if m is None: > + return None > + return VMA( > + name=m.group(11), > + start=int(m.group(1), 16), > + end=int(m.group(2), 16), > + read=m.group(3) == 'r', > + write=m.group(4) == 'w', > + execute=m.group(5) == 'x', > + private=m.group(6) == 'p', > + pgoff=int(m.group(7), 16), > + major=int(m.group(8), 16), > + minor=int(m.group(9), 16), > + inode=int(m.group(10), 16), > + stats={}, > + ) > + > + def get_value(line): > + # Currently only handle the KB stats because they are summed for > + # --summary. Core code doesn't know how to combine other stats. > + exclude = ['KernelPageSize', 'MMUPageSize'] > + m = self.kb_item_regex.search(line) > + if m: > + param = m.group(1) > + if param not in exclude: > + value = int(m.group(2)) > + return param, value > + return None, None > + > + def parse_smaps(file): > + vmas = [] > + i = 0 > + > + line = file.readline() > + > + while True: > + if not line: > + break > + line = line.strip() > + > + i += 1 > + > + vma = get_vma(line) > + if vma is None: > + raise FileIOException(f'error: could not parse line {i}: "{line}"') > + > + while True: > + line = file.readline() > + if not line: > + break > + line = line.strip() > + if is_vma(line): > + break > + > + i += 1 > + > + param, value = get_value(line) > + if param: > + vma.stats[param] = {'type': None, 'value': value} > + > + vmas.append(vma) > + > + return vmas > + > + with open(f'/proc/{pid}/smaps', 'r') as file: > + self.vmas = parse_smaps(file) > + > + def __iter__(self): > + yield from self.vmas > + > + > +def thp_parse(max_order, kpageflags, vfns, pfns, anons, heads): > + # Given 4 same-sized arrays representing a range within a page table backed > + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: > + # True if page is anonymous, heads: True if page is head of a THP), return a > + # dictionary of statistics describing the mapped THPs. > + stats = { > + 'file': { > + 'partial': 0, > + 'aligned': [0] * (max_order + 1), > + 'unaligned': [0] * (max_order + 1), > + }, > + 'anon': { > + 'partial': 0, > + 'aligned': [0] * (max_order + 1), > + 'unaligned': [0] * (max_order + 1), > + }, > + } > + > + indexes = np.arange(len(vfns), dtype=np.uint64) > + ranges = cont_ranges_all([indexes, vfns, pfns]) > + for rindex, rpfn in zip(ranges[0], ranges[2]): > + index_next = int(rindex[0]) > + index_end = int(rindex[1]) + 1 > + pfn_end = int(rpfn[1]) + 1 > + > + folios = indexes[index_next:index_end][heads[index_next:index_end]] > + > + # Account pages for any partially mapped THP at the front. In that case, > + # the first page of the range is a tail. > + nr = (int(folios[0]) if len(folios) else index_end) - index_next > + stats['anon' if anons[index_next] else 'file']['partial'] += nr > + > + # Account pages for any partially mapped THP at the back. In that case, > + # the next page after the range is a tail. > + if len(folios): > + flags = int(kpageflags.get(pfn_end)[0]) > + if flags & KPF_COMPOUND_TAIL: > + nr = index_end - int(folios[-1]) > + folios = folios[:-1] > + index_end -= nr > + stats['anon' if anons[index_end - 1] else 'file']['partial'] += nr > + > + # Account fully mapped THPs in the middle of the range. > + if len(folios): > + folio_nrs = np.append(np.diff(folios), np.uint64(index_end - folios[-1])) > + folio_orders = np.log2(folio_nrs).astype(np.uint64) > + for index, order in zip(folios, folio_orders): > + index = int(index) > + order = int(order) > + nr = 1 << order > + vfn = int(vfns[index]) > + align = 'aligned' if align_forward(vfn, nr) == vfn else 'unaligned' > + anon = 'anon' if anons[index] else 'file' > + stats[anon][align][order] += nr > + > + rstats = {} > + > + def flatten_sub(type, subtype, stats): > + for od, nr in enumerate(stats[2:], 2): > + rstats[f"{type}-thp-{subtype}-{odkb(od)}kB"] = {'type': type, 'value': nrkb(nr)} > + > + def flatten_type(type, stats): > + flatten_sub(type, 'aligned', stats['aligned']) > + flatten_sub(type, 'unaligned', stats['unaligned']) > + rstats[f"{type}-thp-partial"] = {'type': type, 'value': nrkb(stats['partial'])} > + > + flatten_type('anon', stats['anon']) > + flatten_type('file', stats['file']) > + > + return rstats > + > + > +def cont_parse(order, vfns, pfns, anons, heads): > + # Given 4 same-sized arrays representing a range within a page table backed > + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: > + # True if page is anonymous, heads: True if page is head of a THP), return a > + # dictionary of statistics describing the contiguous blocks. > + nr_cont = 1 << order > + nr_anon = 0 > + nr_file = 0 > + > + ranges = cont_ranges_all([np.arange(len(vfns), dtype=np.uint64), vfns, pfns]) > + for rindex, rvfn, rpfn in zip(*ranges): > + index_next = int(rindex[0]) > + index_end = int(rindex[1]) + 1 > + vfn_start = int(rvfn[0]) > + pfn_start = int(rpfn[0]) > + > + if align_offset(pfn_start, nr_cont) != align_offset(vfn_start, nr_cont): > + continue > + > + off = align_forward(vfn_start, nr_cont) - vfn_start > + index_next += off > + > + while index_next + nr_cont <= index_end: > + folio_boundary = heads[index_next+1:index_next+nr_cont].any() > + if not folio_boundary: > + if anons[index_next]: > + nr_anon += nr_cont > + else: > + nr_file += nr_cont > + index_next += nr_cont > + > + return { > + f"anon-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'anon', 'value': nrkb(nr_anon)}, > + f"file-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'file', 'value': nrkb(nr_file)}, > + } > + > + > +def vma_print(vma, pid): > + # Prints a VMA instance in a format similar to smaps. The main difference is > + # that the pid is included as the first value. > + print("{:08x} {:016x}-{:016x} {}{}{}{} {:08x} {:02x}:{:02x} {:08x} {}" > + .format( > + pid, vma.start, vma.end, > + 'r' if vma.read else '-', 'w' if vma.write else '-', > + 'x' if vma.execute else '-', 'p' if vma.private else 's', > + vma.pgoff, vma.major, vma.minor, vma.inode, vma.name > + )) > + > + > +def stats_print(stats, tot_anon, tot_file, inc_empty): > + # Print a statistics dictionary. > + label_field = 32 > + for label, stat in stats.items(): > + type = stat['type'] > + value = stat['value'] > + if value or inc_empty: > + pad = max(0, label_field - len(label) - 1) > + if type == 'anon': > + percent = f' ({value / tot_anon:3.0%})' > + elif type == 'file': > + percent = f' ({value / tot_file:3.0%})' > + else: > + percent = '' > + print(f"{label}:{' ' * pad}{value:8} kB{percent}") > + > + > +def vma_parse(vma, pagemap, kpageflags, contorders): > + # Generate thp and cont statistics for a single VMA. > + start = vma.start >> PAGE_SHIFT > + end = vma.end >> PAGE_SHIFT > + > + pmes = pagemap.get(start, end - start) > + present = pmes & PM_PAGE_PRESENT != 0 > + pfns = pmes & PM_PFN_MASK > + pfns = pfns[present] > + vfns = np.arange(start, end, dtype=np.uint64) > + vfns = vfns[present] > + > + flags = kpageflags.getv(cont_ranges_all([pfns])[0]) > + anons = flags & KPF_ANON != 0 > + heads = flags & KPF_COMPOUND_HEAD != 0 > + tails = flags & KPF_COMPOUND_TAIL != 0 > + thps = heads | tails > + > + tot_anon = np.count_nonzero(anons) > + tot_file = np.size(anons) - tot_anon > + tot_anon = nrkb(tot_anon) > + tot_file = nrkb(tot_file) > + > + vfns = vfns[thps] > + pfns = pfns[thps] > + anons = anons[thps] > + heads = heads[thps] > + > + thpstats = thp_parse(PMD_ORDER, kpageflags, vfns, pfns, anons, heads) > + contstats = [cont_parse(order, vfns, pfns, anons, heads) for order in contorders] > + > + return { > + **thpstats, > + **{k: v for s in contstats for k, v in s.items()} > + }, tot_anon, tot_file > + > + > +def do_main(args): > + pids = set() > + summary = {} > + summary_anon = 0 > + summary_file = 0 > + > + if args.cgroup: > + with open(f'{args.cgroup}/cgroup.procs') as pidfile: > + for line in pidfile.readlines(): > + pids.add(int(line.strip())) > + else: > + pids.add(args.pid) > + > + for pid in pids: > + try: > + with PageMap(pid) as pagemap: > + with KPageFlags() as kpageflags: > + for vma in VMAList(pid): > + if (vma.read or vma.write or vma.execute) and vma.stats['Rss']['value'] > 0: > + stats, vma_anon, vma_file = vma_parse(vma, pagemap, kpageflags, args.cont) > + else: > + stats = {} > + vma_anon = 0 > + vma_file = 0 > + if args.inc_smaps: > + stats = {**vma.stats, **stats} > + if args.summary: > + for k, v in stats.items(): > + if k in summary: > + assert(summary[k]['type'] == v['type']) > + summary[k]['value'] += v['value'] > + else: > + summary[k] = v > + summary_anon += vma_anon > + summary_file += vma_file > + else: > + vma_print(vma, pid) > + stats_print(stats, vma_anon, vma_file, args.inc_empty) > + except FileNotFoundError: > + if not args.cgroup: > + raise > + except ProcessLookupError: > + if not args.cgroup: > + raise It turns out that reading pagemap will return 0 bytes if the process goes away if the process exits after the file is opened. So need to add handler here to recover from the race of the --cgroup case: except FileIOException: if not args.cgroup: raise > + > + if args.summary: > + stats_print(summary, summary_anon, summary_file, args.inc_empty) > + > + > +def main(): > + def formatter(prog): > + width = shutil.get_terminal_size().columns > + width -= 2 > + width = min(80, width) > + return argparse.HelpFormatter(prog, width=width) > + > + def size2order(human): > + units = {"K": 2**10, "M": 2**20, "G": 2**30} nit: Linux convention seems to be case-invariant, so kmg are equivalent to KMG. Will do the same. Thanks, Ryan > + unit = 1 > + if human[-1] in units: > + unit = units[human[-1]] > + human = human[:-1] > + try: > + size = int(human) > + except ValueError: > + raise ArgException('error: --cont value must be integer size with optional KMG unit') > + size *= unit > + order = int(math.log2(size / PAGE_SIZE)) > + if order < 1: > + raise ArgException('error: --cont value must be size of at least 2 pages') > + if (1 << order) * PAGE_SIZE != size: > + raise ArgException('error: --cont value must be size of power-of-2 pages') > + return order > + > + parser = argparse.ArgumentParser(formatter_class=formatter, > + description="""Prints information about how transparent huge pages are > + mapped to a specified process or cgroup. > + > + Shows statistics for fully-mapped THPs of every size, mapped > + both naturally aligned and unaligned for both file and > + anonymous memory. See > + [anon|file]-thp-[aligned|unaligned]-<size>kB keys. > + > + Shows statistics for mapped pages that belong to a THP but > + which are not fully mapped. See [anon|file]-thp-partial > + keys. > + > + Optionally shows statistics for naturally aligned, > + contiguous blocks of memory of a specified size (when --cont > + is provided). See [anon|file]-cont-aligned-<size>kB keys. > + > + Statistics are shown in kB and as a percentage of either > + total anon or file memory as appropriate.""", > + epilog="""Requires root privilege to access pagemap and kpageflags.""") > + > + parser.add_argument('--pid', > + metavar='pid', required=False, type=int, > + help="""Process id of the target process. Exactly one of --pid and > + --cgroup must be provided.""") > + > + parser.add_argument('--cgroup', > + metavar='path', required=False, > + help="""Path to the target cgroup in sysfs. Iterates over every pid in > + the cgroup. Exactly one of --pid and --cgroup must be provided.""") > + > + parser.add_argument('--summary', > + required=False, default=False, action='store_true', > + help="""Sum the per-vma statistics to provide a summary over the whole > + process or cgroup.""") > + > + parser.add_argument('--cont', > + metavar='size[KMG]', required=False, default=[], action='append', > + help="""Adds anon and file stats for naturally aligned, contiguously > + mapped blocks of the specified size. May be issued multiple times to > + track multiple sized blocks. Useful to infer e.g. arm64 contpte and > + hpa mappings. Size must be a power-of-2 number of pages.""") > + > + parser.add_argument('--inc-smaps', > + required=False, default=False, action='store_true', > + help="""Include all numerical, additive /proc/<pid>/smaps stats in the > + output.""") > + > + parser.add_argument('--inc-empty', > + required=False, default=False, action='store_true', > + help="""Show all statistics including those whose value is 0.""") > + > + parser.add_argument('--periodic', > + metavar='sleep_ms', required=False, type=int, > + help="""Run in a loop, polling every sleep_ms milliseconds.""") > + > + args = parser.parse_args() > + > + try: > + if (args.pid and args.cgroup) or \ > + (not args.pid and not args.cgroup): > + raise ArgException("error: Exactly one of --pid and --cgroup must be provided.") > + > + args.cont = [size2order(cont) for cont in args.cont] > + except ArgException as e: > + parser.print_usage() > + raise > + > + if args.periodic: > + while True: > + do_main(args) > + print() > + time.sleep(args.periodic / 1000) > + else: > + do_main(args) > + > + > +if __name__ == "__main__": > + try: > + main() > + except Exception as e: > + prog = os.path.basename(sys.argv[0]) > + print(f'{prog}: {e}') > + exit(1) > -- > 2.25.1 >
Personally I like either of these: 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 With a header that looks something like this; I suspect the formatting will get mangled in email anyway: PID START END PROT OFF MJ:MN INODE FILE 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 -- William Kucharski > On Jan 5, 2024, at 01:35, Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/01/2024 22:48, John Hubbard wrote: >> On 1/3/24 02:20, Ryan Roberts wrote: >>> On 03/01/2024 10:09, William Kucharski wrote: >> ... >>>>> The reason is that it is also possible to invoke the tool with --cgroup instead >>>>> of --pid. In this case, the tool will iterate over all the pids in the >>>>> cgroup so >>>>> (when --summary is not specified) having the pid associated with each vma is >>>>> useful. >>>>> >>>>> I could change it to conditionally output the pid only when --cgroup is >>>>> specified? >>>> >>>> You could, or perhaps emit a colon after the pid to delineate it, e.g.: >>>> >>>>> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 >>>>> /root/a.out >>> >>> Yeah that sounds like the least worst option. Let's go with that. >> >> I'm trying this out and had the exact same issue with pid. I'd suggest: >> >> a) pid should always be printed in decimal, because that's what ps(1) uses >> and no one expects to see it in other formats such as hex. > > right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like > ps? But given pid is the first column, I think it will look weird right aligned. > Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options: > > 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > > My personal preference is the first option; right aligned with 0 pad. > >> >> b) In fact, perhaps a header row would help. There could be a --no-header-row >> option for cases that want to feed this to other scripts, but the default >> would be to include a human-friendly header. > > How about this for a header (with example first data row): > > PID START END PROT OFF MJ:MN INODE FILE > 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > > Personally I wouldn't bother with a --no-header option; just keep it always on. > >> >> c) pid should probably be suppressed if --pid is specified, but that's >> less important than the other points. > > If we have the header then I think its clear what it is and I'd prefer to keep > the data format consistent between --pid and --cgroup. So prefer to leave pid in > always. > >> >> In a day or two I'll get a chance to run this on something that allocates >> lots of mTHPs, and give a closer look. > > Thanks - it would be great to get some feedback on the usefulness of the actual > counters! :) > > I'm considering adding an --ignore-folio-boundaries option, which would modify > the way the cont counters work, to only look for contiguity and alignment and > ignore any folio boundaries. At the moment, if you have multiple contiguous > folios, they don't count, because the memory doesn't all belong to the same > folio. I think this could be useful in some (limited) circumstances. > >> >> >> thanks,
On 1/5/24 03:30, William Kucharski wrote: > Personally I like either of these: > > 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 I like that second one, because that's how pids are often printed. > > With a header that looks something like this; I suspect the formatting will get > mangled in email anyway: > > PID START END PROT OFF MJ:MN INODE FILE > 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 The MJ:MN is mysterious, but this looks good otherwise. thanks,
On 1/5/24 00:35, Ryan Roberts wrote: > right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like > ps? But given pid is the first column, I think it will look weird right aligned. > Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options: I will leave all of the alignment to your judgment and good taste. I'm sure it will be fine. (I'm not trying to make the output look like ps(1). I'm trying to make the pid look like it "often" looks, and I used ps(1) as an example.) > > 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 Sure. > 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > > My personal preference is the first option; right aligned with 0 pad. > >> >> b) In fact, perhaps a header row would help. There could be a --no-header-row >> option for cases that want to feed this to other scripts, but the default >> would be to include a human-friendly header. > > How about this for a header (with example first data row): > > PID START END PROT OFF MJ:MN INODE FILE I need to go look up with the MJ:MN means, and then see if there is a less mysterious column name. > 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > > Personally I wouldn't bother with a --no-header option; just keep it always on. > >> >> c) pid should probably be suppressed if --pid is specified, but that's >> less important than the other points. > > If we have the header then I think its clear what it is and I'd prefer to keep > the data format consistent between --pid and --cgroup. So prefer to leave pid in > always. > That sounds reasonable to me. >> >> In a day or two I'll get a chance to run this on something that allocates >> lots of mTHPs, and give a closer look. > > Thanks - it would be great to get some feedback on the usefulness of the actual > counters! :) Working on it! > > I'm considering adding an --ignore-folio-boundaries option, which would modify > the way the cont counters work, to only look for contiguity and alignment and > ignore any folio boundaries. At the moment, if you have multiple contiguous > folios, they don't count, because the memory doesn't all belong to the same > folio. I think this could be useful in some (limited) circumstances. > This sounds both potentially useful, and yet obscure, so I'd suggest waiting until you see a usecase. And then include the usecase (even if just a comment), so that it explains both how to use it, and why it's useful. thanks,
On 1/2/24 07:38, Ryan Roberts wrote: > With the proliferation of large folios for file-backed memory, and more > recently the introduction of multi-size THP for anonymous memory, it is > becoming useful to be able to see exactly how large folios are mapped > into processes. For some architectures (e.g. arm64), if most memory is > mapped using contpte-sized and -aligned blocks, TLB usage can be > optimized so it's useful to see where these requirements are and are not > being met. > > thpmaps is a Python utility that reads /proc/<pid>/smaps, > /proc/<pid>/pagemap and /proc/kpageflags to print information about how > transparent huge pages (both file and anon) are mapped to a specified > process or cgroup. It aims to help users debug and optimize their > workloads. In future we may wish to introduce stats directly into the > kernel (e.g. smaps or similar), but for now this provides a short term > solution without the need to introduce any new ABI. > ... > I've found this very useful for debugging, and I know others have requested a > way to check if mTHP and contpte is working, so thought this might a good short > term solution until we figure out how best to add stats in the kernel? > Hi Ryan, One thing that immediately came up during some recent testing of mTHP on arm64: the pid requirement is sometimes a little awkward. I'm running tests on a machine at a time for now, inside various containers and such, and it would be nice if there were an easy way to get some numbers for the mTHPs across the whole machine. I'm not sure if that changes anything about thpmaps here. Probably this is fine as-is. But I wanted to give some initial reactions from just some quick runs: the global state would be convenient. thanks,
On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: > > On 1/2/24 07:38, Ryan Roberts wrote: > > With the proliferation of large folios for file-backed memory, and more > > recently the introduction of multi-size THP for anonymous memory, it is > > becoming useful to be able to see exactly how large folios are mapped > > into processes. For some architectures (e.g. arm64), if most memory is > > mapped using contpte-sized and -aligned blocks, TLB usage can be > > optimized so it's useful to see where these requirements are and are not > > being met. > > > > thpmaps is a Python utility that reads /proc/<pid>/smaps, > > /proc/<pid>/pagemap and /proc/kpageflags to print information about how > > transparent huge pages (both file and anon) are mapped to a specified > > process or cgroup. It aims to help users debug and optimize their > > workloads. In future we may wish to introduce stats directly into the > > kernel (e.g. smaps or similar), but for now this provides a short term > > solution without the need to introduce any new ABI. > > > ... > > I've found this very useful for debugging, and I know others have requested a > > way to check if mTHP and contpte is working, so thought this might a good short > > term solution until we figure out how best to add stats in the kernel? > > > > Hi Ryan, > > One thing that immediately came up during some recent testing of mTHP > on arm64: the pid requirement is sometimes a little awkward. I'm running > tests on a machine at a time for now, inside various containers and > such, and it would be nice if there were an easy way to get some numbers > for the mTHPs across the whole machine. > > I'm not sure if that changes anything about thpmaps here. Probably > this is fine as-is. But I wanted to give some initial reactions from > just some quick runs: the global state would be convenient. +1. but this seems to be impossible by scanning pagemap? so may we add this statistics information in kernel just like /proc/meminfo or a separate /proc/mthp_info? > > thanks, > -- > John Hubbard > NVIDIA Thanks barry
On 1/9/24 19:51, Barry Song wrote: > On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: ... >> Hi Ryan, >> >> One thing that immediately came up during some recent testing of mTHP >> on arm64: the pid requirement is sometimes a little awkward. I'm running >> tests on a machine at a time for now, inside various containers and >> such, and it would be nice if there were an easy way to get some numbers >> for the mTHPs across the whole machine. >> >> I'm not sure if that changes anything about thpmaps here. Probably >> this is fine as-is. But I wanted to give some initial reactions from >> just some quick runs: the global state would be convenient. > > +1. but this seems to be impossible by scanning pagemap? > so may we add this statistics information in kernel just like > /proc/meminfo or a separate /proc/mthp_info? > Yes. From my perspective, it looks like the global stats are more useful initially, and the more detailed per-pid or per-cgroup stats are the next level of investigation. So feels odd to start with the more detailed stats. However, Ryan did clearly say, above, "In future we may wish to introduce stats directly into the kernel (e.g. smaps or similar)". And earlier he ran into some pushback on trying to set up /proc or /sys values because this is still such an early feature. I wonder if we could put the global stats in debugfs for now? That's specifically supposed to be a "we promise *not* to keep this ABI stable" location. thanks,
On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > > On 1/9/24 19:51, Barry Song wrote: > > On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: > ... > >> Hi Ryan, > >> > >> One thing that immediately came up during some recent testing of mTHP > >> on arm64: the pid requirement is sometimes a little awkward. I'm running > >> tests on a machine at a time for now, inside various containers and > >> such, and it would be nice if there were an easy way to get some numbers > >> for the mTHPs across the whole machine. > >> > >> I'm not sure if that changes anything about thpmaps here. Probably > >> this is fine as-is. But I wanted to give some initial reactions from > >> just some quick runs: the global state would be convenient. > > > > +1. but this seems to be impossible by scanning pagemap? > > so may we add this statistics information in kernel just like > > /proc/meminfo or a separate /proc/mthp_info? > > > > Yes. From my perspective, it looks like the global stats are more useful > initially, and the more detailed per-pid or per-cgroup stats are the > next level of investigation. So feels odd to start with the more > detailed stats. > probably because this can be done without the modification of the kernel. The detailed per-pid or per-cgroup is still quite useful to my case in which we set mTHP enabled/disabled and allowed sizes according to vma types, eg. libc_malloc, java heaps etc. Different vma types can have different anon_name. So I can use the detailed info to find out if specific VMAs have gotten mTHP properly and how many they have gotten. > However, Ryan did clearly say, above, "In future we may wish to > introduce stats directly into the kernel (e.g. smaps or similar)". And > earlier he ran into some pushback on trying to set up /proc or /sys > values because this is still such an early feature. > > I wonder if we could put the global stats in debugfs for now? That's > specifically supposed to be a "we promise *not* to keep this ABI stable" > location. +1. > > > thanks, > -- > John Hubbard > NVIDIA > Thanks Barry
On 05/01/2024 23:18, John Hubbard wrote: > On 1/5/24 00:35, Ryan Roberts wrote: >> right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like >> ps? But given pid is the first column, I think it will look weird right aligned. >> Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options: > > I will leave all of the alignment to your judgment and good taste. I'm sure > it will be fine. > > (I'm not trying to make the output look like ps(1). I'm trying to make the pid > look like it "often" looks, and I used ps(1) as an example.) > >> >> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 I'm going to go with this version ^ >> 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 > > Sure. > >> 206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 >> >> My personal preference is the first option; right aligned with 0 pad. >> >>> >>> b) In fact, perhaps a header row would help. There could be a --no-header-row >>> option for cases that want to feed this to other scripts, but the default >>> would be to include a human-friendly header. >> >> How about this for a header (with example first data row): >> >> PID START END PROT OFF MJ:MN INODE FILE > > I need to go look up with the MJ:MN means, and then see if there is a > less mysterious column name. Its the device major/minor number. I could just call it DEV (DEVICE is too long) > >> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969 >> >> Personally I wouldn't bother with a --no-header option; just keep it always on. >> >>> >>> c) pid should probably be suppressed if --pid is specified, but that's >>> less important than the other points. >> >> If we have the header then I think its clear what it is and I'd prefer to keep >> the data format consistent between --pid and --cgroup. So prefer to leave pid in >> always. >> > > That sounds reasonable to me. > >>> >>> In a day or two I'll get a chance to run this on something that allocates >>> lots of mTHPs, and give a closer look. >> >> Thanks - it would be great to get some feedback on the usefulness of the actual >> counters! :) > > Working on it! > >> >> I'm considering adding an --ignore-folio-boundaries option, which would modify >> the way the cont counters work, to only look for contiguity and alignment and >> ignore any folio boundaries. At the moment, if you have multiple contiguous >> folios, they don't count, because the memory doesn't all belong to the same >> folio. I think this could be useful in some (limited) circumstances. >> > > This sounds both potentially useful, and yet obscure, so I'd suggest waiting > until you see a usecase. And then include the usecase (even if just a comment), > so that it explains both how to use it, and why it's useful. > > thanks,
On 10/01/2024 08:02, Barry Song wrote: > On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >> >> On 1/9/24 19:51, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >> ... >>>> Hi Ryan, >>>> >>>> One thing that immediately came up during some recent testing of mTHP >>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>> tests on a machine at a time for now, inside various containers and >>>> such, and it would be nice if there were an easy way to get some numbers >>>> for the mTHPs across the whole machine. Just to confirm, you're expecting these "global" stats be truely global and not per-container? (asking because you exploicitly mentioned being in a container). If you want per-container, then you can probably just create the container in a cgroup? >>>> >>>> I'm not sure if that changes anything about thpmaps here. Probably >>>> this is fine as-is. But I wanted to give some initial reactions from >>>> just some quick runs: the global state would be convenient. Thanks for taking this for a spin! Appreciate the feedback. >>> >>> +1. but this seems to be impossible by scanning pagemap? >>> so may we add this statistics information in kernel just like >>> /proc/meminfo or a separate /proc/mthp_info? >>> >> >> Yes. From my perspective, it looks like the global stats are more useful >> initially, and the more detailed per-pid or per-cgroup stats are the >> next level of investigation. So feels odd to start with the more >> detailed stats. >> > > probably because this can be done without the modification of the kernel. Yes indeed, as John said in an earlier thread, my previous attempts to add stats directly in the kernel got pushback; DavidH was concerned that we don't really know exectly how to account mTHPs yet (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding the wrong ABI and having to maintain it forever. There has also been some pushback regarding adding more values to multi-value files in sysfs, so David was suggesting coming up with a whole new scheme at some point (I know /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups do live in sysfs). Anyway, this script was my attempt to 1) provide a short term solution to the "we need some stats" request and 2) provide a context in which to explore what the right stats are - this script can evolve without the ABI problem. > The detailed per-pid or per-cgroup is still quite useful to my case in which > we set mTHP enabled/disabled and allowed sizes according to vma types, > eg. libc_malloc, java heaps etc. > > Different vma types can have different anon_name. So I can use the detailed > info to find out if specific VMAs have gotten mTHP properly and how many > they have gotten. > >> However, Ryan did clearly say, above, "In future we may wish to >> introduce stats directly into the kernel (e.g. smaps or similar)". And >> earlier he ran into some pushback on trying to set up /proc or /sys >> values because this is still such an early feature. >> >> I wonder if we could put the global stats in debugfs for now? That's >> specifically supposed to be a "we promise *not* to keep this ABI stable" >> location. Now that I think about it, I wonder if we can add a --global mode to the script (or just infer global when neither --pid nor --cgroup are provided). I think I should be able to determine all the physical memory ranges from /proc/iomem, then grab all the info we need from /proc/kpageflags. We should then be able to process it all in much the same way as for --pid/--cgroup and provide the same stats, but it will apply globally. What do you think? If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script for now. > > +1. > >> >> >> thanks, >> -- >> John Hubbard >> NVIDIA >> > > Thanks > Barry
On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 08:02, Barry Song wrote: > > On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >> > >> On 1/9/24 19:51, Barry Song wrote: > >>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: > >> ... > >>>> Hi Ryan, > >>>> > >>>> One thing that immediately came up during some recent testing of mTHP > >>>> on arm64: the pid requirement is sometimes a little awkward. I'm running > >>>> tests on a machine at a time for now, inside various containers and > >>>> such, and it would be nice if there were an easy way to get some numbers > >>>> for the mTHPs across the whole machine. > > Just to confirm, you're expecting these "global" stats be truely global and not > per-container? (asking because you exploicitly mentioned being in a container). > If you want per-container, then you can probably just create the container in a > cgroup? > > >>>> > >>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>> this is fine as-is. But I wanted to give some initial reactions from > >>>> just some quick runs: the global state would be convenient. > > Thanks for taking this for a spin! Appreciate the feedback. > > >>> > >>> +1. but this seems to be impossible by scanning pagemap? > >>> so may we add this statistics information in kernel just like > >>> /proc/meminfo or a separate /proc/mthp_info? > >>> > >> > >> Yes. From my perspective, it looks like the global stats are more useful > >> initially, and the more detailed per-pid or per-cgroup stats are the > >> next level of investigation. So feels odd to start with the more > >> detailed stats. > >> > > > > probably because this can be done without the modification of the kernel. > > Yes indeed, as John said in an earlier thread, my previous attempts to add stats > directly in the kernel got pushback; DavidH was concerned that we don't really > know exectly how to account mTHPs yet > (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding > the wrong ABI and having to maintain it forever. There has also been some > pushback regarding adding more values to multi-value files in sysfs, so David > was suggesting coming up with a whole new scheme at some point (I know > /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups > do live in sysfs). > > Anyway, this script was my attempt to 1) provide a short term solution to the > "we need some stats" request and 2) provide a context in which to explore what > the right stats are - this script can evolve without the ABI problem. > > > The detailed per-pid or per-cgroup is still quite useful to my case in which > > we set mTHP enabled/disabled and allowed sizes according to vma types, > > eg. libc_malloc, java heaps etc. > > > > Different vma types can have different anon_name. So I can use the detailed > > info to find out if specific VMAs have gotten mTHP properly and how many > > they have gotten. > > > >> However, Ryan did clearly say, above, "In future we may wish to > >> introduce stats directly into the kernel (e.g. smaps or similar)". And > >> earlier he ran into some pushback on trying to set up /proc or /sys > >> values because this is still such an early feature. > >> > >> I wonder if we could put the global stats in debugfs for now? That's > >> specifically supposed to be a "we promise *not* to keep this ABI stable" > >> location. > > Now that I think about it, I wonder if we can add a --global mode to the script > (or just infer global when neither --pid nor --cgroup are provided). I think I > should be able to determine all the physical memory ranges from /proc/iomem, > then grab all the info we need from /proc/kpageflags. We should then be able to > process it all in much the same way as for --pid/--cgroup and provide the same > stats, but it will apply globally. What do you think? for debug purposes, it should be good. imaging there is a health monitor which needs to sample the stats of large folios online and periodically, this might be too expensive. > > If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script > for now. > > > > > +1. > > > >> > >> > >> thanks, > >> -- > >> John Hubbard > >> NVIDIA > >> > > Thanks Barry
On 10/01/2024 09:09, Barry Song wrote: > On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 08:02, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>> >>>> On 1/9/24 19:51, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >>>> ... >>>>>> Hi Ryan, >>>>>> >>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>> tests on a machine at a time for now, inside various containers and >>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>> for the mTHPs across the whole machine. >> >> Just to confirm, you're expecting these "global" stats be truely global and not >> per-container? (asking because you exploicitly mentioned being in a container). >> If you want per-container, then you can probably just create the container in a >> cgroup? >> >>>>>> >>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>> just some quick runs: the global state would be convenient. >> >> Thanks for taking this for a spin! Appreciate the feedback. >> >>>>> >>>>> +1. but this seems to be impossible by scanning pagemap? >>>>> so may we add this statistics information in kernel just like >>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>> >>>> >>>> Yes. From my perspective, it looks like the global stats are more useful >>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>> next level of investigation. So feels odd to start with the more >>>> detailed stats. >>>> >>> >>> probably because this can be done without the modification of the kernel. >> >> Yes indeed, as John said in an earlier thread, my previous attempts to add stats >> directly in the kernel got pushback; DavidH was concerned that we don't really >> know exectly how to account mTHPs yet >> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding >> the wrong ABI and having to maintain it forever. There has also been some >> pushback regarding adding more values to multi-value files in sysfs, so David >> was suggesting coming up with a whole new scheme at some point (I know >> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups >> do live in sysfs). >> >> Anyway, this script was my attempt to 1) provide a short term solution to the >> "we need some stats" request and 2) provide a context in which to explore what >> the right stats are - this script can evolve without the ABI problem. >> >>> The detailed per-pid or per-cgroup is still quite useful to my case in which >>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>> eg. libc_malloc, java heaps etc. >>> >>> Different vma types can have different anon_name. So I can use the detailed >>> info to find out if specific VMAs have gotten mTHP properly and how many >>> they have gotten. >>> >>>> However, Ryan did clearly say, above, "In future we may wish to >>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>> values because this is still such an early feature. >>>> >>>> I wonder if we could put the global stats in debugfs for now? That's >>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>> location. >> >> Now that I think about it, I wonder if we can add a --global mode to the script >> (or just infer global when neither --pid nor --cgroup are provided). I think I >> should be able to determine all the physical memory ranges from /proc/iomem, >> then grab all the info we need from /proc/kpageflags. We should then be able to >> process it all in much the same way as for --pid/--cgroup and provide the same >> stats, but it will apply globally. What do you think? > > for debug purposes, it should be good. imaging there is a health > monitor which needs > to sample the stats of large folios online and periodically, this > might be too expensive. Yes, understood - the long term aim needs to be to get stats into the kernel. This is intended as a step to help make that happen. > >> >> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script >> for now. >> >>> >>> +1. >>> >>>> >>>> >>>> thanks, >>>> -- >>>> John Hubbard >>>> NVIDIA >>>> >>> > > Thanks > Barry
On 10/01/2024 09:09, Barry Song wrote: > On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 08:02, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>> >>>> On 1/9/24 19:51, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >>>> ... >>>>>> Hi Ryan, >>>>>> >>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>> tests on a machine at a time for now, inside various containers and >>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>> for the mTHPs across the whole machine. >> >> Just to confirm, you're expecting these "global" stats be truely global and not >> per-container? (asking because you exploicitly mentioned being in a container). >> If you want per-container, then you can probably just create the container in a >> cgroup? >> >>>>>> >>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>> just some quick runs: the global state would be convenient. >> >> Thanks for taking this for a spin! Appreciate the feedback. >> >>>>> >>>>> +1. but this seems to be impossible by scanning pagemap? >>>>> so may we add this statistics information in kernel just like >>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>> >>>> >>>> Yes. From my perspective, it looks like the global stats are more useful >>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>> next level of investigation. So feels odd to start with the more >>>> detailed stats. >>>> >>> >>> probably because this can be done without the modification of the kernel. >> >> Yes indeed, as John said in an earlier thread, my previous attempts to add stats >> directly in the kernel got pushback; DavidH was concerned that we don't really >> know exectly how to account mTHPs yet >> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding >> the wrong ABI and having to maintain it forever. There has also been some >> pushback regarding adding more values to multi-value files in sysfs, so David >> was suggesting coming up with a whole new scheme at some point (I know >> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups >> do live in sysfs). >> >> Anyway, this script was my attempt to 1) provide a short term solution to the >> "we need some stats" request and 2) provide a context in which to explore what >> the right stats are - this script can evolve without the ABI problem. >> >>> The detailed per-pid or per-cgroup is still quite useful to my case in which >>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>> eg. libc_malloc, java heaps etc. >>> >>> Different vma types can have different anon_name. So I can use the detailed >>> info to find out if specific VMAs have gotten mTHP properly and how many >>> they have gotten. >>> >>>> However, Ryan did clearly say, above, "In future we may wish to >>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>> values because this is still such an early feature. >>>> >>>> I wonder if we could put the global stats in debugfs for now? That's >>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>> location. >> >> Now that I think about it, I wonder if we can add a --global mode to the script >> (or just infer global when neither --pid nor --cgroup are provided). I think I >> should be able to determine all the physical memory ranges from /proc/iomem, >> then grab all the info we need from /proc/kpageflags. We should then be able to >> process it all in much the same way as for --pid/--cgroup and provide the same >> stats, but it will apply globally. What do you think? Having now thought about this for a few mins (in the shower, if anyone wants the complete picture :) ), this won't quite work. This approach doesn't have the virtual mapping information so the best it can do is tell us "how many of each size of THP are allocated?" - it doesn't tell us anything about whether they are fully or partially mapped or what their alignment is (all necessary if we want to know if they are contpte-mapped). So I don't think this approach is going to be particularly useful. And this is also the big problem if we want to gather stats inside the kernel; if we want something equivalant to /proc/meminfo's AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the allocation of the THP but also whether it is mapped. That's easy for PMD-mappings, because there is only one entry to consider - when you set it, you increment the number of PMD-mapped THPs, when you clear it, you decrement. But for PTE-mappings it's harder; you know the size when you are mapping so its easy to increment, but you can do a partial unmap, so you would need to scan the PTEs to figure out if we are unmapping the first page of a previously fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to determine "is this folio fully and contiguously mapped in at least one process?". So depending on what global stats you actually need, the route to getting them cheaply may not be easy. (My previous attempt to add stats cheated and didn't try to track "fully mapped" vs "partially mapped" - instead it just counted the number of pages belonging to a THP (of any size) that were mapped. If you need the global mapping state, then the short term way to do this would be to provide the root cgroup, then have the script recurse through all child cgroups; That would pick up all the processes and iterate through them: $ thpmaps --cgroup /sys/fs/cgroup --summary ... This won't quite work with the current version because it doesn't recurse through the cgroup children currently, but that would be easy to add. > > for debug purposes, it should be good. imaging there is a health > monitor which needs > to sample the stats of large folios online and periodically, this > might be too expensive. > >> >> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script >> for now. >> >>> >>> +1. >>> >>>> >>>> >>>> thanks, >>>> -- >>>> John Hubbard >>>> NVIDIA >>>> >>> > > Thanks > Barry
On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 09:09, Barry Song wrote: > > On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 10/01/2024 08:02, Barry Song wrote: > >>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>> > >>>> On 1/9/24 19:51, Barry Song wrote: > >>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: > >>>> ... > >>>>>> Hi Ryan, > >>>>>> > >>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running > >>>>>> tests on a machine at a time for now, inside various containers and > >>>>>> such, and it would be nice if there were an easy way to get some numbers > >>>>>> for the mTHPs across the whole machine. > >> > >> Just to confirm, you're expecting these "global" stats be truely global and not > >> per-container? (asking because you exploicitly mentioned being in a container). > >> If you want per-container, then you can probably just create the container in a > >> cgroup? > >> > >>>>>> > >>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>> just some quick runs: the global state would be convenient. > >> > >> Thanks for taking this for a spin! Appreciate the feedback. > >> > >>>>> > >>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>> so may we add this statistics information in kernel just like > >>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>> > >>>> > >>>> Yes. From my perspective, it looks like the global stats are more useful > >>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>> next level of investigation. So feels odd to start with the more > >>>> detailed stats. > >>>> > >>> > >>> probably because this can be done without the modification of the kernel. > >> > >> Yes indeed, as John said in an earlier thread, my previous attempts to add stats > >> directly in the kernel got pushback; DavidH was concerned that we don't really > >> know exectly how to account mTHPs yet > >> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding > >> the wrong ABI and having to maintain it forever. There has also been some > >> pushback regarding adding more values to multi-value files in sysfs, so David > >> was suggesting coming up with a whole new scheme at some point (I know > >> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups > >> do live in sysfs). > >> > >> Anyway, this script was my attempt to 1) provide a short term solution to the > >> "we need some stats" request and 2) provide a context in which to explore what > >> the right stats are - this script can evolve without the ABI problem. > >> > >>> The detailed per-pid or per-cgroup is still quite useful to my case in which > >>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>> eg. libc_malloc, java heaps etc. > >>> > >>> Different vma types can have different anon_name. So I can use the detailed > >>> info to find out if specific VMAs have gotten mTHP properly and how many > >>> they have gotten. > >>> > >>>> However, Ryan did clearly say, above, "In future we may wish to > >>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>> values because this is still such an early feature. > >>>> > >>>> I wonder if we could put the global stats in debugfs for now? That's > >>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>> location. > >> > >> Now that I think about it, I wonder if we can add a --global mode to the script > >> (or just infer global when neither --pid nor --cgroup are provided). I think I > >> should be able to determine all the physical memory ranges from /proc/iomem, > >> then grab all the info we need from /proc/kpageflags. We should then be able to > >> process it all in much the same way as for --pid/--cgroup and provide the same > >> stats, but it will apply globally. What do you think? > > Having now thought about this for a few mins (in the shower, if anyone wants the > complete picture :) ), this won't quite work. This approach doesn't have the > virtual mapping information so the best it can do is tell us "how many of each > size of THP are allocated?" - it doesn't tell us anything about whether they are > fully or partially mapped or what their alignment is (all necessary if we want > to know if they are contpte-mapped). So I don't think this approach is going to > be particularly useful. > > And this is also the big problem if we want to gather stats inside the kernel; > if we want something equivalant to /proc/meminfo's > AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > allocation of the THP but also whether it is mapped. That's easy for > PMD-mappings, because there is only one entry to consider - when you set it, you > increment the number of PMD-mapped THPs, when you clear it, you decrement. But > for PTE-mappings it's harder; you know the size when you are mapping so its easy > to increment, but you can do a partial unmap, so you would need to scan the PTEs > to figure out if we are unmapping the first page of a previously > fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > determine "is this folio fully and contiguously mapped in at least one process?". as OPPO's approach I shared to you before is maintaining two mapcount 1. entire map 2. subpage's map 3. if 1 and 2 both exist, it is DoubleMapped. This isn't a problem for us. and everytime if we do a partial unmap, we have an explicit cont_pte split which will decrease the entire map and increase the subpage's mapcount. but its downside is that we expose this info to mm-core. > > So depending on what global stats you actually need, the route to getting them > cheaply may not be easy. (My previous attempt to add stats cheated and didn't > try to track "fully mapped" vs "partially mapped" - instead it just counted the > number of pages belonging to a THP (of any size) that were mapped. > > If you need the global mapping state, then the short term way to do this would > be to provide the root cgroup, then have the script recurse through all child > cgroups; That would pick up all the processes and iterate through them: > > $ thpmaps --cgroup /sys/fs/cgroup --summary ... > > This won't quite work with the current version because it doesn't recurse > through the cgroup children currently, but that would be easy to add. > > > > > > for debug purposes, it should be good. imaging there is a health > > monitor which needs > > to sample the stats of large folios online and periodically, this > > might be too expensive. > > > >> > >> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script > >> for now. > >> > >>> > >>> +1. > >>> > >>>> > >>>> > >>>> thanks, > >>>> -- > >>>> John Hubbard > >>>> NVIDIA > >>>> > >>> > > Thanks Barry
On 10/01/2024 10:30, Barry Song wrote: > On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 09:09, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 08:02, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>> >>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>> ... >>>>>>>> Hi Ryan, >>>>>>>> >>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>>>> for the mTHPs across the whole machine. >>>> >>>> Just to confirm, you're expecting these "global" stats be truely global and not >>>> per-container? (asking because you exploicitly mentioned being in a container). >>>> If you want per-container, then you can probably just create the container in a >>>> cgroup? >>>> >>>>>>>> >>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>> just some quick runs: the global state would be convenient. >>>> >>>> Thanks for taking this for a spin! Appreciate the feedback. >>>> >>>>>>> >>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>> so may we add this statistics information in kernel just like >>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>> >>>>>> >>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>> next level of investigation. So feels odd to start with the more >>>>>> detailed stats. >>>>>> >>>>> >>>>> probably because this can be done without the modification of the kernel. >>>> >>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats >>>> directly in the kernel got pushback; DavidH was concerned that we don't really >>>> know exectly how to account mTHPs yet >>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding >>>> the wrong ABI and having to maintain it forever. There has also been some >>>> pushback regarding adding more values to multi-value files in sysfs, so David >>>> was suggesting coming up with a whole new scheme at some point (I know >>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups >>>> do live in sysfs). >>>> >>>> Anyway, this script was my attempt to 1) provide a short term solution to the >>>> "we need some stats" request and 2) provide a context in which to explore what >>>> the right stats are - this script can evolve without the ABI problem. >>>> >>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which >>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>> eg. libc_malloc, java heaps etc. >>>>> >>>>> Different vma types can have different anon_name. So I can use the detailed >>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>> they have gotten. >>>>> >>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>> values because this is still such an early feature. >>>>>> >>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>> location. >>>> >>>> Now that I think about it, I wonder if we can add a --global mode to the script >>>> (or just infer global when neither --pid nor --cgroup are provided). I think I >>>> should be able to determine all the physical memory ranges from /proc/iomem, >>>> then grab all the info we need from /proc/kpageflags. We should then be able to >>>> process it all in much the same way as for --pid/--cgroup and provide the same >>>> stats, but it will apply globally. What do you think? >> >> Having now thought about this for a few mins (in the shower, if anyone wants the >> complete picture :) ), this won't quite work. This approach doesn't have the >> virtual mapping information so the best it can do is tell us "how many of each >> size of THP are allocated?" - it doesn't tell us anything about whether they are >> fully or partially mapped or what their alignment is (all necessary if we want >> to know if they are contpte-mapped). So I don't think this approach is going to >> be particularly useful. >> >> And this is also the big problem if we want to gather stats inside the kernel; >> if we want something equivalant to /proc/meminfo's >> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >> allocation of the THP but also whether it is mapped. That's easy for >> PMD-mappings, because there is only one entry to consider - when you set it, you >> increment the number of PMD-mapped THPs, when you clear it, you decrement. But >> for PTE-mappings it's harder; you know the size when you are mapping so its easy >> to increment, but you can do a partial unmap, so you would need to scan the PTEs >> to figure out if we are unmapping the first page of a previously >> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >> determine "is this folio fully and contiguously mapped in at least one process?". > > as OPPO's approach I shared to you before is maintaining two mapcount > 1. entire map > 2. subpage's map > 3. if 1 and 2 both exist, it is DoubleMapped. > > This isn't a problem for us. and everytime if we do a partial unmap, > we have an explicit > cont_pte split which will decrease the entire map and increase the > subpage's mapcount. > > but its downside is that we expose this info to mm-core. OK, but I think we have a slightly more generic situation going on with the upstream; If I've understood correctly, you are using the PTE_CONT bit in the PTE to determne if its fully mapped? That works for your case where you only have 1 size of THP that you care about (contpte-size). But for the upstream, we have multi-size THP so we can't use the PTE_CONT bit to determine if its fully mapped because we can only use that bit if the THP is at least 64K and aligned, and only on arm64. We would need a SW bit for this purpose, and the mm would need to update that SW bit for every PTE one the full -> partial map transition. > >> >> So depending on what global stats you actually need, the route to getting them >> cheaply may not be easy. (My previous attempt to add stats cheated and didn't >> try to track "fully mapped" vs "partially mapped" - instead it just counted the >> number of pages belonging to a THP (of any size) that were mapped. >> >> If you need the global mapping state, then the short term way to do this would >> be to provide the root cgroup, then have the script recurse through all child >> cgroups; That would pick up all the processes and iterate through them: >> >> $ thpmaps --cgroup /sys/fs/cgroup --summary ... >> >> This won't quite work with the current version because it doesn't recurse >> through the cgroup children currently, but that would be easy to add. >> >> >>> >>> for debug purposes, it should be good. imaging there is a health >>> monitor which needs >>> to sample the stats of large folios online and periodically, this >>> might be too expensive. >>> >>>> >>>> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script >>>> for now. >>>> >>>>> >>>>> +1. >>>>> >>>>>> >>>>>> >>>>>> thanks, >>>>>> -- >>>>>> John Hubbard >>>>>> NVIDIA >>>>>> >>>>> >>> > > Thanks > Barry
On 10.01.24 11:38, Ryan Roberts wrote: > On 10/01/2024 10:30, Barry Song wrote: >> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> On 10/01/2024 09:09, Barry Song wrote: >>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>> >>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>> ... >>>>>>>>> Hi Ryan, >>>>>>>>> >>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>>>>> for the mTHPs across the whole machine. >>>>> >>>>> Just to confirm, you're expecting these "global" stats be truely global and not >>>>> per-container? (asking because you exploicitly mentioned being in a container). >>>>> If you want per-container, then you can probably just create the container in a >>>>> cgroup? >>>>> >>>>>>>>> >>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>> just some quick runs: the global state would be convenient. >>>>> >>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>> >>>>>>>> >>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>> so may we add this statistics information in kernel just like >>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>> >>>>>>> >>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>> next level of investigation. So feels odd to start with the more >>>>>>> detailed stats. >>>>>>> >>>>>> >>>>>> probably because this can be done without the modification of the kernel. >>>>> >>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats >>>>> directly in the kernel got pushback; DavidH was concerned that we don't really >>>>> know exectly how to account mTHPs yet >>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding >>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>> pushback regarding adding more values to multi-value files in sysfs, so David >>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups >>>>> do live in sysfs). >>>>> >>>>> Anyway, this script was my attempt to 1) provide a short term solution to the >>>>> "we need some stats" request and 2) provide a context in which to explore what >>>>> the right stats are - this script can evolve without the ABI problem. >>>>> >>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which >>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>> eg. libc_malloc, java heaps etc. >>>>>> >>>>>> Different vma types can have different anon_name. So I can use the detailed >>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>> they have gotten. >>>>>> >>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>> values because this is still such an early feature. >>>>>>> >>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>> location. >>>>> >>>>> Now that I think about it, I wonder if we can add a --global mode to the script >>>>> (or just infer global when neither --pid nor --cgroup are provided). I think I >>>>> should be able to determine all the physical memory ranges from /proc/iomem, >>>>> then grab all the info we need from /proc/kpageflags. We should then be able to >>>>> process it all in much the same way as for --pid/--cgroup and provide the same >>>>> stats, but it will apply globally. What do you think? >>> >>> Having now thought about this for a few mins (in the shower, if anyone wants the >>> complete picture :) ), this won't quite work. This approach doesn't have the >>> virtual mapping information so the best it can do is tell us "how many of each >>> size of THP are allocated?" - it doesn't tell us anything about whether they are >>> fully or partially mapped or what their alignment is (all necessary if we want >>> to know if they are contpte-mapped). So I don't think this approach is going to >>> be particularly useful. >>> >>> And this is also the big problem if we want to gather stats inside the kernel; >>> if we want something equivalant to /proc/meminfo's >>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>> allocation of the THP but also whether it is mapped. That's easy for >>> PMD-mappings, because there is only one entry to consider - when you set it, you >>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But >>> for PTE-mappings it's harder; you know the size when you are mapping so its easy >>> to increment, but you can do a partial unmap, so you would need to scan the PTEs >>> to figure out if we are unmapping the first page of a previously >>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>> determine "is this folio fully and contiguously mapped in at least one process?". >> >> as OPPO's approach I shared to you before is maintaining two mapcount >> 1. entire map >> 2. subpage's map >> 3. if 1 and 2 both exist, it is DoubleMapped. >> >> This isn't a problem for us. and everytime if we do a partial unmap, >> we have an explicit >> cont_pte split which will decrease the entire map and increase the >> subpage's mapcount. >> >> but its downside is that we expose this info to mm-core. > > OK, but I think we have a slightly more generic situation going on with the > upstream; If I've understood correctly, you are using the PTE_CONT bit in the > PTE to determne if its fully mapped? That works for your case where you only > have 1 size of THP that you care about (contpte-size). But for the upstream, we > have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > mapped because we can only use that bit if the THP is at least 64K and aligned, > and only on arm64. We would need a SW bit for this purpose, and the mm would > need to update that SW bit for every PTE one the full -> partial map transition. Oh no. Let's not make everything more complicated for the purpose of some stats.
On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 10:30, Barry Song wrote: > > On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 10/01/2024 09:09, Barry Song wrote: > >>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>> > >>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>> ... > >>>>>>>> Hi Ryan, > >>>>>>>> > >>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running > >>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>> such, and it would be nice if there were an easy way to get some numbers > >>>>>>>> for the mTHPs across the whole machine. > >>>> > >>>> Just to confirm, you're expecting these "global" stats be truely global and not > >>>> per-container? (asking because you exploicitly mentioned being in a container). > >>>> If you want per-container, then you can probably just create the container in a > >>>> cgroup? > >>>> > >>>>>>>> > >>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>> just some quick runs: the global state would be convenient. > >>>> > >>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>> > >>>>>>> > >>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>> so may we add this statistics information in kernel just like > >>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>> > >>>>>> > >>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>> next level of investigation. So feels odd to start with the more > >>>>>> detailed stats. > >>>>>> > >>>>> > >>>>> probably because this can be done without the modification of the kernel. > >>>> > >>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats > >>>> directly in the kernel got pushback; DavidH was concerned that we don't really > >>>> know exectly how to account mTHPs yet > >>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding > >>>> the wrong ABI and having to maintain it forever. There has also been some > >>>> pushback regarding adding more values to multi-value files in sysfs, so David > >>>> was suggesting coming up with a whole new scheme at some point (I know > >>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups > >>>> do live in sysfs). > >>>> > >>>> Anyway, this script was my attempt to 1) provide a short term solution to the > >>>> "we need some stats" request and 2) provide a context in which to explore what > >>>> the right stats are - this script can evolve without the ABI problem. > >>>> > >>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which > >>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>> eg. libc_malloc, java heaps etc. > >>>>> > >>>>> Different vma types can have different anon_name. So I can use the detailed > >>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>> they have gotten. > >>>>> > >>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>> values because this is still such an early feature. > >>>>>> > >>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>> location. > >>>> > >>>> Now that I think about it, I wonder if we can add a --global mode to the script > >>>> (or just infer global when neither --pid nor --cgroup are provided). I think I > >>>> should be able to determine all the physical memory ranges from /proc/iomem, > >>>> then grab all the info we need from /proc/kpageflags. We should then be able to > >>>> process it all in much the same way as for --pid/--cgroup and provide the same > >>>> stats, but it will apply globally. What do you think? > >> > >> Having now thought about this for a few mins (in the shower, if anyone wants the > >> complete picture :) ), this won't quite work. This approach doesn't have the > >> virtual mapping information so the best it can do is tell us "how many of each > >> size of THP are allocated?" - it doesn't tell us anything about whether they are > >> fully or partially mapped or what their alignment is (all necessary if we want > >> to know if they are contpte-mapped). So I don't think this approach is going to > >> be particularly useful. > >> > >> And this is also the big problem if we want to gather stats inside the kernel; > >> if we want something equivalant to /proc/meminfo's > >> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >> allocation of the THP but also whether it is mapped. That's easy for > >> PMD-mappings, because there is only one entry to consider - when you set it, you > >> increment the number of PMD-mapped THPs, when you clear it, you decrement. But > >> for PTE-mappings it's harder; you know the size when you are mapping so its easy > >> to increment, but you can do a partial unmap, so you would need to scan the PTEs > >> to figure out if we are unmapping the first page of a previously > >> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >> determine "is this folio fully and contiguously mapped in at least one process?". > > > > as OPPO's approach I shared to you before is maintaining two mapcount > > 1. entire map > > 2. subpage's map > > 3. if 1 and 2 both exist, it is DoubleMapped. > > > > This isn't a problem for us. and everytime if we do a partial unmap, > > we have an explicit > > cont_pte split which will decrease the entire map and increase the > > subpage's mapcount. > > > > but its downside is that we expose this info to mm-core. > > OK, but I think we have a slightly more generic situation going on with the > upstream; If I've understood correctly, you are using the PTE_CONT bit in the > PTE to determne if its fully mapped? That works for your case where you only > have 1 size of THP that you care about (contpte-size). But for the upstream, we > have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > mapped because we can only use that bit if the THP is at least 64K and aligned, > and only on arm64. We would need a SW bit for this purpose, and the mm would > need to update that SW bit for every PTE one the full -> partial map transition. My current implementation does use cont_pte but i don't think it is a must-have. we don't need a bit in PTE to know if we are partially unmapping a large folio at all. as long as we are unmapping a part of a large folio, we do know what we are doing. if a large folio is mapped entirely in a process, we get only entire_map +1, if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's mapcount + 1. if we are only mapping a part of this large folio, we only increase its subpages' mapcount. > > > > >> > >> So depending on what global stats you actually need, the route to getting them > >> cheaply may not be easy. (My previous attempt to add stats cheated and didn't > >> try to track "fully mapped" vs "partially mapped" - instead it just counted the > >> number of pages belonging to a THP (of any size) that were mapped. > >> > >> If you need the global mapping state, then the short term way to do this would > >> be to provide the root cgroup, then have the script recurse through all child > >> cgroups; That would pick up all the processes and iterate through them: > >> > >> $ thpmaps --cgroup /sys/fs/cgroup --summary ... > >> > >> This won't quite work with the current version because it doesn't recurse > >> through the cgroup children currently, but that would be easy to add. > >> > >> > >>> > >>> for debug purposes, it should be good. imaging there is a health > >>> monitor which needs > >>> to sample the stats of large folios online and periodically, this > >>> might be too expensive. > >>> > >>>> > >>>> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script > >>>> for now. > >>>> > >>>>> > >>>>> +1. > >>>>> > >>>>>> > >>>>>> > >>>>>> thanks, > >>>>>> -- > >>>>>> John Hubbard > >>>>>> NVIDIA > >>>>>> > >>>>> > >>> > > Thanks Barry
On 10.01.24 11:48, Barry Song wrote: > On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 10:30, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 09:09, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>> >>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>> ... >>>>>>>>>> Hi Ryan, >>>>>>>>>> >>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>>>>>> for the mTHPs across the whole machine. >>>>>> >>>>>> Just to confirm, you're expecting these "global" stats be truely global and not >>>>>> per-container? (asking because you exploicitly mentioned being in a container). >>>>>> If you want per-container, then you can probably just create the container in a >>>>>> cgroup? >>>>>> >>>>>>>>>> >>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>> >>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>> >>>>>>>>> >>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>> >>>>>>>> >>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>> detailed stats. >>>>>>>> >>>>>>> >>>>>>> probably because this can be done without the modification of the kernel. >>>>>> >>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats >>>>>> directly in the kernel got pushback; DavidH was concerned that we don't really >>>>>> know exectly how to account mTHPs yet >>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding >>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>> pushback regarding adding more values to multi-value files in sysfs, so David >>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups >>>>>> do live in sysfs). >>>>>> >>>>>> Anyway, this script was my attempt to 1) provide a short term solution to the >>>>>> "we need some stats" request and 2) provide a context in which to explore what >>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>> >>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which >>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>> eg. libc_malloc, java heaps etc. >>>>>>> >>>>>>> Different vma types can have different anon_name. So I can use the detailed >>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>> they have gotten. >>>>>>> >>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>> values because this is still such an early feature. >>>>>>>> >>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>> location. >>>>>> >>>>>> Now that I think about it, I wonder if we can add a --global mode to the script >>>>>> (or just infer global when neither --pid nor --cgroup are provided). I think I >>>>>> should be able to determine all the physical memory ranges from /proc/iomem, >>>>>> then grab all the info we need from /proc/kpageflags. We should then be able to >>>>>> process it all in much the same way as for --pid/--cgroup and provide the same >>>>>> stats, but it will apply globally. What do you think? >>>> >>>> Having now thought about this for a few mins (in the shower, if anyone wants the >>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>> virtual mapping information so the best it can do is tell us "how many of each >>>> size of THP are allocated?" - it doesn't tell us anything about whether they are >>>> fully or partially mapped or what their alignment is (all necessary if we want >>>> to know if they are contpte-mapped). So I don't think this approach is going to >>>> be particularly useful. >>>> >>>> And this is also the big problem if we want to gather stats inside the kernel; >>>> if we want something equivalant to /proc/meminfo's >>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>> allocation of the THP but also whether it is mapped. That's easy for >>>> PMD-mappings, because there is only one entry to consider - when you set it, you >>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But >>>> for PTE-mappings it's harder; you know the size when you are mapping so its easy >>>> to increment, but you can do a partial unmap, so you would need to scan the PTEs >>>> to figure out if we are unmapping the first page of a previously >>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>> determine "is this folio fully and contiguously mapped in at least one process?". >>> >>> as OPPO's approach I shared to you before is maintaining two mapcount >>> 1. entire map >>> 2. subpage's map >>> 3. if 1 and 2 both exist, it is DoubleMapped. >>> >>> This isn't a problem for us. and everytime if we do a partial unmap, >>> we have an explicit >>> cont_pte split which will decrease the entire map and increase the >>> subpage's mapcount. >>> >>> but its downside is that we expose this info to mm-core. >> >> OK, but I think we have a slightly more generic situation going on with the >> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >> PTE to determne if its fully mapped? That works for your case where you only >> have 1 size of THP that you care about (contpte-size). But for the upstream, we >> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >> mapped because we can only use that bit if the THP is at least 64K and aligned, >> and only on arm64. We would need a SW bit for this purpose, and the mm would >> need to update that SW bit for every PTE one the full -> partial map transition. > > My current implementation does use cont_pte but i don't think it is a must-have. > we don't need a bit in PTE to know if we are partially unmapping a large folio > at all. > > as long as we are unmapping a part of a large folio, we do know what we are > doing. if a large folio is mapped entirely in a process, we get only > entire_map +1, > if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's > mapcount + 1. if we are only mapping a part of this large folio, we > only increase > its subpages' mapcount. That doesn't work as soon as you unmap a second subpage. Not to mention that people ( :) ) are working on removing the subpage mapcounts. I'm going propose that as a topic for LSF/MM soon, once I get to it.
On 10/01/2024 10:42, David Hildenbrand wrote: > On 10.01.24 11:38, Ryan Roberts wrote: >> On 10/01/2024 10:30, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 09:09, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>> >>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>> ... >>>>>>>>>> Hi Ryan, >>>>>>>>>> >>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>>>>>> for the mTHPs across the whole machine. >>>>>> >>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>> and not >>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>> container). >>>>>> If you want per-container, then you can probably just create the container >>>>>> in a >>>>>> cgroup? >>>>>> >>>>>>>>>> >>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>> >>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>> >>>>>>>>> >>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>> >>>>>>>> >>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>> detailed stats. >>>>>>>> >>>>>>> >>>>>>> probably because this can be done without the modification of the kernel. >>>>>> >>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>> stats >>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>> really >>>>>> know exectly how to account mTHPs yet >>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>> adding >>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>> pushback regarding adding more values to multi-value files in sysfs, so David >>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>> cgroups >>>>>> do live in sysfs). >>>>>> >>>>>> Anyway, this script was my attempt to 1) provide a short term solution to the >>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>> what >>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>> >>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which >>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>> eg. libc_malloc, java heaps etc. >>>>>>> >>>>>>> Different vma types can have different anon_name. So I can use the detailed >>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>> they have gotten. >>>>>>> >>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>> values because this is still such an early feature. >>>>>>>> >>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>> location. >>>>>> >>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>> script >>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>> think I >>>>>> should be able to determine all the physical memory ranges from /proc/iomem, >>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>> able to >>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>> same >>>>>> stats, but it will apply globally. What do you think? >>>> >>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>> the >>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>> virtual mapping information so the best it can do is tell us "how many of each >>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>> are >>>> fully or partially mapped or what their alignment is (all necessary if we want >>>> to know if they are contpte-mapped). So I don't think this approach is going to >>>> be particularly useful. >>>> >>>> And this is also the big problem if we want to gather stats inside the kernel; >>>> if we want something equivalant to /proc/meminfo's >>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>> allocation of the THP but also whether it is mapped. That's easy for >>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>> you >>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But >>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>> easy >>>> to increment, but you can do a partial unmap, so you would need to scan the >>>> PTEs >>>> to figure out if we are unmapping the first page of a previously >>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>> determine "is this folio fully and contiguously mapped in at least one >>>> process?". >>> >>> as OPPO's approach I shared to you before is maintaining two mapcount >>> 1. entire map >>> 2. subpage's map >>> 3. if 1 and 2 both exist, it is DoubleMapped. >>> >>> This isn't a problem for us. and everytime if we do a partial unmap, >>> we have an explicit >>> cont_pte split which will decrease the entire map and increase the >>> subpage's mapcount. >>> >>> but its downside is that we expose this info to mm-core. >> >> OK, but I think we have a slightly more generic situation going on with the >> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >> PTE to determne if its fully mapped? That works for your case where you only >> have 1 size of THP that you care about (contpte-size). But for the upstream, we >> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >> mapped because we can only use that bit if the THP is at least 64K and aligned, >> and only on arm64. We would need a SW bit for this purpose, and the mm would >> need to update that SW bit for every PTE one the full -> partial map transition. > > Oh no. Let's not make everything more complicated for the purpose of some stats. > Indeed, I was intending to argue *against* doing it this way. Fundamentally, if we want to know what's fully mapped and what's not, then I don't see any way other than by scanning the page tables and we might as well do that in user space with this script. Although, I expect you will shortly make a proposal that is simple to implement and prove me wrong ;-)
On 10/01/2024 10:54, David Hildenbrand wrote: > On 10.01.24 11:48, Barry Song wrote: >> On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> On 10/01/2024 10:30, Barry Song wrote: >>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>> >>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>> wrote: >>>>>>>>> ... >>>>>>>>>>> Hi Ryan, >>>>>>>>>>> >>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>> >>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>> and not >>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>> container). >>>>>>> If you want per-container, then you can probably just create the >>>>>>> container in a >>>>>>> cgroup? >>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>> >>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>> >>>>>>>>>> >>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>> detailed stats. >>>>>>>>> >>>>>>>> >>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>> >>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to >>>>>>> add stats >>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>> really >>>>>>> know exectly how to account mTHPs yet >>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>> adding >>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>> David >>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>> cgroups >>>>>>> do live in sysfs). >>>>>>> >>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to >>>>>>> the >>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>> what >>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>> >>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>> which >>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>> >>>>>>>> Different vma types can have different anon_name. So I can use the detailed >>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>> they have gotten. >>>>>>>> >>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>> values because this is still such an early feature. >>>>>>>>> >>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>> location. >>>>>>> >>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>> script >>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>> think I >>>>>>> should be able to determine all the physical memory ranges from /proc/iomem, >>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>> able to >>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>> same >>>>>>> stats, but it will apply globally. What do you think? >>>>> >>>>> Having now thought about this for a few mins (in the shower, if anyone >>>>> wants the >>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>> virtual mapping information so the best it can do is tell us "how many of each >>>>> size of THP are allocated?" - it doesn't tell us anything about whether >>>>> they are >>>>> fully or partially mapped or what their alignment is (all necessary if we want >>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>> going to >>>>> be particularly useful. >>>>> >>>>> And this is also the big problem if we want to gather stats inside the kernel; >>>>> if we want something equivalant to /proc/meminfo's >>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>> PMD-mappings, because there is only one entry to consider - when you set >>>>> it, you >>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But >>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>> easy >>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>> PTEs >>>>> to figure out if we are unmapping the first page of a previously >>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>> determine "is this folio fully and contiguously mapped in at least one >>>>> process?". >>>> >>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>> 1. entire map >>>> 2. subpage's map >>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>> >>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>> we have an explicit >>>> cont_pte split which will decrease the entire map and increase the >>>> subpage's mapcount. >>>> >>>> but its downside is that we expose this info to mm-core. >>> >>> OK, but I think we have a slightly more generic situation going on with the >>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>> PTE to determne if its fully mapped? That works for your case where you only >>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>> need to update that SW bit for every PTE one the full -> partial map transition. >> >> My current implementation does use cont_pte but i don't think it is a must-have. >> we don't need a bit in PTE to know if we are partially unmapping a large folio >> at all. >> >> as long as we are unmapping a part of a large folio, we do know what we are >> doing. if a large folio is mapped entirely in a process, we get only >> entire_map +1, >> if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's >> mapcount + 1. if we are only mapping a part of this large folio, we >> only increase >> its subpages' mapcount. > > That doesn't work as soon as you unmap a second subpage. Not to mention that > people ( :) ) are working on removing the subpage mapcounts. Yes, that was my point - Oppo's implementation relies on the bit in the PTE to tell the difference between unmapping the first subpage and unmapping the others. We don't have that luxury here. > > I'm going propose that as a topic for LSF/MM soon, once I get to it. >
On 10.01.24 11:55, Ryan Roberts wrote: > On 10/01/2024 10:42, David Hildenbrand wrote: >> On 10.01.24 11:38, Ryan Roberts wrote: >>> On 10/01/2024 10:30, Barry Song wrote: >>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>> >>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>> ... >>>>>>>>>>> Hi Ryan, >>>>>>>>>>> >>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>> >>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>> and not >>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>> container). >>>>>>> If you want per-container, then you can probably just create the container >>>>>>> in a >>>>>>> cgroup? >>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>> >>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>> >>>>>>>>>> >>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>> detailed stats. >>>>>>>>> >>>>>>>> >>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>> >>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>> stats >>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>> really >>>>>>> know exectly how to account mTHPs yet >>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>> adding >>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>> pushback regarding adding more values to multi-value files in sysfs, so David >>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>> cgroups >>>>>>> do live in sysfs). >>>>>>> >>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to the >>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>> what >>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>> >>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which >>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>> >>>>>>>> Different vma types can have different anon_name. So I can use the detailed >>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>> they have gotten. >>>>>>>> >>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>> values because this is still such an early feature. >>>>>>>>> >>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>> location. >>>>>>> >>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>> script >>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>> think I >>>>>>> should be able to determine all the physical memory ranges from /proc/iomem, >>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>> able to >>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>> same >>>>>>> stats, but it will apply globally. What do you think? >>>>> >>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>> the >>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>> virtual mapping information so the best it can do is tell us "how many of each >>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>> are >>>>> fully or partially mapped or what their alignment is (all necessary if we want >>>>> to know if they are contpte-mapped). So I don't think this approach is going to >>>>> be particularly useful. >>>>> >>>>> And this is also the big problem if we want to gather stats inside the kernel; >>>>> if we want something equivalant to /proc/meminfo's >>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>> you >>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But >>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>> easy >>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>> PTEs >>>>> to figure out if we are unmapping the first page of a previously >>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>> determine "is this folio fully and contiguously mapped in at least one >>>>> process?". >>>> >>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>> 1. entire map >>>> 2. subpage's map >>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>> >>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>> we have an explicit >>>> cont_pte split which will decrease the entire map and increase the >>>> subpage's mapcount. >>>> >>>> but its downside is that we expose this info to mm-core. >>> >>> OK, but I think we have a slightly more generic situation going on with the >>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>> PTE to determne if its fully mapped? That works for your case where you only >>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>> need to update that SW bit for every PTE one the full -> partial map transition. >> >> Oh no. Let's not make everything more complicated for the purpose of some stats. >> > > Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > we want to know what's fully mapped and what's not, then I don't see any way > other than by scanning the page tables and we might as well do that in user > space with this script. > > Although, I expect you will shortly make a proposal that is simple to implement > and prove me wrong ;-) Unlikely :) As you said, once you have multiple folio sizes, it stops really making sense. Assume you have a 128 kiB pageache folio, and half of that is mapped. You can set cont-pte bits on that half and all is fine. Or AMD can benefit from it's optimizations without the cont-pte bit and everything is fine. We want simple stats that tell us which folio sizes are actually allocated. For everything else, just scan the process to figure out what exactly is going on.
On 10.01.24 11:58, Ryan Roberts wrote: > On 10/01/2024 10:54, David Hildenbrand wrote: >> On 10.01.24 11:48, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 10:30, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>> >>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>> >>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>> wrote: >>>>>>>>>> ... >>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>> >>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running >>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers >>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>> >>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>> and not >>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>> container). >>>>>>>> If you want per-container, then you can probably just create the >>>>>>>> container in a >>>>>>>> cgroup? >>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>> >>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>> >>>>>>>>>>> >>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>> detailed stats. >>>>>>>>>> >>>>>>>>> >>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>> >>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to >>>>>>>> add stats >>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>> really >>>>>>>> know exectly how to account mTHPs yet >>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>> adding >>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>> David >>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>> cgroups >>>>>>>> do live in sysfs). >>>>>>>> >>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to >>>>>>>> the >>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>> what >>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>> >>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>> which >>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>> >>>>>>>>> Different vma types can have different anon_name. So I can use the detailed >>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>> they have gotten. >>>>>>>>> >>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>> values because this is still such an early feature. >>>>>>>>>> >>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>> location. >>>>>>>> >>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>> script >>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>> think I >>>>>>>> should be able to determine all the physical memory ranges from /proc/iomem, >>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>> able to >>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>> same >>>>>>>> stats, but it will apply globally. What do you think? >>>>>> >>>>>> Having now thought about this for a few mins (in the shower, if anyone >>>>>> wants the >>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>> virtual mapping information so the best it can do is tell us "how many of each >>>>>> size of THP are allocated?" - it doesn't tell us anything about whether >>>>>> they are >>>>>> fully or partially mapped or what their alignment is (all necessary if we want >>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>> going to >>>>>> be particularly useful. >>>>>> >>>>>> And this is also the big problem if we want to gather stats inside the kernel; >>>>>> if we want something equivalant to /proc/meminfo's >>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>> PMD-mappings, because there is only one entry to consider - when you set >>>>>> it, you >>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But >>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>> easy >>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>> PTEs >>>>>> to figure out if we are unmapping the first page of a previously >>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>> process?". >>>>> >>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>> 1. entire map >>>>> 2. subpage's map >>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>> >>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>> we have an explicit >>>>> cont_pte split which will decrease the entire map and increase the >>>>> subpage's mapcount. >>>>> >>>>> but its downside is that we expose this info to mm-core. >>>> >>>> OK, but I think we have a slightly more generic situation going on with the >>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>> PTE to determne if its fully mapped? That works for your case where you only >>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>> need to update that SW bit for every PTE one the full -> partial map transition. >>> >>> My current implementation does use cont_pte but i don't think it is a must-have. >>> we don't need a bit in PTE to know if we are partially unmapping a large folio >>> at all. >>> >>> as long as we are unmapping a part of a large folio, we do know what we are >>> doing. if a large folio is mapped entirely in a process, we get only >>> entire_map +1, >>> if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's >>> mapcount + 1. if we are only mapping a part of this large folio, we >>> only increase >>> its subpages' mapcount. >> >> That doesn't work as soon as you unmap a second subpage. Not to mention that >> people ( :) ) are working on removing the subpage mapcounts. > > Yes, that was my point - Oppo's implementation relies on the bit in the PTE to > tell the difference between unmapping the first subpage and unmapping the > others. We don't have that luxury here. Yes, and once we're thinking of bigger folios that eventually span multiple page tables, these PTE-bit games won't scale.
On Wed, Jan 10, 2024 at 6:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 10:54, David Hildenbrand wrote: > > On 10.01.24 11:48, Barry Song wrote: > >> On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>> > >>> On 10/01/2024 10:30, Barry Song wrote: > >>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>> > >>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>> > >>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>> > >>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>> wrote: > >>>>>>>>> ... > >>>>>>>>>>> Hi Ryan, > >>>>>>>>>>> > >>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running > >>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers > >>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>> > >>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>> and not > >>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>> container). > >>>>>>> If you want per-container, then you can probably just create the > >>>>>>> container in a > >>>>>>> cgroup? > >>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>> > >>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>> > >>>>>>>>>> > >>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>> detailed stats. > >>>>>>>>> > >>>>>>>> > >>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>> > >>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to > >>>>>>> add stats > >>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>> really > >>>>>>> know exectly how to account mTHPs yet > >>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>> adding > >>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>> David > >>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>> cgroups > >>>>>>> do live in sysfs). > >>>>>>> > >>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to > >>>>>>> the > >>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>> what > >>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>> > >>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>> which > >>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>> > >>>>>>>> Different vma types can have different anon_name. So I can use the detailed > >>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>> they have gotten. > >>>>>>>> > >>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>> values because this is still such an early feature. > >>>>>>>>> > >>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>> location. > >>>>>>> > >>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>> script > >>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>> think I > >>>>>>> should be able to determine all the physical memory ranges from /proc/iomem, > >>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>> able to > >>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>> same > >>>>>>> stats, but it will apply globally. What do you think? > >>>>> > >>>>> Having now thought about this for a few mins (in the shower, if anyone > >>>>> wants the > >>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>> virtual mapping information so the best it can do is tell us "how many of each > >>>>> size of THP are allocated?" - it doesn't tell us anything about whether > >>>>> they are > >>>>> fully or partially mapped or what their alignment is (all necessary if we want > >>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>> going to > >>>>> be particularly useful. > >>>>> > >>>>> And this is also the big problem if we want to gather stats inside the kernel; > >>>>> if we want something equivalant to /proc/meminfo's > >>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>> PMD-mappings, because there is only one entry to consider - when you set > >>>>> it, you > >>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But > >>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>> easy > >>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>> PTEs > >>>>> to figure out if we are unmapping the first page of a previously > >>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>> process?". > >>>> > >>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>> 1. entire map > >>>> 2. subpage's map > >>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>> > >>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>> we have an explicit > >>>> cont_pte split which will decrease the entire map and increase the > >>>> subpage's mapcount. > >>>> > >>>> but its downside is that we expose this info to mm-core. > >>> > >>> OK, but I think we have a slightly more generic situation going on with the > >>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>> PTE to determne if its fully mapped? That works for your case where you only > >>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>> need to update that SW bit for every PTE one the full -> partial map transition. > >> > >> My current implementation does use cont_pte but i don't think it is a must-have. > >> we don't need a bit in PTE to know if we are partially unmapping a large folio > >> at all. > >> > >> as long as we are unmapping a part of a large folio, we do know what we are > >> doing. if a large folio is mapped entirely in a process, we get only > >> entire_map +1, > >> if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's > >> mapcount + 1. if we are only mapping a part of this large folio, we > >> only increase > >> its subpages' mapcount. > > > > That doesn't work as soon as you unmap a second subpage. Not to mention that > > people ( :) ) are working on removing the subpage mapcounts. > > Yes, that was my point - Oppo's implementation relies on the bit in the PTE to > tell the difference between unmapping the first subpage and unmapping the > others. We don't have that luxury here. right. The devil is in the details :-) > > > > > I'm going propose that as a topic for LSF/MM soon, once I get to it. > > >
On 10/01/2024 11:00, David Hildenbrand wrote: > On 10.01.24 11:55, Ryan Roberts wrote: >> On 10/01/2024 10:42, David Hildenbrand wrote: >>> On 10.01.24 11:38, Ryan Roberts wrote: >>>> On 10/01/2024 10:30, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>> >>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>> >>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>> wrote: >>>>>>>>>> ... >>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>> >>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>> running >>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>> numbers >>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>> >>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>> and not >>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>> container). >>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>> in a >>>>>>>> cgroup? >>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>> >>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>> >>>>>>>>>>> >>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>> detailed stats. >>>>>>>>>> >>>>>>>>> >>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>> >>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>> stats >>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>> really >>>>>>>> know exectly how to account mTHPs yet >>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>> adding >>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>> David >>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>> cgroups >>>>>>>> do live in sysfs). >>>>>>>> >>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>> to the >>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>> what >>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>> >>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>> which >>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>> >>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>> detailed >>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>> they have gotten. >>>>>>>>> >>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>> values because this is still such an early feature. >>>>>>>>>> >>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>> location. >>>>>>>> >>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>> script >>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>> think I >>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>> /proc/iomem, >>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>> able to >>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>> same >>>>>>>> stats, but it will apply globally. What do you think? >>>>>> >>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>> the >>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>> each >>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>> are >>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>> want >>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>> going to >>>>>> be particularly useful. >>>>>> >>>>>> And this is also the big problem if we want to gather stats inside the >>>>>> kernel; >>>>>> if we want something equivalant to /proc/meminfo's >>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>> you >>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>> But >>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>> easy >>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>> PTEs >>>>>> to figure out if we are unmapping the first page of a previously >>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>> process?". >>>>> >>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>> 1. entire map >>>>> 2. subpage's map >>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>> >>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>> we have an explicit >>>>> cont_pte split which will decrease the entire map and increase the >>>>> subpage's mapcount. >>>>> >>>>> but its downside is that we expose this info to mm-core. >>>> >>>> OK, but I think we have a slightly more generic situation going on with the >>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>> PTE to determne if its fully mapped? That works for your case where you only >>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>> need to update that SW bit for every PTE one the full -> partial map >>>> transition. >>> >>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>> >> >> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >> we want to know what's fully mapped and what's not, then I don't see any way >> other than by scanning the page tables and we might as well do that in user >> space with this script. >> >> Although, I expect you will shortly make a proposal that is simple to implement >> and prove me wrong ;-) > > Unlikely :) As you said, once you have multiple folio sizes, it stops really > making sense. > > Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > optimizations without the cont-pte bit and everything is fine. Yes, but for debug and optimization, its useful to know when THPs are fully/partially mapped, when they are unaligned etc. Anyway, the script does that for us, and I think we are tending towards agreement that there are unlikely to be any cost benefits by moving it into the kernel. > > We want simple stats that tell us which folio sizes are actually allocated. For > everything else, just scan the process to figure out what exactly is going on. > Certainly that's much easier to do. But is it valuable? It might be if we also keep stats for the number of failures to allocate the various sizes - then we can see what percentage of high order allocation attempts are successful, which is probably useful.
On 10.01.24 12:20, Ryan Roberts wrote: > On 10/01/2024 11:00, David Hildenbrand wrote: >> On 10.01.24 11:55, Ryan Roberts wrote: >>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>> >>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>> wrote: >>>>>>>>>>> ... >>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>> >>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>> running >>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>> numbers >>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>> >>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>> and not >>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>> container). >>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>> in a >>>>>>>>> cgroup? >>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>> >>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>> detailed stats. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>> >>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>> stats >>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>> really >>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>> adding >>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>> David >>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>> cgroups >>>>>>>>> do live in sysfs). >>>>>>>>> >>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>> to the >>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>> what >>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>> >>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>> which >>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>> >>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>> detailed >>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>> they have gotten. >>>>>>>>>> >>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>> >>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>> location. >>>>>>>>> >>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>> script >>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>> think I >>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>> /proc/iomem, >>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>> able to >>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>> same >>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>> >>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>> the >>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>> each >>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>> are >>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>> want >>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>> going to >>>>>>> be particularly useful. >>>>>>> >>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>> kernel; >>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>> you >>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>> But >>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>> easy >>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>> PTEs >>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>> process?". >>>>>> >>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>> 1. entire map >>>>>> 2. subpage's map >>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>> >>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>> we have an explicit >>>>>> cont_pte split which will decrease the entire map and increase the >>>>>> subpage's mapcount. >>>>>> >>>>>> but its downside is that we expose this info to mm-core. >>>>> >>>>> OK, but I think we have a slightly more generic situation going on with the >>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>> need to update that SW bit for every PTE one the full -> partial map >>>>> transition. >>>> >>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>> >>> >>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>> we want to know what's fully mapped and what's not, then I don't see any way >>> other than by scanning the page tables and we might as well do that in user >>> space with this script. >>> >>> Although, I expect you will shortly make a proposal that is simple to implement >>> and prove me wrong ;-) >> >> Unlikely :) As you said, once you have multiple folio sizes, it stops really >> making sense. >> >> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >> optimizations without the cont-pte bit and everything is fine. > > Yes, but for debug and optimization, its useful to know when THPs are > fully/partially mapped, when they are unaligned etc. Anyway, the script does > that for us, and I think we are tending towards agreement that there are > unlikely to be any cost benefits by moving it into the kernel. Agreed. And just adding: while one process might map a folio unaligned/partial/ ... another one might map it aligned/fully. So this per-process scanning is really required (because per process stats per folio are pretty much out of scope :) ). > >> >> We want simple stats that tell us which folio sizes are actually allocated. For >> everything else, just scan the process to figure out what exactly is going on. >> > > Certainly that's much easier to do. But is it valuable? It might be if we also > keep stats for the number of failures to allocate the various sizes - then we > can see what percentage of high order allocation attempts are successful, which > is probably useful. Agreed.
On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 11:00, David Hildenbrand wrote: > > On 10.01.24 11:55, Ryan Roberts wrote: > >> On 10/01/2024 10:42, David Hildenbrand wrote: > >>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>> > >>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>> > >>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>>> wrote: > >>>>>>>>>> ... > >>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>> > >>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>> running > >>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>> numbers > >>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>> > >>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>> and not > >>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>> container). > >>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>> in a > >>>>>>>> cgroup? > >>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>> > >>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>> detailed stats. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>> > >>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>> stats > >>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>> really > >>>>>>>> know exectly how to account mTHPs yet > >>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>> adding > >>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>> David > >>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>> cgroups > >>>>>>>> do live in sysfs). > >>>>>>>> > >>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>> to the > >>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>> what > >>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>> > >>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>> which > >>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>> > >>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>> detailed > >>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>> they have gotten. > >>>>>>>>> > >>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>> > >>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>> location. > >>>>>>>> > >>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>> script > >>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>> think I > >>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>> /proc/iomem, > >>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>> able to > >>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>> same > >>>>>>>> stats, but it will apply globally. What do you think? > >>>>>> > >>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>> the > >>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>> each > >>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>> are > >>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>> want > >>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>> going to > >>>>>> be particularly useful. > >>>>>> > >>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>> kernel; > >>>>>> if we want something equivalant to /proc/meminfo's > >>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>> you > >>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>> But > >>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>> easy > >>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>> PTEs > >>>>>> to figure out if we are unmapping the first page of a previously > >>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>> process?". > >>>>> > >>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>> 1. entire map > >>>>> 2. subpage's map > >>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>> > >>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>> we have an explicit > >>>>> cont_pte split which will decrease the entire map and increase the > >>>>> subpage's mapcount. > >>>>> > >>>>> but its downside is that we expose this info to mm-core. > >>>> > >>>> OK, but I think we have a slightly more generic situation going on with the > >>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>> PTE to determne if its fully mapped? That works for your case where you only > >>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>> need to update that SW bit for every PTE one the full -> partial map > >>>> transition. > >>> > >>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>> > >> > >> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >> we want to know what's fully mapped and what's not, then I don't see any way > >> other than by scanning the page tables and we might as well do that in user > >> space with this script. > >> > >> Although, I expect you will shortly make a proposal that is simple to implement > >> and prove me wrong ;-) > > > > Unlikely :) As you said, once you have multiple folio sizes, it stops really > > making sense. > > > > Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > > set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > > optimizations without the cont-pte bit and everything is fine. > > Yes, but for debug and optimization, its useful to know when THPs are > fully/partially mapped, when they are unaligned etc. Anyway, the script does > that for us, and I think we are tending towards agreement that there are > unlikely to be any cost benefits by moving it into the kernel. frequent partial unmap can defeat all purpose for us to use large folios. just imagine a large folio can soon be splitted after it is formed. we lose the performance gain and might get regression instead. and this can be very frequent, for example, one userspace heap management is releasing memory page by page. In our real product deployment, we might not care about the second partial unmapped, we do care about the first partial unmapped as we can use this to know if split has ever happened on this large folios. an partial unmapped subpage can be unlikely re-mapped back. so i guess 1st unmap is probably enough, at least for my product. I mean we care about if partial unmap has ever happened on a large folio more than how they are exactly partially unmapped :-) > > > > > We want simple stats that tell us which folio sizes are actually allocated. For > > everything else, just scan the process to figure out what exactly is going on. > > > > Certainly that's much easier to do. But is it valuable? It might be if we also > keep stats for the number of failures to allocate the various sizes - then we > can see what percentage of high order allocation attempts are successful, which > is probably useful. > Thanks Barry
On 10/01/2024 11:38, Barry Song wrote: > On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 11:00, David Hildenbrand wrote: >>> On 10.01.24 11:55, Ryan Roberts wrote: >>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>> >>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>> >>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>> ... >>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>> >>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>> running >>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>> numbers >>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>> >>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>> and not >>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>> container). >>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>> in a >>>>>>>>>> cgroup? >>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>> >>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>> detailed stats. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>> >>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>> stats >>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>> really >>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>> adding >>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>> David >>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>> cgroups >>>>>>>>>> do live in sysfs). >>>>>>>>>> >>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>> to the >>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>> what >>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>> >>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>> which >>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>> >>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>> detailed >>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>> they have gotten. >>>>>>>>>>> >>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>> >>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>> location. >>>>>>>>>> >>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>> script >>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>> think I >>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>> /proc/iomem, >>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>> able to >>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>> same >>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>> >>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>> the >>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>> each >>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>> are >>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>> want >>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>> going to >>>>>>>> be particularly useful. >>>>>>>> >>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>> kernel; >>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>> you >>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>> But >>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>> easy >>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>> PTEs >>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>> process?". >>>>>>> >>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>> 1. entire map >>>>>>> 2. subpage's map >>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>> >>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>> we have an explicit >>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>> subpage's mapcount. >>>>>>> >>>>>>> but its downside is that we expose this info to mm-core. >>>>>> >>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>> transition. >>>>> >>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>> >>>> >>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>> we want to know what's fully mapped and what's not, then I don't see any way >>>> other than by scanning the page tables and we might as well do that in user >>>> space with this script. >>>> >>>> Although, I expect you will shortly make a proposal that is simple to implement >>>> and prove me wrong ;-) >>> >>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>> making sense. >>> >>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>> optimizations without the cont-pte bit and everything is fine. >> >> Yes, but for debug and optimization, its useful to know when THPs are >> fully/partially mapped, when they are unaligned etc. Anyway, the script does >> that for us, and I think we are tending towards agreement that there are >> unlikely to be any cost benefits by moving it into the kernel. > > frequent partial unmap can defeat all purpose for us to use large folios. > just imagine a large folio can soon be splitted after it is formed. we lose > the performance gain and might get regression instead. nit: just because a THP gets partially unmapped in a process doesn't mean it gets split into order-0 pages. If the folio still has all its pages mapped at least once then no further action is taken. If the page being unmapped was the last mapping of that page, then the THP is put on the deferred split queue, so that it can be split in future if needed. > > and this can be very frequent, for example, one userspace heap management > is releasing memory page by page. > > In our real product deployment, we might not care about the second partial > unmapped, we do care about the first partial unmapped as we can use this > to know if split has ever happened on this large folios. an partial unmapped > subpage can be unlikely re-mapped back. > > so i guess 1st unmap is probably enough, at least for my product. I mean we > care about if partial unmap has ever happened on a large folio more than how > they are exactly partially unmapped :-) I'm not sure what you are suggesting here? A global boolean that tells you if any folio in the system has ever been partially unmapped? That will almost certainly always be true, even for a very well tuned system. > >> >>> >>> We want simple stats that tell us which folio sizes are actually allocated. For >>> everything else, just scan the process to figure out what exactly is going on. >>> >> >> Certainly that's much easier to do. But is it valuable? It might be if we also >> keep stats for the number of failures to allocate the various sizes - then we >> can see what percentage of high order allocation attempts are successful, which >> is probably useful. >> > > Thanks > Barry
On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 11:38, Barry Song wrote: > > On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 10/01/2024 11:00, David Hildenbrand wrote: > >>> On 10.01.24 11:55, Ryan Roberts wrote: > >>>> On 10/01/2024 10:42, David Hildenbrand wrote: > >>>>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>> > >>>>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>> ... > >>>>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>>>> running > >>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>>>> numbers > >>>>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>>>> > >>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>>>> and not > >>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>>>> container). > >>>>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>>>> in a > >>>>>>>>>> cgroup? > >>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>>>> > >>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>>>> detailed stats. > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>>>> > >>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>>>> stats > >>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>>>> really > >>>>>>>>>> know exectly how to account mTHPs yet > >>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>>>> adding > >>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>>>> David > >>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>>>> cgroups > >>>>>>>>>> do live in sysfs). > >>>>>>>>>> > >>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>>>> to the > >>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>>>> what > >>>>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>>>> > >>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>>>> which > >>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>>>> > >>>>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>>>> detailed > >>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>>>> they have gotten. > >>>>>>>>>>> > >>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>>>> > >>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>>>> location. > >>>>>>>>>> > >>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>>>> script > >>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>>>> think I > >>>>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>>>> /proc/iomem, > >>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>>>> able to > >>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>>>> same > >>>>>>>>>> stats, but it will apply globally. What do you think? > >>>>>>>> > >>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>>>> the > >>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>>>> each > >>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>>>> are > >>>>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>>>> want > >>>>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>>>> going to > >>>>>>>> be particularly useful. > >>>>>>>> > >>>>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>>>> kernel; > >>>>>>>> if we want something equivalant to /proc/meminfo's > >>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>>>> you > >>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>>>> But > >>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>>>> easy > >>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>>>> PTEs > >>>>>>>> to figure out if we are unmapping the first page of a previously > >>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>>>> process?". > >>>>>>> > >>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>>>> 1. entire map > >>>>>>> 2. subpage's map > >>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>>>> > >>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>>>> we have an explicit > >>>>>>> cont_pte split which will decrease the entire map and increase the > >>>>>>> subpage's mapcount. > >>>>>>> > >>>>>>> but its downside is that we expose this info to mm-core. > >>>>>> > >>>>>> OK, but I think we have a slightly more generic situation going on with the > >>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>>>> PTE to determne if its fully mapped? That works for your case where you only > >>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>>>> need to update that SW bit for every PTE one the full -> partial map > >>>>>> transition. > >>>>> > >>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>>>> > >>>> > >>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >>>> we want to know what's fully mapped and what's not, then I don't see any way > >>>> other than by scanning the page tables and we might as well do that in user > >>>> space with this script. > >>>> > >>>> Although, I expect you will shortly make a proposal that is simple to implement > >>>> and prove me wrong ;-) > >>> > >>> Unlikely :) As you said, once you have multiple folio sizes, it stops really > >>> making sense. > >>> > >>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > >>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > >>> optimizations without the cont-pte bit and everything is fine. > >> > >> Yes, but for debug and optimization, its useful to know when THPs are > >> fully/partially mapped, when they are unaligned etc. Anyway, the script does > >> that for us, and I think we are tending towards agreement that there are > >> unlikely to be any cost benefits by moving it into the kernel. > > > > frequent partial unmap can defeat all purpose for us to use large folios. > > just imagine a large folio can soon be splitted after it is formed. we lose > > the performance gain and might get regression instead. > > nit: just because a THP gets partially unmapped in a process doesn't mean it > gets split into order-0 pages. If the folio still has all its pages mapped at > least once then no further action is taken. If the page being unmapped was the > last mapping of that page, then the THP is put on the deferred split queue, so > that it can be split in future if needed. yes. That is exactly what the kernel is doing, but this is not so important for us to resolve performance issues. > > > > and this can be very frequent, for example, one userspace heap management > > is releasing memory page by page. > > > > In our real product deployment, we might not care about the second partial > > unmapped, we do care about the first partial unmapped as we can use this > > to know if split has ever happened on this large folios. an partial unmapped > > subpage can be unlikely re-mapped back. > > > > so i guess 1st unmap is probably enough, at least for my product. I mean we > > care about if partial unmap has ever happened on a large folio more than how > > they are exactly partially unmapped :-) > > I'm not sure what you are suggesting here? A global boolean that tells you if > any folio in the system has ever been partially unmapped? That will almost > certainly always be true, even for a very well tuned system. > > > > >> > >>> > >>> We want simple stats that tell us which folio sizes are actually allocated. For > >>> everything else, just scan the process to figure out what exactly is going on. > >>> > >> > >> Certainly that's much easier to do. But is it valuable? It might be if we also > >> keep stats for the number of failures to allocate the various sizes - then we > >> can see what percentage of high order allocation attempts are successful, which > >> is probably useful. My point is that we split large folios into two simple categories, 1. large folios which have never been partially unmapped 2. large folios which have ever been partially unmapped. we can totally ignore all details except "never" and "ever". it won't be accurate, but it has been useful enough at least for my product and according to our experiences deploying large folios on millions of real android phones. In real product deployment, we modified userspace a lot to decrease 2 as much as possible. so we did observe 2 very often in our past debugging. Thanks Barry
On 10.01.24 13:05, Barry Song wrote: > On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 11:38, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>> >>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>> >>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>> and not >>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>> container). >>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>> in a >>>>>>>>>>>> cgroup? >>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>> >>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>> >>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>> stats >>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>> really >>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>> adding >>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>> David >>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>> cgroups >>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>> >>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>> to the >>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>> what >>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>> >>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>> which >>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>> >>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>> detailed >>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>> >>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>> location. >>>>>>>>>>>> >>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>> script >>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>> think I >>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>> able to >>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>> same >>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>> >>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>> the >>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>> each >>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>> are >>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>> want >>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>> going to >>>>>>>>>> be particularly useful. >>>>>>>>>> >>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>> kernel; >>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>> you >>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>> But >>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>> easy >>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>> PTEs >>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>> process?". >>>>>>>>> >>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>> 1. entire map >>>>>>>>> 2. subpage's map >>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>> >>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>> we have an explicit >>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>> subpage's mapcount. >>>>>>>>> >>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>> >>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>> transition. >>>>>>> >>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>> >>>>>> >>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>> other than by scanning the page tables and we might as well do that in user >>>>>> space with this script. >>>>>> >>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>> and prove me wrong ;-) >>>>> >>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>> making sense. >>>>> >>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>> optimizations without the cont-pte bit and everything is fine. >>>> >>>> Yes, but for debug and optimization, its useful to know when THPs are >>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>> that for us, and I think we are tending towards agreement that there are >>>> unlikely to be any cost benefits by moving it into the kernel. >>> >>> frequent partial unmap can defeat all purpose for us to use large folios. >>> just imagine a large folio can soon be splitted after it is formed. we lose >>> the performance gain and might get regression instead. >> >> nit: just because a THP gets partially unmapped in a process doesn't mean it >> gets split into order-0 pages. If the folio still has all its pages mapped at >> least once then no further action is taken. If the page being unmapped was the >> last mapping of that page, then the THP is put on the deferred split queue, so >> that it can be split in future if needed. > > yes. That is exactly what the kernel is doing, but this is not so > important for us > to resolve performance issues. > >>> >>> and this can be very frequent, for example, one userspace heap management >>> is releasing memory page by page. >>> >>> In our real product deployment, we might not care about the second partial >>> unmapped, we do care about the first partial unmapped as we can use this >>> to know if split has ever happened on this large folios. an partial unmapped >>> subpage can be unlikely re-mapped back. >>> >>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>> care about if partial unmap has ever happened on a large folio more than how >>> they are exactly partially unmapped :-) >> >> I'm not sure what you are suggesting here? A global boolean that tells you if >> any folio in the system has ever been partially unmapped? That will almost >> certainly always be true, even for a very well tuned system. >> >>> >>>> >>>>> >>>>> We want simple stats that tell us which folio sizes are actually allocated. For >>>>> everything else, just scan the process to figure out what exactly is going on. >>>>> >>>> >>>> Certainly that's much easier to do. But is it valuable? It might be if we also >>>> keep stats for the number of failures to allocate the various sizes - then we >>>> can see what percentage of high order allocation attempts are successful, which >>>> is probably useful. > > My point is that we split large folios into two simple categories, > 1. large folios which have never been partially unmapped > 2. large folios which have ever been partially unmapped. > With the rmap batching stuff I am working on, you get the complete thing unmapped in most cases (as long as they are in one VMA) -- for example during munmap()/exit()/etc. Only when multiple VMAs were involved, or when someone COWs / PTE_DONTNEEDs / munmap some subpages, you get a single page of a large folio. That could be used to simply flag the folio in your case. But not sure if that has to be handled on the rmap level. Could be handled higher up in the callchain (esp. pte-dontneed).
On 10 Jan 2024, at 7:12, David Hildenbrand wrote: > On 10.01.24 13:05, Barry Song wrote: >> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> On 10/01/2024 11:38, Barry Song wrote: >>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>> >>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>>> and not >>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>> container). >>>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>>> in a >>>>>>>>>>>>> cgroup? >>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>>> >>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>>> stats >>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>>> really >>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>>> adding >>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>>> David >>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>>> cgroups >>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>> >>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>>> to the >>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>>> what >>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>> >>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>>> which >>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>> detailed >>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>>> location. >>>>>>>>>>>>> >>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>>> script >>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>>> think I >>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>>> able to >>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>>> same >>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>> >>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>>> the >>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>>> each >>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>>> are >>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>>> want >>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>> going to >>>>>>>>>>> be particularly useful. >>>>>>>>>>> >>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>> kernel; >>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>>> you >>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>>> But >>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>>> easy >>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>>> PTEs >>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>> process?". >>>>>>>>>> >>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>> 1. entire map >>>>>>>>>> 2. subpage's map >>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>> >>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>> we have an explicit >>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>> subpage's mapcount. >>>>>>>>>> >>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>> >>>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>> transition. >>>>>>>> >>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>>> >>>>>>> >>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>> space with this script. >>>>>>> >>>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>>> and prove me wrong ;-) >>>>>> >>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>> making sense. >>>>>> >>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>> optimizations without the cont-pte bit and everything is fine. >>>>> >>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>> that for us, and I think we are tending towards agreement that there are >>>>> unlikely to be any cost benefits by moving it into the kernel. >>>> >>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>> the performance gain and might get regression instead. >>> >>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>> gets split into order-0 pages. If the folio still has all its pages mapped at >>> least once then no further action is taken. If the page being unmapped was the >>> last mapping of that page, then the THP is put on the deferred split queue, so >>> that it can be split in future if needed. >> >> yes. That is exactly what the kernel is doing, but this is not so >> important for us >> to resolve performance issues. >> >>>> >>>> and this can be very frequent, for example, one userspace heap management >>>> is releasing memory page by page. >>>> >>>> In our real product deployment, we might not care about the second partial >>>> unmapped, we do care about the first partial unmapped as we can use this >>>> to know if split has ever happened on this large folios. an partial unmapped >>>> subpage can be unlikely re-mapped back. >>>> >>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>> care about if partial unmap has ever happened on a large folio more than how >>>> they are exactly partially unmapped :-) >>> >>> I'm not sure what you are suggesting here? A global boolean that tells you if >>> any folio in the system has ever been partially unmapped? That will almost >>> certainly always be true, even for a very well tuned system. >>> >>>> >>>>> >>>>>> >>>>>> We want simple stats that tell us which folio sizes are actually allocated. For >>>>>> everything else, just scan the process to figure out what exactly is going on. >>>>>> >>>>> >>>>> Certainly that's much easier to do. But is it valuable? It might be if we also >>>>> keep stats for the number of failures to allocate the various sizes - then we >>>>> can see what percentage of high order allocation attempts are successful, which >>>>> is probably useful. >> >> My point is that we split large folios into two simple categories, >> 1. large folios which have never been partially unmapped >> 2. large folios which have ever been partially unmapped. >> > > With the rmap batching stuff I am working on, you get the complete thing unmapped in most cases (as long as they are in one VMA) -- for example during munmap()/exit()/etc. IIUC, there are two cases: 1. munmap() a range within a VMA, the rmap batching can void temporary partially unmapped folios, since it does the range operations as a whole. 2. Barry has a case that userspace, e.g., the heap management, releases memory page by page, which rmap batching cannot help, unless either userspace batches memory releases or kernel delays and aggregates these memory releasing syscalls. -- Best Regards, Yan, Zi
On 10.01.24 16:19, Zi Yan wrote: > On 10 Jan 2024, at 7:12, David Hildenbrand wrote: > >> On 10.01.24 13:05, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 11:38, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>>>> and not >>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>>> container). >>>>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>>>> in a >>>>>>>>>>>>>> cgroup? >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>>>> stats >>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>>>> really >>>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>>>> adding >>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>>>> David >>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>>>> cgroups >>>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>>>> to the >>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>>>> what >>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>>>> which >>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>>> detailed >>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>>>> location. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>>>> script >>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>>>> think I >>>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>>>> able to >>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>>>> same >>>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>>> >>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>>>> the >>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>>>> each >>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>>>> are >>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>>>> want >>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>>> going to >>>>>>>>>>>> be particularly useful. >>>>>>>>>>>> >>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>>> kernel; >>>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>>>> you >>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>>>> But >>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>>>> easy >>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>>>> PTEs >>>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>>> process?". >>>>>>>>>>> >>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>>> 1. entire map >>>>>>>>>>> 2. subpage's map >>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>>> >>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>>> we have an explicit >>>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>>> subpage's mapcount. >>>>>>>>>>> >>>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>>> >>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>>> transition. >>>>>>>>> >>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>>>> >>>>>>>> >>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>>> space with this script. >>>>>>>> >>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>>>> and prove me wrong ;-) >>>>>>> >>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>>> making sense. >>>>>>> >>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>>> optimizations without the cont-pte bit and everything is fine. >>>>>> >>>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>>> that for us, and I think we are tending towards agreement that there are >>>>>> unlikely to be any cost benefits by moving it into the kernel. >>>>> >>>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>>> the performance gain and might get regression instead. >>>> >>>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>>> gets split into order-0 pages. If the folio still has all its pages mapped at >>>> least once then no further action is taken. If the page being unmapped was the >>>> last mapping of that page, then the THP is put on the deferred split queue, so >>>> that it can be split in future if needed. >>> >>> yes. That is exactly what the kernel is doing, but this is not so >>> important for us >>> to resolve performance issues. >>> >>>>> >>>>> and this can be very frequent, for example, one userspace heap management >>>>> is releasing memory page by page. >>>>> >>>>> In our real product deployment, we might not care about the second partial >>>>> unmapped, we do care about the first partial unmapped as we can use this >>>>> to know if split has ever happened on this large folios. an partial unmapped >>>>> subpage can be unlikely re-mapped back. >>>>> >>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>>> care about if partial unmap has ever happened on a large folio more than how >>>>> they are exactly partially unmapped :-) >>>> >>>> I'm not sure what you are suggesting here? A global boolean that tells you if >>>> any folio in the system has ever been partially unmapped? That will almost >>>> certainly always be true, even for a very well tuned system. >>>> >>>>> >>>>>> >>>>>>> >>>>>>> We want simple stats that tell us which folio sizes are actually allocated. For >>>>>>> everything else, just scan the process to figure out what exactly is going on. >>>>>>> >>>>>> >>>>>> Certainly that's much easier to do. But is it valuable? It might be if we also >>>>>> keep stats for the number of failures to allocate the various sizes - then we >>>>>> can see what percentage of high order allocation attempts are successful, which >>>>>> is probably useful. >>> >>> My point is that we split large folios into two simple categories, >>> 1. large folios which have never been partially unmapped >>> 2. large folios which have ever been partially unmapped. >>> >> >> With the rmap batching stuff I am working on, you get the complete thing unmapped in most cases (as long as they are in one VMA) -- for example during munmap()/exit()/etc. > > IIUC, there are two cases: > > 1. munmap() a range within a VMA, the rmap batching can void temporary partially unmapped folios, since it does the range operations as a whole. > > 2. Barry has a case that userspace, e.g., the heap management, releases > memory page by page, which rmap batching cannot help, unless either userspace > batches memory releases or kernel delays and aggregates these memory releasing > syscalls. Exactly. And for 2. you immediately know that someone is partially unmapping a large folio. At least temporarily. Compared to doing a MADV_DONTNEED that covers a whole large folio (e.g., THP).
On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 11:38, Barry Song wrote: > > On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 10/01/2024 11:00, David Hildenbrand wrote: > >>> On 10.01.24 11:55, Ryan Roberts wrote: > >>>> On 10/01/2024 10:42, David Hildenbrand wrote: > >>>>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>> > >>>>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>> ... > >>>>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>>>> running > >>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>>>> numbers > >>>>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>>>> > >>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>>>> and not > >>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>>>> container). > >>>>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>>>> in a > >>>>>>>>>> cgroup? > >>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>>>> > >>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>>>> detailed stats. > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>>>> > >>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>>>> stats > >>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>>>> really > >>>>>>>>>> know exectly how to account mTHPs yet > >>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>>>> adding > >>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>>>> David > >>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>>>> cgroups > >>>>>>>>>> do live in sysfs). > >>>>>>>>>> > >>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>>>> to the > >>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>>>> what > >>>>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>>>> > >>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>>>> which > >>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>>>> > >>>>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>>>> detailed > >>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>>>> they have gotten. > >>>>>>>>>>> > >>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>>>> > >>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>>>> location. > >>>>>>>>>> > >>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>>>> script > >>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>>>> think I > >>>>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>>>> /proc/iomem, > >>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>>>> able to > >>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>>>> same > >>>>>>>>>> stats, but it will apply globally. What do you think? > >>>>>>>> > >>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>>>> the > >>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>>>> each > >>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>>>> are > >>>>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>>>> want > >>>>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>>>> going to > >>>>>>>> be particularly useful. > >>>>>>>> > >>>>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>>>> kernel; > >>>>>>>> if we want something equivalant to /proc/meminfo's > >>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>>>> you > >>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>>>> But > >>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>>>> easy > >>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>>>> PTEs > >>>>>>>> to figure out if we are unmapping the first page of a previously > >>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>>>> process?". > >>>>>>> > >>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>>>> 1. entire map > >>>>>>> 2. subpage's map > >>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>>>> > >>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>>>> we have an explicit > >>>>>>> cont_pte split which will decrease the entire map and increase the > >>>>>>> subpage's mapcount. > >>>>>>> > >>>>>>> but its downside is that we expose this info to mm-core. > >>>>>> > >>>>>> OK, but I think we have a slightly more generic situation going on with the > >>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>>>> PTE to determne if its fully mapped? That works for your case where you only > >>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>>>> need to update that SW bit for every PTE one the full -> partial map > >>>>>> transition. > >>>>> > >>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>>>> > >>>> > >>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >>>> we want to know what's fully mapped and what's not, then I don't see any way > >>>> other than by scanning the page tables and we might as well do that in user > >>>> space with this script. > >>>> > >>>> Although, I expect you will shortly make a proposal that is simple to implement > >>>> and prove me wrong ;-) > >>> > >>> Unlikely :) As you said, once you have multiple folio sizes, it stops really > >>> making sense. > >>> > >>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > >>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > >>> optimizations without the cont-pte bit and everything is fine. > >> > >> Yes, but for debug and optimization, its useful to know when THPs are > >> fully/partially mapped, when they are unaligned etc. Anyway, the script does > >> that for us, and I think we are tending towards agreement that there are > >> unlikely to be any cost benefits by moving it into the kernel. > > > > frequent partial unmap can defeat all purpose for us to use large folios. > > just imagine a large folio can soon be splitted after it is formed. we lose > > the performance gain and might get regression instead. > > nit: just because a THP gets partially unmapped in a process doesn't mean it > gets split into order-0 pages. If the folio still has all its pages mapped at > least once then no further action is taken. If the page being unmapped was the > last mapping of that page, then the THP is put on the deferred split queue, so > that it can be split in future if needed. > > > > and this can be very frequent, for example, one userspace heap management > > is releasing memory page by page. > > > > In our real product deployment, we might not care about the second partial > > unmapped, we do care about the first partial unmapped as we can use this > > to know if split has ever happened on this large folios. an partial unmapped > > subpage can be unlikely re-mapped back. > > > > so i guess 1st unmap is probably enough, at least for my product. I mean we > > care about if partial unmap has ever happened on a large folio more than how > > they are exactly partially unmapped :-) > > I'm not sure what you are suggesting here? A global boolean that tells you if > any folio in the system has ever been partially unmapped? That will almost > certainly always be true, even for a very well tuned system. not a global boolean but a per-folio boolean. in case userspace maps a region and has no userspace management, then we are fine as it is unlikely to have partial unmap/map things; in case userspace maps a region, but manages it by itself, such as heap things, we might result in lots of partial map/unmap, which can lead to 3 problems: 1. potential memory footprint increase, for example, while userspace releases some pages in a folio, we might still keep it as frequent splitting folio into basepages and releasing the unmapped subpage might be too expensive. 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown might happen. 3. other maintenance overhead such as splitting large folios etc. We'd like to know how serious partial map things are happening. so either we will disable mTHP in this kind of VMAs, or optimize userspace to do some alignment according to the size of large folios. in android phones, we detect lots of apps, and also found some apps might do things like 1. mprotect on some pages within a large folio 2. mlock on some pages within a large folio 3. madv_free on some pages within a large folio 4. madv_pageout on some pages within a large folio. it would be good if we have a per-folio boolean to know how serious userspace is breaking the large folios. for example, if more than 50% folios in a vma has this problem, we can find it out and take some action. Thanks Barry
On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 11:00, David Hildenbrand wrote: > > On 10.01.24 11:55, Ryan Roberts wrote: > >> On 10/01/2024 10:42, David Hildenbrand wrote: > >>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>> > >>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>> > >>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>>> wrote: > >>>>>>>>>> ... > >>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>> > >>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>> running > >>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>> numbers > >>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>> > >>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>> and not > >>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>> container). > >>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>> in a > >>>>>>>> cgroup? > >>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>> > >>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>> detailed stats. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>> > >>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>> stats > >>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>> really > >>>>>>>> know exectly how to account mTHPs yet > >>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>> adding > >>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>> David > >>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>> cgroups > >>>>>>>> do live in sysfs). > >>>>>>>> > >>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>> to the > >>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>> what > >>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>> > >>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>> which > >>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>> > >>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>> detailed > >>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>> they have gotten. > >>>>>>>>> > >>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>> > >>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>> location. > >>>>>>>> > >>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>> script > >>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>> think I > >>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>> /proc/iomem, > >>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>> able to > >>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>> same > >>>>>>>> stats, but it will apply globally. What do you think? > >>>>>> > >>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>> the > >>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>> each > >>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>> are > >>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>> want > >>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>> going to > >>>>>> be particularly useful. > >>>>>> > >>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>> kernel; > >>>>>> if we want something equivalant to /proc/meminfo's > >>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>> you > >>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>> But > >>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>> easy > >>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>> PTEs > >>>>>> to figure out if we are unmapping the first page of a previously > >>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>> process?". > >>>>> > >>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>> 1. entire map > >>>>> 2. subpage's map > >>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>> > >>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>> we have an explicit > >>>>> cont_pte split which will decrease the entire map and increase the > >>>>> subpage's mapcount. > >>>>> > >>>>> but its downside is that we expose this info to mm-core. > >>>> > >>>> OK, but I think we have a slightly more generic situation going on with the > >>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>> PTE to determne if its fully mapped? That works for your case where you only > >>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>> need to update that SW bit for every PTE one the full -> partial map > >>>> transition. > >>> > >>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>> > >> > >> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >> we want to know what's fully mapped and what's not, then I don't see any way > >> other than by scanning the page tables and we might as well do that in user > >> space with this script. > >> > >> Although, I expect you will shortly make a proposal that is simple to implement > >> and prove me wrong ;-) > > > > Unlikely :) As you said, once you have multiple folio sizes, it stops really > > making sense. > > > > Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > > set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > > optimizations without the cont-pte bit and everything is fine. > > Yes, but for debug and optimization, its useful to know when THPs are > fully/partially mapped, when they are unaligned etc. Anyway, the script does > that for us, and I think we are tending towards agreement that there are > unlikely to be any cost benefits by moving it into the kernel. > > > > > We want simple stats that tell us which folio sizes are actually allocated. For > > everything else, just scan the process to figure out what exactly is going on. > > > > Certainly that's much easier to do. But is it valuable? It might be if we also > keep stats for the number of failures to allocate the various sizes - then we > can see what percentage of high order allocation attempts are successful, which > is probably useful. +1 this is perfectly useful especially after memory is fragmented. In an embedded device, this can be absolutely true after the system runs for a while. That's why we have to set a large folios pool in products. otherwise, large folios only have a positive impact in the first hour. for your reference, i am posting some stats i have on my phone using OPPO's large folios approach, :/ # cat /proc/<oppo's large folios>/stat pool_size eg. <4GB> ---- we have a pool to help the success of large folios allocation pool_low ----- watermarks for the pool, we may begin to reclaim large folios when the pool has limited free memory pool_high thp_cow 1011488 ----- we are doing CoW for large folios thp_cow_fallback 584494 ------ we fail to allocate large folios for CoW, then fallback to normal page ... madv_free_unaligned 9159 ------ userspace unaligned madv_free madv_dont_need_unaligned 11358 ------ userspace unaligned madv_dontneed .... thp_do_anon_pages 131289845 thp_do_anon_pages_fallback 88911215 ----- fallback to normal pages in do_anon_pages thp_swpin_no_swapcache_entry ----- swapin large folios thp_swpin_no_swapcache_fallback_entry ----- swapin large folios fallback... thp_swpin_swapcache_entry ----- swapin large folios/swapcache case thp_swpin_swapcache_fallback_entry ----- swapin large folios/swapcache fallback to normal pages thp_file_entry 28998 thp_file_alloc_success 27334 thp_file_alloc_fail 1664 .... PartialMappedTHP: 29312 kB ---- these are folios which have ever been not entirely mapped. ----- this is also what i am suggesting to have in recent several replies :-) .... Thanks Barry
On 10/01/2024 22:14, Barry Song wrote: > On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 11:38, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>> >>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>> >>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>> and not >>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>> container). >>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>> in a >>>>>>>>>>>> cgroup? >>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>> >>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>> >>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>> stats >>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>> really >>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>> adding >>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>> David >>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>> cgroups >>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>> >>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>> to the >>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>> what >>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>> >>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>> which >>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>> >>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>> detailed >>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>> >>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>> location. >>>>>>>>>>>> >>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>> script >>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>> think I >>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>> able to >>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>> same >>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>> >>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>> the >>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>> each >>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>> are >>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>> want >>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>> going to >>>>>>>>>> be particularly useful. >>>>>>>>>> >>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>> kernel; >>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>> you >>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>> But >>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>> easy >>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>> PTEs >>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>> process?". >>>>>>>>> >>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>> 1. entire map >>>>>>>>> 2. subpage's map >>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>> >>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>> we have an explicit >>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>> subpage's mapcount. >>>>>>>>> >>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>> >>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>> transition. >>>>>>> >>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>> >>>>>> >>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>> other than by scanning the page tables and we might as well do that in user >>>>>> space with this script. >>>>>> >>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>> and prove me wrong ;-) >>>>> >>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>> making sense. >>>>> >>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>> optimizations without the cont-pte bit and everything is fine. >>>> >>>> Yes, but for debug and optimization, its useful to know when THPs are >>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>> that for us, and I think we are tending towards agreement that there are >>>> unlikely to be any cost benefits by moving it into the kernel. >>> >>> frequent partial unmap can defeat all purpose for us to use large folios. >>> just imagine a large folio can soon be splitted after it is formed. we lose >>> the performance gain and might get regression instead. >> >> nit: just because a THP gets partially unmapped in a process doesn't mean it >> gets split into order-0 pages. If the folio still has all its pages mapped at >> least once then no further action is taken. If the page being unmapped was the >> last mapping of that page, then the THP is put on the deferred split queue, so >> that it can be split in future if needed. >>> >>> and this can be very frequent, for example, one userspace heap management >>> is releasing memory page by page. >>> >>> In our real product deployment, we might not care about the second partial >>> unmapped, we do care about the first partial unmapped as we can use this >>> to know if split has ever happened on this large folios. an partial unmapped >>> subpage can be unlikely re-mapped back. >>> >>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>> care about if partial unmap has ever happened on a large folio more than how >>> they are exactly partially unmapped :-) >> >> I'm not sure what you are suggesting here? A global boolean that tells you if >> any folio in the system has ever been partially unmapped? That will almost >> certainly always be true, even for a very well tuned system. > > not a global boolean but a per-folio boolean. in case userspace maps a region > and has no userspace management, then we are fine as it is unlikely to have > partial unmap/map things; in case userspace maps a region, but manages it > by itself, such as heap things, we might result in lots of partial map/unmap, > which can lead to 3 problems: > 1. potential memory footprint increase, for example, while userspace releases > some pages in a folio, we might still keep it as frequent splitting folio into > basepages and releasing the unmapped subpage might be too expensive. > 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown > might happen. > 3. other maintenance overhead such as splitting large folios etc. > > We'd like to know how serious partial map things are happening. so either > we will disable mTHP in this kind of VMAs, or optimize userspace to do > some alignment according to the size of large folios. > > in android phones, we detect lots of apps, and also found some apps might > do things like > 1. mprotect on some pages within a large folio > 2. mlock on some pages within a large folio > 3. madv_free on some pages within a large folio > 4. madv_pageout on some pages within a large folio. > > it would be good if we have a per-folio boolean to know how serious userspace > is breaking the large folios. for example, if more than 50% folios in a vma has > this problem, we can find it out and take some action. The high level value of these stats seems clear - I agree we need to be able to get these insights. I think the issues are more around the implementation though. I'm struggling to understand exactly how we could implement a lot of these things cheaply (either in the kernel or in user space). Let me try to work though what I think you are suggesting: - every THP is initially fully mapped - when an operation causes a partial unmap, mark the folio as having at least one partial mapping - on transition from "no partial mappings" to "at least one partial mapping" increment a "anon-partial-<size>kB" (one for each supported folio size) counter by the folio size - on transition from "at least one partial mapping" to "fully unampped everywhere" decrement the counter by the folio size I think the issue with this is that a folio that is fully mapped in a process that gets forked, then is partially unmapped in 1 process, will be accounted as partially mapped even after the process that partially unmapped it exits, even though that folio is now fully mapped in all processes that map it. Is that a problem, perhaps not? I'm not sure. > > Thanks > Barry
On 11.01.24 13:25, Ryan Roberts wrote: > On 10/01/2024 22:14, Barry Song wrote: >> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> On 10/01/2024 11:38, Barry Song wrote: >>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>> >>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>>> and not >>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>> container). >>>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>>> in a >>>>>>>>>>>>> cgroup? >>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>>> >>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>>> stats >>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>>> really >>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>>> adding >>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>>> David >>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>>> cgroups >>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>> >>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>>> to the >>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>>> what >>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>> >>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>>> which >>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>> detailed >>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>>> location. >>>>>>>>>>>>> >>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>>> script >>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>>> think I >>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>>> able to >>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>>> same >>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>> >>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>>> the >>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>>> each >>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>>> are >>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>>> want >>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>> going to >>>>>>>>>>> be particularly useful. >>>>>>>>>>> >>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>> kernel; >>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>>> you >>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>>> But >>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>>> easy >>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>>> PTEs >>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>> process?". >>>>>>>>>> >>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>> 1. entire map >>>>>>>>>> 2. subpage's map >>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>> >>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>> we have an explicit >>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>> subpage's mapcount. >>>>>>>>>> >>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>> >>>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>> transition. >>>>>>>> >>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>>> >>>>>>> >>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>> space with this script. >>>>>>> >>>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>>> and prove me wrong ;-) >>>>>> >>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>> making sense. >>>>>> >>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>> optimizations without the cont-pte bit and everything is fine. >>>>> >>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>> that for us, and I think we are tending towards agreement that there are >>>>> unlikely to be any cost benefits by moving it into the kernel. >>>> >>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>> the performance gain and might get regression instead. >>> >>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>> gets split into order-0 pages. If the folio still has all its pages mapped at >>> least once then no further action is taken. If the page being unmapped was the >>> last mapping of that page, then the THP is put on the deferred split queue, so >>> that it can be split in future if needed. >>>> >>>> and this can be very frequent, for example, one userspace heap management >>>> is releasing memory page by page. >>>> >>>> In our real product deployment, we might not care about the second partial >>>> unmapped, we do care about the first partial unmapped as we can use this >>>> to know if split has ever happened on this large folios. an partial unmapped >>>> subpage can be unlikely re-mapped back. >>>> >>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>> care about if partial unmap has ever happened on a large folio more than how >>>> they are exactly partially unmapped :-) >>> >>> I'm not sure what you are suggesting here? A global boolean that tells you if >>> any folio in the system has ever been partially unmapped? That will almost >>> certainly always be true, even for a very well tuned system. >> >> not a global boolean but a per-folio boolean. in case userspace maps a region >> and has no userspace management, then we are fine as it is unlikely to have >> partial unmap/map things; in case userspace maps a region, but manages it >> by itself, such as heap things, we might result in lots of partial map/unmap, >> which can lead to 3 problems: >> 1. potential memory footprint increase, for example, while userspace releases >> some pages in a folio, we might still keep it as frequent splitting folio into >> basepages and releasing the unmapped subpage might be too expensive. >> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown >> might happen. >> 3. other maintenance overhead such as splitting large folios etc. >> >> We'd like to know how serious partial map things are happening. so either >> we will disable mTHP in this kind of VMAs, or optimize userspace to do >> some alignment according to the size of large folios. >> >> in android phones, we detect lots of apps, and also found some apps might >> do things like >> 1. mprotect on some pages within a large folio >> 2. mlock on some pages within a large folio >> 3. madv_free on some pages within a large folio >> 4. madv_pageout on some pages within a large folio. >> >> it would be good if we have a per-folio boolean to know how serious userspace >> is breaking the large folios. for example, if more than 50% folios in a vma has >> this problem, we can find it out and take some action. > > The high level value of these stats seems clear - I agree we need to be able to > get these insights. I think the issues are more around the implementation > though. I'm struggling to understand exactly how we could implement a lot of > these things cheaply (either in the kernel or in user space). > > Let me try to work though what I think you are suggesting: > > - every THP is initially fully mapped Not for pagecache folios. > - when an operation causes a partial unmap, mark the folio as having at least > one partial mapping > - on transition from "no partial mappings" to "at least one partial mapping" > increment a "anon-partial-<size>kB" (one for each supported folio size) > counter by the folio size > - on transition from "at least one partial mapping" to "fully unampped > everywhere" decrement the counter by the folio size > > I think the issue with this is that a folio that is fully mapped in a process > that gets forked, then is partially unmapped in 1 process, will be accounted as > partially mapped even after the process that partially unmapped it exits, even > though that folio is now fully mapped in all processes that map it. Is that a > problem, perhaps not? I'm not sure. What I can offer with my total mapcount I am working on (+ entire/pmd mapcount, but let's put that aside): 1) total_mapcount not multiples of folio_nr_page -> at least one process currently maps the folio partially 2) total_mapcount is less than folio_nr_page -> surely partially mapped I think for most of anon memory (note that most folios are always exclusive in our system, not cow-shared) 2) would already be sufficient.
On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote: > > On 11.01.24 13:25, Ryan Roberts wrote: > > On 10/01/2024 22:14, Barry Song wrote: > >> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>> > >>> On 10/01/2024 11:38, Barry Song wrote: > >>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>> > >>>>> On 10/01/2024 11:00, David Hildenbrand wrote: > >>>>>> On 10.01.24 11:55, Ryan Roberts wrote: > >>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: > >>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>>>>>>> running > >>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>>>>>>> numbers > >>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>>>>>>> and not > >>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>>>>>>> container). > >>>>>>>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>>>>>>> in a > >>>>>>>>>>>>> cgroup? > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>>>>>>> detailed stats. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>>>>>>> stats > >>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>>>>>>> really > >>>>>>>>>>>>> know exectly how to account mTHPs yet > >>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>>>>>>> adding > >>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>>>>>>> David > >>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>>>>>>> cgroups > >>>>>>>>>>>>> do live in sysfs). > >>>>>>>>>>>>> > >>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>>>>>>> to the > >>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>>>>>>> what > >>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>>>>>>> which > >>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>>>>>>> detailed > >>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>>>>>>> they have gotten. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>>>>>>> location. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>>>>>>> script > >>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>>>>>>> think I > >>>>>>>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>>>>>>> /proc/iomem, > >>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>>>>>>> able to > >>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>>>>>>> same > >>>>>>>>>>>>> stats, but it will apply globally. What do you think? > >>>>>>>>>>> > >>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>>>>>>> the > >>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>>>>>>> each > >>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>>>>>>> are > >>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>>>>>>> want > >>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>>>>>>> going to > >>>>>>>>>>> be particularly useful. > >>>>>>>>>>> > >>>>>>>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>>>>>>> kernel; > >>>>>>>>>>> if we want something equivalant to /proc/meminfo's > >>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>>>>>>> you > >>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>>>>>>> But > >>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>>>>>>> easy > >>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>>>>>>> PTEs > >>>>>>>>>>> to figure out if we are unmapping the first page of a previously > >>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>>>>>>> process?". > >>>>>>>>>> > >>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>>>>>>> 1. entire map > >>>>>>>>>> 2. subpage's map > >>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>>>>>>> > >>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>>>>>>> we have an explicit > >>>>>>>>>> cont_pte split which will decrease the entire map and increase the > >>>>>>>>>> subpage's mapcount. > >>>>>>>>>> > >>>>>>>>>> but its downside is that we expose this info to mm-core. > >>>>>>>>> > >>>>>>>>> OK, but I think we have a slightly more generic situation going on with the > >>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only > >>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>>>>>>> need to update that SW bit for every PTE one the full -> partial map > >>>>>>>>> transition. > >>>>>>>> > >>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>>>>>>> > >>>>>>> > >>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >>>>>>> we want to know what's fully mapped and what's not, then I don't see any way > >>>>>>> other than by scanning the page tables and we might as well do that in user > >>>>>>> space with this script. > >>>>>>> > >>>>>>> Although, I expect you will shortly make a proposal that is simple to implement > >>>>>>> and prove me wrong ;-) > >>>>>> > >>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really > >>>>>> making sense. > >>>>>> > >>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > >>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > >>>>>> optimizations without the cont-pte bit and everything is fine. > >>>>> > >>>>> Yes, but for debug and optimization, its useful to know when THPs are > >>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does > >>>>> that for us, and I think we are tending towards agreement that there are > >>>>> unlikely to be any cost benefits by moving it into the kernel. > >>>> > >>>> frequent partial unmap can defeat all purpose for us to use large folios. > >>>> just imagine a large folio can soon be splitted after it is formed. we lose > >>>> the performance gain and might get regression instead. > >>> > >>> nit: just because a THP gets partially unmapped in a process doesn't mean it > >>> gets split into order-0 pages. If the folio still has all its pages mapped at > >>> least once then no further action is taken. If the page being unmapped was the > >>> last mapping of that page, then the THP is put on the deferred split queue, so > >>> that it can be split in future if needed. > >>>> > >>>> and this can be very frequent, for example, one userspace heap management > >>>> is releasing memory page by page. > >>>> > >>>> In our real product deployment, we might not care about the second partial > >>>> unmapped, we do care about the first partial unmapped as we can use this > >>>> to know if split has ever happened on this large folios. an partial unmapped > >>>> subpage can be unlikely re-mapped back. > >>>> > >>>> so i guess 1st unmap is probably enough, at least for my product. I mean we > >>>> care about if partial unmap has ever happened on a large folio more than how > >>>> they are exactly partially unmapped :-) > >>> > >>> I'm not sure what you are suggesting here? A global boolean that tells you if > >>> any folio in the system has ever been partially unmapped? That will almost > >>> certainly always be true, even for a very well tuned system. > >> > >> not a global boolean but a per-folio boolean. in case userspace maps a region > >> and has no userspace management, then we are fine as it is unlikely to have > >> partial unmap/map things; in case userspace maps a region, but manages it > >> by itself, such as heap things, we might result in lots of partial map/unmap, > >> which can lead to 3 problems: > >> 1. potential memory footprint increase, for example, while userspace releases > >> some pages in a folio, we might still keep it as frequent splitting folio into > >> basepages and releasing the unmapped subpage might be too expensive. > >> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown > >> might happen. > >> 3. other maintenance overhead such as splitting large folios etc. > >> > >> We'd like to know how serious partial map things are happening. so either > >> we will disable mTHP in this kind of VMAs, or optimize userspace to do > >> some alignment according to the size of large folios. > >> > >> in android phones, we detect lots of apps, and also found some apps might > >> do things like > >> 1. mprotect on some pages within a large folio > >> 2. mlock on some pages within a large folio > >> 3. madv_free on some pages within a large folio > >> 4. madv_pageout on some pages within a large folio. > >> > >> it would be good if we have a per-folio boolean to know how serious userspace > >> is breaking the large folios. for example, if more than 50% folios in a vma has > >> this problem, we can find it out and take some action. > > > > The high level value of these stats seems clear - I agree we need to be able to > > get these insights. I think the issues are more around the implementation > > though. I'm struggling to understand exactly how we could implement a lot of > > these things cheaply (either in the kernel or in user space). > > > > Let me try to work though what I think you are suggesting: > > > > - every THP is initially fully mapped > > Not for pagecache folios. > > > - when an operation causes a partial unmap, mark the folio as having at least > > one partial mapping > > - on transition from "no partial mappings" to "at least one partial mapping" > > increment a "anon-partial-<size>kB" (one for each supported folio size) > > counter by the folio size > > - on transition from "at least one partial mapping" to "fully unampped > > everywhere" decrement the counter by the folio size > > > > I think the issue with this is that a folio that is fully mapped in a process > > that gets forked, then is partially unmapped in 1 process, will be accounted as > > partially mapped even after the process that partially unmapped it exits, even > > though that folio is now fully mapped in all processes that map it. Is that a > > problem, perhaps not? I'm not sure. > > What I can offer with my total mapcount I am working on (+ entire/pmd > mapcount, but let's put that aside): > > 1) total_mapcount not multiples of folio_nr_page -> at least one process > currently maps the folio partially > > 2) total_mapcount is less than folio_nr_page -> surely partially mapped > > I think for most of anon memory (note that most folios are always > exclusive in our system, not cow-shared) 2) would already be sufficient. if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to add nr_pages in copy_pte_range for rmap. copy_pte_range() { folio_try_dup_anon_rmap_ptes(...nr_pages....) } and at the same time, in zap_pte_range(), we remove the whole anon_rmap if the zapped-range covers the whole folio. Replace the for-loop for (i = 0; i < nr; i++, page++) { add_rmap(1); } for (i = 0; i < nr; i++, page++) { remove_rmap(1); } by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we are doing the entire mapping/unmapping. then we might be able to TestAndSetPartialMapped flag for this folio anywhile 1. someone is adding rmap with a number not equal nr_pages 2. someone is removing rmap with a number not equal nr_pages That means we are doing partial mapping or unmapping. and we increment partialmap_count by 1, let debugfs or somewhere present this count. while the folio is released to buddy and splitted into normal pages, we remove this flag and decrease partialmap_count by 1. > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 11.01.24 21:21, Barry Song wrote: > On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 11.01.24 13:25, Ryan Roberts wrote: >>> On 10/01/2024 22:14, Barry Song wrote: >>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> On 10/01/2024 11:38, Barry Song wrote: >>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>>>>> and not >>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>>>> container). >>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>>>>> in a >>>>>>>>>>>>>>> cgroup? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>>>>> stats >>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>>>>> really >>>>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>>>>> David >>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>>>>> cgroups >>>>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>>>>> to the >>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>>>>> what >>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>>>> detailed >>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>>>>> location. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>>>>> script >>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>>>>> think I >>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>>>>> able to >>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>>>>> same >>>>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>>>> >>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>>>>> the >>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>>>>> each >>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>>>>> are >>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>>>>> want >>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>>>> going to >>>>>>>>>>>>> be particularly useful. >>>>>>>>>>>>> >>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>>>> kernel; >>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>>>>> you >>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>>>>> But >>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>>>>> easy >>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>>>>> PTEs >>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>>>> process?". >>>>>>>>>>>> >>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>>>> 1. entire map >>>>>>>>>>>> 2. subpage's map >>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>>>> >>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>>>> we have an explicit >>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>>>> subpage's mapcount. >>>>>>>>>>>> >>>>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>>>> >>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>>>> transition. >>>>>>>>>> >>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>>>> space with this script. >>>>>>>>> >>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>>>>> and prove me wrong ;-) >>>>>>>> >>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>>>> making sense. >>>>>>>> >>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>>>> optimizations without the cont-pte bit and everything is fine. >>>>>>> >>>>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>>>> that for us, and I think we are tending towards agreement that there are >>>>>>> unlikely to be any cost benefits by moving it into the kernel. >>>>>> >>>>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>>>> the performance gain and might get regression instead. >>>>> >>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>>>> gets split into order-0 pages. If the folio still has all its pages mapped at >>>>> least once then no further action is taken. If the page being unmapped was the >>>>> last mapping of that page, then the THP is put on the deferred split queue, so >>>>> that it can be split in future if needed. >>>>>> >>>>>> and this can be very frequent, for example, one userspace heap management >>>>>> is releasing memory page by page. >>>>>> >>>>>> In our real product deployment, we might not care about the second partial >>>>>> unmapped, we do care about the first partial unmapped as we can use this >>>>>> to know if split has ever happened on this large folios. an partial unmapped >>>>>> subpage can be unlikely re-mapped back. >>>>>> >>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>>>> care about if partial unmap has ever happened on a large folio more than how >>>>>> they are exactly partially unmapped :-) >>>>> >>>>> I'm not sure what you are suggesting here? A global boolean that tells you if >>>>> any folio in the system has ever been partially unmapped? That will almost >>>>> certainly always be true, even for a very well tuned system. >>>> >>>> not a global boolean but a per-folio boolean. in case userspace maps a region >>>> and has no userspace management, then we are fine as it is unlikely to have >>>> partial unmap/map things; in case userspace maps a region, but manages it >>>> by itself, such as heap things, we might result in lots of partial map/unmap, >>>> which can lead to 3 problems: >>>> 1. potential memory footprint increase, for example, while userspace releases >>>> some pages in a folio, we might still keep it as frequent splitting folio into >>>> basepages and releasing the unmapped subpage might be too expensive. >>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown >>>> might happen. >>>> 3. other maintenance overhead such as splitting large folios etc. >>>> >>>> We'd like to know how serious partial map things are happening. so either >>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do >>>> some alignment according to the size of large folios. >>>> >>>> in android phones, we detect lots of apps, and also found some apps might >>>> do things like >>>> 1. mprotect on some pages within a large folio >>>> 2. mlock on some pages within a large folio >>>> 3. madv_free on some pages within a large folio >>>> 4. madv_pageout on some pages within a large folio. >>>> >>>> it would be good if we have a per-folio boolean to know how serious userspace >>>> is breaking the large folios. for example, if more than 50% folios in a vma has >>>> this problem, we can find it out and take some action. >>> >>> The high level value of these stats seems clear - I agree we need to be able to >>> get these insights. I think the issues are more around the implementation >>> though. I'm struggling to understand exactly how we could implement a lot of >>> these things cheaply (either in the kernel or in user space). >>> >>> Let me try to work though what I think you are suggesting: >>> >>> - every THP is initially fully mapped >> >> Not for pagecache folios. >> >>> - when an operation causes a partial unmap, mark the folio as having at least >>> one partial mapping >>> - on transition from "no partial mappings" to "at least one partial mapping" >>> increment a "anon-partial-<size>kB" (one for each supported folio size) >>> counter by the folio size >>> - on transition from "at least one partial mapping" to "fully unampped >>> everywhere" decrement the counter by the folio size >>> >>> I think the issue with this is that a folio that is fully mapped in a process >>> that gets forked, then is partially unmapped in 1 process, will be accounted as >>> partially mapped even after the process that partially unmapped it exits, even >>> though that folio is now fully mapped in all processes that map it. Is that a >>> problem, perhaps not? I'm not sure. >> >> What I can offer with my total mapcount I am working on (+ entire/pmd >> mapcount, but let's put that aside): >> >> 1) total_mapcount not multiples of folio_nr_page -> at least one process >> currently maps the folio partially >> >> 2) total_mapcount is less than folio_nr_page -> surely partially mapped >> >> I think for most of anon memory (note that most folios are always >> exclusive in our system, not cow-shared) 2) would already be sufficient. > > if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to > add nr_pages in copy_pte_range for rmap. > copy_pte_range() > { > folio_try_dup_anon_rmap_ptes(...nr_pages....) > } > and at the same time, in zap_pte_range(), we remove the whole anon_rmap > if the zapped-range covers the whole folio. > > Replace the for-loop > for (i = 0; i < nr; i++, page++) { > add_rmap(1); > } > for (i = 0; i < nr; i++, page++) { > remove_rmap(1); > } > by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we > are doing the entire mapping/unmapping That's precisely what I have already running as protoypes :) And I promised Ryan to get to this soon, clean it up and sent it out. . > > then we might be able to TestAndSetPartialMapped flag for this folio anywhile > 1. someone is adding rmap with a number not equal nr_pages > 2. someone is removing rmap with a number not equal nr_pages > That means we are doing partial mapping or unmapping. > and we increment partialmap_count by 1, let debugfs or somewhere present > this count. Yes. The only "ugly" corner case if you have a split VMA. We're not batching rmap exceeding that.
On Fri, Jan 12, 2024 at 1:25 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 10/01/2024 22:14, Barry Song wrote: > > On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 10/01/2024 11:38, Barry Song wrote: > >>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> On 10/01/2024 11:00, David Hildenbrand wrote: > >>>>> On 10.01.24 11:55, Ryan Roberts wrote: > >>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: > >>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>>>>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>> > >>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>>>>>> running > >>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>>>>>> numbers > >>>>>>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>>>>>> > >>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>>>>>> and not > >>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>>>>>> container). > >>>>>>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>>>>>> in a > >>>>>>>>>>>> cgroup? > >>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>>>>>> detailed stats. > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>>>>>> > >>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>>>>>> stats > >>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>>>>>> really > >>>>>>>>>>>> know exectly how to account mTHPs yet > >>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>>>>>> adding > >>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>>>>>> David > >>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>>>>>> cgroups > >>>>>>>>>>>> do live in sysfs). > >>>>>>>>>>>> > >>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>>>>>> to the > >>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>>>>>> what > >>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>>>>>> > >>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>>>>>> which > >>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>>>>>> detailed > >>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>>>>>> they have gotten. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>>>>>> location. > >>>>>>>>>>>> > >>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>>>>>> script > >>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>>>>>> think I > >>>>>>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>>>>>> /proc/iomem, > >>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>>>>>> able to > >>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>>>>>> same > >>>>>>>>>>>> stats, but it will apply globally. What do you think? > >>>>>>>>>> > >>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>>>>>> the > >>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>>>>>> each > >>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>>>>>> are > >>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>>>>>> want > >>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>>>>>> going to > >>>>>>>>>> be particularly useful. > >>>>>>>>>> > >>>>>>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>>>>>> kernel; > >>>>>>>>>> if we want something equivalant to /proc/meminfo's > >>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>>>>>> you > >>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>>>>>> But > >>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>>>>>> easy > >>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>>>>>> PTEs > >>>>>>>>>> to figure out if we are unmapping the first page of a previously > >>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>>>>>> process?". > >>>>>>>>> > >>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>>>>>> 1. entire map > >>>>>>>>> 2. subpage's map > >>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>>>>>> > >>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>>>>>> we have an explicit > >>>>>>>>> cont_pte split which will decrease the entire map and increase the > >>>>>>>>> subpage's mapcount. > >>>>>>>>> > >>>>>>>>> but its downside is that we expose this info to mm-core. > >>>>>>>> > >>>>>>>> OK, but I think we have a slightly more generic situation going on with the > >>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>>>>>> PTE to determne if its fully mapped? That works for your case where you only > >>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>>>>>> need to update that SW bit for every PTE one the full -> partial map > >>>>>>>> transition. > >>>>>>> > >>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>>>>>> > >>>>>> > >>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >>>>>> we want to know what's fully mapped and what's not, then I don't see any way > >>>>>> other than by scanning the page tables and we might as well do that in user > >>>>>> space with this script. > >>>>>> > >>>>>> Although, I expect you will shortly make a proposal that is simple to implement > >>>>>> and prove me wrong ;-) > >>>>> > >>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really > >>>>> making sense. > >>>>> > >>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > >>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > >>>>> optimizations without the cont-pte bit and everything is fine. > >>>> > >>>> Yes, but for debug and optimization, its useful to know when THPs are > >>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does > >>>> that for us, and I think we are tending towards agreement that there are > >>>> unlikely to be any cost benefits by moving it into the kernel. > >>> > >>> frequent partial unmap can defeat all purpose for us to use large folios. > >>> just imagine a large folio can soon be splitted after it is formed. we lose > >>> the performance gain and might get regression instead. > >> > >> nit: just because a THP gets partially unmapped in a process doesn't mean it > >> gets split into order-0 pages. If the folio still has all its pages mapped at > >> least once then no further action is taken. If the page being unmapped was the > >> last mapping of that page, then the THP is put on the deferred split queue, so > >> that it can be split in future if needed. > >>> > >>> and this can be very frequent, for example, one userspace heap management > >>> is releasing memory page by page. > >>> > >>> In our real product deployment, we might not care about the second partial > >>> unmapped, we do care about the first partial unmapped as we can use this > >>> to know if split has ever happened on this large folios. an partial unmapped > >>> subpage can be unlikely re-mapped back. > >>> > >>> so i guess 1st unmap is probably enough, at least for my product. I mean we > >>> care about if partial unmap has ever happened on a large folio more than how > >>> they are exactly partially unmapped :-) > >> > >> I'm not sure what you are suggesting here? A global boolean that tells you if > >> any folio in the system has ever been partially unmapped? That will almost > >> certainly always be true, even for a very well tuned system. > > > > not a global boolean but a per-folio boolean. in case userspace maps a region > > and has no userspace management, then we are fine as it is unlikely to have > > partial unmap/map things; in case userspace maps a region, but manages it > > by itself, such as heap things, we might result in lots of partial map/unmap, > > which can lead to 3 problems: > > 1. potential memory footprint increase, for example, while userspace releases > > some pages in a folio, we might still keep it as frequent splitting folio into > > basepages and releasing the unmapped subpage might be too expensive. > > 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown > > might happen. > > 3. other maintenance overhead such as splitting large folios etc. > > > > We'd like to know how serious partial map things are happening. so either > > we will disable mTHP in this kind of VMAs, or optimize userspace to do > > some alignment according to the size of large folios. > > > > in android phones, we detect lots of apps, and also found some apps might > > do things like > > 1. mprotect on some pages within a large folio > > 2. mlock on some pages within a large folio > > 3. madv_free on some pages within a large folio > > 4. madv_pageout on some pages within a large folio. > > > > it would be good if we have a per-folio boolean to know how serious userspace > > is breaking the large folios. for example, if more than 50% folios in a vma has > > this problem, we can find it out and take some action. > > The high level value of these stats seems clear - I agree we need to be able to > get these insights. I think the issues are more around the implementation > though. I'm struggling to understand exactly how we could implement a lot of > these things cheaply (either in the kernel or in user space). > > Let me try to work though what I think you are suggesting: > > - every THP is initially fully mapped > - when an operation causes a partial unmap, mark the folio as having at least > one partial mapping > - on transition from "no partial mappings" to "at least one partial mapping" > increment a "anon-partial-<size>kB" (one for each supported folio size) > counter by the folio size > - on transition from "at least one partial mapping" to "fully unampped > everywhere" decrement the counter by the folio size > > I think the issue with this is that a folio that is fully mapped in a process > that gets forked, then is partially unmapped in 1 process, will be accounted as > partially mapped even after the process that partially unmapped it exits, even > though that folio is now fully mapped in all processes that map it. Is that a > problem, perhaps not? I'm not sure. I don't think this is a problem as what we really care about is if some "bad" behaviour has ever happened in userspace. Though the "bad" guy has exited or been killed, we still need the record to find out. except the global count, if we can reflect the partially mapped count in vma such as smaps, this will help even more to locate the problematic userspace code. Thanks Barry
On Fri, Jan 12, 2024 at 9:28 AM David Hildenbrand <david@redhat.com> wrote: > > On 11.01.24 21:21, Barry Song wrote: > > On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 11.01.24 13:25, Ryan Roberts wrote: > >>> On 10/01/2024 22:14, Barry Song wrote: > >>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>> > >>>>> On 10/01/2024 11:38, Barry Song wrote: > >>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>> > >>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote: > >>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote: > >>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: > >>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>>>>>>>>> running > >>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>>>>>>>>> numbers > >>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>>>>>>>>> and not > >>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>>>>>>>>> container). > >>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>>>>>>>>> in a > >>>>>>>>>>>>>>> cgroup? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>>>>>>>>> detailed stats. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>>>>>>>>> stats > >>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>>>>>>>>> really > >>>>>>>>>>>>>>> know exectly how to account mTHPs yet > >>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>>>>>>>>> adding > >>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>>>>>>>>> David > >>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>>>>>>>>> cgroups > >>>>>>>>>>>>>>> do live in sysfs). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>>>>>>>>> to the > >>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>>>>>>>>> what > >>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>>>>>>>>> which > >>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>>>>>>>>> detailed > >>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>>>>>>>>> they have gotten. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>>>>>>>>> location. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>>>>>>>>> script > >>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>>>>>>>>> think I > >>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>>>>>>>>> /proc/iomem, > >>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>>>>>>>>> able to > >>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>>>>>>>>> same > >>>>>>>>>>>>>>> stats, but it will apply globally. What do you think? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>>>>>>>>> the > >>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>>>>>>>>> each > >>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>>>>>>>>> are > >>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>>>>>>>>> want > >>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>>>>>>>>> going to > >>>>>>>>>>>>> be particularly useful. > >>>>>>>>>>>>> > >>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>>>>>>>>> kernel; > >>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's > >>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>>>>>>>>> you > >>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>>>>>>>>> But > >>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>>>>>>>>> easy > >>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>>>>>>>>> PTEs > >>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously > >>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>>>>>>>>> process?". > >>>>>>>>>>>> > >>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>>>>>>>>> 1. entire map > >>>>>>>>>>>> 2. subpage's map > >>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>>>>>>>>> > >>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>>>>>>>>> we have an explicit > >>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the > >>>>>>>>>>>> subpage's mapcount. > >>>>>>>>>>>> > >>>>>>>>>>>> but its downside is that we expose this info to mm-core. > >>>>>>>>>>> > >>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the > >>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only > >>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map > >>>>>>>>>>> transition. > >>>>>>>>>> > >>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way > >>>>>>>>> other than by scanning the page tables and we might as well do that in user > >>>>>>>>> space with this script. > >>>>>>>>> > >>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement > >>>>>>>>> and prove me wrong ;-) > >>>>>>>> > >>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really > >>>>>>>> making sense. > >>>>>>>> > >>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > >>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > >>>>>>>> optimizations without the cont-pte bit and everything is fine. > >>>>>>> > >>>>>>> Yes, but for debug and optimization, its useful to know when THPs are > >>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does > >>>>>>> that for us, and I think we are tending towards agreement that there are > >>>>>>> unlikely to be any cost benefits by moving it into the kernel. > >>>>>> > >>>>>> frequent partial unmap can defeat all purpose for us to use large folios. > >>>>>> just imagine a large folio can soon be splitted after it is formed. we lose > >>>>>> the performance gain and might get regression instead. > >>>>> > >>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it > >>>>> gets split into order-0 pages. If the folio still has all its pages mapped at > >>>>> least once then no further action is taken. If the page being unmapped was the > >>>>> last mapping of that page, then the THP is put on the deferred split queue, so > >>>>> that it can be split in future if needed. > >>>>>> > >>>>>> and this can be very frequent, for example, one userspace heap management > >>>>>> is releasing memory page by page. > >>>>>> > >>>>>> In our real product deployment, we might not care about the second partial > >>>>>> unmapped, we do care about the first partial unmapped as we can use this > >>>>>> to know if split has ever happened on this large folios. an partial unmapped > >>>>>> subpage can be unlikely re-mapped back. > >>>>>> > >>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we > >>>>>> care about if partial unmap has ever happened on a large folio more than how > >>>>>> they are exactly partially unmapped :-) > >>>>> > >>>>> I'm not sure what you are suggesting here? A global boolean that tells you if > >>>>> any folio in the system has ever been partially unmapped? That will almost > >>>>> certainly always be true, even for a very well tuned system. > >>>> > >>>> not a global boolean but a per-folio boolean. in case userspace maps a region > >>>> and has no userspace management, then we are fine as it is unlikely to have > >>>> partial unmap/map things; in case userspace maps a region, but manages it > >>>> by itself, such as heap things, we might result in lots of partial map/unmap, > >>>> which can lead to 3 problems: > >>>> 1. potential memory footprint increase, for example, while userspace releases > >>>> some pages in a folio, we might still keep it as frequent splitting folio into > >>>> basepages and releasing the unmapped subpage might be too expensive. > >>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown > >>>> might happen. > >>>> 3. other maintenance overhead such as splitting large folios etc. > >>>> > >>>> We'd like to know how serious partial map things are happening. so either > >>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do > >>>> some alignment according to the size of large folios. > >>>> > >>>> in android phones, we detect lots of apps, and also found some apps might > >>>> do things like > >>>> 1. mprotect on some pages within a large folio > >>>> 2. mlock on some pages within a large folio > >>>> 3. madv_free on some pages within a large folio > >>>> 4. madv_pageout on some pages within a large folio. > >>>> > >>>> it would be good if we have a per-folio boolean to know how serious userspace > >>>> is breaking the large folios. for example, if more than 50% folios in a vma has > >>>> this problem, we can find it out and take some action. > >>> > >>> The high level value of these stats seems clear - I agree we need to be able to > >>> get these insights. I think the issues are more around the implementation > >>> though. I'm struggling to understand exactly how we could implement a lot of > >>> these things cheaply (either in the kernel or in user space). > >>> > >>> Let me try to work though what I think you are suggesting: > >>> > >>> - every THP is initially fully mapped > >> > >> Not for pagecache folios. > >> > >>> - when an operation causes a partial unmap, mark the folio as having at least > >>> one partial mapping > >>> - on transition from "no partial mappings" to "at least one partial mapping" > >>> increment a "anon-partial-<size>kB" (one for each supported folio size) > >>> counter by the folio size > >>> - on transition from "at least one partial mapping" to "fully unampped > >>> everywhere" decrement the counter by the folio size > >>> > >>> I think the issue with this is that a folio that is fully mapped in a process > >>> that gets forked, then is partially unmapped in 1 process, will be accounted as > >>> partially mapped even after the process that partially unmapped it exits, even > >>> though that folio is now fully mapped in all processes that map it. Is that a > >>> problem, perhaps not? I'm not sure. > >> > >> What I can offer with my total mapcount I am working on (+ entire/pmd > >> mapcount, but let's put that aside): > >> > >> 1) total_mapcount not multiples of folio_nr_page -> at least one process > >> currently maps the folio partially > >> > >> 2) total_mapcount is less than folio_nr_page -> surely partially mapped > >> > >> I think for most of anon memory (note that most folios are always > >> exclusive in our system, not cow-shared) 2) would already be sufficient. > > > > if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to > > add nr_pages in copy_pte_range for rmap. > > copy_pte_range() > > { > > folio_try_dup_anon_rmap_ptes(...nr_pages....) > > } > > and at the same time, in zap_pte_range(), we remove the whole anon_rmap > > if the zapped-range covers the whole folio. > > > > Replace the for-loop > > for (i = 0; i < nr; i++, page++) { > > add_rmap(1); > > } > > for (i = 0; i < nr; i++, page++) { > > remove_rmap(1); > > } > > by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we > > are doing the entire mapping/unmapping > > That's precisely what I have already running as protoypes :) And I > promised Ryan to get to this soon, clean it up and sent it out. Cool. Glad we'll have it soon. > > . > > > > then we might be able to TestAndSetPartialMapped flag for this folio anywhile > > 1. someone is adding rmap with a number not equal nr_pages > > 2. someone is removing rmap with a number not equal nr_pages > > That means we are doing partial mapping or unmapping. > > and we increment partialmap_count by 1, let debugfs or somewhere present > > this count. > > Yes. The only "ugly" corner case if you have a split VMA. We're not > batching rmap exceeding that. I am sorry I don't quite get what the problem is. Do you mean splitting vma is crossing a PTE-mapped mTHP or a PMD-mapped THP? for the latter, I see __split_huge_pmd_locked() does have some mapcount operation but it is batched by folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, vma, haddr, rmap_flags); for the former, I don't find any special mapcount thing is needed. Do I miss something? > > -- > Cheers, > > David / dhildenb Thanks Barry
On 11/01/2024 13:18, David Hildenbrand wrote: > On 11.01.24 13:25, Ryan Roberts wrote: >> On 10/01/2024 22:14, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 11:38, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard >>>>>>>>>>>>>>> <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard >>>>>>>>>>>>>>>>> <jhubbard@nvidia.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing >>>>>>>>>>>>>>>>>> of mTHP >>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various >>>>>>>>>>>>>>>>>> containers and >>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely >>>>>>>>>>>>>> global >>>>>>>>>>>>>> and not >>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>>> container). >>>>>>>>>>>>>> If you want per-container, then you can probably just create the >>>>>>>>>>>>>> container >>>>>>>>>>>>>> in a >>>>>>>>>>>>>> cgroup? >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. >>>>>>>>>>>>>>>>>> Probably >>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial >>>>>>>>>>>>>>>>>> reactions from >>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are >>>>>>>>>>>>>>>> more useful >>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> probably because this can be done without the modification of the >>>>>>>>>>>>>>> kernel. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous >>>>>>>>>>>>>> attempts to add >>>>>>>>>>>>>> stats >>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we >>>>>>>>>>>>>> don't >>>>>>>>>>>>>> really >>>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to >>>>>>>>>>>>>> end up >>>>>>>>>>>>>> adding >>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also >>>>>>>>>>>>>> been some >>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in >>>>>>>>>>>>>> sysfs, so >>>>>>>>>>>>>> David >>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I >>>>>>>>>>>>>> know >>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes >>>>>>>>>>>>>> and >>>>>>>>>>>>>> cgroups >>>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term >>>>>>>>>>>>>> solution >>>>>>>>>>>>>> to the >>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to >>>>>>>>>>>>>> explore >>>>>>>>>>>>>> what >>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my >>>>>>>>>>>>>>> case in >>>>>>>>>>>>>>> which >>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma >>>>>>>>>>>>>>> types, >>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>>> detailed >>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and >>>>>>>>>>>>>>> how many >>>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or >>>>>>>>>>>>>>>> similar)". And >>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? >>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI >>>>>>>>>>>>>>>> stable" >>>>>>>>>>>>>>>> location. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode >>>>>>>>>>>>>> to the >>>>>>>>>>>>>> script >>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are >>>>>>>>>>>>>> provided). I >>>>>>>>>>>>>> think I >>>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should >>>>>>>>>>>>>> then be >>>>>>>>>>>>>> able to >>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and >>>>>>>>>>>>>> provide the >>>>>>>>>>>>>> same >>>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>>> >>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if >>>>>>>>>>>> anyone wants >>>>>>>>>>>> the >>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't >>>>>>>>>>>> have the >>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how >>>>>>>>>>>> many of >>>>>>>>>>>> each >>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about >>>>>>>>>>>> whether they >>>>>>>>>>>> are >>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary >>>>>>>>>>>> if we >>>>>>>>>>>> want >>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>>> going to >>>>>>>>>>>> be particularly useful. >>>>>>>>>>>> >>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>>> kernel; >>>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not >>>>>>>>>>>> just the >>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you >>>>>>>>>>>> set it, >>>>>>>>>>>> you >>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you >>>>>>>>>>>> decrement. >>>>>>>>>>>> But >>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping >>>>>>>>>>>> so its >>>>>>>>>>>> easy >>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to >>>>>>>>>>>> scan the >>>>>>>>>>>> PTEs >>>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap >>>>>>>>>>>> mechanism to >>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>>> process?". >>>>>>>>>>> >>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>>> 1. entire map >>>>>>>>>>> 2. subpage's map >>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>>> >>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>>> we have an explicit >>>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>>> subpage's mapcount. >>>>>>>>>>> >>>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>>> >>>>>>>>>> OK, but I think we have a slightly more generic situation going on >>>>>>>>>> with the >>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit >>>>>>>>>> in the >>>>>>>>>> PTE to determne if its fully mapped? That works for your case where >>>>>>>>>> you only >>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the >>>>>>>>>> upstream, we >>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if >>>>>>>>>> its fully >>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and >>>>>>>>>> aligned, >>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm >>>>>>>>>> would >>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>>> transition. >>>>>>>>> >>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of >>>>>>>>> some stats. >>>>>>>>> >>>>>>>> >>>>>>>> Indeed, I was intending to argue *against* doing it this way. >>>>>>>> Fundamentally, if >>>>>>>> we want to know what's fully mapped and what's not, then I don't see any >>>>>>>> way >>>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>>> space with this script. >>>>>>>> >>>>>>>> Although, I expect you will shortly make a proposal that is simple to >>>>>>>> implement >>>>>>>> and prove me wrong ;-) >>>>>>> >>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>>> making sense. >>>>>>> >>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You >>>>>>> can >>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>>> optimizations without the cont-pte bit and everything is fine. >>>>>> >>>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>>> that for us, and I think we are tending towards agreement that there are >>>>>> unlikely to be any cost benefits by moving it into the kernel. >>>>> >>>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>>> the performance gain and might get regression instead. >>>> >>>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>>> gets split into order-0 pages. If the folio still has all its pages mapped at >>>> least once then no further action is taken. If the page being unmapped was the >>>> last mapping of that page, then the THP is put on the deferred split queue, so >>>> that it can be split in future if needed. >>>>> >>>>> and this can be very frequent, for example, one userspace heap management >>>>> is releasing memory page by page. >>>>> >>>>> In our real product deployment, we might not care about the second partial >>>>> unmapped, we do care about the first partial unmapped as we can use this >>>>> to know if split has ever happened on this large folios. an partial unmapped >>>>> subpage can be unlikely re-mapped back. >>>>> >>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>>> care about if partial unmap has ever happened on a large folio more than how >>>>> they are exactly partially unmapped :-) >>>> >>>> I'm not sure what you are suggesting here? A global boolean that tells you if >>>> any folio in the system has ever been partially unmapped? That will almost >>>> certainly always be true, even for a very well tuned system. >>> >>> not a global boolean but a per-folio boolean. in case userspace maps a region >>> and has no userspace management, then we are fine as it is unlikely to have >>> partial unmap/map things; in case userspace maps a region, but manages it >>> by itself, such as heap things, we might result in lots of partial map/unmap, >>> which can lead to 3 problems: >>> 1. potential memory footprint increase, for example, while userspace releases >>> some pages in a folio, we might still keep it as frequent splitting folio into >>> basepages and releasing the unmapped subpage might be too expensive. >>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown >>> might happen. >>> 3. other maintenance overhead such as splitting large folios etc. >>> >>> We'd like to know how serious partial map things are happening. so either >>> we will disable mTHP in this kind of VMAs, or optimize userspace to do >>> some alignment according to the size of large folios. >>> >>> in android phones, we detect lots of apps, and also found some apps might >>> do things like >>> 1. mprotect on some pages within a large folio >>> 2. mlock on some pages within a large folio >>> 3. madv_free on some pages within a large folio >>> 4. madv_pageout on some pages within a large folio. >>> >>> it would be good if we have a per-folio boolean to know how serious userspace >>> is breaking the large folios. for example, if more than 50% folios in a vma has >>> this problem, we can find it out and take some action. >> >> The high level value of these stats seems clear - I agree we need to be able to >> get these insights. I think the issues are more around the implementation >> though. I'm struggling to understand exactly how we could implement a lot of >> these things cheaply (either in the kernel or in user space). >> >> Let me try to work though what I think you are suggesting: >> >> - every THP is initially fully mapped > > Not for pagecache folios. > >> - when an operation causes a partial unmap, mark the folio as having at least >> one partial mapping >> - on transition from "no partial mappings" to "at least one partial mapping" >> increment a "anon-partial-<size>kB" (one for each supported folio size) >> counter by the folio size >> - on transition from "at least one partial mapping" to "fully unampped >> everywhere" decrement the counter by the folio size >> >> I think the issue with this is that a folio that is fully mapped in a process >> that gets forked, then is partially unmapped in 1 process, will be accounted as >> partially mapped even after the process that partially unmapped it exits, even >> though that folio is now fully mapped in all processes that map it. Is that a >> problem, perhaps not? I'm not sure. > > What I can offer with my total mapcount I am working on (+ entire/pmd mapcount, > but let's put that aside): Is "total mapcount" bound up as part of your "precise shared vs exclusive" work or is it separate? If separate, do you have any ballpark feel for how likely it is to land and if so, when? > > 1) total_mapcount not multiples of folio_nr_page -> at least one process > currently maps the folio partially > > 2) total_mapcount is less than folio_nr_page -> surely partially mapped > > I think for most of anon memory (note that most folios are always exclusive in > our system, not cow-shared) 2) would already be sufficient. >
On 11/01/2024 20:45, Barry Song wrote: > On Fri, Jan 12, 2024 at 1:25 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 10/01/2024 22:14, Barry Song wrote: >>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 10/01/2024 11:38, Barry Song wrote: >>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>>>> and not >>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>>> container). >>>>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>>>> in a >>>>>>>>>>>>>> cgroup? >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>>>> stats >>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>>>> really >>>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>>>> adding >>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>>>> David >>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>>>> cgroups >>>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>>>> to the >>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>>>> what >>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>>>> which >>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>>> detailed >>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>>>> location. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>>>> script >>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>>>> think I >>>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>>>> able to >>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>>>> same >>>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>>> >>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>>>> the >>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>>>> each >>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>>>> are >>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>>>> want >>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>>> going to >>>>>>>>>>>> be particularly useful. >>>>>>>>>>>> >>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>>> kernel; >>>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>>>> you >>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>>>> But >>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>>>> easy >>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>>>> PTEs >>>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>>> process?". >>>>>>>>>>> >>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>>> 1. entire map >>>>>>>>>>> 2. subpage's map >>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>>> >>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>>> we have an explicit >>>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>>> subpage's mapcount. >>>>>>>>>>> >>>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>>> >>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>>> transition. >>>>>>>>> >>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>>>> >>>>>>>> >>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>>> space with this script. >>>>>>>> >>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>>>> and prove me wrong ;-) >>>>>>> >>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>>> making sense. >>>>>>> >>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>>> optimizations without the cont-pte bit and everything is fine. >>>>>> >>>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>>> that for us, and I think we are tending towards agreement that there are >>>>>> unlikely to be any cost benefits by moving it into the kernel. >>>>> >>>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>>> the performance gain and might get regression instead. >>>> >>>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>>> gets split into order-0 pages. If the folio still has all its pages mapped at >>>> least once then no further action is taken. If the page being unmapped was the >>>> last mapping of that page, then the THP is put on the deferred split queue, so >>>> that it can be split in future if needed. >>>>> >>>>> and this can be very frequent, for example, one userspace heap management >>>>> is releasing memory page by page. >>>>> >>>>> In our real product deployment, we might not care about the second partial >>>>> unmapped, we do care about the first partial unmapped as we can use this >>>>> to know if split has ever happened on this large folios. an partial unmapped >>>>> subpage can be unlikely re-mapped back. >>>>> >>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>>> care about if partial unmap has ever happened on a large folio more than how >>>>> they are exactly partially unmapped :-) >>>> >>>> I'm not sure what you are suggesting here? A global boolean that tells you if >>>> any folio in the system has ever been partially unmapped? That will almost >>>> certainly always be true, even for a very well tuned system. >>> >>> not a global boolean but a per-folio boolean. in case userspace maps a region >>> and has no userspace management, then we are fine as it is unlikely to have >>> partial unmap/map things; in case userspace maps a region, but manages it >>> by itself, such as heap things, we might result in lots of partial map/unmap, >>> which can lead to 3 problems: >>> 1. potential memory footprint increase, for example, while userspace releases >>> some pages in a folio, we might still keep it as frequent splitting folio into >>> basepages and releasing the unmapped subpage might be too expensive. >>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown >>> might happen. >>> 3. other maintenance overhead such as splitting large folios etc. >>> >>> We'd like to know how serious partial map things are happening. so either >>> we will disable mTHP in this kind of VMAs, or optimize userspace to do >>> some alignment according to the size of large folios. >>> >>> in android phones, we detect lots of apps, and also found some apps might >>> do things like >>> 1. mprotect on some pages within a large folio >>> 2. mlock on some pages within a large folio >>> 3. madv_free on some pages within a large folio >>> 4. madv_pageout on some pages within a large folio. >>> >>> it would be good if we have a per-folio boolean to know how serious userspace >>> is breaking the large folios. for example, if more than 50% folios in a vma has >>> this problem, we can find it out and take some action. >> >> The high level value of these stats seems clear - I agree we need to be able to >> get these insights. I think the issues are more around the implementation >> though. I'm struggling to understand exactly how we could implement a lot of >> these things cheaply (either in the kernel or in user space). >> >> Let me try to work though what I think you are suggesting: >> >> - every THP is initially fully mapped >> - when an operation causes a partial unmap, mark the folio as having at least >> one partial mapping >> - on transition from "no partial mappings" to "at least one partial mapping" >> increment a "anon-partial-<size>kB" (one for each supported folio size) >> counter by the folio size >> - on transition from "at least one partial mapping" to "fully unampped >> everywhere" decrement the counter by the folio size >> >> I think the issue with this is that a folio that is fully mapped in a process >> that gets forked, then is partially unmapped in 1 process, will be accounted as >> partially mapped even after the process that partially unmapped it exits, even >> though that folio is now fully mapped in all processes that map it. Is that a >> problem, perhaps not? I'm not sure. > > I don't think this is a problem as what we really care about is if some "bad" > behaviour has ever happened in userspace. Though the "bad" guy has exited > or been killed, we still need the record to find out. > > except the global count, if we can reflect the partially mapped count in vma > such as smaps, this will help even more to locate the problematic userspace > code. Right. Although note that smaps is already scanning the page table, so for the smaps case we could do it precisely - it's already slow. The thpmaps script already gives a precise account of partially mapped THPs, FYI. > > Thanks > Barry
On 12/01/2024 06:03, Barry Song wrote: > On Fri, Jan 12, 2024 at 9:28 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 11.01.24 21:21, Barry Song wrote: >>> On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 11.01.24 13:25, Ryan Roberts wrote: >>>>> On 10/01/2024 22:14, Barry Song wrote: >>>>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> On 10/01/2024 11:38, Barry Song wrote: >>>>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>> >>>>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP >>>>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and >>>>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global >>>>>>>>>>>>>>>>> and not >>>>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>>>>>> container). >>>>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container >>>>>>>>>>>>>>>>> in a >>>>>>>>>>>>>>>>> cgroup? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably >>>>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from >>>>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful >>>>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the >>>>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add >>>>>>>>>>>>>>>>> stats >>>>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't >>>>>>>>>>>>>>>>> really >>>>>>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up >>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some >>>>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so >>>>>>>>>>>>>>>>> David >>>>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know >>>>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and >>>>>>>>>>>>>>>>> cgroups >>>>>>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution >>>>>>>>>>>>>>>>> to the >>>>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore >>>>>>>>>>>>>>>>> what >>>>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in >>>>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, >>>>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>>>>>> detailed >>>>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many >>>>>>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And >>>>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's >>>>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" >>>>>>>>>>>>>>>>>>> location. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the >>>>>>>>>>>>>>>>> script >>>>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I >>>>>>>>>>>>>>>>> think I >>>>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be >>>>>>>>>>>>>>>>> able to >>>>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the >>>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the >>>>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of >>>>>>>>>>>>>>> each >>>>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they >>>>>>>>>>>>>>> are >>>>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we >>>>>>>>>>>>>>> want >>>>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>>>>>> going to >>>>>>>>>>>>>>> be particularly useful. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>>>>>> kernel; >>>>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the >>>>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it, >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. >>>>>>>>>>>>>>> But >>>>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its >>>>>>>>>>>>>>> easy >>>>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the >>>>>>>>>>>>>>> PTEs >>>>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to >>>>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>>>>>> process?". >>>>>>>>>>>>>> >>>>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>>>>>> 1. entire map >>>>>>>>>>>>>> 2. subpage's map >>>>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>>>>>> we have an explicit >>>>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>>>>>> subpage's mapcount. >>>>>>>>>>>>>> >>>>>>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>>>>>> >>>>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the >>>>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the >>>>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only >>>>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we >>>>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully >>>>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned, >>>>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would >>>>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>>>>>> transition. >>>>>>>>>>>> >>>>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if >>>>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way >>>>>>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>>>>>> space with this script. >>>>>>>>>>> >>>>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement >>>>>>>>>>> and prove me wrong ;-) >>>>>>>>>> >>>>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>>>>>> making sense. >>>>>>>>>> >>>>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can >>>>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>>>>>> optimizations without the cont-pte bit and everything is fine. >>>>>>>>> >>>>>>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>>>>>> that for us, and I think we are tending towards agreement that there are >>>>>>>>> unlikely to be any cost benefits by moving it into the kernel. >>>>>>>> >>>>>>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>>>>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>>>>>> the performance gain and might get regression instead. >>>>>>> >>>>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>>>>>> gets split into order-0 pages. If the folio still has all its pages mapped at >>>>>>> least once then no further action is taken. If the page being unmapped was the >>>>>>> last mapping of that page, then the THP is put on the deferred split queue, so >>>>>>> that it can be split in future if needed. >>>>>>>> >>>>>>>> and this can be very frequent, for example, one userspace heap management >>>>>>>> is releasing memory page by page. >>>>>>>> >>>>>>>> In our real product deployment, we might not care about the second partial >>>>>>>> unmapped, we do care about the first partial unmapped as we can use this >>>>>>>> to know if split has ever happened on this large folios. an partial unmapped >>>>>>>> subpage can be unlikely re-mapped back. >>>>>>>> >>>>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>>>>>> care about if partial unmap has ever happened on a large folio more than how >>>>>>>> they are exactly partially unmapped :-) >>>>>>> >>>>>>> I'm not sure what you are suggesting here? A global boolean that tells you if >>>>>>> any folio in the system has ever been partially unmapped? That will almost >>>>>>> certainly always be true, even for a very well tuned system. >>>>>> >>>>>> not a global boolean but a per-folio boolean. in case userspace maps a region >>>>>> and has no userspace management, then we are fine as it is unlikely to have >>>>>> partial unmap/map things; in case userspace maps a region, but manages it >>>>>> by itself, such as heap things, we might result in lots of partial map/unmap, >>>>>> which can lead to 3 problems: >>>>>> 1. potential memory footprint increase, for example, while userspace releases >>>>>> some pages in a folio, we might still keep it as frequent splitting folio into >>>>>> basepages and releasing the unmapped subpage might be too expensive. >>>>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown >>>>>> might happen. >>>>>> 3. other maintenance overhead such as splitting large folios etc. >>>>>> >>>>>> We'd like to know how serious partial map things are happening. so either >>>>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do >>>>>> some alignment according to the size of large folios. >>>>>> >>>>>> in android phones, we detect lots of apps, and also found some apps might >>>>>> do things like >>>>>> 1. mprotect on some pages within a large folio >>>>>> 2. mlock on some pages within a large folio >>>>>> 3. madv_free on some pages within a large folio >>>>>> 4. madv_pageout on some pages within a large folio. >>>>>> >>>>>> it would be good if we have a per-folio boolean to know how serious userspace >>>>>> is breaking the large folios. for example, if more than 50% folios in a vma has >>>>>> this problem, we can find it out and take some action. >>>>> >>>>> The high level value of these stats seems clear - I agree we need to be able to >>>>> get these insights. I think the issues are more around the implementation >>>>> though. I'm struggling to understand exactly how we could implement a lot of >>>>> these things cheaply (either in the kernel or in user space). >>>>> >>>>> Let me try to work though what I think you are suggesting: >>>>> >>>>> - every THP is initially fully mapped >>>> >>>> Not for pagecache folios. >>>> >>>>> - when an operation causes a partial unmap, mark the folio as having at least >>>>> one partial mapping >>>>> - on transition from "no partial mappings" to "at least one partial mapping" >>>>> increment a "anon-partial-<size>kB" (one for each supported folio size) >>>>> counter by the folio size >>>>> - on transition from "at least one partial mapping" to "fully unampped >>>>> everywhere" decrement the counter by the folio size >>>>> >>>>> I think the issue with this is that a folio that is fully mapped in a process >>>>> that gets forked, then is partially unmapped in 1 process, will be accounted as >>>>> partially mapped even after the process that partially unmapped it exits, even >>>>> though that folio is now fully mapped in all processes that map it. Is that a >>>>> problem, perhaps not? I'm not sure. >>>> >>>> What I can offer with my total mapcount I am working on (+ entire/pmd >>>> mapcount, but let's put that aside): >>>> >>>> 1) total_mapcount not multiples of folio_nr_page -> at least one process >>>> currently maps the folio partially >>>> >>>> 2) total_mapcount is less than folio_nr_page -> surely partially mapped >>>> >>>> I think for most of anon memory (note that most folios are always >>>> exclusive in our system, not cow-shared) 2) would already be sufficient. >>> >>> if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to >>> add nr_pages in copy_pte_range for rmap. >>> copy_pte_range() >>> { >>> folio_try_dup_anon_rmap_ptes(...nr_pages....) >>> } >>> and at the same time, in zap_pte_range(), we remove the whole anon_rmap >>> if the zapped-range covers the whole folio. >>> >>> Replace the for-loop >>> for (i = 0; i < nr; i++, page++) { >>> add_rmap(1); >>> } >>> for (i = 0; i < nr; i++, page++) { >>> remove_rmap(1); >>> } >>> by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we >>> are doing the entire mapping/unmapping >> >> That's precisely what I have already running as protoypes :) And I >> promised Ryan to get to this soon, clean it up and sent it out. > > Cool. Glad we'll have it soon. > >> >> . >>> >>> then we might be able to TestAndSetPartialMapped flag for this folio anywhile >>> 1. someone is adding rmap with a number not equal nr_pages >>> 2. someone is removing rmap with a number not equal nr_pages >>> That means we are doing partial mapping or unmapping. >>> and we increment partialmap_count by 1, let debugfs or somewhere present >>> this count. >> >> Yes. The only "ugly" corner case if you have a split VMA. We're not >> batching rmap exceeding that. > > I am sorry I don't quite get what the problem is. Do you mean splitting > vma is crossing a PTE-mapped mTHP or a PMD-mapped THP? > > for the latter, I see __split_huge_pmd_locked() does have some mapcount > operation but it is batched by > folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, > vma, haddr, rmap_flags); > > for the former, I don't find any special mapcount thing is needed. > Do I miss something? I think the case that David is describing is when you have a THP that is contpte-mapped in a VMA, then you do an operation that causes the VMA to be split but which doesn't cause the PTEs to be remapped (e.g. MADV_HUGEPAGE covering part of the mTHP). In this case you end up with 2 VMAs and a THP straddling both, which is contpte-mapped. So in this case there would not be a single batch rmap call when unmapping it. At best there would be 2; one in the context of each VMA, covering the respective parts of the THP. I don't think this is a big problem though; it would be counted as a partial unmap, but its a corner case, that is unlikely to happen in practice. I just want to try to summarise the counters we have discussed in this thread to check my understanding: 1. global mTHP successful allocation counter, per mTHP size (inc only) 2. global mTHP failed allocation counter, per mTHP size (inc only) 3. global mTHP currently allocated counter, per mTHP size (inc and dec) 4. global "mTHP became partially mapped 1 or more processes" counter (inc only) I geuss the above should apply to both page cache and anon? Do we want separate counters for each? I'm not sure if we would want 4. to be per mTHP or a single counter for all? Probably the former if it provides a bit more info for neglegable cost. Where should these be exposed? I guess /proc/vmstats is the obvious place, but I don't think there is any precident for per-size counters (especially where the sizes will change depending on system). Perhaps it would be better to expose them in their per-size directories in /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB ? Additional to above global counters, there is a case for adding a per-process version of 4. to smaps to consider. Is that accurate? > >> >> -- >> Cheers, >> >> David / dhildenb > > Thanks > Barry
On 12.01.24 11:18, Ryan Roberts wrote: > On 11/01/2024 13:18, David Hildenbrand wrote: >> On 11.01.24 13:25, Ryan Roberts wrote: >>> On 10/01/2024 22:14, Barry Song wrote: >>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> On 10/01/2024 11:38, Barry Song wrote: >>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote: >>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote: >>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote: >>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote: >>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote: >>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote: >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote: >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard >>>>>>>>>>>>>>>> <jhubbard@nvidia.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: >>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard >>>>>>>>>>>>>>>>>> <jhubbard@nvidia.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>>> Hi Ryan, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing >>>>>>>>>>>>>>>>>>> of mTHP >>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm >>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various >>>>>>>>>>>>>>>>>>> containers and >>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some >>>>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely >>>>>>>>>>>>>>> global >>>>>>>>>>>>>>> and not >>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a >>>>>>>>>>>>>>> container). >>>>>>>>>>>>>>> If you want per-container, then you can probably just create the >>>>>>>>>>>>>>> container >>>>>>>>>>>>>>> in a >>>>>>>>>>>>>>> cgroup? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. >>>>>>>>>>>>>>>>>>> Probably >>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial >>>>>>>>>>>>>>>>>>> reactions from >>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? >>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like >>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are >>>>>>>>>>>>>>>>> more useful >>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more >>>>>>>>>>>>>>>>> detailed stats. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> probably because this can be done without the modification of the >>>>>>>>>>>>>>>> kernel. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous >>>>>>>>>>>>>>> attempts to add >>>>>>>>>>>>>>> stats >>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we >>>>>>>>>>>>>>> don't >>>>>>>>>>>>>>> really >>>>>>>>>>>>>>> know exectly how to account mTHPs yet >>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to >>>>>>>>>>>>>>> end up >>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also >>>>>>>>>>>>>>> been some >>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in >>>>>>>>>>>>>>> sysfs, so >>>>>>>>>>>>>>> David >>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I >>>>>>>>>>>>>>> know >>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> cgroups >>>>>>>>>>>>>>> do live in sysfs). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term >>>>>>>>>>>>>>> solution >>>>>>>>>>>>>>> to the >>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to >>>>>>>>>>>>>>> explore >>>>>>>>>>>>>>> what >>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my >>>>>>>>>>>>>>>> case in >>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma >>>>>>>>>>>>>>>> types, >>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the >>>>>>>>>>>>>>>> detailed >>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and >>>>>>>>>>>>>>>> how many >>>>>>>>>>>>>>>> they have gotten. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to >>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or >>>>>>>>>>>>>>>>> similar)". And >>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys >>>>>>>>>>>>>>>>> values because this is still such an early feature. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? >>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI >>>>>>>>>>>>>>>>> stable" >>>>>>>>>>>>>>>>> location. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode >>>>>>>>>>>>>>> to the >>>>>>>>>>>>>>> script >>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are >>>>>>>>>>>>>>> provided). I >>>>>>>>>>>>>>> think I >>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from >>>>>>>>>>>>>>> /proc/iomem, >>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should >>>>>>>>>>>>>>> then be >>>>>>>>>>>>>>> able to >>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and >>>>>>>>>>>>>>> provide the >>>>>>>>>>>>>>> same >>>>>>>>>>>>>>> stats, but it will apply globally. What do you think? >>>>>>>>>>>>> >>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if >>>>>>>>>>>>> anyone wants >>>>>>>>>>>>> the >>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't >>>>>>>>>>>>> have the >>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how >>>>>>>>>>>>> many of >>>>>>>>>>>>> each >>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about >>>>>>>>>>>>> whether they >>>>>>>>>>>>> are >>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary >>>>>>>>>>>>> if we >>>>>>>>>>>>> want >>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is >>>>>>>>>>>>> going to >>>>>>>>>>>>> be particularly useful. >>>>>>>>>>>>> >>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the >>>>>>>>>>>>> kernel; >>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's >>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not >>>>>>>>>>>>> just the >>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for >>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you >>>>>>>>>>>>> set it, >>>>>>>>>>>>> you >>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you >>>>>>>>>>>>> decrement. >>>>>>>>>>>>> But >>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping >>>>>>>>>>>>> so its >>>>>>>>>>>>> easy >>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to >>>>>>>>>>>>> scan the >>>>>>>>>>>>> PTEs >>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously >>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap >>>>>>>>>>>>> mechanism to >>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one >>>>>>>>>>>>> process?". >>>>>>>>>>>> >>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount >>>>>>>>>>>> 1. entire map >>>>>>>>>>>> 2. subpage's map >>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped. >>>>>>>>>>>> >>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap, >>>>>>>>>>>> we have an explicit >>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the >>>>>>>>>>>> subpage's mapcount. >>>>>>>>>>>> >>>>>>>>>>>> but its downside is that we expose this info to mm-core. >>>>>>>>>>> >>>>>>>>>>> OK, but I think we have a slightly more generic situation going on >>>>>>>>>>> with the >>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit >>>>>>>>>>> in the >>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where >>>>>>>>>>> you only >>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the >>>>>>>>>>> upstream, we >>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if >>>>>>>>>>> its fully >>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and >>>>>>>>>>> aligned, >>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm >>>>>>>>>>> would >>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map >>>>>>>>>>> transition. >>>>>>>>>> >>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of >>>>>>>>>> some stats. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Indeed, I was intending to argue *against* doing it this way. >>>>>>>>> Fundamentally, if >>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any >>>>>>>>> way >>>>>>>>> other than by scanning the page tables and we might as well do that in user >>>>>>>>> space with this script. >>>>>>>>> >>>>>>>>> Although, I expect you will shortly make a proposal that is simple to >>>>>>>>> implement >>>>>>>>> and prove me wrong ;-) >>>>>>>> >>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really >>>>>>>> making sense. >>>>>>>> >>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You >>>>>>>> can >>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's >>>>>>>> optimizations without the cont-pte bit and everything is fine. >>>>>>> >>>>>>> Yes, but for debug and optimization, its useful to know when THPs are >>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does >>>>>>> that for us, and I think we are tending towards agreement that there are >>>>>>> unlikely to be any cost benefits by moving it into the kernel. >>>>>> >>>>>> frequent partial unmap can defeat all purpose for us to use large folios. >>>>>> just imagine a large folio can soon be splitted after it is formed. we lose >>>>>> the performance gain and might get regression instead. >>>>> >>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it >>>>> gets split into order-0 pages. If the folio still has all its pages mapped at >>>>> least once then no further action is taken. If the page being unmapped was the >>>>> last mapping of that page, then the THP is put on the deferred split queue, so >>>>> that it can be split in future if needed. >>>>>> >>>>>> and this can be very frequent, for example, one userspace heap management >>>>>> is releasing memory page by page. >>>>>> >>>>>> In our real product deployment, we might not care about the second partial >>>>>> unmapped, we do care about the first partial unmapped as we can use this >>>>>> to know if split has ever happened on this large folios. an partial unmapped >>>>>> subpage can be unlikely re-mapped back. >>>>>> >>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we >>>>>> care about if partial unmap has ever happened on a large folio more than how >>>>>> they are exactly partially unmapped :-) >>>>> >>>>> I'm not sure what you are suggesting here? A global boolean that tells you if >>>>> any folio in the system has ever been partially unmapped? That will almost >>>>> certainly always be true, even for a very well tuned system. >>>> >>>> not a global boolean but a per-folio boolean. in case userspace maps a region >>>> and has no userspace management, then we are fine as it is unlikely to have >>>> partial unmap/map things; in case userspace maps a region, but manages it >>>> by itself, such as heap things, we might result in lots of partial map/unmap, >>>> which can lead to 3 problems: >>>> 1. potential memory footprint increase, for example, while userspace releases >>>> some pages in a folio, we might still keep it as frequent splitting folio into >>>> basepages and releasing the unmapped subpage might be too expensive. >>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown >>>> might happen. >>>> 3. other maintenance overhead such as splitting large folios etc. >>>> >>>> We'd like to know how serious partial map things are happening. so either >>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do >>>> some alignment according to the size of large folios. >>>> >>>> in android phones, we detect lots of apps, and also found some apps might >>>> do things like >>>> 1. mprotect on some pages within a large folio >>>> 2. mlock on some pages within a large folio >>>> 3. madv_free on some pages within a large folio >>>> 4. madv_pageout on some pages within a large folio. >>>> >>>> it would be good if we have a per-folio boolean to know how serious userspace >>>> is breaking the large folios. for example, if more than 50% folios in a vma has >>>> this problem, we can find it out and take some action. >>> >>> The high level value of these stats seems clear - I agree we need to be able to >>> get these insights. I think the issues are more around the implementation >>> though. I'm struggling to understand exactly how we could implement a lot of >>> these things cheaply (either in the kernel or in user space). >>> >>> Let me try to work though what I think you are suggesting: >>> >>> - every THP is initially fully mapped >> >> Not for pagecache folios. >> >>> - when an operation causes a partial unmap, mark the folio as having at least >>> one partial mapping >>> - on transition from "no partial mappings" to "at least one partial mapping" >>> increment a "anon-partial-<size>kB" (one for each supported folio size) >>> counter by the folio size >>> - on transition from "at least one partial mapping" to "fully unampped >>> everywhere" decrement the counter by the folio size >>> >>> I think the issue with this is that a folio that is fully mapped in a process >>> that gets forked, then is partially unmapped in 1 process, will be accounted as >>> partially mapped even after the process that partially unmapped it exits, even >>> though that folio is now fully mapped in all processes that map it. Is that a >>> problem, perhaps not? I'm not sure. >> >> What I can offer with my total mapcount I am working on (+ entire/pmd mapcount, >> but let's put that aside): > > Is "total mapcount" bound up as part of your "precise shared vs exclusive" work > or is it separate? If separate, do you have any ballpark feel for how likely it > is to land and if so, when? You could have an expensive total mapcount via folio_mapcount() today. The fast version is currently part of "precise shared vs exclusive", but with most RMAP batching in place we might want to consider adding it ahead of time, because the overhead of maintaining it will reduce drastically in the cases we care about. My current plan is: (1) RMAP batching when remapping a PMD-mapped THP. Upstream. (2) Fork batching. I have that prototype you already saw, will work on this next. (3) Zap batching. I also have a prototype now and will polish that as well. (4) Total mapcount (5) Shared vs. Exclusive (6) Subpage mapcounts / PageAnonExclusive fun I'll try getting a per-folio PageAnonExclusive bit implemented ahead of time. I think I know roughly what there is to do, but some corner cases are ugly to handle and I avoided messing with them in the past by making the PAE bit per-subpage. We'll see. Now, no idea how long that all will take. I have decent prototypes at this point for most stuff.
diff --git a/tools/mm/Makefile b/tools/mm/Makefile index 1c5606cc3334..7bb03606b9ea 100644 --- a/tools/mm/Makefile +++ b/tools/mm/Makefile @@ -3,7 +3,8 @@ # include ../scripts/Makefile.include -TARGETS=page-types slabinfo page_owner_sort +BUILD_TARGETS=page-types slabinfo page_owner_sort +INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps LIB_DIR = ../lib/api LIBS = $(LIB_DIR)/libapi.a @@ -11,9 +12,9 @@ LIBS = $(LIB_DIR)/libapi.a CFLAGS += -Wall -Wextra -I../lib/ -pthread LDFLAGS += $(LIBS) -pthread -all: $(TARGETS) +all: $(BUILD_TARGETS) -$(TARGETS): $(LIBS) +$(BUILD_TARGETS): $(LIBS) $(LIBS): make -C $(LIB_DIR) @@ -29,4 +30,4 @@ sbindir ?= /usr/sbin install: all install -d $(DESTDIR)$(sbindir) - install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir) + install -m 755 -p $(INSTALL_TARGETS) $(DESTDIR)$(sbindir) diff --git a/tools/mm/thpmaps b/tools/mm/thpmaps new file mode 100755 index 000000000000..af9b19f63eb4 --- /dev/null +++ b/tools/mm/thpmaps @@ -0,0 +1,573 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0-only +# Copyright (C) 2024 ARM Ltd. +# +# Utility providing smaps-like output detailing transparent hugepage usage. +# For more info, run: +# ./thpmaps --help +# +# Requires numpy: +# pip3 install numpy + + +import argparse +import collections +import math +import os +import re +import resource +import shutil +import sys +import time +import numpy as np + + +with open('/sys/kernel/mm/transparent_hugepage/hpage_pmd_size') as f: + PAGE_SIZE = resource.getpagesize() + PAGE_SHIFT = int(math.log2(PAGE_SIZE)) + PMD_SIZE = int(f.read()) + PMD_ORDER = int(math.log2(PMD_SIZE / PAGE_SIZE)) + + +def align_forward(v, a): + return (v + (a - 1)) & ~(a - 1) + + +def align_offset(v, a): + return v & (a - 1) + + +def nrkb(nr): + # Convert number of pages to KB. + return (nr << PAGE_SHIFT) >> 10 + + +def odkb(order): + # Convert page order to KB. + return nrkb(1 << order) + + +def cont_ranges_all(arrs): + # Given a list of arrays, find the ranges for which values are monotonically + # incrementing in all arrays. + assert(len(arrs) > 0) + sz = len(arrs[0]) + for arr in arrs: + assert(arr.shape == (sz,)) + r = np.full(sz, 2) + d = np.diff(arrs[0]) == 1 + for dd in [np.diff(arr) == 1 for arr in arrs[1:]]: + d &= dd + r[1:] -= d + r[:-1] -= d + return [np.repeat(arr, r).reshape(-1, 2) for arr in arrs] + + +class ArgException(Exception): + pass + + +class FileIOException(Exception): + pass + + +class BinArrayFile: + # Base class used to read /proc/<pid>/pagemap and /proc/kpageflags into a + # numpy array. Use inherrited class in a with clause to ensure file is + # closed when it goes out of scope. + def __init__(self, filename, element_size): + self.element_size = element_size + self.filename = filename + self.fd = os.open(self.filename, os.O_RDONLY) + + def cleanup(self): + os.close(self.fd) + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.cleanup() + + def _readin(self, offset, buffer): + length = os.preadv(self.fd, (buffer,), offset) + if len(buffer) != length: + raise FileIOException('error: {} failed to read {} bytes at {:x}' + .format(self.filename, len(buffer), offset)) + + def _toarray(self, buf): + assert(self.element_size == 8) + return np.frombuffer(buf, dtype=np.uint64) + + def getv(self, vec): + sz = 0 + for region in vec: + sz += int(region[1] - region[0] + 1) * self.element_size + buf = bytearray(sz) + view = memoryview(buf) + pos = 0 + for region in vec: + offset = int(region[0]) * self.element_size + length = int(region[1] - region[0] + 1) * self.element_size + self._readin(offset, view[pos:pos+length]) + pos += length + return self._toarray(buf) + + def get(self, index, nr=1): + offset = index * self.element_size + length = nr * self.element_size + buf = bytearray(length) + self._readin(offset, buf) + return self._toarray(buf) + + +PM_PAGE_PRESENT = 1 << 63 +PM_PFN_MASK = (1 << 55) - 1 + +class PageMap(BinArrayFile): + # Read ranges of a given pid's pagemap into a numpy array. + def __init__(self, pid='self'): + super().__init__(f'/proc/{pid}/pagemap', 8) + + +KPF_ANON = 1 << 12 +KPF_COMPOUND_HEAD = 1 << 15 +KPF_COMPOUND_TAIL = 1 << 16 + +class KPageFlags(BinArrayFile): + # Read ranges of /proc/kpageflags into a numpy array. + def __init__(self): + super().__init__(f'/proc/kpageflags', 8) + + +VMA = collections.namedtuple('VMA', [ + 'name', + 'start', + 'end', + 'read', + 'write', + 'execute', + 'private', + 'pgoff', + 'major', + 'minor', + 'inode', + 'stats', +]) + +class VMAList: + # A container for VMAs, parsed from /proc/<pid>/smaps. Iterate over the + # instance to receive VMAs. + head_regex = re.compile(r"^([\da-f]+)-([\da-f]+) ([r-])([w-])([x-])([ps]) ([\da-f]+) ([\da-f]+):([\da-f]+) ([\da-f]+)\s*(.*)$") + kb_item_regex = re.compile(r"(\w+):\s*(\d+)\s*kB") + + def __init__(self, pid='self'): + def is_vma(line): + return self.head_regex.search(line) != None + + def get_vma(line): + m = self.head_regex.match(line) + if m is None: + return None + return VMA( + name=m.group(11), + start=int(m.group(1), 16), + end=int(m.group(2), 16), + read=m.group(3) == 'r', + write=m.group(4) == 'w', + execute=m.group(5) == 'x', + private=m.group(6) == 'p', + pgoff=int(m.group(7), 16), + major=int(m.group(8), 16), + minor=int(m.group(9), 16), + inode=int(m.group(10), 16), + stats={}, + ) + + def get_value(line): + # Currently only handle the KB stats because they are summed for + # --summary. Core code doesn't know how to combine other stats. + exclude = ['KernelPageSize', 'MMUPageSize'] + m = self.kb_item_regex.search(line) + if m: + param = m.group(1) + if param not in exclude: + value = int(m.group(2)) + return param, value + return None, None + + def parse_smaps(file): + vmas = [] + i = 0 + + line = file.readline() + + while True: + if not line: + break + line = line.strip() + + i += 1 + + vma = get_vma(line) + if vma is None: + raise FileIOException(f'error: could not parse line {i}: "{line}"') + + while True: + line = file.readline() + if not line: + break + line = line.strip() + if is_vma(line): + break + + i += 1 + + param, value = get_value(line) + if param: + vma.stats[param] = {'type': None, 'value': value} + + vmas.append(vma) + + return vmas + + with open(f'/proc/{pid}/smaps', 'r') as file: + self.vmas = parse_smaps(file) + + def __iter__(self): + yield from self.vmas + + +def thp_parse(max_order, kpageflags, vfns, pfns, anons, heads): + # Given 4 same-sized arrays representing a range within a page table backed + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: + # True if page is anonymous, heads: True if page is head of a THP), return a + # dictionary of statistics describing the mapped THPs. + stats = { + 'file': { + 'partial': 0, + 'aligned': [0] * (max_order + 1), + 'unaligned': [0] * (max_order + 1), + }, + 'anon': { + 'partial': 0, + 'aligned': [0] * (max_order + 1), + 'unaligned': [0] * (max_order + 1), + }, + } + + indexes = np.arange(len(vfns), dtype=np.uint64) + ranges = cont_ranges_all([indexes, vfns, pfns]) + for rindex, rpfn in zip(ranges[0], ranges[2]): + index_next = int(rindex[0]) + index_end = int(rindex[1]) + 1 + pfn_end = int(rpfn[1]) + 1 + + folios = indexes[index_next:index_end][heads[index_next:index_end]] + + # Account pages for any partially mapped THP at the front. In that case, + # the first page of the range is a tail. + nr = (int(folios[0]) if len(folios) else index_end) - index_next + stats['anon' if anons[index_next] else 'file']['partial'] += nr + + # Account pages for any partially mapped THP at the back. In that case, + # the next page after the range is a tail. + if len(folios): + flags = int(kpageflags.get(pfn_end)[0]) + if flags & KPF_COMPOUND_TAIL: + nr = index_end - int(folios[-1]) + folios = folios[:-1] + index_end -= nr + stats['anon' if anons[index_end - 1] else 'file']['partial'] += nr + + # Account fully mapped THPs in the middle of the range. + if len(folios): + folio_nrs = np.append(np.diff(folios), np.uint64(index_end - folios[-1])) + folio_orders = np.log2(folio_nrs).astype(np.uint64) + for index, order in zip(folios, folio_orders): + index = int(index) + order = int(order) + nr = 1 << order + vfn = int(vfns[index]) + align = 'aligned' if align_forward(vfn, nr) == vfn else 'unaligned' + anon = 'anon' if anons[index] else 'file' + stats[anon][align][order] += nr + + rstats = {} + + def flatten_sub(type, subtype, stats): + for od, nr in enumerate(stats[2:], 2): + rstats[f"{type}-thp-{subtype}-{odkb(od)}kB"] = {'type': type, 'value': nrkb(nr)} + + def flatten_type(type, stats): + flatten_sub(type, 'aligned', stats['aligned']) + flatten_sub(type, 'unaligned', stats['unaligned']) + rstats[f"{type}-thp-partial"] = {'type': type, 'value': nrkb(stats['partial'])} + + flatten_type('anon', stats['anon']) + flatten_type('file', stats['file']) + + return rstats + + +def cont_parse(order, vfns, pfns, anons, heads): + # Given 4 same-sized arrays representing a range within a page table backed + # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons: + # True if page is anonymous, heads: True if page is head of a THP), return a + # dictionary of statistics describing the contiguous blocks. + nr_cont = 1 << order + nr_anon = 0 + nr_file = 0 + + ranges = cont_ranges_all([np.arange(len(vfns), dtype=np.uint64), vfns, pfns]) + for rindex, rvfn, rpfn in zip(*ranges): + index_next = int(rindex[0]) + index_end = int(rindex[1]) + 1 + vfn_start = int(rvfn[0]) + pfn_start = int(rpfn[0]) + + if align_offset(pfn_start, nr_cont) != align_offset(vfn_start, nr_cont): + continue + + off = align_forward(vfn_start, nr_cont) - vfn_start + index_next += off + + while index_next + nr_cont <= index_end: + folio_boundary = heads[index_next+1:index_next+nr_cont].any() + if not folio_boundary: + if anons[index_next]: + nr_anon += nr_cont + else: + nr_file += nr_cont + index_next += nr_cont + + return { + f"anon-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'anon', 'value': nrkb(nr_anon)}, + f"file-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'file', 'value': nrkb(nr_file)}, + } + + +def vma_print(vma, pid): + # Prints a VMA instance in a format similar to smaps. The main difference is + # that the pid is included as the first value. + print("{:08x} {:016x}-{:016x} {}{}{}{} {:08x} {:02x}:{:02x} {:08x} {}" + .format( + pid, vma.start, vma.end, + 'r' if vma.read else '-', 'w' if vma.write else '-', + 'x' if vma.execute else '-', 'p' if vma.private else 's', + vma.pgoff, vma.major, vma.minor, vma.inode, vma.name + )) + + +def stats_print(stats, tot_anon, tot_file, inc_empty): + # Print a statistics dictionary. + label_field = 32 + for label, stat in stats.items(): + type = stat['type'] + value = stat['value'] + if value or inc_empty: + pad = max(0, label_field - len(label) - 1) + if type == 'anon': + percent = f' ({value / tot_anon:3.0%})' + elif type == 'file': + percent = f' ({value / tot_file:3.0%})' + else: + percent = '' + print(f"{label}:{' ' * pad}{value:8} kB{percent}") + + +def vma_parse(vma, pagemap, kpageflags, contorders): + # Generate thp and cont statistics for a single VMA. + start = vma.start >> PAGE_SHIFT + end = vma.end >> PAGE_SHIFT + + pmes = pagemap.get(start, end - start) + present = pmes & PM_PAGE_PRESENT != 0 + pfns = pmes & PM_PFN_MASK + pfns = pfns[present] + vfns = np.arange(start, end, dtype=np.uint64) + vfns = vfns[present] + + flags = kpageflags.getv(cont_ranges_all([pfns])[0]) + anons = flags & KPF_ANON != 0 + heads = flags & KPF_COMPOUND_HEAD != 0 + tails = flags & KPF_COMPOUND_TAIL != 0 + thps = heads | tails + + tot_anon = np.count_nonzero(anons) + tot_file = np.size(anons) - tot_anon + tot_anon = nrkb(tot_anon) + tot_file = nrkb(tot_file) + + vfns = vfns[thps] + pfns = pfns[thps] + anons = anons[thps] + heads = heads[thps] + + thpstats = thp_parse(PMD_ORDER, kpageflags, vfns, pfns, anons, heads) + contstats = [cont_parse(order, vfns, pfns, anons, heads) for order in contorders] + + return { + **thpstats, + **{k: v for s in contstats for k, v in s.items()} + }, tot_anon, tot_file + + +def do_main(args): + pids = set() + summary = {} + summary_anon = 0 + summary_file = 0 + + if args.cgroup: + with open(f'{args.cgroup}/cgroup.procs') as pidfile: + for line in pidfile.readlines(): + pids.add(int(line.strip())) + else: + pids.add(args.pid) + + for pid in pids: + try: + with PageMap(pid) as pagemap: + with KPageFlags() as kpageflags: + for vma in VMAList(pid): + if (vma.read or vma.write or vma.execute) and vma.stats['Rss']['value'] > 0: + stats, vma_anon, vma_file = vma_parse(vma, pagemap, kpageflags, args.cont) + else: + stats = {} + vma_anon = 0 + vma_file = 0 + if args.inc_smaps: + stats = {**vma.stats, **stats} + if args.summary: + for k, v in stats.items(): + if k in summary: + assert(summary[k]['type'] == v['type']) + summary[k]['value'] += v['value'] + else: + summary[k] = v + summary_anon += vma_anon + summary_file += vma_file + else: + vma_print(vma, pid) + stats_print(stats, vma_anon, vma_file, args.inc_empty) + except FileNotFoundError: + if not args.cgroup: + raise + except ProcessLookupError: + if not args.cgroup: + raise + + if args.summary: + stats_print(summary, summary_anon, summary_file, args.inc_empty) + + +def main(): + def formatter(prog): + width = shutil.get_terminal_size().columns + width -= 2 + width = min(80, width) + return argparse.HelpFormatter(prog, width=width) + + def size2order(human): + units = {"K": 2**10, "M": 2**20, "G": 2**30} + unit = 1 + if human[-1] in units: + unit = units[human[-1]] + human = human[:-1] + try: + size = int(human) + except ValueError: + raise ArgException('error: --cont value must be integer size with optional KMG unit') + size *= unit + order = int(math.log2(size / PAGE_SIZE)) + if order < 1: + raise ArgException('error: --cont value must be size of at least 2 pages') + if (1 << order) * PAGE_SIZE != size: + raise ArgException('error: --cont value must be size of power-of-2 pages') + return order + + parser = argparse.ArgumentParser(formatter_class=formatter, + description="""Prints information about how transparent huge pages are + mapped to a specified process or cgroup. + + Shows statistics for fully-mapped THPs of every size, mapped + both naturally aligned and unaligned for both file and + anonymous memory. See + [anon|file]-thp-[aligned|unaligned]-<size>kB keys. + + Shows statistics for mapped pages that belong to a THP but + which are not fully mapped. See [anon|file]-thp-partial + keys. + + Optionally shows statistics for naturally aligned, + contiguous blocks of memory of a specified size (when --cont + is provided). See [anon|file]-cont-aligned-<size>kB keys. + + Statistics are shown in kB and as a percentage of either + total anon or file memory as appropriate.""", + epilog="""Requires root privilege to access pagemap and kpageflags.""") + + parser.add_argument('--pid', + metavar='pid', required=False, type=int, + help="""Process id of the target process. Exactly one of --pid and + --cgroup must be provided.""") + + parser.add_argument('--cgroup', + metavar='path', required=False, + help="""Path to the target cgroup in sysfs. Iterates over every pid in + the cgroup. Exactly one of --pid and --cgroup must be provided.""") + + parser.add_argument('--summary', + required=False, default=False, action='store_true', + help="""Sum the per-vma statistics to provide a summary over the whole + process or cgroup.""") + + parser.add_argument('--cont', + metavar='size[KMG]', required=False, default=[], action='append', + help="""Adds anon and file stats for naturally aligned, contiguously + mapped blocks of the specified size. May be issued multiple times to + track multiple sized blocks. Useful to infer e.g. arm64 contpte and + hpa mappings. Size must be a power-of-2 number of pages.""") + + parser.add_argument('--inc-smaps', + required=False, default=False, action='store_true', + help="""Include all numerical, additive /proc/<pid>/smaps stats in the + output.""") + + parser.add_argument('--inc-empty', + required=False, default=False, action='store_true', + help="""Show all statistics including those whose value is 0.""") + + parser.add_argument('--periodic', + metavar='sleep_ms', required=False, type=int, + help="""Run in a loop, polling every sleep_ms milliseconds.""") + + args = parser.parse_args() + + try: + if (args.pid and args.cgroup) or \ + (not args.pid and not args.cgroup): + raise ArgException("error: Exactly one of --pid and --cgroup must be provided.") + + args.cont = [size2order(cont) for cont in args.cont] + except ArgException as e: + parser.print_usage() + raise + + if args.periodic: + while True: + do_main(args) + print() + time.sleep(args.periodic / 1000) + else: + do_main(args) + + +if __name__ == "__main__": + try: + main() + except Exception as e: + prog = os.path.basename(sys.argv[0]) + print(f'{prog}: {e}') + exit(1)
With the proliferation of large folios for file-backed memory, and more recently the introduction of multi-size THP for anonymous memory, it is becoming useful to be able to see exactly how large folios are mapped into processes. For some architectures (e.g. arm64), if most memory is mapped using contpte-sized and -aligned blocks, TLB usage can be optimized so it's useful to see where these requirements are and are not being met. thpmaps is a Python utility that reads /proc/<pid>/smaps, /proc/<pid>/pagemap and /proc/kpageflags to print information about how transparent huge pages (both file and anon) are mapped to a specified process or cgroup. It aims to help users debug and optimize their workloads. In future we may wish to introduce stats directly into the kernel (e.g. smaps or similar), but for now this provides a short term solution without the need to introduce any new ABI. Run with help option for a full listing of the arguments: # thpmaps --help --8<-- usage: thpmaps [-h] [--pid pid] [--cgroup path] [--summary] [--cont size[KMG]] [--inc-smaps] [--inc-empty] [--periodic sleep_ms] Prints information about how transparent huge pages are mapped to a specified process or cgroup. Shows statistics for fully-mapped THPs of every size, mapped both naturally aligned and unaligned for both file and anonymous memory. See [anon|file]-thp-[aligned|unaligned]-<size>kB keys. Shows statistics for mapped pages that belong to a THP but which are not fully mapped. See [anon|file]-thp-partial keys. Optionally shows statistics for naturally aligned, contiguous blocks of memory of a specified size (when --cont is provided). See [anon|file]-cont- aligned-<size>kB keys. Statistics are shown in kB and as a percentage of either total anon or file memory as appropriate. options: -h, --help show this help message and exit --pid pid Process id of the target process. Exactly one of --pid and --cgroup must be provided. --cgroup path Path to the target cgroup in sysfs. Iterates over every pid in the cgroup. Exactly one of --pid and --cgroup must be provided. --summary Sum the per-vma statistics to provide a summary over the whole process or cgroup. --cont size[KMG] Adds anon and file stats for naturally aligned, contiguously mapped blocks of the specified size. May be issued multiple times to track multiple sized blocks. Useful to infer e.g. arm64 contpte and hpa mappings. Size must be a power-of-2 number of pages. --inc-smaps Include all numerical, additive /proc/<pid>/smaps stats in the output. --inc-empty Show all statistics including those whose value is 0. --periodic sleep_ms Run in a loop, polling every sleep_ms milliseconds. Requires root privilege to access pagemap and kpageflags. --8<-- Example command to summarise fully and partially mapped THPs and 64K contiguous blocks over all VMAs in a single process (--inc-empty forces printing stats that are 0): # ./thpmaps --pid 10837 --cont 64K --summary --inc-empty --8<-- anon-thp-aligned-16kB: 16 kB ( 0%) anon-thp-aligned-32kB: 0 kB ( 0%) anon-thp-aligned-64kB: 4194304 kB (100%) anon-thp-aligned-128kB: 0 kB ( 0%) anon-thp-aligned-256kB: 0 kB ( 0%) anon-thp-aligned-512kB: 0 kB ( 0%) anon-thp-aligned-1024kB: 0 kB ( 0%) anon-thp-aligned-2048kB: 0 kB ( 0%) anon-thp-unaligned-16kB: 0 kB ( 0%) anon-thp-unaligned-32kB: 0 kB ( 0%) anon-thp-unaligned-64kB: 0 kB ( 0%) anon-thp-unaligned-128kB: 0 kB ( 0%) anon-thp-unaligned-256kB: 0 kB ( 0%) anon-thp-unaligned-512kB: 0 kB ( 0%) anon-thp-unaligned-1024kB: 0 kB ( 0%) anon-thp-unaligned-2048kB: 0 kB ( 0%) anon-thp-partial: 0 kB ( 0%) file-thp-aligned-16kB: 16 kB ( 1%) file-thp-aligned-32kB: 64 kB ( 5%) file-thp-aligned-64kB: 640 kB (50%) file-thp-aligned-128kB: 128 kB (10%) file-thp-aligned-256kB: 0 kB ( 0%) file-thp-aligned-512kB: 0 kB ( 0%) file-thp-aligned-1024kB: 0 kB ( 0%) file-thp-aligned-2048kB: 0 kB ( 0%) file-thp-unaligned-16kB: 16 kB ( 1%) file-thp-unaligned-32kB: 32 kB ( 3%) file-thp-unaligned-64kB: 64 kB ( 5%) file-thp-unaligned-128kB: 0 kB ( 0%) file-thp-unaligned-256kB: 0 kB ( 0%) file-thp-unaligned-512kB: 0 kB ( 0%) file-thp-unaligned-1024kB: 0 kB ( 0%) file-thp-unaligned-2048kB: 0 kB ( 0%) file-thp-partial: 12 kB ( 1%) anon-cont-aligned-64kB: 4194304 kB (100%) file-cont-aligned-64kB: 768 kB (61%) --8<-- Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- I've found this very useful for debugging, and I know others have requested a way to check if mTHP and contpte is working, so thought this might a good short term solution until we figure out how best to add stats in the kernel? Thanks, Ryan tools/mm/Makefile | 9 +- tools/mm/thpmaps | 573 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 578 insertions(+), 4 deletions(-) create mode 100755 tools/mm/thpmaps -- 2.25.1