diff mbox series

[RFC,v1] tools/mm: Add thpmaps script to dump THP usage info

Message ID 20240102153828.1002295-1-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series [RFC,v1] tools/mm: Add thpmaps script to dump THP usage info | expand

Commit Message

Ryan Roberts Jan. 2, 2024, 3:38 p.m. UTC
With the proliferation of large folios for file-backed memory, and more
recently the introduction of multi-size THP for anonymous memory, it is
becoming useful to be able to see exactly how large folios are mapped
into processes. For some architectures (e.g. arm64), if most memory is
mapped using contpte-sized and -aligned blocks, TLB usage can be
optimized so it's useful to see where these requirements are and are not
being met.

thpmaps is a Python utility that reads /proc/<pid>/smaps,
/proc/<pid>/pagemap and /proc/kpageflags to print information about how
transparent huge pages (both file and anon) are mapped to a specified
process or cgroup. It aims to help users debug and optimize their
workloads. In future we may wish to introduce stats directly into the
kernel (e.g. smaps or similar), but for now this provides a short term
solution without the need to introduce any new ABI.

Run with help option for a full listing of the arguments:

    # thpmaps --help

--8<--
usage: thpmaps [-h] [--pid pid] [--cgroup path] [--summary]
               [--cont size[KMG]] [--inc-smaps] [--inc-empty]
               [--periodic sleep_ms]

Prints information about how transparent huge pages are mapped to a
specified process or cgroup. Shows statistics for fully-mapped THPs of
every size, mapped both naturally aligned and unaligned for both file
and anonymous memory. See [anon|file]-thp-[aligned|unaligned]-<size>kB
keys. Shows statistics for mapped pages that belong to a THP but which
are not fully mapped. See [anon|file]-thp-partial keys. Optionally
shows statistics for naturally aligned, contiguous blocks of memory of
a specified size (when --cont is provided). See [anon|file]-cont-
aligned-<size>kB keys. Statistics are shown in kB and as a percentage
of either total anon or file memory as appropriate.

options:
  -h, --help           show this help message and exit
  --pid pid            Process id of the target process. Exactly one of
                       --pid and --cgroup must be provided.
  --cgroup path        Path to the target cgroup in sysfs. Iterates
                       over every pid in the cgroup. Exactly one of
                       --pid and --cgroup must be provided.
  --summary            Sum the per-vma statistics to provide a summary
                       over the whole process or cgroup.
  --cont size[KMG]     Adds anon and file stats for naturally aligned,
                       contiguously mapped blocks of the specified
                       size. May be issued multiple times to track
                       multiple sized blocks. Useful to infer e.g.
                       arm64 contpte and hpa mappings. Size must be a
                       power-of-2 number of pages.
  --inc-smaps          Include all numerical, additive
                       /proc/<pid>/smaps stats in the output.
  --inc-empty          Show all statistics including those whose value
                       is 0.
  --periodic sleep_ms  Run in a loop, polling every sleep_ms
                       milliseconds.

Requires root privilege to access pagemap and kpageflags.
--8<--

Example command to summarise fully and partially mapped THPs and 64K
contiguous blocks over all VMAs in a single process (--inc-empty forces
printing stats that are 0):

    # ./thpmaps --pid 10837 --cont 64K --summary --inc-empty

--8<--
anon-thp-aligned-16kB:                16 kB ( 0%)
anon-thp-aligned-32kB:                 0 kB ( 0%)
anon-thp-aligned-64kB:           4194304 kB (100%)
anon-thp-aligned-128kB:                0 kB ( 0%)
anon-thp-aligned-256kB:                0 kB ( 0%)
anon-thp-aligned-512kB:                0 kB ( 0%)
anon-thp-aligned-1024kB:               0 kB ( 0%)
anon-thp-aligned-2048kB:               0 kB ( 0%)
anon-thp-unaligned-16kB:               0 kB ( 0%)
anon-thp-unaligned-32kB:               0 kB ( 0%)
anon-thp-unaligned-64kB:               0 kB ( 0%)
anon-thp-unaligned-128kB:              0 kB ( 0%)
anon-thp-unaligned-256kB:              0 kB ( 0%)
anon-thp-unaligned-512kB:              0 kB ( 0%)
anon-thp-unaligned-1024kB:             0 kB ( 0%)
anon-thp-unaligned-2048kB:             0 kB ( 0%)
anon-thp-partial:                      0 kB ( 0%)
file-thp-aligned-16kB:                16 kB ( 1%)
file-thp-aligned-32kB:                64 kB ( 5%)
file-thp-aligned-64kB:               640 kB (50%)
file-thp-aligned-128kB:              128 kB (10%)
file-thp-aligned-256kB:                0 kB ( 0%)
file-thp-aligned-512kB:                0 kB ( 0%)
file-thp-aligned-1024kB:               0 kB ( 0%)
file-thp-aligned-2048kB:               0 kB ( 0%)
file-thp-unaligned-16kB:              16 kB ( 1%)
file-thp-unaligned-32kB:              32 kB ( 3%)
file-thp-unaligned-64kB:              64 kB ( 5%)
file-thp-unaligned-128kB:              0 kB ( 0%)
file-thp-unaligned-256kB:              0 kB ( 0%)
file-thp-unaligned-512kB:              0 kB ( 0%)
file-thp-unaligned-1024kB:             0 kB ( 0%)
file-thp-unaligned-2048kB:             0 kB ( 0%)
file-thp-partial:                     12 kB ( 1%)
anon-cont-aligned-64kB:          4194304 kB (100%)
file-cont-aligned-64kB:              768 kB (61%)
--8<--

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

I've found this very useful for debugging, and I know others have requested a
way to check if mTHP and contpte is working, so thought this might a good short
term solution until we figure out how best to add stats in the kernel?

Thanks,
Ryan

 tools/mm/Makefile |   9 +-
 tools/mm/thpmaps  | 573 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 578 insertions(+), 4 deletions(-)
 create mode 100755 tools/mm/thpmaps

--
2.25.1

Comments

Barry Song Jan. 3, 2024, 6:44 a.m. UTC | #1
On Wed, Jan 3, 2024 at 4:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> With the proliferation of large folios for file-backed memory, and more
> recently the introduction of multi-size THP for anonymous memory, it is
> becoming useful to be able to see exactly how large folios are mapped
> into processes. For some architectures (e.g. arm64), if most memory is
> mapped using contpte-sized and -aligned blocks, TLB usage can be
> optimized so it's useful to see where these requirements are and are not
> being met.
>
> thpmaps is a Python utility that reads /proc/<pid>/smaps,
> /proc/<pid>/pagemap and /proc/kpageflags to print information about how
> transparent huge pages (both file and anon) are mapped to a specified
> process or cgroup. It aims to help users debug and optimize their
> workloads. In future we may wish to introduce stats directly into the
> kernel (e.g. smaps or similar), but for now this provides a short term
> solution without the need to introduce any new ABI.
>
> Run with help option for a full listing of the arguments:
>
>     # thpmaps --help
>
> --8<--
> usage: thpmaps [-h] [--pid pid] [--cgroup path] [--summary]
>                [--cont size[KMG]] [--inc-smaps] [--inc-empty]
>                [--periodic sleep_ms]
>
> Prints information about how transparent huge pages are mapped to a
> specified process or cgroup. Shows statistics for fully-mapped THPs of
> every size, mapped both naturally aligned and unaligned for both file
> and anonymous memory. See [anon|file]-thp-[aligned|unaligned]-<size>kB
> keys. Shows statistics for mapped pages that belong to a THP but which
> are not fully mapped. See [anon|file]-thp-partial keys. Optionally
> shows statistics for naturally aligned, contiguous blocks of memory of
> a specified size (when --cont is provided). See [anon|file]-cont-
> aligned-<size>kB keys. Statistics are shown in kB and as a percentage
> of either total anon or file memory as appropriate.
>
> options:
>   -h, --help           show this help message and exit
>   --pid pid            Process id of the target process. Exactly one of
>                        --pid and --cgroup must be provided.
>   --cgroup path        Path to the target cgroup in sysfs. Iterates
>                        over every pid in the cgroup. Exactly one of
>                        --pid and --cgroup must be provided.
>   --summary            Sum the per-vma statistics to provide a summary
>                        over the whole process or cgroup.
>   --cont size[KMG]     Adds anon and file stats for naturally aligned,
>                        contiguously mapped blocks of the specified
>                        size. May be issued multiple times to track
>                        multiple sized blocks. Useful to infer e.g.
>                        arm64 contpte and hpa mappings. Size must be a
>                        power-of-2 number of pages.
>   --inc-smaps          Include all numerical, additive
>                        /proc/<pid>/smaps stats in the output.
>   --inc-empty          Show all statistics including those whose value
>                        is 0.
>   --periodic sleep_ms  Run in a loop, polling every sleep_ms
>                        milliseconds.
>
> Requires root privilege to access pagemap and kpageflags.
> --8<--
>
> Example command to summarise fully and partially mapped THPs and 64K
> contiguous blocks over all VMAs in a single process (--inc-empty forces
> printing stats that are 0):
>
>     # ./thpmaps --pid 10837 --cont 64K --summary --inc-empty
>
> --8<--
> anon-thp-aligned-16kB:                16 kB ( 0%)
> anon-thp-aligned-32kB:                 0 kB ( 0%)
> anon-thp-aligned-64kB:           4194304 kB (100%)
> anon-thp-aligned-128kB:                0 kB ( 0%)
> anon-thp-aligned-256kB:                0 kB ( 0%)
> anon-thp-aligned-512kB:                0 kB ( 0%)
> anon-thp-aligned-1024kB:               0 kB ( 0%)
> anon-thp-aligned-2048kB:               0 kB ( 0%)
> anon-thp-unaligned-16kB:               0 kB ( 0%)
> anon-thp-unaligned-32kB:               0 kB ( 0%)
> anon-thp-unaligned-64kB:               0 kB ( 0%)
> anon-thp-unaligned-128kB:              0 kB ( 0%)
> anon-thp-unaligned-256kB:              0 kB ( 0%)
> anon-thp-unaligned-512kB:              0 kB ( 0%)
> anon-thp-unaligned-1024kB:             0 kB ( 0%)
> anon-thp-unaligned-2048kB:             0 kB ( 0%)
> anon-thp-partial:                      0 kB ( 0%)
> file-thp-aligned-16kB:                16 kB ( 1%)
> file-thp-aligned-32kB:                64 kB ( 5%)
> file-thp-aligned-64kB:               640 kB (50%)
> file-thp-aligned-128kB:              128 kB (10%)
> file-thp-aligned-256kB:                0 kB ( 0%)
> file-thp-aligned-512kB:                0 kB ( 0%)
> file-thp-aligned-1024kB:               0 kB ( 0%)
> file-thp-aligned-2048kB:               0 kB ( 0%)
> file-thp-unaligned-16kB:              16 kB ( 1%)
> file-thp-unaligned-32kB:              32 kB ( 3%)
> file-thp-unaligned-64kB:              64 kB ( 5%)
> file-thp-unaligned-128kB:              0 kB ( 0%)
> file-thp-unaligned-256kB:              0 kB ( 0%)
> file-thp-unaligned-512kB:              0 kB ( 0%)
> file-thp-unaligned-1024kB:             0 kB ( 0%)
> file-thp-unaligned-2048kB:             0 kB ( 0%)
> file-thp-partial:                     12 kB ( 1%)
> anon-cont-aligned-64kB:          4194304 kB (100%)
> file-cont-aligned-64kB:              768 kB (61%)
> --8<--
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---

Hi Ryan,

I ran a couple of test cases with different parameters, it seems to
work correctly.
just i don't understand the below, what is the meaning of 000000ce at
the beginning of
each line?

/thpmaps  --pid 206 --cont 64K
000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00
00426969 /root/a.out
000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00
00426969 /root/a.out
000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00
00426969 /root/a.out
000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000
anon-thp-aligned-64kB:            473920 kB (100%)
anon-cont-aligned-64kB:           473920 kB (100%)
000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00
00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00
00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00
00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00
00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000
000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00
00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000
000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar]
000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso]
000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00
00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00
00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]


>
> I've found this very useful for debugging, and I know others have requested a
> way to check if mTHP and contpte is working, so thought this might a good short
> term solution until we figure out how best to add stats in the kernel?
>
> Thanks,
> Ryan
>
>  tools/mm/Makefile |   9 +-
>  tools/mm/thpmaps  | 573 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 578 insertions(+), 4 deletions(-)
>  create mode 100755 tools/mm/thpmaps
>
> diff --git a/tools/mm/Makefile b/tools/mm/Makefile
> index 1c5606cc3334..7bb03606b9ea 100644
> --- a/tools/mm/Makefile
> +++ b/tools/mm/Makefile
> @@ -3,7 +3,8 @@
>  #
>  include ../scripts/Makefile.include
>
> -TARGETS=page-types slabinfo page_owner_sort
> +BUILD_TARGETS=page-types slabinfo page_owner_sort
> +INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps
>
>  LIB_DIR = ../lib/api
>  LIBS = $(LIB_DIR)/libapi.a
> @@ -11,9 +12,9 @@ LIBS = $(LIB_DIR)/libapi.a
>  CFLAGS += -Wall -Wextra -I../lib/ -pthread
>  LDFLAGS += $(LIBS) -pthread
>
> -all: $(TARGETS)
> +all: $(BUILD_TARGETS)
>
> -$(TARGETS): $(LIBS)
> +$(BUILD_TARGETS): $(LIBS)
>
>  $(LIBS):
>         make -C $(LIB_DIR)
> @@ -29,4 +30,4 @@ sbindir ?= /usr/sbin
>
>  install: all
>         install -d $(DESTDIR)$(sbindir)
> -       install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir)
> +       install -m 755 -p $(INSTALL_TARGETS) $(DESTDIR)$(sbindir)
> diff --git a/tools/mm/thpmaps b/tools/mm/thpmaps
> new file mode 100755
> index 000000000000..af9b19f63eb4
> --- /dev/null
> +++ b/tools/mm/thpmaps
> @@ -0,0 +1,573 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0-only
> +# Copyright (C) 2024 ARM Ltd.
> +#
> +# Utility providing smaps-like output detailing transparent hugepage usage.
> +# For more info, run:
> +# ./thpmaps --help
> +#
> +# Requires numpy:
> +# pip3 install numpy
> +
> +
> +import argparse
> +import collections
> +import math
> +import os
> +import re
> +import resource
> +import shutil
> +import sys
> +import time
> +import numpy as np
> +
> +
> +with open('/sys/kernel/mm/transparent_hugepage/hpage_pmd_size') as f:
> +    PAGE_SIZE = resource.getpagesize()
> +    PAGE_SHIFT = int(math.log2(PAGE_SIZE))
> +    PMD_SIZE = int(f.read())
> +    PMD_ORDER = int(math.log2(PMD_SIZE / PAGE_SIZE))
> +
> +
> +def align_forward(v, a):
> +    return (v + (a - 1)) & ~(a - 1)
> +
> +
> +def align_offset(v, a):
> +    return v & (a - 1)
> +
> +
> +def nrkb(nr):
> +    # Convert number of pages to KB.
> +    return (nr << PAGE_SHIFT) >> 10
> +
> +
> +def odkb(order):
> +    # Convert page order to KB.
> +    return nrkb(1 << order)
> +
> +
> +def cont_ranges_all(arrs):
> +    # Given a list of arrays, find the ranges for which values are monotonically
> +    # incrementing in all arrays.
> +    assert(len(arrs) > 0)
> +    sz = len(arrs[0])
> +    for arr in arrs:
> +        assert(arr.shape == (sz,))
> +    r = np.full(sz, 2)
> +    d = np.diff(arrs[0]) == 1
> +    for dd in [np.diff(arr) == 1 for arr in arrs[1:]]:
> +        d &= dd
> +    r[1:] -= d
> +    r[:-1] -= d
> +    return [np.repeat(arr, r).reshape(-1, 2) for arr in arrs]
> +
> +
> +class ArgException(Exception):
> +    pass
> +
> +
> +class FileIOException(Exception):
> +    pass
> +
> +
> +class BinArrayFile:
> +    # Base class used to read /proc/<pid>/pagemap and /proc/kpageflags into a
> +    # numpy array. Use inherrited class in a with clause to ensure file is
> +    # closed when it goes out of scope.
> +    def __init__(self, filename, element_size):
> +        self.element_size = element_size
> +        self.filename = filename
> +        self.fd = os.open(self.filename, os.O_RDONLY)
> +
> +    def cleanup(self):
> +        os.close(self.fd)
> +
> +    def __enter__(self):
> +        return self
> +
> +    def __exit__(self, exc_type, exc_val, exc_tb):
> +        self.cleanup()
> +
> +    def _readin(self, offset, buffer):
> +        length = os.preadv(self.fd, (buffer,), offset)
> +        if len(buffer) != length:
> +            raise FileIOException('error: {} failed to read {} bytes at {:x}'
> +                            .format(self.filename, len(buffer), offset))
> +
> +    def _toarray(self, buf):
> +        assert(self.element_size == 8)
> +        return np.frombuffer(buf, dtype=np.uint64)
> +
> +    def getv(self, vec):
> +        sz = 0
> +        for region in vec:
> +            sz += int(region[1] - region[0] + 1) * self.element_size
> +        buf = bytearray(sz)
> +        view = memoryview(buf)
> +        pos = 0
> +        for region in vec:
> +            offset = int(region[0]) * self.element_size
> +            length = int(region[1] - region[0] + 1) * self.element_size
> +            self._readin(offset, view[pos:pos+length])
> +            pos += length
> +        return self._toarray(buf)
> +
> +    def get(self, index, nr=1):
> +        offset = index * self.element_size
> +        length = nr * self.element_size
> +        buf = bytearray(length)
> +        self._readin(offset, buf)
> +        return self._toarray(buf)
> +
> +
> +PM_PAGE_PRESENT = 1 << 63
> +PM_PFN_MASK = (1 << 55) - 1
> +
> +class PageMap(BinArrayFile):
> +    # Read ranges of a given pid's pagemap into a numpy array.
> +    def __init__(self, pid='self'):
> +        super().__init__(f'/proc/{pid}/pagemap', 8)
> +
> +
> +KPF_ANON = 1 << 12
> +KPF_COMPOUND_HEAD = 1 << 15
> +KPF_COMPOUND_TAIL = 1 << 16
> +
> +class KPageFlags(BinArrayFile):
> +    # Read ranges of /proc/kpageflags into a numpy array.
> +    def __init__(self):
> +         super().__init__(f'/proc/kpageflags', 8)
> +
> +
> +VMA = collections.namedtuple('VMA', [
> +    'name',
> +    'start',
> +    'end',
> +    'read',
> +    'write',
> +    'execute',
> +    'private',
> +    'pgoff',
> +    'major',
> +    'minor',
> +    'inode',
> +    'stats',
> +])
> +
> +class VMAList:
> +    # A container for VMAs, parsed from /proc/<pid>/smaps. Iterate over the
> +    # instance to receive VMAs.
> +    head_regex = re.compile(r"^([\da-f]+)-([\da-f]+) ([r-])([w-])([x-])([ps]) ([\da-f]+) ([\da-f]+):([\da-f]+) ([\da-f]+)\s*(.*)$")
> +    kb_item_regex = re.compile(r"(\w+):\s*(\d+)\s*kB")
> +
> +    def __init__(self, pid='self'):
> +        def is_vma(line):
> +            return self.head_regex.search(line) != None
> +
> +        def get_vma(line):
> +            m = self.head_regex.match(line)
> +            if m is None:
> +                return None
> +            return VMA(
> +                name=m.group(11),
> +                start=int(m.group(1), 16),
> +                end=int(m.group(2), 16),
> +                read=m.group(3) == 'r',
> +                write=m.group(4) == 'w',
> +                execute=m.group(5) == 'x',
> +                private=m.group(6) == 'p',
> +                pgoff=int(m.group(7), 16),
> +                major=int(m.group(8), 16),
> +                minor=int(m.group(9), 16),
> +                inode=int(m.group(10), 16),
> +                stats={},
> +            )
> +
> +        def get_value(line):
> +            # Currently only handle the KB stats because they are summed for
> +            # --summary. Core code doesn't know how to combine other stats.
> +            exclude = ['KernelPageSize', 'MMUPageSize']
> +            m = self.kb_item_regex.search(line)
> +            if m:
> +                param = m.group(1)
> +                if param not in exclude:
> +                    value = int(m.group(2))
> +                    return param, value
> +            return None, None
> +
> +        def parse_smaps(file):
> +            vmas = []
> +            i = 0
> +
> +            line = file.readline()
> +
> +            while True:
> +                if not line:
> +                    break
> +                line = line.strip()
> +
> +                i += 1
> +
> +                vma = get_vma(line)
> +                if vma is None:
> +                    raise FileIOException(f'error: could not parse line {i}: "{line}"')
> +
> +                while True:
> +                    line = file.readline()
> +                    if not line:
> +                        break
> +                    line = line.strip()
> +                    if is_vma(line):
> +                        break
> +
> +                    i += 1
> +
> +                    param, value = get_value(line)
> +                    if param:
> +                        vma.stats[param] = {'type': None, 'value': value}
> +
> +                vmas.append(vma)
> +
> +            return vmas
> +
> +        with open(f'/proc/{pid}/smaps', 'r') as file:
> +            self.vmas = parse_smaps(file)
> +
> +    def __iter__(self):
> +        yield from self.vmas
> +
> +
> +def thp_parse(max_order, kpageflags, vfns, pfns, anons, heads):
> +    # Given 4 same-sized arrays representing a range within a page table backed
> +    # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons:
> +    # True if page is anonymous, heads: True if page is head of a THP), return a
> +    # dictionary of statistics describing the mapped THPs.
> +    stats = {
> +        'file': {
> +            'partial': 0,
> +            'aligned': [0] * (max_order + 1),
> +            'unaligned': [0] * (max_order + 1),
> +        },
> +        'anon': {
> +            'partial': 0,
> +            'aligned': [0] * (max_order + 1),
> +            'unaligned': [0] * (max_order + 1),
> +        },
> +    }
> +
> +    indexes = np.arange(len(vfns), dtype=np.uint64)
> +    ranges = cont_ranges_all([indexes, vfns, pfns])
> +    for rindex, rpfn in zip(ranges[0], ranges[2]):
> +        index_next = int(rindex[0])
> +        index_end = int(rindex[1]) + 1
> +        pfn_end = int(rpfn[1]) + 1
> +
> +        folios = indexes[index_next:index_end][heads[index_next:index_end]]
> +
> +        # Account pages for any partially mapped THP at the front. In that case,
> +        # the first page of the range is a tail.
> +        nr = (int(folios[0]) if len(folios) else index_end) - index_next
> +        stats['anon' if anons[index_next] else 'file']['partial'] += nr
> +
> +        # Account pages for any partially mapped THP at the back. In that case,
> +        # the next page after the range is a tail.
> +        if len(folios):
> +            flags = int(kpageflags.get(pfn_end)[0])
> +            if flags & KPF_COMPOUND_TAIL:
> +                nr = index_end - int(folios[-1])
> +                folios = folios[:-1]
> +                index_end -= nr
> +                stats['anon' if anons[index_end - 1] else 'file']['partial'] += nr
> +
> +        # Account fully mapped THPs in the middle of the range.
> +        if len(folios):
> +            folio_nrs = np.append(np.diff(folios), np.uint64(index_end - folios[-1]))
> +            folio_orders = np.log2(folio_nrs).astype(np.uint64)
> +            for index, order in zip(folios, folio_orders):
> +                index = int(index)
> +                order = int(order)
> +                nr = 1 << order
> +                vfn = int(vfns[index])
> +                align = 'aligned' if align_forward(vfn, nr) == vfn else 'unaligned'
> +                anon = 'anon' if anons[index] else 'file'
> +                stats[anon][align][order] += nr
> +
> +    rstats = {}
> +
> +    def flatten_sub(type, subtype, stats):
> +        for od, nr in enumerate(stats[2:], 2):
> +            rstats[f"{type}-thp-{subtype}-{odkb(od)}kB"] = {'type': type, 'value': nrkb(nr)}
> +
> +    def flatten_type(type, stats):
> +        flatten_sub(type, 'aligned', stats['aligned'])
> +        flatten_sub(type, 'unaligned', stats['unaligned'])
> +        rstats[f"{type}-thp-partial"] = {'type': type, 'value': nrkb(stats['partial'])}
> +
> +    flatten_type('anon', stats['anon'])
> +    flatten_type('file', stats['file'])
> +
> +    return rstats
> +
> +
> +def cont_parse(order, vfns, pfns, anons, heads):
> +    # Given 4 same-sized arrays representing a range within a page table backed
> +    # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons:
> +    # True if page is anonymous, heads: True if page is head of a THP), return a
> +    # dictionary of statistics describing the contiguous blocks.
> +    nr_cont = 1 << order
> +    nr_anon = 0
> +    nr_file = 0
> +
> +    ranges = cont_ranges_all([np.arange(len(vfns), dtype=np.uint64), vfns, pfns])
> +    for rindex, rvfn, rpfn in zip(*ranges):
> +        index_next = int(rindex[0])
> +        index_end = int(rindex[1]) + 1
> +        vfn_start = int(rvfn[0])
> +        pfn_start = int(rpfn[0])
> +
> +        if align_offset(pfn_start, nr_cont) != align_offset(vfn_start, nr_cont):
> +            continue
> +
> +        off = align_forward(vfn_start, nr_cont) - vfn_start
> +        index_next += off
> +
> +        while index_next + nr_cont <= index_end:
> +            folio_boundary = heads[index_next+1:index_next+nr_cont].any()
> +            if not folio_boundary:
> +                if anons[index_next]:
> +                    nr_anon += nr_cont
> +                else:
> +                    nr_file += nr_cont
> +            index_next += nr_cont
> +
> +    return {
> +        f"anon-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'anon', 'value': nrkb(nr_anon)},
> +        f"file-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'file', 'value': nrkb(nr_file)},
> +    }
> +
> +
> +def vma_print(vma, pid):
> +    # Prints a VMA instance in a format similar to smaps. The main difference is
> +    # that the pid is included as the first value.
> +    print("{:08x} {:016x}-{:016x} {}{}{}{} {:08x} {:02x}:{:02x} {:08x} {}"
> +        .format(
> +            pid, vma.start, vma.end,
> +            'r' if vma.read else '-', 'w' if vma.write else '-',
> +            'x' if vma.execute else '-', 'p' if vma.private else 's',
> +            vma.pgoff, vma.major, vma.minor, vma.inode, vma.name
> +        ))
> +
> +
> +def stats_print(stats, tot_anon, tot_file, inc_empty):
> +    # Print a statistics dictionary.
> +    label_field = 32
> +    for label, stat in stats.items():
> +        type = stat['type']
> +        value = stat['value']
> +        if value or inc_empty:
> +            pad = max(0, label_field - len(label) - 1)
> +            if type == 'anon':
> +                percent = f' ({value / tot_anon:3.0%})'
> +            elif type == 'file':
> +                percent = f' ({value / tot_file:3.0%})'
> +            else:
> +                percent = ''
> +            print(f"{label}:{' ' * pad}{value:8} kB{percent}")
> +
> +
> +def vma_parse(vma, pagemap, kpageflags, contorders):
> +    # Generate thp and cont statistics for a single VMA.
> +    start = vma.start >> PAGE_SHIFT
> +    end = vma.end >> PAGE_SHIFT
> +
> +    pmes = pagemap.get(start, end - start)
> +    present = pmes & PM_PAGE_PRESENT != 0
> +    pfns = pmes & PM_PFN_MASK
> +    pfns = pfns[present]
> +    vfns = np.arange(start, end, dtype=np.uint64)
> +    vfns = vfns[present]
> +
> +    flags = kpageflags.getv(cont_ranges_all([pfns])[0])
> +    anons = flags & KPF_ANON != 0
> +    heads = flags & KPF_COMPOUND_HEAD != 0
> +    tails = flags & KPF_COMPOUND_TAIL != 0
> +    thps = heads | tails
> +
> +    tot_anon = np.count_nonzero(anons)
> +    tot_file = np.size(anons) - tot_anon
> +    tot_anon = nrkb(tot_anon)
> +    tot_file = nrkb(tot_file)
> +
> +    vfns = vfns[thps]
> +    pfns = pfns[thps]
> +    anons = anons[thps]
> +    heads = heads[thps]
> +
> +    thpstats = thp_parse(PMD_ORDER, kpageflags, vfns, pfns, anons, heads)
> +    contstats = [cont_parse(order, vfns, pfns, anons, heads) for order in contorders]
> +
> +    return {
> +        **thpstats,
> +        **{k: v for s in contstats for k, v in s.items()}
> +    }, tot_anon, tot_file
> +
> +
> +def do_main(args):
> +    pids = set()
> +    summary = {}
> +    summary_anon = 0
> +    summary_file = 0
> +
> +    if args.cgroup:
> +        with open(f'{args.cgroup}/cgroup.procs') as pidfile:
> +            for line in pidfile.readlines():
> +                pids.add(int(line.strip()))
> +    else:
> +        pids.add(args.pid)
> +
> +    for pid in pids:
> +        try:
> +            with PageMap(pid) as pagemap:
> +                with KPageFlags() as kpageflags:
> +                    for vma in VMAList(pid):
> +                        if (vma.read or vma.write or vma.execute) and vma.stats['Rss']['value'] > 0:
> +                            stats, vma_anon, vma_file = vma_parse(vma, pagemap, kpageflags, args.cont)
> +                        else:
> +                            stats = {}
> +                            vma_anon = 0
> +                            vma_file = 0
> +                        if args.inc_smaps:
> +                            stats = {**vma.stats, **stats}
> +                        if args.summary:
> +                            for k, v in stats.items():
> +                                if k in summary:
> +                                    assert(summary[k]['type'] == v['type'])
> +                                    summary[k]['value'] += v['value']
> +                                else:
> +                                    summary[k] = v
> +                            summary_anon += vma_anon
> +                            summary_file += vma_file
> +                        else:
> +                            vma_print(vma, pid)
> +                            stats_print(stats, vma_anon, vma_file, args.inc_empty)
> +        except FileNotFoundError:
> +            if not args.cgroup:
> +                raise
> +        except ProcessLookupError:
> +            if not args.cgroup:
> +                raise
> +
> +    if args.summary:
> +        stats_print(summary, summary_anon, summary_file, args.inc_empty)
> +
> +
> +def main():
> +    def formatter(prog):
> +        width = shutil.get_terminal_size().columns
> +        width -= 2
> +        width = min(80, width)
> +        return argparse.HelpFormatter(prog, width=width)
> +
> +    def size2order(human):
> +        units = {"K": 2**10, "M": 2**20, "G": 2**30}
> +        unit = 1
> +        if human[-1] in units:
> +            unit = units[human[-1]]
> +            human = human[:-1]
> +        try:
> +            size = int(human)
> +        except ValueError:
> +            raise ArgException('error: --cont value must be integer size with optional KMG unit')
> +        size *= unit
> +        order = int(math.log2(size / PAGE_SIZE))
> +        if order < 1:
> +            raise ArgException('error: --cont value must be size of at least 2 pages')
> +        if (1 << order) * PAGE_SIZE != size:
> +            raise ArgException('error: --cont value must be size of power-of-2 pages')
> +        return order
> +
> +    parser = argparse.ArgumentParser(formatter_class=formatter,
> +        description="""Prints information about how transparent huge pages are
> +                    mapped to a specified process or cgroup.
> +
> +                    Shows statistics for fully-mapped THPs of every size, mapped
> +                    both naturally aligned and unaligned for both file and
> +                    anonymous memory. See
> +                    [anon|file]-thp-[aligned|unaligned]-<size>kB keys.
> +
> +                    Shows statistics for mapped pages that belong to a THP but
> +                    which are not fully mapped. See [anon|file]-thp-partial
> +                    keys.
> +
> +                    Optionally shows statistics for naturally aligned,
> +                    contiguous blocks of memory of a specified size (when --cont
> +                    is provided). See [anon|file]-cont-aligned-<size>kB keys.
> +
> +                    Statistics are shown in kB and as a percentage of either
> +                    total anon or file memory as appropriate.""",
> +        epilog="""Requires root privilege to access pagemap and kpageflags.""")
> +
> +    parser.add_argument('--pid',
> +        metavar='pid', required=False, type=int,
> +        help="""Process id of the target process. Exactly one of --pid and
> +            --cgroup must be provided.""")
> +
> +    parser.add_argument('--cgroup',
> +        metavar='path', required=False,
> +        help="""Path to the target cgroup in sysfs. Iterates over every pid in
> +            the cgroup. Exactly one of --pid and --cgroup must be provided.""")
> +
> +    parser.add_argument('--summary',
> +        required=False, default=False, action='store_true',
> +        help="""Sum the per-vma statistics to provide a summary over the whole
> +            process or cgroup.""")
> +
> +    parser.add_argument('--cont',
> +        metavar='size[KMG]', required=False, default=[], action='append',
> +        help="""Adds anon and file stats for naturally aligned, contiguously
> +            mapped blocks of the specified size. May be issued multiple times to
> +            track multiple sized blocks. Useful to infer e.g. arm64 contpte and
> +            hpa mappings. Size must be a power-of-2 number of pages.""")
> +
> +    parser.add_argument('--inc-smaps',
> +        required=False, default=False, action='store_true',
> +        help="""Include all numerical, additive /proc/<pid>/smaps stats in the
> +            output.""")
> +
> +    parser.add_argument('--inc-empty',
> +        required=False, default=False, action='store_true',
> +        help="""Show all statistics including those whose value is 0.""")
> +
> +    parser.add_argument('--periodic',
> +        metavar='sleep_ms', required=False, type=int,
> +        help="""Run in a loop, polling every sleep_ms milliseconds.""")
> +
> +    args = parser.parse_args()
> +
> +    try:
> +        if (args.pid and args.cgroup) or \
> +        (not args.pid and not args.cgroup):
> +            raise ArgException("error: Exactly one of --pid and --cgroup must be provided.")
> +
> +        args.cont = [size2order(cont) for cont in args.cont]
> +    except ArgException as e:
> +        parser.print_usage()
> +        raise
> +
> +    if args.periodic:
> +        while True:
> +            do_main(args)
> +            print()
> +            time.sleep(args.periodic / 1000)
> +    else:
> +        do_main(args)
> +
> +
> +if __name__ == "__main__":
> +    try:
> +        main()
> +    except Exception as e:
> +        prog = os.path.basename(sys.argv[0])
> +        print(f'{prog}: {e}')
> +        exit(1)
> --
> 2.25.1
>
William Kucharski Jan. 3, 2024, 8:07 a.m. UTC | #2
> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote:
> 
> Hi Ryan,
> 
> I ran a couple of test cases with different parameters, it seems to
> work correctly.
> just i don't understand the below, what is the meaning of 000000ce at
> the beginning of
> each line?

It's the pid; 0xce is the specified pid, 206.

Perhaps the pid should be printed in decimal?

    -- William Kucharski

> /thpmaps  --pid 206 --cont 64K
> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00
> 00426969 /root/a.out
> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00
> 00426969 /root/a.out
> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00
> 00426969 /root/a.out
> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000
> anon-thp-aligned-64kB:            473920 kB (100%)
> anon-cont-aligned-64kB:           473920 kB (100%)
> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00
> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00
> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00
> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00
> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000
> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00
> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000
> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar]
> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso]
> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00
> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00
> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]
Ryan Roberts Jan. 3, 2024, 8:24 a.m. UTC | #3
On 03/01/2024 08:07, William Kucharski wrote:
> 
>> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote:
>>
>> Hi Ryan,
>>
>> I ran a couple of test cases with different parameters, it seems to
>> work correctly.
>> just i don't understand the below, what is the meaning of 000000ce at
>> the beginning of
>> each line?
> 
> It's the pid; 0xce is the specified pid, 206.

Yes indeed. I added the pid to the front for the case where you are using
--cgroup without --summary; in that case, each vma will be printed for each pid
in the cgroup and it seemed sensible to be able to see which pid each vma
belonged to.

> 
> Perhaps the pid should be printed in decimal?

I thought about printing in decimal, but every other value in the vma is in hex
without a leading "0x" (I'm trying to follow the smaps convention). So I thought
it could be more confusing in decimal.

I'm happy to change it to decimal if that's the preference though? Although I'd
like to continue to present it in a fixed width field, padded with 0s on the
left so that everything lines up.

> 
>     -- William Kucharski
> 
>> /thpmaps  --pid 206 --cont 64K
>> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00
>> 00426969 /root/a.out
>> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00
>> 00426969 /root/a.out
>> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00
>> 00426969 /root/a.out
>> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000
>> anon-thp-aligned-64kB:            473920 kB (100%)
>> anon-cont-aligned-64kB:           473920 kB (100%)
>> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00
>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00
>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00
>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00
>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000
>> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00
>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000
>> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar]
>> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso]
>> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00
>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00
>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]
>
Barry Song Jan. 3, 2024, 9:16 a.m. UTC | #4
On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/01/2024 08:07, William Kucharski wrote:
> >
> >> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> Hi Ryan,
> >>
> >> I ran a couple of test cases with different parameters, it seems to
> >> work correctly.
> >> just i don't understand the below, what is the meaning of 000000ce at
> >> the beginning of
> >> each line?
> >
> > It's the pid; 0xce is the specified pid, 206.
>
> Yes indeed. I added the pid to the front for the case where you are using
> --cgroup without --summary; in that case, each vma will be printed for each pid
> in the cgroup and it seemed sensible to be able to see which pid each vma
> belonged to.

I don't understand why we have to add the pid before each line as this tool
already has pid in the parameter :-)  this seems like duplicated information
to me. but it doesn't matter too much as this tool is really nice though it is
not so easy to deploy on Android.

Please feel free to add,

Tested-by: Barry Song <v-songbaohua@oppo.com>

>
> >
> > Perhaps the pid should be printed in decimal?
>
> I thought about printing in decimal, but every other value in the vma is in hex
> without a leading "0x" (I'm trying to follow the smaps convention). So I thought
> it could be more confusing in decimal.
>
> I'm happy to change it to decimal if that's the preference though? Although I'd
> like to continue to present it in a fixed width field, padded with 0s on the
> left so that everything lines up.
>
> >
> >     -- William Kucharski
> >
> >> /thpmaps  --pid 206 --cont 64K
> >> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00
> >> 00426969 /root/a.out
> >> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00
> >> 00426969 /root/a.out
> >> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00
> >> 00426969 /root/a.out
> >> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000
> >> anon-thp-aligned-64kB:            473920 kB (100%)
> >> anon-cont-aligned-64kB:           473920 kB (100%)
> >> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00
> >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> >> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00
> >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> >> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00
> >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> >> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00
> >> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
> >> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000
> >> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00
> >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> >> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000
> >> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar]
> >> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso]
> >> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00
> >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> >> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00
> >> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
> >> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]
> >

Thanks
Barry
Ryan Roberts Jan. 3, 2024, 9:35 a.m. UTC | #5
On 03/01/2024 09:16, Barry Song wrote:
> On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/01/2024 08:07, William Kucharski wrote:
>>>
>>>> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> Hi Ryan,
>>>>
>>>> I ran a couple of test cases with different parameters, it seems to
>>>> work correctly.
>>>> just i don't understand the below, what is the meaning of 000000ce at
>>>> the beginning of
>>>> each line?
>>>
>>> It's the pid; 0xce is the specified pid, 206.
>>
>> Yes indeed. I added the pid to the front for the case where you are using
>> --cgroup without --summary; in that case, each vma will be printed for each pid
>> in the cgroup and it seemed sensible to be able to see which pid each vma
>> belonged to.
> 
> I don't understand why we have to add the pid before each line as this tool
> already has pid in the parameter :-) 

The reason is that it is also possible to invoke the tool with --cgroup instead
of --pid. In this case, the tool will iterate over all the pids in the cgroup so
(when --summary is not specified) having the pid associated with each vma is useful.

I could change it to conditionally output the pid only when --cgroup is specified?

> this seems like duplicated information
> to me. but it doesn't matter too much as this tool is really nice though it is
> not so easy to deploy on Android.

Hmm. I've seen tutorials where people have Python running under Android, but I
agree its not zero effort. Perhaps it would be better in C. Unfortuantely, I
can't commit to doing a port at this point.

> 
> Please feel free to add,
> 
> Tested-by: Barry Song <v-songbaohua@oppo.com>

Thanks!

> 
>>
>>>
>>> Perhaps the pid should be printed in decimal?
>>
>> I thought about printing in decimal, but every other value in the vma is in hex
>> without a leading "0x" (I'm trying to follow the smaps convention). So I thought
>> it could be more confusing in decimal.
>>
>> I'm happy to change it to decimal if that's the preference though? Although I'd
>> like to continue to present it in a fixed width field, padded with 0s on the
>> left so that everything lines up.
>>
>>>
>>>     -- William Kucharski
>>>
>>>> /thpmaps  --pid 206 --cont 64K
>>>> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00
>>>> 00426969 /root/a.out
>>>> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00
>>>> 00426969 /root/a.out
>>>> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00
>>>> 00426969 /root/a.out
>>>> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000
>>>> anon-thp-aligned-64kB:            473920 kB (100%)
>>>> anon-cont-aligned-64kB:           473920 kB (100%)
>>>> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00
>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00
>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00
>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00
>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000
>>>> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00
>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000
>>>> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar]
>>>> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso]
>>>> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00
>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00
>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]
>>>
> 
> Thanks
> Barry
William Kucharski Jan. 3, 2024, 10:09 a.m. UTC | #6
> On Jan 3, 2024, at 02:35, Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> On 03/01/2024 09:16, Barry Song wrote:
>> On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>> 
>>> On 03/01/2024 08:07, William Kucharski wrote:
>>>> 
>>>>> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote:
>>>>> 
>>>>> Hi Ryan,
>>>>> 
>>>>> I ran a couple of test cases with different parameters, it seems to
>>>>> work correctly.
>>>>> just i don't understand the below, what is the meaning of 000000ce at
>>>>> the beginning of
>>>>> each line?
>>>> 
>>>> It's the pid; 0xce is the specified pid, 206.
>>> 
>>> Yes indeed. I added the pid to the front for the case where you are using
>>> --cgroup without --summary; in that case, each vma will be printed for each pid
>>> in the cgroup and it seemed sensible to be able to see which pid each vma
>>> belonged to.
>> 
>> I don't understand why we have to add the pid before each line as this tool
>> already has pid in the parameter :-)
> 
> The reason is that it is also possible to invoke the tool with --cgroup instead
> of --pid. In this case, the tool will iterate over all the pids in the cgroup so
> (when --summary is not specified) having the pid associated with each vma is useful.
> 
> I could change it to conditionally output the pid only when --cgroup is specified?

You could, or perhaps emit a colon after the pid to delineate it, e.g.:

> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 /root/a.out

but then some people would probably read it as a memory address, so who knows.

    -- William Kucharski

> 
>> this seems like duplicated information
>> to me. but it doesn't matter too much as this tool is really nice though it is
>> not so easy to deploy on Android.
> 
> Hmm. I've seen tutorials where people have Python running under Android, but I
> agree its not zero effort. Perhaps it would be better in C. Unfortuantely, I
> can't commit to doing a port at this point.
> 
>> 
>> Please feel free to add,
>> 
>> Tested-by: Barry Song <v-songbaohua@oppo.com>
> 
> Thanks!
> 
>> 
>>> 
>>>> 
>>>> Perhaps the pid should be printed in decimal?
>>> 
>>> I thought about printing in decimal, but every other value in the vma is in hex
>>> without a leading "0x" (I'm trying to follow the smaps convention). So I thought
>>> it could be more confusing in decimal.
>>> 
>>> I'm happy to change it to decimal if that's the preference though? Although I'd
>>> like to continue to present it in a fixed width field, padded with 0s on the
>>> left so that everything lines up.
>>> 
>>>> 
>>>>    -- William Kucharski
>>>> 
>>>>> /thpmaps  --pid 206 --cont 64K
>>>>> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00
>>>>> 00426969 /root/a.out
>>>>> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00
>>>>> 00426969 /root/a.out
>>>>> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00
>>>>> 00426969 /root/a.out
>>>>> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000
>>>>> anon-thp-aligned-64kB:            473920 kB (100%)
>>>>> anon-cont-aligned-64kB:           473920 kB (100%)
>>>>> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00
>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00
>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00
>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00
>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000
>>>>> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00
>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>>> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000
>>>>> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar]
>>>>> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso]
>>>>> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00
>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>>> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00
>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>>> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]
>>>> 
>> 
>> Thanks
>> Barry
> 
>
Ryan Roberts Jan. 3, 2024, 10:20 a.m. UTC | #7
On 03/01/2024 10:09, William Kucharski wrote:
> 
> 
>> On Jan 3, 2024, at 02:35, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/01/2024 09:16, Barry Song wrote:
>>> On Wed, Jan 3, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 03/01/2024 08:07, William Kucharski wrote:
>>>>>
>>>>>> On Jan 2, 2024, at 23:44, Barry Song <21cnbao@gmail.com> wrote:
>>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> I ran a couple of test cases with different parameters, it seems to
>>>>>> work correctly.
>>>>>> just i don't understand the below, what is the meaning of 000000ce at
>>>>>> the beginning of
>>>>>> each line?
>>>>>
>>>>> It's the pid; 0xce is the specified pid, 206.
>>>>
>>>> Yes indeed. I added the pid to the front for the case where you are using
>>>> --cgroup without --summary; in that case, each vma will be printed for each pid
>>>> in the cgroup and it seemed sensible to be able to see which pid each vma
>>>> belonged to.
>>>
>>> I don't understand why we have to add the pid before each line as this tool
>>> already has pid in the parameter :-)
>>
>> The reason is that it is also possible to invoke the tool with --cgroup instead
>> of --pid. In this case, the tool will iterate over all the pids in the cgroup so
>> (when --summary is not specified) having the pid associated with each vma is useful.
>>
>> I could change it to conditionally output the pid only when --cgroup is specified?
> 
> You could, or perhaps emit a colon after the pid to delineate it, e.g.:
> 
>> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 /root/a.out

Yeah that sounds like the least worst option. Let's go with that.

> 
> but then some people would probably read it as a memory address, so who knows.
> 
>     -- William Kucharski
> 
>>
>>> this seems like duplicated information
>>> to me. but it doesn't matter too much as this tool is really nice though it is
>>> not so easy to deploy on Android.
>>
>> Hmm. I've seen tutorials where people have Python running under Android, but I
>> agree its not zero effort. Perhaps it would be better in C. Unfortuantely, I
>> can't commit to doing a port at this point.
>>
>>>
>>> Please feel free to add,
>>>
>>> Tested-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Thanks!
>>
>>>
>>>>
>>>>>
>>>>> Perhaps the pid should be printed in decimal?
>>>>
>>>> I thought about printing in decimal, but every other value in the vma is in hex
>>>> without a leading "0x" (I'm trying to follow the smaps convention). So I thought
>>>> it could be more confusing in decimal.
>>>>
>>>> I'm happy to change it to decimal if that's the preference though? Although I'd
>>>> like to continue to present it in a fixed width field, padded with 0s on the
>>>> left so that everything lines up.
>>>>
>>>>>
>>>>>    -- William Kucharski
>>>>>
>>>>>> /thpmaps  --pid 206 --cont 64K
>>>>>> 000000ce 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00
>>>>>> 00426969 /root/a.out
>>>>>> 000000ce 0000aaaadbb3f000-0000aaaadbb40000 r--p 0000f000 fe:00
>>>>>> 00426969 /root/a.out
>>>>>> 000000ce 0000aaaadbb40000-0000aaaadbb41000 rw-p 00010000 fe:00
>>>>>> 00426969 /root/a.out
>>>>>> 000000ce 0000ffff702c0000-0000ffffb02c0000 rw-p 00000000 00:00 00000000
>>>>>> anon-thp-aligned-64kB:            473920 kB (100%)
>>>>>> anon-cont-aligned-64kB:           473920 kB (100%)
>>>>>> 000000ce 0000ffffb02c0000-0000ffffb044c000 r-xp 00000000 fe:00
>>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>>> 000000ce 0000ffffb044c000-0000ffffb045d000 ---p 0018c000 fe:00
>>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>>> 000000ce 0000ffffb045d000-0000ffffb0460000 r--p 0018d000 fe:00
>>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>>> 000000ce 0000ffffb0460000-0000ffffb0462000 rw-p 00190000 fe:00
>>>>>> 00395429 /usr/lib/aarch64-linux-gnu/libc.so.6
>>>>>> 000000ce 0000ffffb0462000-0000ffffb046f000 rw-p 00000000 00:00 00000000
>>>>>> 000000ce 0000ffffb0477000-0000ffffb049d000 r-xp 00000000 fe:00
>>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>>>> 000000ce 0000ffffb04b0000-0000ffffb04b2000 rw-p 00000000 00:00 00000000
>>>>>> 000000ce 0000ffffb04b2000-0000ffffb04b4000 r--p 00000000 00:00 00000000 [vvar]
>>>>>> 000000ce 0000ffffb04b4000-0000ffffb04b5000 r-xp 00000000 00:00 00000000 [vdso]
>>>>>> 000000ce 0000ffffb04b5000-0000ffffb04b7000 r--p 0002e000 fe:00
>>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>>>> 000000ce 0000ffffb04b7000-0000ffffb04b9000 rw-p 00030000 fe:00
>>>>>> 00393893 /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
>>>>>> 000000ce 0000ffffdaba4000-0000ffffdabc5000 rw-p 00000000 00:00 00000000 [stack]
>>>>>
>>>
>>> Thanks
>>> Barry
>>
>>
>
John Hubbard Jan. 4, 2024, 10:48 p.m. UTC | #8
On 1/3/24 02:20, Ryan Roberts wrote:
> On 03/01/2024 10:09, William Kucharski wrote:
...
>>> The reason is that it is also possible to invoke the tool with --cgroup instead
>>> of --pid. In this case, the tool will iterate over all the pids in the cgroup so
>>> (when --summary is not specified) having the pid associated with each vma is useful.
>>>
>>> I could change it to conditionally output the pid only when --cgroup is specified?
>>
>> You could, or perhaps emit a colon after the pid to delineate it, e.g.:
>>
>>> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969 /root/a.out
> 
> Yeah that sounds like the least worst option. Let's go with that.

I'm trying this out and had the exact same issue with pid. I'd suggest:

a) pid should always be printed in decimal, because that's what ps(1) uses
    and no one expects to see it in other formats such as hex.

b) In fact, perhaps a header row would help. There could be a --no-header-row
    option for cases that want to feed this to other scripts, but the default
    would be to include a human-friendly header.

c) pid should probably be suppressed if --pid is specified, but that's
    less important than the other points.

In a day or two I'll get a chance to run this on something that allocates
lots of mTHPs, and give a closer look.


thanks,
Ryan Roberts Jan. 5, 2024, 8:35 a.m. UTC | #9
On 04/01/2024 22:48, John Hubbard wrote:
> On 1/3/24 02:20, Ryan Roberts wrote:
>> On 03/01/2024 10:09, William Kucharski wrote:
> ...
>>>> The reason is that it is also possible to invoke the tool with --cgroup instead
>>>> of --pid. In this case, the tool will iterate over all the pids in the
>>>> cgroup so
>>>> (when --summary is not specified) having the pid associated with each vma is
>>>> useful.
>>>>
>>>> I could change it to conditionally output the pid only when --cgroup is
>>>> specified?
>>>
>>> You could, or perhaps emit a colon after the pid to delineate it, e.g.:
>>>
>>>> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969
>>>> /root/a.out
>>
>> Yeah that sounds like the least worst option. Let's go with that.
> 
> I'm trying this out and had the exact same issue with pid. I'd suggest:
> 
> a) pid should always be printed in decimal, because that's what ps(1) uses
>    and no one expects to see it in other formats such as hex.

right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like
ps? But given pid is the first column, I think it will look weird right aligned.
Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options:

00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
     206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
206:      0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

My personal preference is the first option; right aligned with 0 pad.

> 
> b) In fact, perhaps a header row would help. There could be a --no-header-row
>    option for cases that want to feed this to other scripts, but the default
>    would be to include a human-friendly header.

How about this for a header (with example first data row):

     PID             START              END PROT      OFF MJ:MN    INODE FILE
00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

Personally I wouldn't bother with a --no-header option; just keep it always on.

> 
> c) pid should probably be suppressed if --pid is specified, but that's
>    less important than the other points.

If we have the header then I think its clear what it is and I'd prefer to keep
the data format consistent between --pid and --cgroup. So prefer to leave pid in
always.

> 
> In a day or two I'll get a chance to run this on something that allocates
> lots of mTHPs, and give a closer look.

Thanks - it would be great to get some feedback on the usefulness of the actual
counters! :)

I'm considering adding an --ignore-folio-boundaries option, which would modify
the way the cont counters work, to only look for contiguity and alignment and
ignore any folio boundaries. At the moment, if you have multiple contiguous
folios, they don't count, because the memory doesn't all belong to the same
folio. I think this could be useful in some (limited) circumstances.

> 
> 
> thanks,
Ryan Roberts Jan. 5, 2024, 8:40 a.m. UTC | #10
On 02/01/2024 15:38, Ryan Roberts wrote:
> With the proliferation of large folios for file-backed memory, and more
> recently the introduction of multi-size THP for anonymous memory, it is
> becoming useful to be able to see exactly how large folios are mapped
> into processes. For some architectures (e.g. arm64), if most memory is
> mapped using contpte-sized and -aligned blocks, TLB usage can be
> optimized so it's useful to see where these requirements are and are not
> being met.
> 
> thpmaps is a Python utility that reads /proc/<pid>/smaps,
> /proc/<pid>/pagemap and /proc/kpageflags to print information about how
> transparent huge pages (both file and anon) are mapped to a specified
> process or cgroup. It aims to help users debug and optimize their
> workloads. In future we may wish to introduce stats directly into the
> kernel (e.g. smaps or similar), but for now this provides a short term
> solution without the need to introduce any new ABI.
> 
> Run with help option for a full listing of the arguments:
> 
>     # thpmaps --help
> 
> --8<--
> usage: thpmaps [-h] [--pid pid] [--cgroup path] [--summary]
>                [--cont size[KMG]] [--inc-smaps] [--inc-empty]
>                [--periodic sleep_ms]
> 
> Prints information about how transparent huge pages are mapped to a
> specified process or cgroup. Shows statistics for fully-mapped THPs of
> every size, mapped both naturally aligned and unaligned for both file
> and anonymous memory. See [anon|file]-thp-[aligned|unaligned]-<size>kB
> keys. Shows statistics for mapped pages that belong to a THP but which
> are not fully mapped. See [anon|file]-thp-partial keys. Optionally
> shows statistics for naturally aligned, contiguous blocks of memory of
> a specified size (when --cont is provided). See [anon|file]-cont-
> aligned-<size>kB keys. Statistics are shown in kB and as a percentage
> of either total anon or file memory as appropriate.
> 
> options:
>   -h, --help           show this help message and exit
>   --pid pid            Process id of the target process. Exactly one of
>                        --pid and --cgroup must be provided.
>   --cgroup path        Path to the target cgroup in sysfs. Iterates
>                        over every pid in the cgroup. Exactly one of
>                        --pid and --cgroup must be provided.
>   --summary            Sum the per-vma statistics to provide a summary
>                        over the whole process or cgroup.
>   --cont size[KMG]     Adds anon and file stats for naturally aligned,
>                        contiguously mapped blocks of the specified
>                        size. May be issued multiple times to track
>                        multiple sized blocks. Useful to infer e.g.
>                        arm64 contpte and hpa mappings. Size must be a
>                        power-of-2 number of pages.
>   --inc-smaps          Include all numerical, additive
>                        /proc/<pid>/smaps stats in the output.
>   --inc-empty          Show all statistics including those whose value
>                        is 0.
>   --periodic sleep_ms  Run in a loop, polling every sleep_ms
>                        milliseconds.
> 
> Requires root privilege to access pagemap and kpageflags.
> --8<--
> 
> Example command to summarise fully and partially mapped THPs and 64K
> contiguous blocks over all VMAs in a single process (--inc-empty forces
> printing stats that are 0):
> 
>     # ./thpmaps --pid 10837 --cont 64K --summary --inc-empty
> 
> --8<--
> anon-thp-aligned-16kB:                16 kB ( 0%)
> anon-thp-aligned-32kB:                 0 kB ( 0%)
> anon-thp-aligned-64kB:           4194304 kB (100%)
> anon-thp-aligned-128kB:                0 kB ( 0%)
> anon-thp-aligned-256kB:                0 kB ( 0%)
> anon-thp-aligned-512kB:                0 kB ( 0%)
> anon-thp-aligned-1024kB:               0 kB ( 0%)
> anon-thp-aligned-2048kB:               0 kB ( 0%)
> anon-thp-unaligned-16kB:               0 kB ( 0%)
> anon-thp-unaligned-32kB:               0 kB ( 0%)
> anon-thp-unaligned-64kB:               0 kB ( 0%)
> anon-thp-unaligned-128kB:              0 kB ( 0%)
> anon-thp-unaligned-256kB:              0 kB ( 0%)
> anon-thp-unaligned-512kB:              0 kB ( 0%)
> anon-thp-unaligned-1024kB:             0 kB ( 0%)
> anon-thp-unaligned-2048kB:             0 kB ( 0%)
> anon-thp-partial:                      0 kB ( 0%)
> file-thp-aligned-16kB:                16 kB ( 1%)
> file-thp-aligned-32kB:                64 kB ( 5%)
> file-thp-aligned-64kB:               640 kB (50%)
> file-thp-aligned-128kB:              128 kB (10%)
> file-thp-aligned-256kB:                0 kB ( 0%)
> file-thp-aligned-512kB:                0 kB ( 0%)
> file-thp-aligned-1024kB:               0 kB ( 0%)
> file-thp-aligned-2048kB:               0 kB ( 0%)
> file-thp-unaligned-16kB:              16 kB ( 1%)
> file-thp-unaligned-32kB:              32 kB ( 3%)
> file-thp-unaligned-64kB:              64 kB ( 5%)
> file-thp-unaligned-128kB:              0 kB ( 0%)
> file-thp-unaligned-256kB:              0 kB ( 0%)
> file-thp-unaligned-512kB:              0 kB ( 0%)
> file-thp-unaligned-1024kB:             0 kB ( 0%)
> file-thp-unaligned-2048kB:             0 kB ( 0%)
> file-thp-partial:                     12 kB ( 1%)
> anon-cont-aligned-64kB:          4194304 kB (100%)
> file-cont-aligned-64kB:              768 kB (61%)
> --8<--
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> 
> I've found this very useful for debugging, and I know others have requested a
> way to check if mTHP and contpte is working, so thought this might a good short
> term solution until we figure out how best to add stats in the kernel?
> 
> Thanks,
> Ryan

I found a minor bug and a change I plan to make in the next version. Just FYI:


> 
>  tools/mm/Makefile |   9 +-
>  tools/mm/thpmaps  | 573 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 578 insertions(+), 4 deletions(-)
>  create mode 100755 tools/mm/thpmaps
> 
> diff --git a/tools/mm/Makefile b/tools/mm/Makefile
> index 1c5606cc3334..7bb03606b9ea 100644
> --- a/tools/mm/Makefile
> +++ b/tools/mm/Makefile
> @@ -3,7 +3,8 @@
>  #
>  include ../scripts/Makefile.include
> 
> -TARGETS=page-types slabinfo page_owner_sort
> +BUILD_TARGETS=page-types slabinfo page_owner_sort
> +INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps
> 
>  LIB_DIR = ../lib/api
>  LIBS = $(LIB_DIR)/libapi.a
> @@ -11,9 +12,9 @@ LIBS = $(LIB_DIR)/libapi.a
>  CFLAGS += -Wall -Wextra -I../lib/ -pthread
>  LDFLAGS += $(LIBS) -pthread
> 
> -all: $(TARGETS)
> +all: $(BUILD_TARGETS)
> 
> -$(TARGETS): $(LIBS)
> +$(BUILD_TARGETS): $(LIBS)
> 
>  $(LIBS):
>  	make -C $(LIB_DIR)
> @@ -29,4 +30,4 @@ sbindir ?= /usr/sbin
> 
>  install: all
>  	install -d $(DESTDIR)$(sbindir)
> -	install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir)
> +	install -m 755 -p $(INSTALL_TARGETS) $(DESTDIR)$(sbindir)
> diff --git a/tools/mm/thpmaps b/tools/mm/thpmaps
> new file mode 100755
> index 000000000000..af9b19f63eb4
> --- /dev/null
> +++ b/tools/mm/thpmaps
> @@ -0,0 +1,573 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0-only
> +# Copyright (C) 2024 ARM Ltd.
> +#
> +# Utility providing smaps-like output detailing transparent hugepage usage.
> +# For more info, run:
> +# ./thpmaps --help
> +#
> +# Requires numpy:
> +# pip3 install numpy
> +
> +
> +import argparse
> +import collections
> +import math
> +import os
> +import re
> +import resource
> +import shutil
> +import sys
> +import time
> +import numpy as np
> +
> +
> +with open('/sys/kernel/mm/transparent_hugepage/hpage_pmd_size') as f:
> +    PAGE_SIZE = resource.getpagesize()
> +    PAGE_SHIFT = int(math.log2(PAGE_SIZE))
> +    PMD_SIZE = int(f.read())
> +    PMD_ORDER = int(math.log2(PMD_SIZE / PAGE_SIZE))
> +
> +
> +def align_forward(v, a):
> +    return (v + (a - 1)) & ~(a - 1)
> +
> +
> +def align_offset(v, a):
> +    return v & (a - 1)
> +
> +
> +def nrkb(nr):
> +    # Convert number of pages to KB.
> +    return (nr << PAGE_SHIFT) >> 10
> +
> +
> +def odkb(order):
> +    # Convert page order to KB.
> +    return nrkb(1 << order)
> +
> +
> +def cont_ranges_all(arrs):
> +    # Given a list of arrays, find the ranges for which values are monotonically
> +    # incrementing in all arrays.
> +    assert(len(arrs) > 0)
> +    sz = len(arrs[0])
> +    for arr in arrs:
> +        assert(arr.shape == (sz,))
> +    r = np.full(sz, 2)
> +    d = np.diff(arrs[0]) == 1
> +    for dd in [np.diff(arr) == 1 for arr in arrs[1:]]:
> +        d &= dd
> +    r[1:] -= d
> +    r[:-1] -= d
> +    return [np.repeat(arr, r).reshape(-1, 2) for arr in arrs]
> +
> +
> +class ArgException(Exception):
> +    pass
> +
> +
> +class FileIOException(Exception):
> +    pass
> +
> +
> +class BinArrayFile:
> +    # Base class used to read /proc/<pid>/pagemap and /proc/kpageflags into a
> +    # numpy array. Use inherrited class in a with clause to ensure file is
> +    # closed when it goes out of scope.
> +    def __init__(self, filename, element_size):
> +        self.element_size = element_size
> +        self.filename = filename
> +        self.fd = os.open(self.filename, os.O_RDONLY)
> +
> +    def cleanup(self):
> +        os.close(self.fd)
> +
> +    def __enter__(self):
> +        return self
> +
> +    def __exit__(self, exc_type, exc_val, exc_tb):
> +        self.cleanup()
> +
> +    def _readin(self, offset, buffer):
> +        length = os.preadv(self.fd, (buffer,), offset)
> +        if len(buffer) != length:
> +            raise FileIOException('error: {} failed to read {} bytes at {:x}'
> +                            .format(self.filename, len(buffer), offset))
> +
> +    def _toarray(self, buf):
> +        assert(self.element_size == 8)
> +        return np.frombuffer(buf, dtype=np.uint64)
> +
> +    def getv(self, vec):
> +        sz = 0
> +        for region in vec:
> +            sz += int(region[1] - region[0] + 1) * self.element_size
> +        buf = bytearray(sz)
> +        view = memoryview(buf)
> +        pos = 0
> +        for region in vec:
> +            offset = int(region[0]) * self.element_size
> +            length = int(region[1] - region[0] + 1) * self.element_size
> +            self._readin(offset, view[pos:pos+length])
> +            pos += length
> +        return self._toarray(buf)
> +
> +    def get(self, index, nr=1):
> +        offset = index * self.element_size
> +        length = nr * self.element_size
> +        buf = bytearray(length)
> +        self._readin(offset, buf)
> +        return self._toarray(buf)
> +
> +
> +PM_PAGE_PRESENT = 1 << 63
> +PM_PFN_MASK = (1 << 55) - 1
> +
> +class PageMap(BinArrayFile):
> +    # Read ranges of a given pid's pagemap into a numpy array.
> +    def __init__(self, pid='self'):
> +        super().__init__(f'/proc/{pid}/pagemap', 8)
> +
> +
> +KPF_ANON = 1 << 12
> +KPF_COMPOUND_HEAD = 1 << 15
> +KPF_COMPOUND_TAIL = 1 << 16
> +
> +class KPageFlags(BinArrayFile):
> +    # Read ranges of /proc/kpageflags into a numpy array.
> +    def __init__(self):
> +         super().__init__(f'/proc/kpageflags', 8)
> +
> +
> +VMA = collections.namedtuple('VMA', [
> +    'name',
> +    'start',
> +    'end',
> +    'read',
> +    'write',
> +    'execute',
> +    'private',
> +    'pgoff',
> +    'major',
> +    'minor',
> +    'inode',
> +    'stats',
> +])
> +
> +class VMAList:
> +    # A container for VMAs, parsed from /proc/<pid>/smaps. Iterate over the
> +    # instance to receive VMAs.
> +    head_regex = re.compile(r"^([\da-f]+)-([\da-f]+) ([r-])([w-])([x-])([ps]) ([\da-f]+) ([\da-f]+):([\da-f]+) ([\da-f]+)\s*(.*)$")
> +    kb_item_regex = re.compile(r"(\w+):\s*(\d+)\s*kB")
> +
> +    def __init__(self, pid='self'):
> +        def is_vma(line):
> +            return self.head_regex.search(line) != None
> +
> +        def get_vma(line):
> +            m = self.head_regex.match(line)
> +            if m is None:
> +                return None
> +            return VMA(
> +                name=m.group(11),
> +                start=int(m.group(1), 16),
> +                end=int(m.group(2), 16),
> +                read=m.group(3) == 'r',
> +                write=m.group(4) == 'w',
> +                execute=m.group(5) == 'x',
> +                private=m.group(6) == 'p',
> +                pgoff=int(m.group(7), 16),
> +                major=int(m.group(8), 16),
> +                minor=int(m.group(9), 16),
> +                inode=int(m.group(10), 16),
> +                stats={},
> +            )
> +
> +        def get_value(line):
> +            # Currently only handle the KB stats because they are summed for
> +            # --summary. Core code doesn't know how to combine other stats.
> +            exclude = ['KernelPageSize', 'MMUPageSize']
> +            m = self.kb_item_regex.search(line)
> +            if m:
> +                param = m.group(1)
> +                if param not in exclude:
> +                    value = int(m.group(2))
> +                    return param, value
> +            return None, None
> +
> +        def parse_smaps(file):
> +            vmas = []
> +            i = 0
> +
> +            line = file.readline()
> +
> +            while True:
> +                if not line:
> +                    break
> +                line = line.strip()
> +
> +                i += 1
> +
> +                vma = get_vma(line)
> +                if vma is None:
> +                    raise FileIOException(f'error: could not parse line {i}: "{line}"')
> +
> +                while True:
> +                    line = file.readline()
> +                    if not line:
> +                        break
> +                    line = line.strip()
> +                    if is_vma(line):
> +                        break
> +
> +                    i += 1
> +
> +                    param, value = get_value(line)
> +                    if param:
> +                        vma.stats[param] = {'type': None, 'value': value}
> +
> +                vmas.append(vma)
> +
> +            return vmas
> +
> +        with open(f'/proc/{pid}/smaps', 'r') as file:
> +            self.vmas = parse_smaps(file)
> +
> +    def __iter__(self):
> +        yield from self.vmas
> +
> +
> +def thp_parse(max_order, kpageflags, vfns, pfns, anons, heads):
> +    # Given 4 same-sized arrays representing a range within a page table backed
> +    # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons:
> +    # True if page is anonymous, heads: True if page is head of a THP), return a
> +    # dictionary of statistics describing the mapped THPs.
> +    stats = {
> +        'file': {
> +            'partial': 0,
> +            'aligned': [0] * (max_order + 1),
> +            'unaligned': [0] * (max_order + 1),
> +        },
> +        'anon': {
> +            'partial': 0,
> +            'aligned': [0] * (max_order + 1),
> +            'unaligned': [0] * (max_order + 1),
> +        },
> +    }
> +
> +    indexes = np.arange(len(vfns), dtype=np.uint64)
> +    ranges = cont_ranges_all([indexes, vfns, pfns])
> +    for rindex, rpfn in zip(ranges[0], ranges[2]):
> +        index_next = int(rindex[0])
> +        index_end = int(rindex[1]) + 1
> +        pfn_end = int(rpfn[1]) + 1
> +
> +        folios = indexes[index_next:index_end][heads[index_next:index_end]]
> +
> +        # Account pages for any partially mapped THP at the front. In that case,
> +        # the first page of the range is a tail.
> +        nr = (int(folios[0]) if len(folios) else index_end) - index_next
> +        stats['anon' if anons[index_next] else 'file']['partial'] += nr
> +
> +        # Account pages for any partially mapped THP at the back. In that case,
> +        # the next page after the range is a tail.
> +        if len(folios):
> +            flags = int(kpageflags.get(pfn_end)[0])
> +            if flags & KPF_COMPOUND_TAIL:
> +                nr = index_end - int(folios[-1])
> +                folios = folios[:-1]
> +                index_end -= nr
> +                stats['anon' if anons[index_end - 1] else 'file']['partial'] += nr
> +
> +        # Account fully mapped THPs in the middle of the range.
> +        if len(folios):
> +            folio_nrs = np.append(np.diff(folios), np.uint64(index_end - folios[-1]))
> +            folio_orders = np.log2(folio_nrs).astype(np.uint64)
> +            for index, order in zip(folios, folio_orders):
> +                index = int(index)
> +                order = int(order)
> +                nr = 1 << order
> +                vfn = int(vfns[index])
> +                align = 'aligned' if align_forward(vfn, nr) == vfn else 'unaligned'
> +                anon = 'anon' if anons[index] else 'file'
> +                stats[anon][align][order] += nr
> +
> +    rstats = {}
> +
> +    def flatten_sub(type, subtype, stats):
> +        for od, nr in enumerate(stats[2:], 2):
> +            rstats[f"{type}-thp-{subtype}-{odkb(od)}kB"] = {'type': type, 'value': nrkb(nr)}
> +
> +    def flatten_type(type, stats):
> +        flatten_sub(type, 'aligned', stats['aligned'])
> +        flatten_sub(type, 'unaligned', stats['unaligned'])
> +        rstats[f"{type}-thp-partial"] = {'type': type, 'value': nrkb(stats['partial'])}
> +
> +    flatten_type('anon', stats['anon'])
> +    flatten_type('file', stats['file'])
> +
> +    return rstats
> +
> +
> +def cont_parse(order, vfns, pfns, anons, heads):
> +    # Given 4 same-sized arrays representing a range within a page table backed
> +    # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons:
> +    # True if page is anonymous, heads: True if page is head of a THP), return a
> +    # dictionary of statistics describing the contiguous blocks.
> +    nr_cont = 1 << order
> +    nr_anon = 0
> +    nr_file = 0
> +
> +    ranges = cont_ranges_all([np.arange(len(vfns), dtype=np.uint64), vfns, pfns])
> +    for rindex, rvfn, rpfn in zip(*ranges):
> +        index_next = int(rindex[0])
> +        index_end = int(rindex[1]) + 1
> +        vfn_start = int(rvfn[0])
> +        pfn_start = int(rpfn[0])
> +
> +        if align_offset(pfn_start, nr_cont) != align_offset(vfn_start, nr_cont):
> +            continue
> +
> +        off = align_forward(vfn_start, nr_cont) - vfn_start
> +        index_next += off
> +
> +        while index_next + nr_cont <= index_end:
> +            folio_boundary = heads[index_next+1:index_next+nr_cont].any()
> +            if not folio_boundary:
> +                if anons[index_next]:
> +                    nr_anon += nr_cont
> +                else:
> +                    nr_file += nr_cont
> +            index_next += nr_cont
> +
> +    return {
> +        f"anon-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'anon', 'value': nrkb(nr_anon)},
> +        f"file-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'file', 'value': nrkb(nr_file)},
> +    }
> +
> +
> +def vma_print(vma, pid):
> +    # Prints a VMA instance in a format similar to smaps. The main difference is
> +    # that the pid is included as the first value.
> +    print("{:08x} {:016x}-{:016x} {}{}{}{} {:08x} {:02x}:{:02x} {:08x} {}"
> +        .format(
> +            pid, vma.start, vma.end,
> +            'r' if vma.read else '-', 'w' if vma.write else '-',
> +            'x' if vma.execute else '-', 'p' if vma.private else 's',
> +            vma.pgoff, vma.major, vma.minor, vma.inode, vma.name
> +        ))
> +
> +
> +def stats_print(stats, tot_anon, tot_file, inc_empty):
> +    # Print a statistics dictionary.
> +    label_field = 32
> +    for label, stat in stats.items():
> +        type = stat['type']
> +        value = stat['value']
> +        if value or inc_empty:
> +            pad = max(0, label_field - len(label) - 1)
> +            if type == 'anon':
> +                percent = f' ({value / tot_anon:3.0%})'
> +            elif type == 'file':
> +                percent = f' ({value / tot_file:3.0%})'
> +            else:
> +                percent = ''
> +            print(f"{label}:{' ' * pad}{value:8} kB{percent}")
> +
> +
> +def vma_parse(vma, pagemap, kpageflags, contorders):
> +    # Generate thp and cont statistics for a single VMA.
> +    start = vma.start >> PAGE_SHIFT
> +    end = vma.end >> PAGE_SHIFT
> +
> +    pmes = pagemap.get(start, end - start)
> +    present = pmes & PM_PAGE_PRESENT != 0
> +    pfns = pmes & PM_PFN_MASK
> +    pfns = pfns[present]
> +    vfns = np.arange(start, end, dtype=np.uint64)
> +    vfns = vfns[present]
> +
> +    flags = kpageflags.getv(cont_ranges_all([pfns])[0])
> +    anons = flags & KPF_ANON != 0
> +    heads = flags & KPF_COMPOUND_HEAD != 0
> +    tails = flags & KPF_COMPOUND_TAIL != 0
> +    thps = heads | tails
> +
> +    tot_anon = np.count_nonzero(anons)
> +    tot_file = np.size(anons) - tot_anon
> +    tot_anon = nrkb(tot_anon)
> +    tot_file = nrkb(tot_file)
> +
> +    vfns = vfns[thps]
> +    pfns = pfns[thps]
> +    anons = anons[thps]
> +    heads = heads[thps]
> +
> +    thpstats = thp_parse(PMD_ORDER, kpageflags, vfns, pfns, anons, heads)
> +    contstats = [cont_parse(order, vfns, pfns, anons, heads) for order in contorders]
> +
> +    return {
> +        **thpstats,
> +        **{k: v for s in contstats for k, v in s.items()}
> +    }, tot_anon, tot_file
> +
> +
> +def do_main(args):
> +    pids = set()
> +    summary = {}
> +    summary_anon = 0
> +    summary_file = 0
> +
> +    if args.cgroup:
> +        with open(f'{args.cgroup}/cgroup.procs') as pidfile:
> +            for line in pidfile.readlines():
> +                pids.add(int(line.strip()))
> +    else:
> +        pids.add(args.pid)
> +
> +    for pid in pids:
> +        try:
> +            with PageMap(pid) as pagemap:
> +                with KPageFlags() as kpageflags:
> +                    for vma in VMAList(pid):
> +                        if (vma.read or vma.write or vma.execute) and vma.stats['Rss']['value'] > 0:
> +                            stats, vma_anon, vma_file = vma_parse(vma, pagemap, kpageflags, args.cont)
> +                        else:
> +                            stats = {}
> +                            vma_anon = 0
> +                            vma_file = 0
> +                        if args.inc_smaps:
> +                            stats = {**vma.stats, **stats}
> +                        if args.summary:
> +                            for k, v in stats.items():
> +                                if k in summary:
> +                                    assert(summary[k]['type'] == v['type'])
> +                                    summary[k]['value'] += v['value']
> +                                else:
> +                                    summary[k] = v
> +                            summary_anon += vma_anon
> +                            summary_file += vma_file
> +                        else:
> +                            vma_print(vma, pid)
> +                            stats_print(stats, vma_anon, vma_file, args.inc_empty)
> +        except FileNotFoundError:
> +            if not args.cgroup:
> +                raise
> +        except ProcessLookupError:
> +            if not args.cgroup:
> +                raise

It turns out that reading pagemap will return 0 bytes if the process goes away
if the process exits after the file is opened. So need to add handler here to
recover from the race of the --cgroup case:

          except FileIOException:
               if not args.cgroup:
                   raise

> +
> +    if args.summary:
> +        stats_print(summary, summary_anon, summary_file, args.inc_empty)
> +
> +
> +def main():
> +    def formatter(prog):
> +        width = shutil.get_terminal_size().columns
> +        width -= 2
> +        width = min(80, width)
> +        return argparse.HelpFormatter(prog, width=width)
> +
> +    def size2order(human):
> +        units = {"K": 2**10, "M": 2**20, "G": 2**30}

nit: Linux convention seems to be case-invariant, so kmg are equivalent to KMG.
Will do the same.

Thanks,
Ryan

> +        unit = 1
> +        if human[-1] in units:
> +            unit = units[human[-1]]
> +            human = human[:-1]
> +        try:
> +            size = int(human)
> +        except ValueError:
> +            raise ArgException('error: --cont value must be integer size with optional KMG unit')
> +        size *= unit
> +        order = int(math.log2(size / PAGE_SIZE))
> +        if order < 1:
> +            raise ArgException('error: --cont value must be size of at least 2 pages')
> +        if (1 << order) * PAGE_SIZE != size:
> +            raise ArgException('error: --cont value must be size of power-of-2 pages')
> +        return order
> +
> +    parser = argparse.ArgumentParser(formatter_class=formatter,
> +        description="""Prints information about how transparent huge pages are
> +                    mapped to a specified process or cgroup.
> +
> +                    Shows statistics for fully-mapped THPs of every size, mapped
> +                    both naturally aligned and unaligned for both file and
> +                    anonymous memory. See
> +                    [anon|file]-thp-[aligned|unaligned]-<size>kB keys.
> +
> +                    Shows statistics for mapped pages that belong to a THP but
> +                    which are not fully mapped. See [anon|file]-thp-partial
> +                    keys.
> +
> +                    Optionally shows statistics for naturally aligned,
> +                    contiguous blocks of memory of a specified size (when --cont
> +                    is provided). See [anon|file]-cont-aligned-<size>kB keys.
> +
> +                    Statistics are shown in kB and as a percentage of either
> +                    total anon or file memory as appropriate.""",
> +        epilog="""Requires root privilege to access pagemap and kpageflags.""")
> +
> +    parser.add_argument('--pid',
> +        metavar='pid', required=False, type=int,
> +        help="""Process id of the target process. Exactly one of --pid and
> +            --cgroup must be provided.""")
> +
> +    parser.add_argument('--cgroup',
> +        metavar='path', required=False,
> +        help="""Path to the target cgroup in sysfs. Iterates over every pid in
> +            the cgroup. Exactly one of --pid and --cgroup must be provided.""")
> +
> +    parser.add_argument('--summary',
> +        required=False, default=False, action='store_true',
> +        help="""Sum the per-vma statistics to provide a summary over the whole
> +            process or cgroup.""")
> +
> +    parser.add_argument('--cont',
> +        metavar='size[KMG]', required=False, default=[], action='append',
> +        help="""Adds anon and file stats for naturally aligned, contiguously
> +            mapped blocks of the specified size. May be issued multiple times to
> +            track multiple sized blocks. Useful to infer e.g. arm64 contpte and
> +            hpa mappings. Size must be a power-of-2 number of pages.""")
> +
> +    parser.add_argument('--inc-smaps',
> +        required=False, default=False, action='store_true',
> +        help="""Include all numerical, additive /proc/<pid>/smaps stats in the
> +            output.""")
> +
> +    parser.add_argument('--inc-empty',
> +        required=False, default=False, action='store_true',
> +        help="""Show all statistics including those whose value is 0.""")
> +
> +    parser.add_argument('--periodic',
> +        metavar='sleep_ms', required=False, type=int,
> +        help="""Run in a loop, polling every sleep_ms milliseconds.""")
> +
> +    args = parser.parse_args()
> +
> +    try:
> +        if (args.pid and args.cgroup) or \
> +        (not args.pid and not args.cgroup):
> +            raise ArgException("error: Exactly one of --pid and --cgroup must be provided.")
> +
> +        args.cont = [size2order(cont) for cont in args.cont]
> +    except ArgException as e:
> +        parser.print_usage()
> +        raise
> +
> +    if args.periodic:
> +        while True:
> +            do_main(args)
> +            print()
> +            time.sleep(args.periodic / 1000)
> +    else:
> +        do_main(args)
> +
> +
> +if __name__ == "__main__":
> +    try:
> +        main()
> +    except Exception as e:
> +        prog = os.path.basename(sys.argv[0])
> +        print(f'{prog}: {e}')
> +        exit(1)
> --
> 2.25.1
>
William Kucharski Jan. 5, 2024, 11:30 a.m. UTC | #11
Personally I like either of these:

00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
     206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

With a header that looks something like this; I suspect the formatting will get
mangled in email anyway:

   PID         START             END        PROT   OFF    MJ:MN  INODE      FILE
00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

    -- William Kucharski

> On Jan 5, 2024, at 01:35, Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> On 04/01/2024 22:48, John Hubbard wrote:
>> On 1/3/24 02:20, Ryan Roberts wrote:
>>> On 03/01/2024 10:09, William Kucharski wrote:
>> ...
>>>>> The reason is that it is also possible to invoke the tool with --cgroup instead
>>>>> of --pid. In this case, the tool will iterate over all the pids in the
>>>>> cgroup so
>>>>> (when --summary is not specified) having the pid associated with each vma is
>>>>> useful.
>>>>> 
>>>>> I could change it to conditionally output the pid only when --cgroup is
>>>>> specified?
>>>> 
>>>> You could, or perhaps emit a colon after the pid to delineate it, e.g.:
>>>> 
>>>>> 000000ce: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:0000426969
>>>>> /root/a.out
>>> 
>>> Yeah that sounds like the least worst option. Let's go with that.
>> 
>> I'm trying this out and had the exact same issue with pid. I'd suggest:
>> 
>> a) pid should always be printed in decimal, because that's what ps(1) uses
>>    and no one expects to see it in other formats such as hex.
> 
> right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like
> ps? But given pid is the first column, I think it will look weird right aligned.
> Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options:
> 
> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
>     206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
> 206:      0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
> 
> My personal preference is the first option; right aligned with 0 pad.
> 
>> 
>> b) In fact, perhaps a header row would help. There could be a --no-header-row
>>    option for cases that want to feed this to other scripts, but the default
>>    would be to include a human-friendly header.
> 
> How about this for a header (with example first data row):
> 
>     PID             START              END PROT      OFF MJ:MN    INODE FILE
> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
> 
> Personally I wouldn't bother with a --no-header option; just keep it always on.
> 
>> 
>> c) pid should probably be suppressed if --pid is specified, but that's
>>    less important than the other points.
> 
> If we have the header then I think its clear what it is and I'd prefer to keep
> the data format consistent between --pid and --cgroup. So prefer to leave pid in
> always.
> 
>> 
>> In a day or two I'll get a chance to run this on something that allocates
>> lots of mTHPs, and give a closer look.
> 
> Thanks - it would be great to get some feedback on the usefulness of the actual
> counters! :)
> 
> I'm considering adding an --ignore-folio-boundaries option, which would modify
> the way the cont counters work, to only look for contiguity and alignment and
> ignore any folio boundaries. At the moment, if you have multiple contiguous
> folios, they don't count, because the memory doesn't all belong to the same
> folio. I think this could be useful in some (limited) circumstances.
> 
>> 
>> 
>> thanks,
John Hubbard Jan. 5, 2024, 11:07 p.m. UTC | #12
On 1/5/24 03:30, William Kucharski wrote:
> Personally I like either of these:
> 
> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
>       206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

I like that second one, because that's how pids are often printed.

> 
> With a header that looks something like this; I suspect the formatting will get
> mangled in email anyway:
> 
>     PID         START             END        PROT   OFF    MJ:MN  INODE      FILE
> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

The MJ:MN is mysterious, but this looks good otherwise.

thanks,
John Hubbard Jan. 5, 2024, 11:18 p.m. UTC | #13
On 1/5/24 00:35, Ryan Roberts wrote:
> right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like
> ps? But given pid is the first column, I think it will look weird right aligned.
> Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options:

I will leave all of the alignment to your judgment and good taste. I'm sure
it will be fine.

(I'm not trying to make the output look like ps(1). I'm trying to make the pid
look like it "often" looks, and I used ps(1) as an example.)

> 
> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
>       206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

Sure.

> 206:      0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
> 
> My personal preference is the first option; right aligned with 0 pad.
> 
>>
>> b) In fact, perhaps a header row would help. There could be a --no-header-row
>>     option for cases that want to feed this to other scripts, but the default
>>     would be to include a human-friendly header.
> 
> How about this for a header (with example first data row):
> 
>       PID             START              END PROT      OFF MJ:MN    INODE FILE

I need to go look up with the MJ:MN means, and then see if there is a
less mysterious column name.

> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
> 
> Personally I wouldn't bother with a --no-header option; just keep it always on.
> 
>>
>> c) pid should probably be suppressed if --pid is specified, but that's
>>     less important than the other points.
> 
> If we have the header then I think its clear what it is and I'd prefer to keep
> the data format consistent between --pid and --cgroup. So prefer to leave pid in
> always.
>

That sounds reasonable to me.
  
>>
>> In a day or two I'll get a chance to run this on something that allocates
>> lots of mTHPs, and give a closer look.
> 
> Thanks - it would be great to get some feedback on the usefulness of the actual
> counters! :)

Working on it!

> 
> I'm considering adding an --ignore-folio-boundaries option, which would modify
> the way the cont counters work, to only look for contiguity and alignment and
> ignore any folio boundaries. At the moment, if you have multiple contiguous
> folios, they don't count, because the memory doesn't all belong to the same
> folio. I think this could be useful in some (limited) circumstances.
> 

This sounds both potentially useful, and yet obscure, so I'd suggest waiting
until you see a usecase. And then include the usecase (even if just a comment),
so that it explains both how to use it, and why it's useful.

thanks,
John Hubbard Jan. 10, 2024, 3:34 a.m. UTC | #14
On 1/2/24 07:38, Ryan Roberts wrote:
> With the proliferation of large folios for file-backed memory, and more
> recently the introduction of multi-size THP for anonymous memory, it is
> becoming useful to be able to see exactly how large folios are mapped
> into processes. For some architectures (e.g. arm64), if most memory is
> mapped using contpte-sized and -aligned blocks, TLB usage can be
> optimized so it's useful to see where these requirements are and are not
> being met.
> 
> thpmaps is a Python utility that reads /proc/<pid>/smaps,
> /proc/<pid>/pagemap and /proc/kpageflags to print information about how
> transparent huge pages (both file and anon) are mapped to a specified
> process or cgroup. It aims to help users debug and optimize their
> workloads. In future we may wish to introduce stats directly into the
> kernel (e.g. smaps or similar), but for now this provides a short term
> solution without the need to introduce any new ABI.
> 
...
> I've found this very useful for debugging, and I know others have requested a
> way to check if mTHP and contpte is working, so thought this might a good short
> term solution until we figure out how best to add stats in the kernel?
> 

Hi Ryan,

One thing that immediately came up during some recent testing of mTHP
on arm64: the pid requirement is sometimes a little awkward. I'm running
tests on a machine at a time for now, inside various containers and
such, and it would be nice if there were an easy way to get some numbers
for the mTHPs across the whole machine.

I'm not sure if that changes anything about thpmaps here. Probably
this is fine as-is. But I wanted to give some initial reactions from
just some quick runs: the global state would be convenient.

thanks,
Barry Song Jan. 10, 2024, 3:51 a.m. UTC | #15
On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 1/2/24 07:38, Ryan Roberts wrote:
> > With the proliferation of large folios for file-backed memory, and more
> > recently the introduction of multi-size THP for anonymous memory, it is
> > becoming useful to be able to see exactly how large folios are mapped
> > into processes. For some architectures (e.g. arm64), if most memory is
> > mapped using contpte-sized and -aligned blocks, TLB usage can be
> > optimized so it's useful to see where these requirements are and are not
> > being met.
> >
> > thpmaps is a Python utility that reads /proc/<pid>/smaps,
> > /proc/<pid>/pagemap and /proc/kpageflags to print information about how
> > transparent huge pages (both file and anon) are mapped to a specified
> > process or cgroup. It aims to help users debug and optimize their
> > workloads. In future we may wish to introduce stats directly into the
> > kernel (e.g. smaps or similar), but for now this provides a short term
> > solution without the need to introduce any new ABI.
> >
> ...
> > I've found this very useful for debugging, and I know others have requested a
> > way to check if mTHP and contpte is working, so thought this might a good short
> > term solution until we figure out how best to add stats in the kernel?
> >
>
> Hi Ryan,
>
> One thing that immediately came up during some recent testing of mTHP
> on arm64: the pid requirement is sometimes a little awkward. I'm running
> tests on a machine at a time for now, inside various containers and
> such, and it would be nice if there were an easy way to get some numbers
> for the mTHPs across the whole machine.
>
> I'm not sure if that changes anything about thpmaps here. Probably
> this is fine as-is. But I wanted to give some initial reactions from
> just some quick runs: the global state would be convenient.

+1. but this seems to be impossible by scanning pagemap?
so may we add this statistics information in kernel just like
/proc/meminfo or a separate /proc/mthp_info?

>
> thanks,
> --
> John Hubbard
> NVIDIA

Thanks
barry
John Hubbard Jan. 10, 2024, 4:15 a.m. UTC | #16
On 1/9/24 19:51, Barry Song wrote:
> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
...
>> Hi Ryan,
>>
>> One thing that immediately came up during some recent testing of mTHP
>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>> tests on a machine at a time for now, inside various containers and
>> such, and it would be nice if there were an easy way to get some numbers
>> for the mTHPs across the whole machine.
>>
>> I'm not sure if that changes anything about thpmaps here. Probably
>> this is fine as-is. But I wanted to give some initial reactions from
>> just some quick runs: the global state would be convenient.
> 
> +1. but this seems to be impossible by scanning pagemap?
> so may we add this statistics information in kernel just like
> /proc/meminfo or a separate /proc/mthp_info?
> 

Yes. From my perspective, it looks like the global stats are more useful
initially, and the more detailed per-pid or per-cgroup stats are the
next level of investigation. So feels odd to start with the more
detailed stats.

However, Ryan did clearly say, above, "In future we may wish to
introduce stats directly into the kernel (e.g. smaps or similar)". And
earlier he ran into some pushback on trying to set up /proc or /sys
values because this is still such an early feature.

I wonder if we could put the global stats in debugfs for now? That's
specifically supposed to be a "we promise *not* to keep this ABI stable"
location.


thanks,
Barry Song Jan. 10, 2024, 8:02 a.m. UTC | #17
On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 1/9/24 19:51, Barry Song wrote:
> > On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
> ...
> >> Hi Ryan,
> >>
> >> One thing that immediately came up during some recent testing of mTHP
> >> on arm64: the pid requirement is sometimes a little awkward. I'm running
> >> tests on a machine at a time for now, inside various containers and
> >> such, and it would be nice if there were an easy way to get some numbers
> >> for the mTHPs across the whole machine.
> >>
> >> I'm not sure if that changes anything about thpmaps here. Probably
> >> this is fine as-is. But I wanted to give some initial reactions from
> >> just some quick runs: the global state would be convenient.
> >
> > +1. but this seems to be impossible by scanning pagemap?
> > so may we add this statistics information in kernel just like
> > /proc/meminfo or a separate /proc/mthp_info?
> >
>
> Yes. From my perspective, it looks like the global stats are more useful
> initially, and the more detailed per-pid or per-cgroup stats are the
> next level of investigation. So feels odd to start with the more
> detailed stats.
>

probably because this can be done without the modification of the kernel.
The detailed per-pid or per-cgroup is still quite useful to my case in which
we set mTHP enabled/disabled and allowed sizes according to vma types,
eg. libc_malloc, java heaps etc.

Different vma types can have different anon_name. So I can use the detailed
info to find out if specific VMAs have gotten mTHP properly and how many
they have gotten.

> However, Ryan did clearly say, above, "In future we may wish to
> introduce stats directly into the kernel (e.g. smaps or similar)". And
> earlier he ran into some pushback on trying to set up /proc or /sys
> values because this is still such an early feature.
>
> I wonder if we could put the global stats in debugfs for now? That's
> specifically supposed to be a "we promise *not* to keep this ABI stable"
> location.

+1.

>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>

Thanks
Barry
Ryan Roberts Jan. 10, 2024, 8:43 a.m. UTC | #18
On 05/01/2024 23:18, John Hubbard wrote:
> On 1/5/24 00:35, Ryan Roberts wrote:
>> right aligned with 0 or ' ' as the pad? I guess ' ' if you want it to look like
>> ps? But given pid is the first column, I think it will look weird right aligned.
>> Perhaps left aligned, followed by colon, followed by pad? Here are the 3 options:
> 
> I will leave all of the alignment to your judgment and good taste. I'm sure
> it will be fine.
> 
> (I'm not trying to make the output look like ps(1). I'm trying to make the pid
> look like it "often" looks, and I used ps(1) as an example.)
> 
>>
>> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969

I'm going to go with this version ^

>>       206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
> 
> Sure.
> 
>> 206:      0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
>>
>> My personal preference is the first option; right aligned with 0 pad.
>>
>>>
>>> b) In fact, perhaps a header row would help. There could be a --no-header-row
>>>     option for cases that want to feed this to other scripts, but the default
>>>     would be to include a human-friendly header.
>>
>> How about this for a header (with example first data row):
>>
>>       PID             START              END PROT      OFF MJ:MN    INODE FILE
> 
> I need to go look up with the MJ:MN means, and then see if there is a
> less mysterious column name.

Its the device major/minor number. I could just call it DEV (DEVICE is too long)

> 
>> 00000206: 0000aaaadbb20000-0000aaaadbb21000 r-xp 00000000 fe:00 00426969
>>
>> Personally I wouldn't bother with a --no-header option; just keep it always on.
>>
>>>
>>> c) pid should probably be suppressed if --pid is specified, but that's
>>>     less important than the other points.
>>
>> If we have the header then I think its clear what it is and I'd prefer to keep
>> the data format consistent between --pid and --cgroup. So prefer to leave pid in
>> always.
>>
> 
> That sounds reasonable to me.
>  
>>>
>>> In a day or two I'll get a chance to run this on something that allocates
>>> lots of mTHPs, and give a closer look.
>>
>> Thanks - it would be great to get some feedback on the usefulness of the actual
>> counters! :)
> 
> Working on it!
> 
>>
>> I'm considering adding an --ignore-folio-boundaries option, which would modify
>> the way the cont counters work, to only look for contiguity and alignment and
>> ignore any folio boundaries. At the moment, if you have multiple contiguous
>> folios, they don't count, because the memory doesn't all belong to the same
>> folio. I think this could be useful in some (limited) circumstances.
>>
> 
> This sounds both potentially useful, and yet obscure, so I'd suggest waiting
> until you see a usecase. And then include the usecase (even if just a comment),
> so that it explains both how to use it, and why it's useful.
> 
> thanks,
Ryan Roberts Jan. 10, 2024, 8:58 a.m. UTC | #19
On 10/01/2024 08:02, Barry Song wrote:
> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 1/9/24 19:51, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>> ...
>>>> Hi Ryan,
>>>>
>>>> One thing that immediately came up during some recent testing of mTHP
>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>> tests on a machine at a time for now, inside various containers and
>>>> such, and it would be nice if there were an easy way to get some numbers
>>>> for the mTHPs across the whole machine.

Just to confirm, you're expecting these "global" stats be truely global and not
per-container? (asking because you exploicitly mentioned being in a container).
If you want per-container, then you can probably just create the container in a
cgroup?

>>>>
>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>> just some quick runs: the global state would be convenient.

Thanks for taking this for a spin! Appreciate the feedback.

>>>
>>> +1. but this seems to be impossible by scanning pagemap?
>>> so may we add this statistics information in kernel just like
>>> /proc/meminfo or a separate /proc/mthp_info?
>>>
>>
>> Yes. From my perspective, it looks like the global stats are more useful
>> initially, and the more detailed per-pid or per-cgroup stats are the
>> next level of investigation. So feels odd to start with the more
>> detailed stats.
>>
> 
> probably because this can be done without the modification of the kernel.

Yes indeed, as John said in an earlier thread, my previous attempts to add stats
directly in the kernel got pushback; DavidH was concerned that we don't really
know exectly how to account mTHPs yet
(whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
the wrong ABI and having to maintain it forever. There has also been some
pushback regarding adding more values to multi-value files in sysfs, so David
was suggesting coming up with a whole new scheme at some point (I know
/proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
do live in sysfs).

Anyway, this script was my attempt to 1) provide a short term solution to the
"we need some stats" request and 2) provide a context in which to explore what
the right stats are - this script can evolve without the ABI problem.

> The detailed per-pid or per-cgroup is still quite useful to my case in which
> we set mTHP enabled/disabled and allowed sizes according to vma types,
> eg. libc_malloc, java heaps etc.
> 
> Different vma types can have different anon_name. So I can use the detailed
> info to find out if specific VMAs have gotten mTHP properly and how many
> they have gotten.
> 
>> However, Ryan did clearly say, above, "In future we may wish to
>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>> earlier he ran into some pushback on trying to set up /proc or /sys
>> values because this is still such an early feature.
>>
>> I wonder if we could put the global stats in debugfs for now? That's
>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>> location.

Now that I think about it, I wonder if we can add a --global mode to the script
(or just infer global when neither --pid nor --cgroup are provided). I think I
should be able to determine all the physical memory ranges from /proc/iomem,
then grab all the info we need from /proc/kpageflags. We should then be able to
process it all in much the same way as for --pid/--cgroup and provide the same
stats, but it will apply globally. What do you think?

If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script
for now.

> 
> +1.
> 
>>
>>
>> thanks,
>> --
>> John Hubbard
>> NVIDIA
>>
> 
> Thanks
> Barry
Barry Song Jan. 10, 2024, 9:09 a.m. UTC | #20
On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 08:02, Barry Song wrote:
> > On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 1/9/24 19:51, Barry Song wrote:
> >>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >> ...
> >>>> Hi Ryan,
> >>>>
> >>>> One thing that immediately came up during some recent testing of mTHP
> >>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
> >>>> tests on a machine at a time for now, inside various containers and
> >>>> such, and it would be nice if there were an easy way to get some numbers
> >>>> for the mTHPs across the whole machine.
>
> Just to confirm, you're expecting these "global" stats be truely global and not
> per-container? (asking because you exploicitly mentioned being in a container).
> If you want per-container, then you can probably just create the container in a
> cgroup?
>
> >>>>
> >>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>> just some quick runs: the global state would be convenient.
>
> Thanks for taking this for a spin! Appreciate the feedback.
>
> >>>
> >>> +1. but this seems to be impossible by scanning pagemap?
> >>> so may we add this statistics information in kernel just like
> >>> /proc/meminfo or a separate /proc/mthp_info?
> >>>
> >>
> >> Yes. From my perspective, it looks like the global stats are more useful
> >> initially, and the more detailed per-pid or per-cgroup stats are the
> >> next level of investigation. So feels odd to start with the more
> >> detailed stats.
> >>
> >
> > probably because this can be done without the modification of the kernel.
>
> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
> directly in the kernel got pushback; DavidH was concerned that we don't really
> know exectly how to account mTHPs yet
> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
> the wrong ABI and having to maintain it forever. There has also been some
> pushback regarding adding more values to multi-value files in sysfs, so David
> was suggesting coming up with a whole new scheme at some point (I know
> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
> do live in sysfs).
>
> Anyway, this script was my attempt to 1) provide a short term solution to the
> "we need some stats" request and 2) provide a context in which to explore what
> the right stats are - this script can evolve without the ABI problem.
>
> > The detailed per-pid or per-cgroup is still quite useful to my case in which
> > we set mTHP enabled/disabled and allowed sizes according to vma types,
> > eg. libc_malloc, java heaps etc.
> >
> > Different vma types can have different anon_name. So I can use the detailed
> > info to find out if specific VMAs have gotten mTHP properly and how many
> > they have gotten.
> >
> >> However, Ryan did clearly say, above, "In future we may wish to
> >> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >> earlier he ran into some pushback on trying to set up /proc or /sys
> >> values because this is still such an early feature.
> >>
> >> I wonder if we could put the global stats in debugfs for now? That's
> >> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >> location.
>
> Now that I think about it, I wonder if we can add a --global mode to the script
> (or just infer global when neither --pid nor --cgroup are provided). I think I
> should be able to determine all the physical memory ranges from /proc/iomem,
> then grab all the info we need from /proc/kpageflags. We should then be able to
> process it all in much the same way as for --pid/--cgroup and provide the same
> stats, but it will apply globally. What do you think?

for debug purposes, it should be good. imaging there is a health
monitor which needs
to sample the stats of large folios online and periodically, this
might be too expensive.

>
> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script
> for now.
>
> >
> > +1.
> >
> >>
> >>
> >> thanks,
> >> --
> >> John Hubbard
> >> NVIDIA
> >>
> >

Thanks
Barry
Ryan Roberts Jan. 10, 2024, 9:20 a.m. UTC | #21
On 10/01/2024 09:09, Barry Song wrote:
> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 08:02, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>
>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>> ...
>>>>>> Hi Ryan,
>>>>>>
>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>> for the mTHPs across the whole machine.
>>
>> Just to confirm, you're expecting these "global" stats be truely global and not
>> per-container? (asking because you exploicitly mentioned being in a container).
>> If you want per-container, then you can probably just create the container in a
>> cgroup?
>>
>>>>>>
>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>> just some quick runs: the global state would be convenient.
>>
>> Thanks for taking this for a spin! Appreciate the feedback.
>>
>>>>>
>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>> so may we add this statistics information in kernel just like
>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>
>>>>
>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>> next level of investigation. So feels odd to start with the more
>>>> detailed stats.
>>>>
>>>
>>> probably because this can be done without the modification of the kernel.
>>
>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
>> directly in the kernel got pushback; DavidH was concerned that we don't really
>> know exectly how to account mTHPs yet
>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
>> the wrong ABI and having to maintain it forever. There has also been some
>> pushback regarding adding more values to multi-value files in sysfs, so David
>> was suggesting coming up with a whole new scheme at some point (I know
>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
>> do live in sysfs).
>>
>> Anyway, this script was my attempt to 1) provide a short term solution to the
>> "we need some stats" request and 2) provide a context in which to explore what
>> the right stats are - this script can evolve without the ABI problem.
>>
>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>> eg. libc_malloc, java heaps etc.
>>>
>>> Different vma types can have different anon_name. So I can use the detailed
>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>> they have gotten.
>>>
>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>> values because this is still such an early feature.
>>>>
>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>> location.
>>
>> Now that I think about it, I wonder if we can add a --global mode to the script
>> (or just infer global when neither --pid nor --cgroup are provided). I think I
>> should be able to determine all the physical memory ranges from /proc/iomem,
>> then grab all the info we need from /proc/kpageflags. We should then be able to
>> process it all in much the same way as for --pid/--cgroup and provide the same
>> stats, but it will apply globally. What do you think?
> 
> for debug purposes, it should be good. imaging there is a health
> monitor which needs
> to sample the stats of large folios online and periodically, this
> might be too expensive.

Yes, understood - the long term aim needs to be to get stats into the kernel.
This is intended as a step to help make that happen.

> 
>>
>> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script
>> for now.
>>
>>>
>>> +1.
>>>
>>>>
>>>>
>>>> thanks,
>>>> --
>>>> John Hubbard
>>>> NVIDIA
>>>>
>>>
> 
> Thanks
> Barry
Ryan Roberts Jan. 10, 2024, 10:23 a.m. UTC | #22
On 10/01/2024 09:09, Barry Song wrote:
> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 08:02, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>
>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>> ...
>>>>>> Hi Ryan,
>>>>>>
>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>> for the mTHPs across the whole machine.
>>
>> Just to confirm, you're expecting these "global" stats be truely global and not
>> per-container? (asking because you exploicitly mentioned being in a container).
>> If you want per-container, then you can probably just create the container in a
>> cgroup?
>>
>>>>>>
>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>> just some quick runs: the global state would be convenient.
>>
>> Thanks for taking this for a spin! Appreciate the feedback.
>>
>>>>>
>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>> so may we add this statistics information in kernel just like
>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>
>>>>
>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>> next level of investigation. So feels odd to start with the more
>>>> detailed stats.
>>>>
>>>
>>> probably because this can be done without the modification of the kernel.
>>
>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
>> directly in the kernel got pushback; DavidH was concerned that we don't really
>> know exectly how to account mTHPs yet
>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
>> the wrong ABI and having to maintain it forever. There has also been some
>> pushback regarding adding more values to multi-value files in sysfs, so David
>> was suggesting coming up with a whole new scheme at some point (I know
>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
>> do live in sysfs).
>>
>> Anyway, this script was my attempt to 1) provide a short term solution to the
>> "we need some stats" request and 2) provide a context in which to explore what
>> the right stats are - this script can evolve without the ABI problem.
>>
>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>> eg. libc_malloc, java heaps etc.
>>>
>>> Different vma types can have different anon_name. So I can use the detailed
>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>> they have gotten.
>>>
>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>> values because this is still such an early feature.
>>>>
>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>> location.
>>
>> Now that I think about it, I wonder if we can add a --global mode to the script
>> (or just infer global when neither --pid nor --cgroup are provided). I think I
>> should be able to determine all the physical memory ranges from /proc/iomem,
>> then grab all the info we need from /proc/kpageflags. We should then be able to
>> process it all in much the same way as for --pid/--cgroup and provide the same
>> stats, but it will apply globally. What do you think?

Having now thought about this for a few mins (in the shower, if anyone wants the
complete picture :) ), this won't quite work. This approach doesn't have the
virtual mapping information so the best it can do is tell us "how many of each
size of THP are allocated?" - it doesn't tell us anything about whether they are
fully or partially mapped or what their alignment is (all necessary if we want
to know if they are contpte-mapped). So I don't think this approach is going to
be particularly useful.

And this is also the big problem if we want to gather stats inside the kernel;
if we want something equivalant to /proc/meminfo's
AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
allocation of the THP but also whether it is mapped. That's easy for
PMD-mappings, because there is only one entry to consider - when you set it, you
increment the number of PMD-mapped THPs, when you clear it, you decrement. But
for PTE-mappings it's harder; you know the size when you are mapping so its easy
to increment, but you can do a partial unmap, so you would need to scan the PTEs
to figure out if we are unmapping the first page of a previously
fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
determine "is this folio fully and contiguously mapped in at least one process?".

So depending on what global stats you actually need, the route to getting them
cheaply may not be easy. (My previous attempt to add stats cheated and didn't
try to track "fully mapped" vs "partially mapped" - instead it just counted the
number of pages belonging to a THP (of any size) that were mapped.

If you need the global mapping state, then the short term way to do this would
be to provide the root cgroup, then have the script recurse through all child
cgroups; That would pick up all the processes and iterate through them:

  $ thpmaps --cgroup /sys/fs/cgroup --summary ...

This won't quite work with the current version because it doesn't recurse
through the cgroup children currently, but that would be easy to add.


> 
> for debug purposes, it should be good. imaging there is a health
> monitor which needs
> to sample the stats of large folios online and periodically, this
> might be too expensive.
> 
>>
>> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script
>> for now.
>>
>>>
>>> +1.
>>>
>>>>
>>>>
>>>> thanks,
>>>> --
>>>> John Hubbard
>>>> NVIDIA
>>>>
>>>
> 
> Thanks
> Barry
Barry Song Jan. 10, 2024, 10:30 a.m. UTC | #23
On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 09:09, Barry Song wrote:
> > On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 10/01/2024 08:02, Barry Song wrote:
> >>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>
> >>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>> ...
> >>>>>> Hi Ryan,
> >>>>>>
> >>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
> >>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>> such, and it would be nice if there were an easy way to get some numbers
> >>>>>> for the mTHPs across the whole machine.
> >>
> >> Just to confirm, you're expecting these "global" stats be truely global and not
> >> per-container? (asking because you exploicitly mentioned being in a container).
> >> If you want per-container, then you can probably just create the container in a
> >> cgroup?
> >>
> >>>>>>
> >>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>> just some quick runs: the global state would be convenient.
> >>
> >> Thanks for taking this for a spin! Appreciate the feedback.
> >>
> >>>>>
> >>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>> so may we add this statistics information in kernel just like
> >>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>
> >>>>
> >>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>> next level of investigation. So feels odd to start with the more
> >>>> detailed stats.
> >>>>
> >>>
> >>> probably because this can be done without the modification of the kernel.
> >>
> >> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
> >> directly in the kernel got pushback; DavidH was concerned that we don't really
> >> know exectly how to account mTHPs yet
> >> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
> >> the wrong ABI and having to maintain it forever. There has also been some
> >> pushback regarding adding more values to multi-value files in sysfs, so David
> >> was suggesting coming up with a whole new scheme at some point (I know
> >> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
> >> do live in sysfs).
> >>
> >> Anyway, this script was my attempt to 1) provide a short term solution to the
> >> "we need some stats" request and 2) provide a context in which to explore what
> >> the right stats are - this script can evolve without the ABI problem.
> >>
> >>> The detailed per-pid or per-cgroup is still quite useful to my case in which
> >>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>> eg. libc_malloc, java heaps etc.
> >>>
> >>> Different vma types can have different anon_name. So I can use the detailed
> >>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>> they have gotten.
> >>>
> >>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>> values because this is still such an early feature.
> >>>>
> >>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>> location.
> >>
> >> Now that I think about it, I wonder if we can add a --global mode to the script
> >> (or just infer global when neither --pid nor --cgroup are provided). I think I
> >> should be able to determine all the physical memory ranges from /proc/iomem,
> >> then grab all the info we need from /proc/kpageflags. We should then be able to
> >> process it all in much the same way as for --pid/--cgroup and provide the same
> >> stats, but it will apply globally. What do you think?
>
> Having now thought about this for a few mins (in the shower, if anyone wants the
> complete picture :) ), this won't quite work. This approach doesn't have the
> virtual mapping information so the best it can do is tell us "how many of each
> size of THP are allocated?" - it doesn't tell us anything about whether they are
> fully or partially mapped or what their alignment is (all necessary if we want
> to know if they are contpte-mapped). So I don't think this approach is going to
> be particularly useful.
>
> And this is also the big problem if we want to gather stats inside the kernel;
> if we want something equivalant to /proc/meminfo's
> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> allocation of the THP but also whether it is mapped. That's easy for
> PMD-mappings, because there is only one entry to consider - when you set it, you
> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
> for PTE-mappings it's harder; you know the size when you are mapping so its easy
> to increment, but you can do a partial unmap, so you would need to scan the PTEs
> to figure out if we are unmapping the first page of a previously
> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> determine "is this folio fully and contiguously mapped in at least one process?".

as OPPO's approach I shared to you before is maintaining two mapcount
1. entire map
2. subpage's map
3. if 1 and 2 both exist, it is DoubleMapped.

This isn't a problem for us. and everytime if we do a partial unmap,
we have an explicit
cont_pte split which will decrease the entire map and increase the
subpage's mapcount.

but its downside is that we expose this info to mm-core.

>
> So depending on what global stats you actually need, the route to getting them
> cheaply may not be easy. (My previous attempt to add stats cheated and didn't
> try to track "fully mapped" vs "partially mapped" - instead it just counted the
> number of pages belonging to a THP (of any size) that were mapped.
>
> If you need the global mapping state, then the short term way to do this would
> be to provide the root cgroup, then have the script recurse through all child
> cgroups; That would pick up all the processes and iterate through them:
>
>   $ thpmaps --cgroup /sys/fs/cgroup --summary ...
>
> This won't quite work with the current version because it doesn't recurse
> through the cgroup children currently, but that would be easy to add.
>
>
> >
> > for debug purposes, it should be good. imaging there is a health
> > monitor which needs
> > to sample the stats of large folios online and periodically, this
> > might be too expensive.
> >
> >>
> >> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script
> >> for now.
> >>
> >>>
> >>> +1.
> >>>
> >>>>
> >>>>
> >>>> thanks,
> >>>> --
> >>>> John Hubbard
> >>>> NVIDIA
> >>>>
> >>>
> >

Thanks
Barry
Ryan Roberts Jan. 10, 2024, 10:38 a.m. UTC | #24
On 10/01/2024 10:30, Barry Song wrote:
> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 09:09, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>
>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>> ...
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>>>> for the mTHPs across the whole machine.
>>>>
>>>> Just to confirm, you're expecting these "global" stats be truely global and not
>>>> per-container? (asking because you exploicitly mentioned being in a container).
>>>> If you want per-container, then you can probably just create the container in a
>>>> cgroup?
>>>>
>>>>>>>>
>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>> just some quick runs: the global state would be convenient.
>>>>
>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>
>>>>>>>
>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>> so may we add this statistics information in kernel just like
>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>
>>>>>>
>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>> next level of investigation. So feels odd to start with the more
>>>>>> detailed stats.
>>>>>>
>>>>>
>>>>> probably because this can be done without the modification of the kernel.
>>>>
>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
>>>> directly in the kernel got pushback; DavidH was concerned that we don't really
>>>> know exectly how to account mTHPs yet
>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>> pushback regarding adding more values to multi-value files in sysfs, so David
>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
>>>> do live in sysfs).
>>>>
>>>> Anyway, this script was my attempt to 1) provide a short term solution to the
>>>> "we need some stats" request and 2) provide a context in which to explore what
>>>> the right stats are - this script can evolve without the ABI problem.
>>>>
>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>> eg. libc_malloc, java heaps etc.
>>>>>
>>>>> Different vma types can have different anon_name. So I can use the detailed
>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>> they have gotten.
>>>>>
>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>> values because this is still such an early feature.
>>>>>>
>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>> location.
>>>>
>>>> Now that I think about it, I wonder if we can add a --global mode to the script
>>>> (or just infer global when neither --pid nor --cgroup are provided). I think I
>>>> should be able to determine all the physical memory ranges from /proc/iomem,
>>>> then grab all the info we need from /proc/kpageflags. We should then be able to
>>>> process it all in much the same way as for --pid/--cgroup and provide the same
>>>> stats, but it will apply globally. What do you think?
>>
>> Having now thought about this for a few mins (in the shower, if anyone wants the
>> complete picture :) ), this won't quite work. This approach doesn't have the
>> virtual mapping information so the best it can do is tell us "how many of each
>> size of THP are allocated?" - it doesn't tell us anything about whether they are
>> fully or partially mapped or what their alignment is (all necessary if we want
>> to know if they are contpte-mapped). So I don't think this approach is going to
>> be particularly useful.
>>
>> And this is also the big problem if we want to gather stats inside the kernel;
>> if we want something equivalant to /proc/meminfo's
>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>> allocation of the THP but also whether it is mapped. That's easy for
>> PMD-mappings, because there is only one entry to consider - when you set it, you
>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
>> for PTE-mappings it's harder; you know the size when you are mapping so its easy
>> to increment, but you can do a partial unmap, so you would need to scan the PTEs
>> to figure out if we are unmapping the first page of a previously
>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>> determine "is this folio fully and contiguously mapped in at least one process?".
> 
> as OPPO's approach I shared to you before is maintaining two mapcount
> 1. entire map
> 2. subpage's map
> 3. if 1 and 2 both exist, it is DoubleMapped.
> 
> This isn't a problem for us. and everytime if we do a partial unmap,
> we have an explicit
> cont_pte split which will decrease the entire map and increase the
> subpage's mapcount.
> 
> but its downside is that we expose this info to mm-core.

OK, but I think we have a slightly more generic situation going on with the
upstream; If I've understood correctly, you are using the PTE_CONT bit in the
PTE to determne if its fully mapped? That works for your case where you only
have 1 size of THP that you care about (contpte-size). But for the upstream, we
have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
mapped because we can only use that bit if the THP is at least 64K and aligned,
and only on arm64. We would need a SW bit for this purpose, and the mm would
need to update that SW bit for every PTE one the full -> partial map transition.

> 
>>
>> So depending on what global stats you actually need, the route to getting them
>> cheaply may not be easy. (My previous attempt to add stats cheated and didn't
>> try to track "fully mapped" vs "partially mapped" - instead it just counted the
>> number of pages belonging to a THP (of any size) that were mapped.
>>
>> If you need the global mapping state, then the short term way to do this would
>> be to provide the root cgroup, then have the script recurse through all child
>> cgroups; That would pick up all the processes and iterate through them:
>>
>>   $ thpmaps --cgroup /sys/fs/cgroup --summary ...
>>
>> This won't quite work with the current version because it doesn't recurse
>> through the cgroup children currently, but that would be easy to add.
>>
>>
>>>
>>> for debug purposes, it should be good. imaging there is a health
>>> monitor which needs
>>> to sample the stats of large folios online and periodically, this
>>> might be too expensive.
>>>
>>>>
>>>> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script
>>>> for now.
>>>>
>>>>>
>>>>> +1.
>>>>>
>>>>>>
>>>>>>
>>>>>> thanks,
>>>>>> --
>>>>>> John Hubbard
>>>>>> NVIDIA
>>>>>>
>>>>>
>>>
> 
> Thanks
> Barry
David Hildenbrand Jan. 10, 2024, 10:42 a.m. UTC | #25
On 10.01.24 11:38, Ryan Roberts wrote:
> On 10/01/2024 10:30, Barry Song wrote:
>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 10/01/2024 09:09, Barry Song wrote:
>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>
>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>> ...
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>>>>> for the mTHPs across the whole machine.
>>>>>
>>>>> Just to confirm, you're expecting these "global" stats be truely global and not
>>>>> per-container? (asking because you exploicitly mentioned being in a container).
>>>>> If you want per-container, then you can probably just create the container in a
>>>>> cgroup?
>>>>>
>>>>>>>>>
>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>
>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>
>>>>>>>>
>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>
>>>>>>>
>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>> detailed stats.
>>>>>>>
>>>>>>
>>>>>> probably because this can be done without the modification of the kernel.
>>>>>
>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
>>>>> directly in the kernel got pushback; DavidH was concerned that we don't really
>>>>> know exectly how to account mTHPs yet
>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>> pushback regarding adding more values to multi-value files in sysfs, so David
>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
>>>>> do live in sysfs).
>>>>>
>>>>> Anyway, this script was my attempt to 1) provide a short term solution to the
>>>>> "we need some stats" request and 2) provide a context in which to explore what
>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>
>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>
>>>>>> Different vma types can have different anon_name. So I can use the detailed
>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>> they have gotten.
>>>>>>
>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>> values because this is still such an early feature.
>>>>>>>
>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>> location.
>>>>>
>>>>> Now that I think about it, I wonder if we can add a --global mode to the script
>>>>> (or just infer global when neither --pid nor --cgroup are provided). I think I
>>>>> should be able to determine all the physical memory ranges from /proc/iomem,
>>>>> then grab all the info we need from /proc/kpageflags. We should then be able to
>>>>> process it all in much the same way as for --pid/--cgroup and provide the same
>>>>> stats, but it will apply globally. What do you think?
>>>
>>> Having now thought about this for a few mins (in the shower, if anyone wants the
>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>> virtual mapping information so the best it can do is tell us "how many of each
>>> size of THP are allocated?" - it doesn't tell us anything about whether they are
>>> fully or partially mapped or what their alignment is (all necessary if we want
>>> to know if they are contpte-mapped). So I don't think this approach is going to
>>> be particularly useful.
>>>
>>> And this is also the big problem if we want to gather stats inside the kernel;
>>> if we want something equivalant to /proc/meminfo's
>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>> allocation of the THP but also whether it is mapped. That's easy for
>>> PMD-mappings, because there is only one entry to consider - when you set it, you
>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
>>> for PTE-mappings it's harder; you know the size when you are mapping so its easy
>>> to increment, but you can do a partial unmap, so you would need to scan the PTEs
>>> to figure out if we are unmapping the first page of a previously
>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>> determine "is this folio fully and contiguously mapped in at least one process?".
>>
>> as OPPO's approach I shared to you before is maintaining two mapcount
>> 1. entire map
>> 2. subpage's map
>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>
>> This isn't a problem for us. and everytime if we do a partial unmap,
>> we have an explicit
>> cont_pte split which will decrease the entire map and increase the
>> subpage's mapcount.
>>
>> but its downside is that we expose this info to mm-core.
> 
> OK, but I think we have a slightly more generic situation going on with the
> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> PTE to determne if its fully mapped? That works for your case where you only
> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> mapped because we can only use that bit if the THP is at least 64K and aligned,
> and only on arm64. We would need a SW bit for this purpose, and the mm would
> need to update that SW bit for every PTE one the full -> partial map transition.

Oh no. Let's not make everything more complicated for the purpose of 
some stats.
Barry Song Jan. 10, 2024, 10:48 a.m. UTC | #26
On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 10:30, Barry Song wrote:
> > On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 10/01/2024 09:09, Barry Song wrote:
> >>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>
> >>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>> ...
> >>>>>>>> Hi Ryan,
> >>>>>>>>
> >>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
> >>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>> such, and it would be nice if there were an easy way to get some numbers
> >>>>>>>> for the mTHPs across the whole machine.
> >>>>
> >>>> Just to confirm, you're expecting these "global" stats be truely global and not
> >>>> per-container? (asking because you exploicitly mentioned being in a container).
> >>>> If you want per-container, then you can probably just create the container in a
> >>>> cgroup?
> >>>>
> >>>>>>>>
> >>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>> just some quick runs: the global state would be convenient.
> >>>>
> >>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>
> >>>>>>>
> >>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>> so may we add this statistics information in kernel just like
> >>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>
> >>>>>>
> >>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>> next level of investigation. So feels odd to start with the more
> >>>>>> detailed stats.
> >>>>>>
> >>>>>
> >>>>> probably because this can be done without the modification of the kernel.
> >>>>
> >>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
> >>>> directly in the kernel got pushback; DavidH was concerned that we don't really
> >>>> know exectly how to account mTHPs yet
> >>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
> >>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>> pushback regarding adding more values to multi-value files in sysfs, so David
> >>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
> >>>> do live in sysfs).
> >>>>
> >>>> Anyway, this script was my attempt to 1) provide a short term solution to the
> >>>> "we need some stats" request and 2) provide a context in which to explore what
> >>>> the right stats are - this script can evolve without the ABI problem.
> >>>>
> >>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
> >>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>> eg. libc_malloc, java heaps etc.
> >>>>>
> >>>>> Different vma types can have different anon_name. So I can use the detailed
> >>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>> they have gotten.
> >>>>>
> >>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>> values because this is still such an early feature.
> >>>>>>
> >>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>> location.
> >>>>
> >>>> Now that I think about it, I wonder if we can add a --global mode to the script
> >>>> (or just infer global when neither --pid nor --cgroup are provided). I think I
> >>>> should be able to determine all the physical memory ranges from /proc/iomem,
> >>>> then grab all the info we need from /proc/kpageflags. We should then be able to
> >>>> process it all in much the same way as for --pid/--cgroup and provide the same
> >>>> stats, but it will apply globally. What do you think?
> >>
> >> Having now thought about this for a few mins (in the shower, if anyone wants the
> >> complete picture :) ), this won't quite work. This approach doesn't have the
> >> virtual mapping information so the best it can do is tell us "how many of each
> >> size of THP are allocated?" - it doesn't tell us anything about whether they are
> >> fully or partially mapped or what their alignment is (all necessary if we want
> >> to know if they are contpte-mapped). So I don't think this approach is going to
> >> be particularly useful.
> >>
> >> And this is also the big problem if we want to gather stats inside the kernel;
> >> if we want something equivalant to /proc/meminfo's
> >> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >> allocation of the THP but also whether it is mapped. That's easy for
> >> PMD-mappings, because there is only one entry to consider - when you set it, you
> >> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
> >> for PTE-mappings it's harder; you know the size when you are mapping so its easy
> >> to increment, but you can do a partial unmap, so you would need to scan the PTEs
> >> to figure out if we are unmapping the first page of a previously
> >> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >> determine "is this folio fully and contiguously mapped in at least one process?".
> >
> > as OPPO's approach I shared to you before is maintaining two mapcount
> > 1. entire map
> > 2. subpage's map
> > 3. if 1 and 2 both exist, it is DoubleMapped.
> >
> > This isn't a problem for us. and everytime if we do a partial unmap,
> > we have an explicit
> > cont_pte split which will decrease the entire map and increase the
> > subpage's mapcount.
> >
> > but its downside is that we expose this info to mm-core.
>
> OK, but I think we have a slightly more generic situation going on with the
> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> PTE to determne if its fully mapped? That works for your case where you only
> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> mapped because we can only use that bit if the THP is at least 64K and aligned,
> and only on arm64. We would need a SW bit for this purpose, and the mm would
> need to update that SW bit for every PTE one the full -> partial map transition.

My current implementation does use cont_pte but i don't think it is a must-have.
we don't need a bit in PTE to know if we are partially unmapping a large folio
at all.

as long as we are unmapping a part of a large folio, we do know what we are
doing. if a large folio is mapped entirely in a process, we get only
entire_map +1,
if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's
mapcount + 1. if we are only mapping a part of this large folio, we
only increase
its subpages' mapcount.

>
> >
> >>
> >> So depending on what global stats you actually need, the route to getting them
> >> cheaply may not be easy. (My previous attempt to add stats cheated and didn't
> >> try to track "fully mapped" vs "partially mapped" - instead it just counted the
> >> number of pages belonging to a THP (of any size) that were mapped.
> >>
> >> If you need the global mapping state, then the short term way to do this would
> >> be to provide the root cgroup, then have the script recurse through all child
> >> cgroups; That would pick up all the processes and iterate through them:
> >>
> >>   $ thpmaps --cgroup /sys/fs/cgroup --summary ...
> >>
> >> This won't quite work with the current version because it doesn't recurse
> >> through the cgroup children currently, but that would be easy to add.
> >>
> >>
> >>>
> >>> for debug purposes, it should be good. imaging there is a health
> >>> monitor which needs
> >>> to sample the stats of large folios online and periodically, this
> >>> might be too expensive.
> >>>
> >>>>
> >>>> If we can possibly avoid sysfs/debugfs I would prefer to keep it all in a script
> >>>> for now.
> >>>>
> >>>>>
> >>>>> +1.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> thanks,
> >>>>>> --
> >>>>>> John Hubbard
> >>>>>> NVIDIA
> >>>>>>
> >>>>>
> >>>
> >

Thanks
Barry
David Hildenbrand Jan. 10, 2024, 10:54 a.m. UTC | #27
On 10.01.24 11:48, Barry Song wrote:
> On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 10:30, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>
>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>> ...
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>
>>>>>> Just to confirm, you're expecting these "global" stats be truely global and not
>>>>>> per-container? (asking because you exploicitly mentioned being in a container).
>>>>>> If you want per-container, then you can probably just create the container in a
>>>>>> cgroup?
>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>
>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>
>>>>>>>>>
>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>> detailed stats.
>>>>>>>>
>>>>>>>
>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>
>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add stats
>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't really
>>>>>> know exectly how to account mTHPs yet
>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up adding
>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>> pushback regarding adding more values to multi-value files in sysfs, so David
>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and cgroups
>>>>>> do live in sysfs).
>>>>>>
>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to the
>>>>>> "we need some stats" request and 2) provide a context in which to explore what
>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>
>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>
>>>>>>> Different vma types can have different anon_name. So I can use the detailed
>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>> they have gotten.
>>>>>>>
>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>> values because this is still such an early feature.
>>>>>>>>
>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>> location.
>>>>>>
>>>>>> Now that I think about it, I wonder if we can add a --global mode to the script
>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I think I
>>>>>> should be able to determine all the physical memory ranges from /proc/iomem,
>>>>>> then grab all the info we need from /proc/kpageflags. We should then be able to
>>>>>> process it all in much the same way as for --pid/--cgroup and provide the same
>>>>>> stats, but it will apply globally. What do you think?
>>>>
>>>> Having now thought about this for a few mins (in the shower, if anyone wants the
>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>> virtual mapping information so the best it can do is tell us "how many of each
>>>> size of THP are allocated?" - it doesn't tell us anything about whether they are
>>>> fully or partially mapped or what their alignment is (all necessary if we want
>>>> to know if they are contpte-mapped). So I don't think this approach is going to
>>>> be particularly useful.
>>>>
>>>> And this is also the big problem if we want to gather stats inside the kernel;
>>>> if we want something equivalant to /proc/meminfo's
>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>> PMD-mappings, because there is only one entry to consider - when you set it, you
>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
>>>> for PTE-mappings it's harder; you know the size when you are mapping so its easy
>>>> to increment, but you can do a partial unmap, so you would need to scan the PTEs
>>>> to figure out if we are unmapping the first page of a previously
>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>> determine "is this folio fully and contiguously mapped in at least one process?".
>>>
>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>> 1. entire map
>>> 2. subpage's map
>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>
>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>> we have an explicit
>>> cont_pte split which will decrease the entire map and increase the
>>> subpage's mapcount.
>>>
>>> but its downside is that we expose this info to mm-core.
>>
>> OK, but I think we have a slightly more generic situation going on with the
>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>> PTE to determne if its fully mapped? That works for your case where you only
>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>> need to update that SW bit for every PTE one the full -> partial map transition.
> 
> My current implementation does use cont_pte but i don't think it is a must-have.
> we don't need a bit in PTE to know if we are partially unmapping a large folio
> at all.
> 
> as long as we are unmapping a part of a large folio, we do know what we are
> doing. if a large folio is mapped entirely in a process, we get only
> entire_map +1,
> if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's
> mapcount + 1. if we are only mapping a part of this large folio, we
> only increase
> its subpages' mapcount.

That doesn't work as soon as you unmap a second subpage. Not to mention 
that people ( :) ) are working on removing the subpage mapcounts.

I'm going propose that as a topic for LSF/MM soon, once I get to it.
Ryan Roberts Jan. 10, 2024, 10:55 a.m. UTC | #28
On 10/01/2024 10:42, David Hildenbrand wrote:
> On 10.01.24 11:38, Ryan Roberts wrote:
>> On 10/01/2024 10:30, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>
>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>> ...
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>
>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>> and not
>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>> container).
>>>>>> If you want per-container, then you can probably just create the container
>>>>>> in a
>>>>>> cgroup?
>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>
>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>
>>>>>>>>>
>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>> detailed stats.
>>>>>>>>
>>>>>>>
>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>
>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>> stats
>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>> really
>>>>>> know exectly how to account mTHPs yet
>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>> adding
>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>> pushback regarding adding more values to multi-value files in sysfs, so David
>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>> cgroups
>>>>>> do live in sysfs).
>>>>>>
>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to the
>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>> what
>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>
>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>
>>>>>>> Different vma types can have different anon_name. So I can use the detailed
>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>> they have gotten.
>>>>>>>
>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>> values because this is still such an early feature.
>>>>>>>>
>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>> location.
>>>>>>
>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>> script
>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>> think I
>>>>>> should be able to determine all the physical memory ranges from /proc/iomem,
>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>> able to
>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>> same
>>>>>> stats, but it will apply globally. What do you think?
>>>>
>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>> the
>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>> virtual mapping information so the best it can do is tell us "how many of each
>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>> are
>>>> fully or partially mapped or what their alignment is (all necessary if we want
>>>> to know if they are contpte-mapped). So I don't think this approach is going to
>>>> be particularly useful.
>>>>
>>>> And this is also the big problem if we want to gather stats inside the kernel;
>>>> if we want something equivalant to /proc/meminfo's
>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>> you
>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>> easy
>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>> PTEs
>>>> to figure out if we are unmapping the first page of a previously
>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>> determine "is this folio fully and contiguously mapped in at least one
>>>> process?".
>>>
>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>> 1. entire map
>>> 2. subpage's map
>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>
>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>> we have an explicit
>>> cont_pte split which will decrease the entire map and increase the
>>> subpage's mapcount.
>>>
>>> but its downside is that we expose this info to mm-core.
>>
>> OK, but I think we have a slightly more generic situation going on with the
>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>> PTE to determne if its fully mapped? That works for your case where you only
>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>> need to update that SW bit for every PTE one the full -> partial map transition.
> 
> Oh no. Let's not make everything more complicated for the purpose of some stats.
> 

Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
we want to know what's fully mapped and what's not, then I don't see any way
other than by scanning the page tables and we might as well do that in user
space with this script.

Although, I expect you will shortly make a proposal that is simple to implement
and prove me wrong ;-)
Ryan Roberts Jan. 10, 2024, 10:58 a.m. UTC | #29
On 10/01/2024 10:54, David Hildenbrand wrote:
> On 10.01.24 11:48, Barry Song wrote:
>> On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 10/01/2024 10:30, Barry Song wrote:
>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>
>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>> wrote:
>>>>>>>>> ...
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>
>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>> and not
>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>> container).
>>>>>>> If you want per-container, then you can probably just create the
>>>>>>> container in a
>>>>>>> cgroup?
>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>
>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>> detailed stats.
>>>>>>>>>
>>>>>>>>
>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>
>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to
>>>>>>> add stats
>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>> really
>>>>>>> know exectly how to account mTHPs yet
>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>> adding
>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>> David
>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>> cgroups
>>>>>>> do live in sysfs).
>>>>>>>
>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to
>>>>>>> the
>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>> what
>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>
>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>> which
>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>
>>>>>>>> Different vma types can have different anon_name. So I can use the detailed
>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>> they have gotten.
>>>>>>>>
>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>
>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>> location.
>>>>>>>
>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>> script
>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>> think I
>>>>>>> should be able to determine all the physical memory ranges from /proc/iomem,
>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>> able to
>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>> same
>>>>>>> stats, but it will apply globally. What do you think?
>>>>>
>>>>> Having now thought about this for a few mins (in the shower, if anyone
>>>>> wants the
>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>> virtual mapping information so the best it can do is tell us "how many of each
>>>>> size of THP are allocated?" - it doesn't tell us anything about whether
>>>>> they are
>>>>> fully or partially mapped or what their alignment is (all necessary if we want
>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>> going to
>>>>> be particularly useful.
>>>>>
>>>>> And this is also the big problem if we want to gather stats inside the kernel;
>>>>> if we want something equivalant to /proc/meminfo's
>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>> PMD-mappings, because there is only one entry to consider - when you set
>>>>> it, you
>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>> easy
>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>> PTEs
>>>>> to figure out if we are unmapping the first page of a previously
>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>> process?".
>>>>
>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>> 1. entire map
>>>> 2. subpage's map
>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>
>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>> we have an explicit
>>>> cont_pte split which will decrease the entire map and increase the
>>>> subpage's mapcount.
>>>>
>>>> but its downside is that we expose this info to mm-core.
>>>
>>> OK, but I think we have a slightly more generic situation going on with the
>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>> PTE to determne if its fully mapped? That works for your case where you only
>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>> need to update that SW bit for every PTE one the full -> partial map transition.
>>
>> My current implementation does use cont_pte but i don't think it is a must-have.
>> we don't need a bit in PTE to know if we are partially unmapping a large folio
>> at all.
>>
>> as long as we are unmapping a part of a large folio, we do know what we are
>> doing. if a large folio is mapped entirely in a process, we get only
>> entire_map +1,
>> if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's
>> mapcount + 1. if we are only mapping a part of this large folio, we
>> only increase
>> its subpages' mapcount.
> 
> That doesn't work as soon as you unmap a second subpage. Not to mention that
> people ( :) ) are working on removing the subpage mapcounts.

Yes, that was my point - Oppo's implementation relies on the bit in the PTE to
tell the difference between unmapping the first subpage and unmapping the
others. We don't have that luxury here.

> 
> I'm going propose that as a topic for LSF/MM soon, once I get to it.
>
David Hildenbrand Jan. 10, 2024, 11 a.m. UTC | #30
On 10.01.24 11:55, Ryan Roberts wrote:
> On 10/01/2024 10:42, David Hildenbrand wrote:
>> On 10.01.24 11:38, Ryan Roberts wrote:
>>> On 10/01/2024 10:30, Barry Song wrote:
>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>
>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>> ...
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>
>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>> and not
>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>> container).
>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>> in a
>>>>>>> cgroup?
>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>
>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>> detailed stats.
>>>>>>>>>
>>>>>>>>
>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>
>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>> stats
>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>> really
>>>>>>> know exectly how to account mTHPs yet
>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>> adding
>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so David
>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>> cgroups
>>>>>>> do live in sysfs).
>>>>>>>
>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to the
>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>> what
>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>
>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in which
>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>
>>>>>>>> Different vma types can have different anon_name. So I can use the detailed
>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>> they have gotten.
>>>>>>>>
>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>
>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>> location.
>>>>>>>
>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>> script
>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>> think I
>>>>>>> should be able to determine all the physical memory ranges from /proc/iomem,
>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>> able to
>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>> same
>>>>>>> stats, but it will apply globally. What do you think?
>>>>>
>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>> the
>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>> virtual mapping information so the best it can do is tell us "how many of each
>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>> are
>>>>> fully or partially mapped or what their alignment is (all necessary if we want
>>>>> to know if they are contpte-mapped). So I don't think this approach is going to
>>>>> be particularly useful.
>>>>>
>>>>> And this is also the big problem if we want to gather stats inside the kernel;
>>>>> if we want something equivalant to /proc/meminfo's
>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>> you
>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>> easy
>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>> PTEs
>>>>> to figure out if we are unmapping the first page of a previously
>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>> process?".
>>>>
>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>> 1. entire map
>>>> 2. subpage's map
>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>
>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>> we have an explicit
>>>> cont_pte split which will decrease the entire map and increase the
>>>> subpage's mapcount.
>>>>
>>>> but its downside is that we expose this info to mm-core.
>>>
>>> OK, but I think we have a slightly more generic situation going on with the
>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>> PTE to determne if its fully mapped? That works for your case where you only
>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>> need to update that SW bit for every PTE one the full -> partial map transition.
>>
>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>
> 
> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> we want to know what's fully mapped and what's not, then I don't see any way
> other than by scanning the page tables and we might as well do that in user
> space with this script.
> 
> Although, I expect you will shortly make a proposal that is simple to implement
> and prove me wrong ;-)

Unlikely :) As you said, once you have multiple folio sizes, it stops 
really making sense.

Assume you have a 128 kiB pageache folio, and half of that is mapped. 
You can set cont-pte bits on that half and all is fine. Or AMD can 
benefit from it's optimizations without the cont-pte bit and everything 
is fine.

We want simple stats that tell us which folio sizes are actually 
allocated. For everything else, just scan the process to figure out what 
exactly is going on.
David Hildenbrand Jan. 10, 2024, 11:02 a.m. UTC | #31
On 10.01.24 11:58, Ryan Roberts wrote:
> On 10/01/2024 10:54, David Hildenbrand wrote:
>> On 10.01.24 11:48, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>> wrote:
>>>>>>>>>> ...
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>
>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>> and not
>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>> container).
>>>>>>>> If you want per-container, then you can probably just create the
>>>>>>>> container in a
>>>>>>>> cgroup?
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>
>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>> detailed stats.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>
>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to
>>>>>>>> add stats
>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>> really
>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>> adding
>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>> David
>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>> cgroups
>>>>>>>> do live in sysfs).
>>>>>>>>
>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to
>>>>>>>> the
>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>> what
>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>
>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>> which
>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>
>>>>>>>>> Different vma types can have different anon_name. So I can use the detailed
>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>> they have gotten.
>>>>>>>>>
>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>
>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>> location.
>>>>>>>>
>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>> script
>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>> think I
>>>>>>>> should be able to determine all the physical memory ranges from /proc/iomem,
>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>> able to
>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>> same
>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>
>>>>>> Having now thought about this for a few mins (in the shower, if anyone
>>>>>> wants the
>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>> virtual mapping information so the best it can do is tell us "how many of each
>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether
>>>>>> they are
>>>>>> fully or partially mapped or what their alignment is (all necessary if we want
>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>> going to
>>>>>> be particularly useful.
>>>>>>
>>>>>> And this is also the big problem if we want to gather stats inside the kernel;
>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>> PMD-mappings, because there is only one entry to consider - when you set
>>>>>> it, you
>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>> easy
>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>> PTEs
>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>> process?".
>>>>>
>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>> 1. entire map
>>>>> 2. subpage's map
>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>
>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>> we have an explicit
>>>>> cont_pte split which will decrease the entire map and increase the
>>>>> subpage's mapcount.
>>>>>
>>>>> but its downside is that we expose this info to mm-core.
>>>>
>>>> OK, but I think we have a slightly more generic situation going on with the
>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>> need to update that SW bit for every PTE one the full -> partial map transition.
>>>
>>> My current implementation does use cont_pte but i don't think it is a must-have.
>>> we don't need a bit in PTE to know if we are partially unmapping a large folio
>>> at all.
>>>
>>> as long as we are unmapping a part of a large folio, we do know what we are
>>> doing. if a large folio is mapped entirely in a process, we get only
>>> entire_map +1,
>>> if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's
>>> mapcount + 1. if we are only mapping a part of this large folio, we
>>> only increase
>>> its subpages' mapcount.
>>
>> That doesn't work as soon as you unmap a second subpage. Not to mention that
>> people ( :) ) are working on removing the subpage mapcounts.
> 
> Yes, that was my point - Oppo's implementation relies on the bit in the PTE to
> tell the difference between unmapping the first subpage and unmapping the
> others. We don't have that luxury here.

Yes, and once we're thinking of bigger folios that eventually span 
multiple page tables, these PTE-bit games won't scale.
Barry Song Jan. 10, 2024, 11:07 a.m. UTC | #32
On Wed, Jan 10, 2024 at 6:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 10:54, David Hildenbrand wrote:
> > On 10.01.24 11:48, Barry Song wrote:
> >> On Wed, Jan 10, 2024 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> On 10/01/2024 10:30, Barry Song wrote:
> >>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>
> >>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>> wrote:
> >>>>>>>>> ...
> >>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>
> >>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm running
> >>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>> such, and it would be nice if there were an easy way to get some numbers
> >>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>
> >>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>> and not
> >>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>> container).
> >>>>>>> If you want per-container, then you can probably just create the
> >>>>>>> container in a
> >>>>>>> cgroup?
> >>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>
> >>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>> detailed stats.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>
> >>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to
> >>>>>>> add stats
> >>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>> really
> >>>>>>> know exectly how to account mTHPs yet
> >>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>> adding
> >>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>> David
> >>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>> cgroups
> >>>>>>> do live in sysfs).
> >>>>>>>
> >>>>>>> Anyway, this script was my attempt to 1) provide a short term solution to
> >>>>>>> the
> >>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>> what
> >>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>
> >>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>> which
> >>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>
> >>>>>>>> Different vma types can have different anon_name. So I can use the detailed
> >>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>> they have gotten.
> >>>>>>>>
> >>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>
> >>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>> location.
> >>>>>>>
> >>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>> script
> >>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>> think I
> >>>>>>> should be able to determine all the physical memory ranges from /proc/iomem,
> >>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>> able to
> >>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>> same
> >>>>>>> stats, but it will apply globally. What do you think?
> >>>>>
> >>>>> Having now thought about this for a few mins (in the shower, if anyone
> >>>>> wants the
> >>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>> virtual mapping information so the best it can do is tell us "how many of each
> >>>>> size of THP are allocated?" - it doesn't tell us anything about whether
> >>>>> they are
> >>>>> fully or partially mapped or what their alignment is (all necessary if we want
> >>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>> going to
> >>>>> be particularly useful.
> >>>>>
> >>>>> And this is also the big problem if we want to gather stats inside the kernel;
> >>>>> if we want something equivalant to /proc/meminfo's
> >>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>> PMD-mappings, because there is only one entry to consider - when you set
> >>>>> it, you
> >>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. But
> >>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>> easy
> >>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>> PTEs
> >>>>> to figure out if we are unmapping the first page of a previously
> >>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>> process?".
> >>>>
> >>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>> 1. entire map
> >>>> 2. subpage's map
> >>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>
> >>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>> we have an explicit
> >>>> cont_pte split which will decrease the entire map and increase the
> >>>> subpage's mapcount.
> >>>>
> >>>> but its downside is that we expose this info to mm-core.
> >>>
> >>> OK, but I think we have a slightly more generic situation going on with the
> >>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>> PTE to determne if its fully mapped? That works for your case where you only
> >>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>> need to update that SW bit for every PTE one the full -> partial map transition.
> >>
> >> My current implementation does use cont_pte but i don't think it is a must-have.
> >> we don't need a bit in PTE to know if we are partially unmapping a large folio
> >> at all.
> >>
> >> as long as we are unmapping a part of a large folio, we do know what we are
> >> doing. if a large folio is mapped entirely in a process, we get only
> >> entire_map +1,
> >> if we are unmapping a subpage of it, we get entire_map -1 and remained subpage's
> >> mapcount + 1. if we are only mapping a part of this large folio, we
> >> only increase
> >> its subpages' mapcount.
> >
> > That doesn't work as soon as you unmap a second subpage. Not to mention that
> > people ( :) ) are working on removing the subpage mapcounts.
>
> Yes, that was my point - Oppo's implementation relies on the bit in the PTE to
> tell the difference between unmapping the first subpage and unmapping the
> others. We don't have that luxury here.

right. The devil is in the details :-)

>
> >
> > I'm going propose that as a topic for LSF/MM soon, once I get to it.
> >
>
Ryan Roberts Jan. 10, 2024, 11:20 a.m. UTC | #33
On 10/01/2024 11:00, David Hildenbrand wrote:
> On 10.01.24 11:55, Ryan Roberts wrote:
>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>> wrote:
>>>>>>>>>> ...
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>> running
>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>> numbers
>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>
>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>> and not
>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>> container).
>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>> in a
>>>>>>>> cgroup?
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>
>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>> detailed stats.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>
>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>> stats
>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>> really
>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>> adding
>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>> David
>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>> cgroups
>>>>>>>> do live in sysfs).
>>>>>>>>
>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>> to the
>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>> what
>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>
>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>> which
>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>
>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>> detailed
>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>> they have gotten.
>>>>>>>>>
>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>
>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>> location.
>>>>>>>>
>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>> script
>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>> think I
>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>> /proc/iomem,
>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>> able to
>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>> same
>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>
>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>> the
>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>> each
>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>> are
>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>> want
>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>> going to
>>>>>> be particularly useful.
>>>>>>
>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>> kernel;
>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>> you
>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>> But
>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>> easy
>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>> PTEs
>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>> process?".
>>>>>
>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>> 1. entire map
>>>>> 2. subpage's map
>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>
>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>> we have an explicit
>>>>> cont_pte split which will decrease the entire map and increase the
>>>>> subpage's mapcount.
>>>>>
>>>>> but its downside is that we expose this info to mm-core.
>>>>
>>>> OK, but I think we have a slightly more generic situation going on with the
>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>> need to update that SW bit for every PTE one the full -> partial map
>>>> transition.
>>>
>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>
>>
>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>> we want to know what's fully mapped and what's not, then I don't see any way
>> other than by scanning the page tables and we might as well do that in user
>> space with this script.
>>
>> Although, I expect you will shortly make a proposal that is simple to implement
>> and prove me wrong ;-)
> 
> Unlikely :) As you said, once you have multiple folio sizes, it stops really
> making sense.
> 
> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> optimizations without the cont-pte bit and everything is fine.

Yes, but for debug and optimization, its useful to know when THPs are
fully/partially mapped, when they are unaligned etc. Anyway, the script does
that for us, and I think we are tending towards agreement that there are
unlikely to be any cost benefits by moving it into the kernel.

> 
> We want simple stats that tell us which folio sizes are actually allocated. For
> everything else, just scan the process to figure out what exactly is going on.
> 

Certainly that's much easier to do. But is it valuable? It might be if we also
keep stats for the number of failures to allocate the various sizes - then we
can see what percentage of high order allocation attempts are successful, which
is probably useful.
David Hildenbrand Jan. 10, 2024, 11:24 a.m. UTC | #34
On 10.01.24 12:20, Ryan Roberts wrote:
> On 10/01/2024 11:00, David Hildenbrand wrote:
>> On 10.01.24 11:55, Ryan Roberts wrote:
>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>> ...
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>> running
>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>> numbers
>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>
>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>> and not
>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>> container).
>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>> in a
>>>>>>>>> cgroup?
>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>
>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>> detailed stats.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>
>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>> stats
>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>> really
>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>> adding
>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>> David
>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>> cgroups
>>>>>>>>> do live in sysfs).
>>>>>>>>>
>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>> to the
>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>> what
>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>
>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>> which
>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>
>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>> detailed
>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>> they have gotten.
>>>>>>>>>>
>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>
>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>> location.
>>>>>>>>>
>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>> script
>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>> think I
>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>> /proc/iomem,
>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>> able to
>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>> same
>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>
>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>> the
>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>> each
>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>> are
>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>> want
>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>> going to
>>>>>>> be particularly useful.
>>>>>>>
>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>> kernel;
>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>> you
>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>> But
>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>> easy
>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>> PTEs
>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>> process?".
>>>>>>
>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>> 1. entire map
>>>>>> 2. subpage's map
>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>
>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>> we have an explicit
>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>> subpage's mapcount.
>>>>>>
>>>>>> but its downside is that we expose this info to mm-core.
>>>>>
>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>> transition.
>>>>
>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>
>>>
>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>> we want to know what's fully mapped and what's not, then I don't see any way
>>> other than by scanning the page tables and we might as well do that in user
>>> space with this script.
>>>
>>> Although, I expect you will shortly make a proposal that is simple to implement
>>> and prove me wrong ;-)
>>
>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>> making sense.
>>
>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>> optimizations without the cont-pte bit and everything is fine.
> 
> Yes, but for debug and optimization, its useful to know when THPs are
> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> that for us, and I think we are tending towards agreement that there are
> unlikely to be any cost benefits by moving it into the kernel.

Agreed. And just adding: while one process might map a folio 
unaligned/partial/ ... another one might map it aligned/fully. So this 
per-process scanning is really required (because per process stats per 
folio are pretty much out of scope :) ).

> 
>>
>> We want simple stats that tell us which folio sizes are actually allocated. For
>> everything else, just scan the process to figure out what exactly is going on.
>>
> 
> Certainly that's much easier to do. But is it valuable? It might be if we also
> keep stats for the number of failures to allocate the various sizes - then we
> can see what percentage of high order allocation attempts are successful, which
> is probably useful.

Agreed.
Barry Song Jan. 10, 2024, 11:38 a.m. UTC | #35
On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 11:00, David Hildenbrand wrote:
> > On 10.01.24 11:55, Ryan Roberts wrote:
> >> On 10/01/2024 10:42, David Hildenbrand wrote:
> >>> On 10.01.24 11:38, Ryan Roberts wrote:
> >>>> On 10/01/2024 10:30, Barry Song wrote:
> >>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>> ...
> >>>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
> >>>>>>>>>>>> running
> >>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
> >>>>>>>>>>>> numbers
> >>>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>>
> >>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>>> and not
> >>>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>>> container).
> >>>>>>>> If you want per-container, then you can probably just create the container
> >>>>>>>> in a
> >>>>>>>> cgroup?
> >>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>>
> >>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>>> detailed stats.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>>
> >>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
> >>>>>>>> stats
> >>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>>> really
> >>>>>>>> know exectly how to account mTHPs yet
> >>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>>> adding
> >>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>>> David
> >>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>>> cgroups
> >>>>>>>> do live in sysfs).
> >>>>>>>>
> >>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
> >>>>>>>> to the
> >>>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>>> what
> >>>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>>
> >>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>>> which
> >>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>>
> >>>>>>>>> Different vma types can have different anon_name. So I can use the
> >>>>>>>>> detailed
> >>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>>> they have gotten.
> >>>>>>>>>
> >>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>>
> >>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>>> location.
> >>>>>>>>
> >>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>>> script
> >>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>>> think I
> >>>>>>>> should be able to determine all the physical memory ranges from
> >>>>>>>> /proc/iomem,
> >>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>>> able to
> >>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>>> same
> >>>>>>>> stats, but it will apply globally. What do you think?
> >>>>>>
> >>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
> >>>>>> the
> >>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>>> virtual mapping information so the best it can do is tell us "how many of
> >>>>>> each
> >>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
> >>>>>> are
> >>>>>> fully or partially mapped or what their alignment is (all necessary if we
> >>>>>> want
> >>>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>>> going to
> >>>>>> be particularly useful.
> >>>>>>
> >>>>>> And this is also the big problem if we want to gather stats inside the
> >>>>>> kernel;
> >>>>>> if we want something equivalant to /proc/meminfo's
> >>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
> >>>>>> you
> >>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
> >>>>>> But
> >>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>>> easy
> >>>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>>> PTEs
> >>>>>> to figure out if we are unmapping the first page of a previously
> >>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>>> process?".
> >>>>>
> >>>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>>> 1. entire map
> >>>>> 2. subpage's map
> >>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>>
> >>>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>>> we have an explicit
> >>>>> cont_pte split which will decrease the entire map and increase the
> >>>>> subpage's mapcount.
> >>>>>
> >>>>> but its downside is that we expose this info to mm-core.
> >>>>
> >>>> OK, but I think we have a slightly more generic situation going on with the
> >>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>>> PTE to determne if its fully mapped? That works for your case where you only
> >>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>>> need to update that SW bit for every PTE one the full -> partial map
> >>>> transition.
> >>>
> >>> Oh no. Let's not make everything more complicated for the purpose of some stats.
> >>>
> >>
> >> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> >> we want to know what's fully mapped and what's not, then I don't see any way
> >> other than by scanning the page tables and we might as well do that in user
> >> space with this script.
> >>
> >> Although, I expect you will shortly make a proposal that is simple to implement
> >> and prove me wrong ;-)
> >
> > Unlikely :) As you said, once you have multiple folio sizes, it stops really
> > making sense.
> >
> > Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> > set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> > optimizations without the cont-pte bit and everything is fine.
>
> Yes, but for debug and optimization, its useful to know when THPs are
> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> that for us, and I think we are tending towards agreement that there are
> unlikely to be any cost benefits by moving it into the kernel.

frequent partial unmap can defeat all purpose for us to use large folios.
just imagine a large folio can soon be splitted after it is formed. we lose
the performance gain and might get regression instead.

and this can be very frequent, for example, one userspace heap management
is releasing memory page by page.

In our real product deployment, we might not care about the second partial
unmapped,  we do care about the first partial unmapped as we can use this
to know if split has ever happened on this large folios. an partial unmapped
subpage can be unlikely re-mapped back.

so i guess 1st unmap is probably enough, at least for my product. I mean we
care about if partial unmap has ever happened on a large folio more than how
they are exactly partially unmapped :-)

>
> >
> > We want simple stats that tell us which folio sizes are actually allocated. For
> > everything else, just scan the process to figure out what exactly is going on.
> >
>
> Certainly that's much easier to do. But is it valuable? It might be if we also
> keep stats for the number of failures to allocate the various sizes - then we
> can see what percentage of high order allocation attempts are successful, which
> is probably useful.
>

Thanks
Barry
Ryan Roberts Jan. 10, 2024, 11:59 a.m. UTC | #36
On 10/01/2024 11:38, Barry Song wrote:
> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> ...
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>> running
>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>
>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>> and not
>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>> container).
>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>> in a
>>>>>>>>>> cgroup?
>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>
>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>
>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>> stats
>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>> really
>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>> adding
>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>> David
>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>> cgroups
>>>>>>>>>> do live in sysfs).
>>>>>>>>>>
>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>> to the
>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>> what
>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>
>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>> which
>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>
>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>> detailed
>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>> they have gotten.
>>>>>>>>>>>
>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>
>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>> location.
>>>>>>>>>>
>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>> script
>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>> think I
>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>> /proc/iomem,
>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>> able to
>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>> same
>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>
>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>> the
>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>> each
>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>> are
>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>> want
>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>> going to
>>>>>>>> be particularly useful.
>>>>>>>>
>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>> kernel;
>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>> you
>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>> But
>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>> easy
>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>> PTEs
>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>> process?".
>>>>>>>
>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>> 1. entire map
>>>>>>> 2. subpage's map
>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>
>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>> we have an explicit
>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>> subpage's mapcount.
>>>>>>>
>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>
>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>> transition.
>>>>>
>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>
>>>>
>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>> other than by scanning the page tables and we might as well do that in user
>>>> space with this script.
>>>>
>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>> and prove me wrong ;-)
>>>
>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>> making sense.
>>>
>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>> optimizations without the cont-pte bit and everything is fine.
>>
>> Yes, but for debug and optimization, its useful to know when THPs are
>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>> that for us, and I think we are tending towards agreement that there are
>> unlikely to be any cost benefits by moving it into the kernel.
> 
> frequent partial unmap can defeat all purpose for us to use large folios.
> just imagine a large folio can soon be splitted after it is formed. we lose
> the performance gain and might get regression instead.

nit: just because a THP gets partially unmapped in a process doesn't mean it
gets split into order-0 pages. If the folio still has all its pages mapped at
least once then no further action is taken. If the page being unmapped was the
last mapping of that page, then the THP is put on the deferred split queue, so
that it can be split in future if needed.
> 
> and this can be very frequent, for example, one userspace heap management
> is releasing memory page by page.
> 
> In our real product deployment, we might not care about the second partial
> unmapped,  we do care about the first partial unmapped as we can use this
> to know if split has ever happened on this large folios. an partial unmapped
> subpage can be unlikely re-mapped back.
> 
> so i guess 1st unmap is probably enough, at least for my product. I mean we
> care about if partial unmap has ever happened on a large folio more than how
> they are exactly partially unmapped :-)

I'm not sure what you are suggesting here? A global boolean that tells you if
any folio in the system has ever been partially unmapped? That will almost
certainly always be true, even for a very well tuned system.

> 
>>
>>>
>>> We want simple stats that tell us which folio sizes are actually allocated. For
>>> everything else, just scan the process to figure out what exactly is going on.
>>>
>>
>> Certainly that's much easier to do. But is it valuable? It might be if we also
>> keep stats for the number of failures to allocate the various sizes - then we
>> can see what percentage of high order allocation attempts are successful, which
>> is probably useful.
>>
> 
> Thanks
> Barry
Barry Song Jan. 10, 2024, 12:05 p.m. UTC | #37
On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 11:38, Barry Song wrote:
> > On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 10/01/2024 11:00, David Hildenbrand wrote:
> >>> On 10.01.24 11:55, Ryan Roberts wrote:
> >>>> On 10/01/2024 10:42, David Hildenbrand wrote:
> >>>>> On 10.01.24 11:38, Ryan Roberts wrote:
> >>>>>> On 10/01/2024 10:30, Barry Song wrote:
> >>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>> ...
> >>>>>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
> >>>>>>>>>>>>>> running
> >>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
> >>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>>>>
> >>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>>>>> and not
> >>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>>>>> container).
> >>>>>>>>>> If you want per-container, then you can probably just create the container
> >>>>>>>>>> in a
> >>>>>>>>>> cgroup?
> >>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>>>>> detailed stats.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>>>>
> >>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
> >>>>>>>>>> stats
> >>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>>>>> really
> >>>>>>>>>> know exectly how to account mTHPs yet
> >>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>>>>> adding
> >>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>>>>> David
> >>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>>>>> cgroups
> >>>>>>>>>> do live in sysfs).
> >>>>>>>>>>
> >>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
> >>>>>>>>>> to the
> >>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>>>>> what
> >>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>>>>
> >>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>>>>> which
> >>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>>>>
> >>>>>>>>>>> Different vma types can have different anon_name. So I can use the
> >>>>>>>>>>> detailed
> >>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>>>>> they have gotten.
> >>>>>>>>>>>
> >>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>>>>> location.
> >>>>>>>>>>
> >>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>>>>> script
> >>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>>>>> think I
> >>>>>>>>>> should be able to determine all the physical memory ranges from
> >>>>>>>>>> /proc/iomem,
> >>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>>>>> able to
> >>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>>>>> same
> >>>>>>>>>> stats, but it will apply globally. What do you think?
> >>>>>>>>
> >>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
> >>>>>>>> the
> >>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>>>>> virtual mapping information so the best it can do is tell us "how many of
> >>>>>>>> each
> >>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
> >>>>>>>> are
> >>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
> >>>>>>>> want
> >>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>>>>> going to
> >>>>>>>> be particularly useful.
> >>>>>>>>
> >>>>>>>> And this is also the big problem if we want to gather stats inside the
> >>>>>>>> kernel;
> >>>>>>>> if we want something equivalant to /proc/meminfo's
> >>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
> >>>>>>>> you
> >>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
> >>>>>>>> But
> >>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>>>>> easy
> >>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>>>>> PTEs
> >>>>>>>> to figure out if we are unmapping the first page of a previously
> >>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>>>>> process?".
> >>>>>>>
> >>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>>>>> 1. entire map
> >>>>>>> 2. subpage's map
> >>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>>>>
> >>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>>>>> we have an explicit
> >>>>>>> cont_pte split which will decrease the entire map and increase the
> >>>>>>> subpage's mapcount.
> >>>>>>>
> >>>>>>> but its downside is that we expose this info to mm-core.
> >>>>>>
> >>>>>> OK, but I think we have a slightly more generic situation going on with the
> >>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>>>>> PTE to determne if its fully mapped? That works for your case where you only
> >>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>>>>> need to update that SW bit for every PTE one the full -> partial map
> >>>>>> transition.
> >>>>>
> >>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
> >>>>>
> >>>>
> >>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> >>>> we want to know what's fully mapped and what's not, then I don't see any way
> >>>> other than by scanning the page tables and we might as well do that in user
> >>>> space with this script.
> >>>>
> >>>> Although, I expect you will shortly make a proposal that is simple to implement
> >>>> and prove me wrong ;-)
> >>>
> >>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
> >>> making sense.
> >>>
> >>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> >>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> >>> optimizations without the cont-pte bit and everything is fine.
> >>
> >> Yes, but for debug and optimization, its useful to know when THPs are
> >> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> >> that for us, and I think we are tending towards agreement that there are
> >> unlikely to be any cost benefits by moving it into the kernel.
> >
> > frequent partial unmap can defeat all purpose for us to use large folios.
> > just imagine a large folio can soon be splitted after it is formed. we lose
> > the performance gain and might get regression instead.
>
> nit: just because a THP gets partially unmapped in a process doesn't mean it
> gets split into order-0 pages. If the folio still has all its pages mapped at
> least once then no further action is taken. If the page being unmapped was the
> last mapping of that page, then the THP is put on the deferred split queue, so
> that it can be split in future if needed.

yes. That is exactly what the kernel is doing, but this is not so
important for us
to resolve performance issues.

> >
> > and this can be very frequent, for example, one userspace heap management
> > is releasing memory page by page.
> >
> > In our real product deployment, we might not care about the second partial
> > unmapped,  we do care about the first partial unmapped as we can use this
> > to know if split has ever happened on this large folios. an partial unmapped
> > subpage can be unlikely re-mapped back.
> >
> > so i guess 1st unmap is probably enough, at least for my product. I mean we
> > care about if partial unmap has ever happened on a large folio more than how
> > they are exactly partially unmapped :-)
>
> I'm not sure what you are suggesting here? A global boolean that tells you if
> any folio in the system has ever been partially unmapped? That will almost
> certainly always be true, even for a very well tuned system.
>
> >
> >>
> >>>
> >>> We want simple stats that tell us which folio sizes are actually allocated. For
> >>> everything else, just scan the process to figure out what exactly is going on.
> >>>
> >>
> >> Certainly that's much easier to do. But is it valuable? It might be if we also
> >> keep stats for the number of failures to allocate the various sizes - then we
> >> can see what percentage of high order allocation attempts are successful, which
> >> is probably useful.

My point is that we split large folios into two simple categories,
1. large folios which have never been partially unmapped
2. large folios which have ever been partially unmapped.

we can totally ignore all details except "never" and "ever". it won't
be accurate,
but it has been useful enough at least for my product and according to our
experiences deploying large folios on millions of real android phones.

In real product deployment, we modified userspace a lot to decrease 2 as much
as possible. so we did observe 2 very often in our past debugging.

 Thanks
 Barry
David Hildenbrand Jan. 10, 2024, 12:12 p.m. UTC | #38
On 10.01.24 13:05, Barry Song wrote:
> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 11:38, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>
>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>> and not
>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>> container).
>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>> in a
>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>> stats
>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>> really
>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>> adding
>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>> David
>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>> cgroups
>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>
>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>> to the
>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>> what
>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>
>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>> which
>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>> detailed
>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>> location.
>>>>>>>>>>>>
>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>> script
>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>> think I
>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>> able to
>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>> same
>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>
>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>> the
>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>> each
>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>> are
>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>> want
>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>> going to
>>>>>>>>>> be particularly useful.
>>>>>>>>>>
>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>> kernel;
>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>> you
>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>> But
>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>> easy
>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>> PTEs
>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>> process?".
>>>>>>>>>
>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>> 1. entire map
>>>>>>>>> 2. subpage's map
>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>
>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>> we have an explicit
>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>> subpage's mapcount.
>>>>>>>>>
>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>
>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>> transition.
>>>>>>>
>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>
>>>>>>
>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>> space with this script.
>>>>>>
>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>> and prove me wrong ;-)
>>>>>
>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>> making sense.
>>>>>
>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>
>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>> that for us, and I think we are tending towards agreement that there are
>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>
>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>> the performance gain and might get regression instead.
>>
>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>> gets split into order-0 pages. If the folio still has all its pages mapped at
>> least once then no further action is taken. If the page being unmapped was the
>> last mapping of that page, then the THP is put on the deferred split queue, so
>> that it can be split in future if needed.
> 
> yes. That is exactly what the kernel is doing, but this is not so
> important for us
> to resolve performance issues.
> 
>>>
>>> and this can be very frequent, for example, one userspace heap management
>>> is releasing memory page by page.
>>>
>>> In our real product deployment, we might not care about the second partial
>>> unmapped,  we do care about the first partial unmapped as we can use this
>>> to know if split has ever happened on this large folios. an partial unmapped
>>> subpage can be unlikely re-mapped back.
>>>
>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>> care about if partial unmap has ever happened on a large folio more than how
>>> they are exactly partially unmapped :-)
>>
>> I'm not sure what you are suggesting here? A global boolean that tells you if
>> any folio in the system has ever been partially unmapped? That will almost
>> certainly always be true, even for a very well tuned system.
>>
>>>
>>>>
>>>>>
>>>>> We want simple stats that tell us which folio sizes are actually allocated. For
>>>>> everything else, just scan the process to figure out what exactly is going on.
>>>>>
>>>>
>>>> Certainly that's much easier to do. But is it valuable? It might be if we also
>>>> keep stats for the number of failures to allocate the various sizes - then we
>>>> can see what percentage of high order allocation attempts are successful, which
>>>> is probably useful.
> 
> My point is that we split large folios into two simple categories,
> 1. large folios which have never been partially unmapped
> 2. large folios which have ever been partially unmapped.
> 

With the rmap batching stuff I am working on, you get the complete thing 
unmapped in most cases (as long as they are in one VMA) -- for example 
during munmap()/exit()/etc.

Only when multiple VMAs were involved, or when someone COWs / 
PTE_DONTNEEDs / munmap some subpages, you get a single page of a large 
folio.

That could be used to simply flag the folio in your case.

But not sure if that has to be handled on the rmap level. Could be 
handled higher up in the callchain (esp. pte-dontneed).
Zi Yan Jan. 10, 2024, 3:19 p.m. UTC | #39
On 10 Jan 2024, at 7:12, David Hildenbrand wrote:

> On 10.01.24 13:05, Barry Song wrote:
>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 10/01/2024 11:38, Barry Song wrote:
>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>>> and not
>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>> container).
>>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>>> in a
>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>>> stats
>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>>> really
>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>>> adding
>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>>> David
>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>>> to the
>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>>> what
>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>>> script
>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>>> think I
>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>>> able to
>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>
>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>>> the
>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>>> each
>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>>> are
>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>>> want
>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>> going to
>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>
>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>> kernel;
>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>>> you
>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>>> But
>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>>> easy
>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>>> PTEs
>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>> process?".
>>>>>>>>>>
>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>> 1. entire map
>>>>>>>>>> 2. subpage's map
>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>
>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>> we have an explicit
>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>
>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>
>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>> transition.
>>>>>>>>
>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>>
>>>>>>>
>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>> space with this script.
>>>>>>>
>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>>> and prove me wrong ;-)
>>>>>>
>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>> making sense.
>>>>>>
>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>
>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>> that for us, and I think we are tending towards agreement that there are
>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>
>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>> the performance gain and might get regression instead.
>>>
>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>> least once then no further action is taken. If the page being unmapped was the
>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>> that it can be split in future if needed.
>>
>> yes. That is exactly what the kernel is doing, but this is not so
>> important for us
>> to resolve performance issues.
>>
>>>>
>>>> and this can be very frequent, for example, one userspace heap management
>>>> is releasing memory page by page.
>>>>
>>>> In our real product deployment, we might not care about the second partial
>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>> subpage can be unlikely re-mapped back.
>>>>
>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>> care about if partial unmap has ever happened on a large folio more than how
>>>> they are exactly partially unmapped :-)
>>>
>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>> any folio in the system has ever been partially unmapped? That will almost
>>> certainly always be true, even for a very well tuned system.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> We want simple stats that tell us which folio sizes are actually allocated. For
>>>>>> everything else, just scan the process to figure out what exactly is going on.
>>>>>>
>>>>>
>>>>> Certainly that's much easier to do. But is it valuable? It might be if we also
>>>>> keep stats for the number of failures to allocate the various sizes - then we
>>>>> can see what percentage of high order allocation attempts are successful, which
>>>>> is probably useful.
>>
>> My point is that we split large folios into two simple categories,
>> 1. large folios which have never been partially unmapped
>> 2. large folios which have ever been partially unmapped.
>>
>
> With the rmap batching stuff I am working on, you get the complete thing unmapped in most cases (as long as they are in one VMA) -- for example during munmap()/exit()/etc.

IIUC, there are two cases:

1. munmap() a range within a VMA, the rmap batching can void temporary partially unmapped folios, since it does the range operations as a whole.

2. Barry has a case that userspace, e.g., the heap management, releases
memory page by page, which rmap batching cannot help, unless either userspace
batches memory releases or kernel delays and aggregates these memory releasing
syscalls.



--
Best Regards,
Yan, Zi
David Hildenbrand Jan. 10, 2024, 3:27 p.m. UTC | #40
On 10.01.24 16:19, Zi Yan wrote:
> On 10 Jan 2024, at 7:12, David Hildenbrand wrote:
> 
>> On 10.01.24 13:05, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 11:38, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>>> container).
>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>>>> stats
>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>>>> script
>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>>>> think I
>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>>>> the
>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>>>> each
>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>>>> are
>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>>>> want
>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>>> going to
>>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>>
>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>>> kernel;
>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>>>> you
>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>>>> But
>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>>>> easy
>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>>>> PTEs
>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>>> process?".
>>>>>>>>>>>
>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>>> 1. entire map
>>>>>>>>>>> 2. subpage's map
>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>>
>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>>> we have an explicit
>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>>
>>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>>
>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>>> transition.
>>>>>>>>>
>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>>> space with this script.
>>>>>>>>
>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>>>> and prove me wrong ;-)
>>>>>>>
>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>>> making sense.
>>>>>>>
>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>>
>>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>>> that for us, and I think we are tending towards agreement that there are
>>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>>
>>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>>> the performance gain and might get regression instead.
>>>>
>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>>> least once then no further action is taken. If the page being unmapped was the
>>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>>> that it can be split in future if needed.
>>>
>>> yes. That is exactly what the kernel is doing, but this is not so
>>> important for us
>>> to resolve performance issues.
>>>
>>>>>
>>>>> and this can be very frequent, for example, one userspace heap management
>>>>> is releasing memory page by page.
>>>>>
>>>>> In our real product deployment, we might not care about the second partial
>>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>>> subpage can be unlikely re-mapped back.
>>>>>
>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>>> care about if partial unmap has ever happened on a large folio more than how
>>>>> they are exactly partially unmapped :-)
>>>>
>>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>>> any folio in the system has ever been partially unmapped? That will almost
>>>> certainly always be true, even for a very well tuned system.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> We want simple stats that tell us which folio sizes are actually allocated. For
>>>>>>> everything else, just scan the process to figure out what exactly is going on.
>>>>>>>
>>>>>>
>>>>>> Certainly that's much easier to do. But is it valuable? It might be if we also
>>>>>> keep stats for the number of failures to allocate the various sizes - then we
>>>>>> can see what percentage of high order allocation attempts are successful, which
>>>>>> is probably useful.
>>>
>>> My point is that we split large folios into two simple categories,
>>> 1. large folios which have never been partially unmapped
>>> 2. large folios which have ever been partially unmapped.
>>>
>>
>> With the rmap batching stuff I am working on, you get the complete thing unmapped in most cases (as long as they are in one VMA) -- for example during munmap()/exit()/etc.
> 
> IIUC, there are two cases:
> 
> 1. munmap() a range within a VMA, the rmap batching can void temporary partially unmapped folios, since it does the range operations as a whole.
> 
> 2. Barry has a case that userspace, e.g., the heap management, releases
> memory page by page, which rmap batching cannot help, unless either userspace
> batches memory releases or kernel delays and aggregates these memory releasing
> syscalls.

Exactly. And for 2. you immediately know that someone is partially 
unmapping a large folio. At least temporarily. Compared to doing a 
MADV_DONTNEED that covers a whole large folio (e.g., THP).
Barry Song Jan. 10, 2024, 10:14 p.m. UTC | #41
On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 11:38, Barry Song wrote:
> > On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 10/01/2024 11:00, David Hildenbrand wrote:
> >>> On 10.01.24 11:55, Ryan Roberts wrote:
> >>>> On 10/01/2024 10:42, David Hildenbrand wrote:
> >>>>> On 10.01.24 11:38, Ryan Roberts wrote:
> >>>>>> On 10/01/2024 10:30, Barry Song wrote:
> >>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>> ...
> >>>>>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
> >>>>>>>>>>>>>> running
> >>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
> >>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>>>>
> >>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>>>>> and not
> >>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>>>>> container).
> >>>>>>>>>> If you want per-container, then you can probably just create the container
> >>>>>>>>>> in a
> >>>>>>>>>> cgroup?
> >>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>>>>> detailed stats.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>>>>
> >>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
> >>>>>>>>>> stats
> >>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>>>>> really
> >>>>>>>>>> know exectly how to account mTHPs yet
> >>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>>>>> adding
> >>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>>>>> David
> >>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>>>>> cgroups
> >>>>>>>>>> do live in sysfs).
> >>>>>>>>>>
> >>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
> >>>>>>>>>> to the
> >>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>>>>> what
> >>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>>>>
> >>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>>>>> which
> >>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>>>>
> >>>>>>>>>>> Different vma types can have different anon_name. So I can use the
> >>>>>>>>>>> detailed
> >>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>>>>> they have gotten.
> >>>>>>>>>>>
> >>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>>>>> location.
> >>>>>>>>>>
> >>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>>>>> script
> >>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>>>>> think I
> >>>>>>>>>> should be able to determine all the physical memory ranges from
> >>>>>>>>>> /proc/iomem,
> >>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>>>>> able to
> >>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>>>>> same
> >>>>>>>>>> stats, but it will apply globally. What do you think?
> >>>>>>>>
> >>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
> >>>>>>>> the
> >>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>>>>> virtual mapping information so the best it can do is tell us "how many of
> >>>>>>>> each
> >>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
> >>>>>>>> are
> >>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
> >>>>>>>> want
> >>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>>>>> going to
> >>>>>>>> be particularly useful.
> >>>>>>>>
> >>>>>>>> And this is also the big problem if we want to gather stats inside the
> >>>>>>>> kernel;
> >>>>>>>> if we want something equivalant to /proc/meminfo's
> >>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
> >>>>>>>> you
> >>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
> >>>>>>>> But
> >>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>>>>> easy
> >>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>>>>> PTEs
> >>>>>>>> to figure out if we are unmapping the first page of a previously
> >>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>>>>> process?".
> >>>>>>>
> >>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>>>>> 1. entire map
> >>>>>>> 2. subpage's map
> >>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>>>>
> >>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>>>>> we have an explicit
> >>>>>>> cont_pte split which will decrease the entire map and increase the
> >>>>>>> subpage's mapcount.
> >>>>>>>
> >>>>>>> but its downside is that we expose this info to mm-core.
> >>>>>>
> >>>>>> OK, but I think we have a slightly more generic situation going on with the
> >>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>>>>> PTE to determne if its fully mapped? That works for your case where you only
> >>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>>>>> need to update that SW bit for every PTE one the full -> partial map
> >>>>>> transition.
> >>>>>
> >>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
> >>>>>
> >>>>
> >>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> >>>> we want to know what's fully mapped and what's not, then I don't see any way
> >>>> other than by scanning the page tables and we might as well do that in user
> >>>> space with this script.
> >>>>
> >>>> Although, I expect you will shortly make a proposal that is simple to implement
> >>>> and prove me wrong ;-)
> >>>
> >>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
> >>> making sense.
> >>>
> >>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> >>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> >>> optimizations without the cont-pte bit and everything is fine.
> >>
> >> Yes, but for debug and optimization, its useful to know when THPs are
> >> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> >> that for us, and I think we are tending towards agreement that there are
> >> unlikely to be any cost benefits by moving it into the kernel.
> >
> > frequent partial unmap can defeat all purpose for us to use large folios.
> > just imagine a large folio can soon be splitted after it is formed. we lose
> > the performance gain and might get regression instead.
>
> nit: just because a THP gets partially unmapped in a process doesn't mean it
> gets split into order-0 pages. If the folio still has all its pages mapped at
> least once then no further action is taken. If the page being unmapped was the
> last mapping of that page, then the THP is put on the deferred split queue, so
> that it can be split in future if needed.
> >
> > and this can be very frequent, for example, one userspace heap management
> > is releasing memory page by page.
> >
> > In our real product deployment, we might not care about the second partial
> > unmapped,  we do care about the first partial unmapped as we can use this
> > to know if split has ever happened on this large folios. an partial unmapped
> > subpage can be unlikely re-mapped back.
> >
> > so i guess 1st unmap is probably enough, at least for my product. I mean we
> > care about if partial unmap has ever happened on a large folio more than how
> > they are exactly partially unmapped :-)
>
> I'm not sure what you are suggesting here? A global boolean that tells you if
> any folio in the system has ever been partially unmapped? That will almost
> certainly always be true, even for a very well tuned system.

not a global boolean but a per-folio boolean. in case userspace maps a region
and has no userspace management, then we are fine as it is unlikely to have
partial unmap/map things; in case userspace maps a region, but manages it
by itself, such as heap things, we might result in lots of partial map/unmap,
which can lead to 3 problems:
1. potential memory footprint increase, for example, while userspace releases
some pages in a folio, we might still keep it as frequent splitting folio into
basepages and releasing the unmapped subpage might be too expensive.
2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
might happen.
3. other maintenance overhead such as splitting large folios etc.

We'd like to know how serious partial map things are happening. so either
we will disable mTHP in this kind of VMAs, or optimize userspace to do
some alignment according to the size of large folios.

in android phones, we detect lots of apps, and also found some apps might
do things like
1. mprotect on some pages within a large folio
2. mlock on some pages within a large folio
3. madv_free on some pages within a large folio
4. madv_pageout on some pages within a large folio.

it would be good if we have a per-folio boolean to know how serious userspace
is breaking the large folios. for example, if more than 50% folios in a vma has
this problem, we can find it out and take some action.

Thanks
Barry
Barry Song Jan. 10, 2024, 11:34 p.m. UTC | #42
On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 11:00, David Hildenbrand wrote:
> > On 10.01.24 11:55, Ryan Roberts wrote:
> >> On 10/01/2024 10:42, David Hildenbrand wrote:
> >>> On 10.01.24 11:38, Ryan Roberts wrote:
> >>>> On 10/01/2024 10:30, Barry Song wrote:
> >>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>> ...
> >>>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
> >>>>>>>>>>>> running
> >>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
> >>>>>>>>>>>> numbers
> >>>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>>
> >>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>>> and not
> >>>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>>> container).
> >>>>>>>> If you want per-container, then you can probably just create the container
> >>>>>>>> in a
> >>>>>>>> cgroup?
> >>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>>
> >>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>>> detailed stats.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>>
> >>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
> >>>>>>>> stats
> >>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>>> really
> >>>>>>>> know exectly how to account mTHPs yet
> >>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>>> adding
> >>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>>> David
> >>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>>> cgroups
> >>>>>>>> do live in sysfs).
> >>>>>>>>
> >>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
> >>>>>>>> to the
> >>>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>>> what
> >>>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>>
> >>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>>> which
> >>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>>
> >>>>>>>>> Different vma types can have different anon_name. So I can use the
> >>>>>>>>> detailed
> >>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>>> they have gotten.
> >>>>>>>>>
> >>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>>
> >>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>>> location.
> >>>>>>>>
> >>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>>> script
> >>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>>> think I
> >>>>>>>> should be able to determine all the physical memory ranges from
> >>>>>>>> /proc/iomem,
> >>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>>> able to
> >>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>>> same
> >>>>>>>> stats, but it will apply globally. What do you think?
> >>>>>>
> >>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
> >>>>>> the
> >>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>>> virtual mapping information so the best it can do is tell us "how many of
> >>>>>> each
> >>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
> >>>>>> are
> >>>>>> fully or partially mapped or what their alignment is (all necessary if we
> >>>>>> want
> >>>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>>> going to
> >>>>>> be particularly useful.
> >>>>>>
> >>>>>> And this is also the big problem if we want to gather stats inside the
> >>>>>> kernel;
> >>>>>> if we want something equivalant to /proc/meminfo's
> >>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
> >>>>>> you
> >>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
> >>>>>> But
> >>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>>> easy
> >>>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>>> PTEs
> >>>>>> to figure out if we are unmapping the first page of a previously
> >>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>>> process?".
> >>>>>
> >>>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>>> 1. entire map
> >>>>> 2. subpage's map
> >>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>>
> >>>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>>> we have an explicit
> >>>>> cont_pte split which will decrease the entire map and increase the
> >>>>> subpage's mapcount.
> >>>>>
> >>>>> but its downside is that we expose this info to mm-core.
> >>>>
> >>>> OK, but I think we have a slightly more generic situation going on with the
> >>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>>> PTE to determne if its fully mapped? That works for your case where you only
> >>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>>> need to update that SW bit for every PTE one the full -> partial map
> >>>> transition.
> >>>
> >>> Oh no. Let's not make everything more complicated for the purpose of some stats.
> >>>
> >>
> >> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> >> we want to know what's fully mapped and what's not, then I don't see any way
> >> other than by scanning the page tables and we might as well do that in user
> >> space with this script.
> >>
> >> Although, I expect you will shortly make a proposal that is simple to implement
> >> and prove me wrong ;-)
> >
> > Unlikely :) As you said, once you have multiple folio sizes, it stops really
> > making sense.
> >
> > Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> > set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> > optimizations without the cont-pte bit and everything is fine.
>
> Yes, but for debug and optimization, its useful to know when THPs are
> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> that for us, and I think we are tending towards agreement that there are
> unlikely to be any cost benefits by moving it into the kernel.
>
> >
> > We want simple stats that tell us which folio sizes are actually allocated. For
> > everything else, just scan the process to figure out what exactly is going on.
> >
>
> Certainly that's much easier to do. But is it valuable? It might be if we also
> keep stats for the number of failures to allocate the various sizes - then we
> can see what percentage of high order allocation attempts are successful, which
> is probably useful.

+1 this is perfectly useful especially after memory is fragmented. In
an embedded device, this can
be absolutely true after the system runs for a while. That's why we
have to set a large folios pool
in products. otherwise, large folios only have a positive impact in
the first hour.

for your reference, i am posting some stats i have on my phone using
OPPO's large folios
approach,

:/ # cat /proc/<oppo's large folios>/stat
pool_size   eg. <4GB>  ---- we have a pool to help the success of
large folios allocation
pool_low    ----- watermarks for the pool, we may begin to reclaim
large folios when the pool has limited free memory
pool_high
thp_cow 1011488                   ----- we are doing CoW for large folios
thp_cow_fallback 584494      ------ we fail to allocate large folios
for CoW, then fallback to normal page
...
madv_free_unaligned 9159    ------ userspace unaligned madv_free
madv_dont_need_unaligned 11358 ------ userspace unaligned madv_dontneed
....
thp_do_anon_pages 131289845
thp_do_anon_pages_fallback 88911215  ----- fallback to normal pages in
do_anon_pages
thp_swpin_no_swapcache_entry             ----- swapin large folios
thp_swpin_no_swapcache_fallback_entry   -----  swapin large folios fallback...
thp_swpin_swapcache_entry                  ----- swapin large
folios/swapcache case
thp_swpin_swapcache_fallback_entry    ----- swapin large
folios/swapcache fallback to normal pages
thp_file_entry 28998
  thp_file_alloc_success 27334
  thp_file_alloc_fail 1664
....
PartialMappedTHP:      29312 kB    ---- these are folios which have
ever been not entirely mapped.
                                                        ----- this is
also what i am suggesting to have in recent several replies :-)
....

Thanks
Barry
Ryan Roberts Jan. 11, 2024, 12:25 p.m. UTC | #43
On 10/01/2024 22:14, Barry Song wrote:
> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 11:38, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>
>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>> and not
>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>> container).
>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>> in a
>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>> stats
>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>> really
>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>> adding
>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>> David
>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>> cgroups
>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>
>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>> to the
>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>> what
>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>
>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>> which
>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>> detailed
>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>> location.
>>>>>>>>>>>>
>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>> script
>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>> think I
>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>> able to
>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>> same
>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>
>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>> the
>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>> each
>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>> are
>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>> want
>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>> going to
>>>>>>>>>> be particularly useful.
>>>>>>>>>>
>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>> kernel;
>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>> you
>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>> But
>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>> easy
>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>> PTEs
>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>> process?".
>>>>>>>>>
>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>> 1. entire map
>>>>>>>>> 2. subpage's map
>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>
>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>> we have an explicit
>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>> subpage's mapcount.
>>>>>>>>>
>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>
>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>> transition.
>>>>>>>
>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>
>>>>>>
>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>> space with this script.
>>>>>>
>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>> and prove me wrong ;-)
>>>>>
>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>> making sense.
>>>>>
>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>
>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>> that for us, and I think we are tending towards agreement that there are
>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>
>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>> the performance gain and might get regression instead.
>>
>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>> gets split into order-0 pages. If the folio still has all its pages mapped at
>> least once then no further action is taken. If the page being unmapped was the
>> last mapping of that page, then the THP is put on the deferred split queue, so
>> that it can be split in future if needed.
>>>
>>> and this can be very frequent, for example, one userspace heap management
>>> is releasing memory page by page.
>>>
>>> In our real product deployment, we might not care about the second partial
>>> unmapped,  we do care about the first partial unmapped as we can use this
>>> to know if split has ever happened on this large folios. an partial unmapped
>>> subpage can be unlikely re-mapped back.
>>>
>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>> care about if partial unmap has ever happened on a large folio more than how
>>> they are exactly partially unmapped :-)
>>
>> I'm not sure what you are suggesting here? A global boolean that tells you if
>> any folio in the system has ever been partially unmapped? That will almost
>> certainly always be true, even for a very well tuned system.
> 
> not a global boolean but a per-folio boolean. in case userspace maps a region
> and has no userspace management, then we are fine as it is unlikely to have
> partial unmap/map things; in case userspace maps a region, but manages it
> by itself, such as heap things, we might result in lots of partial map/unmap,
> which can lead to 3 problems:
> 1. potential memory footprint increase, for example, while userspace releases
> some pages in a folio, we might still keep it as frequent splitting folio into
> basepages and releasing the unmapped subpage might be too expensive.
> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
> might happen.
> 3. other maintenance overhead such as splitting large folios etc.
> 
> We'd like to know how serious partial map things are happening. so either
> we will disable mTHP in this kind of VMAs, or optimize userspace to do
> some alignment according to the size of large folios.
> 
> in android phones, we detect lots of apps, and also found some apps might
> do things like
> 1. mprotect on some pages within a large folio
> 2. mlock on some pages within a large folio
> 3. madv_free on some pages within a large folio
> 4. madv_pageout on some pages within a large folio.
> 
> it would be good if we have a per-folio boolean to know how serious userspace
> is breaking the large folios. for example, if more than 50% folios in a vma has
> this problem, we can find it out and take some action.

The high level value of these stats seems clear - I agree we need to be able to
get these insights. I think the issues are more around the implementation
though. I'm struggling to understand exactly how we could implement a lot of
these things cheaply (either in the kernel or in user space).

Let me try to work though what I think you are suggesting:

 - every THP is initially fully mapped
 - when an operation causes a partial unmap, mark the folio as having at least
   one partial mapping
 - on transition from "no partial mappings" to "at least one partial mapping"
   increment a "anon-partial-<size>kB" (one for each supported folio size)
   counter by the folio size
 - on transition from "at least one partial mapping" to "fully unampped
   everywhere" decrement the counter by the folio size

I think the issue with this is that a folio that is fully mapped in a process
that gets forked, then is partially unmapped in 1 process, will be accounted as
partially mapped even after the process that partially unmapped it exits, even
though that folio is now fully mapped in all processes that map it. Is that a
problem, perhaps not? I'm not sure.

> 
> Thanks
> Barry
David Hildenbrand Jan. 11, 2024, 1:18 p.m. UTC | #44
On 11.01.24 13:25, Ryan Roberts wrote:
> On 10/01/2024 22:14, Barry Song wrote:
>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 10/01/2024 11:38, Barry Song wrote:
>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>>> and not
>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>> container).
>>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>>> in a
>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>>> stats
>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>>> really
>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>>> adding
>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>>> David
>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>>> to the
>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>>> what
>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>>> script
>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>>> think I
>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>>> able to
>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>
>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>>> the
>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>>> each
>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>>> are
>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>>> want
>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>> going to
>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>
>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>> kernel;
>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>>> you
>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>>> But
>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>>> easy
>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>>> PTEs
>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>> process?".
>>>>>>>>>>
>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>> 1. entire map
>>>>>>>>>> 2. subpage's map
>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>
>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>> we have an explicit
>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>
>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>
>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>> transition.
>>>>>>>>
>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>>
>>>>>>>
>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>> space with this script.
>>>>>>>
>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>>> and prove me wrong ;-)
>>>>>>
>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>> making sense.
>>>>>>
>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>
>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>> that for us, and I think we are tending towards agreement that there are
>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>
>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>> the performance gain and might get regression instead.
>>>
>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>> least once then no further action is taken. If the page being unmapped was the
>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>> that it can be split in future if needed.
>>>>
>>>> and this can be very frequent, for example, one userspace heap management
>>>> is releasing memory page by page.
>>>>
>>>> In our real product deployment, we might not care about the second partial
>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>> subpage can be unlikely re-mapped back.
>>>>
>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>> care about if partial unmap has ever happened on a large folio more than how
>>>> they are exactly partially unmapped :-)
>>>
>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>> any folio in the system has ever been partially unmapped? That will almost
>>> certainly always be true, even for a very well tuned system.
>>
>> not a global boolean but a per-folio boolean. in case userspace maps a region
>> and has no userspace management, then we are fine as it is unlikely to have
>> partial unmap/map things; in case userspace maps a region, but manages it
>> by itself, such as heap things, we might result in lots of partial map/unmap,
>> which can lead to 3 problems:
>> 1. potential memory footprint increase, for example, while userspace releases
>> some pages in a folio, we might still keep it as frequent splitting folio into
>> basepages and releasing the unmapped subpage might be too expensive.
>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
>> might happen.
>> 3. other maintenance overhead such as splitting large folios etc.
>>
>> We'd like to know how serious partial map things are happening. so either
>> we will disable mTHP in this kind of VMAs, or optimize userspace to do
>> some alignment according to the size of large folios.
>>
>> in android phones, we detect lots of apps, and also found some apps might
>> do things like
>> 1. mprotect on some pages within a large folio
>> 2. mlock on some pages within a large folio
>> 3. madv_free on some pages within a large folio
>> 4. madv_pageout on some pages within a large folio.
>>
>> it would be good if we have a per-folio boolean to know how serious userspace
>> is breaking the large folios. for example, if more than 50% folios in a vma has
>> this problem, we can find it out and take some action.
> 
> The high level value of these stats seems clear - I agree we need to be able to
> get these insights. I think the issues are more around the implementation
> though. I'm struggling to understand exactly how we could implement a lot of
> these things cheaply (either in the kernel or in user space).
> 
> Let me try to work though what I think you are suggesting:
> 
>   - every THP is initially fully mapped

Not for pagecache folios.

>   - when an operation causes a partial unmap, mark the folio as having at least
>     one partial mapping
>   - on transition from "no partial mappings" to "at least one partial mapping"
>     increment a "anon-partial-<size>kB" (one for each supported folio size)
>     counter by the folio size
>   - on transition from "at least one partial mapping" to "fully unampped
>     everywhere" decrement the counter by the folio size
> 
> I think the issue with this is that a folio that is fully mapped in a process
> that gets forked, then is partially unmapped in 1 process, will be accounted as
> partially mapped even after the process that partially unmapped it exits, even
> though that folio is now fully mapped in all processes that map it. Is that a
> problem, perhaps not? I'm not sure.

What I can offer with my total mapcount I am working on (+ entire/pmd 
mapcount, but let's put that aside):

1) total_mapcount not multiples of folio_nr_page -> at least one process 
currently maps the folio partially

2) total_mapcount is less than folio_nr_page -> surely partially mapped

I think for most of anon memory (note that most folios are always 
exclusive in our system, not cow-shared) 2) would already be sufficient.
Barry Song Jan. 11, 2024, 8:21 p.m. UTC | #45
On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.01.24 13:25, Ryan Roberts wrote:
> > On 10/01/2024 22:14, Barry Song wrote:
> >> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> On 10/01/2024 11:38, Barry Song wrote:
> >>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
> >>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
> >>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
> >>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
> >>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
> >>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
> >>>>>>>>>>>>>>>>> running
> >>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
> >>>>>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>>>>>>>> and not
> >>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>>>>>>>> container).
> >>>>>>>>>>>>> If you want per-container, then you can probably just create the container
> >>>>>>>>>>>>> in a
> >>>>>>>>>>>>> cgroup?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>>>>>>>> detailed stats.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
> >>>>>>>>>>>>> stats
> >>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>>>>>>>> really
> >>>>>>>>>>>>> know exectly how to account mTHPs yet
> >>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>>>>>>>> adding
> >>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>>>>>>>> David
> >>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>>>>>>>> cgroups
> >>>>>>>>>>>>> do live in sysfs).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
> >>>>>>>>>>>>> to the
> >>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>>>>>>>> what
> >>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>>>>>>>> which
> >>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
> >>>>>>>>>>>>>> detailed
> >>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>>>>>>>> they have gotten.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>>>>>>>> location.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>>>>>>>> script
> >>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>>>>>>>> think I
> >>>>>>>>>>>>> should be able to determine all the physical memory ranges from
> >>>>>>>>>>>>> /proc/iomem,
> >>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>>>>>>>> able to
> >>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>>>>>>>> same
> >>>>>>>>>>>>> stats, but it will apply globally. What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
> >>>>>>>>>>> the
> >>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
> >>>>>>>>>>> each
> >>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
> >>>>>>>>>>> are
> >>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
> >>>>>>>>>>> want
> >>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>>>>>>>> going to
> >>>>>>>>>>> be particularly useful.
> >>>>>>>>>>>
> >>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
> >>>>>>>>>>> kernel;
> >>>>>>>>>>> if we want something equivalant to /proc/meminfo's
> >>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
> >>>>>>>>>>> you
> >>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
> >>>>>>>>>>> But
> >>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>>>>>>>> easy
> >>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>>>>>>>> PTEs
> >>>>>>>>>>> to figure out if we are unmapping the first page of a previously
> >>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>>>>>>>> process?".
> >>>>>>>>>>
> >>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>>>>>>>> 1. entire map
> >>>>>>>>>> 2. subpage's map
> >>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>>>>>>>
> >>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>>>>>>>> we have an explicit
> >>>>>>>>>> cont_pte split which will decrease the entire map and increase the
> >>>>>>>>>> subpage's mapcount.
> >>>>>>>>>>
> >>>>>>>>>> but its downside is that we expose this info to mm-core.
> >>>>>>>>>
> >>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
> >>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
> >>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
> >>>>>>>>> transition.
> >>>>>>>>
> >>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> >>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
> >>>>>>> other than by scanning the page tables and we might as well do that in user
> >>>>>>> space with this script.
> >>>>>>>
> >>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
> >>>>>>> and prove me wrong ;-)
> >>>>>>
> >>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
> >>>>>> making sense.
> >>>>>>
> >>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> >>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> >>>>>> optimizations without the cont-pte bit and everything is fine.
> >>>>>
> >>>>> Yes, but for debug and optimization, its useful to know when THPs are
> >>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> >>>>> that for us, and I think we are tending towards agreement that there are
> >>>>> unlikely to be any cost benefits by moving it into the kernel.
> >>>>
> >>>> frequent partial unmap can defeat all purpose for us to use large folios.
> >>>> just imagine a large folio can soon be splitted after it is formed. we lose
> >>>> the performance gain and might get regression instead.
> >>>
> >>> nit: just because a THP gets partially unmapped in a process doesn't mean it
> >>> gets split into order-0 pages. If the folio still has all its pages mapped at
> >>> least once then no further action is taken. If the page being unmapped was the
> >>> last mapping of that page, then the THP is put on the deferred split queue, so
> >>> that it can be split in future if needed.
> >>>>
> >>>> and this can be very frequent, for example, one userspace heap management
> >>>> is releasing memory page by page.
> >>>>
> >>>> In our real product deployment, we might not care about the second partial
> >>>> unmapped,  we do care about the first partial unmapped as we can use this
> >>>> to know if split has ever happened on this large folios. an partial unmapped
> >>>> subpage can be unlikely re-mapped back.
> >>>>
> >>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
> >>>> care about if partial unmap has ever happened on a large folio more than how
> >>>> they are exactly partially unmapped :-)
> >>>
> >>> I'm not sure what you are suggesting here? A global boolean that tells you if
> >>> any folio in the system has ever been partially unmapped? That will almost
> >>> certainly always be true, even for a very well tuned system.
> >>
> >> not a global boolean but a per-folio boolean. in case userspace maps a region
> >> and has no userspace management, then we are fine as it is unlikely to have
> >> partial unmap/map things; in case userspace maps a region, but manages it
> >> by itself, such as heap things, we might result in lots of partial map/unmap,
> >> which can lead to 3 problems:
> >> 1. potential memory footprint increase, for example, while userspace releases
> >> some pages in a folio, we might still keep it as frequent splitting folio into
> >> basepages and releasing the unmapped subpage might be too expensive.
> >> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
> >> might happen.
> >> 3. other maintenance overhead such as splitting large folios etc.
> >>
> >> We'd like to know how serious partial map things are happening. so either
> >> we will disable mTHP in this kind of VMAs, or optimize userspace to do
> >> some alignment according to the size of large folios.
> >>
> >> in android phones, we detect lots of apps, and also found some apps might
> >> do things like
> >> 1. mprotect on some pages within a large folio
> >> 2. mlock on some pages within a large folio
> >> 3. madv_free on some pages within a large folio
> >> 4. madv_pageout on some pages within a large folio.
> >>
> >> it would be good if we have a per-folio boolean to know how serious userspace
> >> is breaking the large folios. for example, if more than 50% folios in a vma has
> >> this problem, we can find it out and take some action.
> >
> > The high level value of these stats seems clear - I agree we need to be able to
> > get these insights. I think the issues are more around the implementation
> > though. I'm struggling to understand exactly how we could implement a lot of
> > these things cheaply (either in the kernel or in user space).
> >
> > Let me try to work though what I think you are suggesting:
> >
> >   - every THP is initially fully mapped
>
> Not for pagecache folios.
>
> >   - when an operation causes a partial unmap, mark the folio as having at least
> >     one partial mapping
> >   - on transition from "no partial mappings" to "at least one partial mapping"
> >     increment a "anon-partial-<size>kB" (one for each supported folio size)
> >     counter by the folio size
> >   - on transition from "at least one partial mapping" to "fully unampped
> >     everywhere" decrement the counter by the folio size
> >
> > I think the issue with this is that a folio that is fully mapped in a process
> > that gets forked, then is partially unmapped in 1 process, will be accounted as
> > partially mapped even after the process that partially unmapped it exits, even
> > though that folio is now fully mapped in all processes that map it. Is that a
> > problem, perhaps not? I'm not sure.
>
> What I can offer with my total mapcount I am working on (+ entire/pmd
> mapcount, but let's put that aside):
>
> 1) total_mapcount not multiples of folio_nr_page -> at least one process
> currently maps the folio partially
>
> 2) total_mapcount is less than folio_nr_page -> surely partially mapped
>
> I think for most of anon memory (note that most folios are always
> exclusive in our system, not cow-shared) 2) would already be sufficient.

if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to
add nr_pages in copy_pte_range for rmap.
copy_pte_range()
{
           folio_try_dup_anon_rmap_ptes(...nr_pages....)
}
and at the same time, in zap_pte_range(), we remove the whole anon_rmap
if the zapped-range covers the whole folio.

Replace the for-loop
for (i = 0; i < nr; i++, page++) {
        add_rmap(1);
}
for (i = 0; i < nr; i++, page++) {
        remove_rmap(1);
}
by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we
are doing the entire mapping/unmapping.

then we might be able to TestAndSetPartialMapped flag for this folio anywhile
1. someone is adding rmap with a number not equal nr_pages
2. someone is removing rmap with a number not equal nr_pages
That means we are doing partial mapping or unmapping.
and we increment partialmap_count by 1, let debugfs or somewhere present
this count.

while the folio is released to buddy and splitted into normal pages,
we remove this flag and decrease partialmap_count by 1.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
David Hildenbrand Jan. 11, 2024, 8:28 p.m. UTC | #46
On 11.01.24 21:21, Barry Song wrote:
> On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 11.01.24 13:25, Ryan Roberts wrote:
>>> On 10/01/2024 22:14, Barry Song wrote:
>>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 10/01/2024 11:38, Barry Song wrote:
>>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>>>> container).
>>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>>>>> stats
>>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>>>>> script
>>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>>>>> think I
>>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>>>>> the
>>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>>>>> each
>>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>>>>> are
>>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>>>>> want
>>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>>>> going to
>>>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>>>> kernel;
>>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>>>>> you
>>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>>>>> But
>>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>>>>> easy
>>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>>>> process?".
>>>>>>>>>>>>
>>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>>>> 1. entire map
>>>>>>>>>>>> 2. subpage's map
>>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>>>
>>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>>>> we have an explicit
>>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>>>
>>>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>>>
>>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>>>> transition.
>>>>>>>>>>
>>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>>>> space with this script.
>>>>>>>>>
>>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>>>>> and prove me wrong ;-)
>>>>>>>>
>>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>>>> making sense.
>>>>>>>>
>>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>>>
>>>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>>>> that for us, and I think we are tending towards agreement that there are
>>>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>>>
>>>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>>>> the performance gain and might get regression instead.
>>>>>
>>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>>>> least once then no further action is taken. If the page being unmapped was the
>>>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>>>> that it can be split in future if needed.
>>>>>>
>>>>>> and this can be very frequent, for example, one userspace heap management
>>>>>> is releasing memory page by page.
>>>>>>
>>>>>> In our real product deployment, we might not care about the second partial
>>>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>>>> subpage can be unlikely re-mapped back.
>>>>>>
>>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>>>> care about if partial unmap has ever happened on a large folio more than how
>>>>>> they are exactly partially unmapped :-)
>>>>>
>>>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>>>> any folio in the system has ever been partially unmapped? That will almost
>>>>> certainly always be true, even for a very well tuned system.
>>>>
>>>> not a global boolean but a per-folio boolean. in case userspace maps a region
>>>> and has no userspace management, then we are fine as it is unlikely to have
>>>> partial unmap/map things; in case userspace maps a region, but manages it
>>>> by itself, such as heap things, we might result in lots of partial map/unmap,
>>>> which can lead to 3 problems:
>>>> 1. potential memory footprint increase, for example, while userspace releases
>>>> some pages in a folio, we might still keep it as frequent splitting folio into
>>>> basepages and releasing the unmapped subpage might be too expensive.
>>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
>>>> might happen.
>>>> 3. other maintenance overhead such as splitting large folios etc.
>>>>
>>>> We'd like to know how serious partial map things are happening. so either
>>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do
>>>> some alignment according to the size of large folios.
>>>>
>>>> in android phones, we detect lots of apps, and also found some apps might
>>>> do things like
>>>> 1. mprotect on some pages within a large folio
>>>> 2. mlock on some pages within a large folio
>>>> 3. madv_free on some pages within a large folio
>>>> 4. madv_pageout on some pages within a large folio.
>>>>
>>>> it would be good if we have a per-folio boolean to know how serious userspace
>>>> is breaking the large folios. for example, if more than 50% folios in a vma has
>>>> this problem, we can find it out and take some action.
>>>
>>> The high level value of these stats seems clear - I agree we need to be able to
>>> get these insights. I think the issues are more around the implementation
>>> though. I'm struggling to understand exactly how we could implement a lot of
>>> these things cheaply (either in the kernel or in user space).
>>>
>>> Let me try to work though what I think you are suggesting:
>>>
>>>    - every THP is initially fully mapped
>>
>> Not for pagecache folios.
>>
>>>    - when an operation causes a partial unmap, mark the folio as having at least
>>>      one partial mapping
>>>    - on transition from "no partial mappings" to "at least one partial mapping"
>>>      increment a "anon-partial-<size>kB" (one for each supported folio size)
>>>      counter by the folio size
>>>    - on transition from "at least one partial mapping" to "fully unampped
>>>      everywhere" decrement the counter by the folio size
>>>
>>> I think the issue with this is that a folio that is fully mapped in a process
>>> that gets forked, then is partially unmapped in 1 process, will be accounted as
>>> partially mapped even after the process that partially unmapped it exits, even
>>> though that folio is now fully mapped in all processes that map it. Is that a
>>> problem, perhaps not? I'm not sure.
>>
>> What I can offer with my total mapcount I am working on (+ entire/pmd
>> mapcount, but let's put that aside):
>>
>> 1) total_mapcount not multiples of folio_nr_page -> at least one process
>> currently maps the folio partially
>>
>> 2) total_mapcount is less than folio_nr_page -> surely partially mapped
>>
>> I think for most of anon memory (note that most folios are always
>> exclusive in our system, not cow-shared) 2) would already be sufficient.
> 
> if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to
> add nr_pages in copy_pte_range for rmap.
> copy_pte_range()
> {
>             folio_try_dup_anon_rmap_ptes(...nr_pages....)
> }
> and at the same time, in zap_pte_range(), we remove the whole anon_rmap
> if the zapped-range covers the whole folio.
> 
> Replace the for-loop
> for (i = 0; i < nr; i++, page++) {
>          add_rmap(1);
> }
> for (i = 0; i < nr; i++, page++) {
>          remove_rmap(1);
> }
> by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we
> are doing the entire mapping/unmapping

That's precisely what I have already running as protoypes :) And I 
promised Ryan to get to this soon, clean it up and sent it out.

.
> 
> then we might be able to TestAndSetPartialMapped flag for this folio anywhile
> 1. someone is adding rmap with a number not equal nr_pages
> 2. someone is removing rmap with a number not equal nr_pages
> That means we are doing partial mapping or unmapping.
> and we increment partialmap_count by 1, let debugfs or somewhere present
> this count.

Yes. The only "ugly" corner case if you have a split VMA. We're not 
batching rmap exceeding that.
Barry Song Jan. 11, 2024, 8:45 p.m. UTC | #47
On Fri, Jan 12, 2024 at 1:25 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 10/01/2024 22:14, Barry Song wrote:
> > On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 10/01/2024 11:38, Barry Song wrote:
> >>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 10/01/2024 11:00, David Hildenbrand wrote:
> >>>>> On 10.01.24 11:55, Ryan Roberts wrote:
> >>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
> >>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
> >>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
> >>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
> >>>>>>>>>>>>>>>> running
> >>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
> >>>>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>>>>>>> and not
> >>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>>>>>>> container).
> >>>>>>>>>>>> If you want per-container, then you can probably just create the container
> >>>>>>>>>>>> in a
> >>>>>>>>>>>> cgroup?
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>>>>>>> detailed stats.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
> >>>>>>>>>>>> stats
> >>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>>>>>>> really
> >>>>>>>>>>>> know exectly how to account mTHPs yet
> >>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>>>>>>> adding
> >>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>>>>>>> David
> >>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>>>>>>> cgroups
> >>>>>>>>>>>> do live in sysfs).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
> >>>>>>>>>>>> to the
> >>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>>>>>>> what
> >>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
> >>>>>>>>>>>>> detailed
> >>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>>>>>>> they have gotten.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>>>>>>> location.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>>>>>>> script
> >>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>>>>>>> think I
> >>>>>>>>>>>> should be able to determine all the physical memory ranges from
> >>>>>>>>>>>> /proc/iomem,
> >>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>>>>>>> able to
> >>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>>>>>>> same
> >>>>>>>>>>>> stats, but it will apply globally. What do you think?
> >>>>>>>>>>
> >>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
> >>>>>>>>>> the
> >>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
> >>>>>>>>>> each
> >>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
> >>>>>>>>>> are
> >>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
> >>>>>>>>>> want
> >>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>>>>>>> going to
> >>>>>>>>>> be particularly useful.
> >>>>>>>>>>
> >>>>>>>>>> And this is also the big problem if we want to gather stats inside the
> >>>>>>>>>> kernel;
> >>>>>>>>>> if we want something equivalant to /proc/meminfo's
> >>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
> >>>>>>>>>> you
> >>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
> >>>>>>>>>> But
> >>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>>>>>>> easy
> >>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>>>>>>> PTEs
> >>>>>>>>>> to figure out if we are unmapping the first page of a previously
> >>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>>>>>>> process?".
> >>>>>>>>>
> >>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>>>>>>> 1. entire map
> >>>>>>>>> 2. subpage's map
> >>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>>>>>>
> >>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>>>>>>> we have an explicit
> >>>>>>>>> cont_pte split which will decrease the entire map and increase the
> >>>>>>>>> subpage's mapcount.
> >>>>>>>>>
> >>>>>>>>> but its downside is that we expose this info to mm-core.
> >>>>>>>>
> >>>>>>>> OK, but I think we have a slightly more generic situation going on with the
> >>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
> >>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>>>>>>> need to update that SW bit for every PTE one the full -> partial map
> >>>>>>>> transition.
> >>>>>>>
> >>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
> >>>>>>>
> >>>>>>
> >>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> >>>>>> we want to know what's fully mapped and what's not, then I don't see any way
> >>>>>> other than by scanning the page tables and we might as well do that in user
> >>>>>> space with this script.
> >>>>>>
> >>>>>> Although, I expect you will shortly make a proposal that is simple to implement
> >>>>>> and prove me wrong ;-)
> >>>>>
> >>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
> >>>>> making sense.
> >>>>>
> >>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> >>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> >>>>> optimizations without the cont-pte bit and everything is fine.
> >>>>
> >>>> Yes, but for debug and optimization, its useful to know when THPs are
> >>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> >>>> that for us, and I think we are tending towards agreement that there are
> >>>> unlikely to be any cost benefits by moving it into the kernel.
> >>>
> >>> frequent partial unmap can defeat all purpose for us to use large folios.
> >>> just imagine a large folio can soon be splitted after it is formed. we lose
> >>> the performance gain and might get regression instead.
> >>
> >> nit: just because a THP gets partially unmapped in a process doesn't mean it
> >> gets split into order-0 pages. If the folio still has all its pages mapped at
> >> least once then no further action is taken. If the page being unmapped was the
> >> last mapping of that page, then the THP is put on the deferred split queue, so
> >> that it can be split in future if needed.
> >>>
> >>> and this can be very frequent, for example, one userspace heap management
> >>> is releasing memory page by page.
> >>>
> >>> In our real product deployment, we might not care about the second partial
> >>> unmapped,  we do care about the first partial unmapped as we can use this
> >>> to know if split has ever happened on this large folios. an partial unmapped
> >>> subpage can be unlikely re-mapped back.
> >>>
> >>> so i guess 1st unmap is probably enough, at least for my product. I mean we
> >>> care about if partial unmap has ever happened on a large folio more than how
> >>> they are exactly partially unmapped :-)
> >>
> >> I'm not sure what you are suggesting here? A global boolean that tells you if
> >> any folio in the system has ever been partially unmapped? That will almost
> >> certainly always be true, even for a very well tuned system.
> >
> > not a global boolean but a per-folio boolean. in case userspace maps a region
> > and has no userspace management, then we are fine as it is unlikely to have
> > partial unmap/map things; in case userspace maps a region, but manages it
> > by itself, such as heap things, we might result in lots of partial map/unmap,
> > which can lead to 3 problems:
> > 1. potential memory footprint increase, for example, while userspace releases
> > some pages in a folio, we might still keep it as frequent splitting folio into
> > basepages and releasing the unmapped subpage might be too expensive.
> > 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
> > might happen.
> > 3. other maintenance overhead such as splitting large folios etc.
> >
> > We'd like to know how serious partial map things are happening. so either
> > we will disable mTHP in this kind of VMAs, or optimize userspace to do
> > some alignment according to the size of large folios.
> >
> > in android phones, we detect lots of apps, and also found some apps might
> > do things like
> > 1. mprotect on some pages within a large folio
> > 2. mlock on some pages within a large folio
> > 3. madv_free on some pages within a large folio
> > 4. madv_pageout on some pages within a large folio.
> >
> > it would be good if we have a per-folio boolean to know how serious userspace
> > is breaking the large folios. for example, if more than 50% folios in a vma has
> > this problem, we can find it out and take some action.
>
> The high level value of these stats seems clear - I agree we need to be able to
> get these insights. I think the issues are more around the implementation
> though. I'm struggling to understand exactly how we could implement a lot of
> these things cheaply (either in the kernel or in user space).
>
> Let me try to work though what I think you are suggesting:
>
>  - every THP is initially fully mapped
>  - when an operation causes a partial unmap, mark the folio as having at least
>    one partial mapping
>  - on transition from "no partial mappings" to "at least one partial mapping"
>    increment a "anon-partial-<size>kB" (one for each supported folio size)
>    counter by the folio size
>  - on transition from "at least one partial mapping" to "fully unampped
>    everywhere" decrement the counter by the folio size
>
> I think the issue with this is that a folio that is fully mapped in a process
> that gets forked, then is partially unmapped in 1 process, will be accounted as
> partially mapped even after the process that partially unmapped it exits, even
> though that folio is now fully mapped in all processes that map it. Is that a
> problem, perhaps not? I'm not sure.

I don't think this is a problem as what we really care about is if some "bad"
behaviour has ever happened in userspace. Though the "bad" guy has exited
or been killed, we still need the record to find out.

except the global count, if we can reflect the partially mapped count in  vma
such as smaps, this will help even more to locate the problematic userspace
code.

Thanks
Barry
Barry Song Jan. 12, 2024, 6:03 a.m. UTC | #48
On Fri, Jan 12, 2024 at 9:28 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.01.24 21:21, Barry Song wrote:
> > On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 11.01.24 13:25, Ryan Roberts wrote:
> >>> On 10/01/2024 22:14, Barry Song wrote:
> >>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> On 10/01/2024 11:38, Barry Song wrote:
> >>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>
> >>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
> >>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
> >>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
> >>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
> >>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
> >>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
> >>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
> >>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
> >>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
> >>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
> >>>>>>>>>>>>>>>>>>> running
> >>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
> >>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
> >>>>>>>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
> >>>>>>>>>>>>>>> and not
> >>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
> >>>>>>>>>>>>>>> container).
> >>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container
> >>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>> cgroup?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
> >>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
> >>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
> >>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
> >>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
> >>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
> >>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
> >>>>>>>>>>>>>>>>> detailed stats.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
> >>>>>>>>>>>>>>> stats
> >>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
> >>>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>> know exectly how to account mTHPs yet
> >>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
> >>>>>>>>>>>>>>> adding
> >>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
> >>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
> >>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
> >>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
> >>>>>>>>>>>>>>> cgroups
> >>>>>>>>>>>>>>> do live in sysfs).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
> >>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
> >>>>>>>>>>>>>>> what
> >>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
> >>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
> >>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
> >>>>>>>>>>>>>>>> detailed
> >>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
> >>>>>>>>>>>>>>>> they have gotten.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
> >>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
> >>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
> >>>>>>>>>>>>>>>>> values because this is still such an early feature.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
> >>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
> >>>>>>>>>>>>>>>>> location.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
> >>>>>>>>>>>>>>> script
> >>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
> >>>>>>>>>>>>>>> think I
> >>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
> >>>>>>>>>>>>>>> /proc/iomem,
> >>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
> >>>>>>>>>>>>>>> able to
> >>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
> >>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
> >>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
> >>>>>>>>>>>>> each
> >>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
> >>>>>>>>>>>>> want
> >>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
> >>>>>>>>>>>>> going to
> >>>>>>>>>>>>> be particularly useful.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
> >>>>>>>>>>>>> kernel;
> >>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
> >>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
> >>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
> >>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
> >>>>>>>>>>>>> you
> >>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
> >>>>>>>>>>>>> But
> >>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
> >>>>>>>>>>>>> easy
> >>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
> >>>>>>>>>>>>> PTEs
> >>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
> >>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
> >>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
> >>>>>>>>>>>>> process?".
> >>>>>>>>>>>>
> >>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
> >>>>>>>>>>>> 1. entire map
> >>>>>>>>>>>> 2. subpage's map
> >>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
> >>>>>>>>>>>> we have an explicit
> >>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
> >>>>>>>>>>>> subpage's mapcount.
> >>>>>>>>>>>>
> >>>>>>>>>>>> but its downside is that we expose this info to mm-core.
> >>>>>>>>>>>
> >>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
> >>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
> >>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
> >>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
> >>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
> >>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
> >>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
> >>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
> >>>>>>>>>>> transition.
> >>>>>>>>>>
> >>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
> >>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
> >>>>>>>>> other than by scanning the page tables and we might as well do that in user
> >>>>>>>>> space with this script.
> >>>>>>>>>
> >>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
> >>>>>>>>> and prove me wrong ;-)
> >>>>>>>>
> >>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
> >>>>>>>> making sense.
> >>>>>>>>
> >>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
> >>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
> >>>>>>>> optimizations without the cont-pte bit and everything is fine.
> >>>>>>>
> >>>>>>> Yes, but for debug and optimization, its useful to know when THPs are
> >>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
> >>>>>>> that for us, and I think we are tending towards agreement that there are
> >>>>>>> unlikely to be any cost benefits by moving it into the kernel.
> >>>>>>
> >>>>>> frequent partial unmap can defeat all purpose for us to use large folios.
> >>>>>> just imagine a large folio can soon be splitted after it is formed. we lose
> >>>>>> the performance gain and might get regression instead.
> >>>>>
> >>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
> >>>>> gets split into order-0 pages. If the folio still has all its pages mapped at
> >>>>> least once then no further action is taken. If the page being unmapped was the
> >>>>> last mapping of that page, then the THP is put on the deferred split queue, so
> >>>>> that it can be split in future if needed.
> >>>>>>
> >>>>>> and this can be very frequent, for example, one userspace heap management
> >>>>>> is releasing memory page by page.
> >>>>>>
> >>>>>> In our real product deployment, we might not care about the second partial
> >>>>>> unmapped,  we do care about the first partial unmapped as we can use this
> >>>>>> to know if split has ever happened on this large folios. an partial unmapped
> >>>>>> subpage can be unlikely re-mapped back.
> >>>>>>
> >>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
> >>>>>> care about if partial unmap has ever happened on a large folio more than how
> >>>>>> they are exactly partially unmapped :-)
> >>>>>
> >>>>> I'm not sure what you are suggesting here? A global boolean that tells you if
> >>>>> any folio in the system has ever been partially unmapped? That will almost
> >>>>> certainly always be true, even for a very well tuned system.
> >>>>
> >>>> not a global boolean but a per-folio boolean. in case userspace maps a region
> >>>> and has no userspace management, then we are fine as it is unlikely to have
> >>>> partial unmap/map things; in case userspace maps a region, but manages it
> >>>> by itself, such as heap things, we might result in lots of partial map/unmap,
> >>>> which can lead to 3 problems:
> >>>> 1. potential memory footprint increase, for example, while userspace releases
> >>>> some pages in a folio, we might still keep it as frequent splitting folio into
> >>>> basepages and releasing the unmapped subpage might be too expensive.
> >>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
> >>>> might happen.
> >>>> 3. other maintenance overhead such as splitting large folios etc.
> >>>>
> >>>> We'd like to know how serious partial map things are happening. so either
> >>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do
> >>>> some alignment according to the size of large folios.
> >>>>
> >>>> in android phones, we detect lots of apps, and also found some apps might
> >>>> do things like
> >>>> 1. mprotect on some pages within a large folio
> >>>> 2. mlock on some pages within a large folio
> >>>> 3. madv_free on some pages within a large folio
> >>>> 4. madv_pageout on some pages within a large folio.
> >>>>
> >>>> it would be good if we have a per-folio boolean to know how serious userspace
> >>>> is breaking the large folios. for example, if more than 50% folios in a vma has
> >>>> this problem, we can find it out and take some action.
> >>>
> >>> The high level value of these stats seems clear - I agree we need to be able to
> >>> get these insights. I think the issues are more around the implementation
> >>> though. I'm struggling to understand exactly how we could implement a lot of
> >>> these things cheaply (either in the kernel or in user space).
> >>>
> >>> Let me try to work though what I think you are suggesting:
> >>>
> >>>    - every THP is initially fully mapped
> >>
> >> Not for pagecache folios.
> >>
> >>>    - when an operation causes a partial unmap, mark the folio as having at least
> >>>      one partial mapping
> >>>    - on transition from "no partial mappings" to "at least one partial mapping"
> >>>      increment a "anon-partial-<size>kB" (one for each supported folio size)
> >>>      counter by the folio size
> >>>    - on transition from "at least one partial mapping" to "fully unampped
> >>>      everywhere" decrement the counter by the folio size
> >>>
> >>> I think the issue with this is that a folio that is fully mapped in a process
> >>> that gets forked, then is partially unmapped in 1 process, will be accounted as
> >>> partially mapped even after the process that partially unmapped it exits, even
> >>> though that folio is now fully mapped in all processes that map it. Is that a
> >>> problem, perhaps not? I'm not sure.
> >>
> >> What I can offer with my total mapcount I am working on (+ entire/pmd
> >> mapcount, but let's put that aside):
> >>
> >> 1) total_mapcount not multiples of folio_nr_page -> at least one process
> >> currently maps the folio partially
> >>
> >> 2) total_mapcount is less than folio_nr_page -> surely partially mapped
> >>
> >> I think for most of anon memory (note that most folios are always
> >> exclusive in our system, not cow-shared) 2) would already be sufficient.
> >
> > if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to
> > add nr_pages in copy_pte_range for rmap.
> > copy_pte_range()
> > {
> >             folio_try_dup_anon_rmap_ptes(...nr_pages....)
> > }
> > and at the same time, in zap_pte_range(), we remove the whole anon_rmap
> > if the zapped-range covers the whole folio.
> >
> > Replace the for-loop
> > for (i = 0; i < nr; i++, page++) {
> >          add_rmap(1);
> > }
> > for (i = 0; i < nr; i++, page++) {
> >          remove_rmap(1);
> > }
> > by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we
> > are doing the entire mapping/unmapping
>
> That's precisely what I have already running as protoypes :) And I
> promised Ryan to get to this soon, clean it up and sent it out.

Cool. Glad we'll have it soon.

>
> .
> >
> > then we might be able to TestAndSetPartialMapped flag for this folio anywhile
> > 1. someone is adding rmap with a number not equal nr_pages
> > 2. someone is removing rmap with a number not equal nr_pages
> > That means we are doing partial mapping or unmapping.
> > and we increment partialmap_count by 1, let debugfs or somewhere present
> > this count.
>
> Yes. The only "ugly" corner case if you have a split VMA. We're not
> batching rmap exceeding that.

I am sorry I don't quite get what the problem is. Do you mean splitting
vma is crossing a PTE-mapped mTHP or a PMD-mapped THP?

for the latter, I see __split_huge_pmd_locked() does have some mapcount
operation but it is batched by
                        folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
                                                 vma, haddr, rmap_flags);

for the former, I don't find any special mapcount thing is needed.
Do I miss something?

>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry
Ryan Roberts Jan. 12, 2024, 10:18 a.m. UTC | #49
On 11/01/2024 13:18, David Hildenbrand wrote:
> On 11.01.24 13:25, Ryan Roberts wrote:
>> On 10/01/2024 22:14, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 11:38, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard
>>>>>>>>>>>>>>> <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard
>>>>>>>>>>>>>>>>> <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing
>>>>>>>>>>>>>>>>>> of mTHP
>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various
>>>>>>>>>>>>>>>>>> containers and
>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely
>>>>>>>>>>>>>> global
>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>>> container).
>>>>>>>>>>>>>> If you want per-container, then you can probably just create the
>>>>>>>>>>>>>> container
>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here.
>>>>>>>>>>>>>>>>>> Probably
>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial
>>>>>>>>>>>>>>>>>> reactions from
>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are
>>>>>>>>>>>>>>>> more useful
>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> probably because this can be done without the modification of the
>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous
>>>>>>>>>>>>>> attempts to add
>>>>>>>>>>>>>> stats
>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we
>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to
>>>>>>>>>>>>>> end up
>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also
>>>>>>>>>>>>>> been some
>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in
>>>>>>>>>>>>>> sysfs, so
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I
>>>>>>>>>>>>>> know
>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term
>>>>>>>>>>>>>> solution
>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to
>>>>>>>>>>>>>> explore
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my
>>>>>>>>>>>>>>> case in
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma
>>>>>>>>>>>>>>> types,
>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and
>>>>>>>>>>>>>>> how many
>>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or
>>>>>>>>>>>>>>>> similar)". And
>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now?
>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI
>>>>>>>>>>>>>>>> stable"
>>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode
>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>> script
>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are
>>>>>>>>>>>>>> provided). I
>>>>>>>>>>>>>> think I
>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should
>>>>>>>>>>>>>> then be
>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and
>>>>>>>>>>>>>> provide the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if
>>>>>>>>>>>> anyone wants
>>>>>>>>>>>> the
>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't
>>>>>>>>>>>> have the
>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how
>>>>>>>>>>>> many of
>>>>>>>>>>>> each
>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about
>>>>>>>>>>>> whether they
>>>>>>>>>>>> are
>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary
>>>>>>>>>>>> if we
>>>>>>>>>>>> want
>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>>> going to
>>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>>
>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>>> kernel;
>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not
>>>>>>>>>>>> just the
>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you
>>>>>>>>>>>> set it,
>>>>>>>>>>>> you
>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you
>>>>>>>>>>>> decrement.
>>>>>>>>>>>> But
>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping
>>>>>>>>>>>> so its
>>>>>>>>>>>> easy
>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to
>>>>>>>>>>>> scan the
>>>>>>>>>>>> PTEs
>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap
>>>>>>>>>>>> mechanism to
>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>>> process?".
>>>>>>>>>>>
>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>>> 1. entire map
>>>>>>>>>>> 2. subpage's map
>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>>
>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>>> we have an explicit
>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>>
>>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>>
>>>>>>>>>> OK, but I think we have a slightly more generic situation going on
>>>>>>>>>> with the
>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit
>>>>>>>>>> in the
>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where
>>>>>>>>>> you only
>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the
>>>>>>>>>> upstream, we
>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if
>>>>>>>>>> its fully
>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and
>>>>>>>>>> aligned,
>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm
>>>>>>>>>> would
>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>>> transition.
>>>>>>>>>
>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of
>>>>>>>>> some stats.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Indeed, I was intending to argue *against* doing it this way.
>>>>>>>> Fundamentally, if
>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any
>>>>>>>> way
>>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>>> space with this script.
>>>>>>>>
>>>>>>>> Although, I expect you will shortly make a proposal that is simple to
>>>>>>>> implement
>>>>>>>> and prove me wrong ;-)
>>>>>>>
>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>>> making sense.
>>>>>>>
>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You
>>>>>>> can
>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>>
>>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>>> that for us, and I think we are tending towards agreement that there are
>>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>>
>>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>>> the performance gain and might get regression instead.
>>>>
>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>>> least once then no further action is taken. If the page being unmapped was the
>>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>>> that it can be split in future if needed.
>>>>>
>>>>> and this can be very frequent, for example, one userspace heap management
>>>>> is releasing memory page by page.
>>>>>
>>>>> In our real product deployment, we might not care about the second partial
>>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>>> subpage can be unlikely re-mapped back.
>>>>>
>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>>> care about if partial unmap has ever happened on a large folio more than how
>>>>> they are exactly partially unmapped :-)
>>>>
>>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>>> any folio in the system has ever been partially unmapped? That will almost
>>>> certainly always be true, even for a very well tuned system.
>>>
>>> not a global boolean but a per-folio boolean. in case userspace maps a region
>>> and has no userspace management, then we are fine as it is unlikely to have
>>> partial unmap/map things; in case userspace maps a region, but manages it
>>> by itself, such as heap things, we might result in lots of partial map/unmap,
>>> which can lead to 3 problems:
>>> 1. potential memory footprint increase, for example, while userspace releases
>>> some pages in a folio, we might still keep it as frequent splitting folio into
>>> basepages and releasing the unmapped subpage might be too expensive.
>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
>>> might happen.
>>> 3. other maintenance overhead such as splitting large folios etc.
>>>
>>> We'd like to know how serious partial map things are happening. so either
>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do
>>> some alignment according to the size of large folios.
>>>
>>> in android phones, we detect lots of apps, and also found some apps might
>>> do things like
>>> 1. mprotect on some pages within a large folio
>>> 2. mlock on some pages within a large folio
>>> 3. madv_free on some pages within a large folio
>>> 4. madv_pageout on some pages within a large folio.
>>>
>>> it would be good if we have a per-folio boolean to know how serious userspace
>>> is breaking the large folios. for example, if more than 50% folios in a vma has
>>> this problem, we can find it out and take some action.
>>
>> The high level value of these stats seems clear - I agree we need to be able to
>> get these insights. I think the issues are more around the implementation
>> though. I'm struggling to understand exactly how we could implement a lot of
>> these things cheaply (either in the kernel or in user space).
>>
>> Let me try to work though what I think you are suggesting:
>>
>>   - every THP is initially fully mapped
> 
> Not for pagecache folios.
> 
>>   - when an operation causes a partial unmap, mark the folio as having at least
>>     one partial mapping
>>   - on transition from "no partial mappings" to "at least one partial mapping"
>>     increment a "anon-partial-<size>kB" (one for each supported folio size)
>>     counter by the folio size
>>   - on transition from "at least one partial mapping" to "fully unampped
>>     everywhere" decrement the counter by the folio size
>>
>> I think the issue with this is that a folio that is fully mapped in a process
>> that gets forked, then is partially unmapped in 1 process, will be accounted as
>> partially mapped even after the process that partially unmapped it exits, even
>> though that folio is now fully mapped in all processes that map it. Is that a
>> problem, perhaps not? I'm not sure.
> 
> What I can offer with my total mapcount I am working on (+ entire/pmd mapcount,
> but let's put that aside):

Is "total mapcount" bound up as part of your "precise shared vs exclusive" work
or is it separate? If separate, do you have any ballpark feel for how likely it
is to land and if so, when?

> 
> 1) total_mapcount not multiples of folio_nr_page -> at least one process
> currently maps the folio partially
> 
> 2) total_mapcount is less than folio_nr_page -> surely partially mapped
> 
> I think for most of anon memory (note that most folios are always exclusive in
> our system, not cow-shared) 2) would already be sufficient.
>
Ryan Roberts Jan. 12, 2024, 10:25 a.m. UTC | #50
On 11/01/2024 20:45, Barry Song wrote:
> On Fri, Jan 12, 2024 at 1:25 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 10/01/2024 22:14, Barry Song wrote:
>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 10/01/2024 11:38, Barry Song wrote:
>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>>> container).
>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>>>> stats
>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>>>> script
>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>>>> think I
>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>>>> the
>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>>>> each
>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>>>> are
>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>>>> want
>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>>> going to
>>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>>
>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>>> kernel;
>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>>>> you
>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>>>> But
>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>>>> easy
>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>>>> PTEs
>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>>> process?".
>>>>>>>>>>>
>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>>> 1. entire map
>>>>>>>>>>> 2. subpage's map
>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>>
>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>>> we have an explicit
>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>>
>>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>>
>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>>> transition.
>>>>>>>>>
>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>>> space with this script.
>>>>>>>>
>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>>>> and prove me wrong ;-)
>>>>>>>
>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>>> making sense.
>>>>>>>
>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>>
>>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>>> that for us, and I think we are tending towards agreement that there are
>>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>>
>>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>>> the performance gain and might get regression instead.
>>>>
>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>>> least once then no further action is taken. If the page being unmapped was the
>>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>>> that it can be split in future if needed.
>>>>>
>>>>> and this can be very frequent, for example, one userspace heap management
>>>>> is releasing memory page by page.
>>>>>
>>>>> In our real product deployment, we might not care about the second partial
>>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>>> subpage can be unlikely re-mapped back.
>>>>>
>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>>> care about if partial unmap has ever happened on a large folio more than how
>>>>> they are exactly partially unmapped :-)
>>>>
>>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>>> any folio in the system has ever been partially unmapped? That will almost
>>>> certainly always be true, even for a very well tuned system.
>>>
>>> not a global boolean but a per-folio boolean. in case userspace maps a region
>>> and has no userspace management, then we are fine as it is unlikely to have
>>> partial unmap/map things; in case userspace maps a region, but manages it
>>> by itself, such as heap things, we might result in lots of partial map/unmap,
>>> which can lead to 3 problems:
>>> 1. potential memory footprint increase, for example, while userspace releases
>>> some pages in a folio, we might still keep it as frequent splitting folio into
>>> basepages and releasing the unmapped subpage might be too expensive.
>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
>>> might happen.
>>> 3. other maintenance overhead such as splitting large folios etc.
>>>
>>> We'd like to know how serious partial map things are happening. so either
>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do
>>> some alignment according to the size of large folios.
>>>
>>> in android phones, we detect lots of apps, and also found some apps might
>>> do things like
>>> 1. mprotect on some pages within a large folio
>>> 2. mlock on some pages within a large folio
>>> 3. madv_free on some pages within a large folio
>>> 4. madv_pageout on some pages within a large folio.
>>>
>>> it would be good if we have a per-folio boolean to know how serious userspace
>>> is breaking the large folios. for example, if more than 50% folios in a vma has
>>> this problem, we can find it out and take some action.
>>
>> The high level value of these stats seems clear - I agree we need to be able to
>> get these insights. I think the issues are more around the implementation
>> though. I'm struggling to understand exactly how we could implement a lot of
>> these things cheaply (either in the kernel or in user space).
>>
>> Let me try to work though what I think you are suggesting:
>>
>>  - every THP is initially fully mapped
>>  - when an operation causes a partial unmap, mark the folio as having at least
>>    one partial mapping
>>  - on transition from "no partial mappings" to "at least one partial mapping"
>>    increment a "anon-partial-<size>kB" (one for each supported folio size)
>>    counter by the folio size
>>  - on transition from "at least one partial mapping" to "fully unampped
>>    everywhere" decrement the counter by the folio size
>>
>> I think the issue with this is that a folio that is fully mapped in a process
>> that gets forked, then is partially unmapped in 1 process, will be accounted as
>> partially mapped even after the process that partially unmapped it exits, even
>> though that folio is now fully mapped in all processes that map it. Is that a
>> problem, perhaps not? I'm not sure.
> 
> I don't think this is a problem as what we really care about is if some "bad"
> behaviour has ever happened in userspace. Though the "bad" guy has exited
> or been killed, we still need the record to find out.
> 
> except the global count, if we can reflect the partially mapped count in  vma
> such as smaps, this will help even more to locate the problematic userspace
> code.

Right. Although note that smaps is already scanning the page table, so for the
smaps case we could do it precisely - it's already slow. The thpmaps script
already gives a precise account of partially mapped THPs, FYI.

> 
> Thanks
> Barry
Ryan Roberts Jan. 12, 2024, 10:44 a.m. UTC | #51
On 12/01/2024 06:03, Barry Song wrote:
> On Fri, Jan 12, 2024 at 9:28 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 11.01.24 21:21, Barry Song wrote:
>>> On Fri, Jan 12, 2024 at 2:18 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 11.01.24 13:25, Ryan Roberts wrote:
>>>>> On 10/01/2024 22:14, Barry Song wrote:
>>>>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 10/01/2024 11:38, Barry Song wrote:
>>>>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP
>>>>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various containers and
>>>>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely global
>>>>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>>>>>> container).
>>>>>>>>>>>>>>>>> If you want per-container, then you can probably just create the container
>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably
>>>>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from
>>>>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful
>>>>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the
>>>>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> probably because this can be done without the modification of the kernel.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add
>>>>>>>>>>>>>>>>> stats
>>>>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't
>>>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up
>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also been some
>>>>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so
>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I know
>>>>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and
>>>>>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution
>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to explore
>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in
>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types,
>>>>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many
>>>>>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And
>>>>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's
>>>>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable"
>>>>>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the
>>>>>>>>>>>>>>>>> script
>>>>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I
>>>>>>>>>>>>>>>>> think I
>>>>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be
>>>>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the
>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if anyone wants
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't have the
>>>>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how many of
>>>>>>>>>>>>>>> each
>>>>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary if we
>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>>>>>> going to
>>>>>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>>>>>> kernel;
>>>>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the
>>>>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you set it,
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement.
>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its
>>>>>>>>>>>>>>> easy
>>>>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to scan the
>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to
>>>>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>>>>>> process?".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>>>>>> 1. entire map
>>>>>>>>>>>>>> 2. subpage's map
>>>>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>>>>>> we have an explicit
>>>>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, but I think we have a slightly more generic situation going on with the
>>>>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the
>>>>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where you only
>>>>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we
>>>>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully
>>>>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and aligned,
>>>>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm would
>>>>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>>>>>> transition.
>>>>>>>>>>>>
>>>>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of some stats.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if
>>>>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any way
>>>>>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>>>>>> space with this script.
>>>>>>>>>>>
>>>>>>>>>>> Although, I expect you will shortly make a proposal that is simple to implement
>>>>>>>>>>> and prove me wrong ;-)
>>>>>>>>>>
>>>>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>>>>>> making sense.
>>>>>>>>>>
>>>>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You can
>>>>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>>>>>
>>>>>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>>>>>> that for us, and I think we are tending towards agreement that there are
>>>>>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>>>>>
>>>>>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>>>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>>>>>> the performance gain and might get regression instead.
>>>>>>>
>>>>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>>>>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>>>>>> least once then no further action is taken. If the page being unmapped was the
>>>>>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>>>>>> that it can be split in future if needed.
>>>>>>>>
>>>>>>>> and this can be very frequent, for example, one userspace heap management
>>>>>>>> is releasing memory page by page.
>>>>>>>>
>>>>>>>> In our real product deployment, we might not care about the second partial
>>>>>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>>>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>>>>>> subpage can be unlikely re-mapped back.
>>>>>>>>
>>>>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>>>>>> care about if partial unmap has ever happened on a large folio more than how
>>>>>>>> they are exactly partially unmapped :-)
>>>>>>>
>>>>>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>>>>>> any folio in the system has ever been partially unmapped? That will almost
>>>>>>> certainly always be true, even for a very well tuned system.
>>>>>>
>>>>>> not a global boolean but a per-folio boolean. in case userspace maps a region
>>>>>> and has no userspace management, then we are fine as it is unlikely to have
>>>>>> partial unmap/map things; in case userspace maps a region, but manages it
>>>>>> by itself, such as heap things, we might result in lots of partial map/unmap,
>>>>>> which can lead to 3 problems:
>>>>>> 1. potential memory footprint increase, for example, while userspace releases
>>>>>> some pages in a folio, we might still keep it as frequent splitting folio into
>>>>>> basepages and releasing the unmapped subpage might be too expensive.
>>>>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
>>>>>> might happen.
>>>>>> 3. other maintenance overhead such as splitting large folios etc.
>>>>>>
>>>>>> We'd like to know how serious partial map things are happening. so either
>>>>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do
>>>>>> some alignment according to the size of large folios.
>>>>>>
>>>>>> in android phones, we detect lots of apps, and also found some apps might
>>>>>> do things like
>>>>>> 1. mprotect on some pages within a large folio
>>>>>> 2. mlock on some pages within a large folio
>>>>>> 3. madv_free on some pages within a large folio
>>>>>> 4. madv_pageout on some pages within a large folio.
>>>>>>
>>>>>> it would be good if we have a per-folio boolean to know how serious userspace
>>>>>> is breaking the large folios. for example, if more than 50% folios in a vma has
>>>>>> this problem, we can find it out and take some action.
>>>>>
>>>>> The high level value of these stats seems clear - I agree we need to be able to
>>>>> get these insights. I think the issues are more around the implementation
>>>>> though. I'm struggling to understand exactly how we could implement a lot of
>>>>> these things cheaply (either in the kernel or in user space).
>>>>>
>>>>> Let me try to work though what I think you are suggesting:
>>>>>
>>>>>    - every THP is initially fully mapped
>>>>
>>>> Not for pagecache folios.
>>>>
>>>>>    - when an operation causes a partial unmap, mark the folio as having at least
>>>>>      one partial mapping
>>>>>    - on transition from "no partial mappings" to "at least one partial mapping"
>>>>>      increment a "anon-partial-<size>kB" (one for each supported folio size)
>>>>>      counter by the folio size
>>>>>    - on transition from "at least one partial mapping" to "fully unampped
>>>>>      everywhere" decrement the counter by the folio size
>>>>>
>>>>> I think the issue with this is that a folio that is fully mapped in a process
>>>>> that gets forked, then is partially unmapped in 1 process, will be accounted as
>>>>> partially mapped even after the process that partially unmapped it exits, even
>>>>> though that folio is now fully mapped in all processes that map it. Is that a
>>>>> problem, perhaps not? I'm not sure.
>>>>
>>>> What I can offer with my total mapcount I am working on (+ entire/pmd
>>>> mapcount, but let's put that aside):
>>>>
>>>> 1) total_mapcount not multiples of folio_nr_page -> at least one process
>>>> currently maps the folio partially
>>>>
>>>> 2) total_mapcount is less than folio_nr_page -> surely partially mapped
>>>>
>>>> I think for most of anon memory (note that most folios are always
>>>> exclusive in our system, not cow-shared) 2) would already be sufficient.
>>>
>>> if we can improve Ryan's "mm: Batch-copy PTE ranges during fork()" to
>>> add nr_pages in copy_pte_range for rmap.
>>> copy_pte_range()
>>> {
>>>             folio_try_dup_anon_rmap_ptes(...nr_pages....)
>>> }
>>> and at the same time, in zap_pte_range(), we remove the whole anon_rmap
>>> if the zapped-range covers the whole folio.
>>>
>>> Replace the for-loop
>>> for (i = 0; i < nr; i++, page++) {
>>>          add_rmap(1);
>>> }
>>> for (i = 0; i < nr; i++, page++) {
>>>          remove_rmap(1);
>>> }
>>> by always using add_rmap(nr_pages) and remove_rmap(nr_pages) if we
>>> are doing the entire mapping/unmapping
>>
>> That's precisely what I have already running as protoypes :) And I
>> promised Ryan to get to this soon, clean it up and sent it out.
> 
> Cool. Glad we'll have it soon.
> 
>>
>> .
>>>
>>> then we might be able to TestAndSetPartialMapped flag for this folio anywhile
>>> 1. someone is adding rmap with a number not equal nr_pages
>>> 2. someone is removing rmap with a number not equal nr_pages
>>> That means we are doing partial mapping or unmapping.
>>> and we increment partialmap_count by 1, let debugfs or somewhere present
>>> this count.
>>
>> Yes. The only "ugly" corner case if you have a split VMA. We're not
>> batching rmap exceeding that.
> 
> I am sorry I don't quite get what the problem is. Do you mean splitting
> vma is crossing a PTE-mapped mTHP or a PMD-mapped THP?
> 
> for the latter, I see __split_huge_pmd_locked() does have some mapcount
> operation but it is batched by
>                         folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>                                                  vma, haddr, rmap_flags);
> 
> for the former, I don't find any special mapcount thing is needed.
> Do I miss something?

I think the case that David is describing is when you have a THP that is
contpte-mapped in a VMA, then you do an operation that causes the VMA to be
split but which doesn't cause the PTEs to be remapped (e.g. MADV_HUGEPAGE
covering part of the mTHP). In this case you end up with 2 VMAs and a THP
straddling both, which is contpte-mapped.

So in this case there would not be a single batch rmap call when unmapping it.
At best there would be 2; one in the context of each VMA, covering the
respective parts of the THP. I don't think this is a big problem though; it
would be counted as a partial unmap, but its a corner case, that is unlikely to
happen in practice.


I just want to try to summarise the counters we have discussed in this thread to
check my understanding:

1. global mTHP successful allocation counter, per mTHP size (inc only)
2. global mTHP failed allocation counter, per mTHP size (inc only)
3. global mTHP currently allocated counter, per mTHP size (inc and dec)
4. global "mTHP became partially mapped 1 or more processes" counter (inc only)

I geuss the above should apply to both page cache and anon? Do we want separate
counters for each?

I'm not sure if we would want 4. to be per mTHP or a single counter for all?
Probably the former if it provides a bit more info for neglegable cost.

Where should these be exposed? I guess /proc/vmstats is the obvious place, but I
don't think there is any precident for per-size counters (especially where the
sizes will change depending on system). Perhaps it would be better to expose
them in their per-size directories in
/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB ?


Additional to above global counters, there is a case for adding a per-process
version of 4. to smaps to consider.

Is that accurate?


> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
> 
> Thanks
> Barry
David Hildenbrand Jan. 17, 2024, 3:49 p.m. UTC | #52
On 12.01.24 11:18, Ryan Roberts wrote:
> On 11/01/2024 13:18, David Hildenbrand wrote:
>> On 11.01.24 13:25, Ryan Roberts wrote:
>>> On 10/01/2024 22:14, Barry Song wrote:
>>>> On Wed, Jan 10, 2024 at 7:59 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 10/01/2024 11:38, Barry Song wrote:
>>>>>> On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 10/01/2024 11:00, David Hildenbrand wrote:
>>>>>>>> On 10.01.24 11:55, Ryan Roberts wrote:
>>>>>>>>> On 10/01/2024 10:42, David Hildenbrand wrote:
>>>>>>>>>> On 10.01.24 11:38, Ryan Roberts wrote:
>>>>>>>>>>> On 10/01/2024 10:30, Barry Song wrote:
>>>>>>>>>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2024 09:09, Barry Song wrote:
>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/01/2024 08:02, Barry Song wrote:
>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard
>>>>>>>>>>>>>>>> <jhubbard@nvidia.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 1/9/24 19:51, Barry Song wrote:
>>>>>>>>>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard
>>>>>>>>>>>>>>>>>> <jhubbard@nvidia.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One thing that immediately came up during some recent testing
>>>>>>>>>>>>>>>>>>> of mTHP
>>>>>>>>>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm
>>>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>>> tests on a machine at a time for now, inside various
>>>>>>>>>>>>>>>>>>> containers and
>>>>>>>>>>>>>>>>>>> such, and it would be nice if there were an easy way to get some
>>>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>>>> for the mTHPs across the whole machine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just to confirm, you're expecting these "global" stats be truely
>>>>>>>>>>>>>>> global
>>>>>>>>>>>>>>> and not
>>>>>>>>>>>>>>> per-container? (asking because you exploicitly mentioned being in a
>>>>>>>>>>>>>>> container).
>>>>>>>>>>>>>>> If you want per-container, then you can probably just create the
>>>>>>>>>>>>>>> container
>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>> cgroup?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here.
>>>>>>>>>>>>>>>>>>> Probably
>>>>>>>>>>>>>>>>>>> this is fine as-is. But I wanted to give some initial
>>>>>>>>>>>>>>>>>>> reactions from
>>>>>>>>>>>>>>>>>>> just some quick runs: the global state would be convenient.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for taking this for a spin! Appreciate the feedback.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap?
>>>>>>>>>>>>>>>>>> so may we add this statistics information in kernel just like
>>>>>>>>>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes. From my perspective, it looks like the global stats are
>>>>>>>>>>>>>>>>> more useful
>>>>>>>>>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> next level of investigation. So feels odd to start with the more
>>>>>>>>>>>>>>>>> detailed stats.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> probably because this can be done without the modification of the
>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes indeed, as John said in an earlier thread, my previous
>>>>>>>>>>>>>>> attempts to add
>>>>>>>>>>>>>>> stats
>>>>>>>>>>>>>>> directly in the kernel got pushback; DavidH was concerned that we
>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> know exectly how to account mTHPs yet
>>>>>>>>>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to
>>>>>>>>>>>>>>> end up
>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>> the wrong ABI and having to maintain it forever. There has also
>>>>>>>>>>>>>>> been some
>>>>>>>>>>>>>>> pushback regarding adding more values to multi-value files in
>>>>>>>>>>>>>>> sysfs, so
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>> was suggesting coming up with a whole new scheme at some point (I
>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>>>> do live in sysfs).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyway, this script was my attempt to 1) provide a short term
>>>>>>>>>>>>>>> solution
>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>> "we need some stats" request and 2) provide a context in which to
>>>>>>>>>>>>>>> explore
>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>> the right stats are - this script can evolve without the ABI problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my
>>>>>>>>>>>>>>>> case in
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma
>>>>>>>>>>>>>>>> types,
>>>>>>>>>>>>>>>> eg. libc_malloc, java heaps etc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Different vma types can have different anon_name. So I can use the
>>>>>>>>>>>>>>>> detailed
>>>>>>>>>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and
>>>>>>>>>>>>>>>> how many
>>>>>>>>>>>>>>>> they have gotten.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to
>>>>>>>>>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or
>>>>>>>>>>>>>>>>> similar)". And
>>>>>>>>>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys
>>>>>>>>>>>>>>>>> values because this is still such an early feature.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I wonder if we could put the global stats in debugfs for now?
>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI
>>>>>>>>>>>>>>>>> stable"
>>>>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Now that I think about it, I wonder if we can add a --global mode
>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>> script
>>>>>>>>>>>>>>> (or just infer global when neither --pid nor --cgroup are
>>>>>>>>>>>>>>> provided). I
>>>>>>>>>>>>>>> think I
>>>>>>>>>>>>>>> should be able to determine all the physical memory ranges from
>>>>>>>>>>>>>>> /proc/iomem,
>>>>>>>>>>>>>>> then grab all the info we need from /proc/kpageflags. We should
>>>>>>>>>>>>>>> then be
>>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>>> process it all in much the same way as for --pid/--cgroup and
>>>>>>>>>>>>>>> provide the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>> stats, but it will apply globally. What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Having now thought about this for a few mins (in the shower, if
>>>>>>>>>>>>> anyone wants
>>>>>>>>>>>>> the
>>>>>>>>>>>>> complete picture :) ), this won't quite work. This approach doesn't
>>>>>>>>>>>>> have the
>>>>>>>>>>>>> virtual mapping information so the best it can do is tell us "how
>>>>>>>>>>>>> many of
>>>>>>>>>>>>> each
>>>>>>>>>>>>> size of THP are allocated?" - it doesn't tell us anything about
>>>>>>>>>>>>> whether they
>>>>>>>>>>>>> are
>>>>>>>>>>>>> fully or partially mapped or what their alignment is (all necessary
>>>>>>>>>>>>> if we
>>>>>>>>>>>>> want
>>>>>>>>>>>>> to know if they are contpte-mapped). So I don't think this approach is
>>>>>>>>>>>>> going to
>>>>>>>>>>>>> be particularly useful.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And this is also the big problem if we want to gather stats inside the
>>>>>>>>>>>>> kernel;
>>>>>>>>>>>>> if we want something equivalant to /proc/meminfo's
>>>>>>>>>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not
>>>>>>>>>>>>> just the
>>>>>>>>>>>>> allocation of the THP but also whether it is mapped. That's easy for
>>>>>>>>>>>>> PMD-mappings, because there is only one entry to consider - when you
>>>>>>>>>>>>> set it,
>>>>>>>>>>>>> you
>>>>>>>>>>>>> increment the number of PMD-mapped THPs, when you clear it, you
>>>>>>>>>>>>> decrement.
>>>>>>>>>>>>> But
>>>>>>>>>>>>> for PTE-mappings it's harder; you know the size when you are mapping
>>>>>>>>>>>>> so its
>>>>>>>>>>>>> easy
>>>>>>>>>>>>> to increment, but you can do a partial unmap, so you would need to
>>>>>>>>>>>>> scan the
>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>> to figure out if we are unmapping the first page of a previously
>>>>>>>>>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap
>>>>>>>>>>>>> mechanism to
>>>>>>>>>>>>> determine "is this folio fully and contiguously mapped in at least one
>>>>>>>>>>>>> process?".
>>>>>>>>>>>>
>>>>>>>>>>>> as OPPO's approach I shared to you before is maintaining two mapcount
>>>>>>>>>>>> 1. entire map
>>>>>>>>>>>> 2. subpage's map
>>>>>>>>>>>> 3. if 1 and 2 both exist, it is DoubleMapped.
>>>>>>>>>>>>
>>>>>>>>>>>> This isn't a problem for us. and everytime if we do a partial unmap,
>>>>>>>>>>>> we have an explicit
>>>>>>>>>>>> cont_pte split which will decrease the entire map and increase the
>>>>>>>>>>>> subpage's mapcount.
>>>>>>>>>>>>
>>>>>>>>>>>> but its downside is that we expose this info to mm-core.
>>>>>>>>>>>
>>>>>>>>>>> OK, but I think we have a slightly more generic situation going on
>>>>>>>>>>> with the
>>>>>>>>>>> upstream; If I've understood correctly, you are using the PTE_CONT bit
>>>>>>>>>>> in the
>>>>>>>>>>> PTE to determne if its fully mapped? That works for your case where
>>>>>>>>>>> you only
>>>>>>>>>>> have 1 size of THP that you care about (contpte-size). But for the
>>>>>>>>>>> upstream, we
>>>>>>>>>>> have multi-size THP so we can't use the PTE_CONT bit to determine if
>>>>>>>>>>> its fully
>>>>>>>>>>> mapped because we can only use that bit if the THP is at least 64K and
>>>>>>>>>>> aligned,
>>>>>>>>>>> and only on arm64. We would need a SW bit for this purpose, and the mm
>>>>>>>>>>> would
>>>>>>>>>>> need to update that SW bit for every PTE one the full -> partial map
>>>>>>>>>>> transition.
>>>>>>>>>>
>>>>>>>>>> Oh no. Let's not make everything more complicated for the purpose of
>>>>>>>>>> some stats.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Indeed, I was intending to argue *against* doing it this way.
>>>>>>>>> Fundamentally, if
>>>>>>>>> we want to know what's fully mapped and what's not, then I don't see any
>>>>>>>>> way
>>>>>>>>> other than by scanning the page tables and we might as well do that in user
>>>>>>>>> space with this script.
>>>>>>>>>
>>>>>>>>> Although, I expect you will shortly make a proposal that is simple to
>>>>>>>>> implement
>>>>>>>>> and prove me wrong ;-)
>>>>>>>>
>>>>>>>> Unlikely :) As you said, once you have multiple folio sizes, it stops really
>>>>>>>> making sense.
>>>>>>>>
>>>>>>>> Assume you have a 128 kiB pageache folio, and half of that is mapped. You
>>>>>>>> can
>>>>>>>> set cont-pte bits on that half and all is fine. Or AMD can benefit from it's
>>>>>>>> optimizations without the cont-pte bit and everything is fine.
>>>>>>>
>>>>>>> Yes, but for debug and optimization, its useful to know when THPs are
>>>>>>> fully/partially mapped, when they are unaligned etc. Anyway, the script does
>>>>>>> that for us, and I think we are tending towards agreement that there are
>>>>>>> unlikely to be any cost benefits by moving it into the kernel.
>>>>>>
>>>>>> frequent partial unmap can defeat all purpose for us to use large folios.
>>>>>> just imagine a large folio can soon be splitted after it is formed. we lose
>>>>>> the performance gain and might get regression instead.
>>>>>
>>>>> nit: just because a THP gets partially unmapped in a process doesn't mean it
>>>>> gets split into order-0 pages. If the folio still has all its pages mapped at
>>>>> least once then no further action is taken. If the page being unmapped was the
>>>>> last mapping of that page, then the THP is put on the deferred split queue, so
>>>>> that it can be split in future if needed.
>>>>>>
>>>>>> and this can be very frequent, for example, one userspace heap management
>>>>>> is releasing memory page by page.
>>>>>>
>>>>>> In our real product deployment, we might not care about the second partial
>>>>>> unmapped,  we do care about the first partial unmapped as we can use this
>>>>>> to know if split has ever happened on this large folios. an partial unmapped
>>>>>> subpage can be unlikely re-mapped back.
>>>>>>
>>>>>> so i guess 1st unmap is probably enough, at least for my product. I mean we
>>>>>> care about if partial unmap has ever happened on a large folio more than how
>>>>>> they are exactly partially unmapped :-)
>>>>>
>>>>> I'm not sure what you are suggesting here? A global boolean that tells you if
>>>>> any folio in the system has ever been partially unmapped? That will almost
>>>>> certainly always be true, even for a very well tuned system.
>>>>
>>>> not a global boolean but a per-folio boolean. in case userspace maps a region
>>>> and has no userspace management, then we are fine as it is unlikely to have
>>>> partial unmap/map things; in case userspace maps a region, but manages it
>>>> by itself, such as heap things, we might result in lots of partial map/unmap,
>>>> which can lead to 3 problems:
>>>> 1. potential memory footprint increase, for example, while userspace releases
>>>> some pages in a folio, we might still keep it as frequent splitting folio into
>>>> basepages and releasing the unmapped subpage might be too expensive.
>>>> 2. if cont-pte is involved, frequent dropping cont-pte/tlb shootdown
>>>> might happen.
>>>> 3. other maintenance overhead such as splitting large folios etc.
>>>>
>>>> We'd like to know how serious partial map things are happening. so either
>>>> we will disable mTHP in this kind of VMAs, or optimize userspace to do
>>>> some alignment according to the size of large folios.
>>>>
>>>> in android phones, we detect lots of apps, and also found some apps might
>>>> do things like
>>>> 1. mprotect on some pages within a large folio
>>>> 2. mlock on some pages within a large folio
>>>> 3. madv_free on some pages within a large folio
>>>> 4. madv_pageout on some pages within a large folio.
>>>>
>>>> it would be good if we have a per-folio boolean to know how serious userspace
>>>> is breaking the large folios. for example, if more than 50% folios in a vma has
>>>> this problem, we can find it out and take some action.
>>>
>>> The high level value of these stats seems clear - I agree we need to be able to
>>> get these insights. I think the issues are more around the implementation
>>> though. I'm struggling to understand exactly how we could implement a lot of
>>> these things cheaply (either in the kernel or in user space).
>>>
>>> Let me try to work though what I think you are suggesting:
>>>
>>>    - every THP is initially fully mapped
>>
>> Not for pagecache folios.
>>
>>>    - when an operation causes a partial unmap, mark the folio as having at least
>>>      one partial mapping
>>>    - on transition from "no partial mappings" to "at least one partial mapping"
>>>      increment a "anon-partial-<size>kB" (one for each supported folio size)
>>>      counter by the folio size
>>>    - on transition from "at least one partial mapping" to "fully unampped
>>>      everywhere" decrement the counter by the folio size
>>>
>>> I think the issue with this is that a folio that is fully mapped in a process
>>> that gets forked, then is partially unmapped in 1 process, will be accounted as
>>> partially mapped even after the process that partially unmapped it exits, even
>>> though that folio is now fully mapped in all processes that map it. Is that a
>>> problem, perhaps not? I'm not sure.
>>
>> What I can offer with my total mapcount I am working on (+ entire/pmd mapcount,
>> but let's put that aside):
> 
> Is "total mapcount" bound up as part of your "precise shared vs exclusive" work
> or is it separate? If separate, do you have any ballpark feel for how likely it
> is to land and if so, when?

You could have an expensive total mapcount via folio_mapcount() today.

The fast version is currently part of "precise shared vs exclusive", but 
with most RMAP batching in place we might want to consider adding it 
ahead of time, because the overhead of maintaining it will reduce 
drastically in the cases we care about.

My current plan is:

(1) RMAP batching when remapping a PMD-mapped THP. Upstream.
(2) Fork batching. I have that prototype you already saw, will work
     on this next.
(3) Zap batching. I also have a prototype now and will polish that
     as well.
(4) Total mapcount
(5) Shared vs. Exclusive
(6) Subpage mapcounts / PageAnonExclusive fun

I'll try getting a per-folio PageAnonExclusive bit implemented ahead of 
time. I think I know roughly what there is to do, but some corner cases 
are ugly to handle and I avoided messing with them in the past by making 
the PAE bit per-subpage. We'll see.

Now, no idea how long that all will take. I have decent prototypes at 
this point for most stuff.
diff mbox series

Patch

diff --git a/tools/mm/Makefile b/tools/mm/Makefile
index 1c5606cc3334..7bb03606b9ea 100644
--- a/tools/mm/Makefile
+++ b/tools/mm/Makefile
@@ -3,7 +3,8 @@ 
 #
 include ../scripts/Makefile.include

-TARGETS=page-types slabinfo page_owner_sort
+BUILD_TARGETS=page-types slabinfo page_owner_sort
+INSTALL_TARGETS = $(BUILD_TARGETS) thpmaps

 LIB_DIR = ../lib/api
 LIBS = $(LIB_DIR)/libapi.a
@@ -11,9 +12,9 @@  LIBS = $(LIB_DIR)/libapi.a
 CFLAGS += -Wall -Wextra -I../lib/ -pthread
 LDFLAGS += $(LIBS) -pthread

-all: $(TARGETS)
+all: $(BUILD_TARGETS)

-$(TARGETS): $(LIBS)
+$(BUILD_TARGETS): $(LIBS)

 $(LIBS):
 	make -C $(LIB_DIR)
@@ -29,4 +30,4 @@  sbindir ?= /usr/sbin

 install: all
 	install -d $(DESTDIR)$(sbindir)
-	install -m 755 -p $(TARGETS) $(DESTDIR)$(sbindir)
+	install -m 755 -p $(INSTALL_TARGETS) $(DESTDIR)$(sbindir)
diff --git a/tools/mm/thpmaps b/tools/mm/thpmaps
new file mode 100755
index 000000000000..af9b19f63eb4
--- /dev/null
+++ b/tools/mm/thpmaps
@@ -0,0 +1,573 @@ 
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0-only
+# Copyright (C) 2024 ARM Ltd.
+#
+# Utility providing smaps-like output detailing transparent hugepage usage.
+# For more info, run:
+# ./thpmaps --help
+#
+# Requires numpy:
+# pip3 install numpy
+
+
+import argparse
+import collections
+import math
+import os
+import re
+import resource
+import shutil
+import sys
+import time
+import numpy as np
+
+
+with open('/sys/kernel/mm/transparent_hugepage/hpage_pmd_size') as f:
+    PAGE_SIZE = resource.getpagesize()
+    PAGE_SHIFT = int(math.log2(PAGE_SIZE))
+    PMD_SIZE = int(f.read())
+    PMD_ORDER = int(math.log2(PMD_SIZE / PAGE_SIZE))
+
+
+def align_forward(v, a):
+    return (v + (a - 1)) & ~(a - 1)
+
+
+def align_offset(v, a):
+    return v & (a - 1)
+
+
+def nrkb(nr):
+    # Convert number of pages to KB.
+    return (nr << PAGE_SHIFT) >> 10
+
+
+def odkb(order):
+    # Convert page order to KB.
+    return nrkb(1 << order)
+
+
+def cont_ranges_all(arrs):
+    # Given a list of arrays, find the ranges for which values are monotonically
+    # incrementing in all arrays.
+    assert(len(arrs) > 0)
+    sz = len(arrs[0])
+    for arr in arrs:
+        assert(arr.shape == (sz,))
+    r = np.full(sz, 2)
+    d = np.diff(arrs[0]) == 1
+    for dd in [np.diff(arr) == 1 for arr in arrs[1:]]:
+        d &= dd
+    r[1:] -= d
+    r[:-1] -= d
+    return [np.repeat(arr, r).reshape(-1, 2) for arr in arrs]
+
+
+class ArgException(Exception):
+    pass
+
+
+class FileIOException(Exception):
+    pass
+
+
+class BinArrayFile:
+    # Base class used to read /proc/<pid>/pagemap and /proc/kpageflags into a
+    # numpy array. Use inherrited class in a with clause to ensure file is
+    # closed when it goes out of scope.
+    def __init__(self, filename, element_size):
+        self.element_size = element_size
+        self.filename = filename
+        self.fd = os.open(self.filename, os.O_RDONLY)
+
+    def cleanup(self):
+        os.close(self.fd)
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.cleanup()
+
+    def _readin(self, offset, buffer):
+        length = os.preadv(self.fd, (buffer,), offset)
+        if len(buffer) != length:
+            raise FileIOException('error: {} failed to read {} bytes at {:x}'
+                            .format(self.filename, len(buffer), offset))
+
+    def _toarray(self, buf):
+        assert(self.element_size == 8)
+        return np.frombuffer(buf, dtype=np.uint64)
+
+    def getv(self, vec):
+        sz = 0
+        for region in vec:
+            sz += int(region[1] - region[0] + 1) * self.element_size
+        buf = bytearray(sz)
+        view = memoryview(buf)
+        pos = 0
+        for region in vec:
+            offset = int(region[0]) * self.element_size
+            length = int(region[1] - region[0] + 1) * self.element_size
+            self._readin(offset, view[pos:pos+length])
+            pos += length
+        return self._toarray(buf)
+
+    def get(self, index, nr=1):
+        offset = index * self.element_size
+        length = nr * self.element_size
+        buf = bytearray(length)
+        self._readin(offset, buf)
+        return self._toarray(buf)
+
+
+PM_PAGE_PRESENT = 1 << 63
+PM_PFN_MASK = (1 << 55) - 1
+
+class PageMap(BinArrayFile):
+    # Read ranges of a given pid's pagemap into a numpy array.
+    def __init__(self, pid='self'):
+        super().__init__(f'/proc/{pid}/pagemap', 8)
+
+
+KPF_ANON = 1 << 12
+KPF_COMPOUND_HEAD = 1 << 15
+KPF_COMPOUND_TAIL = 1 << 16
+
+class KPageFlags(BinArrayFile):
+    # Read ranges of /proc/kpageflags into a numpy array.
+    def __init__(self):
+         super().__init__(f'/proc/kpageflags', 8)
+
+
+VMA = collections.namedtuple('VMA', [
+    'name',
+    'start',
+    'end',
+    'read',
+    'write',
+    'execute',
+    'private',
+    'pgoff',
+    'major',
+    'minor',
+    'inode',
+    'stats',
+])
+
+class VMAList:
+    # A container for VMAs, parsed from /proc/<pid>/smaps. Iterate over the
+    # instance to receive VMAs.
+    head_regex = re.compile(r"^([\da-f]+)-([\da-f]+) ([r-])([w-])([x-])([ps]) ([\da-f]+) ([\da-f]+):([\da-f]+) ([\da-f]+)\s*(.*)$")
+    kb_item_regex = re.compile(r"(\w+):\s*(\d+)\s*kB")
+
+    def __init__(self, pid='self'):
+        def is_vma(line):
+            return self.head_regex.search(line) != None
+
+        def get_vma(line):
+            m = self.head_regex.match(line)
+            if m is None:
+                return None
+            return VMA(
+                name=m.group(11),
+                start=int(m.group(1), 16),
+                end=int(m.group(2), 16),
+                read=m.group(3) == 'r',
+                write=m.group(4) == 'w',
+                execute=m.group(5) == 'x',
+                private=m.group(6) == 'p',
+                pgoff=int(m.group(7), 16),
+                major=int(m.group(8), 16),
+                minor=int(m.group(9), 16),
+                inode=int(m.group(10), 16),
+                stats={},
+            )
+
+        def get_value(line):
+            # Currently only handle the KB stats because they are summed for
+            # --summary. Core code doesn't know how to combine other stats.
+            exclude = ['KernelPageSize', 'MMUPageSize']
+            m = self.kb_item_regex.search(line)
+            if m:
+                param = m.group(1)
+                if param not in exclude:
+                    value = int(m.group(2))
+                    return param, value
+            return None, None
+
+        def parse_smaps(file):
+            vmas = []
+            i = 0
+
+            line = file.readline()
+
+            while True:
+                if not line:
+                    break
+                line = line.strip()
+
+                i += 1
+
+                vma = get_vma(line)
+                if vma is None:
+                    raise FileIOException(f'error: could not parse line {i}: "{line}"')
+
+                while True:
+                    line = file.readline()
+                    if not line:
+                        break
+                    line = line.strip()
+                    if is_vma(line):
+                        break
+
+                    i += 1
+
+                    param, value = get_value(line)
+                    if param:
+                        vma.stats[param] = {'type': None, 'value': value}
+
+                vmas.append(vma)
+
+            return vmas
+
+        with open(f'/proc/{pid}/smaps', 'r') as file:
+            self.vmas = parse_smaps(file)
+
+    def __iter__(self):
+        yield from self.vmas
+
+
+def thp_parse(max_order, kpageflags, vfns, pfns, anons, heads):
+    # Given 4 same-sized arrays representing a range within a page table backed
+    # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons:
+    # True if page is anonymous, heads: True if page is head of a THP), return a
+    # dictionary of statistics describing the mapped THPs.
+    stats = {
+        'file': {
+            'partial': 0,
+            'aligned': [0] * (max_order + 1),
+            'unaligned': [0] * (max_order + 1),
+        },
+        'anon': {
+            'partial': 0,
+            'aligned': [0] * (max_order + 1),
+            'unaligned': [0] * (max_order + 1),
+        },
+    }
+
+    indexes = np.arange(len(vfns), dtype=np.uint64)
+    ranges = cont_ranges_all([indexes, vfns, pfns])
+    for rindex, rpfn in zip(ranges[0], ranges[2]):
+        index_next = int(rindex[0])
+        index_end = int(rindex[1]) + 1
+        pfn_end = int(rpfn[1]) + 1
+
+        folios = indexes[index_next:index_end][heads[index_next:index_end]]
+
+        # Account pages for any partially mapped THP at the front. In that case,
+        # the first page of the range is a tail.
+        nr = (int(folios[0]) if len(folios) else index_end) - index_next
+        stats['anon' if anons[index_next] else 'file']['partial'] += nr
+
+        # Account pages for any partially mapped THP at the back. In that case,
+        # the next page after the range is a tail.
+        if len(folios):
+            flags = int(kpageflags.get(pfn_end)[0])
+            if flags & KPF_COMPOUND_TAIL:
+                nr = index_end - int(folios[-1])
+                folios = folios[:-1]
+                index_end -= nr
+                stats['anon' if anons[index_end - 1] else 'file']['partial'] += nr
+
+        # Account fully mapped THPs in the middle of the range.
+        if len(folios):
+            folio_nrs = np.append(np.diff(folios), np.uint64(index_end - folios[-1]))
+            folio_orders = np.log2(folio_nrs).astype(np.uint64)
+            for index, order in zip(folios, folio_orders):
+                index = int(index)
+                order = int(order)
+                nr = 1 << order
+                vfn = int(vfns[index])
+                align = 'aligned' if align_forward(vfn, nr) == vfn else 'unaligned'
+                anon = 'anon' if anons[index] else 'file'
+                stats[anon][align][order] += nr
+
+    rstats = {}
+
+    def flatten_sub(type, subtype, stats):
+        for od, nr in enumerate(stats[2:], 2):
+            rstats[f"{type}-thp-{subtype}-{odkb(od)}kB"] = {'type': type, 'value': nrkb(nr)}
+
+    def flatten_type(type, stats):
+        flatten_sub(type, 'aligned', stats['aligned'])
+        flatten_sub(type, 'unaligned', stats['unaligned'])
+        rstats[f"{type}-thp-partial"] = {'type': type, 'value': nrkb(stats['partial'])}
+
+    flatten_type('anon', stats['anon'])
+    flatten_type('file', stats['file'])
+
+    return rstats
+
+
+def cont_parse(order, vfns, pfns, anons, heads):
+    # Given 4 same-sized arrays representing a range within a page table backed
+    # by THPs (vfns: virtual frame numbers, pfns: physical frame numbers, anons:
+    # True if page is anonymous, heads: True if page is head of a THP), return a
+    # dictionary of statistics describing the contiguous blocks.
+    nr_cont = 1 << order
+    nr_anon = 0
+    nr_file = 0
+
+    ranges = cont_ranges_all([np.arange(len(vfns), dtype=np.uint64), vfns, pfns])
+    for rindex, rvfn, rpfn in zip(*ranges):
+        index_next = int(rindex[0])
+        index_end = int(rindex[1]) + 1
+        vfn_start = int(rvfn[0])
+        pfn_start = int(rpfn[0])
+
+        if align_offset(pfn_start, nr_cont) != align_offset(vfn_start, nr_cont):
+            continue
+
+        off = align_forward(vfn_start, nr_cont) - vfn_start
+        index_next += off
+
+        while index_next + nr_cont <= index_end:
+            folio_boundary = heads[index_next+1:index_next+nr_cont].any()
+            if not folio_boundary:
+                if anons[index_next]:
+                    nr_anon += nr_cont
+                else:
+                    nr_file += nr_cont
+            index_next += nr_cont
+
+    return {
+        f"anon-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'anon', 'value': nrkb(nr_anon)},
+        f"file-cont-aligned-{nrkb(nr_cont)}kB": {'type': 'file', 'value': nrkb(nr_file)},
+    }
+
+
+def vma_print(vma, pid):
+    # Prints a VMA instance in a format similar to smaps. The main difference is
+    # that the pid is included as the first value.
+    print("{:08x} {:016x}-{:016x} {}{}{}{} {:08x} {:02x}:{:02x} {:08x} {}"
+        .format(
+            pid, vma.start, vma.end,
+            'r' if vma.read else '-', 'w' if vma.write else '-',
+            'x' if vma.execute else '-', 'p' if vma.private else 's',
+            vma.pgoff, vma.major, vma.minor, vma.inode, vma.name
+        ))
+
+
+def stats_print(stats, tot_anon, tot_file, inc_empty):
+    # Print a statistics dictionary.
+    label_field = 32
+    for label, stat in stats.items():
+        type = stat['type']
+        value = stat['value']
+        if value or inc_empty:
+            pad = max(0, label_field - len(label) - 1)
+            if type == 'anon':
+                percent = f' ({value / tot_anon:3.0%})'
+            elif type == 'file':
+                percent = f' ({value / tot_file:3.0%})'
+            else:
+                percent = ''
+            print(f"{label}:{' ' * pad}{value:8} kB{percent}")
+
+
+def vma_parse(vma, pagemap, kpageflags, contorders):
+    # Generate thp and cont statistics for a single VMA.
+    start = vma.start >> PAGE_SHIFT
+    end = vma.end >> PAGE_SHIFT
+
+    pmes = pagemap.get(start, end - start)
+    present = pmes & PM_PAGE_PRESENT != 0
+    pfns = pmes & PM_PFN_MASK
+    pfns = pfns[present]
+    vfns = np.arange(start, end, dtype=np.uint64)
+    vfns = vfns[present]
+
+    flags = kpageflags.getv(cont_ranges_all([pfns])[0])
+    anons = flags & KPF_ANON != 0
+    heads = flags & KPF_COMPOUND_HEAD != 0
+    tails = flags & KPF_COMPOUND_TAIL != 0
+    thps = heads | tails
+
+    tot_anon = np.count_nonzero(anons)
+    tot_file = np.size(anons) - tot_anon
+    tot_anon = nrkb(tot_anon)
+    tot_file = nrkb(tot_file)
+
+    vfns = vfns[thps]
+    pfns = pfns[thps]
+    anons = anons[thps]
+    heads = heads[thps]
+
+    thpstats = thp_parse(PMD_ORDER, kpageflags, vfns, pfns, anons, heads)
+    contstats = [cont_parse(order, vfns, pfns, anons, heads) for order in contorders]
+
+    return {
+        **thpstats,
+        **{k: v for s in contstats for k, v in s.items()}
+    }, tot_anon, tot_file
+
+
+def do_main(args):
+    pids = set()
+    summary = {}
+    summary_anon = 0
+    summary_file = 0
+
+    if args.cgroup:
+        with open(f'{args.cgroup}/cgroup.procs') as pidfile:
+            for line in pidfile.readlines():
+                pids.add(int(line.strip()))
+    else:
+        pids.add(args.pid)
+
+    for pid in pids:
+        try:
+            with PageMap(pid) as pagemap:
+                with KPageFlags() as kpageflags:
+                    for vma in VMAList(pid):
+                        if (vma.read or vma.write or vma.execute) and vma.stats['Rss']['value'] > 0:
+                            stats, vma_anon, vma_file = vma_parse(vma, pagemap, kpageflags, args.cont)
+                        else:
+                            stats = {}
+                            vma_anon = 0
+                            vma_file = 0
+                        if args.inc_smaps:
+                            stats = {**vma.stats, **stats}
+                        if args.summary:
+                            for k, v in stats.items():
+                                if k in summary:
+                                    assert(summary[k]['type'] == v['type'])
+                                    summary[k]['value'] += v['value']
+                                else:
+                                    summary[k] = v
+                            summary_anon += vma_anon
+                            summary_file += vma_file
+                        else:
+                            vma_print(vma, pid)
+                            stats_print(stats, vma_anon, vma_file, args.inc_empty)
+        except FileNotFoundError:
+            if not args.cgroup:
+                raise
+        except ProcessLookupError:
+            if not args.cgroup:
+                raise
+
+    if args.summary:
+        stats_print(summary, summary_anon, summary_file, args.inc_empty)
+
+
+def main():
+    def formatter(prog):
+        width = shutil.get_terminal_size().columns
+        width -= 2
+        width = min(80, width)
+        return argparse.HelpFormatter(prog, width=width)
+
+    def size2order(human):
+        units = {"K": 2**10, "M": 2**20, "G": 2**30}
+        unit = 1
+        if human[-1] in units:
+            unit = units[human[-1]]
+            human = human[:-1]
+        try:
+            size = int(human)
+        except ValueError:
+            raise ArgException('error: --cont value must be integer size with optional KMG unit')
+        size *= unit
+        order = int(math.log2(size / PAGE_SIZE))
+        if order < 1:
+            raise ArgException('error: --cont value must be size of at least 2 pages')
+        if (1 << order) * PAGE_SIZE != size:
+            raise ArgException('error: --cont value must be size of power-of-2 pages')
+        return order
+
+    parser = argparse.ArgumentParser(formatter_class=formatter,
+        description="""Prints information about how transparent huge pages are
+                    mapped to a specified process or cgroup.
+
+                    Shows statistics for fully-mapped THPs of every size, mapped
+                    both naturally aligned and unaligned for both file and
+                    anonymous memory. See
+                    [anon|file]-thp-[aligned|unaligned]-<size>kB keys.
+
+                    Shows statistics for mapped pages that belong to a THP but
+                    which are not fully mapped. See [anon|file]-thp-partial
+                    keys.
+
+                    Optionally shows statistics for naturally aligned,
+                    contiguous blocks of memory of a specified size (when --cont
+                    is provided). See [anon|file]-cont-aligned-<size>kB keys.
+
+                    Statistics are shown in kB and as a percentage of either
+                    total anon or file memory as appropriate.""",
+        epilog="""Requires root privilege to access pagemap and kpageflags.""")
+
+    parser.add_argument('--pid',
+        metavar='pid', required=False, type=int,
+        help="""Process id of the target process. Exactly one of --pid and
+            --cgroup must be provided.""")
+
+    parser.add_argument('--cgroup',
+        metavar='path', required=False,
+        help="""Path to the target cgroup in sysfs. Iterates over every pid in
+            the cgroup. Exactly one of --pid and --cgroup must be provided.""")
+
+    parser.add_argument('--summary',
+        required=False, default=False, action='store_true',
+        help="""Sum the per-vma statistics to provide a summary over the whole
+            process or cgroup.""")
+
+    parser.add_argument('--cont',
+        metavar='size[KMG]', required=False, default=[], action='append',
+        help="""Adds anon and file stats for naturally aligned, contiguously
+            mapped blocks of the specified size. May be issued multiple times to
+            track multiple sized blocks. Useful to infer e.g. arm64 contpte and
+            hpa mappings. Size must be a power-of-2 number of pages.""")
+
+    parser.add_argument('--inc-smaps',
+        required=False, default=False, action='store_true',
+        help="""Include all numerical, additive /proc/<pid>/smaps stats in the
+            output.""")
+
+    parser.add_argument('--inc-empty',
+        required=False, default=False, action='store_true',
+        help="""Show all statistics including those whose value is 0.""")
+
+    parser.add_argument('--periodic',
+        metavar='sleep_ms', required=False, type=int,
+        help="""Run in a loop, polling every sleep_ms milliseconds.""")
+
+    args = parser.parse_args()
+
+    try:
+        if (args.pid and args.cgroup) or \
+        (not args.pid and not args.cgroup):
+            raise ArgException("error: Exactly one of --pid and --cgroup must be provided.")
+
+        args.cont = [size2order(cont) for cont in args.cont]
+    except ArgException as e:
+        parser.print_usage()
+        raise
+
+    if args.periodic:
+        while True:
+            do_main(args)
+            print()
+            time.sleep(args.periodic / 1000)
+    else:
+        do_main(args)
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as e:
+        prog = os.path.basename(sys.argv[0])
+        print(f'{prog}: {e}')
+        exit(1)