[v5,1/1] mm: report per-page metadata information

Message ID	20231101230816.1459373-2-souravpanda@google.com (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1C081BDEF for <linux-fsdevel@vger.kernel.org>; Wed, 1 Nov 2023 23:08:23 +0000 (UTC) Date: Wed, 1 Nov 2023 16:08:16 -0700 In-Reply-To: <20231101230816.1459373-1-souravpanda@google.com> Precedence: bulk Mime-Version: 1.0 References: <20231101230816.1459373-1-souravpanda@google.com> Message-ID: <20231101230816.1459373-2-souravpanda@google.com> Subject: [PATCH v5 1/1] mm: report per-page metadata information From: Sourav Panda <souravpanda@google.com> To: corbet@lwn.net, gregkh@linuxfoundation.org, rafael@kernel.org, akpm@linux-foundation.org, mike.kravetz@oracle.com, muchun.song@linux.dev, rppt@kernel.org, david@redhat.com, rdunlap@infradead.org, chenlinxuan@uniontech.com, yang.yang29@zte.com.cn, souravpanda@google.com, tomas.mudrunka@gmail.com, bhelgaas@google.com, ivan@cloudflare.com, pasha.tatashin@soleen.com, yosryahmed@google.com, hannes@cmpxchg.org, shakeelb@google.com, kirill.shutemov@linux.intel.com, wangkefeng.wang@huawei.com, adobriyan@gmail.com, vbabka@suse.cz, Liam.Howlett@Oracle.com, surenb@google.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, willy@infradead.org, weixugc@google.com Content-Type: text/plain; charset="UTF-8"
Series	mm: report per-page metadata information \| expand [v5,0/1] mm: report per-page metadata information [v5,1/1] mm: report per-page metadata information

Sourav Panda Nov. 1, 2023, 11:08 p.m. UTC

Adds a new per-node PageMetadata field to
/sys/devices/system/node/nodeN/meminfo
and a global PageMetadata field to /proc/meminfo. This information can
be used by users to see how much memory is being used by per-page
metadata, which can vary depending on build configuration, machine
architecture, and system use.

Per-page metadata is the amount of memory that Linux needs in order to
manage memory at the page granularity. The majority of such memory is
used by "struct page" and "page_ext" data structures. In contrast to
most other memory consumption statistics, per-page metadata might not
be included in MemTotal. For example, MemTotal does not include memblock
allocations but includes buddy allocations. While on the other hand,
per-page metadata would include both memblock and buddy allocations.

This memory depends on build configurations, machine architectures, and
the way system is used:

Build configuration may include extra fields into "struct page",
and enable / disable "page_ext"
Machine architecture defines base page sizes. For example 4K x86,
8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata
overhead is smaller on machines with larger page sizes.
System use can change per-page overhead by using vmemmap
optimizations with hugetlb pages, and emulated pmem devdax pages.
Also, boot parameters can determine whether page_ext is needed
to be allocated. This memory can be part of MemTotal or be outside
MemTotal depending on whether the memory was hot-plugged, booted with,
or hugetlb memory was returned back to the system.

Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Sourav Panda <souravpanda@google.com>
---
 Documentation/filesystems/proc.rst |  3 +++
 drivers/base/node.c                |  2 ++
 fs/proc/meminfo.c                  |  7 +++++++
 include/linux/mmzone.h             |  3 +++
 include/linux/vmstat.h             |  4 ++++
 mm/hugetlb.c                       | 11 ++++++++--
 mm/hugetlb_vmemmap.c               | 12 +++++++++--
 mm/mm_init.c                       |  3 +++
 mm/page_alloc.c                    |  1 +
 mm/page_ext.c                      | 32 +++++++++++++++++++++---------
 mm/sparse-vmemmap.c                |  3 +++
 mm/sparse.c                        |  7 ++++++-
 mm/vmstat.c                        | 24 ++++++++++++++++++++++
 13 files changed, 98 insertions(+), 14 deletions(-)

Wei Xu Nov. 1, 2023, 11:40 p.m. UTC | #1

On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote:
>
> Adds a new per-node PageMetadata field to
> /sys/devices/system/node/nodeN/meminfo
> and a global PageMetadata field to /proc/meminfo. This information can
> be used by users to see how much memory is being used by per-page
> metadata, which can vary depending on build configuration, machine
> architecture, and system use.
>
> Per-page metadata is the amount of memory that Linux needs in order to
> manage memory at the page granularity. The majority of such memory is
> used by "struct page" and "page_ext" data structures. In contrast to
> most other memory consumption statistics, per-page metadata might not
> be included in MemTotal. For example, MemTotal does not include memblock
> allocations but includes buddy allocations. While on the other hand,
> per-page metadata would include both memblock and buddy allocations.

I expect that the new PageMetadata field in meminfo should help break
down the memory usage of a system (MemUsed, or MemTotal - MemFree),
similar to the other fields in meminfo.

However, given that PageMetadata includes per-page metadata allocated
from not only the buddy allocator, but also the memblock allocations,
and MemTotal doesn't include memory reserved by memblock allocations,
I wonder how a user can actually use this new PageMetadata to break
down the system memory usage.  BTW, it is not robust to assume that
all memblock allocations are for per-page metadata.

Here are some ideas to address this problem:

- Only report the buddy allocations for per-page medata in PageMetadata, or
- Report per-page metadata in two separate fields in meminfo, one for
buddy allocations and another for memblock allocations, or
- Change MemTotal/MemUsed to include the memblock reserved memory as well.

Wei Xu

> This memory depends on build configurations, machine architectures, and
> the way system is used:
>
> Build configuration may include extra fields into "struct page",
> and enable / disable "page_ext"
> Machine architecture defines base page sizes. For example 4K x86,
> 8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata
> overhead is smaller on machines with larger page sizes.
> System use can change per-page overhead by using vmemmap
> optimizations with hugetlb pages, and emulated pmem devdax pages.
> Also, boot parameters can determine whether page_ext is needed
> to be allocated. This memory can be part of MemTotal or be outside
> MemTotal depending on whether the memory was hot-plugged, booted with,
> or hugetlb memory was returned back to the system.
>
> Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Sourav Panda <souravpanda@google.com>
> ---
>  Documentation/filesystems/proc.rst |  3 +++
>  drivers/base/node.c                |  2 ++
>  fs/proc/meminfo.c                  |  7 +++++++
>  include/linux/mmzone.h             |  3 +++
>  include/linux/vmstat.h             |  4 ++++
>  mm/hugetlb.c                       | 11 ++++++++--
>  mm/hugetlb_vmemmap.c               | 12 +++++++++--
>  mm/mm_init.c                       |  3 +++
>  mm/page_alloc.c                    |  1 +
>  mm/page_ext.c                      | 32 +++++++++++++++++++++---------
>  mm/sparse-vmemmap.c                |  3 +++
>  mm/sparse.c                        |  7 ++++++-
>  mm/vmstat.c                        | 24 ++++++++++++++++++++++
>  13 files changed, 98 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2b59cff8be17..c121f2ef9432 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -987,6 +987,7 @@ Example output. You may not have all of these fields.
>      AnonPages:       4654780 kB
>      Mapped:           266244 kB
>      Shmem:              9976 kB
> +    PageMetadata:     513419 kB
>      KReclaimable:     517708 kB
>      Slab:             660044 kB
>      SReclaimable:     517708 kB
> @@ -1089,6 +1090,8 @@ Mapped
>                files which have been mmapped, such as libraries
>  Shmem
>                Total memory used by shared memory (shmem) and tmpfs
> +PageMetadata
> +              Memory used for per-page metadata
>  KReclaimable
>                Kernel allocations that the kernel will attempt to reclaim
>                under memory pressure. Includes SReclaimable (below), and other
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 493d533f8375..da728542265f 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>                              "Node %d Mapped:         %8lu kB\n"
>                              "Node %d AnonPages:      %8lu kB\n"
>                              "Node %d Shmem:          %8lu kB\n"
> +                            "Node %d PageMetadata:   %8lu kB\n"
>                              "Node %d KernelStack:    %8lu kB\n"
>  #ifdef CONFIG_SHADOW_CALL_STACK
>                              "Node %d ShadowCallStack:%8lu kB\n"
> @@ -458,6 +459,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>                              nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
>                              nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
>                              nid, K(i.sharedram),
> +                            nid, K(node_page_state(pgdat, NR_PAGE_METADATA)),
>                              nid, node_page_state(pgdat, NR_KERNEL_STACK_KB),
>  #ifdef CONFIG_SHADOW_CALL_STACK
>                              nid, node_page_state(pgdat, NR_KERNEL_SCS_KB),
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 45af9a989d40..f141bb2a550d 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>         long available;
>         unsigned long pages[NR_LRU_LISTS];
>         unsigned long sreclaimable, sunreclaim;
> +       unsigned long nr_page_metadata;
>         int lru;
> +       int nid;
>
>         si_meminfo(&i);
>         si_swapinfo(&i);
> @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>         sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
>         sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
>
> +       nr_page_metadata = 0;
> +       for_each_online_node(nid)
> +               nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA);
> +
>         show_val_kb(m, "MemTotal:       ", i.totalram);
>         show_val_kb(m, "MemFree:        ", i.freeram);
>         show_val_kb(m, "MemAvailable:   ", available);
> @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>         show_val_kb(m, "Mapped:         ",
>                     global_node_page_state(NR_FILE_MAPPED));
>         show_val_kb(m, "Shmem:          ", i.sharedram);
> +       show_val_kb(m, "PageMetadata:   ", nr_page_metadata);
>         show_val_kb(m, "KReclaimable:   ", sreclaimable +
>                     global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE));
>         show_val_kb(m, "Slab:           ", sreclaimable + sunreclaim);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4106fbc5b4b3..dda1ad522324 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -207,6 +207,9 @@ enum node_stat_item {
>         PGPROMOTE_SUCCESS,      /* promote successfully */
>         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
>  #endif
> +       NR_PAGE_METADATA,       /* Page metadata size (struct page and page_ext)
> +                                * in pages
> +                                */
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index fed855bae6d8..af096a881f03 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio,
>  {
>         lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
>  }
> +
> +void __init mod_node_early_perpage_metadata(int nid, long delta);
> +void __init store_early_perpage_metadata(void);
> +
>  #endif /* _LINUX_VMSTAT_H */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1301ba7b2c9a..1778e02ed583 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1790,6 +1790,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>                 destroy_compound_gigantic_folio(folio, huge_page_order(h));
>                 free_gigantic_folio(folio, huge_page_order(h));
>         } else {
> +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> +               __node_stat_sub_folio(folio, NR_PAGE_METADATA);
> +#endif
>                 __free_pages(&folio->page, huge_page_order(h));
>         }
>  }
> @@ -2125,6 +2128,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
>         struct page *page;
>         bool alloc_try_hard = true;
>         bool retry = true;
> +       struct folio *folio;
>
>         /*
>          * By default we always try hard to allocate the page with
> @@ -2175,9 +2179,12 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
>                 __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
>                 return NULL;
>         }
> -
> +       folio = page_folio(page);
> +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> +       __node_stat_add_folio(folio, NR_PAGE_METADATA);
> +#endif
>         __count_vm_event(HTLB_BUDDY_PGALLOC);
> -       return page_folio(page);
> +       return folio;
>  }
>
>  /*
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 4b9734777f69..f7ca5d4dd583 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -214,6 +214,7 @@ static inline void free_vmemmap_page(struct page *page)
>                 free_bootmem_page(page);
>         else
>                 __free_page(page);
> +       __mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1);
>  }
>
>  /* Free a list of the vmemmap pages */
> @@ -335,6 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>                 copy_page(page_to_virt(walk.reuse_page),
>                           (void *)walk.reuse_addr);
>                 list_add(&walk.reuse_page->lru, &vmemmap_pages);
> +               __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1);
>         }
>
>         /*
> @@ -384,14 +386,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
>         unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
>         int nid = page_to_nid((struct page *)start);
>         struct page *page, *next;
> +       int i;
>
> -       while (nr_pages--) {
> +       for (i = 0; i < nr_pages; i++) {
>                 page = alloc_pages_node(nid, gfp_mask, 0);
> -               if (!page)
> +               if (!page) {
> +                       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> +                                             i);
>                         goto out;
> +               }
>                 list_add_tail(&page->lru, list);
>         }
>
> +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages);
> +
>         return 0;
>  out:
>         list_for_each_entry_safe(page, next, list, lru)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 50f2f34745af..6997bf00945b 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -26,6 +26,7 @@
>  #include <linux/pgtable.h>
>  #include <linux/swap.h>
>  #include <linux/cma.h>
> +#include <linux/vmstat.h>
>  #include "internal.h"
>  #include "slab.h"
>  #include "shuffle.h"
> @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat)
>                         panic("Failed to allocate %ld bytes for node %d memory map\n",
>                               size, pgdat->node_id);
>                 pgdat->node_mem_map = map + offset;
> +               mod_node_early_perpage_metadata(pgdat->node_id,
> +                                               DIV_ROUND_UP(size, PAGE_SIZE));
>         }
>         pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n",
>                                 __func__, pgdat->node_id, (unsigned long)pgdat,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 85741403948f..522dc0c52610 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5443,6 +5443,7 @@ void __init setup_per_cpu_pageset(void)
>         for_each_online_pgdat(pgdat)
>                 pgdat->per_cpu_nodestats =
>                         alloc_percpu(struct per_cpu_nodestat);
> +       store_early_perpage_metadata();
>  }
>
>  __meminit void zone_pcp_init(struct zone *zone)
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 4548fcc66d74..d8d6db9c3d75 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid)
>                 return -ENOMEM;
>         NODE_DATA(nid)->node_page_ext = base;
>         total_usage += table_size;
> +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> +                             DIV_ROUND_UP(table_size, PAGE_SIZE));
>         return 0;
>  }
>
> @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid)
>         void *addr = NULL;
>
>         addr = alloc_pages_exact_nid(nid, size, flags);
> -       if (addr) {
> +       if (addr)
>                 kmemleak_alloc(addr, size, 1, flags);
> -               return addr;
> -       }
> +       else
> +               addr = vzalloc_node(size, nid);
>
> -       addr = vzalloc_node(size, nid);
> +       if (addr) {
> +               mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> +                                   DIV_ROUND_UP(size, PAGE_SIZE));
> +       }
>
>         return addr;
>  }
> @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid)
>
>  static void free_page_ext(void *addr)
>  {
> +       size_t table_size;
> +       struct page *page;
> +       struct pglist_data *pgdat;
> +
> +       table_size = page_ext_size * PAGES_PER_SECTION;
> +
>         if (is_vmalloc_addr(addr)) {
> +               page = vmalloc_to_page(addr);
> +               pgdat = page_pgdat(page);
>                 vfree(addr);
>         } else {
> -               struct page *page = virt_to_page(addr);
> -               size_t table_size;
> -
> -               table_size = page_ext_size * PAGES_PER_SECTION;
> -
> +               page = virt_to_page(addr);
> +               pgdat = page_pgdat(page);
>                 BUG_ON(PageReserved(page));
>                 kmemleak_free(addr);
>                 free_pages_exact(addr, table_size);
>         }
> +
> +       __mod_node_page_state(pgdat, NR_PAGE_METADATA,
> +                             -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE)));
> +
>  }
>
>  static void __free_page_ext(unsigned long pfn)
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index a2cbe44c48e1..2bc67b2c2aa2 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -469,5 +469,8 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>         if (r < 0)
>                 return NULL;
>
> +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> +                             DIV_ROUND_UP(end - start, PAGE_SIZE));
> +
>         return pfn_to_page(pfn);
>  }
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e565045..7f67b5486cd1 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -14,7 +14,7 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/bootmem_info.h>
> -
> +#include <linux/vmstat.h>
>  #include "internal.h"
>  #include <asm/dma.h>
>
> @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid)
>          */
>         sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true);
>         sparsemap_buf_end = sparsemap_buf + size;
> +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> +       mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE));
> +#endif
>  }
>
>  static void __init sparse_buffer_fini(void)
> @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>         unsigned long end = start + nr_pages * sizeof(struct page);
>
> +       __mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA,
> +                             -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE)));
>         vmemmap_free(start, end, altmap);
>  }
>  static void free_map_bootmem(struct page *memmap)
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 00e81e99c6ee..070d2b3d2bcc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = {
>         "pgpromote_success",
>         "pgpromote_candidate",
>  #endif
> +       "nr_page_metadata",
>
>         /* enum writeback_stat_item counters */
>         "nr_dirty_threshold",
> @@ -2274,4 +2275,27 @@ static int __init extfrag_debug_init(void)
>  }
>
>  module_init(extfrag_debug_init);
> +
>  #endif
> +
> +/*
> + * Page metadata size (struct page and page_ext) in pages
> + */
> +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata;
> +
> +void __init mod_node_early_perpage_metadata(int nid, long delta)
> +{
> +       early_perpage_metadata[nid] += delta;
> +}
> +
> +void __init store_early_perpage_metadata(void)
> +{
> +       int nid;
> +       struct pglist_data *pgdat;
> +
> +       for_each_online_pgdat(pgdat) {
> +               nid = pgdat->node_id;
> +               __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> +                                     early_perpage_metadata[nid]);
> +       }
> +}
> --
> 2.42.0.820.g83a721a137-goog
>

Pasha Tatashin Nov. 2, 2023, 2:57 a.m. UTC | #2

On Wed, Nov 1, 2023 at 7:40 PM Wei Xu <weixugc@google.com> wrote:
>
> On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote:
> >
> > Adds a new per-node PageMetadata field to
> > /sys/devices/system/node/nodeN/meminfo
> > and a global PageMetadata field to /proc/meminfo. This information can
> > be used by users to see how much memory is being used by per-page
> > metadata, which can vary depending on build configuration, machine
> > architecture, and system use.
> >
> > Per-page metadata is the amount of memory that Linux needs in order to
> > manage memory at the page granularity. The majority of such memory is
> > used by "struct page" and "page_ext" data structures. In contrast to
> > most other memory consumption statistics, per-page metadata might not
> > be included in MemTotal. For example, MemTotal does not include memblock
> > allocations but includes buddy allocations. While on the other hand,
> > per-page metadata would include both memblock and buddy allocations.
>
> I expect that the new PageMetadata field in meminfo should help break
> down the memory usage of a system (MemUsed, or MemTotal - MemFree),
> similar to the other fields in meminfo.
>
> However, given that PageMetadata includes per-page metadata allocated
> from not only the buddy allocator, but also the memblock allocations,
> and MemTotal doesn't include memory reserved by memblock allocations,
> I wonder how a user can actually use this new PageMetadata to break
> down the system memory usage.  BTW, it is not robust to assume that
> all memblock allocations are for per-page metadata.
>

Hi Wei,

> Here are some ideas to address this problem:
>
> - Only report the buddy allocations for per-page medata in PageMetadata, or

Making PageMetadata not to contain all per-page memory but just some
is confusing, especially right after boot it would always be 0, as all
struct pages are all coming from memblock during boot, yet we know we
have allocated tons of memory for struct pages.

> - Report per-page metadata in two separate fields in meminfo, one for
> buddy allocations and another for memblock allocations, or

This is also going to be confusing for the users, it is really
implementation detail which allocator was used to allocate struct
pages, and having to trackers is not going to improve things.

> - Change MemTotal/MemUsed to include the memblock reserved memory as well.

I think this is the right solution for an existing bug: MemTotal
should really include memblock reserved memory.

Pasha

>
> Wei Xu
>
> > This memory depends on build configurations, machine architectures, and
> > the way system is used:
> >
> > Build configuration may include extra fields into "struct page",
> > and enable / disable "page_ext"
> > Machine architecture defines base page sizes. For example 4K x86,
> > 8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata
> > overhead is smaller on machines with larger page sizes.
> > System use can change per-page overhead by using vmemmap
> > optimizations with hugetlb pages, and emulated pmem devdax pages.
> > Also, boot parameters can determine whether page_ext is needed
> > to be allocated. This memory can be part of MemTotal or be outside
> > MemTotal depending on whether the memory was hot-plugged, booted with,
> > or hugetlb memory was returned back to the system.
> >
> > Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Sourav Panda <souravpanda@google.com>
> > ---
> >  Documentation/filesystems/proc.rst |  3 +++
> >  drivers/base/node.c                |  2 ++
> >  fs/proc/meminfo.c                  |  7 +++++++
> >  include/linux/mmzone.h             |  3 +++
> >  include/linux/vmstat.h             |  4 ++++
> >  mm/hugetlb.c                       | 11 ++++++++--
> >  mm/hugetlb_vmemmap.c               | 12 +++++++++--
> >  mm/mm_init.c                       |  3 +++
> >  mm/page_alloc.c                    |  1 +
> >  mm/page_ext.c                      | 32 +++++++++++++++++++++---------
> >  mm/sparse-vmemmap.c                |  3 +++
> >  mm/sparse.c                        |  7 ++++++-
> >  mm/vmstat.c                        | 24 ++++++++++++++++++++++
> >  13 files changed, 98 insertions(+), 14 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 2b59cff8be17..c121f2ef9432 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -987,6 +987,7 @@ Example output. You may not have all of these fields.
> >      AnonPages:       4654780 kB
> >      Mapped:           266244 kB
> >      Shmem:              9976 kB
> > +    PageMetadata:     513419 kB
> >      KReclaimable:     517708 kB
> >      Slab:             660044 kB
> >      SReclaimable:     517708 kB
> > @@ -1089,6 +1090,8 @@ Mapped
> >                files which have been mmapped, such as libraries
> >  Shmem
> >                Total memory used by shared memory (shmem) and tmpfs
> > +PageMetadata
> > +              Memory used for per-page metadata
> >  KReclaimable
> >                Kernel allocations that the kernel will attempt to reclaim
> >                under memory pressure. Includes SReclaimable (below), and other
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 493d533f8375..da728542265f 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev,
> >                              "Node %d Mapped:         %8lu kB\n"
> >                              "Node %d AnonPages:      %8lu kB\n"
> >                              "Node %d Shmem:          %8lu kB\n"
> > +                            "Node %d PageMetadata:   %8lu kB\n"
> >                              "Node %d KernelStack:    %8lu kB\n"
> >  #ifdef CONFIG_SHADOW_CALL_STACK
> >                              "Node %d ShadowCallStack:%8lu kB\n"
> > @@ -458,6 +459,7 @@ static ssize_t node_read_meminfo(struct device *dev,
> >                              nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
> >                              nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
> >                              nid, K(i.sharedram),
> > +                            nid, K(node_page_state(pgdat, NR_PAGE_METADATA)),
> >                              nid, node_page_state(pgdat, NR_KERNEL_STACK_KB),
> >  #ifdef CONFIG_SHADOW_CALL_STACK
> >                              nid, node_page_state(pgdat, NR_KERNEL_SCS_KB),
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index 45af9a989d40..f141bb2a550d 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >         long available;
> >         unsigned long pages[NR_LRU_LISTS];
> >         unsigned long sreclaimable, sunreclaim;
> > +       unsigned long nr_page_metadata;
> >         int lru;
> > +       int nid;
> >
> >         si_meminfo(&i);
> >         si_swapinfo(&i);
> > @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >         sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
> >         sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
> >
> > +       nr_page_metadata = 0;
> > +       for_each_online_node(nid)
> > +               nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA);
> > +
> >         show_val_kb(m, "MemTotal:       ", i.totalram);
> >         show_val_kb(m, "MemFree:        ", i.freeram);
> >         show_val_kb(m, "MemAvailable:   ", available);
> > @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >         show_val_kb(m, "Mapped:         ",
> >                     global_node_page_state(NR_FILE_MAPPED));
> >         show_val_kb(m, "Shmem:          ", i.sharedram);
> > +       show_val_kb(m, "PageMetadata:   ", nr_page_metadata);
> >         show_val_kb(m, "KReclaimable:   ", sreclaimable +
> >                     global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE));
> >         show_val_kb(m, "Slab:           ", sreclaimable + sunreclaim);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 4106fbc5b4b3..dda1ad522324 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -207,6 +207,9 @@ enum node_stat_item {
> >         PGPROMOTE_SUCCESS,      /* promote successfully */
> >         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> >  #endif
> > +       NR_PAGE_METADATA,       /* Page metadata size (struct page and page_ext)
> > +                                * in pages
> > +                                */
> >         NR_VM_NODE_STAT_ITEMS
> >  };
> >
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index fed855bae6d8..af096a881f03 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio,
> >  {
> >         lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
> >  }
> > +
> > +void __init mod_node_early_perpage_metadata(int nid, long delta);
> > +void __init store_early_perpage_metadata(void);
> > +
> >  #endif /* _LINUX_VMSTAT_H */
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1301ba7b2c9a..1778e02ed583 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1790,6 +1790,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >                 destroy_compound_gigantic_folio(folio, huge_page_order(h));
> >                 free_gigantic_folio(folio, huge_page_order(h));
> >         } else {
> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > +               __node_stat_sub_folio(folio, NR_PAGE_METADATA);
> > +#endif
> >                 __free_pages(&folio->page, huge_page_order(h));
> >         }
> >  }
> > @@ -2125,6 +2128,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
> >         struct page *page;
> >         bool alloc_try_hard = true;
> >         bool retry = true;
> > +       struct folio *folio;
> >
> >         /*
> >          * By default we always try hard to allocate the page with
> > @@ -2175,9 +2179,12 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
> >                 __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> >                 return NULL;
> >         }
> > -
> > +       folio = page_folio(page);
> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > +       __node_stat_add_folio(folio, NR_PAGE_METADATA);
> > +#endif
> >         __count_vm_event(HTLB_BUDDY_PGALLOC);
> > -       return page_folio(page);
> > +       return folio;
> >  }
> >
> >  /*
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 4b9734777f69..f7ca5d4dd583 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -214,6 +214,7 @@ static inline void free_vmemmap_page(struct page *page)
> >                 free_bootmem_page(page);
> >         else
> >                 __free_page(page);
> > +       __mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1);
> >  }
> >
> >  /* Free a list of the vmemmap pages */
> > @@ -335,6 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> >                 copy_page(page_to_virt(walk.reuse_page),
> >                           (void *)walk.reuse_addr);
> >                 list_add(&walk.reuse_page->lru, &vmemmap_pages);
> > +               __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1);
> >         }
> >
> >         /*
> > @@ -384,14 +386,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> >         unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
> >         int nid = page_to_nid((struct page *)start);
> >         struct page *page, *next;
> > +       int i;
> >
> > -       while (nr_pages--) {
> > +       for (i = 0; i < nr_pages; i++) {
> >                 page = alloc_pages_node(nid, gfp_mask, 0);
> > -               if (!page)
> > +               if (!page) {
> > +                       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +                                             i);
> >                         goto out;
> > +               }
> >                 list_add_tail(&page->lru, list);
> >         }
> >
> > +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages);
> > +
> >         return 0;
> >  out:
> >         list_for_each_entry_safe(page, next, list, lru)
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 50f2f34745af..6997bf00945b 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -26,6 +26,7 @@
> >  #include <linux/pgtable.h>
> >  #include <linux/swap.h>
> >  #include <linux/cma.h>
> > +#include <linux/vmstat.h>
> >  #include "internal.h"
> >  #include "slab.h"
> >  #include "shuffle.h"
> > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat)
> >                         panic("Failed to allocate %ld bytes for node %d memory map\n",
> >                               size, pgdat->node_id);
> >                 pgdat->node_mem_map = map + offset;
> > +               mod_node_early_perpage_metadata(pgdat->node_id,
> > +                                               DIV_ROUND_UP(size, PAGE_SIZE));
> >         }
> >         pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n",
> >                                 __func__, pgdat->node_id, (unsigned long)pgdat,
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 85741403948f..522dc0c52610 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5443,6 +5443,7 @@ void __init setup_per_cpu_pageset(void)
> >         for_each_online_pgdat(pgdat)
> >                 pgdat->per_cpu_nodestats =
> >                         alloc_percpu(struct per_cpu_nodestat);
> > +       store_early_perpage_metadata();
> >  }
> >
> >  __meminit void zone_pcp_init(struct zone *zone)
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 4548fcc66d74..d8d6db9c3d75 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid)
> >                 return -ENOMEM;
> >         NODE_DATA(nid)->node_page_ext = base;
> >         total_usage += table_size;
> > +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +                             DIV_ROUND_UP(table_size, PAGE_SIZE));
> >         return 0;
> >  }
> >
> > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid)
> >         void *addr = NULL;
> >
> >         addr = alloc_pages_exact_nid(nid, size, flags);
> > -       if (addr) {
> > +       if (addr)
> >                 kmemleak_alloc(addr, size, 1, flags);
> > -               return addr;
> > -       }
> > +       else
> > +               addr = vzalloc_node(size, nid);
> >
> > -       addr = vzalloc_node(size, nid);
> > +       if (addr) {
> > +               mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +                                   DIV_ROUND_UP(size, PAGE_SIZE));
> > +       }
> >
> >         return addr;
> >  }
> > @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid)
> >
> >  static void free_page_ext(void *addr)
> >  {
> > +       size_t table_size;
> > +       struct page *page;
> > +       struct pglist_data *pgdat;
> > +
> > +       table_size = page_ext_size * PAGES_PER_SECTION;
> > +
> >         if (is_vmalloc_addr(addr)) {
> > +               page = vmalloc_to_page(addr);
> > +               pgdat = page_pgdat(page);
> >                 vfree(addr);
> >         } else {
> > -               struct page *page = virt_to_page(addr);
> > -               size_t table_size;
> > -
> > -               table_size = page_ext_size * PAGES_PER_SECTION;
> > -
> > +               page = virt_to_page(addr);
> > +               pgdat = page_pgdat(page);
> >                 BUG_ON(PageReserved(page));
> >                 kmemleak_free(addr);
> >                 free_pages_exact(addr, table_size);
> >         }
> > +
> > +       __mod_node_page_state(pgdat, NR_PAGE_METADATA,
> > +                             -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE)));
> > +
> >  }
> >
> >  static void __free_page_ext(unsigned long pfn)
> > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > index a2cbe44c48e1..2bc67b2c2aa2 100644
> > --- a/mm/sparse-vmemmap.c
> > +++ b/mm/sparse-vmemmap.c
> > @@ -469,5 +469,8 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
> >         if (r < 0)
> >                 return NULL;
> >
> > +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +                             DIV_ROUND_UP(end - start, PAGE_SIZE));
> > +
> >         return pfn_to_page(pfn);
> >  }
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index 77d91e565045..7f67b5486cd1 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -14,7 +14,7 @@
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> >  #include <linux/bootmem_info.h>
> > -
> > +#include <linux/vmstat.h>
> >  #include "internal.h"
> >  #include <asm/dma.h>
> >
> > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid)
> >          */
> >         sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true);
> >         sparsemap_buf_end = sparsemap_buf + size;
> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > +       mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE));
> > +#endif
> >  }
> >
> >  static void __init sparse_buffer_fini(void)
> > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
> >         unsigned long start = (unsigned long) pfn_to_page(pfn);
> >         unsigned long end = start + nr_pages * sizeof(struct page);
> >
> > +       __mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA,
> > +                             -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE)));
> >         vmemmap_free(start, end, altmap);
> >  }
> >  static void free_map_bootmem(struct page *memmap)
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 00e81e99c6ee..070d2b3d2bcc 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = {
> >         "pgpromote_success",
> >         "pgpromote_candidate",
> >  #endif
> > +       "nr_page_metadata",
> >
> >         /* enum writeback_stat_item counters */
> >         "nr_dirty_threshold",
> > @@ -2274,4 +2275,27 @@ static int __init extfrag_debug_init(void)
> >  }
> >
> >  module_init(extfrag_debug_init);
> > +
> >  #endif
> > +
> > +/*
> > + * Page metadata size (struct page and page_ext) in pages
> > + */
> > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata;
> > +
> > +void __init mod_node_early_perpage_metadata(int nid, long delta)
> > +{
> > +       early_perpage_metadata[nid] += delta;
> > +}
> > +
> > +void __init store_early_perpage_metadata(void)
> > +{
> > +       int nid;
> > +       struct pglist_data *pgdat;
> > +
> > +       for_each_online_pgdat(pgdat) {
> > +               nid = pgdat->node_id;
> > +               __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > +                                     early_perpage_metadata[nid]);
> > +       }
> > +}
> > --
> > 2.42.0.820.g83a721a137-goog
> >

Greg KH Nov. 2, 2023, 5:42 a.m. UTC | #3

On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote:
> Adds a new per-node PageMetadata field to
> /sys/devices/system/node/nodeN/meminfo

No, this file is already an abuse of sysfs and we need to get rid of it
(it has multiple values in one file.)  Please do not add to the
nightmare by adding new values.

Also, even if you did want to do this, you didn't document it properly
in Documentation/ABI/ :(

thanks,

greg k-h

Alexey Dobriyan Nov. 2, 2023, 10:19 a.m. UTC | #4

On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote:
> +void __init mod_node_early_perpage_metadata(int nid, long delta);
> +void __init store_early_perpage_metadata(void);

Section markers are useless with prototypes.

Pasha Tatashin Nov. 2, 2023, 2:24 p.m. UTC | #5

On Thu, Nov 2, 2023 at 1:42 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote:
> > Adds a new per-node PageMetadata field to
> > /sys/devices/system/node/nodeN/meminfo
>
> No, this file is already an abuse of sysfs and we need to get rid of it
> (it has multiple values in one file.)  Please do not add to the
> nightmare by adding new values.

Hi Greg,

Today, nodeN/meminfo is a counterpart of /proc/meminfo, they contain
almost identical fields, but show node-wide and system-wide views.

Since per-page metadata is added into /proc/meminfo, it is logical to
add into nodeN/meminfo, some nodes can have more or less struct page
data based on size of the node, and also the way memory is configured,
such as use of vmemamp optimization etc, therefore this information is
useful to users.

I am not aware of any example of where a system-wide field from
/proc/meminfo is represented as a separate sysfs file under node0/. If
nodeN/meminfo is ever broken down into separate files it will affect
all the fields in it the same way with or without per-page metadata

> Also, even if you did want to do this, you didn't document it properly
> in Documentation/ABI/ :(

 The documentation for the fields in nodeN/meminfo is only specified
in  Documentation/filesystems/proc.rst, there is no separate sysfs
Documentation for the fields in this file, we could certainly add
that.

Thank you,
Pasha

Greg KH Nov. 2, 2023, 2:28 p.m. UTC | #6

On Thu, Nov 02, 2023 at 10:24:04AM -0400, Pasha Tatashin wrote:
> On Thu, Nov 2, 2023 at 1:42 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote:
> > > Adds a new per-node PageMetadata field to
> > > /sys/devices/system/node/nodeN/meminfo
> >
> > No, this file is already an abuse of sysfs and we need to get rid of it
> > (it has multiple values in one file.)  Please do not add to the
> > nightmare by adding new values.
> 
> Hi Greg,
> 
> Today, nodeN/meminfo is a counterpart of /proc/meminfo, they contain
> almost identical fields, but show node-wide and system-wide views.

And that is wrong, and again, an abuse of sysfs, please do not continue
to add to it, that will only cause problems.

> Since per-page metadata is added into /proc/meminfo, it is logical to
> add into nodeN/meminfo, some nodes can have more or less struct page
> data based on size of the node, and also the way memory is configured,
> such as use of vmemamp optimization etc, therefore this information is
> useful to users.
> 
> I am not aware of any example of where a system-wide field from
> /proc/meminfo is represented as a separate sysfs file under node0/. If
> nodeN/meminfo is ever broken down into separate files it will affect
> all the fields in it the same way with or without per-page metadata

All of the fields should be individual files, please start adding them
if you want to add new items, I do not want to see additional abuse here
as that will cause problems (as you are seeing with the proc file.)

> > Also, even if you did want to do this, you didn't document it properly
> > in Documentation/ABI/ :(
> 
>  The documentation for the fields in nodeN/meminfo is only specified
> in  Documentation/filesystems/proc.rst, there is no separate sysfs
> Documentation for the fields in this file, we could certainly add
> that.

All sysfs files need to be documented in Documentation/ABI/ otherwise
you should get a warning when running our testing scripts.

thanks,

greg k-h

Pasha Tatashin Nov. 2, 2023, 3:11 p.m. UTC | #7

On Thu, Nov 2, 2023 at 10:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Nov 02, 2023 at 10:24:04AM -0400, Pasha Tatashin wrote:
> > On Thu, Nov 2, 2023 at 1:42 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote:
> > > > Adds a new per-node PageMetadata field to
> > > > /sys/devices/system/node/nodeN/meminfo
> > >
> > > No, this file is already an abuse of sysfs and we need to get rid of it
> > > (it has multiple values in one file.)  Please do not add to the
> > > nightmare by adding new values.
> >
> > Hi Greg,
> >
> > Today, nodeN/meminfo is a counterpart of /proc/meminfo, they contain
> > almost identical fields, but show node-wide and system-wide views.
>
> And that is wrong, and again, an abuse of sysfs, please do not continue
> to add to it, that will only cause problems.
>
> > Since per-page metadata is added into /proc/meminfo, it is logical to
> > add into nodeN/meminfo, some nodes can have more or less struct page
> > data based on size of the node, and also the way memory is configured,
> > such as use of vmemamp optimization etc, therefore this information is
> > useful to users.
> >
> > I am not aware of any example of where a system-wide field from
> > /proc/meminfo is represented as a separate sysfs file under node0/. If
> > nodeN/meminfo is ever broken down into separate files it will affect
> > all the fields in it the same way with or without per-page metadata
>
> All of the fields should be individual files, please start adding them
> if you want to add new items, I do not want to see additional abuse here

Sounds good, in our next patch version we will create a new file under
nodeN/ to contain per-page metadata overhead, and add an ABI doc file
for it.

Thanks,
Pasha

Wei Xu Nov. 2, 2023, 3:43 p.m. UTC | #8

On Wed, Nov 1, 2023 at 7:58 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Wed, Nov 1, 2023 at 7:40 PM Wei Xu <weixugc@google.com> wrote:
> >
> > On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote:
> > >
> > > Adds a new per-node PageMetadata field to
> > > /sys/devices/system/node/nodeN/meminfo
> > > and a global PageMetadata field to /proc/meminfo. This information can
> > > be used by users to see how much memory is being used by per-page
> > > metadata, which can vary depending on build configuration, machine
> > > architecture, and system use.
> > >
> > > Per-page metadata is the amount of memory that Linux needs in order to
> > > manage memory at the page granularity. The majority of such memory is
> > > used by "struct page" and "page_ext" data structures. In contrast to
> > > most other memory consumption statistics, per-page metadata might not
> > > be included in MemTotal. For example, MemTotal does not include memblock
> > > allocations but includes buddy allocations. While on the other hand,
> > > per-page metadata would include both memblock and buddy allocations.
> >
> > I expect that the new PageMetadata field in meminfo should help break
> > down the memory usage of a system (MemUsed, or MemTotal - MemFree),
> > similar to the other fields in meminfo.
> >
> > However, given that PageMetadata includes per-page metadata allocated
> > from not only the buddy allocator, but also the memblock allocations,
> > and MemTotal doesn't include memory reserved by memblock allocations,
> > I wonder how a user can actually use this new PageMetadata to break
> > down the system memory usage.  BTW, it is not robust to assume that
> > all memblock allocations are for per-page metadata.
> >
>
> Hi Wei,
>
> > Here are some ideas to address this problem:
> >
> > - Only report the buddy allocations for per-page medata in PageMetadata, or
>
> Making PageMetadata not to contain all per-page memory but just some
> is confusing, especially right after boot it would always be 0, as all
> struct pages are all coming from memblock during boot, yet we know we
> have allocated tons of memory for struct pages.
>
> > - Report per-page metadata in two separate fields in meminfo, one for
> > buddy allocations and another for memblock allocations, or
>
> This is also going to be confusing for the users, it is really
> implementation detail which allocator was used to allocate struct
> pages, and having to trackers is not going to improve things.
>
> > - Change MemTotal/MemUsed to include the memblock reserved memory as well.
>
> I think this is the right solution for an existing bug: MemTotal
> should really include memblock reserved memory.

Adding reserved memory to MemTotal is a cleaner approach IMO as well.
But it changes the semantics of MemTotal, which may have compatibility
issues.

I think the MemTotal change should be part of this patch series, too.
If it doesn't get accepted, then we need to take one of the first two
approaches (reporting only buddy allocations of per-page metadata or
reporting per-page metadata separately for buddy/memblock allocations)
at least for the Google use cases such that we can use the new
PageMetadata to improve the breakdown of runtime kernel memory
overheads (excluding the boot-time memblock allocations).

> Pasha
>
> >
> > Wei Xu
> >
> > > This memory depends on build configurations, machine architectures, and
> > > the way system is used:
> > >
> > > Build configuration may include extra fields into "struct page",
> > > and enable / disable "page_ext"
> > > Machine architecture defines base page sizes. For example 4K x86,
> > > 8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata
> > > overhead is smaller on machines with larger page sizes.
> > > System use can change per-page overhead by using vmemmap
> > > optimizations with hugetlb pages, and emulated pmem devdax pages.
> > > Also, boot parameters can determine whether page_ext is needed
> > > to be allocated. This memory can be part of MemTotal or be outside
> > > MemTotal depending on whether the memory was hot-plugged, booted with,
> > > or hugetlb memory was returned back to the system.
> > >
> > > Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > Signed-off-by: Sourav Panda <souravpanda@google.com>
> > > ---
> > >  Documentation/filesystems/proc.rst |  3 +++
> > >  drivers/base/node.c                |  2 ++
> > >  fs/proc/meminfo.c                  |  7 +++++++
> > >  include/linux/mmzone.h             |  3 +++
> > >  include/linux/vmstat.h             |  4 ++++
> > >  mm/hugetlb.c                       | 11 ++++++++--
> > >  mm/hugetlb_vmemmap.c               | 12 +++++++++--
> > >  mm/mm_init.c                       |  3 +++
> > >  mm/page_alloc.c                    |  1 +
> > >  mm/page_ext.c                      | 32 +++++++++++++++++++++---------
> > >  mm/sparse-vmemmap.c                |  3 +++
> > >  mm/sparse.c                        |  7 ++++++-
> > >  mm/vmstat.c                        | 24 ++++++++++++++++++++++
> > >  13 files changed, 98 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index 2b59cff8be17..c121f2ef9432 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -987,6 +987,7 @@ Example output. You may not have all of these fields.
> > >      AnonPages:       4654780 kB
> > >      Mapped:           266244 kB
> > >      Shmem:              9976 kB
> > > +    PageMetadata:     513419 kB
> > >      KReclaimable:     517708 kB
> > >      Slab:             660044 kB
> > >      SReclaimable:     517708 kB
> > > @@ -1089,6 +1090,8 @@ Mapped
> > >                files which have been mmapped, such as libraries
> > >  Shmem
> > >                Total memory used by shared memory (shmem) and tmpfs
> > > +PageMetadata
> > > +              Memory used for per-page metadata
> > >  KReclaimable
> > >                Kernel allocations that the kernel will attempt to reclaim
> > >                under memory pressure. Includes SReclaimable (below), and other
> > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > index 493d533f8375..da728542265f 100644
> > > --- a/drivers/base/node.c
> > > +++ b/drivers/base/node.c
> > > @@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev,
> > >                              "Node %d Mapped:         %8lu kB\n"
> > >                              "Node %d AnonPages:      %8lu kB\n"
> > >                              "Node %d Shmem:          %8lu kB\n"
> > > +                            "Node %d PageMetadata:   %8lu kB\n"
> > >                              "Node %d KernelStack:    %8lu kB\n"
> > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > >                              "Node %d ShadowCallStack:%8lu kB\n"
> > > @@ -458,6 +459,7 @@ static ssize_t node_read_meminfo(struct device *dev,
> > >                              nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
> > >                              nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
> > >                              nid, K(i.sharedram),
> > > +                            nid, K(node_page_state(pgdat, NR_PAGE_METADATA)),
> > >                              nid, node_page_state(pgdat, NR_KERNEL_STACK_KB),
> > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > >                              nid, node_page_state(pgdat, NR_KERNEL_SCS_KB),
> > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > index 45af9a989d40..f141bb2a550d 100644
> > > --- a/fs/proc/meminfo.c
> > > +++ b/fs/proc/meminfo.c
> > > @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > >         long available;
> > >         unsigned long pages[NR_LRU_LISTS];
> > >         unsigned long sreclaimable, sunreclaim;
> > > +       unsigned long nr_page_metadata;
> > >         int lru;
> > > +       int nid;
> > >
> > >         si_meminfo(&i);
> > >         si_swapinfo(&i);
> > > @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > >         sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
> > >         sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
> > >
> > > +       nr_page_metadata = 0;
> > > +       for_each_online_node(nid)
> > > +               nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA);
> > > +
> > >         show_val_kb(m, "MemTotal:       ", i.totalram);
> > >         show_val_kb(m, "MemFree:        ", i.freeram);
> > >         show_val_kb(m, "MemAvailable:   ", available);
> > > @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > >         show_val_kb(m, "Mapped:         ",
> > >                     global_node_page_state(NR_FILE_MAPPED));
> > >         show_val_kb(m, "Shmem:          ", i.sharedram);
> > > +       show_val_kb(m, "PageMetadata:   ", nr_page_metadata);
> > >         show_val_kb(m, "KReclaimable:   ", sreclaimable +
> > >                     global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE));
> > >         show_val_kb(m, "Slab:           ", sreclaimable + sunreclaim);
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 4106fbc5b4b3..dda1ad522324 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -207,6 +207,9 @@ enum node_stat_item {
> > >         PGPROMOTE_SUCCESS,      /* promote successfully */
> > >         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> > >  #endif
> > > +       NR_PAGE_METADATA,       /* Page metadata size (struct page and page_ext)
> > > +                                * in pages
> > > +                                */
> > >         NR_VM_NODE_STAT_ITEMS
> > >  };
> > >
> > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > index fed855bae6d8..af096a881f03 100644
> > > --- a/include/linux/vmstat.h
> > > +++ b/include/linux/vmstat.h
> > > @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio,
> > >  {
> > >         lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
> > >  }
> > > +
> > > +void __init mod_node_early_perpage_metadata(int nid, long delta);
> > > +void __init store_early_perpage_metadata(void);
> > > +
> > >  #endif /* _LINUX_VMSTAT_H */
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 1301ba7b2c9a..1778e02ed583 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1790,6 +1790,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> > >                 destroy_compound_gigantic_folio(folio, huge_page_order(h));
> > >                 free_gigantic_folio(folio, huge_page_order(h));
> > >         } else {
> > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > > +               __node_stat_sub_folio(folio, NR_PAGE_METADATA);
> > > +#endif
> > >                 __free_pages(&folio->page, huge_page_order(h));
> > >         }
> > >  }
> > > @@ -2125,6 +2128,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
> > >         struct page *page;
> > >         bool alloc_try_hard = true;
> > >         bool retry = true;
> > > +       struct folio *folio;
> > >
> > >         /*
> > >          * By default we always try hard to allocate the page with
> > > @@ -2175,9 +2179,12 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
> > >                 __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> > >                 return NULL;
> > >         }
> > > -
> > > +       folio = page_folio(page);
> > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > > +       __node_stat_add_folio(folio, NR_PAGE_METADATA);
> > > +#endif
> > >         __count_vm_event(HTLB_BUDDY_PGALLOC);
> > > -       return page_folio(page);
> > > +       return folio;
> > >  }
> > >
> > >  /*
> > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > index 4b9734777f69..f7ca5d4dd583 100644
> > > --- a/mm/hugetlb_vmemmap.c
> > > +++ b/mm/hugetlb_vmemmap.c
> > > @@ -214,6 +214,7 @@ static inline void free_vmemmap_page(struct page *page)
> > >                 free_bootmem_page(page);
> > >         else
> > >                 __free_page(page);
> > > +       __mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1);
> > >  }
> > >
> > >  /* Free a list of the vmemmap pages */
> > > @@ -335,6 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> > >                 copy_page(page_to_virt(walk.reuse_page),
> > >                           (void *)walk.reuse_addr);
> > >                 list_add(&walk.reuse_page->lru, &vmemmap_pages);
> > > +               __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1);
> > >         }
> > >
> > >         /*
> > > @@ -384,14 +386,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> > >         unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
> > >         int nid = page_to_nid((struct page *)start);
> > >         struct page *page, *next;
> > > +       int i;
> > >
> > > -       while (nr_pages--) {
> > > +       for (i = 0; i < nr_pages; i++) {
> > >                 page = alloc_pages_node(nid, gfp_mask, 0);
> > > -               if (!page)
> > > +               if (!page) {
> > > +                       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > > +                                             i);
> > >                         goto out;
> > > +               }
> > >                 list_add_tail(&page->lru, list);
> > >         }
> > >
> > > +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages);
> > > +
> > >         return 0;
> > >  out:
> > >         list_for_each_entry_safe(page, next, list, lru)
> > > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > > index 50f2f34745af..6997bf00945b 100644
> > > --- a/mm/mm_init.c
> > > +++ b/mm/mm_init.c
> > > @@ -26,6 +26,7 @@
> > >  #include <linux/pgtable.h>
> > >  #include <linux/swap.h>
> > >  #include <linux/cma.h>
> > > +#include <linux/vmstat.h>
> > >  #include "internal.h"
> > >  #include "slab.h"
> > >  #include "shuffle.h"
> > > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat)
> > >                         panic("Failed to allocate %ld bytes for node %d memory map\n",
> > >                               size, pgdat->node_id);
> > >                 pgdat->node_mem_map = map + offset;
> > > +               mod_node_early_perpage_metadata(pgdat->node_id,
> > > +                                               DIV_ROUND_UP(size, PAGE_SIZE));
> > >         }
> > >         pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n",
> > >                                 __func__, pgdat->node_id, (unsigned long)pgdat,
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 85741403948f..522dc0c52610 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -5443,6 +5443,7 @@ void __init setup_per_cpu_pageset(void)
> > >         for_each_online_pgdat(pgdat)
> > >                 pgdat->per_cpu_nodestats =
> > >                         alloc_percpu(struct per_cpu_nodestat);
> > > +       store_early_perpage_metadata();
> > >  }
> > >
> > >  __meminit void zone_pcp_init(struct zone *zone)
> > > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > > index 4548fcc66d74..d8d6db9c3d75 100644
> > > --- a/mm/page_ext.c
> > > +++ b/mm/page_ext.c
> > > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid)
> > >                 return -ENOMEM;
> > >         NODE_DATA(nid)->node_page_ext = base;
> > >         total_usage += table_size;
> > > +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > > +                             DIV_ROUND_UP(table_size, PAGE_SIZE));
> > >         return 0;
> > >  }
> > >
> > > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid)
> > >         void *addr = NULL;
> > >
> > >         addr = alloc_pages_exact_nid(nid, size, flags);
> > > -       if (addr) {
> > > +       if (addr)
> > >                 kmemleak_alloc(addr, size, 1, flags);
> > > -               return addr;
> > > -       }
> > > +       else
> > > +               addr = vzalloc_node(size, nid);
> > >
> > > -       addr = vzalloc_node(size, nid);
> > > +       if (addr) {
> > > +               mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > > +                                   DIV_ROUND_UP(size, PAGE_SIZE));
> > > +       }
> > >
> > >         return addr;
> > >  }
> > > @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid)
> > >
> > >  static void free_page_ext(void *addr)
> > >  {
> > > +       size_t table_size;
> > > +       struct page *page;
> > > +       struct pglist_data *pgdat;
> > > +
> > > +       table_size = page_ext_size * PAGES_PER_SECTION;
> > > +
> > >         if (is_vmalloc_addr(addr)) {
> > > +               page = vmalloc_to_page(addr);
> > > +               pgdat = page_pgdat(page);
> > >                 vfree(addr);
> > >         } else {
> > > -               struct page *page = virt_to_page(addr);
> > > -               size_t table_size;
> > > -
> > > -               table_size = page_ext_size * PAGES_PER_SECTION;
> > > -
> > > +               page = virt_to_page(addr);
> > > +               pgdat = page_pgdat(page);
> > >                 BUG_ON(PageReserved(page));
> > >                 kmemleak_free(addr);
> > >                 free_pages_exact(addr, table_size);
> > >         }
> > > +
> > > +       __mod_node_page_state(pgdat, NR_PAGE_METADATA,
> > > +                             -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE)));
> > > +
> > >  }
> > >
> > >  static void __free_page_ext(unsigned long pfn)
> > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > > index a2cbe44c48e1..2bc67b2c2aa2 100644
> > > --- a/mm/sparse-vmemmap.c
> > > +++ b/mm/sparse-vmemmap.c
> > > @@ -469,5 +469,8 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
> > >         if (r < 0)
> > >                 return NULL;
> > >
> > > +       __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > > +                             DIV_ROUND_UP(end - start, PAGE_SIZE));
> > > +
> > >         return pfn_to_page(pfn);
> > >  }
> > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > index 77d91e565045..7f67b5486cd1 100644
> > > --- a/mm/sparse.c
> > > +++ b/mm/sparse.c
> > > @@ -14,7 +14,7 @@
> > >  #include <linux/swap.h>
> > >  #include <linux/swapops.h>
> > >  #include <linux/bootmem_info.h>
> > > -
> > > +#include <linux/vmstat.h>
> > >  #include "internal.h"
> > >  #include <asm/dma.h>
> > >
> > > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid)
> > >          */
> > >         sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true);
> > >         sparsemap_buf_end = sparsemap_buf + size;
> > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > > +       mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE));
> > > +#endif
> > >  }
> > >
> > >  static void __init sparse_buffer_fini(void)
> > > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
> > >         unsigned long start = (unsigned long) pfn_to_page(pfn);
> > >         unsigned long end = start + nr_pages * sizeof(struct page);
> > >
> > > +       __mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA,
> > > +                             -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE)));
> > >         vmemmap_free(start, end, altmap);
> > >  }
> > >  static void free_map_bootmem(struct page *memmap)
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 00e81e99c6ee..070d2b3d2bcc 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = {
> > >         "pgpromote_success",
> > >         "pgpromote_candidate",
> > >  #endif
> > > +       "nr_page_metadata",
> > >
> > >         /* enum writeback_stat_item counters */
> > >         "nr_dirty_threshold",
> > > @@ -2274,4 +2275,27 @@ static int __init extfrag_debug_init(void)
> > >  }
> > >
> > >  module_init(extfrag_debug_init);
> > > +
> > >  #endif
> > > +
> > > +/*
> > > + * Page metadata size (struct page and page_ext) in pages
> > > + */
> > > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata;
> > > +
> > > +void __init mod_node_early_perpage_metadata(int nid, long delta)
> > > +{
> > > +       early_perpage_metadata[nid] += delta;
> > > +}
> > > +
> > > +void __init store_early_perpage_metadata(void)
> > > +{
> > > +       int nid;
> > > +       struct pglist_data *pgdat;
> > > +
> > > +       for_each_online_pgdat(pgdat) {
> > > +               nid = pgdat->node_id;
> > > +               __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA,
> > > +                                     early_perpage_metadata[nid]);
> > > +       }
> > > +}
> > > --
> > > 2.42.0.820.g83a721a137-goog
> > >

David Hildenbrand Nov. 2, 2023, 3:47 p.m. UTC | #9

On 02.11.23 16:43, Wei Xu wrote:
> On Wed, Nov 1, 2023 at 7:58 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>>
>> On Wed, Nov 1, 2023 at 7:40 PM Wei Xu <weixugc@google.com> wrote:
>>>
>>> On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote:
>>>>
>>>> Adds a new per-node PageMetadata field to
>>>> /sys/devices/system/node/nodeN/meminfo
>>>> and a global PageMetadata field to /proc/meminfo. This information can
>>>> be used by users to see how much memory is being used by per-page
>>>> metadata, which can vary depending on build configuration, machine
>>>> architecture, and system use.
>>>>
>>>> Per-page metadata is the amount of memory that Linux needs in order to
>>>> manage memory at the page granularity. The majority of such memory is
>>>> used by "struct page" and "page_ext" data structures. In contrast to
>>>> most other memory consumption statistics, per-page metadata might not
>>>> be included in MemTotal. For example, MemTotal does not include memblock
>>>> allocations but includes buddy allocations. While on the other hand,
>>>> per-page metadata would include both memblock and buddy allocations.
>>>
>>> I expect that the new PageMetadata field in meminfo should help break
>>> down the memory usage of a system (MemUsed, or MemTotal - MemFree),
>>> similar to the other fields in meminfo.
>>>
>>> However, given that PageMetadata includes per-page metadata allocated
>>> from not only the buddy allocator, but also the memblock allocations,
>>> and MemTotal doesn't include memory reserved by memblock allocations,
>>> I wonder how a user can actually use this new PageMetadata to break
>>> down the system memory usage.  BTW, it is not robust to assume that
>>> all memblock allocations are for per-page metadata.
>>>
>>
>> Hi Wei,
>>
>>> Here are some ideas to address this problem:
>>>
>>> - Only report the buddy allocations for per-page medata in PageMetadata, or
>>
>> Making PageMetadata not to contain all per-page memory but just some
>> is confusing, especially right after boot it would always be 0, as all
>> struct pages are all coming from memblock during boot, yet we know we
>> have allocated tons of memory for struct pages.
>>
>>> - Report per-page metadata in two separate fields in meminfo, one for
>>> buddy allocations and another for memblock allocations, or
>>
>> This is also going to be confusing for the users, it is really
>> implementation detail which allocator was used to allocate struct
>> pages, and having to trackers is not going to improve things.
>>
>>> - Change MemTotal/MemUsed to include the memblock reserved memory as well.
>>
>> I think this is the right solution for an existing bug: MemTotal
>> should really include memblock reserved memory.
> 
> Adding reserved memory to MemTotal is a cleaner approach IMO as well.
> But it changes the semantics of MemTotal, which may have compatibility
> issues.

I object.

Pasha Tatashin Nov. 2, 2023, 3:50 p.m. UTC | #10

> > Adding reserved memory to MemTotal is a cleaner approach IMO as well.
> > But it changes the semantics of MemTotal, which may have compatibility
> > issues.
>
> I object.

Could you please elaborate what you object (and why): you object that
it will have compatibility issues, or  you object to include memblock
reserves into MemTotal?

Thanks,
Pasha

David Hildenbrand Nov. 2, 2023, 3:53 p.m. UTC | #11

On 02.11.23 16:50, Pasha Tatashin wrote:
>>> Adding reserved memory to MemTotal is a cleaner approach IMO as well.
>>> But it changes the semantics of MemTotal, which may have compatibility
>>> issues.
>>
>> I object.
> 
> Could you please elaborate what you object (and why): you object that
> it will have compatibility issues, or  you object to include memblock
> reserves into MemTotal?

Sorry, I object to changing the semantics of MemTotal. MemTotal is 
traditionally the memory managed by the buddy, not all memory in the 
system. I know people/scripts that are relying on that [although it's 
been source of confusion a couple of times].

Pasha Tatashin Nov. 2, 2023, 4:02 p.m. UTC | #12

On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.11.23 16:50, Pasha Tatashin wrote:
> >>> Adding reserved memory to MemTotal is a cleaner approach IMO as well.
> >>> But it changes the semantics of MemTotal, which may have compatibility
> >>> issues.
> >>
> >> I object.
> >
> > Could you please elaborate what you object (and why): you object that
> > it will have compatibility issues, or  you object to include memblock
> > reserves into MemTotal?
>
> Sorry, I object to changing the semantics of MemTotal. MemTotal is
> traditionally the memory managed by the buddy, not all memory in the
> system. I know people/scripts that are relying on that [although it's
> been source of confusion a couple of times].

What if one day we change so that struct pages are allocated from
buddy allocator (i.e. allocate deferred struct pages from buddy) will
it break those MemTotal scripts? What if the size of struct pages
changes significantly, but the overhead will come from other metadata
(i.e. memdesc) will that break those scripts? I feel like struct page
memory should really be included into MemTotal, otherwise we will have
this struggle in the future when we try to optimize struct page
memory.

David Hildenbrand Nov. 2, 2023, 4:09 p.m. UTC | #13

On 02.11.23 17:02, Pasha Tatashin wrote:
> On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 02.11.23 16:50, Pasha Tatashin wrote:
>>>>> Adding reserved memory to MemTotal is a cleaner approach IMO as well.
>>>>> But it changes the semantics of MemTotal, which may have compatibility
>>>>> issues.
>>>>
>>>> I object.
>>>
>>> Could you please elaborate what you object (and why): you object that
>>> it will have compatibility issues, or  you object to include memblock
>>> reserves into MemTotal?
>>
>> Sorry, I object to changing the semantics of MemTotal. MemTotal is
>> traditionally the memory managed by the buddy, not all memory in the
>> system. I know people/scripts that are relying on that [although it's
>> been source of confusion a couple of times].
> 
> What if one day we change so that struct pages are allocated from
> buddy allocator (i.e. allocate deferred struct pages from buddy) will

It does on memory hotplug. But for things like crashkernel size 
detection doesn't really care about that.

> it break those MemTotal scripts? What if the size of struct pages
> changes significantly, but the overhead will come from other metadata
> (i.e. memdesc) will that break those scripts? I feel like struct page

Probably; but ideally the metadata overhead will be smaller with 
memdesc. And we'll talk about that once it gets real ;)

> memory should really be included into MemTotal, otherwise we will have
> this struggle in the future when we try to optimize struct page
> memory.
How far do we want to go, do we want to include crashkernel reserved 
memory in MemTotal because it is system memory? Only metadata? what else 
allocated using memblock?

Again, right now it's simple: MemTotal is memory managed by the buddy.

The spirit of this patch set is good, modifying existing counters needs 
good justification.

Pasha Tatashin Nov. 2, 2023, 4:43 p.m. UTC | #14

On Thu, Nov 2, 2023 at 12:09 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.11.23 17:02, Pasha Tatashin wrote:
> > On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 02.11.23 16:50, Pasha Tatashin wrote:
> >>>>> Adding reserved memory to MemTotal is a cleaner approach IMO as well.
> >>>>> But it changes the semantics of MemTotal, which may have compatibility
> >>>>> issues.
> >>>>
> >>>> I object.
> >>>
> >>> Could you please elaborate what you object (and why): you object that
> >>> it will have compatibility issues, or  you object to include memblock
> >>> reserves into MemTotal?
> >>
> >> Sorry, I object to changing the semantics of MemTotal. MemTotal is
> >> traditionally the memory managed by the buddy, not all memory in the
> >> system. I know people/scripts that are relying on that [although it's
> >> been source of confusion a couple of times].
> >
> > What if one day we change so that struct pages are allocated from
> > buddy allocator (i.e. allocate deferred struct pages from buddy) will
>
> It does on memory hotplug. But for things like crashkernel size
> detection doesn't really care about that.

"Crash kernel" is a different case: it is kernel external memory,
similar to limiting the amount of physical memory via mem=/memmap=, it
sets memory that cannot be used by this kernel, but only by the crash
kernel. Also, the crash kernel reserve is exposed in /proc/iomem via
"Crash kernel" range.

Page metadata memory on the other hand, is used by this kernel, and
also can be changed by this kernel depending on how the memory is
used: memdec, hotplug, THP, emulated pmem etc.

> > it break those MemTotal scripts? What if the size of struct pages
> > changes significantly, but the overhead will come from other metadata
> > (i.e. memdesc) will that break those scripts? I feel like struct page
>
> Probably; but ideally the metadata overhead will be smaller with
> memdesc. And we'll talk about that once it gets real ;)

The size and allocation of struct pages change MemTotal today, during
runtime, even without memdesc, I just brought it up, to emphasize that
this is something that we should resolve now before it gets worse.

> > memory should really be included into MemTotal, otherwise we will have
> > this struggle in the future when we try to optimize struct page
> > memory.
> How far do we want to go, do we want to include crashkernel reserved
> memory in MemTotal because it is system memory? Only metadata? what else
> allocated using memblock?
>
> Again, right now it's simple: MemTotal is memory managed by the buddy.
>
> The spirit of this patch set is good, modifying existing counters needs
> good justification.

Wei, noticed that all other fields in /proc/meminfo are part of
MemTotal, but this new field may be not (depending where struct pages
are allocated), so what would be the best way to export page metadata
without redefining MemTotal? Keep the new field in /proc/meminfo but
be ok that it is not part of MemTotal or do two counters? If we do two
counters, we will still need to keep one that is a buddy allocator in
/proc/meminfo and the other one somewhere outside?

Pasha

David Hildenbrand Nov. 2, 2023, 4:58 p.m. UTC | #15

On 02.11.23 17:43, Pasha Tatashin wrote:
> On Thu, Nov 2, 2023 at 12:09 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 02.11.23 17:02, Pasha Tatashin wrote:
>>> On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 02.11.23 16:50, Pasha Tatashin wrote:
>>>>>>> Adding reserved memory to MemTotal is a cleaner approach IMO as well.
>>>>>>> But it changes the semantics of MemTotal, which may have compatibility
>>>>>>> issues.
>>>>>>
>>>>>> I object.
>>>>>
>>>>> Could you please elaborate what you object (and why): you object that
>>>>> it will have compatibility issues, or  you object to include memblock
>>>>> reserves into MemTotal?
>>>>
>>>> Sorry, I object to changing the semantics of MemTotal. MemTotal is
>>>> traditionally the memory managed by the buddy, not all memory in the
>>>> system. I know people/scripts that are relying on that [although it's
>>>> been source of confusion a couple of times].
>>>
>>> What if one day we change so that struct pages are allocated from
>>> buddy allocator (i.e. allocate deferred struct pages from buddy) will
>>
>> It does on memory hotplug. But for things like crashkernel size
>> detection doesn't really care about that.
> 
> "Crash kernel" is a different case: it is kernel external memory,
> similar to limiting the amount of physical memory via mem=/memmap=, it
> sets memory that cannot be used by this kernel, but only by the crash
> kernel. Also, the crash kernel reserve is exposed in /proc/iomem via
> "Crash kernel" range.

Agreed.

> 
> Page metadata memory on the other hand, is used by this kernel, and
> also can be changed by this kernel depending on how the memory is
> used: memdec, hotplug, THP, emulated pmem etc.

And then, there is the "altmap" for dax, where the metadata is placed on 
the dax memory itself. I mean, it's system RAM (or NVDIMM or whatever) 
used for metadata, but not managed by the buddy.

There is now also the "memmap_on_memory" feature for memory hotplug, 
where we do the same for ordinary hotplug memory (but some memory aside 
for the memmap and not allocate it from the buddy). We'd have to account 
that one as well as metadata, I think. I don't think it would get 
accounted under MemTotal (because, not managed by the buddy) as of now.

> 
>>> it break those MemTotal scripts? What if the size of struct pages
>>> changes significantly, but the overhead will come from other metadata
>>> (i.e. memdesc) will that break those scripts? I feel like struct page
>>
>> Probably; but ideally the metadata overhead will be smaller with
>> memdesc. And we'll talk about that once it gets real ;)
> 
> The size and allocation of struct pages change MemTotal today, during
> runtime, even without memdesc, I just brought it up, to emphasize that
> this is something that we should resolve now before it gets worse.

I don't quite see the immediate need for action, but I get what you are 
saying. It's a historical mess, but if we want to tackle it, we should 
tackle it completely and not only sort out the metadata accounting.

> 
>>> memory should really be included into MemTotal, otherwise we will have
>>> this struggle in the future when we try to optimize struct page
>>> memory.
>> How far do we want to go, do we want to include crashkernel reserved
>> memory in MemTotal because it is system memory? Only metadata? what else
>> allocated using memblock?
>>
>> Again, right now it's simple: MemTotal is memory managed by the buddy.
>>
>> The spirit of this patch set is good, modifying existing counters needs
>> good justification.
> 
> Wei, noticed that all other fields in /proc/meminfo are part of
> MemTotal, but this new field may be not (depending where struct pages

I could have sworn that I pointed that out in a previous version and 
requested to document that special case in the patch description. :)

> are allocated), so what would be the best way to export page metadata
> without redefining MemTotal? Keep the new field in /proc/meminfo but
> be ok that it is not part of MemTotal or do two counters? If we do two
> counters, we will still need to keep one that is a buddy allocator in
> /proc/meminfo and the other one somewhere outside?

IMHO, we should just leave MemTotal alone ("memory managed by the buddy 
that could actually mostly get freed up and reused -- although that's 
not completely true") and have a new counter that includes any system 
memory (MemSystem? but as we learned, as separate files), including most 
memblock allocations/reservations as well (metadata, early pagetables, 
initrd, kernel, ...).

The you would actually know how much memory the system is using 
(exclusing things like crashmem, mem=, ...).

That part is tricky, though -- I recall there are memblock reservations 
that are similar to the crashkernel -- which is why the current state is 
to account memory when it's handed to the buddy under MemTotal -- which 
is straight forward and simply.

I'm happy to discuss this further, if that direction is worth exploring.

Pasha Tatashin Nov. 2, 2023, 5:11 p.m. UTC | #16

> > Wei, noticed that all other fields in /proc/meminfo are part of
> > MemTotal, but this new field may be not (depending where struct pages
>
> I could have sworn that I pointed that out in a previous version and
> requested to document that special case in the patch description. :)

Sounds, good we will document that parts of per-page may not be part
of MemTotal.

> > are allocated), so what would be the best way to export page metadata
> > without redefining MemTotal? Keep the new field in /proc/meminfo but
> > be ok that it is not part of MemTotal or do two counters? If we do two
> > counters, we will still need to keep one that is a buddy allocator in
> > /proc/meminfo and the other one somewhere outside?
>
> IMHO, we should just leave MemTotal alone ("memory managed by the buddy
> that could actually mostly get freed up and reused -- although that's
> not completely true") and have a new counter that includes any system
> memory (MemSystem? but as we learned, as separate files), including most
> memblock allocations/reservations as well (metadata, early pagetables,
> initrd, kernel, ...).
>
> The you would actually know how much memory the system is using
> (exclusing things like crashmem, mem=, ...).
>
> That part is tricky, though -- I recall there are memblock reservations
> that are similar to the crashkernel -- which is why the current state is
> to account memory when it's handed to the buddy under MemTotal -- which
> is straight forward and simply.

It may be simplified if we define MemSystem as all the usable memory
provided by firmware to Linux kernel.
For BIOS it would be the "usable" ranges in the original e820 memory
list before it's been modified by the kernel based on the parameters.

For device-tree architectures, it would be the memory binding provided
by the original device tree from the firmware.

Pasha

Wei Xu Nov. 2, 2023, 6:06 p.m. UTC | #17

On Thu, Nov 2, 2023 at 10:12 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> > > Wei, noticed that all other fields in /proc/meminfo are part of
> > > MemTotal, but this new field may be not (depending where struct pages
> >
> > I could have sworn that I pointed that out in a previous version and
> > requested to document that special case in the patch description. :)
>
> Sounds, good we will document that parts of per-page may not be part
> of MemTotal.

But this still doesn't answer how we can use the new PageMetadata
field to help break down the runtime kernel overhead within MemUsed
(MemTotal - MemFree).

> > > are allocated), so what would be the best way to export page metadata
> > > without redefining MemTotal? Keep the new field in /proc/meminfo but
> > > be ok that it is not part of MemTotal or do two counters? If we do two
> > > counters, we will still need to keep one that is a buddy allocator in
> > > /proc/meminfo and the other one somewhere outside?
> >

I think the simplest thing to do now is to only report the buddy
allocations of per-page metadata in meminfo.  The meaning of the new
counter is easier to understand and consistent with MemTotal and other
fields in meminfo. Its implementation can also be greatly simplified
and we don't need to handle the other special cases, either, e.g.
pagemeta allocated from DAX devices.

> > IMHO, we should just leave MemTotal alone ("memory managed by the buddy
> > that could actually mostly get freed up and reused -- although that's
> > not completely true") and have a new counter that includes any system
> > memory (MemSystem? but as we learned, as separate files), including most
> > memblock allocations/reservations as well (metadata, early pagetables,
> > initrd, kernel, ...).
> >
> > The you would actually know how much memory the system is using
> > (exclusing things like crashmem, mem=, ...).
> >
> > That part is tricky, though -- I recall there are memblock reservations
> > that are similar to the crashkernel -- which is why the current state is
> > to account memory when it's handed to the buddy under MemTotal -- which
> > is straight forward and simply.
>
> It may be simplified if we define MemSystem as all the usable memory
> provided by firmware to Linux kernel.
> For BIOS it would be the "usable" ranges in the original e820 memory
> list before it's been modified by the kernel based on the parameters.
>
> For device-tree architectures, it would be the memory binding provided
> by the original device tree from the firmware.
>
> Pasha

Pasha Tatashin Nov. 2, 2023, 6:33 p.m. UTC | #18

> > > I could have sworn that I pointed that out in a previous version and
> > > requested to document that special case in the patch description. :)
> >
> > Sounds, good we will document that parts of per-page may not be part
> > of MemTotal.
>
> But this still doesn't answer how we can use the new PageMetadata
> field to help break down the runtime kernel overhead within MemUsed
> (MemTotal - MemFree).

I am not sure it matters to the end users: they look at PageMetadata
with or without Page Owner, page_table_check, HugeTLB and it shows
exactly how much per-page overhead changed. Where the kernel allocated
that memory is not that important to the end user as long as that
memory became available to them.

In addition, it is still possible to estimate the actual memblock part
of Per-page metadata by looking at /proc/zoneinfo:

Memblock reserved per-page metadata: "present_pages - managed_pages"

If there is something big that we will allocate in that range, we
should probably also export it in some form.

If this field does not fit in /proc/meminfo due to not fully being
part of MemTotal, we could just keep it under nodeN/, as a separate
file, as suggested by Greg.

However, I think it is useful enough to have an easy system wide view
for Per-page metadata.

> > > > are allocated), so what would be the best way to export page metadata
> > > > without redefining MemTotal? Keep the new field in /proc/meminfo but
> > > > be ok that it is not part of MemTotal or do two counters? If we do two
> > > > counters, we will still need to keep one that is a buddy allocator in
> > > > /proc/meminfo and the other one somewhere outside?
> > >
>
> I think the simplest thing to do now is to only report the buddy
> allocations of per-page metadata in meminfo.  The meaning of the new

This will cause PageMetadata to be 0 on 99% of the systems, and
essentially become useless to the vast majority of users.

Wei Xu Nov. 2, 2023, 8:22 p.m. UTC | #19

On Thu, Nov 2, 2023 at 11:34 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> > > > I could have sworn that I pointed that out in a previous version and
> > > > requested to document that special case in the patch description. :)
> > >
> > > Sounds, good we will document that parts of per-page may not be part
> > > of MemTotal.
> >
> > But this still doesn't answer how we can use the new PageMetadata
> > field to help break down the runtime kernel overhead within MemUsed
> > (MemTotal - MemFree).
>
> I am not sure it matters to the end users: they look at PageMetadata
> with or without Page Owner, page_table_check, HugeTLB and it shows
> exactly how much per-page overhead changed. Where the kernel allocated
> that memory is not that important to the end user as long as that
> memory became available to them.
>
> In addition, it is still possible to estimate the actual memblock part
> of Per-page metadata by looking at /proc/zoneinfo:
>
> Memblock reserved per-page metadata: "present_pages - managed_pages"

This assumes that all reserved memblocks are per-page metadata. As I
mentioned earlier, it is not a robust approach.

> If there is something big that we will allocate in that range, we
> should probably also export it in some form.
>
> If this field does not fit in /proc/meminfo due to not fully being
> part of MemTotal, we could just keep it under nodeN/, as a separate
> file, as suggested by Greg.
>
> However, I think it is useful enough to have an easy system wide view
> for Per-page metadata.

It is fine to have this as a separate, informational sysfs file under
nodeN/, outside of meminfo. I just don't think as in the current
implementation (where PageMetadata is a mixture of buddy and memblock
allocations), it can help with the use case that motivates this
change, i.e. to improve the breakdown of the kernel overhead.

> > > > > are allocated), so what would be the best way to export page metadata
> > > > > without redefining MemTotal? Keep the new field in /proc/meminfo but
> > > > > be ok that it is not part of MemTotal or do two counters? If we do two
> > > > > counters, we will still need to keep one that is a buddy allocator in
> > > > > /proc/meminfo and the other one somewhere outside?
> > > >
> >
> > I think the simplest thing to do now is to only report the buddy
> > allocations of per-page metadata in meminfo.  The meaning of the new
>
> This will cause PageMetadata to be 0 on 99% of the systems, and
> essentially become useless to the vast majority of users.

I don't think it is a major issue. There are other fields (e.g. Zswap)
in meminfo that remain 0 when the feature is not used.

David Hildenbrand Nov. 2, 2023, 8:28 p.m. UTC | #20

On 02.11.23 18:11, Pasha Tatashin wrote:
>>> Wei, noticed that all other fields in /proc/meminfo are part of
>>> MemTotal, but this new field may be not (depending where struct pages
>>
>> I could have sworn that I pointed that out in a previous version and
>> requested to document that special case in the patch description. :)
> 
> Sounds, good we will document that parts of per-page may not be part
> of MemTotal.
> 
>>> are allocated), so what would be the best way to export page metadata
>>> without redefining MemTotal? Keep the new field in /proc/meminfo but
>>> be ok that it is not part of MemTotal or do two counters? If we do two
>>> counters, we will still need to keep one that is a buddy allocator in
>>> /proc/meminfo and the other one somewhere outside?
>>
>> IMHO, we should just leave MemTotal alone ("memory managed by the buddy
>> that could actually mostly get freed up and reused -- although that's
>> not completely true") and have a new counter that includes any system
>> memory (MemSystem? but as we learned, as separate files), including most
>> memblock allocations/reservations as well (metadata, early pagetables,
>> initrd, kernel, ...).
>>
>> The you would actually know how much memory the system is using
>> (exclusing things like crashmem, mem=, ...).
>>
>> That part is tricky, though -- I recall there are memblock reservations
>> that are similar to the crashkernel -- which is why the current state is
>> to account memory when it's handed to the buddy under MemTotal -- which
>> is straight forward and simply.
> 
> It may be simplified if we define MemSystem as all the usable memory
> provided by firmware to Linux kernel.
> For BIOS it would be the "usable" ranges in the original e820 memory
> list before it's been modified by the kernel based on the parameters.

There are some cases to consider, like "mem=", crashkernel, and some 
more odd things (I believe there are some on ppc at least for hw tracing 
buffers).

All information should be in the memblock allocator, maybe we just have 
to find some ways to better enlighten it what an allocation is (e.g., 
memmap), and what some other reason to exclude memory is (crash kernel, 
mem=, ACPI tables, odd memory holes, ...).

Pasha Tatashin Nov. 3, 2023, 1:06 a.m. UTC | #21

On Thu, Nov 2, 2023 at 4:22 PM Wei Xu <weixugc@google.com> wrote:
>
> On Thu, Nov 2, 2023 at 11:34 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > > > > I could have sworn that I pointed that out in a previous version and
> > > > > requested to document that special case in the patch description. :)
> > > >
> > > > Sounds, good we will document that parts of per-page may not be part
> > > > of MemTotal.
> > >
> > > But this still doesn't answer how we can use the new PageMetadata
> > > field to help break down the runtime kernel overhead within MemUsed
> > > (MemTotal - MemFree).
> >
> > I am not sure it matters to the end users: they look at PageMetadata
> > with or without Page Owner, page_table_check, HugeTLB and it shows
> > exactly how much per-page overhead changed. Where the kernel allocated
> > that memory is not that important to the end user as long as that
> > memory became available to them.
> >
> > In addition, it is still possible to estimate the actual memblock part
> > of Per-page metadata by looking at /proc/zoneinfo:
> >
> > Memblock reserved per-page metadata: "present_pages - managed_pages"
>
> This assumes that all reserved memblocks are per-page metadata. As I

Right after boot, when all Per-page metadata is still from memblocks,
we could determine what part of the zone reserved memory is not
per-page, and use it later in our calculations.

> mentioned earlier, it is not a robust approach.
> > If there is something big that we will allocate in that range, we
> > should probably also export it in some form.
> >
> > If this field does not fit in /proc/meminfo due to not fully being
> > part of MemTotal, we could just keep it under nodeN/, as a separate
> > file, as suggested by Greg.
> >
> > However, I think it is useful enough to have an easy system wide view
> > for Per-page metadata.
>
> It is fine to have this as a separate, informational sysfs file under
> nodeN/, outside of meminfo. I just don't think as in the current
> implementation (where PageMetadata is a mixture of buddy and memblock
> allocations), it can help with the use case that motivates this
> change, i.e. to improve the breakdown of the kernel overhead.
> > > > > > are allocated), so what would be the best way to export page metadata
> > > > > > without redefining MemTotal? Keep the new field in /proc/meminfo but
> > > > > > be ok that it is not part of MemTotal or do two counters? If we do two
> > > > > > counters, we will still need to keep one that is a buddy allocator in
> > > > > > /proc/meminfo and the other one somewhere outside?
> > > > >
> > >
> > > I think the simplest thing to do now is to only report the buddy
> > > allocations of per-page metadata in meminfo.  The meaning of the new
> >
> > This will cause PageMetadata to be 0 on 99% of the systems, and
> > essentially become useless to the vast majority of users.
>
> I don't think it is a major issue. There are other fields (e.g. Zswap)
> in meminfo that remain 0 when the feature is not used.

Since we are going to use two independent interfaces
/proc/meminfo/PageMetadata and nodeN/page_metadata (in a separate file
as requested by Greg) How about if in /proc/meminfo we provide only
the buddy allocator part, and in nodeN/page_metadata we provide the
total per-page overhead in the given node that include memblock
reserves, and buddy allocator memory?

Pasha

Wei Xu Nov. 3, 2023, 4:27 a.m. UTC | #22

On Thu, Nov 2, 2023 at 6:07 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Thu, Nov 2, 2023 at 4:22 PM Wei Xu <weixugc@google.com> wrote:
> >
> > On Thu, Nov 2, 2023 at 11:34 AM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > > > > I could have sworn that I pointed that out in a previous version and
> > > > > > requested to document that special case in the patch description. :)
> > > > >
> > > > > Sounds, good we will document that parts of per-page may not be part
> > > > > of MemTotal.
> > > >
> > > > But this still doesn't answer how we can use the new PageMetadata
> > > > field to help break down the runtime kernel overhead within MemUsed
> > > > (MemTotal - MemFree).
> > >
> > > I am not sure it matters to the end users: they look at PageMetadata
> > > with or without Page Owner, page_table_check, HugeTLB and it shows
> > > exactly how much per-page overhead changed. Where the kernel allocated
> > > that memory is not that important to the end user as long as that
> > > memory became available to them.
> > >
> > > In addition, it is still possible to estimate the actual memblock part
> > > of Per-page metadata by looking at /proc/zoneinfo:
> > >
> > > Memblock reserved per-page metadata: "present_pages - managed_pages"
> >
> > This assumes that all reserved memblocks are per-page metadata. As I
>
> Right after boot, when all Per-page metadata is still from memblocks,
> we could determine what part of the zone reserved memory is not
> per-page, and use it later in our calculations.
>
> > mentioned earlier, it is not a robust approach.
> > > If there is something big that we will allocate in that range, we
> > > should probably also export it in some form.
> > >
> > > If this field does not fit in /proc/meminfo due to not fully being
> > > part of MemTotal, we could just keep it under nodeN/, as a separate
> > > file, as suggested by Greg.
> > >
> > > However, I think it is useful enough to have an easy system wide view
> > > for Per-page metadata.
> >
> > It is fine to have this as a separate, informational sysfs file under
> > nodeN/, outside of meminfo. I just don't think as in the current
> > implementation (where PageMetadata is a mixture of buddy and memblock
> > allocations), it can help with the use case that motivates this
> > change, i.e. to improve the breakdown of the kernel overhead.
> > > > > > > are allocated), so what would be the best way to export page metadata
> > > > > > > without redefining MemTotal? Keep the new field in /proc/meminfo but
> > > > > > > be ok that it is not part of MemTotal or do two counters? If we do two
> > > > > > > counters, we will still need to keep one that is a buddy allocator in
> > > > > > > /proc/meminfo and the other one somewhere outside?
> > > > > >
> > > >
> > > > I think the simplest thing to do now is to only report the buddy
> > > > allocations of per-page metadata in meminfo.  The meaning of the new
> > >
> > > This will cause PageMetadata to be 0 on 99% of the systems, and
> > > essentially become useless to the vast majority of users.
> >
> > I don't think it is a major issue. There are other fields (e.g. Zswap)
> > in meminfo that remain 0 when the feature is not used.
>
> Since we are going to use two independent interfaces
> /proc/meminfo/PageMetadata and nodeN/page_metadata (in a separate file
> as requested by Greg) How about if in /proc/meminfo we provide only
> the buddy allocator part, and in nodeN/page_metadata we provide the
> total per-page overhead in the given node that include memblock
> reserves, and buddy allocator memory?

What we want is the system-wide breakdown of kernel memory usage. It
works for this use case with the new PageMetadata counter in
/proc/meminfo to report only buddy-allocated per-page metadata.

> Pasha

Pasha Tatashin Nov. 3, 2023, 3:18 p.m. UTC | #23

> > Since we are going to use two independent interfaces
> > /proc/meminfo/PageMetadata and nodeN/page_metadata (in a separate file
> > as requested by Greg) How about if in /proc/meminfo we provide only
> > the buddy allocator part, and in nodeN/page_metadata we provide the
> > total per-page overhead in the given node that include memblock
> > reserves, and buddy allocator memory?
>
> What we want is the system-wide breakdown of kernel memory usage. It
> works for this use case with the new PageMetadata counter in
> /proc/meminfo to report only buddy-allocated per-page metadata.

We want to report all PageMetadata, otherwise this effort is going to
be useless for the majority of users.

As you noted, /proc/meminfo allows us to report only the part of
per-page metadata that was allocated by the buddy allocator because of
an existing MemTotal bug that does not include memblock reserves.
However, we do not have this limitation when we create a new
nodeN/page_metadata interface, and we can document that in the sysfs
ABI documentation: sum(nodeN/page_metadata)  contains all per-page
metadata and is superset of /proc/meminfo.

The only question is how to name PageMetadata in the /proc/meminfo
appropriately, so users can understand that not all page metadata is
included? (of course we will also document that only the MemTotal part
of page metadata is reported in /proc/meminfo)

Pasha

kernel test robot Nov. 17, 2023, 2:42 a.m. UTC | #24

hi, Sourav Panda,

we are not sure if this patch is NACKed since
https://lore.kernel.org/all/2023110205-enquirer-sponge-4f35@gregkh/

but seems you still have plan for next version
https://lore.kernel.org/all/CA+CK2bCFgwLXp=pUTKezWtRoCKiDC41DqGXx_kahg0UcB53sPw@mail.gmail.com/

so still send below report to you FYI about what we observed in our tests.


Hello,

kernel test robot noticed "WARNING:at_mm/vmstat.c:#__mod_node_page_state" on:

commit: 77348e22542ef30ac2e12e111fdbe2debe4c8bf7 ("[PATCH v5 1/1] mm: report per-page metadata information")
url: https://github.com/intel-lab-lkp/linux/commits/Sourav-Panda/mm-report-per-page-metadata-information/20231102-071047
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git effd7c70eaa0440688b60b9d419243695ede3c45
patch link: https://lore.kernel.org/all/20231101230816.1459373-2-souravpanda@google.com/
patch subject: [PATCH v5 1/1] mm: report per-page metadata information

in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:

	sc_nr_hugepages: 2
	group: mm



compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311171013.fb3e52d3-oliver.sang@intel.com


kern  :warn  : [  625.944628] ------------[ cut here ]------------
kern :warn : [  625.945623] WARNING: CPU: 30 PID: 16422 at mm/vmstat.c:393 __mod_node_page_state (mm/vmstat.c:393) 
kern  :warn  : [  625.946550] Modules linked in: test_hmm(+) netconsole openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl_msr intel_rapl_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp btrfs blake2b_generic xor coretemp kvm_intel raid6_pq zstd_compress kvm libcrc32c irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate nvme nvme_core ahci t10_pi ipmi_devintf libahci ipmi_msghandler wmi_bmof mxm_wmi intel_wmi_thunderbolt crc64_rocksoft_generic i2c_i801 crc64_rocksoft intel_uncore wdat_wdt crc64 libata mei_me i2c_smbus ioatdma mei dca wmi binfmt_misc fuse drm ip_tables
kern  :warn  : [  625.951800] CPU: 30 PID: 16422 Comm: modprobe Not tainted 6.6.0-rc4-00022-g77348e22542e #1
kern  :warn  : [  625.952689] Hardware name: Gigabyte Technology Co., Ltd. X299 UD4 Pro/X299 UD4 Pro-CF, BIOS F8a 04/27/2021
kern :warn : [  625.953692] RIP: 0010:__mod_node_page_state (mm/vmstat.c:393) 
kern :warn : [ 625.954310] Code: 1c 24 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 65 8b 05 78 ad 77 7e a9 ff ff ff 7f 75 bb 65 8b 05 9e 79 76 7e 85 c0 74 b0 <0f> 0b eb ac 49 83 fd 2c 77 7b 4e 8d 34 ed c8 a5 02 00 be 08 00 00
All code
========
   0:	1c 24                	sbb    $0x24,%al
   2:	48 83 c4 08          	add    $0x8,%rsp
   6:	5b                   	pop    %rbx
   7:	5d                   	pop    %rbp
   8:	41 5c                	pop    %r12
   a:	41 5d                	pop    %r13
   c:	41 5e                	pop    %r14
   e:	41 5f                	pop    %r15
  10:	c3                   	retq   
  11:	65 8b 05 78 ad 77 7e 	mov    %gs:0x7e77ad78(%rip),%eax        # 0x7e77ad90
  18:	a9 ff ff ff 7f       	test   $0x7fffffff,%eax
  1d:	75 bb                	jne    0xffffffffffffffda
  1f:	65 8b 05 9e 79 76 7e 	mov    %gs:0x7e76799e(%rip),%eax        # 0x7e7679c4
  26:	85 c0                	test   %eax,%eax
  28:	74 b0                	je     0xffffffffffffffda
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	eb ac                	jmp    0xffffffffffffffda
  2e:	49 83 fd 2c          	cmp    $0x2c,%r13
  32:	77 7b                	ja     0xaf
  34:	4e 8d 34 ed c8 a5 02 	lea    0x2a5c8(,%r13,8),%r14
  3b:	00 
  3c:	be                   	.byte 0xbe
  3d:	08 00                	or     %al,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	eb ac                	jmp    0xffffffffffffffb0
   4:	49 83 fd 2c          	cmp    $0x2c,%r13
   8:	77 7b                	ja     0x85
   a:	4e 8d 34 ed c8 a5 02 	lea    0x2a5c8(,%r13,8),%r14
  11:	00 
  12:	be                   	.byte 0xbe
  13:	08 00                	or     %al,(%rax)
	...
kern  :warn  : [  625.956115] RSP: 0018:ffffc90000d7f548 EFLAGS: 00010202
kern  :warn  : [  625.956726] RAX: 0000000000000001 RBX: 00000003ffff8000 RCX: 1ffffffff0aeddef
kern  :warn  : [  625.957526] RDX: 0000000000000000 RSI: 0000000000000026 RDI: ffff88889fffe5c0
kern  :warn  : [  625.958414] RBP: ffff88889ffd4000 R08: 0000000000000007 R09: fffffbfff091ebd4
kern  :warn  : [  625.959207] R10: ffffffff848f5ea3 R11: 0000000000000001 R12: 00000000000427ec
kern  :warn  : [  625.960008] R13: 000000000000002b R14: 0000000000000200 R15: 00000000000427c0
kern  :warn  : [  625.960786] FS:  00007fca350f5740(0000) GS:ffff88880f100000(0000) knlGS:0000000000000000
kern  :warn  : [  625.961664] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern  :warn  : [  625.962342] CR2: 00007f643c75d000 CR3: 00000002c7c44003 CR4: 00000000003706e0
kern  :warn  : [  625.963132] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kern  :warn  : [  625.963923] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kern  :warn  : [  625.964702] Call Trace:
kern  :warn  : [  625.965089]  <TASK>
kern :warn : [  625.965436] ? __warn (kernel/panic.c:673) 
kern :warn : [  625.965898] ? __mod_node_page_state (mm/vmstat.c:393) 
kern :warn : [  625.966450] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
kern :warn : [  625.966947] ? handle_bug (arch/x86/kernel/traps.c:237) 
kern :warn : [  625.967409] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
kern :warn : [  625.967914] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
kern :warn : [  625.968445] ? __mod_node_page_state (mm/vmstat.c:393) 
kern :warn : [  625.969014] __populate_section_memmap (mm/sparse-vmemmap.c:475) 
kern :warn : [  625.969591] ? kasan_set_track (mm/kasan/common.c:52) 
kern :warn : [  625.970103] sparse_add_section (mm/sparse.c:867 mm/sparse.c:907) 
kern :warn : [  625.970628] ? sparse_buffer_alloc (mm/sparse.c:897) 
kern :warn : [  625.971177] __add_pages (mm/memory_hotplug.c:403) 
kern :warn : [  625.971650] add_pages (arch/x86/mm/init_64.c:956) 
kern :warn : [  625.972113] pagemap_range (mm/memremap.c:250) 
kern :warn : [  625.972609] ? memremap_compat_align (mm/memremap.c:163) 
kern :warn : [  625.973162] ? percpu_ref_init (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2602 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3196 lib/percpu-refcount.c:98) 
kern :warn : [  625.973678] memremap_pages (mm/memremap.c:367) 
kern :warn : [  625.974187] ? pagemap_range (mm/memremap.c:292) 
kern :warn : [  625.974697] ? kasan_set_track (mm/kasan/common.c:52) 
kern :warn : [  625.975209] ? __kmalloc_node_track_caller (include/trace/events/kmem.h:54 include/trace/events/kmem.h:54 mm/slab_common.c:1024 mm/slab_common.c:1043) 
kern :warn : [  625.975802] dmirror_allocate_chunk (include/linux/err.h:72 lib/test_hmm.c:552) test_hmm
kern :warn : [  625.976483] hmm_dmirror_init (lib/test_hmm.c:267) test_hmm
kern  :warn  : [  625.977092]  ? 0xffffffffc14b1000
kern :warn : [  625.977539] do_one_initcall (init/main.c:1232) 
kern :warn : [  625.978044] ? trace_event_raw_event_initcall_level (init/main.c:1223) 
kern :warn : [  625.978718] ? kasan_unpoison (mm/kasan/shadow.c:160 mm/kasan/shadow.c:194) 
kern :warn : [  625.979261] do_init_module (kernel/module/main.c:2530) 
kern :warn : [  625.979761] load_module (kernel/module/main.c:2981) 
kern :warn : [  625.980267] ? post_relocation (kernel/module/main.c:2830) 
kern :warn : [  625.980782] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:979 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:2740 fs/kernel_read_file.c:122) 
kern :warn : [  625.981318] ? __x64_sys_fspick (fs/kernel_read_file.c:38) 
kern :warn : [  625.981858] ? init_module_from_file (kernel/module/main.c:3148) 
kern :warn : [  625.982408] init_module_from_file (kernel/module/main.c:3148) 
kern :warn : [  625.982959] ? __ia32_sys_init_module (kernel/module/main.c:3124) 
kern :warn : [  625.983508] ? __lock_release+0x111/0x440 
kern :warn : [  625.984078] ? idempotent_init_module (kernel/module/main.c:3094 kernel/module/main.c:3159) 
kern :warn : [  625.984743] ? idempotent_init_module (kernel/module/main.c:3094 kernel/module/main.c:3159) 
kern :warn : [  625.985347] ? do_raw_spin_unlock (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:444 include/linux/atomic/atomic-instrumented.h:33 include/asm-generic/qspinlock.h:57 kernel/locking/spinlock_debug.c:100 kernel/locking/spinlock_debug.c:140) 
kern :warn : [  625.985895] idempotent_init_module (kernel/module/main.c:3165) 
kern :warn : [  625.986448] ? init_module_from_file (kernel/module/main.c:3152) 
kern :warn : [  625.987029] ? security_capable (security/security.c:946 (discriminator 13)) 
kern :warn : [  625.987540] __x64_sys_finit_module (include/linux/file.h:45 kernel/module/main.c:3187 kernel/module/main.c:3169 kernel/module/main.c:3169) 
kern :warn : [  625.988090] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
kern :warn : [  625.988576] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) 
kern  :warn  : [  625.989174] RIP: 0033:0x7fca352005a9
kern :warn : [ 625.989645] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 08 0d 00 f7 d8 64 89 01 48
All code
========
   0:	08 89 e8 5b 5d c3    	or     %cl,-0x3ca2a418(%rcx)
   6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
   d:	00 00 00 
  10:	90                   	nop
  11:	48 89 f8             	mov    %rdi,%rax
  14:	48 89 f7             	mov    %rsi,%rdi
  17:	48 89 d6             	mov    %rdx,%rsi
  1a:	48 89 ca             	mov    %rcx,%rdx
  1d:	4d 89 c2             	mov    %r8,%r10
  20:	4d 89 c8             	mov    %r9,%r8
  23:	4c 8b 4c 24 08       	mov    0x8(%rsp),%r9
  28:	0f 05                	syscall 
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	retq   
  33:	48 8b 0d 27 08 0d 00 	mov    0xd0827(%rip),%rcx        # 0xd0861
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	retq   
   9:	48 8b 0d 27 08 0d 00 	mov    0xd0827(%rip),%rcx        # 0xd0837
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231117/202311171013.fb3e52d3-oliver.sang@intel.com

[v5,1/1] mm: report per-page metadata information

Commit Message

Comments

Patch