[v3,00/35] Memory allocation profiling

Message ID	20240212213922.783301-1-surenb@google.com (mailing list archive)
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BF954D5A0 for <linux-fsdevel@vger.kernel.org>; Mon, 12 Feb 2024 21:39:33 +0000 (UTC) Date: Mon, 12 Feb 2024 13:38:46 -0800 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240212213922.783301-1-surenb@google.com> Subject: [PATCH v3 00/35] Memory allocation profiling From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: kent.overstreet@linux.dev, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, roman.gushchin@linux.dev, mgorman@suse.de, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, corbet@lwn.net, void@manifault.com, peterz@infradead.org, juri.lelli@redhat.com, catalin.marinas@arm.com, will@kernel.org, arnd@arndb.de, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, peterx@redhat.com, david@redhat.com, axboe@kernel.dk, mcgrof@kernel.org, masahiroy@kernel.org, nathan@kernel.org, dennis@kernel.org, tj@kernel.org, muchun.song@linux.dev, rppt@kernel.org, paulmck@kernel.org, pasha.tatashin@soleen.com, yosryahmed@google.com, yuzhao@google.com, dhowells@redhat.com, hughd@google.com, andreyknvl@gmail.com, keescook@chromium.org, ndesaulniers@google.com, vvvvvv@google.com, gregkh@linuxfoundation.org, ebiggers@google.com, ytcoode@gmail.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, bristot@redhat.com, vschneid@redhat.com, cl@linux.com, penberg@kernel.org, iamjoonsoo.kim@lge.com, 42.hyeyoo@gmail.com, glider@google.com, elver@google.com, dvyukov@google.com, shakeelb@google.com, songmuchun@bytedance.com, jbaron@akamai.com, rientjes@google.com, minchan@google.com, kaleshsingh@google.com, surenb@google.com, kernel-team@android.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, iommu@lists.linux.dev, linux-arch@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-modules@vger.kernel.org, kasan-dev@googlegroups.com, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8"
Series	Memory allocation profiling \| expand [v3,00/35] Memory allocation profiling [v3,01/35] lib/string_helpers: Add flags param to string_get_size() [v3,02/35] scripts/kallysms: Always include __start and __stop symbols [v3,03/35] fs: Convert alloc_inode_sb() to a macro [v3,04/35] mm: enumerate all gfp flags [v3,05/35] mm: introduce slabobj_ext to support slab object extensions [v3,06/35] mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation [v3,07/35] mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation [v3,08/35] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects [v3,09/35] slab: objext: introduce objext_flags as extension to page_memcg_data_flags [v3,10/35] lib: code tagging framework [v3,11/35] lib: code tagging module support [v3,12/35] lib: prevent module unloading if memory is not freed [v3,13/35] lib: add allocation tagging support for memory allocation profiling [v3,14/35] lib: introduce support for page allocation tagging [v3,15/35] mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation tags [v3,16/35] change alloc_pages name in dma_map_ops to avoid name conflicts [v3,17/35] mm: enable page allocation tagging [v3,18/35] mm: create new codetag references during page splitting [v3,19/35] mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y [v3,20/35] lib: add codetag reference into slabobj_ext [v3,21/35] mm/slab: add allocation accounting into slab allocation and free paths [v3,22/35] mm/slab: enable slab allocation tagging for kmalloc and friends [v3,23/35] mm/slub: Mark slab_free_freelist_hook() __always_inline [v3,24/35] mempool: Hook up to memory allocation profiling [v3,25/35] xfs: Memory allocation profiling fixups [v3,26/35] mm: percpu: Introduce pcpuobj_ext [v3,27/35] mm: percpu: Add codetag reference into pcpuobj_ext [v3,28/35] mm: percpu: enable per-cpu allocation tagging [v3,29/35] mm: vmalloc: Enable memory allocation profiling [v3,30/35] rhashtable: Plumb through alloc tag [v3,31/35] lib: add memory allocations report in show_mem() [v3,32/35] codetag: debug: skip objext checking when it's for objext itself [v3,33/35] codetag: debug: mark codetags for reserved pages as empty [v3,34/35] codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations [v3,35/35] MAINTAINERS: Add entries for code tagging and memory allocation profiling

Suren Baghdasaryan Feb. 12, 2024, 9:38 p.m. UTC

Memory allocation, v3 and final:

Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for debug
kernels, overhead low enough to be deployed in production.

We're aiming to get this in the next merge window, for 6.9. The feedback
we've gotten has been that even out of tree this patchset has already
been useful, and there's a significant amount of other work gated on the
code tagging functionality included in this patchset [2].

Example output:
  root@moria-kvm:~# sort -h /proc/allocinfo|tail
   3.11MiB     2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
   3.52MiB      225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
   3.75MiB      960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
   4.00MiB        2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
   10.5MiB      168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
   14.0MiB     3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
   26.8MiB     6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
   64.5MiB    98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
   98.7MiB    25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
    125MiB     7357 mm/slub.c:2201 module:slub func:alloc_slab_page

Since v2:
 - tglx noticed a circular header dependency between sched.h and percpu.h;
   a bunch of header cleanups were merged into 6.8 to ameliorate this [3].

 - a number of improvements, moving alloc_hooks() annotations to the
   correct place for better tracking (mempool), and bugfixes.

 - looked at alternate hooking methods.
   There were suggestions on alternate methods (compiler attribute,
   trampolines), but they wouldn't have made the patchset any cleaner
   (we still need to have different function versions for accounting vs. no
   accounting to control at which point in a call chain the accounting
   happens), and they would have added a dependency on toolchain
   support.

Usage:
kconfig options:
 - CONFIG_MEM_ALLOC_PROFILING
 - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
 - CONFIG_MEM_ALLOC_PROFILING_DEBUG
   adds warnings for allocations that weren't accounted because of a
   missing annotation

sysctl:
  /proc/sys/vm/mem_profiling

Runtime info:
  /proc/allocinfo

Notes:

[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:

                        kmalloc                 pgalloc
(1 baseline)            6.764s                  16.902s
(2 default disabled)    6.793s (+0.43%)         17.007s (+0.62%)
(3 default enabled)     7.197s (+6.40%)         23.666s (+40.02%)
(4 runtime enabled)     7.405s (+9.48%)         23.901s (+41.41%)
(5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)

Memory overhead:
Kernel size:

   text           data        bss         dec         diff
(1) 26515311	      18890222    17018880    62424413
(2) 26524728	      19423818    16740352    62688898    264485
(3) 26524724	      19423818    16740352    62688894    264481
(4) 26524728	      19423818    16740352    62688898    264485
(5) 26541782	      18964374    16957440    62463596    39183

Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags:           192 kB
PageExts:         262144 kB (256MB)
SlabExts:           9876 kB (9.6MB)
PcpuExts:            512 kB (0.5MB)

Total overhead is 0.2% of total memory.

[2]: Improved fault injection is the big one; the alloc_hooks() macro
this patchset introduces is also used for per-callsite fault injection
points in the dynamic fault injection patchset, which means we can
easily do fault injection on a per module or per file basis; this makes
it much easier to integrate memory fault injection into existing tests.

Vlastimil recently raised concerns about exposing GFP_NOWAIT as a
PF_MEMALLOC_* flag, as this might introduce GFP_NOWAIT to allocation
paths that have never had their failure paths tested - this is something
we need to address.

[3]: The circular dependency looks to be unavoidable; the issue is that
alloc_tag_save() -> current -> get_current() requires percpu.h, and
percpu.h requires sched.h because of course it does. But this doesn't
actually cause build errors because we're only using macros, so the main
concern is just not leaving a difficult-to-disentangle minefield for
later.
So, sched.h is now pretty close to being a types only header that
imports types and declares types - this is the header cleanups that were
merged for 6.8.


Kent Overstreet (11):
  lib/string_helpers: Add flags param to string_get_size()
  scripts/kallysms: Always include __start and __stop symbols
  fs: Convert alloc_inode_sb() to a macro
  mm/slub: Mark slab_free_freelist_hook() __always_inline
  mempool: Hook up to memory allocation profiling
  xfs: Memory allocation profiling fixups
  mm: percpu: Introduce pcpuobj_ext
  mm: percpu: Add codetag reference into pcpuobj_ext
  mm: vmalloc: Enable memory allocation profiling
  rhashtable: Plumb through alloc tag
  MAINTAINERS: Add entries for code tagging and memory allocation
    profiling

Suren Baghdasaryan (24):
  mm: enumerate all gfp flags
  mm: introduce slabobj_ext to support slab object extensions
  mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
    creation
  mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
  mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
    objects
  slab: objext: introduce objext_flags as extension to
    page_memcg_data_flags
  lib: code tagging framework
  lib: code tagging module support
  lib: prevent module unloading if memory is not freed
  lib: add allocation tagging support for memory allocation profiling
  lib: introduce support for page allocation tagging
  mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation
    tags
  change alloc_pages name in dma_map_ops to avoid name conflicts
  mm: enable page allocation tagging
  mm: create new codetag references during page splitting
  mm/page_ext: enable early_page_ext when
    CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
  lib: add codetag reference into slabobj_ext
  mm/slab: add allocation accounting into slab allocation and free paths
  mm/slab: enable slab allocation tagging for kmalloc and friends
  mm: percpu: enable per-cpu allocation tagging
  lib: add memory allocations report in show_mem()
  codetag: debug: skip objext checking when it's for objext itself
  codetag: debug: mark codetags for reserved pages as empty
  codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
    allocations

 Documentation/admin-guide/sysctl/vm.rst       |  16 ++
 Documentation/filesystems/proc.rst            |  28 ++
 MAINTAINERS                                   |  16 ++
 arch/alpha/kernel/pci_iommu.c                 |   2 +-
 arch/mips/jazz/jazzdma.c                      |   2 +-
 arch/powerpc/kernel/dma-iommu.c               |   2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c      |   2 +-
 arch/powerpc/platforms/ps3/system-bus.c       |   4 +-
 arch/powerpc/platforms/pseries/vio.c          |   2 +-
 arch/x86/kernel/amd_gart_64.c                 |   2 +-
 drivers/block/virtio_blk.c                    |   4 +-
 drivers/gpu/drm/gud/gud_drv.c                 |   2 +-
 drivers/iommu/dma-iommu.c                     |   2 +-
 drivers/mmc/core/block.c                      |   4 +-
 drivers/mtd/spi-nor/debugfs.c                 |   6 +-
 .../ethernet/chelsio/cxgb4/cxgb4_debugfs.c    |   4 +-
 drivers/parisc/ccio-dma.c                     |   2 +-
 drivers/parisc/sba_iommu.c                    |   2 +-
 drivers/scsi/sd.c                             |   8 +-
 drivers/staging/media/atomisp/pci/hmm/hmm.c   |   2 +-
 drivers/xen/grant-dma-ops.c                   |   2 +-
 drivers/xen/swiotlb-xen.c                     |   2 +-
 fs/xfs/kmem.c                                 |   4 +-
 fs/xfs/kmem.h                                 |  10 +-
 include/asm-generic/codetag.lds.h             |  14 +
 include/asm-generic/vmlinux.lds.h             |   3 +
 include/linux/alloc_tag.h                     | 188 +++++++++++++
 include/linux/codetag.h                       |  83 ++++++
 include/linux/dma-map-ops.h                   |   2 +-
 include/linux/fortify-string.h                |   5 +-
 include/linux/fs.h                            |   6 +-
 include/linux/gfp.h                           | 126 +++++----
 include/linux/gfp_types.h                     | 101 +++++--
 include/linux/memcontrol.h                    |  56 +++-
 include/linux/mempool.h                       |  73 +++--
 include/linux/mm.h                            |   8 +
 include/linux/mm_types.h                      |   4 +-
 include/linux/page_ext.h                      |   1 -
 include/linux/pagemap.h                       |   9 +-
 include/linux/percpu.h                        |  27 +-
 include/linux/pgalloc_tag.h                   | 105 +++++++
 include/linux/rhashtable-types.h              |  11 +-
 include/linux/sched.h                         |  24 ++
 include/linux/slab.h                          | 184 +++++++------
 include/linux/string.h                        |   4 +-
 include/linux/string_helpers.h                |  11 +-
 include/linux/vmalloc.h                       |  60 +++-
 init/Kconfig                                  |   4 +
 kernel/dma/mapping.c                          |   4 +-
 kernel/kallsyms_selftest.c                    |   2 +-
 kernel/module/main.c                          |  25 +-
 lib/Kconfig.debug                             |  31 +++
 lib/Makefile                                  |   3 +
 lib/alloc_tag.c                               | 213 +++++++++++++++
 lib/codetag.c                                 | 258 ++++++++++++++++++
 lib/rhashtable.c                              |  52 +++-
 lib/string_helpers.c                          |  22 +-
 lib/test-string_helpers.c                     |   4 +-
 mm/compaction.c                               |   7 +-
 mm/filemap.c                                  |   6 +-
 mm/huge_memory.c                              |   2 +
 mm/hugetlb.c                                  |   8 +-
 mm/kfence/core.c                              |  14 +-
 mm/kfence/kfence.h                            |   4 +-
 mm/memcontrol.c                               |  56 +---
 mm/mempolicy.c                                |  52 ++--
 mm/mempool.c                                  |  36 +--
 mm/mm_init.c                                  |  10 +
 mm/page_alloc.c                               |  66 +++--
 mm/page_ext.c                                 |  13 +
 mm/page_owner.c                               |   2 +-
 mm/percpu-internal.h                          |  26 +-
 mm/percpu.c                                   | 120 ++++----
 mm/show_mem.c                                 |  15 +
 mm/slab.h                                     | 176 ++++++++++--
 mm/slab_common.c                              |  65 ++++-
 mm/slub.c                                     | 138 ++++++----
 mm/util.c                                     |  44 +--
 mm/vmalloc.c                                  |  88 +++---
 scripts/kallsyms.c                            |  13 +
 scripts/module.lds.S                          |   7 +
 81 files changed, 2126 insertions(+), 695 deletions(-)
 create mode 100644 include/asm-generic/codetag.lds.h
 create mode 100644 include/linux/alloc_tag.h
 create mode 100644 include/linux/codetag.h
 create mode 100644 include/linux/pgalloc_tag.h
 create mode 100644 lib/alloc_tag.c
 create mode 100644 lib/codetag.c

Pasha Tatashin Feb. 13, 2024, 12:14 a.m. UTC | #1

On Mon, Feb 12, 2024 at 4:39 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Memory allocation, v3 and final:
>
> Overview:
> Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> kernels, overhead low enough to be deployed in production.
>
> We're aiming to get this in the next merge window, for 6.9. The feedback
> we've gotten has been that even out of tree this patchset has already
> been useful, and there's a significant amount of other work gated on the
> code tagging functionality included in this patchset [2].
>
> Example output:
>   root@moria-kvm:~# sort -h /proc/allocinfo|tail
>    3.11MiB     2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
>    3.52MiB      225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
>    3.75MiB      960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
>    4.00MiB        2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
>    10.5MiB      168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
>    14.0MiB     3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
>    26.8MiB     6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
>    64.5MiB    98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
>    98.7MiB    25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
>     125MiB     7357 mm/slub.c:2201 module:slub func:alloc_slab_page

This kind of memory profiling would be an incredible asset in cloud
environments.

Over the past year, we've encountered several kernel memory overhead
issues. Two particularly severe cases involved excessively large IOMMU
page tables (20GB per machine) and IOVA magazines (up to 8GB).
Considering thousands of machines were affected, the cumulative memory
waste was huge.

While we eventually resolved these issues with custom kernel profiling
hacks (some based on this series) and kdump analysis, comprehensive
memory profiling would have significantly accelerated the diagnostic
process, pinpointing the precise source of the allocations.

Kees Cook Feb. 13, 2024, 12:29 a.m. UTC | #2

On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> kernels, overhead low enough to be deployed in production.

What's the plan for things like devm_kmalloc() and similar relatively
simple wrappers? I was thinking it would be possible to reimplement at
least devm_kmalloc() with size and flags changing helper a while back:

https://lore.kernel.org/all/202309111428.6F36672F57@keescook/

I suspect it could be possible to adapt the alloc_hooks wrapper in this
series similarly:

#define alloc_hooks_prep(_do_alloc, _do_prepare, _do_finish,		\
			  ctx, size, flags)				\
({									\
	typeof(_do_alloc) _res;						\
	DEFINE_ALLOC_TAG(_alloc_tag, _old);				\
	ssize_t _size = (size);						\
	size_t _usable = _size;						\
	gfp_t _flags = (flags);						\
									\
	_res = _do_prepare(ctx, &_size, &_flags);			\
	if (!IS_ERR_OR_NULL(_res)					\
		_res = _do_alloc(_size, _flags);			\
	if (!IS_ERR_OR_NULL(_res)					\
		_res = _do_finish(ctx, _usable, _size, _flags, _res);	\
	_res;								\
})

#define devm_kmalloc(dev, size, flags)					\
	alloc_hooks_prep(kmalloc, devm_alloc_prep, devm_alloc_finish,	\
			 dev, size, flags)

And devm_alloc_prep() and devm_alloc_finish() adapted from the URL
above.

And _do_finish instances could be marked with __realloc_size(2)

-Kees

Suren Baghdasaryan Feb. 13, 2024, 12:47 a.m. UTC | #3

On Mon, Feb 12, 2024 at 4:29 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> > Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> > kernels, overhead low enough to be deployed in production.
>
> What's the plan for things like devm_kmalloc() and similar relatively
> simple wrappers? I was thinking it would be possible to reimplement at
> least devm_kmalloc() with size and flags changing helper a while back:
>
> https://lore.kernel.org/all/202309111428.6F36672F57@keescook/
>
> I suspect it could be possible to adapt the alloc_hooks wrapper in this
> series similarly:
>
> #define alloc_hooks_prep(_do_alloc, _do_prepare, _do_finish,            \
>                           ctx, size, flags)                             \
> ({                                                                      \
>         typeof(_do_alloc) _res;                                         \
>         DEFINE_ALLOC_TAG(_alloc_tag, _old);                             \
>         ssize_t _size = (size);                                         \
>         size_t _usable = _size;                                         \
>         gfp_t _flags = (flags);                                         \
>                                                                         \
>         _res = _do_prepare(ctx, &_size, &_flags);                       \
>         if (!IS_ERR_OR_NULL(_res)                                       \
>                 _res = _do_alloc(_size, _flags);                        \
>         if (!IS_ERR_OR_NULL(_res)                                       \
>                 _res = _do_finish(ctx, _usable, _size, _flags, _res);   \
>         _res;                                                           \
> })
>
> #define devm_kmalloc(dev, size, flags)                                  \
>         alloc_hooks_prep(kmalloc, devm_alloc_prep, devm_alloc_finish,   \
>                          dev, size, flags)
>
> And devm_alloc_prep() and devm_alloc_finish() adapted from the URL
> above.
>
> And _do_finish instances could be marked with __realloc_size(2)

devm_kmalloc() is definitely a great candidate to account separately.
Looks like it's currently using
alloc_dr()->kmalloc_node_track_caller(), so this series will account
the internal kmalloc_node_track_caller() allocation. We can easily
apply alloc_hook to devm_kmalloc() and friends and replace the
kmalloc_node_track_caller() call inside alloc_dr() with
kmalloc_node_track_caller_noprof(). That will move accounting directly
to devm_kmalloc().

>
> -Kees
>
> --
> Kees Cook

Michal Hocko Feb. 13, 2024, 12:24 p.m. UTC | #4

On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
[...]
> We're aiming to get this in the next merge window, for 6.9. The feedback
> we've gotten has been that even out of tree this patchset has already
> been useful, and there's a significant amount of other work gated on the
> code tagging functionality included in this patchset [2].

I suspect it will not come as a surprise that I really dislike the
implementation proposed here. I will not repeat my arguments, I have
done so on several occasions already. 

Anyway, I didn't go as far as to nak it even though I _strongly_ believe
this debugging feature will add a maintenance overhead for a very long
time. I can live with all the downsides of the proposed implementation
_as long as_ there is a wider agreement from the MM community as this is
where the maintenance cost will be payed. So far I have not seen (m)any
acks by MM developers so aiming into the next merge window is more than
little rushed. 

>  81 files changed, 2126 insertions(+), 695 deletions(-)

Suren Baghdasaryan Feb. 13, 2024, 9:58 p.m. UTC | #5

On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> [...]
> > We're aiming to get this in the next merge window, for 6.9. The feedback
> > we've gotten has been that even out of tree this patchset has already
> > been useful, and there's a significant amount of other work gated on the
> > code tagging functionality included in this patchset [2].
>
> I suspect it will not come as a surprise that I really dislike the
> implementation proposed here. I will not repeat my arguments, I have
> done so on several occasions already.
>
> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> this debugging feature will add a maintenance overhead for a very long
> time. I can live with all the downsides of the proposed implementation
> _as long as_ there is a wider agreement from the MM community as this is
> where the maintenance cost will be payed. So far I have not seen (m)any
> acks by MM developers so aiming into the next merge window is more than
> little rushed.

We tried other previously proposed approaches and all have their
downsides without making maintenance much easier. Your position is
understandable and I think it's fair. Let's see if others see more
benefit than cost here.
Thanks,
Suren.

>
> >  81 files changed, 2126 insertions(+), 695 deletions(-)
> --
> Michal Hocko
> SUSE Labs

David Hildenbrand Feb. 13, 2024, 10:04 p.m. UTC | #6

On 13.02.24 22:58, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>> [...]
>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>> we've gotten has been that even out of tree this patchset has already
>>> been useful, and there's a significant amount of other work gated on the
>>> code tagging functionality included in this patchset [2].
>>
>> I suspect it will not come as a surprise that I really dislike the
>> implementation proposed here. I will not repeat my arguments, I have
>> done so on several occasions already.
>>
>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>> this debugging feature will add a maintenance overhead for a very long
>> time. I can live with all the downsides of the proposed implementation
>> _as long as_ there is a wider agreement from the MM community as this is
>> where the maintenance cost will be payed. So far I have not seen (m)any
>> acks by MM developers so aiming into the next merge window is more than
>> little rushed.
> 
> We tried other previously proposed approaches and all have their
> downsides without making maintenance much easier. Your position is
> understandable and I think it's fair. Let's see if others see more
> benefit than cost here.

Would it make sense to discuss that at LSF/MM once again, especially 
covering why proposed alternatives did not work out? LSF/MM is not "too 
far" away (May).

I recall that the last LSF/MM session on this topic was a bit 
unfortunate (IMHO not as productive as it could have been). Maybe we can 
finally reach a consensus on this.

Kent Overstreet Feb. 13, 2024, 10:09 p.m. UTC | #7

On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > 
> > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > [...]
> > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > we've gotten has been that even out of tree this patchset has already
> > > > been useful, and there's a significant amount of other work gated on the
> > > > code tagging functionality included in this patchset [2].
> > > 
> > > I suspect it will not come as a surprise that I really dislike the
> > > implementation proposed here. I will not repeat my arguments, I have
> > > done so on several occasions already.
> > > 
> > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > this debugging feature will add a maintenance overhead for a very long
> > > time. I can live with all the downsides of the proposed implementation
> > > _as long as_ there is a wider agreement from the MM community as this is
> > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > acks by MM developers so aiming into the next merge window is more than
> > > little rushed.
> > 
> > We tried other previously proposed approaches and all have their
> > downsides without making maintenance much easier. Your position is
> > understandable and I think it's fair. Let's see if others see more
> > benefit than cost here.
> 
> Would it make sense to discuss that at LSF/MM once again, especially
> covering why proposed alternatives did not work out? LSF/MM is not "too far"
> away (May).
> 
> I recall that the last LSF/MM session on this topic was a bit unfortunate
> (IMHO not as productive as it could have been). Maybe we can finally reach a
> consensus on this.

I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
need to see a serious proposl - what we had at the last LSF was people
jumping in with half baked alternative proposals that very much hadn't
been thought through, and I see no need to repeat that.

Like I mentioned, there's other work gated on this patchset; if people
want to hold this up for more discussion they better be putting forth
something to discuss.

David Hildenbrand Feb. 13, 2024, 10:17 p.m. UTC | #8

On 13.02.24 23:09, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>
>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>> [...]
>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>> we've gotten has been that even out of tree this patchset has already
>>>>> been useful, and there's a significant amount of other work gated on the
>>>>> code tagging functionality included in this patchset [2].
>>>>
>>>> I suspect it will not come as a surprise that I really dislike the
>>>> implementation proposed here. I will not repeat my arguments, I have
>>>> done so on several occasions already.
>>>>
>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>> this debugging feature will add a maintenance overhead for a very long
>>>> time. I can live with all the downsides of the proposed implementation
>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>> acks by MM developers so aiming into the next merge window is more than
>>>> little rushed.
>>>
>>> We tried other previously proposed approaches and all have their
>>> downsides without making maintenance much easier. Your position is
>>> understandable and I think it's fair. Let's see if others see more
>>> benefit than cost here.
>>
>> Would it make sense to discuss that at LSF/MM once again, especially
>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>> away (May).
>>
>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>> consensus on this.
> 
> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> need to see a serious proposl - what we had at the last LSF was people
> jumping in with half baked alternative proposals that very much hadn't
> been thought through, and I see no need to repeat that.
> 
> Like I mentioned, there's other work gated on this patchset; if people
> want to hold this up for more discussion they better be putting forth
> something to discuss.

I'm thinking of ways on how to achieve Michal's request: "as long as 
there is a wider agreement from the MM community". If we can achieve 
that without LSF, great! (a bi-weekly MM meeting might also be an option)

Kent Overstreet Feb. 13, 2024, 10:29 p.m. UTC | #9

On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote:
> On 13.02.24 23:09, Kent Overstreet wrote:
> > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > 
> > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > code tagging functionality included in this patchset [2].
> > > > > 
> > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > done so on several occasions already.
> > > > > 
> > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > time. I can live with all the downsides of the proposed implementation
> > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > little rushed.
> > > > 
> > > > We tried other previously proposed approaches and all have their
> > > > downsides without making maintenance much easier. Your position is
> > > > understandable and I think it's fair. Let's see if others see more
> > > > benefit than cost here.
> > > 
> > > Would it make sense to discuss that at LSF/MM once again, especially
> > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > away (May).
> > > 
> > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > consensus on this.
> > 
> > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > need to see a serious proposl - what we had at the last LSF was people
> > jumping in with half baked alternative proposals that very much hadn't
> > been thought through, and I see no need to repeat that.
> > 
> > Like I mentioned, there's other work gated on this patchset; if people
> > want to hold this up for more discussion they better be putting forth
> > something to discuss.
> 
> I'm thinking of ways on how to achieve Michal's request: "as long as there
> is a wider agreement from the MM community". If we can achieve that without
> LSF, great! (a bi-weekly MM meeting might also be an option)

A meeting wouldn't be out of the question, _if_ there is an agenda, but:

What's that coffeee mug say? I just survived another meeting that
could've been an email? What exactly is the outcome we're looking for?

Is there info that people are looking for? I think we summed things up
pretty well in the cover letter; if there are specifics that people
want to discuss, that's why we emailed the series out.

There's people in this thread who've used this patchset in production
and diagnosed real issues (gigabytes of memory gone missing, I heard the
other day); I'm personally looking for them to chime in on this thread
(Johannes, Pasha).

If it's just grumbling about "maintenance overhead" we need to get past
- well, people are going to have to accept that we can't deliver
features without writing code, and I'm confident that the hooking in
particular is about as clean as it's going to get, _regardless_ of
toolchain support; and moreover it addresses what's been historically a
pretty gaping hole in our ability to profile and understand the code we
write.

Suren Baghdasaryan Feb. 13, 2024, 10:30 p.m. UTC | #10

On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.02.24 23:09, Kent Overstreet wrote:
> > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> >> On 13.02.24 22:58, Suren Baghdasaryan wrote:
> >>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>>
> >>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> >>>> [...]
> >>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
> >>>>> we've gotten has been that even out of tree this patchset has already
> >>>>> been useful, and there's a significant amount of other work gated on the
> >>>>> code tagging functionality included in this patchset [2].
> >>>>
> >>>> I suspect it will not come as a surprise that I really dislike the
> >>>> implementation proposed here. I will not repeat my arguments, I have
> >>>> done so on several occasions already.
> >>>>
> >>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> >>>> this debugging feature will add a maintenance overhead for a very long
> >>>> time. I can live with all the downsides of the proposed implementation
> >>>> _as long as_ there is a wider agreement from the MM community as this is
> >>>> where the maintenance cost will be payed. So far I have not seen (m)any
> >>>> acks by MM developers so aiming into the next merge window is more than
> >>>> little rushed.
> >>>
> >>> We tried other previously proposed approaches and all have their
> >>> downsides without making maintenance much easier. Your position is
> >>> understandable and I think it's fair. Let's see if others see more
> >>> benefit than cost here.
> >>
> >> Would it make sense to discuss that at LSF/MM once again, especially
> >> covering why proposed alternatives did not work out? LSF/MM is not "too far"
> >> away (May).
> >>
> >> I recall that the last LSF/MM session on this topic was a bit unfortunate
> >> (IMHO not as productive as it could have been). Maybe we can finally reach a
> >> consensus on this.
> >
> > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > need to see a serious proposl - what we had at the last LSF was people
> > jumping in with half baked alternative proposals that very much hadn't
> > been thought through, and I see no need to repeat that.
> >
> > Like I mentioned, there's other work gated on this patchset; if people
> > want to hold this up for more discussion they better be putting forth
> > something to discuss.
>
> I'm thinking of ways on how to achieve Michal's request: "as long as
> there is a wider agreement from the MM community". If we can achieve
> that without LSF, great! (a bi-weekly MM meeting might also be an option)

There will be a maintenance burden even with the cleanest proposed
approach. We worked hard to make the patchset as clean as possible and
if benefits still don't outweigh the maintenance cost then we should
probably stop trying. At LSF/MM I would rather discuss functonal
issues/requirements/improvements than alternative approaches to
instrument allocators.
I'm happy to arrange a separate meeting with MM folks if that would
help to progress on the cost/benefit decision.

>
> --
> Cheers,
>
> David / dhildenb
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

David Hildenbrand Feb. 13, 2024, 10:48 p.m. UTC | #11

On 13.02.24 23:30, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.02.24 23:09, Kent Overstreet wrote:
>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>
>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>> [...]
>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>
>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>> done so on several occasions already.
>>>>>>
>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>> little rushed.
>>>>>
>>>>> We tried other previously proposed approaches and all have their
>>>>> downsides without making maintenance much easier. Your position is
>>>>> understandable and I think it's fair. Let's see if others see more
>>>>> benefit than cost here.
>>>>
>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>> away (May).
>>>>
>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>> consensus on this.
>>>
>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>> need to see a serious proposl - what we had at the last LSF was people
>>> jumping in with half baked alternative proposals that very much hadn't
>>> been thought through, and I see no need to repeat that.
>>>
>>> Like I mentioned, there's other work gated on this patchset; if people
>>> want to hold this up for more discussion they better be putting forth
>>> something to discuss.
>>
>> I'm thinking of ways on how to achieve Michal's request: "as long as
>> there is a wider agreement from the MM community". If we can achieve
>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
> 
> There will be a maintenance burden even with the cleanest proposed
> approach. 

Yes.

> We worked hard to make the patchset as clean as possible and
> if benefits still don't outweigh the maintenance cost then we should
> probably stop trying.

Indeed.

> At LSF/MM I would rather discuss functonal
> issues/requirements/improvements than alternative approaches to
> instrument allocators.
> I'm happy to arrange a separate meeting with MM folks if that would
> help to progress on the cost/benefit decision.
Note that I am only proposing ways forward.

If you think you can easily achieve what Michal requested without all 
that, good.

My past experience was that LSF/MM / bi-weekly MM meetings were 
extremely helpful to reach consensus.

Kent Overstreet Feb. 13, 2024, 10:50 p.m. UTC | #12

On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > 
> > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > [...]
> > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > > 
> > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > done so on several occasions already.
> > > > > > > 
> > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > little rushed.
> > > > > > 
> > > > > > We tried other previously proposed approaches and all have their
> > > > > > downsides without making maintenance much easier. Your position is
> > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > benefit than cost here.
> > > > > 
> > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > away (May).
> > > > > 
> > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > consensus on this.
> > > > 
> > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > need to see a serious proposl - what we had at the last LSF was people
> > > > jumping in with half baked alternative proposals that very much hadn't
> > > > been thought through, and I see no need to repeat that.
> > > > 
> > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > want to hold this up for more discussion they better be putting forth
> > > > something to discuss.
> > > 
> > > I'm thinking of ways on how to achieve Michal's request: "as long as
> > > there is a wider agreement from the MM community". If we can achieve
> > > that without LSF, great! (a bi-weekly MM meeting might also be an option)
> > 
> > There will be a maintenance burden even with the cleanest proposed
> > approach.
> 
> Yes.
> 
> > We worked hard to make the patchset as clean as possible and
> > if benefits still don't outweigh the maintenance cost then we should
> > probably stop trying.
> 
> Indeed.
> 
> > At LSF/MM I would rather discuss functonal
> > issues/requirements/improvements than alternative approaches to
> > instrument allocators.
> > I'm happy to arrange a separate meeting with MM folks if that would
> > help to progress on the cost/benefit decision.
> Note that I am only proposing ways forward.
> 
> If you think you can easily achieve what Michal requested without all that,
> good.

He requested something?

David Hildenbrand Feb. 13, 2024, 10:57 p.m. UTC | #13

On 13.02.24 23:50, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>>>
>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>> [...]
>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>
>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>> done so on several occasions already.
>>>>>>>>
>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>> little rushed.
>>>>>>>
>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>> benefit than cost here.
>>>>>>
>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>> away (May).
>>>>>>
>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>> consensus on this.
>>>>>
>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>> been thought through, and I see no need to repeat that.
>>>>>
>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>> want to hold this up for more discussion they better be putting forth
>>>>> something to discuss.
>>>>
>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>> there is a wider agreement from the MM community". If we can achieve
>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>
>>> There will be a maintenance burden even with the cleanest proposed
>>> approach.
>>
>> Yes.
>>
>>> We worked hard to make the patchset as clean as possible and
>>> if benefits still don't outweigh the maintenance cost then we should
>>> probably stop trying.
>>
>> Indeed.
>>
>>> At LSF/MM I would rather discuss functonal
>>> issues/requirements/improvements than alternative approaches to
>>> instrument allocators.
>>> I'm happy to arrange a separate meeting with MM folks if that would
>>> help to progress on the cost/benefit decision.
>> Note that I am only proposing ways forward.
>>
>> If you think you can easily achieve what Michal requested without all that,
>> good.
> 
> He requested something?
> 

This won't get merged without acks from MM people.

Suren Baghdasaryan Feb. 13, 2024, 10:59 p.m. UTC | #14

On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> > On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > > [...]
> > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > > >
> > > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > > done so on several occasions already.
> > > > > > > >
> > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > > little rushed.
> > > > > > >
> > > > > > > We tried other previously proposed approaches and all have their
> > > > > > > downsides without making maintenance much easier. Your position is
> > > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > > benefit than cost here.
> > > > > >
> > > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > > away (May).
> > > > > >
> > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > > consensus on this.
> > > > >
> > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > > need to see a serious proposl - what we had at the last LSF was people
> > > > > jumping in with half baked alternative proposals that very much hadn't
> > > > > been thought through, and I see no need to repeat that.
> > > > >
> > > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > > want to hold this up for more discussion they better be putting forth
> > > > > something to discuss.
> > > >
> > > > I'm thinking of ways on how to achieve Michal's request: "as long as
> > > > there is a wider agreement from the MM community". If we can achieve
> > > > that without LSF, great! (a bi-weekly MM meeting might also be an option)
> > >
> > > There will be a maintenance burden even with the cleanest proposed
> > > approach.
> >
> > Yes.
> >
> > > We worked hard to make the patchset as clean as possible and
> > > if benefits still don't outweigh the maintenance cost then we should
> > > probably stop trying.
> >
> > Indeed.
> >
> > > At LSF/MM I would rather discuss functonal
> > > issues/requirements/improvements than alternative approaches to
> > > instrument allocators.
> > > I'm happy to arrange a separate meeting with MM folks if that would
> > > help to progress on the cost/benefit decision.
> > Note that I am only proposing ways forward.
> >
> > If you think you can easily achieve what Michal requested without all that,
> > good.
>
> He requested something?

Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
possible until the compiler feature is developed and deployed. And it
still would require changes to the headers, so don't think it's worth
delaying the feature for years.

David Hildenbrand Feb. 13, 2024, 11:02 p.m. UTC | #15

On 13.02.24 23:59, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
>>
>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>>>>
>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>>> [...]
>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>>
>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>>> done so on several occasions already.
>>>>>>>>>
>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>>> little rushed.
>>>>>>>>
>>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>>> benefit than cost here.
>>>>>>>
>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>>> away (May).
>>>>>>>
>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>>> consensus on this.
>>>>>>
>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>>> been thought through, and I see no need to repeat that.
>>>>>>
>>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>>> want to hold this up for more discussion they better be putting forth
>>>>>> something to discuss.
>>>>>
>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>>> there is a wider agreement from the MM community". If we can achieve
>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>>
>>>> There will be a maintenance burden even with the cleanest proposed
>>>> approach.
>>>
>>> Yes.
>>>
>>>> We worked hard to make the patchset as clean as possible and
>>>> if benefits still don't outweigh the maintenance cost then we should
>>>> probably stop trying.
>>>
>>> Indeed.
>>>
>>>> At LSF/MM I would rather discuss functonal
>>>> issues/requirements/improvements than alternative approaches to
>>>> instrument allocators.
>>>> I'm happy to arrange a separate meeting with MM folks if that would
>>>> help to progress on the cost/benefit decision.
>>> Note that I am only proposing ways forward.
>>>
>>> If you think you can easily achieve what Michal requested without all that,
>>> good.
>>
>> He requested something?
> 
> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.
> 

I was talking about this: "I can live with all the downsides of the 
proposed implementationas long as there is a wider agreement from the MM 
community as this is where the maintenance cost will be payed. So far I 
have not seen (m)any acks by MM developers".

I certainly cannot be motivated at this point to review and ack this, 
unfortunately too much negative energy around here.

Kent Overstreet Feb. 13, 2024, 11:08 p.m. UTC | #16

On Tue, Feb 13, 2024 at 02:59:11PM -0800, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> > > On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
> > > If you think you can easily achieve what Michal requested without all that,
> > > good.
> >
> > He requested something?
> 
> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.

Hang on, let's look at the actual code.

This is what instrumenting an allocation function looks like:

#define krealloc_array(...)                     alloc_hooks(krealloc_array_noprof(__VA_ARGS__))

IOW, we have to:
 - rename krealloc_array to krealloc_array_noprof
 - replace krealloc_array with a one wrapper macro call

Is this really all we're getting worked up over?

The renaming we need regardless, because the thing that makes this
approach efficient enough to run in production is that we account at
_one_ point in the callstack, we don't save entire backtraces.

And thus we need to explicitly annotate which one that is; which means
we need _noprof() versions of functions for when the accounting is done
by an outer wraper (e.g. mempool).

And, as I keep saying: that alloc_hooks() macro will also get us _per
callsite fault injection points_, and we really need that because - if
you guys have been paying attention to other threads - whenever moving
more stuff to PF_MEMALLOC_* flags comes up (including adding
PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and
not being testable keeps coming up.

Darrick J. Wong Feb. 13, 2024, 11:11 p.m. UTC | #17

On Tue, Feb 13, 2024 at 05:29:03PM -0500, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote:
> > On 13.02.24 23:09, Kent Overstreet wrote:
> > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > 
> > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > [...]
> > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > code tagging functionality included in this patchset [2].
> > > > > > 
> > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > done so on several occasions already.
> > > > > > 
> > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > little rushed.
> > > > >
> > > > > We tried other previously proposed approaches and all have their
> > > > > downsides without making maintenance much easier. Your position is
> > > > > understandable and I think it's fair. Let's see if others see more
> > > > > benefit than cost here.
> > > > 
> > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > away (May).

You want to stall this effort for *three months* to schedule a meeting?

I would love to have better profiling of memory allocations inside XFS
so that we can answer questions like "What's going on with these
allocation stalls?" or "What memory is getting used, and where?" more
quickly than we can now.

Right now I get to stare at tracepoints and printk crap until my eyes
bleed, and maybe dchinner comes to my rescue and figures out what's
going on sooner than I do.  More often we just never figure it out
because only the customer can reproduce the problem, the reams of data
produced by ftrace is unmanageable, and BPF isn't always available.

I'm not thrilled by the large increase in macro crap in the allocation
paths, but I don't know of a better way to instrument things.  Our
attempts to use _RET_IP in XFS to instrument its code paths have never
worked quite right w.r.t. inlined functions and whatnot.

> > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > consensus on this.

From my outsider's perspective, nine months have gone by since the last
LSF.  Who has come up with a cleaner/better/faster way to do what Suren
and Kent have done?  Were those code changes integrated into this
patchset?  Or why not?

Most of what I saw in 2023 involved compiler changes (great; now I have
to wait until RHEL 11/Debian 14 to use it) and/or still utilize fugly
macros.

Recalling all the way back to suggestions made in 2022, who wrote the
prototype for doing this via ftrace?  Or BPF?  How well did that go for
counting allocation events and the like?  I saw Tejun saying something
about how they use BPF aggressively inside Meta, but that was about it.

Were any of those solutions significantly better than what's in front of
us here?

I get it, a giant patch forcing everyone to know the difference between
alloc_foo and alloc_foo_noperf offends my (yours?) stylistic
sensibilities.  On the other side, making analysis easier during
customer escalations means we kernel people get data, answers, and
solutions sooner instead of watching all our time get eaten up on L4
support and backporting hell.

> > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > need to see a serious proposl - what we had at the last LSF was people
> > > jumping in with half baked alternative proposals that very much hadn't
> > > been thought through, and I see no need to repeat that.
> > > 
> > > Like I mentioned, there's other work gated on this patchset; if people
> > > want to hold this up for more discussion they better be putting forth
> > > something to discuss.
> > 
> > I'm thinking of ways on how to achieve Michal's request: "as long as there
> > is a wider agreement from the MM community". If we can achieve that without
> > LSF, great! (a bi-weekly MM meeting might also be an option)
> 
> A meeting wouldn't be out of the question, _if_ there is an agenda, but:
> 
> What's that coffeee mug say? I just survived another meeting that
> could've been an email?

I congratulate you on your memory of my kitchen mug.  Yes, that's what
it says.

> What exactly is the outcome we're looking for?
> 
> Is there info that people are looking for? I think we summed things up
> pretty well in the cover letter; if there are specifics that people
> want to discuss, that's why we emailed the series out.
> 
> There's people in this thread who've used this patchset in production
> and diagnosed real issues (gigabytes of memory gone missing, I heard the
> other day); I'm personally looking for them to chime in on this thread
> (Johannes, Pasha).
> 
> If it's just grumbling about "maintenance overhead" we need to get past
> - well, people are going to have to accept that we can't deliver
> features without writing code, and I'm confident that the hooking in
> particular is about as clean as it's going to get, _regardless_ of
> toolchain support; and moreover it addresses what's been historically a
> pretty gaping hole in our ability to profile and understand the code we
> write.

Are you and Suren willing to pay whatever maintenance overhead there is?

--D

Kent Overstreet Feb. 13, 2024, 11:12 p.m. UTC | #18

On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
> On 13.02.24 23:59, Suren Baghdasaryan wrote:
> > On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > > 
> > > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> > > > On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > > 
> > > > > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > > > > [...]
> > > > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > > > > > 
> > > > > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > > > > done so on several occasions already.
> > > > > > > > > > 
> > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > > > > little rushed.
> > > > > > > > > 
> > > > > > > > > We tried other previously proposed approaches and all have their
> > > > > > > > > downsides without making maintenance much easier. Your position is
> > > > > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > > > > benefit than cost here.
> > > > > > > > 
> > > > > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > > > > away (May).
> > > > > > > > 
> > > > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > > > > consensus on this.
> > > > > > > 
> > > > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > > > > need to see a serious proposl - what we had at the last LSF was people
> > > > > > > jumping in with half baked alternative proposals that very much hadn't
> > > > > > > been thought through, and I see no need to repeat that.
> > > > > > > 
> > > > > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > > > > want to hold this up for more discussion they better be putting forth
> > > > > > > something to discuss.
> > > > > > 
> > > > > > I'm thinking of ways on how to achieve Michal's request: "as long as
> > > > > > there is a wider agreement from the MM community". If we can achieve
> > > > > > that without LSF, great! (a bi-weekly MM meeting might also be an option)
> > > > > 
> > > > > There will be a maintenance burden even with the cleanest proposed
> > > > > approach.
> > > > 
> > > > Yes.
> > > > 
> > > > > We worked hard to make the patchset as clean as possible and
> > > > > if benefits still don't outweigh the maintenance cost then we should
> > > > > probably stop trying.
> > > > 
> > > > Indeed.
> > > > 
> > > > > At LSF/MM I would rather discuss functonal
> > > > > issues/requirements/improvements than alternative approaches to
> > > > > instrument allocators.
> > > > > I'm happy to arrange a separate meeting with MM folks if that would
> > > > > help to progress on the cost/benefit decision.
> > > > Note that I am only proposing ways forward.
> > > > 
> > > > If you think you can easily achieve what Michal requested without all that,
> > > > good.
> > > 
> > > He requested something?
> > 
> > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > possible until the compiler feature is developed and deployed. And it
> > still would require changes to the headers, so don't think it's worth
> > delaying the feature for years.
> > 
> 
> I was talking about this: "I can live with all the downsides of the proposed
> implementationas long as there is a wider agreement from the MM community as
> this is where the maintenance cost will be payed. So far I have not seen
> (m)any acks by MM developers".
> 
> I certainly cannot be motivated at this point to review and ack this,
> unfortunately too much negative energy around here.

David, this kind of reaction is exactly why I was telling Andrew I was
going to submit this as a direct pull request to Linus.

This is an important feature; if we can't stay focused ot the technical
and get it done that's what I'll do.

David Hildenbrand Feb. 13, 2024, 11:22 p.m. UTC | #19

On 14.02.24 00:12, Kent Overstreet wrote:
> On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
>> On 13.02.24 23:59, Suren Baghdasaryan wrote:
>>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
>>> <kent.overstreet@linux.dev> wrote:
>>>>
>>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>>>>> [...]
>>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>>>>
>>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>>>>> done so on several occasions already.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>>>>> little rushed.
>>>>>>>>>>
>>>>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>>>>> benefit than cost here.
>>>>>>>>>
>>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>>>>> away (May).
>>>>>>>>>
>>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>>>>> consensus on this.
>>>>>>>>
>>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>>>>> been thought through, and I see no need to repeat that.
>>>>>>>>
>>>>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>>>>> want to hold this up for more discussion they better be putting forth
>>>>>>>> something to discuss.
>>>>>>>
>>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>>>>> there is a wider agreement from the MM community". If we can achieve
>>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>>>>
>>>>>> There will be a maintenance burden even with the cleanest proposed
>>>>>> approach.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> We worked hard to make the patchset as clean as possible and
>>>>>> if benefits still don't outweigh the maintenance cost then we should
>>>>>> probably stop trying.
>>>>>
>>>>> Indeed.
>>>>>
>>>>>> At LSF/MM I would rather discuss functonal
>>>>>> issues/requirements/improvements than alternative approaches to
>>>>>> instrument allocators.
>>>>>> I'm happy to arrange a separate meeting with MM folks if that would
>>>>>> help to progress on the cost/benefit decision.
>>>>> Note that I am only proposing ways forward.
>>>>>
>>>>> If you think you can easily achieve what Michal requested without all that,
>>>>> good.
>>>>
>>>> He requested something?
>>>
>>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
>>> possible until the compiler feature is developed and deployed. And it
>>> still would require changes to the headers, so don't think it's worth
>>> delaying the feature for years.
>>>
>>
>> I was talking about this: "I can live with all the downsides of the proposed
>> implementationas long as there is a wider agreement from the MM community as
>> this is where the maintenance cost will be payed. So far I have not seen
>> (m)any acks by MM developers".
>>
>> I certainly cannot be motivated at this point to review and ack this,
>> unfortunately too much negative energy around here.
> 
> David, this kind of reaction is exactly why I was telling Andrew I was
> going to submit this as a direct pull request to Linus.
> 
> This is an important feature; if we can't stay focused ot the technical
> and get it done that's what I'll do.

Kent, I started this with "Would it make sense" in an attempt to help 
Suren and you to finally make progress with this, one way or the other. 
I know that there were ways in the past to get the MM community to agree 
on such things.

I tried to be helpful, finding ways *not having to* bypass the MM 
community to get MM stuff merged.

The reply I got is mostly negative energy.

So you don't need my help here, understood.

But I will fight against any attempts to bypass the MM community.

Kent Overstreet Feb. 13, 2024, 11:24 p.m. UTC | #20

On Tue, Feb 13, 2024 at 03:11:15PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 05:29:03PM -0500, Kent Overstreet wrote:
> > On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote:
> > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > 
> > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > [...]
> > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > > 
> > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > done so on several occasions already.
> > > > > > > 
> > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > little rushed.
> > > > > >
> > > > > > We tried other previously proposed approaches and all have their
> > > > > > downsides without making maintenance much easier. Your position is
> > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > benefit than cost here.
> > > > > 
> > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > away (May).
> 
> You want to stall this effort for *three months* to schedule a meeting?
> 
> I would love to have better profiling of memory allocations inside XFS
> so that we can answer questions like "What's going on with these
> allocation stalls?" or "What memory is getting used, and where?" more
> quickly than we can now.
> 
> Right now I get to stare at tracepoints and printk crap until my eyes
> bleed, and maybe dchinner comes to my rescue and figures out what's
> going on sooner than I do.  More often we just never figure it out
> because only the customer can reproduce the problem, the reams of data
> produced by ftrace is unmanageable, and BPF isn't always available.
> 
> I'm not thrilled by the large increase in macro crap in the allocation
> paths, but I don't know of a better way to instrument things.  Our
> attempts to use _RET_IP in XFS to instrument its code paths have never
> worked quite right w.r.t. inlined functions and whatnot.
> 
> > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > consensus on this.
> 
> From my outsider's perspective, nine months have gone by since the last
> LSF.  Who has come up with a cleaner/better/faster way to do what Suren
> and Kent have done?  Were those code changes integrated into this
> patchset?  Or why not?
> 
> Most of what I saw in 2023 involved compiler changes (great; now I have
> to wait until RHEL 11/Debian 14 to use it) and/or still utilize fugly
> macros.
> 
> Recalling all the way back to suggestions made in 2022, who wrote the
> prototype for doing this via ftrace?  Or BPF?  How well did that go for
> counting allocation events and the like?  I saw Tejun saying something
> about how they use BPF aggressively inside Meta, but that was about it.
> 
> Were any of those solutions significantly better than what's in front of
> us here?
> 
> I get it, a giant patch forcing everyone to know the difference between
> alloc_foo and alloc_foo_noperf offends my (yours?) stylistic
> sensibilities.  On the other side, making analysis easier during
> customer escalations means we kernel people get data, answers, and
> solutions sooner instead of watching all our time get eaten up on L4
> support and backporting hell.
> 
> > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > need to see a serious proposl - what we had at the last LSF was people
> > > > jumping in with half baked alternative proposals that very much hadn't
> > > > been thought through, and I see no need to repeat that.
> > > > 
> > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > want to hold this up for more discussion they better be putting forth
> > > > something to discuss.
> > > 
> > > I'm thinking of ways on how to achieve Michal's request: "as long as there
> > > is a wider agreement from the MM community". If we can achieve that without
> > > LSF, great! (a bi-weekly MM meeting might also be an option)
> > 
> > A meeting wouldn't be out of the question, _if_ there is an agenda, but:
> > 
> > What's that coffeee mug say? I just survived another meeting that
> > could've been an email?
> 
> I congratulate you on your memory of my kitchen mug.  Yes, that's what
> it says.
> 
> > What exactly is the outcome we're looking for?
> > 
> > Is there info that people are looking for? I think we summed things up
> > pretty well in the cover letter; if there are specifics that people
> > want to discuss, that's why we emailed the series out.
> > 
> > There's people in this thread who've used this patchset in production
> > and diagnosed real issues (gigabytes of memory gone missing, I heard the
> > other day); I'm personally looking for them to chime in on this thread
> > (Johannes, Pasha).
> > 
> > If it's just grumbling about "maintenance overhead" we need to get past
> > - well, people are going to have to accept that we can't deliver
> > features without writing code, and I'm confident that the hooking in
> > particular is about as clean as it's going to get, _regardless_ of
> > toolchain support; and moreover it addresses what's been historically a
> > pretty gaping hole in our ability to profile and understand the code we
> > write.
> 
> Are you and Suren willing to pay whatever maintenance overhead there is?

I'm still wondering what this supposed "maintenance overhead" is going
to be...

As I use this patch series I occasionally notice places where a bunch of
memory is being accounted to one line of code, and it would better be
accounted to a caller - but then it's just a couple lines of code to fix
that. You switch that callsite to the _noprof() version of whatever
allocation it's doing, then add an alloc_hooks() wrapper at the place
you do want it accounted.

That doesn't really feel like overhead to me, just the normal tweaking
your tools to get the most out of them.

I will continue to do that for the code I'm looking at, yes.

If other people are doing that too, it'll be because they're also using
memory allocation profiling and finding it valuable.

I did notice earlier that we're still lacking documentation in the
Documentation/ directory; the workflow for "how you shift accounting to
the right spot" is something that should go in there.

Suren Baghdasaryan Feb. 13, 2024, 11:28 p.m. UTC | #21

On Tue, Feb 13, 2024 at 3:22 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.02.24 00:12, Kent Overstreet wrote:
> > On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
> >> On 13.02.24 23:59, Suren Baghdasaryan wrote:
> >>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> >>> <kent.overstreet@linux.dev> wrote:
> >>>>
> >>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> >>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
> >>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
> >>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> >>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
> >>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> >>>>>>>>>>> [...]
> >>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
> >>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
> >>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
> >>>>>>>>>>>> code tagging functionality included in this patchset [2].
> >>>>>>>>>>>
> >>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
> >>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
> >>>>>>>>>>> done so on several occasions already.
> >>>>>>>>>>>
> >>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> >>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
> >>>>>>>>>>> time. I can live with all the downsides of the proposed implementation
> >>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
> >>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
> >>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
> >>>>>>>>>>> little rushed.
> >>>>>>>>>>
> >>>>>>>>>> We tried other previously proposed approaches and all have their
> >>>>>>>>>> downsides without making maintenance much easier. Your position is
> >>>>>>>>>> understandable and I think it's fair. Let's see if others see more
> >>>>>>>>>> benefit than cost here.
> >>>>>>>>>
> >>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
> >>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
> >>>>>>>>> away (May).
> >>>>>>>>>
> >>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
> >>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
> >>>>>>>>> consensus on this.
> >>>>>>>>
> >>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> >>>>>>>> need to see a serious proposl - what we had at the last LSF was people
> >>>>>>>> jumping in with half baked alternative proposals that very much hadn't
> >>>>>>>> been thought through, and I see no need to repeat that.
> >>>>>>>>
> >>>>>>>> Like I mentioned, there's other work gated on this patchset; if people
> >>>>>>>> want to hold this up for more discussion they better be putting forth
> >>>>>>>> something to discuss.
> >>>>>>>
> >>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
> >>>>>>> there is a wider agreement from the MM community". If we can achieve
> >>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
> >>>>>>
> >>>>>> There will be a maintenance burden even with the cleanest proposed
> >>>>>> approach.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>> We worked hard to make the patchset as clean as possible and
> >>>>>> if benefits still don't outweigh the maintenance cost then we should
> >>>>>> probably stop trying.
> >>>>>
> >>>>> Indeed.
> >>>>>
> >>>>>> At LSF/MM I would rather discuss functonal
> >>>>>> issues/requirements/improvements than alternative approaches to
> >>>>>> instrument allocators.
> >>>>>> I'm happy to arrange a separate meeting with MM folks if that would
> >>>>>> help to progress on the cost/benefit decision.
> >>>>> Note that I am only proposing ways forward.
> >>>>>
> >>>>> If you think you can easily achieve what Michal requested without all that,
> >>>>> good.
> >>>>
> >>>> He requested something?
> >>>
> >>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> >>> possible until the compiler feature is developed and deployed. And it
> >>> still would require changes to the headers, so don't think it's worth
> >>> delaying the feature for years.
> >>>
> >>
> >> I was talking about this: "I can live with all the downsides of the proposed
> >> implementationas long as there is a wider agreement from the MM community as
> >> this is where the maintenance cost will be payed. So far I have not seen
> >> (m)any acks by MM developers".
> >>
> >> I certainly cannot be motivated at this point to review and ack this,
> >> unfortunately too much negative energy around here.
> >
> > David, this kind of reaction is exactly why I was telling Andrew I was
> > going to submit this as a direct pull request to Linus.
> >
> > This is an important feature; if we can't stay focused ot the technical
> > and get it done that's what I'll do.
>
> Kent, I started this with "Would it make sense" in an attempt to help
> Suren and you to finally make progress with this, one way or the other.
> I know that there were ways in the past to get the MM community to agree
> on such things.
>
> I tried to be helpful, finding ways *not having to* bypass the MM
> community to get MM stuff merged.
>
> The reply I got is mostly negative energy.
>
> So you don't need my help here, understood.
>
> But I will fight against any attempts to bypass the MM community.

Well, I'm definitely not trying to bypass the MM community, that's why
this patchset is posted. Not sure why people can't voice their opinion
on the benefit/cost balance of the patchset over the email... But if a
meeting would be more productive I'm happy to set it up.

>
> --
> Cheers,
>
> David / dhildenb
>

Pasha Tatashin Feb. 13, 2024, 11:54 p.m. UTC | #22

> > I tried to be helpful, finding ways *not having to* bypass the MM
> > community to get MM stuff merged.
> >
> > The reply I got is mostly negative energy.
> >
> > So you don't need my help here, understood.
> >
> > But I will fight against any attempts to bypass the MM community.
>
> Well, I'm definitely not trying to bypass the MM community, that's why
> this patchset is posted. Not sure why people can't voice their opinion
> on the benefit/cost balance of the patchset over the email... But if a
> meeting would be more productive I'm happy to set it up.

Discussing these concerns during the next available MM Alignment
session makes sense. At a minimum, Suren and Kent can present their
reasons for believing the current approach is superior to the
previously proposed alternatives.

However, delaying the discussion and feature merge until after LSF/MM
seems unnecessary. As I mentioned earlier in this thread, we've
already leveraged the concepts within this feature to debug
unexplained memory overhead, saving us many terabytes of memory. This
was just the initial benefit; we haven't even explored its full
potential to track every allocation path.

Pasha

Kent Overstreet Feb. 14, 2024, 12:04 a.m. UTC | #23

On Tue, Feb 13, 2024 at 06:54:09PM -0500, Pasha Tatashin wrote:
> > > I tried to be helpful, finding ways *not having to* bypass the MM
> > > community to get MM stuff merged.
> > >
> > > The reply I got is mostly negative energy.
> > >
> > > So you don't need my help here, understood.
> > >
> > > But I will fight against any attempts to bypass the MM community.
> >
> > Well, I'm definitely not trying to bypass the MM community, that's why
> > this patchset is posted. Not sure why people can't voice their opinion
> > on the benefit/cost balance of the patchset over the email... But if a
> > meeting would be more productive I'm happy to set it up.
> 
> Discussing these concerns during the next available MM Alignment
> session makes sense. At a minimum, Suren and Kent can present their
> reasons for believing the current approach is superior to the
> previously proposed alternatives.

Hang on though: I believe we did so adequately within this thread. Both
in the cover letter, and I further outlined exactly what the hooks
need to do, and cited the exact code.

Nobody seems to be complaining about the specifics, so I'm not sure what
would be on the agenda?

Johannes Weiner Feb. 14, 2024, 6:20 a.m. UTC | #24

I'll do a more throrough code review, but before the discussion gets
too sidetracked, I wanted to add my POV on the overall merit of the
direction that is being proposed here.

I have backported and used this code for debugging production issues
before. Logging into a random host with an unfamiliar workload and
being able to get a reliable, comprehensive list of kernel memory
consumers is one of the coolest things I have seen in a long
time. This is a huge improvement to sysadmin quality of life.

It's also a huge improvement for MM developers. We're the first points
of contact for memory regressions that can be caused by pretty much
any driver or subsystem in the kernel.

I encourage anybody who is undecided on whether this is worth doing to
build a kernel with these patches applied and run it on their own
machine. I think you'll be surprised what you'll find - and how myopic
and uninformative /proc/meminfo feels in comparison to this. Did you
know there is a lot more to modern filesystems than the VFS objects we
are currently tracking? :)

Then imagine what this looks like on a production host running a
complex mix of filesystems, enterprise networking, bpf programs, gpus
and accelerators etc.

Backporting the code to a slightly older production kernel wasn't too
difficult. The instrumentation layering is explicit, clean, and fairly
centralized, so resolving minor conflicts around the _noprof renames
and the wrappers was pretty straight-forward.

When we talk about maintenance cost, a fair shake would be to weigh it
against the cost and reliability of our current method: evaluating
consumers in the kernel on a case-by-case basis and annotating the
alloc/free sites by hand; then quibbling with the MM community about
whether that consumer is indeed significant enough to warrant an entry
in /proc/meminfo, and what the catchiest name for the stat would be.

I think we can agree that this is vastly less scalable and more
burdensome than central annotations around a handful of mostly static
allocator entry points. Especially considering the rate of change in
the kernel as a whole, and that not everybody will think of the
comprehensive MM picture when writing a random driver. And I think
that's generous - we don't even have the network stack in meminfo.

So I think what we do now isn't working. In the Meta fleet, at any
given time the p50 for unaccounted kernel memory is several gigabytes
per host. The p99 is between 15% and 30% of total memory. That's a
looot of opaque resource usage we have to accept on faith.

For hunting down regressions, all it takes is one untracked consumer
in the kernel to really throw a wrench into things. It's difficult to
find in the noise with tracing, and if it's not growing after an
initial allocation spike, you're pretty much out of luck finding it at
all. Raise your hand if you've written a drgn script to walk pfns and
try to guess consumers from the state of struct page :)

I agree we should discuss how the annotations are implemented on a
technical basis, but my take is that we need something like this.

In a codebase of our size, I don't think the allocator should be
handing out memory without some basic implied tracking of where it's
going. It's a liability for production environments, and it can hide
bad memory management decisions in drivers and other subsystems for a
very long time.

David Hildenbrand Feb. 14, 2024, 10:01 a.m. UTC | #25

On 14.02.24 00:28, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 3:22 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 14.02.24 00:12, Kent Overstreet wrote:
>>> On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
>>>> On 13.02.24 23:59, Suren Baghdasaryan wrote:
>>>>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
>>>>> <kent.overstreet@linux.dev> wrote:
>>>>>>
>>>>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>>>>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>>>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>>>>>>
>>>>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>>>>>>> done so on several occasions already.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>>>>>>> little rushed.
>>>>>>>>>>>>
>>>>>>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>>>>>>> benefit than cost here.
>>>>>>>>>>>
>>>>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>>>>>>> away (May).
>>>>>>>>>>>
>>>>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>>>>>>> consensus on this.
>>>>>>>>>>
>>>>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>>>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>>>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>>>>>>> been thought through, and I see no need to repeat that.
>>>>>>>>>>
>>>>>>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>>>>>>> want to hold this up for more discussion they better be putting forth
>>>>>>>>>> something to discuss.
>>>>>>>>>
>>>>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>>>>>>> there is a wider agreement from the MM community". If we can achieve
>>>>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>>>>>>
>>>>>>>> There will be a maintenance burden even with the cleanest proposed
>>>>>>>> approach.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> We worked hard to make the patchset as clean as possible and
>>>>>>>> if benefits still don't outweigh the maintenance cost then we should
>>>>>>>> probably stop trying.
>>>>>>>
>>>>>>> Indeed.
>>>>>>>
>>>>>>>> At LSF/MM I would rather discuss functonal
>>>>>>>> issues/requirements/improvements than alternative approaches to
>>>>>>>> instrument allocators.
>>>>>>>> I'm happy to arrange a separate meeting with MM folks if that would
>>>>>>>> help to progress on the cost/benefit decision.
>>>>>>> Note that I am only proposing ways forward.
>>>>>>>
>>>>>>> If you think you can easily achieve what Michal requested without all that,
>>>>>>> good.
>>>>>>
>>>>>> He requested something?
>>>>>
>>>>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
>>>>> possible until the compiler feature is developed and deployed. And it
>>>>> still would require changes to the headers, so don't think it's worth
>>>>> delaying the feature for years.
>>>>>
>>>>
>>>> I was talking about this: "I can live with all the downsides of the proposed
>>>> implementationas long as there is a wider agreement from the MM community as
>>>> this is where the maintenance cost will be payed. So far I have not seen
>>>> (m)any acks by MM developers".
>>>>
>>>> I certainly cannot be motivated at this point to review and ack this,
>>>> unfortunately too much negative energy around here.
>>>
>>> David, this kind of reaction is exactly why I was telling Andrew I was
>>> going to submit this as a direct pull request to Linus.
>>>
>>> This is an important feature; if we can't stay focused ot the technical
>>> and get it done that's what I'll do.
>>
>> Kent, I started this with "Would it make sense" in an attempt to help
>> Suren and you to finally make progress with this, one way or the other.
>> I know that there were ways in the past to get the MM community to agree
>> on such things.
>>
>> I tried to be helpful, finding ways *not having to* bypass the MM
>> community to get MM stuff merged.
>>
>> The reply I got is mostly negative energy.
>>
>> So you don't need my help here, understood.
>>
>> But I will fight against any attempts to bypass the MM community.
> 
> Well, I'm definitely not trying to bypass the MM community, that's why
> this patchset is posted. Not sure why people can't voice their opinion
> on the benefit/cost balance of the patchset over the email... But if a
> meeting would be more productive I'm happy to set it up.

If you can get the acks without any additional meetings, great. The 
replies from Pasha and Johannes are encouraging, let's hope core 
memory-allocator people will voice their opinion here.

If you come to the conclusion that another meeting would help getting 
maintainers's attention and sorting out some of the remaining concerns, 
feel free to schedule a meeting with Dave R. I suspect only the slot 
next week is already taken. In the past, we also had "special" meetings 
just for things to make progress faster.

If you're looking for ideas on what the agenda of such a meeting could 
look like, I'll happily discuss that with you off-list.

v2 was more than 3 months ago. If it's really about minor details here, 
waiting another 3 months for LSF/MM is indeed not reasonable.

Myself, I'll be happy not having to sit through another LSF/MM session 
of that kind. The level of drama is exceptional and I'm hoping it won't 
be the new norm in the MM space.

Good luck!

Vlastimil Babka Feb. 14, 2024, 10:20 a.m. UTC | #26

On 2/14/24 00:08, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 02:59:11PM -0800, Suren Baghdasaryan wrote:
>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
>> <kent.overstreet@linux.dev> wrote:
>> >
>> > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>> > > On 13.02.24 23:30, Suren Baghdasaryan wrote:
>> > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>> > > If you think you can easily achieve what Michal requested without all that,
>> > > good.
>> >
>> > He requested something?
>> 
>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
>> possible until the compiler feature is developed and deployed. And it
>> still would require changes to the headers, so don't think it's worth
>> delaying the feature for years.
> 
> Hang on, let's look at the actual code.
> 
> This is what instrumenting an allocation function looks like:
> 
> #define krealloc_array(...)                     alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
> 
> IOW, we have to:
>  - rename krealloc_array to krealloc_array_noprof
>  - replace krealloc_array with a one wrapper macro call
> 
> Is this really all we're getting worked up over?
> 
> The renaming we need regardless, because the thing that makes this
> approach efficient enough to run in production is that we account at
> _one_ point in the callstack, we don't save entire backtraces.
> 
> And thus we need to explicitly annotate which one that is; which means
> we need _noprof() versions of functions for when the accounting is done
> by an outer wraper (e.g. mempool).
> 
> And, as I keep saying: that alloc_hooks() macro will also get us _per
> callsite fault injection points_, and we really need that because - if
> you guys have been paying attention to other threads - whenever moving
> more stuff to PF_MEMALLOC_* flags comes up (including adding
> PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and
> not being testable keeps coming up.

How exactly do you envision the fault injection to help here? The proposals
are about scoping via a process flag, and the process may then call just
about anything under that scope. So if our tool is per callsite fault
injection points, how do we know which callsites to enable to focus the
fault injection on the particular scope?

Michal Hocko Feb. 14, 2024, 1:23 p.m. UTC | #27

On Tue 13-02-24 14:59:11, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
[...]
> > > If you think you can easily achieve what Michal requested without all that,
> > > good.
> >
> > He requested something?
> 
> Yes, a cleaner instrumentation.

Nope, not really. You have indicated you want to target this version for the
_next_ merge window without any acks, really. If you want to go
forward with this then you should gain a support from the MM community
at least. Why? Because the whole macro layering is adding maintenance
cost for MM people.

I have expressed why I absolutely hate the additional macro layer. We
have been through similar layers of macros in other areas (not to
mention page allocator interface itself) and it has _always_ turned out
a bad idea long term. I do not see why this case should be any
different.

The whole kernel is moving to a dynamic tracing realm and now we
are going to build a statically macro based tracing infrastructure which
will need tweaking anytime real memory consumers are one layer up the
existing macro infrastructure (do not forget quite a lot of allocations
are in library functions) and/or we need to modify the allocator API
in some way. Call me unimpressed!

Now, I fully recognize that the solution doesn't really have to be
perfect in order to be useful. Hence I never NAKed it even though I really
_dislike_ the approach. I have expected you will grow the community
support over time if this is indeed the only feasible approach but that
is not reflected in the series posted here. If you find a support I will
not stand in the way.

> Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.

I am pretty sure you have invested a non-trivial time into evaluating
other ways, yet your cover letter is rather modest about any details:
:  - looked at alternate hooking methods.
:    There were suggestions on alternate methods (compiler attribute,
:    trampolines), but they wouldn't have made the patchset any cleaner
:    (we still need to have different function versions for accounting vs. no
:    accounting to control at which point in a call chain the accounting
:    happens), and they would have added a dependency on toolchain
:    support.

First immediate question would be: What about page_owner? I do remember
the runtime overhead being discussed but I do not really remember any
actual numbers outside of artificial workloads. Has this been
investigated? Is our stack unwinder the problem? Etc.

Also what are the biggest obstacles to efficiently track allocations via
our tracing infrastructure? Has this been investigated? What were conclusions?

Michal Hocko Feb. 14, 2024, 2:46 p.m. UTC | #28

On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
[...]
> I agree we should discuss how the annotations are implemented on a
> technical basis, but my take is that we need something like this.

I do not think there is any disagreement on usefulness of a better
memory allocation tracking. At least for me the primary problem is the
implementation. At LFSMM last year we have heard that existing tracing
infrastructure hasn't really been explored much. Cover letter doesn't
really talk much about those alternatives so it is really hard to
evaluate whether the proposed solution is indeed our best way to
approach this.

> In a codebase of our size, I don't think the allocator should be
> handing out memory without some basic implied tracking of where it's
> going. It's a liability for production environments, and it can hide
> bad memory management decisions in drivers and other subsystems for a
> very long time.

Fully agreed! It is quite common to see oom reports with a large portion
of memory unaccounted and this really presents additional cost on the
debugging side.

Matthew Wilcox (Oracle) Feb. 14, 2024, 3 p.m. UTC | #29

On Tue, Feb 13, 2024 at 06:08:45PM -0500, Kent Overstreet wrote:
> This is what instrumenting an allocation function looks like:
> 
> #define krealloc_array(...)                     alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
> 
> IOW, we have to:
>  - rename krealloc_array to krealloc_array_noprof
>  - replace krealloc_array with a one wrapper macro call
> 
> Is this really all we're getting worked up over?
> 
> The renaming we need regardless, because the thing that makes this
> approach efficient enough to run in production is that we account at
> _one_ point in the callstack, we don't save entire backtraces.

I'm probably going to regret getting involved in this thread, but since
Suren already decided to put me on the cc ...

There might be a way to do it without renaming.  We have a bit of the
linker script called SCHED_TEXT which lets us implement
in_sched_functions().  ie we could have the equivalent of

include/linux/sched/debug.h:#define __sched             __section(".sched.text")

perhaps #define __memalloc __section(".memalloc.text")
which would do all the necessary magic to know where the backtrace
should stop.

Kent Overstreet Feb. 14, 2024, 3:01 p.m. UTC | #30

On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote:
> On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
> [...]
> > I agree we should discuss how the annotations are implemented on a
> > technical basis, but my take is that we need something like this.
> 
> I do not think there is any disagreement on usefulness of a better
> memory allocation tracking. At least for me the primary problem is the
> implementation. At LFSMM last year we have heard that existing tracing
> infrastructure hasn't really been explored much. Cover letter doesn't
> really talk much about those alternatives so it is really hard to
> evaluate whether the proposed solution is indeed our best way to
> approach this.

Michal, we covered this before.

To do this with tracing you'd have to build up data structures
separately, in userspace, that would mirror the allocator's data
structures; you would have to track every single allocation so that you
could match up the free event to the place it was allocated.

Even if it could be built, which I doubt, it'd be completely non viable
because the performance would be terrible.

Like I said, we covered all this before; if you're going to spend so
much time on these threads you really should be making a better attempt
at keeping up with what's been talked about.

Kent Overstreet Feb. 14, 2024, 3:13 p.m. UTC | #31

On Wed, Feb 14, 2024 at 03:00:00PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 13, 2024 at 06:08:45PM -0500, Kent Overstreet wrote:
> > This is what instrumenting an allocation function looks like:
> > 
> > #define krealloc_array(...)                     alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
> > 
> > IOW, we have to:
> >  - rename krealloc_array to krealloc_array_noprof
> >  - replace krealloc_array with a one wrapper macro call
> > 
> > Is this really all we're getting worked up over?
> > 
> > The renaming we need regardless, because the thing that makes this
> > approach efficient enough to run in production is that we account at
> > _one_ point in the callstack, we don't save entire backtraces.
> 
> I'm probably going to regret getting involved in this thread, but since
> Suren already decided to put me on the cc ...
> 
> There might be a way to do it without renaming.  We have a bit of the
> linker script called SCHED_TEXT which lets us implement
> in_sched_functions().  ie we could have the equivalent of
> 
> include/linux/sched/debug.h:#define __sched             __section(".sched.text")
> 
> perhaps #define __memalloc __section(".memalloc.text")
> which would do all the necessary magic to know where the backtrace
> should stop.

Could we please try to get through the cover letter before proposing
alternatives? I already explained there why we need the renaming.

In addition, you can't create the per-callsite codetag with linker
magic; you nede the macro for that.

Instead of citing myself again, I'm just going to post what I was
working on last night for the documentation directory:

.. SPDX-License-Identifier: GPL-2.0

===========================
MEMORY ALLOCATION PROFILING
===========================

Low overhead (suitable for production) accounting of all memory allocations,
tracked by file and line number.

Usage:
kconfig options:
 - CONFIG_MEM_ALLOC_PROFILING
 - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
 - CONFIG_MEM_ALLOC_PROFILING_DEBUG
   adds warnings for allocations that weren't accounted because of a
   missing annotation

sysctl:
  /proc/sys/vm/mem_profiling

Runtime info:
  /proc/allocinfo

Example output:
  root@moria-kvm:~# sort -h /proc/allocinfo|tail
   3.11MiB     2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
   3.52MiB      225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
   3.75MiB      960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
   4.00MiB        2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
   10.5MiB      168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
   14.0MiB     3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
   26.8MiB     6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
   64.5MiB    98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
   98.7MiB    25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
    125MiB     7357 mm/slub.c:2201 module:slub func:alloc_slab_page


Theory of operation:

Memory allocation profiling builds off of code tagging, which is a library for
declaring static structs (that typcially describe a file and line number in
some way, hence code tagging) and then finding and operating on them at runtime
- i.e. iterating over them to print them in debugfs/procfs.

To add accounting for an allocation call, we replace it with a macro
invocation, alloc_hooks(), that
 - declares a code tag
 - stashes a pointer to it in task_struct
 - calls the real allocation function
 - and finally, restores the task_struct alloc tag pointer to its previous value.

This allows for alloc_hooks() calls to be nested, with the most recent one
taking effect. This is important for allocations internal to the mm/ code that
do not properly belong to the outer allocation context and should be counted
separately: for example, slab object extension vectors, or when the slab
allocates pages from the page allocator.

Thus, proper usage requires determining which function in an allocation call
stack should be tagged. There are many helper functions that essentially wrap
e.g. kmalloc() and do a little more work, then are called in multiple places;
we'll generally want the accounting to happen in the callers of these helpers,
not in the helpers themselves.

To fix up a given helper, for example foo(), do the following:
 - switch its allocation call to the _noprof() version, e.g. kmalloc_noprof()
 - rename it to foo_noprof()
 - define a macro version of foo() like so:
   #define foo(...) alloc_hooks(foo_noprof(__VA_ARGS__))

Michal Hocko Feb. 14, 2024, 4:02 p.m. UTC | #32

On Wed 14-02-24 10:01:14, Kent Overstreet wrote:
> On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote:
> > On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
> > [...]
> > > I agree we should discuss how the annotations are implemented on a
> > > technical basis, but my take is that we need something like this.
> > 
> > I do not think there is any disagreement on usefulness of a better
> > memory allocation tracking. At least for me the primary problem is the
> > implementation. At LFSMM last year we have heard that existing tracing
> > infrastructure hasn't really been explored much. Cover letter doesn't
> > really talk much about those alternatives so it is really hard to
> > evaluate whether the proposed solution is indeed our best way to
> > approach this.
> 
> Michal, we covered this before.

It is a good practice to summarize previous discussions in the cover
letter. Especially when there are different approaches discussed over a
longer time period or when the topic is controversial.

I do not see anything like that here. Neither for the existing tracing
infrastructure, page owner nor performance concerns discussed before
etc. Look, I do not want to nit pick or insist on formalisms but having
those data points layed out would make any further discussion much more
smooth.

Kent Overstreet Feb. 14, 2024, 4:17 p.m. UTC | #33

On Wed, Feb 14, 2024 at 05:02:28PM +0100, Michal Hocko wrote:
> On Wed 14-02-24 10:01:14, Kent Overstreet wrote:
> > On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote:
> > > On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
> > > [...]
> > > > I agree we should discuss how the annotations are implemented on a
> > > > technical basis, but my take is that we need something like this.
> > > 
> > > I do not think there is any disagreement on usefulness of a better
> > > memory allocation tracking. At least for me the primary problem is the
> > > implementation. At LFSMM last year we have heard that existing tracing
> > > infrastructure hasn't really been explored much. Cover letter doesn't
> > > really talk much about those alternatives so it is really hard to
> > > evaluate whether the proposed solution is indeed our best way to
> > > approach this.
> > 
> > Michal, we covered this before.
> 
> It is a good practice to summarize previous discussions in the cover
> letter. Especially when there are different approaches discussed over a
> longer time period or when the topic is controversial.
> 
> I do not see anything like that here. Neither for the existing tracing
> infrastructure, page owner nor performance concerns discussed before
> etc. Look, I do not want to nit pick or insist on formalisms but having
> those data points layed out would make any further discussion much more
> smooth.

You don't want to nitpick???

Look, you've been consistently sidestepping the technical discussion; it
seems all you want to talk about is process or "your nack".

If we're going to have a technical discussion, it's incumbent upon all
of us to /keep the focus on the technical/; that is everyone's
responsibility.

I'm not going to write a 20 page cover letter and recap every dead end
that was proposed. That would be a lot of useless crap for eveyone to
wade through. I'm going to summarize the important stuff, and keep the
focus on what we're doing and documenting it. If you want to take part
in a discussion, it's your responsibility to be reading with
comprehension and finding useful things to say.

You gotta stop with this this derailing garbage.

Michal Hocko Feb. 14, 2024, 4:31 p.m. UTC | #34

On Wed 14-02-24 11:17:20, Kent Overstreet wrote:
[...]
> You gotta stop with this this derailing garbage.

It is always pleasure talking to you Kent, but let me give you advice
(free of charge of course). Let Suren talk, chances for civilized
and productive discussion are much higher!

I do not have much more to add to the discussion. My point stays, find a
support of the MM community if you want to proceed with this work.

Kent Overstreet Feb. 14, 2024, 4:38 p.m. UTC | #35

On Wed, Feb 14, 2024 at 11:20:26AM +0100, Vlastimil Babka wrote:
> On 2/14/24 00:08, Kent Overstreet wrote:
> > And, as I keep saying: that alloc_hooks() macro will also get us _per
> > callsite fault injection points_, and we really need that because - if
> > you guys have been paying attention to other threads - whenever moving
> > more stuff to PF_MEMALLOC_* flags comes up (including adding
> > PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and
> > not being testable keeps coming up.
> 
> How exactly do you envision the fault injection to help here? The proposals
> are about scoping via a process flag, and the process may then call just
> about anything under that scope. So if our tool is per callsite fault
> injection points, how do we know which callsites to enable to focus the
> fault injection on the particular scope?

So the question with fault injection is - how do we integrate it into
our existing tests?

We need fault injection that we can integrate into our existing tests
because that's the only way to get the code coverage we need - writing
new tests that cover all the error paths isn't going to happen, and
wouldn't work as well anyways.

But the trouble with injecting memory allocation failures is that
they'll result in errors bubbling up to userspace, and in unpredictable
ways.

We _definitely_ cannot enable random memory allocation faults for the
entire kernel at runttme - or rather we _could_, and that would actually
be great to do as a side project; but that's not something we can do in
our existing automated tests because the results will be completely
unpredictable. If we did that the goal would be to just make sure the
kernel doesn't explode - but what we actually want is for our automated
pass/fail tests to still pass; we need to constrain what will fail.

So we need at a minumum to be able to only enable memory allocation
failures for the code we're interested in testing (file/module) -
enabling memory allocation failures in some other random subsystem we're
not developing or looking at isn't what we want.

Beyond that, it's very much subsystem dependent. For bcachefs, my main
strategy has been to flip on random (1%) memory allocation failures
after the filesystem has mounted. During startup, we do a ton of
allocations (I cover those with separate tests), but after startup we
should be able to run normally in the precence of allocation failures
without ever returning an error to userspace - so that's what I'm trying
to test.

Andrew Morton Feb. 14, 2024, 4:55 p.m. UTC | #36

On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

> > > If you think you can easily achieve what Michal requested without all that,
> > > good.
> >
> > He requested something?
> 
> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.

Can we please be told much more about this compiler feature? 
Description of what it is, what it does, how it will affect this kernel
feature, etc.

Who is developing it and when can we expect it to become available?

Will we be able to migrate to it without back-compatibility concerns? 
(I think "you need quite recent gcc for memory profiling" is
reasonable).

Because: if the maintainability issues which Michel describes will be
significantly addressed with the gcc support then we're kinda reviewing
the wrong patchset.  Yes, it may be a maintenance burden initially, but
at some (yet to be revealed) time in the future, this will be addressed
with the gcc support?

Suren Baghdasaryan Feb. 14, 2024, 5:14 p.m. UTC | #37

On Wed, Feb 14, 2024 at 8:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 14-02-24 11:17:20, Kent Overstreet wrote:
> [...]
> > You gotta stop with this this derailing garbage.
>
> It is always pleasure talking to you Kent, but let me give you advice
> (free of charge of course). Let Suren talk, chances for civilized
> and productive discussion are much higher!

Every time I wake up to a new drama... Sorry I won't follow up on this
one not to feed the fire.


>
> I do not have much more to add to the discussion. My point stays, find a
> support of the MM community if you want to proceed with this work.
> --
> Michal Hocko
> SUSE Labs

Suren Baghdasaryan Feb. 14, 2024, 5:14 p.m. UTC | #38

On Wed, Feb 14, 2024 at 8:55 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > > If you think you can easily achieve what Michal requested without all that,
> > > > good.
> > >
> > > He requested something?
> >
> > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > possible until the compiler feature is developed and deployed. And it
> > still would require changes to the headers, so don't think it's worth
> > delaying the feature for years.
>
> Can we please be told much more about this compiler feature?
> Description of what it is, what it does, how it will affect this kernel
> feature, etc.

Sure. The compiler support will be in a form of a new __attribute__,
simplified example:

// generate data for the wrapper
static void _alloc_tag()
{
  static struct alloc_tag _alloc_tag __section ("alloc_tags")
      = { .ct = CODE_TAG_INIT, .counter = 0 };
}

static inline int
wrapper (const char *name, int x, int (*callee) (const char *, int),
         struct alloc_tag *callsite_data)
{
  callsite_data->counter++;
  printf ("Call #%d from %s:%d (%s)\n", callsite_data->counter,
          callsite_data->ct.filename, callsite_data->ct.lineno,
          callsite_data->ct.function);
  int ret = callee (name, x);
  printf ("Returned: %d\n", ret);
  return ret;
}

__attribute__((annotate("callsite_wrapped_by", wrapper, _alloc_tag)))
int foo(const char* name, int x);

int foo(const char* name, int x) {
  printf ("Hello %s, %d!\n", name, x);
  return x;
}

Which we will be able to attach to a function without changing its
name and preserving the namespace (it applies only to functions with
that name, not everything else).
Note that we will still need _noprof versions of the allocators.

>
> Who is developing it and when can we expect it to become available?

Aleksei Vetrov (google) with the help of Nick Desaulniers (google).
Both are CC'ed on this email.
After several iterations Aleksei has a POC which we are evaluating
(https://github.com/llvm/llvm-project/compare/main...noxwell:llvm-project:callsite-wrapper-tree-transform).
Once it's in good shape we are going to engage with CLANG and GCC
community to get it upstreamed. When it will become available and when
the distributions will pick it up is anybody's guess. Upstreaming is
usually a lengthy process.

>
> Will we be able to migrate to it without back-compatibility concerns?
> (I think "you need quite recent gcc for memory profiling" is
> reasonable).

The migration should be quite straight-forward, replacing the macros
with functions with that attribute.

>
>
> Because: if the maintainability issues which Michel describes will be
> significantly addressed with the gcc support then we're kinda reviewing
> the wrong patchset.  Yes, it may be a maintenance burden initially, but
> at some (yet to be revealed) time in the future, this will be addressed
> with the gcc support?

That's what I'm aiming for. I just don't want this placed on hold
until the compiler support is widely available, which might take
years.

>

Kent Overstreet Feb. 14, 2024, 5:52 p.m. UTC | #39

On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote:
> On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > > > If you think you can easily achieve what Michal requested without all that,
> > > > good.
> > >
> > > He requested something?
> > 
> > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > possible until the compiler feature is developed and deployed. And it
> > still would require changes to the headers, so don't think it's worth
> > delaying the feature for years.
> 
> Can we please be told much more about this compiler feature? 
> Description of what it is, what it does, how it will affect this kernel
> feature, etc.
> 
> Who is developing it and when can we expect it to become available?
> 
> Will we be able to migrate to it without back-compatibility concerns? 
> (I think "you need quite recent gcc for memory profiling" is
> reasonable).
> 
> 
> 
> Because: if the maintainability issues which Michel describes will be
> significantly addressed with the gcc support then we're kinda reviewing
> the wrong patchset.  Yes, it may be a maintenance burden initially, but
> at some (yet to be revealed) time in the future, this will be addressed
> with the gcc support?

Even if we had compiler magic, after considering it more I don't think
the patchset would be improved by it - I would still prefer to stick
with the macro approach.

There's also a lot of unresolved questions about whether the compiler
approach would even end being what we need; we need macro expansion to
happen in the caller of the allocation function, and that's another
level of hooking that I don't think the compiler people are even
considering yet, since cpp runs before the main part of the compiler; if
C macros worked and were implemented more like Rust macros I'm sure it
could be done - in fact, I think this could all be done in Rust
_without_ any new compiler support - but in C, this is a lot to ask.

Let's look at the instrumentation again. There's two steps:

- Renaming the original function to _noprof
- Adding a hooked version of the original function.

We need to do the renaming regardless of what approach we take in order
to correctly handle allocations that happen inside the context of an
existing alloc tag hook but should not be accounted to the outer
context; we do that by selecting the alloc_foo() or alloc_foo_noprof()
version as appropriate.

It's important to get this right; consider slab object extension
vectors or the slab allocator allocating pages from the page allocator.

Second step, adding a hooked version of the original function. We do
that with

#define alloc_foo(...) alloc_hooks(alloc_foo_noprof(__VA_ARGS__))

That's pretty clean, if you ask me. The only way to make it more succint
be if it were possible for a C macro to define a new macro, then it
could be just

alloc_fn(alloc_foo);

But honestly, the former is probably preferable anyways from a ctags/cscope POV.

Andy Shevchenko Feb. 14, 2024, 6:44 p.m. UTC | #40

On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> Memory allocation, v3 and final:

Would be nice to have --base added to cover letter. The very first patch
can't be applied on today's Linux Next.

Suren Baghdasaryan Feb. 14, 2024, 6:51 p.m. UTC | #41

On Wed, Feb 14, 2024 at 10:44 AM Andy Shevchenko
<andy@black.fi.intel.com> wrote:
>
> On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> > Memory allocation, v3 and final:
>
> Would be nice to have --base added to cover letter. The very first patch
> can't be applied on today's Linux Next.

Sorry about that. It as based on Linus` ToT at the time of posting
(7521f258ea30 Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm). I also applied
it to mm-unstable with only one trivial merge conflict in one of the
patches.

>
> --
> With Best Regards,
> Andy Shevchenko
>
>
>

Tim Chen Feb. 14, 2024, 6:53 p.m. UTC | #42

On Mon, 2024-02-12 at 13:38 -0800, Suren Baghdasaryan wrote:
> Memory allocation, v3 and final:
> 
> Overview:
> Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> kernels, overhead low enough to be deployed in production.
> 
> We're aiming to get this in the next merge window, for 6.9. The feedback
> we've gotten has been that even out of tree this patchset has already
> been useful, and there's a significant amount of other work gated on the
> code tagging functionality included in this patchset [2].
> 
> Example output:
>   root@moria-kvm:~# sort -h /proc/allocinfo|tail
>    3.11MiB     2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
>    3.52MiB      225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
>    3.75MiB      960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
>    4.00MiB        2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
>    10.5MiB      168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
>    14.0MiB     3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
>    26.8MiB     6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
>    64.5MiB    98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
>    98.7MiB    25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
>     125MiB     7357 mm/slub.c:2201 module:slub func:alloc_slab_page
> 
> Since v2:
>  - tglx noticed a circular header dependency between sched.h and percpu.h;
>    a bunch of header cleanups were merged into 6.8 to ameliorate this [3].
> 
>  - a number of improvements, moving alloc_hooks() annotations to the
>    correct place for better tracking (mempool), and bugfixes.
> 
>  - looked at alternate hooking methods.
>    There were suggestions on alternate methods (compiler attribute,
>    trampolines), but they wouldn't have made the patchset any cleaner
>    (we still need to have different function versions for accounting vs. no
>    accounting to control at which point in a call chain the accounting
>    happens), and they would have added a dependency on toolchain
>    support.
> 
> Usage:
> kconfig options:
>  - CONFIG_MEM_ALLOC_PROFILING
>  - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
>  - CONFIG_MEM_ALLOC_PROFILING_DEBUG
>    adds warnings for allocations that weren't accounted because of a
>    missing annotation
> 
> sysctl:
>   /proc/sys/vm/mem_profiling
> 
> Runtime info:
>   /proc/allocinfo
> 
> Notes:
> 
> [1]: Overhead
> To measure the overhead we are comparing the following configurations:
> (1) Baseline with CONFIG_MEMCG_KMEM=n
> (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
>     CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
> (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
>     CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
> (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
>     CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
> (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
> 

Thanks for the work on this patchset and it is quite useful.
A clarification question on the data:

I assume Config (2), (3) and (4) has CONFIG_MEMCG_KMEM=n, right?
If so do you have similar data for config (2), (3) and (4) but with
CONFIG_MEMCG_KMEM=y for comparison with (5)?

Tim

> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below are results
> from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> 56 core Intel Xeon:
> 
>                         kmalloc                 pgalloc
> (1 baseline)            6.764s                  16.902s
> (2 default disabled)    6.793s (+0.43%)         17.007s (+0.62%)
> (3 default enabled)     7.197s (+6.40%)         23.666s (+40.02%)
> (4 runtime enabled)     7.405s (+9.48%)         23.901s (+41.41%)
> (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
>

Suren Baghdasaryan Feb. 14, 2024, 7:09 p.m. UTC | #43

On Wed, Feb 14, 2024 at 10:54 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Mon, 2024-02-12 at 13:38 -0800, Suren Baghdasaryan wrote:
> > Memory allocation, v3 and final:
> >
> > Overview:
> > Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> > kernels, overhead low enough to be deployed in production.
> >
> > We're aiming to get this in the next merge window, for 6.9. The feedback
> > we've gotten has been that even out of tree this patchset has already
> > been useful, and there's a significant amount of other work gated on the
> > code tagging functionality included in this patchset [2].
> >
> > Example output:
> >   root@moria-kvm:~# sort -h /proc/allocinfo|tail
> >    3.11MiB     2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
> >    3.52MiB      225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
> >    3.75MiB      960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
> >    4.00MiB        2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
> >    10.5MiB      168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
> >    14.0MiB     3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
> >    26.8MiB     6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
> >    64.5MiB    98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
> >    98.7MiB    25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
> >     125MiB     7357 mm/slub.c:2201 module:slub func:alloc_slab_page
> >
> > Since v2:
> >  - tglx noticed a circular header dependency between sched.h and percpu.h;
> >    a bunch of header cleanups were merged into 6.8 to ameliorate this [3].
> >
> >  - a number of improvements, moving alloc_hooks() annotations to the
> >    correct place for better tracking (mempool), and bugfixes.
> >
> >  - looked at alternate hooking methods.
> >    There were suggestions on alternate methods (compiler attribute,
> >    trampolines), but they wouldn't have made the patchset any cleaner
> >    (we still need to have different function versions for accounting vs. no
> >    accounting to control at which point in a call chain the accounting
> >    happens), and they would have added a dependency on toolchain
> >    support.
> >
> > Usage:
> > kconfig options:
> >  - CONFIG_MEM_ALLOC_PROFILING
> >  - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> >  - CONFIG_MEM_ALLOC_PROFILING_DEBUG
> >    adds warnings for allocations that weren't accounted because of a
> >    missing annotation
> >
> > sysctl:
> >   /proc/sys/vm/mem_profiling
> >
> > Runtime info:
> >   /proc/allocinfo
> >
> > Notes:
> >
> > [1]: Overhead
> > To measure the overhead we are comparing the following configurations:
> > (1) Baseline with CONFIG_MEMCG_KMEM=n
> > (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> >     CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
> > (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> >     CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
> > (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
> >     CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
> > (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
> >
>
> Thanks for the work on this patchset and it is quite useful.
> A clarification question on the data:
>
> I assume Config (2), (3) and (4) has CONFIG_MEMCG_KMEM=n, right?

Yes, correct.

> If so do you have similar data for config (2), (3) and (4) but with
> CONFIG_MEMCG_KMEM=y for comparison with (5)?

I have data for these additional configs (didn't think there were that
important):
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y


>
> Tim
>
> > Performance overhead:
> > To evaluate performance we implemented an in-kernel test executing
> > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > affinity set to a specific CPU to minimize the noise. Below are results
> > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> > 56 core Intel Xeon:
> >
> >                         kmalloc                 pgalloc
> > (1 baseline)            6.764s                  16.902s
> > (2 default disabled)    6.793s (+0.43%)         17.007s (+0.62%)
> > (3 default enabled)     7.197s (+6.40%)         23.666s (+40.02%)
> > (4 runtime enabled)     7.405s (+9.48%)         23.901s (+41.41%)
> > (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)

(6 default disabled+memcg)    13.332s (+97.10%)         48.105s (+184.61%)
(7 default enabled+memcg)     13.446s (+98.78%)       54.963s (+225.18%)

(6) shows a bit better performance than (5) but it's probably noise. I
would expect them to be roughly the same. Hope this helps.

> >
>
>

Suren Baghdasaryan Feb. 14, 2024, 7:24 p.m. UTC | #44

On Wed, Feb 14, 2024 at 9:52 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote:
> > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > > > If you think you can easily achieve what Michal requested without all that,
> > > > > good.
> > > >
> > > > He requested something?
> > >
> > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > > possible until the compiler feature is developed and deployed. And it
> > > still would require changes to the headers, so don't think it's worth
> > > delaying the feature for years.
> >
> > Can we please be told much more about this compiler feature?
> > Description of what it is, what it does, how it will affect this kernel
> > feature, etc.
> >
> > Who is developing it and when can we expect it to become available?
> >
> > Will we be able to migrate to it without back-compatibility concerns?
> > (I think "you need quite recent gcc for memory profiling" is
> > reasonable).
> >
> >
> >
> > Because: if the maintainability issues which Michel describes will be
> > significantly addressed with the gcc support then we're kinda reviewing
> > the wrong patchset.  Yes, it may be a maintenance burden initially, but
> > at some (yet to be revealed) time in the future, this will be addressed
> > with the gcc support?
>
> Even if we had compiler magic, after considering it more I don't think
> the patchset would be improved by it - I would still prefer to stick
> with the macro approach.
>
> There's also a lot of unresolved questions about whether the compiler
> approach would even end being what we need; we need macro expansion to
> happen in the caller of the allocation function

For the record, that's what this attribute will be doing. So it should
cover our usecase.

> , and that's another
> level of hooking that I don't think the compiler people are even
> considering yet, since cpp runs before the main part of the compiler; if
> C macros worked and were implemented more like Rust macros I'm sure it
> could be done - in fact, I think this could all be done in Rust
> _without_ any new compiler support - but in C, this is a lot to ask.
>
> Let's look at the instrumentation again. There's two steps:
>
> - Renaming the original function to _noprof
> - Adding a hooked version of the original function.
>
> We need to do the renaming regardless of what approach we take in order
> to correctly handle allocations that happen inside the context of an
> existing alloc tag hook but should not be accounted to the outer
> context; we do that by selecting the alloc_foo() or alloc_foo_noprof()
> version as appropriate.
>
> It's important to get this right; consider slab object extension
> vectors or the slab allocator allocating pages from the page allocator.
>
> Second step, adding a hooked version of the original function. We do
> that with
>
> #define alloc_foo(...) alloc_hooks(alloc_foo_noprof(__VA_ARGS__))
>
> That's pretty clean, if you ask me. The only way to make it more succint
> be if it were possible for a C macro to define a new macro, then it
> could be just
>
> alloc_fn(alloc_foo);
>
> But honestly, the former is probably preferable anyways from a ctags/cscope POV.

Kent Overstreet Feb. 14, 2024, 8 p.m. UTC | #45

On Wed, Feb 14, 2024 at 11:24:23AM -0800, Suren Baghdasaryan wrote:
> On Wed, Feb 14, 2024 at 9:52 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote:
> > > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > > > If you think you can easily achieve what Michal requested without all that,
> > > > > > good.
> > > > >
> > > > > He requested something?
> > > >
> > > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > > > possible until the compiler feature is developed and deployed. And it
> > > > still would require changes to the headers, so don't think it's worth
> > > > delaying the feature for years.
> > >
> > > Can we please be told much more about this compiler feature?
> > > Description of what it is, what it does, how it will affect this kernel
> > > feature, etc.
> > >
> > > Who is developing it and when can we expect it to become available?
> > >
> > > Will we be able to migrate to it without back-compatibility concerns?
> > > (I think "you need quite recent gcc for memory profiling" is
> > > reasonable).
> > >
> > >
> > >
> > > Because: if the maintainability issues which Michel describes will be
> > > significantly addressed with the gcc support then we're kinda reviewing
> > > the wrong patchset.  Yes, it may be a maintenance burden initially, but
> > > at some (yet to be revealed) time in the future, this will be addressed
> > > with the gcc support?
> >
> > Even if we had compiler magic, after considering it more I don't think
> > the patchset would be improved by it - I would still prefer to stick
> > with the macro approach.
> >
> > There's also a lot of unresolved questions about whether the compiler
> > approach would even end being what we need; we need macro expansion to
> > happen in the caller of the allocation function
> 
> For the record, that's what this attribute will be doing. So it should
> cover our usecase.

That wasn't clear in the meeting we had the other day; all that was
discussed there was the attribute syntax, as I recall.

So say that does work out (and I don't think that's a given; if I were a
compiler person I don't think I'd be interested in this strange half
macro, half inline function beast); all that has accomplished is to get
rid of the need for the renaming - the _noprof() versions of functions.

So then how do you distinguish where in the callstack the accounting
happens?

If you say "it happens at the outermost wrapper", then what happens is

 - Extra overhead for all the inner wrapper invocations, where they have
   to now check "actually, we already have an alloc tag, don't do
   anything". That's a cost, and given how much time we spent shaving
   cycles and branches during development it's not one we want.

 - Inner allocations that shouldn't be accounted to the outer context
   are now a major problem, because they silently will be accounted
   there and never noticed.

   With our approach, inner allocations are by default (i.e. when we
   haven't switched them to the _noprof() variant) accounted to their
   own alloc tag; that way, when we're reading the /proc/allocinfo
   output, we can examine them and check if they should be collapsed to
   the outer context. With this approach they won't be seen.

So no, we still don't want the compiler approach.

Yosry Ahmed Feb. 14, 2024, 8:17 p.m. UTC | #46

> > > Performance overhead:
> > > To evaluate performance we implemented an in-kernel test executing
> > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > affinity set to a specific CPU to minimize the noise. Below are results
> > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> > > 56 core Intel Xeon:
> > >
> > >                         kmalloc                 pgalloc
> > > (1 baseline)            6.764s                  16.902s
> > > (2 default disabled)    6.793s (+0.43%)         17.007s (+0.62%)
> > > (3 default enabled)     7.197s (+6.40%)         23.666s (+40.02%)
> > > (4 runtime enabled)     7.405s (+9.48%)         23.901s (+41.41%)
> > > (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
> 
> (6 default disabled+memcg)    13.332s (+97.10%)         48.105s (+184.61%)
> (7 default enabled+memcg)     13.446s (+98.78%)       54.963s (+225.18%)

I think these numbers are very interesting for folks that already use
memcg. Specifically, the difference between 6 & 7, which seems to be
~0.85% and ~14.25%. IIUC, this means that the extra overhead is
relatively much lower if someone is already using memcgs.

> 
> (6) shows a bit better performance than (5) but it's probably noise. I
> would expect them to be roughly the same. Hope this helps.
> 
> > >
> >
> >

Suren Baghdasaryan Feb. 14, 2024, 8:30 p.m. UTC | #47

On Wed, Feb 14, 2024 at 12:17 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > > > Performance overhead:
> > > > To evaluate performance we implemented an in-kernel test executing
> > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > > affinity set to a specific CPU to minimize the noise. Below are results
> > > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> > > > 56 core Intel Xeon:
> > > >
> > > >                         kmalloc                 pgalloc
> > > > (1 baseline)            6.764s                  16.902s
> > > > (2 default disabled)    6.793s (+0.43%)         17.007s (+0.62%)
> > > > (3 default enabled)     7.197s (+6.40%)         23.666s (+40.02%)
> > > > (4 runtime enabled)     7.405s (+9.48%)         23.901s (+41.41%)
> > > > (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
> >
> > (6 default disabled+memcg)    13.332s (+97.10%)         48.105s (+184.61%)
> > (7 default enabled+memcg)     13.446s (+98.78%)       54.963s (+225.18%)
>
> I think these numbers are very interesting for folks that already use
> memcg. Specifically, the difference between 6 & 7, which seems to be
> ~0.85% and ~14.25%. IIUC, this means that the extra overhead is
> relatively much lower if someone is already using memcgs.

Well, yes, percentage-wise it's much lower. If you look at the
absolute difference between 6 & 7 vs 2 & 3, it's quite close.

>
> >
> > (6) shows a bit better performance than (5) but it's probably noise. I
> > would expect them to be roughly the same. Hope this helps.
> >
> > > >
> > >
> > >

Tim Chen Feb. 14, 2024, 10:59 p.m. UTC | #48

On Wed, 2024-02-14 at 12:30 -0800, Suren Baghdasaryan wrote:
> On Wed, Feb 14, 2024 at 12:17 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > 
> > > > > Performance overhead:
> > > > > To evaluate performance we implemented an in-kernel test executing
> > > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > > > affinity set to a specific CPU to minimize the noise. Below are results
> > > > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> > > > > 56 core Intel Xeon:
> > > > > 
> > > > >                         kmalloc                 pgalloc
> > > > > (1 baseline)            6.764s                  16.902s
> > > > > (2 default disabled)    6.793s (+0.43%)         17.007s (+0.62%)
> > > > > (3 default enabled)     7.197s (+6.40%)         23.666s (+40.02%)
> > > > > (4 runtime enabled)     7.405s (+9.48%)         23.901s (+41.41%)
> > > > > (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
> > > 
> > > (6 default disabled+memcg)    13.332s (+97.10%)         48.105s (+184.61%)
> > > (7 default enabled+memcg)     13.446s (+98.78%)       54.963s (+225.18%)
> > 
> > I think these numbers are very interesting for folks that already use
> > memcg. Specifically, the difference between 6 & 7, which seems to be
> > ~0.85% and ~14.25%. IIUC, this means that the extra overhead is
> > relatively much lower if someone is already using memcgs.
> 
> Well, yes, percentage-wise it's much lower. If you look at the
> absolute difference between 6 & 7 vs 2 & 3, it's quite close.
> 
> > 
> > > 
> > > (6) shows a bit better performance than (5) but it's probably noise. I
> > > would expect them to be roughly the same. Hope this helps.
> > > 
> > > > 

Thanks for the data.  It does show that turning on memcg does not cost
extra overhead percentage wise.

Tim

Jani Nikula Feb. 16, 2024, 8:38 a.m. UTC | #49

On Mon, 12 Feb 2024, Suren Baghdasaryan <surenb@google.com> wrote:
> Memory allocation, v3 and final:
>
> Overview:
> Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> kernels, overhead low enough to be deployed in production.
>
> We're aiming to get this in the next merge window, for 6.9. The feedback
> we've gotten has been that even out of tree this patchset has already
> been useful, and there's a significant amount of other work gated on the
> code tagging functionality included in this patchset [2].

I wonder if it wouldn't be too much trouble to write at least a brief
overview document under Documentation/ describing what this is all
about? Even as follow-up. People seeing the patch series have the
benefit of the cover letter and the commit messages, but that's hardly
documentation.

We have all these great frameworks and tools but their discoverability
to kernel developers isn't always all that great.

BR,
Jani.

Kent Overstreet Feb. 16, 2024, 8:42 a.m. UTC | #50

On Fri, Feb 16, 2024 at 10:38:00AM +0200, Jani Nikula wrote:
> On Mon, 12 Feb 2024, Suren Baghdasaryan <surenb@google.com> wrote:
> > Memory allocation, v3 and final:
> >
> > Overview:
> > Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> > kernels, overhead low enough to be deployed in production.
> >
> > We're aiming to get this in the next merge window, for 6.9. The feedback
> > we've gotten has been that even out of tree this patchset has already
> > been useful, and there's a significant amount of other work gated on the
> > code tagging functionality included in this patchset [2].
> 
> I wonder if it wouldn't be too much trouble to write at least a brief
> overview document under Documentation/ describing what this is all
> about? Even as follow-up. People seeing the patch series have the
> benefit of the cover letter and the commit messages, but that's hardly
> documentation.
> 
> We have all these great frameworks and tools but their discoverability
> to kernel developers isn't always all that great.

commit f589b48789de4b8f77bfc70b9f3ab2013c01eaf2
Author: Kent Overstreet <kent.overstreet@linux.dev>
Date:   Wed Feb 14 01:13:04 2024 -0500

    memprofiling: Documentation
    
    Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst
new file mode 100644
index 000000000000..d906e9360279
--- /dev/null
+++ b/Documentation/mm/allocation-profiling.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+MEMORY ALLOCATION PROFILING
+===========================
+
+Low overhead (suitable for production) accounting of all memory allocations,
+tracked by file and line number.
+
+Usage:
+kconfig options:
+ - CONFIG_MEM_ALLOC_PROFILING
+ - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
+ - CONFIG_MEM_ALLOC_PROFILING_DEBUG
+   adds warnings for allocations that weren't accounted because of a
+   missing annotation
+
+sysctl:
+  /proc/sys/vm/mem_profiling
+
+Runtime info:
+  /proc/allocinfo
+
+Example output:
+  root@moria-kvm:~# sort -h /proc/allocinfo|tail
+   3.11MiB     2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
+   3.52MiB      225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
+   3.75MiB      960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
+   4.00MiB        2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
+   10.5MiB      168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
+   14.0MiB     3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
+   26.8MiB     6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
+   64.5MiB    98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
+   98.7MiB    25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
+    125MiB     7357 mm/slub.c:2201 module:slub func:alloc_slab_page
+
+
+Theory of operation:
+
+Memory allocation profiling builds off of code tagging, which is a library for
+declaring static structs (that typcially describe a file and line number in
+some way, hence code tagging) and then finding and operating on them at runtime
+- i.e. iterating over them to print them in debugfs/procfs.
+
+To add accounting for an allocation call, we replace it with a macro
+invocation, alloc_hooks(), that
+ - declares a code tag
+ - stashes a pointer to it in task_struct
+ - calls the real allocation function
+ - and finally, restores the task_struct alloc tag pointer to its previous value.
+
+This allows for alloc_hooks() calls to be nested, with the most recent one
+taking effect. This is important for allocations internal to the mm/ code that
+do not properly belong to the outer allocation context and should be counted
+separately: for example, slab object extension vectors, or when the slab
+allocates pages from the page allocator.
+
+Thus, proper usage requires determining which function in an allocation call
+stack should be tagged. There are many helper functions that essentially wrap
+e.g. kmalloc() and do a little more work, then are called in multiple places;
+we'll generally want the accounting to happen in the callers of these helpers,
+not in the helpers themselves.
+
+To fix up a given helper, for example foo(), do the following:
+ - switch its allocation call to the _noprof() version, e.g. kmalloc_noprof()
+ - rename it to foo_noprof()
+ - define a macro version of foo() like so:
+   #define foo(...) alloc_hooks(foo_noprof(__VA_ARGS__))

Jani Nikula Feb. 16, 2024, 9:07 a.m. UTC | #51

On Fri, 16 Feb 2024, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> On Fri, Feb 16, 2024 at 10:38:00AM +0200, Jani Nikula wrote:
>> I wonder if it wouldn't be too much trouble to write at least a brief
>> overview document under Documentation/ describing what this is all
>> about? Even as follow-up. People seeing the patch series have the
>> benefit of the cover letter and the commit messages, but that's hardly
>> documentation.
>> 
>> We have all these great frameworks and tools but their discoverability
>> to kernel developers isn't always all that great.
>
> commit f589b48789de4b8f77bfc70b9f3ab2013c01eaf2
> Author: Kent Overstreet <kent.overstreet@linux.dev>
> Date:   Wed Feb 14 01:13:04 2024 -0500
>
>     memprofiling: Documentation
>     
>     Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

Thanks! Wasn't part of this series and I wasn't aware it existed.

BR,
Jani.

[v3,00/35] Memory allocation profiling

Message

Comments