[v2,0/7] mm: introduce shrinker debugfs interface

Message ID	20220422202644.799732-1-roman.gushchin@linux.dev (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Roman Gushchin <roman.gushchin@linux.dev> To: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org Cc: Dave Chinner <dchinner@redhat.com>, linux-kernel@vger.kernel.org, Yang Shi <shy828301@gmail.com>, Kent Overstreet <kent.overstreet@gmail.com>, Hillf Danton <hdanton@sina.com>, Roman Gushchin <roman.gushchin@linux.dev> Subject: [PATCH v2 0/7] mm: introduce shrinker debugfs interface Date: Fri, 22 Apr 2022 13:26:37 -0700 Message-Id: <20220422202644.799732-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: introduce shrinker debugfs interface \| expand [v2,0/7] mm: introduce shrinker debugfs interface [v2,1/7] mm: introduce debugfs interface for kernel memory shrinkers [v2,2/7] mm: memcontrol: introduce mem_cgroup_ino() and mem_cgroup_get_from_ino() [v2,3/7] mm: introduce memcg interfaces for shrinker debugfs [v2,4/7] mm: introduce numa interfaces for shrinker debugfs [v2,5/7] mm: provide shrinkers with names [v2,6/7] docs: document shrinker debugfs [v2,7/7] tools: add memcg_shrinker.py

Roman Gushchin April 22, 2022, 8:26 p.m. UTC

There are 50+ different shrinkers in the kernel, many with their own bells and
whistles. Under the memory pressure the kernel applies some pressure on each of
them in the order of which they were created/registered in the system. Some
of them can contain only few objects, some can be quite large. Some can be
effective at reclaiming memory, some not.

The only existing debugging mechanism is a couple of tracepoints in
do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
covering everything though: shrinkers which report 0 objects will never show up,
there is no support for memcg-aware shrinkers. Shrinkers are identified by their
scan function, which is not always enough (e.g. hard to guess which super
block's shrinker it is having only "super_cache_scan"). They are a passive
mechanism: there is no way to call into counting and scanning of an individual
shrinker and profile it.

To provide a better visibility and debug options for memory shrinkers
this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
similar to /sys/kernel/slab.

For each shrinker registered in the system a directory is created. The directory
contains "count" and "scan" files, which allow to trigger count_objects()
and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
and scan_memcg_node are additionally provided. They allow to get per-memcg
and/or per-node object count and shrink only a specific memcg/node.

To make debugging more pleasant, the patchset also names all shrinkers,
so that debugfs entries can have more meaningful names.

Usage examples:

1) List registered shrinkers:
  $ cd /sys/kernel/debug/shrinker/
  $ ls
    dqcache-16          sb-cgroup2-30    sb-hugetlbfs-33  sb-proc-41       sb-selinuxfs-22  sb-tmpfs-40    sb-zsmalloc-19
    kfree_rcu-0         sb-configfs-23   sb-iomem-12      sb-proc-44       sb-sockfs-8      sb-tmpfs-42    shadow-18
    sb-aio-20           sb-dax-11        sb-mqueue-21     sb-proc-45       sb-sysfs-26      sb-tmpfs-43    thp_deferred_split-10
    sb-anon_inodefs-15  sb-debugfs-7     sb-nsfs-4        sb-proc-47       sb-tmpfs-1       sb-tmpfs-46    thp_zero-9
    sb-bdev-3           sb-devpts-28     sb-pipefs-14     sb-pstore-31     sb-tmpfs-27      sb-tmpfs-49    xfs_buf-37
    sb-bpf-32           sb-devtmpfs-5    sb-proc-25       sb-rootfs-2      sb-tmpfs-29      sb-tracefs-13  xfs_inodegc-38
    sb-btrfs-24         sb-hugetlbfs-17  sb-proc-39       sb-securityfs-6  sb-tmpfs-35      sb-xfs-36      zspool-34

2) Get information about a specific shrinker:
  $ cd sb-btrfs-24/
  $ ls
    count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node

3) Count objects on the system/root cgroup level
  $ cat count
    212

4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
  $ cat count_node
    209 3

5) Count objects for each memcg (output format: cgroup inode, count)
  $ cat count_memcg
    1 212
    20 96
    53 817
    2297 2
    218 13
    581 30
    911 124
    <CUT>

6) Same but with a per-node output
  $ cat count_memcg_node
    1 209 3
    20 96 0
    53 810 7
    2297 2 0
    218 13 0
    581 30 0
    911 124 0
    <CUT>

7) Scan system/root shrinker
  $ cat count
    212
  $ echo 100 > scan
  $ cat scan
    97
  $ cat count
    115

8) Scan individual memcg
  $ echo "1868 500" > scan_memcg
  $ cat scan_memcg
    193

9) Scan individual node
  $ echo "1 200" > scan_node
  $ cat scan_node
    2

10) Scan individual memcg and node
  $ echo "1868 0 500" > scan_memcg_node
  $ cat scan_memcg_node
    435


v1:
  1) switched to debugfs, suggested by Mike, Andrew, Greg and others
  2) switched to seq_file API for output, no PAGE_SIZE limit anymore, by Andrew
  3) switched to down_read_killable(), suggested by Hillf
  4) dropped stateful filtering and "freed" returning, by Kent
  5) added docs, by Andrew

rfc:
  https://lwn.net/Articles/891542/


Roman Gushchin (7):
  mm: introduce debugfs interface for kernel memory shrinkers
  mm: memcontrol: introduce mem_cgroup_ino() and
    mem_cgroup_get_from_ino()
  mm: introduce memcg interfaces for shrinker debugfs
  mm: introduce numa interfaces for shrinker debugfs
  mm: provide shrinkers with names
  docs: document shrinker debugfs
  tools: add memcg_shrinker.py

 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/shrinker_debugfs.rst       |  90 +++
 arch/x86/kvm/mmu/mmu.c                        |   2 +-
 drivers/android/binder_alloc.c                |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |   3 +-
 drivers/gpu/drm/msm/msm_gem_shrinker.c        |   2 +-
 .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |   2 +-
 drivers/gpu/drm/ttm/ttm_pool.c                |   2 +-
 drivers/md/bcache/btree.c                     |   2 +-
 drivers/md/dm-bufio.c                         |   2 +-
 drivers/md/dm-zoned-metadata.c                |   2 +-
 drivers/md/raid5.c                            |   2 +-
 drivers/misc/vmw_balloon.c                    |   2 +-
 drivers/virtio/virtio_balloon.c               |   2 +-
 drivers/xen/xenbus/xenbus_probe_backend.c     |   2 +-
 fs/erofs/utils.c                              |   2 +-
 fs/ext4/extents_status.c                      |   3 +-
 fs/f2fs/super.c                               |   2 +-
 fs/gfs2/glock.c                               |   2 +-
 fs/gfs2/main.c                                |   2 +-
 fs/jbd2/journal.c                             |   2 +-
 fs/mbcache.c                                  |   2 +-
 fs/nfs/nfs42xattr.c                           |   7 +-
 fs/nfs/super.c                                |   2 +-
 fs/nfsd/filecache.c                           |   2 +-
 fs/nfsd/nfscache.c                            |   2 +-
 fs/quota/dquot.c                              |   2 +-
 fs/super.c                                    |   2 +-
 fs/ubifs/super.c                              |   2 +-
 fs/xfs/xfs_buf.c                              |   2 +-
 fs/xfs/xfs_icache.c                           |   2 +-
 fs/xfs/xfs_qm.c                               |   2 +-
 include/linux/memcontrol.h                    |   9 +
 include/linux/shrinker.h                      |  24 +-
 kernel/rcu/tree.c                             |   2 +-
 lib/Kconfig.debug                             |   9 +
 mm/Makefile                                   |   1 +
 mm/huge_memory.c                              |   4 +-
 mm/memcontrol.c                               |  23 +
 mm/shrinker_debug.c                           | 511 ++++++++++++++++++
 mm/vmscan.c                                   |  66 ++-
 mm/workingset.c                               |   2 +-
 mm/zsmalloc.c                                 |   2 +-
 net/sunrpc/auth.c                             |   2 +-
 tools/cgroup/memcg_shrinker.py                |  70 +++
 45 files changed, 836 insertions(+), 47 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/shrinker_debugfs.rst
 create mode 100644 mm/shrinker_debug.c
 create mode 100755 tools/cgroup/memcg_shrinker.py

Dave Chinner April 26, 2022, 6:02 a.m. UTC | #1

On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote:
> There are 50+ different shrinkers in the kernel, many with their own bells and
> whistles. Under the memory pressure the kernel applies some pressure on each of
> them in the order of which they were created/registered in the system. Some
> of them can contain only few objects, some can be quite large. Some can be
> effective at reclaiming memory, some not.
> 
> The only existing debugging mechanism is a couple of tracepoints in
> do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> covering everything though: shrinkers which report 0 objects will never show up,
> there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> scan function, which is not always enough (e.g. hard to guess which super
> block's shrinker it is having only "super_cache_scan").

In general, I've had no trouble identifying individual shrinker
instances because I'm always looking at individual subsystem
shrinker tracepoints, too.  Hence I've almost always got the
identification information in the traces I need to trace just the
individual shrinker tracepoints and a bit of sed/grep/awk and I've
got something I can feed to gnuplot or a python script to graph...

> They are a passive
> mechanism: there is no way to call into counting and scanning of an individual
> shrinker and profile it.

IDGI. profiling shrinkers iunder ideal conditions when there isn't
memory pressure is largely a useless exercise because execution
patterns under memory pressure are vastly different.

All the problems with shrinkers show up when progress cannot be made
as fast as memory reclaim wants memory to be reclaimed. How do you
trigger priority windup causing large amounts of deferred processing
because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do
you simulate objects getting dirtied in memory so they can't be
immediately reclaimed so the shrinker can't make any progress at all
until IO completes? How do you simulate the unbound concurrency that
direct reclaim can drive into the shrinkers that causes massive lock
contention on shared structures and locks that need to be accessed
to free objects?

IOWs, if all you want to do is profile shrinkers running in the
absence of memory pressure, then you can do that perfectly well with
the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't
need some complex debugfs API just to profile the shrinker
behaviour.

So why do we need any of the complexity and potential for abuse that
comes from exposing control of shrinkers directly to userspace like
these patches do?

> To provide a better visibility and debug options for memory shrinkers
> this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
> similar to /sys/kernel/slab.

/sys/kernel/slab contains read-only usage information - it is
analagous for visibility arguments, but it is not equivalent for
the rest of the "active" functionality you want to add here....

> For each shrinker registered in the system a directory is created. The directory
> contains "count" and "scan" files, which allow to trigger count_objects()
> and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> and scan_memcg_node are additionally provided. They allow to get per-memcg
> and/or per-node object count and shrink only a specific memcg/node.

Great, but why does the shrinker introspection interface need active
scan control functions like these?

> To make debugging more pleasant, the patchset also names all shrinkers,
> so that debugfs entries can have more meaningful names.
> 
> Usage examples:
> 
> 1) List registered shrinkers:
>   $ cd /sys/kernel/debug/shrinker/
>   $ ls
>     dqcache-16          sb-cgroup2-30    sb-hugetlbfs-33  sb-proc-41       sb-selinuxfs-22  sb-tmpfs-40    sb-zsmalloc-19
>     kfree_rcu-0         sb-configfs-23   sb-iomem-12      sb-proc-44       sb-sockfs-8      sb-tmpfs-42    shadow-18
>     sb-aio-20           sb-dax-11        sb-mqueue-21     sb-proc-45       sb-sysfs-26      sb-tmpfs-43    thp_deferred_split-10
>     sb-anon_inodefs-15  sb-debugfs-7     sb-nsfs-4        sb-proc-47       sb-tmpfs-1       sb-tmpfs-46    thp_zero-9
>     sb-bdev-3           sb-devpts-28     sb-pipefs-14     sb-pstore-31     sb-tmpfs-27      sb-tmpfs-49    xfs_buf-37
>     sb-bpf-32           sb-devtmpfs-5    sb-proc-25       sb-rootfs-2      sb-tmpfs-29      sb-tracefs-13  xfs_inodegc-38
>     sb-btrfs-24         sb-hugetlbfs-17  sb-proc-39       sb-securityfs-6  sb-tmpfs-35      sb-xfs-36      zspool-34

Ouch. That's not going to be useful for humans debugging a system as
there's no way to cross reference a "superblock" with an actual
filesystem mount point. Nor is there any way to reallly know that
all the shrinkers in one filesystem are related.

We normally solve this by ensuring that the fs related object has
the short bdev name appended to them. e.g:

$ pgrep xfs
1 I root          36       2  0  60 -20 -     0 -      Apr19 ?        00:00:10 [kworker/0:1H-xfs-log/dm-3]
1 I root         679       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfsalloc]
1 I root         680       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs_mru_cache]
1 I root         681       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs-buf/dm-1]
.....

Here we have a kworker process running log IO completion work on
dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer
task for xfs-buf workqueue on dm-1.

We need the same name discrimination for shrinker information here,
too - just saying "this is an XFS superblock shrinker" is just not
sufficient when there are hundreds of XFS mount points with a
handful of shrinkers each.

> 2) Get information about a specific shrinker:
>   $ cd sb-btrfs-24/
>   $ ls
>     count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node
> 
> 3) Count objects on the system/root cgroup level
>   $ cat count
>     212
> 
> 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
>   $ cat count_node
>     209 3

So a single space separated line with a number per node?

When you have a few hundred nodes and hundreds of thousands of objects per
node, we overrun the 4kB page size with a single line. What then?

> 5) Count objects for each memcg (output format: cgroup inode, count)
>   $ cat count_memcg
>     1 212
>     20 96
>     53 817
>     2297 2
>     218 13
>     581 30
>     911 124
>     <CUT>

What does "<CUT>" mean?

Also, this now iterates separate memcg per line. A parser now needs
to know the difference between count/count_node and
count_memcg/count_memcg_node because they are subtly different file
formats.  These files should have the same format, otherwise it just
creates needless complexity.

Indeed, why do we even need count/count_node? They are just the
"index 1" memcg output, so are totally redundant.

> 6) Same but with a per-node output
>   $ cat count_memcg_node
>     1 209 3
>     20 96 0
>     53 810 7
>     2297 2 0
>     218 13 0
>     581 30 0
>     911 124 0
>     <CUT>

So now we have a hundred nodes in the machine and thousands of
memcgs. And the information we want is in the numerically largest
memcg that is last in the list. ANd we want to graph it's behaviour
over time at high resolution (say 1Hz). Now we burn huge amounts
of CPU counting memcgs that we don't care about and then throwing
away most of the information. That's highly in-efficient and really
doesn't scale.

[snap active scan interface]

This just seems like a solution looking for a problem to solve.
Can you please describe the problem this infrastructure is going
to solve?

Cheers,

Dave.

Kent Overstreet April 26, 2022, 6:45 a.m. UTC | #2

On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote:
> This just seems like a solution looking for a problem to solve.
> Can you please describe the problem this infrastructure is going
> to solve?

A point I was making over VC is that memcg is completely irrelevant to debugging
most of these issues; all the issues we've been talking about can be easily
reproduced in a single test VM without memcg.

Yet we don't even have the tooling to debug the simple stuff.

Why are we trying to make big and complicated stuff when we can't even debug the
simple cases? And I've been getting _really_ tired of the stock answer of "that
use case isn't interesting to the big cloud providers".

A: If you're a Linux kernel developer at this level, you have earned a great
deal of trust and it is incumbent upon you to be a good steward of the code you
have been entrusted with, instead of just spending all your time chasing fat
bonuses from your employer while ignoring what's good for the codebase as a
whole. That's pissing all over the commons that came long before you and will
hopefully still be around long after you.

B: Even aside from that, it's incredibly shortsighted and a poor use of time and
resources. When I was at Google I saw, over and over again, people rushing to do
something big and complicated and new because that was how they could get a
promotion, instead of working on basic stuff like refactoring core IO paths (and
it's been my experience over and over again that when you just try to make code
saner and more understandable, you almost always find big performance
improvements along the way... but that's not as exciting as rushing to find the
biggest coolest optimization or all-the-bells-and-whistles interface).

So yeah, this patchset screams of someone looking for a promotion to me.

Meanwhile, the status of visibility into the _basics_ of what goes on in MM is
utter dogshit. There's just too many _basic_ questions that are a pain in the
ass to answer - even just profiling memory usage by file:line number is a
shitshow.

One thing that I run into a lot is people rush to say "tracepoints!" for a lot
of problems - but tracepoints aren't a good answer for a lot of problems because
having them on all the time is problematic.

What I would like to see is more lighter weight collection of statistics, and
some basic library code for things like latency measurements of important
operations broken out by quantiles, with rate & frequence - this is something
that's helped in bcachefs. If anyone's interested, the code for that starts
here:

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bcachefs.h#n322

Specifically for shrinkers, I'd like if we had rolling averages over the past
few seconds for e.g. _rate_ of objects requested to be freed vs. actually freed.
If we collect those kinds of rate measurements (and perhaps latency too, to show
stalls) at various places in the MM code, perhaps we'd be able to see what's
getting stuck when we OOM.

We should have rate of objects getting added, too, and we should be collecting
data from the list_lru code as well, like you were mentioning the other night.

And if we collect this data in such a way that it can be displayed in sysfs, but
done with the to_text() methods I've been talking about, it'll also be trivial
to include that in the show_mem() report when we OOM.

Anyways, that's my two cents.... I can't claim to have any brilliant insights
here, but I hope Roman will start taking ideas from more people (and Dave's been
a real wealth of information on this topic! I'd pick his brain if I were you,
Roman).

Hillf Danton April 26, 2022, 8:45 a.m. UTC | #3

On Tue, 26 Apr 2022 16:02:19 +1000 Dave Chinner wrote:
> On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote:
> > There are 50+ different shrinkers in the kernel, many with their own bells and
> > whistles. Under the memory pressure the kernel applies some pressure on each of
> > them in the order of which they were created/registered in the system. Some
> > of them can contain only few objects, some can be quite large. Some can be
> > effective at reclaiming memory, some not.
> > 
> > The only existing debugging mechanism is a couple of tracepoints in
> > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > covering everything though: shrinkers which report 0 objects will never show up,
> > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > scan function, which is not always enough (e.g. hard to guess which super
> > block's shrinker it is having only "super_cache_scan").
> 
> In general, I've had no trouble identifying individual shrinker
> instances because I'm always looking at individual subsystem
> shrinker tracepoints, too.  Hence I've almost always got the
> identification information in the traces I need to trace just the
> individual shrinker tracepoints and a bit of sed/grep/awk and I've
> got something I can feed to gnuplot or a python script to graph...
> 
> > They are a passive
> > mechanism: there is no way to call into counting and scanning of an individual
> > shrinker and profile it.
> 
> IDGI. profiling shrinkers iunder ideal conditions when there isn't
> memory pressure is largely a useless exercise because execution
> patterns under memory pressure are vastly different.

Well how many minutes, two or ten, does it take for kswapd to reclaim
100 xfs objects at DEF_PRIORITY-3?

> 
> All the problems with shrinkers show up when progress cannot be made
> as fast as memory reclaim wants memory to be reclaimed. How do you
> trigger priority windup causing large amounts of deferred processing
> because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do
> you simulate objects getting dirtied in memory so they can't be
> immediately reclaimed so the shrinker can't make any progress at all
> until IO completes? How do you simulate the unbound concurrency that
> direct reclaim can drive into the shrinkers that causes massive lock
> contention on shared structures and locks that need to be accessed
> to free objects?
> 
> IOWs, if all you want to do is profile shrinkers running in the
> absence of memory pressure, then you can do that perfectly well with
> the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't
> need some complex debugfs API just to profile the shrinker
> behaviour.

Hm ... given ext4, what sense does xfs make? Or vice verse?
Or Given wine, why Coke?

I want to see the minutes recycling ten ext4 objects with xfs intact
before waking kswapd up.

Hillf
> 
> So why do we need any of the complexity and potential for abuse that
> comes from exposing control of shrinkers directly to userspace like
> these patches do?
> 
> > To provide a better visibility and debug options for memory shrinkers
> > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
> > similar to /sys/kernel/slab.
> 
> /sys/kernel/slab contains read-only usage information - it is
> analagous for visibility arguments, but it is not equivalent for
> the rest of the "active" functionality you want to add here....
> 
> > For each shrinker registered in the system a directory is created. The directory
> > contains "count" and "scan" files, which allow to trigger count_objects()
> > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > and/or per-node object count and shrink only a specific memcg/node.
> 
> Great, but why does the shrinker introspection interface need active
> scan control functions like these?
> 
> > To make debugging more pleasant, the patchset also names all shrinkers,
> > so that debugfs entries can have more meaningful names.
> > 
> > Usage examples:
> > 
> > 1) List registered shrinkers:
> >   $ cd /sys/kernel/debug/shrinker/
> >   $ ls
> >     dqcache-16          sb-cgroup2-30    sb-hugetlbfs-33  sb-proc-41       sb-selinuxfs-22  sb-tmpfs-40    sb-zsmalloc-19
> >     kfree_rcu-0         sb-configfs-23   sb-iomem-12      sb-proc-44       sb-sockfs-8      sb-tmpfs-42    shadow-18
> >     sb-aio-20           sb-dax-11        sb-mqueue-21     sb-proc-45       sb-sysfs-26      sb-tmpfs-43    thp_deferred_split-10
> >     sb-anon_inodefs-15  sb-debugfs-7     sb-nsfs-4        sb-proc-47       sb-tmpfs-1       sb-tmpfs-46    thp_zero-9
> >     sb-bdev-3           sb-devpts-28     sb-pipefs-14     sb-pstore-31     sb-tmpfs-27      sb-tmpfs-49    xfs_buf-37
> >     sb-bpf-32           sb-devtmpfs-5    sb-proc-25       sb-rootfs-2      sb-tmpfs-29      sb-tracefs-13  xfs_inodegc-38
> >     sb-btrfs-24         sb-hugetlbfs-17  sb-proc-39       sb-securityfs-6  sb-tmpfs-35      sb-xfs-36      zspool-34
> 
> Ouch. That's not going to be useful for humans debugging a system as
> there's no way to cross reference a "superblock" with an actual
> filesystem mount point. Nor is there any way to reallly know that
> all the shrinkers in one filesystem are related.
> 
> We normally solve this by ensuring that the fs related object has
> the short bdev name appended to them. e.g:
> 
> $ pgrep xfs
> 1 I root          36       2  0  60 -20 -     0 -      Apr19 ?        00:00:10 [kworker/0:1H-xfs-log/dm-3]
> 1 I root         679       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfsalloc]
> 1 I root         680       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs_mru_cache]
> 1 I root         681       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs-buf/dm-1]
> .....
> 
> Here we have a kworker process running log IO completion work on
> dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer
> task for xfs-buf workqueue on dm-1.
> 
> We need the same name discrimination for shrinker information here,
> too - just saying "this is an XFS superblock shrinker" is just not
> sufficient when there are hundreds of XFS mount points with a
> handful of shrinkers each.
> 
> > 2) Get information about a specific shrinker:
> >   $ cd sb-btrfs-24/
> >   $ ls
> >     count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node
> > 
> > 3) Count objects on the system/root cgroup level
> >   $ cat count
> >     212
> > 
> > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
> >   $ cat count_node
> >     209 3
> 
> So a single space separated line with a number per node?
> 
> When you have a few hundred nodes and hundreds of thousands of objects per
> node, we overrun the 4kB page size with a single line. What then?
> 
> > 5) Count objects for each memcg (output format: cgroup inode, count)
> >   $ cat count_memcg
> >     1 212
> >     20 96
> >     53 817
> >     2297 2
> >     218 13
> >     581 30
> >     911 124
> >     <CUT>
> 
> What does "<CUT>" mean?
> 
> Also, this now iterates separate memcg per line. A parser now needs
> to know the difference between count/count_node and
> count_memcg/count_memcg_node because they are subtly different file
> formats.  These files should have the same format, otherwise it just
> creates needless complexity.
> 
> Indeed, why do we even need count/count_node? They are just the
> "index 1" memcg output, so are totally redundant.
> 
> > 6) Same but with a per-node output
> >   $ cat count_memcg_node
> >     1 209 3
> >     20 96 0
> >     53 810 7
> >     2297 2 0
> >     218 13 0
> >     581 30 0
> >     911 124 0
> >     <CUT>
> 
> So now we have a hundred nodes in the machine and thousands of
> memcgs. And the information we want is in the numerically largest
> memcg that is last in the list. ANd we want to graph it's behaviour
> over time at high resolution (say 1Hz). Now we burn huge amounts
> of CPU counting memcgs that we don't care about and then throwing
> away most of the information. That's highly in-efficient and really
> doesn't scale.
> 
> [snap active scan interface]
> 
> This just seems like a solution looking for a problem to solve.
> Can you please describe the problem this infrastructure is going
> to solve?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> dchinner@redhat.com
> 
>

Roman Gushchin April 26, 2022, 4:41 p.m. UTC | #4

On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote:
> On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote:
> > There are 50+ different shrinkers in the kernel, many with their own bells and
> > whistles. Under the memory pressure the kernel applies some pressure on each of
> > them in the order of which they were created/registered in the system. Some
> > of them can contain only few objects, some can be quite large. Some can be
> > effective at reclaiming memory, some not.
> > 
> > The only existing debugging mechanism is a couple of tracepoints in
> > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > covering everything though: shrinkers which report 0 objects will never show up,
> > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > scan function, which is not always enough (e.g. hard to guess which super
> > block's shrinker it is having only "super_cache_scan").
> 
> In general, I've had no trouble identifying individual shrinker
> instances because I'm always looking at individual subsystem
> shrinker tracepoints, too.  Hence I've almost always got the
> identification information in the traces I need to trace just the
> individual shrinker tracepoints and a bit of sed/grep/awk and I've
> got something I can feed to gnuplot or a python script to graph...
> 
> > They are a passive
> > mechanism: there is no way to call into counting and scanning of an individual
> > shrinker and profile it.
> 
> IDGI. profiling shrinkers iunder ideal conditions when there isn't
> memory pressure is largely a useless exercise because execution
> patterns under memory pressure are vastly different.
> 
> All the problems with shrinkers show up when progress cannot be made
> as fast as memory reclaim wants memory to be reclaimed. How do you
> trigger priority windup causing large amounts of deferred processing
> because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do
> you simulate objects getting dirtied in memory so they can't be
> immediately reclaimed so the shrinker can't make any progress at all
> until IO completes? How do you simulate the unbound concurrency that
> direct reclaim can drive into the shrinkers that causes massive lock
> contention on shared structures and locks that need to be accessed
> to free objects?
> 
> IOWs, if all you want to do is profile shrinkers running in the
> absence of memory pressure, then you can do that perfectly well with
> the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't
> need some complex debugfs API just to profile the shrinker
> behaviour.
> 
> So why do we need any of the complexity and potential for abuse that
> comes from exposing control of shrinkers directly to userspace like
> these patches do?
> 
> > To provide a better visibility and debug options for memory shrinkers
> > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
> > similar to /sys/kernel/slab.
> 
> /sys/kernel/slab contains read-only usage information - it is
> analagous for visibility arguments, but it is not equivalent for
> the rest of the "active" functionality you want to add here....
> 
> > For each shrinker registered in the system a directory is created. The directory
> > contains "count" and "scan" files, which allow to trigger count_objects()
> > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > and/or per-node object count and shrink only a specific memcg/node.
> 
> Great, but why does the shrinker introspection interface need active
> scan control functions like these?
> 
> > To make debugging more pleasant, the patchset also names all shrinkers,
> > so that debugfs entries can have more meaningful names.
> > 
> > Usage examples:
> > 
> > 1) List registered shrinkers:
> >   $ cd /sys/kernel/debug/shrinker/
> >   $ ls
> >     dqcache-16          sb-cgroup2-30    sb-hugetlbfs-33  sb-proc-41       sb-selinuxfs-22  sb-tmpfs-40    sb-zsmalloc-19
> >     kfree_rcu-0         sb-configfs-23   sb-iomem-12      sb-proc-44       sb-sockfs-8      sb-tmpfs-42    shadow-18
> >     sb-aio-20           sb-dax-11        sb-mqueue-21     sb-proc-45       sb-sysfs-26      sb-tmpfs-43    thp_deferred_split-10
> >     sb-anon_inodefs-15  sb-debugfs-7     sb-nsfs-4        sb-proc-47       sb-tmpfs-1       sb-tmpfs-46    thp_zero-9
> >     sb-bdev-3           sb-devpts-28     sb-pipefs-14     sb-pstore-31     sb-tmpfs-27      sb-tmpfs-49    xfs_buf-37
> >     sb-bpf-32           sb-devtmpfs-5    sb-proc-25       sb-rootfs-2      sb-tmpfs-29      sb-tracefs-13  xfs_inodegc-38
> >     sb-btrfs-24         sb-hugetlbfs-17  sb-proc-39       sb-securityfs-6  sb-tmpfs-35      sb-xfs-36      zspool-34
> 
> Ouch. That's not going to be useful for humans debugging a system as
> there's no way to cross reference a "superblock" with an actual
> filesystem mount point. Nor is there any way to reallly know that
> all the shrinkers in one filesystem are related.
> 
> We normally solve this by ensuring that the fs related object has
> the short bdev name appended to them. e.g:
> 
> $ pgrep xfs
> 1 I root          36       2  0  60 -20 -     0 -      Apr19 ?        00:00:10 [kworker/0:1H-xfs-log/dm-3]
> 1 I root         679       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfsalloc]
> 1 I root         680       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs_mru_cache]
> 1 I root         681       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs-buf/dm-1]
> .....
> 
> Here we have a kworker process running log IO completion work on
> dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer
> task for xfs-buf workqueue on dm-1.
> 
> We need the same name discrimination for shrinker information here,
> too - just saying "this is an XFS superblock shrinker" is just not
> sufficient when there are hundreds of XFS mount points with a
> handful of shrinkers each.
> 
> > 2) Get information about a specific shrinker:
> >   $ cd sb-btrfs-24/
> >   $ ls
> >     count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node
> > 
> > 3) Count objects on the system/root cgroup level
> >   $ cat count
> >     212
> > 
> > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
> >   $ cat count_node
> >     209 3
> 
> So a single space separated line with a number per node?
> 
> When you have a few hundred nodes and hundreds of thousands of objects per
> node, we overrun the 4kB page size with a single line. What then?
> 
> > 5) Count objects for each memcg (output format: cgroup inode, count)
> >   $ cat count_memcg
> >     1 212
> >     20 96
> >     53 817
> >     2297 2
> >     218 13
> >     581 30
> >     911 124
> >     <CUT>
> 
> What does "<CUT>" mean?
> 
> Also, this now iterates separate memcg per line. A parser now needs
> to know the difference between count/count_node and
> count_memcg/count_memcg_node because they are subtly different file
> formats.  These files should have the same format, otherwise it just
> creates needless complexity.
> 
> Indeed, why do we even need count/count_node? They are just the
> "index 1" memcg output, so are totally redundant.
> 
> > 6) Same but with a per-node output
> >   $ cat count_memcg_node
> >     1 209 3
> >     20 96 0
> >     53 810 7
> >     2297 2 0
> >     218 13 0
> >     581 30 0
> >     911 124 0
> >     <CUT>
> 
> So now we have a hundred nodes in the machine and thousands of
> memcgs. And the information we want is in the numerically largest
> memcg that is last in the list. ANd we want to graph it's behaviour
> over time at high resolution (say 1Hz). Now we burn huge amounts
> of CPU counting memcgs that we don't care about and then throwing
> away most of the information. That's highly in-efficient and really
> doesn't scale.
> 
> [snap active scan interface]
> 
> This just seems like a solution looking for a problem to solve.
> Can you please describe the problem this infrastructure is going
> to solve?

Hi Dave!

Thank you for taking a look.

Can you, please, summarize your position, because it's a bit unclear.
You made a lot of good points about some details (e.g. shrinkers naming,
and I totally agree there; machines with hundreds of nodes etc), then
you said the active scanning is useless and then said the whole thing
is useless and we're fine with what we have regarding shrinkers debugging.

My plan is to work on convert shrinkers API to bytes and experiment
with different LRU implementations. I find an ability to easily export
statistics and other data (which doesn't exist now) via debugfs useful
(and way more convenient than changing existing tracepoints), as well as
an ability to trigger scanning of individual shrinkers. If nobody else
seeing any value here, I'm fine to keep these patches private, no reason
to argue about the output format then.

If you (or somebody else) see some value in at least "count" part, I'm happy
to answer all questions and incorporate the feedback in the next version.

Thank you!

Kent Overstreet April 26, 2022, 6:37 p.m. UTC | #5

On Tue, Apr 26, 2022 at 09:41:34AM -0700, Roman Gushchin wrote:
> My plan is to work on convert shrinkers API to bytes and experiment
> with different LRU implementations. I find an ability to easily export
> statistics and other data (which doesn't exist now) via debugfs useful
> (and way more convenient than changing existing tracepoints), as well as
> an ability to trigger scanning of individual shrinkers. If nobody else
> seeing any value here, I'm fine to keep these patches private, no reason
> to argue about the output format then.

I don't think converting the shrinker API to bytes instead of object counts is
such a great idea - that's going to introducing new rounding errors and new
corner cases when we can't free the exact # of bytes requested.

I was thinking along the lines of adding reporting for memory usage in bytes as
either an additional thing the .count_objects reports, or a new callback.

Roman Gushchin April 26, 2022, 7:05 p.m. UTC | #6

On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote:
> On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote:
> > There are 50+ different shrinkers in the kernel, many with their own bells and
> > whistles. Under the memory pressure the kernel applies some pressure on each of
> > them in the order of which they were created/registered in the system. Some
> > of them can contain only few objects, some can be quite large. Some can be
> > effective at reclaiming memory, some not.
> > 
> > The only existing debugging mechanism is a couple of tracepoints in
> > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > covering everything though: shrinkers which report 0 objects will never show up,
> > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > scan function, which is not always enough (e.g. hard to guess which super
> > block's shrinker it is having only "super_cache_scan").
> 
> In general, I've had no trouble identifying individual shrinker
> instances because I'm always looking at individual subsystem
> shrinker tracepoints, too.  Hence I've almost always got the
> identification information in the traces I need to trace just the
> individual shrinker tracepoints and a bit of sed/grep/awk and I've
> got something I can feed to gnuplot or a python script to graph...

You spent a lot of time working on shrinkers in general and xfs-specific
shrinkers in particular, no questions here. But imagine someone who's not
a core-mm developer and is adding a new shrinker.

> 
> > They are a passive
> > mechanism: there is no way to call into counting and scanning of an individual
> > shrinker and profile it.
> 
> IDGI. profiling shrinkers iunder ideal conditions when there isn't
> memory pressure is largely a useless exercise because execution
> patterns under memory pressure are vastly different.
> 
> All the problems with shrinkers show up when progress cannot be made
> as fast as memory reclaim wants memory to be reclaimed. How do you
> trigger priority windup causing large amounts of deferred processing
> because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do
> you simulate objects getting dirtied in memory so they can't be
> immediately reclaimed so the shrinker can't make any progress at all
> until IO completes? How do you simulate the unbound concurrency that
> direct reclaim can drive into the shrinkers that causes massive lock
> contention on shared structures and locks that need to be accessed
> to free objects?

These are valid points and I assume we can find ways to emulate some of
these conditions, e.g. by allowing to run scanning using the GFP_NOFS context.
I though about it but decided to left for further improvements.

> 
> IOWs, if all you want to do is profile shrinkers running in the
> absence of memory pressure, then you can do that perfectly well with
> the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't
> need some complex debugfs API just to profile the shrinker
> behaviour.

And then we need somehow separate shrinkers in the result?

> 
> So why do we need any of the complexity and potential for abuse that
> comes from exposing control of shrinkers directly to userspace like
> these patches do?

I feel like the added complexity is minimal (unlike slab's sysfs, for
example). If the config option is off (by default), there is no additional
risk and overhead as well.

> 
> > To provide a better visibility and debug options for memory shrinkers
> > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
> > similar to /sys/kernel/slab.
> 
> /sys/kernel/slab contains read-only usage information - it is
> analagous for visibility arguments, but it is not equivalent for
> the rest of the "active" functionality you want to add here....
> 
> > For each shrinker registered in the system a directory is created. The directory
> > contains "count" and "scan" files, which allow to trigger count_objects()
> > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > and/or per-node object count and shrink only a specific memcg/node.
> 
> Great, but why does the shrinker introspection interface need active
> scan control functions like these?

It makes testing of (new) shrinkers easier, for example.
For instance, shadow entries shrinker hides associated objects by returning
0 count most of the time (unless the total consumed memory is bigger than a
certain amount of the total memory).
echo 2 > /proc/sys/vm/drop_caches won't even trigger the scanning.

> 
> > To make debugging more pleasant, the patchset also names all shrinkers,
> > so that debugfs entries can have more meaningful names.
> > 
> > Usage examples:
> > 
> > 1) List registered shrinkers:
> >   $ cd /sys/kernel/debug/shrinker/
> >   $ ls
> >     dqcache-16          sb-cgroup2-30    sb-hugetlbfs-33  sb-proc-41       sb-selinuxfs-22  sb-tmpfs-40    sb-zsmalloc-19
> >     kfree_rcu-0         sb-configfs-23   sb-iomem-12      sb-proc-44       sb-sockfs-8      sb-tmpfs-42    shadow-18
> >     sb-aio-20           sb-dax-11        sb-mqueue-21     sb-proc-45       sb-sysfs-26      sb-tmpfs-43    thp_deferred_split-10
> >     sb-anon_inodefs-15  sb-debugfs-7     sb-nsfs-4        sb-proc-47       sb-tmpfs-1       sb-tmpfs-46    thp_zero-9
> >     sb-bdev-3           sb-devpts-28     sb-pipefs-14     sb-pstore-31     sb-tmpfs-27      sb-tmpfs-49    xfs_buf-37
> >     sb-bpf-32           sb-devtmpfs-5    sb-proc-25       sb-rootfs-2      sb-tmpfs-29      sb-tracefs-13  xfs_inodegc-38
> >     sb-btrfs-24         sb-hugetlbfs-17  sb-proc-39       sb-securityfs-6  sb-tmpfs-35      sb-xfs-36      zspool-34
> 
> Ouch. That's not going to be useful for humans debugging a system as
> there's no way to cross reference a "superblock" with an actual
> filesystem mount point. Nor is there any way to reallly know that
> all the shrinkers in one filesystem are related.
> 
> We normally solve this by ensuring that the fs related object has
> the short bdev name appended to them. e.g:
> 
> $ pgrep xfs
> 1 I root          36       2  0  60 -20 -     0 -      Apr19 ?        00:00:10 [kworker/0:1H-xfs-log/dm-3]
> 1 I root         679       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfsalloc]
> 1 I root         680       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs_mru_cache]
> 1 I root         681       2  0  60 -20 -     0 -      Apr19 ?        00:00:00 [xfs-buf/dm-1]
> .....
> 
> Here we have a kworker process running log IO completion work on
> dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer
> task for xfs-buf workqueue on dm-1.
> 
> We need the same name discrimination for shrinker information here,
> too - just saying "this is an XFS superblock shrinker" is just not
> sufficient when there are hundreds of XFS mount points with a
> handful of shrinkers each.

Good point, I think it's doable, and I really like it.

> 
> > 2) Get information about a specific shrinker:
> >   $ cd sb-btrfs-24/
> >   $ ls
> >     count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node
> > 
> > 3) Count objects on the system/root cgroup level
> >   $ cat count
> >     212
> > 
> > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
> >   $ cat count_node
> >     209 3
> 
> So a single space separated line with a number per node?
> 
> When you have a few hundred nodes and hundreds of thousands of objects per
> node, we overrun the 4kB page size with a single line. What then?

With seq_buf api we don't have 4kb limit, do we?

> 
> > 5) Count objects for each memcg (output format: cgroup inode, count)
> >   $ cat count_memcg
> >     1 212
> >     20 96
> >     53 817
> >     2297 2
> >     218 13
> >     581 30
> >     911 124
> >     <CUT>
> 
> What does "<CUT>" mean?

I've just shortened the lengthy output, not a part of the original output.

> 
> Also, this now iterates separate memcg per line. A parser now needs
> to know the difference between count/count_node and
> count_memcg/count_memcg_node because they are subtly different file
> formats.  These files should have the same format, otherwise it just
> creates needless complexity.
> 
> Indeed, why do we even need count/count_node? They are just the
> "index 1" memcg output, so are totally redundant.

Ok, but then we need a flag to indicate that a shrinker is memcg-aware?
But I got your point and I (partially) agree.
But do you think we're fine with just one interface and don't need
an aggregation over nodes? So just count_memcg_node?


> 
> > 6) Same but with a per-node output
> >   $ cat count_memcg_node
> >     1 209 3
> >     20 96 0
> >     53 810 7
> >     2297 2 0
> >     218 13 0
> >     581 30 0
> >     911 124 0
> >     <CUT>
> 
> So now we have a hundred nodes in the machine and thousands of
> memcgs. And the information we want is in the numerically largest
> memcg that is last in the list. ANd we want to graph it's behaviour
> over time at high resolution (say 1Hz). Now we burn huge amounts
> of CPU counting memcgs that we don't care about and then throwing
> away most of the information. That's highly in-efficient and really
> doesn't scale.

For this case we can provide an interface which allows to specify both
node and memcg and get the count. Personally I don't have a machine
with hundred nodes, so it's not on my radar.
If you find it useful, happy to add.

Thanks!

Roman

Dave Chinner April 27, 2022, 1:02 a.m. UTC | #7

On Tue, Apr 26, 2022 at 12:05:30PM -0700, Roman Gushchin wrote:
> On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote:
> > On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote:
> > > There are 50+ different shrinkers in the kernel, many with their own bells and
> > > whistles. Under the memory pressure the kernel applies some pressure on each of
> > > them in the order of which they were created/registered in the system. Some
> > > of them can contain only few objects, some can be quite large. Some can be
> > > effective at reclaiming memory, some not.
> > > 
> > > The only existing debugging mechanism is a couple of tracepoints in
> > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > > covering everything though: shrinkers which report 0 objects will never show up,
> > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > > scan function, which is not always enough (e.g. hard to guess which super
> > > block's shrinker it is having only "super_cache_scan").
> > 
> > In general, I've had no trouble identifying individual shrinker
> > instances because I'm always looking at individual subsystem
> > shrinker tracepoints, too.  Hence I've almost always got the
> > identification information in the traces I need to trace just the
> > individual shrinker tracepoints and a bit of sed/grep/awk and I've
> > got something I can feed to gnuplot or a python script to graph...
> 
> You spent a lot of time working on shrinkers in general and xfs-specific
> shrinkers in particular, no questions here. But imagine someone who's not
> a core-mm developer and is adding a new shrinker.

At which point, they add their own subsystem introspection to
understand what their shrinker implementation is doing.

You keep talking about shrinkers as if they exist in isolation
to the actual subsystems that implement shrinkers. I think that is a
fundamental mistake - you cannot understand how a shrinker is
actually working without understanding something about what the
subsystem that implements the shrinker actually does.

That is, the tracepoints in the shrinker code are largely
supplemental to the subsystem introspection that is really
determining the behaviour of the system.

The shrinker infrastructure is only providing a measure of memory
pressure - most shrinker implementations jsut don't care about what
happens in the shrinker infrastructure - they just count and scan
objects for reclaim, and mostly that just works for them.

> > > They are a passive
> > > mechanism: there is no way to call into counting and scanning of an individual
> > > shrinker and profile it.
> > 
> > IDGI. profiling shrinkers iunder ideal conditions when there isn't
> > memory pressure is largely a useless exercise because execution
> > patterns under memory pressure are vastly different.
> > 
> > All the problems with shrinkers show up when progress cannot be made
> > as fast as memory reclaim wants memory to be reclaimed. How do you
> > trigger priority windup causing large amounts of deferred processing
> > because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do
> > you simulate objects getting dirtied in memory so they can't be
> > immediately reclaimed so the shrinker can't make any progress at all
> > until IO completes? How do you simulate the unbound concurrency that
> > direct reclaim can drive into the shrinkers that causes massive lock
> > contention on shared structures and locks that need to be accessed
> > to free objects?
> 
> These are valid points and I assume we can find ways to emulate some of
> these conditions, e.g. by allowing to run scanning using the GFP_NOFS context.
> I though about it but decided to left for further improvements.
> 
> > 
> > IOWs, if all you want to do is profile shrinkers running in the
> > absence of memory pressure, then you can do that perfectly well with
> > the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't
> > need some complex debugfs API just to profile the shrinker
> > behaviour.
> 
> And then we need somehow separate shrinkers in the result?

How do you profile a shrinker in the first place? You have to load
up the slab cache/LRU before you have something you can actually
profile. SO it's as simple as 'drop caches, load up cache to be
profiled, drop caches'. It's trivial to isolate the specific cache
that got loaded up from the tracepoints, and then with other
tracepoints and/or perf profiling, you can extract the profile of
the shrinker that is doing all the reclaim work.

Indeed, you can point perf at the specific task that drops the
caches, and that is all you'll get in the profile. If you can't
isolate the specific shrinker profile from the output of such a
simple test setup, then you should hand in your Kernel Developer
badge....

> > So why do we need any of the complexity and potential for abuse that
> > comes from exposing control of shrinkers directly to userspace like
> > these patches do?
> 
> I feel like the added complexity is minimal (unlike slab's sysfs, for
> example). If the config option is off (by default), there is no additional
> risk and overhead as well.

No. The argument that "if we turn it off there's no overhead" means
one of two things:

1. nobody turns it on and it never gets tested and so bitrots and is
useless, or
2. distro's all turn it on because some tool they ship or customer
they ship to wants it.

Either way, hiding it behind a config option is not an acceptible
solution for mering poorly thought out infrastructure.

> > > To provide a better visibility and debug options for memory shrinkers
> > > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
> > > similar to /sys/kernel/slab.
> > 
> > /sys/kernel/slab contains read-only usage information - it is
> > analagous for visibility arguments, but it is not equivalent for
> > the rest of the "active" functionality you want to add here....
> > 
> > > For each shrinker registered in the system a directory is created. The directory
> > > contains "count" and "scan" files, which allow to trigger count_objects()
> > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > > and/or per-node object count and shrink only a specific memcg/node.
> > 
> > Great, but why does the shrinker introspection interface need active
> > scan control functions like these?
> 
> It makes testing of (new) shrinkers easier, for example.
> For instance, shadow entries shrinker hides associated objects by returning
> 0 count most of the time (unless the total consumed memory is bigger than a
> certain amount of the total memory).
> echo 2 > /proc/sys/vm/drop_caches won't even trigger the scanning.

And that's exactly my point above: you cannot test shrinkers in
isolation of the subsystem that loads them up. In this case, you
*aren't testing the shrinker*, you are testing how the shadow entry
subsystem manages the working set. The shrinker is an integrated
part of that subsystem, so any test hooks needed to trigger the
reclaim of shadow entries belong in the ->count method of the
the shrinker implementation, such that it runs whenever the shrinker
is called rather than only when the memory usage threshold is
triggered.

At that point, drop_caches then does exactly what you need.

Shrinkers cannot be tested in isolation of the subsystem they act
on!

> > > 2) Get information about a specific shrinker:
> > >   $ cd sb-btrfs-24/
> > >   $ ls
> > >     count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node
> > > 
> > > 3) Count objects on the system/root cgroup level
> > >   $ cat count
> > >     212
> > > 
> > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
> > >   $ cat count_node
> > >     209 3
> > 
> > So a single space separated line with a number per node?
> > 
> > When you have a few hundred nodes and hundreds of thousands of objects per
> > node, we overrun the 4kB page size with a single line. What then?
> 
> With seq_buf api we don't have 4kb limit, do we?

No idea. Never cared enough about sysfs to need to know.

But that doesn't avoid the issue: verbosity and overhead to
create/parse this information.

> > Also, this now iterates separate memcg per line. A parser now needs
> > to know the difference between count/count_node and
> > count_memcg/count_memcg_node because they are subtly different file
> > formats.  These files should have the same format, otherwise it just
> > creates needless complexity.
> > 
> > Indeed, why do we even need count/count_node? They are just the
> > "index 1" memcg output, so are totally redundant.
> 
> Ok, but then we need a flag to indicate that a shrinker is memcg-aware?
> But I got your point and I (partially) agree.
> But do you think we're fine with just one interface and don't need
> an aggregation over nodes? So just count_memcg_node?

/me puts on the broken record

Shrinker infrastructure needs to stop treating memcgs are something
special and off to the side. We need to integrate the code so there
is a single scan loop that simply treats the "no memcg" case as the
root memcg. Bleeding architectural/implementation deficiencies into
user visible APIs is even worse than just having to put up with them
in the implementation....

> > > 6) Same but with a per-node output
> > >   $ cat count_memcg_node
> > >     1 209 3
> > >     20 96 0
> > >     53 810 7
> > >     2297 2 0
> > >     218 13 0
> > >     581 30 0
> > >     911 124 0
> > >     <CUT>
> > 
> > So now we have a hundred nodes in the machine and thousands of
> > memcgs. And the information we want is in the numerically largest
> > memcg that is last in the list. ANd we want to graph it's behaviour
> > over time at high resolution (say 1Hz). Now we burn huge amounts
> > of CPU counting memcgs that we don't care about and then throwing
> > away most of the information. That's highly in-efficient and really
> > doesn't scale.
> 
> For this case we can provide an interface which allows to specify both
> node and memcg and get the count. Personally I don't have a machine
> with hundred nodes, so it's not on my radar.

Yup, but there are people how do have this sort of machine, which do
use memcgs (in their thousands) and do have many, many superblocks
(in their thousands). Just because you personally don't have such
machines it does not mean you don't have to design for such
machines. Saying "I don't care other people's requirements" is
exactly what Kent had a rant about in the other leg of this thread.

We know that we have these scalability issues in generic
infrastructure, and therefore generic infrastructure has to handle
these issues at a architecture and design level. We don't need the
initial implementation to work well at such levels of scalability,
but we sure as hell need the design, APIs and file formats to scale
out because if it doesn't scale there is no question that *we will
have to fix it*.

So, yeah, you need to think about how to do fine-grained access to
shrinker stats effectively. That might require a complete change of
presentation API. For example, changing the filesystem layout to
be memcg centric rather than shrinker instance centric would make an
awful lot of this file parsing problem go away.

e.g:

/sys/kernel/debug/mm/memcg/<memcg instance>/shrinker/<shrinker instance>/stats

Cheers,

Dave.

Dave Chinner April 27, 2022, 1:22 a.m. UTC | #8

On Tue, Apr 26, 2022 at 09:41:34AM -0700, Roman Gushchin wrote:
> Can you, please, summarize your position, because it's a bit unclear.
> You made a lot of good points about some details (e.g. shrinkers naming,
> and I totally agree there; machines with hundreds of nodes etc), then
> you said the active scanning is useless and then said the whole thing
> is useless and we're fine with what we have regarding shrinkers debugging.

Better introspection the first thing we need. Work on improving
that. I've been making suggestions to help improve introspection
infrastructure.

Before anything else, we need to improve introspection so we can
gain better insight into the problems we have. Once we understand
the problems better and have evidence to back up where the problems
lie and we have a plan to solve them, then we can talk about whether
we need other user accessible shrinker APIs.

For the moment, exposing shrinker control interfaces to userspace
could potentially be very bad because it exposes internal
architectural and implementation details to a user API.  Just
because it is in /sys/kernel/debug it doesn't mean applications
won't start to use it and build dependencies on it.

That doesn't mean I'm opposed to exposing a shrinker control
mechanism to debugfs - I'm still on the fence on that one. However,
I definitely think that an API that directly exposes the internal
implementation to userspace is the wrong way to go about this.

Fine grained shrinker control is not necessary to improve shrinker
introspection and OOM debugging capability, so if you want/need
control interfaces then I think you should separate those out into a
separate line of development where it doesn't derail the discussion
on how to improve shrinker/OOM introspection.

-Dave.

Roman Gushchin April 27, 2022, 2:18 a.m. UTC | #9

On Wed, Apr 27, 2022 at 11:22:55AM +1000, Dave Chinner wrote:
> On Tue, Apr 26, 2022 at 09:41:34AM -0700, Roman Gushchin wrote:
> > Can you, please, summarize your position, because it's a bit unclear.
> > You made a lot of good points about some details (e.g. shrinkers naming,
> > and I totally agree there; machines with hundreds of nodes etc), then
> > you said the active scanning is useless and then said the whole thing
> > is useless and we're fine with what we have regarding shrinkers debugging.
> 
> Better introspection the first thing we need. Work on improving
> that. I've been making suggestions to help improve introspection
> infrastructure.
> 
> Before anything else, we need to improve introspection so we can
> gain better insight into the problems we have. Once we understand
> the problems better and have evidence to back up where the problems
> lie and we have a plan to solve them, then we can talk about whether
> we need other user accessible shrinker APIs.

Ok, at least we do agree here.

This is exactly why I've started with this debugfs stuff.

> 
> For the moment, exposing shrinker control interfaces to userspace
> could potentially be very bad because it exposes internal
> architectural and implementation details to a user API.  Just
> because it is in /sys/kernel/debug it doesn't mean applications
> won't start to use it and build dependencies on it.
> 
> That doesn't mean I'm opposed to exposing a shrinker control
> mechanism to debugfs - I'm still on the fence on that one. However,
> I definitely think that an API that directly exposes the internal
> implementation to userspace is the wrong way to go about this.

Ok, if it's about having memcg-aware and other interfaces, I can
agree here as well.

I actually made an attempt to unify memcg-aware and system-wide
shrinker scanning, not very successful yet, but it's definitely
on my todo list. I'm pretty sure we're iterating over and over
some empty root-level shrinkers without benefiting the bitmap
infrastructure which works for memory cgroups.

> 
> Fine grained shrinker control is not necessary to improve shrinker
> introspection and OOM debugging capability, so if you want/need
> control interfaces then I think you should separate those out into a
> separate line of development where it doesn't derail the discussion
> on how to improve shrinker/OOM introspection.

Ok, no problems here. Btw, tem OOM debugging is a separate topic brought
in by Kent, I'd keep it separate too, as it comes with many OOM-specific
complications.

From your another email:
> So, yeah, you need to think about how to do fine-grained access to
> shrinker stats effectively. That might require a complete change of
> presentation API. For example, changing the filesystem layout to be
> memcg centric rather than shrinker instance centric would make an
> awful lot of this file parsing problem go away.
>
> e.g:
>
> /sys/kernel/debug/mm/memcg/<memcg instance>/shrinker/<shrinker instance>/stats

The problem with this approach (I though about it) is that it comes
with a high memory overhead especially on that machine with thousands cgroups
and mount points. And beside the memory overhead, it's really expensive to
collect system-wide data and get a big picture, as it requires opening
and reading of thousand of files.

Actually, you wrote recently:
"I've thought about it, too, and can see where it could be useful.
However, when I consider the list_lru memcg integration, I suspect
it becomes a "can't see the forest for the trees" problem. We're
going to end up with millions of sysfs objects with no obvious way
to navigate, iterate or search them if we just take the naive "sysfs
object + stats per list_lru instance" approach."

It all makes me think we need both: a way to iterate over all memcgs and dump
all the numbers at once and a way to get a specific per-memcg (per-node) count.

Thanks!

[v2,0/7] mm: introduce shrinker debugfs interface

Message

Comments