mbox series

[rfc,0/5] mm: introduce shrinker sysfs interface

Message ID 20220416002756.4087977-1-roman.gushchin@linux.dev (mailing list archive)
Headers show
Series mm: introduce shrinker sysfs interface | expand

Message

Roman Gushchin April 16, 2022, 12:27 a.m. UTC
There are 50+ different shrinkers in the kernel, many with their own bells and
whistles. Under the memory pressure the kernel applies some pressure on each of
them in the order of which they were created/registered in the system. Some
of them can contain only few objects, some can be quite large. Some can be
effective at reclaiming memory, some not.

The only existing debugging mechanism is a couple of tracepoints in
do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
covering everything though: shrinkers which report 0 objects will never show up,
there is no support for memcg-aware shrinkers. Shrinkers are identified by their
scan function, which is not always enough (e.g. hard to guess which super
block's shrinker it is having only "super_cache_scan"). They are a passive
mechanism: there is no way to call into counting and scanning of an individual
shrinker and profile it.

To provide a better visibility and debug options for memory shrinkers
this patchset introduces a /sys/kernel/shrinker interface, to some extent
similar to /sys/kernel/slab.

For each shrinker registered in the system a folder is created. The folder
contains "count" and "scan" files, which allow to trigger count_objects()
and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
and scan_memcg_node are additionally provided. They allow to get per-memcg
and/or per-node object count and shrink only a specific memcg/node.

To make debugging more pleasant, the patchset also names all shrinkers,
so that sysfs entries can have more meaningful names.

Usage examples:

1) List registered shrinkers:
  $ cd /sys/kernel/shrinker/
  $ ls
    dqcache-16          sb-cgroup2-30    sb-hugetlbfs-33  sb-proc-41       sb-selinuxfs-22  sb-tmpfs-40    sb-zsmalloc-19
    kfree_rcu-0         sb-configfs-23   sb-iomem-12      sb-proc-44       sb-sockfs-8      sb-tmpfs-42    shadow-18
    sb-aio-20           sb-dax-11        sb-mqueue-21     sb-proc-45       sb-sysfs-26      sb-tmpfs-43    thp_deferred_split-10
    sb-anon_inodefs-15  sb-debugfs-7     sb-nsfs-4        sb-proc-47       sb-tmpfs-1       sb-tmpfs-46    thp_zero-9
    sb-bdev-3           sb-devpts-28     sb-pipefs-14     sb-pstore-31     sb-tmpfs-27      sb-tmpfs-49    xfs_buf-37
    sb-bpf-32           sb-devtmpfs-5    sb-proc-25       sb-rootfs-2      sb-tmpfs-29      sb-tracefs-13  xfs_inodegc-38
    sb-btrfs-24         sb-hugetlbfs-17  sb-proc-39       sb-securityfs-6  sb-tmpfs-35      sb-xfs-36      zspool-34

2) Get information about a specific shrinker:
  $ cd sb-btrfs-24/
  $ ls
    count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node

3) Count objects on the system/root cgroup level
  $ cat count
    212

4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
  $ cat count_node
    209 3

5) Count objects for each memcg (output format: cgroup inode, count)
  $ cat count_memcg
    1 212
    20 96
    53 817
    2297 2
    218 13
    581 30
    911 124
    <CUT>

6) Same but with a per-node output
  $ cat count_memcg_node
    1 209 3
    20 96 0
    53 810 7
    2297 2 0
    218 13 0
    581 30 0
    911 124 0
    <CUT>

7) Don't display cgroups with less than 500 attached objects
  $ echo 500 > count_memcg
  $ cat count_memcg
    53 817
    1868 886
    2396 799
    2462 861

8) Don't display cgroups with less than 500 attached objects (sum over all nodes)
  $ echo "500" > count_memcg_node
  $ cat count_memcg_node
    53 810 7
    1868 886 0
    2396 799 0
    2462 861 0

9) Scan system/root shrinker
  $ cat count
    212
  $ echo 100 > scan
  $ cat scan
    97
  $ cat count
    115

10) Scan individual memcg
  $ echo "1868 500" > scan_memcg
  $ cat scan_memcg
    193

11) Scan individual node
  $ echo "1 200" > scan_node
  $ cat scan_node
    2

12) Scan individual memcg and node
  $ echo "1868 0 500" > scan_memcg_node
  $ cat scan_memcg_node
    435

If the output doesn't fit into a single page, "...\n" is printed at the end of
output.


Roman Gushchin (5):
  mm: introduce sysfs interface for debugging kernel shrinker
  mm: memcontrol: introduce mem_cgroup_ino() and
    mem_cgroup_get_from_ino()
  mm: introduce memcg interfaces for shrinker sysfs
  mm: introduce numa interfaces for shrinker sysfs
  mm: provide shrinkers with names

 arch/x86/kvm/mmu/mmu.c                        |   2 +-
 drivers/android/binder_alloc.c                |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |   3 +-
 drivers/gpu/drm/msm/msm_gem_shrinker.c        |   2 +-
 .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |   2 +-
 drivers/gpu/drm/ttm/ttm_pool.c                |   2 +-
 drivers/md/bcache/btree.c                     |   2 +-
 drivers/md/dm-bufio.c                         |   2 +-
 drivers/md/dm-zoned-metadata.c                |   2 +-
 drivers/md/raid5.c                            |   2 +-
 drivers/misc/vmw_balloon.c                    |   2 +-
 drivers/virtio/virtio_balloon.c               |   2 +-
 drivers/xen/xenbus/xenbus_probe_backend.c     |   2 +-
 fs/erofs/utils.c                              |   2 +-
 fs/ext4/extents_status.c                      |   3 +-
 fs/f2fs/super.c                               |   2 +-
 fs/gfs2/glock.c                               |   2 +-
 fs/gfs2/main.c                                |   2 +-
 fs/jbd2/journal.c                             |   2 +-
 fs/mbcache.c                                  |   2 +-
 fs/nfs/nfs42xattr.c                           |   7 +-
 fs/nfs/super.c                                |   2 +-
 fs/nfsd/filecache.c                           |   2 +-
 fs/nfsd/nfscache.c                            |   2 +-
 fs/quota/dquot.c                              |   2 +-
 fs/super.c                                    |   2 +-
 fs/ubifs/super.c                              |   2 +-
 fs/xfs/xfs_buf.c                              |   2 +-
 fs/xfs/xfs_icache.c                           |   2 +-
 fs/xfs/xfs_qm.c                               |   2 +-
 include/linux/memcontrol.h                    |   9 +
 include/linux/shrinker.h                      |  25 +-
 kernel/rcu/tree.c                             |   2 +-
 lib/Kconfig.debug                             |   9 +
 mm/Makefile                                   |   1 +
 mm/huge_memory.c                              |   4 +-
 mm/memcontrol.c                               |  23 +
 mm/shrinker_debug.c                           | 792 ++++++++++++++++++
 mm/vmscan.c                                   |  66 +-
 mm/workingset.c                               |   2 +-
 mm/zsmalloc.c                                 |   2 +-
 net/sunrpc/auth.c                             |   2 +-
 42 files changed, 957 insertions(+), 47 deletions(-)
 create mode 100644 mm/shrinker_debug.c

Comments

Mike Rapoport April 18, 2022, 9:27 a.m. UTC | #1
On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> There are 50+ different shrinkers in the kernel, many with their own bells and
> whistles. Under the memory pressure the kernel applies some pressure on each of
> them in the order of which they were created/registered in the system. Some
> of them can contain only few objects, some can be quite large. Some can be
> effective at reclaiming memory, some not.
> 
> The only existing debugging mechanism is a couple of tracepoints in
> do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> covering everything though: shrinkers which report 0 objects will never show up,
> there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> scan function, which is not always enough (e.g. hard to guess which super
> block's shrinker it is having only "super_cache_scan"). They are a passive
> mechanism: there is no way to call into counting and scanning of an individual
> shrinker and profile it.
> 
> To provide a better visibility and debug options for memory shrinkers
> this patchset introduces a /sys/kernel/shrinker interface, to some extent
> similar to /sys/kernel/slab.

Wouldn't debugfs better fit the purpose of shrinker debugging?
Roman Gushchin April 18, 2022, 5:27 p.m. UTC | #2
On Mon, Apr 18, 2022 at 12:27:36PM +0300, Mike Rapoport wrote:
> On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> > There are 50+ different shrinkers in the kernel, many with their own bells and
> > whistles. Under the memory pressure the kernel applies some pressure on each of
> > them in the order of which they were created/registered in the system. Some
> > of them can contain only few objects, some can be quite large. Some can be
> > effective at reclaiming memory, some not.
> > 
> > The only existing debugging mechanism is a couple of tracepoints in
> > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > covering everything though: shrinkers which report 0 objects will never show up,
> > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > scan function, which is not always enough (e.g. hard to guess which super
> > block's shrinker it is having only "super_cache_scan"). They are a passive
> > mechanism: there is no way to call into counting and scanning of an individual
> > shrinker and profile it.
> > 
> > To provide a better visibility and debug options for memory shrinkers
> > this patchset introduces a /sys/kernel/shrinker interface, to some extent
> > similar to /sys/kernel/slab.
> 
> Wouldn't debugfs better fit the purpose of shrinker debugging?

I think sysfs fits better, but not a very strong opinion.

Even though the interface is likely not very useful for the general
public, big cloud instances might wanna enable it to gather statistics
(and it's certainly what we gonna do at Facebook) and to provide
additional data when something is off.  They might not have debugfs
mounted. And it's really similar to /sys/kernel/slab.

Are there any reasons why debugfs is preferable?

Thanks!
Andrew Morton April 19, 2022, 4:27 a.m. UTC | #3
On Fri, 15 Apr 2022 17:27:51 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:

> There are 50+ different shrinkers in the kernel, many with their own bells and
> whistles. Under the memory pressure the kernel applies some pressure on each of
> them in the order of which they were created/registered in the system. Some
> of them can contain only few objects, some can be quite large. Some can be
> effective at reclaiming memory, some not.
> 
> The only existing debugging mechanism is a couple of tracepoints in
> do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> covering everything though: shrinkers which report 0 objects will never show up,
> there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> scan function, which is not always enough (e.g. hard to guess which super
> block's shrinker it is having only "super_cache_scan"). They are a passive
> mechanism: there is no way to call into counting and scanning of an individual
> shrinker and profile it.
> 
> To provide a better visibility and debug options for memory shrinkers
> this patchset introduces a /sys/kernel/shrinker interface, to some extent
> similar to /sys/kernel/slab.
> 
> For each shrinker registered in the system a folder is created.

Please, "directory".

> The folder
> contains "count" and "scan" files, which allow to trigger count_objects()
> and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> and scan_memcg_node are additionally provided. They allow to get per-memcg
> and/or per-node object count and shrink only a specific memcg/node.
> 
> To make debugging more pleasant, the patchset also names all shrinkers,
> so that sysfs entries can have more meaningful names.

I also was wondering "why not debugfs".

> Usage examples:
> 
> ...
>
> If the output doesn't fit into a single page, "...\n" is printed at the end of
> output.

Unclear.  At the end of what output?

> 
> Roman Gushchin (5):
>   mm: introduce sysfs interface for debugging kernel shrinker
>   mm: memcontrol: introduce mem_cgroup_ino() and
>     mem_cgroup_get_from_ino()
>   mm: introduce memcg interfaces for shrinker sysfs
>   mm: introduce numa interfaces for shrinker sysfs
>   mm: provide shrinkers with names
> 
>  arch/x86/kvm/mmu/mmu.c                        |   2 +-
>  ...
>

Nothing under Documentation/!
Mike Rapoport April 19, 2022, 6:33 a.m. UTC | #4
On Mon, Apr 18, 2022 at 10:27:34AM -0700, Roman Gushchin wrote:
> On Mon, Apr 18, 2022 at 12:27:36PM +0300, Mike Rapoport wrote:
> > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> > > There are 50+ different shrinkers in the kernel, many with their own bells and
> > > whistles. Under the memory pressure the kernel applies some pressure on each of
> > > them in the order of which they were created/registered in the system. Some
> > > of them can contain only few objects, some can be quite large. Some can be
> > > effective at reclaiming memory, some not.
> > > 
> > > The only existing debugging mechanism is a couple of tracepoints in
> > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > > covering everything though: shrinkers which report 0 objects will never show up,
> > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > > scan function, which is not always enough (e.g. hard to guess which super
> > > block's shrinker it is having only "super_cache_scan"). They are a passive
> > > mechanism: there is no way to call into counting and scanning of an individual
> > > shrinker and profile it.
> > > 
> > > To provide a better visibility and debug options for memory shrinkers
> > > this patchset introduces a /sys/kernel/shrinker interface, to some extent
> > > similar to /sys/kernel/slab.
> > 
> > Wouldn't debugfs better fit the purpose of shrinker debugging?
> 
> I think sysfs fits better, but not a very strong opinion.
> 
> Even though the interface is likely not very useful for the general
> public, big cloud instances might wanna enable it to gather statistics
> (and it's certainly what we gonna do at Facebook) and to provide
> additional data when something is off.  They might not have debugfs
> mounted. And it's really similar to /sys/kernel/slab.

And there is also similar /proc/vmallocinfo so why not /proc/shrinker? ;-)

I suspect slab ended up in sysfs because nobody suggested to use debugfs
back then. I've been able to track the transition from /proc/slabinfo to
/proc/slubinfo to /sys/kernel/slab, but could not find why Christoph chose
sysfs in the end.

> Are there any reasons why debugfs is preferable?

debugfs is more flexible because it's not stable kernel ABI so if there
will be need/desire to change the layout and content of the files with
debugfs it can be done more easily.

Is this a real problem for Facebook to mount debugfs? ;-)
 
> Thanks!
Roman Gushchin April 19, 2022, 5:52 p.m. UTC | #5
On Mon, Apr 18, 2022 at 09:27:09PM -0700, Andrew Morton wrote:
> On Fri, 15 Apr 2022 17:27:51 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:
> 
> > There are 50+ different shrinkers in the kernel, many with their own bells and
> > whistles. Under the memory pressure the kernel applies some pressure on each of
> > them in the order of which they were created/registered in the system. Some
> > of them can contain only few objects, some can be quite large. Some can be
> > effective at reclaiming memory, some not.
> > 
> > The only existing debugging mechanism is a couple of tracepoints in
> > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > covering everything though: shrinkers which report 0 objects will never show up,
> > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > scan function, which is not always enough (e.g. hard to guess which super
> > block's shrinker it is having only "super_cache_scan"). They are a passive
> > mechanism: there is no way to call into counting and scanning of an individual
> > shrinker and profile it.
> > 
> > To provide a better visibility and debug options for memory shrinkers
> > this patchset introduces a /sys/kernel/shrinker interface, to some extent
> > similar to /sys/kernel/slab.
> > 
> > For each shrinker registered in the system a folder is created.
> 
> Please, "directory".

Of course, sorry :)

> 
> > The folder
> > contains "count" and "scan" files, which allow to trigger count_objects()
> > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > and/or per-node object count and shrink only a specific memcg/node.
> > 
> > To make debugging more pleasant, the patchset also names all shrinkers,
> > so that sysfs entries can have more meaningful names.
> 
> I also was wondering "why not debugfs".

Fair enough, moving to debugfs in v1.

> 
> > Usage examples:
> > 
> > ...
> >
> > If the output doesn't fit into a single page, "...\n" is printed at the end of
> > output.
> 
> Unclear.  At the end of what output?

This is how it looks like when the output is too long:

[root@eth50-1 sb-btrfs-24]# cat count_memcg
1 226
20 96
53 811
2429 2
218 13
581 29
911 124
1010 3
1043 1
1076 1
1241 60
1274 7
1307 39
1340 3
1406 14
1439 63
1472 54
1505 8
1538 1
1571 6
1604 39
1637 9
1670 8
1703 4
1736 1094
1802 2
1868 2
1901 52
1934 592
1967 32
			< CUT >
18797 1
18830 1
18863 1
18896 1
18929 1
18962 1
18995 1
19028 1
19061 1
19094 1
19127 1
19160 1
19193 1
...

I'll try to make it more obvious from the description.

> 
> > 
> > Roman Gushchin (5):
> >   mm: introduce sysfs interface for debugging kernel shrinker
> >   mm: memcontrol: introduce mem_cgroup_ino() and
> >     mem_cgroup_get_from_ino()
> >   mm: introduce memcg interfaces for shrinker sysfs
> >   mm: introduce numa interfaces for shrinker sysfs
> >   mm: provide shrinkers with names
> > 
> >  arch/x86/kvm/mmu/mmu.c                        |   2 +-
> >  ...
> >
> 
> Nothing under Documentation/!

I planned to add it after the rfc version. Will do.

Thank you for taking a look!
Roman Gushchin April 19, 2022, 5:58 p.m. UTC | #6
On Tue, Apr 19, 2022 at 09:33:48AM +0300, Mike Rapoport wrote:
> On Mon, Apr 18, 2022 at 10:27:34AM -0700, Roman Gushchin wrote:
> > On Mon, Apr 18, 2022 at 12:27:36PM +0300, Mike Rapoport wrote:
> > > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> > > > There are 50+ different shrinkers in the kernel, many with their own bells and
> > > > whistles. Under the memory pressure the kernel applies some pressure on each of
> > > > them in the order of which they were created/registered in the system. Some
> > > > of them can contain only few objects, some can be quite large. Some can be
> > > > effective at reclaiming memory, some not.
> > > > 
> > > > The only existing debugging mechanism is a couple of tracepoints in
> > > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > > > covering everything though: shrinkers which report 0 objects will never show up,
> > > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > > > scan function, which is not always enough (e.g. hard to guess which super
> > > > block's shrinker it is having only "super_cache_scan"). They are a passive
> > > > mechanism: there is no way to call into counting and scanning of an individual
> > > > shrinker and profile it.
> > > > 
> > > > To provide a better visibility and debug options for memory shrinkers
> > > > this patchset introduces a /sys/kernel/shrinker interface, to some extent
> > > > similar to /sys/kernel/slab.
> > > 
> > > Wouldn't debugfs better fit the purpose of shrinker debugging?
> > 
> > I think sysfs fits better, but not a very strong opinion.
> > 
> > Even though the interface is likely not very useful for the general
> > public, big cloud instances might wanna enable it to gather statistics
> > (and it's certainly what we gonna do at Facebook) and to provide
> > additional data when something is off.  They might not have debugfs
> > mounted. And it's really similar to /sys/kernel/slab.
> 
> And there is also similar /proc/vmallocinfo so why not /proc/shrinker? ;-)
> 
> I suspect slab ended up in sysfs because nobody suggested to use debugfs
> back then. I've been able to track the transition from /proc/slabinfo to
> /proc/slubinfo to /sys/kernel/slab, but could not find why Christoph chose
> sysfs in the end.
>
> > Are there any reasons why debugfs is preferable?
> 
> debugfs is more flexible because it's not stable kernel ABI so if there
> will be need/desire to change the layout and content of the files with
> debugfs it can be done more easily.
> 
> Is this a real problem for Facebook to mount debugfs? ;-)

Fair enough, switching to debugfs in the next version.

Thanks!
Kent Overstreet April 19, 2022, 6:20 p.m. UTC | #7
On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> There are 50+ different shrinkers in the kernel, many with their own bells and
> whistles. Under the memory pressure the kernel applies some pressure on each of
> them in the order of which they were created/registered in the system. Some
> of them can contain only few objects, some can be quite large. Some can be
> effective at reclaiming memory, some not.
> 
> The only existing debugging mechanism is a couple of tracepoints in
> do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> covering everything though: shrinkers which report 0 objects will never show up,
> there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> scan function, which is not always enough (e.g. hard to guess which super
> block's shrinker it is having only "super_cache_scan"). They are a passive
> mechanism: there is no way to call into counting and scanning of an individual
> shrinker and profile it.
> 
> To provide a better visibility and debug options for memory shrinkers
> this patchset introduces a /sys/kernel/shrinker interface, to some extent
> similar to /sys/kernel/slab.
> 
> For each shrinker registered in the system a folder is created. The folder
> contains "count" and "scan" files, which allow to trigger count_objects()
> and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> and scan_memcg_node are additionally provided. They allow to get per-memcg
> and/or per-node object count and shrink only a specific memcg/node.

Cool!

I've been starting to sketch out some shrinker improvements of my own, perhaps
we could combine efforts. The issue I've been targeting is that when we hit an
OOM, we currently don't get a lot of useful information - shrinkers ought to be
included, and we really want information on shrinker's internal state (e.g.
object dirtyness) if we're to have a chance at understanding why memory isn't
getting reclaimed.

https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text

This adds a .to_text() method - a pretty-printer - that shrinkers can
implement, and then on OOM we report on the top 10 shrinkers by memory usage, in
sorted order.

Another thing I'd like to do is have shrinkers report usage not just in object
counts but in bytes; I think it should be obvious why that's desirable.

Maybe we could have a memory-reporting-and-shrinker-improvements session at LSF?
I'd love to do some collective brainstorming and get some real momementum going
in this area.
Andrew Morton April 19, 2022, 6:25 p.m. UTC | #8
On Tue, 19 Apr 2022 10:52:44 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:

> > Unclear.  At the end of what output?
> 
> This is how it looks like when the output is too long:
> 
> [root@eth50-1 sb-btrfs-24]# cat count_memcg
> 1 226
> 20 96
> 53 811
> 2429 2
> 218 13
> 581 29
> 911 124
> 1010 3
> 1043 1
> 1076 1
> 1241 60
> 1274 7
> 1307 39
> 1340 3
> 1406 14
> 1439 63
> 1472 54
> 1505 8
> 1538 1
> 1571 6
> 1604 39
> 1637 9
> 1670 8
> 1703 4
> 1736 1094
> 1802 2
> 1868 2
> 1901 52
> 1934 592
> 1967 32
> 			< CUT >
> 18797 1
> 18830 1

We do that in-kernel?  Why?  That just makes parsers harder to write?
If someone has issues then direct them at /usr/bin/less?
Greg KH April 19, 2022, 6:33 p.m. UTC | #9
On Tue, Apr 19, 2022 at 10:52:44AM -0700, Roman Gushchin wrote:
> On Mon, Apr 18, 2022 at 09:27:09PM -0700, Andrew Morton wrote:
> > > The folder
> > > contains "count" and "scan" files, which allow to trigger count_objects()
> > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > > and/or per-node object count and shrink only a specific memcg/node.
> > > 
> > > To make debugging more pleasant, the patchset also names all shrinkers,
> > > so that sysfs entries can have more meaningful names.
> > 
> > I also was wondering "why not debugfs".
> 
> Fair enough, moving to debugfs in v1.

Thank you, that keeps me from complaining about how badly you were
abusing sysfs in this patchset :)
Kent Overstreet April 19, 2022, 6:36 p.m. UTC | #10
On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> 7) Don't display cgroups with less than 500 attached objects
>   $ echo 500 > count_memcg
>   $ cat count_memcg
>     53 817
>     1868 886
>     2396 799
>     2462 861
> 
> 8) Don't display cgroups with less than 500 attached objects (sum over all nodes)
>   $ echo "500" > count_memcg_node
>   $ cat count_memcg_node
>     53 810 7
>     1868 886 0
>     2396 799 0
>     2462 861 0
> 
> 9) Scan system/root shrinker
>   $ cat count
>     212
>   $ echo 100 > scan
>   $ cat scan
>     97
>   $ cat count
>     115

This part seems entirely overengineered though and a really bad idea - can we
please _not_ store query state in the kernel? It's not thread safe, and it seems
like overengineering before we've done the basics (just getting this stuff in
sysfs is a major improvement!).

I know kmemleak does something kinda sorta like this, but that's a special
purpose debugging tool and this looks to be something more general purpose
that'll get used in production.
Roman Gushchin April 19, 2022, 6:43 p.m. UTC | #11
On Tue, Apr 19, 2022 at 11:25:49AM -0700, Andrew Morton wrote:
> On Tue, 19 Apr 2022 10:52:44 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:
> 
> > > Unclear.  At the end of what output?
> > 
> > This is how it looks like when the output is too long:
> > 
> > [root@eth50-1 sb-btrfs-24]# cat count_memcg
> > 1 226
> > 20 96
> > 53 811
> > 2429 2
> > 218 13
> > 581 29
> > 911 124
> > 1010 3
> > 1043 1
> > 1076 1
> > 1241 60
> > 1274 7
> > 1307 39
> > 1340 3
> > 1406 14
> > 1439 63
> > 1472 54
> > 1505 8
> > 1538 1
> > 1571 6
> > 1604 39
> > 1637 9
> > 1670 8
> > 1703 4
> > 1736 1094
> > 1802 2
> > 1868 2
> > 1901 52
> > 1934 592
> > 1967 32
> > 			< CUT >
> > 18797 1
> > 18830 1
> 
> We do that in-kernel?  Why?  That just makes parsers harder to write?
> If someone has issues then direct them at /usr/bin/less?

It comes from the sysfs limitation: it expects that the output should fit
into the PAGE_SIZE. If the number of cgroups (and nodes) is large, it's not
always possible. In theory something like seq_file API should be used, but
Idk how hard it's to mix it with the sysfs/debugfs API. I'll try to figure
this out.
Roman Gushchin April 19, 2022, 6:50 p.m. UTC | #12
On Tue, Apr 19, 2022 at 02:36:54PM -0400, Kent Overstreet wrote:
> On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> > 7) Don't display cgroups with less than 500 attached objects
> >   $ echo 500 > count_memcg
> >   $ cat count_memcg
> >     53 817
> >     1868 886
> >     2396 799
> >     2462 861
> > 
> > 8) Don't display cgroups with less than 500 attached objects (sum over all nodes)
> >   $ echo "500" > count_memcg_node
> >   $ cat count_memcg_node
> >     53 810 7
> >     1868 886 0
> >     2396 799 0
> >     2462 861 0
> > 
> > 9) Scan system/root shrinker
> >   $ cat count
> >     212
> >   $ echo 100 > scan
> >   $ cat scan
> >     97
> >   $ cat count
> >     115
> 
> This part seems entirely overengineered though and a really bad idea - can we
> please _not_ store query state in the kernel? It's not thread safe, and it seems
> like overengineering before we've done the basics (just getting this stuff in
> sysfs is a major improvement!).

Yes, it's not great, but I don't have a better idea yet. How to return the number
of freed objects? Do you suggest to drop this functionality at all or there are
other options I'm not seeing?

Counting again isn't a good option either: new object could have been added to
the list during the scan.

Thanks!
Roman Gushchin April 19, 2022, 6:58 p.m. UTC | #13
On Tue, Apr 19, 2022 at 02:20:30PM -0400, Kent Overstreet wrote:
> On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> > There are 50+ different shrinkers in the kernel, many with their own bells and
> > whistles. Under the memory pressure the kernel applies some pressure on each of
> > them in the order of which they were created/registered in the system. Some
> > of them can contain only few objects, some can be quite large. Some can be
> > effective at reclaiming memory, some not.
> > 
> > The only existing debugging mechanism is a couple of tracepoints in
> > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > covering everything though: shrinkers which report 0 objects will never show up,
> > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > scan function, which is not always enough (e.g. hard to guess which super
> > block's shrinker it is having only "super_cache_scan"). They are a passive
> > mechanism: there is no way to call into counting and scanning of an individual
> > shrinker and profile it.
> > 
> > To provide a better visibility and debug options for memory shrinkers
> > this patchset introduces a /sys/kernel/shrinker interface, to some extent
> > similar to /sys/kernel/slab.
> > 
> > For each shrinker registered in the system a folder is created. The folder
> > contains "count" and "scan" files, which allow to trigger count_objects()
> > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > and/or per-node object count and shrink only a specific memcg/node.
> 
> Cool!
> 
> I've been starting to sketch out some shrinker improvements of my own, perhaps
> we could combine efforts.

Thanks! Absolutely!

> The issue I've been targeting is that when we hit an
> OOM, we currently don't get a lot of useful information - shrinkers ought to be
> included, and we really want information on shrinker's internal state (e.g.
> object dirtyness) if we're to have a chance at understanding why memory isn't
> getting reclaimed.
> 
> https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
> 
> This adds a .to_text() method - a pretty-printer - that shrinkers can
> implement, and then on OOM we report on the top 10 shrinkers by memory usage, in
> sorted order.

We must be really careful with describing what's allowed and not allowed
by these callbacks. In-kernel OOM is the last-resort mechanism and it should
be able to make forward progress in really nasty circumstances. So there are
significant (and not very well described) limitations on what can be done
from the oom context.

> 
> Another thing I'd like to do is have shrinkers report usage not just in object
> counts but in bytes; I think it should be obvious why that's desirable.

I totally agree, it's actually on my short-term todo list.

> 
> Maybe we could have a memory-reporting-and-shrinker-improvements session at LSF?
> I'd love to do some collective brainstorming and get some real momementum going
> in this area.

Would be really nice! I'm planning to work on improving shrinkers and gather ideas
and problems, so having a discussion would be really great.

Thanks!
Kent Overstreet April 19, 2022, 7:46 p.m. UTC | #14
On Tue, Apr 19, 2022 at 11:58:00AM -0700, Roman Gushchin wrote:
> On Tue, Apr 19, 2022 at 02:20:30PM -0400, Kent Overstreet wrote:
> > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> > > There are 50+ different shrinkers in the kernel, many with their own bells and
> > > whistles. Under the memory pressure the kernel applies some pressure on each of
> > > them in the order of which they were created/registered in the system. Some
> > > of them can contain only few objects, some can be quite large. Some can be
> > > effective at reclaiming memory, some not.
> > > 
> > > The only existing debugging mechanism is a couple of tracepoints in
> > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > > covering everything though: shrinkers which report 0 objects will never show up,
> > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > > scan function, which is not always enough (e.g. hard to guess which super
> > > block's shrinker it is having only "super_cache_scan"). They are a passive
> > > mechanism: there is no way to call into counting and scanning of an individual
> > > shrinker and profile it.
> > > 
> > > To provide a better visibility and debug options for memory shrinkers
> > > this patchset introduces a /sys/kernel/shrinker interface, to some extent
> > > similar to /sys/kernel/slab.
> > > 
> > > For each shrinker registered in the system a folder is created. The folder
> > > contains "count" and "scan" files, which allow to trigger count_objects()
> > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > > and/or per-node object count and shrink only a specific memcg/node.
> > 
> > Cool!
> > 
> > I've been starting to sketch out some shrinker improvements of my own, perhaps
> > we could combine efforts.
> 
> Thanks! Absolutely!
> 
> > The issue I've been targeting is that when we hit an
> > OOM, we currently don't get a lot of useful information - shrinkers ought to be
> > included, and we really want information on shrinker's internal state (e.g.
> > object dirtyness) if we're to have a chance at understanding why memory isn't
> > getting reclaimed.
> > 
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text
> > 
> > This adds a .to_text() method - a pretty-printer - that shrinkers can
> > implement, and then on OOM we report on the top 10 shrinkers by memory usage, in
> > sorted order.
> 
> We must be really careful with describing what's allowed and not allowed
> by these callbacks. In-kernel OOM is the last-resort mechanism and it should
> be able to make forward progress in really nasty circumstances. So there are
> significant (and not very well described) limitations on what can be done
> from the oom context.

Yep. The only "interesting" thing my patches add is that we heap-allocate the
strings the .to_text methods generate (which is good! it means they can be used
both for printing to the console, and by sysfs code). Memory allocation failure
here is hardly the end of the world; those messages will just get truncated, and
I'm also going to mempool-ify printbufs (might do that today).

> > Another thing I'd like to do is have shrinkers report usage not just in object
> > counts but in bytes; I think it should be obvious why that's desirable.
> 
> I totally agree, it's actually on my short-term todo list.

Wonderful. A request I often get is for bcachefs's caches to show up as cached
memory via the free command - a perfectly reasonable request - and reporting
byte counts would make this possible.
Kent Overstreet April 19, 2022, 9:10 p.m. UTC | #15
On Tue, Apr 19, 2022 at 11:50:45AM -0700, Roman Gushchin wrote:
> On Tue, Apr 19, 2022 at 02:36:54PM -0400, Kent Overstreet wrote:
> > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote:
> > > 7) Don't display cgroups with less than 500 attached objects
> > >   $ echo 500 > count_memcg
> > >   $ cat count_memcg
> > >     53 817
> > >     1868 886
> > >     2396 799
> > >     2462 861
> > > 
> > > 8) Don't display cgroups with less than 500 attached objects (sum over all nodes)
> > >   $ echo "500" > count_memcg_node
> > >   $ cat count_memcg_node
> > >     53 810 7
> > >     1868 886 0
> > >     2396 799 0
> > >     2462 861 0
> > > 
> > > 9) Scan system/root shrinker
> > >   $ cat count
> > >     212
> > >   $ echo 100 > scan
> > >   $ cat scan
> > >     97
> > >   $ cat count
> > >     115
> > 
> > This part seems entirely overengineered though and a really bad idea - can we
> > please _not_ store query state in the kernel? It's not thread safe, and it seems
> > like overengineering before we've done the basics (just getting this stuff in
> > sysfs is a major improvement!).
> 
> Yes, it's not great, but I don't have a better idea yet. How to return the number
> of freed objects? Do you suggest to drop this functionality at all or there are
> other options I'm not seeing?

I'd just drop all of the stateful stuff - or add an ioctl interface.
Yang Shi April 20, 2022, 10:24 p.m. UTC | #16
On Fri, Apr 15, 2022 at 5:28 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> There are 50+ different shrinkers in the kernel, many with their own bells and
> whistles. Under the memory pressure the kernel applies some pressure on each of
> them in the order of which they were created/registered in the system. Some
> of them can contain only few objects, some can be quite large. Some can be
> effective at reclaiming memory, some not.
>
> The only existing debugging mechanism is a couple of tracepoints in
> do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> covering everything though: shrinkers which report 0 objects will never show up,
> there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> scan function, which is not always enough (e.g. hard to guess which super
> block's shrinker it is having only "super_cache_scan"). They are a passive
> mechanism: there is no way to call into counting and scanning of an individual
> shrinker and profile it.
>
> To provide a better visibility and debug options for memory shrinkers
> this patchset introduces a /sys/kernel/shrinker interface, to some extent
> similar to /sys/kernel/slab.
>
> For each shrinker registered in the system a folder is created. The folder
> contains "count" and "scan" files, which allow to trigger count_objects()
> and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> and scan_memcg_node are additionally provided. They allow to get per-memcg
> and/or per-node object count and shrink only a specific memcg/node.
>
> To make debugging more pleasant, the patchset also names all shrinkers,
> so that sysfs entries can have more meaningful names.
>
> Usage examples:

Thanks, Roman. A follow-up question, why do we have to implement this
in kernel if we just count the objects? It seems userspace tools could
achieve it too, for example, drgn :-). Actually I did write a drgn
script for debugging a problem a few months ago, which iterates
specific memcg's lru_list to count the objects by their state.

>
> 1) List registered shrinkers:
>   $ cd /sys/kernel/shrinker/
>   $ ls
>     dqcache-16          sb-cgroup2-30    sb-hugetlbfs-33  sb-proc-41       sb-selinuxfs-22  sb-tmpfs-40    sb-zsmalloc-19
>     kfree_rcu-0         sb-configfs-23   sb-iomem-12      sb-proc-44       sb-sockfs-8      sb-tmpfs-42    shadow-18
>     sb-aio-20           sb-dax-11        sb-mqueue-21     sb-proc-45       sb-sysfs-26      sb-tmpfs-43    thp_deferred_split-10
>     sb-anon_inodefs-15  sb-debugfs-7     sb-nsfs-4        sb-proc-47       sb-tmpfs-1       sb-tmpfs-46    thp_zero-9
>     sb-bdev-3           sb-devpts-28     sb-pipefs-14     sb-pstore-31     sb-tmpfs-27      sb-tmpfs-49    xfs_buf-37
>     sb-bpf-32           sb-devtmpfs-5    sb-proc-25       sb-rootfs-2      sb-tmpfs-29      sb-tracefs-13  xfs_inodegc-38
>     sb-btrfs-24         sb-hugetlbfs-17  sb-proc-39       sb-securityfs-6  sb-tmpfs-35      sb-xfs-36      zspool-34
>
> 2) Get information about a specific shrinker:
>   $ cd sb-btrfs-24/
>   $ ls
>     count  count_memcg  count_memcg_node  count_node  scan  scan_memcg  scan_memcg_node  scan_node
>
> 3) Count objects on the system/root cgroup level
>   $ cat count
>     212
>
> 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine)
>   $ cat count_node
>     209 3
>
> 5) Count objects for each memcg (output format: cgroup inode, count)
>   $ cat count_memcg
>     1 212
>     20 96
>     53 817
>     2297 2
>     218 13
>     581 30
>     911 124
>     <CUT>
>
> 6) Same but with a per-node output
>   $ cat count_memcg_node
>     1 209 3
>     20 96 0
>     53 810 7
>     2297 2 0
>     218 13 0
>     581 30 0
>     911 124 0
>     <CUT>
>
> 7) Don't display cgroups with less than 500 attached objects
>   $ echo 500 > count_memcg
>   $ cat count_memcg
>     53 817
>     1868 886
>     2396 799
>     2462 861
>
> 8) Don't display cgroups with less than 500 attached objects (sum over all nodes)
>   $ echo "500" > count_memcg_node
>   $ cat count_memcg_node
>     53 810 7
>     1868 886 0
>     2396 799 0
>     2462 861 0
>
> 9) Scan system/root shrinker
>   $ cat count
>     212
>   $ echo 100 > scan
>   $ cat scan
>     97
>   $ cat count
>     115
>
> 10) Scan individual memcg
>   $ echo "1868 500" > scan_memcg
>   $ cat scan_memcg
>     193
>
> 11) Scan individual node
>   $ echo "1 200" > scan_node
>   $ cat scan_node
>     2
>
> 12) Scan individual memcg and node
>   $ echo "1868 0 500" > scan_memcg_node
>   $ cat scan_memcg_node
>     435
>
> If the output doesn't fit into a single page, "...\n" is printed at the end of
> output.
>
>
> Roman Gushchin (5):
>   mm: introduce sysfs interface for debugging kernel shrinker
>   mm: memcontrol: introduce mem_cgroup_ino() and
>     mem_cgroup_get_from_ino()
>   mm: introduce memcg interfaces for shrinker sysfs
>   mm: introduce numa interfaces for shrinker sysfs
>   mm: provide shrinkers with names
>
>  arch/x86/kvm/mmu/mmu.c                        |   2 +-
>  drivers/android/binder_alloc.c                |   2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |   3 +-
>  drivers/gpu/drm/msm/msm_gem_shrinker.c        |   2 +-
>  .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |   2 +-
>  drivers/gpu/drm/ttm/ttm_pool.c                |   2 +-
>  drivers/md/bcache/btree.c                     |   2 +-
>  drivers/md/dm-bufio.c                         |   2 +-
>  drivers/md/dm-zoned-metadata.c                |   2 +-
>  drivers/md/raid5.c                            |   2 +-
>  drivers/misc/vmw_balloon.c                    |   2 +-
>  drivers/virtio/virtio_balloon.c               |   2 +-
>  drivers/xen/xenbus/xenbus_probe_backend.c     |   2 +-
>  fs/erofs/utils.c                              |   2 +-
>  fs/ext4/extents_status.c                      |   3 +-
>  fs/f2fs/super.c                               |   2 +-
>  fs/gfs2/glock.c                               |   2 +-
>  fs/gfs2/main.c                                |   2 +-
>  fs/jbd2/journal.c                             |   2 +-
>  fs/mbcache.c                                  |   2 +-
>  fs/nfs/nfs42xattr.c                           |   7 +-
>  fs/nfs/super.c                                |   2 +-
>  fs/nfsd/filecache.c                           |   2 +-
>  fs/nfsd/nfscache.c                            |   2 +-
>  fs/quota/dquot.c                              |   2 +-
>  fs/super.c                                    |   2 +-
>  fs/ubifs/super.c                              |   2 +-
>  fs/xfs/xfs_buf.c                              |   2 +-
>  fs/xfs/xfs_icache.c                           |   2 +-
>  fs/xfs/xfs_qm.c                               |   2 +-
>  include/linux/memcontrol.h                    |   9 +
>  include/linux/shrinker.h                      |  25 +-
>  kernel/rcu/tree.c                             |   2 +-
>  lib/Kconfig.debug                             |   9 +
>  mm/Makefile                                   |   1 +
>  mm/huge_memory.c                              |   4 +-
>  mm/memcontrol.c                               |  23 +
>  mm/shrinker_debug.c                           | 792 ++++++++++++++++++
>  mm/vmscan.c                                   |  66 +-
>  mm/workingset.c                               |   2 +-
>  mm/zsmalloc.c                                 |   2 +-
>  net/sunrpc/auth.c                             |   2 +-
>  42 files changed, 957 insertions(+), 47 deletions(-)
>  create mode 100644 mm/shrinker_debug.c
>
> --
> 2.35.1
>
Roman Gushchin April 20, 2022, 11:23 p.m. UTC | #17
On Wed, Apr 20, 2022 at 03:24:49PM -0700, Yang Shi wrote:
> On Fri, Apr 15, 2022 at 5:28 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > There are 50+ different shrinkers in the kernel, many with their own bells and
> > whistles. Under the memory pressure the kernel applies some pressure on each of
> > them in the order of which they were created/registered in the system. Some
> > of them can contain only few objects, some can be quite large. Some can be
> > effective at reclaiming memory, some not.
> >
> > The only existing debugging mechanism is a couple of tracepoints in
> > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't
> > covering everything though: shrinkers which report 0 objects will never show up,
> > there is no support for memcg-aware shrinkers. Shrinkers are identified by their
> > scan function, which is not always enough (e.g. hard to guess which super
> > block's shrinker it is having only "super_cache_scan"). They are a passive
> > mechanism: there is no way to call into counting and scanning of an individual
> > shrinker and profile it.
> >
> > To provide a better visibility and debug options for memory shrinkers
> > this patchset introduces a /sys/kernel/shrinker interface, to some extent
> > similar to /sys/kernel/slab.
> >
> > For each shrinker registered in the system a folder is created. The folder
> > contains "count" and "scan" files, which allow to trigger count_objects()
> > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers
> > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node
> > and scan_memcg_node are additionally provided. They allow to get per-memcg
> > and/or per-node object count and shrink only a specific memcg/node.
> >
> > To make debugging more pleasant, the patchset also names all shrinkers,
> > so that sysfs entries can have more meaningful names.
> >
> > Usage examples:
> 
> Thanks, Roman. A follow-up question, why do we have to implement this
> in kernel if we just count the objects? It seems userspace tools could
> achieve it too, for example, drgn :-). Actually I did write a drgn
> script for debugging a problem a few months ago, which iterates
> specific memcg's lru_list to count the objects by their state.

Good question! It's because not all shrinkers are lru_list-based
and even some lru_list-based are implementing a custom logic on top of it,
e.g. shadow nodes. So there is no simple way to get the count from
a generic shrinker.

Also I want to be able to reclaim individual shrinkers from userspace
(e.g. to profile how effective the shrinking is).

Thanks!