Message ID | 20220416002756.4087977-1-roman.gushchin@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | mm: introduce shrinker sysfs interface | expand |
On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > There are 50+ different shrinkers in the kernel, many with their own bells and > whistles. Under the memory pressure the kernel applies some pressure on each of > them in the order of which they were created/registered in the system. Some > of them can contain only few objects, some can be quite large. Some can be > effective at reclaiming memory, some not. > > The only existing debugging mechanism is a couple of tracepoints in > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > covering everything though: shrinkers which report 0 objects will never show up, > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > scan function, which is not always enough (e.g. hard to guess which super > block's shrinker it is having only "super_cache_scan"). They are a passive > mechanism: there is no way to call into counting and scanning of an individual > shrinker and profile it. > > To provide a better visibility and debug options for memory shrinkers > this patchset introduces a /sys/kernel/shrinker interface, to some extent > similar to /sys/kernel/slab. Wouldn't debugfs better fit the purpose of shrinker debugging?
On Mon, Apr 18, 2022 at 12:27:36PM +0300, Mike Rapoport wrote: > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > Wouldn't debugfs better fit the purpose of shrinker debugging? I think sysfs fits better, but not a very strong opinion. Even though the interface is likely not very useful for the general public, big cloud instances might wanna enable it to gather statistics (and it's certainly what we gonna do at Facebook) and to provide additional data when something is off. They might not have debugfs mounted. And it's really similar to /sys/kernel/slab. Are there any reasons why debugfs is preferable? Thanks!
On Fri, 15 Apr 2022 17:27:51 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote: > There are 50+ different shrinkers in the kernel, many with their own bells and > whistles. Under the memory pressure the kernel applies some pressure on each of > them in the order of which they were created/registered in the system. Some > of them can contain only few objects, some can be quite large. Some can be > effective at reclaiming memory, some not. > > The only existing debugging mechanism is a couple of tracepoints in > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > covering everything though: shrinkers which report 0 objects will never show up, > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > scan function, which is not always enough (e.g. hard to guess which super > block's shrinker it is having only "super_cache_scan"). They are a passive > mechanism: there is no way to call into counting and scanning of an individual > shrinker and profile it. > > To provide a better visibility and debug options for memory shrinkers > this patchset introduces a /sys/kernel/shrinker interface, to some extent > similar to /sys/kernel/slab. > > For each shrinker registered in the system a folder is created. Please, "directory". > The folder > contains "count" and "scan" files, which allow to trigger count_objects() > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > and scan_memcg_node are additionally provided. They allow to get per-memcg > and/or per-node object count and shrink only a specific memcg/node. > > To make debugging more pleasant, the patchset also names all shrinkers, > so that sysfs entries can have more meaningful names. I also was wondering "why not debugfs". > Usage examples: > > ... > > If the output doesn't fit into a single page, "...\n" is printed at the end of > output. Unclear. At the end of what output? > > Roman Gushchin (5): > mm: introduce sysfs interface for debugging kernel shrinker > mm: memcontrol: introduce mem_cgroup_ino() and > mem_cgroup_get_from_ino() > mm: introduce memcg interfaces for shrinker sysfs > mm: introduce numa interfaces for shrinker sysfs > mm: provide shrinkers with names > > arch/x86/kvm/mmu/mmu.c | 2 +- > ... > Nothing under Documentation/!
On Mon, Apr 18, 2022 at 10:27:34AM -0700, Roman Gushchin wrote: > On Mon, Apr 18, 2022 at 12:27:36PM +0300, Mike Rapoport wrote: > > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > > > There are 50+ different shrinkers in the kernel, many with their own bells and > > > whistles. Under the memory pressure the kernel applies some pressure on each of > > > them in the order of which they were created/registered in the system. Some > > > of them can contain only few objects, some can be quite large. Some can be > > > effective at reclaiming memory, some not. > > > > > > The only existing debugging mechanism is a couple of tracepoints in > > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > > covering everything though: shrinkers which report 0 objects will never show up, > > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > > scan function, which is not always enough (e.g. hard to guess which super > > > block's shrinker it is having only "super_cache_scan"). They are a passive > > > mechanism: there is no way to call into counting and scanning of an individual > > > shrinker and profile it. > > > > > > To provide a better visibility and debug options for memory shrinkers > > > this patchset introduces a /sys/kernel/shrinker interface, to some extent > > > similar to /sys/kernel/slab. > > > > Wouldn't debugfs better fit the purpose of shrinker debugging? > > I think sysfs fits better, but not a very strong opinion. > > Even though the interface is likely not very useful for the general > public, big cloud instances might wanna enable it to gather statistics > (and it's certainly what we gonna do at Facebook) and to provide > additional data when something is off. They might not have debugfs > mounted. And it's really similar to /sys/kernel/slab. And there is also similar /proc/vmallocinfo so why not /proc/shrinker? ;-) I suspect slab ended up in sysfs because nobody suggested to use debugfs back then. I've been able to track the transition from /proc/slabinfo to /proc/slubinfo to /sys/kernel/slab, but could not find why Christoph chose sysfs in the end. > Are there any reasons why debugfs is preferable? debugfs is more flexible because it's not stable kernel ABI so if there will be need/desire to change the layout and content of the files with debugfs it can be done more easily. Is this a real problem for Facebook to mount debugfs? ;-) > Thanks!
On Mon, Apr 18, 2022 at 09:27:09PM -0700, Andrew Morton wrote: > On Fri, 15 Apr 2022 17:27:51 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > > > For each shrinker registered in the system a folder is created. > > Please, "directory". Of course, sorry :) > > > The folder > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > > > To make debugging more pleasant, the patchset also names all shrinkers, > > so that sysfs entries can have more meaningful names. > > I also was wondering "why not debugfs". Fair enough, moving to debugfs in v1. > > > Usage examples: > > > > ... > > > > If the output doesn't fit into a single page, "...\n" is printed at the end of > > output. > > Unclear. At the end of what output? This is how it looks like when the output is too long: [root@eth50-1 sb-btrfs-24]# cat count_memcg 1 226 20 96 53 811 2429 2 218 13 581 29 911 124 1010 3 1043 1 1076 1 1241 60 1274 7 1307 39 1340 3 1406 14 1439 63 1472 54 1505 8 1538 1 1571 6 1604 39 1637 9 1670 8 1703 4 1736 1094 1802 2 1868 2 1901 52 1934 592 1967 32 < CUT > 18797 1 18830 1 18863 1 18896 1 18929 1 18962 1 18995 1 19028 1 19061 1 19094 1 19127 1 19160 1 19193 1 ... I'll try to make it more obvious from the description. > > > > > Roman Gushchin (5): > > mm: introduce sysfs interface for debugging kernel shrinker > > mm: memcontrol: introduce mem_cgroup_ino() and > > mem_cgroup_get_from_ino() > > mm: introduce memcg interfaces for shrinker sysfs > > mm: introduce numa interfaces for shrinker sysfs > > mm: provide shrinkers with names > > > > arch/x86/kvm/mmu/mmu.c | 2 +- > > ... > > > > Nothing under Documentation/! I planned to add it after the rfc version. Will do. Thank you for taking a look!
On Tue, Apr 19, 2022 at 09:33:48AM +0300, Mike Rapoport wrote: > On Mon, Apr 18, 2022 at 10:27:34AM -0700, Roman Gushchin wrote: > > On Mon, Apr 18, 2022 at 12:27:36PM +0300, Mike Rapoport wrote: > > > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > > > > There are 50+ different shrinkers in the kernel, many with their own bells and > > > > whistles. Under the memory pressure the kernel applies some pressure on each of > > > > them in the order of which they were created/registered in the system. Some > > > > of them can contain only few objects, some can be quite large. Some can be > > > > effective at reclaiming memory, some not. > > > > > > > > The only existing debugging mechanism is a couple of tracepoints in > > > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > > > covering everything though: shrinkers which report 0 objects will never show up, > > > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > > > scan function, which is not always enough (e.g. hard to guess which super > > > > block's shrinker it is having only "super_cache_scan"). They are a passive > > > > mechanism: there is no way to call into counting and scanning of an individual > > > > shrinker and profile it. > > > > > > > > To provide a better visibility and debug options for memory shrinkers > > > > this patchset introduces a /sys/kernel/shrinker interface, to some extent > > > > similar to /sys/kernel/slab. > > > > > > Wouldn't debugfs better fit the purpose of shrinker debugging? > > > > I think sysfs fits better, but not a very strong opinion. > > > > Even though the interface is likely not very useful for the general > > public, big cloud instances might wanna enable it to gather statistics > > (and it's certainly what we gonna do at Facebook) and to provide > > additional data when something is off. They might not have debugfs > > mounted. And it's really similar to /sys/kernel/slab. > > And there is also similar /proc/vmallocinfo so why not /proc/shrinker? ;-) > > I suspect slab ended up in sysfs because nobody suggested to use debugfs > back then. I've been able to track the transition from /proc/slabinfo to > /proc/slubinfo to /sys/kernel/slab, but could not find why Christoph chose > sysfs in the end. > > > Are there any reasons why debugfs is preferable? > > debugfs is more flexible because it's not stable kernel ABI so if there > will be need/desire to change the layout and content of the files with > debugfs it can be done more easily. > > Is this a real problem for Facebook to mount debugfs? ;-) Fair enough, switching to debugfs in the next version. Thanks!
On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > There are 50+ different shrinkers in the kernel, many with their own bells and > whistles. Under the memory pressure the kernel applies some pressure on each of > them in the order of which they were created/registered in the system. Some > of them can contain only few objects, some can be quite large. Some can be > effective at reclaiming memory, some not. > > The only existing debugging mechanism is a couple of tracepoints in > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > covering everything though: shrinkers which report 0 objects will never show up, > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > scan function, which is not always enough (e.g. hard to guess which super > block's shrinker it is having only "super_cache_scan"). They are a passive > mechanism: there is no way to call into counting and scanning of an individual > shrinker and profile it. > > To provide a better visibility and debug options for memory shrinkers > this patchset introduces a /sys/kernel/shrinker interface, to some extent > similar to /sys/kernel/slab. > > For each shrinker registered in the system a folder is created. The folder > contains "count" and "scan" files, which allow to trigger count_objects() > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > and scan_memcg_node are additionally provided. They allow to get per-memcg > and/or per-node object count and shrink only a specific memcg/node. Cool! I've been starting to sketch out some shrinker improvements of my own, perhaps we could combine efforts. The issue I've been targeting is that when we hit an OOM, we currently don't get a lot of useful information - shrinkers ought to be included, and we really want information on shrinker's internal state (e.g. object dirtyness) if we're to have a chance at understanding why memory isn't getting reclaimed. https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text This adds a .to_text() method - a pretty-printer - that shrinkers can implement, and then on OOM we report on the top 10 shrinkers by memory usage, in sorted order. Another thing I'd like to do is have shrinkers report usage not just in object counts but in bytes; I think it should be obvious why that's desirable. Maybe we could have a memory-reporting-and-shrinker-improvements session at LSF? I'd love to do some collective brainstorming and get some real momementum going in this area.
On Tue, 19 Apr 2022 10:52:44 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote: > > Unclear. At the end of what output? > > This is how it looks like when the output is too long: > > [root@eth50-1 sb-btrfs-24]# cat count_memcg > 1 226 > 20 96 > 53 811 > 2429 2 > 218 13 > 581 29 > 911 124 > 1010 3 > 1043 1 > 1076 1 > 1241 60 > 1274 7 > 1307 39 > 1340 3 > 1406 14 > 1439 63 > 1472 54 > 1505 8 > 1538 1 > 1571 6 > 1604 39 > 1637 9 > 1670 8 > 1703 4 > 1736 1094 > 1802 2 > 1868 2 > 1901 52 > 1934 592 > 1967 32 > < CUT > > 18797 1 > 18830 1 We do that in-kernel? Why? That just makes parsers harder to write? If someone has issues then direct them at /usr/bin/less?
On Tue, Apr 19, 2022 at 10:52:44AM -0700, Roman Gushchin wrote: > On Mon, Apr 18, 2022 at 09:27:09PM -0700, Andrew Morton wrote: > > > The folder > > > contains "count" and "scan" files, which allow to trigger count_objects() > > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > > and/or per-node object count and shrink only a specific memcg/node. > > > > > > To make debugging more pleasant, the patchset also names all shrinkers, > > > so that sysfs entries can have more meaningful names. > > > > I also was wondering "why not debugfs". > > Fair enough, moving to debugfs in v1. Thank you, that keeps me from complaining about how badly you were abusing sysfs in this patchset :)
On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > 7) Don't display cgroups with less than 500 attached objects > $ echo 500 > count_memcg > $ cat count_memcg > 53 817 > 1868 886 > 2396 799 > 2462 861 > > 8) Don't display cgroups with less than 500 attached objects (sum over all nodes) > $ echo "500" > count_memcg_node > $ cat count_memcg_node > 53 810 7 > 1868 886 0 > 2396 799 0 > 2462 861 0 > > 9) Scan system/root shrinker > $ cat count > 212 > $ echo 100 > scan > $ cat scan > 97 > $ cat count > 115 This part seems entirely overengineered though and a really bad idea - can we please _not_ store query state in the kernel? It's not thread safe, and it seems like overengineering before we've done the basics (just getting this stuff in sysfs is a major improvement!). I know kmemleak does something kinda sorta like this, but that's a special purpose debugging tool and this looks to be something more general purpose that'll get used in production.
On Tue, Apr 19, 2022 at 11:25:49AM -0700, Andrew Morton wrote: > On Tue, 19 Apr 2022 10:52:44 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > Unclear. At the end of what output? > > > > This is how it looks like when the output is too long: > > > > [root@eth50-1 sb-btrfs-24]# cat count_memcg > > 1 226 > > 20 96 > > 53 811 > > 2429 2 > > 218 13 > > 581 29 > > 911 124 > > 1010 3 > > 1043 1 > > 1076 1 > > 1241 60 > > 1274 7 > > 1307 39 > > 1340 3 > > 1406 14 > > 1439 63 > > 1472 54 > > 1505 8 > > 1538 1 > > 1571 6 > > 1604 39 > > 1637 9 > > 1670 8 > > 1703 4 > > 1736 1094 > > 1802 2 > > 1868 2 > > 1901 52 > > 1934 592 > > 1967 32 > > < CUT > > > 18797 1 > > 18830 1 > > We do that in-kernel? Why? That just makes parsers harder to write? > If someone has issues then direct them at /usr/bin/less? It comes from the sysfs limitation: it expects that the output should fit into the PAGE_SIZE. If the number of cgroups (and nodes) is large, it's not always possible. In theory something like seq_file API should be used, but Idk how hard it's to mix it with the sysfs/debugfs API. I'll try to figure this out.
On Tue, Apr 19, 2022 at 02:36:54PM -0400, Kent Overstreet wrote: > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > > 7) Don't display cgroups with less than 500 attached objects > > $ echo 500 > count_memcg > > $ cat count_memcg > > 53 817 > > 1868 886 > > 2396 799 > > 2462 861 > > > > 8) Don't display cgroups with less than 500 attached objects (sum over all nodes) > > $ echo "500" > count_memcg_node > > $ cat count_memcg_node > > 53 810 7 > > 1868 886 0 > > 2396 799 0 > > 2462 861 0 > > > > 9) Scan system/root shrinker > > $ cat count > > 212 > > $ echo 100 > scan > > $ cat scan > > 97 > > $ cat count > > 115 > > This part seems entirely overengineered though and a really bad idea - can we > please _not_ store query state in the kernel? It's not thread safe, and it seems > like overengineering before we've done the basics (just getting this stuff in > sysfs is a major improvement!). Yes, it's not great, but I don't have a better idea yet. How to return the number of freed objects? Do you suggest to drop this functionality at all or there are other options I'm not seeing? Counting again isn't a good option either: new object could have been added to the list during the scan. Thanks!
On Tue, Apr 19, 2022 at 02:20:30PM -0400, Kent Overstreet wrote: > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > > > For each shrinker registered in the system a folder is created. The folder > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > Cool! > > I've been starting to sketch out some shrinker improvements of my own, perhaps > we could combine efforts. Thanks! Absolutely! > The issue I've been targeting is that when we hit an > OOM, we currently don't get a lot of useful information - shrinkers ought to be > included, and we really want information on shrinker's internal state (e.g. > object dirtyness) if we're to have a chance at understanding why memory isn't > getting reclaimed. > > https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text > > This adds a .to_text() method - a pretty-printer - that shrinkers can > implement, and then on OOM we report on the top 10 shrinkers by memory usage, in > sorted order. We must be really careful with describing what's allowed and not allowed by these callbacks. In-kernel OOM is the last-resort mechanism and it should be able to make forward progress in really nasty circumstances. So there are significant (and not very well described) limitations on what can be done from the oom context. > > Another thing I'd like to do is have shrinkers report usage not just in object > counts but in bytes; I think it should be obvious why that's desirable. I totally agree, it's actually on my short-term todo list. > > Maybe we could have a memory-reporting-and-shrinker-improvements session at LSF? > I'd love to do some collective brainstorming and get some real momementum going > in this area. Would be really nice! I'm planning to work on improving shrinkers and gather ideas and problems, so having a discussion would be really great. Thanks!
On Tue, Apr 19, 2022 at 11:58:00AM -0700, Roman Gushchin wrote: > On Tue, Apr 19, 2022 at 02:20:30PM -0400, Kent Overstreet wrote: > > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > > > There are 50+ different shrinkers in the kernel, many with their own bells and > > > whistles. Under the memory pressure the kernel applies some pressure on each of > > > them in the order of which they were created/registered in the system. Some > > > of them can contain only few objects, some can be quite large. Some can be > > > effective at reclaiming memory, some not. > > > > > > The only existing debugging mechanism is a couple of tracepoints in > > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > > covering everything though: shrinkers which report 0 objects will never show up, > > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > > scan function, which is not always enough (e.g. hard to guess which super > > > block's shrinker it is having only "super_cache_scan"). They are a passive > > > mechanism: there is no way to call into counting and scanning of an individual > > > shrinker and profile it. > > > > > > To provide a better visibility and debug options for memory shrinkers > > > this patchset introduces a /sys/kernel/shrinker interface, to some extent > > > similar to /sys/kernel/slab. > > > > > > For each shrinker registered in the system a folder is created. The folder > > > contains "count" and "scan" files, which allow to trigger count_objects() > > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > > and/or per-node object count and shrink only a specific memcg/node. > > > > Cool! > > > > I've been starting to sketch out some shrinker improvements of my own, perhaps > > we could combine efforts. > > Thanks! Absolutely! > > > The issue I've been targeting is that when we hit an > > OOM, we currently don't get a lot of useful information - shrinkers ought to be > > included, and we really want information on shrinker's internal state (e.g. > > object dirtyness) if we're to have a chance at understanding why memory isn't > > getting reclaimed. > > > > https://evilpiepirate.org/git/bcachefs.git/log/?h=shrinker_to_text > > > > This adds a .to_text() method - a pretty-printer - that shrinkers can > > implement, and then on OOM we report on the top 10 shrinkers by memory usage, in > > sorted order. > > We must be really careful with describing what's allowed and not allowed > by these callbacks. In-kernel OOM is the last-resort mechanism and it should > be able to make forward progress in really nasty circumstances. So there are > significant (and not very well described) limitations on what can be done > from the oom context. Yep. The only "interesting" thing my patches add is that we heap-allocate the strings the .to_text methods generate (which is good! it means they can be used both for printing to the console, and by sysfs code). Memory allocation failure here is hardly the end of the world; those messages will just get truncated, and I'm also going to mempool-ify printbufs (might do that today). > > Another thing I'd like to do is have shrinkers report usage not just in object > > counts but in bytes; I think it should be obvious why that's desirable. > > I totally agree, it's actually on my short-term todo list. Wonderful. A request I often get is for bcachefs's caches to show up as cached memory via the free command - a perfectly reasonable request - and reporting byte counts would make this possible.
On Tue, Apr 19, 2022 at 11:50:45AM -0700, Roman Gushchin wrote: > On Tue, Apr 19, 2022 at 02:36:54PM -0400, Kent Overstreet wrote: > > On Fri, Apr 15, 2022 at 05:27:51PM -0700, Roman Gushchin wrote: > > > 7) Don't display cgroups with less than 500 attached objects > > > $ echo 500 > count_memcg > > > $ cat count_memcg > > > 53 817 > > > 1868 886 > > > 2396 799 > > > 2462 861 > > > > > > 8) Don't display cgroups with less than 500 attached objects (sum over all nodes) > > > $ echo "500" > count_memcg_node > > > $ cat count_memcg_node > > > 53 810 7 > > > 1868 886 0 > > > 2396 799 0 > > > 2462 861 0 > > > > > > 9) Scan system/root shrinker > > > $ cat count > > > 212 > > > $ echo 100 > scan > > > $ cat scan > > > 97 > > > $ cat count > > > 115 > > > > This part seems entirely overengineered though and a really bad idea - can we > > please _not_ store query state in the kernel? It's not thread safe, and it seems > > like overengineering before we've done the basics (just getting this stuff in > > sysfs is a major improvement!). > > Yes, it's not great, but I don't have a better idea yet. How to return the number > of freed objects? Do you suggest to drop this functionality at all or there are > other options I'm not seeing? I'd just drop all of the stateful stuff - or add an ioctl interface.
On Fri, Apr 15, 2022 at 5:28 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > whistles. Under the memory pressure the kernel applies some pressure on each of > them in the order of which they were created/registered in the system. Some > of them can contain only few objects, some can be quite large. Some can be > effective at reclaiming memory, some not. > > The only existing debugging mechanism is a couple of tracepoints in > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > covering everything though: shrinkers which report 0 objects will never show up, > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > scan function, which is not always enough (e.g. hard to guess which super > block's shrinker it is having only "super_cache_scan"). They are a passive > mechanism: there is no way to call into counting and scanning of an individual > shrinker and profile it. > > To provide a better visibility and debug options for memory shrinkers > this patchset introduces a /sys/kernel/shrinker interface, to some extent > similar to /sys/kernel/slab. > > For each shrinker registered in the system a folder is created. The folder > contains "count" and "scan" files, which allow to trigger count_objects() > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > and scan_memcg_node are additionally provided. They allow to get per-memcg > and/or per-node object count and shrink only a specific memcg/node. > > To make debugging more pleasant, the patchset also names all shrinkers, > so that sysfs entries can have more meaningful names. > > Usage examples: Thanks, Roman. A follow-up question, why do we have to implement this in kernel if we just count the objects? It seems userspace tools could achieve it too, for example, drgn :-). Actually I did write a drgn script for debugging a problem a few months ago, which iterates specific memcg's lru_list to count the objects by their state. > > 1) List registered shrinkers: > $ cd /sys/kernel/shrinker/ > $ ls > dqcache-16 sb-cgroup2-30 sb-hugetlbfs-33 sb-proc-41 sb-selinuxfs-22 sb-tmpfs-40 sb-zsmalloc-19 > kfree_rcu-0 sb-configfs-23 sb-iomem-12 sb-proc-44 sb-sockfs-8 sb-tmpfs-42 shadow-18 > sb-aio-20 sb-dax-11 sb-mqueue-21 sb-proc-45 sb-sysfs-26 sb-tmpfs-43 thp_deferred_split-10 > sb-anon_inodefs-15 sb-debugfs-7 sb-nsfs-4 sb-proc-47 sb-tmpfs-1 sb-tmpfs-46 thp_zero-9 > sb-bdev-3 sb-devpts-28 sb-pipefs-14 sb-pstore-31 sb-tmpfs-27 sb-tmpfs-49 xfs_buf-37 > sb-bpf-32 sb-devtmpfs-5 sb-proc-25 sb-rootfs-2 sb-tmpfs-29 sb-tracefs-13 xfs_inodegc-38 > sb-btrfs-24 sb-hugetlbfs-17 sb-proc-39 sb-securityfs-6 sb-tmpfs-35 sb-xfs-36 zspool-34 > > 2) Get information about a specific shrinker: > $ cd sb-btrfs-24/ > $ ls > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > 3) Count objects on the system/root cgroup level > $ cat count > 212 > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > $ cat count_node > 209 3 > > 5) Count objects for each memcg (output format: cgroup inode, count) > $ cat count_memcg > 1 212 > 20 96 > 53 817 > 2297 2 > 218 13 > 581 30 > 911 124 > <CUT> > > 6) Same but with a per-node output > $ cat count_memcg_node > 1 209 3 > 20 96 0 > 53 810 7 > 2297 2 0 > 218 13 0 > 581 30 0 > 911 124 0 > <CUT> > > 7) Don't display cgroups with less than 500 attached objects > $ echo 500 > count_memcg > $ cat count_memcg > 53 817 > 1868 886 > 2396 799 > 2462 861 > > 8) Don't display cgroups with less than 500 attached objects (sum over all nodes) > $ echo "500" > count_memcg_node > $ cat count_memcg_node > 53 810 7 > 1868 886 0 > 2396 799 0 > 2462 861 0 > > 9) Scan system/root shrinker > $ cat count > 212 > $ echo 100 > scan > $ cat scan > 97 > $ cat count > 115 > > 10) Scan individual memcg > $ echo "1868 500" > scan_memcg > $ cat scan_memcg > 193 > > 11) Scan individual node > $ echo "1 200" > scan_node > $ cat scan_node > 2 > > 12) Scan individual memcg and node > $ echo "1868 0 500" > scan_memcg_node > $ cat scan_memcg_node > 435 > > If the output doesn't fit into a single page, "...\n" is printed at the end of > output. > > > Roman Gushchin (5): > mm: introduce sysfs interface for debugging kernel shrinker > mm: memcontrol: introduce mem_cgroup_ino() and > mem_cgroup_get_from_ino() > mm: introduce memcg interfaces for shrinker sysfs > mm: introduce numa interfaces for shrinker sysfs > mm: provide shrinkers with names > > arch/x86/kvm/mmu/mmu.c | 2 +- > drivers/android/binder_alloc.c | 2 +- > drivers/gpu/drm/i915/gem/i915_gem_shrinker.c | 3 +- > drivers/gpu/drm/msm/msm_gem_shrinker.c | 2 +- > .../gpu/drm/panfrost/panfrost_gem_shrinker.c | 2 +- > drivers/gpu/drm/ttm/ttm_pool.c | 2 +- > drivers/md/bcache/btree.c | 2 +- > drivers/md/dm-bufio.c | 2 +- > drivers/md/dm-zoned-metadata.c | 2 +- > drivers/md/raid5.c | 2 +- > drivers/misc/vmw_balloon.c | 2 +- > drivers/virtio/virtio_balloon.c | 2 +- > drivers/xen/xenbus/xenbus_probe_backend.c | 2 +- > fs/erofs/utils.c | 2 +- > fs/ext4/extents_status.c | 3 +- > fs/f2fs/super.c | 2 +- > fs/gfs2/glock.c | 2 +- > fs/gfs2/main.c | 2 +- > fs/jbd2/journal.c | 2 +- > fs/mbcache.c | 2 +- > fs/nfs/nfs42xattr.c | 7 +- > fs/nfs/super.c | 2 +- > fs/nfsd/filecache.c | 2 +- > fs/nfsd/nfscache.c | 2 +- > fs/quota/dquot.c | 2 +- > fs/super.c | 2 +- > fs/ubifs/super.c | 2 +- > fs/xfs/xfs_buf.c | 2 +- > fs/xfs/xfs_icache.c | 2 +- > fs/xfs/xfs_qm.c | 2 +- > include/linux/memcontrol.h | 9 + > include/linux/shrinker.h | 25 +- > kernel/rcu/tree.c | 2 +- > lib/Kconfig.debug | 9 + > mm/Makefile | 1 + > mm/huge_memory.c | 4 +- > mm/memcontrol.c | 23 + > mm/shrinker_debug.c | 792 ++++++++++++++++++ > mm/vmscan.c | 66 +- > mm/workingset.c | 2 +- > mm/zsmalloc.c | 2 +- > net/sunrpc/auth.c | 2 +- > 42 files changed, 957 insertions(+), 47 deletions(-) > create mode 100644 mm/shrinker_debug.c > > -- > 2.35.1 >
On Wed, Apr 20, 2022 at 03:24:49PM -0700, Yang Shi wrote: > On Fri, Apr 15, 2022 at 5:28 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > > > For each shrinker registered in the system a folder is created. The folder > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > > > To make debugging more pleasant, the patchset also names all shrinkers, > > so that sysfs entries can have more meaningful names. > > > > Usage examples: > > Thanks, Roman. A follow-up question, why do we have to implement this > in kernel if we just count the objects? It seems userspace tools could > achieve it too, for example, drgn :-). Actually I did write a drgn > script for debugging a problem a few months ago, which iterates > specific memcg's lru_list to count the objects by their state. Good question! It's because not all shrinkers are lru_list-based and even some lru_list-based are implementing a custom logic on top of it, e.g. shadow nodes. So there is no simple way to get the count from a generic shrinker. Also I want to be able to reclaim individual shrinkers from userspace (e.g. to profile how effective the shrinking is). Thanks!