Message ID | 20220422202644.799732-1-roman.gushchin@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | mm: introduce shrinker debugfs interface | expand |
On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote: > There are 50+ different shrinkers in the kernel, many with their own bells and > whistles. Under the memory pressure the kernel applies some pressure on each of > them in the order of which they were created/registered in the system. Some > of them can contain only few objects, some can be quite large. Some can be > effective at reclaiming memory, some not. > > The only existing debugging mechanism is a couple of tracepoints in > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > covering everything though: shrinkers which report 0 objects will never show up, > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > scan function, which is not always enough (e.g. hard to guess which super > block's shrinker it is having only "super_cache_scan"). In general, I've had no trouble identifying individual shrinker instances because I'm always looking at individual subsystem shrinker tracepoints, too. Hence I've almost always got the identification information in the traces I need to trace just the individual shrinker tracepoints and a bit of sed/grep/awk and I've got something I can feed to gnuplot or a python script to graph... > They are a passive > mechanism: there is no way to call into counting and scanning of an individual > shrinker and profile it. IDGI. profiling shrinkers iunder ideal conditions when there isn't memory pressure is largely a useless exercise because execution patterns under memory pressure are vastly different. All the problems with shrinkers show up when progress cannot be made as fast as memory reclaim wants memory to be reclaimed. How do you trigger priority windup causing large amounts of deferred processing because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do you simulate objects getting dirtied in memory so they can't be immediately reclaimed so the shrinker can't make any progress at all until IO completes? How do you simulate the unbound concurrency that direct reclaim can drive into the shrinkers that causes massive lock contention on shared structures and locks that need to be accessed to free objects? IOWs, if all you want to do is profile shrinkers running in the absence of memory pressure, then you can do that perfectly well with the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't need some complex debugfs API just to profile the shrinker behaviour. So why do we need any of the complexity and potential for abuse that comes from exposing control of shrinkers directly to userspace like these patches do? > To provide a better visibility and debug options for memory shrinkers > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent > similar to /sys/kernel/slab. /sys/kernel/slab contains read-only usage information - it is analagous for visibility arguments, but it is not equivalent for the rest of the "active" functionality you want to add here.... > For each shrinker registered in the system a directory is created. The directory > contains "count" and "scan" files, which allow to trigger count_objects() > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > and scan_memcg_node are additionally provided. They allow to get per-memcg > and/or per-node object count and shrink only a specific memcg/node. Great, but why does the shrinker introspection interface need active scan control functions like these? > To make debugging more pleasant, the patchset also names all shrinkers, > so that debugfs entries can have more meaningful names. > > Usage examples: > > 1) List registered shrinkers: > $ cd /sys/kernel/debug/shrinker/ > $ ls > dqcache-16 sb-cgroup2-30 sb-hugetlbfs-33 sb-proc-41 sb-selinuxfs-22 sb-tmpfs-40 sb-zsmalloc-19 > kfree_rcu-0 sb-configfs-23 sb-iomem-12 sb-proc-44 sb-sockfs-8 sb-tmpfs-42 shadow-18 > sb-aio-20 sb-dax-11 sb-mqueue-21 sb-proc-45 sb-sysfs-26 sb-tmpfs-43 thp_deferred_split-10 > sb-anon_inodefs-15 sb-debugfs-7 sb-nsfs-4 sb-proc-47 sb-tmpfs-1 sb-tmpfs-46 thp_zero-9 > sb-bdev-3 sb-devpts-28 sb-pipefs-14 sb-pstore-31 sb-tmpfs-27 sb-tmpfs-49 xfs_buf-37 > sb-bpf-32 sb-devtmpfs-5 sb-proc-25 sb-rootfs-2 sb-tmpfs-29 sb-tracefs-13 xfs_inodegc-38 > sb-btrfs-24 sb-hugetlbfs-17 sb-proc-39 sb-securityfs-6 sb-tmpfs-35 sb-xfs-36 zspool-34 Ouch. That's not going to be useful for humans debugging a system as there's no way to cross reference a "superblock" with an actual filesystem mount point. Nor is there any way to reallly know that all the shrinkers in one filesystem are related. We normally solve this by ensuring that the fs related object has the short bdev name appended to them. e.g: $ pgrep xfs 1 I root 36 2 0 60 -20 - 0 - Apr19 ? 00:00:10 [kworker/0:1H-xfs-log/dm-3] 1 I root 679 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfsalloc] 1 I root 680 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs_mru_cache] 1 I root 681 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs-buf/dm-1] ..... Here we have a kworker process running log IO completion work on dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer task for xfs-buf workqueue on dm-1. We need the same name discrimination for shrinker information here, too - just saying "this is an XFS superblock shrinker" is just not sufficient when there are hundreds of XFS mount points with a handful of shrinkers each. > 2) Get information about a specific shrinker: > $ cd sb-btrfs-24/ > $ ls > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > 3) Count objects on the system/root cgroup level > $ cat count > 212 > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > $ cat count_node > 209 3 So a single space separated line with a number per node? When you have a few hundred nodes and hundreds of thousands of objects per node, we overrun the 4kB page size with a single line. What then? > 5) Count objects for each memcg (output format: cgroup inode, count) > $ cat count_memcg > 1 212 > 20 96 > 53 817 > 2297 2 > 218 13 > 581 30 > 911 124 > <CUT> What does "<CUT>" mean? Also, this now iterates separate memcg per line. A parser now needs to know the difference between count/count_node and count_memcg/count_memcg_node because they are subtly different file formats. These files should have the same format, otherwise it just creates needless complexity. Indeed, why do we even need count/count_node? They are just the "index 1" memcg output, so are totally redundant. > 6) Same but with a per-node output > $ cat count_memcg_node > 1 209 3 > 20 96 0 > 53 810 7 > 2297 2 0 > 218 13 0 > 581 30 0 > 911 124 0 > <CUT> So now we have a hundred nodes in the machine and thousands of memcgs. And the information we want is in the numerically largest memcg that is last in the list. ANd we want to graph it's behaviour over time at high resolution (say 1Hz). Now we burn huge amounts of CPU counting memcgs that we don't care about and then throwing away most of the information. That's highly in-efficient and really doesn't scale. [snap active scan interface] This just seems like a solution looking for a problem to solve. Can you please describe the problem this infrastructure is going to solve? Cheers, Dave.
On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote: > This just seems like a solution looking for a problem to solve. > Can you please describe the problem this infrastructure is going > to solve? A point I was making over VC is that memcg is completely irrelevant to debugging most of these issues; all the issues we've been talking about can be easily reproduced in a single test VM without memcg. Yet we don't even have the tooling to debug the simple stuff. Why are we trying to make big and complicated stuff when we can't even debug the simple cases? And I've been getting _really_ tired of the stock answer of "that use case isn't interesting to the big cloud providers". A: If you're a Linux kernel developer at this level, you have earned a great deal of trust and it is incumbent upon you to be a good steward of the code you have been entrusted with, instead of just spending all your time chasing fat bonuses from your employer while ignoring what's good for the codebase as a whole. That's pissing all over the commons that came long before you and will hopefully still be around long after you. B: Even aside from that, it's incredibly shortsighted and a poor use of time and resources. When I was at Google I saw, over and over again, people rushing to do something big and complicated and new because that was how they could get a promotion, instead of working on basic stuff like refactoring core IO paths (and it's been my experience over and over again that when you just try to make code saner and more understandable, you almost always find big performance improvements along the way... but that's not as exciting as rushing to find the biggest coolest optimization or all-the-bells-and-whistles interface). So yeah, this patchset screams of someone looking for a promotion to me. Meanwhile, the status of visibility into the _basics_ of what goes on in MM is utter dogshit. There's just too many _basic_ questions that are a pain in the ass to answer - even just profiling memory usage by file:line number is a shitshow. One thing that I run into a lot is people rush to say "tracepoints!" for a lot of problems - but tracepoints aren't a good answer for a lot of problems because having them on all the time is problematic. What I would like to see is more lighter weight collection of statistics, and some basic library code for things like latency measurements of important operations broken out by quantiles, with rate & frequence - this is something that's helped in bcachefs. If anyone's interested, the code for that starts here: https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bcachefs.h#n322 Specifically for shrinkers, I'd like if we had rolling averages over the past few seconds for e.g. _rate_ of objects requested to be freed vs. actually freed. If we collect those kinds of rate measurements (and perhaps latency too, to show stalls) at various places in the MM code, perhaps we'd be able to see what's getting stuck when we OOM. We should have rate of objects getting added, too, and we should be collecting data from the list_lru code as well, like you were mentioning the other night. And if we collect this data in such a way that it can be displayed in sysfs, but done with the to_text() methods I've been talking about, it'll also be trivial to include that in the show_mem() report when we OOM. Anyways, that's my two cents.... I can't claim to have any brilliant insights here, but I hope Roman will start taking ideas from more people (and Dave's been a real wealth of information on this topic! I'd pick his brain if I were you, Roman).
On Tue, 26 Apr 2022 16:02:19 +1000 Dave Chinner wrote: > On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). > > In general, I've had no trouble identifying individual shrinker > instances because I'm always looking at individual subsystem > shrinker tracepoints, too. Hence I've almost always got the > identification information in the traces I need to trace just the > individual shrinker tracepoints and a bit of sed/grep/awk and I've > got something I can feed to gnuplot or a python script to graph... > > > They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > IDGI. profiling shrinkers iunder ideal conditions when there isn't > memory pressure is largely a useless exercise because execution > patterns under memory pressure are vastly different. Well how many minutes, two or ten, does it take for kswapd to reclaim 100 xfs objects at DEF_PRIORITY-3? > > All the problems with shrinkers show up when progress cannot be made > as fast as memory reclaim wants memory to be reclaimed. How do you > trigger priority windup causing large amounts of deferred processing > because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do > you simulate objects getting dirtied in memory so they can't be > immediately reclaimed so the shrinker can't make any progress at all > until IO completes? How do you simulate the unbound concurrency that > direct reclaim can drive into the shrinkers that causes massive lock > contention on shared structures and locks that need to be accessed > to free objects? > > IOWs, if all you want to do is profile shrinkers running in the > absence of memory pressure, then you can do that perfectly well with > the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't > need some complex debugfs API just to profile the shrinker > behaviour. Hm ... given ext4, what sense does xfs make? Or vice verse? Or Given wine, why Coke? I want to see the minutes recycling ten ext4 objects with xfs intact before waking kswapd up. Hillf > > So why do we need any of the complexity and potential for abuse that > comes from exposing control of shrinkers directly to userspace like > these patches do? > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > /sys/kernel/slab contains read-only usage information - it is > analagous for visibility arguments, but it is not equivalent for > the rest of the "active" functionality you want to add here.... > > > For each shrinker registered in the system a directory is created. The directory > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > Great, but why does the shrinker introspection interface need active > scan control functions like these? > > > To make debugging more pleasant, the patchset also names all shrinkers, > > so that debugfs entries can have more meaningful names. > > > > Usage examples: > > > > 1) List registered shrinkers: > > $ cd /sys/kernel/debug/shrinker/ > > $ ls > > dqcache-16 sb-cgroup2-30 sb-hugetlbfs-33 sb-proc-41 sb-selinuxfs-22 sb-tmpfs-40 sb-zsmalloc-19 > > kfree_rcu-0 sb-configfs-23 sb-iomem-12 sb-proc-44 sb-sockfs-8 sb-tmpfs-42 shadow-18 > > sb-aio-20 sb-dax-11 sb-mqueue-21 sb-proc-45 sb-sysfs-26 sb-tmpfs-43 thp_deferred_split-10 > > sb-anon_inodefs-15 sb-debugfs-7 sb-nsfs-4 sb-proc-47 sb-tmpfs-1 sb-tmpfs-46 thp_zero-9 > > sb-bdev-3 sb-devpts-28 sb-pipefs-14 sb-pstore-31 sb-tmpfs-27 sb-tmpfs-49 xfs_buf-37 > > sb-bpf-32 sb-devtmpfs-5 sb-proc-25 sb-rootfs-2 sb-tmpfs-29 sb-tracefs-13 xfs_inodegc-38 > > sb-btrfs-24 sb-hugetlbfs-17 sb-proc-39 sb-securityfs-6 sb-tmpfs-35 sb-xfs-36 zspool-34 > > Ouch. That's not going to be useful for humans debugging a system as > there's no way to cross reference a "superblock" with an actual > filesystem mount point. Nor is there any way to reallly know that > all the shrinkers in one filesystem are related. > > We normally solve this by ensuring that the fs related object has > the short bdev name appended to them. e.g: > > $ pgrep xfs > 1 I root 36 2 0 60 -20 - 0 - Apr19 ? 00:00:10 [kworker/0:1H-xfs-log/dm-3] > 1 I root 679 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfsalloc] > 1 I root 680 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs_mru_cache] > 1 I root 681 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs-buf/dm-1] > ..... > > Here we have a kworker process running log IO completion work on > dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer > task for xfs-buf workqueue on dm-1. > > We need the same name discrimination for shrinker information here, > too - just saying "this is an XFS superblock shrinker" is just not > sufficient when there are hundreds of XFS mount points with a > handful of shrinkers each. > > > 2) Get information about a specific shrinker: > > $ cd sb-btrfs-24/ > > $ ls > > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > > > 3) Count objects on the system/root cgroup level > > $ cat count > > 212 > > > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > > $ cat count_node > > 209 3 > > So a single space separated line with a number per node? > > When you have a few hundred nodes and hundreds of thousands of objects per > node, we overrun the 4kB page size with a single line. What then? > > > 5) Count objects for each memcg (output format: cgroup inode, count) > > $ cat count_memcg > > 1 212 > > 20 96 > > 53 817 > > 2297 2 > > 218 13 > > 581 30 > > 911 124 > > <CUT> > > What does "<CUT>" mean? > > Also, this now iterates separate memcg per line. A parser now needs > to know the difference between count/count_node and > count_memcg/count_memcg_node because they are subtly different file > formats. These files should have the same format, otherwise it just > creates needless complexity. > > Indeed, why do we even need count/count_node? They are just the > "index 1" memcg output, so are totally redundant. > > > 6) Same but with a per-node output > > $ cat count_memcg_node > > 1 209 3 > > 20 96 0 > > 53 810 7 > > 2297 2 0 > > 218 13 0 > > 581 30 0 > > 911 124 0 > > <CUT> > > So now we have a hundred nodes in the machine and thousands of > memcgs. And the information we want is in the numerically largest > memcg that is last in the list. ANd we want to graph it's behaviour > over time at high resolution (say 1Hz). Now we burn huge amounts > of CPU counting memcgs that we don't care about and then throwing > away most of the information. That's highly in-efficient and really > doesn't scale. > > [snap active scan interface] > > This just seems like a solution looking for a problem to solve. > Can you please describe the problem this infrastructure is going > to solve? > > Cheers, > > Dave. > -- > Dave Chinner > dchinner@redhat.com > >
On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote: > On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). > > In general, I've had no trouble identifying individual shrinker > instances because I'm always looking at individual subsystem > shrinker tracepoints, too. Hence I've almost always got the > identification information in the traces I need to trace just the > individual shrinker tracepoints and a bit of sed/grep/awk and I've > got something I can feed to gnuplot or a python script to graph... > > > They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > IDGI. profiling shrinkers iunder ideal conditions when there isn't > memory pressure is largely a useless exercise because execution > patterns under memory pressure are vastly different. > > All the problems with shrinkers show up when progress cannot be made > as fast as memory reclaim wants memory to be reclaimed. How do you > trigger priority windup causing large amounts of deferred processing > because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do > you simulate objects getting dirtied in memory so they can't be > immediately reclaimed so the shrinker can't make any progress at all > until IO completes? How do you simulate the unbound concurrency that > direct reclaim can drive into the shrinkers that causes massive lock > contention on shared structures and locks that need to be accessed > to free objects? > > IOWs, if all you want to do is profile shrinkers running in the > absence of memory pressure, then you can do that perfectly well with > the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't > need some complex debugfs API just to profile the shrinker > behaviour. > > So why do we need any of the complexity and potential for abuse that > comes from exposing control of shrinkers directly to userspace like > these patches do? > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > /sys/kernel/slab contains read-only usage information - it is > analagous for visibility arguments, but it is not equivalent for > the rest of the "active" functionality you want to add here.... > > > For each shrinker registered in the system a directory is created. The directory > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > Great, but why does the shrinker introspection interface need active > scan control functions like these? > > > To make debugging more pleasant, the patchset also names all shrinkers, > > so that debugfs entries can have more meaningful names. > > > > Usage examples: > > > > 1) List registered shrinkers: > > $ cd /sys/kernel/debug/shrinker/ > > $ ls > > dqcache-16 sb-cgroup2-30 sb-hugetlbfs-33 sb-proc-41 sb-selinuxfs-22 sb-tmpfs-40 sb-zsmalloc-19 > > kfree_rcu-0 sb-configfs-23 sb-iomem-12 sb-proc-44 sb-sockfs-8 sb-tmpfs-42 shadow-18 > > sb-aio-20 sb-dax-11 sb-mqueue-21 sb-proc-45 sb-sysfs-26 sb-tmpfs-43 thp_deferred_split-10 > > sb-anon_inodefs-15 sb-debugfs-7 sb-nsfs-4 sb-proc-47 sb-tmpfs-1 sb-tmpfs-46 thp_zero-9 > > sb-bdev-3 sb-devpts-28 sb-pipefs-14 sb-pstore-31 sb-tmpfs-27 sb-tmpfs-49 xfs_buf-37 > > sb-bpf-32 sb-devtmpfs-5 sb-proc-25 sb-rootfs-2 sb-tmpfs-29 sb-tracefs-13 xfs_inodegc-38 > > sb-btrfs-24 sb-hugetlbfs-17 sb-proc-39 sb-securityfs-6 sb-tmpfs-35 sb-xfs-36 zspool-34 > > Ouch. That's not going to be useful for humans debugging a system as > there's no way to cross reference a "superblock" with an actual > filesystem mount point. Nor is there any way to reallly know that > all the shrinkers in one filesystem are related. > > We normally solve this by ensuring that the fs related object has > the short bdev name appended to them. e.g: > > $ pgrep xfs > 1 I root 36 2 0 60 -20 - 0 - Apr19 ? 00:00:10 [kworker/0:1H-xfs-log/dm-3] > 1 I root 679 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfsalloc] > 1 I root 680 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs_mru_cache] > 1 I root 681 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs-buf/dm-1] > ..... > > Here we have a kworker process running log IO completion work on > dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer > task for xfs-buf workqueue on dm-1. > > We need the same name discrimination for shrinker information here, > too - just saying "this is an XFS superblock shrinker" is just not > sufficient when there are hundreds of XFS mount points with a > handful of shrinkers each. > > > 2) Get information about a specific shrinker: > > $ cd sb-btrfs-24/ > > $ ls > > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > > > 3) Count objects on the system/root cgroup level > > $ cat count > > 212 > > > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > > $ cat count_node > > 209 3 > > So a single space separated line with a number per node? > > When you have a few hundred nodes and hundreds of thousands of objects per > node, we overrun the 4kB page size with a single line. What then? > > > 5) Count objects for each memcg (output format: cgroup inode, count) > > $ cat count_memcg > > 1 212 > > 20 96 > > 53 817 > > 2297 2 > > 218 13 > > 581 30 > > 911 124 > > <CUT> > > What does "<CUT>" mean? > > Also, this now iterates separate memcg per line. A parser now needs > to know the difference between count/count_node and > count_memcg/count_memcg_node because they are subtly different file > formats. These files should have the same format, otherwise it just > creates needless complexity. > > Indeed, why do we even need count/count_node? They are just the > "index 1" memcg output, so are totally redundant. > > > 6) Same but with a per-node output > > $ cat count_memcg_node > > 1 209 3 > > 20 96 0 > > 53 810 7 > > 2297 2 0 > > 218 13 0 > > 581 30 0 > > 911 124 0 > > <CUT> > > So now we have a hundred nodes in the machine and thousands of > memcgs. And the information we want is in the numerically largest > memcg that is last in the list. ANd we want to graph it's behaviour > over time at high resolution (say 1Hz). Now we burn huge amounts > of CPU counting memcgs that we don't care about and then throwing > away most of the information. That's highly in-efficient and really > doesn't scale. > > [snap active scan interface] > > This just seems like a solution looking for a problem to solve. > Can you please describe the problem this infrastructure is going > to solve? Hi Dave! Thank you for taking a look. Can you, please, summarize your position, because it's a bit unclear. You made a lot of good points about some details (e.g. shrinkers naming, and I totally agree there; machines with hundreds of nodes etc), then you said the active scanning is useless and then said the whole thing is useless and we're fine with what we have regarding shrinkers debugging. My plan is to work on convert shrinkers API to bytes and experiment with different LRU implementations. I find an ability to easily export statistics and other data (which doesn't exist now) via debugfs useful (and way more convenient than changing existing tracepoints), as well as an ability to trigger scanning of individual shrinkers. If nobody else seeing any value here, I'm fine to keep these patches private, no reason to argue about the output format then. If you (or somebody else) see some value in at least "count" part, I'm happy to answer all questions and incorporate the feedback in the next version. Thank you!
On Tue, Apr 26, 2022 at 09:41:34AM -0700, Roman Gushchin wrote: > My plan is to work on convert shrinkers API to bytes and experiment > with different LRU implementations. I find an ability to easily export > statistics and other data (which doesn't exist now) via debugfs useful > (and way more convenient than changing existing tracepoints), as well as > an ability to trigger scanning of individual shrinkers. If nobody else > seeing any value here, I'm fine to keep these patches private, no reason > to argue about the output format then. I don't think converting the shrinker API to bytes instead of object counts is such a great idea - that's going to introducing new rounding errors and new corner cases when we can't free the exact # of bytes requested. I was thinking along the lines of adding reporting for memory usage in bytes as either an additional thing the .count_objects reports, or a new callback.
On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote: > On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote: > > There are 50+ different shrinkers in the kernel, many with their own bells and > > whistles. Under the memory pressure the kernel applies some pressure on each of > > them in the order of which they were created/registered in the system. Some > > of them can contain only few objects, some can be quite large. Some can be > > effective at reclaiming memory, some not. > > > > The only existing debugging mechanism is a couple of tracepoints in > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > covering everything though: shrinkers which report 0 objects will never show up, > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > scan function, which is not always enough (e.g. hard to guess which super > > block's shrinker it is having only "super_cache_scan"). > > In general, I've had no trouble identifying individual shrinker > instances because I'm always looking at individual subsystem > shrinker tracepoints, too. Hence I've almost always got the > identification information in the traces I need to trace just the > individual shrinker tracepoints and a bit of sed/grep/awk and I've > got something I can feed to gnuplot or a python script to graph... You spent a lot of time working on shrinkers in general and xfs-specific shrinkers in particular, no questions here. But imagine someone who's not a core-mm developer and is adding a new shrinker. > > > They are a passive > > mechanism: there is no way to call into counting and scanning of an individual > > shrinker and profile it. > > IDGI. profiling shrinkers iunder ideal conditions when there isn't > memory pressure is largely a useless exercise because execution > patterns under memory pressure are vastly different. > > All the problems with shrinkers show up when progress cannot be made > as fast as memory reclaim wants memory to be reclaimed. How do you > trigger priority windup causing large amounts of deferred processing > because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do > you simulate objects getting dirtied in memory so they can't be > immediately reclaimed so the shrinker can't make any progress at all > until IO completes? How do you simulate the unbound concurrency that > direct reclaim can drive into the shrinkers that causes massive lock > contention on shared structures and locks that need to be accessed > to free objects? These are valid points and I assume we can find ways to emulate some of these conditions, e.g. by allowing to run scanning using the GFP_NOFS context. I though about it but decided to left for further improvements. > > IOWs, if all you want to do is profile shrinkers running in the > absence of memory pressure, then you can do that perfectly well with > the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't > need some complex debugfs API just to profile the shrinker > behaviour. And then we need somehow separate shrinkers in the result? > > So why do we need any of the complexity and potential for abuse that > comes from exposing control of shrinkers directly to userspace like > these patches do? I feel like the added complexity is minimal (unlike slab's sysfs, for example). If the config option is off (by default), there is no additional risk and overhead as well. > > > To provide a better visibility and debug options for memory shrinkers > > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent > > similar to /sys/kernel/slab. > > /sys/kernel/slab contains read-only usage information - it is > analagous for visibility arguments, but it is not equivalent for > the rest of the "active" functionality you want to add here.... > > > For each shrinker registered in the system a directory is created. The directory > > contains "count" and "scan" files, which allow to trigger count_objects() > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > and/or per-node object count and shrink only a specific memcg/node. > > Great, but why does the shrinker introspection interface need active > scan control functions like these? It makes testing of (new) shrinkers easier, for example. For instance, shadow entries shrinker hides associated objects by returning 0 count most of the time (unless the total consumed memory is bigger than a certain amount of the total memory). echo 2 > /proc/sys/vm/drop_caches won't even trigger the scanning. > > > To make debugging more pleasant, the patchset also names all shrinkers, > > so that debugfs entries can have more meaningful names. > > > > Usage examples: > > > > 1) List registered shrinkers: > > $ cd /sys/kernel/debug/shrinker/ > > $ ls > > dqcache-16 sb-cgroup2-30 sb-hugetlbfs-33 sb-proc-41 sb-selinuxfs-22 sb-tmpfs-40 sb-zsmalloc-19 > > kfree_rcu-0 sb-configfs-23 sb-iomem-12 sb-proc-44 sb-sockfs-8 sb-tmpfs-42 shadow-18 > > sb-aio-20 sb-dax-11 sb-mqueue-21 sb-proc-45 sb-sysfs-26 sb-tmpfs-43 thp_deferred_split-10 > > sb-anon_inodefs-15 sb-debugfs-7 sb-nsfs-4 sb-proc-47 sb-tmpfs-1 sb-tmpfs-46 thp_zero-9 > > sb-bdev-3 sb-devpts-28 sb-pipefs-14 sb-pstore-31 sb-tmpfs-27 sb-tmpfs-49 xfs_buf-37 > > sb-bpf-32 sb-devtmpfs-5 sb-proc-25 sb-rootfs-2 sb-tmpfs-29 sb-tracefs-13 xfs_inodegc-38 > > sb-btrfs-24 sb-hugetlbfs-17 sb-proc-39 sb-securityfs-6 sb-tmpfs-35 sb-xfs-36 zspool-34 > > Ouch. That's not going to be useful for humans debugging a system as > there's no way to cross reference a "superblock" with an actual > filesystem mount point. Nor is there any way to reallly know that > all the shrinkers in one filesystem are related. > > We normally solve this by ensuring that the fs related object has > the short bdev name appended to them. e.g: > > $ pgrep xfs > 1 I root 36 2 0 60 -20 - 0 - Apr19 ? 00:00:10 [kworker/0:1H-xfs-log/dm-3] > 1 I root 679 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfsalloc] > 1 I root 680 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs_mru_cache] > 1 I root 681 2 0 60 -20 - 0 - Apr19 ? 00:00:00 [xfs-buf/dm-1] > ..... > > Here we have a kworker process running log IO completion work on > dm-3, two global workqueue rescuer tasks (alloc, mru) and a rescuer > task for xfs-buf workqueue on dm-1. > > We need the same name discrimination for shrinker information here, > too - just saying "this is an XFS superblock shrinker" is just not > sufficient when there are hundreds of XFS mount points with a > handful of shrinkers each. Good point, I think it's doable, and I really like it. > > > 2) Get information about a specific shrinker: > > $ cd sb-btrfs-24/ > > $ ls > > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > > > 3) Count objects on the system/root cgroup level > > $ cat count > > 212 > > > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > > $ cat count_node > > 209 3 > > So a single space separated line with a number per node? > > When you have a few hundred nodes and hundreds of thousands of objects per > node, we overrun the 4kB page size with a single line. What then? With seq_buf api we don't have 4kb limit, do we? > > > 5) Count objects for each memcg (output format: cgroup inode, count) > > $ cat count_memcg > > 1 212 > > 20 96 > > 53 817 > > 2297 2 > > 218 13 > > 581 30 > > 911 124 > > <CUT> > > What does "<CUT>" mean? I've just shortened the lengthy output, not a part of the original output. > > Also, this now iterates separate memcg per line. A parser now needs > to know the difference between count/count_node and > count_memcg/count_memcg_node because they are subtly different file > formats. These files should have the same format, otherwise it just > creates needless complexity. > > Indeed, why do we even need count/count_node? They are just the > "index 1" memcg output, so are totally redundant. Ok, but then we need a flag to indicate that a shrinker is memcg-aware? But I got your point and I (partially) agree. But do you think we're fine with just one interface and don't need an aggregation over nodes? So just count_memcg_node? > > > 6) Same but with a per-node output > > $ cat count_memcg_node > > 1 209 3 > > 20 96 0 > > 53 810 7 > > 2297 2 0 > > 218 13 0 > > 581 30 0 > > 911 124 0 > > <CUT> > > So now we have a hundred nodes in the machine and thousands of > memcgs. And the information we want is in the numerically largest > memcg that is last in the list. ANd we want to graph it's behaviour > over time at high resolution (say 1Hz). Now we burn huge amounts > of CPU counting memcgs that we don't care about and then throwing > away most of the information. That's highly in-efficient and really > doesn't scale. For this case we can provide an interface which allows to specify both node and memcg and get the count. Personally I don't have a machine with hundred nodes, so it's not on my radar. If you find it useful, happy to add. Thanks! Roman
On Tue, Apr 26, 2022 at 12:05:30PM -0700, Roman Gushchin wrote: > On Tue, Apr 26, 2022 at 04:02:19PM +1000, Dave Chinner wrote: > > On Fri, Apr 22, 2022 at 01:26:37PM -0700, Roman Gushchin wrote: > > > There are 50+ different shrinkers in the kernel, many with their own bells and > > > whistles. Under the memory pressure the kernel applies some pressure on each of > > > them in the order of which they were created/registered in the system. Some > > > of them can contain only few objects, some can be quite large. Some can be > > > effective at reclaiming memory, some not. > > > > > > The only existing debugging mechanism is a couple of tracepoints in > > > do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end. They aren't > > > covering everything though: shrinkers which report 0 objects will never show up, > > > there is no support for memcg-aware shrinkers. Shrinkers are identified by their > > > scan function, which is not always enough (e.g. hard to guess which super > > > block's shrinker it is having only "super_cache_scan"). > > > > In general, I've had no trouble identifying individual shrinker > > instances because I'm always looking at individual subsystem > > shrinker tracepoints, too. Hence I've almost always got the > > identification information in the traces I need to trace just the > > individual shrinker tracepoints and a bit of sed/grep/awk and I've > > got something I can feed to gnuplot or a python script to graph... > > You spent a lot of time working on shrinkers in general and xfs-specific > shrinkers in particular, no questions here. But imagine someone who's not > a core-mm developer and is adding a new shrinker. At which point, they add their own subsystem introspection to understand what their shrinker implementation is doing. You keep talking about shrinkers as if they exist in isolation to the actual subsystems that implement shrinkers. I think that is a fundamental mistake - you cannot understand how a shrinker is actually working without understanding something about what the subsystem that implements the shrinker actually does. That is, the tracepoints in the shrinker code are largely supplemental to the subsystem introspection that is really determining the behaviour of the system. The shrinker infrastructure is only providing a measure of memory pressure - most shrinker implementations jsut don't care about what happens in the shrinker infrastructure - they just count and scan objects for reclaim, and mostly that just works for them. > > > They are a passive > > > mechanism: there is no way to call into counting and scanning of an individual > > > shrinker and profile it. > > > > IDGI. profiling shrinkers iunder ideal conditions when there isn't > > memory pressure is largely a useless exercise because execution > > patterns under memory pressure are vastly different. > > > > All the problems with shrinkers show up when progress cannot be made > > as fast as memory reclaim wants memory to be reclaimed. How do you > > trigger priority windup causing large amounts of deferred processing > > because shrinkers are running in GFP_NOFS/GFP_NOIO context? How do > > you simulate objects getting dirtied in memory so they can't be > > immediately reclaimed so the shrinker can't make any progress at all > > until IO completes? How do you simulate the unbound concurrency that > > direct reclaim can drive into the shrinkers that causes massive lock > > contention on shared structures and locks that need to be accessed > > to free objects? > > These are valid points and I assume we can find ways to emulate some of > these conditions, e.g. by allowing to run scanning using the GFP_NOFS context. > I though about it but decided to left for further improvements. > > > > > IOWs, if all you want to do is profile shrinkers running in the > > absence of memory pressure, then you can do that perfectly well with > > the existing 'echo 2 > /proc/sys/vm/drop_caches' mechanism. We don't > > need some complex debugfs API just to profile the shrinker > > behaviour. > > And then we need somehow separate shrinkers in the result? How do you profile a shrinker in the first place? You have to load up the slab cache/LRU before you have something you can actually profile. SO it's as simple as 'drop caches, load up cache to be profiled, drop caches'. It's trivial to isolate the specific cache that got loaded up from the tracepoints, and then with other tracepoints and/or perf profiling, you can extract the profile of the shrinker that is doing all the reclaim work. Indeed, you can point perf at the specific task that drops the caches, and that is all you'll get in the profile. If you can't isolate the specific shrinker profile from the output of such a simple test setup, then you should hand in your Kernel Developer badge.... > > So why do we need any of the complexity and potential for abuse that > > comes from exposing control of shrinkers directly to userspace like > > these patches do? > > I feel like the added complexity is minimal (unlike slab's sysfs, for > example). If the config option is off (by default), there is no additional > risk and overhead as well. No. The argument that "if we turn it off there's no overhead" means one of two things: 1. nobody turns it on and it never gets tested and so bitrots and is useless, or 2. distro's all turn it on because some tool they ship or customer they ship to wants it. Either way, hiding it behind a config option is not an acceptible solution for mering poorly thought out infrastructure. > > > To provide a better visibility and debug options for memory shrinkers > > > this patchset introduces a /sys/kernel/debug/shrinker interface, to some extent > > > similar to /sys/kernel/slab. > > > > /sys/kernel/slab contains read-only usage information - it is > > analagous for visibility arguments, but it is not equivalent for > > the rest of the "active" functionality you want to add here.... > > > > > For each shrinker registered in the system a directory is created. The directory > > > contains "count" and "scan" files, which allow to trigger count_objects() > > > and scan_objects() callbacks. For memcg-aware and numa-aware shrinkers > > > count_memcg, scan_memcg, count_node, scan_node, count_memcg_node > > > and scan_memcg_node are additionally provided. They allow to get per-memcg > > > and/or per-node object count and shrink only a specific memcg/node. > > > > Great, but why does the shrinker introspection interface need active > > scan control functions like these? > > It makes testing of (new) shrinkers easier, for example. > For instance, shadow entries shrinker hides associated objects by returning > 0 count most of the time (unless the total consumed memory is bigger than a > certain amount of the total memory). > echo 2 > /proc/sys/vm/drop_caches won't even trigger the scanning. And that's exactly my point above: you cannot test shrinkers in isolation of the subsystem that loads them up. In this case, you *aren't testing the shrinker*, you are testing how the shadow entry subsystem manages the working set. The shrinker is an integrated part of that subsystem, so any test hooks needed to trigger the reclaim of shadow entries belong in the ->count method of the the shrinker implementation, such that it runs whenever the shrinker is called rather than only when the memory usage threshold is triggered. At that point, drop_caches then does exactly what you need. Shrinkers cannot be tested in isolation of the subsystem they act on! > > > 2) Get information about a specific shrinker: > > > $ cd sb-btrfs-24/ > > > $ ls > > > count count_memcg count_memcg_node count_node scan scan_memcg scan_memcg_node scan_node > > > > > > 3) Count objects on the system/root cgroup level > > > $ cat count > > > 212 > > > > > > 4) Count objects on the system/root cgroup level per numa node (on a 2-node machine) > > > $ cat count_node > > > 209 3 > > > > So a single space separated line with a number per node? > > > > When you have a few hundred nodes and hundreds of thousands of objects per > > node, we overrun the 4kB page size with a single line. What then? > > With seq_buf api we don't have 4kb limit, do we? No idea. Never cared enough about sysfs to need to know. But that doesn't avoid the issue: verbosity and overhead to create/parse this information. > > Also, this now iterates separate memcg per line. A parser now needs > > to know the difference between count/count_node and > > count_memcg/count_memcg_node because they are subtly different file > > formats. These files should have the same format, otherwise it just > > creates needless complexity. > > > > Indeed, why do we even need count/count_node? They are just the > > "index 1" memcg output, so are totally redundant. > > Ok, but then we need a flag to indicate that a shrinker is memcg-aware? > But I got your point and I (partially) agree. > But do you think we're fine with just one interface and don't need > an aggregation over nodes? So just count_memcg_node? /me puts on the broken record Shrinker infrastructure needs to stop treating memcgs are something special and off to the side. We need to integrate the code so there is a single scan loop that simply treats the "no memcg" case as the root memcg. Bleeding architectural/implementation deficiencies into user visible APIs is even worse than just having to put up with them in the implementation.... > > > 6) Same but with a per-node output > > > $ cat count_memcg_node > > > 1 209 3 > > > 20 96 0 > > > 53 810 7 > > > 2297 2 0 > > > 218 13 0 > > > 581 30 0 > > > 911 124 0 > > > <CUT> > > > > So now we have a hundred nodes in the machine and thousands of > > memcgs. And the information we want is in the numerically largest > > memcg that is last in the list. ANd we want to graph it's behaviour > > over time at high resolution (say 1Hz). Now we burn huge amounts > > of CPU counting memcgs that we don't care about and then throwing > > away most of the information. That's highly in-efficient and really > > doesn't scale. > > For this case we can provide an interface which allows to specify both > node and memcg and get the count. Personally I don't have a machine > with hundred nodes, so it's not on my radar. Yup, but there are people how do have this sort of machine, which do use memcgs (in their thousands) and do have many, many superblocks (in their thousands). Just because you personally don't have such machines it does not mean you don't have to design for such machines. Saying "I don't care other people's requirements" is exactly what Kent had a rant about in the other leg of this thread. We know that we have these scalability issues in generic infrastructure, and therefore generic infrastructure has to handle these issues at a architecture and design level. We don't need the initial implementation to work well at such levels of scalability, but we sure as hell need the design, APIs and file formats to scale out because if it doesn't scale there is no question that *we will have to fix it*. So, yeah, you need to think about how to do fine-grained access to shrinker stats effectively. That might require a complete change of presentation API. For example, changing the filesystem layout to be memcg centric rather than shrinker instance centric would make an awful lot of this file parsing problem go away. e.g: /sys/kernel/debug/mm/memcg/<memcg instance>/shrinker/<shrinker instance>/stats Cheers, Dave.
On Tue, Apr 26, 2022 at 09:41:34AM -0700, Roman Gushchin wrote: > Can you, please, summarize your position, because it's a bit unclear. > You made a lot of good points about some details (e.g. shrinkers naming, > and I totally agree there; machines with hundreds of nodes etc), then > you said the active scanning is useless and then said the whole thing > is useless and we're fine with what we have regarding shrinkers debugging. Better introspection the first thing we need. Work on improving that. I've been making suggestions to help improve introspection infrastructure. Before anything else, we need to improve introspection so we can gain better insight into the problems we have. Once we understand the problems better and have evidence to back up where the problems lie and we have a plan to solve them, then we can talk about whether we need other user accessible shrinker APIs. For the moment, exposing shrinker control interfaces to userspace could potentially be very bad because it exposes internal architectural and implementation details to a user API. Just because it is in /sys/kernel/debug it doesn't mean applications won't start to use it and build dependencies on it. That doesn't mean I'm opposed to exposing a shrinker control mechanism to debugfs - I'm still on the fence on that one. However, I definitely think that an API that directly exposes the internal implementation to userspace is the wrong way to go about this. Fine grained shrinker control is not necessary to improve shrinker introspection and OOM debugging capability, so if you want/need control interfaces then I think you should separate those out into a separate line of development where it doesn't derail the discussion on how to improve shrinker/OOM introspection. -Dave.
On Wed, Apr 27, 2022 at 11:22:55AM +1000, Dave Chinner wrote: > On Tue, Apr 26, 2022 at 09:41:34AM -0700, Roman Gushchin wrote: > > Can you, please, summarize your position, because it's a bit unclear. > > You made a lot of good points about some details (e.g. shrinkers naming, > > and I totally agree there; machines with hundreds of nodes etc), then > > you said the active scanning is useless and then said the whole thing > > is useless and we're fine with what we have regarding shrinkers debugging. > > Better introspection the first thing we need. Work on improving > that. I've been making suggestions to help improve introspection > infrastructure. > > Before anything else, we need to improve introspection so we can > gain better insight into the problems we have. Once we understand > the problems better and have evidence to back up where the problems > lie and we have a plan to solve them, then we can talk about whether > we need other user accessible shrinker APIs. Ok, at least we do agree here. This is exactly why I've started with this debugfs stuff. > > For the moment, exposing shrinker control interfaces to userspace > could potentially be very bad because it exposes internal > architectural and implementation details to a user API. Just > because it is in /sys/kernel/debug it doesn't mean applications > won't start to use it and build dependencies on it. > > That doesn't mean I'm opposed to exposing a shrinker control > mechanism to debugfs - I'm still on the fence on that one. However, > I definitely think that an API that directly exposes the internal > implementation to userspace is the wrong way to go about this. Ok, if it's about having memcg-aware and other interfaces, I can agree here as well. I actually made an attempt to unify memcg-aware and system-wide shrinker scanning, not very successful yet, but it's definitely on my todo list. I'm pretty sure we're iterating over and over some empty root-level shrinkers without benefiting the bitmap infrastructure which works for memory cgroups. > > Fine grained shrinker control is not necessary to improve shrinker > introspection and OOM debugging capability, so if you want/need > control interfaces then I think you should separate those out into a > separate line of development where it doesn't derail the discussion > on how to improve shrinker/OOM introspection. Ok, no problems here. Btw, tem OOM debugging is a separate topic brought in by Kent, I'd keep it separate too, as it comes with many OOM-specific complications. From your another email: > So, yeah, you need to think about how to do fine-grained access to > shrinker stats effectively. That might require a complete change of > presentation API. For example, changing the filesystem layout to be > memcg centric rather than shrinker instance centric would make an > awful lot of this file parsing problem go away. > > e.g: > > /sys/kernel/debug/mm/memcg/<memcg instance>/shrinker/<shrinker instance>/stats The problem with this approach (I though about it) is that it comes with a high memory overhead especially on that machine with thousands cgroups and mount points. And beside the memory overhead, it's really expensive to collect system-wide data and get a big picture, as it requires opening and reading of thousand of files. Actually, you wrote recently: "I've thought about it, too, and can see where it could be useful. However, when I consider the list_lru memcg integration, I suspect it becomes a "can't see the forest for the trees" problem. We're going to end up with millions of sysfs objects with no obvious way to navigate, iterate or search them if we just take the naive "sysfs object + stats per list_lru instance" approach." It all makes me think we need both: a way to iterate over all memcgs and dump all the numbers at once and a way to get a specific per-memcg (per-node) count. Thanks!