Message ID | 20250319221634.71128-1-inwardvessel@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | cgroup: separate rstat trees | expand |
Please disregard this thread. It was sent with an incorrect version. A revised series labeled "v3" will be sent shortly. On 3/19/25 3:16 PM, JP Kobryn wrote: > The current design of rstat takes the approach that if one subsystem is to > be flushed, all other subsystems with pending updates should also be > flushed. A flush may be initiated by reading specific stats (like cpu.stat) > and other subsystems will be flushed alongside. The complexity of flushing > some subsystems has grown to the extent that the overhead of side flushes > is unnecessarily delaying the fetching of desired stats. > > One big area where the issue comes up is system telemetry, where programs > periodically sample cpu stats while the memory controller is enabled. It > would be a benefit for programs sampling cpu.stat if the overhead of having > to flush memory (and also io) stats was eliminated. It would save cpu > cycles for existing stat reader programs and improve scalability in terms > of sampling frequency and host volume. > > This series changes the approach of "flush all subsystems" to "flush only > the requested subsystem". The core design change is moving from a unified > model where rstat trees are shared by subsystems to having separate trees > for each subsystem. On a per-cpu basis, there will be separate trees for > each enabled subsystem if it implements css_rstat_flush and one tree for > the base stats. In order to do this, the rstat list pointers were moved off > of the cgroup and onto the css. In the transition, these pointer types were > changed to cgroup_subsys_state. Finally the API for updated/flush was > changed to accept a reference to a css instead of a cgroup. This allows for > a specific subsystem to be associated with a given update or flush. The > result is that rstat trees will now be made up of css nodes, and a given > tree will only contain nodes associated with a specific subsystem. > > Since separate trees will now be in use, the locking scheme was adjusted. > The global locks were split up in such a way that there are separate locks > for the base stats (cgroup::self) and also for each subsystem (memory, io, > etc). This allows different subsystems (and base stats) to use rstat in > parallel with no contention. > > Breaking up the unified tree into separate trees eliminates the overhead > and scalability issue explained in the first section, but comes at the > expense of additional memory. In an effort to minimize this overhead, new > rstat structs are introduced and a conditional allocation is performed. > The cgroup_rstat_cpu which originally contained the rstat list pointers and > the base stat entities was renamed cgroup_rstat_base_cpu. It is only > allocated when the associated css is cgroup::self. As for non-self css's, a > new compact struct was added that only contains the rstat list pointers. > During initialization, when the given css is associated with an actual > subsystem (not cgroup::self), this compact struct is allocated. With this > conditional allocation, the change in memory overhead on a per-cpu basis > before/after is shown below. > > memory overhead before: > sizeof(struct cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */ > > nr_cgroups * sizeof(struct cgroup_rstat_cpu) > nr_cgroups * 144 bytes > > memory overhead after: > sizeof(struct cgroup_rstat_cpu) == 16 bytes > sizeof(struct cgroup_rstat_base_cpu) =~ 144 bytes > > nr_cgroups * ( > sizeof(struct cgroup_rstat_base_cpu) + > sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers > ) > > nr_cgroups * (144 + 16 * nr_rstat_controllers) > > ... where nr_rstat_controllers is the number of enabled cgroup controllers > that implement css_rstat_flush(). On a host where both memory and io are > enabled: > > nr_cgroups * (144 + 16 * 2) > nr_cgroups * 176 bytes > > This leaves us with an increase in memory overhead of: > 32 bytes per cgroup per cpu > > Validation was performed by reading some *.stat files of a target parent > cgroup while the system was under different workloads. A test program was > made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat, > memory.stat of the parent cgroup each iteration. Using a non-patched kernel > as control and this series as experimental, the findings show perf gains > when reading stats with this series. > > The first experiment consisted of a parent cgroup with memory.swap.max=0 > and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and > within each child cgroup a process was spawned to frequently update the > memory cgroup stats by creating and then reading a file of size 1T > (encouraging reclaim). The test program was run alongside these 26 tasks in > parallel. The results showed time and perf gains for the reader test > program. > > test program elapsed time > control: > real 1m13.663s > user 0m0.948s > sys 1m12.356s > > experiment: > real 0m42.498s > user 0m0.764s > sys 0m41.546s > > test program perf > control: > 31.75% mem_cgroup_css_rstat_flush > 5.49% __blkcg_rstat_flush > 0.10% cpu_stat_show > 0.05% cgroup_base_stat_cputime_show > > experiment: > 8.60% mem_cgroup_css_rstat_flush > 0.15% blkcg_print_stat > 0.12% cgroup_base_stat_cputime_show > > It's worth noting that memcg uses heuristics to optimize flushing. > Depending on the state of updated stats at a given time, a memcg flush may > be considered unnecessary and skipped as a result. This opportunity to skip > a flush is bypassed when memcg is flushed as a consequence of sharing the > tree with another controller. > > A second experiment was setup on the same host using a parent cgroup with > two child cgroups. The same swap and memory max were used as in the > previous experiment. In the two child cgroups, kernel builds were done in > parallel, each using "-j 20". The perf comparison is shown below. > > test program elapsed time > control: > real 1m22.809s > user 0m1.142s > sys 1m21.138s > > experiment: > real 0m42.504s > user 0m1.000s > sys 0m41.220s > > test program perf > control: > 37.16% mem_cgroup_css_rstat_flush > 3.68% __blkcg_rstat_flush > 0.09% cpu_stat_show > 0.06% cgroup_base_stat_cputime_show > > experiment: > 2.02% mem_cgroup_css_rstat_flush > 0.20% blkcg_print_stat > 0.14% cpu_stat_show > 0.08% cgroup_base_stat_cputime_show > > The final experiment differs from the previous two in that it measures > performance from the stat updater perspective. A kernel build was run in a > child node with -j 20 on the same host and cgroup setup. A baseline was > established by having the build run while no stats were read. The builds > were then repeated while stats were constantly being read. In all cases, > perf appeared similar in cycles spent on cgroup_rstat_updated() > (insignificant compared to the other recorded events). As for the elapsed > build times, the results of the different scenarios are shown below, > indicating no significant drawbacks of the split tree approach. > > control with no readers > real 5m12.307s > user 84m52.037s > sys 3m54.000s > > control with constant readers of {memory,io,cpu,cgroup}.stat > real 5m13.209s > user 84m47.949s > sys 4m9.260s > > experiment with no readers > real 5m11.961s > user 84m41.750s > sys 3m54.058s > > experiment with constant readers of {memory,io,cpu,cgroup}.stat > real 5m12.626s > user 85m0.323s > sys 3m56.167s > > changelog > v3: > new bpf kfunc api for updated/flush > rename cgroup_rstat_{updated,flush} and related to "css_rstat_*" > check for ss->css_rstat_flush existence where applicable > rename locks for base stats > move subsystem locks to cgroup_subsys struct > change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within > change lock helpers to accept css and perform lock selection within > fix comments that had outdated lock names > add open css_is_cgroup() helper > rename rstatc to rstatbc to reflect base stats in use > rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu > add comments in early init code to explain deferred allocation > misc formatting fixes > > v2: > drop the patch creating a new cgroup_rstat struct and related code > drop bpf-specific patches. instead just use cgroup::self in bpf progs > drop the cpu lock patches. instead select cpu lock in updated_list func > relocate the cgroup_rstat_init() call to inside css_create() > relocate the cgroup_rstat_exit() cleanup from apply_control_enable() > to css_free_rwork_fn() > v1: > https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/ > > JP Kobryn (4): > cgroup: separate rstat api for bpf programs > cgroup: use separate rstat trees for each subsystem > cgroup: use subsystem-specific rstat locks to avoid contention > cgroup: split up cgroup_rstat_cpu into base stat and non base stat > versions > > block/blk-cgroup.c | 6 +- > include/linux/cgroup-defs.h | 80 ++-- > include/linux/cgroup.h | 16 +- > include/trace/events/cgroup.h | 10 +- > kernel/cgroup/cgroup-internal.h | 6 +- > kernel/cgroup/cgroup.c | 69 +-- > kernel/cgroup/rstat.c | 412 +++++++++++------- > mm/memcontrol.c | 4 +- > .../selftests/bpf/progs/btf_type_tag_percpu.c | 5 +- > .../bpf/progs/cgroup_hierarchical_stats.c | 8 +- > 10 files changed, 363 insertions(+), 253 deletions(-) >