From patchwork Wed Mar 19 22:21:46 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: JP Kobryn X-Patchwork-Id: 14023232 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D5A3C35FFA for ; Wed, 19 Mar 2025 22:22:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 633AE280002; Wed, 19 Mar 2025 18:21:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5BD53280001; Wed, 19 Mar 2025 18:21:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C19F280002; Wed, 19 Mar 2025 18:21:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 17A48280001 for ; Wed, 19 Mar 2025 18:21:58 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3870955C5E for ; Wed, 19 Mar 2025 22:22:00 +0000 (UTC) X-FDA: 83239724400.28.0A3D3BD Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf13.hostedemail.com (Postfix) with ESMTP id 73C3520009 for ; Wed, 19 Mar 2025 22:21:58 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="O+T/nIZi"; spf=pass (imf13.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742422918; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=m09yt2A2os49rhFZLeuucmU+hSklZW9p02skFAjq9sg=; b=0dZ4+R4rvfLiDgTXHGjK1pFLXdC0xTXmqsxBMEfKLs33ig47Mwg+P/GTKNVSJ0bOgl2CFA lQQZ+iHIzgNBTmyvSsWFZfiB6kpHDmCcOq4hZ1ugohhIeqbBq20fSoAydLbx5L9oZPc/gJ FRa/5WorJyPe+N7kO872OkSys6URaLY= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="O+T/nIZi"; spf=pass (imf13.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742422918; a=rsa-sha256; cv=none; b=PdepAn0nEyBNPbWWmGe/zQ4LZ92YYhEBvsoWWLmf4rIPXN+ybTlWgzf0lSxKNId/dFwyMg JwE+XU9EqoQ5f8kA4b3nl8hkCS854/MBzSVUVODMOUGOGDy8OWTN+0PmPo82vXzB+PhyyJ cIOGFm3rd7VNo0Zl9KXfT8mKraGBTJs= Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-224019ad9edso1429015ad.1 for ; Wed, 19 Mar 2025 15:21:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1742422917; x=1743027717; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=m09yt2A2os49rhFZLeuucmU+hSklZW9p02skFAjq9sg=; b=O+T/nIZiyLGaY9ysId0K9HifQyWJGNlRI48h4V6ynST5XuqmGi+yJuh1n+91sYmK6P B0OV2p0kDURbuYDTwEQ8H1ahhhxZ2c2rTt36PCLSpvLaFg8BCJQctLXSWMjLW13kJJH6 i0QH+mfQ7PrOjGs9l6Jwezfg8yfPuA2s1tdVVCtK6ArhbRBNrYS684W2wQ0mCxrswCdx CbZCc5m6JQF4OD523beNWk5Cd/k3GPWf3M5rn21WQ1Nf3Sxza6JdJZCaybUG2+2QDRun HOo+t05foGoWZj4bPjqQnXf1hPasAtW1Zrf/BOq5fqozV0OXsYhVboDPrFgWEp9Ak6sk yRJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742422917; x=1743027717; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=m09yt2A2os49rhFZLeuucmU+hSklZW9p02skFAjq9sg=; b=NEXfhTMwzYr3V20JtoQc+nXENzyoKZF0OUxOsYWEKbCOtGlh3CmOr7DEj/Qd7Mk2+U nheYigaWh2INsZdxPZLcgnW7gbSGR/iHsk9Y3UlqroYUrgdtp47KZzLEQhdoGW3Cx5OS gI4mMZxpPVWB9kF4oX2i1rWwrUAXZd4Kb5YSCXOoaoqC2OAjLJTSwNRqsIPQj1VNj4o9 Uwq/d0uh6LefxFhYkUZ++1FjksbNVEj+a03fP1BHVj3YEtMlZnRwLuG+GnZVNdKZU6BA jlZLRiJhapXWtWPJQXOyDwPsfCObqYWLl56codnCxrW9tlLzX/Jv9TKGGGt/kibnjW2H 99Qw== X-Gm-Message-State: AOJu0YwMJ7PFKqc+jpiSISSDV4y37k0Qyzw5x4bO37ACh/VEO7m2K/Cn S0RupCp68SYzJ9jO7Xms5dBJ5YGqAfdfufmCWCAeuxwFcR8wVc8A X-Gm-Gg: ASbGncvdt55w72YRwEIMFuREbT0qfSNbC9LTRKofuy58v55DKFndwLwJTLOH89paJUL xHSGy895SibGcw22lksgojjz/knzN8zKEI4yFPnkHDadfBaq++fN/TUqTXyQjdZJlutEBhwOIx3 6XEt/qS484ZYflynjPmXl0HLtZgY6aQG4cF6PkKuelcG6Ppwhy8rbGqSLiLkmhsVAPT3H6nADBU ZXzCRMnsTi76C1LSfXnSnSndKyTj3Rsy4MO0uvALNrDtBiZ4UjsdgDJY+T7+2dJBj7gRq0OFnCb 6BPwAJskbLDCnBH5SUbjfW3kziIGShKYl5Nq5sBlxBgBIHqGFErUJjqiIxSPsrAWnP96FoPV X-Google-Smtp-Source: AGHT+IFV0gAaTnUhyC/DjKjd1Fcf0Xg/rzgrZJ6zfyMOONr9to76ZgO0RuCxMk74GGPy2NKT4wkEHA== X-Received: by 2002:a05:6a20:c906:b0:1f5:8b9b:ab54 with SMTP id adf61e73a8af0-1fbebc8563dmr7873865637.23.1742422917333; Wed, 19 Mar 2025 15:21:57 -0700 (PDT) Received: from jpkobryn-fedora-PF5CFKNC.thefacebook.com ([2620:10d:c090:500::4:39d5]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-af56e9dd388sm11467484a12.20.2025.03.19.15.21.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 19 Mar 2025 15:21:56 -0700 (PDT) From: JP Kobryn To: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com, mkoutny@suse.com, hannes@cmpxchg.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: [PATCH 0/4 v3] cgroup: separate rstat trees Date: Wed, 19 Mar 2025 15:21:46 -0700 Message-ID: <20250319222150.71813-1-inwardvessel@gmail.com> X-Mailer: git-send-email 2.48.1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: 73C3520009 X-Rspamd-Server: rspam08 X-Stat-Signature: smm7crf1yf1hj6jrywtiwbufhfeuu1e1 X-HE-Tag: 1742422918-288738 X-HE-Meta: U2FsdGVkX1/BD9hTZfvV0IX82uxOyWTab46BwNllxkGWRreMML7FlWkT8bpwiPOPXalHBR8FnduaJyFpaeUGuQJ3rEdU5SlILa3jLE3BLlp1IzXJ2sB1pkkdDb8Z07VHPEsOTfkmsQVaJAWF/qFwcY7ZM1v75v1GK7ZPhdiIrrnArzm3wu3w6WIkQkyeKI6hZxqJaIYkKQr0uIxy7izB3h2nBV5sk7g06i35Lgxn4VgfO+cqKp7+WmYAlqx0+jgCWxZgdFWbyp1lVnfU0ieZ4vEq0dlo6LWgEd8fRDiOXepEPxwCFhQkY9pGh1NjaKTvpyGu/pBOqD9YB3M8HWbvAiKAg6ZPuMqZEmLiNvX5NlDuSZfIbhaRP7qTl19JDqwhsuVYoGk2a3dGCPioblZxkD8/JXra4MlWoGkx3kYR9m1XC793W2d1wZTpdZKUxG3bJLSMaYsSZzcjO0X1XevCqHXcTmnj7vTSdZUwlA7NwzQUr2xDiU2yN3J5nTNuVi/IGm1NvamG1nmI9quMdiwACm+9PqRp+byfYbwnTatI++9HkepGKwPSjCH6avg3O4lmSw2ZtBKDwrLy+3k8N0vuNLS4jXpow7lUgOHFkNBEq2bU6TYQcVCwNyYQnNEznK34B1TZ08/AVBKOO4ZW6nf4i7WlX+KKsi4gYruuFQSDrf366jqK9SObPAYkgi5CmCANCvtzB7+4zS5I6z8cwDZaYJdS8HESQ1fn8rrFxL+JUKd3pKh09TUHhrVVSiIkcav7i8zuyH3M+gz3AKOt9BPCgBNfNZouY0zdeUx4ZQnj6ETlWI+bqHvDbeJV8+/mraelLCph6qTcqAZHp+ZaIh9cc39E91goKbE88vduygJMihkZQsQTqkhqsx2jbGuyp71zSU6Waya9Xr/N+UmQoN6R0reTT8Os68/28p5tG8GbXyEKmmBqBZtYD/774gDVEx6gG8Ccftiq+sX0vDaEvzM IpSGharI x0Qnchgr8dIw8BuvgPLDHbeNu7OMbZ9KoWtb7PA9PTf+FL0lPyhSe9wzBcsMgyo4F3v7T16TXDwumzxZstRTc23iZO9owDJUkN/zdYK/6wRte61/vBnknbwgyTIu9yXJ9XeiYNNzvIEtNDEGxVXUQEGafQg6ISm/gw/8IrsB0CLjo8FUpio87uONK2Zs7KgfH9kQmZ8MDxH7CfNbBKNW1X+ztqXd8xn9Q3DnU4wEgNFRpEJOXUXA+iaD8h6u0VgmCCeBNWcR7ctRPUCnCcLtFAD28s5X44MHrrvQK0XzaPx6E0GazufeO0SasI3v/hi5RXsGh21XZMRHkZU0g3ECBnQfIRWgoo+h4TW9WgBasAm2OP3BnA9EvdnYecsmXjxq5g9I73DgJQINF3YvamUTMRMMvlQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The current design of rstat takes the approach that if one subsystem is to be flushed, all other subsystems with pending updates should also be flushed. A flush may be initiated by reading specific stats (like cpu.stat) and other subsystems will be flushed alongside. The complexity of flushing some subsystems has grown to the extent that the overhead of side flushes is unnecessarily delaying the fetching of desired stats. One big area where the issue comes up is system telemetry, where programs periodically sample cpu stats while the memory controller is enabled. It would be a benefit for programs sampling cpu.stat if the overhead of having to flush memory (and also io) stats was eliminated. It would save cpu cycles for existing stat reader programs and improve scalability in terms of sampling frequency and host volume. This series changes the approach of "flush all subsystems" to "flush only the requested subsystem". The core design change is moving from a unified model where rstat trees are shared by subsystems to having separate trees for each subsystem. On a per-cpu basis, there will be separate trees for each enabled subsystem if it implements css_rstat_flush and one tree for the base stats. In order to do this, the rstat list pointers were moved off of the cgroup and onto the css. In the transition, these pointer types were changed to cgroup_subsys_state. Finally the API for updated/flush was changed to accept a reference to a css instead of a cgroup. This allows for a specific subsystem to be associated with a given update or flush. The result is that rstat trees will now be made up of css nodes, and a given tree will only contain nodes associated with a specific subsystem. Since separate trees will now be in use, the locking scheme was adjusted. The global locks were split up in such a way that there are separate locks for the base stats (cgroup::self) and also for each subsystem (memory, io, etc). This allows different subsystems (and base stats) to use rstat in parallel with no contention. Breaking up the unified tree into separate trees eliminates the overhead and scalability issue explained in the first section, but comes at the expense of additional memory. In an effort to minimize this overhead, new rstat structs are introduced and a conditional allocation is performed. The cgroup_rstat_cpu which originally contained the rstat list pointers and the base stat entities was renamed cgroup_rstat_base_cpu. It is only allocated when the associated css is cgroup::self. As for non-self css's, a new compact struct was added that only contains the rstat list pointers. During initialization, when the given css is associated with an actual subsystem (not cgroup::self), this compact struct is allocated. With this conditional allocation, the change in memory overhead on a per-cpu basis before/after is shown below. memory overhead before: sizeof(struct cgroup_rstat_cpu) =~ 144 bytes /* can vary based on config */ nr_cgroups * sizeof(struct cgroup_rstat_cpu) nr_cgroups * 144 bytes memory overhead after: sizeof(struct cgroup_rstat_cpu) == 16 bytes sizeof(struct cgroup_rstat_base_cpu) =~ 144 bytes nr_cgroups * ( sizeof(struct cgroup_rstat_base_cpu) + sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers ) nr_cgroups * (144 + 16 * nr_rstat_controllers) ... where nr_rstat_controllers is the number of enabled cgroup controllers that implement css_rstat_flush(). On a host where both memory and io are enabled: nr_cgroups * (144 + 16 * 2) nr_cgroups * 176 bytes This leaves us with an increase in memory overhead of: 32 bytes per cgroup per cpu Validation was performed by reading some *.stat files of a target parent cgroup while the system was under different workloads. A test program was made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat, memory.stat of the parent cgroup each iteration. Using a non-patched kernel as control and this series as experimental, the findings show perf gains when reading stats with this series. The first experiment consisted of a parent cgroup with memory.swap.max=0 and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and within each child cgroup a process was spawned to frequently update the memory cgroup stats by creating and then reading a file of size 1T (encouraging reclaim). The test program was run alongside these 26 tasks in parallel. The results showed time and perf gains for the reader test program. test program elapsed time control: real 1m13.663s user 0m0.948s sys 1m12.356s experiment: real 0m42.498s user 0m0.764s sys 0m41.546s test program perf control: 31.75% mem_cgroup_css_rstat_flush 5.49% __blkcg_rstat_flush 0.10% cpu_stat_show 0.05% cgroup_base_stat_cputime_show experiment: 8.60% mem_cgroup_css_rstat_flush 0.15% blkcg_print_stat 0.12% cgroup_base_stat_cputime_show It's worth noting that memcg uses heuristics to optimize flushing. Depending on the state of updated stats at a given time, a memcg flush may be considered unnecessary and skipped as a result. This opportunity to skip a flush is bypassed when memcg is flushed as a consequence of sharing the tree with another controller. A second experiment was setup on the same host using a parent cgroup with two child cgroups. The same swap and memory max were used as in the previous experiment. In the two child cgroups, kernel builds were done in parallel, each using "-j 20". The perf comparison is shown below. test program elapsed time control: real 1m22.809s user 0m1.142s sys 1m21.138s experiment: real 0m42.504s user 0m1.000s sys 0m41.220s test program perf control: 37.16% mem_cgroup_css_rstat_flush 3.68% __blkcg_rstat_flush 0.09% cpu_stat_show 0.06% cgroup_base_stat_cputime_show experiment: 2.02% mem_cgroup_css_rstat_flush 0.20% blkcg_print_stat 0.14% cpu_stat_show 0.08% cgroup_base_stat_cputime_show The final experiment differs from the previous two in that it measures performance from the stat updater perspective. A kernel build was run in a child node with -j 20 on the same host and cgroup setup. A baseline was established by having the build run while no stats were read. The builds were then repeated while stats were constantly being read. In all cases, perf appeared similar in cycles spent on cgroup_rstat_updated() (insignificant compared to the other recorded events). As for the elapsed build times, the results of the different scenarios are shown below, indicating no significant drawbacks of the split tree approach. control with no readers real 5m12.307s user 84m52.037s sys 3m54.000s control with constant readers of {memory,io,cpu,cgroup}.stat real 5m13.209s user 84m47.949s sys 4m9.260s experiment with no readers real 5m11.961s user 84m41.750s sys 3m54.058s experiment with constant readers of {memory,io,cpu,cgroup}.stat real 5m12.626s user 85m0.323s sys 3m56.167s changelog v3: new bpf kfunc api for updated/flush rename cgroup_rstat_{updated,flush} and related to "css_rstat_*" check for ss->css_rstat_flush existence where applicable rename locks for base stats move subsystem locks to cgroup_subsys struct change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within change lock helpers to accept css and perform lock selection within fix comments that had outdated lock names add open css_is_cgroup() helper rename rstatc to rstatbc to reflect base stats in use rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu add comments in early init code to explain deferred allocation misc formatting fixes v2: drop the patch creating a new cgroup_rstat struct and related code drop bpf-specific patches. instead just use cgroup::self in bpf progs drop the cpu lock patches. instead select cpu lock in updated_list func relocate the cgroup_rstat_init() call to inside css_create() relocate the cgroup_rstat_exit() cleanup from apply_control_enable() to css_free_rwork_fn() v1: https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/ JP Kobryn (4): cgroup: separate rstat api for bpf programs cgroup: use separate rstat trees for each subsystem cgroup: use subsystem-specific rstat locks to avoid contention cgroup: split up cgroup_rstat_cpu into base stat and non base stat versions block/blk-cgroup.c | 6 +- include/linux/cgroup-defs.h | 80 ++-- include/linux/cgroup.h | 16 +- include/trace/events/cgroup.h | 10 +- kernel/cgroup/cgroup-internal.h | 6 +- kernel/cgroup/cgroup.c | 69 +-- kernel/cgroup/rstat.c | 412 +++++++++++------- mm/memcontrol.c | 4 +- .../selftests/bpf/progs/btf_type_tag_percpu.c | 5 +- .../bpf/progs/cgroup_hierarchical_stats.c | 8 +- 10 files changed, 363 insertions(+), 253 deletions(-)