[00/11] cgroup: separate rstat trees

Message ID	20250218031448.46951-1-inwardvessel@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: JP Kobryn <inwardvessel@gmail.com> To: shakeel.butt@linux.dev, tj@kernel.org, mhocko@kernel.org, hannes@cmpxchg.org, yosryahmed@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: [PATCH 00/11] cgroup: separate rstat trees Date: Mon, 17 Feb 2025 19:14:37 -0800 Message-ID: <20250218031448.46951-1-inwardvessel@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	cgroup: separate rstat trees \| expand [00/11] cgroup: separate rstat trees [01/11] cgroup: move rstat pointers into struct of their own [02/11] cgroup: add level of indirection for cgroup_rstat struct [03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state [04/11] cgroup: introduce cgroup_rstat_ops [05/11] cgroup: separate rstat for bpf cgroups [06/11] cgroup: rstat lock indirection [07/11] cgroup: fetch cpu-specific lock in rstat cpu lock helpers [08/11] cgroup: rstat cpu lock indirection [09/11] cgroup: separate rstat locks for bpf cgroups [10/11] cgroup: separate rstat locks for subsystems [11/11] cgroup: separate rstat list pointers from base stats

JP Kobryn Feb. 18, 2025, 3:14 a.m. UTC

The current design of rstat takes the approach that if one subsystem is to
be flushed, all other subsystem controllers with pending updates should
also be flushed. It seems that over time, the stat-keeping of some
subsystems has grown in size to the extent that they are noticeably slowing
down others. This has been most observable in situations where the memory
controller is enabled. One big area where the issue comes up is system
telemetry, where programs periodically sample cpu stats. It would be a
benefit for programs like this if the overhead of flushing memory stats
(and others) could be eliminated. It would save cpu cycles for existing
cpu-based telemetry programs and improve scalability in terms of increasing
sampling frequency and volume of hosts the program is running on - the cpu
cycles saved helps free up the budget for cycles to be spent on the desired
stats.

This series changes the approach of "flush all subsystems" to "flush only the
requested subsystem". The core design change is moving from a single unified
rstat tree to having separate trees: one for each enabled subsystem controller
(if it implements css_rstat_flush()), one for the base stats (cgroup::self)
subsystem state, and one dedicated to bpf-based cgroups (if enabled). A layer
of indirection was introduced to rstat. Where a cgroup reference was previously
used as a parameter, the updated/flush families of functions were changed to
instead accept a references to a new cgroup_rstat struct along with a new
interface containing ops that perform type-specific pointer offsets for
accessing common types. The ops allow rstat routines to work only with common
types while hiding away any unique types. Together, the new struct and
interface allows for extending the type of entity that can participate in
rstat. In this series, the cgroup_subsys_state and the cgroup_bpf are now
participants. For both of these structs, the cgroup_rstat struct was added as a
new field and ops were statically defined for both types in order to provide
access to related objects. To illustrate, the cgroup_subsys_state was given an
rstat_struct field and one of its ops was defined to get a pointer to the
rstat_struct of its parent.  Public api's were changed. In order for clients to
be specific about which stats are being updated or flushed, a reference to the
given cgroup_subsys_state is passed instead of the cgroup. For the bpf api's, a
cgroup is still passed as an argument since there is no subsystem state
associated with these custom cgroups. However, the names of the api calls were
changed and now have a "bpf_" prefix. Since separate trees are in use, the
locking scheme was adjusted to prevent any contention. Separate locks exist for
the three categories: base stats (cgroup::self), formal subsystem controllers
(memory, io, etc), and bpf-based cgroups. Where applicable, the functions for
lock management were adjusted to accept parameters instead of globals.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issue explained in the first section, but comes at the
expense of using additional memory. In an effort to minimize the additional
memory overhead, a conditional allocation is performed. The
cgroup_rstat_cpu originally contained the rstat list pointers and the base
stat entities. The struct was reduced to only contain the list pointers.
For the single case of where base stats are participating in rstat, a new
struct cgroup_rstat_base_cpu was created that contains the list pointers
and the base stat entities. The conditional allocation is done only when
the cgroup::self subsys_state is initialized. Since the list pointers exist
at the beginning of the cgroup_rstat_cpu and cgroup_rstat_base_cpu struct,
a union is used to access one type of pointer or the other depending on if
cgroup::self is detected; it is the one subsystem state where the subsystem
pointer is NULL. With this change, the total memory overhead on a per-cpu
basis is:

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
	sizeof(struct cgroup_rstat_cpu) * nr_controllers
)

if bpf-based cgroups are enabled:

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
	sizeof(struct cgroup_rstat_cpu) * (nr_controllers + 1)
)

... where nr_controllers is the number of enabled cgroup controllers
that implement css_rstat_flush().

With regard to validation, there is a measurable benefit when reading a
specific set of stats. Using the cpu stats as a basis for flushing, some
experiments were set up to measure the perf and time differences.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
within each child cgroup a process was spawned to encourage the updating of
memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). These 26 tasks were run in parallel.  While this was
going on, a custom program was used to open cpu.stat file of the parent
cgroup, read the entire file 1M times, then close it. The perf report for
the task performing the reading showed that most of the cycles (42%) were
spent on the function mem_cgroup_css_rstat_flush() of the control side. It
also showed a smaller but significant number of cycles spent in
__blkcg_rstat_flush. The perf report for patched kernel differed in that no
cycles were spent in these functions. Instead most cycles were spent on
cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
spent running the program performing the reading of cpu.stats showed a gain
when comparing the control to the experimental kernel.The time in kernel
mode was reduced.

before:
real    0m18.449s
user    0m0.209s
sys     0m18.165s

after:
real    0m6.080s
user    0m0.170s
sys     0m5.890s

Another experiment on the same host was setup using a parent cgroup with
two child cgroups. The same swap and memory max were used as the previous
experiment. In the two child cgroups, kernel builds were done in parallel,
each using "-j 20". The program from the previous experiment was used to
perform 1M reads of the parent cpu.stat file. The perf comparison showed
similar results as the previous experiment. For the control side, a
majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
or io. As for the time taken by the program reading cpu.stat, measurements
are shown below.

before:
real    0m17.223s
user    0m0.259s
sys     0m16.871s

after:
real    0m6.498s
user    0m0.237s
sys     0m6.220s

For the final experiment, perf events were recorded during a kernel build
with the same host and cgroup setup. The builds took place in the child
node.  Control and experimental sides both showed similar in cycles spent
on cgroup_rstat_updated() and appeard insignificant compared among the
events recorded with the workload.

JP Kobryn (11):
  cgroup: move rstat pointers into struct of their own
  cgroup: add level of indirection for cgroup_rstat struct
  cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  cgroup: introduce cgroup_rstat_ops
  cgroup: separate rstat for bpf cgroups
  cgroup: rstat lock indirection
  cgroup: fetch cpu-specific lock in rstat cpu lock helpers
  cgroup: rstat cpu lock indirection
  cgroup: separate rstat locks for bpf cgroups
  cgroup: separate rstat locks for subsystems
  cgroup: separate rstat list pointers from base stats

 block/blk-cgroup.c                            |   4 +-
 include/linux/bpf-cgroup-defs.h               |   3 +
 include/linux/cgroup-defs.h                   |  98 +--
 include/linux/cgroup.h                        |  11 +-
 include/linux/cgroup_rstat.h                  |  97 +++
 kernel/bpf/cgroup.c                           |   6 +
 kernel/cgroup/cgroup-internal.h               |   9 +-
 kernel/cgroup/cgroup.c                        |  65 +-
 kernel/cgroup/rstat.c                         | 556 +++++++++++++-----
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   6 +-
 12 files changed, 594 insertions(+), 270 deletions(-)
 create mode 100644 include/linux/cgroup_rstat.h

Tejun Heo Feb. 20, 2025, 3:51 p.m. UTC | #1

Hello,

On Mon, Feb 17, 2025 at 07:14:37PM -0800, JP Kobryn wrote:
...
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
> within each child cgroup a process was spawned to encourage the updating of
> memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). These 26 tasks were run in parallel.  While this was
> going on, a custom program was used to open cpu.stat file of the parent
> cgroup, read the entire file 1M times, then close it. The perf report for
> the task performing the reading showed that most of the cycles (42%) were
> spent on the function mem_cgroup_css_rstat_flush() of the control side. It
> also showed a smaller but significant number of cycles spent in
> __blkcg_rstat_flush. The perf report for patched kernel differed in that no
> cycles were spent in these functions. Instead most cycles were spent on
> cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
> spent running the program performing the reading of cpu.stats showed a gain
> when comparing the control to the experimental kernel.The time in kernel
> mode was reduced.
> 
> before:
> real    0m18.449s
> user    0m0.209s
> sys     0m18.165s
> 
> after:
> real    0m6.080s
> user    0m0.170s
> sys     0m5.890s
> 
> Another experiment on the same host was setup using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the previous
> experiment. In the two child cgroups, kernel builds were done in parallel,
> each using "-j 20". The program from the previous experiment was used to
> perform 1M reads of the parent cpu.stat file. The perf comparison showed
> similar results as the previous experiment. For the control side, a
> majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
> cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
> spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
> or io. As for the time taken by the program reading cpu.stat, measurements
> are shown below.
> 
> before:
> real    0m17.223s
> user    0m0.259s
> sys     0m16.871s
> 
> after:
> real    0m6.498s
> user    0m0.237s
> sys     0m6.220s
> 
> For the final experiment, perf events were recorded during a kernel build
> with the same host and cgroup setup. The builds took place in the child
> node.  Control and experimental sides both showed similar in cycles spent
> on cgroup_rstat_updated() and appeard insignificant compared among the
> events recorded with the workload.

One of the reasons why the original design used one rstat tree is because
readers, in addition to writers, can often be correlated too - e.g. You'd
often have periodic monitoring tools which poll all the major stat files
periodically. Splitting the trees will likely make those at least a bit
worse. Can you test how much worse that'd be? ie. Repeat the above tests but
read all the major stat files - cgroup.stat, cpu.stat, memory.stat and
io.stat.

Thanks.

Yosry Ahmed Feb. 20, 2025, 5:26 p.m. UTC | #2

On Mon, Feb 17, 2025 at 07:14:37PM -0800, JP Kobryn wrote:
> The current design of rstat takes the approach that if one subsystem is to
> be flushed, all other subsystem controllers with pending updates should
> also be flushed. It seems that over time, the stat-keeping of some
> subsystems has grown in size to the extent that they are noticeably slowing
> down others. This has been most observable in situations where the memory
> controller is enabled. One big area where the issue comes up is system
> telemetry, where programs periodically sample cpu stats. It would be a
> benefit for programs like this if the overhead of flushing memory stats
> (and others) could be eliminated. It would save cpu cycles for existing
> cpu-based telemetry programs and improve scalability in terms of increasing
> sampling frequency and volume of hosts the program is running on - the cpu
> cycles saved helps free up the budget for cycles to be spent on the desired
> stats.
> 
> This series changes the approach of "flush all subsystems" to "flush only the
> requested subsystem". The core design change is moving from a single unified
> rstat tree to having separate trees: one for each enabled subsystem controller
> (if it implements css_rstat_flush()), one for the base stats (cgroup::self)
> subsystem state, and one dedicated to bpf-based cgroups (if enabled). A layer
> of indirection was introduced to rstat. Where a cgroup reference was previously
> used as a parameter, the updated/flush families of functions were changed to
> instead accept a references to a new cgroup_rstat struct along with a new
> interface containing ops that perform type-specific pointer offsets for
> accessing common types. The ops allow rstat routines to work only with common
> types while hiding away any unique types. Together, the new struct and
> interface allows for extending the type of entity that can participate in
> rstat. In this series, the cgroup_subsys_state and the cgroup_bpf are now
> participants. For both of these structs, the cgroup_rstat struct was added as a
> new field and ops were statically defined for both types in order to provide
> access to related objects. To illustrate, the cgroup_subsys_state was given an
> rstat_struct field and one of its ops was defined to get a pointer to the
> rstat_struct of its parent.  Public api's were changed. In order for clients to
> be specific about which stats are being updated or flushed, a reference to the
> given cgroup_subsys_state is passed instead of the cgroup. For the bpf api's, a
> cgroup is still passed as an argument since there is no subsystem state
> associated with these custom cgroups. However, the names of the api calls were
> changed and now have a "bpf_" prefix. Since separate trees are in use, the
> locking scheme was adjusted to prevent any contention. Separate locks exist for
> the three categories: base stats (cgroup::self), formal subsystem controllers
> (memory, io, etc), and bpf-based cgroups. Where applicable, the functions for
> lock management were adjusted to accept parameters instead of globals.
> 
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of using additional memory. In an effort to minimize the additional
> memory overhead, a conditional allocation is performed. The
> cgroup_rstat_cpu originally contained the rstat list pointers and the base
> stat entities. The struct was reduced to only contain the list pointers.
> For the single case of where base stats are participating in rstat, a new
> struct cgroup_rstat_base_cpu was created that contains the list pointers
> and the base stat entities. The conditional allocation is done only when
> the cgroup::self subsys_state is initialized. Since the list pointers exist
> at the beginning of the cgroup_rstat_cpu and cgroup_rstat_base_cpu struct,
> a union is used to access one type of pointer or the other depending on if
> cgroup::self is detected; it is the one subsystem state where the subsystem
> pointer is NULL. With this change, the total memory overhead on a per-cpu
> basis is:
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 	sizeof(struct cgroup_rstat_cpu) * nr_controllers
> )
> 
> if bpf-based cgroups are enabled:
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 	sizeof(struct cgroup_rstat_cpu) * (nr_controllers + 1)
> )

Adding actual numbers here as examples would help, instead of people
having to look at the size of these structs at the end of the series.
You can calculate the per CPU per cgroup size on a popular arch (e.g.
x86_64) and that here.

Also, it would be useful to specify the change in memory overhead. IIUC,
sizeof(struct cgroup_rstat_base_cpu) is irrelevant because we are
already using that today anyway, so the actual increase is sizeof(struct
cgroup_rstat_cpu) * (nr_controllers - 1).

Another question is, does it make sense to keep BPF flushing in the
"self" css with base stats flushing for now? IIUC BPF flushing is not
very popular now anyway, and doing so will remove the need to support
flushing and updating things that are not css's. Just food for thought.

> 
> ... where nr_controllers is the number of enabled cgroup controllers
> that implement css_rstat_flush().
> 
> With regard to validation, there is a measurable benefit when reading a
> specific set of stats. Using the cpu stats as a basis for flushing, some
> experiments were set up to measure the perf and time differences.
> 
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
> within each child cgroup a process was spawned to encourage the updating of
> memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). These 26 tasks were run in parallel.  While this was
> going on, a custom program was used to open cpu.stat file of the parent
> cgroup, read the entire file 1M times, then close it. The perf report for
> the task performing the reading showed that most of the cycles (42%) were
> spent on the function mem_cgroup_css_rstat_flush() of the control side. It
> also showed a smaller but significant number of cycles spent in
> __blkcg_rstat_flush. The perf report for patched kernel differed in that no
> cycles were spent in these functions. Instead most cycles were spent on
> cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
> spent running the program performing the reading of cpu.stats showed a gain
> when comparing the control to the experimental kernel.The time in kernel
> mode was reduced.
> 
> before:
> real    0m18.449s
> user    0m0.209s
> sys     0m18.165s
> 
> after:
> real    0m6.080s
> user    0m0.170s
> sys     0m5.890s
> 
> Another experiment on the same host was setup using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the previous
> experiment. In the two child cgroups, kernel builds were done in parallel,
> each using "-j 20". The program from the previous experiment was used to
> perform 1M reads of the parent cpu.stat file. The perf comparison showed
> similar results as the previous experiment. For the control side, a
> majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
> cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
> spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
> or io. As for the time taken by the program reading cpu.stat, measurements
> are shown below.
> 
> before:
> real    0m17.223s
> user    0m0.259s
> sys     0m16.871s
> 
> after:
> real    0m6.498s
> user    0m0.237s
> sys     0m6.220s
> 
> For the final experiment, perf events were recorded during a kernel build
> with the same host and cgroup setup. The builds took place in the child
> node.  Control and experimental sides both showed similar in cycles spent
> on cgroup_rstat_updated() and appeard insignificant compared among the
> events recorded with the workload.
> 
> JP Kobryn (11):
>   cgroup: move rstat pointers into struct of their own
>   cgroup: add level of indirection for cgroup_rstat struct
>   cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
>   cgroup: introduce cgroup_rstat_ops
>   cgroup: separate rstat for bpf cgroups
>   cgroup: rstat lock indirection
>   cgroup: fetch cpu-specific lock in rstat cpu lock helpers
>   cgroup: rstat cpu lock indirection
>   cgroup: separate rstat locks for bpf cgroups
>   cgroup: separate rstat locks for subsystems
>   cgroup: separate rstat list pointers from base stats
> 
>  block/blk-cgroup.c                            |   4 +-
>  include/linux/bpf-cgroup-defs.h               |   3 +
>  include/linux/cgroup-defs.h                   |  98 +--
>  include/linux/cgroup.h                        |  11 +-
>  include/linux/cgroup_rstat.h                  |  97 +++
>  kernel/bpf/cgroup.c                           |   6 +
>  kernel/cgroup/cgroup-internal.h               |   9 +-
>  kernel/cgroup/cgroup.c                        |  65 +-
>  kernel/cgroup/rstat.c                         | 556 +++++++++++++-----
>  mm/memcontrol.c                               |   4 +-
>  .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
>  .../bpf/progs/cgroup_hierarchical_stats.c     |   6 +-
>  12 files changed, 594 insertions(+), 270 deletions(-)
>  create mode 100644 include/linux/cgroup_rstat.h
> 
> -- 
> 2.43.5
> 
>

Shakeel Butt Feb. 20, 2025, 5:53 p.m. UTC | #3

On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> 
> Another question is, does it make sense to keep BPF flushing in the
> "self" css with base stats flushing for now? IIUC BPF flushing is not
> very popular now anyway, and doing so will remove the need to support
> flushing and updating things that are not css's. Just food for thought.
> 

Oh if this simplifies the code, I would say go for it.

Yosry Ahmed Feb. 20, 2025, 5:59 p.m. UTC | #4

On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
> On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> > 
> > Another question is, does it make sense to keep BPF flushing in the
> > "self" css with base stats flushing for now? IIUC BPF flushing is not
> > very popular now anyway, and doing so will remove the need to support
> > flushing and updating things that are not css's. Just food for thought.
> > 
> 
> Oh if this simplifies the code, I would say go for it.

I think we wouldn't need cgroup_rstat_ops and some of the refactoring
may not be needed. It will also reduce the memory overhead, and keep it
constant regardless of using BPF which is nice.

JP Kobryn Feb. 20, 2025, 6:14 p.m. UTC | #5

On 2/20/25 9:59 AM, Yosry Ahmed wrote:
> On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
>> On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
>>>
>>> Another question is, does it make sense to keep BPF flushing in the
>>> "self" css with base stats flushing for now? IIUC BPF flushing is not
>>> very popular now anyway, and doing so will remove the need to support
>>> flushing and updating things that are not css's. Just food for thought.
>>>
>>
>> Oh if this simplifies the code, I would say go for it.
> 
> I think we wouldn't need cgroup_rstat_ops and some of the refactoring
> may not be needed. It will also reduce the memory overhead, and keep it
> constant regardless of using BPF which is nice.

Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
to make use of rstat. If the bpf flushing remains tied to
cgroup_subsys_state::self, then the ops interface and supporting code
can be removed. Probably stating the obvious but the trade-off would be
that if bpf cgroups are in use, they would account for some extra
overhead while flushing the base stats. Is Google making use of bpf-
based cgroups?

Yosry Ahmed Feb. 20, 2025, 8:04 p.m. UTC | #6

On Thu, Feb 20, 2025 at 10:14:45AM -0800, JP Kobryn wrote:
> On 2/20/25 9:59 AM, Yosry Ahmed wrote:
> > On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
> > > On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> > > > 
> > > > Another question is, does it make sense to keep BPF flushing in the
> > > > "self" css with base stats flushing for now? IIUC BPF flushing is not
> > > > very popular now anyway, and doing so will remove the need to support
> > > > flushing and updating things that are not css's. Just food for thought.
> > > > 
> > > 
> > > Oh if this simplifies the code, I would say go for it.
> > 
> > I think we wouldn't need cgroup_rstat_ops and some of the refactoring
> > may not be needed. It will also reduce the memory overhead, and keep it
> > constant regardless of using BPF which is nice.
> 
> Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
> to make use of rstat. If the bpf flushing remains tied to
> cgroup_subsys_state::self, then the ops interface and supporting code
> can be removed. Probably stating the obvious but the trade-off would be
> that if bpf cgroups are in use, they would account for some extra
> overhead while flushing the base stats. Is Google making use of bpf-
> based cgroups?

Ironically I don't know, but I don't expect the BPF flushing to be
expensive enough to affect this. If someone has the use case that loads
enough BPF programs to cause a noticeable impact, we can address it
then.

This series will still be an improvement anyway.

Yosry Ahmed Feb. 20, 2025, 8:22 p.m. UTC | #7

> >  Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
> >  to make use of rstat. If the bpf flushing remains tied to
> >  cgroup_subsys_state::self, then the ops interface and supporting code
> >  can be removed. Probably stating the obvious but the trade-off would be
> >  that if bpf cgroups are in use, they would account for some extra
> >  overhead while flushing the base stats. Is Google making use of bpf-
> >  based cgroups?
> > 
> 
> Ironically I don't know, but I don't expect the BPF flushing to be
> expensive enough to affect this. If someone has the use case that loads
> enough BPF programs to cause a noticeable impact, we can address it
> then.
> 
> This series will still be an improvement anyway.

I actually just remembered. Using BPF programs to collect stats using rstat is currently independent of the CGROUP_BPF stuff. So I think this approach breaks that anyway.

[00/11] cgroup: separate rstat trees

Message

Comments