mbox series

[0/9,RFC] cgroup: separate rstat trees

Message ID 20241224011402.134009-1-inwardvessel@gmail.com (mailing list archive)
Headers show
Series cgroup: separate rstat trees | expand

Message

JP Kobryn Dec. 24, 2024, 1:13 a.m. UTC
I've been experimenting with these changes to allow for separate
updating/flushing of cgroup stats per-subsystem. The idea was instead of having
a single per-cpu rstat tree for managing stats across all subsystems, we could
maybe split up the rstat trees into separate trees for each subsystem. So each
cpu would have individual trees for each subsystem. It would allow subsystems
to update and flush their stats without having any contention with others, i.e.
the io subsystem would not have to wait for an in progress memory subsystem
flush to finish. The core change is moving the rstat entities off of the cgroup
struct and onto the cgroup_subsystem_state struct. Every patch revolves around
that concept.

I reached a point where this started to feel stable in my local testing, so I
wanted to share and get feedback on this approach.

JP Kobryn (8):
  change cgroup to css in rstat updated and flush api
  change cgroup to css in rstat internal flush and lock funcs
  change cgroup to css in rstat init and exit api
  split rstat from cgroup into separate css
  separate locking between base css and others
  isolate base stat flush
  remove unneeded rcu list
  remove bpf rstat flush from css generic flush

 block/blk-cgroup.c              |   4 +-
 include/linux/cgroup-defs.h     |  35 ++---
 include/linux/cgroup.h          |   8 +-
 kernel/cgroup/cgroup-internal.h |   4 +-
 kernel/cgroup/cgroup.c          |  79 ++++++-----
 kernel/cgroup/rstat.c           | 225 +++++++++++++++++++-------------
 mm/memcontrol.c                 |   4 +-
 7 files changed, 203 insertions(+), 156 deletions(-)

Comments

Shakeel Butt Dec. 24, 2024, 4:57 a.m. UTC | #1
Thanks JP for the awesome work. Please CC Tejun and Michal in the
followup revisions.

On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn wrote:
> I've been experimenting with these changes to allow for separate
> updating/flushing of cgroup stats per-subsystem. The idea was instead of having
> a single per-cpu rstat tree for managing stats across all subsystems, we could
> maybe split up the rstat trees into separate trees for each subsystem. So each
> cpu would have individual trees for each subsystem. It would allow subsystems
> to update and flush their stats without having any contention with others, i.e.
> the io subsystem would not have to wait for an in progress memory subsystem
> flush to finish. The core change is moving the rstat entities off of the cgroup
> struct and onto the cgroup_subsystem_state struct. Every patch revolves around
> that concept.
> 
> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

For the cover letter, please start with describing the problem you are
trying to solve and then possibly give an overview of the solution you
are proposing.
Michal Koutný Jan. 8, 2025, 6:16 p.m. UTC | #2
Hello JP.

On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> I've been experimenting with these changes to allow for separate
> updating/flushing of cgroup stats per-subsystem.

Nice.

> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

The split is not straight-forwardly an improvement -- there's at least
higher memory footprint and flushing efffectiveness depends on how
individual readers are correlated, OTOH writer correlation affects
updaters when extending the update tree. So a workload dependent effect
can go (in my theory) both sides.
There are also in-kernel consumers of stats, namely memory controller
that's been optimized over the years to balance the tradeoff between
precision and latency.

So do you have any measurements (or expectations) that show how readers
or writers are affected?

Thanks,
Michal
Shakeel Butt Jan. 13, 2025, 6:25 p.m. UTC | #3
On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
> Hello JP.
> 
> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> > I've been experimenting with these changes to allow for separate
> > updating/flushing of cgroup stats per-subsystem.
> 
> Nice.
> 
> > I reached a point where this started to feel stable in my local testing, so I
> > wanted to share and get feedback on this approach.
> 
> The split is not straight-forwardly an improvement --

The major improvement in my opinion is the performance isolation for
stats readers i.e. cpu stats readers do not need to flush memory stats.

> there's at least
> higher memory footprint 

Yes this is indeed the case and JP, can you please give a ballmark on
the memory overhead?

> and flushing efffectiveness depends on how
> individual readers are correlated, 

Sorry I am confused by the above statement, can you please expand on
what you meant by it?

> OTOH writer correlation affects
> updaters when extending the update tree.

Here I am confused about the difference between writer and updater.

> So a workload dependent effect
> can go (in my theory) both sides.
> There are also in-kernel consumers of stats, namely memory controller
> that's been optimized over the years to balance the tradeoff between
> precision and latency.

In-kernel memcg stats readers will be unaffected most of the time with
this change. The only difference will be when they flush, they will only
flush memcg stats.

> 
> So do you have any measurements (or expectations) that show how readers
> or writers are affected?
> 

Here I am assuming you meant measurements in terms of cpu cost or do you
have something else in mind?


Thanks a lot Michal for taking a look.
JP Kobryn Jan. 15, 2025, 1:33 a.m. UTC | #4
Hi Michal,

On 1/13/25 10:25 AM, Shakeel Butt wrote:
> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
>> Hello JP.
>>
>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
>>> I've been experimenting with these changes to allow for separate
>>> updating/flushing of cgroup stats per-subsystem.
>>
>> Nice.
>>
>>> I reached a point where this started to feel stable in my local testing, so I
>>> wanted to share and get feedback on this approach.
>>
>> The split is not straight-forwardly an improvement --
> 
> The major improvement in my opinion is the performance isolation for
> stats readers i.e. cpu stats readers do not need to flush memory stats.
> 
>> there's at least
>> higher memory footprint
> 
> Yes this is indeed the case and JP, can you please give a ballmark on
> the memory overhead?

Yes, the trade-off is using more memory to allow for separate trees.
With these patches the changes in allocated memory for the 
cgroup_rstat_cpu instances and their associated locks are:
static
	reduced by 58%
dynamic
	increased by 344%

The threefold increase on the dynamic side is attributed to now having 3 
rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), 
instead of originally just 1. The number will change if more subsystems 
start or stop using rstat in the future. Feel free to let me know if you 
would like to see the detailed breakdown of these values.

> 
>> and flushing efffectiveness depends on how
>> individual readers are correlated,
> 
> Sorry I am confused by the above statement, can you please expand on
> what you meant by it?
> 
>> OTOH writer correlation affects
>> updaters when extending the update tree.
> 
> Here I am confused about the difference between writer and updater.
> 
>> So a workload dependent effect
>> can go (in my theory) both sides.
>> There are also in-kernel consumers of stats, namely memory controller
>> that's been optimized over the years to balance the tradeoff between
>> precision and latency.
> 
> In-kernel memcg stats readers will be unaffected most of the time with
> this change. The only difference will be when they flush, they will only
> flush memcg stats.
> 
>>
>> So do you have any measurements (or expectations) that show how readers
>> or writers are affected?
>>
> 
> Here I am assuming you meant measurements in terms of cpu cost or do you
> have something else in mind?
> 
> 
> Thanks a lot Michal for taking a look.
Yosry Ahmed Jan. 15, 2025, 1:39 a.m. UTC | #5
On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> Hi Michal,
>
> On 1/13/25 10:25 AM, Shakeel Butt wrote:
> > On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
> >> Hello JP.
> >>
> >> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> >>> I've been experimenting with these changes to allow for separate
> >>> updating/flushing of cgroup stats per-subsystem.
> >>
> >> Nice.
> >>
> >>> I reached a point where this started to feel stable in my local testing, so I
> >>> wanted to share and get feedback on this approach.
> >>
> >> The split is not straight-forwardly an improvement --
> >
> > The major improvement in my opinion is the performance isolation for
> > stats readers i.e. cpu stats readers do not need to flush memory stats.
> >
> >> there's at least
> >> higher memory footprint
> >
> > Yes this is indeed the case and JP, can you please give a ballmark on
> > the memory overhead?
>
> Yes, the trade-off is using more memory to allow for separate trees.
> With these patches the changes in allocated memory for the
> cgroup_rstat_cpu instances and their associated locks are:
> static
>         reduced by 58%
> dynamic
>         increased by 344%
>
> The threefold increase on the dynamic side is attributed to now having 3
> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
> instead of originally just 1. The number will change if more subsystems
> start or stop using rstat in the future. Feel free to let me know if you
> would like to see the detailed breakdown of these values.

What is the absolute per-CPU memory usage?
JP Kobryn Jan. 15, 2025, 7:38 p.m. UTC | #6
Hi Yosry,

On 1/14/25 5:39 PM, Yosry Ahmed wrote:
> On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> Hi Michal,
>>
>> On 1/13/25 10:25 AM, Shakeel Butt wrote:
>>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
>>>> Hello JP.
>>>>
>>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>> I've been experimenting with these changes to allow for separate
>>>>> updating/flushing of cgroup stats per-subsystem.
>>>>
>>>> Nice.
>>>>
>>>>> I reached a point where this started to feel stable in my local testing, so I
>>>>> wanted to share and get feedback on this approach.
>>>>
>>>> The split is not straight-forwardly an improvement --
>>>
>>> The major improvement in my opinion is the performance isolation for
>>> stats readers i.e. cpu stats readers do not need to flush memory stats.
>>>
>>>> there's at least
>>>> higher memory footprint
>>>
>>> Yes this is indeed the case and JP, can you please give a ballmark on
>>> the memory overhead?
>>
>> Yes, the trade-off is using more memory to allow for separate trees.
>> With these patches the changes in allocated memory for the
>> cgroup_rstat_cpu instances and their associated locks are:
>> static
>>          reduced by 58%
>> dynamic
>>          increased by 344%
>>
>> The threefold increase on the dynamic side is attributed to now having 3
>> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
>> instead of originally just 1. The number will change if more subsystems
>> start or stop using rstat in the future. Feel free to let me know if you
>> would like to see the detailed breakdown of these values.
> 
> What is the absolute per-CPU memory usage?

This is what I calculate as the combined per-cpu usage.
before:
	one cgroup_rstat_cpu instance for every cgroup
	sizeof(cgroup_rstat_cpu) * nr_cgroups
after:
	three cgroup_rstat_cpu instances for every cgroup + updater lock for 
every subsystem plus one for base stats
	sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
		sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)

Note that "every cgroup" includes the root cgroup. Also, 3 represents 
the number of current rstat clients: base stats, memory, and io 
(assuming all enabled).

As I'm writing this, I realize I might need to include the bpf cgroups 
as a fourth client and include this in my testing.
Yosry Ahmed Jan. 15, 2025, 9:36 p.m. UTC | #7
On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> Hi Yosry,
>
> On 1/14/25 5:39 PM, Yosry Ahmed wrote:
> > On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
> >>
> >> Hi Michal,
> >>
> >> On 1/13/25 10:25 AM, Shakeel Butt wrote:
> >>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
> >>>> Hello JP.
> >>>>
> >>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
> >>>>> I've been experimenting with these changes to allow for separate
> >>>>> updating/flushing of cgroup stats per-subsystem.
> >>>>
> >>>> Nice.
> >>>>
> >>>>> I reached a point where this started to feel stable in my local testing, so I
> >>>>> wanted to share and get feedback on this approach.
> >>>>
> >>>> The split is not straight-forwardly an improvement --
> >>>
> >>> The major improvement in my opinion is the performance isolation for
> >>> stats readers i.e. cpu stats readers do not need to flush memory stats.
> >>>
> >>>> there's at least
> >>>> higher memory footprint
> >>>
> >>> Yes this is indeed the case and JP, can you please give a ballmark on
> >>> the memory overhead?
> >>
> >> Yes, the trade-off is using more memory to allow for separate trees.
> >> With these patches the changes in allocated memory for the
> >> cgroup_rstat_cpu instances and their associated locks are:
> >> static
> >>          reduced by 58%
> >> dynamic
> >>          increased by 344%
> >>
> >> The threefold increase on the dynamic side is attributed to now having 3
> >> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
> >> instead of originally just 1. The number will change if more subsystems
> >> start or stop using rstat in the future. Feel free to let me know if you
> >> would like to see the detailed breakdown of these values.
> >
> > What is the absolute per-CPU memory usage?
>
> This is what I calculate as the combined per-cpu usage.
> before:
>         one cgroup_rstat_cpu instance for every cgroup
>         sizeof(cgroup_rstat_cpu) * nr_cgroups
> after:
>         three cgroup_rstat_cpu instances for every cgroup + updater lock for
> every subsystem plus one for base stats
>         sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
>                 sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)
>
> Note that "every cgroup" includes the root cgroup. Also, 3 represents
> the number of current rstat clients: base stats, memory, and io
> (assuming all enabled).

On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes.
Ignoring the spinlock for a second because it doesn't scale with
cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a
moderately large machine with 256 CPUs and 100 cgroups for example
that's ~8MB.

>
> As I'm writing this, I realize I might need to include the bpf cgroups
> as a fourth client and include this in my testing.
Michal Koutný Jan. 16, 2025, 3:19 p.m. UTC | #8
Hello.

On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > and flushing efffectiveness depends on how individual readers are
> > correlated, 
> 
> Sorry I am confused by the above statement, can you please expand on
> what you meant by it?
> 
> > OTOH writer correlation affects
> > updaters when extending the update tree.
> 
> Here I am confused about the difference between writer and updater.

reader -- a call site that'd need to call cgroup_rstat_flush() to
	calculate aggregated stats
writer (or updater) -- a call site that calls cgroup_rstat_updated()
	when it modifies whatever datum

By correlated readers I meant that stats for multiple controllers are
read close to each other (time-wise). First such a reader does the heavy
lifting, consequent readers enjoy quick access.
(With per-controller flushing, each reader would need to do the flush
and I'm suspecting the total time non-linear wrt parts.)

Similarly for writers, if multiple controller's data change in short
window, only the first one has to construct the rstat tree from top down
to self, the other are updating the same tree.

> In-kernel memcg stats readers will be unaffected most of the time with
> this change. The only difference will be when they flush, they will only
> flush memcg stats.

That "most of the time" is what depends on how other controller's
readers are active.

> Here I am assuming you meant measurements in terms of cpu cost or do you
> have something else in mind?

I have in mind something like Tejun's point 2:
| 2. It has noticeable benefits in the targeted use cases.

The cover letter mentions some old problems (which may not be problems
nowadays with memcg flushing reworks) and it's not clear how the
separation into per-controller trees impacts (today's) problems.

(I can imagine if the problem is stated like: io.stat readers are
unnecessarily waiting for memory.stat flushing, the benefit can be shown
(unless io.stat readers could benefit from flushing triggered by e.g.
memory).  But I didn't get if _that_ is the problem.)

Thanks,
Michal
Yosry Ahmed Jan. 16, 2025, 3:35 p.m. UTC | #9
On Thu, Jan 16, 2025 at 7:19 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > and flushing efffectiveness depends on how individual readers are
> > > correlated,
> >
> > Sorry I am confused by the above statement, can you please expand on
> > what you meant by it?
> >
> > > OTOH writer correlation affects
> > > updaters when extending the update tree.
> >
> > Here I am confused about the difference between writer and updater.
>
> reader -- a call site that'd need to call cgroup_rstat_flush() to
>         calculate aggregated stats
> writer (or updater) -- a call site that calls cgroup_rstat_updated()
>         when it modifies whatever datum
>
> By correlated readers I meant that stats for multiple controllers are
> read close to each other (time-wise). First such a reader does the heavy
> lifting, consequent readers enjoy quick access.
> (With per-controller flushing, each reader would need to do the flush
> and I'm suspecting the total time non-linear wrt parts.)

In this case, I actually think it's better if every reader pays for
the flush they asked for (and only that). There is a bit of repeated
work if we read memory stats then io stats right after, but in cases
where we don't, paying to flush all subsystems because they are likely
to be flushed soon is not necessarily a good thing imo.

>
> Similarly for writers, if multiple controller's data change in short
> window, only the first one has to construct the rstat tree from top down
> to self, the other are updating the same tree.

This I agree about. If we have consecutive updates from two different
subsystems to the same cgroup, almost all the work is repeated.
Whether that causes a tangible performance difference or not is
something the numbers should show. In my experience, real regressions
on the update side are usually caught by LKP and are somewhat easy to
surface in benchmarks (I used netperf in the past).

>
> > In-kernel memcg stats readers will be unaffected most of the time with
> > this change. The only difference will be when they flush, they will only
> > flush memcg stats.
>
> That "most of the time" is what depends on how other controller's
> readers are active.

Since readers of other controllers are only in userspace (AFAICT), I
think it's unlikely that they are correlated with in-kernel memcg stat
readers in general.

>
> > Here I am assuming you meant measurements in terms of cpu cost or do you
> > have something else in mind?
>
> I have in mind something like Tejun's point 2:
> | 2. It has noticeable benefits in the targeted use cases.
>
> The cover letter mentions some old problems (which may not be problems
> nowadays with memcg flushing reworks) and it's not clear how the
> separation into per-controller trees impacts (today's) problems.
>
> (I can imagine if the problem is stated like: io.stat readers are
> unnecessarily waiting for memory.stat flushing, the benefit can be shown
> (unless io.stat readers could benefit from flushing triggered by e.g.
> memory).  But I didn't get if _that_ is the problem.)

Yeah I hope/expect that numbers will show that reading memcg stats (in
userspace or the kernel) becomes a bit faster, while reading other
subsystem stats should be significantly faster (at least in some
cases). We will see how that turns out.
JP Kobryn Jan. 16, 2025, 6:20 p.m. UTC | #10
On 1/15/25 1:36 PM, Yosry Ahmed wrote:
> On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> Hi Yosry,
>>
>> On 1/14/25 5:39 PM, Yosry Ahmed wrote:
>>> On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>
>>>> Hi Michal,
>>>>
>>>> On 1/13/25 10:25 AM, Shakeel Butt wrote:
>>>>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
>>>>>> Hello JP.
>>>>>>
>>>>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>>>> I've been experimenting with these changes to allow for separate
>>>>>>> updating/flushing of cgroup stats per-subsystem.
>>>>>>
>>>>>> Nice.
>>>>>>
>>>>>>> I reached a point where this started to feel stable in my local testing, so I
>>>>>>> wanted to share and get feedback on this approach.
>>>>>>
>>>>>> The split is not straight-forwardly an improvement --
>>>>>
>>>>> The major improvement in my opinion is the performance isolation for
>>>>> stats readers i.e. cpu stats readers do not need to flush memory stats.
>>>>>
>>>>>> there's at least
>>>>>> higher memory footprint
>>>>>
>>>>> Yes this is indeed the case and JP, can you please give a ballmark on
>>>>> the memory overhead?
>>>>
>>>> Yes, the trade-off is using more memory to allow for separate trees.
>>>> With these patches the changes in allocated memory for the
>>>> cgroup_rstat_cpu instances and their associated locks are:
>>>> static
>>>>           reduced by 58%
>>>> dynamic
>>>>           increased by 344%
>>>>
>>>> The threefold increase on the dynamic side is attributed to now having 3
>>>> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
>>>> instead of originally just 1. The number will change if more subsystems
>>>> start or stop using rstat in the future. Feel free to let me know if you
>>>> would like to see the detailed breakdown of these values.
>>>
>>> What is the absolute per-CPU memory usage?
>>
>> This is what I calculate as the combined per-cpu usage.
>> before:
>>          one cgroup_rstat_cpu instance for every cgroup
>>          sizeof(cgroup_rstat_cpu) * nr_cgroups
>> after:
>>          three cgroup_rstat_cpu instances for every cgroup + updater lock for
>> every subsystem plus one for base stats
>>          sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
>>                  sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)
>>
>> Note that "every cgroup" includes the root cgroup. Also, 3 represents
>> the number of current rstat clients: base stats, memory, and io
>> (assuming all enabled).
> 
> On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes.
> Ignoring the spinlock for a second because it doesn't scale with
> cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a
> moderately large machine with 256 CPUs and 100 cgroups for example
> that's ~8MB.
> 

Revisiting the cgroup_rstat_cpu struct, there might be an opportunity to 
save some memory here. This struct has several cgroup_base_stat fields 
that are not useful to the actual subsystems (memory, io). So I'm 
considering a scenario where the existing cgroup_rstat_cpu is used for 
the base stats while a new lighter struct is used for others that 
maintains compatibility with the rstat infrastructure.

>>
>> As I'm writing this, I realize I might need to include the bpf cgroups
>> as a fourth client and include this in my testing.
Shakeel Butt Jan. 16, 2025, 7:03 p.m. UTC | #11
Hi Michal,

On Thu, Jan 16, 2025 at 04:19:07PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > and flushing efffectiveness depends on how individual readers are
> > > correlated, 
> > 
> > Sorry I am confused by the above statement, can you please expand on
> > what you meant by it?
> > 
> > > OTOH writer correlation affects
> > > updaters when extending the update tree.
> > 
> > Here I am confused about the difference between writer and updater.
> 
> reader -- a call site that'd need to call cgroup_rstat_flush() to
> 	calculate aggregated stats
> writer (or updater) -- a call site that calls cgroup_rstat_updated()
> 	when it modifies whatever datum
>

Ah so writer and updater are same.

> By correlated readers I meant that stats for multiple controllers are
> read close to each other (time-wise). First such a reader does the heavy
> lifting, consequent readers enjoy quick access.
> (With per-controller flushing, each reader would need to do the flush
> and I'm suspecting the total time non-linear wrt parts.)

This is a good point and actually in prod we are observing machines with
very active workloads, the close readers of different stats are flushing
all the subsystems even when reads are very close-by time-wise. In
addition there are users who are only reading non-memory stats and still
paying the cost of memory flushing. Please note that memory stats
flushing is the most expensive one at the moment and it does not make
sense to do flush memory stats for above two cases.

> 
> Similarly for writers, if multiple controller's data change in short
> window, only the first one has to construct the rstat tree from top down
> to self, the other are updating the same tree.

Another good point. From my observation, the cost of rstat tree
insertion is very cheap and the cost get amortized i.e. a lot of updates
within flushes such that the insertion cost is not noticeable at all, at
least in the perf traces.

> 
> > In-kernel memcg stats readers will be unaffected most of the time with
> > this change. The only difference will be when they flush, they will only
> > flush memcg stats.
> 
> That "most of the time" is what depends on how other controller's
> readers are active.
> 
> > Here I am assuming you meant measurements in terms of cpu cost or do you
> > have something else in mind?
> 
> I have in mind something like Tejun's point 2:
> | 2. It has noticeable benefits in the targeted use cases.
> 
> The cover letter mentions some old problems (which may not be problems
> nowadays with memcg flushing reworks) and it's not clear how the
> separation into per-controller trees impacts (today's) problems.
> 
> (I can imagine if the problem is stated like: io.stat readers are
> unnecessarily waiting for memory.stat flushing, the benefit can be shown
> (unless io.stat readers could benefit from flushing triggered by e.g.
> memory).  But I didn't get if _that_ is the problem.)
> 

The cover letter of v2 has more information on the motivation. The
main motivation is the same I descrived above i.e. many applications
(stat monitors) has to flush all the subsystem even when they are
reading different subsystem stats close-by and then there are
applications who are reading just stats of cpu or io and still have to
pay for memcg stat flushing. This series is targeting these two very
common scenarios.

Thanks Michal for your time and comments.

Shakeel