Message ID | 20241224011402.134009-1-inwardvessel@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | cgroup: separate rstat trees | expand |
Thanks JP for the awesome work. Please CC Tejun and Michal in the followup revisions. On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn wrote: > I've been experimenting with these changes to allow for separate > updating/flushing of cgroup stats per-subsystem. The idea was instead of having > a single per-cpu rstat tree for managing stats across all subsystems, we could > maybe split up the rstat trees into separate trees for each subsystem. So each > cpu would have individual trees for each subsystem. It would allow subsystems > to update and flush their stats without having any contention with others, i.e. > the io subsystem would not have to wait for an in progress memory subsystem > flush to finish. The core change is moving the rstat entities off of the cgroup > struct and onto the cgroup_subsystem_state struct. Every patch revolves around > that concept. > > I reached a point where this started to feel stable in my local testing, so I > wanted to share and get feedback on this approach. For the cover letter, please start with describing the problem you are trying to solve and then possibly give an overview of the solution you are proposing.
Hello JP. On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: > I've been experimenting with these changes to allow for separate > updating/flushing of cgroup stats per-subsystem. Nice. > I reached a point where this started to feel stable in my local testing, so I > wanted to share and get feedback on this approach. The split is not straight-forwardly an improvement -- there's at least higher memory footprint and flushing efffectiveness depends on how individual readers are correlated, OTOH writer correlation affects updaters when extending the update tree. So a workload dependent effect can go (in my theory) both sides. There are also in-kernel consumers of stats, namely memory controller that's been optimized over the years to balance the tradeoff between precision and latency. So do you have any measurements (or expectations) that show how readers or writers are affected? Thanks, Michal
On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote: > Hello JP. > > On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: > > I've been experimenting with these changes to allow for separate > > updating/flushing of cgroup stats per-subsystem. > > Nice. > > > I reached a point where this started to feel stable in my local testing, so I > > wanted to share and get feedback on this approach. > > The split is not straight-forwardly an improvement -- The major improvement in my opinion is the performance isolation for stats readers i.e. cpu stats readers do not need to flush memory stats. > there's at least > higher memory footprint Yes this is indeed the case and JP, can you please give a ballmark on the memory overhead? > and flushing efffectiveness depends on how > individual readers are correlated, Sorry I am confused by the above statement, can you please expand on what you meant by it? > OTOH writer correlation affects > updaters when extending the update tree. Here I am confused about the difference between writer and updater. > So a workload dependent effect > can go (in my theory) both sides. > There are also in-kernel consumers of stats, namely memory controller > that's been optimized over the years to balance the tradeoff between > precision and latency. In-kernel memcg stats readers will be unaffected most of the time with this change. The only difference will be when they flush, they will only flush memcg stats. > > So do you have any measurements (or expectations) that show how readers > or writers are affected? > Here I am assuming you meant measurements in terms of cpu cost or do you have something else in mind? Thanks a lot Michal for taking a look.
Hi Michal, On 1/13/25 10:25 AM, Shakeel Butt wrote: > On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote: >> Hello JP. >> >> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: >>> I've been experimenting with these changes to allow for separate >>> updating/flushing of cgroup stats per-subsystem. >> >> Nice. >> >>> I reached a point where this started to feel stable in my local testing, so I >>> wanted to share and get feedback on this approach. >> >> The split is not straight-forwardly an improvement -- > > The major improvement in my opinion is the performance isolation for > stats readers i.e. cpu stats readers do not need to flush memory stats. > >> there's at least >> higher memory footprint > > Yes this is indeed the case and JP, can you please give a ballmark on > the memory overhead? Yes, the trade-off is using more memory to allow for separate trees. With these patches the changes in allocated memory for the cgroup_rstat_cpu instances and their associated locks are: static reduced by 58% dynamic increased by 344% The threefold increase on the dynamic side is attributed to now having 3 rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), instead of originally just 1. The number will change if more subsystems start or stop using rstat in the future. Feel free to let me know if you would like to see the detailed breakdown of these values. > >> and flushing efffectiveness depends on how >> individual readers are correlated, > > Sorry I am confused by the above statement, can you please expand on > what you meant by it? > >> OTOH writer correlation affects >> updaters when extending the update tree. > > Here I am confused about the difference between writer and updater. > >> So a workload dependent effect >> can go (in my theory) both sides. >> There are also in-kernel consumers of stats, namely memory controller >> that's been optimized over the years to balance the tradeoff between >> precision and latency. > > In-kernel memcg stats readers will be unaffected most of the time with > this change. The only difference will be when they flush, they will only > flush memcg stats. > >> >> So do you have any measurements (or expectations) that show how readers >> or writers are affected? >> > > Here I am assuming you meant measurements in terms of cpu cost or do you > have something else in mind? > > > Thanks a lot Michal for taking a look.
On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote: > > Hi Michal, > > On 1/13/25 10:25 AM, Shakeel Butt wrote: > > On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote: > >> Hello JP. > >> > >> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: > >>> I've been experimenting with these changes to allow for separate > >>> updating/flushing of cgroup stats per-subsystem. > >> > >> Nice. > >> > >>> I reached a point where this started to feel stable in my local testing, so I > >>> wanted to share and get feedback on this approach. > >> > >> The split is not straight-forwardly an improvement -- > > > > The major improvement in my opinion is the performance isolation for > > stats readers i.e. cpu stats readers do not need to flush memory stats. > > > >> there's at least > >> higher memory footprint > > > > Yes this is indeed the case and JP, can you please give a ballmark on > > the memory overhead? > > Yes, the trade-off is using more memory to allow for separate trees. > With these patches the changes in allocated memory for the > cgroup_rstat_cpu instances and their associated locks are: > static > reduced by 58% > dynamic > increased by 344% > > The threefold increase on the dynamic side is attributed to now having 3 > rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), > instead of originally just 1. The number will change if more subsystems > start or stop using rstat in the future. Feel free to let me know if you > would like to see the detailed breakdown of these values. What is the absolute per-CPU memory usage?
Hi Yosry, On 1/14/25 5:39 PM, Yosry Ahmed wrote: > On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote: >> >> Hi Michal, >> >> On 1/13/25 10:25 AM, Shakeel Butt wrote: >>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote: >>>> Hello JP. >>>> >>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: >>>>> I've been experimenting with these changes to allow for separate >>>>> updating/flushing of cgroup stats per-subsystem. >>>> >>>> Nice. >>>> >>>>> I reached a point where this started to feel stable in my local testing, so I >>>>> wanted to share and get feedback on this approach. >>>> >>>> The split is not straight-forwardly an improvement -- >>> >>> The major improvement in my opinion is the performance isolation for >>> stats readers i.e. cpu stats readers do not need to flush memory stats. >>> >>>> there's at least >>>> higher memory footprint >>> >>> Yes this is indeed the case and JP, can you please give a ballmark on >>> the memory overhead? >> >> Yes, the trade-off is using more memory to allow for separate trees. >> With these patches the changes in allocated memory for the >> cgroup_rstat_cpu instances and their associated locks are: >> static >> reduced by 58% >> dynamic >> increased by 344% >> >> The threefold increase on the dynamic side is attributed to now having 3 >> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), >> instead of originally just 1. The number will change if more subsystems >> start or stop using rstat in the future. Feel free to let me know if you >> would like to see the detailed breakdown of these values. > > What is the absolute per-CPU memory usage? This is what I calculate as the combined per-cpu usage. before: one cgroup_rstat_cpu instance for every cgroup sizeof(cgroup_rstat_cpu) * nr_cgroups after: three cgroup_rstat_cpu instances for every cgroup + updater lock for every subsystem plus one for base stats sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups + sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1) Note that "every cgroup" includes the root cgroup. Also, 3 represents the number of current rstat clients: base stats, memory, and io (assuming all enabled). As I'm writing this, I realize I might need to include the bpf cgroups as a fourth client and include this in my testing.
On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@gmail.com> wrote: > > Hi Yosry, > > On 1/14/25 5:39 PM, Yosry Ahmed wrote: > > On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote: > >> > >> Hi Michal, > >> > >> On 1/13/25 10:25 AM, Shakeel Butt wrote: > >>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote: > >>>> Hello JP. > >>>> > >>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: > >>>>> I've been experimenting with these changes to allow for separate > >>>>> updating/flushing of cgroup stats per-subsystem. > >>>> > >>>> Nice. > >>>> > >>>>> I reached a point where this started to feel stable in my local testing, so I > >>>>> wanted to share and get feedback on this approach. > >>>> > >>>> The split is not straight-forwardly an improvement -- > >>> > >>> The major improvement in my opinion is the performance isolation for > >>> stats readers i.e. cpu stats readers do not need to flush memory stats. > >>> > >>>> there's at least > >>>> higher memory footprint > >>> > >>> Yes this is indeed the case and JP, can you please give a ballmark on > >>> the memory overhead? > >> > >> Yes, the trade-off is using more memory to allow for separate trees. > >> With these patches the changes in allocated memory for the > >> cgroup_rstat_cpu instances and their associated locks are: > >> static > >> reduced by 58% > >> dynamic > >> increased by 344% > >> > >> The threefold increase on the dynamic side is attributed to now having 3 > >> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), > >> instead of originally just 1. The number will change if more subsystems > >> start or stop using rstat in the future. Feel free to let me know if you > >> would like to see the detailed breakdown of these values. > > > > What is the absolute per-CPU memory usage? > > This is what I calculate as the combined per-cpu usage. > before: > one cgroup_rstat_cpu instance for every cgroup > sizeof(cgroup_rstat_cpu) * nr_cgroups > after: > three cgroup_rstat_cpu instances for every cgroup + updater lock for > every subsystem plus one for base stats > sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups + > sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1) > > Note that "every cgroup" includes the root cgroup. Also, 3 represents > the number of current rstat clients: base stats, memory, and io > (assuming all enabled). On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes. Ignoring the spinlock for a second because it doesn't scale with cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a moderately large machine with 256 CPUs and 100 cgroups for example that's ~8MB. > > As I'm writing this, I realize I might need to include the bpf cgroups > as a fourth client and include this in my testing.
Hello. On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > and flushing efffectiveness depends on how individual readers are > > correlated, > > Sorry I am confused by the above statement, can you please expand on > what you meant by it? > > > OTOH writer correlation affects > > updaters when extending the update tree. > > Here I am confused about the difference between writer and updater. reader -- a call site that'd need to call cgroup_rstat_flush() to calculate aggregated stats writer (or updater) -- a call site that calls cgroup_rstat_updated() when it modifies whatever datum By correlated readers I meant that stats for multiple controllers are read close to each other (time-wise). First such a reader does the heavy lifting, consequent readers enjoy quick access. (With per-controller flushing, each reader would need to do the flush and I'm suspecting the total time non-linear wrt parts.) Similarly for writers, if multiple controller's data change in short window, only the first one has to construct the rstat tree from top down to self, the other are updating the same tree. > In-kernel memcg stats readers will be unaffected most of the time with > this change. The only difference will be when they flush, they will only > flush memcg stats. That "most of the time" is what depends on how other controller's readers are active. > Here I am assuming you meant measurements in terms of cpu cost or do you > have something else in mind? I have in mind something like Tejun's point 2: | 2. It has noticeable benefits in the targeted use cases. The cover letter mentions some old problems (which may not be problems nowadays with memcg flushing reworks) and it's not clear how the separation into per-controller trees impacts (today's) problems. (I can imagine if the problem is stated like: io.stat readers are unnecessarily waiting for memory.stat flushing, the benefit can be shown (unless io.stat readers could benefit from flushing triggered by e.g. memory). But I didn't get if _that_ is the problem.) Thanks, Michal
On Thu, Jan 16, 2025 at 7:19 AM Michal Koutný <mkoutny@suse.com> wrote: > > Hello. > > On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > and flushing efffectiveness depends on how individual readers are > > > correlated, > > > > Sorry I am confused by the above statement, can you please expand on > > what you meant by it? > > > > > OTOH writer correlation affects > > > updaters when extending the update tree. > > > > Here I am confused about the difference between writer and updater. > > reader -- a call site that'd need to call cgroup_rstat_flush() to > calculate aggregated stats > writer (or updater) -- a call site that calls cgroup_rstat_updated() > when it modifies whatever datum > > By correlated readers I meant that stats for multiple controllers are > read close to each other (time-wise). First such a reader does the heavy > lifting, consequent readers enjoy quick access. > (With per-controller flushing, each reader would need to do the flush > and I'm suspecting the total time non-linear wrt parts.) In this case, I actually think it's better if every reader pays for the flush they asked for (and only that). There is a bit of repeated work if we read memory stats then io stats right after, but in cases where we don't, paying to flush all subsystems because they are likely to be flushed soon is not necessarily a good thing imo. > > Similarly for writers, if multiple controller's data change in short > window, only the first one has to construct the rstat tree from top down > to self, the other are updating the same tree. This I agree about. If we have consecutive updates from two different subsystems to the same cgroup, almost all the work is repeated. Whether that causes a tangible performance difference or not is something the numbers should show. In my experience, real regressions on the update side are usually caught by LKP and are somewhat easy to surface in benchmarks (I used netperf in the past). > > > In-kernel memcg stats readers will be unaffected most of the time with > > this change. The only difference will be when they flush, they will only > > flush memcg stats. > > That "most of the time" is what depends on how other controller's > readers are active. Since readers of other controllers are only in userspace (AFAICT), I think it's unlikely that they are correlated with in-kernel memcg stat readers in general. > > > Here I am assuming you meant measurements in terms of cpu cost or do you > > have something else in mind? > > I have in mind something like Tejun's point 2: > | 2. It has noticeable benefits in the targeted use cases. > > The cover letter mentions some old problems (which may not be problems > nowadays with memcg flushing reworks) and it's not clear how the > separation into per-controller trees impacts (today's) problems. > > (I can imagine if the problem is stated like: io.stat readers are > unnecessarily waiting for memory.stat flushing, the benefit can be shown > (unless io.stat readers could benefit from flushing triggered by e.g. > memory). But I didn't get if _that_ is the problem.) Yeah I hope/expect that numbers will show that reading memcg stats (in userspace or the kernel) becomes a bit faster, while reading other subsystem stats should be significantly faster (at least in some cases). We will see how that turns out.
On 1/15/25 1:36 PM, Yosry Ahmed wrote: > On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@gmail.com> wrote: >> >> Hi Yosry, >> >> On 1/14/25 5:39 PM, Yosry Ahmed wrote: >>> On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@gmail.com> wrote: >>>> >>>> Hi Michal, >>>> >>>> On 1/13/25 10:25 AM, Shakeel Butt wrote: >>>>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote: >>>>>> Hello JP. >>>>>> >>>>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: >>>>>>> I've been experimenting with these changes to allow for separate >>>>>>> updating/flushing of cgroup stats per-subsystem. >>>>>> >>>>>> Nice. >>>>>> >>>>>>> I reached a point where this started to feel stable in my local testing, so I >>>>>>> wanted to share and get feedback on this approach. >>>>>> >>>>>> The split is not straight-forwardly an improvement -- >>>>> >>>>> The major improvement in my opinion is the performance isolation for >>>>> stats readers i.e. cpu stats readers do not need to flush memory stats. >>>>> >>>>>> there's at least >>>>>> higher memory footprint >>>>> >>>>> Yes this is indeed the case and JP, can you please give a ballmark on >>>>> the memory overhead? >>>> >>>> Yes, the trade-off is using more memory to allow for separate trees. >>>> With these patches the changes in allocated memory for the >>>> cgroup_rstat_cpu instances and their associated locks are: >>>> static >>>> reduced by 58% >>>> dynamic >>>> increased by 344% >>>> >>>> The threefold increase on the dynamic side is attributed to now having 3 >>>> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), >>>> instead of originally just 1. The number will change if more subsystems >>>> start or stop using rstat in the future. Feel free to let me know if you >>>> would like to see the detailed breakdown of these values. >>> >>> What is the absolute per-CPU memory usage? >> >> This is what I calculate as the combined per-cpu usage. >> before: >> one cgroup_rstat_cpu instance for every cgroup >> sizeof(cgroup_rstat_cpu) * nr_cgroups >> after: >> three cgroup_rstat_cpu instances for every cgroup + updater lock for >> every subsystem plus one for base stats >> sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups + >> sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1) >> >> Note that "every cgroup" includes the root cgroup. Also, 3 represents >> the number of current rstat clients: base stats, memory, and io >> (assuming all enabled). > > On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes. > Ignoring the spinlock for a second because it doesn't scale with > cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a > moderately large machine with 256 CPUs and 100 cgroups for example > that's ~8MB. > Revisiting the cgroup_rstat_cpu struct, there might be an opportunity to save some memory here. This struct has several cgroup_base_stat fields that are not useful to the actual subsystems (memory, io). So I'm considering a scenario where the existing cgroup_rstat_cpu is used for the base stats while a new lighter struct is used for others that maintains compatibility with the rstat infrastructure. >> >> As I'm writing this, I realize I might need to include the bpf cgroups >> as a fourth client and include this in my testing.
Hi Michal, On Thu, Jan 16, 2025 at 04:19:07PM +0100, Michal Koutný wrote: > Hello. > > On Mon, Jan 13, 2025 at 10:25:34AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > and flushing efffectiveness depends on how individual readers are > > > correlated, > > > > Sorry I am confused by the above statement, can you please expand on > > what you meant by it? > > > > > OTOH writer correlation affects > > > updaters when extending the update tree. > > > > Here I am confused about the difference between writer and updater. > > reader -- a call site that'd need to call cgroup_rstat_flush() to > calculate aggregated stats > writer (or updater) -- a call site that calls cgroup_rstat_updated() > when it modifies whatever datum > Ah so writer and updater are same. > By correlated readers I meant that stats for multiple controllers are > read close to each other (time-wise). First such a reader does the heavy > lifting, consequent readers enjoy quick access. > (With per-controller flushing, each reader would need to do the flush > and I'm suspecting the total time non-linear wrt parts.) This is a good point and actually in prod we are observing machines with very active workloads, the close readers of different stats are flushing all the subsystems even when reads are very close-by time-wise. In addition there are users who are only reading non-memory stats and still paying the cost of memory flushing. Please note that memory stats flushing is the most expensive one at the moment and it does not make sense to do flush memory stats for above two cases. > > Similarly for writers, if multiple controller's data change in short > window, only the first one has to construct the rstat tree from top down > to self, the other are updating the same tree. Another good point. From my observation, the cost of rstat tree insertion is very cheap and the cost get amortized i.e. a lot of updates within flushes such that the insertion cost is not noticeable at all, at least in the perf traces. > > > In-kernel memcg stats readers will be unaffected most of the time with > > this change. The only difference will be when they flush, they will only > > flush memcg stats. > > That "most of the time" is what depends on how other controller's > readers are active. > > > Here I am assuming you meant measurements in terms of cpu cost or do you > > have something else in mind? > > I have in mind something like Tejun's point 2: > | 2. It has noticeable benefits in the targeted use cases. > > The cover letter mentions some old problems (which may not be problems > nowadays with memcg flushing reworks) and it's not clear how the > separation into per-controller trees impacts (today's) problems. > > (I can imagine if the problem is stated like: io.stat readers are > unnecessarily waiting for memory.stat flushing, the benefit can be shown > (unless io.stat readers could benefit from flushing triggered by e.g. > memory). But I didn't get if _that_ is the problem.) > The cover letter of v2 has more information on the motivation. The main motivation is the same I descrived above i.e. many applications (stat monitors) has to flush all the subsystem even when they are reading different subsystem stats close-by and then there are applications who are reading just stats of cpu or io and still have to pay for memcg stat flushing. This series is targeting these two very common scenarios. Thanks Michal for your time and comments. Shakeel