Message ID | 20220822001737.4120417-4-shakeelb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: optimizatize charge codepath | expand |
On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote: > > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Nice! Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog >
On Mon, Aug 22, 2022 at 08:17:37AM +0800, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> This batch number has long been a pain point :) thanks for the work! Reviewed-by: Feng Tang <feng.tang@intel.com> - Feng > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog >
On Mon 22-08-22 00:17:37, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. Yes, the batch size has always been an arbitrary number. I do not think there have ever been any solid grounds for the value we have now except we need something and SWAP_CLUSTER_MAX was a good enough template. Increasing it to 64 sounds like a reasonable step. It would be great to have it scale based on the number of CPUs and potentially other factors but that would be hard to get right and actually hard to evaluate because it will depend on the specific workload. > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. It will have an effect on other stuff as well like high limit reclaim backoff and stast flushing. > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. a similar feedback to the test case description as with other patches. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Anyway Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > include/linux/memcontrol.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4d31ce55b1c0..70ae91188e16 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -354,10 +354,11 @@ struct mem_cgroup { > }; > > /* > - * size of first charge trial. "32" comes from vmscan.c's magic value. > - * TODO: maybe necessary to use big numbers in big irons. > + * size of first charge trial. > + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the > + * workload. > */ > -#define MEMCG_CHARGE_BATCH 32U > +#define MEMCG_CHARGE_BATCH 64U > > extern struct mem_cgroup *root_mem_cgroup; > > -- > 2.37.1.595.g718a3a8f04-goog
On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > ran the following workload in a three level of cgroup hierarchy with top > > level having min and low setup appropriately. More specifically > > memory.min equal to size of netperf binary and memory.low double of > > that. > > a similar feedback to the test case description as with other patches. What more info should I add to the description? Why did I set up min and low or something else? > > > > $ netserver -6 > > # 36 instances of netperf with following params > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > Results (average throughput of netperf): > > Without (6.0-rc1) 10482.7 Mbps > > With patch 17064.7 Mbps (62.7% improvement) > > > > With the patch, the throughput improved by 62.7%. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Anyway > Acked-by: Michal Hocko <mhocko@suse.com> Thanks
On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 08:09:01, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > [...] > > > > > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > > > > ran the following workload in a three level of cgroup hierarchy with top > > > > level having min and low setup appropriately. More specifically > > > > memory.min equal to size of netperf binary and memory.low double of > > > > that. > > > > > > a similar feedback to the test case description as with other patches. > > > > What more info should I add to the description? Why did I set up min > > and low or something else? > > I do see why you wanted to keep the test consistent over those three > patches. I would just drop the reference to the protection configuration > because it likely doesn't make much of an impact, does it? It is the > multi cpu setup and false sharing that makes the real difference. Or am > I wrong in assuming that? > No, you are correct. I will cleanup the commit message in the next version.
On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote: > For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger > machines and the network intensive workloads requiring througput in > Gbps, 32 is too small and makes the memcg charging path a bottleneck. > For now, increase it to 64 for easy acceptance to 6.0. We will need to > revisit this in future for ever increasing demand of higher performance. > > Please note that the memcg charge path drain the per-cpu memcg charge > stock, so there should not be any oom behavior change. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 17064.7 Mbps (62.7% improvement) > > With the patch, the throughput improved by 62.7%. This is pretty significant! Acked-by: Roman Gushchin <roman.gushchin@linux.dev> I wonder only if we want to make it configurable (Idk a sysctl or maybe a config option) and close the topic. Thanks!
On Mon 22-08-22 11:37:30, Roman Gushchin wrote: [...] > I wonder only if we want to make it configurable (Idk a sysctl or maybe > a config option) and close the topic. I do not think this is a good idea. We have other examples where we have outsourced internal tunning to the userspace and it has mostly proven impractical and long term more problematic than useful (e.g. lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to name some that come to my mind). I have seen more often these to be used incorrectly than useful. In this case, I guess we should consider either moving to per memcg charge batching and see whether the pcp overhead x memcg_count is worth that or some automagic tuning of the batch size depending on how effectively the batch is used. Certainly a lot of room for experimenting.
On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote: > On Mon 22-08-22 11:37:30, Roman Gushchin wrote: > [...] > > I wonder only if we want to make it configurable (Idk a sysctl or maybe > > a config option) and close the topic. > > I do not think this is a good idea. We have other examples where we have > outsourced internal tunning to the userspace and it has mostly proven > impractical and long term more problematic than useful (e.g. > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to > name some that come to my mind). I have seen more often these to be used > incorrectly than useful. A agree, not a strong opinion here. But I wonder if somebody will complain on Shakeel's change because of the reduced accuracy. I know some users are using memory cgroups to track the size of various workloads (including relatively small) and 32->64 pages per cpu change can be noticeable for them. But we can wait for an actual bug report :) > > In this case, I guess we should consider either moving to per memcg > charge batching and see whether the pcp overhead x memcg_count is worth > that or some automagic tuning of the batch size depending on how > effectively the batch is used. Certainly a lot of room for > experimenting. I'm not a big believer into the automagic tuning here because it's a fundamental trade-off of accuracy vs performance and various users might make a different choice depending on their needs, not on the cpu count or something else. Per-memcg batching sounds interesting though. For example, we can likely batch updates on leaf cgroups and have a single atomic update instead of multiple most of the times. Or do you mean something different? Thanks!
On Mon 22-08-22 19:22:26, Roman Gushchin wrote: > On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote: > > On Mon 22-08-22 11:37:30, Roman Gushchin wrote: > > [...] > > > I wonder only if we want to make it configurable (Idk a sysctl or maybe > > > a config option) and close the topic. > > > > I do not think this is a good idea. We have other examples where we have > > outsourced internal tunning to the userspace and it has mostly proven > > impractical and long term more problematic than useful (e.g. > > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to > > name some that come to my mind). I have seen more often these to be used > > incorrectly than useful. > > A agree, not a strong opinion here. But I wonder if somebody will > complain on Shakeel's change because of the reduced accuracy. > I know some users are using memory cgroups to track the size of various > workloads (including relatively small) and 32->64 pages per cpu change > can be noticeable for them. But we can wait for an actual bug report :) Yes, that would be my approach. I have seen reports like that already but that was mostly because of heavy caching on the SLUB side on older kernels. So there surely are workloads with small limits configured (e.g. 20MB). On the other hand those users were receptive to adapt their limits as they were kinda arbitrary anyway. > > In this case, I guess we should consider either moving to per memcg > > charge batching and see whether the pcp overhead x memcg_count is worth > > that or some automagic tuning of the batch size depending on how > > effectively the batch is used. Certainly a lot of room for > > experimenting. > > I'm not a big believer into the automagic tuning here because it's a fundamental > trade-off of accuracy vs performance and various users might make a different > choice depending on their needs, not on the cpu count or something else. Yes, this not an easy thing to get right. I was mostly thinking some auto scaling based on the limit size or growing the stock if cache hits are common and decrease when stocks get flushed often because multiple memcgs compete over the same pcp stock. But to me it seems like a per memcg approach might lead better results without too many heuristics (albeit more memory hungry). > Per-memcg batching sounds interesting though. For example, we can likely > batch updates on leaf cgroups and have a single atomic update instead of > multiple most of the times. Or do you mean something different? No, that was exactly my thinking as well.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4d31ce55b1c0..70ae91188e16 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -354,10 +354,11 @@ struct mem_cgroup { }; /* - * size of first charge trial. "32" comes from vmscan.c's magic value. - * TODO: maybe necessary to use big numbers in big irons. + * size of first charge trial. + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the + * workload. */ -#define MEMCG_CHARGE_BATCH 32U +#define MEMCG_CHARGE_BATCH 64U extern struct mem_cgroup *root_mem_cgroup;
For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger machines and the network intensive workloads requiring througput in Gbps, 32 is too small and makes the memcg charging path a bottleneck. For now, increase it to 64 for easy acceptance to 6.0. We will need to revisit this in future for ever increasing demand of higher performance. Please note that the memcg charge path drain the per-cpu memcg charge stock, so there should not be any oom behavior change. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 17064.7 Mbps (62.7% improvement) With the patch, the throughput improved by 62.7%. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- include/linux/memcontrol.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)