Message ID | 20220822001737.4120417-2-shakeelb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: optimizatize charge codepath | expand |
On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <shakeelb@google.com> wrote: > > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Nice speed up! Acked-by: Soheil Hassas Yeganeh <soheil@google.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog >
On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Thanks! - Feng > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); > } > > - low = READ_ONCE(c->low); > - if (low || atomic_long_read(&c->low_usage)) { > - protected = min(usage, low); > + protected = min(usage, READ_ONCE(c->low)); > + old_protected = atomic_long_read(&c->low_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->low_usage, protected); > delta = protected - old_protected; > if (delta) > -- > 2.37.1.595.g718a3a8f04-goog >
On Mon 22-08-22 11:55:33, Michal Hocko wrote: > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: [...] > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > index eb156ff5d603..47711aa28161 100644 > > --- a/mm/page_counter.c > > +++ b/mm/page_counter.c > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > unsigned long usage) > > { > > unsigned long protected, old_protected; > > - unsigned long low, min; > > long delta; > > > > if (!c->parent) > > return; > > > > - min = READ_ONCE(c->min); > > - if (min || atomic_long_read(&c->min_usage)) { > > - protected = min(usage, min); > > + protected = min(usage, READ_ONCE(c->min)); > > + old_protected = atomic_long_read(&c->min_usage); > > + if (protected != old_protected) { > > I have to cache that code back into brain. It is really subtle thing and > it is not really obvious why this is still correct. I will think about > that some more but the changelog could help with that a lot. OK, so the this patch will be most useful when the min > 0 && min < usage because then the protection doesn't really change since the last call. In other words when the usage grows above the protection and your workload benefits from this change because that happens a lot as only a part of the workload is protected. Correct? Unless I have missed anything this shouldn't break the correctness but I still have to think about the proportional distribution of the protection because that adds to the complexity here.
On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > [...] > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > index eb156ff5d603..47711aa28161 100644 > > > --- a/mm/page_counter.c > > > +++ b/mm/page_counter.c > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > unsigned long usage) > > > { > > > unsigned long protected, old_protected; > > > - unsigned long low, min; > > > long delta; > > > > > > if (!c->parent) > > > return; > > > > > > - min = READ_ONCE(c->min); > > > - if (min || atomic_long_read(&c->min_usage)) { > > > - protected = min(usage, min); > > > + protected = min(usage, READ_ONCE(c->min)); > > > + old_protected = atomic_long_read(&c->min_usage); > > > + if (protected != old_protected) { > > > > I have to cache that code back into brain. It is really subtle thing and > > it is not really obvious why this is still correct. I will think about > > that some more but the changelog could help with that a lot. > > OK, so the this patch will be most useful when the min > 0 && min < > usage because then the protection doesn't really change since the last > call. In other words when the usage grows above the protection and your > workload benefits from this change because that happens a lot as only a > part of the workload is protected. Correct? Yes, that is correct. I hope the experiment setup is clear now. > > Unless I have missed anything this shouldn't break the correctness but I > still have to think about the proportional distribution of the > protection because that adds to the complexity here. The patch is not changing any semantics. It is just removing an unnecessary atomic xchg() for a specific scenario (min > 0 && min < usage). I don't think there will be any change related to proportional distribution of the protection.
On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > > [...] > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > > index eb156ff5d603..47711aa28161 100644 > > > > --- a/mm/page_counter.c > > > > +++ b/mm/page_counter.c > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > > unsigned long usage) > > > > { > > > > unsigned long protected, old_protected; > > > > - unsigned long low, min; > > > > long delta; > > > > > > > > if (!c->parent) > > > > return; > > > > > > > > - min = READ_ONCE(c->min); > > > > - if (min || atomic_long_read(&c->min_usage)) { > > > > - protected = min(usage, min); > > > > + protected = min(usage, READ_ONCE(c->min)); > > > > + old_protected = atomic_long_read(&c->min_usage); > > > > + if (protected != old_protected) { > > > > > > I have to cache that code back into brain. It is really subtle thing and > > > it is not really obvious why this is still correct. I will think about > > > that some more but the changelog could help with that a lot. > > > > OK, so the this patch will be most useful when the min > 0 && min < > > usage because then the protection doesn't really change since the last > > call. In other words when the usage grows above the protection and your > > workload benefits from this change because that happens a lot as only a > > part of the workload is protected. Correct? > > Yes, that is correct. I hope the experiment setup is clear now. Maybe it is just me that it took a bit to grasp but maybe we want to save our future selfs from going through that mental process again. So please just be explicit about that in the changelog. It is really the part that workloads excessing the protection will benefit the most that would help to understand this patch. > > Unless I have missed anything this shouldn't break the correctness but I > > still have to think about the proportional distribution of the > > protection because that adds to the complexity here. > > The patch is not changing any semantics. It is just removing an > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > usage). I don't think there will be any change related to proportional > distribution of the protection. Yes, I suspect you are right. I just remembered previous fixes like 503970e42325 ("mm: memcontrol: fix memory.low proportional distribution") which just made me nervous that this is a tricky area. I will have another look tomorrow with a fresh brain and send an ack.
On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 22-08-22 07:55:58, Shakeel Butt wrote: > > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 22-08-22 11:55:33, Michal Hocko wrote: > > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote: > > > [...] > > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > > > > index eb156ff5d603..47711aa28161 100644 > > > > > --- a/mm/page_counter.c > > > > > +++ b/mm/page_counter.c > > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > > > > > unsigned long usage) > > > > > { > > > > > unsigned long protected, old_protected; > > > > > - unsigned long low, min; > > > > > long delta; > > > > > > > > > > if (!c->parent) > > > > > return; > > > > > > > > > > - min = READ_ONCE(c->min); > > > > > - if (min || atomic_long_read(&c->min_usage)) { > > > > > - protected = min(usage, min); > > > > > + protected = min(usage, READ_ONCE(c->min)); > > > > > + old_protected = atomic_long_read(&c->min_usage); > > > > > + if (protected != old_protected) { > > > > > > > > I have to cache that code back into brain. It is really subtle thing and > > > > it is not really obvious why this is still correct. I will think about > > > > that some more but the changelog could help with that a lot. > > > > > > OK, so the this patch will be most useful when the min > 0 && min < > > > usage because then the protection doesn't really change since the last > > > call. In other words when the usage grows above the protection and your > > > workload benefits from this change because that happens a lot as only a > > > part of the workload is protected. Correct? > > > > Yes, that is correct. I hope the experiment setup is clear now. > > Maybe it is just me that it took a bit to grasp but maybe we want to > save our future selfs from going through that mental process again. So > please just be explicit about that in the changelog. It is really the > part that workloads excessing the protection will benefit the most that > would help to understand this patch. > I will add more detail in the commit message in the next version. > > > Unless I have missed anything this shouldn't break the correctness but I > > > still have to think about the proportional distribution of the > > > protection because that adds to the complexity here. > > > > The patch is not changing any semantics. It is just removing an > > unnecessary atomic xchg() for a specific scenario (min > 0 && min < > > usage). I don't think there will be any change related to proportional > > distribution of the protection. > > Yes, I suspect you are right. I just remembered previous fixes > like 503970e42325 ("mm: memcontrol: fix memory.low proportional > distribution") which just made me nervous that this is a tricky area. > > I will have another look tomorrow with a fresh brain and send an ack. I will wait for your ack before sending the next version.
On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote: > For cgroups using low or min protections, the function > propagate_protected_usage() was doing an atomic xchg() operation > irrespectively. It only needs to do that operation if the new value of > protection is different from older one. This patch does that. > > To evaluate the impact of this optimization, on a 72 CPUs machine, we > ran the following workload in a three level of cgroup hierarchy with top > level having min and low setup appropriately. More specifically > memory.min equal to size of netperf binary and memory.low double of > that. > > $ netserver -6 > # 36 instances of netperf with following params > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Results (average throughput of netperf): > Without (6.0-rc1) 10482.7 Mbps > With patch 14542.5 Mbps (38.7% improvement) > > With the patch, the throughput improved by 38.7% Nice savings! > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > --- > mm/page_counter.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/page_counter.c b/mm/page_counter.c > index eb156ff5d603..47711aa28161 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, > unsigned long usage) > { > unsigned long protected, old_protected; > - unsigned long low, min; > long delta; > > if (!c->parent) > return; > > - min = READ_ONCE(c->min); > - if (min || atomic_long_read(&c->min_usage)) { > - protected = min(usage, min); > + protected = min(usage, READ_ONCE(c->min)); > + old_protected = atomic_long_read(&c->min_usage); > + if (protected != old_protected) { > old_protected = atomic_long_xchg(&c->min_usage, protected); > delta = protected - old_protected; > if (delta) > atomic_long_add(delta, &c->parent->children_min_usage); What if there is a concurrent update of c->min_usage? Then the patched version can miss an update. I can't imagine a case when it will lead to bad consequences, so probably it's ok. But not super obvious. I think the way to think of it is that a missed update will be fixed by the next one, so it's ok to run some time with old numbers. Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks!
diff --git a/mm/page_counter.c b/mm/page_counter.c index eb156ff5d603..47711aa28161 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c, unsigned long usage) { unsigned long protected, old_protected; - unsigned long low, min; long delta; if (!c->parent) return; - min = READ_ONCE(c->min); - if (min || atomic_long_read(&c->min_usage)) { - protected = min(usage, min); + protected = min(usage, READ_ONCE(c->min)); + old_protected = atomic_long_read(&c->min_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) atomic_long_add(delta, &c->parent->children_min_usage); } - low = READ_ONCE(c->low); - if (low || atomic_long_read(&c->low_usage)) { - protected = min(usage, low); + protected = min(usage, READ_ONCE(c->low)); + old_protected = atomic_long_read(&c->low_usage); + if (protected != old_protected) { old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta)
For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. It only needs to do that operation if the new value of protection is different from older one. This patch does that. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately. More specifically memory.min equal to size of netperf binary and memory.low double of that. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> --- mm/page_counter.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)