Message ID | 20210320221432.924-1-song.bao.hua@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched/fair: remove redundant test_idle_cores for non-smt | expand |
Hi Barry, On 2021/3/21 6:14, Barry Song wrote: > update_idle_core() is only done for the case of sched_smt_present. > but test_idle_cores() is done for all machines even those without > smt. The patch looks good to me. May I know for what case we need to keep CONFIG_SCHED_SMT for non-smt machines? Thanks, -Aubrey > this could contribute to up 8%+ hackbench performance loss on a > machine like kunpeng 920 which has no smt. this patch removes the > redundant test_idle_cores() for non-smt machines. > > we run the below hackbench with different -g parameter from 2 to > 14, for each different g, we run the command 10 times and get the > average time: > $ numactl -N 0 hackbench -p -T -l 20000 -g $1 > > hackbench will report the time which is needed to complete a certain > number of messages transmissions between a certain number of tasks, > for example: > $ numactl -N 0 hackbench -p -T -l 20000 -g 10 > Running in threaded mode with 10 groups using 40 file descriptors each > (== 400 tasks) > Each sender will pass 20000 messages of 100 bytes > > The below is the result of hackbench w/ and w/o this patch: > g= 2 4 6 8 10 12 14 > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 > w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 > +4.1% +8.3% +7.3% +6.3% > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > --- > kernel/sched/fair.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2e2ab1e..de42a32 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def) > { > struct sched_domain_shared *sds; > > - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > - if (sds) > - return READ_ONCE(sds->has_idle_cores); > + if (static_branch_likely(&sched_smt_present)) { > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > + if (sds) > + return READ_ONCE(sds->has_idle_cores); > + } > > return def; > } >
> -----Original Message----- > From: Li, Aubrey [mailto:aubrey.li@linux.intel.com] > Sent: Monday, March 22, 2021 5:37 PM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; > vincent.guittot@linaro.org; mingo@redhat.com; peterz@infradead.org; > juri.lelli@redhat.com; dietmar.eggemann@arm.com; rostedt@goodmis.org; > bsegall@google.com; mgorman@suse.de > Cc: valentin.schneider@arm.com; linux-arm-kernel@lists.infradead.org; > linux-kernel@vger.kernel.org; xuwei (O) <xuwei5@huawei.com>; Zengtao (B) > <prime.zeng@hisilicon.com>; guodong.xu@linaro.org; yangyicong > <yangyicong@huawei.com>; Liguozhu (Kenneth) <liguozhu@hisilicon.com>; > linuxarm@openeuler.org > Subject: [Linuxarm] Re: [PATCH] sched/fair: remove redundant test_idle_cores > for non-smt > > Hi Barry, > > On 2021/3/21 6:14, Barry Song wrote: > > update_idle_core() is only done for the case of sched_smt_present. > > but test_idle_cores() is done for all machines even those without > > smt. > > The patch looks good to me. > May I know for what case we need to keep CONFIG_SCHED_SMT for non-smt > machines? Hi Aubrey, I think the defconfig of arm64 has always enabled CONFIG_SCHED_SMT: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/configs/defconfig it is probably true for x86 as well. I don't think Linux distribution will build a separate kernel for machines without smt. so basically the kernel depends on runtime topology parse to figure out if smt is present rather than depending on a rebuild. > > Thanks, > -Aubrey > > > > this could contribute to up 8%+ hackbench performance loss on a > > machine like kunpeng 920 which has no smt. this patch removes the > > redundant test_idle_cores() for non-smt machines. > > > > we run the below hackbench with different -g parameter from 2 to > > 14, for each different g, we run the command 10 times and get the > > average time: > > $ numactl -N 0 hackbench -p -T -l 20000 -g $1 > > > > hackbench will report the time which is needed to complete a certain > > number of messages transmissions between a certain number of tasks, > > for example: > > $ numactl -N 0 hackbench -p -T -l 20000 -g 10 > > Running in threaded mode with 10 groups using 40 file descriptors each > > (== 400 tasks) > > Each sender will pass 20000 messages of 100 bytes > > > > The below is the result of hackbench w/ and w/o this patch: > > g= 2 4 6 8 10 12 14 > > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 > > w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 > > +4.1% +8.3% +7.3% +6.3% > > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > --- > > kernel/sched/fair.c | 8 +++++--- > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 2e2ab1e..de42a32 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def) > > { > > struct sched_domain_shared *sds; > > > > - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > > - if (sds) > > - return READ_ONCE(sds->has_idle_cores); > > + if (static_branch_likely(&sched_smt_present)) { > > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > > + if (sds) > > + return READ_ONCE(sds->has_idle_cores); > > + } > > > > return def; > > } Thanks Barry
On Sun, Mar 21, 2021 at 11:14:32AM +1300, Barry Song wrote: > update_idle_core() is only done for the case of sched_smt_present. > but test_idle_cores() is done for all machines even those without > smt. > this could contribute to up 8%+ hackbench performance loss on a > machine like kunpeng 920 which has no smt. this patch removes the > redundant test_idle_cores() for non-smt machines. > > we run the below hackbench with different -g parameter from 2 to > 14, for each different g, we run the command 10 times and get the > average time: > $ numactl -N 0 hackbench -p -T -l 20000 -g $1 > > hackbench will report the time which is needed to complete a certain > number of messages transmissions between a certain number of tasks, > for example: > $ numactl -N 0 hackbench -p -T -l 20000 -g 10 > Running in threaded mode with 10 groups using 40 file descriptors each > (== 400 tasks) > Each sender will pass 20000 messages of 100 bytes > > The below is the result of hackbench w/ and w/o this patch: > g= 2 4 6 8 10 12 14 > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 > w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 > +4.1% +8.3% +7.3% +6.3% > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Acked-by: Mel Gorman <mgorman@suse.de> That said, the numa_idle_core() function then becomes slightly redundant. A possible follow up is to move the "idle_core >= 0" check in numa_idle_core() to its caller in update_numa_stats() and then remove the redundant check in !static_branch_likely(&sched_smt_present) check in numa_idle_core.
On Sun, Mar 21, 2021 at 11:14:32AM +1300, Barry Song wrote: > update_idle_core() is only done for the case of sched_smt_present. > but test_idle_cores() is done for all machines even those without > smt. > this could contribute to up 8%+ hackbench performance loss on a > machine like kunpeng 920 which has no smt. this patch removes the > redundant test_idle_cores() for non-smt machines. > > we run the below hackbench with different -g parameter from 2 to > 14, for each different g, we run the command 10 times and get the > average time: > $ numactl -N 0 hackbench -p -T -l 20000 -g $1 > > hackbench will report the time which is needed to complete a certain > number of messages transmissions between a certain number of tasks, > for example: > $ numactl -N 0 hackbench -p -T -l 20000 -g 10 > Running in threaded mode with 10 groups using 40 file descriptors each > (== 400 tasks) > Each sender will pass 20000 messages of 100 bytes > > The below is the result of hackbench w/ and w/o this patch: > g= 2 4 6 8 10 12 14 > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 > w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 > +4.1% +8.3% +7.3% +6.3% > The patch looks obvious, but the Changelog and Subject needed a lot of help. I've changed it like so: --- Subject: sched/fair: Optimize test_idle_cores() for !SMT From: Barry Song <song.bao.hua@hisilicon.com> Date: Sun, 21 Mar 2021 11:14:32 +1300 From: Barry Song <song.bao.hua@hisilicon.com> update_idle_core() is only done for the case of sched_smt_present. but test_idle_cores() is done for all machines even those without SMT. This can contribute to up 8%+ hackbench performance loss on a machine like kunpeng 920 which has no SMT. This patch removes the redundant test_idle_cores() for !SMT machines. Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get an average. $ numactl -N 0 hackbench -p -T -l 20000 -g $1 The below is the result of hackbench w/ and w/o this patch: g= 2 4 6 8 10 12 14 w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 +4.1% +8.3% +7.3% +6.3% Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao.hua@hisilicon.com --- kernel/sched/fair.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int c { struct sched_domain_shared *sds; - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - if (sds) - return READ_ONCE(sds->has_idle_cores); + if (static_branch_likely(&sched_smt_present)) { + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (sds) + return READ_ONCE(sds->has_idle_cores); + } return def; }
On Sat, 20 Mar 2021 at 23:21, Barry Song <song.bao.hua@hisilicon.com> wrote: > > update_idle_core() is only done for the case of sched_smt_present. > but test_idle_cores() is done for all machines even those without > smt. > this could contribute to up 8%+ hackbench performance loss on a > machine like kunpeng 920 which has no smt. this patch removes the > redundant test_idle_cores() for non-smt machines. > > we run the below hackbench with different -g parameter from 2 to > 14, for each different g, we run the command 10 times and get the > average time: > $ numactl -N 0 hackbench -p -T -l 20000 -g $1 > > hackbench will report the time which is needed to complete a certain > number of messages transmissions between a certain number of tasks, > for example: > $ numactl -N 0 hackbench -p -T -l 20000 -g 10 > Running in threaded mode with 10 groups using 40 file descriptors each > (== 400 tasks) > Each sender will pass 20000 messages of 100 bytes > > The below is the result of hackbench w/ and w/o this patch: > g= 2 4 6 8 10 12 14 > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 > w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 > +4.1% +8.3% +7.3% +6.3% > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2e2ab1e..de42a32 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def) > { > struct sched_domain_shared *sds; > > - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > - if (sds) > - return READ_ONCE(sds->has_idle_cores); > + if (static_branch_likely(&sched_smt_present)) { > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > + if (sds) > + return READ_ONCE(sds->has_idle_cores); > + } > > return def; > } > -- > 1.8.3.1 >
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2e2ab1e..de42a32 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def) { struct sched_domain_shared *sds; - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - if (sds) - return READ_ONCE(sds->has_idle_cores); + if (static_branch_likely(&sched_smt_present)) { + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (sds) + return READ_ONCE(sds->has_idle_cores); + } return def; }
update_idle_core() is only done for the case of sched_smt_present. but test_idle_cores() is done for all machines even those without smt. this could contribute to up 8%+ hackbench performance loss on a machine like kunpeng 920 which has no smt. this patch removes the redundant test_idle_cores() for non-smt machines. we run the below hackbench with different -g parameter from 2 to 14, for each different g, we run the command 10 times and get the average time: $ numactl -N 0 hackbench -p -T -l 20000 -g $1 hackbench will report the time which is needed to complete a certain number of messages transmissions between a certain number of tasks, for example: $ numactl -N 0 hackbench -p -T -l 20000 -g 10 Running in threaded mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 20000 messages of 100 bytes The below is the result of hackbench w/ and w/o this patch: g= 2 4 6 8 10 12 14 w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 +4.1% +8.3% +7.3% +6.3% Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> --- kernel/sched/fair.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)