From patchwork Fri Mar 19 04:16:17 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Song Bao Hua (Barry Song)"
 <song.bao.hua@hisilicon.com>
X-Patchwork-Id: 12149977
Return-Path: 
 <SRS0=5UrE=IR=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,
	USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AB58CC433E6
	for <linux-arm-kernel@archiver.kernel.org>;
 Fri, 19 Mar 2021 04:25:06 +0000 (UTC)
Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 29D0864EFD
	for <linux-arm-kernel@archiver.kernel.org>;
 Fri, 19 Mar 2021 04:25:06 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 29D0864EFD
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=hisilicon.com
Authentication-Results: mail.kernel.org;
 spf=none
 smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding
	:Content-Type:List-Subscribe:List-Help:List-Post:List-Archive:
	List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:Message-ID:Date:
	Subject:CC:To:From:Reply-To:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	 bh=gYEfE4oOuu6gLTFAnDUv70xtCjne/4RtP5L/5qMtJYo=; b=iDHwkREm9++gNXkQGsVXGyqc6
	Ny7Rc5aNajqKhLDl+cD/TeoRfLe2hrN4uF94ZRCNz7ieVPIAY1WlllI+UhVNzlgrVrCyCnjq0i0m/
	hOJ/9CqCuDUT16pM+pOiYY7WCgDqCATAFZ5xEXsgWnJGFh53bfHiBD65oKbu1RnOTgyx3x7M9XfoY
	0Qd/1CCqmmS3iBsu8+lGnsjsPPGg5cTNCeXLy0Kx0gKx6Heb3ZsvbK0NAY1yqtQ/Y9SmoBDI/xMWa
	ZeWGki98XXl1jVgUs8QdBfemWZXQl7umIntb8e2iq1HTdLYFf7zX5b0SDKI9/CPHr6tU00iu9LD9x
	W/KarhO1w==;
Received: from localhost ([::1] helo=desiato.infradead.org)
	by desiato.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux))
	id 1lN6fl-006Swe-JM; Fri, 19 Mar 2021 04:23:41 +0000
Received: from szxga05-in.huawei.com ([45.249.212.191])
 by desiato.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux))
 id 1lN6fW-006Sus-Fj
 for linux-arm-kernel@lists.infradead.org; Fri, 19 Mar 2021 04:23:29 +0000
Received: from DGGEMS405-HUB.china.huawei.com (unknown [172.30.72.60])
 by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4F1rLc4ztPzrXP7;
 Fri, 19 Mar 2021 12:21:28 +0800 (CST)
Received: from SWX921481.china.huawei.com (10.126.203.211) by
 DGGEMS405-HUB.china.huawei.com (10.3.19.205) with Microsoft SMTP Server id
 14.3.498.0; Fri, 19 Mar 2021 12:23:15 +0800
From: Barry Song <song.bao.hua@hisilicon.com>
To: <tim.c.chen@linux.intel.com>, <catalin.marinas@arm.com>,
 <will@kernel.org>, <rjw@rjwysocki.net>, <vincent.guittot@linaro.org>,
 <bp@alien8.de>, <tglx@linutronix.de>, <mingo@redhat.com>, <lenb@kernel.org>,
 <peterz@infradead.org>, <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>,
 <bsegall@google.com>, <mgorman@suse.de>
CC: <msys.mizuma@gmail.com>, <valentin.schneider@arm.com>,
 <gregkh@linuxfoundation.org>, <jonathan.cameron@huawei.com>,
 <juri.lelli@redhat.com>, <mark.rutland@arm.com>, <sudeep.holla@arm.com>,
 <aubrey.li@linux.intel.com>, <linux-arm-kernel@lists.infradead.org>,
 <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>,
 <x86@kernel.org>, <xuwei5@huawei.com>, <prime.zeng@hisilicon.com>,
 <guodong.xu@linaro.org>, <yangyicong@huawei.com>, <liguozhu@hisilicon.com>,
 <linuxarm@openeuler.org>, <hpa@zytor.com>, Barry Song
 <song.bao.hua@hisilicon.com>
Subject: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before
 scanning the whole llc
Date: Fri, 19 Mar 2021 17:16:17 +1300
Message-ID: <20210319041618.14316-4-song.bao.hua@hisilicon.com>
X-Mailer: git-send-email 2.21.0.windows.1
In-Reply-To: <20210319041618.14316-1-song.bao.hua@hisilicon.com>
References: <20210319041618.14316-1-song.bao.hua@hisilicon.com>
MIME-Version: 1.0
X-Originating-IP: [10.126.203.211]
X-CFilter-Loop: Reflected
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20210319_042327_102889_30D7E4B8 
X-CRM114-Status: GOOD (  20.64  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

On kunpeng920, cpus within one cluster can communicate wit each other
much faster than cpus across different clusters. A simple hackbench
can prove that.
hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters shows a large contrast:
(1) within a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 4.285

(2) across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.524

This inspires us to change the wake_affine path to scan cluster before
scanning the whole LLC to try to gatter related tasks in one cluster,
which is done by this patch.

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 2
to 14, for each different g, we run the command 10 times and get the
average time:
$ numactl -N 0 hackbench -p -T -l 20000 -g $1

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 20000 -g 10
Running in threaded mode with 10 groups using 40 file descriptors each
(== 400 tasks)
Each sender will pass 20000 messages of 100 bytes

The below is the result of hackbench w/ and w/o cluster patch:
g=    2      4     6       8      10     12      14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.7881 3.7371 5.3301 6.9747 8.6909  9.9235 11.2608

Obviously some recent commits have improved the hackbench. So the change
in wake_affine path brings less increase on hackbench compared to what
we got in RFC v4.
And obviously it is much more tricky to leverage wake_affine compared to
leveraging the scatter of tasks in the previous patch as load balance
might pull tasks which have been compact in a cluster so alternative
suggestions welcome.

In order to figure out how many times cpu is picked from the cluster and
how many times cpu is picked out of the cluster, a tracepoint for debug
purpose is added in this patch. And an userspace bcc script to print the
histogram of the result of select_idle_cpu():
#!/usr/bin/python
#
# selectidlecpu.py	select idle cpu histogram.
#
# A Ctrl-C will print the gathered histogram then exit.
#
# 18-March-2021 Barry Song Created this.

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""

BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
{
	u32 e;
	if (args->idle / 4 == args->target/4)
		e = 0; /* idle cpu from same cluster */
	else if (args->idle != -1)
		e = 1; /* idle cpu from different clusters */
	else
		e = 2; /* no idle cpu */

	dist.increment(e);
	return 0;
}
""")

# header
print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C
try:
	sleep(99999999)
except KeyboardInterrupt:
	print()

# output

print("\nlinear histogram")
print("~~~~~~~~~~~~~~~~")
b["dist"].print_linear_hist("idle")

Even while g=14 and the system is quite busy, we can see there are some
chances idle cpu is picked from local cluster:
linear histogram
~~~~~~~~~~~~~~
     idle          : count     distribution
        0          : 15234281 |***********                             |
        1          : 18494    |                                        |
        2          : 53066152 |****************************************|

0: local cluster
1: out of the cluster
2: select_idle_cpu() returns -1

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 include/trace/events/sched.h | 22 ++++++++++++++++++++++
 kernel/sched/fair.c          | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cbe3e15..86608cf 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -136,6 +136,28 @@
 );
 
 /*
+ * Tracepoint for select_idle_cpu:
+ */
+TRACE_EVENT(sched_select_idle_cpu,
+
+	TP_PROTO(int target, int idle),
+
+	TP_ARGS(target, idle),
+
+	TP_STRUCT__entry(
+		__field(	int,	target			)
+		__field(	int,	idle			)
+	),
+
+	TP_fast_assign(
+		__entry->target	= target;
+		__entry->idle = idle;
+	),
+
+	TP_printk("target=%d idle=%d", __entry->target, __entry->idle)
+);
+
+/*
  * Tracepoint for waking up a task:
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c92ad9f2..3892d42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6150,7 +6150,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	if (!this_sd)
 		return -1;
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+	if (!sched_cluster_active())
+		cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+#ifdef CONFIG_SCHED_CLUSTER
+	if (sched_cluster_active())
+		cpumask_and(cpus, cpu_cluster_mask(target), p->cpus_ptr);
+#endif
 
 	if (sched_feat(SIS_PROP) && !smt) {
 		u64 avg_cost, avg_idle, span_avg;
@@ -6171,6 +6176,29 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		time = cpu_clock(this);
 	}
 
+#ifdef CONFIG_SCHED_CLUSTER
+	if (sched_cluster_active()) {
+		for_each_cpu_wrap(cpu, cpus, target) {
+			if (smt) {
+				i = select_idle_core(p, cpu, cpus, &idle_cpu);
+				if ((unsigned int)i < nr_cpumask_bits)
+					return i;
+
+			} else {
+				if (!--nr)
+					return -1;
+				idle_cpu = __select_idle_cpu(cpu);
+				if ((unsigned int)idle_cpu < nr_cpumask_bits) {
+					goto done;
+				}
+			}
+		}
+
+		cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+		cpumask_andnot(cpus, cpus, cpu_cluster_mask(target));
+	}
+#endif
+
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (smt) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6186,6 +6214,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		}
 	}
 
+done:
 	if (smt)
 		set_idle_cores(this, false);
 
@@ -6324,6 +6353,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		return target;
 
 	i = select_idle_cpu(p, sd, target);
+	trace_sched_select_idle_cpu(target, i);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;