From patchwork Tue Dec  1 02:59:42 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Song Bao Hua (Barry Song)"
 <song.bao.hua@hisilicon.com>
X-Patchwork-Id: 11941775
Return-Path: 
 <SRS0=r2Hb=FF=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2CC08C63777
	for <linux-arm-kernel@archiver.kernel.org>;
 Tue,  1 Dec 2020 03:05:47 +0000 (UTC)
Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id A83362074A
	for <linux-arm-kernel@archiver.kernel.org>;
 Tue,  1 Dec 2020 03:05:46 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=lists.infradead.org
 header.i=@lists.infradead.org header.b="vls8P9bv"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A83362074A
Authentication-Results: mail.kernel.org;
 dmarc=none (p=none dis=none) header.from=hisilicon.com
Authentication-Results: mail.kernel.org;
 spf=none
 smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding:
	Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive:
	List-Unsubscribe:List-Id:MIME-Version:Message-ID:Date:Subject:To:From:
	Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender
	:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner;
	bh=Db27ZMvLe8t9/oZzx9OEAqxvPMmFTL5q+pQZ1BFIOkU=; b=vls8P9bvdmJe+/NpOKDYnWV42F
	lsuU6mmEm8+aSOh/4bKqN1Cn3+/lcb0HRhbhaZzlzL54LrTDS/fu2urwgXx8PwxPgXfXfOU0rnFFo
	NYyJDEWxL88ZOGu8BGeOwj86TuesAfITaM9fnITtBWUbZOwUd6pgn+bZXfNU+TkraYJF+N9bNzEf1
	9rpVWuuty4SkNJrTrZYcicrHNWXIOnzj7RxMQJqeP1DHxK+W9qtaUY0bjS5XA4c7N9ZFmbM3gTmKk
	agaalioiWejphfYQpNd8iLRuOgGXGPuZ8vvh7eGBxg74EAaxa41pfdu8vZFLhlrwpHstoWG3bikKo
	Nu1zruxg==;
Received: from localhost ([::1] helo=merlin.infradead.org)
	by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux))
	id 1kjvxu-00022X-JD; Tue, 01 Dec 2020 03:04:30 +0000
Received: from szxga05-in.huawei.com ([45.249.212.191])
 by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux))
 id 1kjvxq-00020B-N4
 for linux-arm-kernel@lists.infradead.org; Tue, 01 Dec 2020 03:04:28 +0000
Received: from DGGEMS405-HUB.china.huawei.com (unknown [172.30.72.60])
 by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4ClRkv3CD7zhXtD;
 Tue,  1 Dec 2020 11:03:51 +0800 (CST)
Received: from SWX921481.china.huawei.com (10.126.202.198) by
 DGGEMS405-HUB.china.huawei.com (10.3.19.205) with Microsoft SMTP Server id
 14.3.487.0; Tue, 1 Dec 2020 11:04:07 +0800
From: Barry Song <song.bao.hua@hisilicon.com>
To: <valentin.schneider@arm.com>, <catalin.marinas@arm.com>,
 <will@kernel.org>, <rjw@rjwysocki.net>, <lenb@kernel.org>,
 <gregkh@linuxfoundation.org>, <Jonathan.Cameron@huawei.com>,
 <mingo@redhat.com>, <peterz@infradead.org>, <juri.lelli@redhat.com>,
 <vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
 <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
 <mark.rutland@arm.com>, <linux-arm-kernel@lists.infradead.org>,
 <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>
Subject: [RFC PATCH v2 0/2] scheduler: expose the topology of clusters and add
 cluster scheduler
Date: Tue, 1 Dec 2020 15:59:42 +1300
Message-ID: <20201201025944.18260-1-song.bao.hua@hisilicon.com>
X-Mailer: git-send-email 2.21.0.windows.1
MIME-Version: 1.0
X-Originating-IP: [10.126.202.198]
X-CFilter-Loop: Reflected
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20201130_220427_248750_CD16512A 
X-CRM114-Status: GOOD (  12.85  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: Barry Song <song.bao.hua@hisilicon.com>, prime.zeng@hisilicon.com,
 linuxarm@huawei.com, xuwei5@huawei.com
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data while each cluster
has local L3 tag. On the other hand, each cluster will share some
internal system bus. This means cache is much more affine inside one cluster
than across clusters.

+-----------------------------------+                          +---------+
|  +------+    +------+            +---------------------------+         |
|  | CPU0 |    | cpu1 |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+   cluster   |    |    tag    |         |         |
|  | CPU2 |    | CPU3 |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   |    |    L3     |         |         |
|  +------+    +------+             +----+    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |   L3    |
                                                               |   data  |
+-----------------------------------+                          |         |
|  +------+    +------+             |    +-----------+         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             +----+    L3     |         |         |
|                                   |    |    tag    |         |         |
|  +------+    +------+             |    |           |         |         |
|  |      |    |      |            ++    +-----------+         |         |
|  +------+    +------+            |---------------------------+         |
+-----------------------------------|                          |         |
+-----------------------------------|                          |         |
|  +------+    +------+            +---------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+             |    |    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |   +-----------+          |         |
|  +------+    +------+             |   |           |          |         |


The presented illustration is still a simplification of what is actually
going on, but is a more accurate model than currently presented.

Through the following small program, you can see the performance impact of
running it in one cluster and across two clusters:

struct foo {
        int x;
        int y;
} f;

void *thread1_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                s += f.x;
}

void *thread2_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                f.y++;
}

int main(int argc, char **argv)
{
        pthread_t tid1, tid2;

        pthread_create(&tid1, NULL, thread1_fun, NULL);
        pthread_create(&tid2, NULL, thread2_fun, NULL);
        pthread_join(tid1, NULL);
        pthread_join(tid2, NULL);
}

While running this program in one cluster, it takes:
$ time taskset -c 0,1 ./a.out 
real	0m0.832s
user	0m1.649s
sys	0m0.004s

As a contrast, it takes much more time if we run the same program
in two clusters:
$ time taskset -c 0,4 ./a.out 
real	0m1.133s
user	0m1.960s
sys	0m0.000s

0.832/1.133 = 73%, it is a huge difference.

This implies that we should let the Linux scheduler use cluster topology to
make better load balancing and WAKE_AFFINE decisions. Unfortuantely, right
now, all cpu0-23 are treated equally in the current kernel running on
Kunpeng 920.

This patchset first exposes the topology, then add a new sched_domain
between smt and mc. The new sched_domain will influence the load balance
and wake_affine of the scheduler.

The code is still pretty much a proof-of-concept and need lots of benchmark
and tuning. However, a rough hackbench result shows

While running hackbench on one numa node(cpu0-cpu23), we may achieve 5%+
performance improvement with the new sched_domain.
While running hackbench on two numa nodes(cpu0-cpu47), we may achieve 49%+
performance improvement with the new sched_domain.

Although I believe there is still a lot to do, sending a RFC to get feedbacks
of community experts might be helpful for the next step.

Barry Song (1):
  scheduler: add scheduler level for clusters

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die.

 Documentation/admin-guide/cputopology.rst | 26 +++++++++++---
 arch/arm64/Kconfig                        |  7 ++++
 arch/arm64/kernel/smp.c                   | 17 +++++++++
 arch/arm64/kernel/topology.c              |  2 ++
 drivers/acpi/pptt.c                       | 60 +++++++++++++++++++++++++++++++
 drivers/base/arch_topology.c              | 14 ++++++++
 drivers/base/topology.c                   | 10 ++++++
 include/linux/acpi.h                      |  5 +++ 
 include/linux/arch_topology.h             |  5 +++ 
 include/linux/topology.h                  | 13 +++++++
 kernel/sched/fair.c                       | 35 ++++++++++++++++++
 11 files changed, 190 insertions(+), 4 deletions(-)