[v7,0/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Message ID	20220822073610.27205-1-yangyicong@huawei.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Yicong Yang <yangyicong@huawei.com> To: <peterz@infradead.org>, <mingo@redhat.com>, <juri.lelli@redhat.com>, <vincent.guittot@linaro.org>, <tim.c.chen@linux.intel.com>, <gautham.shenoy@amd.com>, <linux-kernel@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org> CC: <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>, <bristot@redhat.com>, <prime.zeng@huawei.com>, <yangyicong@hisilicon.com>, <jonathan.cameron@huawei.com>, <ego@linux.vnet.ibm.com>, <srikar@linux.vnet.ibm.com>, <linuxarm@huawei.com>, <21cnbao@gmail.com>, <guodong.xu@linaro.org>, <hesham.almatary@huawei.com>, <john.garry@huawei.com>, <shenyang39@huawei.com>, <kprateek.nayak@amd.com>, <yu.c.chen@intel.com>, <wuyun.abel@bytedance.com> Subject: [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Date: Mon, 22 Aug 2022 15:36:08 +0800 Message-ID: <20220822073610.27205-1-yangyicong@huawei.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	sched/fair: Scan cluster before scanning LLC in wake-up path \| expand [v7,0/2] sched/fair: Scan cluster before scanning LLC in wake-up path [v7,1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API [v7,2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

Message ID

20220822073610.27205-1-yangyicong@huawei.com (mailing list archive)

Headers

From: Yicong Yang <yangyicong@huawei.com>
To: <peterz@infradead.org>, <mingo@redhat.com>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <tim.c.chen@linux.intel.com>,
	<gautham.shenoy@amd.com>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>
CC: <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>,
	<bristot@redhat.com>, <prime.zeng@huawei.com>, <yangyicong@hisilicon.com>,
	<jonathan.cameron@huawei.com>, <ego@linux.vnet.ibm.com>,
	<srikar@linux.vnet.ibm.com>, <linuxarm@huawei.com>, <21cnbao@gmail.com>,
	<guodong.xu@linaro.org>, <hesham.almatary@huawei.com>,
	<john.garry@huawei.com>, <shenyang39@huawei.com>, <kprateek.nayak@amd.com>,
	<yu.c.chen@intel.com>, <wuyun.abel@bytedance.com>
Subject: [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in
 wake-up path
Date: Mon, 22 Aug 2022 15:36:08 +0800
Message-ID: <20220822073610.27205-1-yangyicong@huawei.com>
MIME-Version: 1.0
Precedence: list
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Series

sched/fair: Scan cluster before scanning LLC in wake-up path | expand

Message

Yicong Yang Aug. 22, 2022, 7:36 a.m. UTC

From: Yicong Yang <yangyicong@hisilicon.com>

This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.

[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")

Change since v6:
- rebase on 6.0-rc1
Link: https://lore.kernel.org/lkml/20220726074758.46686-1-yangyicong@huawei.com/

Change since v5:
- Improve patch 2 according to Peter's suggestion:
  - use sched_cluster_active to indicate whether cluster is active
  - consider SMT case and use wrap iteration when scanning cluster
- Add Vincent's tag
Thanks.
Link: https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/

Change since v4:
- rename cpus_share_resources to cpus_share_lowest_cache to be more informative, per Tim
- return -1 when nr==0 in scan_cluster(), per Abel
Thanks!
Link: https://lore.kernel.org/lkml/20220609120622.47724-1-yangyicong@hisilicon.com/

Change since v3:
- fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/

Change since v2:
- leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
- remove the ping-pong suppression
- address the comment from Tim, thanks.
Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/

Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/


Barry Song (2):
  sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
  sched/fair: Scan cluster before scanning LLC in wake-up path

 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/fair.c            | 30 +++++++++++++++++++++++++++---
 kernel/sched/sched.h           |  4 ++++
 kernel/sched/topology.c        | 25 +++++++++++++++++++++++++
 6 files changed, 82 insertions(+), 4 deletions(-)

Comments

Yicong Yang Sept. 5, 2022, 12:37 p.m. UTC | #1

a friendly ping...

Thanks.

On 2022/8/22 15:36, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
> 
> This is the follow-up work to support cluster scheduler. Previously
> we have added cluster level in the scheduler for both ARM64[1] and
> X86[2] to support load balance between clusters to bring more memory
> bandwidth and decrease cache contention. This patchset, on the other
> hand, takes care of wake-up path by giving CPUs within the same cluster
> a try before scanning the whole LLC to benefit those tasks communicating
> with each other.
> 
> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
> 
> Change since v6:
> - rebase on 6.0-rc1
> Link: https://lore.kernel.org/lkml/20220726074758.46686-1-yangyicong@huawei.com/
> 
> Change since v5:
> - Improve patch 2 according to Peter's suggestion:
>   - use sched_cluster_active to indicate whether cluster is active
>   - consider SMT case and use wrap iteration when scanning cluster
> - Add Vincent's tag
> Thanks.
> Link: https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/
> 
> Change since v4:
> - rename cpus_share_resources to cpus_share_lowest_cache to be more informative, per Tim
> - return -1 when nr==0 in scan_cluster(), per Abel
> Thanks!
> Link: https://lore.kernel.org/lkml/20220609120622.47724-1-yangyicong@hisilicon.com/
> 
> Change since v3:
> - fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
> Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/
> 
> Change since v2:
> - leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
> - remove the ping-pong suppression
> - address the comment from Tim, thanks.
> Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/
> 
> Change since v1:
> - regain the performance data based on v5.17-rc1
> - rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
> Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/
> 
> 
> Barry Song (2):
>   sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
>   sched/fair: Scan cluster before scanning LLC in wake-up path
> 
>  include/linux/sched/sd_flags.h |  7 +++++++
>  include/linux/sched/topology.h |  8 +++++++-
>  kernel/sched/core.c            | 12 ++++++++++++
>  kernel/sched/fair.c            | 30 +++++++++++++++++++++++++++---
>  kernel/sched/sched.h           |  4 ++++
>  kernel/sched/topology.c        | 25 +++++++++++++++++++++++++
>  6 files changed, 82 insertions(+), 4 deletions(-)
>

K Prateek Nayak Sept. 6, 2022, 5:28 a.m. UTC | #2

Hello Yicong,

We've tested the series on a dual socket Zen3 system (2 x 64C/128T).

tl;dr

- The results look good and the changes do not affect the Zen3 machine
  which doesn't contain any sched domain with SD_CLUSTER flag set.

- With the latest BIOS, I don't see any regression due to the addition
  of the new per CPU variables.
  We had observed a regression in tbench previously when testing the
  v4 of the series on the system with a slightly outdated BIOS
  (https://lore.kernel.org/lkml/e000b124-afd4-28e1-fde2-393b0e38ce19@amd.com/)
  but that doesn't seem to be the case with the latest BIOS :)

Detailed results from the standard benchmarks are reported below.

On 8/22/2022 1:06 PM, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
> 
> This is the follow-up work to support cluster scheduler. Previously
> we have added cluster level in the scheduler for both ARM64[1] and
> X86[2] to support load balance between clusters to bring more memory
> bandwidth and decrease cache contention. This patchset, on the other
> hand, takes care of wake-up path by giving CPUs within the same cluster
> a try before scanning the whole LLC to benefit those tasks communicating
> with each other.
> 
> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
> 

Discussed below are the results from running standard benchmarks on
a dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:      5.19.0 tip sched/core
- cluster:  5.19.0 tip sched/core + both the patches of the series

When we started testing, the tip was at:
commit: 5531ecffa4b9 "sched: Add update_current_exec_runtime helper"

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

NPS1

Test:		      tip                    cluster
 1-groups:	   4.31 (0.00 pct)	   4.31 (0.00 pct)
 2-groups:	   4.93 (0.00 pct)	   4.86 (1.41 pct)
 4-groups:	   5.38 (0.00 pct)	   5.36 (0.37 pct)
 8-groups:	   5.59 (0.00 pct)	   5.54 (0.89 pct)
16-groups:	   7.18 (0.00 pct)	   7.47 (-4.03 pct)

NPS2

Test:		      tip                     cluster
 1-groups:	   4.25 (0.00 pct)	   4.40 (-3.52 pct)
 2-groups:	   4.83 (0.00 pct)	   4.73 (2.07 pct)
 4-groups:	   5.25 (0.00 pct)	   5.18 (1.33 pct)
 8-groups:	   5.56 (0.00 pct)	   5.45 (1.97 pct)
16-groups:	   6.72 (0.00 pct)	   6.63 (1.33 pct)

NPS4

Test:		      tip                     cluster
 1-groups:	   4.24 (0.00 pct)	   4.23 (0.23 pct)
 2-groups:	   4.88 (0.00 pct)	   4.78 (2.04 pct)
 4-groups:	   5.30 (0.00 pct)	   5.25 (0.94 pct)
 8-groups:	   5.66 (0.00 pct)	   5.61 (0.88 pct)
16-groups:	   6.79 (0.00 pct)	   7.05 (-3.82 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

NPS1

#workers:     tip                       cluster
  1:	  37.00 (0.00 pct)	     22.00 (40.54 pct)
  2:	  39.00 (0.00 pct)	     23.00 (41.02 pct)
  4:	  41.00 (0.00 pct)	     30.00 (26.82 pct)
  8:	  53.00 (0.00 pct)	     47.00 (11.32 pct)
 16:	  73.00 (0.00 pct)	     73.00 (0.00 pct)
 32:	 116.00 (0.00 pct)	    117.00 (-0.86 pct)
 64:	 217.00 (0.00 pct)	    221.00 (-1.84 pct)
128:	 477.00 (0.00 pct)	    444.00 (6.91 pct)
256:	1062.00 (0.00 pct)	   1050.00 (1.12 pct)
512:   47552.00 (0.00 pct)	  48576.00 (-2.15 pct)

NPS2

#workers:     tip                       cluster
  1:	  20.00 (0.00 pct)	     20.00 (0.00 pct)
  2:	  22.00 (0.00 pct)	     23.00 (-4.54 pct)
  4:	  30.00 (0.00 pct)	     31.00 (-3.33 pct)
  8:	  46.00 (0.00 pct)	     49.00 (-6.52 pct)
 16:	  70.00 (0.00 pct)	     72.00 (-2.85 pct)
 32:	 120.00 (0.00 pct)	    118.00 (1.66 pct)
 64:	 215.00 (0.00 pct)	    216.00 (-0.46 pct)
128:	 482.00 (0.00 pct)	    449.00 (6.84 pct)
256:	1042.00 (0.00 pct)	    995.00 (4.51 pct)
512:   47552.00 (0.00 pct)	  47296.00 (0.53 pct)

NPS4

#workers:     tip                       cluster
  1:	  18.00 (0.00 pct)	     20.00 (-11.11 pct)
  2:	  23.00 (0.00 pct)	     22.00 (4.34 pct)
  4:	  27.00 (0.00 pct)	     30.00 (-11.11 pct)
  8:	  57.00 (0.00 pct)	     60.00 (-5.26 pct)
 16:	  76.00 (0.00 pct)	     84.00 (-10.52 pct)
 32:	 120.00 (0.00 pct)	    115.00 (4.16 pct)
 64:	 219.00 (0.00 pct)	    212.00 (3.19 pct)
128:	 459.00 (0.00 pct)	    442.00 (3.70 pct)
256:	1078.00 (0.00 pct)	    983.00 (8.81 pct)
512:   47040.00 (0.00 pct)	  48192.00 (-2.44 pct)

Note: schbench displays lot of run to run variance for
low worker count. This behavior is due to the timing of
new-idle balance which is not consistent across runs.

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients:      tip            	      cluster
    1	   573.26 (0.00 pct)	   572.61 (-0.11 pct)
    2	  1131.19 (0.00 pct)	  1122.41 (-0.77 pct)
    4	  2100.07 (0.00 pct)	  2081.74 (-0.87 pct)
    8	  3809.88 (0.00 pct)	  3732.14 (-2.04 pct)
   16	  6560.72 (0.00 pct)	  6289.22 (-4.13 pct)
   32	 12203.23 (0.00 pct)	 11811.74 (-3.20 pct)
   64	 22389.81 (0.00 pct)	 21587.79 (-3.58 pct)
  128	 32449.37 (0.00 pct)	 32967.15 (1.59 pct)
  256	 58962.40 (0.00 pct)	 56604.63 (-3.99 pct)
  512	 59608.71 (0.00 pct)	 56529.95 (-5.16 pct) * (Machine Overloaded)
  512	 57925.05 (0.00 pct)	 56697.38 (-2.11 pct) [Verification Run]
 1024	 58037.02 (0.00 pct)	 55751.53 (-3.93 pct)

NPS2

Clients:      tip                     cluster
    1	   574.20 (0.00 pct)	   572.49 (-0.29 pct)
    2	  1131.56 (0.00 pct)	  1149.53 (1.58 pct)
    4	  2132.26 (0.00 pct)	  2084.18 (-2.25 pct)
    8	  3812.20 (0.00 pct)	  3683.04 (-3.38 pct)
   16	  6457.61 (0.00 pct)	  6340.70 (-1.81 pct)
   32	 12263.82 (0.00 pct)	 11714.15 (-4.48 pct)
   64	 22224.11 (0.00 pct)	 21226.34 (-4.48 pct)
  128	 33040.38 (0.00 pct)	 32478.99 (-1.69 pct)
  256	 56547.25 (0.00 pct)	 52915.71 (-6.42 pct) * (Machine Overloaded)
  256    55631.80 (0.00 pct)     52905.99 (-4.89 pct) [Verification Run]
  512	 56220.67 (0.00 pct)	 54735.69 (-2.64 pct)
 1024	 56048.88 (0.00 pct)	 54426.63 (-2.89 pct)

NPS4

Clients:     tip                      cluster
    1	   575.50 (0.00 pct)	   570.65 (-0.84 pct)
    2	  1138.70 (0.00 pct)	  1137.75 (-0.08 pct)
    4	  2070.66 (0.00 pct)	  2103.18 (1.57 pct)
    8	  3811.70 (0.00 pct)	  3573.52 (-6.24 pct) *
    8	  3769.53 (0.00 pct)      3653.05 (-3.09 pct) [Verification Run]
   16	  6312.80 (0.00 pct)	  6212.41 (-1.59 pct)
   32	 11418.14 (0.00 pct)	 11721.01 (2.65 pct)
   64	 19671.16 (0.00 pct)	 20053.77 (1.94 pct)
  128	 30258.53 (0.00 pct)	 32585.15 (7.68 pct)
  256	 55838.10 (0.00 pct)	 51318.64 (-8.09 pct) * (Machine Overloaded)
  256	 54291.03 (0.00 pct)     54379.80 (0.16 pct)  [Verification Run]
  512	 55586.44 (0.00 pct)	 51538.93 (-7.28 pct) * (Machine Overloaded)
  512	 54190.04 (0.00 pct)     54096.16 (-0.17 pct) [Verification Run]
 1024	 56370.35 (0.00 pct)	 50768.68 (-9.93 pct) * (Machine Overloaded)
 1024    56498.36 (0.00 pct)     54661.85 (-3.25 pct) [Verification Run]

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

NPS1

- 10 Runs:

Test:	      tip                  cluster
 Copy:	 332237.51 (0.00 pct)	 338085.24 (1.76 pct)
Scale:	 215236.94 (0.00 pct)	 214179.72 (-0.49 pct)
  Add:	 250753.67 (0.00 pct)	 251181.86 (0.17 pct)
Triad:	 259467.60 (0.00 pct)	 262541.92 (1.18 pct)

- 100 Runs:

Test:	      tip                  cluster
 Copy:	 329320.65 (0.00 pct)	 336947.39 (2.31 pct)
Scale:	 218102.78 (0.00 pct)	 219617.85 (0.69 pct)
  Add:	 251283.30 (0.00 pct)	 251918.03 (0.25 pct)
Triad:	 258044.33 (0.00 pct)	 261512.99 (1.34 pct)

NPS2

- 10 Runs:

Test:	      tip                  cluster
 Copy:	 336926.24 (0.00 pct)	 324310.01 (-3.74 pct)
Scale:	 220120.41 (0.00 pct)	 212795.43 (-3.32 pct)
  Add:	 252428.34 (0.00 pct)	 254355.80 (0.76 pct)
Triad:	 274268.23 (0.00 pct)	 261777.03 (-4.55 pct)

- 100 Runs:

Test:	      tip                  cluster
 Copy:   338126.49 (0.00 pct)    338947.03 (0.24 pct)
Scale:   230229.59 (0.00 pct)    229991.65 (-0.10 pct)
  Add:   253964.25 (0.00 pct)    264374.57 (4.09 pct)
Triad:   272176.19 (0.00 pct)    274587.35 (0.88 pct)

NPS4

- 10 Runs:

Test:	      tip                  cluster
 Copy:   367144.56 (0.00 pct)    375452.26 (2.26 pct)
Scale:   246928.04 (0.00 pct)    243651.53 (-1.32 pct)
  Add:   272096.30 (0.00 pct)    272845.33 (0.27 pct)
Triad:   286644.55 (0.00 pct)    290925.20 (1.49 pct)

- 100 Runs:

Test:	      tip                  cluster
 Copy:	 351980.15 (0.00 pct)	 375854.72 (6.78 pct)
Scale:	 254918.41 (0.00 pct)	 255904.90 (0.38 pct)
  Add:	 272722.89 (0.00 pct)	 274075.11 (0.49 pct)
Triad:   283340.94 (0.00 pct)	 287608.77 (1.50 pct)

~~~~~~~~~~~~~~~~~~~~
~ Additional notes ~
~~~~~~~~~~~~~~~~~~~~

- schbench is know to have a noticeable run-to-run variation for lower
  worker counts and any improvements or regression observed can be
  safely ignored. The results are included to make sure there are
  no unnecessarily large regressions as a result of task pileup.

- tbench shows slight run to run variation with larger number of
  clients on both tip and patched kernel. This is expected as the machine
  is overloaded at that point (equivalent of two or more tasks per CPU).
  "Verification Run" shows none of these regressions are persistent.

>
> [..snip..]
> 

Overall, the changes look good and doesn't affect system without a
SD_CLUSTER domain like the Zen3 system used during testing.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

--
Thanks and Regards,
Prateek

Yicong Yang Sept. 6, 2022, 8:46 a.m. UTC | #3

On 2022/9/6 13:28, K Prateek Nayak wrote:
> Hello Yicong,
> 
> We've tested the series on a dual socket Zen3 system (2 x 64C/128T).
> 
> tl;dr
> 
> - The results look good and the changes do not affect the Zen3 machine
>   which doesn't contain any sched domain with SD_CLUSTER flag set.
> 
> - With the latest BIOS, I don't see any regression due to the addition
>   of the new per CPU variables.
>   We had observed a regression in tbench previously when testing the
>   v4 of the series on the system with a slightly outdated BIOS
>   (https://lore.kernel.org/lkml/e000b124-afd4-28e1-fde2-393b0e38ce19@amd.com/)
>   but that doesn't seem to be the case with the latest BIOS :)
> 
> Detailed results from the standard benchmarks are reported below.
> 
> On 8/22/2022 1:06 PM, Yicong Yang wrote:
>> From: Yicong Yang <yangyicong@hisilicon.com>
>>
>> This is the follow-up work to support cluster scheduler. Previously
>> we have added cluster level in the scheduler for both ARM64[1] and
>> X86[2] to support load balance between clusters to bring more memory
>> bandwidth and decrease cache contention. This patchset, on the other
>> hand, takes care of wake-up path by giving CPUs within the same cluster
>> a try before scanning the whole LLC to benefit those tasks communicating
>> with each other.
>>
>> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
>> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
>>
> 
> Discussed below are the results from running standard benchmarks on
> a dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:      5.19.0 tip sched/core
> - cluster:  5.19.0 tip sched/core + both the patches of the series
> 
> When we started testing, the tip was at:
> commit: 5531ecffa4b9 "sched: Add update_current_exec_runtime helper"
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> NPS1
> 
> Test:		      tip                    cluster
>  1-groups:	   4.31 (0.00 pct)	   4.31 (0.00 pct)
>  2-groups:	   4.93 (0.00 pct)	   4.86 (1.41 pct)
>  4-groups:	   5.38 (0.00 pct)	   5.36 (0.37 pct)
>  8-groups:	   5.59 (0.00 pct)	   5.54 (0.89 pct)
> 16-groups:	   7.18 (0.00 pct)	   7.47 (-4.03 pct)
> 
> NPS2
> 
> Test:		      tip                     cluster
>  1-groups:	   4.25 (0.00 pct)	   4.40 (-3.52 pct)
>  2-groups:	   4.83 (0.00 pct)	   4.73 (2.07 pct)
>  4-groups:	   5.25 (0.00 pct)	   5.18 (1.33 pct)
>  8-groups:	   5.56 (0.00 pct)	   5.45 (1.97 pct)
> 16-groups:	   6.72 (0.00 pct)	   6.63 (1.33 pct)
> 
> NPS4
> 
> Test:		      tip                     cluster
>  1-groups:	   4.24 (0.00 pct)	   4.23 (0.23 pct)
>  2-groups:	   4.88 (0.00 pct)	   4.78 (2.04 pct)
>  4-groups:	   5.30 (0.00 pct)	   5.25 (0.94 pct)
>  8-groups:	   5.66 (0.00 pct)	   5.61 (0.88 pct)
> 16-groups:	   6.79 (0.00 pct)	   7.05 (-3.82 pct)
> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> NPS1
> 
> #workers:     tip                       cluster
>   1:	  37.00 (0.00 pct)	     22.00 (40.54 pct)
>   2:	  39.00 (0.00 pct)	     23.00 (41.02 pct)
>   4:	  41.00 (0.00 pct)	     30.00 (26.82 pct)
>   8:	  53.00 (0.00 pct)	     47.00 (11.32 pct)
>  16:	  73.00 (0.00 pct)	     73.00 (0.00 pct)
>  32:	 116.00 (0.00 pct)	    117.00 (-0.86 pct)
>  64:	 217.00 (0.00 pct)	    221.00 (-1.84 pct)
> 128:	 477.00 (0.00 pct)	    444.00 (6.91 pct)
> 256:	1062.00 (0.00 pct)	   1050.00 (1.12 pct)
> 512:   47552.00 (0.00 pct)	  48576.00 (-2.15 pct)
> 
> NPS2
> 
> #workers:     tip                       cluster
>   1:	  20.00 (0.00 pct)	     20.00 (0.00 pct)
>   2:	  22.00 (0.00 pct)	     23.00 (-4.54 pct)
>   4:	  30.00 (0.00 pct)	     31.00 (-3.33 pct)
>   8:	  46.00 (0.00 pct)	     49.00 (-6.52 pct)
>  16:	  70.00 (0.00 pct)	     72.00 (-2.85 pct)
>  32:	 120.00 (0.00 pct)	    118.00 (1.66 pct)
>  64:	 215.00 (0.00 pct)	    216.00 (-0.46 pct)
> 128:	 482.00 (0.00 pct)	    449.00 (6.84 pct)
> 256:	1042.00 (0.00 pct)	    995.00 (4.51 pct)
> 512:   47552.00 (0.00 pct)	  47296.00 (0.53 pct)
> 
> NPS4
> 
> #workers:     tip                       cluster
>   1:	  18.00 (0.00 pct)	     20.00 (-11.11 pct)
>   2:	  23.00 (0.00 pct)	     22.00 (4.34 pct)
>   4:	  27.00 (0.00 pct)	     30.00 (-11.11 pct)
>   8:	  57.00 (0.00 pct)	     60.00 (-5.26 pct)
>  16:	  76.00 (0.00 pct)	     84.00 (-10.52 pct)
>  32:	 120.00 (0.00 pct)	    115.00 (4.16 pct)
>  64:	 219.00 (0.00 pct)	    212.00 (3.19 pct)
> 128:	 459.00 (0.00 pct)	    442.00 (3.70 pct)
> 256:	1078.00 (0.00 pct)	    983.00 (8.81 pct)
> 512:   47040.00 (0.00 pct)	  48192.00 (-2.44 pct)
> 
> Note: schbench displays lot of run to run variance for
> low worker count. This behavior is due to the timing of
> new-idle balance which is not consistent across runs.
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> NPS1
> 
> Clients:      tip            	      cluster
>     1	   573.26 (0.00 pct)	   572.61 (-0.11 pct)
>     2	  1131.19 (0.00 pct)	  1122.41 (-0.77 pct)
>     4	  2100.07 (0.00 pct)	  2081.74 (-0.87 pct)
>     8	  3809.88 (0.00 pct)	  3732.14 (-2.04 pct)
>    16	  6560.72 (0.00 pct)	  6289.22 (-4.13 pct)
>    32	 12203.23 (0.00 pct)	 11811.74 (-3.20 pct)
>    64	 22389.81 (0.00 pct)	 21587.79 (-3.58 pct)
>   128	 32449.37 (0.00 pct)	 32967.15 (1.59 pct)
>   256	 58962.40 (0.00 pct)	 56604.63 (-3.99 pct)
>   512	 59608.71 (0.00 pct)	 56529.95 (-5.16 pct) * (Machine Overloaded)
>   512	 57925.05 (0.00 pct)	 56697.38 (-2.11 pct) [Verification Run]
>  1024	 58037.02 (0.00 pct)	 55751.53 (-3.93 pct)
> 
> NPS2
> 
> Clients:      tip                     cluster
>     1	   574.20 (0.00 pct)	   572.49 (-0.29 pct)
>     2	  1131.56 (0.00 pct)	  1149.53 (1.58 pct)
>     4	  2132.26 (0.00 pct)	  2084.18 (-2.25 pct)
>     8	  3812.20 (0.00 pct)	  3683.04 (-3.38 pct)
>    16	  6457.61 (0.00 pct)	  6340.70 (-1.81 pct)
>    32	 12263.82 (0.00 pct)	 11714.15 (-4.48 pct)
>    64	 22224.11 (0.00 pct)	 21226.34 (-4.48 pct)
>   128	 33040.38 (0.00 pct)	 32478.99 (-1.69 pct)
>   256	 56547.25 (0.00 pct)	 52915.71 (-6.42 pct) * (Machine Overloaded)
>   256    55631.80 (0.00 pct)     52905.99 (-4.89 pct) [Verification Run]
>   512	 56220.67 (0.00 pct)	 54735.69 (-2.64 pct)
>  1024	 56048.88 (0.00 pct)	 54426.63 (-2.89 pct)
> 
> NPS4
> 
> Clients:     tip                      cluster
>     1	   575.50 (0.00 pct)	   570.65 (-0.84 pct)
>     2	  1138.70 (0.00 pct)	  1137.75 (-0.08 pct)
>     4	  2070.66 (0.00 pct)	  2103.18 (1.57 pct)
>     8	  3811.70 (0.00 pct)	  3573.52 (-6.24 pct) *
>     8	  3769.53 (0.00 pct)      3653.05 (-3.09 pct) [Verification Run]
>    16	  6312.80 (0.00 pct)	  6212.41 (-1.59 pct)
>    32	 11418.14 (0.00 pct)	 11721.01 (2.65 pct)
>    64	 19671.16 (0.00 pct)	 20053.77 (1.94 pct)
>   128	 30258.53 (0.00 pct)	 32585.15 (7.68 pct)
>   256	 55838.10 (0.00 pct)	 51318.64 (-8.09 pct) * (Machine Overloaded)
>   256	 54291.03 (0.00 pct)     54379.80 (0.16 pct)  [Verification Run]
>   512	 55586.44 (0.00 pct)	 51538.93 (-7.28 pct) * (Machine Overloaded)
>   512	 54190.04 (0.00 pct)     54096.16 (-0.17 pct) [Verification Run]
>  1024	 56370.35 (0.00 pct)	 50768.68 (-9.93 pct) * (Machine Overloaded)
>  1024    56498.36 (0.00 pct)     54661.85 (-3.25 pct) [Verification Run]
> 
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
> 
> NPS1
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 332237.51 (0.00 pct)	 338085.24 (1.76 pct)
> Scale:	 215236.94 (0.00 pct)	 214179.72 (-0.49 pct)
>   Add:	 250753.67 (0.00 pct)	 251181.86 (0.17 pct)
> Triad:	 259467.60 (0.00 pct)	 262541.92 (1.18 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 329320.65 (0.00 pct)	 336947.39 (2.31 pct)
> Scale:	 218102.78 (0.00 pct)	 219617.85 (0.69 pct)
>   Add:	 251283.30 (0.00 pct)	 251918.03 (0.25 pct)
> Triad:	 258044.33 (0.00 pct)	 261512.99 (1.34 pct)
> 
> NPS2
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 336926.24 (0.00 pct)	 324310.01 (-3.74 pct)
> Scale:	 220120.41 (0.00 pct)	 212795.43 (-3.32 pct)
>   Add:	 252428.34 (0.00 pct)	 254355.80 (0.76 pct)
> Triad:	 274268.23 (0.00 pct)	 261777.03 (-4.55 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:   338126.49 (0.00 pct)    338947.03 (0.24 pct)
> Scale:   230229.59 (0.00 pct)    229991.65 (-0.10 pct)
>   Add:   253964.25 (0.00 pct)    264374.57 (4.09 pct)
> Triad:   272176.19 (0.00 pct)    274587.35 (0.88 pct)
> 
> NPS4
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:   367144.56 (0.00 pct)    375452.26 (2.26 pct)
> Scale:   246928.04 (0.00 pct)    243651.53 (-1.32 pct)
>   Add:   272096.30 (0.00 pct)    272845.33 (0.27 pct)
> Triad:   286644.55 (0.00 pct)    290925.20 (1.49 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 351980.15 (0.00 pct)	 375854.72 (6.78 pct)
> Scale:	 254918.41 (0.00 pct)	 255904.90 (0.38 pct)
>   Add:	 272722.89 (0.00 pct)	 274075.11 (0.49 pct)
> Triad:   283340.94 (0.00 pct)	 287608.77 (1.50 pct)
> 
> ~~~~~~~~~~~~~~~~~~~~
> ~ Additional notes ~
> ~~~~~~~~~~~~~~~~~~~~
> 
> - schbench is know to have a noticeable run-to-run variation for lower
>   worker counts and any improvements or regression observed can be
>   safely ignored. The results are included to make sure there are
>   no unnecessarily large regressions as a result of task pileup.
> 
> - tbench shows slight run to run variation with larger number of
>   clients on both tip and patched kernel. This is expected as the machine
>   is overloaded at that point (equivalent of two or more tasks per CPU).
>   "Verification Run" shows none of these regressions are persistent.
> 
>>
>> [..snip..]
>>
> 
> Overall, the changes look good and doesn't affect system without a
> SD_CLUSTER domain like the Zen3 system used during testing.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 

Thanks a lot for the testing and verification on the Zen3 system.

Regards,
Yicong

Barry Song Sept. 7, 2022, 11:52 a.m. UTC | #4

> From: Yicong Yang <yangyicong@hisilicon.com>

> This is the follow-up work to support cluster scheduler. Previously
> we have added cluster level in the scheduler for both ARM64[1] and
> X86[2] to support load balance between clusters to bring more memory
> bandwidth and decrease cache contention. This patchset, on the other
> hand, takes care of wake-up path by giving CPUs within the same cluster
> a try before scanning the whole LLC to benefit those tasks communicating
> with each other.

> Barry Song (2):
>   sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
>   sched/fair: Scan cluster before scanning LLC in wake-up path

Hi Peter,
I believe this one has been ready. It has also gotten widely reviewed and
tested on platforms w/ and w/o clusters.
Can you please pick up this?

Thanks
Barry