Message ID | 20210301225940.16728-4-song.bao.hua@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | scheduler: expose the topology of clusters and add cluster scheduler | expand |
On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote: > From: Tim Chen <tim.c.chen@linux.intel.com> > > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce > is shared among a cluster of cores instead of being exclusive > to one single core. Isn't that most atoms one way or another? Tremont seems to have it per 4 cores, but earlier it was per 2 cores.
On 3/2/21 2:30 AM, Peter Zijlstra wrote: > On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote: >> From: Tim Chen <tim.c.chen@linux.intel.com> >> >> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce >> is shared among a cluster of cores instead of being exclusive >> to one single core. > > Isn't that most atoms one way or another? Tremont seems to have it per 4 > cores, but earlier it was per 2 cores. > Yes, older Atoms have 2 cores sharing L2. I probably should rephrase my comments to not leave the impression that sharing L2 among cores is new for Atoms. Tremont based Atom CPUs increases the possible load imbalance more with 4 cores per L2 instead of 2. And also with more overall cores on a die, the chance increases for packing running tasks on a few clusters while leaving others empty on light/medium loaded systems. We did see this effect on Jacobsville. So load balancing between the L2 clusters is more useful on Tremont based Atom CPUs compared to the older Atoms. Tim
> -----Original Message----- > From: Tim Chen [mailto:tim.c.chen@linux.intel.com] > Sent: Thursday, March 4, 2021 7:34 AM > To: Peter Zijlstra <peterz@infradead.org>; Song Bao Hua (Barry Song) > <song.bao.hua@hisilicon.com> > Cc: catalin.marinas@arm.com; will@kernel.org; rjw@rjwysocki.net; > vincent.guittot@linaro.org; bp@alien8.de; tglx@linutronix.de; > mingo@redhat.com; lenb@kernel.org; dietmar.eggemann@arm.com; > rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de; > msys.mizuma@gmail.com; valentin.schneider@arm.com; > gregkh@linuxfoundation.org; Jonathan Cameron <jonathan.cameron@huawei.com>; > juri.lelli@redhat.com; mark.rutland@arm.com; sudeep.holla@arm.com; > aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org; > linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org; x86@kernel.org; > xuwei (O) <xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; > guodong.xu@linaro.org; yangyicong <yangyicong@huawei.com>; Liguozhu (Kenneth) > <liguozhu@hisilicon.com>; linuxarm@openeuler.org; hpa@zytor.com > Subject: [Linuxarm] Re: [RFC PATCH v4 3/3] scheduler: Add cluster scheduler > level for x86 > > > > On 3/2/21 2:30 AM, Peter Zijlstra wrote: > > On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote: > >> From: Tim Chen <tim.c.chen@linux.intel.com> > >> > >> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce > >> is shared among a cluster of cores instead of being exclusive > >> to one single core. > > > > Isn't that most atoms one way or another? Tremont seems to have it per 4 > > cores, but earlier it was per 2 cores. > > > > Yes, older Atoms have 2 cores sharing L2. I probably should > rephrase my comments to not leave the impression that sharing > L2 among cores is new for Atoms. > > Tremont based Atom CPUs increases the possible load imbalance more > with 4 cores per L2 instead of 2. And also with more overall cores on a die, > the > chance increases for packing running tasks on a few clusters while leaving > others empty on light/medium loaded systems. We did see > this effect on Jacobsville. > > So load balancing between the L2 clusters is more > useful on Tremont based Atom CPUs compared to the older Atoms. It seems sensible the more CPU we get in the cluster, the more we need the kernel to be aware of its existence. Tim, it is possible for you to bring up the cpu_cluster_mask and cluster_sibling for x86 so that the topology can be represented in sysfs and be used by scheduler? It seems your patch lacks this part. BTW, I wonder if x86 can do some improvement on your KMP_AFFINITY by leveraging the cluster topology level. https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html KMP_AFFINITY has thread affinity modes like compact and scatter, it seems this "compact" and "scatter" can also use the cluster information as you see we are also struggling with the "compact" and "scatter" issues here in this patchset :-) Thanks Barry
> It seems sensible the more CPU we get in the cluster, the more > we need the kernel to be aware of its existence. > > Tim, it is possible for you to bring up the cpu_cluster_mask and > cluster_sibling for x86 so that the topology can be represented > in sysfs and be used by scheduler? It seems your patch lacks this > part. You mean having something in /sys/devices/system/cpu/cpu0/topology on cluster information so that an external program can affinitize to a cluster if it prefers to do so? Tim > > BTW, I wonder if x86 can do some improvement on your KMP_AFFINITY > by leveraging the cluster topology level. > https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html > > KMP_AFFINITY has thread affinity modes like compact and scatter, > it seems this "compact" and "scatter" can also use the cluster > information as you see we are also struggling with the "compact" > and "scatter" issues here in this patchset :-) > > Thanks > Barry >
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d3338a8..40110de 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1009,6 +1009,14 @@ config NR_CPUS This is purely to save memory: each supported CPU adds about 8KB to the kernel image. +config SCHED_CLUSTER + bool "Cluster scheduler support" + default n + help + Cluster scheduler support improves the CPU scheduler's decision + making when dealing with machines that have clusters of CPUs + sharing L2 cache. If unsure say N here. + config SCHED_SMT def_bool y if SMP diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index c0538f8..9cbc4ae 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -16,7 +16,9 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); +DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map); DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id); +DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id); DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number); static inline struct cpumask *cpu_llc_shared_mask(int cpu) @@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu) return per_cpu(cpu_llc_shared_map, cpu); } +static inline struct cpumask *cpu_l2c_shared_mask(int cpu) +{ + return per_cpu(cpu_l2c_shared_map, cpu); +} + DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid); DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid); DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 9239399..2a11ccc 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { } #include <asm-generic/topology.h> extern const struct cpumask *cpu_coregroup_mask(int cpu); +extern const struct cpumask *cpu_clustergroup_mask(int cpu); #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) #define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id) diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c index 3ca9be4..0d03a71 100644 --- a/arch/x86/kernel/cpu/cacheinfo.c +++ b/arch/x86/kernel/cpu/cacheinfo.c @@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c) l2 = new_l2; #ifdef CONFIG_SMP per_cpu(cpu_llc_id, cpu) = l2_id; + per_cpu(cpu_l2c_id, cpu) = l2_id; #endif } diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 35ad848..fb08c73 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -78,6 +78,9 @@ /* Last level cache ID of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID; +/* L2 cache ID of each logical CPU */ +DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID; + /* correctly size the local cpu masks */ void __init setup_cpu_local_masks(void) { diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 02813a7..c85ffa8 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -101,6 +101,8 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); +DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map); + /* Per CPU bogomips and other parameters */ DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info); EXPORT_PER_CPU_SYMBOL(cpu_info); @@ -501,6 +503,21 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) return topology_sane(c, o, "llc"); } +static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + int cpu1 = c->cpu_index, cpu2 = o->cpu_index; + + /* Do not match if we do not have a valid APICID for cpu: */ + if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID) + return false; + + /* Do not match if L2 cache id does not match: */ + if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2)) + return false; + + return topology_sane(c, o, "l2c"); +} + /* * Unlike the other levels, we do not enforce keeping a * multicore group inside a NUMA node. If this happens, we will @@ -522,7 +539,7 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) } -#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC) +#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC) static inline int x86_sched_itmt_flags(void) { return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0; @@ -540,12 +557,21 @@ static int x86_smt_flags(void) return cpu_smt_flags() | x86_sched_itmt_flags(); } #endif +#ifdef CONFIG_SCHED_CLUSTER +static int x86_cluster_flags(void) +{ + return cpu_cluster_flags() | x86_sched_itmt_flags(); +} +#endif #endif static struct sched_domain_topology_level x86_numa_in_package_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) }, #endif +#ifdef CONFIG_SCHED_CLUSTER + { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) }, +#endif #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) }, #endif @@ -556,6 +582,9 @@ static int x86_smt_flags(void) #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) }, #endif +#ifdef CONFIG_SCHED_CLUSTER + { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) }, +#endif #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) }, #endif @@ -583,6 +612,7 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); + cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); cpumask_set_cpu(cpu, topology_die_cpumask(cpu)); c->booted_cores = 1; @@ -598,6 +628,8 @@ void set_cpu_sibling_map(int cpu) if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); + if ((i == cpu) || (has_mp && match_l2c(c, o))) + link_mask(cpu_l2c_shared_mask, cpu, i); } /* @@ -649,6 +681,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return cpu_llc_shared_mask(cpu); } +const struct cpumask *cpu_clustergroup_mask(int cpu) +{ + return cpu_l2c_shared_mask(cpu); +} + static void impress_friends(void) { int cpu; @@ -1332,6 +1369,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL); zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL); zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL); + zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL); } /* @@ -1556,7 +1594,10 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling)); for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); + for_each_cpu(sibling, cpu_l2c_shared_mask(cpu)) + cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); + cpumask_clear(cpu_l2c_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); cpumask_clear(topology_core_cpumask(cpu)); cpumask_clear(topology_die_cpumask(cpu));