Message ID | 20220525081416.3306043-16-sudeep.holla@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arch_topology: Updates to add socket support and fix cluster ids | expand |
On 25/05/2022 10:14, Sudeep Holla wrote: > Let us set the cluster identifier as parsed from the device tree > cluster nodes within /cpu-map. > > We don't support nesting of clusters yet as there are no real hardware > to support clusters of clusters. > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> > --- > drivers/base/arch_topology.c | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > index b8f0d72908c8..5f4f148a7769 100644 > --- a/drivers/base/arch_topology.c > +++ b/drivers/base/arch_topology.c > @@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node) > } > > static int __init parse_core(struct device_node *core, int package_id, > - int core_id) > + int cluster_id, int core_id) > { > char name[20]; > bool leaf = true; > @@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id, > cpu = get_cpu_for_node(t); > if (cpu >= 0) { > cpu_topology[cpu].package_id = package_id; > + cpu_topology[cpu].cluster_id = cluster_id; > cpu_topology[cpu].core_id = core_id; > cpu_topology[cpu].thread_id = i; > } else if (cpu != -ENODEV) { > @@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id, > } > > cpu_topology[cpu].package_id = package_id; > + cpu_topology[cpu].cluster_id = cluster_id; I'm still not convinced that this is the right thing to do. Let's take the juno board as an example here. And I guess our assumption should be that we want to make CONFIG_SCHED_CLUSTER a default option, like CONFIG_SCHED_MC is. Simply to avoid a unmanageable zoo of config-option combinations. (1) Scheduler Domains (SDs) w/o CONFIG_SCHED_CLUSTER: MC <-- !!! DIE (2) SDs w/ CONFIG_SCHED_CLUSTER: CLS <-- !!! DIE In (2) MC gets degenerated in sd_parent_degenerate() since CLS and MC cpumasks are equal and MC does not have any additional flags compared to CLS. I'm not convinced that we can change the degeneration rules without destroying other scenarios of the scheduler so that here MC stays and CLS gets removed instead. Even though MC and CLS are doing the same right now from the perspective of the scheduler, we should also see MC and not CLS under (2). CLS only makes sense longer term if the scheduler also makes use of it (next to MC) in the wakeup-path for instance. Especially when this happens, a platform should always construct the same scheduler domain hierarchy, no matter which CONFIG_SCHED_XXX options are enabled. You can see this in update_siblings_masks() if (last_level_cache_is_shared) set llc_sibling if (cpuid_topo->package_id != cpu_topo->package_id) continue set core_sibling If llc cache and socket boundaries are congruent, llc_sibling and core_sibling are the same. if (cpuid_topo->cluster_id != cpu_topo->cluster_id) continue set cluster_sibling Now we potentially set clusters. Since socket=0 is by default and we use the existing juno.dts, the cluster nodes end up being congruent to the llc cache cpumasks as well. The problem is that we code `llc cache` and `DT cluster nodes` as the same thing in juno.dts. `Cluster0/1` are congruent with the llc information, although they should be actually `socket0/1` right now. But we can't set-up a cpu-map with a `socketX` containing `coreY` directly. And then we use llc_sibling and cluster_sibling in two different SD cpumask functions (cpu_coregroup_mask() and cpu_clustergroup_mask()). Remember, CONFIG_SCHED_CLUSTER was introduced in ACPI/PPTT as a cpumask which is a subset of the cpumasks of CONFIG_SCHED_MC. --- IMHO we probably could just introduce your changes w/o setting `cpu-map cluster nodes` in DT for now. We would just have to make sure that for all `*.dts` affected, the `llc cache` info can take over the old role of the `cluster nodes`. In this case e.g. Juno ends up with MC, DIE no matter if CONFIG_SCHED_CLUSTER is set or not. [...]
On Fri, Jun 03, 2022 at 02:30:04PM +0200, Dietmar Eggemann wrote: > On 25/05/2022 10:14, Sudeep Holla wrote: > > Let us set the cluster identifier as parsed from the device tree > > cluster nodes within /cpu-map. > > > > We don't support nesting of clusters yet as there are no real hardware > > to support clusters of clusters. > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> > > --- > > drivers/base/arch_topology.c | 13 ++++++++----- > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > > index b8f0d72908c8..5f4f148a7769 100644 > > --- a/drivers/base/arch_topology.c > > +++ b/drivers/base/arch_topology.c > > @@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node) > > } > > > > static int __init parse_core(struct device_node *core, int package_id, > > - int core_id) > > + int cluster_id, int core_id) > > { > > char name[20]; > > bool leaf = true; > > @@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id, > > cpu = get_cpu_for_node(t); > > if (cpu >= 0) { > > cpu_topology[cpu].package_id = package_id; > > + cpu_topology[cpu].cluster_id = cluster_id; > > cpu_topology[cpu].core_id = core_id; > > cpu_topology[cpu].thread_id = i; > > } else if (cpu != -ENODEV) { > > @@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id, > > } > > > > cpu_topology[cpu].package_id = package_id; > > + cpu_topology[cpu].cluster_id = cluster_id; > > I'm still not convinced that this is the right thing to do. Let's take > the juno board as an example here. And I guess our assumption should be > that we want to make CONFIG_SCHED_CLUSTER a default option, like > CONFIG_SCHED_MC is. Simply to avoid a unmanageable zoo of config-option > combinations. > Agreed on the config part. > (1) Scheduler Domains (SDs) w/o CONFIG_SCHED_CLUSTER: > > MC <-- !!! > DIE > > (2) SDs w/ CONFIG_SCHED_CLUSTER: > > CLS <-- !!! > DIE > Yes I have seen this. > In (2) MC gets degenerated in sd_parent_degenerate() since CLS and MC > cpumasks are equal and MC does not have any additional flags compared to > CLS. > I'm not convinced that we can change the degeneration rules without > destroying other scenarios of the scheduler so that here MC stays and > CLS gets removed instead. > Why ? Are you suggesting that we shouldn't present the hardware cluster to the topology because of the above reason ? If so, sorry that is not a valid reason. We could add login to return NULL or appropriate value needed in cpu_clustergroup_mask id it matches MC level mask if we can't deal that in generic scheduler code. But the topology code can't be compromised for that reason as it is user visible. > Even though MC and CLS are doing the same right now from the perspective > of the scheduler, we should also see MC and not CLS under (2). CLS only > makes sense longer term if the scheduler also makes use of it (next to > MC) in the wakeup-path for instance. Especially when this happens, a > platform should always construct the same scheduler domain hierarchy, no > matter which CONFIG_SCHED_XXX options are enabled. > > > You can see this in update_siblings_masks() > > if (last_level_cache_is_shared) > set llc_sibling > > if (cpuid_topo->package_id != cpu_topo->package_id) > continue > > set core_sibling > > If llc cache and socket boundaries are congruent, llc_sibling and > core_sibling are the same. > > if (cpuid_topo->cluster_id != cpu_topo->cluster_id) > continue > > set cluster_sibling > > Now we potentially set clusters. Since socket=0 is by default and we > use the existing juno.dts, the cluster nodes end up being congruent to > the llc cache cpumasks as well. > Correct and I see no problems as it matches what the hardware is. So I am not expecting any change in any cpumasks there as they all are aligned with the hardware. > The problem is that we code `llc cache` and `DT cluster nodes` as the > same thing in juno.dts. Why is that a problem ? If so, blame hardware and deal with it as we have to
On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > On Fri, Jun 03, 2022 at 02:30:04PM +0200, Dietmar Eggemann wrote: > > On 25/05/2022 10:14, Sudeep Holla wrote: > > > Let us set the cluster identifier as parsed from the device tree > > > cluster nodes within /cpu-map. > > > > > > We don't support nesting of clusters yet as there are no real hardware > > > to support clusters of clusters. > > > > > > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> > > > --- > > > drivers/base/arch_topology.c | 13 ++++++++----- > > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > > > index b8f0d72908c8..5f4f148a7769 100644 > > > --- a/drivers/base/arch_topology.c > > > +++ b/drivers/base/arch_topology.c > > > @@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node) > > > } > > > > > > static int __init parse_core(struct device_node *core, int package_id, > > > - int core_id) > > > + int cluster_id, int core_id) > > > { > > > char name[20]; > > > bool leaf = true; > > > @@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id, > > > cpu = get_cpu_for_node(t); > > > if (cpu >= 0) { > > > cpu_topology[cpu].package_id = package_id; > > > + cpu_topology[cpu].cluster_id = cluster_id; > > > cpu_topology[cpu].core_id = core_id; > > > cpu_topology[cpu].thread_id = i; > > > } else if (cpu != -ENODEV) { > > > @@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id, > > > } > > > > > > cpu_topology[cpu].package_id = package_id; > > > + cpu_topology[cpu].cluster_id = cluster_id; > > > > I'm still not convinced that this is the right thing to do. Let's take > > the juno board as an example here. And I guess our assumption should be > > that we want to make CONFIG_SCHED_CLUSTER a default option, like > > CONFIG_SCHED_MC is. Simply to avoid a unmanageable zoo of config-option > > combinations. > > > > Agreed on the config part. > > > (1) Scheduler Domains (SDs) w/o CONFIG_SCHED_CLUSTER: > > > > MC <-- !!! > > DIE > > > > (2) SDs w/ CONFIG_SCHED_CLUSTER: > > > > CLS <-- !!! > > DIE > > > > Yes I have seen this. > > > In (2) MC gets degenerated in sd_parent_degenerate() since CLS and MC > > cpumasks are equal and MC does not have any additional flags compared to > > CLS. > > I'm not convinced that we can change the degeneration rules without > > destroying other scenarios of the scheduler so that here MC stays and > > CLS gets removed instead. > > > > Why ? Are you suggesting that we shouldn't present the hardware cluster > to the topology because of the above reason ? If so, sorry that is not a > valid reason. We could add login to return NULL or appropriate value > needed in cpu_clustergroup_mask id it matches MC level mask if we can't > deal that in generic scheduler code. But the topology code can't be > compromised for that reason as it is user visible. I tend to agree with Dietmar. The legacy use of cluster node in DT refers to the dynamiQ or legacy b.L cluster which is also aligned to the LLC and the MC scheduling level. The new cluster level that has been introduced recently does not target this level but some intermediate levels either inside like for the kupeng920 or the v9 complex or outside like for the ampere altra. So I would say that there is one cluster node level in DT that refers to the same MC/LLC level and only an additional child/parent cluster node should be used to fill the clustergroup_mask. IIUC, we don't describe the dynamiQ level in ACPI which uses cache topology instead to define cpu_coregroup_mask whereas DT described the dynamiQ instead of using cache topology. If you use cache topology now, then you should skip the dynamiQ Finally, even if CLS and MC have the same scheduling behavior for now, they might ends up with different scheduling properties which would mean that replacing MC level by CLS one for current SoC would become wrong > > > Even though MC and CLS are doing the same right now from the perspective > > of the scheduler, we should also see MC and not CLS under (2). CLS only > > makes sense longer term if the scheduler also makes use of it (next to > > MC) in the wakeup-path for instance. Especially when this happens, a > > platform should always construct the same scheduler domain hierarchy, no > > matter which CONFIG_SCHED_XXX options are enabled. > > > > > > You can see this in update_siblings_masks() > > > > if (last_level_cache_is_shared) > > set llc_sibling > > > > if (cpuid_topo->package_id != cpu_topo->package_id) > > continue > > > > set core_sibling > > > > If llc cache and socket boundaries are congruent, llc_sibling and > > core_sibling are the same. > > > > if (cpuid_topo->cluster_id != cpu_topo->cluster_id) > > continue > > > > set cluster_sibling > > > > Now we potentially set clusters. Since socket=0 is by default and we > > use the existing juno.dts, the cluster nodes end up being congruent to > > the llc cache cpumasks as well. > > > > Correct and I see no problems as it matches what the hardware is. So I am > not expecting any change in any cpumasks there as they all are aligned with > the hardware. > > > The problem is that we code `llc cache` and `DT cluster nodes` as the > > same thing in juno.dts. > > Why is that a problem ? If so, blame hardware and deal with it as we have to >
On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > [...] > > Why ? Are you suggesting that we shouldn't present the hardware cluster > > to the topology because of the above reason ? If so, sorry that is not a > > valid reason. We could add login to return NULL or appropriate value > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't > > deal that in generic scheduler code. But the topology code can't be > > compromised for that reason as it is user visible. > > I tend to agree with Dietmar. The legacy use of cluster node in DT > refers to the dynamiQ or legacy b.L cluster which is also aligned to > the LLC and the MC scheduling level. The new cluster level that has > been introduced recently does not target this level but some > intermediate levels either inside like for the kupeng920 or the v9 > complex or outside like for the ampere altra. So I would say that > there is one cluster node level in DT that refers to the same MC/LLC > level and only an additional child/parent cluster node should be used > to fill the clustergroup_mask. > Again I completely disagree. Let us look at the problems separately. The hardware topology that some of the tools like lscpu and lstopo expects what the hardware looks like and not the scheduler's view of the hardware. So the topology masks that gets exposed to the user-space needs fixing even today. I have reports from various tooling people about the same. E.g. Juno getting exposed as dual socket system is utter non-sense. Yes scheduler uses most of the topology masks as is but that is not a must. There are these *group_mask functions that can implement what scheduler needs to be fed. I am not sure why the 2 issues are getting mixed up and that is the main reason why I jumped into this to make sure the topology masks are not tampered based on the way it needs to be used for scheduler. Both ACPI and DT on a platform must present exact same hardware topology to the user-space, there is no space for argument there. > IIUC, we don't describe the dynamiQ level in ACPI which uses cache > topology instead to define cpu_coregroup_mask whereas DT described the > dynamiQ instead of using cache topology. If you use cache topology > now, then you should skip the dynamiQ > Yes, unless someone can work out a binding to represent that and convince DT maintainers ;). > Finally, even if CLS and MC have the same scheduling behavior for now, > they might ends up with different scheduling properties which would > mean that replacing MC level by CLS one for current SoC would become > wrong > Again as I mentioned to Dietmar, that is something we can and must deal with in those *group_mask and not expect topology mask to be altered to meet CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion as the topology is already user-space visible and (tooling) people are complaining that DT systems are broken and doesn't match ACPI systems. So unless someone gives me non-scheduler and topology specific reasons to change that, sorry but my opinion on this matter is not going to change ;). You will get this view of topology, find a way to manage with all those *group_mask functions. By the way it is already handled for ACPI systems, so if you are not happy with that, then that needs fixing as this change set just aligns the behaviour on similar ACPI system. So the Juno example is incorrect for the reason that the behaviour of scheduler there is different with DT and ACPI. -- Regards, Sudeep
On 10/06/2022 12:27, Sudeep Holla wrote: > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: >> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: >>> > > [...] > >>> Why ? Are you suggesting that we shouldn't present the hardware cluster >>> to the topology because of the above reason ? If so, sorry that is not a >>> valid reason. We could add login to return NULL or appropriate value >>> needed in cpu_clustergroup_mask id it matches MC level mask if we can't >>> deal that in generic scheduler code. But the topology code can't be >>> compromised for that reason as it is user visible. >> >> I tend to agree with Dietmar. The legacy use of cluster node in DT >> refers to the dynamiQ or legacy b.L cluster which is also aligned to >> the LLC and the MC scheduling level. The new cluster level that has >> been introduced recently does not target this level but some >> intermediate levels either inside like for the kupeng920 or the v9 >> complex or outside like for the ampere altra. So I would say that >> there is one cluster node level in DT that refers to the same MC/LLC >> level and only an additional child/parent cluster node should be used >> to fill the clustergroup_mask. >> > > Again I completely disagree. Let us look at the problems separately. > The hardware topology that some of the tools like lscpu and lstopo expects > what the hardware looks like and not the scheduler's view of the hardware. > So the topology masks that gets exposed to the user-space needs fixing > even today. I have reports from various tooling people about the same. > E.g. Juno getting exposed as dual socket system is utter non-sense. > > Yes scheduler uses most of the topology masks as is but that is not a must. > There are these *group_mask functions that can implement what scheduler > needs to be fed. > > I am not sure why the 2 issues are getting mixed up and that is the main > reason why I jumped into this to make sure the topology masks are > not tampered based on the way it needs to be used for scheduler. I'm all in favor of not mixing up those 2 issues. But I don't understand why you have to glue them together. (1) DT systems broken in userspace (lstopo shows Juno with 2 Packages) (2) Introduce CONFIG_SCHED_CLUSTER for DT systems (1) This can be solved with your patch-set w/o setting `(1. level) cpu-map cluster nodes`. The `socket nodes` taking over the functionality of the `cluster nodes` sorts out the `Juno is seen as having 2 packages`. This will make core_sibling not suitable for cpu_coregroup_mask() anymore. But this is OK since llc from cacheinfo (i.e. llc_sibling) takes over here. There is no need to involve `cluster nodes` anymore. (2) This will only make sense for Armv9 L2 complexes if we connect `2. level cpu-map cluster nodes` with cluster_id and cluster_sibling. And only then clusters would mean the same thing in ACPI and DT. I guess this was mentioned already a couple of times. > Both ACPI and DT on a platform must present exact same hardware topology > to the user-space, there is no space for argument there. > >> IIUC, we don't describe the dynamiQ level in ACPI which uses cache >> topology instead to define cpu_coregroup_mask whereas DT described the >> dynamiQ instead of using cache topology. If you use cache topology >> now, then you should skip the dynamiQ >> > > Yes, unless someone can work out a binding to represent that and convince > DT maintainers ;). > >> Finally, even if CLS and MC have the same scheduling behavior for now, >> they might ends up with different scheduling properties which would >> mean that replacing MC level by CLS one for current SoC would become >> wrong >> > > Again as I mentioned to Dietmar, that is something we can and must deal with > in those *group_mask and not expect topology mask to be altered to meet > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion > as the topology is already user-space visible and (tooling) people are > complaining that DT systems are broken and doesn't match ACPI systems. > > So unless someone gives me non-scheduler and topology specific reasons > to change that, sorry but my opinion on this matter is not going to change ;). `lstopo` is fine with a now correct /sys/.../topology/package_cpus (or core_siblings (old filename). It's not reading /sys/.../topology/cluster_cpus (yet) so why set it (wrongly) to 0x39 for CPU0 on Juno when it can stay 0x01? > You will get this view of topology, find a way to manage with all those > *group_mask functions. By the way it is already handled for ACPI systems, > so if you are not happy with that, then that needs fixing as this change > set just aligns the behaviour on similar ACPI system. So the Juno example > is incorrect for the reason that the behaviour of scheduler there is different > with DT and ACPI. [...]
On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote: > On 10/06/2022 12:27, Sudeep Holla wrote: > > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > >> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > >>> > > > > [...] > > > >>> Why ? Are you suggesting that we shouldn't present the hardware cluster > >>> to the topology because of the above reason ? If so, sorry that is not a > >>> valid reason. We could add login to return NULL or appropriate value > >>> needed in cpu_clustergroup_mask id it matches MC level mask if we can't > >>> deal that in generic scheduler code. But the topology code can't be > >>> compromised for that reason as it is user visible. > >> > >> I tend to agree with Dietmar. The legacy use of cluster node in DT > >> refers to the dynamiQ or legacy b.L cluster which is also aligned to > >> the LLC and the MC scheduling level. The new cluster level that has > >> been introduced recently does not target this level but some > >> intermediate levels either inside like for the kupeng920 or the v9 > >> complex or outside like for the ampere altra. So I would say that > >> there is one cluster node level in DT that refers to the same MC/LLC > >> level and only an additional child/parent cluster node should be used > >> to fill the clustergroup_mask. > >> > > > > Again I completely disagree. Let us look at the problems separately. > > The hardware topology that some of the tools like lscpu and lstopo expects > > what the hardware looks like and not the scheduler's view of the hardware. > > So the topology masks that gets exposed to the user-space needs fixing > > even today. I have reports from various tooling people about the same. > > E.g. Juno getting exposed as dual socket system is utter non-sense. > > > > Yes scheduler uses most of the topology masks as is but that is not a must. > > There are these *group_mask functions that can implement what scheduler > > needs to be fed. > > > > I am not sure why the 2 issues are getting mixed up and that is the main > > reason why I jumped into this to make sure the topology masks are > > not tampered based on the way it needs to be used for scheduler. > > I'm all in favor of not mixing up those 2 issues. But I don't understand > why you have to glue them together. > What are you referring as 'glue them together'. As I said this series just address the hardware topology and if there is any impact on sched domains then it is do with alignment with ACPI and DT platform behaviour. I am not adding anything more to glue topology and info needed for sched domains. > (1) DT systems broken in userspace (lstopo shows Juno with 2 Packages) > Correct. > (2) Introduce CONFIG_SCHED_CLUSTER for DT systems > If that is a problem, you need to disable it for DT platforms. Not supporting proper hardware topology is not the way to workaround the issues enabling CONFIG_SCHED_CLUSTER for DT systems IMO. > > (1) This can be solved with your patch-set w/o setting `(1. level) > cpu-map cluster nodes`. The `socket nodes` taking over the > functionality of the `cluster nodes` sorts out the `Juno is seen as > having 2 packages`. > This will make core_sibling not suitable for cpu_coregroup_mask() > anymore. But this is OK since llc from cacheinfo (i.e. llc_sibling) > takes over here. > There is no need to involve `cluster nodes` anymore. > Again you are just deferring introducing CONFIG_SCHED_CLUSTER on DT which is fine but I don't agree with your approach. > (2) This will only make sense for Armv9 L2 complexes if we connect `2. > level cpu-map cluster nodes` with cluster_id and cluster_sibling. > And only then clusters would mean the same thing in ACPI and DT. > I guess this was mentioned already a couple of times. > Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level cpu nodes in cluster mask. So we are just aligning to it on DT platforms here. So if you are saying that is an issue, please fix that for ACPI too. > > Both ACPI and DT on a platform must present exact same hardware topology > > to the user-space, there is no space for argument there. > > > >> IIUC, we don't describe the dynamiQ level in ACPI which uses cache > >> topology instead to define cpu_coregroup_mask whereas DT described the > >> dynamiQ instead of using cache topology. If you use cache topology > >> now, then you should skip the dynamiQ > >> > > > > Yes, unless someone can work out a binding to represent that and convince > > DT maintainers ;). > > > >> Finally, even if CLS and MC have the same scheduling behavior for now, > >> they might ends up with different scheduling properties which would > >> mean that replacing MC level by CLS one for current SoC would become > >> wrong > >> > > > > Again as I mentioned to Dietmar, that is something we can and must deal with > > in those *group_mask and not expect topology mask to be altered to meet > > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion > > as the topology is already user-space visible and (tooling) people are > > complaining that DT systems are broken and doesn't match ACPI systems. > > > > So unless someone gives me non-scheduler and topology specific reasons > > to change that, sorry but my opinion on this matter is not going to change ;). > > `lstopo` is fine with a now correct /sys/.../topology/package_cpus (or > core_siblings (old filename). It's not reading > /sys/.../topology/cluster_cpus (yet) so why set it (wrongly) to 0x39 for > CPU0 on Juno when it can stay 0x01? > On ACPI ? If so, it could be the package ID in the ACPI table which can be invalid and kernel use the table offset as ID. It is not ideal but doesn't affect the masks. The current value 0 or 1 on Juno is cluster ID and they contribute to wrong package cpu masks. And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system and not on DT which was my main point. As pointed out earlier, have you checked ACPI on Juno and with CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI differs, then it is an issue. But AFAIU, it doesn't and that is my main argument. You are just assuming what we have on Juno with DT is correct which may be w.r.t to scheduler but definitely not with respect to the hardware topology exposed to the users. So my aim is to get that fixed. If you are not happy with that, then how can be be happy with what is the current behaviour on ACPI + and - CONFIG_SCHED_CLUSTER. I haven't got your opinion yet on that matter. -- Regards, Sudeep
On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <sudeep.holla@arm.com> wrote: > > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > [...] > > > > Why ? Are you suggesting that we shouldn't present the hardware cluster > > > to the topology because of the above reason ? If so, sorry that is not a > > > valid reason. We could add login to return NULL or appropriate value > > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't > > > deal that in generic scheduler code. But the topology code can't be > > > compromised for that reason as it is user visible. > > > > I tend to agree with Dietmar. The legacy use of cluster node in DT > > refers to the dynamiQ or legacy b.L cluster which is also aligned to > > the LLC and the MC scheduling level. The new cluster level that has > > been introduced recently does not target this level but some > > intermediate levels either inside like for the kupeng920 or the v9 > > complex or outside like for the ampere altra. So I would say that > > there is one cluster node level in DT that refers to the same MC/LLC > > level and only an additional child/parent cluster node should be used > > to fill the clustergroup_mask. > > > > Again I completely disagree. Let us look at the problems separately. > The hardware topology that some of the tools like lscpu and lstopo expects > what the hardware looks like and not the scheduler's view of the hardware. > So the topology masks that gets exposed to the user-space needs fixing > even today. I have reports from various tooling people about the same. > E.g. Juno getting exposed as dual socket system is utter non-sense. > > Yes scheduler uses most of the topology masks as is but that is not a must. > There are these *group_mask functions that can implement what scheduler > needs to be fed. > > I am not sure why the 2 issues are getting mixed up and that is the main > reason why I jumped into this to make sure the topology masks are > not tampered based on the way it needs to be used for scheduler. > > Both ACPI and DT on a platform must present exact same hardware topology > to the user-space, there is no space for argument there. But that's exactly my point there: ACPI doesn't show the dynamiQ level anywhere but only the llc which are the same and your patch makes the dynamiQ level visible for DT in addition to llc > > > IIUC, we don't describe the dynamiQ level in ACPI which uses cache > > topology instead to define cpu_coregroup_mask whereas DT described the > > dynamiQ instead of using cache topology. If you use cache topology > > now, then you should skip the dynamiQ > > > > Yes, unless someone can work out a binding to represent that and convince > DT maintainers ;). > > > Finally, even if CLS and MC have the same scheduling behavior for now, > > they might ends up with different scheduling properties which would > > mean that replacing MC level by CLS one for current SoC would become > > wrong > > > > Again as I mentioned to Dietmar, that is something we can and must deal with > in those *group_mask and not expect topology mask to be altered to meet > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion > as the topology is already user-space visible and (tooling) people are > complaining that DT systems are broken and doesn't match ACPI systems. again, your proposal doesn't help here because the DT will show a level that doesn't appears in ACPI > > So unless someone gives me non-scheduler and topology specific reasons > to change that, sorry but my opinion on this matter is not going to change ;). > > You will get this view of topology, find a way to manage with all those > *group_mask functions. By the way it is already handled for ACPI systems, AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ level that you make now visible to DT > so if you are not happy with that, then that needs fixing as this change > set just aligns the behaviour on similar ACPI system. So the Juno example > is incorrect for the reason that the behaviour of scheduler there is different > with DT and ACPI. > > -- > Regards, > Sudeep
Please note until we agree on unified view for hardware topology, I will temporarily ignore any scheduler domain related issues/concerns as this thread/discussion is mixing up too much IMO. I am not ignoring sched_domain concerns, but deferring it until we agree on the hardware topology view which is user visible and how that impacts sched domain topology can be considered soon following that. On Tue, Jun 14, 2022 at 07:59:23PM +0200, Vincent Guittot wrote: > On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > > > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > > > [...] > > > > > > Why ? Are you suggesting that we shouldn't present the hardware cluster > > > > to the topology because of the above reason ? If so, sorry that is not a > > > > valid reason. We could add login to return NULL or appropriate value > > > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't > > > > deal that in generic scheduler code. But the topology code can't be > > > > compromised for that reason as it is user visible. > > > > > > I tend to agree with Dietmar. The legacy use of cluster node in DT > > > refers to the dynamiQ or legacy b.L cluster which is also aligned to > > > the LLC and the MC scheduling level. The new cluster level that has > > > been introduced recently does not target this level but some > > > intermediate levels either inside like for the kupeng920 or the v9 > > > complex or outside like for the ampere altra. So I would say that > > > there is one cluster node level in DT that refers to the same MC/LLC > > > level and only an additional child/parent cluster node should be used > > > to fill the clustergroup_mask. > > > > > > > Again I completely disagree. Let us look at the problems separately. > > The hardware topology that some of the tools like lscpu and lstopo expects > > what the hardware looks like and not the scheduler's view of the hardware. > > So the topology masks that gets exposed to the user-space needs fixing > > even today. I have reports from various tooling people about the same. > > E.g. Juno getting exposed as dual socket system is utter non-sense. > > > > Yes scheduler uses most of the topology masks as is but that is not a must. > > There are these *group_mask functions that can implement what scheduler > > needs to be fed. > > > > I am not sure why the 2 issues are getting mixed up and that is the main > > reason why I jumped into this to make sure the topology masks are > > not tampered based on the way it needs to be used for scheduler. > > > > Both ACPI and DT on a platform must present exact same hardware topology > > to the user-space, there is no space for argument there. > > But that's exactly my point there: > ACPI doesn't show the dynamiQ level anywhere but only the llc which > are the same and your patch makes the dynamiQ level visible for DT in > addition to llc > Sorry if I am missing something obvious here, but both ACPI and DT has no special representation for dynamiQ clusters and hence it is impossible to deduce the same from either DT or ACPI. Can you provide some details or example as what you are referring as dynamiQ. Also what you mean by dynamiQ not shown on ACPI while shown with DT systems. If there is any discrepancies, we need to fix. Now, what I refer as discrepancy for example on Juno is below: (value read from a subset of per cpu sysfs files) cpu 0 1 2 3 4 5 cluster_id -1 -1 -1 -1 -1 -1 physical_package_id 1 0 0 1 1 1 cluster_cpus_list 0 1 2 3 4 5 package_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5 The above one is for DT which is wrong in all the 4 entries above. The below one is on ACPI and after applying my series on Juno. cpu 0 1 2 3 4 5 cluster_id 1 0 0 1 1 1 physical_package_id 0 0 0 0 0 0 cluster_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5 package_cpus_list 0-5 0-5 0-5 0-5 0-5 0-5 This matches the expectation from the various userspace tools like lscpu, lstopo,..etc. > > > > > IIUC, we don't describe the dynamiQ level in ACPI which uses cache > > > topology instead to define cpu_coregroup_mask whereas DT described the > > > dynamiQ instead of using cache topology. If you use cache topology > > > now, then you should skip the dynamiQ > > > > > > > Yes, unless someone can work out a binding to represent that and convince > > DT maintainers ;). > > > > > Finally, even if CLS and MC have the same scheduling behavior for now, > > > they might ends up with different scheduling properties which would > > > mean that replacing MC level by CLS one for current SoC would become > > > wrong > > > > > > > Again as I mentioned to Dietmar, that is something we can and must deal with > > in those *group_mask and not expect topology mask to be altered to meet > > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion > > as the topology is already user-space visible and (tooling) people are > > complaining that DT systems are broken and doesn't match ACPI systems. > > again, your proposal doesn't help here because the DT will show a > level that doesn't appears in ACPI > Which level exactly ? It matches exactly for Juno, the sysfs files are exact match after my changes. Again don't mix the scheduler domains for arguments here. > > > > So unless someone gives me non-scheduler and topology specific reasons > > to change that, sorry but my opinion on this matter is not going to change ;). > > > > You will get this view of topology, find a way to manage with all those > > *group_mask functions. By the way it is already handled for ACPI systems, > > AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ > level that you make now visible to DT Again, no. There is no binding for dynamiQ level either in DT or ACPI and hence there is no way it can become visible on DT. So I have no idea why there is a thought process or assumption about existence of dynamiQ level in the DT. It doesn't exist. If that is wrong, can you point me to the bindings as well as existing device tree ? If you are referring to the phantom domains Dietmar mentioned in earlier threads, then they don't exist. It is made up and one need to get the bindings pushed before we can address such a system.
On Wed, 15 Jun 2022 at 19:01, Sudeep Holla <sudeep.holla@arm.com> wrote: > > Please note until we agree on unified view for hardware topology, I will > temporarily ignore any scheduler domain related issues/concerns as this > thread/discussion is mixing up too much IMO. I am not ignoring sched_domain > concerns, but deferring it until we agree on the hardware topology view > which is user visible and how that impacts sched domain topology can be > considered soon following that. > > On Tue, Jun 14, 2022 at 07:59:23PM +0200, Vincent Guittot wrote: > > On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > > > > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > > > > > > [...] > > > > > > > > Why ? Are you suggesting that we shouldn't present the hardware cluster > > > > > to the topology because of the above reason ? If so, sorry that is not a > > > > > valid reason. We could add login to return NULL or appropriate value > > > > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't > > > > > deal that in generic scheduler code. But the topology code can't be > > > > > compromised for that reason as it is user visible. > > > > > > > > I tend to agree with Dietmar. The legacy use of cluster node in DT > > > > refers to the dynamiQ or legacy b.L cluster which is also aligned to > > > > the LLC and the MC scheduling level. The new cluster level that has > > > > been introduced recently does not target this level but some > > > > intermediate levels either inside like for the kupeng920 or the v9 > > > > complex or outside like for the ampere altra. So I would say that > > > > there is one cluster node level in DT that refers to the same MC/LLC > > > > level and only an additional child/parent cluster node should be used > > > > to fill the clustergroup_mask. > > > > > > > > > > Again I completely disagree. Let us look at the problems separately. > > > The hardware topology that some of the tools like lscpu and lstopo expects > > > what the hardware looks like and not the scheduler's view of the hardware. > > > So the topology masks that gets exposed to the user-space needs fixing > > > even today. I have reports from various tooling people about the same. > > > E.g. Juno getting exposed as dual socket system is utter non-sense. > > > > > > Yes scheduler uses most of the topology masks as is but that is not a must. > > > There are these *group_mask functions that can implement what scheduler > > > needs to be fed. > > > > > > I am not sure why the 2 issues are getting mixed up and that is the main > > > reason why I jumped into this to make sure the topology masks are > > > not tampered based on the way it needs to be used for scheduler. > > > > > > Both ACPI and DT on a platform must present exact same hardware topology > > > to the user-space, there is no space for argument there. > > > > But that's exactly my point there: > > ACPI doesn't show the dynamiQ level anywhere but only the llc which > > are the same and your patch makes the dynamiQ level visible for DT in > > addition to llc > > > > Sorry if I am missing something obvious here, but both ACPI and DT has no > special representation for dynamiQ clusters and hence it is impossible to > deduce the same from either DT or ACPI. Can you provide some details > or example as what you are referring as dynamiQ. Also what you mean by > dynamiQ not shown on ACPI while shown with DT systems. If there is any > discrepancies, we need to fix. The cpu-map node in DT is following the dynamiQ or the legacy big.LITTLE topology. As an example the hikey6220 has 2 clusters, the hikey960 has 2 clusters that reflect big.LITTLE and the sdm845 has one cluster that reflects the dynamiQ cluster. now my mistake is to have made the assumption that core_sibling is arch_topology was used to reflect the cores sharing cache but after looking more deeply in the code it seems to be a lucky coincidence > > Now, what I refer as discrepancy for example on Juno is below: > (value read from a subset of per cpu sysfs files) > cpu 0 1 2 3 4 5 > cluster_id -1 -1 -1 -1 -1 -1 > physical_package_id 1 0 0 1 1 1 > cluster_cpus_list 0 1 2 3 4 5 > package_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5 > > The above one is for DT which is wrong in all the 4 entries above. > The below one is on ACPI and after applying my series on Juno. > > cpu 0 1 2 3 4 5 > cluster_id 1 0 0 1 1 1 > physical_package_id 0 0 0 0 0 0 > cluster_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5 > package_cpus_list 0-5 0-5 0-5 0-5 0-5 0-5 > > This matches the expectation from the various userspace tools like lscpu, > lstopo,..etc. > > > > > > > > IIUC, we don't describe the dynamiQ level in ACPI which uses cache > > > > topology instead to define cpu_coregroup_mask whereas DT described the > > > > dynamiQ instead of using cache topology. If you use cache topology > > > > now, then you should skip the dynamiQ > > > > > > > > > > Yes, unless someone can work out a binding to represent that and convince > > > DT maintainers ;). > > > > > > > Finally, even if CLS and MC have the same scheduling behavior for now, > > > > they might ends up with different scheduling properties which would > > > > mean that replacing MC level by CLS one for current SoC would become > > > > wrong > > > > > > > > > > Again as I mentioned to Dietmar, that is something we can and must deal with > > > in those *group_mask and not expect topology mask to be altered to meet > > > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion > > > as the topology is already user-space visible and (tooling) people are > > > complaining that DT systems are broken and doesn't match ACPI systems. > > > > again, your proposal doesn't help here because the DT will show a > > level that doesn't appears in ACPI > > > > Which level exactly ? It matches exactly for Juno, the sysfs files are > exact match after my changes. Again don't mix the scheduler domains for > arguments here. > > > > > > > So unless someone gives me non-scheduler and topology specific reasons > > > to change that, sorry but my opinion on this matter is not going to change ;). > > > > > > You will get this view of topology, find a way to manage with all those > > > *group_mask functions. By the way it is already handled for ACPI systems, > > > > AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ > > level that you make now visible to DT > > Again, no. There is no binding for dynamiQ level either in DT or ACPI and > hence there is no way it can become visible on DT. So I have no idea why > there is a thought process or assumption about existence of dynamiQ level > in the DT. It doesn't exist. If that is wrong, can you point me to the > bindings as well as existing device tree ? If you are referring to the > phantom domains Dietmar mentioned in earlier threads, then they don't exist. > It is made up and one need to get the bindings pushed before we can address > such a system. > > -- > Regards, > Sudeep
On Wed, 15 Jun 2022 at 19:01, Sudeep Holla <sudeep.holla@arm.com> wrote: > > Please note until we agree on unified view for hardware topology, I will > temporarily ignore any scheduler domain related issues/concerns as this > thread/discussion is mixing up too much IMO. I am not ignoring sched_domain > concerns, but deferring it until we agree on the hardware topology view > which is user visible and how that impacts sched domain topology can be > considered soon following that. On my side, what i'm really interested in, it's the hardware topology reported to the scheduler > > On Tue, Jun 14, 2022 at 07:59:23PM +0200, Vincent Guittot wrote: > > On Fri, 10 Jun 2022 at 12:27, Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > > > > On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > > > > > > > > > > [...] > > > > > > > > Why ? Are you suggesting that we shouldn't present the hardware cluster > > > > > to the topology because of the above reason ? If so, sorry that is not a > > > > > valid reason. We could add login to return NULL or appropriate value > > > > > needed in cpu_clustergroup_mask id it matches MC level mask if we can't > > > > > deal that in generic scheduler code. But the topology code can't be > > > > > compromised for that reason as it is user visible. > > > > > > > > I tend to agree with Dietmar. The legacy use of cluster node in DT > > > > refers to the dynamiQ or legacy b.L cluster which is also aligned to > > > > the LLC and the MC scheduling level. The new cluster level that has > > > > been introduced recently does not target this level but some > > > > intermediate levels either inside like for the kupeng920 or the v9 > > > > complex or outside like for the ampere altra. So I would say that > > > > there is one cluster node level in DT that refers to the same MC/LLC > > > > level and only an additional child/parent cluster node should be used > > > > to fill the clustergroup_mask. > > > > > > > > > > Again I completely disagree. Let us look at the problems separately. > > > The hardware topology that some of the tools like lscpu and lstopo expects > > > what the hardware looks like and not the scheduler's view of the hardware. > > > So the topology masks that gets exposed to the user-space needs fixing > > > even today. I have reports from various tooling people about the same. > > > E.g. Juno getting exposed as dual socket system is utter non-sense. > > > > > > Yes scheduler uses most of the topology masks as is but that is not a must. > > > There are these *group_mask functions that can implement what scheduler > > > needs to be fed. > > > > > > I am not sure why the 2 issues are getting mixed up and that is the main > > > reason why I jumped into this to make sure the topology masks are > > > not tampered based on the way it needs to be used for scheduler. > > > > > > Both ACPI and DT on a platform must present exact same hardware topology > > > to the user-space, there is no space for argument there. > > > > But that's exactly my point there: > > ACPI doesn't show the dynamiQ level anywhere but only the llc which > > are the same and your patch makes the dynamiQ level visible for DT in > > addition to llc > > > > Sorry if I am missing something obvious here, but both ACPI and DT has no > special representation for dynamiQ clusters and hence it is impossible to > deduce the same from either DT or ACPI. Can you provide some details > or example as what you are referring as dynamiQ. Also what you mean by > dynamiQ not shown on ACPI while shown with DT systems. If there is any > discrepancies, we need to fix. > > Now, what I refer as discrepancy for example on Juno is below: > (value read from a subset of per cpu sysfs files) > cpu 0 1 2 3 4 5 > cluster_id -1 -1 -1 -1 -1 -1 > physical_package_id 1 0 0 1 1 1 > cluster_cpus_list 0 1 2 3 4 5 > package_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5 > > The above one is for DT which is wrong in all the 4 entries above. > The below one is on ACPI and after applying my series on Juno. > > cpu 0 1 2 3 4 5 > cluster_id 1 0 0 1 1 1 > physical_package_id 0 0 0 0 0 0 > cluster_cpus_list 0,3-5 1-2 1-2 0,3-5 0,3-5 0,3-5 > package_cpus_list 0-5 0-5 0-5 0-5 0-5 0-5 > > This matches the expectation from the various userspace tools like lscpu, > lstopo,..etc. > > > > > > > > IIUC, we don't describe the dynamiQ level in ACPI which uses cache > > > > topology instead to define cpu_coregroup_mask whereas DT described the > > > > dynamiQ instead of using cache topology. If you use cache topology > > > > now, then you should skip the dynamiQ > > > > > > > > > > Yes, unless someone can work out a binding to represent that and convince > > > DT maintainers ;). > > > > > > > Finally, even if CLS and MC have the same scheduling behavior for now, > > > > they might ends up with different scheduling properties which would > > > > mean that replacing MC level by CLS one for current SoC would become > > > > wrong > > > > > > > > > > Again as I mentioned to Dietmar, that is something we can and must deal with > > > in those *group_mask and not expect topology mask to be altered to meet > > > CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion > > > as the topology is already user-space visible and (tooling) people are > > > complaining that DT systems are broken and doesn't match ACPI systems. > > > > again, your proposal doesn't help here because the DT will show a > > level that doesn't appears in ACPI > > > > Which level exactly ? It matches exactly for Juno, the sysfs files are > exact match after my changes. Again don't mix the scheduler domains for > arguments here. > > > > > > > So unless someone gives me non-scheduler and topology specific reasons > > > to change that, sorry but my opinion on this matter is not going to change ;). > > > > > > You will get this view of topology, find a way to manage with all those > > > *group_mask functions. By the way it is already handled for ACPI systems, > > > > AFAICT, no it's not, the cluster described in ACPI is not the dynamiQ > > level that you make now visible to DT > > Again, no. There is no binding for dynamiQ level either in DT or ACPI and > hence there is no way it can become visible on DT. So I have no idea why > there is a thought process or assumption about existence of dynamiQ level > in the DT. It doesn't exist. If that is wrong, can you point me to the > bindings as well as existing device tree ? If you are referring to the > phantom domains Dietmar mentioned in earlier threads, then they don't exist. > It is made up and one need to get the bindings pushed before we can address > such a system. > > -- > Regards, > Sudeep
On 13/06/2022 12:17, Sudeep Holla wrote: > On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote: >> On 10/06/2022 12:27, Sudeep Holla wrote: >>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: >>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: [...] >>> Again I completely disagree. Let us look at the problems separately. >>> The hardware topology that some of the tools like lscpu and lstopo expects >>> what the hardware looks like and not the scheduler's view of the hardware. >>> So the topology masks that gets exposed to the user-space needs fixing >>> even today. I have reports from various tooling people about the same. >>> E.g. Juno getting exposed as dual socket system is utter non-sense. >>> >>> Yes scheduler uses most of the topology masks as is but that is not a must. >>> There are these *group_mask functions that can implement what scheduler >>> needs to be fed. >>> >>> I am not sure why the 2 issues are getting mixed up and that is the main >>> reason why I jumped into this to make sure the topology masks are >>> not tampered based on the way it needs to be used for scheduler. >> >> I'm all in favor of not mixing up those 2 issues. But I don't understand >> why you have to glue them together. >> > > What are you referring as 'glue them together'. As I said this series just > address the hardware topology and if there is any impact on sched domains > then it is do with alignment with ACPI and DT platform behaviour. I am not > adding anything more to glue topology and info needed for sched domains. You can fix (1) without (2) parsing 1. level cluster nodes as cluster_siblings. >> (1) DT systems broken in userspace (lstopo shows Juno with 2 Packages) >> > > Correct. > >> (2) Introduce CONFIG_SCHED_CLUSTER for DT systems >> > > If that is a problem, you need to disable it for DT platforms. Not > supporting proper hardware topology is not the way to workaround the > issues enabling CONFIG_SCHED_CLUSTER for DT systems IMO. > >> >> (1) This can be solved with your patch-set w/o setting `(1. level) >> cpu-map cluster nodes`. The `socket nodes` taking over the >> functionality of the `cluster nodes` sorts out the `Juno is seen as >> having 2 packages`. >> This will make core_sibling not suitable for cpu_coregroup_mask() >> anymore. But this is OK since llc from cacheinfo (i.e. llc_sibling) >> takes over here. >> There is no need to involve `cluster nodes` anymore. >> > > Again you are just deferring introducing CONFIG_SCHED_CLUSTER on DT > which is fine but I don't agree with your approach. > >> (2) This will only make sense for Armv9 L2 complexes if we connect `2. >> level cpu-map cluster nodes` with cluster_id and cluster_sibling. >> And only then clusters would mean the same thing in ACPI and DT. >> I guess this was mentioned already a couple of times. >> > > Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level cpu_map { socket0 { cluster0 { <-- 1. level cluster cluster0 { <-- 2. level cluster (3 -->) core0 { }; core1 { }; cluster1 { ... Armv9 L2 complexes: e.g. QC SM8450: .---------------. CPU |0 1 2 3 4 5 6 7| +---------------+ uarch |l l l l m m m b| (so called tri-gear: little, medium, big) +---------------+ L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3) +---------------+ L3 |<-- -->| +---------------+ |<-- cluster -->| +---------------+ |<-- DSU -->| '---------------' Only if we map (i) into cluster_sibling, we get the same hardware representation (for the task scheduler) for ACPI (4) and DT (5) systems. (4) examples: Kunpeng920 - 24 CPUs sharing LLC (cpu_coregroup_mask()), 4 CPUs sharing L3-tag (cpu_clustergroup_mask()). X86 Jacobsville - 24 CPUs sharing LLC (L3), 4 CPUs sharing L2 Armv9 L2 complexes: e.g. QC SM8450 - 8 CPUs sharing LLC (L3), (for A510 (little CPUs)) 2 CPUs sharing L2 > cpu nodes in cluster mask. So we are just aligning to it on DT platforms > here. So if you are saying that is an issue, please fix that for ACPI too. > [...] >>> So unless someone gives me non-scheduler and topology specific reasons >>> to change that, sorry but my opinion on this matter is not going to change ;). >> >> `lstopo` is fine with a now correct /sys/.../topology/package_cpus (or >> core_siblings (old filename). It's not reading >> /sys/.../topology/cluster_cpus (yet) so why set it (wrongly) to 0x39 for >> CPU0 on Juno when it can stay 0x01? >> > > On ACPI ? If so, it could be the package ID in the ACPI table which can be > invalid and kernel use the table offset as ID. It is not ideal but doesn't > affect the masks. The current value 0 or 1 on Juno is cluster ID and they > contribute to wrong package cpu masks. > > > And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system > and not on DT which was my main point. Understood. But a Kunpeng920 `cluster_cpus_list` file would contain logically different information than a Juno `cluster_cpus_list` file. Kunpeng920 `cluster_cpus_list` contain 4 CPUs sharing L3-tag (less than LLC) whereas Juno cluster_cpus_list contain 2 or 4 CPUs (which is LLC). I think key is to agree what a CLUSTER actually represent and especially in case when `the level of topology above CPUs` is congruent with LLC boundaries. Because my feeling is that people didn't pay attention to this detail when they introduced SCHED_CONFIG_CLUSTER. A related issue is the Ampere Altra hack in cpu_coregroup_mask(). > As pointed out earlier, have you checked ACPI on Juno and with > CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI > differs, then it is an issue. But AFAIU, it doesn't and that is my main > argument. You are just assuming what we have on Juno with DT is correct > which may be w.r.t to scheduler but definitely not with respect to the > hardware topology exposed to the users. So my aim is to get that fixed. I never run Juno w/ ACPI. Are you saying that find_acpi_cpu_topology_cluster() returns cpu_topology[cpu].cluster_id's which match the `1. level cluster nodes`? The function header of find_acpi_cpu_topology_cluster() says that `... the cluster, if present is the level of topology above CPUs. ...`. From this perspective I can see your point. But this is then still clearly poorly designed. How would we ever support CONFIG_SCHED_CLUSTER in DT when it really (potentially) would bring a benefit (i.e. in the Armv9 L2-complex case) and not only create trouble for the task scheduler to setup its domains correctly? Also in case we stick to setting CONFIG_SCHED_CLUSTER=1 by default, CLS should be the default LLC sched domain and MC the exceptional one. Just to respect the way how the task scheduler removes not useful domains today. > If you are not happy with that, then how can be be happy with what is the > current behaviour on ACPI + and - CONFIG_SCHED_CLUSTER. I haven't got > your opinion yet on that matter. I guess it's clear now that ACPI + CONFIG_SCHED_CLUSTER with ``the level of topology above CPUs` is congruent with LLC` creates trouble to the scheduler. So I don't see why we should replicate this for DT. Let's discuss further tomorrow in person. [...]
On Thu, Jun 16, 2022 at 05:02:28PM +0100, Dietmar Eggemann wrote: > On 13/06/2022 12:17, Sudeep Holla wrote: > > On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote: > >> On 10/06/2022 12:27, Sudeep Holla wrote: > >>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > >>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > [...] > > >>> Again I completely disagree. Let us look at the problems separately. > >>> The hardware topology that some of the tools like lscpu and lstopo expects > >>> what the hardware looks like and not the scheduler's view of the hardware. > >>> So the topology masks that gets exposed to the user-space needs fixing > >>> even today. I have reports from various tooling people about the same. > >>> E.g. Juno getting exposed as dual socket system is utter non-sense. > >>> > >>> Yes scheduler uses most of the topology masks as is but that is not a must. > >>> There are these *group_mask functions that can implement what scheduler > >>> needs to be fed. > >>> > >>> I am not sure why the 2 issues are getting mixed up and that is the main > >>> reason why I jumped into this to make sure the topology masks are > >>> not tampered based on the way it needs to be used for scheduler. > >> > >> I'm all in favor of not mixing up those 2 issues. But I don't understand > >> why you have to glue them together. > >> > > > > What are you referring as 'glue them together'. As I said this series just > > address the hardware topology and if there is any impact on sched domains > > then it is do with alignment with ACPI and DT platform behaviour. I am not > > adding anything more to glue topology and info needed for sched domains. > > You can fix (1) without (2) parsing 1. level cluster nodes as > cluster_siblings. > Technically yes, but I see no point in delaying it as it is considered as broken with respect to the moment ACPI exposed the correct value and at the same time resulted in exposing incorrect value in case of DT. I am referring to the same change that introduced SCHED_CLUSTER. The damage is done and it needs repairing ASAP. > > Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level > > cpu_map { > socket0 { > cluster0 { <-- 1. level cluster > cluster0 { <-- 2. level cluster (3 -->) Oh I had misunderstood this level nomenclature, I refer it as leaf cluster node which is 2. level cluster in this DT snippet. > core0 { > > }; > core1 { > > }; > cluster1 { > ... > > Armv9 L2 complexes: e.g. QC SM8450: > > .---------------. > CPU |0 1 2 3 4 5 6 7| > +---------------+ > uarch |l l l l m m m b| (so called tri-gear: little, medium, big) > +---------------+ > L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3) Again before I assume, what exactly <--3 here and in above snippet mean ? > +---------------+ > L3 |<-- -->| > +---------------+ > |<-- cluster -->| I think this is 2 level cluster or only cluster in this system w.r.t hardware. So lets not confuse with multi-level if not necessary. > +---------------+ > |<-- DSU -->| > '---------------' > > Only if we map (i) into cluster_sibling, we get the same hardware > representation (for the task scheduler) for ACPI (4) and DT (5) systems. > What is (i) above ? > (4) examples: > > Kunpeng920 - 24 CPUs sharing LLC (cpu_coregroup_mask()), 4 CPUs sharing > L3-tag (cpu_clustergroup_mask()). > Again decouple cache info and cluster info from h/w, you have all the info. You can couple them together if that helps when you feed sched_domains. > X86 Jacobsville - 24 CPUs sharing LLC (L3), 4 CPUs sharing L2 > > Armv9 L2 complexes: e.g. QC SM8450 - 8 CPUs sharing LLC (L3), (for A510 > (little CPUs)) 2 CPUs sharing L2 [...] > > And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system > > and not on DT which was my main point. > > Understood. But a Kunpeng920 `cluster_cpus_list` file would contain > logically different information than a Juno `cluster_cpus_list` file. > And why is that ? > Kunpeng920 `cluster_cpus_list` contain 4 CPUs sharing L3-tag (less than > LLC) whereas Juno cluster_cpus_list contain 2 or 4 CPUs (which is LLC). > Correct because that is how the hardware clusters are designed on those SoC. Cache topology is different. > I think key is to agree what a CLUSTER actually represent and especially > in case when `the level of topology above CPUs` is congruent with LLC > boundaries. Because my feeling is that people didn't pay attention to > this detail when they introduced SCHED_CONFIG_CLUSTER. A related issue > is the Ampere Altra hack in cpu_coregroup_mask(). > The is defined by hardware and DT/ACPI has bindings for it. We can't redefine CLUSTER in a way that breaks those definitions. Again cluster is part of CPU topology and cache topology can be different and need not be congruent with CPU topology in some ways, Ofcourse they are in some other ways but there will be no single line of alignment across SoCs which is quite evident with the examples you have listed. Just add Ampere Altra to make it more fun. > > As pointed out earlier, have you checked ACPI on Juno and with > > CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI > > differs, then it is an issue. But AFAIU, it doesn't and that is my main > > argument. You are just assuming what we have on Juno with DT is correct > > which may be w.r.t to scheduler but definitely not with respect to the > > hardware topology exposed to the users. So my aim is to get that fixed. > > I never run Juno w/ ACPI. Are you saying that > find_acpi_cpu_topology_cluster() returns cpu_topology[cpu].cluster_id's > which match the `1. level cluster nodes`? > Again I am totally confused as why this is now 1.level cluster where as above SDM was 2.level cluster. Both SoCs have only 1 level of cluster. While SDM has 1 DSU cluster, Juno has 2 clusters. > The function header of find_acpi_cpu_topology_cluster() says that `... > the cluster, if present is the level of topology above CPUs. ...`. > Exactly and that's how sysfs is also defined and we can't go back and change that now. I don't see any issue TBH. > From this perspective I can see your point. But this is then still > clearly poorly designed. Not really as per the definition. > How would we ever support CONFIG_SCHED_CLUSTER > in DT when it really (potentially) would bring a benefit (i.e. in the > Armv9 L2-complex case) and not only create trouble for the task > scheduler to setup its domains correctly? Indeed, that is the next problem once we get all these aligned across DT and ACPI. They have diverged and I prefer not to allow that anymore by adding more divergence e.g. to support Armv9 L2-complex case. Please consider that on top of these, I am not addressing that at the moment. In fact I am not addressing any sched_domain topics or issues. I have made that clear
On 17/06/2022 13:16, Sudeep Holla wrote: > On Thu, Jun 16, 2022 at 05:02:28PM +0100, Dietmar Eggemann wrote: >> On 13/06/2022 12:17, Sudeep Holla wrote: >>> On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote: >>>> On 10/06/2022 12:27, Sudeep Holla wrote: >>>>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: >>>>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: [...] >>> What are you referring as 'glue them together'. As I said this series just >>> address the hardware topology and if there is any impact on sched domains >>> then it is do with alignment with ACPI and DT platform behaviour. I am not >>> adding anything more to glue topology and info needed for sched domains. >> >> You can fix (1) without (2) parsing 1. level cluster nodes as >> cluster_siblings. >> > > Technically yes, but I see no point in delaying it as it is considered as > broken with respect to the moment ACPI exposed the correct value and at the > same time resulted in exposing incorrect value in case of DT. I am referring > to the same change that introduced SCHED_CLUSTER. The damage is done and it > needs repairing ASAP. OK, then lets `/sys/.../topology/cluster_cpus` refer to pure topology-based cluster information. This can be DT cluster-node information or ACPI L3-tag information e.g. for KP920. >>> Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level >> >> cpu_map { >> socket0 { >> cluster0 { <-- 1. level cluster >> cluster0 { <-- 2. level cluster (3 -->) > > Oh I had misunderstood this level nomenclature, I refer it as leaf cluster > node which is 2. level cluster in this DT snippet. > >> core0 { >> >> }; >> core1 { >> >> }; >> cluster1 { >> ... >> >> Armv9 L2 complexes: e.g. QC SM8450: >> >> .---------------. >> CPU |0 1 2 3 4 5 6 7| >> +---------------+ >> uarch |l l l l m m m b| (so called tri-gear: little, medium, big) >> +---------------+ >> L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3) > > Again before I assume, what exactly <--3 here and in above snippet mean ? I wanted to show that we could encode `2. level cluster` as `Armv9 L2 complexes`. But since we agreed (after the email was sent) not to support `nested cluster-nodes`, this idea does not hold anymore. >> +---------------+ >> L3 |<-- -->| >> +---------------+ >> |<-- cluster -->| > > I think this is 2 level cluster or only cluster in this system w.r.t hardware. > So lets not confuse with multi-level if not necessary. No need, we said no `nested cluster-node` support in DT. >> +---------------+ >> |<-- DSU -->| >> '---------------' >> >> Only if we map (i) into cluster_sibling, we get the same hardware >> representation (for the task scheduler) for ACPI (4) and DT (5) systems. >> > > What is (i) above ? Sorry, (i) was meant to be `3 -->`. >> (4) examples: >> >> Kunpeng920 - 24 CPUs sharing LLC (cpu_coregroup_mask()), 4 CPUs sharing >> L3-tag (cpu_clustergroup_mask()). >> > > Again decouple cache info and cluster info from h/w, you have all the info. > You can couple them together if that helps when you feed sched_domains. OK, this is what we finally agreed. >> X86 Jacobsville - 24 CPUs sharing LLC (L3), 4 CPUs sharing L2 >> >> Armv9 L2 complexes: e.g. QC SM8450 - 8 CPUs sharing LLC (L3), (for A510 >> (little CPUs)) 2 CPUs sharing L2 > > [...] > >>> And yes lstopo doesn't read cluster IDs. But we expose them in ACPI system >>> and not on DT which was my main point. OK, no further discussion needed here. `/sys/.../topology/cluster_cpus` shows L2 (LLC) on Juno, L3-tag an KP920 or L2 on X86 Jacobsville. The cpu_xxx_mask()s of (e.g.) default_topology[] have to make sure that the scheduler sees the correct information (the hierarchy of cpumasks). >> Understood. But a Kunpeng920 `cluster_cpus_list` file would contain >> logically different information than a Juno `cluster_cpus_list` file. >> > And why is that ? If we see it from the angle of the definition of SCHED_CONFIG_CLUSTER (`... the level of topology above CPUs ...` then I can see that both definitions are the same. (CPUS should be rather `cores` here, I guess?). [...] >>> As pointed out earlier, have you checked ACPI on Juno and with >>> CONFIG_SCHED_CLUSTER ? If the behaviour with my series on DT and ACPI >>> differs, then it is an issue. But AFAIU, it doesn't and that is my main >>> argument. You are just assuming what we have on Juno with DT is correct >>> which may be w.r.t to scheduler but definitely not with respect to the >>> hardware topology exposed to the users. So my aim is to get that fixed. >> >> I never run Juno w/ ACPI. Are you saying that >> find_acpi_cpu_topology_cluster() returns cpu_topology[cpu].cluster_id's >> which match the `1. level cluster nodes`? >> > > Again I am totally confused as why this is now 1.level cluster where as above > SDM was 2.level cluster. Both SoCs have only 1 level of cluster. While SDM > has 1 DSU cluster, Juno has 2 clusters. No need in agreeing what level (could) stand(s) here for. We said `no nested cluster-node`. >> The function header of find_acpi_cpu_topology_cluster() says that `... >> the cluster, if present is the level of topology above CPUs. ...`. >> > > Exactly and that's how sysfs is also defined and we can't go back and change > that now. I don't see any issue TBH. OK. >> From this perspective I can see your point. But this is then still >> clearly poorly designed. > > Not really as per the definition. Not from the viewpoint of topology and cacheinfo. But from the scheduler and the whole thing gets activated by a scheduler CONFIG option. >> How would we ever support CONFIG_SCHED_CLUSTER >> in DT when it really (potentially) would bring a benefit (i.e. in the >> Armv9 L2-complex case) and not only create trouble for the task >> scheduler to setup its domains correctly? > > Indeed, that is the next problem once we get all these aligned across > DT and ACPI. They have diverged and I prefer not to allow that anymore > by adding more divergence e.g. to support Armv9 L2-complex case. Please > consider that on top of these, I am not addressing that at the moment. > In fact I am not addressing any sched_domain topics or issues. I have made > that clear
On Mon, Jun 20, 2022 at 03:27:33PM +0200, Dietmar Eggemann wrote: > On 17/06/2022 13:16, Sudeep Holla wrote: > > On Thu, Jun 16, 2022 at 05:02:28PM +0100, Dietmar Eggemann wrote: > >> On 13/06/2022 12:17, Sudeep Holla wrote: > >>> On Mon, Jun 13, 2022 at 11:19:36AM +0200, Dietmar Eggemann wrote: > >>>> On 10/06/2022 12:27, Sudeep Holla wrote: > >>>>> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote: > >>>>>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@arm.com> wrote: > > [...] > > >>> What are you referring as 'glue them together'. As I said this series just > >>> address the hardware topology and if there is any impact on sched domains > >>> then it is do with alignment with ACPI and DT platform behaviour. I am not > >>> adding anything more to glue topology and info needed for sched domains. > >> > >> You can fix (1) without (2) parsing 1. level cluster nodes as > >> cluster_siblings. > >> > > > > Technically yes, but I see no point in delaying it as it is considered as > > broken with respect to the moment ACPI exposed the correct value and at the > > same time resulted in exposing incorrect value in case of DT. I am referring > > to the same change that introduced SCHED_CLUSTER. The damage is done and it > > needs repairing ASAP. > > OK, then lets `/sys/.../topology/cluster_cpus` refer to pure > topology-based cluster information. This can be DT cluster-node > information or ACPI L3-tag information e.g. for KP920. > Agreed. All the information under /sys/.../topology/ are the hardware view and not the scheduler's view of the hardware for the purpose of building sched domains. > >>> Indeed. But I don't get what you mean by 2 level here. ACPI puts 1st level > >> > >> cpu_map { > >> socket0 { > >> cluster0 { <-- 1. level cluster > >> cluster0 { <-- 2. level cluster (3 -->) > > > > Oh I had misunderstood this level nomenclature, I refer it as leaf cluster > > node which is 2. level cluster in this DT snippet. > > > >> core0 { > >> > >> }; > >> core1 { > >> > >> }; > >> cluster1 { > >> ... > >> > >> Armv9 L2 complexes: e.g. QC SM8450: > >> > >> .---------------. > >> CPU |0 1 2 3 4 5 6 7| > >> +---------------+ > >> uarch |l l l l m m m b| (so called tri-gear: little, medium, big) > >> +---------------+ > >> L2 | | | | | | | <-- 2. level cluster, Armv9 L2 complexes (<-- 3) > > > > Again before I assume, what exactly <--3 here and in above snippet mean ? > > I wanted to show that we could encode `2. level cluster` as `Armv9 L2 > complexes`. But since we agreed (after the email was sent) not to > support `nested cluster-nodes`, this idea does not hold anymore. > Yes I plan to throw warning if we encounter nested clusters, will be part of next version(https://git.kernel.org/sudeep.holla/c/94fae12d64bb). > >> +---------------+ > >> L3 |<-- -->| > >> +---------------+ > >> |<-- cluster -->| > > > > I think this is 2 level cluster or only cluster in this system w.r.t hardware. > > So lets not confuse with multi-level if not necessary. > > No need, we said no `nested cluster-node` support in DT. >
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index b8f0d72908c8..5f4f148a7769 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -492,7 +492,7 @@ static int __init get_cpu_for_node(struct device_node *node) } static int __init parse_core(struct device_node *core, int package_id, - int core_id) + int cluster_id, int core_id) { char name[20]; bool leaf = true; @@ -508,6 +508,7 @@ static int __init parse_core(struct device_node *core, int package_id, cpu = get_cpu_for_node(t); if (cpu >= 0) { cpu_topology[cpu].package_id = package_id; + cpu_topology[cpu].cluster_id = cluster_id; cpu_topology[cpu].core_id = core_id; cpu_topology[cpu].thread_id = i; } else if (cpu != -ENODEV) { @@ -529,6 +530,7 @@ static int __init parse_core(struct device_node *core, int package_id, } cpu_topology[cpu].package_id = package_id; + cpu_topology[cpu].cluster_id = cluster_id; cpu_topology[cpu].core_id = core_id; } else if (leaf && cpu != -ENODEV) { pr_err("%pOF: Can't get CPU for leaf core\n", core); @@ -538,7 +540,8 @@ static int __init parse_core(struct device_node *core, int package_id, return 0; } -static int __init parse_cluster(struct device_node *cluster, int depth) +static int __init +parse_cluster(struct device_node *cluster, int cluster_id, int depth) { char name[20]; bool leaf = true; @@ -558,7 +561,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth) c = of_get_child_by_name(cluster, name); if (c) { leaf = false; - ret = parse_cluster(c, depth + 1); + ret = parse_cluster(c, i, depth + 1); of_node_put(c); if (ret != 0) return ret; @@ -582,7 +585,7 @@ static int __init parse_cluster(struct device_node *cluster, int depth) } if (leaf) { - ret = parse_core(c, 0, core_id++); + ret = parse_core(c, 0, cluster_id, core_id++); } else { pr_err("%pOF: Non-leaf cluster with core %s\n", cluster, name); @@ -621,7 +624,7 @@ static int __init parse_dt_topology(void) if (!map) goto out; - ret = parse_cluster(map, 0); + ret = parse_cluster(map, -1, 0); if (ret != 0) goto out_map;
Let us set the cluster identifier as parsed from the device tree cluster nodes within /cpu-map. We don't support nesting of clusters yet as there are no real hardware to support clusters of clusters. Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> --- drivers/base/arch_topology.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)