diff mbox series

[v5,09/19] arch_topology: Use the last level cache information from the cacheinfo

Message ID 20220627165047.336669-10-sudeep.holla@arm.com (mailing list archive)
State New, archived
Headers show
Series arch_topology: Updates to add socket support and fix cluster ids | expand

Commit Message

Sudeep Holla June 27, 2022, 4:50 p.m. UTC
The cacheinfo is now initialised early along with the CPU topology
initialisation. Instead of relying on the LLC ID information parsed
separately only with ACPI PPTT elsewhere, migrate to use the similar
information from the cacheinfo.

This is generic for both DT and ACPI systems. The ACPI LLC ID information
parsed separately can now be removed from arch specific code.

Link: https://lore.kernel.org/r/20220621192034.3332546-10-sudeep.holla@arm.com
Tested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
---
 drivers/base/arch_topology.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Conor Dooley June 29, 2022, 5:49 p.m. UTC | #1
On 27/06/2022 17:50, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> The cacheinfo is now initialised early along with the CPU topology
> initialisation. Instead of relying on the LLC ID information parsed
> separately only with ACPI PPTT elsewhere, migrate to use the similar
> information from the cacheinfo.
> 
> This is generic for both DT and ACPI systems. The ACPI LLC ID information
> parsed separately can now be removed from arch specific code.

Hey Sudeep,
I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
I suspect the issue is a missing "next-level-cache" in the the dt:
arch/riscv/boot/dts/microchip/mpfs.dtsi

Adding next-level-cache = <&cctrllr> fixes the boot.
Not sure what the resolution here is, old devicetrees are meant to keep
booting, right?

Thanks,
Conor.

> 
> Link: https://lore.kernel.org/r/20220621192034.3332546-10-sudeep.holla@arm.com

btw, why is this link in the patch? Why is a link to v4 relevant?
Links to both v4 and v5 exist in your for-linux-next branch.

Log:
git bisect start
# bad: [c4ef528bd006febc7de444d9775b28706d924f78] Add linux-next specific files for 20220629
git bisect bad c4ef528bd006febc7de444d9775b28706d924f78
# good: [b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3] Linux 5.19-rc2
git bisect good b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3
# bad: [95c758a8899c4e8825a35a62a6f31667991217f9] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
git bisect bad 95c758a8899c4e8825a35a62a6f31667991217f9
# bad: [5cbb9aeefe0070b627cd5c5528e6e63701561d57] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git
git bisect bad 5cbb9aeefe0070b627cd5c5528e6e63701561d57
# good: [2e6556bae3e453cf27f3fb9c6144080e2a61707e] Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git
git bisect good 2e6556bae3e453cf27f3fb9c6144080e2a61707e
# good: [17efe76af33f6af09a821acce2e2e4e84819d381] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git
git bisect good 17efe76af33f6af09a821acce2e2e4e84819d381
# good: [5aeeaf40d31288e8efa6ff2cbd952b13de077aa9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git
git bisect good 5aeeaf40d31288e8efa6ff2cbd952b13de077aa9
# bad: [f64dfa36b325d107d8aca9727410343bd86d37dc] Merge branch 'stm32-next' of git://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32.git
git bisect bad f64dfa36b325d107d8aca9727410343bd86d37dc
# good: [89459a2aef8832f044c8fbbec54b46cec05156c8] Merge branch 'next/dt' into for-next
git bisect good 89459a2aef8832f044c8fbbec54b46cec05156c8
# bad: [24cdefc96973ff1a1f6702470ad91ab019e5fedd] Merge branch 'arch_topology' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into for-linux-next
git bisect bad 24cdefc96973ff1a1f6702470ad91ab019e5fedd
# bad: [0d71f236f0a1067aba7660d056a9061b5877bf52] arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found
git bisect bad 0d71f236f0a1067aba7660d056a9061b5877bf52
# good: [be6ab2e822888b8d9983d670fdabc09d753fd24f] cacheinfo: Use cache identifiers to check if the caches are shared if available
git bisect good be6ab2e822888b8d9983d670fdabc09d753fd24f
# bad: [854a3115f9ec0b889015c6854fbc0c1d69a46e4a] arm64: topology: Remove redundant setting of llc_id in CPU topology
git bisect bad 854a3115f9ec0b889015c6854fbc0c1d69a46e4a
# bad: [3b23bb2573e65b11be8f4b89023296dee7f06c0b] arch_topology: Use the last level cache information from the cacheinfo
git bisect bad 3b23bb2573e65b11be8f4b89023296dee7f06c0b
# good: [2f7b757eb69df296554bd39b0b2b2f4da678c736] arch_topology: Add support to parse and detect cache attributes
git bisect good 2f7b757eb69df296554bd39b0b2b2f4da678c736
# first bad commit: [3b23bb2573e65b11be8f4b89023296dee7f06c0b] arch_topology: Use the last level cache information from the cacheinfo
Conor Dooley June 29, 2022, 6:18 p.m. UTC | #2
On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On 27/06/2022 17:50, Sudeep Holla wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> The cacheinfo is now initialised early along with the CPU topology
>> initialisation. Instead of relying on the LLC ID information parsed
>> separately only with ACPI PPTT elsewhere, migrate to use the similar
>> information from the cacheinfo.
>>
>> This is generic for both DT and ACPI systems. The ACPI LLC ID information
>> parsed separately can now be removed from arch specific code.
> 
> Hey Sudeep,
> I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
> I suspect the issue is a missing "next-level-cache" in the the dt:
> arch/riscv/boot/dts/microchip/mpfs.dtsi
> 
> Adding next-level-cache = <&cctrllr> fixes the boot.

No, no it doesn't. Not sure what I was thinking there.
Prob tested that on the the last commit that bisect tested
rather than the one it pointed out the problem was with.

Either way, boot is broken in -next.

> Not sure what the resolution here is, old devicetrees are meant to keep
> booting, right?
> 
> Thanks,
> Conor.
> 
>>
>> Link: https://lore.kernel.org/r/20220621192034.3332546-10-sudeep.holla@arm.com
> 
> btw, why is this link in the patch? Why is a link to v4 relevant?
> Links to both v4 and v5 exist in your for-linux-next branch.
> 
> Log:
> git bisect start
> # bad: [c4ef528bd006febc7de444d9775b28706d924f78] Add linux-next specific files for 20220629
> git bisect bad c4ef528bd006febc7de444d9775b28706d924f78
> # good: [b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3] Linux 5.19-rc2
> git bisect good b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3
> # bad: [95c758a8899c4e8825a35a62a6f31667991217f9] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
> git bisect bad 95c758a8899c4e8825a35a62a6f31667991217f9
> # bad: [5cbb9aeefe0070b627cd5c5528e6e63701561d57] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git
> git bisect bad 5cbb9aeefe0070b627cd5c5528e6e63701561d57
> # good: [2e6556bae3e453cf27f3fb9c6144080e2a61707e] Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git
> git bisect good 2e6556bae3e453cf27f3fb9c6144080e2a61707e
> # good: [17efe76af33f6af09a821acce2e2e4e84819d381] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git
> git bisect good 17efe76af33f6af09a821acce2e2e4e84819d381
> # good: [5aeeaf40d31288e8efa6ff2cbd952b13de077aa9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip.git
> git bisect good 5aeeaf40d31288e8efa6ff2cbd952b13de077aa9
> # bad: [f64dfa36b325d107d8aca9727410343bd86d37dc] Merge branch 'stm32-next' of git://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32.git
> git bisect bad f64dfa36b325d107d8aca9727410343bd86d37dc
> # good: [89459a2aef8832f044c8fbbec54b46cec05156c8] Merge branch 'next/dt' into for-next
> git bisect good 89459a2aef8832f044c8fbbec54b46cec05156c8
> # bad: [24cdefc96973ff1a1f6702470ad91ab019e5fedd] Merge branch 'arch_topology' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into for-linux-next
> git bisect bad 24cdefc96973ff1a1f6702470ad91ab019e5fedd
> # bad: [0d71f236f0a1067aba7660d056a9061b5877bf52] arch_topology: Avoid parsing through all the CPUs once a outlier CPU is found
> git bisect bad 0d71f236f0a1067aba7660d056a9061b5877bf52
> # good: [be6ab2e822888b8d9983d670fdabc09d753fd24f] cacheinfo: Use cache identifiers to check if the caches are shared if available
> git bisect good be6ab2e822888b8d9983d670fdabc09d753fd24f
> # bad: [854a3115f9ec0b889015c6854fbc0c1d69a46e4a] arm64: topology: Remove redundant setting of llc_id in CPU topology
> git bisect bad 854a3115f9ec0b889015c6854fbc0c1d69a46e4a
> # bad: [3b23bb2573e65b11be8f4b89023296dee7f06c0b] arch_topology: Use the last level cache information from the cacheinfo
> git bisect bad 3b23bb2573e65b11be8f4b89023296dee7f06c0b
> # good: [2f7b757eb69df296554bd39b0b2b2f4da678c736] arch_topology: Add support to parse and detect cache attributes
> git bisect good 2f7b757eb69df296554bd39b0b2b2f4da678c736
> # first bad commit: [3b23bb2573e65b11be8f4b89023296dee7f06c0b] arch_topology: Use the last level cache information from the cacheinfo
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Sudeep Holla June 29, 2022, 6:29 p.m. UTC | #3
On Wed, Jun 29, 2022 at 05:49:20PM +0000, Conor.Dooley@microchip.com wrote:
> > 
> > Link: https://lore.kernel.org/r/20220621192034.3332546-10-sudeep.holla@arm.com
> 
> btw, why is this link in the patch? Why is a link to v4 relevant?
> Links to both v4 and v5 exist in your for-linux-next branch.
> 
Yes I noticed that and fixed it today.
Sudeep Holla June 29, 2022, 6:33 p.m. UTC | #4
On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
> On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > 
> > On 27/06/2022 17:50, Sudeep Holla wrote:
> >> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>
> >> The cacheinfo is now initialised early along with the CPU topology
> >> initialisation. Instead of relying on the LLC ID information parsed
> >> separately only with ACPI PPTT elsewhere, migrate to use the similar
> >> information from the cacheinfo.
> >>
> >> This is generic for both DT and ACPI systems. The ACPI LLC ID information
> >> parsed separately can now be removed from arch specific code.
> > 
> > Hey Sudeep,
> > I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
> > I suspect the issue is a missing "next-level-cache" in the the dt:
> > arch/riscv/boot/dts/microchip/mpfs.dtsi
> > 
> > Adding next-level-cache = <&cctrllr> fixes the boot.
>

If the above is missing, even the existing cacheinfo will be incorrect on
that system. But we must end up with L1 as LLC, I need to check if that
breaks the boot.

> No, no it doesn't. Not sure what I was thinking there.
> Prob tested that on the the last commit that bisect tested
> rather than the one it pointed out the problem was with.
>

So can I assume the pointed commit is where the boot breaks ?

> Either way, boot is broken in -next.
>

OK, that's bad. 

> > Not sure what the resolution here is, old devicetrees are meant to keep
> > booting, right?
> >

Ideally yes. But with this, I assume the cacheinfo to userspace is also broken
on this platform and I guess that needs fixing which can happen with DT update
only.

--
Regards,
Sudeep
Sudeep Holla June 29, 2022, 6:42 p.m. UTC | #5
On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
> 
> No, no it doesn't. Not sure what I was thinking there.
> Prob tested that on the the last commit that bisect tested
> rather than the one it pointed out the problem was with.
> 
> Either way, boot is broken in -next.
>

Can you check if the below fixes the issue ? Assuming presenting L1 as
LLC might be causing issue.

Regards,
Sudeep

-->8
diff --git i/drivers/base/cacheinfo.c w/drivers/base/cacheinfo.c
index 167abfa6f37d..a691317f7fdd 100644
--- i/drivers/base/cacheinfo.c
+++ w/drivers/base/cacheinfo.c
@@ -60,7 +60,8 @@ bool last_level_cache_is_valid(unsigned int cpu)

        llc = per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1);

-       return (llc->attributes & CACHE_ID) || !!llc->fw_token;
+       return (llc->type == CACHE_TYPE_UNIFIED) &&
+              ((llc->attributes & CACHE_ID) || !!llc->fw_token);

 }
Sudeep Holla June 29, 2022, 6:47 p.m. UTC | #6
On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
> On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > 
> > On 27/06/2022 17:50, Sudeep Holla wrote:
> >> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>
> >> The cacheinfo is now initialised early along with the CPU topology
> >> initialisation. Instead of relying on the LLC ID information parsed
> >> separately only with ACPI PPTT elsewhere, migrate to use the similar
> >> information from the cacheinfo.
> >>
> >> This is generic for both DT and ACPI systems. The ACPI LLC ID information
> >> parsed separately can now be removed from arch specific code.
> > 
> > Hey Sudeep,
> > I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
> > I suspect the issue is a missing "next-level-cache" in the the dt:
> > arch/riscv/boot/dts/microchip/mpfs.dtsi

Good that I included this in -next, I had not received any feedback from
RISC-V even after 5 iterations. I also see this DTS is very odd. It also
states CPU0 doesn't have L1-D$ while the other 4 CPUs have L1-D$. Is that
a mistake or is it the reality ? Another breakage in userspace cacheinfo
sysfs entry of cpu0 has both I$ and D$.
Conor Dooley June 29, 2022, 6:56 p.m. UTC | #7
On 29/06/2022 19:47, Sudeep Holla wrote:
> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
>> On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> On 27/06/2022 17:50, Sudeep Holla wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> The cacheinfo is now initialised early along with the CPU topology
>>>> initialisation. Instead of relying on the LLC ID information parsed
>>>> separately only with ACPI PPTT elsewhere, migrate to use the similar
>>>> information from the cacheinfo.
>>>>
>>>> This is generic for both DT and ACPI systems. The ACPI LLC ID information
>>>> parsed separately can now be removed from arch specific code.
>>>
>>> Hey Sudeep,
>>> I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
>>> I suspect the issue is a missing "next-level-cache" in the the dt:
>>> arch/riscv/boot/dts/microchip/mpfs.dtsi
> 
> Good that I included this in -next, I had not received any feedback from
> RISC-V even after 5 iterations.

I'll be honest, I saw the titles and CC list and made some incorrect
assumptions as to whether looking at it was worthwhile! I am not at
this all too long and what is/isn't important to look at often is not
obvious to me. But hey, our CI boots -next every day for a reason ;)

> I also see this DTS is very odd. It also
> states CPU0 doesn't have L1-D$ while the other 4 CPUs have L1-D$. Is that
> a mistake or is it the reality ?

AFAIK, reality. It's the same for the SiFive fu540 (with which this shares
a core complex. See page 12:
https://static.dev.sifive.com/FU540-C000-v1.0.pdf

> Another breakage in userspace cacheinfo
> sysfs entry of cpu0 has both I$ and D$.

Could you clarify what this means please?
Thanks,
Conor.
Sudeep Holla June 29, 2022, 7:12 p.m. UTC | #8
On Wed, Jun 29, 2022 at 06:56:29PM +0000, Conor.Dooley@microchip.com wrote:
> On 29/06/2022 19:47, Sudeep Holla wrote:
> > On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
> >> On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
> >>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>>
> >>> On 27/06/2022 17:50, Sudeep Holla wrote:
> >>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>>>
> >>>> The cacheinfo is now initialised early along with the CPU topology
> >>>> initialisation. Instead of relying on the LLC ID information parsed
> >>>> separately only with ACPI PPTT elsewhere, migrate to use the similar
> >>>> information from the cacheinfo.
> >>>>
> >>>> This is generic for both DT and ACPI systems. The ACPI LLC ID information
> >>>> parsed separately can now be removed from arch specific code.
> >>>
> >>> Hey Sudeep,
> >>> I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
> >>> I suspect the issue is a missing "next-level-cache" in the the dt:
> >>> arch/riscv/boot/dts/microchip/mpfs.dtsi
> > 
> > Good that I included this in -next, I had not received any feedback from
> > RISC-V even after 5 iterations.
> 
> I'll be honest, I saw the titles and CC list and made some incorrect
> assumptions as to whether looking at it was worthwhile! I am not at
> this all too long and what is/isn't important to look at often is not
> obvious to me.

No worries, that's why I thought better to include in -next to get some
attention and I did get it this time, hurray! 
Conor Dooley June 29, 2022, 7:25 p.m. UTC | #9
On 29/06/2022 20:12, Sudeep Holla wrote:
> On Wed, Jun 29, 2022 at 06:56:29PM +0000, Conor.Dooley@microchip.com wrote:
>> On 29/06/2022 19:47, Sudeep Holla wrote:
>>> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
>>>> On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
>>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>>
>>>>> On 27/06/2022 17:50, Sudeep Holla wrote:
>>>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>>>
>>>>>> The cacheinfo is now initialised early along with the CPU topology
>>>>>> initialisation. Instead of relying on the LLC ID information parsed
>>>>>> separately only with ACPI PPTT elsewhere, migrate to use the similar
>>>>>> information from the cacheinfo.
>>>>>>
>>>>>> This is generic for both DT and ACPI systems. The ACPI LLC ID information
>>>>>> parsed separately can now be removed from arch specific code.
>>>>>
>>>>> Hey Sudeep,
>>>>> I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
>>>>> I suspect the issue is a missing "next-level-cache" in the the dt:
>>>>> arch/riscv/boot/dts/microchip/mpfs.dtsi
>>>
>>> Good that I included this in -next, I had not received any feedback from
>>> RISC-V even after 5 iterations.
>>
>> I'll be honest, I saw the titles and CC list and made some incorrect
>> assumptions as to whether looking at it was worthwhile! I am not at
>> this all too long and what is/isn't important to look at often is not
>> obvious to me.
> 
> No worries, that's why I thought better to include in -next to get some
> attention and I did get it this time, hurray! 
Conor Dooley June 29, 2022, 7:39 p.m. UTC | #10
On 29/06/2022 19:42, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
>>
>> No, no it doesn't. Not sure what I was thinking there.
>> Prob tested that on the the last commit that bisect tested
>> rather than the one it pointed out the problem was with.
>>
>> Either way, boot is broken in -next.
>>
> 
> Can you check if the below fixes the issue?

Unfortunately, no joy.
Applied to a HEAD of 3b23bb2573e6 ("arch_topology: Use the
last level cache information from the cacheinfo").
Thanks,
Conor.

> Assuming presenting L1 as
> LLC might be causing issue.
> 
> Regards,
> Sudeep
> 
> -->8
> diff --git i/drivers/base/cacheinfo.c w/drivers/base/cacheinfo.c
> index 167abfa6f37d..a691317f7fdd 100644
> --- i/drivers/base/cacheinfo.c
> +++ w/drivers/base/cacheinfo.c
> @@ -60,7 +60,8 @@ bool last_level_cache_is_valid(unsigned int cpu)
> 
>         llc = per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1);
> 
> -       return (llc->attributes & CACHE_ID) || !!llc->fw_token;
> +       return (llc->type == CACHE_TYPE_UNIFIED) &&
> +              ((llc->attributes & CACHE_ID) || !!llc->fw_token);
> 
>  }
>
Sudeep Holla June 29, 2022, 7:43 p.m. UTC | #11
On Wed, Jun 29, 2022 at 07:25:41PM +0000, Conor.Dooley@microchip.com wrote:
> 
> 
> On 29/06/2022 20:12, Sudeep Holla wrote:
> > On Wed, Jun 29, 2022 at 06:56:29PM +0000, Conor.Dooley@microchip.com wrote:
> >> On 29/06/2022 19:47, Sudeep Holla wrote:
> >>> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
> >>>> On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
> >>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>>>>
> >>>>> On 27/06/2022 17:50, Sudeep Holla wrote:
> >>>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>>>>>
> >>>>>> The cacheinfo is now initialised early along with the CPU topology
> >>>>>> initialisation. Instead of relying on the LLC ID information parsed
> >>>>>> separately only with ACPI PPTT elsewhere, migrate to use the similar
> >>>>>> information from the cacheinfo.
> >>>>>>
> >>>>>> This is generic for both DT and ACPI systems. The ACPI LLC ID information
> >>>>>> parsed separately can now be removed from arch specific code.
> >>>>>
> >>>>> Hey Sudeep,
> >>>>> I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
> >>>>> I suspect the issue is a missing "next-level-cache" in the the dt:
> >>>>> arch/riscv/boot/dts/microchip/mpfs.dtsi
> >>>
> >>> Good that I included this in -next, I had not received any feedback from
> >>> RISC-V even after 5 iterations.
> >>
> >> I'll be honest, I saw the titles and CC list and made some incorrect
> >> assumptions as to whether looking at it was worthwhile! I am not at
> >> this all too long and what is/isn't important to look at often is not
> >> obvious to me.
> > 
> > No worries, that's why I thought better to include in -next to get some
> > attention and I did get it this time, hurray! 
Conor Dooley June 29, 2022, 7:52 p.m. UTC | #12
On 29/06/2022 20:43, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Wed, Jun 29, 2022 at 07:25:41PM +0000, Conor.Dooley@microchip.com wrote:
>>
>>
>> On 29/06/2022 20:12, Sudeep Holla wrote:
>>> On Wed, Jun 29, 2022 at 06:56:29PM +0000, Conor.Dooley@microchip.com wrote:
>>>> On 29/06/2022 19:47, Sudeep Holla wrote:
>>>>> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
>>>>>> On 29/06/2022 18:49, Conor.Dooley@microchip.com wrote:
>>>>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>>>>
>>>>>>> On 27/06/2022 17:50, Sudeep Holla wrote:
>>>>>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>>>>>
>>>>>>>> The cacheinfo is now initialised early along with the CPU topology
>>>>>>>> initialisation. Instead of relying on the LLC ID information parsed
>>>>>>>> separately only with ACPI PPTT elsewhere, migrate to use the similar
>>>>>>>> information from the cacheinfo.
>>>>>>>>
>>>>>>>> This is generic for both DT and ACPI systems. The ACPI LLC ID information
>>>>>>>> parsed separately can now be removed from arch specific code.
>>>>>>>
>>>>>>> Hey Sudeep,
>>>>>>> I bisected broken boot on PolarFire SoC to this patch in next-20220629 :/
>>>>>>> I suspect the issue is a missing "next-level-cache" in the the dt:
>>>>>>> arch/riscv/boot/dts/microchip/mpfs.dtsi
>>>>>
>>>>> Good that I included this in -next, I had not received any feedback from
>>>>> RISC-V even after 5 iterations.
>>>>
>>>> I'll be honest, I saw the titles and CC list and made some incorrect
>>>> assumptions as to whether looking at it was worthwhile! I am not at
>>>> this all too long and what is/isn't important to look at often is not
>>>> obvious to me.
>>>
>>> No worries, that's why I thought better to include in -next to get some
>>> attention and I did get it this time, hurray! 
Sudeep Holla June 29, 2022, 7:54 p.m. UTC | #13
On Wed, Jun 29, 2022 at 07:39:43PM +0000, Conor.Dooley@microchip.com wrote:
> On 29/06/2022 19:42, Sudeep Holla wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > 
> > On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
> >>
> >> No, no it doesn't. Not sure what I was thinking there.
> >> Prob tested that on the the last commit that bisect tested
> >> rather than the one it pointed out the problem was with.
> >>
> >> Either way, boot is broken in -next.
> >>
> > 
> > Can you check if the below fixes the issue?
> 
> Unfortunately, no joy.
> Applied to a HEAD of 3b23bb2573e6 ("arch_topology: Use the
> last level cache information from the cacheinfo").

That's bad. Does the system boot with
Commit 2f7b757eb69d ("arch_topology: Add support to parse and detect cache
attributes") ?
Conor Dooley June 29, 2022, 8:32 p.m. UTC | #14
On 29/06/2022 20:54, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Wed, Jun 29, 2022 at 07:39:43PM +0000, Conor.Dooley@microchip.com wrote:
>> On 29/06/2022 19:42, Sudeep Holla wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
>>>>
>>>> No, no it doesn't. Not sure what I was thinking there.
>>>> Prob tested that on the the last commit that bisect tested
>>>> rather than the one it pointed out the problem was with.
>>>>
>>>> Either way, boot is broken in -next.
>>>>
>>>
>>> Can you check if the below fixes the issue?
>>
>> Unfortunately, no joy.
>> Applied to a HEAD of 3b23bb2573e6 ("arch_topology: Use the
>> last level cache information from the cacheinfo").
> 
> That's bad. Does the system boot with
> Commit 2f7b757eb69d ("arch_topology: Add support to parse and detect cache
> attributes") ?

It does.

> 
> --
> Regards,
> Sudeep
Conor Dooley June 29, 2022, 11:25 p.m. UTC | #15
On 29/06/2022 21:32, Conor.Dooley@microchip.com wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On 29/06/2022 20:54, Sudeep Holla wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On Wed, Jun 29, 2022 at 07:39:43PM +0000, Conor.Dooley@microchip.com wrote:
>>> On 29/06/2022 19:42, Sudeep Holla wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
>>>>>
>>>>> No, no it doesn't. Not sure what I was thinking there.
>>>>> Prob tested that on the the last commit that bisect tested
>>>>> rather than the one it pointed out the problem was with.
>>>>>
>>>>> Either way, boot is broken in -next.
>>>>>
>>>>
>>>> Can you check if the below fixes the issue?
>>>
>>> Unfortunately, no joy.
>>> Applied to a HEAD of 3b23bb2573e6 ("arch_topology: Use the
>>> last level cache information from the cacheinfo").
>>
>> That's bad. Does the system boot with
>> Commit 2f7b757eb69d ("arch_topology: Add support to parse and detect cache
>> attributes") ?
> 
> It does.

FWIW boot log of the failure:

[    0.000000] Linux version 5.19.0-rc4-00009-g3b23bb2573e6-dirty (conor@) (riscv64-unknown-linux-gnu-gcc (g5964b5cd727) 11.1.0, GNU ld (GNU Binutils) 2.37) #1 SMP Thu Jun 30 00:22:42 IST 2022
[    0.000000] OF: fdt: Ignoring memory range 0x80000000 - 0x80200000
[    0.000000] Machine model: Microchip PolarFire-SoC Icicle Kit
[    0.000000] earlycon: ns16550a0 at MMIO32 0x0000000020100000 (options '115200n8')
[    0.000000] printk: bootconsole [ns16550a0] enabled
[    0.000000] printk: debug: skip boot console de-registration.
[    0.000000] efi: UEFI not found.
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080200000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x000000103fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080200000-0x00000000adffffff]
[    0.000000]   node   0: [mem 0x0000001000000000-0x000000103fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080200000-0x000000103fffffff]
[    0.000000] On node 0, zone Normal: 16064512 pages in unavailable ranges
[    0.000000] SBI specification v0.3 detected
[    0.000000] SBI implementation ID=0x1 Version=0x9
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] CPU with hartid=0 is not available
[    0.000000] CPU with hartid=0 is not available
[    0.000000] riscv: base ISA extensions acdfim
[    0.000000] riscv: ELF capabilities acdfim
[    0.000000] percpu: Embedded 18 pages/cpu s34104 r8192 d31432 u73728
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 224263
[    0.000000] Kernel command line: earlycon keep_bootcon
[    0.000000] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.000000] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] software IO TLB: mapped [mem 0x00000000aa000000-0x00000000ae000000] (64MB)
[    0.000000] Virtual kernel memory layout:
[    0.000000]       fixmap : 0xffffffc6fee00000 - 0xffffffc6ff000000   (2048 kB)
[    0.000000]       pci io : 0xffffffc6ff000000 - 0xffffffc700000000   (  16 MB)
[    0.000000]      vmemmap : 0xffffffc700000000 - 0xffffffc800000000   (4096 MB)
[    0.000000]      vmalloc : 0xffffffc800000000 - 0xffffffd800000000   (  64 GB)
[    0.000000]       lowmem : 0xffffffd800000000 - 0xffffffe7bfe00000   (  62 GB)
[    0.000000]       kernel : 0xffffffff80000000 - 0xffffffffffffffff   (2047 MB)
[    0.000000] Memory: 807748K/1800192K available (6523K kernel code, 4857K rwdata, 2048K rodata, 2171K init, 396K bss, 992444K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
[    0.000000] rcu:     RCU debug extended QS entry/exit.
[    0.000000]  Tracing variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] CPU with hartid=0 is not available
[    0.000000] riscv-intc: unable to find hart id for /cpus/cpu@0/interrupt-controller
[    0.000000] riscv-intc: 64 local interrupts mapped
[    0.000000] plic: interrupt-controller@c000000: mapped 186 interrupts with 4 handlers for 9 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [4]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x1d854df40, max_idle_ns: 3526361616960 ns
[    0.000003] sched_clock: 64 bits at 1000kHz, resolution 1000ns, wraps every 2199023255500ns
[    0.009639] Console: colour dummy device 80x25
[    0.014583] printk: console [tty0] enabled
[    0.019220] Calibrating delay loop (skipped), value calculated using timer frequency.. 2.00 BogoMIPS (lpj=4000)
[    0.030377] pid_max: default: 32768 minimum: 301
[    0.035905] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.044169] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.057018] cblist_init_generic: Setting adjustable number of callback queues.
[    0.065031] cblist_init_generic: Setting shift to 2 and lim to 1.
[    0.072068] riscv: ELF compat mode failed
[    0.072084] ASID allocator disabled (0 bits)
[    0.081558] rcu: Hierarchical SRCU implementation.
[    0.087482] EFI services will not be available.
[    0.093816] smp: Bringing up secondary CPUs ...
Sudeep Holla June 30, 2022, 10:39 a.m. UTC | #16
On Wed, Jun 29, 2022 at 11:25:41PM +0000, Conor.Dooley@microchip.com wrote:
> On 29/06/2022 21:32, Conor.Dooley@microchip.com wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > 
> > On 29/06/2022 20:54, Sudeep Holla wrote:
> >> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>
> >> On Wed, Jun 29, 2022 at 07:39:43PM +0000, Conor.Dooley@microchip.com wrote:
> >>> On 29/06/2022 19:42, Sudeep Holla wrote:
> >>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> >>>>
> >>>> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
> >>>>>
> >>>>> No, no it doesn't. Not sure what I was thinking there.
> >>>>> Prob tested that on the the last commit that bisect tested
> >>>>> rather than the one it pointed out the problem was with.
> >>>>>
> >>>>> Either way, boot is broken in -next.
> >>>>>
> >>>>
> >>>> Can you check if the below fixes the issue?
> >>>
> >>> Unfortunately, no joy.
> >>> Applied to a HEAD of 3b23bb2573e6 ("arch_topology: Use the
> >>> last level cache information from the cacheinfo").
> >>
> >> That's bad. Does the system boot with
> >> Commit 2f7b757eb69d ("arch_topology: Add support to parse and detect cache
> >> attributes") ?
> > 
> > It does.
> 

I can't think of any reason for that to happen unless detect_cache_attributes
is failing from init_cpu_topology and we are ignoring that.

Are all RISC-V platforms failing on -next or is it just this platform ?
We may have to try with some logs in detect_cache_attributes,
last_level_cache_is_valid and last_level_cache_is_shared to check where it
is going wrong.

It must be crashing in smp_callin->update_siblings_masks->last_level_cache_is_shared
Conor Dooley June 30, 2022, 4:37 p.m. UTC | #17
On 30/06/2022 11:39, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Wed, Jun 29, 2022 at 11:25:41PM +0000, Conor.Dooley@microchip.com wrote:
>> On 29/06/2022 21:32, Conor.Dooley@microchip.com wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> On 29/06/2022 20:54, Sudeep Holla wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> On Wed, Jun 29, 2022 at 07:39:43PM +0000, Conor.Dooley@microchip.com wrote:
>>>>> On 29/06/2022 19:42, Sudeep Holla wrote:
>>>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>>>
>>>>>> On Wed, Jun 29, 2022 at 06:18:25PM +0000, Conor.Dooley@microchip.com wrote:
>>>>>>>
>>>>>>> No, no it doesn't. Not sure what I was thinking there.
>>>>>>> Prob tested that on the the last commit that bisect tested
>>>>>>> rather than the one it pointed out the problem was with.
>>>>>>>
>>>>>>> Either way, boot is broken in -next.
>>>>>>>
>>>>>>
>>>>>> Can you check if the below fixes the issue?
>>>>>
>>>>> Unfortunately, no joy.
>>>>> Applied to a HEAD of 3b23bb2573e6 ("arch_topology: Use the
>>>>> last level cache information from the cacheinfo").
>>>>
>>>> That's bad. Does the system boot with
>>>> Commit 2f7b757eb69d ("arch_topology: Add support to parse and detect cache
>>>> attributes") ?
>>>
>>> It does.
>>
> 
> I can't think of any reason for that to happen unless detect_cache_attributes
> is failing from init_cpu_topology and we are ignoring that.
> 
> Are all RISC-V platforms failing on -next or is it just this platform ?

I don't know. I only have SoCs with this core complex & one that does not
work with upstream. I can try my other board with this SoC - but I am on
leave at the moment w/ a computer or internet during the day so it may be
a few days before I can try it.

However, Niklas Cassel has tried to use the Canaan K210 on next-20220630
but had issues with RCU stalling:
https://lore.kernel.org/linux-riscv/Yr3PKR0Uj1bE5Y6O@x1-carbon/T/#m52016996fcf5fa0501066d73352ed8e806803e06
Not going to claim any relation, but that's minus 1 to the platforms that
can be used to test this on upstream RISC-V.

> We may have to try with some logs in detect_cache_attributes,
> last_level_cache_is_valid and last_level_cache_is_shared to check where it
> is going wrong.
> 
> It must be crashing in smp_callin->update_siblings_masks->last_level_cache_is_shared

Yeah, I was playing around last night for a while but didn't figure out the
root cause. I'll try again tonight.

In the meantime, would you mind taking the patches out of -next?
FWIW I repro'd the failure on next-20220630.

Thanks,
Conor.
Sudeep Holla June 30, 2022, 5:35 p.m. UTC | #18
On Thu, Jun 30, 2022 at 04:37:50PM +0000, Conor.Dooley@microchip.com wrote:
> On 30/06/2022 11:39, Sudeep Holla wrote:
> >
> > I can't think of any reason for that to happen unless detect_cache_attributes
> > is failing from init_cpu_topology and we are ignoring that.
> >
> > Are all RISC-V platforms failing on -next or is it just this platform ?
>
> I don't know. I only have SoCs with this core complex & one that does not
> work with upstream. I can try my other board with this SoC - but I am on
> leave at the moment w/ a computer or internet during the day so it may be
> a few days before I can try it.
>

Sure, no worries.

> However, Niklas Cassel has tried to use the Canaan K210 on next-20220630
> but had issues with RCU stalling:
> https://lore.kernel.org/linux-riscv/Yr3PKR0Uj1bE5Y6O@x1-carbon/T/#m52016996fcf5fa0501066d73352ed8e806803e06
> Not going to claim any relation, but that's minus 1 to the platforms that
> can be used to test this on upstream RISC-V.
>

Ah OK, will check and ask full logs to see if there is any relation.

> > We may have to try with some logs in detect_cache_attributes,
> > last_level_cache_is_valid and last_level_cache_is_shared to check where it
> > is going wrong.
> >
> > It must be crashing in smp_callin->update_siblings_masks->last_level_cache_is_shared
>
> Yeah, I was playing around last night for a while but didn't figure out the
> root cause. I'll try again tonight.
>

OK, thanks for that. I tried qemu, but doesn't have any cache info in DT
provided by qemu itself. The other sifive_u machine didn't work on qemu,
only virt booted with mainline as well.

> In the meantime, would you mind taking the patches out of -next?

I don't want to take out as we will loose all the test coverage.
I would like to know if any other RISC-V platform is affected or not
before removing it.

> FWIW I repro'd the failure on next-20220630.

Yes, nothing has changed yet.

--
Regards,
Sudeep
Conor Dooley June 30, 2022, 7:20 p.m. UTC | #19
On 30/06/2022 18:35, Sudeep Holla wrote:
> On Thu, Jun 30, 2022 at 04:37:50PM +0000, Conor.Dooley@microchip.com wrote:
>> On 30/06/2022 11:39, Sudeep Holla wrote:
>>>
>>> I can't think of any reason for that to happen unless detect_cache_attributes
>>> is failing from init_cpu_topology and we are ignoring that.
>>>
>>> Are all RISC-V platforms failing on -next or is it just this platform ?
>>
>> I don't know. I only have SoCs with this core complex & one that does not
>> work with upstream. I can try my other board with this SoC - but I am on
>> leave at the moment w/ a computer or internet during the day so it may be
>> a few days before I can try it.
>>
> 
> Sure, no worries.
> 
>> However, Niklas Cassel has tried to use the Canaan K210 on next-20220630
>> but had issues with RCU stalling:
>> https://lore.kernel.org/linux-riscv/Yr3PKR0Uj1bE5Y6O@x1-carbon/T/#m52016996fcf5fa0501066d73352ed8e806803e06
>> Not going to claim any relation, but that's minus 1 to the platforms that
>> can be used to test this on upstream RISC-V.
>>
> 
> Ah OK, will check and ask full logs to see if there is any relation.
> 
>>> We may have to try with some logs in detect_cache_attributes,
>>> last_level_cache_is_valid and last_level_cache_is_shared to check where it
>>> is going wrong.
>>>
>>> It must be crashing in smp_callin->update_siblings_masks->last_level_cache_is_shared


So, looks like there's a problem in cache_leaves_are_shared() which is hit
by the above path. Both of the if clauses are false, and the function falls
through to return sib_leaf->fw_token == this_leaf->fw_token;
Both sib_leaf & this_leaf seem to be null.

static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
					   struct cacheinfo *sib_leaf)
{
	/*
	 * For non DT/ACPI systems, assume unique level 1 caches,
	 * system-wide shared caches for all other levels. This will be used
	 * only if arch specific code has not populated shared_cpu_map
	 */
	if (!(IS_ENABLED(CONFIG_OF) || IS_ENABLED(CONFIG_ACPI)))
		return !(this_leaf->level == 1);

	if ((sib_leaf->attributes & CACHE_ID) &&
	    (this_leaf->attributes & CACHE_ID))
		return sib_leaf->id == this_leaf->id;

	return sib_leaf->fw_token == this_leaf->fw_token;
}

Any ideas what to look at next?
Sudeep Holla June 30, 2022, 8:07 p.m. UTC | #20
On Thu, Jun 30, 2022 at 07:20:04PM +0000, Conor.Dooley@microchip.com wrote:
> 
> 
> On 30/06/2022 18:35, Sudeep Holla wrote:
> > On Thu, Jun 30, 2022 at 04:37:50PM +0000, Conor.Dooley@microchip.com wrote:
> >> On 30/06/2022 11:39, Sudeep Holla wrote:
> >>>
> >>> I can't think of any reason for that to happen unless detect_cache_attributes
> >>> is failing from init_cpu_topology and we are ignoring that.
> >>>
> >>> Are all RISC-V platforms failing on -next or is it just this platform ?
> >>
> >> I don't know. I only have SoCs with this core complex & one that does not
> >> work with upstream. I can try my other board with this SoC - but I am on
> >> leave at the moment w/ a computer or internet during the day so it may be
> >> a few days before I can try it.
> >>
> > 
> > Sure, no worries.
> > 
> >> However, Niklas Cassel has tried to use the Canaan K210 on next-20220630
> >> but had issues with RCU stalling:
> >> https://lore.kernel.org/linux-riscv/Yr3PKR0Uj1bE5Y6O@x1-carbon/T/#m52016996fcf5fa0501066d73352ed8e806803e06
> >> Not going to claim any relation, but that's minus 1 to the platforms that
> >> can be used to test this on upstream RISC-V.
> >>
> > 
> > Ah OK, will check and ask full logs to see if there is any relation.
> > 
> >>> We may have to try with some logs in detect_cache_attributes,
> >>> last_level_cache_is_valid and last_level_cache_is_shared to check where it
> >>> is going wrong.
> >>>
> >>> It must be crashing in smp_callin->update_siblings_masks->last_level_cache_is_shared
> 
> 
> So, looks like there's a problem in cache_leaves_are_shared() which is hit
> by the above path. Both of the if clauses are false, and the function falls
> through to return sib_leaf->fw_token == this_leaf->fw_token;

Both if() failing is expected and that statement
	return sib_leaf->fw_token == this_leaf->fw_token;
execution is correct.

> Both sib_leaf & this_leaf seem to be null.
>

But this is wrong as last_level_cache_is_shared checks for
last_level_cache_is_valid which must return false if the fw_token = NULL
So we must not hit the above return statement with NULL fw_token.

> static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
> 					   struct cacheinfo *sib_leaf)
> {
> 	/*
> 	 * For non DT/ACPI systems, assume unique level 1 caches,
> 	 * system-wide shared caches for all other levels. This will be used
> 	 * only if arch specific code has not populated shared_cpu_map
> 	 */
> 	if (!(IS_ENABLED(CONFIG_OF) || IS_ENABLED(CONFIG_ACPI)))
> 		return !(this_leaf->level == 1);
> 
> 	if ((sib_leaf->attributes & CACHE_ID) &&
> 	    (this_leaf->attributes & CACHE_ID))
> 		return sib_leaf->id == this_leaf->id;
> 
> 	return sib_leaf->fw_token == this_leaf->fw_token;
> }
> 
> Any ideas what to look at next?

I wonder how did we not get last_level_cache_is_valid as false if the
fw_node is NULL. But it should not be NULL at the first place.
Conor Dooley June 30, 2022, 8:13 p.m. UTC | #21
On 30/06/2022 21:07, Sudeep Holla wrote:
> On Thu, Jun 30, 2022 at 07:20:04PM +0000, Conor.Dooley@microchip.com wrote:
>>
>>
>> On 30/06/2022 18:35, Sudeep Holla wrote:
>>> On Thu, Jun 30, 2022 at 04:37:50PM +0000, Conor.Dooley@microchip.com wrote:
>>>> On 30/06/2022 11:39, Sudeep Holla wrote:
>>>>>
>>>>> I can't think of any reason for that to happen unless detect_cache_attributes
>>>>> is failing from init_cpu_topology and we are ignoring that.
>>>>>
>>>>> Are all RISC-V platforms failing on -next or is it just this platform ?
>>>>
>>>> I don't know. I only have SoCs with this core complex & one that does not
>>>> work with upstream. I can try my other board with this SoC - but I am on
>>>> leave at the moment w/ a computer or internet during the day so it may be
>>>> a few days before I can try it.
>>>>
>>>
>>> Sure, no worries.
>>>
>>>> However, Niklas Cassel has tried to use the Canaan K210 on next-20220630
>>>> but had issues with RCU stalling:
>>>> https://lore.kernel.org/linux-riscv/Yr3PKR0Uj1bE5Y6O@x1-carbon/T/#m52016996fcf5fa0501066d73352ed8e806803e06
>>>> Not going to claim any relation, but that's minus 1 to the platforms that
>>>> can be used to test this on upstream RISC-V.
>>>>
>>>
>>> Ah OK, will check and ask full logs to see if there is any relation.
>>>
>>>>> We may have to try with some logs in detect_cache_attributes,
>>>>> last_level_cache_is_valid and last_level_cache_is_shared to check where it
>>>>> is going wrong.
>>>>>
>>>>> It must be crashing in smp_callin->update_siblings_masks->last_level_cache_is_shared
>>
>>
>> So, looks like there's a problem in cache_leaves_are_shared() which is hit
>> by the above path. Both of the if clauses are false, and the function falls
>> through to return sib_leaf->fw_token == this_leaf->fw_token;
> 
> Both if() failing is expected and that statement
> 	return sib_leaf->fw_token == this_leaf->fw_token;
> execution is correct.
> 
>> Both sib_leaf & this_leaf seem to be null.
>>
> 
> But this is wrong as last_level_cache_is_shared checks for
> last_level_cache_is_valid which must return false if the fw_token = NULL
> So we must not hit the above return statement with NULL fw_token.
> 
>> static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
>> 					   struct cacheinfo *sib_leaf)
>> {
>> 	/*
>> 	 * For non DT/ACPI systems, assume unique level 1 caches,
>> 	 * system-wide shared caches for all other levels. This will be used
>> 	 * only if arch specific code has not populated shared_cpu_map
>> 	 */
>> 	if (!(IS_ENABLED(CONFIG_OF) || IS_ENABLED(CONFIG_ACPI)))
>> 		return !(this_leaf->level == 1);
>>
>> 	if ((sib_leaf->attributes & CACHE_ID) &&
>> 	    (this_leaf->attributes & CACHE_ID))
>> 		return sib_leaf->id == this_leaf->id;
>>
>> 	return sib_leaf->fw_token == this_leaf->fw_token;
>> }
>>
>> Any ideas what to look at next?
> 
> I wonder how did we not get last_level_cache_is_valid as false if the
> fw_node is NULL. But it should not be NULL at the first place.
> 

I didn't have the time to go digging into things, but the following
macro looked odd:
#define per_cpu_cacheinfo_idx(cpu, idx)		\
				(per_cpu_cacheinfo(cpu) + (idx))
Maybe it is just badly named, but is this getting the per_cpu_cacheinfo
and then incrementing intentionally, or is it meant to get the
per_cpu_cacheinfo of cpu + idx?
Sudeep Holla June 30, 2022, 8:21 p.m. UTC | #22
On Thu, Jun 30, 2022 at 08:13:55PM +0000, Conor.Dooley@microchip.com wrote:
> 
> I didn't have the time to go digging into things, but the following
> macro looked odd:
> #define per_cpu_cacheinfo_idx(cpu, idx)		\
> 				(per_cpu_cacheinfo(cpu) + (idx))
> Maybe it is just badly named, but is this getting the per_cpu_cacheinfo
> and then incrementing intentionally, or is it meant to get the
> per_cpu_cacheinfo of cpu + idx?

OK, basically per_cpu_cacheinfo(cpu) get the information for a cpu
while per_cpu_cacheinfo_idx(cpu, idx) will fetch the information for a
given cpu and given index within the cpu. So we are incrementing the
pointer by the index. These work just fine on arm64 platform.

Not sure if compiler is optimising something as I still can't understand
how we can end up with valid llc but fw_token as NULL.

--
Regards,
Sudeep
Conor Dooley June 30, 2022, 10:07 p.m. UTC | #23
On 30/06/2022 21:21, Sudeep Holla wrote:
> On Thu, Jun 30, 2022 at 08:13:55PM +0000, Conor.Dooley@microchip.com wrote:
>>
>> I didn't have the time to go digging into things, but the following
>> macro looked odd:
>> #define per_cpu_cacheinfo_idx(cpu, idx)		\
>> 				(per_cpu_cacheinfo(cpu) + (idx))
>> Maybe it is just badly named, but is this getting the per_cpu_cacheinfo
>> and then incrementing intentionally, or is it meant to get the
>> per_cpu_cacheinfo of cpu + idx?
> 
> OK, basically per_cpu_cacheinfo(cpu) get the information for a cpu
> while per_cpu_cacheinfo_idx(cpu, idx) will fetch the information for a
> given cpu and given index within the cpu. So we are incrementing the
> pointer by the index. These work just fine on arm64 platform.

Right, that's what I figured but wanted to be sure.

> 
> Not sure if compiler is optimising something as I still can't understand
> how we can end up with valid llc but fw_token as NULL.
See idk about that. The following fails to boot.
index 167abfa6f37d..9d45c37fb004 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -36,6 +36,8 @@ struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
 static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
                                           struct cacheinfo *sib_leaf)
 {
+       if (!this_leaf || !sib_leaf)
+               return false;
        /*
         * For non DT/ACPI systems, assume unique level 1 caches,
         * system-wide shared caches for all other levels. This will be used
@@ -74,8 +76,12 @@ bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
 
        llc_x = per_cpu_cacheinfo_idx(cpu_x, cache_leaves(cpu_x) - 1);
        llc_y = per_cpu_cacheinfo_idx(cpu_y, cache_leaves(cpu_y) - 1);
+       if (!llc_x || !llc_y){
+               printk("llc was null\n");
+               return false;
+       }
 
-       return cache_leaves_are_shared(llc_x, llc_y);
+       return false; //cache_leaves_are_shared(llc_x, llc_y);
 }
 
 #ifdef CONFIG_OF

and this boots:

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 167abfa6f37d..01900908fe31 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -36,6 +36,8 @@ struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
 static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
                                           struct cacheinfo *sib_leaf)
 {
+       if (!this_leaf || !sib_leaf)
+               return false;
        /*
         * For non DT/ACPI systems, assume unique level 1 caches,
         * system-wide shared caches for all other levels. This will be used
@@ -75,7 +77,7 @@ bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
        llc_x = per_cpu_cacheinfo_idx(cpu_x, cache_leaves(cpu_x) - 1);
        llc_y = per_cpu_cacheinfo_idx(cpu_y, cache_leaves(cpu_y) - 1);
 
-       return cache_leaves_are_shared(llc_x, llc_y);
+       return false; //cache_leaves_are_shared(llc_x, llc_y);
 }
 
 #ifdef CONFIG_OF
Sudeep Holla July 1, 2022, 11:11 a.m. UTC | #24
On Thu, Jun 30, 2022 at 10:07:49PM +0000, Conor.Dooley@microchip.com wrote:
> 
> 
> On 30/06/2022 21:21, Sudeep Holla wrote:
> > On Thu, Jun 30, 2022 at 08:13:55PM +0000, Conor.Dooley@microchip.com wrote:
> >>
> >> I didn't have the time to go digging into things, but the following
> >> macro looked odd:
> >> #define per_cpu_cacheinfo_idx(cpu, idx)		\
> >> 				(per_cpu_cacheinfo(cpu) + (idx))
> >> Maybe it is just badly named, but is this getting the per_cpu_cacheinfo
> >> and then incrementing intentionally, or is it meant to get the
> >> per_cpu_cacheinfo of cpu + idx?
> > 
> > OK, basically per_cpu_cacheinfo(cpu) get the information for a cpu
> > while per_cpu_cacheinfo_idx(cpu, idx) will fetch the information for a
> > given cpu and given index within the cpu. So we are incrementing the
> > pointer by the index. These work just fine on arm64 platform.
> 
> Right, that's what I figured but wanted to be sure.
>

OK

> > 
> > Not sure if compiler is optimising something as I still can't understand
> > how we can end up with valid llc but fw_token as NULL.
> See idk about that. The following fails to boot.
> index 167abfa6f37d..9d45c37fb004 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -36,6 +36,8 @@ struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
>  static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
>                                            struct cacheinfo *sib_leaf)
>  {
> +       if (!this_leaf || !sib_leaf)
> +               return false;

Did you hit this ?

>         /*
>          * For non DT/ACPI systems, assume unique level 1 caches,
>          * system-wide shared caches for all other levels. This will be used
> @@ -74,8 +76,12 @@ bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
>  
>         llc_x = per_cpu_cacheinfo_idx(cpu_x, cache_leaves(cpu_x) - 1);
>         llc_y = per_cpu_cacheinfo_idx(cpu_y, cache_leaves(cpu_y) - 1);
> +       if (!llc_x || !llc_y){
> +               printk("llc was null\n");

Or this ?

> +               return false;
> +       }
>  
> -       return cache_leaves_are_shared(llc_x, llc_y);
> +       return false; //cache_leaves_are_shared(llc_x, llc_y);

Even the above change fails to boot ? Coz you are always returning false here
too.

>  }
>  
>  #ifdef CONFIG_OF
> 
> and this boots:
> 
> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> index 167abfa6f37d..01900908fe31 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -36,6 +36,8 @@ struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
>  static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
>                                            struct cacheinfo *sib_leaf)
>  {
> +       if (!this_leaf || !sib_leaf)
> +               return false;
>         /*
>          * For non DT/ACPI systems, assume unique level 1 caches,
>          * system-wide shared caches for all other levels. This will be used
> @@ -75,7 +77,7 @@ bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
>         llc_x = per_cpu_cacheinfo_idx(cpu_x, cache_leaves(cpu_x) - 1);
>         llc_y = per_cpu_cacheinfo_idx(cpu_y, cache_leaves(cpu_y) - 1);
>

You are just missing the checks for llc_x and llc_y and it works which
means llc_x and llc_y is where things are going wrong.
Conor Dooley July 1, 2022, 2:47 p.m. UTC | #25
On 01/07/2022 12:11, Sudeep Holla wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Thu, Jun 30, 2022 at 10:07:49PM +0000, Conor.Dooley@microchip.com wrote:
>>
>>
>> On 30/06/2022 21:21, Sudeep Holla wrote:
>>> On Thu, Jun 30, 2022 at 08:13:55PM +0000, Conor.Dooley@microchip.com wrote:
>>>>
>>>> I didn't have the time to go digging into things, but the following
>>>> macro looked odd:
>>>> #define per_cpu_cacheinfo_idx(cpu, idx)            \
>>>>                             (per_cpu_cacheinfo(cpu) + (idx))
>>>> Maybe it is just badly named, but is this getting the per_cpu_cacheinfo
>>>> and then incrementing intentionally, or is it meant to get the
>>>> per_cpu_cacheinfo of cpu + idx?
>>>
>>> OK, basically per_cpu_cacheinfo(cpu) get the information for a cpu
>>> while per_cpu_cacheinfo_idx(cpu, idx) will fetch the information for a
>>> given cpu and given index within the cpu. So we are incrementing the
>>> pointer by the index. These work just fine on arm64 platform.
>>
>> Right, that's what I figured but wanted to be sure.
>>
> 
> OK
> 
>>>
>>> Not sure if compiler is optimising something as I still can't understand
>>> how we can end up with valid llc but fw_token as NULL.
>> See idk about that. The following fails to boot.
>> index 167abfa6f37d..9d45c37fb004 100644
>> --- a/drivers/base/cacheinfo.c
>> +++ b/drivers/base/cacheinfo.c
>> @@ -36,6 +36,8 @@ struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
>>   static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
>>                                             struct cacheinfo *sib_leaf)
>>   {
>> +       if (!this_leaf || !sib_leaf)
>> +               return false;
> 
> Did you hit this ?

Ah I forgot to remove this, I had added it (since I knew a return value
of false was correct) but it was still failing to boot. It was my step
prior to adding the if(!llc...

> 
>>          /*
>>           * For non DT/ACPI systems, assume unique level 1 caches,
>>           * system-wide shared caches for all other levels. This will be used
>> @@ -74,8 +76,12 @@ bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
>>
>>          llc_x = per_cpu_cacheinfo_idx(cpu_x, cache_leaves(cpu_x) - 1);
>>          llc_y = per_cpu_cacheinfo_idx(cpu_y, cache_leaves(cpu_y) - 1);
>> +       if (!llc_x || !llc_y){
>> +               printk("llc was null\n");
> 
> Or this ?

This printk never prints out.

> 
>> +               return false;
>> +       }
>>
>> -       return cache_leaves_are_shared(llc_x, llc_y);
>> +       return false; //cache_leaves_are_shared(llc_x, llc_y);
> 
> Even the above change fails to boot ? Coz you are always returning false here
> too.

Correct, fails to boot.

> 
>>   }
>>
>>   #ifdef CONFIG_OF
>>
>> and this boots:
>>
>> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> index 167abfa6f37d..01900908fe31 100644
>> --- a/drivers/base/cacheinfo.c
>> +++ b/drivers/base/cacheinfo.c
>> @@ -36,6 +36,8 @@ struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu)
>>   static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
>>                                             struct cacheinfo *sib_leaf)
>>   {
>> +       if (!this_leaf || !sib_leaf)
>> +               return false;
>>          /*
>>           * For non DT/ACPI systems, assume unique level 1 caches,
>>           * system-wide shared caches for all other levels. This will be used
>> @@ -75,7 +77,7 @@ bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
>>          llc_x = per_cpu_cacheinfo_idx(cpu_x, cache_leaves(cpu_x) - 1);
>>          llc_y = per_cpu_cacheinfo_idx(cpu_y, cache_leaves(cpu_y) - 1);
>>
> 
> You are just missing the checks for llc_x and llc_y and it works which
> means llc_x and llc_y is where things are going wrong.
> 

Sounds about right to me.
diff mbox series

Patch

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 23cb52180ce3..c314c7064397 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -668,7 +668,8 @@  const struct cpumask *cpu_coregroup_mask(int cpu)
 		/* not numa in package, lets use the package siblings */
 		core_mask = &cpu_topology[cpu].core_sibling;
 	}
-	if (cpu_topology[cpu].llc_id != -1) {
+
+	if (last_level_cache_is_valid(cpu)) {
 		if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
 			core_mask = &cpu_topology[cpu].llc_sibling;
 	}
@@ -699,7 +700,7 @@  void update_siblings_masks(unsigned int cpuid)
 	for_each_online_cpu(cpu) {
 		cpu_topo = &cpu_topology[cpu];
 
-		if (cpu_topo->llc_id != -1 && cpuid_topo->llc_id == cpu_topo->llc_id) {
+		if (last_level_cache_is_shared(cpu, cpuid)) {
 			cpumask_set_cpu(cpu, &cpuid_topo->llc_sibling);
 			cpumask_set_cpu(cpuid, &cpu_topo->llc_sibling);
 		}