mbox series

[V7,0/2] qcom: x1e80100: Enable CPUFreq

Message ID 20241030130840.2890904-1-quic_sibis@quicinc.com (mailing list archive)
Headers show
Series qcom: x1e80100: Enable CPUFreq | expand

Message

Sibi Sankar Oct. 30, 2024, 1:08 p.m. UTC
This series enables CPUFreq support on the X1E SoC using the SCMI perf
protocol. This was originally part of the RFC: firmware: arm_scmi:
Qualcomm Vendor Protocol [1]. I've split it up so that this part can
land earlier. Warnings Introduced by the series are fixed by [2]

V6:
* Drop patches that are already picked up.
* Rebase to the latest next-20241029.

V5:
* Fix build error reported by kernel test robot by adding 64BIT requirement
  to COMPILE_TEST
* Pick Rbs

V4:
* Move val, flag and chan to local loop variables. [Jassi]
* Add cpucp mailbox to the MAINTAINERS file. [Jassi]
* Move to core_initcall. [Konrad]
* Skip explicitly setting txdone_irq/txdone_poll to zero. [Konrad]

V3:
* Fix Maintainer info in cpucp mbox bindings. [Bjorn]
* Fix copyright info in cpucp driver. [Bjorn]
* Drop unused APSS_CPUCP_TX_MBOX_IDR, value init and drv_data. [Bjorn/Dmitry]
* Convert to lower case hex. [Bjorn]
* Convert irq and dev to local variables. [Bjorn]
* Replace for and if with for_each_set_bit. [Bjorn]
* Document the need for spinlock. [Bjorn]
* Add space after " for aesthetics. [Bjorn]
* Fix err in calc and add fixes tag. [Bjorn]
* Include io.h and re-order platform_device.h
* Use GENMASK_ULL to generate APSS_CPUCP_RX_MBOX_CMD_MASK.

V2:
* Fix series version number [Rob]
* Pickup Rbs from Dimitry and Rob.
* Use power-domain instead of clocks. [Sudeep/Ulf]
* Rename sram sub-nodes according to schema. [Dmitry]
* Use BIT() instead of manual shift. [Dmitry]
* Define RX_MBOX_CMD to account for chan calculation. [Dmitry]
* Clear the bit instead of the entire status within the spinlock. [Dmitry]
* Use dev_err_probe instead. [Dmitry]
* Drop superfluous error message while handling errors from get_irq. [Dmitry]
* Use devm_mbox_controller_register and drop remove path. [Dmitry]
* Define TX_MBOX_CMD to account for chan calculation.
* Use cpucp->dev in probe path for conformity.

RFC V1:
* Use x1e80100 as the fallback for future SoCs using the cpucp-mbox
  controller. [Krzysztoff/Konrad/Rob]
* Use chan->lock and chan->cl to detect if the channel is no longer
  Available. [Dmitry]
* Use BIT() instead of using manual shifts. [Dmitry]
* Don't use integer as a pointer value. [Dmitry]
* Allow it to default to of_mbox_index_xlate. [Dmitry]
* Use devm_of_iomap. [Dmitry]
* Use module_platform_driver instead of module init/exit. [Dmitry]
* Get channel number using mailbox core (like other drivers) and
  further simplify the driver by dropping setup_mbox func.

[1]: https://lore.kernel.org/lkml/20240117173458.2312669-1-quic_sibis@quicinc.com/#r
[2]: https://lore.kernel.org/lkml/20241030125512.2884761-1-quic_sibis@quicinc.com/

Other relevant Links:
https://lore.kernel.org/lkml/be2e475a-349f-4e98-b238-262dd7117a4e@linaro.org/

base: next-20241029

Sibi Sankar (2):
  arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes
  arm64: dts: qcom: x1e80100: Enable cpufreq

 arch/arm64/boot/dts/qcom/x1e80100.dtsi | 89 +++++++++++++++++++-------
 1 file changed, 65 insertions(+), 24 deletions(-)

Comments

Johan Hovold Nov. 1, 2024, 1 p.m. UTC | #1
[ +CC: Marc, who I think I saw reporting something similar even if I can
  seem to find where right now ]

On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote:
> This series enables CPUFreq support on the X1E SoC using the SCMI perf
> protocol. This was originally part of the RFC: firmware: arm_scmi:
> Qualcomm Vendor Protocol [1]. I've split it up so that this part can
> land earlier. Warnings Introduced by the series are fixed by [2]

 Sibi Sankar (2):
>   arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes
>   arm64: dts: qcom: x1e80100: Enable cpufreq

I've been running with v6 of these for a while now, without noticing any
issues, and just updated to v7 to be able to provide a Tested-by tag.

I wanted to run a compilation and see how the frequencies varied, but
before I got around to that I just grepped the cpufreq sysfs attributes
for CPU0 four times. And this triggered a reset of the machine (x1e80100
CRD).

The last values output were:

	affected_cpus:0 1 2 3
	cpuinfo_cur_freq:<unknown>
	cpuinfo_max_freq:3417600
	cpuinfo_min_freq:710400
	cpuinfo_transition_latency:30000
	related_cpus:0 1 2 3
	scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320
	scaling_available_governors:ondemand userspace performance schedutil
	scaling_cur_freq:806400
	scaling_driver:scmi
	scaling_governor:schedutil
	scaling_max_freq:3417600
	scaling_min_freq:710400
	scaling_setspeed:<unsupported>

Notice the <unknown> current frequency (the previous greps said 710400
and 2515200).

The last thing I see on the serial console, presumably just before
the reset, is:

	[  196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564)

I just rebooted and grepped again and it triggered on the first attempt
(cur_freq also said '<unknown>'). Same error in the log, printed when
grepping.

Johan
Marc Zyngier Nov. 1, 2024, 2:08 p.m. UTC | #2
On Fri, 01 Nov 2024 13:00:37 +0000,
Johan Hovold <johan@kernel.org> wrote:
> 
> [ +CC: Marc, who I think I saw reporting something similar even if I can
>   seem to find where right now ]

It was on IRC.

> 
> On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote:
> > This series enables CPUFreq support on the X1E SoC using the SCMI perf
> > protocol. This was originally part of the RFC: firmware: arm_scmi:
> > Qualcomm Vendor Protocol [1]. I've split it up so that this part can
> > land earlier. Warnings Introduced by the series are fixed by [2]
> 
>  Sibi Sankar (2):
> >   arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes
> >   arm64: dts: qcom: x1e80100: Enable cpufreq
> 
> I've been running with v6 of these for a while now, without noticing any
> issues, and just updated to v7 to be able to provide a Tested-by tag.
> 
> I wanted to run a compilation and see how the frequencies varied, but
> before I got around to that I just grepped the cpufreq sysfs attributes
> for CPU0 four times. And this triggered a reset of the machine (x1e80100
> CRD).
> 
> The last values output were:
> 
> 	affected_cpus:0 1 2 3
> 	cpuinfo_cur_freq:<unknown>
> 	cpuinfo_max_freq:3417600
> 	cpuinfo_min_freq:710400
> 	cpuinfo_transition_latency:30000
> 	related_cpus:0 1 2 3
> 	scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320
> 	scaling_available_governors:ondemand userspace performance schedutil
> 	scaling_cur_freq:806400
> 	scaling_driver:scmi
> 	scaling_governor:schedutil
> 	scaling_max_freq:3417600
> 	scaling_min_freq:710400
> 	scaling_setspeed:<unsupported>
> 
> Notice the <unknown> current frequency (the previous greps said 710400
> and 2515200).
> 
> The last thing I see on the serial console, presumably just before
> the reset, is:
> 
> 	[  196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564)
> 
> I just rebooted and grepped again and it triggered on the first attempt
> (cur_freq also said '<unknown>'). Same error in the log, printed when
> grepping.

I'm seeing similar things indeed. Randomly grepping in cpufreq/policy*
results in hard resets, although I don't get much on the serial
console when that happens. Interestingly, I also see some errors in
dmesg at boot time:

maz@semi-fraudulent:~$ dmesg| grep -i scmi
[    0.966175] scmi_core: SCMI protocol bus registered
[    7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport
[    7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms
[    7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0
[    7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled !
[    7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled.
[    7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000
[    8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains

All these "Failed" are a bit worrying. Happy to put any theory to the
test.

Thanks,

	M.
Johan Hovold Nov. 1, 2024, 2:19 p.m. UTC | #3
On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote:

> I'm seeing similar things indeed. Randomly grepping in cpufreq/policy*
> results in hard resets, although I don't get much on the serial
> console when that happens. Interestingly, I also see some errors in
> dmesg at boot time:
> 
> maz@semi-fraudulent:~$ dmesg| grep -i scmi
> [    0.966175] scmi_core: SCMI protocol bus registered
> [    7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport
> [    7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms
> [    7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0
> [    7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled !
> [    7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled.
> [    7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000
> [    8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains
> 
> All these "Failed" are a bit worrying. Happy to put any theory to the
> test.

Yes, those warnings indeed look troubling. Fortunately they appear to be
mostly benign and only indicate that the firmware is reporting duplicate
OPPs, which the kernel is now ignoring without any other side effects
than the warnings.

The side-effects and these remaining warnings are addressed by this
series:

	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/

but I think we should try to make the warnings a bit more informative
(and less scary) by printing something along the lines of:

	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC

instead.

Johan
Marc Zyngier Nov. 1, 2024, 2:43 p.m. UTC | #4
On Fri, 01 Nov 2024 14:19:54 +0000,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote:
> 
> > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy*
> > results in hard resets, although I don't get much on the serial
> > console when that happens. Interestingly, I also see some errors in
> > dmesg at boot time:
> > 
> > maz@semi-fraudulent:~$ dmesg| grep -i scmi
> > [    0.966175] scmi_core: SCMI protocol bus registered
> > [    7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport
> > [    7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms
> > [    7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0
> > [    7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled !
> > [    7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled.
> > [    7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000
> > [    8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains
> > 
> > All these "Failed" are a bit worrying. Happy to put any theory to the
> > test.
> 
> Yes, those warnings indeed look troubling. Fortunately they appear to be
> mostly benign and only indicate that the firmware is reporting duplicate
> OPPs, which the kernel is now ignoring without any other side effects
> than the warnings.

Right. Not something that would explain the hard reset behaviour then.

> 
> The side-effects and these remaining warnings are addressed by this
> series:
> 
> 	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/
> 
> but I think we should try to make the warnings a bit more informative
> (and less scary) by printing something along the lines of:
> 
> 	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC
> 
> instead.

Indeed. Seeing [Firmware Bug] has a comforting feeling of
familiarity... :)

I wonder whether the same sort of reset happen on more "commercial"
systems (such as some of the laptops). You expect that people look at
the cpufreq stuff closely, and don't see things exploding like we are.

	M.
Johan Hovold Nov. 5, 2024, 4:57 p.m. UTC | #5
On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote:
> On Fri, 01 Nov 2024 14:19:54 +0000,
> Johan Hovold <johan@kernel.org> wrote:

> > The side-effects and these remaining warnings are addressed by this
> > series:
> > 
> > 	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/
> > 
> > but I think we should try to make the warnings a bit more informative
> > (and less scary) by printing something along the lines of:
> > 
> > 	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC
> > 
> > instead.
> 
> Indeed. Seeing [Firmware Bug] has a comforting feeling of
> familiarity... :)
> 
> I wonder whether the same sort of reset happen on more "commercial"
> systems (such as some of the laptops). You expect that people look at
> the cpufreq stuff closely, and don't see things exploding like we are.

I finally got around to getting my Lenovo ThinkPad T14s to boot (it
refuses to start the kernel when using GRUB, and it's not due to the
known 64 GB memory issue as it only has 32 GB) and can confirm that it
hard resets when accessing the cpufreq sysfs attributes as well.

On the bright side, at least I don't see any warnings due to duplicate
OPPs on this machine (x1e78100, latest UEFI fw).

Johan
Marc Zyngier Nov. 5, 2024, 6:12 p.m. UTC | #6
On Tue, 05 Nov 2024 16:57:07 +0000,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote:
> > On Fri, 01 Nov 2024 14:19:54 +0000,
> > Johan Hovold <johan@kernel.org> wrote:
> 
> > > The side-effects and these remaining warnings are addressed by this
> > > series:
> > > 
> > > 	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/
> > > 
> > > but I think we should try to make the warnings a bit more informative
> > > (and less scary) by printing something along the lines of:
> > > 
> > > 	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC
> > > 
> > > instead.
> > 
> > Indeed. Seeing [Firmware Bug] has a comforting feeling of
> > familiarity... :)
> > 
> > I wonder whether the same sort of reset happen on more "commercial"
> > systems (such as some of the laptops). You expect that people look at
> > the cpufreq stuff closely, and don't see things exploding like we are.
> 
> I finally got around to getting my Lenovo ThinkPad T14s to boot (it
> refuses to start the kernel when using GRUB, and it's not due to the
> known 64 GB memory issue as it only has 32 GB)

<cry>
I know the feeling. My devkit can't use GRUB either, so I added a
hook to the GRUB config to generate EFI scripts that directly execute
the kernel with initrd, dtb, and command line.

This is probably the worse firmware I've seen in a very long while.
</cry>

> and can confirm that it
> hard resets when accessing the cpufreq sysfs attributes as well.

Right. So this also happens on non-abandonware machines.

> On the bright side, at least I don't see any warnings due to duplicate
> OPPs on this machine (x1e78100, latest UEFI fw).

One bug fixed...

	M.