Message ID | 20241030130840.2890904-1-quic_sibis@quicinc.com (mailing list archive) |
---|---|
Headers | show |
Series | qcom: x1e80100: Enable CPUFreq | expand |
[ +CC: Marc, who I think I saw reporting something similar even if I can seem to find where right now ] On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote: > This series enables CPUFreq support on the X1E SoC using the SCMI perf > protocol. This was originally part of the RFC: firmware: arm_scmi: > Qualcomm Vendor Protocol [1]. I've split it up so that this part can > land earlier. Warnings Introduced by the series are fixed by [2] Sibi Sankar (2): > arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes > arm64: dts: qcom: x1e80100: Enable cpufreq I've been running with v6 of these for a while now, without noticing any issues, and just updated to v7 to be able to provide a Tested-by tag. I wanted to run a compilation and see how the frequencies varied, but before I got around to that I just grepped the cpufreq sysfs attributes for CPU0 four times. And this triggered a reset of the machine (x1e80100 CRD). The last values output were: affected_cpus:0 1 2 3 cpuinfo_cur_freq:<unknown> cpuinfo_max_freq:3417600 cpuinfo_min_freq:710400 cpuinfo_transition_latency:30000 related_cpus:0 1 2 3 scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320 scaling_available_governors:ondemand userspace performance schedutil scaling_cur_freq:806400 scaling_driver:scmi scaling_governor:schedutil scaling_max_freq:3417600 scaling_min_freq:710400 scaling_setspeed:<unsupported> Notice the <unknown> current frequency (the previous greps said 710400 and 2515200). The last thing I see on the serial console, presumably just before the reset, is: [ 196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564) I just rebooted and grepped again and it triggered on the first attempt (cur_freq also said '<unknown>'). Same error in the log, printed when grepping. Johan
On Fri, 01 Nov 2024 13:00:37 +0000, Johan Hovold <johan@kernel.org> wrote: > > [ +CC: Marc, who I think I saw reporting something similar even if I can > seem to find where right now ] It was on IRC. > > On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote: > > This series enables CPUFreq support on the X1E SoC using the SCMI perf > > protocol. This was originally part of the RFC: firmware: arm_scmi: > > Qualcomm Vendor Protocol [1]. I've split it up so that this part can > > land earlier. Warnings Introduced by the series are fixed by [2] > > Sibi Sankar (2): > > arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes > > arm64: dts: qcom: x1e80100: Enable cpufreq > > I've been running with v6 of these for a while now, without noticing any > issues, and just updated to v7 to be able to provide a Tested-by tag. > > I wanted to run a compilation and see how the frequencies varied, but > before I got around to that I just grepped the cpufreq sysfs attributes > for CPU0 four times. And this triggered a reset of the machine (x1e80100 > CRD). > > The last values output were: > > affected_cpus:0 1 2 3 > cpuinfo_cur_freq:<unknown> > cpuinfo_max_freq:3417600 > cpuinfo_min_freq:710400 > cpuinfo_transition_latency:30000 > related_cpus:0 1 2 3 > scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320 > scaling_available_governors:ondemand userspace performance schedutil > scaling_cur_freq:806400 > scaling_driver:scmi > scaling_governor:schedutil > scaling_max_freq:3417600 > scaling_min_freq:710400 > scaling_setspeed:<unsupported> > > Notice the <unknown> current frequency (the previous greps said 710400 > and 2515200). > > The last thing I see on the serial console, presumably just before > the reset, is: > > [ 196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564) > > I just rebooted and grepped again and it triggered on the first attempt > (cur_freq also said '<unknown>'). Same error in the log, printed when > grepping. I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* results in hard resets, although I don't get much on the serial console when that happens. Interestingly, I also see some errors in dmesg at boot time: maz@semi-fraudulent:~$ dmesg| grep -i scmi [ 0.966175] scmi_core: SCMI protocol bus registered [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains All these "Failed" are a bit worrying. Happy to put any theory to the test. Thanks, M.
On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote: > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* > results in hard resets, although I don't get much on the serial > console when that happens. Interestingly, I also see some errors in > dmesg at boot time: > > maz@semi-fraudulent:~$ dmesg| grep -i scmi > [ 0.966175] scmi_core: SCMI protocol bus registered > [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport > [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms > [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 > [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! > [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. > [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 > [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains > > All these "Failed" are a bit worrying. Happy to put any theory to the > test. Yes, those warnings indeed look troubling. Fortunately they appear to be mostly benign and only indicate that the firmware is reporting duplicate OPPs, which the kernel is now ignoring without any other side effects than the warnings. The side-effects and these remaining warnings are addressed by this series: https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ but I think we should try to make the warnings a bit more informative (and less scary) by printing something along the lines of: arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC instead. Johan
On Fri, 01 Nov 2024 14:19:54 +0000, Johan Hovold <johan@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote: > > > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* > > results in hard resets, although I don't get much on the serial > > console when that happens. Interestingly, I also see some errors in > > dmesg at boot time: > > > > maz@semi-fraudulent:~$ dmesg| grep -i scmi > > [ 0.966175] scmi_core: SCMI protocol bus registered > > [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport > > [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms > > [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 > > [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! > > [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. > > [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 > > [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains > > > > All these "Failed" are a bit worrying. Happy to put any theory to the > > test. > > Yes, those warnings indeed look troubling. Fortunately they appear to be > mostly benign and only indicate that the firmware is reporting duplicate > OPPs, which the kernel is now ignoring without any other side effects > than the warnings. Right. Not something that would explain the hard reset behaviour then. > > The side-effects and these remaining warnings are addressed by this > series: > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > but I think we should try to make the warnings a bit more informative > (and less scary) by printing something along the lines of: > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > instead. Indeed. Seeing [Firmware Bug] has a comforting feeling of familiarity... :) I wonder whether the same sort of reset happen on more "commercial" systems (such as some of the laptops). You expect that people look at the cpufreq stuff closely, and don't see things exploding like we are. M.
On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > On Fri, 01 Nov 2024 14:19:54 +0000, > Johan Hovold <johan@kernel.org> wrote: > > The side-effects and these remaining warnings are addressed by this > > series: > > > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > > > but I think we should try to make the warnings a bit more informative > > (and less scary) by printing something along the lines of: > > > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > > > instead. > > Indeed. Seeing [Firmware Bug] has a comforting feeling of > familiarity... :) > > I wonder whether the same sort of reset happen on more "commercial" > systems (such as some of the laptops). You expect that people look at > the cpufreq stuff closely, and don't see things exploding like we are. I finally got around to getting my Lenovo ThinkPad T14s to boot (it refuses to start the kernel when using GRUB, and it's not due to the known 64 GB memory issue as it only has 32 GB) and can confirm that it hard resets when accessing the cpufreq sysfs attributes as well. On the bright side, at least I don't see any warnings due to duplicate OPPs on this machine (x1e78100, latest UEFI fw). Johan
On Tue, 05 Nov 2024 16:57:07 +0000, Johan Hovold <johan@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > > On Fri, 01 Nov 2024 14:19:54 +0000, > > Johan Hovold <johan@kernel.org> wrote: > > > > The side-effects and these remaining warnings are addressed by this > > > series: > > > > > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > > > > > but I think we should try to make the warnings a bit more informative > > > (and less scary) by printing something along the lines of: > > > > > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > > > > > instead. > > > > Indeed. Seeing [Firmware Bug] has a comforting feeling of > > familiarity... :) > > > > I wonder whether the same sort of reset happen on more "commercial" > > systems (such as some of the laptops). You expect that people look at > > the cpufreq stuff closely, and don't see things exploding like we are. > > I finally got around to getting my Lenovo ThinkPad T14s to boot (it > refuses to start the kernel when using GRUB, and it's not due to the > known 64 GB memory issue as it only has 32 GB) <cry> I know the feeling. My devkit can't use GRUB either, so I added a hook to the GRUB config to generate EFI scripts that directly execute the kernel with initrd, dtb, and command line. This is probably the worse firmware I've seen in a very long while. </cry> > and can confirm that it > hard resets when accessing the cpufreq sysfs attributes as well. Right. So this also happens on non-abandonware machines. > On the bright side, at least I don't see any warnings due to duplicate > OPPs on this machine (x1e78100, latest UEFI fw). One bug fixed... M.