Message ID | 20200629122925.21729-1-andrzej.p@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | Stop monitoring disabled devices | expand |
Hi Andrzej, I've tested your series with kernelci and there are 3 regressions for the imx6. https://kernelci.org/test/job/thermal/branch/testing/kernel/v5.8-rc3-11-gf5e50bf4d3ef/plan/baseline/ On 29/06/2020 14:29, Andrzej Pietrasiewicz wrote: > A respin of v6 with these changes: > > v6..v7: > - removed duplicate S-o-b line from patch 6/11 > > v5..v6: > - staticized thermal_zone_device_set_mode() (kbuild test robot) > > v4..v5: > > - EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL (Daniel) > - dropped unnecessary thermal_zone_device_enable() in int3400_thermal.c > and in thermal_of.c (Bartlomiej) > > Andrzej Pietrasiewicz (11): > acpi: thermal: Fix error handling in the register function > thermal: Store thermal mode in a dedicated enum > thermal: Add current mode to thermal zone device > thermal: Store device mode in struct thermal_zone_device > thermal: remove get_mode() operation of drivers > thermal: Add mode helpers > thermal: Use mode helpers in drivers > thermal: Explicitly enable non-changing thermal zone devices > thermal: core: Stop polling DISABLED thermal devices > thermal: Simplify or eliminate unnecessary set_mode() methods > thermal: Rename set_mode() to change_mode() > > drivers/acpi/thermal.c | 75 +++++---------- > .../ethernet/chelsio/cxgb4/cxgb4_thermal.c | 8 ++ > .../ethernet/mellanox/mlxsw/core_thermal.c | 91 ++++--------------- > drivers/net/wireless/intel/iwlwifi/mvm/tt.c | 9 +- > drivers/platform/x86/acerhdf.c | 33 +++---- > drivers/platform/x86/intel_mid_thermal.c | 6 ++ > drivers/power/supply/power_supply_core.c | 9 +- > drivers/thermal/armada_thermal.c | 6 ++ > drivers/thermal/da9062-thermal.c | 16 +--- > drivers/thermal/dove_thermal.c | 6 ++ > drivers/thermal/hisi_thermal.c | 6 +- > drivers/thermal/imx_thermal.c | 57 ++++-------- > .../intel/int340x_thermal/int3400_thermal.c | 38 ++------ > .../int340x_thermal/int340x_thermal_zone.c | 5 + > drivers/thermal/intel/intel_pch_thermal.c | 5 + > .../thermal/intel/intel_quark_dts_thermal.c | 34 ++----- > drivers/thermal/intel/intel_soc_dts_iosf.c | 3 + > drivers/thermal/intel/x86_pkg_temp_thermal.c | 6 ++ > drivers/thermal/kirkwood_thermal.c | 7 ++ > drivers/thermal/rcar_thermal.c | 9 +- > drivers/thermal/rockchip_thermal.c | 6 +- > drivers/thermal/spear_thermal.c | 7 ++ > drivers/thermal/sprd_thermal.c | 6 +- > drivers/thermal/st/st_thermal.c | 5 + > drivers/thermal/thermal_core.c | 76 ++++++++++++++-- > drivers/thermal/thermal_of.c | 41 +-------- > drivers/thermal/thermal_sysfs.c | 37 +------- > include/linux/thermal.h | 19 +++- > 28 files changed, 275 insertions(+), 351 deletions(-) > > > base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68 >
Hi Daniel, I am reading the logs and can't find anything specific to thermal. What I can see is "random: crng init done" with large times (~200s) and then e.g. 'auto-login-action timed out after 283 seconds' I'm looking at e.g. https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html Is there anywhere else I can look at? Andrzej W dniu 30.06.2020 o 14:57, Daniel Lezcano pisze: > > Hi Andrzej, > > I've tested your series with kernelci and there are 3 regressions for > the imx6. > > https://kernelci.org/test/job/thermal/branch/testing/kernel/v5.8-rc3-11-gf5e50bf4d3ef/plan/baseline/ > > > On 29/06/2020 14:29, Andrzej Pietrasiewicz wrote: >> A respin of v6 with these changes: >> >> v6..v7: >> - removed duplicate S-o-b line from patch 6/11 >> >> v5..v6: >> - staticized thermal_zone_device_set_mode() (kbuild test robot) >> >> v4..v5: >> >> - EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL (Daniel) >> - dropped unnecessary thermal_zone_device_enable() in int3400_thermal.c >> and in thermal_of.c (Bartlomiej) >> >> Andrzej Pietrasiewicz (11): >> acpi: thermal: Fix error handling in the register function >> thermal: Store thermal mode in a dedicated enum >> thermal: Add current mode to thermal zone device >> thermal: Store device mode in struct thermal_zone_device >> thermal: remove get_mode() operation of drivers >> thermal: Add mode helpers >> thermal: Use mode helpers in drivers >> thermal: Explicitly enable non-changing thermal zone devices >> thermal: core: Stop polling DISABLED thermal devices >> thermal: Simplify or eliminate unnecessary set_mode() methods >> thermal: Rename set_mode() to change_mode() >> >> drivers/acpi/thermal.c | 75 +++++---------- >> .../ethernet/chelsio/cxgb4/cxgb4_thermal.c | 8 ++ >> .../ethernet/mellanox/mlxsw/core_thermal.c | 91 ++++--------------- >> drivers/net/wireless/intel/iwlwifi/mvm/tt.c | 9 +- >> drivers/platform/x86/acerhdf.c | 33 +++---- >> drivers/platform/x86/intel_mid_thermal.c | 6 ++ >> drivers/power/supply/power_supply_core.c | 9 +- >> drivers/thermal/armada_thermal.c | 6 ++ >> drivers/thermal/da9062-thermal.c | 16 +--- >> drivers/thermal/dove_thermal.c | 6 ++ >> drivers/thermal/hisi_thermal.c | 6 +- >> drivers/thermal/imx_thermal.c | 57 ++++-------- >> .../intel/int340x_thermal/int3400_thermal.c | 38 ++------ >> .../int340x_thermal/int340x_thermal_zone.c | 5 + >> drivers/thermal/intel/intel_pch_thermal.c | 5 + >> .../thermal/intel/intel_quark_dts_thermal.c | 34 ++----- >> drivers/thermal/intel/intel_soc_dts_iosf.c | 3 + >> drivers/thermal/intel/x86_pkg_temp_thermal.c | 6 ++ >> drivers/thermal/kirkwood_thermal.c | 7 ++ >> drivers/thermal/rcar_thermal.c | 9 +- >> drivers/thermal/rockchip_thermal.c | 6 +- >> drivers/thermal/spear_thermal.c | 7 ++ >> drivers/thermal/sprd_thermal.c | 6 +- >> drivers/thermal/st/st_thermal.c | 5 + >> drivers/thermal/thermal_core.c | 76 ++++++++++++++-- >> drivers/thermal/thermal_of.c | 41 +-------- >> drivers/thermal/thermal_sysfs.c | 37 +------- >> include/linux/thermal.h | 19 +++- >> 28 files changed, 275 insertions(+), 351 deletions(-) >> >> >> base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68 >> > >
On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote: > Hi Daniel, > > I am reading the logs and can't find anything specific to thermal. > > What I can see is > > "random: crng init done" > > with large times (~200s) and then e.g. > > 'auto-login-action timed out after 283 seconds' > > I'm looking at e.g. > https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html > > > Is there anywhere else I can look at? No unfortunately :/ I have a Meerkat96 which uses the same sensor as the imx6q. I'll give a try to see if I can reproduce and let you know.
Hi Daniel, W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze: > On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote: >> Hi Daniel, >> >> I am reading the logs and can't find anything specific to thermal. >> >> What I can see is >> >> "random: crng init done" >> >> with large times (~200s) and then e.g. >> >> 'auto-login-action timed out after 283 seconds' >> >> I'm looking at e.g. >> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html >> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11? PATCH 11/11 renames a method and the code compiles, so it seems unlikely that this is causing problems. One should never say never, though ;) The reported failure is not due to some test failing but rather due to timeout logging into the test system. Could it be that there is some other problem? Andrzej
On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote: > Hi Daniel, > > W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze: >> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote: >>> Hi Daniel, >>> >>> I am reading the logs and can't find anything specific to thermal. >>> >>> What I can see is >>> >>> "random: crng init done" >>> >>> with large times (~200s) and then e.g. >>> >>> 'auto-login-action timed out after 283 seconds' >>> >>> I'm looking at e.g. >>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html >>> >>> > > f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11? > PATCH 11/11 renames a method and the code compiles, so it seems > unlikely that this is causing problems. One should never say never, > though ;) The sha1 is just the HEAD for the kernel reference. The regression happens with your series, somewhere. > The reported failure is not due to some test failing but rather due > to timeout logging into the test system. Could it be that there is > some other problem? I did reproduce: v5.8-rc3 + series => imx6 hang at boot time v5.8-rc3 => imx6 boots correctly
Hi, W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze: > On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote: >> Hi Daniel, >> >> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze: >>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote: >>>> Hi Daniel, >>>> >>>> I am reading the logs and can't find anything specific to thermal. >>>> >>>> What I can see is >>>> >>>> "random: crng init done" >>>> >>>> with large times (~200s) and then e.g. >>>> >>>> 'auto-login-action timed out after 283 seconds' >>>> >>>> I'm looking at e.g. >>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html >>>> >>>> >> >> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11? >> PATCH 11/11 renames a method and the code compiles, so it seems >> unlikely that this is causing problems. One should never say never, >> though ;) > > The sha1 is just the HEAD for the kernel reference. The regression > happens with your series, somewhere. > >> The reported failure is not due to some test failing but rather due >> to timeout logging into the test system. Could it be that there is >> some other problem? > > I did reproduce: > > v5.8-rc3 + series => imx6 hang at boot time > v5.8-rc3 => imx6 boots correctly > I kindly ask for a bisect. Andrzej
On 30/06/2020 18:56, Andrzej Pietrasiewicz wrote: > Hi, > > W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze: >> On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote: >>> Hi Daniel, >>> >>> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze: >>>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote: >>>>> Hi Daniel, >>>>> >>>>> I am reading the logs and can't find anything specific to thermal. >>>>> >>>>> What I can see is >>>>> >>>>> "random: crng init done" >>>>> >>>>> with large times (~200s) and then e.g. >>>>> >>>>> 'auto-login-action timed out after 283 seconds' >>>>> >>>>> I'm looking at e.g. >>>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html >>>>> >>>>> >>>>> >>> >>> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11? >>> PATCH 11/11 renames a method and the code compiles, so it seems >>> unlikely that this is causing problems. One should never say never, >>> though ;) >> >> The sha1 is just the HEAD for the kernel reference. The regression >> happens with your series, somewhere. >> >>> The reported failure is not due to some test failing but rather due >>> to timeout logging into the test system. Could it be that there is >>> some other problem? >> >> I did reproduce: >> >> v5.8-rc3 + series => imx6 hang at boot time >> v5.8-rc3 => imx6 boots correctly >> > > I kindly ask for a bisect. I will give a try but it is a very long process as the board is running on kernelci. I was not able to reproduce it on imx7 despite it is the same sensor :/
Hi, W dniu 30.06.2020 o 20:33, Daniel Lezcano pisze: > On 30/06/2020 18:56, Andrzej Pietrasiewicz wrote: >> Hi, >> >> W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze: >>> On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote: >>>> Hi Daniel, >>>> >>>> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze: >>>>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote: >>>>>> Hi Daniel, >>>>>> >>>>>> I am reading the logs and can't find anything specific to thermal. >>>>>> >>>>>> What I can see is >>>>>> >>>>>> "random: crng init done" >>>>>> >>>>>> with large times (~200s) and then e.g. >>>>>> >>>>>> 'auto-login-action timed out after 283 seconds' >>>>>> >>>>>> I'm looking at e.g. >>>>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html >>>>>> >>>>>> >>>>>> >>>> >>>> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11? >>>> PATCH 11/11 renames a method and the code compiles, so it seems >>>> unlikely that this is causing problems. One should never say never, >>>> though ;) >>> >>> The sha1 is just the HEAD for the kernel reference. The regression >>> happens with your series, somewhere. >>> >>>> The reported failure is not due to some test failing but rather due >>>> to timeout logging into the test system. Could it be that there is >>>> some other problem? >>> >>> I did reproduce: >>> >>> v5.8-rc3 + series => imx6 hang at boot time >>> v5.8-rc3 => imx6 boots correctly >>> What did you reproduce? Timeout logging in to the test system or a "real" failure of a test? >> >> I kindly ask for a bisect. > > I will give a try but it is a very long process as the board is running > on kernelci. > > I was not able to reproduce it on imx7 despite it is the same sensor :/ > > Could it be that the thermal sensors somehow contribute to entropy and after the series is applied on some machines it takes more time to gather enough entropy? Andrzej
On 01/07/2020 12:23, Andrzej Pietrasiewicz wrote: > Hi, > [ ... ] >>>> >>>> I did reproduce: >>>> >>>> v5.8-rc3 + series => imx6 hang at boot time >>>> v5.8-rc3 => imx6 boots correctly >>>> > > What did you reproduce? Timeout logging in to the test system or a > "real" failure of a test? Timeout logging. Boot hangs. >>> I kindly ask for a bisect. >> >> I will give a try but it is a very long process as the board is running >> on kernelci. >> >> I was not able to reproduce it on imx7 despite it is the same sensor :/ >> >> > > Could it be that the thermal sensors somehow contribute to entropy and > after > the series is applied on some machines it takes more time to gather enough > entropy? I assume you are talking about the entropy for random? It would be really surprising if it is the case. The message appears asynchronously, I believe the boot flow is stuck in a mutex.
On 01/07/2020 12:23, Andrzej Pietrasiewicz wrote: > Hi, > > W dniu 30.06.2020 o 20:33, Daniel Lezcano pisze: >> On 30/06/2020 18:56, Andrzej Pietrasiewicz wrote: >>> Hi, >>> >>> W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze: >>>> On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote: >>>>> Hi Daniel, >>>>> >>>>> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze: >>>>>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote: >>>>>>> Hi Daniel, >>>>>>> >>>>>>> I am reading the logs and can't find anything specific to thermal. >>>>>>> >>>>>>> What I can see is >>>>>>> >>>>>>> "random: crng init done" >>>>>>> >>>>>>> with large times (~200s) and then e.g. >>>>>>> >>>>>>> 'auto-login-action timed out after 283 seconds' >>>>>>> >>>>>>> I'm looking at e.g. >>>>>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11? >>>>> PATCH 11/11 renames a method and the code compiles, so it seems >>>>> unlikely that this is causing problems. One should never say never, >>>>> though ;) >>>> >>>> The sha1 is just the HEAD for the kernel reference. The regression >>>> happens with your series, somewhere. >>>> >>>>> The reported failure is not due to some test failing but rather due >>>>> to timeout logging into the test system. Could it be that there is >>>>> some other problem? >>>> >>>> I did reproduce: >>>> >>>> v5.8-rc3 + series => imx6 hang at boot time >>>> v5.8-rc3 => imx6 boots correctly So finally I succeeded to reproduce it on my imx7 locally. The sensor was failing to initialize for another reason related to the legacy cooling device, this is why it is not appearing on the imx7. I can now git-bisect :)
Hi Daniel, <snip> >>>>> >>>>> I did reproduce: >>>>> >>>>> v5.8-rc3 + series => imx6 hang at boot time >>>>> v5.8-rc3 => imx6 boots correctly > > So finally I succeeded to reproduce it on my imx7 locally. The sensor > was failing to initialize for another reason related to the legacy > cooling device, this is why it is not appearing on the imx7. > > I can now git-bisect :) > That would be very kind of you, thank you! Andrzej
On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote: > Hi Daniel, > > <snip> > >>>>>> >>>>>> I did reproduce: >>>>>> >>>>>> v5.8-rc3 + series => imx6 hang at boot time >>>>>> v5.8-rc3 => imx6 boots correctly >> >> So finally I succeeded to reproduce it on my imx7 locally. The sensor >> was failing to initialize for another reason related to the legacy >> cooling device, this is why it is not appearing on the imx7. >> >> I can now git-bisect :) >> > > That would be very kind of you, thank you! Author: Andrzej Pietrasiewicz <andrzej.p@collabora.com> Date: Mon Jun 29 14:29:21 2020 +0200 thermal: Use mode helpers in drivers Use thermal_zone_device_{en|dis}able() and thermal_zone_device_is_enabled().
On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote: > Hi Daniel, > > <snip> > >>>>>> >>>>>> I did reproduce: >>>>>> >>>>>> v5.8-rc3 + series => imx6 hang at boot time >>>>>> v5.8-rc3 => imx6 boots correctly >> >> So finally I succeeded to reproduce it on my imx7 locally. The sensor >> was failing to initialize for another reason related to the legacy >> cooling device, this is why it is not appearing on the imx7. >> >> I can now git-bisect :) >> > > That would be very kind of you, thank you! With the lock correctness option enabled: [ 4.179223] imx_thermal tempmon: Extended Commercial CPU temperature grade - max:105C critical:100C passive:95C [ 4.189557] [ 4.191060] ============================================ [ 4.196378] WARNING: possible recursive locking detected [ 4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted [ 4.207102] -------------------------------------------- [ 4.212421] kworker/0:3/54 is trying to acquire lock: [ 4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: thermal_zone_device_is_enabled+0x18/0x34 [ 4.225777] [ 4.225777] but task is already holding lock: [ 4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: thermal_zone_get_temp+0x38/0x6c [ 4.239121] [ 4.239121] other info that might help us debug this: [ 4.245655] Possible unsafe locking scenario: [ 4.245655] [ 4.251579] CPU0 [ 4.254031] ---- [ 4.256481] lock(&tz->lock); [ 4.259544] lock(&tz->lock); [ 4.262608] [ 4.262608] *** DEADLOCK *** [ 4.262608] [ 4.268533] May be due to missing lock nesting notation [ 4.268533] [ 4.275329] 4 locks held by kworker/0:3/54: [ 4.279517] #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x224/0x808 [ 4.288241] #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x224/0x808 [ 4.296787] #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x30/0x140 [ 4.304468] #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: thermal_zone_get_temp+0x38/0x6c [ 4.312408] [ 4.312408] stack backtrace: [ 4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted 5.8.0-rc3-00011-gf5e50bf4d3ef #42 [ 4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree) [ 4.330809] Workqueue: events deferred_probe_work_func [ 4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>] (show_stack+0x10/0x14) [ 4.343734] [<c030c580>] (show_stack) from [<c079d7d8>] (dump_stack+0xe8/0x114) [ 4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>] (__lock_acquire+0xbfc/0x2cb4) [ 4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>] (lock_acquire+0xf4/0x4e4) [ 4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>] (__mutex_lock+0xb0/0xaa8) [ 4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>] (mutex_lock_nested+0x1c/0x24) [ 4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>] (thermal_zone_device_is_enabled+0x18/0x34) [ 4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from [<c0d9da90>] (imx_get_temp+0x30/0x208) [ 4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>] (thermal_zone_get_temp+0x4c/0x6c) [ 4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>] (thermal_zone_device_update+0x8c/0x258) [ 4.419310] [<c0d93df0>] (thermal_zone_device_update) from [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88) [ 4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578) [ 4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>] (platform_drv_probe+0x48/0x98) [ 4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>] (really_probe+0x218/0x348) [ 4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>] (driver_probe_device+0x5c/0xb4) [ 4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>] (bus_for_each_drv+0x58/0xb8) [ 4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>] (__device_attach+0xd4/0x140) [ 4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>] (bus_probe_device+0x88/0x90) [ 4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>] (deferred_probe_work_func+0x68/0x98) [ 4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>] (process_one_work+0x2e0/0x808) [ 4.507237] [<c0369988>] (process_one_work) from [<c036a150>] (worker_thread+0x2a0/0x59c) [ 4.515432] [<c036a150>] (worker_thread) from [<c0372208>] (kthread+0x16c/0x178) [ 4.522843] [<c0372208>] (kthread) from [<c0300174>] (ret_from_fork+0x14/0x20) [ 4.530074] Exception stack(0xca075fb0 to 0xca075ff8) [ 4.535138] 5fa0: 00000000 00000000 00000000 00000000 [ 4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Hi, W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze: > On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote: >> Hi Daniel, >> >> <snip> >> >>>>>>> >>>>>>> I did reproduce: >>>>>>> >>>>>>> v5.8-rc3 + series => imx6 hang at boot time >>>>>>> v5.8-rc3 => imx6 boots correctly >>> >>> So finally I succeeded to reproduce it on my imx7 locally. The sensor >>> was failing to initialize for another reason related to the legacy >>> cooling device, this is why it is not appearing on the imx7. >>> >>> I can now git-bisect :) >>> >> >> That would be very kind of you, thank you! > > With the lock correctness option enabled: > > [ 4.179223] imx_thermal tempmon: Extended Commercial CPU temperature > grade - max:105C critical:100C passive:95C > [ 4.189557] > [ 4.191060] ============================================ > [ 4.196378] WARNING: possible recursive locking detected > [ 4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted > [ 4.207102] -------------------------------------------- > [ 4.212421] kworker/0:3/54 is trying to acquire lock: > [ 4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: > thermal_zone_device_is_enabled+0x18/0x34 > [ 4.225777] > [ 4.225777] but task is already holding lock: > [ 4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: > thermal_zone_get_temp+0x38/0x6c > [ 4.239121] > [ 4.239121] other info that might help us debug this: > [ 4.245655] Possible unsafe locking scenario: > [ 4.245655] > [ 4.251579] CPU0 > [ 4.254031] ---- > [ 4.256481] lock(&tz->lock); > [ 4.259544] lock(&tz->lock); > [ 4.262608] > [ 4.262608] *** DEADLOCK *** > [ 4.262608] > [ 4.268533] May be due to missing lock nesting notation > [ 4.268533] > [ 4.275329] 4 locks held by kworker/0:3/54: > [ 4.279517] #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at: > process_one_work+0x224/0x808 > [ 4.288241] #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at: > process_one_work+0x224/0x808 > [ 4.296787] #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at: > __device_attach+0x30/0x140 > [ 4.304468] #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: > thermal_zone_get_temp+0x38/0x6c > [ 4.312408] > [ 4.312408] stack backtrace: > [ 4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted > 5.8.0-rc3-00011-gf5e50bf4d3ef #42 > [ 4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree) > [ 4.330809] Workqueue: events deferred_probe_work_func > [ 4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>] > (show_stack+0x10/0x14) > [ 4.343734] [<c030c580>] (show_stack) from [<c079d7d8>] > (dump_stack+0xe8/0x114) > [ 4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>] > (__lock_acquire+0xbfc/0x2cb4) > [ 4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>] > (lock_acquire+0xf4/0x4e4) > [ 4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>] > (__mutex_lock+0xb0/0xaa8) > [ 4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>] > (mutex_lock_nested+0x1c/0x24) > [ 4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>] > (thermal_zone_device_is_enabled+0x18/0x34) > [ 4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from > [<c0d9da90>] (imx_get_temp+0x30/0x208) > [ 4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>] > (thermal_zone_get_temp+0x4c/0x6c) > [ 4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>] > (thermal_zone_device_update+0x8c/0x258) > [ 4.419310] [<c0d93df0>] (thermal_zone_device_update) from > [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88) > [ 4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from > [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578) > [ 4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>] > (platform_drv_probe+0x48/0x98) > [ 4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>] > (really_probe+0x218/0x348) > [ 4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>] > (driver_probe_device+0x5c/0xb4) > [ 4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>] > (bus_for_each_drv+0x58/0xb8) > [ 4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>] > (__device_attach+0xd4/0x140) > [ 4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>] > (bus_probe_device+0x88/0x90) > [ 4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>] > (deferred_probe_work_func+0x68/0x98) > [ 4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>] > (process_one_work+0x2e0/0x808) > [ 4.507237] [<c0369988>] (process_one_work) from [<c036a150>] > (worker_thread+0x2a0/0x59c) > [ 4.515432] [<c036a150>] (worker_thread) from [<c0372208>] > (kthread+0x16c/0x178) > [ 4.522843] [<c0372208>] (kthread) from [<c0300174>] > (ret_from_fork+0x14/0x20) > [ 4.530074] Exception stack(0xca075fb0 to 0xca075ff8) > [ 4.535138] 5fa0: 00000000 > 00000000 00000000 00000000 > [ 4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000 > 00000000 00000000 00000000 > [ 4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 > > > Thanks! That confirms your suspicions. So the reason is that ->get_temp() is called while the mutex is held and thermal_zone_device_is_enabled() wants to take the same mutex. Is adding a comment to thermal_zone_device_is_enabled() to never call it while the mutex is held and adding another version of it which does not take the mutex ok? Andrzej
On 02/07/2020 19:19, Andrzej Pietrasiewicz wrote: > Hi, > > W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze: >> On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote: >>> Hi Daniel, >>> >>> <snip> >>> >>>>>>>> >>>>>>>> I did reproduce: >>>>>>>> >>>>>>>> v5.8-rc3 + series => imx6 hang at boot time >>>>>>>> v5.8-rc3 => imx6 boots correctly >>>> >>>> So finally I succeeded to reproduce it on my imx7 locally. The sensor >>>> was failing to initialize for another reason related to the legacy >>>> cooling device, this is why it is not appearing on the imx7. >>>> >>>> I can now git-bisect :) >>>> >>> >>> That would be very kind of you, thank you! >> >> With the lock correctness option enabled: >> >> [ 4.179223] imx_thermal tempmon: Extended Commercial CPU temperature >> grade - max:105C critical:100C passive:95C >> [ 4.189557] >> [ 4.191060] ============================================ >> [ 4.196378] WARNING: possible recursive locking detected >> [ 4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted >> [ 4.207102] -------------------------------------------- >> [ 4.212421] kworker/0:3/54 is trying to acquire lock: >> [ 4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: >> thermal_zone_device_is_enabled+0x18/0x34 >> [ 4.225777] >> [ 4.225777] but task is already holding lock: >> [ 4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: >> thermal_zone_get_temp+0x38/0x6c >> [ 4.239121] >> [ 4.239121] other info that might help us debug this: >> [ 4.245655] Possible unsafe locking scenario: >> [ 4.245655] >> [ 4.251579] CPU0 >> [ 4.254031] ---- >> [ 4.256481] lock(&tz->lock); >> [ 4.259544] lock(&tz->lock); >> [ 4.262608] >> [ 4.262608] *** DEADLOCK *** >> [ 4.262608] >> [ 4.268533] May be due to missing lock nesting notation >> [ 4.268533] >> [ 4.275329] 4 locks held by kworker/0:3/54: >> [ 4.279517] #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at: >> process_one_work+0x224/0x808 >> [ 4.288241] #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at: >> process_one_work+0x224/0x808 >> [ 4.296787] #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at: >> __device_attach+0x30/0x140 >> [ 4.304468] #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: >> thermal_zone_get_temp+0x38/0x6c >> [ 4.312408] >> [ 4.312408] stack backtrace: >> [ 4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted >> 5.8.0-rc3-00011-gf5e50bf4d3ef #42 >> [ 4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree) >> [ 4.330809] Workqueue: events deferred_probe_work_func >> [ 4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>] >> (show_stack+0x10/0x14) >> [ 4.343734] [<c030c580>] (show_stack) from [<c079d7d8>] >> (dump_stack+0xe8/0x114) >> [ 4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>] >> (__lock_acquire+0xbfc/0x2cb4) >> [ 4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>] >> (lock_acquire+0xf4/0x4e4) >> [ 4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>] >> (__mutex_lock+0xb0/0xaa8) >> [ 4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>] >> (mutex_lock_nested+0x1c/0x24) >> [ 4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>] >> (thermal_zone_device_is_enabled+0x18/0x34) >> [ 4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from >> [<c0d9da90>] (imx_get_temp+0x30/0x208) >> [ 4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>] >> (thermal_zone_get_temp+0x4c/0x6c) >> [ 4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>] >> (thermal_zone_device_update+0x8c/0x258) >> [ 4.419310] [<c0d93df0>] (thermal_zone_device_update) from >> [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88) >> [ 4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from >> [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578) >> [ 4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>] >> (platform_drv_probe+0x48/0x98) >> [ 4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>] >> (really_probe+0x218/0x348) >> [ 4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>] >> (driver_probe_device+0x5c/0xb4) >> [ 4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>] >> (bus_for_each_drv+0x58/0xb8) >> [ 4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>] >> (__device_attach+0xd4/0x140) >> [ 4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>] >> (bus_probe_device+0x88/0x90) >> [ 4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>] >> (deferred_probe_work_func+0x68/0x98) >> [ 4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>] >> (process_one_work+0x2e0/0x808) >> [ 4.507237] [<c0369988>] (process_one_work) from [<c036a150>] >> (worker_thread+0x2a0/0x59c) >> [ 4.515432] [<c036a150>] (worker_thread) from [<c0372208>] >> (kthread+0x16c/0x178) >> [ 4.522843] [<c0372208>] (kthread) from [<c0300174>] >> (ret_from_fork+0x14/0x20) >> [ 4.530074] Exception stack(0xca075fb0 to 0xca075ff8) >> [ 4.535138] 5fa0: 00000000 >> 00000000 00000000 00000000 >> [ 4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000 >> 00000000 00000000 00000000 >> [ 4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 >> 00000000 >> >> >> > > Thanks! > > That confirms your suspicions. > > So the reason is that ->get_temp() is called while the mutex is held and > thermal_zone_device_is_enabled() wants to take the same mutex. Yes, that's correct. > Is adding a comment to thermal_zone_device_is_enabled() to never call > it while the mutex is held and adding another version of it which does > not take the mutex ok? The thermal_zone_device_is_enabled() is only used in two places, acpi and this imx driver, and given: 1. as soon as the mutex is released, there is no guarantee the thermal zone won't be changed right after, the lock is pointless, thus the information also. 2. from a design point of view, I don't see why a driver should know if a thermal zone is disabled or not It would make sense to end with this function and do not give the different drivers an opportunity to access this information. Why not add change_mode for the acpi in order to enable or disable the events and for imx_thermal use irq_enabled flag instead of the thermal zone mode? Moreover it is very unclear why this function is needed in imx_get_temp(), and I suspect we should be able to get rid of it.
On 02/07/2020 19:49, Daniel Lezcano wrote: [ ... ] >> Thanks! >> >> That confirms your suspicions. >> >> So the reason is that ->get_temp() is called while the mutex is held and >> thermal_zone_device_is_enabled() wants to take the same mutex. > > Yes, that's correct. > >> Is adding a comment to thermal_zone_device_is_enabled() to never call >> it while the mutex is held and adding another version of it which does >> not take the mutex ok? > > The thermal_zone_device_is_enabled() is only used in two places, acpi > and this imx driver, and given: > > 1. as soon as the mutex is released, there is no guarantee the thermal > zone won't be changed right after, the lock is pointless, thus the > information also. > > 2. from a design point of view, I don't see why a driver should know if > a thermal zone is disabled or not > > It would make sense to end with this function and do not give the > different drivers an opportunity to access this information. > > Why not add change_mode for the acpi in order to enable or disable the > events and for imx_thermal use irq_enabled flag instead of the thermal > zone mode? Moreover it is very unclear why this function is needed in > imx_get_temp(), and I suspect we should be able to get rid of it. If you agree with that you can send a patch on top of your series so I can test it fixes the imx platform.
On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote: > On 02/07/2020 19:19, Andrzej Pietrasiewicz wrote: > > Hi, > > > > W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze: > > > On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote: > > > > Hi Daniel, > > > > > > > > <snip> > > > > > > > > > > > > > > > > > > > > > > I did reproduce: > > > > > > > > > > > > > > > > > > v5.8-rc3 + series => imx6 hang at boot time > > > > > > > > > v5.8-rc3 => imx6 boots correctly > > > > > > > > > > So finally I succeeded to reproduce it on my imx7 locally. > > > > > The sensor > > > > > was failing to initialize for another reason related to the > > > > > legacy > > > > > cooling device, this is why it is not appearing on the imx7. > > > > > > > > > > I can now git-bisect :) > > > > > > > > > > > > > That would be very kind of you, thank you! > > > > > > With the lock correctness option enabled: > > > > > > [ 4.179223] imx_thermal tempmon: Extended Commercial CPU > > > temperature > > > grade - max:105C critical:100C passive:95C > > > [ 4.189557] > > > [ 4.191060] ============================================ > > > [ 4.196378] WARNING: possible recursive locking detected > > > [ 4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted > > > [ 4.207102] -------------------------------------------- > > > [ 4.212421] kworker/0:3/54 is trying to acquire lock: > > > [ 4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: > > > thermal_zone_device_is_enabled+0x18/0x34 > > > [ 4.225777] > > > [ 4.225777] but task is already holding lock: > > > [ 4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: > > > thermal_zone_get_temp+0x38/0x6c > > > [ 4.239121] > > > [ 4.239121] other info that might help us debug this: > > > [ 4.245655] Possible unsafe locking scenario: > > > [ 4.245655] > > > [ 4.251579] CPU0 > > > [ 4.254031] ---- > > > [ 4.256481] lock(&tz->lock); > > > [ 4.259544] lock(&tz->lock); > > > [ 4.262608] > > > [ 4.262608] *** DEADLOCK *** > > > [ 4.262608] > > > [ 4.268533] May be due to missing lock nesting notation > > > [ 4.268533] > > > [ 4.275329] 4 locks held by kworker/0:3/54: > > > [ 4.279517] #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, > > > at: > > > process_one_work+0x224/0x808 > > > [ 4.288241] #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, > > > at: > > > process_one_work+0x224/0x808 > > > [ 4.296787] #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at: > > > __device_attach+0x30/0x140 > > > [ 4.304468] #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at: > > > thermal_zone_get_temp+0x38/0x6c > > > [ 4.312408] > > > [ 4.312408] stack backtrace: > > > [ 4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted > > > 5.8.0-rc3-00011-gf5e50bf4d3ef #42 > > > [ 4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree) > > > [ 4.330809] Workqueue: events deferred_probe_work_func > > > [ 4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>] > > > (show_stack+0x10/0x14) > > > [ 4.343734] [<c030c580>] (show_stack) from [<c079d7d8>] > > > (dump_stack+0xe8/0x114) > > > [ 4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>] > > > (__lock_acquire+0xbfc/0x2cb4) > > > [ 4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>] > > > (lock_acquire+0xf4/0x4e4) > > > [ 4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>] > > > (__mutex_lock+0xb0/0xaa8) > > > [ 4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>] > > > (mutex_lock_nested+0x1c/0x24) > > > [ 4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>] > > > (thermal_zone_device_is_enabled+0x18/0x34) > > > [ 4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from > > > [<c0d9da90>] (imx_get_temp+0x30/0x208) > > > [ 4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>] > > > (thermal_zone_get_temp+0x4c/0x6c) > > > [ 4.409640] [<c0d97484>] (thermal_zone_get_temp) from > > > [<c0d93df0>] > > > (thermal_zone_device_update+0x8c/0x258) > > > [ 4.419310] [<c0d93df0>] (thermal_zone_device_update) from > > > [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88) > > > [ 4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from > > > [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578) > > > [ 4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>] > > > (platform_drv_probe+0x48/0x98) > > > [ 4.447622] [<c0a78388>] (platform_drv_probe) from > > > [<c0a7603c>] > > > (really_probe+0x218/0x348) > > > [ 4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>] > > > (driver_probe_device+0x5c/0xb4) > > > [ 4.464098] [<c0a76278>] (driver_probe_device) from > > > [<c0a743bc>] > > > (bus_for_each_drv+0x58/0xb8) > > > [ 4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>] > > > (__device_attach+0xd4/0x140) > > > [ 4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>] > > > (bus_probe_device+0x88/0x90) > > > [ 4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>] > > > (deferred_probe_work_func+0x68/0x98) > > > [ 4.498088] [<c0a75564>] (deferred_probe_work_func) from > > > [<c0369988>] > > > (process_one_work+0x2e0/0x808) > > > [ 4.507237] [<c0369988>] (process_one_work) from [<c036a150>] > > > (worker_thread+0x2a0/0x59c) > > > [ 4.515432] [<c036a150>] (worker_thread) from [<c0372208>] > > > (kthread+0x16c/0x178) > > > [ 4.522843] [<c0372208>] (kthread) from [<c0300174>] > > > (ret_from_fork+0x14/0x20) > > > [ 4.530074] Exception stack(0xca075fb0 to 0xca075ff8) > > > [ 4.535138] 5fa0: 00000000 > > > 00000000 00000000 00000000 > > > [ 4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000 > > > 00000000 00000000 00000000 > > > [ 4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 > > > 00000000 > > > > > > > > > > > > > Thanks! > > > > That confirms your suspicions. > > > > So the reason is that ->get_temp() is called while the mutex is > > held and > > thermal_zone_device_is_enabled() wants to take the same mutex. > > Yes, that's correct. > > > Is adding a comment to thermal_zone_device_is_enabled() to never > > call > > it while the mutex is held and adding another version of it which > > does > > not take the mutex ok? > > The thermal_zone_device_is_enabled() is only used in two places, acpi > and this imx driver, and given: > > 1. as soon as the mutex is released, there is no guarantee the > thermal > zone won't be changed right after, the lock is pointless, thus the > information also. > > 2. from a design point of view, I don't see why a driver should know > if > a thermal zone is disabled or not > > It would make sense to end with this function and do not give the > different drivers an opportunity to access this information. I agree. > > Why not add change_mode for the acpi in order to enable or disable > the > events thermal_zone_device_is_enabled() is invoked in acpi thermal driver because we only want to do thermal_zone_device_update() when the acpi thermal zone is enabled. As thermal_zone_device_update() can handle a disabled thermal zone now, we can just remove the check. thanks, rui > and for imx_thermal use irq_enabled flag instead of the thermal > zone mode? Moreover it is very unclear why this function is needed in > imx_get_temp(), and I suspect we should be able to get rid of it. > >
On 03/07/2020 03:49, Zhang Rui wrote: > On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote: [ ... ] >>> So the reason is that ->get_temp() is called while the mutex is >>> held and >>> thermal_zone_device_is_enabled() wants to take the same mutex. >> >> Yes, that's correct. >> >>> Is adding a comment to thermal_zone_device_is_enabled() to never >>> call >>> it while the mutex is held and adding another version of it which >>> does >>> not take the mutex ok? >> >> The thermal_zone_device_is_enabled() is only used in two places, acpi >> and this imx driver, and given: >> >> 1. as soon as the mutex is released, there is no guarantee the >> thermal >> zone won't be changed right after, the lock is pointless, thus the >> information also. >> >> 2. from a design point of view, I don't see why a driver should know >> if >> a thermal zone is disabled or not >> >> It would make sense to end with this function and do not give the >> different drivers an opportunity to access this information. > > I agree. >> >> Why not add change_mode for the acpi in order to enable or disable >> the >> events > > thermal_zone_device_is_enabled() is invoked in acpi thermal driver > because we only want to do thermal_zone_device_update() when the acpi > thermal zone is enabled. > > As thermal_zone_device_update() can handle a disabled thermal zone now, > we can just remove the check. Ah yes, good point!
Hi, W dniu 03.07.2020 o 08:38, Daniel Lezcano pisze: > On 03/07/2020 03:49, Zhang Rui wrote: >> On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote: > > [ ... ] > >>>> So the reason is that ->get_temp() is called while the mutex is >>>> held and >>>> thermal_zone_device_is_enabled() wants to take the same mutex. >>> >>> Yes, that's correct. >>> >>>> Is adding a comment to thermal_zone_device_is_enabled() to never >>>> call >>>> it while the mutex is held and adding another version of it which >>>> does >>>> not take the mutex ok? >>> >>> The thermal_zone_device_is_enabled() is only used in two places, acpi >>> and this imx driver, and given: >>> >>> 1. as soon as the mutex is released, there is no guarantee the >>> thermal >>> zone won't be changed right after, the lock is pointless, thus the >>> information also. >>> >>> 2. from a design point of view, I don't see why a driver should know >>> if >>> a thermal zone is disabled or not >>> >>> It would make sense to end with this function and do not give the >>> different drivers an opportunity to access this information. >> >> I agree. >>> >>> Why not add change_mode for the acpi in order to enable or disable >>> the >>> events >> >> thermal_zone_device_is_enabled() is invoked in acpi thermal driver >> because we only want to do thermal_zone_device_update() when the acpi >> thermal zone is enabled. >> >> As thermal_zone_device_update() can handle a disabled thermal zone now, >> we can just remove the check. > > Ah yes, good point! > > > I sent a short series with fixes. Daniel, can you kindly test it? Regards, Andrzej
On 03/07/2020 12:45, Andrzej Pietrasiewicz wrote: > Hi, > > W dniu 03.07.2020 o 08:38, Daniel Lezcano pisze: >> On 03/07/2020 03:49, Zhang Rui wrote: >>> On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote: >> >> [ ... ] >> >>>>> So the reason is that ->get_temp() is called while the mutex is >>>>> held and >>>>> thermal_zone_device_is_enabled() wants to take the same mutex. >>>> >>>> Yes, that's correct. >>>> >>>>> Is adding a comment to thermal_zone_device_is_enabled() to never >>>>> call >>>>> it while the mutex is held and adding another version of it which >>>>> does >>>>> not take the mutex ok? >>>> >>>> The thermal_zone_device_is_enabled() is only used in two places, acpi >>>> and this imx driver, and given: >>>> >>>> 1. as soon as the mutex is released, there is no guarantee the >>>> thermal >>>> zone won't be changed right after, the lock is pointless, thus the >>>> information also. >>>> >>>> 2. from a design point of view, I don't see why a driver should know >>>> if >>>> a thermal zone is disabled or not >>>> >>>> It would make sense to end with this function and do not give the >>>> different drivers an opportunity to access this information. >>> >>> I agree. >>>> >>>> Why not add change_mode for the acpi in order to enable or disable >>>> the >>>> events >>> >>> thermal_zone_device_is_enabled() is invoked in acpi thermal driver >>> because we only want to do thermal_zone_device_update() when the acpi >>> thermal zone is enabled. >>> >>> As thermal_zone_device_update() can handle a disabled thermal zone now, >>> we can just remove the check. >> >> Ah yes, good point! >> >> >> > > I sent a short series with fixes. Daniel, can you kindly test it? I confirm the i.MX is now correctly booting with the thermal zone temperature available.