mbox series

[v7,00/11] Stop monitoring disabled devices

Message ID 20200629122925.21729-1-andrzej.p@collabora.com (mailing list archive)
Headers show
Series Stop monitoring disabled devices | expand

Message

Andrzej Pietrasiewicz June 29, 2020, 12:29 p.m. UTC
A respin of v6 with these changes:

v6..v7:
- removed duplicate S-o-b line from patch 6/11

v5..v6:
- staticized thermal_zone_device_set_mode() (kbuild test robot)

v4..v5:

- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL (Daniel)
- dropped unnecessary thermal_zone_device_enable() in int3400_thermal.c
and in thermal_of.c (Bartlomiej)

Andrzej Pietrasiewicz (11):
  acpi: thermal: Fix error handling in the register function
  thermal: Store thermal mode in a dedicated enum
  thermal: Add current mode to thermal zone device
  thermal: Store device mode in struct thermal_zone_device
  thermal: remove get_mode() operation of drivers
  thermal: Add mode helpers
  thermal: Use mode helpers in drivers
  thermal: Explicitly enable non-changing thermal zone devices
  thermal: core: Stop polling DISABLED thermal devices
  thermal: Simplify or eliminate unnecessary set_mode() methods
  thermal: Rename set_mode() to change_mode()

 drivers/acpi/thermal.c                        | 75 +++++----------
 .../ethernet/chelsio/cxgb4/cxgb4_thermal.c    |  8 ++
 .../ethernet/mellanox/mlxsw/core_thermal.c    | 91 ++++---------------
 drivers/net/wireless/intel/iwlwifi/mvm/tt.c   |  9 +-
 drivers/platform/x86/acerhdf.c                | 33 +++----
 drivers/platform/x86/intel_mid_thermal.c      |  6 ++
 drivers/power/supply/power_supply_core.c      |  9 +-
 drivers/thermal/armada_thermal.c              |  6 ++
 drivers/thermal/da9062-thermal.c              | 16 +---
 drivers/thermal/dove_thermal.c                |  6 ++
 drivers/thermal/hisi_thermal.c                |  6 +-
 drivers/thermal/imx_thermal.c                 | 57 ++++--------
 .../intel/int340x_thermal/int3400_thermal.c   | 38 ++------
 .../int340x_thermal/int340x_thermal_zone.c    |  5 +
 drivers/thermal/intel/intel_pch_thermal.c     |  5 +
 .../thermal/intel/intel_quark_dts_thermal.c   | 34 ++-----
 drivers/thermal/intel/intel_soc_dts_iosf.c    |  3 +
 drivers/thermal/intel/x86_pkg_temp_thermal.c  |  6 ++
 drivers/thermal/kirkwood_thermal.c            |  7 ++
 drivers/thermal/rcar_thermal.c                |  9 +-
 drivers/thermal/rockchip_thermal.c            |  6 +-
 drivers/thermal/spear_thermal.c               |  7 ++
 drivers/thermal/sprd_thermal.c                |  6 +-
 drivers/thermal/st/st_thermal.c               |  5 +
 drivers/thermal/thermal_core.c                | 76 ++++++++++++++--
 drivers/thermal/thermal_of.c                  | 41 +--------
 drivers/thermal/thermal_sysfs.c               | 37 +-------
 include/linux/thermal.h                       | 19 +++-
 28 files changed, 275 insertions(+), 351 deletions(-)


base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68

Comments

Daniel Lezcano June 30, 2020, 12:57 p.m. UTC | #1
Hi Andrzej,

I've tested your series with kernelci and there are 3 regressions for
the imx6.

https://kernelci.org/test/job/thermal/branch/testing/kernel/v5.8-rc3-11-gf5e50bf4d3ef/plan/baseline/


On 29/06/2020 14:29, Andrzej Pietrasiewicz wrote:
> A respin of v6 with these changes:
> 
> v6..v7:
> - removed duplicate S-o-b line from patch 6/11
> 
> v5..v6:
> - staticized thermal_zone_device_set_mode() (kbuild test robot)
> 
> v4..v5:
> 
> - EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL (Daniel)
> - dropped unnecessary thermal_zone_device_enable() in int3400_thermal.c
> and in thermal_of.c (Bartlomiej)
> 
> Andrzej Pietrasiewicz (11):
>   acpi: thermal: Fix error handling in the register function
>   thermal: Store thermal mode in a dedicated enum
>   thermal: Add current mode to thermal zone device
>   thermal: Store device mode in struct thermal_zone_device
>   thermal: remove get_mode() operation of drivers
>   thermal: Add mode helpers
>   thermal: Use mode helpers in drivers
>   thermal: Explicitly enable non-changing thermal zone devices
>   thermal: core: Stop polling DISABLED thermal devices
>   thermal: Simplify or eliminate unnecessary set_mode() methods
>   thermal: Rename set_mode() to change_mode()
> 
>  drivers/acpi/thermal.c                        | 75 +++++----------
>  .../ethernet/chelsio/cxgb4/cxgb4_thermal.c    |  8 ++
>  .../ethernet/mellanox/mlxsw/core_thermal.c    | 91 ++++---------------
>  drivers/net/wireless/intel/iwlwifi/mvm/tt.c   |  9 +-
>  drivers/platform/x86/acerhdf.c                | 33 +++----
>  drivers/platform/x86/intel_mid_thermal.c      |  6 ++
>  drivers/power/supply/power_supply_core.c      |  9 +-
>  drivers/thermal/armada_thermal.c              |  6 ++
>  drivers/thermal/da9062-thermal.c              | 16 +---
>  drivers/thermal/dove_thermal.c                |  6 ++
>  drivers/thermal/hisi_thermal.c                |  6 +-
>  drivers/thermal/imx_thermal.c                 | 57 ++++--------
>  .../intel/int340x_thermal/int3400_thermal.c   | 38 ++------
>  .../int340x_thermal/int340x_thermal_zone.c    |  5 +
>  drivers/thermal/intel/intel_pch_thermal.c     |  5 +
>  .../thermal/intel/intel_quark_dts_thermal.c   | 34 ++-----
>  drivers/thermal/intel/intel_soc_dts_iosf.c    |  3 +
>  drivers/thermal/intel/x86_pkg_temp_thermal.c  |  6 ++
>  drivers/thermal/kirkwood_thermal.c            |  7 ++
>  drivers/thermal/rcar_thermal.c                |  9 +-
>  drivers/thermal/rockchip_thermal.c            |  6 +-
>  drivers/thermal/spear_thermal.c               |  7 ++
>  drivers/thermal/sprd_thermal.c                |  6 +-
>  drivers/thermal/st/st_thermal.c               |  5 +
>  drivers/thermal/thermal_core.c                | 76 ++++++++++++++--
>  drivers/thermal/thermal_of.c                  | 41 +--------
>  drivers/thermal/thermal_sysfs.c               | 37 +-------
>  include/linux/thermal.h                       | 19 +++-
>  28 files changed, 275 insertions(+), 351 deletions(-)
> 
> 
> base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68
>
Andrzej Pietrasiewicz June 30, 2020, 1:43 p.m. UTC | #2
Hi Daniel,

I am reading the logs and can't find anything specific to thermal.

What I can see is

"random: crng init done"

with large times (~200s) and then e.g.

'auto-login-action timed out after 283 seconds'

I'm looking at e.g. 
https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html

Is there anywhere else I can look at?

Andrzej

W dniu 30.06.2020 o 14:57, Daniel Lezcano pisze:
> 
> Hi Andrzej,
> 
> I've tested your series with kernelci and there are 3 regressions for
> the imx6.
> 
> https://kernelci.org/test/job/thermal/branch/testing/kernel/v5.8-rc3-11-gf5e50bf4d3ef/plan/baseline/
> 
> 
> On 29/06/2020 14:29, Andrzej Pietrasiewicz wrote:
>> A respin of v6 with these changes:
>>
>> v6..v7:
>> - removed duplicate S-o-b line from patch 6/11
>>
>> v5..v6:
>> - staticized thermal_zone_device_set_mode() (kbuild test robot)
>>
>> v4..v5:
>>
>> - EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL (Daniel)
>> - dropped unnecessary thermal_zone_device_enable() in int3400_thermal.c
>> and in thermal_of.c (Bartlomiej)
>>
>> Andrzej Pietrasiewicz (11):
>>    acpi: thermal: Fix error handling in the register function
>>    thermal: Store thermal mode in a dedicated enum
>>    thermal: Add current mode to thermal zone device
>>    thermal: Store device mode in struct thermal_zone_device
>>    thermal: remove get_mode() operation of drivers
>>    thermal: Add mode helpers
>>    thermal: Use mode helpers in drivers
>>    thermal: Explicitly enable non-changing thermal zone devices
>>    thermal: core: Stop polling DISABLED thermal devices
>>    thermal: Simplify or eliminate unnecessary set_mode() methods
>>    thermal: Rename set_mode() to change_mode()
>>
>>   drivers/acpi/thermal.c                        | 75 +++++----------
>>   .../ethernet/chelsio/cxgb4/cxgb4_thermal.c    |  8 ++
>>   .../ethernet/mellanox/mlxsw/core_thermal.c    | 91 ++++---------------
>>   drivers/net/wireless/intel/iwlwifi/mvm/tt.c   |  9 +-
>>   drivers/platform/x86/acerhdf.c                | 33 +++----
>>   drivers/platform/x86/intel_mid_thermal.c      |  6 ++
>>   drivers/power/supply/power_supply_core.c      |  9 +-
>>   drivers/thermal/armada_thermal.c              |  6 ++
>>   drivers/thermal/da9062-thermal.c              | 16 +---
>>   drivers/thermal/dove_thermal.c                |  6 ++
>>   drivers/thermal/hisi_thermal.c                |  6 +-
>>   drivers/thermal/imx_thermal.c                 | 57 ++++--------
>>   .../intel/int340x_thermal/int3400_thermal.c   | 38 ++------
>>   .../int340x_thermal/int340x_thermal_zone.c    |  5 +
>>   drivers/thermal/intel/intel_pch_thermal.c     |  5 +
>>   .../thermal/intel/intel_quark_dts_thermal.c   | 34 ++-----
>>   drivers/thermal/intel/intel_soc_dts_iosf.c    |  3 +
>>   drivers/thermal/intel/x86_pkg_temp_thermal.c  |  6 ++
>>   drivers/thermal/kirkwood_thermal.c            |  7 ++
>>   drivers/thermal/rcar_thermal.c                |  9 +-
>>   drivers/thermal/rockchip_thermal.c            |  6 +-
>>   drivers/thermal/spear_thermal.c               |  7 ++
>>   drivers/thermal/sprd_thermal.c                |  6 +-
>>   drivers/thermal/st/st_thermal.c               |  5 +
>>   drivers/thermal/thermal_core.c                | 76 ++++++++++++++--
>>   drivers/thermal/thermal_of.c                  | 41 +--------
>>   drivers/thermal/thermal_sysfs.c               | 37 +-------
>>   include/linux/thermal.h                       | 19 +++-
>>   28 files changed, 275 insertions(+), 351 deletions(-)
>>
>>
>> base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68
>>
> 
>
Daniel Lezcano June 30, 2020, 2:53 p.m. UTC | #3
On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote:
> Hi Daniel,
> 
> I am reading the logs and can't find anything specific to thermal.
> 
> What I can see is
> 
> "random: crng init done"
> 
> with large times (~200s) and then e.g.
> 
> 'auto-login-action timed out after 283 seconds'
> 
> I'm looking at e.g.
> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html
> 
> 
> Is there anywhere else I can look at?

No unfortunately :/

I have a Meerkat96 which uses the same sensor as the imx6q.

I'll give a try to see if I can reproduce and let you know.
Andrzej Pietrasiewicz June 30, 2020, 3:29 p.m. UTC | #4
Hi Daniel,

W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze:
> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote:
>> Hi Daniel,
>>
>> I am reading the logs and can't find anything specific to thermal.
>>
>> What I can see is
>>
>> "random: crng init done"
>>
>> with large times (~200s) and then e.g.
>>
>> 'auto-login-action timed out after 283 seconds'
>>
>> I'm looking at e.g.
>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html
>>

f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11?
PATCH 11/11 renames a method and the code compiles, so it seems
unlikely that this is causing problems. One should never say never,
though ;)

The reported failure is not due to some test failing but rather due
to timeout logging into the test system. Could it be that there is
some other problem?

Andrzej
Daniel Lezcano June 30, 2020, 3:53 p.m. UTC | #5
On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote:
> Hi Daniel,
> 
> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze:
>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote:
>>> Hi Daniel,
>>>
>>> I am reading the logs and can't find anything specific to thermal.
>>>
>>> What I can see is
>>>
>>> "random: crng init done"
>>>
>>> with large times (~200s) and then e.g.
>>>
>>> 'auto-login-action timed out after 283 seconds'
>>>
>>> I'm looking at e.g.
>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html
>>>
>>>
> 
> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11?
> PATCH 11/11 renames a method and the code compiles, so it seems
> unlikely that this is causing problems. One should never say never,
> though ;)

The sha1 is just the HEAD for the kernel reference. The regression
happens with your series, somewhere.

> The reported failure is not due to some test failing but rather due
> to timeout logging into the test system. Could it be that there is
> some other problem?

I did reproduce:

v5.8-rc3 + series => imx6 hang at boot time
v5.8-rc3 => imx6 boots correctly
Andrzej Pietrasiewicz June 30, 2020, 4:56 p.m. UTC | #6
Hi,

W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze:
> On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote:
>> Hi Daniel,
>>
>> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze:
>>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote:
>>>> Hi Daniel,
>>>>
>>>> I am reading the logs and can't find anything specific to thermal.
>>>>
>>>> What I can see is
>>>>
>>>> "random: crng init done"
>>>>
>>>> with large times (~200s) and then e.g.
>>>>
>>>> 'auto-login-action timed out after 283 seconds'
>>>>
>>>> I'm looking at e.g.
>>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html
>>>>
>>>>
>>
>> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11?
>> PATCH 11/11 renames a method and the code compiles, so it seems
>> unlikely that this is causing problems. One should never say never,
>> though ;)
> 
> The sha1 is just the HEAD for the kernel reference. The regression
> happens with your series, somewhere.
> 
>> The reported failure is not due to some test failing but rather due
>> to timeout logging into the test system. Could it be that there is
>> some other problem?
> 
> I did reproduce:
> 
> v5.8-rc3 + series => imx6 hang at boot time
> v5.8-rc3 => imx6 boots correctly
> 

I kindly ask for a bisect.

Andrzej
Daniel Lezcano June 30, 2020, 6:33 p.m. UTC | #7
On 30/06/2020 18:56, Andrzej Pietrasiewicz wrote:
> Hi,
> 
> W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze:
>> On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote:
>>> Hi Daniel,
>>>
>>> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze:
>>>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> I am reading the logs and can't find anything specific to thermal.
>>>>>
>>>>> What I can see is
>>>>>
>>>>> "random: crng init done"
>>>>>
>>>>> with large times (~200s) and then e.g.
>>>>>
>>>>> 'auto-login-action timed out after 283 seconds'
>>>>>
>>>>> I'm looking at e.g.
>>>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html
>>>>>
>>>>>
>>>>>
>>>
>>> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11?
>>> PATCH 11/11 renames a method and the code compiles, so it seems
>>> unlikely that this is causing problems. One should never say never,
>>> though ;)
>>
>> The sha1 is just the HEAD for the kernel reference. The regression
>> happens with your series, somewhere.
>>
>>> The reported failure is not due to some test failing but rather due
>>> to timeout logging into the test system. Could it be that there is
>>> some other problem?
>>
>> I did reproduce:
>>
>> v5.8-rc3 + series => imx6 hang at boot time
>> v5.8-rc3 => imx6 boots correctly
>>
> 
> I kindly ask for a bisect.

I will give a try but it is a very long process as the board is running
on kernelci.

I was not able to reproduce it on imx7 despite it is the same sensor :/
Andrzej Pietrasiewicz July 1, 2020, 10:23 a.m. UTC | #8
Hi,

W dniu 30.06.2020 o 20:33, Daniel Lezcano pisze:
> On 30/06/2020 18:56, Andrzej Pietrasiewicz wrote:
>> Hi,
>>
>> W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze:
>>> On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote:
>>>> Hi Daniel,
>>>>
>>>> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze:
>>>>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> I am reading the logs and can't find anything specific to thermal.
>>>>>>
>>>>>> What I can see is
>>>>>>
>>>>>> "random: crng init done"
>>>>>>
>>>>>> with large times (~200s) and then e.g.
>>>>>>
>>>>>> 'auto-login-action timed out after 283 seconds'
>>>>>>
>>>>>> I'm looking at e.g.
>>>>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11?
>>>> PATCH 11/11 renames a method and the code compiles, so it seems
>>>> unlikely that this is causing problems. One should never say never,
>>>> though ;)
>>>
>>> The sha1 is just the HEAD for the kernel reference. The regression
>>> happens with your series, somewhere.
>>>
>>>> The reported failure is not due to some test failing but rather due
>>>> to timeout logging into the test system. Could it be that there is
>>>> some other problem?
>>>
>>> I did reproduce:
>>>
>>> v5.8-rc3 + series => imx6 hang at boot time
>>> v5.8-rc3 => imx6 boots correctly
>>>

What did you reproduce? Timeout logging in to the test system or a "real" 
failure of a test?

>>
>> I kindly ask for a bisect.
> 
> I will give a try but it is a very long process as the board is running
> on kernelci.
> 
> I was not able to reproduce it on imx7 despite it is the same sensor :/
> 
> 

Could it be that the thermal sensors somehow contribute to entropy and after
the series is applied on some machines it takes more time to gather enough
entropy?

Andrzej
Daniel Lezcano July 1, 2020, 2:25 p.m. UTC | #9
On 01/07/2020 12:23, Andrzej Pietrasiewicz wrote:
> Hi,
> 

[ ... ]

>>>>
>>>> I did reproduce:
>>>>
>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>> v5.8-rc3 => imx6 boots correctly
>>>>
> 
> What did you reproduce? Timeout logging in to the test system or a
> "real" failure of a test?

Timeout logging. Boot hangs.

>>> I kindly ask for a bisect.
>>
>> I will give a try but it is a very long process as the board is running
>> on kernelci.
>>
>> I was not able to reproduce it on imx7 despite it is the same sensor :/
>>
>>
> 
> Could it be that the thermal sensors somehow contribute to entropy and
> after
> the series is applied on some machines it takes more time to gather enough
> entropy?

I assume you are talking about the entropy for random?

It would be really surprising if it is the case. The message appears
asynchronously, I believe the boot flow is stuck in a mutex.
Daniel Lezcano July 2, 2020, 1:47 p.m. UTC | #10
On 01/07/2020 12:23, Andrzej Pietrasiewicz wrote:
> Hi,
> 
> W dniu 30.06.2020 o 20:33, Daniel Lezcano pisze:
>> On 30/06/2020 18:56, Andrzej Pietrasiewicz wrote:
>>> Hi,
>>>
>>> W dniu 30.06.2020 o 17:53, Daniel Lezcano pisze:
>>>> On 30/06/2020 17:29, Andrzej Pietrasiewicz wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> W dniu 30.06.2020 o 16:53, Daniel Lezcano pisze:
>>>>>> On 30/06/2020 15:43, Andrzej Pietrasiewicz wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> I am reading the logs and can't find anything specific to thermal.
>>>>>>>
>>>>>>> What I can see is
>>>>>>>
>>>>>>> "random: crng init done"
>>>>>>>
>>>>>>> with large times (~200s) and then e.g.
>>>>>>>
>>>>>>> 'auto-login-action timed out after 283 seconds'
>>>>>>>
>>>>>>> I'm looking at e.g.
>>>>>>> https://storage.kernelci.org/thermal/testing/v5.8-rc3-11-gf5e50bf4d3ef/arm/multi_v7_defconfig/gcc-8/lab-baylibre/baseline-imx6q-sabrelite.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> f5e50bf4d3ef is PATCH 11/11. Does the problem happen at PATCH 1-10/11?
>>>>> PATCH 11/11 renames a method and the code compiles, so it seems
>>>>> unlikely that this is causing problems. One should never say never,
>>>>> though ;)
>>>>
>>>> The sha1 is just the HEAD for the kernel reference. The regression
>>>> happens with your series, somewhere.
>>>>
>>>>> The reported failure is not due to some test failing but rather due
>>>>> to timeout logging into the test system. Could it be that there is
>>>>> some other problem?
>>>>
>>>> I did reproduce:
>>>>
>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>> v5.8-rc3 => imx6 boots correctly

So finally I succeeded to reproduce it on my imx7 locally. The sensor
was failing to initialize for another reason related to the legacy
cooling device, this is why it is not appearing on the imx7.

I can now git-bisect :)
Andrzej Pietrasiewicz July 2, 2020, 1:53 p.m. UTC | #11
Hi Daniel,

<snip>

>>>>>
>>>>> I did reproduce:
>>>>>
>>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>>> v5.8-rc3 => imx6 boots correctly
> 
> So finally I succeeded to reproduce it on my imx7 locally. The sensor
> was failing to initialize for another reason related to the legacy
> cooling device, this is why it is not appearing on the imx7.
> 
> I can now git-bisect :)
> 

That would be very kind of you, thank you!

Andrzej
Daniel Lezcano July 2, 2020, 2:58 p.m. UTC | #12
On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:
> Hi Daniel,
> 
> <snip>
> 
>>>>>>
>>>>>> I did reproduce:
>>>>>>
>>>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>>>> v5.8-rc3 => imx6 boots correctly
>>
>> So finally I succeeded to reproduce it on my imx7 locally. The sensor
>> was failing to initialize for another reason related to the legacy
>> cooling device, this is why it is not appearing on the imx7.
>>
>> I can now git-bisect :)
>>
> 
> That would be very kind of you, thank you!

Author: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
Date:   Mon Jun 29 14:29:21 2020 +0200

    thermal: Use mode helpers in drivers

    Use thermal_zone_device_{en|dis}able() and
thermal_zone_device_is_enabled().
Daniel Lezcano July 2, 2020, 5:01 p.m. UTC | #13
On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:
> Hi Daniel,
> 
> <snip>
> 
>>>>>>
>>>>>> I did reproduce:
>>>>>>
>>>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>>>> v5.8-rc3 => imx6 boots correctly
>>
>> So finally I succeeded to reproduce it on my imx7 locally. The sensor
>> was failing to initialize for another reason related to the legacy
>> cooling device, this is why it is not appearing on the imx7.
>>
>> I can now git-bisect :)
>>
> 
> That would be very kind of you, thank you!

With the lock correctness option enabled:

[    4.179223] imx_thermal tempmon: Extended Commercial CPU temperature
grade - max:105C critical:100C passive:95C
[    4.189557]
[    4.191060] ============================================
[    4.196378] WARNING: possible recursive locking detected
[    4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted
[    4.207102] --------------------------------------------
[    4.212421] kworker/0:3/54 is trying to acquire lock:
[    4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
thermal_zone_device_is_enabled+0x18/0x34
[    4.225777]
[    4.225777] but task is already holding lock:
[    4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
thermal_zone_get_temp+0x38/0x6c
[    4.239121]
[    4.239121] other info that might help us debug this:
[    4.245655]  Possible unsafe locking scenario:
[    4.245655]
[    4.251579]        CPU0
[    4.254031]        ----
[    4.256481]   lock(&tz->lock);
[    4.259544]   lock(&tz->lock);
[    4.262608]
[    4.262608]  *** DEADLOCK ***
[    4.262608]
[    4.268533]  May be due to missing lock nesting notation
[    4.268533]
[    4.275329] 4 locks held by kworker/0:3/54:
[    4.279517]  #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at:
process_one_work+0x224/0x808
[    4.288241]  #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at:
process_one_work+0x224/0x808
[    4.296787]  #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:
__device_attach+0x30/0x140
[    4.304468]  #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
thermal_zone_get_temp+0x38/0x6c
[    4.312408]
[    4.312408] stack backtrace:
[    4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted
5.8.0-rc3-00011-gf5e50bf4d3ef #42
[    4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)
[    4.330809] Workqueue: events deferred_probe_work_func
[    4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]
(show_stack+0x10/0x14)
[    4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]
(dump_stack+0xe8/0x114)
[    4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]
(__lock_acquire+0xbfc/0x2cb4)
[    4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]
(lock_acquire+0xf4/0x4e4)
[    4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]
(__mutex_lock+0xb0/0xaa8)
[    4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]
(mutex_lock_nested+0x1c/0x24)
[    4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]
(thermal_zone_device_is_enabled+0x18/0x34)
[    4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from
[<c0d9da90>] (imx_get_temp+0x30/0x208)
[    4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]
(thermal_zone_get_temp+0x4c/0x6c)
[    4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>]
(thermal_zone_device_update+0x8c/0x258)
[    4.419310] [<c0d93df0>] (thermal_zone_device_update) from
[<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)
[    4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from
[<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)
[    4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]
(platform_drv_probe+0x48/0x98)
[    4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>]
(really_probe+0x218/0x348)
[    4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]
(driver_probe_device+0x5c/0xb4)
[    4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>]
(bus_for_each_drv+0x58/0xb8)
[    4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]
(__device_attach+0xd4/0x140)
[    4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]
(bus_probe_device+0x88/0x90)
[    4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]
(deferred_probe_work_func+0x68/0x98)
[    4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>]
(process_one_work+0x2e0/0x808)
[    4.507237] [<c0369988>] (process_one_work) from [<c036a150>]
(worker_thread+0x2a0/0x59c)
[    4.515432] [<c036a150>] (worker_thread) from [<c0372208>]
(kthread+0x16c/0x178)
[    4.522843] [<c0372208>] (kthread) from [<c0300174>]
(ret_from_fork+0x14/0x20)
[    4.530074] Exception stack(0xca075fb0 to 0xca075ff8)
[    4.535138] 5fa0:                                     00000000
00000000 00000000 00000000
[    4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[    4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Andrzej Pietrasiewicz July 2, 2020, 5:19 p.m. UTC | #14
Hi,

W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze:
> On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:
>> Hi Daniel,
>>
>> <snip>
>>
>>>>>>>
>>>>>>> I did reproduce:
>>>>>>>
>>>>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>>>>> v5.8-rc3 => imx6 boots correctly
>>>
>>> So finally I succeeded to reproduce it on my imx7 locally. The sensor
>>> was failing to initialize for another reason related to the legacy
>>> cooling device, this is why it is not appearing on the imx7.
>>>
>>> I can now git-bisect :)
>>>
>>
>> That would be very kind of you, thank you!
> 
> With the lock correctness option enabled:
> 
> [    4.179223] imx_thermal tempmon: Extended Commercial CPU temperature
> grade - max:105C critical:100C passive:95C
> [    4.189557]
> [    4.191060] ============================================
> [    4.196378] WARNING: possible recursive locking detected
> [    4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted
> [    4.207102] --------------------------------------------
> [    4.212421] kworker/0:3/54 is trying to acquire lock:
> [    4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
> thermal_zone_device_is_enabled+0x18/0x34
> [    4.225777]
> [    4.225777] but task is already holding lock:
> [    4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
> thermal_zone_get_temp+0x38/0x6c
> [    4.239121]
> [    4.239121] other info that might help us debug this:
> [    4.245655]  Possible unsafe locking scenario:
> [    4.245655]
> [    4.251579]        CPU0
> [    4.254031]        ----
> [    4.256481]   lock(&tz->lock);
> [    4.259544]   lock(&tz->lock);
> [    4.262608]
> [    4.262608]  *** DEADLOCK ***
> [    4.262608]
> [    4.268533]  May be due to missing lock nesting notation
> [    4.268533]
> [    4.275329] 4 locks held by kworker/0:3/54:
> [    4.279517]  #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at:
> process_one_work+0x224/0x808
> [    4.288241]  #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at:
> process_one_work+0x224/0x808
> [    4.296787]  #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:
> __device_attach+0x30/0x140
> [    4.304468]  #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
> thermal_zone_get_temp+0x38/0x6c
> [    4.312408]
> [    4.312408] stack backtrace:
> [    4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted
> 5.8.0-rc3-00011-gf5e50bf4d3ef #42
> [    4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)
> [    4.330809] Workqueue: events deferred_probe_work_func
> [    4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]
> (show_stack+0x10/0x14)
> [    4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]
> (dump_stack+0xe8/0x114)
> [    4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]
> (__lock_acquire+0xbfc/0x2cb4)
> [    4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]
> (lock_acquire+0xf4/0x4e4)
> [    4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]
> (__mutex_lock+0xb0/0xaa8)
> [    4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]
> (mutex_lock_nested+0x1c/0x24)
> [    4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]
> (thermal_zone_device_is_enabled+0x18/0x34)
> [    4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from
> [<c0d9da90>] (imx_get_temp+0x30/0x208)
> [    4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]
> (thermal_zone_get_temp+0x4c/0x6c)
> [    4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>]
> (thermal_zone_device_update+0x8c/0x258)
> [    4.419310] [<c0d93df0>] (thermal_zone_device_update) from
> [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)
> [    4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from
> [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)
> [    4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]
> (platform_drv_probe+0x48/0x98)
> [    4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>]
> (really_probe+0x218/0x348)
> [    4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]
> (driver_probe_device+0x5c/0xb4)
> [    4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>]
> (bus_for_each_drv+0x58/0xb8)
> [    4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]
> (__device_attach+0xd4/0x140)
> [    4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]
> (bus_probe_device+0x88/0x90)
> [    4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]
> (deferred_probe_work_func+0x68/0x98)
> [    4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>]
> (process_one_work+0x2e0/0x808)
> [    4.507237] [<c0369988>] (process_one_work) from [<c036a150>]
> (worker_thread+0x2a0/0x59c)
> [    4.515432] [<c036a150>] (worker_thread) from [<c0372208>]
> (kthread+0x16c/0x178)
> [    4.522843] [<c0372208>] (kthread) from [<c0300174>]
> (ret_from_fork+0x14/0x20)
> [    4.530074] Exception stack(0xca075fb0 to 0xca075ff8)
> [    4.535138] 5fa0:                                     00000000
> 00000000 00000000 00000000
> [    4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [    4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> 
> 
> 

Thanks!

That confirms your suspicions.

So the reason is that ->get_temp() is called while the mutex is held and
thermal_zone_device_is_enabled() wants to take the same mutex.

Is adding a comment to thermal_zone_device_is_enabled() to never call
it while the mutex is held and adding another version of it which does
not take the mutex ok?

Andrzej
Daniel Lezcano July 2, 2020, 5:49 p.m. UTC | #15
On 02/07/2020 19:19, Andrzej Pietrasiewicz wrote:
> Hi,
> 
> W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze:
>> On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:
>>> Hi Daniel,
>>>
>>> <snip>
>>>
>>>>>>>>
>>>>>>>> I did reproduce:
>>>>>>>>
>>>>>>>> v5.8-rc3 + series => imx6 hang at boot time
>>>>>>>> v5.8-rc3 => imx6 boots correctly
>>>>
>>>> So finally I succeeded to reproduce it on my imx7 locally. The sensor
>>>> was failing to initialize for another reason related to the legacy
>>>> cooling device, this is why it is not appearing on the imx7.
>>>>
>>>> I can now git-bisect :)
>>>>
>>>
>>> That would be very kind of you, thank you!
>>
>> With the lock correctness option enabled:
>>
>> [    4.179223] imx_thermal tempmon: Extended Commercial CPU temperature
>> grade - max:105C critical:100C passive:95C
>> [    4.189557]
>> [    4.191060] ============================================
>> [    4.196378] WARNING: possible recursive locking detected
>> [    4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted
>> [    4.207102] --------------------------------------------
>> [    4.212421] kworker/0:3/54 is trying to acquire lock:
>> [    4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
>> thermal_zone_device_is_enabled+0x18/0x34
>> [    4.225777]
>> [    4.225777] but task is already holding lock:
>> [    4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
>> thermal_zone_get_temp+0x38/0x6c
>> [    4.239121]
>> [    4.239121] other info that might help us debug this:
>> [    4.245655]  Possible unsafe locking scenario:
>> [    4.245655]
>> [    4.251579]        CPU0
>> [    4.254031]        ----
>> [    4.256481]   lock(&tz->lock);
>> [    4.259544]   lock(&tz->lock);
>> [    4.262608]
>> [    4.262608]  *** DEADLOCK ***
>> [    4.262608]
>> [    4.268533]  May be due to missing lock nesting notation
>> [    4.268533]
>> [    4.275329] 4 locks held by kworker/0:3/54:
>> [    4.279517]  #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, at:
>> process_one_work+0x224/0x808
>> [    4.288241]  #1: ca075f10 (deferred_probe_work){+.+.}-{0:0}, at:
>> process_one_work+0x224/0x808
>> [    4.296787]  #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:
>> __device_attach+0x30/0x140
>> [    4.304468]  #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
>> thermal_zone_get_temp+0x38/0x6c
>> [    4.312408]
>> [    4.312408] stack backtrace:
>> [    4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted
>> 5.8.0-rc3-00011-gf5e50bf4d3ef #42
>> [    4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)
>> [    4.330809] Workqueue: events deferred_probe_work_func
>> [    4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]
>> (show_stack+0x10/0x14)
>> [    4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]
>> (dump_stack+0xe8/0x114)
>> [    4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]
>> (__lock_acquire+0xbfc/0x2cb4)
>> [    4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]
>> (lock_acquire+0xf4/0x4e4)
>> [    4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]
>> (__mutex_lock+0xb0/0xaa8)
>> [    4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]
>> (mutex_lock_nested+0x1c/0x24)
>> [    4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]
>> (thermal_zone_device_is_enabled+0x18/0x34)
>> [    4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from
>> [<c0d9da90>] (imx_get_temp+0x30/0x208)
>> [    4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]
>> (thermal_zone_get_temp+0x4c/0x6c)
>> [    4.409640] [<c0d97484>] (thermal_zone_get_temp) from [<c0d93df0>]
>> (thermal_zone_device_update+0x8c/0x258)
>> [    4.419310] [<c0d93df0>] (thermal_zone_device_update) from
>> [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)
>> [    4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from
>> [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)
>> [    4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]
>> (platform_drv_probe+0x48/0x98)
>> [    4.447622] [<c0a78388>] (platform_drv_probe) from [<c0a7603c>]
>> (really_probe+0x218/0x348)
>> [    4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]
>> (driver_probe_device+0x5c/0xb4)
>> [    4.464098] [<c0a76278>] (driver_probe_device) from [<c0a743bc>]
>> (bus_for_each_drv+0x58/0xb8)
>> [    4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]
>> (__device_attach+0xd4/0x140)
>> [    4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]
>> (bus_probe_device+0x88/0x90)
>> [    4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]
>> (deferred_probe_work_func+0x68/0x98)
>> [    4.498088] [<c0a75564>] (deferred_probe_work_func) from [<c0369988>]
>> (process_one_work+0x2e0/0x808)
>> [    4.507237] [<c0369988>] (process_one_work) from [<c036a150>]
>> (worker_thread+0x2a0/0x59c)
>> [    4.515432] [<c036a150>] (worker_thread) from [<c0372208>]
>> (kthread+0x16c/0x178)
>> [    4.522843] [<c0372208>] (kthread) from [<c0300174>]
>> (ret_from_fork+0x14/0x20)
>> [    4.530074] Exception stack(0xca075fb0 to 0xca075ff8)
>> [    4.535138] 5fa0:                                     00000000
>> 00000000 00000000 00000000
>> [    4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000
>> 00000000 00000000 00000000
>> [    4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013
>> 00000000
>>
>>
>>
> 
> Thanks!
> 
> That confirms your suspicions.
> 
> So the reason is that ->get_temp() is called while the mutex is held and
> thermal_zone_device_is_enabled() wants to take the same mutex.

Yes, that's correct.

> Is adding a comment to thermal_zone_device_is_enabled() to never call
> it while the mutex is held and adding another version of it which does
> not take the mutex ok?

The thermal_zone_device_is_enabled() is only used in two places, acpi
and this imx driver, and given:

1. as soon as the mutex is released, there is no guarantee the thermal
zone won't be changed right after, the lock is pointless, thus the
information also.

2. from a design point of view, I don't see why a driver should know if
a thermal zone is disabled or not

It would make sense to end with this function and do not give the
different drivers an opportunity to access this information.

Why not add change_mode for the acpi in order to enable or disable the
events and for imx_thermal use irq_enabled flag instead of the thermal
zone mode? Moreover it is very unclear why this function is needed in
imx_get_temp(), and I suspect we should be able to get rid of it.
Daniel Lezcano July 2, 2020, 5:52 p.m. UTC | #16
On 02/07/2020 19:49, Daniel Lezcano wrote:

[ ... ]

>> Thanks!
>>
>> That confirms your suspicions.
>>
>> So the reason is that ->get_temp() is called while the mutex is held and
>> thermal_zone_device_is_enabled() wants to take the same mutex.
> 
> Yes, that's correct.
> 
>> Is adding a comment to thermal_zone_device_is_enabled() to never call
>> it while the mutex is held and adding another version of it which does
>> not take the mutex ok?
> 
> The thermal_zone_device_is_enabled() is only used in two places, acpi
> and this imx driver, and given:
> 
> 1. as soon as the mutex is released, there is no guarantee the thermal
> zone won't be changed right after, the lock is pointless, thus the
> information also.
> 
> 2. from a design point of view, I don't see why a driver should know if
> a thermal zone is disabled or not
> 
> It would make sense to end with this function and do not give the
> different drivers an opportunity to access this information.
> 
> Why not add change_mode for the acpi in order to enable or disable the
> events and for imx_thermal use irq_enabled flag instead of the thermal
> zone mode? Moreover it is very unclear why this function is needed in
> imx_get_temp(), and I suspect we should be able to get rid of it.

If you agree with that you can send a patch on top of your series so I
can test it fixes the imx platform.
Zhang Rui July 3, 2020, 1:49 a.m. UTC | #17
On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote:
> On 02/07/2020 19:19, Andrzej Pietrasiewicz wrote:
> > Hi,
> > 
> > W dniu 02.07.2020 o 19:01, Daniel Lezcano pisze:
> > > On 02/07/2020 15:53, Andrzej Pietrasiewicz wrote:
> > > > Hi Daniel,
> > > > 
> > > > <snip>
> > > > 
> > > > > > > > > 
> > > > > > > > > I did reproduce:
> > > > > > > > > 
> > > > > > > > > v5.8-rc3 + series => imx6 hang at boot time
> > > > > > > > > v5.8-rc3 => imx6 boots correctly
> > > > > 
> > > > > So finally I succeeded to reproduce it on my imx7 locally.
> > > > > The sensor
> > > > > was failing to initialize for another reason related to the
> > > > > legacy
> > > > > cooling device, this is why it is not appearing on the imx7.
> > > > > 
> > > > > I can now git-bisect :)
> > > > > 
> > > > 
> > > > That would be very kind of you, thank you!
> > > 
> > > With the lock correctness option enabled:
> > > 
> > > [    4.179223] imx_thermal tempmon: Extended Commercial CPU
> > > temperature
> > > grade - max:105C critical:100C passive:95C
> > > [    4.189557]
> > > [    4.191060] ============================================
> > > [    4.196378] WARNING: possible recursive locking detected
> > > [    4.201699] 5.8.0-rc3-00011-gf5e50bf4d3ef #42 Not tainted
> > > [    4.207102] --------------------------------------------
> > > [    4.212421] kworker/0:3/54 is trying to acquire lock:
> > > [    4.217480] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
> > > thermal_zone_device_is_enabled+0x18/0x34
> > > [    4.225777]
> > > [    4.225777] but task is already holding lock:
> > > [    4.231615] ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
> > > thermal_zone_get_temp+0x38/0x6c
> > > [    4.239121]
> > > [    4.239121] other info that might help us debug this:
> > > [    4.245655]  Possible unsafe locking scenario:
> > > [    4.245655]
> > > [    4.251579]        CPU0
> > > [    4.254031]        ----
> > > [    4.256481]   lock(&tz->lock);
> > > [    4.259544]   lock(&tz->lock);
> > > [    4.262608]
> > > [    4.262608]  *** DEADLOCK ***
> > > [    4.262608]
> > > [    4.268533]  May be due to missing lock nesting notation
> > > [    4.268533]
> > > [    4.275329] 4 locks held by kworker/0:3/54:
> > > [    4.279517]  #0: cb0066a8 ((wq_completion)events){+.+.}-{0:0}, 
> > > at:
> > > process_one_work+0x224/0x808
> > > [    4.288241]  #1: ca075f10 (deferred_probe_work){+.+.}-{0:0},
> > > at:
> > > process_one_work+0x224/0x808
> > > [    4.296787]  #2: cb1a48d8 (&dev->mutex){....}-{3:3}, at:
> > > __device_attach+0x30/0x140
> > > [    4.304468]  #3: ca09a3e4 (&tz->lock){+.+.}-{3:3}, at:
> > > thermal_zone_get_temp+0x38/0x6c
> > > [    4.312408]
> > > [    4.312408] stack backtrace:
> > > [    4.316778] CPU: 0 PID: 54 Comm: kworker/0:3 Not tainted
> > > 5.8.0-rc3-00011-gf5e50bf4d3ef #42
> > > [    4.325048] Hardware name: Freescale i.MX7 Dual (Device Tree)
> > > [    4.330809] Workqueue: events deferred_probe_work_func
> > > [    4.335973] [<c0312384>] (unwind_backtrace) from [<c030c580>]
> > > (show_stack+0x10/0x14)
> > > [    4.343734] [<c030c580>] (show_stack) from [<c079d7d8>]
> > > (dump_stack+0xe8/0x114)
> > > [    4.351062] [<c079d7d8>] (dump_stack) from [<c03abf78>]
> > > (__lock_acquire+0xbfc/0x2cb4)
> > > [    4.358909] [<c03abf78>] (__lock_acquire) from [<c03ae9c4>]
> > > (lock_acquire+0xf4/0x4e4)
> > > [    4.366758] [<c03ae9c4>] (lock_acquire) from [<c10630fc>]
> > > (__mutex_lock+0xb0/0xaa8)
> > > [    4.374431] [<c10630fc>] (__mutex_lock) from [<c1063b10>]
> > > (mutex_lock_nested+0x1c/0x24)
> > > [    4.382452] [<c1063b10>] (mutex_lock_nested) from [<c0d932c0>]
> > > (thermal_zone_device_is_enabled+0x18/0x34)
> > > [    4.392036] [<c0d932c0>] (thermal_zone_device_is_enabled) from
> > > [<c0d9da90>] (imx_get_temp+0x30/0x208)
> > > [    4.401271] [<c0d9da90>] (imx_get_temp) from [<c0d97484>]
> > > (thermal_zone_get_temp+0x4c/0x6c)
> > > [    4.409640] [<c0d97484>] (thermal_zone_get_temp) from
> > > [<c0d93df0>]
> > > (thermal_zone_device_update+0x8c/0x258)
> > > [    4.419310] [<c0d93df0>] (thermal_zone_device_update) from
> > > [<c0d9401c>] (thermal_zone_device_set_mode+0x60/0x88)
> > > [    4.429500] [<c0d9401c>] (thermal_zone_device_set_mode) from
> > > [<c0d9e1d4>] (imx_thermal_probe+0x3e4/0x578)
> > > [    4.439082] [<c0d9e1d4>] (imx_thermal_probe) from [<c0a78388>]
> > > (platform_drv_probe+0x48/0x98)
> > > [    4.447622] [<c0a78388>] (platform_drv_probe) from
> > > [<c0a7603c>]
> > > (really_probe+0x218/0x348)
> > > [    4.455903] [<c0a7603c>] (really_probe) from [<c0a76278>]
> > > (driver_probe_device+0x5c/0xb4)
> > > [    4.464098] [<c0a76278>] (driver_probe_device) from
> > > [<c0a743bc>]
> > > (bus_for_each_drv+0x58/0xb8)
> > > [    4.472638] [<c0a743bc>] (bus_for_each_drv) from [<c0a75db0>]
> > > (__device_attach+0xd4/0x140)
> > > [    4.480919] [<c0a75db0>] (__device_attach) from [<c0a750b0>]
> > > (bus_probe_device+0x88/0x90)
> > > [    4.489112] [<c0a750b0>] (bus_probe_device) from [<c0a75564>]
> > > (deferred_probe_work_func+0x68/0x98)
> > > [    4.498088] [<c0a75564>] (deferred_probe_work_func) from
> > > [<c0369988>]
> > > (process_one_work+0x2e0/0x808)
> > > [    4.507237] [<c0369988>] (process_one_work) from [<c036a150>]
> > > (worker_thread+0x2a0/0x59c)
> > > [    4.515432] [<c036a150>] (worker_thread) from [<c0372208>]
> > > (kthread+0x16c/0x178)
> > > [    4.522843] [<c0372208>] (kthread) from [<c0300174>]
> > > (ret_from_fork+0x14/0x20)
> > > [    4.530074] Exception stack(0xca075fb0 to 0xca075ff8)
> > > [    4.535138] 5fa0:                                     00000000
> > > 00000000 00000000 00000000
> > > [    4.543328] 5fc0: 00000000 00000000 00000000 00000000 00000000
> > > 00000000 00000000 00000000
> > > [    4.551516] 5fe0: 00000000 00000000 00000000 00000000 00000013
> > > 00000000
> > > 
> > > 
> > > 
> > 
> > Thanks!
> > 
> > That confirms your suspicions.
> > 
> > So the reason is that ->get_temp() is called while the mutex is
> > held and
> > thermal_zone_device_is_enabled() wants to take the same mutex.
> 
> Yes, that's correct.
> 
> > Is adding a comment to thermal_zone_device_is_enabled() to never
> > call
> > it while the mutex is held and adding another version of it which
> > does
> > not take the mutex ok?
> 
> The thermal_zone_device_is_enabled() is only used in two places, acpi
> and this imx driver, and given:
> 
> 1. as soon as the mutex is released, there is no guarantee the
> thermal
> zone won't be changed right after, the lock is pointless, thus the
> information also.
> 
> 2. from a design point of view, I don't see why a driver should know
> if
> a thermal zone is disabled or not
> 
> It would make sense to end with this function and do not give the
> different drivers an opportunity to access this information.

I agree.
> 
> Why not add change_mode for the acpi in order to enable or disable
> the
> events

thermal_zone_device_is_enabled() is invoked in acpi thermal driver
because we only want to do thermal_zone_device_update() when the acpi
thermal zone is enabled.

As thermal_zone_device_update() can handle a disabled thermal zone now,
we can just remove the check.

thanks,
rui

>  and for imx_thermal use irq_enabled flag instead of the thermal
> zone mode? Moreover it is very unclear why this function is needed in
> imx_get_temp(), and I suspect we should be able to get rid of it.
> 
>
Daniel Lezcano July 3, 2020, 6:38 a.m. UTC | #18
On 03/07/2020 03:49, Zhang Rui wrote:
> On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote:

[ ... ]

>>> So the reason is that ->get_temp() is called while the mutex is
>>> held and
>>> thermal_zone_device_is_enabled() wants to take the same mutex.
>>
>> Yes, that's correct.
>>
>>> Is adding a comment to thermal_zone_device_is_enabled() to never
>>> call
>>> it while the mutex is held and adding another version of it which
>>> does
>>> not take the mutex ok?
>>
>> The thermal_zone_device_is_enabled() is only used in two places, acpi
>> and this imx driver, and given:
>>
>> 1. as soon as the mutex is released, there is no guarantee the
>> thermal
>> zone won't be changed right after, the lock is pointless, thus the
>> information also.
>>
>> 2. from a design point of view, I don't see why a driver should know
>> if
>> a thermal zone is disabled or not
>>
>> It would make sense to end with this function and do not give the
>> different drivers an opportunity to access this information.
> 
> I agree.
>>
>> Why not add change_mode for the acpi in order to enable or disable
>> the
>> events
> 
> thermal_zone_device_is_enabled() is invoked in acpi thermal driver
> because we only want to do thermal_zone_device_update() when the acpi
> thermal zone is enabled.
> 
> As thermal_zone_device_update() can handle a disabled thermal zone now,
> we can just remove the check.

Ah yes, good point!
Andrzej Pietrasiewicz July 3, 2020, 10:45 a.m. UTC | #19
Hi,

W dniu 03.07.2020 o 08:38, Daniel Lezcano pisze:
> On 03/07/2020 03:49, Zhang Rui wrote:
>> On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote:
> 
> [ ... ]
> 
>>>> So the reason is that ->get_temp() is called while the mutex is
>>>> held and
>>>> thermal_zone_device_is_enabled() wants to take the same mutex.
>>>
>>> Yes, that's correct.
>>>
>>>> Is adding a comment to thermal_zone_device_is_enabled() to never
>>>> call
>>>> it while the mutex is held and adding another version of it which
>>>> does
>>>> not take the mutex ok?
>>>
>>> The thermal_zone_device_is_enabled() is only used in two places, acpi
>>> and this imx driver, and given:
>>>
>>> 1. as soon as the mutex is released, there is no guarantee the
>>> thermal
>>> zone won't be changed right after, the lock is pointless, thus the
>>> information also.
>>>
>>> 2. from a design point of view, I don't see why a driver should know
>>> if
>>> a thermal zone is disabled or not
>>>
>>> It would make sense to end with this function and do not give the
>>> different drivers an opportunity to access this information.
>>
>> I agree.
>>>
>>> Why not add change_mode for the acpi in order to enable or disable
>>> the
>>> events
>>
>> thermal_zone_device_is_enabled() is invoked in acpi thermal driver
>> because we only want to do thermal_zone_device_update() when the acpi
>> thermal zone is enabled.
>>
>> As thermal_zone_device_update() can handle a disabled thermal zone now,
>> we can just remove the check.
> 
> Ah yes, good point!
> 
> 
> 

I sent a short series with fixes. Daniel, can you kindly test it?

Regards,

Andrzej
Daniel Lezcano July 3, 2020, 11:05 a.m. UTC | #20
On 03/07/2020 12:45, Andrzej Pietrasiewicz wrote:
> Hi,
> 
> W dniu 03.07.2020 o 08:38, Daniel Lezcano pisze:
>> On 03/07/2020 03:49, Zhang Rui wrote:
>>> On Thu, 2020-07-02 at 19:49 +0200, Daniel Lezcano wrote:
>>
>> [ ... ]
>>
>>>>> So the reason is that ->get_temp() is called while the mutex is
>>>>> held and
>>>>> thermal_zone_device_is_enabled() wants to take the same mutex.
>>>>
>>>> Yes, that's correct.
>>>>
>>>>> Is adding a comment to thermal_zone_device_is_enabled() to never
>>>>> call
>>>>> it while the mutex is held and adding another version of it which
>>>>> does
>>>>> not take the mutex ok?
>>>>
>>>> The thermal_zone_device_is_enabled() is only used in two places, acpi
>>>> and this imx driver, and given:
>>>>
>>>> 1. as soon as the mutex is released, there is no guarantee the
>>>> thermal
>>>> zone won't be changed right after, the lock is pointless, thus the
>>>> information also.
>>>>
>>>> 2. from a design point of view, I don't see why a driver should know
>>>> if
>>>> a thermal zone is disabled or not
>>>>
>>>> It would make sense to end with this function and do not give the
>>>> different drivers an opportunity to access this information.
>>>
>>> I agree.
>>>>
>>>> Why not add change_mode for the acpi in order to enable or disable
>>>> the
>>>> events
>>>
>>> thermal_zone_device_is_enabled() is invoked in acpi thermal driver
>>> because we only want to do thermal_zone_device_update() when the acpi
>>> thermal zone is enabled.
>>>
>>> As thermal_zone_device_update() can handle a disabled thermal zone now,
>>> we can just remove the check.
>>
>> Ah yes, good point!
>>
>>
>>
> 
> I sent a short series with fixes. Daniel, can you kindly test it?

I confirm the i.MX is now correctly booting with the thermal zone
temperature available.