Message ID | 20201230084338.19410-1-tony@atomide.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430 | expand |
On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <tony@atomide.com> wrote: > > At least for 4430, trying to use the single conversion mode eventually > hangs the thermal sensor. This can be quite easily seen with errors: > > thermal thermal_zone0: failed to read out thermal zone (-5) > > Also, trying to read the temperature shows a stuck value with: > > $ while true; do cat /sys/class/thermal/thermal_zone0/temp; done > > Where the temperature is not rising at all with the busy loop. > > Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in > single conversion mode while it works fine in continuous conversion mode. > It is also possible that the hung temperature sensor can affect the > thermal shutdown alert too. > > Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and > use it for 4430. > > Note that we also need to add udelay to for the EOCZ (end of conversion) > bit polling as otherwise we have it time out too early on 4430. We'll be > changing the loop to use iopoll in the following clean-up patch. > > Cc: Adam Ford <aford173@gmail.com> I don't have an OMAP4, but if you want, I can test a DM3730. adam > Cc: Carl Philipp Klemm <philipp@uvos.xyz> > Cc: Eduardo Valentin <edubezval@gmail.com> > Cc: Merlijn Wajer <merlijn@wizzup.org> > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Peter Ujfalusi <peter.ujfalusi@gmail.com> > Cc: Sebastian Reichel <sre@kernel.org> > Signed-off-by: Tony Lindgren <tony@atomide.com> > --- > drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++- > drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++-- > drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++ > 3 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c > --- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c > +++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c > @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = { > const struct ti_bandgap_data omap4430_data = { > .features = TI_BANDGAP_FEATURE_MODE_CONFIG | > TI_BANDGAP_FEATURE_CLK_CTRL | > - TI_BANDGAP_FEATURE_POWER_SWITCH, > + TI_BANDGAP_FEATURE_POWER_SWITCH | > + TI_BANDGAP_FEATURE_CONT_MODE_ONLY, > .fclock_name = "bandgap_fclk", > .div_ck_name = "bandgap_fclk", > .conv_table = omap4430_adc_to_temp, > diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c > --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c > +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c > @@ -15,6 +15,7 @@ > #include <linux/kernel.h> > #include <linux/interrupt.h> > #include <linux/clk.h> > +#include <linux/delay.h> > #include <linux/gpio/consumer.h> > #include <linux/platform_device.h> > #include <linux/err.h> > @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) > u32 counter = 1000; > struct temp_sensor_registers *tsr; > > - /* Select single conversion mode */ > - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) > + /* Select continuous or single conversion mode */ > + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY)) > + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1); > + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) > RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0); > > /* Start of Conversion = 1 */ > @@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) > if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & > tsr->bgap_eocz_mask) > break; > + udelay(1); > } > > /* Start of Conversion = 0 */ > @@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) > if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & > tsr->bgap_eocz_mask)) > break; > + udelay(1); > } > > return 0; > diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h > --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h > +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h > @@ -280,6 +280,7 @@ struct ti_temp_sensor { > * has Errata 814 > * TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too > * inaccurate. > + * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor > * TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a > * specific feature (above) or not. Return non-zero, if yes. > */ > @@ -295,6 +296,7 @@ struct ti_temp_sensor { > #define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9) > #define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10) > #define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11) > +#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12) > #define TI_BANDGAP_HAS(b, f) \ > ((b)->conf->features & TI_BANDGAP_FEATURE_ ## f) > > -- > 2.29.2
Hi Adam and Tony, > Am 30.12.2020 um 13:55 schrieb Adam Ford <aford173@gmail.com>: > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <tony@atomide.com> wrote: >> >> At least for 4430, trying to use the single conversion mode eventually >> hangs the thermal sensor. This can be quite easily seen with errors: >> >> thermal thermal_zone0: failed to read out thermal zone (-5) >> >> Also, trying to read the temperature shows a stuck value with: >> >> $ while true; do cat /sys/class/thermal/thermal_zone0/temp; done >> >> Where the temperature is not rising at all with the busy loop. >> >> Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in >> single conversion mode while it works fine in continuous conversion mode. >> It is also possible that the hung temperature sensor can affect the >> thermal shutdown alert too. >> >> Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and >> use it for 4430. >> >> Note that we also need to add udelay to for the EOCZ (end of conversion) >> bit polling as otherwise we have it time out too early on 4430. We'll be >> changing the loop to use iopoll in the following clean-up patch. >> >> Cc: Adam Ford <aford173@gmail.com> > > I don't have an OMAP4, but if you want, I can test a DM3730. Indeed I remember a similar discussion from the DM3730 [1]. temp values were always those from the last measurement. E.g. the first one was done during (cold) boot and the first request after 10 minutes did show a quite cold system... The next one did show a hot system independent of what had been between (suspend or high activity). It seems as if it was even reproducible with a very old kernel on a BeagleBoard. So it is quite fundamental. We tried to fix it but did not come to a solution [2]. So we opened an issue in our tracker [3] and decided to stay with continuous conversion although this raises idle mode processor load. BR, Nikolaus [1]: https://lists.goldelico.com/pipermail/letux-kernel/2019-September/003958.html [2]: https://lists.goldelico.com/pipermail/letux-kernel/2019-September/003975.html [3]: https://projects.goldelico.com/p/gta04-kernel/issues/928/ > adam > >> Cc: Carl Philipp Klemm <philipp@uvos.xyz> >> Cc: Eduardo Valentin <edubezval@gmail.com> >> Cc: Merlijn Wajer <merlijn@wizzup.org> >> Cc: Pavel Machek <pavel@ucw.cz> >> Cc: Peter Ujfalusi <peter.ujfalusi@gmail.com> >> Cc: Sebastian Reichel <sre@kernel.org> >> Signed-off-by: Tony Lindgren <tony@atomide.com> >> --- >> drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++- >> drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++-- >> drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++ >> 3 files changed, 11 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c >> --- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c >> +++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c >> @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = { >> const struct ti_bandgap_data omap4430_data = { >> .features = TI_BANDGAP_FEATURE_MODE_CONFIG | >> TI_BANDGAP_FEATURE_CLK_CTRL | >> - TI_BANDGAP_FEATURE_POWER_SWITCH, >> + TI_BANDGAP_FEATURE_POWER_SWITCH | >> + TI_BANDGAP_FEATURE_CONT_MODE_ONLY, >> .fclock_name = "bandgap_fclk", >> .div_ck_name = "bandgap_fclk", >> .conv_table = omap4430_adc_to_temp, >> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c >> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c >> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c >> @@ -15,6 +15,7 @@ >> #include <linux/kernel.h> >> #include <linux/interrupt.h> >> #include <linux/clk.h> >> +#include <linux/delay.h> >> #include <linux/gpio/consumer.h> >> #include <linux/platform_device.h> >> #include <linux/err.h> >> @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) >> u32 counter = 1000; >> struct temp_sensor_registers *tsr; >> >> - /* Select single conversion mode */ >> - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) >> + /* Select continuous or single conversion mode */ >> + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY)) >> + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1); >> + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) >> RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0); >> >> /* Start of Conversion = 1 */ >> @@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) >> if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & >> tsr->bgap_eocz_mask) >> break; >> + udelay(1); >> } >> >> /* Start of Conversion = 0 */ >> @@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) >> if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & >> tsr->bgap_eocz_mask)) >> break; >> + udelay(1); >> } >> >> return 0; >> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h >> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h >> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h >> @@ -280,6 +280,7 @@ struct ti_temp_sensor { >> * has Errata 814 >> * TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too >> * inaccurate. >> + * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor >> * TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a >> * specific feature (above) or not. Return non-zero, if yes. >> */ >> @@ -295,6 +296,7 @@ struct ti_temp_sensor { >> #define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9) >> #define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10) >> #define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11) >> +#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12) >> #define TI_BANDGAP_HAS(b, f) \ >> ((b)->conf->features & TI_BANDGAP_FEATURE_ ## f) >> >> -- >> 2.29.2
Hi Tony, On 12/30/20 10:43 AM, Tony Lindgren wrote: > At least for 4430, trying to use the single conversion mode eventually > hangs the thermal sensor. This can be quite easily seen with errors: > > thermal thermal_zone0: failed to read out thermal zone (-5) > > Also, trying to read the temperature shows a stuck value with: > > $ while true; do cat /sys/class/thermal/thermal_zone0/temp; done > > Where the temperature is not rising at all with the busy loop. > > Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in > single conversion mode while it works fine in continuous conversion mode. > It is also possible that the hung temperature sensor can affect the > thermal shutdown alert too. > > Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and > use it for 4430. > > Note that we also need to add udelay to for the EOCZ (end of conversion) > bit polling as otherwise we have it time out too early on 4430. We'll be > changing the loop to use iopoll in the following clean-up patch. I don't yet have my setup in working condition, so I can not test these. > Cc: Adam Ford <aford173@gmail.com> > Cc: Carl Philipp Klemm <philipp@uvos.xyz> > Cc: Eduardo Valentin <edubezval@gmail.com> > Cc: Merlijn Wajer <merlijn@wizzup.org> > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Peter Ujfalusi <peter.ujfalusi@gmail.com> > Cc: Sebastian Reichel <sre@kernel.org> > Signed-off-by: Tony Lindgren <tony@atomide.com> > --- > drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++- > drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++-- > drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++ > 3 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c > --- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c > +++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c > @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = { > const struct ti_bandgap_data omap4430_data = { > .features = TI_BANDGAP_FEATURE_MODE_CONFIG | > TI_BANDGAP_FEATURE_CLK_CTRL | > - TI_BANDGAP_FEATURE_POWER_SWITCH, > + TI_BANDGAP_FEATURE_POWER_SWITCH | > + TI_BANDGAP_FEATURE_CONT_MODE_ONLY, Can we add a comment with the observations? > .fclock_name = "bandgap_fclk", > .div_ck_name = "bandgap_fclk", > .conv_table = omap4430_adc_to_temp, > diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c > --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c > +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c > @@ -15,6 +15,7 @@ > #include <linux/kernel.h> > #include <linux/interrupt.h> > #include <linux/clk.h> > +#include <linux/delay.h> > #include <linux/gpio/consumer.h> > #include <linux/platform_device.h> > #include <linux/err.h> > @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) > u32 counter = 1000; > struct temp_sensor_registers *tsr; > > - /* Select single conversion mode */ > - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) > + /* Select continuous or single conversion mode */ > + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY)) > + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1); > + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) > RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0); Would not be better to: if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) { if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY)) RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1); else RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0); } One can only switch to cont/single mode if the mode config is possible. > > /* Start of Conversion = 1 */ > @@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) > if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & > tsr->bgap_eocz_mask) > break; > + udelay(1); > } > > /* Start of Conversion = 0 */ > @@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) > if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & > tsr->bgap_eocz_mask)) > break; > + udelay(1); > } > > return 0; > diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h > --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h > +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h > @@ -280,6 +280,7 @@ struct ti_temp_sensor { > * has Errata 814 > * TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too > * inaccurate. > + * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor > * TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a > * specific feature (above) or not. Return non-zero, if yes. > */ > @@ -295,6 +296,7 @@ struct ti_temp_sensor { > #define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9) > #define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10) > #define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11) > +#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12) > #define TI_BANDGAP_HAS(b, f) \ > ((b)->conf->features & TI_BANDGAP_FEATURE_ ## f) > >
* Péter Ujfalusi <peter.ujfalusi@gmail.com> [201231 12:55]: > On 12/30/20 10:43 AM, Tony Lindgren wrote: > > @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = { > > const struct ti_bandgap_data omap4430_data = { > > .features = TI_BANDGAP_FEATURE_MODE_CONFIG | > > TI_BANDGAP_FEATURE_CLK_CTRL | > > - TI_BANDGAP_FEATURE_POWER_SWITCH, > > + TI_BANDGAP_FEATURE_POWER_SWITCH | > > + TI_BANDGAP_FEATURE_CONT_MODE_ONLY, > > Can we add a comment with the observations? Sure, and I also noticed that the timeout triggers also on dra7 too. I need to recheck what all are affected.. At least we now see warnings on the SoCs affected. > > @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) > > u32 counter = 1000; > > struct temp_sensor_registers *tsr; > > > > - /* Select single conversion mode */ > > - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) > > + /* Select continuous or single conversion mode */ > > + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY)) > > + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1); > > + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) > > RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0); > > Would not be better to: > if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) { > if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY)) > RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1); > else > RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0); > } > > One can only switch to cont/single mode if the mode config is possible. Yup makes sense thanks for spotting that. Regards, Tony
* H. Nikolaus Schaller <hns@goldelico.com> [201230 13:29]: > > Am 30.12.2020 um 13:55 schrieb Adam Ford <aford173@gmail.com>: > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <tony@atomide.com> wrote: > >> > >> At least for 4430, trying to use the single conversion mode eventually > >> hangs the thermal sensor. This can be quite easily seen with errors: > >> > >> thermal thermal_zone0: failed to read out thermal zone (-5) ... > > I don't have an OMAP4, but if you want, I can test a DM3730. > > Indeed I remember a similar discussion from the DM3730 [1]. temp values were > always those from the last measurement. E.g. the first one was done > during (cold) boot and the first request after 10 minutes did show a > quite cold system... The next one did show a hot system independent > of what had been between (suspend or high activity). > > It seems as if it was even reproducible with a very old kernel on a BeagleBoard. > So it is quite fundamental. > > We tried to fix it but did not come to a solution [2]. So we opened an issue > in our tracker [3] and decided to stay with continuous conversion although this > raises idle mode processor load. Hmm so maybe eocz high always times out in single mode since it also triggers at least on dra7? Yes it would be great if you guys can the $subject patch a try at least on your omap36xx and omap5 boards and see if you see eocz time out warnings in dmesg. Regards, Tony
On Fri, Jan 8, 2021 at 1:22 AM Tony Lindgren <tony@atomide.com> wrote: > > * H. Nikolaus Schaller <hns@goldelico.com> [201230 13:29]: > > > Am 30.12.2020 um 13:55 schrieb Adam Ford <aford173@gmail.com>: > > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <tony@atomide.com> wrote: > > >> > > >> At least for 4430, trying to use the single conversion mode eventually > > >> hangs the thermal sensor. This can be quite easily seen with errors: > > >> > > >> thermal thermal_zone0: failed to read out thermal zone (-5) > ... > > > > I don't have an OMAP4, but if you want, I can test a DM3730. > > > > Indeed I remember a similar discussion from the DM3730 [1]. temp values were > > always those from the last measurement. E.g. the first one was done > > during (cold) boot and the first request after 10 minutes did show a > > quite cold system... The next one did show a hot system independent > > of what had been between (suspend or high activity). > > > > It seems as if it was even reproducible with a very old kernel on a BeagleBoard. > > So it is quite fundamental. > > > > We tried to fix it but did not come to a solution [2]. So we opened an issue > > in our tracker [3] and decided to stay with continuous conversion although this > > raises idle mode processor load. > > Hmm so maybe eocz high always times out in single mode since it also > triggers at least on dra7? > > Yes it would be great if you guys can the $subject patch a try at > least on your omap36xx and omap5 boards and see if you see eocz > time out warnings in dmesg. I should be able to try it on the dm3730 logicpd-torpedo kit this weekend. adam > > Regards, > > Tony
On Fri, Jan 8, 2021 at 7:45 AM Adam Ford <aford173@gmail.com> wrote: > > On Fri, Jan 8, 2021 at 1:22 AM Tony Lindgren <tony@atomide.com> wrote: > > > > * H. Nikolaus Schaller <hns@goldelico.com> [201230 13:29]: > > > > Am 30.12.2020 um 13:55 schrieb Adam Ford <aford173@gmail.com>: > > > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <tony@atomide.com> wrote: > > > >> > > > >> At least for 4430, trying to use the single conversion mode eventually > > > >> hangs the thermal sensor. This can be quite easily seen with errors: > > > >> > > > >> thermal thermal_zone0: failed to read out thermal zone (-5) > > ... > > > > > > I don't have an OMAP4, but if you want, I can test a DM3730. > > > > > > Indeed I remember a similar discussion from the DM3730 [1]. temp values were > > > always those from the last measurement. E.g. the first one was done > > > during (cold) boot and the first request after 10 minutes did show a > > > quite cold system... The next one did show a hot system independent > > > of what had been between (suspend or high activity). > > > > > > It seems as if it was even reproducible with a very old kernel on a BeagleBoard. > > > So it is quite fundamental. > > > > > > We tried to fix it but did not come to a solution [2]. So we opened an issue > > > in our tracker [3] and decided to stay with continuous conversion although this > > > raises idle mode processor load. > > > > Hmm so maybe eocz high always times out in single mode since it also > > triggers at least on dra7? > > > > Yes it would be great if you guys can the $subject patch a try at > > least on your omap36xx and omap5 boards and see if you see eocz > > time out warnings in dmesg. > > I should be able to try it on the dm3730 logicpd-torpedo kit this weekend. I am going to be a bit delayed testing this. I cannot boot omap2plus using Linux version 5.11.0-rc2. [ 2.666748] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xbc [ 2.673309] nand: Micron MT29F4G16ABBDA3W [ 2.677368] nand: 512 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64 [ 2.685119] nand: using OMAP_ECC_BCH8_CODE_HW_DETECTION_SW [ 2.693237] Invalid ECC layout [ 2.696350] omap2-nand 30000000.nand: unable to use BCH library [ 2.702575] omap2-nand: probe of 30000000.nand failed with error -22 [ 2.716094] 8<--- cut here --- [ 2.719207] Unable to handle kernel NULL pointer dereference at virtual address 00000018 [ 2.727600] pgd = (ptrval) ... [ 3.050933] ---[ end trace 59640c7399a80a07 ]--- [ 3.055603] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 3.063323] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Once I get past this, I'll try to test the thermal stuff. adam > > adam > > > > Regards, > > > > Tony
On Fri, Jan 8, 2021 at 12:31 PM Adam Ford <aford173@gmail.com> wrote: > > On Fri, Jan 8, 2021 at 7:45 AM Adam Ford <aford173@gmail.com> wrote: > > > > On Fri, Jan 8, 2021 at 1:22 AM Tony Lindgren <tony@atomide.com> wrote: > > > > > > * H. Nikolaus Schaller <hns@goldelico.com> [201230 13:29]: > > > > > Am 30.12.2020 um 13:55 schrieb Adam Ford <aford173@gmail.com>: > > > > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <tony@atomide.com> wrote: > > > > >> > > > > >> At least for 4430, trying to use the single conversion mode eventually > > > > >> hangs the thermal sensor. This can be quite easily seen with errors: > > > > >> > > > > >> thermal thermal_zone0: failed to read out thermal zone (-5) > > > ... > > > > > > > > I don't have an OMAP4, but if you want, I can test a DM3730. > > > > > > > > Indeed I remember a similar discussion from the DM3730 [1]. temp values were > > > > always those from the last measurement. E.g. the first one was done > > > > during (cold) boot and the first request after 10 minutes did show a > > > > quite cold system... The next one did show a hot system independent > > > > of what had been between (suspend or high activity). > > > > > > > > It seems as if it was even reproducible with a very old kernel on a BeagleBoard. > > > > So it is quite fundamental. > > > > > > > > We tried to fix it but did not come to a solution [2]. So we opened an issue > > > > in our tracker [3] and decided to stay with continuous conversion although this > > > > raises idle mode processor load. > > > > > > Hmm so maybe eocz high always times out in single mode since it also > > > triggers at least on dra7? > > > > > > Yes it would be great if you guys can the $subject patch a try at > > > least on your omap36xx and omap5 boards and see if you see eocz > > > time out warnings in dmesg. I do see chatter. [ 15.531005] ti-soc-thermal 48002524.bandgap: eocz timed out waiting low [ 16.571075] ti-soc-thermal 48002524.bandgap: eocz timed out waiting low [ 17.610961] ti-soc-thermal 48002524.bandgap: eocz timed out waiting low and it repeats quite often. I would say this patch series would cause a regression on the DM3730. adam > > > > I should be able to try it on the dm3730 logicpd-torpedo kit this weekend. > > I am going to be a bit delayed testing this. I cannot boot omap2plus > using Linux version 5.11.0-rc2. > > [ 2.666748] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xbc > [ 2.673309] nand: Micron MT29F4G16ABBDA3W > [ 2.677368] nand: 512 MiB, SLC, erase size: 128 KiB, page size: > 2048, OOB size: 64 > [ 2.685119] nand: using OMAP_ECC_BCH8_CODE_HW_DETECTION_SW > [ 2.693237] Invalid ECC layout > [ 2.696350] omap2-nand 30000000.nand: unable to use BCH library > [ 2.702575] omap2-nand: probe of 30000000.nand failed with error -22 > [ 2.716094] 8<--- cut here --- > [ 2.719207] Unable to handle kernel NULL pointer dereference at > virtual address 00000018 > [ 2.727600] pgd = (ptrval) > ... > [ 3.050933] ---[ end trace 59640c7399a80a07 ]--- > [ 3.055603] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x0000000b > [ 3.063323] ---[ end Kernel panic - not syncing: Attempted to kill > init! exitcode=0x0000000b ]--- > > Once I get past this, I'll try to test the thermal stuff. > > adam > > > > > adam > > > > > > Regards, > > > > > > Tony
diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c --- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c +++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = { const struct ti_bandgap_data omap4430_data = { .features = TI_BANDGAP_FEATURE_MODE_CONFIG | TI_BANDGAP_FEATURE_CLK_CTRL | - TI_BANDGAP_FEATURE_POWER_SWITCH, + TI_BANDGAP_FEATURE_POWER_SWITCH | + TI_BANDGAP_FEATURE_CONT_MODE_ONLY, .fclock_name = "bandgap_fclk", .div_ck_name = "bandgap_fclk", .conv_table = omap4430_adc_to_temp, diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c @@ -15,6 +15,7 @@ #include <linux/kernel.h> #include <linux/interrupt.h> #include <linux/clk.h> +#include <linux/delay.h> #include <linux/gpio/consumer.h> #include <linux/platform_device.h> #include <linux/err.h> @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) u32 counter = 1000; struct temp_sensor_registers *tsr; - /* Select single conversion mode */ - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) + /* Select continuous or single conversion mode */ + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY)) + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1); + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0); /* Start of Conversion = 1 */ @@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & tsr->bgap_eocz_mask) break; + udelay(1); } /* Start of Conversion = 0 */ @@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id) if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) & tsr->bgap_eocz_mask)) break; + udelay(1); } return 0; diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h @@ -280,6 +280,7 @@ struct ti_temp_sensor { * has Errata 814 * TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too * inaccurate. + * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor * TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a * specific feature (above) or not. Return non-zero, if yes. */ @@ -295,6 +296,7 @@ struct ti_temp_sensor { #define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9) #define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10) #define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11) +#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12) #define TI_BANDGAP_HAS(b, f) \ ((b)->conf->features & TI_BANDGAP_FEATURE_ ## f)
At least for 4430, trying to use the single conversion mode eventually hangs the thermal sensor. This can be quite easily seen with errors: thermal thermal_zone0: failed to read out thermal zone (-5) Also, trying to read the temperature shows a stuck value with: $ while true; do cat /sys/class/thermal/thermal_zone0/temp; done Where the temperature is not rising at all with the busy loop. Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in single conversion mode while it works fine in continuous conversion mode. It is also possible that the hung temperature sensor can affect the thermal shutdown alert too. Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and use it for 4430. Note that we also need to add udelay to for the EOCZ (end of conversion) bit polling as otherwise we have it time out too early on 4430. We'll be changing the loop to use iopoll in the following clean-up patch. Cc: Adam Ford <aford173@gmail.com> Cc: Carl Philipp Klemm <philipp@uvos.xyz> Cc: Eduardo Valentin <edubezval@gmail.com> Cc: Merlijn Wajer <merlijn@wizzup.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Ujfalusi <peter.ujfalusi@gmail.com> Cc: Sebastian Reichel <sre@kernel.org> Signed-off-by: Tony Lindgren <tony@atomide.com> --- drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++- drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++-- drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++ 3 files changed, 11 insertions(+), 3 deletions(-)