Message ID | 20230314155010.3692869-1-idosch@nvidia.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | thermal: core: Restore behavior regarding invalid trip points | expand |
On Tue, Mar 14, 2023 at 4:50 PM Ido Schimmel <idosch@nvidia.com> wrote: > > Cited commit stopped marking trip points with a zero temperature as > disabled, behavior that was originally introduced in commit 81ad4276b505 > ("Thermal: Ignore invalid trip points"). > > When using the mlxsw driver we see that when such trip points are not > disabled, the thermal subsystem repeatedly tries to set the state of the > associated cooling devices to the maximum state. > > Fix this by restoring the original behavior and mark trip points with a > zero temperature as disabled. What if the temperature is negative? I think that you'd still want to disable the trip in that case, wouldn't you? Daniel, what's your take on this? > Fixes: 7c3d5c20dc16 ("thermal/core: Add a generic thermal_zone_get_trip() function") > Signed-off-by: Ido Schimmel <idosch@nvidia.com> > --- > drivers/thermal/thermal_core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > index 5ae72f314683..63583df4498d 100644 > --- a/drivers/thermal/thermal_core.c > +++ b/drivers/thermal/thermal_core.c > @@ -1309,7 +1309,7 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t > struct thermal_trip trip; > > result = thermal_zone_get_trip(tz, count, &trip); > - if (result) > + if (result || !trip.temperature) > set_bit(count, &tz->trips_disabled); > } > > --
Hi Rafael, On Tue, Mar 14, 2023 at 07:03:07PM +0100, Rafael J. Wysocki wrote: > What if the temperature is negative? I think that you'd still want to > disable the trip in that case, wouldn't you? Personally, no. This patch merely restores the behavior that was inadvertently removed by 7c3d5c20dc16. Specifically by this hunk: ``` @@ -1252,9 +1319,10 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t goto release_device; for (count = 0; count < num_trips; count++) { - if (tz->ops->get_trip_type(tz, count, &trip_type) || - tz->ops->get_trip_temp(tz, count, &trip_temp) || - !trip_temp) + struct thermal_trip trip; + + result = thermal_zone_get_trip(tz, count, &trip); + if (result) set_bit(count, &tz->trips_disabled); } ``` > Daniel, what's your take on this? Discussed this with Daniel yesterday: https://lore.kernel.org/linux-pm/ZA8TPDpEVanOpjEp@shredder/ We agreed to rework mlxsw to not rely on the fact that trip points with a zero temperature are disabled, but it's not rc material, unlike this patch. Thanks
On Tue, Mar 14, 2023 at 7:35 PM Ido Schimmel <idosch@nvidia.com> wrote: > > Hi Rafael, > > On Tue, Mar 14, 2023 at 07:03:07PM +0100, Rafael J. Wysocki wrote: > > What if the temperature is negative? I think that you'd still want to > > disable the trip in that case, wouldn't you? > > Personally, no. This patch merely restores the behavior that was > inadvertently removed by 7c3d5c20dc16. Specifically by this hunk: > > ``` > @@ -1252,9 +1319,10 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t > goto release_device; > > for (count = 0; count < num_trips; count++) { > - if (tz->ops->get_trip_type(tz, count, &trip_type) || > - tz->ops->get_trip_temp(tz, count, &trip_temp) || > - !trip_temp) > + struct thermal_trip trip; > + > + result = thermal_zone_get_trip(tz, count, &trip); > + if (result) > set_bit(count, &tz->trips_disabled); > } > ``` > > > Daniel, what's your take on this? > > Discussed this with Daniel yesterday: > https://lore.kernel.org/linux-pm/ZA8TPDpEVanOpjEp@shredder/ > > We agreed to rework mlxsw to not rely on the fact that trip points with > a zero temperature are disabled, but it's not rc material, unlike this > patch. So I've applied this as 6.3-rc material for the sake of avoiding a driver regression in 6.3, under the assumption that we'll get the mlxsw driver update on time for the 6.4 merge window. Thanks!
On Wed, Mar 22, 2023 at 08:04:54PM +0100, Rafael J. Wysocki wrote: > So I've applied this as 6.3-rc material for the sake of avoiding a > driver regression in 6.3, under the assumption that we'll get the > mlxsw driver update on time for the 6.4 merge window. Correct. Sent the patches for internal review and will submit them to netdev (where this driver is maintained) next week assuming review is OK and nothing explodes in our regression. Will copy Daniel and you. Thanks
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 5ae72f314683..63583df4498d 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -1309,7 +1309,7 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t struct thermal_trip trip; result = thermal_zone_get_trip(tz, count, &trip); - if (result) + if (result || !trip.temperature) set_bit(count, &tz->trips_disabled); }
Cited commit stopped marking trip points with a zero temperature as disabled, behavior that was originally introduced in commit 81ad4276b505 ("Thermal: Ignore invalid trip points"). When using the mlxsw driver we see that when such trip points are not disabled, the thermal subsystem repeatedly tries to set the state of the associated cooling devices to the maximum state. Fix this by restoring the original behavior and mark trip points with a zero temperature as disabled. Fixes: 7c3d5c20dc16 ("thermal/core: Add a generic thermal_zone_get_trip() function") Signed-off-by: Ido Schimmel <idosch@nvidia.com> --- drivers/thermal/thermal_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)