diff mbox series

[v3,12/14] thermal/of: Use the .should_bind() thermal zone callback

Message ID 2236794.NgBsaNRSFp@rjwysocki.net (mailing list archive)
State Mainlined, archived
Headers show
Series thermal: Rework binding cooling devices to trip points | expand

Commit Message

Rafael J. Wysocki Aug. 19, 2024, 4:30 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make the thermal_of driver use the .should_bind() thermal zone callback
to provide the thermal core with the information on whether or not to
bind the given cooling device to the given trip point in the given
thermal zone.  If it returns 'true', the thermal core will bind the
cooling device to the trip and the corresponding unbinding will be
taken care of automatically by the core on the removal of the involved
thermal zone or cooling device.

This replaces the .bind() and .unbind() thermal zone callbacks which
assumed the same trip points ordering in the driver and in the thermal
core (that may not be true any more in the future).  The .bind()
callback would walk the given thermal zone's cooling maps to find all
of the valid trip point combinations with the given cooling device and
it would call thermal_zone_bind_cooling_device() for all of them using
trip point indices reflecting the ordering of the trips in the DT.

The .should_bind() callback still walks the thermal zone's cooling maps,
but it can use the trip object passed to it by the thermal core to find
the trip in question in the first place and then it uses the
corresponding 'cooling-device' entries to look up the given cooling
device.  To be able to match the trip object provided by the thermal
core to a specific device node, the driver sets the 'priv' field of each
trip to the corresponding device node pointer during initialization.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v2 -> v3: Reorder (previously [14/17])

v1 -> v2:
   * Fix a build issue (undefined symbol)

This patch only depends on the [06/14] introducing the .should_bind()
thermal zone callback:

https://lore.kernel.org/linux-pm/9334403.CDJkKcVGEf@rjwysocki.net/

---
 drivers/thermal/thermal_of.c |  171 ++++++++++---------------------------------
 1 file changed, 41 insertions(+), 130 deletions(-)

Comments

Daniel Lezcano Aug. 21, 2024, 2:20 p.m. UTC | #1
On 19/08/2024 18:30, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Make the thermal_of driver use the .should_bind() thermal zone callback
> to provide the thermal core with the information on whether or not to
> bind the given cooling device to the given trip point in the given
> thermal zone.  If it returns 'true', the thermal core will bind the
> cooling device to the trip and the corresponding unbinding will be
> taken care of automatically by the core on the removal of the involved
> thermal zone or cooling device.
> 
> This replaces the .bind() and .unbind() thermal zone callbacks which
> assumed the same trip points ordering in the driver and in the thermal
> core (that may not be true any more in the future).  The .bind()
> callback would walk the given thermal zone's cooling maps to find all
> of the valid trip point combinations with the given cooling device and
> it would call thermal_zone_bind_cooling_device() for all of them using
> trip point indices reflecting the ordering of the trips in the DT.
> 
> The .should_bind() callback still walks the thermal zone's cooling maps,
> but it can use the trip object passed to it by the thermal core to find
> the trip in question in the first place and then it uses the
> corresponding 'cooling-device' entries to look up the given cooling
> device.  To be able to match the trip object provided by the thermal
> core to a specific device node, the driver sets the 'priv' field of each
> trip to the corresponding device node pointer during initialization.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---

Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org> # rk3399-rock960
Marek Szyprowski Aug. 26, 2024, 11:31 a.m. UTC | #2
On 19.08.2024 18:30, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Make the thermal_of driver use the .should_bind() thermal zone callback
> to provide the thermal core with the information on whether or not to
> bind the given cooling device to the given trip point in the given
> thermal zone.  If it returns 'true', the thermal core will bind the
> cooling device to the trip and the corresponding unbinding will be
> taken care of automatically by the core on the removal of the involved
> thermal zone or cooling device.
>
> This replaces the .bind() and .unbind() thermal zone callbacks which
> assumed the same trip points ordering in the driver and in the thermal
> core (that may not be true any more in the future).  The .bind()
> callback would walk the given thermal zone's cooling maps to find all
> of the valid trip point combinations with the given cooling device and
> it would call thermal_zone_bind_cooling_device() for all of them using
> trip point indices reflecting the ordering of the trips in the DT.
>
> The .should_bind() callback still walks the thermal zone's cooling maps,
> but it can use the trip object passed to it by the thermal core to find
> the trip in question in the first place and then it uses the
> corresponding 'cooling-device' entries to look up the given cooling
> device.  To be able to match the trip object provided by the thermal
> core to a specific device node, the driver sets the 'priv' field of each
> trip to the corresponding device node pointer during initialization.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

This patch landed recently in linux-next as commit 6d71d55c3b12 
("thermal/of: Use the .should_bind() thermal zone callback"). In my 
tests I found that it breaks booting some on my test boars: Exynos-based 
(OdroidXU4 with ARM32 bit kernel from multi_v7_defconfig) and Amlogic 
Meson based boards (OdroidC4, VIM3 with ARM64 defconfig+some debug 
options). Reverting $subject on top of next-20240823 together with 
c1ee6e1f68f5 ("thermal: core: Clean up trip bind/unbind functions") and 
526954900465 ("thermal: core: Drop unused bind/unbind functions and 
callbacks") due to compile dependencies fixes the issue.

On Odroid C4 I see the following warnings before the boards hangs:

BUG: sleeping function called from invalid context at 
kernel/locking/rwsem.c:1578
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 263, name: 
systemd-udevd
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
4 locks held by systemd-udevd/263:
  #0: ffff0000013768f8 (&dev->mutex){....}-{3:3}, at: 
__driver_attach+0x90/0x1ac
  #1: ffff80008349e1a0 (thermal_list_lock){+.+.}-{3:3}, at: 
__thermal_cooling_device_register.part.0+0x154/0x2f4
  #2: ffff000000988700 (&tz->lock){+.+.}-{3:3}, at: 
thermal_zone_cdev_binding+0x84/0x1e4
  #3: ffff8000834b8a98 (devtree_lock){....}-{2:2}, at: 
of_get_next_child+0x2c/0x80
irq event stamp: 7936
hardirqs last  enabled at (7935): [<ffff8000812b1700>] 
_raw_spin_unlock_irqrestore+0x74/0x78
hardirqs last disabled at (7936): [<ffff8000812b0b14>] 
_raw_spin_lock_irqsave+0x84/0x88
softirqs last  enabled at (7302): [<ffff8000800b13dc>] 
handle_softirqs+0x4cc/0x4e4
softirqs last disabled at (7295): [<ffff8000800105b0>] 
__do_softirq+0x14/0x20
CPU: 3 UID: 0 PID: 263 Comm: systemd-udevd Not tainted 6.11.0-rc3+ #15264
Hardware name: Hardkernel ODROID-C4 (DT)
Call trace:
  dump_backtrace+0x94/0xec
  show_stack+0x18/0x24
  dump_stack_lvl+0x90/0xd0
  dump_stack+0x18/0x24
  __might_resched+0x144/0x248
  __might_sleep+0x48/0x98
  down_write+0x28/0xe8
  kernfs_remove+0x34/0x58
  sysfs_remove_dir+0x54/0x70
  __kobject_del+0x40/0xb8
  kobject_put+0x104/0x124
  of_node_put+0x18/0x28
  of_get_next_child+0x4c/0x80
  thermal_of_should_bind+0xec/0x28c
  thermal_zone_cdev_binding+0x104/0x1e4
  __thermal_cooling_device_register.part.0+0x194/0x2f4
  thermal_of_cooling_device_register+0x3c/0x54
  of_devfreq_cooling_register_power+0x220/0x298
  devfreq_cooling_em_register+0x48/0xa8
  panfrost_devfreq_init+0x294/0x320 [panfrost]
  panfrost_device_init+0x16c/0x5c8 [panfrost]
  panfrost_probe+0xbc/0x194 [panfrost]
  platform_probe+0x68/0xdc
  really_probe+0xbc/0x298
  __driver_probe_device+0x78/0x12c
  driver_probe_device+0x40/0x164
  __driver_attach+0x9c/0x1ac
  bus_for_each_dev+0x74/0xd4
  driver_attach+0x24/0x30
  bus_add_driver+0xe4/0x208
  driver_register+0x60/0x128
  __platform_driver_register+0x28/0x34
  panfrost_driver_init+0x20/0x1000 [panfrost]
  do_one_initcall+0x68/0x300
  do_init_module+0x60/0x224
  load_module+0x1b0c/0x1cb0
  init_module_from_file+0x84/0xc4
  idempotent_init_module+0x18c/0x284
  __arm64_sys_finit_module+0x64/0xa0
  invoke_syscall+0x48/0x110
  el0_svc_common.constprop.0+0x40/0xe8
  do_el0_svc_compat+0x20/0x3c
  el0_svc_compat+0x44/0xe0
  el0t_32_sync_handler+0x98/0x148
  el0t_32_sync+0x194/0x198

=============================
[ BUG: Invalid wait context ]
6.11.0-rc3+ #15264 Tainted: G        W
-----------------------------
systemd-udevd/263 is trying to lock:
ffff0000000e5948 (&root->kernfs_rwsem){++++}-{3:3}, at: 
kernfs_remove+0x34/0x58
other info that might help us debug this:
context-{4:4}
4 locks held by systemd-udevd/263:
  #0: ffff0000013768f8 (&dev->mutex){....}-{3:3}, at: 
__driver_attach+0x90/0x1ac
  #1: ffff80008349e1a0 (thermal_list_lock){+.+.}-{3:3}, at: 
__thermal_cooling_device_register.part.0+0x154/0x2f4
  #2: ffff000000988700 (&tz->lock){+.+.}-{3:3}, at: 
thermal_zone_cdev_binding+0x84/0x1e4
  #3: ffff8000834b8a98 (devtree_lock){....}-{2:2}, at: 
of_get_next_child+0x2c/0x80
stack backtrace:
CPU: 3 UID: 0 PID: 263 Comm: systemd-udevd Tainted: G W          
6.11.0-rc3+ #15264
Tainted: [W]=WARN
Hardware name: Hardkernel ODROID-C4 (DT)
Call trace:
  dump_backtrace+0x94/0xec
  show_stack+0x18/0x24
  dump_stack_lvl+0x90/0xd0
  dump_stack+0x18/0x24
  __lock_acquire+0x9fc/0x21a0
  lock_acquire+0x200/0x340
  down_write+0x50/0xe8
  kernfs_remove+0x34/0x58
  sysfs_remove_dir+0x54/0x70
  __kobject_del+0x40/0xb8
  kobject_put+0x104/0x124
  of_node_put+0x18/0x28
  of_get_next_child+0x4c/0x80
  thermal_of_should_bind+0xec/0x28c
  thermal_zone_cdev_binding+0x104/0x1e4
  __thermal_cooling_device_register.part.0+0x194/0x2f4
  thermal_of_cooling_device_register+0x3c/0x54
  of_devfreq_cooling_register_power+0x220/0x298
  devfreq_cooling_em_register+0x48/0xa8
  panfrost_devfreq_init+0x294/0x320 [panfrost]
  panfrost_device_init+0x16c/0x5c8 [panfrost]
  panfrost_probe+0xbc/0x194 [panfrost]
  platform_probe+0x68/0xdc
  really_probe+0xbc/0x298
  __driver_probe_device+0x78/0x12c
  driver_probe_device+0x40/0x164
  __driver_attach+0x9c/0x1ac
  bus_for_each_dev+0x74/0xd4
  driver_attach+0x24/0x30
  bus_add_driver+0xe4/0x208
  driver_register+0x60/0x128
  __platform_driver_register+0x28/0x34
  panfrost_driver_init+0x20/0x1000 [panfrost]
  do_one_initcall+0x68/0x300
  do_init_module+0x60/0x224
  load_module+0x1b0c/0x1cb0
  init_module_from_file+0x84/0xc4
  idempotent_init_module+0x18c/0x284
  __arm64_sys_finit_module+0x64/0xa0
  invoke_syscall+0x48/0x110
  el0_svc_common.constprop.0+0x40/0xe8
  do_el0_svc_compat+0x20/0x3c
  el0_svc_compat+0x44/0xe0
  el0t_32_sync_handler+0x98/0x148
  el0t_32_sync+0x194/0x198
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     2-...!: (0 ticks this GP) idle=2aac/1/0x4000000000000000 
softirq=798/798 fqs=4
rcu:     3-...!: (0 ticks this GP) idle=28a4/1/0x4000000000000000 
softirq=1007/1007 fqs=4
rcu:     (detected by 0, t=6505 jiffies, g=349, q=46 ncpus=4)
Sending NMI from CPU 0 to CPUs 2:
Sending NMI from CPU 0 to CPUs 3:
rcu: rcu_preempt kthread timer wakeup didn't happen for 6483 jiffies! 
g349 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
rcu:     Possible timer handling issue on cpu=1 timer-softirq=260
rcu: rcu_preempt kthread starved for 6484 jiffies! g349 f0x0 
RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now 
expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:I stack:0     pid:16    tgid:16 ppid:2      
flags:0x00000008
Call trace:
  __switch_to+0xe0/0x124
  __schedule+0x318/0xc30
  schedule+0x50/0x15c
  schedule_timeout+0xac/0x134
  rcu_gp_fqs_loop+0x16c/0x8b4
  rcu_gp_kthread+0x280/0x314
  kthread+0x124/0x128
  ret_from_fork+0x10/0x20
rcu: Stack dump where RCU GP kthread last ran:
Sending NMI from CPU 0 to CPUs 1:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     2-...!: (0 ticks this GP) idle=2aac/1/0x4000000000000000 
softirq=798/798 fqs=4
rcu:     3-...!: (0 ticks this GP) idle=28a4/1/0x4000000000000000 
softirq=1007/1007 fqs=4
rcu:     (detected by 0, t=26013 jiffies, g=349, q=46 ncpus=4)
Sending NMI from CPU 0 to CPUs 2:
Sending NMI from CPU 0 to CPUs 3:

Let me know if I can help debugging this issue further.

> ---
>
> v2 -> v3: Reorder (previously [14/17])
>
> v1 -> v2:
>     * Fix a build issue (undefined symbol)
>
> This patch only depends on the [06/14] introducing the .should_bind()
> thermal zone callback:
>
> https://lore.kernel.org/linux-pm/9334403.CDJkKcVGEf@rjwysocki.net/
>
> ---
>   drivers/thermal/thermal_of.c |  171 ++++++++++---------------------------------
>   1 file changed, 41 insertions(+), 130 deletions(-)
>
> Index: linux-pm/drivers/thermal/thermal_of.c
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_of.c
> +++ linux-pm/drivers/thermal/thermal_of.c
> @@ -20,37 +20,6 @@
>   
>   /***   functions parsing device tree nodes   ***/
>   
> -static int of_find_trip_id(struct device_node *np, struct device_node *trip)
> -{
> -	struct device_node *trips;
> -	struct device_node *t;
> -	int i = 0;
> -
> -	trips = of_get_child_by_name(np, "trips");
> -	if (!trips) {
> -		pr_err("Failed to find 'trips' node\n");
> -		return -EINVAL;
> -	}
> -
> -	/*
> -	 * Find the trip id point associated with the cooling device map
> -	 */
> -	for_each_child_of_node(trips, t) {
> -
> -		if (t == trip) {
> -			of_node_put(t);
> -			goto out;
> -		}
> -		i++;
> -	}
> -
> -	i = -ENXIO;
> -out:
> -	of_node_put(trips);
> -
> -	return i;
> -}
> -
>   /*
>    * It maps 'enum thermal_trip_type' found in include/linux/thermal.h
>    * into the device tree binding of 'trip', property type.
> @@ -119,6 +88,8 @@ static int thermal_of_populate_trip(stru
>   
>   	trip->flags = THERMAL_TRIP_FLAG_RW_TEMP;
>   
> +	trip->priv = np;
> +
>   	return 0;
>   }
>   
> @@ -290,39 +261,9 @@ static struct device_node *thermal_of_zo
>   	return tz_np;
>   }
>   
> -static int __thermal_of_unbind(struct device_node *map_np, int index, int trip_id,
> -			       struct thermal_zone_device *tz, struct thermal_cooling_device *cdev)
> -{
> -	struct of_phandle_args cooling_spec;
> -	int ret;
> -
> -	ret = of_parse_phandle_with_args(map_np, "cooling-device", "#cooling-cells",
> -					 index, &cooling_spec);
> -
> -	if (ret < 0) {
> -		pr_err("Invalid cooling-device entry\n");
> -		return ret;
> -	}
> -
> -	of_node_put(cooling_spec.np);
> -
> -	if (cooling_spec.args_count < 2) {
> -		pr_err("wrong reference to cooling device, missing limits\n");
> -		return -EINVAL;
> -	}
> -
> -	if (cooling_spec.np != cdev->np)
> -		return 0;
> -
> -	ret = thermal_zone_unbind_cooling_device(tz, trip_id, cdev);
> -	if (ret)
> -		pr_err("Failed to unbind '%s' with '%s': %d\n", tz->type, cdev->type, ret);
> -
> -	return ret;
> -}
> -
> -static int __thermal_of_bind(struct device_node *map_np, int index, int trip_id,
> -			     struct thermal_zone_device *tz, struct thermal_cooling_device *cdev)
> +static bool thermal_of_get_cooling_spec(struct device_node *map_np, int index,
> +					struct thermal_cooling_device *cdev,
> +					struct cooling_spec *c)
>   {
>   	struct of_phandle_args cooling_spec;
>   	int ret, weight = THERMAL_WEIGHT_DEFAULT;
> @@ -334,104 +275,75 @@ static int __thermal_of_bind(struct devi
>   
>   	if (ret < 0) {
>   		pr_err("Invalid cooling-device entry\n");
> -		return ret;
> +		return false;
>   	}
>   
>   	of_node_put(cooling_spec.np);
>   
>   	if (cooling_spec.args_count < 2) {
>   		pr_err("wrong reference to cooling device, missing limits\n");
> -		return -EINVAL;
> +		return false;
>   	}
>   
>   	if (cooling_spec.np != cdev->np)
> -		return 0;
> -
> -	ret = thermal_zone_bind_cooling_device(tz, trip_id, cdev, cooling_spec.args[1],
> -					       cooling_spec.args[0],
> -					       weight);
> -	if (ret)
> -		pr_err("Failed to bind '%s' with '%s': %d\n", tz->type, cdev->type, ret);
> -
> -	return ret;
> -}
> -
> -static int thermal_of_for_each_cooling_device(struct device_node *tz_np, struct device_node *map_np,
> -					      struct thermal_zone_device *tz, struct thermal_cooling_device *cdev,
> -					      int (*action)(struct device_node *, int, int,
> -							    struct thermal_zone_device *, struct thermal_cooling_device *))
> -{
> -	struct device_node *tr_np;
> -	int count, i, trip_id;
> -
> -	tr_np = of_parse_phandle(map_np, "trip", 0);
> -	if (!tr_np)
> -		return -ENODEV;
> -
> -	trip_id = of_find_trip_id(tz_np, tr_np);
> -	if (trip_id < 0)
> -		return trip_id;
> -
> -	count = of_count_phandle_with_args(map_np, "cooling-device", "#cooling-cells");
> -	if (count <= 0) {
> -		pr_err("Add a cooling_device property with at least one device\n");
> -		return -ENOENT;
> -	}
> +		return false;
>   
> -	/*
> -	 * At this point, we don't want to bail out when there is an
> -	 * error, we will try to bind/unbind as many as possible
> -	 * cooling devices
> -	 */
> -	for (i = 0; i < count; i++)
> -		action(map_np, i, trip_id, tz, cdev);
> +	c->lower = cooling_spec.args[0];
> +	c->upper = cooling_spec.args[1];
> +	c->weight = weight;
>   
> -	return 0;
> +	return true;
>   }
>   
> -static int thermal_of_for_each_cooling_maps(struct thermal_zone_device *tz,
> -					    struct thermal_cooling_device *cdev,
> -					    int (*action)(struct device_node *, int, int,
> -							  struct thermal_zone_device *, struct thermal_cooling_device *))
> +static bool thermal_of_should_bind(struct thermal_zone_device *tz,
> +				   const struct thermal_trip *trip,
> +				   struct thermal_cooling_device *cdev,
> +				   struct cooling_spec *c)
>   {
>   	struct device_node *tz_np, *cm_np, *child;
> -	int ret = 0;
> +	bool result = false;
>   
>   	tz_np = thermal_of_zone_get_by_name(tz);
>   	if (IS_ERR(tz_np)) {
>   		pr_err("Failed to get node tz by name\n");
> -		return PTR_ERR(tz_np);
> +		return false;
>   	}
>   
>   	cm_np = of_get_child_by_name(tz_np, "cooling-maps");
>   	if (!cm_np)
>   		goto out;
>   
> +	/* Look up the trip and the cdev in the cooling maps. */
>   	for_each_child_of_node(cm_np, child) {
> -		ret = thermal_of_for_each_cooling_device(tz_np, child, tz, cdev, action);
> -		if (ret) {
> +		struct device_node *tr_np;
> +		int count, i;
> +
> +		tr_np = of_parse_phandle(child, "trip", 0);
> +		if (tr_np != trip->priv) {
>   			of_node_put(child);
> -			break;
> +			continue;
> +		}
> +
> +		/* The trip has been found, look up the cdev. */
> +		count = of_count_phandle_with_args(child, "cooling-device", "#cooling-cells");
> +		if (count <= 0)
> +			pr_err("Add a cooling_device property with at least one device\n");
> +
> +		for (i = 0; i < count; i++) {
> +			result = thermal_of_get_cooling_spec(child, i, cdev, c);
> +			if (result)
> +				break;
>   		}
> +
> +		of_node_put(child);
> +		break;
>   	}
>   
>   	of_node_put(cm_np);
>   out:
>   	of_node_put(tz_np);
>   
> -	return ret;
> -}
> -
> -static int thermal_of_bind(struct thermal_zone_device *tz,
> -			   struct thermal_cooling_device *cdev)
> -{
> -	return thermal_of_for_each_cooling_maps(tz, cdev, __thermal_of_bind);
> -}
> -
> -static int thermal_of_unbind(struct thermal_zone_device *tz,
> -			     struct thermal_cooling_device *cdev)
> -{
> -	return thermal_of_for_each_cooling_maps(tz, cdev, __thermal_of_unbind);
> +	return result;
>   }
>   
>   /**
> @@ -502,8 +414,7 @@ static struct thermal_zone_device *therm
>   
>   	thermal_of_parameters_init(np, &tzp);
>   
> -	of_ops.bind = thermal_of_bind;
> -	of_ops.unbind = thermal_of_unbind;
> +	of_ops.should_bind = thermal_of_should_bind;
>   
>   	ret = of_property_read_string(np, "critical-action", &action);
>   	if (!ret)
>
>
>
>
Best regards
Rafael J. Wysocki Aug. 26, 2024, 12:14 p.m. UTC | #3
On Mon, Aug 26, 2024 at 1:32 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> On 19.08.2024 18:30, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Make the thermal_of driver use the .should_bind() thermal zone callback
> > to provide the thermal core with the information on whether or not to
> > bind the given cooling device to the given trip point in the given
> > thermal zone.  If it returns 'true', the thermal core will bind the
> > cooling device to the trip and the corresponding unbinding will be
> > taken care of automatically by the core on the removal of the involved
> > thermal zone or cooling device.
> >
> > This replaces the .bind() and .unbind() thermal zone callbacks which
> > assumed the same trip points ordering in the driver and in the thermal
> > core (that may not be true any more in the future).  The .bind()
> > callback would walk the given thermal zone's cooling maps to find all
> > of the valid trip point combinations with the given cooling device and
> > it would call thermal_zone_bind_cooling_device() for all of them using
> > trip point indices reflecting the ordering of the trips in the DT.
> >
> > The .should_bind() callback still walks the thermal zone's cooling maps,
> > but it can use the trip object passed to it by the thermal core to find
> > the trip in question in the first place and then it uses the
> > corresponding 'cooling-device' entries to look up the given cooling
> > device.  To be able to match the trip object provided by the thermal
> > core to a specific device node, the driver sets the 'priv' field of each
> > trip to the corresponding device node pointer during initialization.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> This patch landed recently in linux-next as commit 6d71d55c3b12
> ("thermal/of: Use the .should_bind() thermal zone callback")

It has been fixed since and it is commit  94c6110b0b13c6416146 now.

Bottom line is that it was calling of_node_put() too many times due to
a coding mistake.

> In my tests I found that it breaks booting some on my test boars: Exynos-based
> (OdroidXU4 with ARM32 bit kernel from multi_v7_defconfig) and Amlogic
> Meson based boards (OdroidC4, VIM3 with ARM64 defconfig+some debug
> options). Reverting $subject on top of next-20240823 together with
> c1ee6e1f68f5 ("thermal: core: Clean up trip bind/unbind functions") and
> 526954900465 ("thermal: core: Drop unused bind/unbind functions and
> callbacks") due to compile dependencies fixes the issue.

Thanks for the report!
Marek Szyprowski Aug. 26, 2024, 8:49 p.m. UTC | #4
On 26.08.2024 14:14, Rafael J. Wysocki wrote:
> On Mon, Aug 26, 2024 at 1:32 PM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> On 19.08.2024 18:30, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Make the thermal_of driver use the .should_bind() thermal zone callback
>>> to provide the thermal core with the information on whether or not to
>>> bind the given cooling device to the given trip point in the given
>>> thermal zone.  If it returns 'true', the thermal core will bind the
>>> cooling device to the trip and the corresponding unbinding will be
>>> taken care of automatically by the core on the removal of the involved
>>> thermal zone or cooling device.
>>>
>>> This replaces the .bind() and .unbind() thermal zone callbacks which
>>> assumed the same trip points ordering in the driver and in the thermal
>>> core (that may not be true any more in the future).  The .bind()
>>> callback would walk the given thermal zone's cooling maps to find all
>>> of the valid trip point combinations with the given cooling device and
>>> it would call thermal_zone_bind_cooling_device() for all of them using
>>> trip point indices reflecting the ordering of the trips in the DT.
>>>
>>> The .should_bind() callback still walks the thermal zone's cooling maps,
>>> but it can use the trip object passed to it by the thermal core to find
>>> the trip in question in the first place and then it uses the
>>> corresponding 'cooling-device' entries to look up the given cooling
>>> device.  To be able to match the trip object provided by the thermal
>>> core to a specific device node, the driver sets the 'priv' field of each
>>> trip to the corresponding device node pointer during initialization.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> This patch landed recently in linux-next as commit 6d71d55c3b12
>> ("thermal/of: Use the .should_bind() thermal zone callback")
> It has been fixed since and it is commit  94c6110b0b13c6416146 now.


Confirmed. Thanks for fixing it and sorry for the noise.


> Bottom line is that it was calling of_node_put() too many times due to
> a coding mistake.
>
>> In my tests I found that it breaks booting some on my test boars: Exynos-based
>> (OdroidXU4 with ARM32 bit kernel from multi_v7_defconfig) and Amlogic
>> Meson based boards (OdroidC4, VIM3 with ARM64 defconfig+some debug
>> options). Reverting $subject on top of next-20240823 together with
>> c1ee6e1f68f5 ("thermal: core: Clean up trip bind/unbind functions") and
>> 526954900465 ("thermal: core: Drop unused bind/unbind functions and
>> callbacks") due to compile dependencies fixes the issue.
> Thanks for the report!
>
Best regards
Rafael J. Wysocki Aug. 27, 2024, 11:39 a.m. UTC | #5
On Mon, Aug 26, 2024 at 10:49 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> On 26.08.2024 14:14, Rafael J. Wysocki wrote:
> > On Mon, Aug 26, 2024 at 1:32 PM Marek Szyprowski
> > <m.szyprowski@samsung.com> wrote:
> >> On 19.08.2024 18:30, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Make the thermal_of driver use the .should_bind() thermal zone callback
> >>> to provide the thermal core with the information on whether or not to
> >>> bind the given cooling device to the given trip point in the given
> >>> thermal zone.  If it returns 'true', the thermal core will bind the
> >>> cooling device to the trip and the corresponding unbinding will be
> >>> taken care of automatically by the core on the removal of the involved
> >>> thermal zone or cooling device.
> >>>
> >>> This replaces the .bind() and .unbind() thermal zone callbacks which
> >>> assumed the same trip points ordering in the driver and in the thermal
> >>> core (that may not be true any more in the future).  The .bind()
> >>> callback would walk the given thermal zone's cooling maps to find all
> >>> of the valid trip point combinations with the given cooling device and
> >>> it would call thermal_zone_bind_cooling_device() for all of them using
> >>> trip point indices reflecting the ordering of the trips in the DT.
> >>>
> >>> The .should_bind() callback still walks the thermal zone's cooling maps,
> >>> but it can use the trip object passed to it by the thermal core to find
> >>> the trip in question in the first place and then it uses the
> >>> corresponding 'cooling-device' entries to look up the given cooling
> >>> device.  To be able to match the trip object provided by the thermal
> >>> core to a specific device node, the driver sets the 'priv' field of each
> >>> trip to the corresponding device node pointer during initialization.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> This patch landed recently in linux-next as commit 6d71d55c3b12
> >> ("thermal/of: Use the .should_bind() thermal zone callback")
> > It has been fixed since and it is commit  94c6110b0b13c6416146 now.
>
>
> Confirmed. Thanks for fixing it and sorry for the noise.

Thank you!

And it wasn't noise.  You reported the problem as soon as you saw it
and before you could see the fix.  Somebody else saw it earlier, but
there's nothing wrong with that.
diff mbox series

Patch

Index: linux-pm/drivers/thermal/thermal_of.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_of.c
+++ linux-pm/drivers/thermal/thermal_of.c
@@ -20,37 +20,6 @@ 
 
 /***   functions parsing device tree nodes   ***/
 
-static int of_find_trip_id(struct device_node *np, struct device_node *trip)
-{
-	struct device_node *trips;
-	struct device_node *t;
-	int i = 0;
-
-	trips = of_get_child_by_name(np, "trips");
-	if (!trips) {
-		pr_err("Failed to find 'trips' node\n");
-		return -EINVAL;
-	}
-
-	/*
-	 * Find the trip id point associated with the cooling device map
-	 */
-	for_each_child_of_node(trips, t) {
-
-		if (t == trip) {
-			of_node_put(t);
-			goto out;
-		}
-		i++;
-	}
-
-	i = -ENXIO;
-out:
-	of_node_put(trips);
-
-	return i;
-}
-
 /*
  * It maps 'enum thermal_trip_type' found in include/linux/thermal.h
  * into the device tree binding of 'trip', property type.
@@ -119,6 +88,8 @@  static int thermal_of_populate_trip(stru
 
 	trip->flags = THERMAL_TRIP_FLAG_RW_TEMP;
 
+	trip->priv = np;
+
 	return 0;
 }
 
@@ -290,39 +261,9 @@  static struct device_node *thermal_of_zo
 	return tz_np;
 }
 
-static int __thermal_of_unbind(struct device_node *map_np, int index, int trip_id,
-			       struct thermal_zone_device *tz, struct thermal_cooling_device *cdev)
-{
-	struct of_phandle_args cooling_spec;
-	int ret;
-
-	ret = of_parse_phandle_with_args(map_np, "cooling-device", "#cooling-cells",
-					 index, &cooling_spec);
-
-	if (ret < 0) {
-		pr_err("Invalid cooling-device entry\n");
-		return ret;
-	}
-
-	of_node_put(cooling_spec.np);
-
-	if (cooling_spec.args_count < 2) {
-		pr_err("wrong reference to cooling device, missing limits\n");
-		return -EINVAL;
-	}
-
-	if (cooling_spec.np != cdev->np)
-		return 0;
-
-	ret = thermal_zone_unbind_cooling_device(tz, trip_id, cdev);
-	if (ret)
-		pr_err("Failed to unbind '%s' with '%s': %d\n", tz->type, cdev->type, ret);
-
-	return ret;
-}
-
-static int __thermal_of_bind(struct device_node *map_np, int index, int trip_id,
-			     struct thermal_zone_device *tz, struct thermal_cooling_device *cdev)
+static bool thermal_of_get_cooling_spec(struct device_node *map_np, int index,
+					struct thermal_cooling_device *cdev,
+					struct cooling_spec *c)
 {
 	struct of_phandle_args cooling_spec;
 	int ret, weight = THERMAL_WEIGHT_DEFAULT;
@@ -334,104 +275,75 @@  static int __thermal_of_bind(struct devi
 
 	if (ret < 0) {
 		pr_err("Invalid cooling-device entry\n");
-		return ret;
+		return false;
 	}
 
 	of_node_put(cooling_spec.np);
 
 	if (cooling_spec.args_count < 2) {
 		pr_err("wrong reference to cooling device, missing limits\n");
-		return -EINVAL;
+		return false;
 	}
 
 	if (cooling_spec.np != cdev->np)
-		return 0;
-
-	ret = thermal_zone_bind_cooling_device(tz, trip_id, cdev, cooling_spec.args[1],
-					       cooling_spec.args[0],
-					       weight);
-	if (ret)
-		pr_err("Failed to bind '%s' with '%s': %d\n", tz->type, cdev->type, ret);
-
-	return ret;
-}
-
-static int thermal_of_for_each_cooling_device(struct device_node *tz_np, struct device_node *map_np,
-					      struct thermal_zone_device *tz, struct thermal_cooling_device *cdev,
-					      int (*action)(struct device_node *, int, int,
-							    struct thermal_zone_device *, struct thermal_cooling_device *))
-{
-	struct device_node *tr_np;
-	int count, i, trip_id;
-
-	tr_np = of_parse_phandle(map_np, "trip", 0);
-	if (!tr_np)
-		return -ENODEV;
-
-	trip_id = of_find_trip_id(tz_np, tr_np);
-	if (trip_id < 0)
-		return trip_id;
-
-	count = of_count_phandle_with_args(map_np, "cooling-device", "#cooling-cells");
-	if (count <= 0) {
-		pr_err("Add a cooling_device property with at least one device\n");
-		return -ENOENT;
-	}
+		return false;
 
-	/*
-	 * At this point, we don't want to bail out when there is an
-	 * error, we will try to bind/unbind as many as possible
-	 * cooling devices
-	 */
-	for (i = 0; i < count; i++)
-		action(map_np, i, trip_id, tz, cdev);
+	c->lower = cooling_spec.args[0];
+	c->upper = cooling_spec.args[1];
+	c->weight = weight;
 
-	return 0;
+	return true;
 }
 
-static int thermal_of_for_each_cooling_maps(struct thermal_zone_device *tz,
-					    struct thermal_cooling_device *cdev,
-					    int (*action)(struct device_node *, int, int,
-							  struct thermal_zone_device *, struct thermal_cooling_device *))
+static bool thermal_of_should_bind(struct thermal_zone_device *tz,
+				   const struct thermal_trip *trip,
+				   struct thermal_cooling_device *cdev,
+				   struct cooling_spec *c)
 {
 	struct device_node *tz_np, *cm_np, *child;
-	int ret = 0;
+	bool result = false;
 
 	tz_np = thermal_of_zone_get_by_name(tz);
 	if (IS_ERR(tz_np)) {
 		pr_err("Failed to get node tz by name\n");
-		return PTR_ERR(tz_np);
+		return false;
 	}
 
 	cm_np = of_get_child_by_name(tz_np, "cooling-maps");
 	if (!cm_np)
 		goto out;
 
+	/* Look up the trip and the cdev in the cooling maps. */
 	for_each_child_of_node(cm_np, child) {
-		ret = thermal_of_for_each_cooling_device(tz_np, child, tz, cdev, action);
-		if (ret) {
+		struct device_node *tr_np;
+		int count, i;
+
+		tr_np = of_parse_phandle(child, "trip", 0);
+		if (tr_np != trip->priv) {
 			of_node_put(child);
-			break;
+			continue;
+		}
+
+		/* The trip has been found, look up the cdev. */
+		count = of_count_phandle_with_args(child, "cooling-device", "#cooling-cells");
+		if (count <= 0)
+			pr_err("Add a cooling_device property with at least one device\n");
+
+		for (i = 0; i < count; i++) {
+			result = thermal_of_get_cooling_spec(child, i, cdev, c);
+			if (result)
+				break;
 		}
+
+		of_node_put(child);
+		break;
 	}
 
 	of_node_put(cm_np);
 out:
 	of_node_put(tz_np);
 
-	return ret;
-}
-
-static int thermal_of_bind(struct thermal_zone_device *tz,
-			   struct thermal_cooling_device *cdev)
-{
-	return thermal_of_for_each_cooling_maps(tz, cdev, __thermal_of_bind);
-}
-
-static int thermal_of_unbind(struct thermal_zone_device *tz,
-			     struct thermal_cooling_device *cdev)
-{
-	return thermal_of_for_each_cooling_maps(tz, cdev, __thermal_of_unbind);
+	return result;
 }
 
 /**
@@ -502,8 +414,7 @@  static struct thermal_zone_device *therm
 
 	thermal_of_parameters_init(np, &tzp);
 
-	of_ops.bind = thermal_of_bind;
-	of_ops.unbind = thermal_of_unbind;
+	of_ops.should_bind = thermal_of_should_bind;
 
 	ret = of_property_read_string(np, "critical-action", &action);
 	if (!ret)