diff mbox series

[5/5] tracing/hwlat: Fix deadlock in cpuhp processing

Message ID 20240924094515.3561410-6-liwei391@huawei.com (mailing list archive)
State New
Delegated to: Steven Rostedt
Headers show
Series tracing: Fix several deadlock/race issues in timerlat and hwlat tracer | expand

Commit Message

Wei Li Sept. 24, 2024, 9:45 a.m. UTC
Another "hung task" error was reported during the test, and i figured out
the deadlock scenario is as follows:

T1 [BP]               | T2 [AP]                     | T3 [hwlatd/1]                  | T4
work_for_cpu_fn()     | cpuhp_thread_fun()          | kthread_fn()                   | hwlat_hotplug_workfn()
  _cpu_down()         |   stop_cpu_kthread()        |                                |   mutex_lock(&hwlat_data.lock)
    cpus_write_lock() |     kthread_stop(hwlatd/1)  |   mutex_lock(&hwlat_data.lock) |
    __cpuhp_kick_ap() |       wait_for_completion() |                                |   cpus_read_lock()

It constitutes ABBA deadlock indirectly between "cpu_hotplug_lock" and
"hwlat_data.lock", make the mutex obtaining in kthread_fn() interruptible
to fix this.

Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
---
 kernel/trace/trace_hwlat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Steven Rostedt Oct. 3, 2024, 8:19 p.m. UTC | #1
On Tue, 24 Sep 2024 17:45:15 +0800
Wei Li <liwei391@huawei.com> wrote:

> Another "hung task" error was reported during the test, and i figured out
> the deadlock scenario is as follows:
> 
> T1 [BP]               | T2 [AP]                     | T3 [hwlatd/1]                  | T4
> work_for_cpu_fn()     | cpuhp_thread_fun()          | kthread_fn()                   | hwlat_hotplug_workfn()
>   _cpu_down()         |   stop_cpu_kthread()        |                                |   mutex_lock(&hwlat_data.lock)
>     cpus_write_lock() |     kthread_stop(hwlatd/1)  |   mutex_lock(&hwlat_data.lock) |
>     __cpuhp_kick_ap() |       wait_for_completion() |                                |   cpus_read_lock()
> 
> It constitutes ABBA deadlock indirectly between "cpu_hotplug_lock" and
> "hwlat_data.lock", make the mutex obtaining in kthread_fn() interruptible
> to fix this.
> 
> Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations")
> Signed-off-by: Wei Li <liwei391@huawei.com>
> ---
>  kernel/trace/trace_hwlat.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> index 3bd6071441ad..4c228ccb8a38 100644
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -370,7 +370,8 @@ static int kthread_fn(void *data)
>  		get_sample();
>  		local_irq_enable();
>  
> -		mutex_lock(&hwlat_data.lock);
> +		if (mutex_lock_interruptible(&hwlat_data.lock))
> +			break;

So basically this requires as signal to break it out of the loop?

But if it receives a signal for any other reason, it breaks out of the loop
too. Which is not what we want. If anything, it should be:

		if (mutex_lock_interruptible(&hwlat_data.lock))
			continue;

But I still don't really like this solution, as it will still report a
	deadlock.

Is it possible to switch the cpu_read_lock() to be taken before the
hwlat_data.lock?

-- Steve


>  		interval = hwlat_data.sample_window - hwlat_data.sample_width;
>  		mutex_unlock(&hwlat_data.lock);
>
Wei Li Oct. 9, 2024, 7:47 a.m. UTC | #2
On 2024/10/4 4:19, Steven Rostedt wrote:
> On Tue, 24 Sep 2024 17:45:15 +0800
> Wei Li <liwei391@huawei.com> wrote:
> 
>> Another "hung task" error was reported during the test, and i figured out
>> the deadlock scenario is as follows:
>>
>> T1 [BP]               | T2 [AP]                     | T3 [hwlatd/1]                  | T4
>> work_for_cpu_fn()     | cpuhp_thread_fun()          | kthread_fn()                   | hwlat_hotplug_workfn()
>>   _cpu_down()         |   stop_cpu_kthread()        |                                |   mutex_lock(&hwlat_data.lock)
>>     cpus_write_lock() |     kthread_stop(hwlatd/1)  |   mutex_lock(&hwlat_data.lock) |
>>     __cpuhp_kick_ap() |       wait_for_completion() |                                |   cpus_read_lock()
>>
>> It constitutes ABBA deadlock indirectly beAs it calls msleep_interruptible() and 'break' if signal pending below, i choosed 'break' here too.tween "cpu_hotplug_lock" and
>> "hwlat_data.lock", make the mutex obtaining in kthread_fn() interruptible
>> to fix this.
>>
>> Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations")
>> Signed-off-by: Wei Li <liwei391@huawei.com>
>> ---
>>  kernel/trace/trace_hwlat.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
>> index 3bd6071441ad..4c228ccb8a38 100644
>> --- a/kernel/trace/trace_hwlat.c
>> +++ b/kernel/trace/trace_hwlat.c
>> @@ -370,7 +370,8 @@ static int kthread_fn(void *data)
>>  		get_sample();
>>  		local_irq_enable();
>>  
>> -		mutex_lock(&hwlat_data.lock);
>> +		if (mutex_lock_interruptible(&hwlat_data.lock))
>> +			break;
> 
> So basically this requires as signal to break it out of the loop?
> 
> But if it receives a signal for any other reason, it breaks out of the loop
> too. Which is not what we want. If anything, it should be:
> 
> 		if (mutex_lock_interruptible(&hwlat_data.lock))
> 			continue;

As it calls msleep_interruptible() below and 'break' if signal pending, i
choosed 'break' here too.

> But I still don't really like this solution, as it will still report a
> 	deadlock.
> 
> Is it possible to switch the cpu_read_lock() to be taken before the
> hwlat_data.lock?

It's a little hard to change the sequence of these two locks, we'll hold
"cpu_hotplug_lock" for longer unnecessarily if we do that.

But maybe we can remove the "hwlat_data.lock" in kthread_fn(), let me try
another modification.


Thanks,
Wei
Steven Rostedt Nov. 12, 2024, 11:50 p.m. UTC | #3
On Wed, 9 Oct 2024 15:47:23 +0800
"liwei (GF)" <liwei391@huawei.com> wrote:

> On 2024/10/4 4:19, Steven Rostedt wrote:
> > On Tue, 24 Sep 2024 17:45:15 +0800
> > Wei Li <liwei391@huawei.com> wrote:
> >   
> >> Another "hung task" error was reported during the test, and i figured out
> >> the deadlock scenario is as follows:
> >>
> >> T1 [BP]               | T2 [AP]                     | T3 [hwlatd/1]                  | T4
> >> work_for_cpu_fn()     | cpuhp_thread_fun()          | kthread_fn()                   | hwlat_hotplug_workfn()
> >>   _cpu_down()         |   stop_cpu_kthread()        |                                |   mutex_lock(&hwlat_data.lock)
> >>     cpus_write_lock() |     kthread_stop(hwlatd/1)  |   mutex_lock(&hwlat_data.lock) |
> >>     __cpuhp_kick_ap() |       wait_for_completion() |                                |   cpus_read_lock()

So, if we can make T3 not take the mutex_lock then that should be a
solution, right?

> >>
> >> It constitutes ABBA deadlock indirectly beAs it calls msleep_interruptible() and 'break' if signal pending below, i choosed 'break' here too.tween "cpu_hotplug_lock" and
> >> "hwlat_data.lock", make the mutex obtaining in kthread_fn() interruptible
> >> to fix this.
> >>
> >> Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations")
> >> Signed-off-by: Wei Li <liwei391@huawei.com>
> >> ---
> >>  kernel/trace/trace_hwlat.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> >> index 3bd6071441ad..4c228ccb8a38 100644
> >> --- a/kernel/trace/trace_hwlat.c
> >> +++ b/kernel/trace/trace_hwlat.c
> >> @@ -370,7 +370,8 @@ static int kthread_fn(void *data)
> >>  		get_sample();
> >>  		local_irq_enable();
> >>  
> >> -		mutex_lock(&hwlat_data.lock);
> >> +		if (mutex_lock_interruptible(&hwlat_data.lock))
> >> +			break;  
> > 
> > So basically this requires as signal to break it out of the loop?
> > 
> > But if it receives a signal for any other reason, it breaks out of the loop
> > too. Which is not what we want. If anything, it should be:
> > 
> > 		if (mutex_lock_interruptible(&hwlat_data.lock))
> > 			continue;  
> 
> As it calls msleep_interruptible() below and 'break' if signal pending, i
> choosed 'break' here too.
> 
> > But I still don't really like this solution, as it will still report a
> > 	deadlock.
> > 
> > Is it possible to switch the cpu_read_lock() to be taken before the
> > hwlat_data.lock?  
> 
> It's a little hard to change the sequence of these two locks, we'll hold
> "cpu_hotplug_lock" for longer unnecessarily if we do that.
> 
> But maybe we can remove the "hwlat_data.lock" in kthread_fn(), let me try
> another modification.

Have you found something yet? Looking at the code we have:

		mutex_lock(&hwlat_data.lock);
		interval = hwlat_data.sample_window - hwlat_data.sample_width;
		mutex_unlock(&hwlat_data.lock);

Where the lock is only there to synchronize the calculation of the
interval. We could add a counter for when sample_window and sample_width
are updated, and we could simply do:

 again:
		counter = atomic_read(&hwlat_data.counter);
		smp_rmb();
		if (!(counter & 1)) {
			new_interval = hwlat_data.sample_window - hwlat_data.sample_width;
			smp_rmb();
			if (counter == atomic_read(&hwlat_data.counter))
				interval = new_interval;
		}

Then we could do something like:

	atomic_inc(&hwlat_data.counter);
	smp_wmb();
	/* update sample_window or sample_width */
	smp_wmb();
	atomic_inc(&hwlat_data.counter);

And then the interval will only be updated if the values are not being
updated. Otherwise it just keeps the previous value.

-- Steve
Wei Li Nov. 14, 2024, 2:06 a.m. UTC | #4
Hi Steven,

On 2024/11/13 7:50, Steven Rostedt wrote:
>>>> Another "hung task" error was reported during the test, and i figured out
>>>> the deadlock scenario is as follows:
>>>>
>>>> T1 [BP]               | T2 [AP]                     | T3 [hwlatd/1]                  | T4
>>>> work_for_cpu_fn()     | cpuhp_thread_fun()          | kthread_fn()                   | hwlat_hotplug_workfn()
>>>>   _cpu_down()         |   stop_cpu_kthread()        |                                |   mutex_lock(&hwlat_data.lock)
>>>>     cpus_write_lock() |     kthread_stop(hwlatd/1)  |   mutex_lock(&hwlat_data.lock) |
>>>>     __cpuhp_kick_ap() |       wait_for_completion() |                                |   cpus_read_lock()
> 
> So, if we can make T3 not take the mutex_lock then that should be a
> solution, right?
> 
>>>>
>>>> It constitutes ABBA deadlock indirectly beAs it calls msleep_interruptible() and 'break' if signal pending below, i choosed 'break' here too.tween "cpu_hotplug_lock" and
>>>> "hwlat_data.lock", make the mutex obtaining in kthread_fn() interruptible
>>>> to fix this.
>>>>
>>>> Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations")
>>>> Signed-off-by: Wei Li <liwei391@huawei.com>
>>>> ---
>>>>  kernel/trace/trace_hwlat.c | 3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
>>>> index 3bd6071441ad..4c228ccb8a38 100644
>>>> --- a/kernel/trace/trace_hwlat.c
>>>> +++ b/kernel/trace/trace_hwlat.c
>>>> @@ -370,7 +370,8 @@ static int kthread_fn(void *data)
>>>>  		get_sample();
>>>>  		local_irq_enable();
>>>>  
>>>> -		mutex_lock(&hwlat_data.lock);
>>>> +		if (mutex_lock_interruptible(&hwlat_data.lock))
>>>> +			break;  
>>>
>>> So basically this requires as signal to break it out of the loop?
>>>
>>> But if it receives a signal for any other reason, it breaks out of the loop
>>> too. Which is not what we want. If anything, it should be:
>>>
>>> 		if (mutex_lock_interruptible(&hwlat_data.lock))
>>> 			continue;  
>>
>> As it calls msleep_interruptible() below and 'break' if signal pending, i
>> choosed 'break' here too.
>>
>>> But I still don't really like this solution, as it will still report a
>>> 	deadlock.
>>>
>>> Is it possible to switch the cpu_read_lock() to be taken before the
>>> hwlat_data.lock?  
>>
>> It's a little hard to change the sequence of these two locks, we'll hold
>> "cpu_hotplug_lock" for longer unnecessarily if we do that.
>>
>> But maybe we can remove the "hwlat_data.lock" in kthread_fn(), let me try
>> another modification.
> 
> Have you found something yet? Looking at the code we have:
> 
> 		mutex_lock(&hwlat_data.lock);
> 		interval = hwlat_data.sample_window - hwlat_data.sample_width;
> 		mutex_unlock(&hwlat_data.lock);
> 
> Where the lock is only there to synchronize the calculation of the
> interval. We could add a counter for when sample_window and sample_width
> are updated, and we could simply do:
> 
>  again:
> 		counter = atomic_read(&hwlat_data.counter);
> 		smp_rmb();
> 		if (!(counter & 1)) {
> 			new_interval = hwlat_data.sample_window - hwlat_data.sample_width;
> 			smp_rmb();
> 			if (counter == atomic_read(&hwlat_data.counter))
> 				interval = new_interval;
> 		}
> 
> Then we could do something like:
> 
> 	atomic_inc(&hwlat_data.counter);
> 	smp_wmb();
> 	/* update sample_window or sample_width */
> 	smp_wmb();
> 	atomic_inc(&hwlat_data.counter);
> 
> And then the interval will only be updated if the values are not being
> updated. Otherwise it just keeps the previous value.
> 

Your seqlock-like solution seems to be able to solve this issue, but the difficulty is that
the current updates of sample_window and sample_width are implemented using the framework
of 'trace_min_max_fops'. We cannot add 'atomic_inc(&hwlat_data.counter)' into the update
processes for sample_window and sample_width directly. If we want to remove the mutex_lock
here, maybe we need to break the application of trace_min_max_write(). However, if we do
that, we can add a 'hwlat_data.sample_interval' and update it at the same time as updating
sample_window and sample_width. I didn't figure out an elegant solution yet.

Thanks,
Wei
diff mbox series

Patch

diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 3bd6071441ad..4c228ccb8a38 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -370,7 +370,8 @@  static int kthread_fn(void *data)
 		get_sample();
 		local_irq_enable();
 
-		mutex_lock(&hwlat_data.lock);
+		if (mutex_lock_interruptible(&hwlat_data.lock))
+			break;
 		interval = hwlat_data.sample_window - hwlat_data.sample_width;
 		mutex_unlock(&hwlat_data.lock);