diff mbox series

[1/2] accel/ivpu: Fix deadlock in ivpu_ms_cleanup()

Message ID 20250325114306.3740022-2-maciej.falkowski@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series PM fixes in Metric Steamer code | expand

Commit Message

Maciej Falkowski March 25, 2025, 11:43 a.m. UTC
From: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>

Fix deadlock in ivpu_ms_cleanup() by preventing runtime resume after
file_priv->ms_lock is acquired.

During a failure in runtime resume, a cold boot is executed, which
calls ivpu_ms_cleanup_all(). This function calls ivpu_ms_cleanup()
that acquires file_priv->ms_lock and causes the deadlock.

Fixes: cdfad4db7756 ("accel/ivpu: Add NPU profiling support")
Cc: <stable@vger.kernel.org> # v6.11+
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
---
 drivers/accel/ivpu/ivpu_ms.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Lizhi Hou March 25, 2025, 8:50 p.m. UTC | #1
On 3/25/25 04:43, Maciej Falkowski wrote:
> From: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>
> Fix deadlock in ivpu_ms_cleanup() by preventing runtime resume after
> file_priv->ms_lock is acquired.
>
> During a failure in runtime resume, a cold boot is executed, which
> calls ivpu_ms_cleanup_all(). This function calls ivpu_ms_cleanup()
> that acquires file_priv->ms_lock and causes the deadlock.
>
> Fixes: cdfad4db7756 ("accel/ivpu: Add NPU profiling support")
> Cc: <stable@vger.kernel.org> # v6.11+
> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
> ---
>   drivers/accel/ivpu/ivpu_ms.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/accel/ivpu/ivpu_ms.c b/drivers/accel/ivpu/ivpu_ms.c
> index ffe7b10f8a76..eb485cf15ad6 100644
> --- a/drivers/accel/ivpu/ivpu_ms.c
> +++ b/drivers/accel/ivpu/ivpu_ms.c
> @@ -4,6 +4,7 @@
>    */
>   
>   #include <drm/drm_file.h>
> +#include <linux/pm_runtime.h>
>   
>   #include "ivpu_drv.h"
>   #include "ivpu_gem.h"
> @@ -281,6 +282,9 @@ int ivpu_ms_get_info_ioctl(struct drm_device *dev, void *data, struct drm_file *
>   void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>   {
>   	struct ivpu_ms_instance *ms, *tmp;
> +	struct ivpu_device *vdev = file_priv->vdev;
> +
> +	pm_runtime_get_sync(vdev->drm.dev);

Could get_sync() be failed here? Maybe it is better to add warning for 
failure?


Lizhi

>   
>   	mutex_lock(&file_priv->ms_lock);
>   
> @@ -293,6 +297,8 @@ void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>   		free_instance(file_priv, ms);
>   
>   	mutex_unlock(&file_priv->ms_lock);
> +
> +	pm_runtime_put_autosuspend(vdev->drm.dev);
>   }
>   
>   void ivpu_ms_cleanup_all(struct ivpu_device *vdev)
Jacek Lawrynowicz March 26, 2025, 8:06 a.m. UTC | #2
Hi,

On 3/25/2025 9:50 PM, Lizhi Hou wrote:
> 
> On 3/25/25 04:43, Maciej Falkowski wrote:
>> From: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>
>> Fix deadlock in ivpu_ms_cleanup() by preventing runtime resume after
>> file_priv->ms_lock is acquired.
>>
>> During a failure in runtime resume, a cold boot is executed, which
>> calls ivpu_ms_cleanup_all(). This function calls ivpu_ms_cleanup()
>> that acquires file_priv->ms_lock and causes the deadlock.
>>
>> Fixes: cdfad4db7756 ("accel/ivpu: Add NPU profiling support")
>> Cc: <stable@vger.kernel.org> # v6.11+
>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
>> ---
>>   drivers/accel/ivpu/ivpu_ms.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/accel/ivpu/ivpu_ms.c b/drivers/accel/ivpu/ivpu_ms.c
>> index ffe7b10f8a76..eb485cf15ad6 100644
>> --- a/drivers/accel/ivpu/ivpu_ms.c
>> +++ b/drivers/accel/ivpu/ivpu_ms.c
>> @@ -4,6 +4,7 @@
>>    */
>>     #include <drm/drm_file.h>
>> +#include <linux/pm_runtime.h>
>>     #include "ivpu_drv.h"
>>   #include "ivpu_gem.h"
>> @@ -281,6 +282,9 @@ int ivpu_ms_get_info_ioctl(struct drm_device *dev, void *data, struct drm_file *
>>   void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>   {
>>       struct ivpu_ms_instance *ms, *tmp;
>> +    struct ivpu_device *vdev = file_priv->vdev;
>> +
>> +    pm_runtime_get_sync(vdev->drm.dev);
> 
> Could get_sync() be failed here? Maybe it is better to add warning for failure?

Yes, this could fail but we already have detailed warnings in runtime resume callback (ivpu_pm_runtime_resume_cb()).
> 
>>         mutex_lock(&file_priv->ms_lock);
>>   @@ -293,6 +297,8 @@ void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>           free_instance(file_priv, ms);
>>         mutex_unlock(&file_priv->ms_lock);
>> +
>> +    pm_runtime_put_autosuspend(vdev->drm.dev);
>>   }
>>     void ivpu_ms_cleanup_all(struct ivpu_device *vdev)

Regards,
Jacek
Lizhi Hou March 27, 2025, 5:38 p.m. UTC | #3
On 3/26/25 01:06, Jacek Lawrynowicz wrote:
> Hi,
>
> On 3/25/2025 9:50 PM, Lizhi Hou wrote:
>> On 3/25/25 04:43, Maciej Falkowski wrote:
>>> From: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>>
>>> Fix deadlock in ivpu_ms_cleanup() by preventing runtime resume after
>>> file_priv->ms_lock is acquired.
>>>
>>> During a failure in runtime resume, a cold boot is executed, which
>>> calls ivpu_ms_cleanup_all(). This function calls ivpu_ms_cleanup()
>>> that acquires file_priv->ms_lock and causes the deadlock.
>>>
>>> Fixes: cdfad4db7756 ("accel/ivpu: Add NPU profiling support")
>>> Cc: <stable@vger.kernel.org> # v6.11+
>>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
>>> ---
>>>    drivers/accel/ivpu/ivpu_ms.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/accel/ivpu/ivpu_ms.c b/drivers/accel/ivpu/ivpu_ms.c
>>> index ffe7b10f8a76..eb485cf15ad6 100644
>>> --- a/drivers/accel/ivpu/ivpu_ms.c
>>> +++ b/drivers/accel/ivpu/ivpu_ms.c
>>> @@ -4,6 +4,7 @@
>>>     */
>>>      #include <drm/drm_file.h>
>>> +#include <linux/pm_runtime.h>
>>>      #include "ivpu_drv.h"
>>>    #include "ivpu_gem.h"
>>> @@ -281,6 +282,9 @@ int ivpu_ms_get_info_ioctl(struct drm_device *dev, void *data, struct drm_file *
>>>    void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>>    {
>>>        struct ivpu_ms_instance *ms, *tmp;
>>> +    struct ivpu_device *vdev = file_priv->vdev;
>>> +
>>> +    pm_runtime_get_sync(vdev->drm.dev);
>> Could get_sync() be failed here? Maybe it is better to add warning for failure?
> Yes, this could fail but we already have detailed warnings in runtime resume callback (ivpu_pm_runtime_resume_cb()).

Will the deadlock still happens if this function fails?


Lizhi

>>>          mutex_lock(&file_priv->ms_lock);
>>>    @@ -293,6 +297,8 @@ void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>>            free_instance(file_priv, ms);
>>>          mutex_unlock(&file_priv->ms_lock);
>>> +
>>> +    pm_runtime_put_autosuspend(vdev->drm.dev);
>>>    }
>>>      void ivpu_ms_cleanup_all(struct ivpu_device *vdev)
> Regards,
> Jacek
>
Jacek Lawrynowicz March 28, 2025, 8:23 a.m. UTC | #4
On 3/27/2025 6:38 PM, Lizhi Hou wrote:
> 
> On 3/26/25 01:06, Jacek Lawrynowicz wrote:
>> Hi,
>>
>> On 3/25/2025 9:50 PM, Lizhi Hou wrote:
>>> On 3/25/25 04:43, Maciej Falkowski wrote:
>>>> From: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>>>
>>>> Fix deadlock in ivpu_ms_cleanup() by preventing runtime resume after
>>>> file_priv->ms_lock is acquired.
>>>>
>>>> During a failure in runtime resume, a cold boot is executed, which
>>>> calls ivpu_ms_cleanup_all(). This function calls ivpu_ms_cleanup()
>>>> that acquires file_priv->ms_lock and causes the deadlock.
>>>>
>>>> Fixes: cdfad4db7756 ("accel/ivpu: Add NPU profiling support")
>>>> Cc: <stable@vger.kernel.org> # v6.11+
>>>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>>> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
>>>> ---
>>>>    drivers/accel/ivpu/ivpu_ms.c | 6 ++++++
>>>>    1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/accel/ivpu/ivpu_ms.c b/drivers/accel/ivpu/ivpu_ms.c
>>>> index ffe7b10f8a76..eb485cf15ad6 100644
>>>> --- a/drivers/accel/ivpu/ivpu_ms.c
>>>> +++ b/drivers/accel/ivpu/ivpu_ms.c
>>>> @@ -4,6 +4,7 @@
>>>>     */
>>>>      #include <drm/drm_file.h>
>>>> +#include <linux/pm_runtime.h>
>>>>      #include "ivpu_drv.h"
>>>>    #include "ivpu_gem.h"
>>>> @@ -281,6 +282,9 @@ int ivpu_ms_get_info_ioctl(struct drm_device *dev, void *data, struct drm_file *
>>>>    void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>>>    {
>>>>        struct ivpu_ms_instance *ms, *tmp;
>>>> +    struct ivpu_device *vdev = file_priv->vdev;
>>>> +
>>>> +    pm_runtime_get_sync(vdev->drm.dev);
>>> Could get_sync() be failed here? Maybe it is better to add warning for failure?
>> Yes, this could fail but we already have detailed warnings in runtime resume callback (ivpu_pm_runtime_resume_cb()).
> 
> Will the deadlock still happens if this function fails?

No. The deadlock was caused by runtime resume in free_instance().
pm_runtime_get_sync() will always bump PM usage counter, so there will be no resume regardless if it fails or not.

>>>>          mutex_lock(&file_priv->ms_lock);
>>>>    @@ -293,6 +297,8 @@ void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>>>            free_instance(file_priv, ms);
>>>>          mutex_unlock(&file_priv->ms_lock);
>>>> +
>>>> +    pm_runtime_put_autosuspend(vdev->drm.dev);
>>>>    }
>>>>      void ivpu_ms_cleanup_all(struct ivpu_device *vdev)
>> Regards,
>> Jacek
>>
Lizhi Hou March 28, 2025, 3:29 p.m. UTC | #5
On 3/28/25 01:23, Jacek Lawrynowicz wrote:
>
> On 3/27/2025 6:38 PM, Lizhi Hou wrote:
>> On 3/26/25 01:06, Jacek Lawrynowicz wrote:
>>> Hi,
>>>
>>> On 3/25/2025 9:50 PM, Lizhi Hou wrote:
>>>> On 3/25/25 04:43, Maciej Falkowski wrote:
>>>>> From: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>>>>
>>>>> Fix deadlock in ivpu_ms_cleanup() by preventing runtime resume after
>>>>> file_priv->ms_lock is acquired.
>>>>>
>>>>> During a failure in runtime resume, a cold boot is executed, which
>>>>> calls ivpu_ms_cleanup_all(). This function calls ivpu_ms_cleanup()
>>>>> that acquires file_priv->ms_lock and causes the deadlock.
>>>>>
>>>>> Fixes: cdfad4db7756 ("accel/ivpu: Add NPU profiling support")
>>>>> Cc: <stable@vger.kernel.org> # v6.11+
>>>>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>>>> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
>>>>> ---
>>>>>     drivers/accel/ivpu/ivpu_ms.c | 6 ++++++
>>>>>     1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/drivers/accel/ivpu/ivpu_ms.c b/drivers/accel/ivpu/ivpu_ms.c
>>>>> index ffe7b10f8a76..eb485cf15ad6 100644
>>>>> --- a/drivers/accel/ivpu/ivpu_ms.c
>>>>> +++ b/drivers/accel/ivpu/ivpu_ms.c
>>>>> @@ -4,6 +4,7 @@
>>>>>      */
>>>>>       #include <drm/drm_file.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>       #include "ivpu_drv.h"
>>>>>     #include "ivpu_gem.h"
>>>>> @@ -281,6 +282,9 @@ int ivpu_ms_get_info_ioctl(struct drm_device *dev, void *data, struct drm_file *
>>>>>     void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>>>>     {
>>>>>         struct ivpu_ms_instance *ms, *tmp;
>>>>> +    struct ivpu_device *vdev = file_priv->vdev;
>>>>> +
>>>>> +    pm_runtime_get_sync(vdev->drm.dev);
>>>> Could get_sync() be failed here? Maybe it is better to add warning for failure?
>>> Yes, this could fail but we already have detailed warnings in runtime resume callback (ivpu_pm_runtime_resume_cb()).
>> Will the deadlock still happens if this function fails?
> No. The deadlock was caused by runtime resume in free_instance().
> pm_runtime_get_sync() will always bump PM usage counter, so there will be no resume regardless if it fails or not.
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
>>>>>           mutex_lock(&file_priv->ms_lock);
>>>>>     @@ -293,6 +297,8 @@ void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
>>>>>             free_instance(file_priv, ms);
>>>>>           mutex_unlock(&file_priv->ms_lock);
>>>>> +
>>>>> +    pm_runtime_put_autosuspend(vdev->drm.dev);
>>>>>     }
>>>>>       void ivpu_ms_cleanup_all(struct ivpu_device *vdev)
>>> Regards,
>>> Jacek
>>>
diff mbox series

Patch

diff --git a/drivers/accel/ivpu/ivpu_ms.c b/drivers/accel/ivpu/ivpu_ms.c
index ffe7b10f8a76..eb485cf15ad6 100644
--- a/drivers/accel/ivpu/ivpu_ms.c
+++ b/drivers/accel/ivpu/ivpu_ms.c
@@ -4,6 +4,7 @@ 
  */
 
 #include <drm/drm_file.h>
+#include <linux/pm_runtime.h>
 
 #include "ivpu_drv.h"
 #include "ivpu_gem.h"
@@ -281,6 +282,9 @@  int ivpu_ms_get_info_ioctl(struct drm_device *dev, void *data, struct drm_file *
 void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
 {
 	struct ivpu_ms_instance *ms, *tmp;
+	struct ivpu_device *vdev = file_priv->vdev;
+
+	pm_runtime_get_sync(vdev->drm.dev);
 
 	mutex_lock(&file_priv->ms_lock);
 
@@ -293,6 +297,8 @@  void ivpu_ms_cleanup(struct ivpu_file_priv *file_priv)
 		free_instance(file_priv, ms);
 
 	mutex_unlock(&file_priv->ms_lock);
+
+	pm_runtime_put_autosuspend(vdev->drm.dev);
 }
 
 void ivpu_ms_cleanup_all(struct ivpu_device *vdev)