diff mbox series

drm/amdgpu: Fix recursive locking warning

Message ID 20220204031139.24717-1-rajneesh.bhardwaj@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/amdgpu: Fix recursive locking warning | expand

Commit Message

Rajneesh Bhardwaj Feb. 4, 2022, 3:11 a.m. UTC
Noticed the below warning while running a pytorch workload on vega10
GPUs. Change to trylock to avoid conflicts with already held reservation
locks.

[  +0.000003] WARNING: possible recursive locking detected
[  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
[  +0.000004] --------------------------------------------
[  +0.000002] python/4822 is trying to acquire lock:
[  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000203]
              but task is already holding lock:
[  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[  +0.000017]
              other info that might help us debug this:
[  +0.000002]  Possible unsafe locking scenario:

[  +0.000003]        CPU0
[  +0.000002]        ----
[  +0.000002]   lock(reservation_ww_class_mutex);
[  +0.000004]   lock(reservation_ww_class_mutex);
[  +0.000003]
               *** DEADLOCK ***

[  +0.000002]  May be due to missing lock nesting notation

[  +0.000003] 7 locks held by python/4822:
[  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
[  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
[  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
[  +0.000236]  #3: ffffb2b35606fd28
(reservation_ww_class_acquire){+.+.}-{0:0}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
[  +0.000235]  #4: ffff932cbb7181f8
(reservation_ww_class_mutex){+.+.}-{3:3}, at:
ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
drm_dev_enter+0x5/0xa0 [drm]
[  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
[  +0.000195]
              stack backtrace:
[  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
5.13.0-kfd-rajneesh #1030
[  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
08/29/2018
[  +0.000003] Call Trace:
[  +0.000003]  dump_stack+0x6d/0x89
[  +0.000010]  __lock_acquire+0xb93/0x1a90
[  +0.000009]  lock_acquire+0x25d/0x2d0
[  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000184]  ? lock_is_held_type+0xa2/0x110
[  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
[  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000183]  ? lock_release+0x13f/0x270
[  +0.000005]  ? lock_is_held_type+0xa2/0x110
[  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
[  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
[  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
[  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
[  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
[  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
[  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
[  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
[  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
[amdgpu]
[  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
[  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
[  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
[  +0.000216]  ? lock_release+0x13f/0x270
[  +0.000006]  ? __fget_files+0x107/0x1e0
[  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
[  +0.000007]  do_syscall_64+0x36/0x70
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000007] RIP: 0033:0x7fbff90a7317
[  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
[  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
00007fbff90a7317
[  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
0000000000000004
[  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
00007fbcc402d880
[  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
00000000c0184b18
[  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
00007fbcc402d820

Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>

Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
enabled")
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Christian König Feb. 4, 2022, 7:13 a.m. UTC | #1
Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
> Noticed the below warning while running a pytorch workload on vega10
> GPUs. Change to trylock to avoid conflicts with already held reservation
> locks.
>
> [  +0.000003] WARNING: possible recursive locking detected
> [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
> [  +0.000004] --------------------------------------------
> [  +0.000002] python/4822 is trying to acquire lock:
> [  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000203]
>                but task is already holding lock:
> [  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
> [  +0.000017]
>                other info that might help us debug this:
> [  +0.000002]  Possible unsafe locking scenario:
>
> [  +0.000003]        CPU0
> [  +0.000002]        ----
> [  +0.000002]   lock(reservation_ww_class_mutex);
> [  +0.000004]   lock(reservation_ww_class_mutex);
> [  +0.000003]
>                 *** DEADLOCK ***
>
> [  +0.000002]  May be due to missing lock nesting notation
>
> [  +0.000003] 7 locks held by python/4822:
> [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
> [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
> [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
> [  +0.000236]  #3: ffffb2b35606fd28
> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
> [  +0.000235]  #4: ffff932cbb7181f8
> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
> [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
> drm_dev_enter+0x5/0xa0 [drm]
> [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
> [  +0.000195]
>                stack backtrace:
> [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
> 5.13.0-kfd-rajneesh #1030
> [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
> 08/29/2018
> [  +0.000003] Call Trace:
> [  +0.000003]  dump_stack+0x6d/0x89
> [  +0.000010]  __lock_acquire+0xb93/0x1a90
> [  +0.000009]  lock_acquire+0x25d/0x2d0
> [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000184]  ? lock_is_held_type+0xa2/0x110
> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
> [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000183]  ? lock_release+0x13f/0x270
> [  +0.000005]  ? lock_is_held_type+0xa2/0x110
> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
> [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
> [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
> [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
> [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
> [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
> [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
> [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
> [  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
> [amdgpu]
> [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
> [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
> [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
> [  +0.000216]  ? lock_release+0x13f/0x270
> [  +0.000006]  ? __fget_files+0x107/0x1e0
> [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
> [  +0.000007]  do_syscall_64+0x36/0x70
> [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000007] RIP: 0033:0x7fbff90a7317
> [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
> [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
> 00007fbff90a7317
> [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
> 0000000000000004
> [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
> 00007fbcc402d880
> [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
> 00000000c0184b18
> [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
> 00007fbcc402d820
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Alex Deucher <Alexander.Deucher@amd.com>
>
> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
> enabled")
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

The fixes tag is not necessarily correct, I would remove that.

But apart from that the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>.

Thanks,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 36bb41b027ec..6ccd2be685f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
>   	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>   		return;
>   
> -	dma_resv_lock(bo->base.resv, NULL);
> +	if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
> +		return;
>   
>   	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>   	if (!WARN_ON(r)) {
Felix Kuehling Feb. 4, 2022, 4:23 p.m. UTC | #2
Am 2022-02-04 um 02:13 schrieb Christian König:
> Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
>> Noticed the below warning while running a pytorch workload on vega10
>> GPUs. Change to trylock to avoid conflicts with already held reservation
>> locks.
>>
>> [  +0.000003] WARNING: possible recursive locking detected
>> [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
>> [  +0.000004] --------------------------------------------
>> [  +0.000002] python/4822 is trying to acquire lock:
>> [  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
>> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000203]
>>                but task is already holding lock:
>> [  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
>> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>> [  +0.000017]
>>                other info that might help us debug this:
>> [  +0.000002]  Possible unsafe locking scenario:
>>
>> [  +0.000003]        CPU0
>> [  +0.000002]        ----
>> [  +0.000002]   lock(reservation_ww_class_mutex);
>> [  +0.000004]   lock(reservation_ww_class_mutex);
>> [  +0.000003]
>>                 *** DEADLOCK ***
>>
>> [  +0.000002]  May be due to missing lock nesting notation
>>
>> [  +0.000003] 7 locks held by python/4822:
>> [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
>> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
>> [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
>> [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
>> [  +0.000236]  #3: ffffb2b35606fd28
>> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
>> [  +0.000235]  #4: ffff932cbb7181f8
>> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
>> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>> [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
>> drm_dev_enter+0x5/0xa0 [drm]
>> [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
>> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
>> [  +0.000195]
>>                stack backtrace:
>> [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
>> 5.13.0-kfd-rajneesh #1030
>> [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
>> 08/29/2018
>> [  +0.000003] Call Trace:
>> [  +0.000003]  dump_stack+0x6d/0x89
>> [  +0.000010]  __lock_acquire+0xb93/0x1a90
>> [  +0.000009]  lock_acquire+0x25d/0x2d0
>> [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000184]  ? lock_is_held_type+0xa2/0x110
>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
>> [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000183]  ? lock_release+0x13f/0x270
>> [  +0.000005]  ? lock_is_held_type+0xa2/0x110
>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
>> [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>> [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
>> [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
>> [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
>> [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
>> [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
>> [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
>> [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
>> [  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
>> [amdgpu]
>> [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
>> [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
>> [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
>> [  +0.000216]  ? lock_release+0x13f/0x270
>> [  +0.000006]  ? __fget_files+0x107/0x1e0
>> [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
>> [  +0.000007]  do_syscall_64+0x36/0x70
>> [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [  +0.000007] RIP: 0033:0x7fbff90a7317
>> [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
>> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>> [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
>> 00007fbff90a7317
>> [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
>> 0000000000000004
>> [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
>> 00007fbcc402d880
>> [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
>> 00000000c0184b18
>> [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
>> 00007fbcc402d820
>>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>> Cc: Alex Deucher <Alexander.Deucher@amd.com>
>>
>> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
>> enabled")
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>
> The fixes tag is not necessarily correct, I would remove that.
>
> But apart from that the patch is Reviewed-by: Christian König 
> <christian.koenig@amd.com>.

I suggested the Fixes tag since it was my patch that introduced the 
problem. Without my patch, page table BOs wouldn't be cleared here, and 
it wouldn't get that recursive lock warning.

Either way, the patch is also

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


>
> Thanks,
> Christian.
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 36bb41b027ec..6ccd2be685f5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct 
>> ttm_buffer_object *bo)
>>           !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>           return;
>>   -    dma_resv_lock(bo->base.resv, NULL);
>> +    if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
>> +        return;
>>         r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
>> &fence);
>>       if (!WARN_ON(r)) {
>
Christian König Feb. 4, 2022, 4:25 p.m. UTC | #3
Am 04.02.22 um 17:23 schrieb Felix Kuehling:
>
> Am 2022-02-04 um 02:13 schrieb Christian König:
>> Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
>>> Noticed the below warning while running a pytorch workload on vega10
>>> GPUs. Change to trylock to avoid conflicts with already held 
>>> reservation
>>> locks.
>>>
>>> [  +0.000003] WARNING: possible recursive locking detected
>>> [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
>>> [  +0.000004] --------------------------------------------
>>> [  +0.000002] python/4822 is trying to acquire lock:
>>> [  +0.000004] ffff932cd9a259f8 
>>> (reservation_ww_class_mutex){+.+.}-{3:3},
>>> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000203]
>>>                but task is already holding lock:
>>> [  +0.000003] ffff932cbb7181f8 
>>> (reservation_ww_class_mutex){+.+.}-{3:3},
>>> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>>> [  +0.000017]
>>>                other info that might help us debug this:
>>> [  +0.000002]  Possible unsafe locking scenario:
>>>
>>> [  +0.000003]        CPU0
>>> [  +0.000002]        ----
>>> [  +0.000002]   lock(reservation_ww_class_mutex);
>>> [  +0.000004]   lock(reservation_ww_class_mutex);
>>> [  +0.000003]
>>>                 *** DEADLOCK ***
>>>
>>> [  +0.000002]  May be due to missing lock nesting notation
>>>
>>> [  +0.000003] 7 locks held by python/4822:
>>> [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
>>> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
>>> [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
>>> [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
>>> [  +0.000236]  #3: ffffb2b35606fd28
>>> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
>>> [  +0.000235]  #4: ffff932cbb7181f8
>>> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
>>> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>>> [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
>>> drm_dev_enter+0x5/0xa0 [drm]
>>> [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
>>> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
>>> [  +0.000195]
>>>                stack backtrace:
>>> [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
>>> 5.13.0-kfd-rajneesh #1030
>>> [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
>>> 08/29/2018
>>> [  +0.000003] Call Trace:
>>> [  +0.000003]  dump_stack+0x6d/0x89
>>> [  +0.000010]  __lock_acquire+0xb93/0x1a90
>>> [  +0.000009]  lock_acquire+0x25d/0x2d0
>>> [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000184]  ? lock_is_held_type+0xa2/0x110
>>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
>>> [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000183]  ? lock_release+0x13f/0x270
>>> [  +0.000005]  ? lock_is_held_type+0xa2/0x110
>>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
>>> [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>>> [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
>>> [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
>>> [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
>>> [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
>>> [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
>>> [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
>>> [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
>>> [  +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
>>> [amdgpu]
>>> [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
>>> [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
>>> [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
>>> [  +0.000216]  ? lock_release+0x13f/0x270
>>> [  +0.000006]  ? __fget_files+0x107/0x1e0
>>> [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
>>> [  +0.000007]  do_syscall_64+0x36/0x70
>>> [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [  +0.000007] RIP: 0033:0x7fbff90a7317
>>> [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
>>> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
>>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>>> [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
>>> 00007fbff90a7317
>>> [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
>>> 0000000000000004
>>> [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
>>> 00007fbcc402d880
>>> [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
>>> 00000000c0184b18
>>> [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
>>> 00007fbcc402d820
>>>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Cc: Alex Deucher <Alexander.Deucher@amd.com>
>>>
>>> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
>>> enabled")
>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>
>> The fixes tag is not necessarily correct, I would remove that.
>>
>> But apart from that the patch is Reviewed-by: Christian König 
>> <christian.koenig@amd.com>.
>
> I suggested the Fixes tag since it was my patch that introduced the 
> problem. Without my patch, page table BOs wouldn't be cleared here, 
> and it wouldn't get that recursive lock warning.

Yeah, but the problem existed before that. E.g. it can happen that we 
drop the last reference during validation as well.

So this is valuable to backport even without your patch.

Regards,
Christian.

>
> Either way, the patch is also
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>
>
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index 36bb41b027ec..6ccd2be685f5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct 
>>> ttm_buffer_object *bo)
>>>           !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>           return;
>>>   -    dma_resv_lock(bo->base.resv, NULL);
>>> +    if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
>>> +        return;
>>>         r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
>>> &fence);
>>>       if (!WARN_ON(r)) {
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 36bb41b027ec..6ccd2be685f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1306,7 +1306,8 @@  void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
 	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
 		return;
 
-	dma_resv_lock(bo->base.resv, NULL);
+	if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
+		return;
 
 	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
 	if (!WARN_ON(r)) {