Message ID | 20220204031139.24717-1-rajneesh.bhardwaj@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/amdgpu: Fix recursive locking warning | expand |
Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj: > Noticed the below warning while running a pytorch workload on vega10 > GPUs. Change to trylock to avoid conflicts with already held reservation > locks. > > [ +0.000003] WARNING: possible recursive locking detected > [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted > [ +0.000004] -------------------------------------------- > [ +0.000002] python/4822 is trying to acquire lock: > [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3}, > at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] > [ +0.000203] > but task is already holding lock: > [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, > at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] > [ +0.000017] > other info that might help us debug this: > [ +0.000002] Possible unsafe locking scenario: > > [ +0.000003] CPU0 > [ +0.000002] ---- > [ +0.000002] lock(reservation_ww_class_mutex); > [ +0.000004] lock(reservation_ww_class_mutex); > [ +0.000003] > *** DEADLOCK *** > > [ +0.000002] May be due to missing lock nesting notation > > [ +0.000003] 7 locks held by python/4822: > [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: > kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] > [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: > amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] > [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at: > amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] > [ +0.000236] #3: ffffb2b35606fd28 > (reservation_ww_class_acquire){+.+.}-{0:0}, at: > amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] > [ +0.000235] #4: ffff932cbb7181f8 > (reservation_ww_class_mutex){+.+.}-{3:3}, at: > ttm_eu_reserve_buffers+0x270/0x470 [ttm] > [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at: > drm_dev_enter+0x5/0xa0 [drm] > [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, > at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] > [ +0.000195] > stack backtrace: > [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted > 5.13.0-kfd-rajneesh #1030 > [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 > 08/29/2018 > [ +0.000003] Call Trace: > [ +0.000003] dump_stack+0x6d/0x89 > [ +0.000010] __lock_acquire+0xb93/0x1a90 > [ +0.000009] lock_acquire+0x25d/0x2d0 > [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] > [ +0.000184] ? lock_is_held_type+0xa2/0x110 > [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] > [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 > [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] > [ +0.000183] ? lock_release+0x13f/0x270 > [ +0.000005] ? lock_is_held_type+0xa2/0x110 > [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] > [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] > [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] > [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] > [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] > [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] > [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] > [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] > [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] > [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] > [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] > [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 > [amdgpu] > [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] > [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] > [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] > [ +0.000216] ? lock_release+0x13f/0x270 > [ +0.000006] ? __fget_files+0x107/0x1e0 > [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 > [ +0.000007] do_syscall_64+0x36/0x70 > [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ +0.000007] RIP: 0033:0x7fbff90a7317 > [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 > 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f > 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 > [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: > 00007fbff90a7317 > [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: > 0000000000000004 > [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: > 00007fbcc402d880 > [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: > 00000000c0184b18 > [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: > 00007fbcc402d820 > > Cc: Christian König <christian.koenig@amd.com> > Cc: Felix Kuehling <Felix.Kuehling@amd.com> > Cc: Alex Deucher <Alexander.Deucher@amd.com> > > Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is > enabled") > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> The fixes tag is not necessarily correct, I would remove that. But apart from that the patch is Reviewed-by: Christian König <christian.koenig@amd.com>. Thanks, Christian. > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > index 36bb41b027ec..6ccd2be685f5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > - dma_resv_lock(bo->base.resv, NULL); > + if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv))) > + return; > > r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); > if (!WARN_ON(r)) {
Am 2022-02-04 um 02:13 schrieb Christian König: > Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj: >> Noticed the below warning while running a pytorch workload on vega10 >> GPUs. Change to trylock to avoid conflicts with already held reservation >> locks. >> >> [ +0.000003] WARNING: possible recursive locking detected >> [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted >> [ +0.000004] -------------------------------------------- >> [ +0.000002] python/4822 is trying to acquire lock: >> [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3}, >> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >> [ +0.000203] >> but task is already holding lock: >> [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, >> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] >> [ +0.000017] >> other info that might help us debug this: >> [ +0.000002] Possible unsafe locking scenario: >> >> [ +0.000003] CPU0 >> [ +0.000002] ---- >> [ +0.000002] lock(reservation_ww_class_mutex); >> [ +0.000004] lock(reservation_ww_class_mutex); >> [ +0.000003] >> *** DEADLOCK *** >> >> [ +0.000002] May be due to missing lock nesting notation >> >> [ +0.000003] 7 locks held by python/4822: >> [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: >> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] >> [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: >> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] >> [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at: >> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] >> [ +0.000236] #3: ffffb2b35606fd28 >> (reservation_ww_class_acquire){+.+.}-{0:0}, at: >> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] >> [ +0.000235] #4: ffff932cbb7181f8 >> (reservation_ww_class_mutex){+.+.}-{3:3}, at: >> ttm_eu_reserve_buffers+0x270/0x470 [ttm] >> [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at: >> drm_dev_enter+0x5/0xa0 [drm] >> [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, >> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] >> [ +0.000195] >> stack backtrace: >> [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted >> 5.13.0-kfd-rajneesh #1030 >> [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 >> 08/29/2018 >> [ +0.000003] Call Trace: >> [ +0.000003] dump_stack+0x6d/0x89 >> [ +0.000010] __lock_acquire+0xb93/0x1a90 >> [ +0.000009] lock_acquire+0x25d/0x2d0 >> [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >> [ +0.000184] ? lock_is_held_type+0xa2/0x110 >> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >> [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 >> [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >> [ +0.000183] ? lock_release+0x13f/0x270 >> [ +0.000005] ? lock_is_held_type+0xa2/0x110 >> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >> [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >> [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] >> [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] >> [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] >> [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] >> [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] >> [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] >> [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] >> [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] >> [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] >> [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 >> [amdgpu] >> [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] >> [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] >> [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] >> [ +0.000216] ? lock_release+0x13f/0x270 >> [ +0.000006] ? __fget_files+0x107/0x1e0 >> [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 >> [ +0.000007] do_syscall_64+0x36/0x70 >> [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae >> [ +0.000007] RIP: 0033:0x7fbff90a7317 >> [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 >> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f >> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 >> [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000010 >> [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: >> 00007fbff90a7317 >> [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: >> 0000000000000004 >> [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: >> 00007fbcc402d880 >> [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: >> 00000000c0184b18 >> [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: >> 00007fbcc402d820 >> >> Cc: Christian König <christian.koenig@amd.com> >> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >> Cc: Alex Deucher <Alexander.Deucher@amd.com> >> >> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is >> enabled") >> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> > > The fixes tag is not necessarily correct, I would remove that. > > But apart from that the patch is Reviewed-by: Christian König > <christian.koenig@amd.com>. I suggested the Fixes tag since it was my patch that introduced the problem. Without my patch, page table BOs wouldn't be cleared here, and it wouldn't get that recursive lock warning. Either way, the patch is also Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> > > Thanks, > Christian. > >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index 36bb41b027ec..6ccd2be685f5 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct >> ttm_buffer_object *bo) >> !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> return; >> - dma_resv_lock(bo->base.resv, NULL); >> + if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv))) >> + return; >> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >> &fence); >> if (!WARN_ON(r)) { >
Am 04.02.22 um 17:23 schrieb Felix Kuehling: > > Am 2022-02-04 um 02:13 schrieb Christian König: >> Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj: >>> Noticed the below warning while running a pytorch workload on vega10 >>> GPUs. Change to trylock to avoid conflicts with already held >>> reservation >>> locks. >>> >>> [ +0.000003] WARNING: possible recursive locking detected >>> [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted >>> [ +0.000004] -------------------------------------------- >>> [ +0.000002] python/4822 is trying to acquire lock: >>> [ +0.000004] ffff932cd9a259f8 >>> (reservation_ww_class_mutex){+.+.}-{3:3}, >>> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >>> [ +0.000203] >>> but task is already holding lock: >>> [ +0.000003] ffff932cbb7181f8 >>> (reservation_ww_class_mutex){+.+.}-{3:3}, >>> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] >>> [ +0.000017] >>> other info that might help us debug this: >>> [ +0.000002] Possible unsafe locking scenario: >>> >>> [ +0.000003] CPU0 >>> [ +0.000002] ---- >>> [ +0.000002] lock(reservation_ww_class_mutex); >>> [ +0.000004] lock(reservation_ww_class_mutex); >>> [ +0.000003] >>> *** DEADLOCK *** >>> >>> [ +0.000002] May be due to missing lock nesting notation >>> >>> [ +0.000003] 7 locks held by python/4822: >>> [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: >>> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] >>> [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: >>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] >>> [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at: >>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] >>> [ +0.000236] #3: ffffb2b35606fd28 >>> (reservation_ww_class_acquire){+.+.}-{0:0}, at: >>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] >>> [ +0.000235] #4: ffff932cbb7181f8 >>> (reservation_ww_class_mutex){+.+.}-{3:3}, at: >>> ttm_eu_reserve_buffers+0x270/0x470 [ttm] >>> [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at: >>> drm_dev_enter+0x5/0xa0 [drm] >>> [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, >>> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] >>> [ +0.000195] >>> stack backtrace: >>> [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted >>> 5.13.0-kfd-rajneesh #1030 >>> [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 >>> 08/29/2018 >>> [ +0.000003] Call Trace: >>> [ +0.000003] dump_stack+0x6d/0x89 >>> [ +0.000010] __lock_acquire+0xb93/0x1a90 >>> [ +0.000009] lock_acquire+0x25d/0x2d0 >>> [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >>> [ +0.000184] ? lock_is_held_type+0xa2/0x110 >>> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >>> [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 >>> [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >>> [ +0.000183] ? lock_release+0x13f/0x270 >>> [ +0.000005] ? lock_is_held_type+0xa2/0x110 >>> [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >>> [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] >>> [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] >>> [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] >>> [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] >>> [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] >>> [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] >>> [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] >>> [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] >>> [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] >>> [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] >>> [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 >>> [amdgpu] >>> [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] >>> [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] >>> [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] >>> [ +0.000216] ? lock_release+0x13f/0x270 >>> [ +0.000006] ? __fget_files+0x107/0x1e0 >>> [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 >>> [ +0.000007] do_syscall_64+0x36/0x70 >>> [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae >>> [ +0.000007] RIP: 0033:0x7fbff90a7317 >>> [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 >>> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f >>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 >>> [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: >>> 0000000000000010 >>> [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: >>> 00007fbff90a7317 >>> [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: >>> 0000000000000004 >>> [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: >>> 00007fbcc402d880 >>> [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: >>> 00000000c0184b18 >>> [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: >>> 00007fbcc402d820 >>> >>> Cc: Christian König <christian.koenig@amd.com> >>> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >>> Cc: Alex Deucher <Alexander.Deucher@amd.com> >>> >>> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is >>> enabled") >>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >> >> The fixes tag is not necessarily correct, I would remove that. >> >> But apart from that the patch is Reviewed-by: Christian König >> <christian.koenig@amd.com>. > > I suggested the Fixes tag since it was my patch that introduced the > problem. Without my patch, page table BOs wouldn't be cleared here, > and it wouldn't get that recursive lock warning. Yeah, but the problem existed before that. E.g. it can happen that we drop the last reference during validation as well. So this is valuable to backport even without your patch. Regards, Christian. > > Either way, the patch is also > > Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> > > >> >> Thanks, >> Christian. >> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> index 36bb41b027ec..6ccd2be685f5 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct >>> ttm_buffer_object *bo) >>> !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> return; >>> - dma_resv_lock(bo->base.resv, NULL); >>> + if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv))) >>> + return; >>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>> &fence); >>> if (!WARN_ON(r)) { >>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 36bb41b027ec..6ccd2be685f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; - dma_resv_lock(bo->base.resv, NULL); + if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv))) + return; r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); if (!WARN_ON(r)) {
Noticed the below warning while running a pytorch workload on vega10 GPUs. Change to trylock to avoid conflicts with already held reservation locks. [ +0.000003] WARNING: possible recursive locking detected [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted [ +0.000004] -------------------------------------------- [ +0.000002] python/4822 is trying to acquire lock: [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000203] but task is already holding lock: [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000017] other info that might help us debug this: [ +0.000002] Possible unsafe locking scenario: [ +0.000003] CPU0 [ +0.000002] ---- [ +0.000002] lock(reservation_ww_class_mutex); [ +0.000004] lock(reservation_ww_class_mutex); [ +0.000003] *** DEADLOCK *** [ +0.000002] May be due to missing lock nesting notation [ +0.000003] 7 locks held by python/4822: [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] [ +0.000236] #3: ffffb2b35606fd28 (reservation_ww_class_acquire){+.+.}-{0:0}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] [ +0.000235] #4: ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at: drm_dev_enter+0x5/0xa0 [drm] [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] [ +0.000195] stack backtrace: [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted 5.13.0-kfd-rajneesh #1030 [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 08/29/2018 [ +0.000003] Call Trace: [ +0.000003] dump_stack+0x6d/0x89 [ +0.000010] __lock_acquire+0xb93/0x1a90 [ +0.000009] lock_acquire+0x25d/0x2d0 [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] ? lock_release+0x13f/0x270 [ +0.000005] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 [amdgpu] [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] [ +0.000216] ? lock_release+0x13f/0x270 [ +0.000006] ? __fget_files+0x107/0x1e0 [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 [ +0.000007] do_syscall_64+0x36/0x70 [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000007] RIP: 0033:0x7fbff90a7317 [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: 00007fbff90a7317 [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: 0000000000000004 [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: 00007fbcc402d880 [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: 00000000c0184b18 [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: 00007fbcc402d820 Cc: Christian König <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is enabled") Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)