diff mbox series

[stable,v6.7] drm/nouveau: don't fini scheduler before entity flush

Message ID 20240304170158.4206-1-dakr@redhat.com (mailing list archive)
State New, archived
Headers show
Series [stable,v6.7] drm/nouveau: don't fini scheduler before entity flush | expand

Commit Message

Danilo Krummrich March 4, 2024, 5:01 p.m. UTC
This bug is present in v6.7 only, since the scheduler design has been
re-worked in v6.8.

Client scheduler entities must be flushed before an associated GPU
scheduler is teared down. Otherwise the entitiy might still hold a
pointer to the scheduler's runqueue which is freed at scheduler tear
down already.

[  305.224293] ==================================================================
[  305.224297] BUG: KASAN: slab-use-after-free in drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
[  305.224310] Read of size 8 at addr ffff8881440a8f48 by task rmmod/4436

[  305.224317] CPU: 10 PID: 4436 Comm: rmmod Tainted: G     U             6.7.6-100.fc38.x86_64+debug #1
[  305.224321] Hardware name: Dell Inc. Precision 7550/01PXFR, BIOS 1.27.0 11/08/2023
[  305.224324] Call Trace:
[  305.224327]  <TASK>
[  305.224329]  dump_stack_lvl+0x76/0xd0
[  305.224336]  print_report+0xcf/0x670
[  305.224342]  ? drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
[  305.224352]  ? __virt_addr_valid+0x215/0x410
[  305.224359]  ? drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
[  305.224368]  kasan_report+0xa6/0xe0
[  305.224373]  ? drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
[  305.224385]  drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
[  305.224395]  ? __pfx_drm_sched_entity_flush+0x10/0x10 [gpu_sched]
[  305.224406]  ? rcu_is_watching+0x15/0xb0
[  305.224413]  drm_sched_entity_destroy+0x17/0x20 [gpu_sched]
[  305.224422]  nouveau_cli_fini+0x6c/0x120 [nouveau]
[  305.224658]  nouveau_drm_device_fini+0x2ac/0x490 [nouveau]
[  305.224871]  nouveau_drm_remove+0x18e/0x220 [nouveau]
[  305.225082]  ? __pfx_nouveau_drm_remove+0x10/0x10 [nouveau]
[  305.225290]  ? rcu_is_watching+0x15/0xb0
[  305.225295]  ? _raw_spin_unlock_irqrestore+0x66/0x80
[  305.225299]  ? trace_hardirqs_on+0x16/0x100
[  305.225304]  ? _raw_spin_unlock_irqrestore+0x4f/0x80
[  305.225310]  pci_device_remove+0xa3/0x1d0
[  305.225316]  device_release_driver_internal+0x379/0x540
[  305.225322]  driver_detach+0xc5/0x180
[  305.225327]  bus_remove_driver+0x11e/0x2a0
[  305.225333]  pci_unregister_driver+0x2a/0x250
[  305.225339]  nouveau_drm_exit+0x1f/0x970 [nouveau]
[  305.225548]  __do_sys_delete_module+0x350/0x580
[  305.225554]  ? __pfx___do_sys_delete_module+0x10/0x10
[  305.225562]  ? syscall_enter_from_user_mode+0x26/0x90
[  305.225567]  ? rcu_is_watching+0x15/0xb0
[  305.225571]  ? syscall_enter_from_user_mode+0x26/0x90
[  305.225575]  ? trace_hardirqs_on+0x16/0x100
[  305.225580]  do_syscall_64+0x61/0xe0
[  305.225584]  ? rcu_is_watching+0x15/0xb0
[  305.225587]  ? syscall_exit_to_user_mode+0x1f/0x50
[  305.225592]  ? trace_hardirqs_on_prepare+0xe3/0x100
[  305.225596]  ? do_syscall_64+0x70/0xe0
[  305.225600]  ? trace_hardirqs_on_prepare+0xe3/0x100
[  305.225604]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  305.225609] RIP: 0033:0x7f6148f3592b
[  305.225650] Code: 73 01 c3 48 8b 0d dd 04 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 04 0c 00 f7 d8 64 89 01 48
[  305.225653] RSP: 002b:00007ffe89986f08 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  305.225659] RAX: ffffffffffffffda RBX: 000055cbb036e900 RCX: 00007f6148f3592b
[  305.225662] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055cbb036e968
[  305.225664] RBP: 00007ffe89986f30 R08: 1999999999999999 R09: 0000000000000000
[  305.225667] R10: 00007f6148fa6ac0 R11: 0000000000000206 R12: 0000000000000000
[  305.225670] R13: 00007ffe89987190 R14: 000055cbb036e900 R15: 0000000000000000
[  305.225678]  </TASK>

[  305.225683] Allocated by task 484:
[  305.225685]  kasan_save_stack+0x33/0x60
[  305.225690]  kasan_set_track+0x25/0x30
[  305.225693]  __kasan_kmalloc+0x8f/0xa0
[  305.225696]  drm_sched_init+0x3c7/0xce0 [gpu_sched]
[  305.225705]  nouveau_sched_init+0xd2/0x110 [nouveau]
[  305.225913]  nouveau_drm_device_init+0x130/0x3290 [nouveau]
[  305.226121]  nouveau_drm_probe+0x1ab/0x6b0 [nouveau]
[  305.226329]  local_pci_probe+0xda/0x190
[  305.226333]  pci_device_probe+0x23a/0x780
[  305.226337]  really_probe+0x3df/0xb80
[  305.226341]  __driver_probe_device+0x18c/0x450
[  305.226345]  driver_probe_device+0x4a/0x120
[  305.226348]  __driver_attach+0x1e5/0x4a0
[  305.226351]  bus_for_each_dev+0x106/0x190
[  305.226355]  bus_add_driver+0x2a1/0x570
[  305.226358]  driver_register+0x134/0x460
[  305.226361]  do_one_initcall+0xd3/0x430
[  305.226366]  do_init_module+0x238/0x770
[  305.226370]  load_module+0x5581/0x6f10
[  305.226374]  __do_sys_init_module+0x1f2/0x220
[  305.226377]  do_syscall_64+0x61/0xe0
[  305.226381]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

[  305.226387] Freed by task 4436:
[  305.226389]  kasan_save_stack+0x33/0x60
[  305.226392]  kasan_set_track+0x25/0x30
[  305.226396]  kasan_save_free_info+0x2b/0x50
[  305.226399]  __kasan_slab_free+0x10b/0x1a0
[  305.226402]  slab_free_freelist_hook+0x12b/0x1e0
[  305.226406]  __kmem_cache_free+0xd4/0x1d0
[  305.226410]  drm_sched_fini+0x178/0x320 [gpu_sched]
[  305.226418]  nouveau_drm_device_fini+0x2a0/0x490 [nouveau]
[  305.226624]  nouveau_drm_remove+0x18e/0x220 [nouveau]
[  305.226832]  pci_device_remove+0xa3/0x1d0
[  305.226836]  device_release_driver_internal+0x379/0x540
[  305.226840]  driver_detach+0xc5/0x180
[  305.226843]  bus_remove_driver+0x11e/0x2a0
[  305.226847]  pci_unregister_driver+0x2a/0x250
[  305.226850]  nouveau_drm_exit+0x1f/0x970 [nouveau]
[  305.227056]  __do_sys_delete_module+0x350/0x580
[  305.227060]  do_syscall_64+0x61/0xe0
[  305.227064]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

[  305.227070] The buggy address belongs to the object at ffff8881440a8f00
                which belongs to the cache kmalloc-128 of size 128
[  305.227073] The buggy address is located 72 bytes inside of
                freed 128-byte region [ffff8881440a8f00, ffff8881440a8f80)

[  305.227078] The buggy address belongs to the physical page:
[  305.227081] page:00000000627efa0a refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1440a8
[  305.227085] head:00000000627efa0a order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  305.227088] flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
[  305.227093] page_type: 0xffffffff()
[  305.227097] raw: 0017ffffc0000840 ffff8881000428c0 ffffea0005b33500 dead000000000002
[  305.227100] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[  305.227102] page dumped because: kasan: bad access detected

[  305.227106] Memory state around the buggy address:
[  305.227109]  ffff8881440a8e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  305.227112]  ffff8881440a8e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  305.227114] >ffff8881440a8f00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  305.227117]                                               ^
[  305.227120]  ffff8881440a8f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  305.227122]  ffff8881440a9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
[  305.227125] ==================================================================

Cc: <stable@vger.kernel.org> # v6.7 only
Reported-by: Karol Herbst <kherbst@redhat.com>
Closes: https://gist.githubusercontent.com/karolherbst/a20eb0f937a06ed6aabe2ac2ca3d11b5/raw/9cd8b1dc5894872d0eeebbee3dd0fdd28bb576bc/gistfile1.txt
Fixes: b88baab82871 ("drm/nouveau: implement new VM_BIND uAPI")
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Greg KH March 4, 2024, 5:54 p.m. UTC | #1
On Mon, Mar 04, 2024 at 06:01:46PM +0100, Danilo Krummrich wrote:
> This bug is present in v6.7 only, since the scheduler design has been
> re-worked in v6.8.

Now queued up, thanks.

greg k-h
Greg KH March 4, 2024, 5:55 p.m. UTC | #2
On Mon, Mar 04, 2024 at 06:01:46PM +0100, Danilo Krummrich wrote:
> This bug is present in v6.7 only, since the scheduler design has been
> re-worked in v6.8.
> 
> Client scheduler entities must be flushed before an associated GPU
> scheduler is teared down. Otherwise the entitiy might still hold a
> pointer to the scheduler's runqueue which is freed at scheduler tear
> down already.
> 
> [  305.224293] ==================================================================
> [  305.224297] BUG: KASAN: slab-use-after-free in drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
> [  305.224310] Read of size 8 at addr ffff8881440a8f48 by task rmmod/4436
> 
> [  305.224317] CPU: 10 PID: 4436 Comm: rmmod Tainted: G     U             6.7.6-100.fc38.x86_64+debug #1
> [  305.224321] Hardware name: Dell Inc. Precision 7550/01PXFR, BIOS 1.27.0 11/08/2023
> [  305.224324] Call Trace:
> [  305.224327]  <TASK>
> [  305.224329]  dump_stack_lvl+0x76/0xd0
> [  305.224336]  print_report+0xcf/0x670
> [  305.224342]  ? drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
> [  305.224352]  ? __virt_addr_valid+0x215/0x410
> [  305.224359]  ? drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
> [  305.224368]  kasan_report+0xa6/0xe0
> [  305.224373]  ? drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
> [  305.224385]  drm_sched_entity_flush+0x6c4/0x7b0 [gpu_sched]
> [  305.224395]  ? __pfx_drm_sched_entity_flush+0x10/0x10 [gpu_sched]
> [  305.224406]  ? rcu_is_watching+0x15/0xb0
> [  305.224413]  drm_sched_entity_destroy+0x17/0x20 [gpu_sched]
> [  305.224422]  nouveau_cli_fini+0x6c/0x120 [nouveau]
> [  305.224658]  nouveau_drm_device_fini+0x2ac/0x490 [nouveau]
> [  305.224871]  nouveau_drm_remove+0x18e/0x220 [nouveau]
> [  305.225082]  ? __pfx_nouveau_drm_remove+0x10/0x10 [nouveau]
> [  305.225290]  ? rcu_is_watching+0x15/0xb0
> [  305.225295]  ? _raw_spin_unlock_irqrestore+0x66/0x80
> [  305.225299]  ? trace_hardirqs_on+0x16/0x100
> [  305.225304]  ? _raw_spin_unlock_irqrestore+0x4f/0x80
> [  305.225310]  pci_device_remove+0xa3/0x1d0
> [  305.225316]  device_release_driver_internal+0x379/0x540
> [  305.225322]  driver_detach+0xc5/0x180
> [  305.225327]  bus_remove_driver+0x11e/0x2a0
> [  305.225333]  pci_unregister_driver+0x2a/0x250
> [  305.225339]  nouveau_drm_exit+0x1f/0x970 [nouveau]
> [  305.225548]  __do_sys_delete_module+0x350/0x580
> [  305.225554]  ? __pfx___do_sys_delete_module+0x10/0x10
> [  305.225562]  ? syscall_enter_from_user_mode+0x26/0x90
> [  305.225567]  ? rcu_is_watching+0x15/0xb0
> [  305.225571]  ? syscall_enter_from_user_mode+0x26/0x90
> [  305.225575]  ? trace_hardirqs_on+0x16/0x100
> [  305.225580]  do_syscall_64+0x61/0xe0
> [  305.225584]  ? rcu_is_watching+0x15/0xb0
> [  305.225587]  ? syscall_exit_to_user_mode+0x1f/0x50
> [  305.225592]  ? trace_hardirqs_on_prepare+0xe3/0x100
> [  305.225596]  ? do_syscall_64+0x70/0xe0
> [  305.225600]  ? trace_hardirqs_on_prepare+0xe3/0x100
> [  305.225604]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [  305.225609] RIP: 0033:0x7f6148f3592b
> [  305.225650] Code: 73 01 c3 48 8b 0d dd 04 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 04 0c 00 f7 d8 64 89 01 48
> [  305.225653] RSP: 002b:00007ffe89986f08 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [  305.225659] RAX: ffffffffffffffda RBX: 000055cbb036e900 RCX: 00007f6148f3592b
> [  305.225662] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055cbb036e968
> [  305.225664] RBP: 00007ffe89986f30 R08: 1999999999999999 R09: 0000000000000000
> [  305.225667] R10: 00007f6148fa6ac0 R11: 0000000000000206 R12: 0000000000000000
> [  305.225670] R13: 00007ffe89987190 R14: 000055cbb036e900 R15: 0000000000000000
> [  305.225678]  </TASK>
> 
> [  305.225683] Allocated by task 484:
> [  305.225685]  kasan_save_stack+0x33/0x60
> [  305.225690]  kasan_set_track+0x25/0x30
> [  305.225693]  __kasan_kmalloc+0x8f/0xa0
> [  305.225696]  drm_sched_init+0x3c7/0xce0 [gpu_sched]
> [  305.225705]  nouveau_sched_init+0xd2/0x110 [nouveau]
> [  305.225913]  nouveau_drm_device_init+0x130/0x3290 [nouveau]
> [  305.226121]  nouveau_drm_probe+0x1ab/0x6b0 [nouveau]
> [  305.226329]  local_pci_probe+0xda/0x190
> [  305.226333]  pci_device_probe+0x23a/0x780
> [  305.226337]  really_probe+0x3df/0xb80
> [  305.226341]  __driver_probe_device+0x18c/0x450
> [  305.226345]  driver_probe_device+0x4a/0x120
> [  305.226348]  __driver_attach+0x1e5/0x4a0
> [  305.226351]  bus_for_each_dev+0x106/0x190
> [  305.226355]  bus_add_driver+0x2a1/0x570
> [  305.226358]  driver_register+0x134/0x460
> [  305.226361]  do_one_initcall+0xd3/0x430
> [  305.226366]  do_init_module+0x238/0x770
> [  305.226370]  load_module+0x5581/0x6f10
> [  305.226374]  __do_sys_init_module+0x1f2/0x220
> [  305.226377]  do_syscall_64+0x61/0xe0
> [  305.226381]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> [  305.226387] Freed by task 4436:
> [  305.226389]  kasan_save_stack+0x33/0x60
> [  305.226392]  kasan_set_track+0x25/0x30
> [  305.226396]  kasan_save_free_info+0x2b/0x50
> [  305.226399]  __kasan_slab_free+0x10b/0x1a0
> [  305.226402]  slab_free_freelist_hook+0x12b/0x1e0
> [  305.226406]  __kmem_cache_free+0xd4/0x1d0
> [  305.226410]  drm_sched_fini+0x178/0x320 [gpu_sched]
> [  305.226418]  nouveau_drm_device_fini+0x2a0/0x490 [nouveau]
> [  305.226624]  nouveau_drm_remove+0x18e/0x220 [nouveau]
> [  305.226832]  pci_device_remove+0xa3/0x1d0
> [  305.226836]  device_release_driver_internal+0x379/0x540
> [  305.226840]  driver_detach+0xc5/0x180
> [  305.226843]  bus_remove_driver+0x11e/0x2a0
> [  305.226847]  pci_unregister_driver+0x2a/0x250
> [  305.226850]  nouveau_drm_exit+0x1f/0x970 [nouveau]
> [  305.227056]  __do_sys_delete_module+0x350/0x580
> [  305.227060]  do_syscall_64+0x61/0xe0
> [  305.227064]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> [  305.227070] The buggy address belongs to the object at ffff8881440a8f00
>                 which belongs to the cache kmalloc-128 of size 128
> [  305.227073] The buggy address is located 72 bytes inside of
>                 freed 128-byte region [ffff8881440a8f00, ffff8881440a8f80)
> 
> [  305.227078] The buggy address belongs to the physical page:
> [  305.227081] page:00000000627efa0a refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1440a8
> [  305.227085] head:00000000627efa0a order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> [  305.227088] flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
> [  305.227093] page_type: 0xffffffff()
> [  305.227097] raw: 0017ffffc0000840 ffff8881000428c0 ffffea0005b33500 dead000000000002
> [  305.227100] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
> [  305.227102] page dumped because: kasan: bad access detected
> 
> [  305.227106] Memory state around the buggy address:
> [  305.227109]  ffff8881440a8e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  305.227112]  ffff8881440a8e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [  305.227114] >ffff8881440a8f00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  305.227117]                                               ^
> [  305.227120]  ffff8881440a8f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [  305.227122]  ffff8881440a9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
> [  305.227125] ==================================================================
> 
> Cc: <stable@vger.kernel.org> # v6.7 only
> Reported-by: Karol Herbst <kherbst@redhat.com>
> Closes: https://gist.githubusercontent.com/karolherbst/a20eb0f937a06ed6aabe2ac2ca3d11b5/raw/9cd8b1dc5894872d0eeebbee3dd0fdd28bb576bc/gistfile1.txt
> Fixes: b88baab82871 ("drm/nouveau: implement new VM_BIND uAPI")

You say 6.7 only, but this commit is in 6.6, so why not 6.6 also?

thanks,

greg k-h
Danilo Krummrich March 4, 2024, 6:10 p.m. UTC | #3
On 3/4/24 18:55, Greg KH wrote:
> On Mon, Mar 04, 2024 at 06:01:46PM +0100, Danilo Krummrich wrote:
>> Cc: <stable@vger.kernel.org> # v6.7 only
> You say 6.7 only, but this commit is in 6.6, so why not 6.6 also?

Good catch, I was sure I originally merged this for 6.7. This fix should indeed
be applied to 6.6 as well. Should have double checked that, my bad.

- Danilo

> 
> thanks,
> 
> greg k-h
>
Greg KH March 4, 2024, 6:18 p.m. UTC | #4
On Mon, Mar 04, 2024 at 07:10:56PM +0100, Danilo Krummrich wrote:
> On 3/4/24 18:55, Greg KH wrote:
> > On Mon, Mar 04, 2024 at 06:01:46PM +0100, Danilo Krummrich wrote:
> > > Cc: <stable@vger.kernel.org> # v6.7 only
> > You say 6.7 only, but this commit is in 6.6, so why not 6.6 also?
> 
> Good catch, I was sure I originally merged this for 6.7. This fix should indeed
> be applied to 6.6 as well. Should have double checked that, my bad.

Great, now queued up, thanks!

greg k-h
diff mbox series

Patch

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 50589f982d1a..75545da9d1e9 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -708,10 +708,11 @@  nouveau_drm_device_fini(struct drm_device *dev)
 	}
 	mutex_unlock(&drm->clients_lock);
 
-	nouveau_sched_fini(drm);
-
 	nouveau_cli_fini(&drm->client);
 	nouveau_cli_fini(&drm->master);
+
+	nouveau_sched_fini(drm);
+
 	nvif_parent_dtor(&drm->parent);
 	mutex_destroy(&drm->clients_lock);
 	kfree(drm);