Message ID | 20191206115635.46931-1-chuansheng.liu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: Fix the use-after-free between i915_active_ref and __active_retire | expand |
Quoting Chuansheng Liu (2019-12-06 11:56:35) > We easily hit drm/i915 panic on TGL when running glmark2, and finally > caught the race condition of use-after-free with enabling KASAN. > > The call stack is below: > === > [ 534.472675] BUG: KASAN: use-after-free in __i915_active_fence_set+0x26d/0x3d0 [i915] > [ 534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199 > > [ 534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G U E 5.4.0-rc8 #8 > [ 534.472687] Call Trace: > [ 534.472693] dump_stack+0x95/0xd5 > [ 534.472722] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > [ 534.472727] print_address_description.constprop.5+0x20/0x320 > [ 534.472751] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > [ 534.472792] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > [ 534.472794] __kasan_report+0x149/0x18c > [ 534.472798] ? _raw_spin_lock+0x1/0xd0 > [ 534.472820] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > [ 534.472822] kasan_report+0x12/0x20 > [ 534.472825] __asan_report_store8_noabort+0x17/0x20 > [ 534.472847] __i915_active_fence_set+0x26d/0x3d0 [i915] > [ 534.472870] i915_active_ref+0x2c8/0x530 [i915] > [ 534.472874] ? dma_resv_add_shared_fence+0x291/0x460 > [ 534.472902] __i915_vma_move_to_active+0x56/0x70 [i915] > [ 534.472927] i915_vma_move_to_active+0x54/0x420 [i915] > [ 534.472931] ? mutex_unlock+0x22/0x40 > [ 534.472957] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > [ 534.472959] ? __kmalloc_node+0x12c/0x350 > [ 534.472983] ? eb_relocate_slow+0xb40/0xb40 [i915] > [ 534.472985] ? _raw_write_trylock+0x110/0x110 > [ 534.472987] ? get_partial_node.isra.72+0x51/0x260 > [ 534.472991] ? unix_stream_read_generic+0x583/0x1a80 > [ 534.472994] ? ___slab_alloc+0x1d8/0x550 > [ 534.472998] ? kvmalloc_node+0x31/0x80 > [ 534.473000] ? kasan_unpoison_shadow+0x35/0x50 > [ 534.473002] ? _raw_spin_lock+0x7b/0xd0 > [ 534.473004] ? radix_tree_lookup+0xd/0x10 > [ 534.473006] ? idr_find+0x3b/0x60 > [ 534.473029] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > [ 534.473052] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > [ 534.473054] ? unix_stream_recvmsg+0x97/0xd0 > [ 534.473056] ? unix_stream_splice_read+0x1c0/0x1c0 > [ 534.473058] ? __unix_insert_socket+0x180/0x180 > [ 534.473081] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > [ 534.473094] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > [ 534.473103] ? drm_setversion+0x8c0/0x8c0 [drm] > [ 534.473106] ? __kasan_check_write+0x14/0x20 > [ 534.473115] drm_ioctl+0x68b/0xaa0 [drm] > ... > > [ 534.473239] Allocated by task 3199: > [ 534.473241] save_stack+0x21/0x90 > [ 534.473243] __kasan_kmalloc.constprop.8+0xa7/0xd0 > [ 534.473245] kasan_slab_alloc+0x11/0x20 > [ 534.473246] kmem_cache_alloc+0xce/0x240 > [ 534.473273] i915_active_ref+0xc2/0x530 [i915] > [ 534.473302] __i915_vma_move_to_active+0x56/0x70 [i915] > [ 534.473328] i915_vma_move_to_active+0x54/0x420 [i915] > [ 534.473355] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > [ 534.473381] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > [ 534.473392] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > [ 534.473402] drm_ioctl+0x68b/0xaa0 [drm] > [ 534.473404] do_vfs_ioctl+0x19a/0xf10 > [ 534.473405] ksys_ioctl+0x75/0x80 > [ 534.473407] __x64_sys_ioctl+0x73/0xb0 > [ 534.473408] do_syscall_64+0x9f/0x3a0 > [ 534.473410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 534.473412] Freed by task 0: > [ 534.473414] save_stack+0x21/0x90 > [ 534.473415] __kasan_slab_free+0x137/0x190 > [ 534.473417] kasan_slab_free+0xe/0x10 > [ 534.473418] kmem_cache_free+0xeb/0x2c0 > [ 534.473444] __active_retire+0x1f2/0x240 [i915] > [ 534.473471] active_retire+0x13b/0x1b0 [i915] > [ 534.473496] node_retire+0x54/0x80 [i915] > [ 534.473523] intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915] > [ 534.473549] cs_irq_handler+0x66/0xb0 [i915] > [ 534.473575] gen11_gt_irq_handler+0x26c/0x400 [i915] > [ 534.473600] gen11_irq_handler+0xc3/0x250 [i915] > [ 534.473603] __handle_irq_event_percpu+0xe0/0x4c0 > [ 534.473605] handle_irq_event_percpu+0x71/0x140 > [ 534.473606] handle_irq_event+0xad/0x140 > [ 534.473608] handle_edge_irq+0x1f6/0x780 > [ 534.473610] do_IRQ+0x9f/0x1f0 > > [ 534.473612] The buggy address belongs to the object at ffff8883f0372380 > which belongs to the cache active_node of size 72 > [ 534.473615] The buggy address is located 8 bytes inside of > > === > > The race scenerio is like: > Initially ref->count is 1, interrupt handler is trying to free the > node. > > === > CPUA in interrupt context CPUB in i915_gem_execbuffer2_ioctl > __active_retire --> > spin_lock(&ref->tree_lock) > decrease ref->count to 0 > i915_active_ref --> > increase ref->count to 1 > (i915_active_acquire) > > get the dirty ref->cache > (READ_ONCE(ref->cache)) > > return the dirty node > > set ref->cache to NULL > spin_unlock(&ref->tree_lock) > free the node > > hit use-after-free in > __i915_active_fence_set() > > === > > Here we need to use spinlock ref->tree_lock to protect the access > of READ_ONCE(ref->cache), then the race scenerio can be resolved. > > with this patch, it passed our stress test for a very long time. > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> > --- > drivers/gpu/drm/i915/i915_active.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c > index dca15ace88f6..3d68b910e949 100644 > --- a/drivers/gpu/drm/i915/i915_active.c > +++ b/drivers/gpu/drm/i915/i915_active.c > @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl) > * after the previous activity has been retired, or if it matches the > * current timeline. > */ > + spin_lock_irq(&ref->tree_lock); > node = READ_ONCE(ref->cache); > + spin_unlock_irq(&ref->tree_lock); Incorrect. The serialisation with __active_retire is required at i915_active_acquire. The problem is that serialisation was provided by ODEBUG for our CI so it went under the radar. -Chris
Chris, Thanks for reviewing, please see below comments. > -----Original Message----- > From: Chris Wilson <chris@chris-wilson.co.uk> > Sent: Friday, December 6, 2019 8:04 PM > To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel- > gfx@lists.freedesktop.org > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between > i915_active_ref and __active_retire > > Quoting Chuansheng Liu (2019-12-06 11:56:35) > > We easily hit drm/i915 panic on TGL when running glmark2, and finally > > caught the race condition of use-after-free with enabling KASAN. > > > > The call stack is below: > > === > > [ 534.472675] BUG: KASAN: use-after-free in > __i915_active_fence_set+0x26d/0x3d0 [i915] > > [ 534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199 > > > > [ 534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G U E 5.4.0- > rc8 #8 > > [ 534.472687] Call Trace: > > [ 534.472693] dump_stack+0x95/0xd5 > > [ 534.472722] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > [ 534.472727] print_address_description.constprop.5+0x20/0x320 > > [ 534.472751] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > [ 534.472792] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > [ 534.472794] __kasan_report+0x149/0x18c > > [ 534.472798] ? _raw_spin_lock+0x1/0xd0 > > [ 534.472820] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > [ 534.472822] kasan_report+0x12/0x20 > > [ 534.472825] __asan_report_store8_noabort+0x17/0x20 > > [ 534.472847] __i915_active_fence_set+0x26d/0x3d0 [i915] > > [ 534.472870] i915_active_ref+0x2c8/0x530 [i915] > > [ 534.472874] ? dma_resv_add_shared_fence+0x291/0x460 > > [ 534.472902] __i915_vma_move_to_active+0x56/0x70 [i915] > > [ 534.472927] i915_vma_move_to_active+0x54/0x420 [i915] > > [ 534.472931] ? mutex_unlock+0x22/0x40 > > [ 534.472957] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > > [ 534.472959] ? __kmalloc_node+0x12c/0x350 > > [ 534.472983] ? eb_relocate_slow+0xb40/0xb40 [i915] > > [ 534.472985] ? _raw_write_trylock+0x110/0x110 > > [ 534.472987] ? get_partial_node.isra.72+0x51/0x260 > > [ 534.472991] ? unix_stream_read_generic+0x583/0x1a80 > > [ 534.472994] ? ___slab_alloc+0x1d8/0x550 > > [ 534.472998] ? kvmalloc_node+0x31/0x80 > > [ 534.473000] ? kasan_unpoison_shadow+0x35/0x50 > > [ 534.473002] ? _raw_spin_lock+0x7b/0xd0 > > [ 534.473004] ? radix_tree_lookup+0xd/0x10 > > [ 534.473006] ? idr_find+0x3b/0x60 > > [ 534.473029] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > > [ 534.473052] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > > [ 534.473054] ? unix_stream_recvmsg+0x97/0xd0 > > [ 534.473056] ? unix_stream_splice_read+0x1c0/0x1c0 > > [ 534.473058] ? __unix_insert_socket+0x180/0x180 > > [ 534.473081] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > > [ 534.473094] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > > [ 534.473103] ? drm_setversion+0x8c0/0x8c0 [drm] > > [ 534.473106] ? __kasan_check_write+0x14/0x20 > > [ 534.473115] drm_ioctl+0x68b/0xaa0 [drm] > > ... > > > > [ 534.473239] Allocated by task 3199: > > [ 534.473241] save_stack+0x21/0x90 > > [ 534.473243] __kasan_kmalloc.constprop.8+0xa7/0xd0 > > [ 534.473245] kasan_slab_alloc+0x11/0x20 > > [ 534.473246] kmem_cache_alloc+0xce/0x240 > > [ 534.473273] i915_active_ref+0xc2/0x530 [i915] > > [ 534.473302] __i915_vma_move_to_active+0x56/0x70 [i915] > > [ 534.473328] i915_vma_move_to_active+0x54/0x420 [i915] > > [ 534.473355] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > > [ 534.473381] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > > [ 534.473392] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > > [ 534.473402] drm_ioctl+0x68b/0xaa0 [drm] > > [ 534.473404] do_vfs_ioctl+0x19a/0xf10 > > [ 534.473405] ksys_ioctl+0x75/0x80 > > [ 534.473407] __x64_sys_ioctl+0x73/0xb0 > > [ 534.473408] do_syscall_64+0x9f/0x3a0 > > [ 534.473410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > [ 534.473412] Freed by task 0: > > [ 534.473414] save_stack+0x21/0x90 > > [ 534.473415] __kasan_slab_free+0x137/0x190 > > [ 534.473417] kasan_slab_free+0xe/0x10 > > [ 534.473418] kmem_cache_free+0xeb/0x2c0 > > [ 534.473444] __active_retire+0x1f2/0x240 [i915] > > [ 534.473471] active_retire+0x13b/0x1b0 [i915] > > [ 534.473496] node_retire+0x54/0x80 [i915] > > [ 534.473523] intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915] > > [ 534.473549] cs_irq_handler+0x66/0xb0 [i915] > > [ 534.473575] gen11_gt_irq_handler+0x26c/0x400 [i915] > > [ 534.473600] gen11_irq_handler+0xc3/0x250 [i915] > > [ 534.473603] __handle_irq_event_percpu+0xe0/0x4c0 > > [ 534.473605] handle_irq_event_percpu+0x71/0x140 > > [ 534.473606] handle_irq_event+0xad/0x140 > > [ 534.473608] handle_edge_irq+0x1f6/0x780 > > [ 534.473610] do_IRQ+0x9f/0x1f0 > > > > [ 534.473612] The buggy address belongs to the object at ffff8883f0372380 > > which belongs to the cache active_node of size 72 > > [ 534.473615] The buggy address is located 8 bytes inside of > > > > === > > > > The race scenerio is like: > > Initially ref->count is 1, interrupt handler is trying to free the > > node. > > > > === > > CPUA in interrupt context CPUB in i915_gem_execbuffer2_ioctl > > __active_retire --> > > spin_lock(&ref->tree_lock) > > decrease ref->count to 0 > > i915_active_ref --> > > increase ref->count to 1 > > (i915_active_acquire) > > > > get the dirty ref->cache > > (READ_ONCE(ref->cache)) > > > > return the dirty node > > > > set ref->cache to NULL > > spin_unlock(&ref->tree_lock) > > free the node > > > > hit use-after-free in > > __i915_active_fence_set() > > > > === > > > > Here we need to use spinlock ref->tree_lock to protect the access > > of READ_ONCE(ref->cache), then the race scenerio can be resolved. > > > > with this patch, it passed our stress test for a very long time. > > > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> > > --- > > drivers/gpu/drm/i915/i915_active.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/i915_active.c > b/drivers/gpu/drm/i915/i915_active.c > > index dca15ace88f6..3d68b910e949 100644 > > --- a/drivers/gpu/drm/i915/i915_active.c > > +++ b/drivers/gpu/drm/i915/i915_active.c > > @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct > intel_timeline *tl) > > * after the previous activity has been retired, or if it matches the > > * current timeline. > > */ > > + spin_lock_irq(&ref->tree_lock); > > node = READ_ONCE(ref->cache); > > + spin_unlock_irq(&ref->tree_lock); > > Incorrect. The serialisation with __active_retire is required at > i915_active_acquire. You suggest the change can be made in i915_active_acquire()? So that we can play ref->count closely together with tree_lock and ODEBUG stuff. If so, I can make a new patch
Quoting Liu, Chuansheng (2019-12-06 12:10:25) > Chris, > > Thanks for reviewing, please see below comments. > > > -----Original Message----- > > From: Chris Wilson <chris@chris-wilson.co.uk> > > Sent: Friday, December 6, 2019 8:04 PM > > To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel- > > gfx@lists.freedesktop.org > > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between > > i915_active_ref and __active_retire > > > > Quoting Chuansheng Liu (2019-12-06 11:56:35) > > > We easily hit drm/i915 panic on TGL when running glmark2, and finally > > > caught the race condition of use-after-free with enabling KASAN. > > > > > > The call stack is below: > > > === > > > [ 534.472675] BUG: KASAN: use-after-free in > > __i915_active_fence_set+0x26d/0x3d0 [i915] > > > [ 534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199 > > > > > > [ 534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G U E 5.4.0- > > rc8 #8 > > > [ 534.472687] Call Trace: > > > [ 534.472693] dump_stack+0x95/0xd5 > > > [ 534.472722] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > [ 534.472727] print_address_description.constprop.5+0x20/0x320 > > > [ 534.472751] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > [ 534.472792] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > [ 534.472794] __kasan_report+0x149/0x18c > > > [ 534.472798] ? _raw_spin_lock+0x1/0xd0 > > > [ 534.472820] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > [ 534.472822] kasan_report+0x12/0x20 > > > [ 534.472825] __asan_report_store8_noabort+0x17/0x20 > > > [ 534.472847] __i915_active_fence_set+0x26d/0x3d0 [i915] > > > [ 534.472870] i915_active_ref+0x2c8/0x530 [i915] > > > [ 534.472874] ? dma_resv_add_shared_fence+0x291/0x460 > > > [ 534.472902] __i915_vma_move_to_active+0x56/0x70 [i915] > > > [ 534.472927] i915_vma_move_to_active+0x54/0x420 [i915] > > > [ 534.472931] ? mutex_unlock+0x22/0x40 > > > [ 534.472957] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > > > [ 534.472959] ? __kmalloc_node+0x12c/0x350 > > > [ 534.472983] ? eb_relocate_slow+0xb40/0xb40 [i915] > > > [ 534.472985] ? _raw_write_trylock+0x110/0x110 > > > [ 534.472987] ? get_partial_node.isra.72+0x51/0x260 > > > [ 534.472991] ? unix_stream_read_generic+0x583/0x1a80 > > > [ 534.472994] ? ___slab_alloc+0x1d8/0x550 > > > [ 534.472998] ? kvmalloc_node+0x31/0x80 > > > [ 534.473000] ? kasan_unpoison_shadow+0x35/0x50 > > > [ 534.473002] ? _raw_spin_lock+0x7b/0xd0 > > > [ 534.473004] ? radix_tree_lookup+0xd/0x10 > > > [ 534.473006] ? idr_find+0x3b/0x60 > > > [ 534.473029] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > > > [ 534.473052] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > > > [ 534.473054] ? unix_stream_recvmsg+0x97/0xd0 > > > [ 534.473056] ? unix_stream_splice_read+0x1c0/0x1c0 > > > [ 534.473058] ? __unix_insert_socket+0x180/0x180 > > > [ 534.473081] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > > > [ 534.473094] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > > > [ 534.473103] ? drm_setversion+0x8c0/0x8c0 [drm] > > > [ 534.473106] ? __kasan_check_write+0x14/0x20 > > > [ 534.473115] drm_ioctl+0x68b/0xaa0 [drm] > > > ... > > > > > > [ 534.473239] Allocated by task 3199: > > > [ 534.473241] save_stack+0x21/0x90 > > > [ 534.473243] __kasan_kmalloc.constprop.8+0xa7/0xd0 > > > [ 534.473245] kasan_slab_alloc+0x11/0x20 > > > [ 534.473246] kmem_cache_alloc+0xce/0x240 > > > [ 534.473273] i915_active_ref+0xc2/0x530 [i915] > > > [ 534.473302] __i915_vma_move_to_active+0x56/0x70 [i915] > > > [ 534.473328] i915_vma_move_to_active+0x54/0x420 [i915] > > > [ 534.473355] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > > > [ 534.473381] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > > > [ 534.473392] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > > > [ 534.473402] drm_ioctl+0x68b/0xaa0 [drm] > > > [ 534.473404] do_vfs_ioctl+0x19a/0xf10 > > > [ 534.473405] ksys_ioctl+0x75/0x80 > > > [ 534.473407] __x64_sys_ioctl+0x73/0xb0 > > > [ 534.473408] do_syscall_64+0x9f/0x3a0 > > > [ 534.473410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > [ 534.473412] Freed by task 0: > > > [ 534.473414] save_stack+0x21/0x90 > > > [ 534.473415] __kasan_slab_free+0x137/0x190 > > > [ 534.473417] kasan_slab_free+0xe/0x10 > > > [ 534.473418] kmem_cache_free+0xeb/0x2c0 > > > [ 534.473444] __active_retire+0x1f2/0x240 [i915] > > > [ 534.473471] active_retire+0x13b/0x1b0 [i915] > > > [ 534.473496] node_retire+0x54/0x80 [i915] > > > [ 534.473523] intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915] > > > [ 534.473549] cs_irq_handler+0x66/0xb0 [i915] > > > [ 534.473575] gen11_gt_irq_handler+0x26c/0x400 [i915] > > > [ 534.473600] gen11_irq_handler+0xc3/0x250 [i915] > > > [ 534.473603] __handle_irq_event_percpu+0xe0/0x4c0 > > > [ 534.473605] handle_irq_event_percpu+0x71/0x140 > > > [ 534.473606] handle_irq_event+0xad/0x140 > > > [ 534.473608] handle_edge_irq+0x1f6/0x780 > > > [ 534.473610] do_IRQ+0x9f/0x1f0 > > > > > > [ 534.473612] The buggy address belongs to the object at ffff8883f0372380 > > > which belongs to the cache active_node of size 72 > > > [ 534.473615] The buggy address is located 8 bytes inside of > > > > > > === > > > > > > The race scenerio is like: > > > Initially ref->count is 1, interrupt handler is trying to free the > > > node. > > > > > > === > > > CPUA in interrupt context CPUB in i915_gem_execbuffer2_ioctl > > > __active_retire --> > > > spin_lock(&ref->tree_lock) > > > decrease ref->count to 0 > > > i915_active_ref --> > > > increase ref->count to 1 > > > (i915_active_acquire) > > > > > > get the dirty ref->cache > > > (READ_ONCE(ref->cache)) > > > > > > return the dirty node > > > > > > set ref->cache to NULL > > > spin_unlock(&ref->tree_lock) > > > free the node > > > > > > hit use-after-free in > > > __i915_active_fence_set() > > > > > > === > > > > > > Here we need to use spinlock ref->tree_lock to protect the access > > > of READ_ONCE(ref->cache), then the race scenerio can be resolved. > > > > > > with this patch, it passed our stress test for a very long time. > > > > > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> > > > --- > > > drivers/gpu/drm/i915/i915_active.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/i915/i915_active.c > > b/drivers/gpu/drm/i915/i915_active.c > > > index dca15ace88f6..3d68b910e949 100644 > > > --- a/drivers/gpu/drm/i915/i915_active.c > > > +++ b/drivers/gpu/drm/i915/i915_active.c > > > @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct > > intel_timeline *tl) > > > * after the previous activity has been retired, or if it matches the > > > * current timeline. > > > */ > > > + spin_lock_irq(&ref->tree_lock); > > > node = READ_ONCE(ref->cache); > > > + spin_unlock_irq(&ref->tree_lock); > > > > Incorrect. The serialisation with __active_retire is required at > > i915_active_acquire. > You suggest the change can be made in i915_active_acquire()? > So that we can play ref->count closely together with tree_lock > and ODEBUG stuff. > > If so, I can make a new patch
> -----Original Message----- > From: Chris Wilson <chris@chris-wilson.co.uk> > Sent: Friday, December 6, 2019 8:15 PM > To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel- > gfx@lists.freedesktop.org > Subject: RE: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between > i915_active_ref and __active_retire > > Quoting Liu, Chuansheng (2019-12-06 12:10:25) > > Chris, > > > > Thanks for reviewing, please see below comments. > > > > > -----Original Message----- > > > From: Chris Wilson <chris@chris-wilson.co.uk> > > > Sent: Friday, December 6, 2019 8:04 PM > > > To: Liu, Chuansheng <chuansheng.liu@intel.com>; intel- > > > gfx@lists.freedesktop.org > > > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Fix the use-after-free between > > > i915_active_ref and __active_retire > > > > > > Quoting Chuansheng Liu (2019-12-06 11:56:35) > > > > We easily hit drm/i915 panic on TGL when running glmark2, and finally > > > > caught the race condition of use-after-free with enabling KASAN. > > > > > > > > The call stack is below: > > > > === > > > > [ 534.472675] BUG: KASAN: use-after-free in > > > __i915_active_fence_set+0x26d/0x3d0 [i915] > > > > [ 534.472679] Write of size 8 at addr ffff8883f0372388 by task > glmark2/3199 > > > > > > > > [ 534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G U E > 5.4.0- > > > rc8 #8 > > > > [ 534.472687] Call Trace: > > > > [ 534.472693] dump_stack+0x95/0xd5 > > > > [ 534.472722] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > > [ 534.472727] print_address_description.constprop.5+0x20/0x320 > > > > [ 534.472751] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > > [ 534.472792] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > > [ 534.472794] __kasan_report+0x149/0x18c > > > > [ 534.472798] ? _raw_spin_lock+0x1/0xd0 > > > > [ 534.472820] ? __i915_active_fence_set+0x26d/0x3d0 [i915] > > > > [ 534.472822] kasan_report+0x12/0x20 > > > > [ 534.472825] __asan_report_store8_noabort+0x17/0x20 > > > > [ 534.472847] __i915_active_fence_set+0x26d/0x3d0 [i915] > > > > [ 534.472870] i915_active_ref+0x2c8/0x530 [i915] > > > > [ 534.472874] ? dma_resv_add_shared_fence+0x291/0x460 > > > > [ 534.472902] __i915_vma_move_to_active+0x56/0x70 [i915] > > > > [ 534.472927] i915_vma_move_to_active+0x54/0x420 [i915] > > > > [ 534.472931] ? mutex_unlock+0x22/0x40 > > > > [ 534.472957] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > > > > [ 534.472959] ? __kmalloc_node+0x12c/0x350 > > > > [ 534.472983] ? eb_relocate_slow+0xb40/0xb40 [i915] > > > > [ 534.472985] ? _raw_write_trylock+0x110/0x110 > > > > [ 534.472987] ? get_partial_node.isra.72+0x51/0x260 > > > > [ 534.472991] ? unix_stream_read_generic+0x583/0x1a80 > > > > [ 534.472994] ? ___slab_alloc+0x1d8/0x550 > > > > [ 534.472998] ? kvmalloc_node+0x31/0x80 > > > > [ 534.473000] ? kasan_unpoison_shadow+0x35/0x50 > > > > [ 534.473002] ? _raw_spin_lock+0x7b/0xd0 > > > > [ 534.473004] ? radix_tree_lookup+0xd/0x10 > > > > [ 534.473006] ? idr_find+0x3b/0x60 > > > > [ 534.473029] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > > > > [ 534.473052] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > > > > [ 534.473054] ? unix_stream_recvmsg+0x97/0xd0 > > > > [ 534.473056] ? unix_stream_splice_read+0x1c0/0x1c0 > > > > [ 534.473058] ? __unix_insert_socket+0x180/0x180 > > > > [ 534.473081] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] > > > > [ 534.473094] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > > > > [ 534.473103] ? drm_setversion+0x8c0/0x8c0 [drm] > > > > [ 534.473106] ? __kasan_check_write+0x14/0x20 > > > > [ 534.473115] drm_ioctl+0x68b/0xaa0 [drm] > > > > ... > > > > > > > > [ 534.473239] Allocated by task 3199: > > > > [ 534.473241] save_stack+0x21/0x90 > > > > [ 534.473243] __kasan_kmalloc.constprop.8+0xa7/0xd0 > > > > [ 534.473245] kasan_slab_alloc+0x11/0x20 > > > > [ 534.473246] kmem_cache_alloc+0xce/0x240 > > > > [ 534.473273] i915_active_ref+0xc2/0x530 [i915] > > > > [ 534.473302] __i915_vma_move_to_active+0x56/0x70 [i915] > > > > [ 534.473328] i915_vma_move_to_active+0x54/0x420 [i915] > > > > [ 534.473355] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] > > > > [ 534.473381] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] > > > > [ 534.473392] drm_ioctl_kernel+0x1ed/0x2b0 [drm] > > > > [ 534.473402] drm_ioctl+0x68b/0xaa0 [drm] > > > > [ 534.473404] do_vfs_ioctl+0x19a/0xf10 > > > > [ 534.473405] ksys_ioctl+0x75/0x80 > > > > [ 534.473407] __x64_sys_ioctl+0x73/0xb0 > > > > [ 534.473408] do_syscall_64+0x9f/0x3a0 > > > > [ 534.473410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > > > [ 534.473412] Freed by task 0: > > > > [ 534.473414] save_stack+0x21/0x90 > > > > [ 534.473415] __kasan_slab_free+0x137/0x190 > > > > [ 534.473417] kasan_slab_free+0xe/0x10 > > > > [ 534.473418] kmem_cache_free+0xeb/0x2c0 > > > > [ 534.473444] __active_retire+0x1f2/0x240 [i915] > > > > [ 534.473471] active_retire+0x13b/0x1b0 [i915] > > > > [ 534.473496] node_retire+0x54/0x80 [i915] > > > > [ 534.473523] intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915] > > > > [ 534.473549] cs_irq_handler+0x66/0xb0 [i915] > > > > [ 534.473575] gen11_gt_irq_handler+0x26c/0x400 [i915] > > > > [ 534.473600] gen11_irq_handler+0xc3/0x250 [i915] > > > > [ 534.473603] __handle_irq_event_percpu+0xe0/0x4c0 > > > > [ 534.473605] handle_irq_event_percpu+0x71/0x140 > > > > [ 534.473606] handle_irq_event+0xad/0x140 > > > > [ 534.473608] handle_edge_irq+0x1f6/0x780 > > > > [ 534.473610] do_IRQ+0x9f/0x1f0 > > > > > > > > [ 534.473612] The buggy address belongs to the object at > ffff8883f0372380 > > > > which belongs to the cache active_node of size 72 > > > > [ 534.473615] The buggy address is located 8 bytes inside of > > > > > > > > === > > > > > > > > The race scenerio is like: > > > > Initially ref->count is 1, interrupt handler is trying to free the > > > > node. > > > > > > > > === > > > > CPUA in interrupt context CPUB in i915_gem_execbuffer2_ioctl > > > > __active_retire --> > > > > spin_lock(&ref->tree_lock) > > > > decrease ref->count to 0 > > > > i915_active_ref --> > > > > increase ref->count to 1 > > > > (i915_active_acquire) > > > > > > > > get the dirty ref->cache > > > > (READ_ONCE(ref->cache)) > > > > > > > > return the dirty node > > > > > > > > set ref->cache to NULL > > > > spin_unlock(&ref->tree_lock) > > > > free the node > > > > > > > > hit use-after-free in > > > > __i915_active_fence_set() > > > > > > > > === > > > > > > > > Here we need to use spinlock ref->tree_lock to protect the access > > > > of READ_ONCE(ref->cache), then the race scenerio can be resolved. > > > > > > > > with this patch, it passed our stress test for a very long time. > > > > > > > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> > > > > --- > > > > drivers/gpu/drm/i915/i915_active.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_active.c > > > b/drivers/gpu/drm/i915/i915_active.c > > > > index dca15ace88f6..3d68b910e949 100644 > > > > --- a/drivers/gpu/drm/i915/i915_active.c > > > > +++ b/drivers/gpu/drm/i915/i915_active.c > > > > @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct > > > intel_timeline *tl) > > > > * after the previous activity has been retired, or if it matches the > > > > * current timeline. > > > > */ > > > > + spin_lock_irq(&ref->tree_lock); > > > > node = READ_ONCE(ref->cache); > > > > + spin_unlock_irq(&ref->tree_lock); > > > > > > Incorrect. The serialisation with __active_retire is required at > > > i915_active_acquire. > > You suggest the change can be made in i915_active_acquire()? > > So that we can play ref->count closely together with tree_lock > > and ODEBUG stuff. > > > > If so, I can make a new patch
diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c index dca15ace88f6..3d68b910e949 100644 --- a/drivers/gpu/drm/i915/i915_active.c +++ b/drivers/gpu/drm/i915/i915_active.c @@ -214,7 +214,10 @@ active_instance(struct i915_active *ref, struct intel_timeline *tl) * after the previous activity has been retired, or if it matches the * current timeline. */ + spin_lock_irq(&ref->tree_lock); node = READ_ONCE(ref->cache); + spin_unlock_irq(&ref->tree_lock); + if (node && node->timeline == idx) return &node->base;
We easily hit drm/i915 panic on TGL when running glmark2, and finally caught the race condition of use-after-free with enabling KASAN. The call stack is below: === [ 534.472675] BUG: KASAN: use-after-free in __i915_active_fence_set+0x26d/0x3d0 [i915] [ 534.472679] Write of size 8 at addr ffff8883f0372388 by task glmark2/3199 [ 534.472684] CPU: 3 PID: 3199 Comm: glmark2 Tainted: G U E 5.4.0-rc8 #8 [ 534.472687] Call Trace: [ 534.472693] dump_stack+0x95/0xd5 [ 534.472722] ? __i915_active_fence_set+0x26d/0x3d0 [i915] [ 534.472727] print_address_description.constprop.5+0x20/0x320 [ 534.472751] ? __i915_active_fence_set+0x26d/0x3d0 [i915] [ 534.472792] ? __i915_active_fence_set+0x26d/0x3d0 [i915] [ 534.472794] __kasan_report+0x149/0x18c [ 534.472798] ? _raw_spin_lock+0x1/0xd0 [ 534.472820] ? __i915_active_fence_set+0x26d/0x3d0 [i915] [ 534.472822] kasan_report+0x12/0x20 [ 534.472825] __asan_report_store8_noabort+0x17/0x20 [ 534.472847] __i915_active_fence_set+0x26d/0x3d0 [i915] [ 534.472870] i915_active_ref+0x2c8/0x530 [i915] [ 534.472874] ? dma_resv_add_shared_fence+0x291/0x460 [ 534.472902] __i915_vma_move_to_active+0x56/0x70 [i915] [ 534.472927] i915_vma_move_to_active+0x54/0x420 [i915] [ 534.472931] ? mutex_unlock+0x22/0x40 [ 534.472957] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] [ 534.472959] ? __kmalloc_node+0x12c/0x350 [ 534.472983] ? eb_relocate_slow+0xb40/0xb40 [i915] [ 534.472985] ? _raw_write_trylock+0x110/0x110 [ 534.472987] ? get_partial_node.isra.72+0x51/0x260 [ 534.472991] ? unix_stream_read_generic+0x583/0x1a80 [ 534.472994] ? ___slab_alloc+0x1d8/0x550 [ 534.472998] ? kvmalloc_node+0x31/0x80 [ 534.473000] ? kasan_unpoison_shadow+0x35/0x50 [ 534.473002] ? _raw_spin_lock+0x7b/0xd0 [ 534.473004] ? radix_tree_lookup+0xd/0x10 [ 534.473006] ? idr_find+0x3b/0x60 [ 534.473029] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] [ 534.473052] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] [ 534.473054] ? unix_stream_recvmsg+0x97/0xd0 [ 534.473056] ? unix_stream_splice_read+0x1c0/0x1c0 [ 534.473058] ? __unix_insert_socket+0x180/0x180 [ 534.473081] ? i915_gem_execbuffer_ioctl+0xd50/0xd50 [i915] [ 534.473094] drm_ioctl_kernel+0x1ed/0x2b0 [drm] [ 534.473103] ? drm_setversion+0x8c0/0x8c0 [drm] [ 534.473106] ? __kasan_check_write+0x14/0x20 [ 534.473115] drm_ioctl+0x68b/0xaa0 [drm] ... [ 534.473239] Allocated by task 3199: [ 534.473241] save_stack+0x21/0x90 [ 534.473243] __kasan_kmalloc.constprop.8+0xa7/0xd0 [ 534.473245] kasan_slab_alloc+0x11/0x20 [ 534.473246] kmem_cache_alloc+0xce/0x240 [ 534.473273] i915_active_ref+0xc2/0x530 [i915] [ 534.473302] __i915_vma_move_to_active+0x56/0x70 [i915] [ 534.473328] i915_vma_move_to_active+0x54/0x420 [i915] [ 534.473355] i915_gem_do_execbuffer+0x1d45/0x3e20 [i915] [ 534.473381] i915_gem_execbuffer2_ioctl+0x634/0x8a0 [i915] [ 534.473392] drm_ioctl_kernel+0x1ed/0x2b0 [drm] [ 534.473402] drm_ioctl+0x68b/0xaa0 [drm] [ 534.473404] do_vfs_ioctl+0x19a/0xf10 [ 534.473405] ksys_ioctl+0x75/0x80 [ 534.473407] __x64_sys_ioctl+0x73/0xb0 [ 534.473408] do_syscall_64+0x9f/0x3a0 [ 534.473410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 534.473412] Freed by task 0: [ 534.473414] save_stack+0x21/0x90 [ 534.473415] __kasan_slab_free+0x137/0x190 [ 534.473417] kasan_slab_free+0xe/0x10 [ 534.473418] kmem_cache_free+0xeb/0x2c0 [ 534.473444] __active_retire+0x1f2/0x240 [i915] [ 534.473471] active_retire+0x13b/0x1b0 [i915] [ 534.473496] node_retire+0x54/0x80 [i915] [ 534.473523] intel_engine_breadcrumbs_irq+0x5f0/0xd10 [i915] [ 534.473549] cs_irq_handler+0x66/0xb0 [i915] [ 534.473575] gen11_gt_irq_handler+0x26c/0x400 [i915] [ 534.473600] gen11_irq_handler+0xc3/0x250 [i915] [ 534.473603] __handle_irq_event_percpu+0xe0/0x4c0 [ 534.473605] handle_irq_event_percpu+0x71/0x140 [ 534.473606] handle_irq_event+0xad/0x140 [ 534.473608] handle_edge_irq+0x1f6/0x780 [ 534.473610] do_IRQ+0x9f/0x1f0 [ 534.473612] The buggy address belongs to the object at ffff8883f0372380 which belongs to the cache active_node of size 72 [ 534.473615] The buggy address is located 8 bytes inside of === The race scenerio is like: Initially ref->count is 1, interrupt handler is trying to free the node. === CPUA in interrupt context CPUB in i915_gem_execbuffer2_ioctl __active_retire --> spin_lock(&ref->tree_lock) decrease ref->count to 0 i915_active_ref --> increase ref->count to 1 (i915_active_acquire) get the dirty ref->cache (READ_ONCE(ref->cache)) return the dirty node set ref->cache to NULL spin_unlock(&ref->tree_lock) free the node hit use-after-free in __i915_active_fence_set() === Here we need to use spinlock ref->tree_lock to protect the access of READ_ONCE(ref->cache), then the race scenerio can be resolved. with this patch, it passed our stress test for a very long time. Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> --- drivers/gpu/drm/i915/i915_active.c | 3 +++ 1 file changed, 3 insertions(+)