Message ID | CAPM=9twko1gCNTB3CPf7CAQqWFayMj=1fa3ZoEwwviDFhF48kQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [git,pull] drm fixes for 5.14-rc4 | expand |
The pull request you sent on Fri, 30 Jul 2021 11:11:27 +1000:
> git://anongit.freedesktop.org/drm/drm tags/drm-fixes-2021-07-30
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/764a5bc89b12b82c18ce7ca5d7c1b10dd748a440
Thank you!
This might possibly have been fixed already by the previous drm pull, but I wanted to report it anyway, just in case. It happened after an uptime of over a week, so it might not be trivial to reproduce. It's a NULL pointer dereference in dc_stream_retain() with the code being lock xadd %eax,0x390(%rdi) <-- trapping instruction and that's just the kref_get(&stream->refcount); with a NULL 'stream' argument. Call Trace: dc_resource_state_copy_construct+0x13f/0x190 [amdgpu] amdgpu_dm_atomic_commit_tail+0xd5/0x1540 [amdgpu] commit_tail+0x97/0x180 [drm_kms_helper] process_one_work+0x1df/0x3a0 the oops is followed by a stream of [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:55:crtc-1] hw_done or flip_done timed out and the machine was not usable afterwards. lspci says this is a 49:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7) (prog-if 00 [VGA controller]) Full oops in the attachment, but I think the above is all the really salient details. Linus
On Thu, Aug 5, 2021 at 2:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > This might possibly have been fixed already by the previous drm pull, > but I wanted to report it anyway, just in case. > > It happened after an uptime of over a week, so it might not be trivial > to reproduce. > > It's a NULL pointer dereference in dc_stream_retain() with the code being > > lock xadd %eax,0x390(%rdi) <-- trapping instruction > > and that's just the > > kref_get(&stream->refcount); > > with a NULL 'stream' argument. > > Call Trace: > dc_resource_state_copy_construct+0x13f/0x190 [amdgpu] > amdgpu_dm_atomic_commit_tail+0xd5/0x1540 [amdgpu] > commit_tail+0x97/0x180 [drm_kms_helper] > process_one_work+0x1df/0x3a0 > > the oops is followed by a stream of > > [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:55:crtc-1] > hw_done or flip_done timed out > > and the machine was not usable afterwards. > > lspci says this is a > > 49:00.0 VGA compatible controller [0300]: > Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere > [Radeon RX 470/480/570/570X/580/580X/590] > [1002:67df] (rev e7) (prog-if 00 [VGA controller]) > > Full oops in the attachment, but I think the above is all the really > salient details. Thanks for the report. Adding some display folks to take a look. Alex
On Thu, Aug 5, 2021 at 8:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > This might possibly have been fixed already by the previous drm pull, > but I wanted to report it anyway, just in case. > > It happened after an uptime of over a week, so it might not be trivial > to reproduce. > > It's a NULL pointer dereference in dc_stream_retain() with the code being > > lock xadd %eax,0x390(%rdi) <-- trapping instruction > > and that's just the > > kref_get(&stream->refcount); > > with a NULL 'stream' argument. > > Call Trace: > dc_resource_state_copy_construct+0x13f/0x190 [amdgpu] > amdgpu_dm_atomic_commit_tail+0xd5/0x1540 [amdgpu] > commit_tail+0x97/0x180 [drm_kms_helper] > process_one_work+0x1df/0x3a0 > > the oops is followed by a stream of > > [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:55:crtc-1] > hw_done or flip_done timed out > > and the machine was not usable afterwards. Hm that part is a bit disappointing because the atomic modeset commit helpers are designed to recover from this (assuming we didn't fry the hw). But amdgpu does these waits in amdgpu_dm_atomic_check() which is decidedly not great (you're not supposed to block on hw or a previous in that atomic_check ever, because it can be called by userspace in a TEST_ONLY mode to figure out whether a desired config would work), and then returns that error to userspace, which is worse. I guess that's another area where the integration between what atomic modeset expects and the DC backend provides is suboptimal. I think the data structures we managed to fuse together fairly ok, but the check/commit flow and semantics are a bit a struggle. Anyway this was just an aside, I guess given the bug the driver wouldn't have recovered anyway. -Daniel > lspci says this is a > > 49:00.0 VGA compatible controller [0300]: > Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere > [Radeon RX 470/480/570/570X/580/580X/590] > [1002:67df] (rev e7) (prog-if 00 [VGA controller]) > > Full oops in the attachment, but I think the above is all the really > salient details. > > Linus