Message ID | CAPM=9tzs4n8dDQ_XVVPS_5jrBgsNkhDQvf-B_XmUg+EG_M2i4Q@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [git,pull] drm for 6.1-rc1 | expand |
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie <airlied@gmail.com> wrote: > > This is very conflict heavy, mostly the correct answer is picking > the version from drm-next. Ugh, yes, that was a bit annoying. I get the same end result as you did, but I do wonder if the drm people should try to keep some kind of separate "fixes" branches for things that go both into the development tree and then get sent to me for fixes pulls? Hopefully this "lots of pointless noise" was a one-off, but it might be due to how you guys work.. Linus
The pull request you sent on Wed, 5 Oct 2022 13:41:47 +1000:
> git://anongit.freedesktop.org/drm/drm tags/drm-next-2022-10-05
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf
Thank you!
On Thu, 6 Oct 2022 at 04:38, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie <airlied@gmail.com> wrote: > > > > This is very conflict heavy, mostly the correct answer is picking > > the version from drm-next. > > Ugh, yes, that was a bit annoying. > > I get the same end result as you did, but I do wonder if the drm > people should try to keep some kind of separate "fixes" branches for > things that go both into the development tree and then get sent to me > for fixes pulls? > > Hopefully this "lots of pointless noise" was a one-off, but it might > be due to how you guys work.. In this case I think it was a late set of fixes backported for new AMD hardware, that had to be redone to fit into the current kernel that caused most of it. I haven't seen it this bad in a long while. We also maintain a rerere tree ourselves to avoid continuously seeing it. The problem is a lot of developers don't have the insight that the maintainers do into the current state of the tree/pipeline. Stuff goes into next because that is where the patch it fixes originally went, and it goes through CI there. Then at some point someone else realises the change needs to be in fixes and it gets backported. The volume of patches and company signoff processes doesn't make it trivial to upfront decide what needs to go in -next or -fixes unfortunately. Dave.
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie <airlied@gmail.com> wrote: > > Lots of stuff all over, some new AMD IP support and gang > submit support [..] Hmm. I have now had my main desktop lock up twice after pulling this. Nothing in the dmesg after a reboot, and nothing in particular that seems to trigger it, so I have a hard time even guessing what's up, but the drm changes are the primary suspect. I will try to see if I can get any information out of the machine, but with the symptom being just a dead machine ... This is the same (old) Radeon device: 49:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7) with dual 4k monitors, running on my good old Threadripper setup. Again, there's no explicit reason to blame the drm pull, except that it started after that merge (that machine ran the kernel with the networking pull for a day with no problems, and while there were other pull requests in between them, they seem to be fairly unrelated to the hardware I have). But the lockup is so sporadic (twice in the last day) that I really can't bisect it, so I'm afraid I have very very little info. Any suggestions? Linus
On Thu, Oct 6, 2022 at 2:48 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie <airlied@gmail.com> wrote: > > > > Lots of stuff all over, some new AMD IP support and gang > > submit support [..] > > Hmm. > > I have now had my main desktop lock up twice after pulling this. > Nothing in the dmesg after a reboot, and nothing in particular that > seems to trigger it, so I have a hard time even guessing what's up, > but the drm changes are the primary suspect. > > I will try to see if I can get any information out of the machine, but > with the symptom being just a dead machine ... > > This is the same (old) Radeon device: > > 49:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7) > > with dual 4k monitors, running on my good old Threadripper setup. > > Again, there's no explicit reason to blame the drm pull, except that > it started after that merge (that machine ran the kernel with the > networking pull for a day with no problems, and while there were other > pull requests in between them, they seem to be fairly unrelated to the > hardware I have). > > But the lockup is so sporadic (twice in the last day) that I really > can't bisect it, so I'm afraid I have very very little info. > > Any suggestions? Maybe you are seeing this which is an issue with GPU TLB flushes which is kind of sporadic: https://gitlab.freedesktop.org/drm/amd/-/issues/2113 Are you seeing any GPU page faults in your kernel log? Alex
On Fri, 7 Oct 2022 at 04:48, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie <airlied@gmail.com> wrote: > > > > Lots of stuff all over, some new AMD IP support and gang > > submit support [..] > > Hmm. > > I have now had my main desktop lock up twice after pulling this. > Nothing in the dmesg after a reboot, and nothing in particular that > seems to trigger it, so I have a hard time even guessing what's up, > but the drm changes are the primary suspect. > > I will try to see if I can get any information out of the machine, but > with the symptom being just a dead machine ... > > This is the same (old) Radeon device: > > 49:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7) > > with dual 4k monitors, running on my good old Threadripper setup. > > Again, there's no explicit reason to blame the drm pull, except that > it started after that merge (that machine ran the kernel with the > networking pull for a day with no problems, and while there were other > pull requests in between them, they seem to be fairly unrelated to the > hardware I have). > > But the lockup is so sporadic (twice in the last day) that I really > can't bisect it, so I'm afraid I have very very little info. > > Any suggestions? netconsole? I'll plug in my 480 and see if I can make it die. Dave.
On Thu, Oct 6, 2022 at 12:30 PM Dave Airlie <airlied@gmail.com> wrote: > > netconsole? I've never been really successful with that in the past, and haven't used it for decades. I guess I could try if nothing else works. Linus
On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > Maybe you are seeing this which is an issue with GPU TLB flushes which > is kind of sporadic: > https://gitlab.freedesktop.org/drm/amd/-/issues/2113 Well, that seems to be 5.19, and while timing changes (or whatever other software updates) could have made it start trigger, this machine has been pretty solid otgerwise. > Are you seeing any GPU page faults in your kernel log? Nothing even remotely like that "no-retry page fault" in that issue report. Of course, if it happens just before the whole thing locks up... Linus
On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > > > Maybe you are seeing this which is an issue with GPU TLB flushes which > > is kind of sporadic: > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113 > > Well, that seems to be 5.19, and while timing changes (or whatever > other software updates) could have made it start trigger, this machine > has been pretty solid otgerwise. > > > Are you seeing any GPU page faults in your kernel log? > > Nothing even remotely like that "no-retry page fault" in that issue > report. Of course, if it happens just before the whole thing locks > up... Your chip is too old to support retry faults so it's likely you could be just seeing a GPU page fault followed by a hang. Your chip also lacks a paging queue, so you would be affected by the TLB issue. Alex
On Fri, 7 Oct 2022 at 06:14, Alex Deucher <alexdeucher@gmail.com> wrote: > > On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > > > > > Maybe you are seeing this which is an issue with GPU TLB flushes which > > > is kind of sporadic: > > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113 > > > > Well, that seems to be 5.19, and while timing changes (or whatever > > other software updates) could have made it start trigger, this machine > > has been pretty solid otgerwise. > > > > > Are you seeing any GPU page faults in your kernel log? > > > > Nothing even remotely like that "no-retry page fault" in that issue > > report. Of course, if it happens just before the whole thing locks > > up... > > Your chip is too old to support retry faults so it's likely you could > be just seeing a GPU page fault followed by a hang. Your chip also > lacks a paging queue, so you would be affected by the TLB issue. Okay I got my FIJI running Linus tree and netconsole to blow up like this, running fedora 36 desktop, steam, firefox, and then I ran poweroff over ssh. [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 [ 1234.778782] #PF: supervisor read access in kernel mode [ 1234.778787] #PF: error_code(0x0000) - not-present page [ 1234.778791] PGD 0 P4D 0 [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 [ 1234.778809] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020 [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 00 f0 [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087 [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018 [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000 [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808 [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00 [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458 [ 1234.778912] FS: 00007f35e7008580(0000) GS:ffff95428ebc0000(0000) knlGS:0000000000000000 [ 1234.778916] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0 [ 1234.778924] Call Trace: [ 1234.778981] <IRQ> [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 [ 1234.778999] dma_fence_signal+0x2c/0x50 [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 [ 1234.779946] handle_irq_event+0x34/0x70 [ 1234.779949] handle_edge_irq+0x9f/0x240 [ 1234.779954] __common_interrupt+0x66/0x100 [ 1234.779960] common_interrupt+0xa0/0xc0 [ 1234.779965] </IRQ> [ 1234.779968] <TASK> [ 1234.779971] asm_common_interrupt+0x22/0x40 [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 83 ea [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202 I'll see if I can dig any. Dave.
On Fri, 7 Oct 2022 at 06:24, Dave Airlie <airlied@gmail.com> wrote: > > On Fri, 7 Oct 2022 at 06:14, Alex Deucher <alexdeucher@gmail.com> wrote: > > > > On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > > > > > > > Maybe you are seeing this which is an issue with GPU TLB flushes which > > > > is kind of sporadic: > > > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113 > > > > > > Well, that seems to be 5.19, and while timing changes (or whatever > > > other software updates) could have made it start trigger, this machine > > > has been pretty solid otgerwise. > > > > > > > Are you seeing any GPU page faults in your kernel log? > > > > > > Nothing even remotely like that "no-retry page fault" in that issue > > > report. Of course, if it happens just before the whole thing locks > > > up... > > > > Your chip is too old to support retry faults so it's likely you could > > be just seeing a GPU page fault followed by a hang. Your chip also > > lacks a paging queue, so you would be affected by the TLB issue. > > > Okay I got my FIJI running Linus tree and netconsole to blow up like > this, running fedora 36 desktop, steam, firefox, and then I ran > poweroff over ssh. > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > [ 1234.778782] #PF: supervisor read access in kernel mode > [ 1234.778787] #PF: error_code(0x0000) - not-present page > [ 1234.778791] PGD 0 P4D 0 > [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI > [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 > [ 1234.778809] Hardware name: System manufacturer System Product > Name/PRIME X370-PRO, BIOS 5603 07/28/2020 > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f > ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 > 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 > 00 f0 > [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087 > [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018 > [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000 > [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808 > [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00 > [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458 > [ 1234.778912] FS: 00007f35e7008580(0000) GS:ffff95428ebc0000(0000) > knlGS:0000000000000000 > [ 1234.778916] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0 > [ 1234.778924] Call Trace: > [ 1234.778981] <IRQ> > [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 > [ 1234.778999] dma_fence_signal+0x2c/0x50 > [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] > [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] > [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] > [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] > [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] > [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 > [ 1234.779946] handle_irq_event+0x34/0x70 > [ 1234.779949] handle_edge_irq+0x9f/0x240 > [ 1234.779954] __common_interrupt+0x66/0x100 > [ 1234.779960] common_interrupt+0xa0/0xc0 > [ 1234.779965] </IRQ> > [ 1234.779968] <TASK> > [ 1234.779971] asm_common_interrupt+0x22/0x40 > [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 > [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 > 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 > 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 > 83 ea > [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202 > > I'll see if I can dig any. I'm kicking the tires on the drm-next tree just prior to submission, and in an attempt to make myself look foolish and to tempt fate, it seems stable. This might mean a silent merge conflict/regression, I'll bash on the drm-next tree a lot more and see if I can play spot the difference. Dave.
On Fri, 7 Oct 2022 at 07:41, Dave Airlie <airlied@gmail.com> wrote: > > On Fri, 7 Oct 2022 at 06:24, Dave Airlie <airlied@gmail.com> wrote: > > > > On Fri, 7 Oct 2022 at 06:14, Alex Deucher <alexdeucher@gmail.com> wrote: > > > > > > On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > > > > > > > > > Maybe you are seeing this which is an issue with GPU TLB flushes which > > > > > is kind of sporadic: > > > > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113 > > > > > > > > Well, that seems to be 5.19, and while timing changes (or whatever > > > > other software updates) could have made it start trigger, this machine > > > > has been pretty solid otgerwise. > > > > > > > > > Are you seeing any GPU page faults in your kernel log? > > > > > > > > Nothing even remotely like that "no-retry page fault" in that issue > > > > report. Of course, if it happens just before the whole thing locks > > > > up... > > > > > > Your chip is too old to support retry faults so it's likely you could > > > be just seeing a GPU page fault followed by a hang. Your chip also > > > lacks a paging queue, so you would be affected by the TLB issue. > > > > > > Okay I got my FIJI running Linus tree and netconsole to blow up like > > this, running fedora 36 desktop, steam, firefox, and then I ran > > poweroff over ssh. > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > [ 1234.778782] #PF: supervisor read access in kernel mode > > [ 1234.778787] #PF: error_code(0x0000) - not-present page > > [ 1234.778791] PGD 0 P4D 0 > > [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI > > [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2 > > [ 1234.778809] Hardware name: System manufacturer System Product > > Name/PRIME X370-PRO, BIOS 5603 07/28/2020 > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f > > ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53 > > 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00 > > 00 f0 > > [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087 > > [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018 > > [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000 > > [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808 > > [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00 > > [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458 > > [ 1234.778912] FS: 00007f35e7008580(0000) GS:ffff95428ebc0000(0000) > > knlGS:0000000000000000 > > [ 1234.778916] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0 > > [ 1234.778924] Call Trace: > > [ 1234.778981] <IRQ> > > [ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0 > > [ 1234.778999] dma_fence_signal+0x2c/0x50 > > [ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu] > > [ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu] > > [ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu] > > [ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu] > > [ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu] > > [ 1234.779940] __handle_irq_event_percpu+0x46/0x190 > > [ 1234.779946] handle_irq_event+0x34/0x70 > > [ 1234.779949] handle_edge_irq+0x9f/0x240 > > [ 1234.779954] __common_interrupt+0x66/0x100 > > [ 1234.779960] common_interrupt+0xa0/0xc0 > > [ 1234.779965] </IRQ> > > [ 1234.779968] <TASK> > > [ 1234.779971] asm_common_interrupt+0x22/0x40 > > [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110 > > [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41 > > 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30 > > 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48 > > 83 ea > > [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202 > > > > I'll see if I can dig any. > > I'm kicking the tires on the drm-next tree just prior to submission, > and in an attempt to make myself look foolish and to tempt fate, it > seems stable. Yay it worked, crashed drm-next. will start reverting down the rabbit hole. Dave.
On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] As far as I can tell, that's the line struct drm_gpu_scheduler *sched = s_fence->sched; where 's_fence' is NULL. The code is 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 5: 41 54 push %r12 7: 55 push %rbp 8: 53 push %rbx 9: 48 89 fb mov %rdi,%rbx c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax and that next 'lock decl' instruction would have been the atomic_dec(&sched->hw_rq_count); at the top of drm_sched_job_done(). Now, as to *why* you'd have a NULL s_fence, it would seem that drm_sched_job_cleanup() was called with an active job. Looking at that code, it does if (kref_read(&job->s_fence->finished.refcount)) { /* drm_sched_job_arm() has been called */ dma_fence_put(&job->s_fence->finished); ... but then it does job->s_fence = NULL; anyway, despite the job still being active. The logic of that kind of "fake refcount" escapes me. The above looks fundamentally racy, not to say pointless and wrong (a refcount is a _count_, not a flag, so there could be multiple references to it, what says that you can just decrement one of them and say "I'm done"). Now, _why_ any of that happens, I have no idea. I'm just looking at the immediate "that pointer is NULL" thing, and reacting to what looks like a completely bogus refcount pattern. But that odd refcount pattern isn't new, so it's presumably some user on the amd gpu side that changed. The problem hasn't happened again for me, but that's not saying a lot, since it was very random to begin with. Linus
On Fri, 7 Oct 2022 at 09:45, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > As far as I can tell, that's the line > > struct drm_gpu_scheduler *sched = s_fence->sched; > > where 's_fence' is NULL. The code is > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 5: 41 54 push %r12 > 7: 55 push %rbp > 8: 53 push %rbx > 9: 48 89 fb mov %rdi,%rbx > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > and that next 'lock decl' instruction would have been the > > atomic_dec(&sched->hw_rq_count); > > at the top of drm_sched_job_done(). > > Now, as to *why* you'd have a NULL s_fence, it would seem that > drm_sched_job_cleanup() was called with an active job. Looking at that > code, it does > > if (kref_read(&job->s_fence->finished.refcount)) { > /* drm_sched_job_arm() has been called */ > dma_fence_put(&job->s_fence->finished); > ... > > but then it does > > job->s_fence = NULL; > > anyway, despite the job still being active. The logic of that kind of > "fake refcount" escapes me. The above looks fundamentally racy, not to > say pointless and wrong (a refcount is a _count_, not a flag, so there > could be multiple references to it, what says that you can just > decrement one of them and say "I'm done"). > > Now, _why_ any of that happens, I have no idea. I'm just looking at > the immediate "that pointer is NULL" thing, and reacting to what looks > like a completely bogus refcount pattern. > > But that odd refcount pattern isn't new, so it's presumably some user > on the amd gpu side that changed. > > The problem hasn't happened again for me, but that's not saying a lot, > since it was very random to begin with. I chased down the culprit to a drm sched patch, I'll send you a pull with a revert in it. commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 Author: Arvind Yadav <Arvind.Yadav@amd.com> Date: Wed Sep 14 22:13:20 2022 +0530 drm/sched: Use parent fence instead of finished Using the parent fence instead of the finished fence to get the job status. This change is to avoid GPU scheduler timeout error which can cause GPU reset. Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com Signed-off-by: Christian König <christian.koenig@amd.com> I'll let Arvind and Christian maybe work out what is going wrong there. Dave. > > Linus
On Fri, 7 Oct 2022 at 12:45, Dave Airlie <airlied@gmail.com> wrote: > > On Fri, 7 Oct 2022 at 09:45, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > > > As far as I can tell, that's the line > > > > struct drm_gpu_scheduler *sched = s_fence->sched; > > > > where 's_fence' is NULL. The code is > > > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > 5: 41 54 push %r12 > > 7: 55 push %rbp > > 8: 53 push %rbx > > 9: 48 89 fb mov %rdi,%rbx > > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > > > and that next 'lock decl' instruction would have been the > > > > atomic_dec(&sched->hw_rq_count); > > > > at the top of drm_sched_job_done(). > > > > Now, as to *why* you'd have a NULL s_fence, it would seem that > > drm_sched_job_cleanup() was called with an active job. Looking at that > > code, it does > > > > if (kref_read(&job->s_fence->finished.refcount)) { > > /* drm_sched_job_arm() has been called */ > > dma_fence_put(&job->s_fence->finished); > > ... > > > > but then it does > > > > job->s_fence = NULL; > > > > anyway, despite the job still being active. The logic of that kind of > > "fake refcount" escapes me. The above looks fundamentally racy, not to > > say pointless and wrong (a refcount is a _count_, not a flag, so there > > could be multiple references to it, what says that you can just > > decrement one of them and say "I'm done"). > > > > Now, _why_ any of that happens, I have no idea. I'm just looking at > > the immediate "that pointer is NULL" thing, and reacting to what looks > > like a completely bogus refcount pattern. > > > > But that odd refcount pattern isn't new, so it's presumably some user > > on the amd gpu side that changed. > > > > The problem hasn't happened again for me, but that's not saying a lot, > > since it was very random to begin with. > > I chased down the culprit to a drm sched patch, I'll send you a pull > with a revert in it. > > commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 > Author: Arvind Yadav <Arvind.Yadav@amd.com> > Date: Wed Sep 14 22:13:20 2022 +0530 > > drm/sched: Use parent fence instead of finished > > Using the parent fence instead of the finished fence > to get the job status. This change is to avoid GPU > scheduler timeout error which can cause GPU reset. > > Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> > Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com > Signed-off-by: Christian König <christian.koenig@amd.com> > > I'll let Arvind and Christian maybe work out what is going wrong there. I do spy two changes queued for -next that might be relevant, so I might try just pulling those instead. I'll send a PR in next hour once I test it. Dave.
On Fri, 7 Oct 2022 at 12:54, Dave Airlie <airlied@gmail.com> wrote: > > On Fri, 7 Oct 2022 at 12:45, Dave Airlie <airlied@gmail.com> wrote: > > > > On Fri, 7 Oct 2022 at 09:45, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > > > > > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > > > > > As far as I can tell, that's the line > > > > > > struct drm_gpu_scheduler *sched = s_fence->sched; > > > > > > where 's_fence' is NULL. The code is > > > > > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > > 5: 41 54 push %r12 > > > 7: 55 push %rbp > > > 8: 53 push %rbx > > > 9: 48 89 fb mov %rdi,%rbx > > > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > > > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > > > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > > > > > and that next 'lock decl' instruction would have been the > > > > > > atomic_dec(&sched->hw_rq_count); > > > > > > at the top of drm_sched_job_done(). > > > > > > Now, as to *why* you'd have a NULL s_fence, it would seem that > > > drm_sched_job_cleanup() was called with an active job. Looking at that > > > code, it does > > > > > > if (kref_read(&job->s_fence->finished.refcount)) { > > > /* drm_sched_job_arm() has been called */ > > > dma_fence_put(&job->s_fence->finished); > > > ... > > > > > > but then it does > > > > > > job->s_fence = NULL; > > > > > > anyway, despite the job still being active. The logic of that kind of > > > "fake refcount" escapes me. The above looks fundamentally racy, not to > > > say pointless and wrong (a refcount is a _count_, not a flag, so there > > > could be multiple references to it, what says that you can just > > > decrement one of them and say "I'm done"). > > > > > > Now, _why_ any of that happens, I have no idea. I'm just looking at > > > the immediate "that pointer is NULL" thing, and reacting to what looks > > > like a completely bogus refcount pattern. > > > > > > But that odd refcount pattern isn't new, so it's presumably some user > > > on the amd gpu side that changed. > > > > > > The problem hasn't happened again for me, but that's not saying a lot, > > > since it was very random to begin with. > > > > I chased down the culprit to a drm sched patch, I'll send you a pull > > with a revert in it. > > > > commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 > > Author: Arvind Yadav <Arvind.Yadav@amd.com> > > Date: Wed Sep 14 22:13:20 2022 +0530 > > > > drm/sched: Use parent fence instead of finished > > > > Using the parent fence instead of the finished fence > > to get the job status. This change is to avoid GPU > > scheduler timeout error which can cause GPU reset. > > > > Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> > > Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > > Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com > > Signed-off-by: Christian König <christian.koenig@amd.com> > > > > I'll let Arvind and Christian maybe work out what is going wrong there. > > I do spy two changes queued for -next that might be relevant, so I > might try just pulling those instead. > > I'll send a PR in next hour once I test it. Okay sent, let me know if you see any further problems. Dave.
Am 07.10.22 um 04:45 schrieb Dave Airlie: > On Fri, 7 Oct 2022 at 09:45, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote: >>> >>> [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 >>> [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] >> As far as I can tell, that's the line >> >> struct drm_gpu_scheduler *sched = s_fence->sched; >> >> where 's_fence' is NULL. The code is >> >> 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) >> 5: 41 54 push %r12 >> 7: 55 push %rbp >> 8: 53 push %rbx >> 9: 48 89 fb mov %rdi,%rbx >> c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction >> 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) >> 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax >> >> and that next 'lock decl' instruction would have been the >> >> atomic_dec(&sched->hw_rq_count); >> >> at the top of drm_sched_job_done(). >> >> Now, as to *why* you'd have a NULL s_fence, it would seem that >> drm_sched_job_cleanup() was called with an active job. Looking at that >> code, it does >> >> if (kref_read(&job->s_fence->finished.refcount)) { >> /* drm_sched_job_arm() has been called */ >> dma_fence_put(&job->s_fence->finished); >> ... >> >> but then it does >> >> job->s_fence = NULL; >> >> anyway, despite the job still being active. The logic of that kind of >> "fake refcount" escapes me. The above looks fundamentally racy, not to >> say pointless and wrong (a refcount is a _count_, not a flag, so there >> could be multiple references to it, what says that you can just >> decrement one of them and say "I'm done"). >> >> Now, _why_ any of that happens, I have no idea. I'm just looking at >> the immediate "that pointer is NULL" thing, and reacting to what looks >> like a completely bogus refcount pattern. >> >> But that odd refcount pattern isn't new, so it's presumably some user >> on the amd gpu side that changed. >> >> The problem hasn't happened again for me, but that's not saying a lot, >> since it was very random to begin with. > I chased down the culprit to a drm sched patch, I'll send you a pull > with a revert in it. > > commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 > Author: Arvind Yadav <Arvind.Yadav@amd.com> > Date: Wed Sep 14 22:13:20 2022 +0530 > > drm/sched: Use parent fence instead of finished > > Using the parent fence instead of the finished fence > to get the job status. This change is to avoid GPU > scheduler timeout error which can cause GPU reset. > > Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com> > Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2Fmsgid%2F20220914164321.2156-6-Arvind.Yadav%40amd.com&data=05%7C01%7Cchristian.koenig%40amd.com%7C516db37183e84489e1aa08daa80e087e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638007075495101336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JWT8R205jIPQu87K7a1T0UJ0iKNO8smHhosijAA0%2BNk%3D&reserved=0 > Signed-off-by: Christian König <christian.koenig@amd.com> > > I'll let Arvind and Christian maybe work out what is going wrong there. That's a known issue Arvind is already investigating for a while. Any idea how you triggered it on boot? We have only be able to trigger it very sporadic. Reverting the patch for now sounds like a good idea to me, it's only a cleanup anyway. Thanks, Christian. > > Dave. > >> Linus
On Fri, 7 Oct 2022 at 01:45, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > As far as I can tell, that's the line > > struct drm_gpu_scheduler *sched = s_fence->sched; > > where 's_fence' is NULL. The code is > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 5: 41 54 push %r12 > 7: 55 push %rbp > 8: 53 push %rbx > 9: 48 89 fb mov %rdi,%rbx > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > and that next 'lock decl' instruction would have been the > > atomic_dec(&sched->hw_rq_count); > > at the top of drm_sched_job_done(). > > Now, as to *why* you'd have a NULL s_fence, it would seem that > drm_sched_job_cleanup() was called with an active job. Looking at that > code, it does > > if (kref_read(&job->s_fence->finished.refcount)) { > /* drm_sched_job_arm() has been called */ > dma_fence_put(&job->s_fence->finished); > ... > > but then it does > > job->s_fence = NULL; > > anyway, despite the job still being active. The logic of that kind of > "fake refcount" escapes me. The above looks fundamentally racy, not to > say pointless and wrong (a refcount is a _count_, not a flag, so there > could be multiple references to it, what says that you can just > decrement one of them and say "I'm done"). Just figured I'll clarify this, because it's indeed a bit wtf and the comment doesn't explain much. drm_sched_job_cleanup can be called both when a real job is being cleaned up (which holds a full reference on job->s_fence and needs to drop it) and to simplify error path in job constructions (and the "is this refcount initialized already" signals what exactly needs to be cleaned up or not). So no race, because the only times this check goes different is when job construction has failed before the job struct is visible by any other thread. But yeah the comment could actually explain what's going on here :-) And yeah the patch Dave reverted screws up the cascade of references that ensures this all stays alive until drm_sched_job_cleanup is called on active jobs, so looks all reasonable to me. Some Kunit tests maybe to exercise these corners? Not the first time pure scheduler code blew up, so proably worth the effort. -Daniel > > Now, _why_ any of that happens, I have no idea. I'm just looking at > the immediate "that pointer is NULL" thing, and reacting to what looks > like a completely bogus refcount pattern. > > But that odd refcount pattern isn't new, so it's presumably some user > on the amd gpu side that changed. > > The problem hasn't happened again for me, but that's not saying a lot, > since it was very random to begin with. > > Linus
Forgot to add Andrey as scheduler maintainer. -Daniel On Fri, 7 Oct 2022 at 10:16, Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > On Fri, 7 Oct 2022 at 01:45, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > > > As far as I can tell, that's the line > > > > struct drm_gpu_scheduler *sched = s_fence->sched; > > > > where 's_fence' is NULL. The code is > > > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > 5: 41 54 push %r12 > > 7: 55 push %rbp > > 8: 53 push %rbx > > 9: 48 89 fb mov %rdi,%rbx > > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > > > and that next 'lock decl' instruction would have been the > > > > atomic_dec(&sched->hw_rq_count); > > > > at the top of drm_sched_job_done(). > > > > Now, as to *why* you'd have a NULL s_fence, it would seem that > > drm_sched_job_cleanup() was called with an active job. Looking at that > > code, it does > > > > if (kref_read(&job->s_fence->finished.refcount)) { > > /* drm_sched_job_arm() has been called */ > > dma_fence_put(&job->s_fence->finished); > > ... > > > > but then it does > > > > job->s_fence = NULL; > > > > anyway, despite the job still being active. The logic of that kind of > > "fake refcount" escapes me. The above looks fundamentally racy, not to > > say pointless and wrong (a refcount is a _count_, not a flag, so there > > could be multiple references to it, what says that you can just > > decrement one of them and say "I'm done"). > > Just figured I'll clarify this, because it's indeed a bit wtf and the > comment doesn't explain much. drm_sched_job_cleanup can be called both > when a real job is being cleaned up (which holds a full reference on > job->s_fence and needs to drop it) and to simplify error path in job > constructions (and the "is this refcount initialized already" signals > what exactly needs to be cleaned up or not). So no race, because the > only times this check goes different is when job construction has > failed before the job struct is visible by any other thread. > > But yeah the comment could actually explain what's going on here :-) > > And yeah the patch Dave reverted screws up the cascade of references > that ensures this all stays alive until drm_sched_job_cleanup is > called on active jobs, so looks all reasonable to me. Some Kunit tests > maybe to exercise these corners? Not the first time pure scheduler > code blew up, so proably worth the effort. > -Daniel > > > > > Now, _why_ any of that happens, I have no idea. I'm just looking at > > the immediate "that pointer is NULL" thing, and reacting to what looks > > like a completely bogus refcount pattern. > > > > But that odd refcount pattern isn't new, so it's presumably some user > > on the amd gpu side that changed. > > > > The problem hasn't happened again for me, but that's not saying a lot, > > since it was very random to begin with. > > > > Linus > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch