Message ID | CAPM=9tzR4BqTtamrTy4T_XV7E0fUNyduaVtH5zAi=sqwX_3udg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [git,pull] drm for 5.14-rc1 | expand |
On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie <airlied@gmail.com> wrote: > > Hi Linus, > > This is the main drm pull request for 5.14-rc1. > > I've done a test pull into your current tree, and hit two conflicts > (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one > is recent and sfr sent out a resolution for it today. Well, the resolutions may be trivial, but the conflict made me look at the code, and it's buggy. Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function") is broken. It made the code do mmap_read_lock(mm); vma = find_vma(mm, start); mmap_read_unlock(mm); and then it *uses* that "vma" after it has dropped the lock. That's a big no-no - once you've dropped the lock, the vma contents simply aren't reliable any more. That mapping could now be unmapped and removed at any time. Now, the conflict actually made one of the uses go away (switching to vma_lookup() means that the subsequent code no longer needs to look at "vm_start" to verify we're actually _inside_ the vma), but it still checks for vma->vm_file afterwards. So those locking changes in commit 04d8d73dbcbe are completely bogus. I tried to fix up that bug while handling the conflict, but who knows what else similar is going on elsewhere. So I would ask people to (a) verify that I didn't make things worse as I fixed things up (note how I had to change the last argument to amdgpu_hmm_range_get_pages() from false to true etc). (b) go and look at their vma lookup code: you can't just look up a vma under the lock, and then drop the lock, and then think things stay stable. In particular for that (b) case: it is *NOT* enough to look up vma->vm_file inside the lock and cache that. No - if the test is about "no backing file before looking up pages", then you have to *keep* holding the lock until after you've actually looked up the pages! Because otherwise any test for "vma->vm_file" is entirely pointless, for the same reason it's buggy to even look at it after dropping the lock: because once you've dropped the lock, the thing you just tested for might not be true any more. So no, it's not valid to do bool has_file = vma && vma->vm_file; and then drop the lock, because you don't use 'vma' any more as a pointer, and then use 'has_file' outside the lock. Because after you've dropped the lock, 'has_file' is now meaningless. So it's not just about "you can't look at vma->vm_file after dropping the lock". It's more fundamental than that. Any *decision* you make based on the vma is entirely pointless and moot after the lock is dropped! Did I fix it up correctly? Who knows. The code makes more sense to me now and seems valid. But I really *really* want to stress how locking is important. You also can't just unlock in the middle of an operation - even if you then take the lock *again* later (as amdgpu_hmm_range_get_pages() then did), the fact that you unlocked in the middle means that all the earlier tests you did are simply no longer valid when you re-take the lock. Linus
The pull request you sent on Thu, 1 Jul 2021 14:34:15 +1000:
> git://anongit.freedesktop.org/drm/drm tags/drm-next-2021-07-01
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e058a84bfddc42ba356a2316f2cf1141974625c9
Thank you!
Am 2021-07-01 um 4:15 p.m. schrieb Linus Torvalds: > On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie <airlied@gmail.com> wrote: >> Hi Linus, >> >> This is the main drm pull request for 5.14-rc1. >> >> I've done a test pull into your current tree, and hit two conflicts >> (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one >> is recent and sfr sent out a resolution for it today. > Well, the resolutions may be trivial, but the conflict made me look at > the code, and it's buggy. > > Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function") > is broken. It made the code do > > mmap_read_lock(mm); > vma = find_vma(mm, start); > mmap_read_unlock(mm); > > and then it *uses* that "vma" after it has dropped the lock. > > That's a big no-no - once you've dropped the lock, the vma contents > simply aren't reliable any more. That mapping could now be unmapped > and removed at any time. > > Now, the conflict actually made one of the uses go away (switching to > vma_lookup() means that the subsequent code no longer needs to look at > "vm_start" to verify we're actually _inside_ the vma), but it still > checks for vma->vm_file afterwards. > > So those locking changes in commit 04d8d73dbcbe are completely bogus. > > I tried to fix up that bug while handling the conflict, but who knows > what else similar is going on elsewhere. > > So I would ask people to > > (a) verify that I didn't make things worse as I fixed things up (note > how I had to change the last argument to amdgpu_hmm_range_get_pages() > from false to true etc). > > (b) go and look at their vma lookup code: you can't just look up a > vma under the lock, and then drop the lock, and then think things stay > stable. > > In particular for that (b) case: it is *NOT* enough to look up > vma->vm_file inside the lock and cache that. No - if the test is about > "no backing file before looking up pages", then you have to *keep* > holding the lock until after you've actually looked up the pages! > > Because otherwise any test for "vma->vm_file" is entirely pointless, > for the same reason it's buggy to even look at it after dropping the > lock: because once you've dropped the lock, the thing you just tested > for might not be true any more. > > So no, it's not valid to do > > bool has_file = vma && vma->vm_file; > > and then drop the lock, because you don't use 'vma' any more as a > pointer, and then use 'has_file' outside the lock. Because after > you've dropped the lock, 'has_file' is now meaningless. > > So it's not just about "you can't look at vma->vm_file after dropping > the lock". It's more fundamental than that. Any *decision* you make > based on the vma is entirely pointless and moot after the lock is > dropped! > > Did I fix it up correctly? Who knows. The code makes more sense to me > now and seems valid. But I really *really* want to stress how locking > is important. Thank you for the fix and the explanation. Your fix looks correct. I also double-checked all other uses of find_vma in the amdgpu driver. They all hold the mmap lock correctly. Two comments: With this fix, we could remove the bool mmap_locked parameter from amdgpu_hmm_range_get_pages because it always gets called with the lock held now. You're now holding the mmap lock from the vma_lookup until hmm_range_fault is done. This ensures that the result of the vma->vm_file check remains valid. This was broken even before our commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function"). > > You also can't just unlock in the middle of an operation - even if you > then take the lock *again* later (as amdgpu_hmm_range_get_pages() then > did), the fact that you unlocked in the middle means that all the > earlier tests you did are simply no longer valid when you re-take the > lock. I agree completely. I catch a lot of locking bugs in code review. I probably missed this one because I wasn't paying enough attention to what was being protected by the mmap_read_lock in this case. Regards, Felix > > Linus
Hi, torvalds at linux-foundation.org (Linus Torvalds) writes: > Did I fix it up correctly? Who knows. The code makes more sense to me > now and seems valid. But I really *really* want to stress how locking > is important. As far as I can tell, this merge conflict resolution made my Raspberry Pi 3 hang on boot. You can find the full serial console output at: https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt The last few messages are from vc4, then the boot hangs. Using git-bisect, I tracked this down to https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9, which is the merge you’re talking about here, AFAICT. I also tried the git://anongit.freedesktop.org/drm/drm, and that tree boots as expected, suggesting that the problem really is with the additional changes. The code seems to work on my Raspberry Pi 4, just not on the Raspberry Pi 3. Any ideas why that might be, and how to fix it? Thanks!
CC Emma and Maxime On Saturday, September 18th, 2021 at 11:18, Michael Stapelberg <michael@stapelberg.ch> wrote: > Hi, > > torvalds at linux-foundation.org (Linus Torvalds) writes: > > Did I fix it up correctly? Who knows. The code makes more sense to me > > now and seems valid. But I really *really* want to stress how locking > > is important. > > As far as I can tell, this merge conflict resolution made my Raspberry > Pi 3 hang on boot. You can find the full serial console output at: > > https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt > > The last few messages are from vc4, then the boot hangs. > > Using git-bisect, I tracked this down to > https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9, > which is the merge you’re talking about here, AFAICT. > > I also tried the git://anongit.freedesktop.org/drm/drm, and that tree > boots as expected, suggesting that the problem really is with the > additional changes. > > The code seems to work on my Raspberry Pi 4, just not on the Raspberry > Pi 3. Any ideas why that might be, and how to fix it? > > Thanks! > > -- > Best regards, > Michael
On Sat, Sep 18, 2021 at 2:18 AM Michael Stapelberg <michael@stapelberg.ch> wrote: > > torvalds at linux-foundation.org (Linus Torvalds) writes: > > Did I fix it up correctly? Who knows. The code makes more sense to me > > now and seems valid. But I really *really* want to stress how locking > > is important. > > As far as I can tell, this merge conflict resolution made my Raspberry > Pi 3 hang on boot. Ok, that's a different merge issue than the locking one (which is about the amd ttm code). But the VC4 driver did have changes close to each other in the hdmi detection and clock setting code. And it doesn't seem to be just RPi3, there was a report back a couple of weeks ago about RPi4 also having regressed (with an Ubuntu install). That one was an oops in vc4_hdmi_audio_prepare(). I don't know if that got resolved, I heard nothing about it after the report. So there's something seriously wrong in VC4 space. The main issue seems to be the runtime power management changes. As far as I can tell, the commits that didn't come in through that drm pull were these two 9984d6664ce9 ("drm/vc4: hdmi: Make sure the controller is powered in detect") 411efa18e4b0 ("drm/vc4: hdmi: Move the HSM clock enable to runtime_pm") Michael - do things work if you revert those two (sadly, they don't revert cleanly exactly _because_ of the other changes in the same area)? Maxime? Linus
On Sat, 18 Sept 2021 at 21:24, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Sep 18, 2021 at 2:18 AM Michael Stapelberg > <michael@stapelberg.ch> wrote: > > > > torvalds at linux-foundation.org (Linus Torvalds) writes: > > > Did I fix it up correctly? Who knows. The code makes more sense to me > > > now and seems valid. But I really *really* want to stress how locking > > > is important. > > > > As far as I can tell, this merge conflict resolution made my Raspberry > > Pi 3 hang on boot. > > Ok, that's a different merge issue than the locking one (which is > about the amd ttm code). Ah, my apologies for getting these mixed up! > > But the VC4 driver did have changes close to each other in the hdmi > detection and clock setting code. > > And it doesn't seem to be just RPi3, there was a report back a couple > of weeks ago about RPi4 also having regressed (with an Ubuntu > install). That one was an oops in vc4_hdmi_audio_prepare(). I don't > know if that got resolved, I heard nothing about it after the report. > > So there's something seriously wrong in VC4 space. > > The main issue seems to be the runtime power management changes. As > far as I can tell, the commits that didn't come in through that drm > pull were these two > > 9984d6664ce9 ("drm/vc4: hdmi: Make sure the controller is powered in detect") > 411efa18e4b0 ("drm/vc4: hdmi: Move the HSM clock enable to runtime_pm") > > Michael - do things work if you revert those two (sadly, they don't > revert cleanly exactly _because_ of the other changes in the same > area)? Reverting only 9984d6664ce9 is not sufficient, but reverting both 9984d6664ce9 and 411efa18e4b0 does indeed make my Raspberry Pi 3 boot again! Thanks for identifying the faulty commits. Please let me know if any of y’all want me to test anything else in the process to get this fixed for the next kernel release (or perhaps even a minor release)? > > Maxime? > > Linus
Hi Linus, On Sat, Sep 18, 2021 at 8:24 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Sep 18, 2021 at 2:18 AM Michael Stapelberg > <michael@stapelberg.ch> wrote: > > > > torvalds at linux-foundation.org (Linus Torvalds) writes: > > > Did I fix it up correctly? Who knows. The code makes more sense to me > > > now and seems valid. But I really *really* want to stress how locking > > > is important. > > > > As far as I can tell, this merge conflict resolution made my Raspberry > > Pi 3 hang on boot. > > Ok, that's a different merge issue than the locking one (which is > about the amd ttm code). > > But the VC4 driver did have changes close to each other in the hdmi > detection and clock setting code. > > And it doesn't seem to be just RPi3, there was a report back a couple > of weeks ago about RPi4 also having regressed (with an Ubuntu > install). That one was an oops in vc4_hdmi_audio_prepare(). I don't > know if that got resolved, I heard nothing about it after the report. Its still there. I am seeing it every night. This was from last night - https://lava.qa.codethink.co.uk/scheduler/job/164#L1356 Last night's test was on top of 4357f03d6611 ("Merge tag 'pm-5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")
On Sat, Sep 18, 2021 at 1:13 PM Michael Stapelberg <michael@stapelberg.ch> wrote: > > > Michael - do things work if you revert those two (sadly, they don't > > revert cleanly exactly _because_ of the other changes in the same > > area)? > > Reverting only 9984d6664ce9 is not sufficient, but reverting both > 9984d6664ce9 and 411efa18e4b0 does indeed make my Raspberry Pi 3 boot > again! Since you did those reverts and fixed up the conflicts, would you mind sending out the resulting patch so that maybe Sudip can test it too? Maybe the RPi4 hdmi audio issues are related to the RPi4 hdmi problems despite the symptoms apparently being rather different.. Linus
On Sat, Sep 18, 2021 at 3:00 PM Sudip Mukherjee <sudipm.mukherjee@gmail.com> wrote: > > Its still there. I am seeing it every night. This was from last night > - https://lava.qa.codethink.co.uk/scheduler/job/164#L1356 Note that that web server is not available at least to me. Looks like some internal name or limited dns, I just get lava.qa.codethink.co.uk: Name or service not known so your reports aren't getting a lot of traction. Linus
On Sat, Sep 18, 2021 at 11:15 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Sep 18, 2021 at 3:00 PM Sudip Mukherjee > <sudipm.mukherjee@gmail.com> wrote: > > > > Its still there. I am seeing it every night. This was from last night > > - https://lava.qa.codethink.co.uk/scheduler/job/164#L1356 > > Note that that web server is not available at least to me. Looks like > some internal name or limited dns, I just get > > lava.qa.codethink.co.uk: Name or service not known > > so your reports aren't getting a lot of traction. Looks like a DNS issue with the subdomains, someone else also reported the same last week. I have reported this to our ops and they are looking into it. Also, I have now tested by reverting those two commits and I still get the same trace on rpi4. Copying here again for your reference. [ 37.929903] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000348 [ 37.940882] Mem abort info: [ 37.943715] ESR = 0x96000004 [ 37.946875] EC = 0x25: DABT (current EL), IL = 32 bits [ 37.952547] SET = 0, FnV = 0 [ 37.956086] EA = 0, S1PTW = 0 [ 37.959570] FSC = 0x04: level 0 translation fault [ 37.964561] Data abort info: [ 37.967518] ISV = 0, ISS = 0x00000004 [ 37.971437] CM = 0, WnR = 0 [ 37.974666] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000050be0000 [ 37.981239] [0000000000000348] pgd=0000000000000000, p4d=0000000000000000 [ 37.988234] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 37.993897] Modules linked in: overlay algif_hash algif_skcipher af_alg bnep sch_fq_codel ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic uas usbhid usb_storage snd_soc_hdmi_codec btsdio hci_uart bcm2835_v4l2(C) bcm2835_mmal_vchiq(C) btqca brcmfmac btrtl btbcm videobuf2_vmalloc videobuf2_memops brcmutil btintel videobuf2_v4l2 cfg80211 raspberrypi_hwmon i2c_brcmstb videobuf2_common bluetooth videodev dwc2 udc_core roles vc4 crct10dif_ce ecdh_generic drm_kms_helper pwm_bcm2835 cec ecc snd_bcm2835(C) mc drm snd_soc_core xhci_pci xhci_pci_renesas ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd fb_sys_fops syscopyarea sysfillrect sysimgblt phy_generic uio_pdrv_genirq uio aes_neon_bs aes_neon_blk crypto_simd cryptd [ 38.073649] CPU: 1 PID: 1572 Comm: pulseaudio Tainted: G C 5.15.0-rc1-d4d016caa4b8 #1 [ 38.082913] Hardware name: Raspberry Pi 4 Model B (DT) [ 38.088119] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 38.095178] pc : vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4] [ 38.100683] lr : vc4_hdmi_audio_prepare+0x308/0xba4 [vc4] [ 38.106173] sp : ffff800013073a60 [ 38.109527] x29: ffff800013073a60 x28: ffff00004086b800 x27: 0000000000000000 [ 38.116767] x26: 0000000000000000 x25: 000000000000ac44 x24: 0000000021002003 [ 38.124004] x23: ffff800013073b50 x22: 0000000000000003 x21: ffff0000561b2080 [ 38.131242] x20: ffff0000561b24c8 x19: 0005833333380600 x18: 0000000000000000 [ 38.138479] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 38.145716] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000991 [ 38.152952] x11: 0000000000000001 x10: 000000000001d4c0 x9 : ffff800008f87708 [ 38.160188] x8 : 0000000000000031 x7 : 000000000001d4c0 x6 : 0000000000000030 [ 38.167424] x5 : ffff800013073aa8 x4 : ffff800008f9baa8 x3 : 0000000010624dd3 [ 38.174661] x2 : 00000000000003e8 x1 : 0000000000000000 x0 : 0000000000562200 [ 38.181899] Call trace: [ 38.184371] vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4] [ 38.189511] hdmi_codec_prepare+0xe8/0x1b0 [snd_soc_hdmi_codec] [ 38.195515] snd_soc_pcm_dai_prepare+0x5c/0x10c [snd_soc_core] [ 38.201476] soc_pcm_prepare+0x5c/0x130 [snd_soc_core] [ 38.206729] snd_pcm_prepare+0x150/0x1f0 [snd_pcm] [ 38.211608] snd_pcm_common_ioctl+0x1644/0x1d14 [snd_pcm] [ 38.217099] snd_pcm_ioctl+0x3c/0x5c [snd_pcm] [ 38.221620] __arm64_sys_ioctl+0xb4/0x100 [ 38.225687] invoke_syscall+0x50/0x120 [ 38.229486] el0_svc_common.constprop.0+0x180/0x1a0 [ 38.234431] do_el0_svc+0x34/0xa0 [ 38.237788] el0_svc+0x2c/0xc0 [ 38.240883] el0t_64_sync_handler+0xa4/0x12c [ 38.245208] el0t_64_sync+0x1a4/0x1a8 [ 38.248921] Code: 52807d02 72a20c43 f9400421 9ba37c13 (f941a423) [ 38.255098] ---[ end trace 925d8184702abf4d ]---
On Sat, Sep 18, 2021 at 3:48 PM Sudip Mukherjee <sudipm.mukherjee@gmail.com> wrote: > > Also, I have now tested by reverting those two commits and I still get > the same trace on rpi4. Ok. I'm afraid we really need to have the VC4 people figure it out - I count do the two reverts that are reported to fix the RPi3 issue, but it looks like the RPi4 pulseaudio issue needs something else. Any chance you could bisect that? Actually, looking at that first report of yours, the RPi4 sound problem apparently happened in the range 9e9fb7655ed5..7c636d4d20f8 and while that range does have a drm merge from Dave Airlie, it _also_ has a sound merge from Takashi and various ARM SoC updates from Arnd. But most (all?) of the changes in that range to the vc4 source code came from the drm tree. And Maxime in particular. I think. Linus
Sure, no problem. Here are the patch files that make it work for me: https://github.com/gokrazy/kernel/blob/d04d64114aae51aa27752adca6080ed4c9a0c70c/0001-Revert-drm-vc4-hdmi-Make-sure-the-controller-is-powe.patch https://github.com/gokrazy/kernel/blob/d04d64114aae51aa27752adca6080ed4c9a0c70c/0002-Revert-drm-vc4-hdmi-Move-the-HSM-clock-enable-to-run.patch On Sun, 19 Sept 2021 at 00:13, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Sep 18, 2021 at 1:13 PM Michael Stapelberg > <michael@stapelberg.ch> wrote: > > > > > Michael - do things work if you revert those two (sadly, they don't > > > revert cleanly exactly _because_ of the other changes in the same > > > area)? > > > > Reverting only 9984d6664ce9 is not sufficient, but reverting both > > 9984d6664ce9 and 411efa18e4b0 does indeed make my Raspberry Pi 3 boot > > again! > > Since you did those reverts and fixed up the conflicts, would you mind > sending out the resulting patch so that maybe Sudip can test it too? > > Maybe the RPi4 hdmi audio issues are related to the RPi4 hdmi problems > despite the symptoms apparently being rather different.. > > Linus
Hi Linus, On Sun, Sep 19, 2021 at 12:06 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Sep 18, 2021 at 3:48 PM Sudip Mukherjee > <sudipm.mukherjee@gmail.com> wrote: > > > > Also, I have now tested by reverting those two commits and I still get > > the same trace on rpi4. > > Ok. I'm afraid we really need to have the VC4 people figure it out - I > count do the two reverts that are reported to fix the RPi3 issue, but > it looks like the RPi4 pulseaudio issue needs something else. > > Any chance you could bisect that? Done, here is the bisect log: # bad: [7c636d4d20f8c5acfbfbc60f326fddb0e1cf5daa] Merge tag 'dt-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc # good: [9e9fb7655ed585da8f468e29221f0ba194a5f613] Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next git bisect start '7c636d4d20f8' '9e9fb7655ed5' '--' 'drivers/gpu/drm/vc4/' # good: [776efe800feda95a29cefecce1ce36cc27d70b29] drm/vc4: hdmi: Drop devm interrupt handler for hotplug interrupts git bisect good 776efe800feda95a29cefecce1ce36cc27d70b29 # bad: [588b3eee528873d73bf777f329d35b2e65e24777] Merge tag 'drm-misc-next-2021-07-16' of git://anongit.freedesktop.org/drm/drm-misc into drm-next git bisect bad 588b3eee528873d73bf777f329d35b2e65e24777 # bad: [27da370e0fb343a0baf308f503bb3e5dcdfe3362] drm/vc4: hdmi: Remove drm_encoder->crtc usage git bisect bad 27da370e0fb343a0baf308f503bb3e5dcdfe3362 # good: [44fe9f90eb9d2533d049b4ba09540eed6cad9f49] drm/vc4: hdmi: Only call into DRM framework if registered git bisect good 44fe9f90eb9d2533d049b4ba09540eed6cad9f49 # first bad commit: [27da370e0fb343a0baf308f503bb3e5dcdfe3362] drm/vc4: hdmi: Remove drm_encoder->crtc usage And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move __udiv_qrnnd library function to arch/alpha/lib/") has fixed the error.
On Sun, Sep 19, 2021 at 4:05 AM Sudip Mukherjee <sudipm.mukherjee@gmail.com> wrote: > > And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove > drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move > __udiv_qrnnd library function to arch/alpha/lib/") > has fixed the error. Ok, since your pulseaudio problem was reported already over two weeks ago with no apparent movement, I've done that revert in my tree. I reverted the two runtime PM changes that cause problems for Michael too. Linus
Hi, On Sat, Sep 18, 2021 at 11:18:33AM +0200, Michael Stapelberg wrote: > torvalds at linux-foundation.org (Linus Torvalds) writes: > > Did I fix it up correctly? Who knows. The code makes more sense to me > > now and seems valid. But I really *really* want to stress how locking > > is important. > > As far as I can tell, this merge conflict resolution made my Raspberry > Pi 3 hang on boot. You can find the full serial console output at: > > https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt > > The last few messages are from vc4, then the boot hangs. > > Using git-bisect, I tracked this down to > https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9, > which is the merge you’re talking about here, AFAICT. > > I also tried the git://anongit.freedesktop.org/drm/drm, and that tree > boots as expected, suggesting that the problem really is with the > additional changes. > > The code seems to work on my Raspberry Pi 4, just not on the Raspberry > Pi 3. Any ideas why that might be, and how to fix it? I assume you run your Pi without anything connected on HDMI, and without hdmi_force_hotplug in your config.txt? If so, can you test that branch, and let me know if it works for you https://github.com/mripard/linux/tree/rpi/bug-fixes Maxime
On Sun, Sep 19, 2021 at 10:19:35AM -0700, Linus Torvalds wrote: > On Sun, Sep 19, 2021 at 4:05 AM Sudip Mukherjee > <sudipm.mukherjee@gmail.com> wrote: > > > > And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove > > drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move > > __udiv_qrnnd library function to arch/alpha/lib/") > > has fixed the error. > > Ok, since your pulseaudio problem was reported already over two weeks > ago with no apparent movement, I've done that revert in my tree. > > I reverted the two runtime PM changes that cause problems for Michael too. I'm sorry, but I find that situation to be a bit ridiculous. In order to be properly addressed, Michael's issue needs some patches that have been unreviewed for 5 monthes now. However, if one reports an issue to you, without cc'ing the author, on a week-end, the revert is done in a single day. Even that audio bug only got a proper report yesterday, after you asked for it. Do you really expect us to work during the week end too? Maxime
On Mon, Sep 20, 2021 at 10:55:31AM +0200, Maxime Ripard wrote: > Hi, > > On Sat, Sep 18, 2021 at 11:18:33AM +0200, Michael Stapelberg wrote: > > torvalds at linux-foundation.org (Linus Torvalds) writes: > > > Did I fix it up correctly? Who knows. The code makes more sense to me > > > now and seems valid. But I really *really* want to stress how locking > > > is important. > > > > As far as I can tell, this merge conflict resolution made my Raspberry > > Pi 3 hang on boot. You can find the full serial console output at: > > > > https://t.zekjur.net/linux-5.14-raspberry-pi-3-hang-vc4.txt > > > > The last few messages are from vc4, then the boot hangs. > > > > Using git-bisect, I tracked this down to > > https://github.com/torvalds/linux/commit/e058a84bfddc42ba356a2316f2cf1141974625c9, > > which is the merge you’re talking about here, AFAICT. > > > > I also tried the git://anongit.freedesktop.org/drm/drm, and that tree > > boots as expected, suggesting that the problem really is with the > > additional changes. > > > > The code seems to work on my Raspberry Pi 4, just not on the Raspberry > > Pi 3. Any ideas why that might be, and how to fix it? > > I assume you run your Pi without anything connected on HDMI, and without > hdmi_force_hotplug in your config.txt? > > If so, can you test that branch, and let me know if it works for you > https://github.com/mripard/linux/tree/rpi/bug-fixes This breaks every one else, unfortunately. I'll try to come up with something. Maxime
On Mon, Sep 20, 2021 at 1:17 PM Maxime Ripard <maxime@cerno.tech> wrote: > > On Sun, Sep 19, 2021 at 10:19:35AM -0700, Linus Torvalds wrote: > > On Sun, Sep 19, 2021 at 4:05 AM Sudip Mukherjee > > <sudipm.mukherjee@gmail.com> wrote: > > > > > > And indeed, reverting 27da370e0fb3 ("drm/vc4: hdmi: Remove > > > drm_encoder->crtc usage") on top of d4d016caa4b8 ("alpha: move > > > __udiv_qrnnd library function to arch/alpha/lib/") > > > has fixed the error. > > > > Ok, since your pulseaudio problem was reported already over two weeks > > ago with no apparent movement, I've done that revert in my tree. > > > > I reverted the two runtime PM changes that cause problems for Michael too. > > I'm sorry, but I find that situation to be a bit ridiculous. In order to > be properly addressed, Michael's issue needs some patches that have been > unreviewed for 5 monthes now. > > However, if one reports an issue to you, without cc'ing the author, on a > week-end, the revert is done in a single day. > > Even that audio bug only got a proper report yesterday, after you asked > for it. Apologies if I have missed any of your mail, but iirc you have only asked me what tests I am running, but Linus asked if I can do a bisect. -- Regards Sudip
On Mon, Sep 20, 2021 at 5:17 AM Maxime Ripard <maxime@cerno.tech> wrote: > > I'm sorry, but I find that situation to be a bit ridiculous. Honestly, the original report about the pulseaudio problem came in over two weeks ago, and all you seemed to do was to ignore everything that Sudip said and reported. THAT is the ridiculous part. The rules for the kernel are very clear: regressions get fixed or they get reverted. And I saw absolutely _zero_ indication that you took that pulseaudio oops at all seriously. Linus
On Mon, Sep 20, 2021 at 10:10:57AM -0700, Linus Torvalds wrote: > On Mon, Sep 20, 2021 at 5:17 AM Maxime Ripard <maxime@cerno.tech> wrote: > > > > I'm sorry, but I find that situation to be a bit ridiculous. > > Honestly, the original report about the pulseaudio problem came in > over two weeks ago, and all you seemed to do was to ignore everything > that Sudip said and reported. > > THAT is the ridiculous part. > > The rules for the kernel are very clear: regressions get fixed or they > get reverted. And I saw absolutely _zero_ indication that you took > that pulseaudio oops at all seriously. I wasn't questioning the revert itself though, I can see how you took that decision for the audio patch. What I was interested in was more about the context itself, and I'd still like an answer on whether it's ok to wait for a review for 5 months though, or if it's an expectation from now on that we are supposed to fix bugs over the week-end. Maxime
On Mon, Sep 20, 2021 at 10:33 AM Maxime Ripard <maxime@cerno.tech> wrote: > > What I was interested in was more about the context itself, and I'd > still like an answer on whether it's ok to wait for a review for 5 > months though, or if it's an expectation from now on that we are > supposed to fix bugs over the week-end. Oh, it's definitely not "over a weekend". These reverts happened on a Sunday just because that's when I do rc releases, and this was one of those pending issues that had been around long enough that I went "ok, I'm reverting now since it's been bisected and verified". So it happened on a weekend, but that's pretty incidental. You should not wait for 5 months to send bug-fixes. That's not the point of review, and review shouldn't hold up reported regressions of existing code. That's just basic _testing_ - either the fix should be applied, or - if the fix is too invasive or too ugly - the problematic source of the regression should be reverted. Review should be about new code, it shouldn't be holding up "there's a bug report, here's the obvious fix". And for something like a NULL pointer dereference, there really should generally be an "obvious fix". Of course, a corollary to that "fixes are different from new development", though, is that bug fixes need to be kept separate from new code - just so that they _can_ be handled separately and so that you could have sent Sudip (and Michael, although that was apparently a very different bug, and the report came in later) a "can you test this fix" kind of thing. I don't know what the review issue on the vc4 drm side is, but I suspect that the vc4 people are just perhaps not as integrated with a lot of the other core drm people. Or maybe review of new features are held off because there are bug reports on the old code. Linus
On Mon, Sep 20, 2021 at 10:47:43AM -0700, Linus Torvalds wrote: > On Mon, Sep 20, 2021 at 10:33 AM Maxime Ripard <maxime@cerno.tech> wrote: > > > > What I was interested in was more about the context itself, and I'd > > still like an answer on whether it's ok to wait for a review for 5 > > months though, or if it's an expectation from now on that we are > > supposed to fix bugs over the week-end. > > Oh, it's definitely not "over a weekend". These reverts happened on a > Sunday just because that's when I do rc releases, and this was one of > those pending issues that had been around long enough that I went "ok, > I'm reverting now since it's been bisected and verified". > > So it happened on a weekend, but that's pretty incidental. Ok. > You should not wait for 5 months to send bug-fixes. That's not the > point of review, and review shouldn't hold up reported regressions of > existing code. That's just basic _testing_ - either the fix should be > applied, or - if the fix is too invasive or too ugly - the problematic > source of the regression should be reverted. > > Review should be about new code, it shouldn't be holding up "there's a > bug report, here's the obvious fix". > > And for something like a NULL pointer dereference, there really should > generally be an "obvious fix". > > Of course, a corollary to that "fixes are different from new > development", though, is that bug fixes need to be kept separate from > new code - just so that they _can_ be handled separately and so that > you could have sent Sudip (and Michael, although that was apparently a > very different bug, and the report came in later) a "can you test this > fix" kind of thing. I still don't have a way to reproduce Sudip's bug, so I can't even provide that. > I don't know what the review issue on the vc4 drm side is, but I > suspect that the vc4 people are just perhaps not as integrated with a > lot of the other core drm people. Or maybe review of new features are > held off because there are bug reports on the old code. It's not really about drm here, it's a dependency on the clock framework. Maxime