Message ID | 1444982077.2350.0.camel@giantmonkey.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Dear Linux SCSI folks, Am Freitag, den 16.10.2015, 09:54 +0200 schrieb Paul Menzel: > Package: linux-image-4.2.0-1-686-pae > Version: 4.2.3-2 > Severity: important > please don’t include the address submit@bugs.debian.org in your reply. this issue is now also tracked in the Debian Bug Tracking System [2] and has the number #801925 [3]. Please keep that address in CC. > Am Freitag, den 16.10.2015, 03:05 +0200 schrieb Paul Menzel: > > > using Debian Sid/unstable with Linux 4.2.3-1 upgrading from systemd > > 227-1 to 227-2 [1] and other packages, the system doesn’t start up > > anymore and the /dev/md1 device doesn’t seem to be found and I am > > dropped into shell from initramfs (BusyBox). > > > > Only having wireless LAN and no serial or USB debug capabilities, and > > mount a USB storage device did not work, I manually copied the beginning > > of the Oops. > > > > ``` > > BUG: unable to handle kernel NULL pointer dereference at 00000014 > > IP: [<f828a00c>] sr_runtime_suspend+0xc/0x20 [sr_mod] > > *pdpt = 000000003696e001 *pde = 000000000000000000 > > Oops: 0000 [#1] SMB > > Modules linked in: sd_mod(+) sr_mod(+) cdrom ata_generic ohci_pci ahci libahci pata_amd firwire_ohci firewire_core crc_iti_t forcedeth libata scsi_mod ohci_hcd ehci_pci ehci_hcd usbcore usb_common fan thermal thermal_sys floppy(+) > > CPU: 1 PID: 73 Comm: systemd-udevd Not tainted 4.2.0-1-686-pae #1 Debian 4.2.3-1 > > Hardware name: Packard Bell imedia S3210/WMCP78M, BIOs P01-B2 11/06/2009 > > task: f68dd040 ti: f6988000 task.ti: f6988000 > > EIP: 0060:[<fh28a00c>] EFLAGS: 00010246 CPU: 1 > > EIP is at sr_runtime_suspend+0xc/0x20 [sr_mod] > > EAX: 00000000 EBX: f6a30cd8 ECX: f6c03d2c EDX: 00000000 > > ESI: 00000000 EDI: f828e100 EBP: f6989ba8 ESP: f6989b88 > > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > > CR0: 8005003b CR2: 00000014 CR3: 3696d780 CR4: 000006f0 > > Stack: > > af83346c3 00000000 00000001 fffffff5 f6a7d150 f6a30cd8 f6a30d3c 00000000 > > f6989bbc c1390cb7 f6a30cd8 f8334660 00000000 f6989bd0 c1390d0f f6a30cd8 > > f8334660 00000000 f6989c0c c13916cb f694a614 f68dd040 00000000 00000008 > > Call Trace: > > […] ? scsi_runtime_suspend+0x63/0xa0 [scsi_mod] > > […] ? __rpm_callback+0x27/0x60 > > […] > > ``` > > > > I tried also to boot with Linux 4.1 and it fails the same way. > > > > Is that a known problem and has been fixed in the mean time? It’d be > > great if you helped me getting the system to boot again. Please tell me > > if you need more information to debug this issue and I’ll do my best to > > get it. > > Ben Hutchings asked me to test the patch below to get more debug > information. > > ``` > diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c > index 8bd54a6..dd5b5b2 100644 > --- a/drivers/scsi/sr.c > +++ b/drivers/scsi/sr.c > @@ -144,6 +144,12 @@ static int sr_runtime_suspend(struct device *dev) > { > struct scsi_cd *cd = dev_get_drvdata(dev); > > + if (WARN_ON(!cd)) { > + pr_info("%s: cd == NULL; power.usage_count = %d\n", > + __func__, atomic_read(&dev->power.usage_count)); > + return 0; > + } > + > if (cd->media_present) > return -EBUSY; > else > @@ -652,7 +658,13 @@ static int sr_probe(struct device *dev) > struct scsi_cd *cd; > int minor, error; > > - scsi_autopm_get_device(sdev); > + error = scsi_autopm_get_device(sdev); > + if (error) { > + pr_err("%s: scsi_autopm_get_device returned %d\n", > + __func__, error); > + return error; > + } > + > error = -ENODEV; > if (sdev->type != TYPE_ROM && sdev->type != TYPE_WORM) > goto fail; > @@ -719,6 +731,9 @@ static int sr_probe(struct device *dev) > if (register_cdrom(&cd->cdi)) > goto fail_put; > > + pr_info("%s: power.usage_count = %d\n", > + __func__, atomic_read(&dev->power.usage_count)); > + > /* > * Initialize block layer runtime PM stuffs before the > * periodic event checking request gets started in add_disk. > ``` > > I’ll try that as soon as a spare drive has arrived, where I can copy the > data to as a backup. > > More thoughts are welcome! Especially, if that error suggests a failing > drive or not. Thanks, Paul > > [1] http://metadata.ftp-master.debian.org/changelogs//main/s/systemd/systemd_227-2_changelog [2] https://www.debian.org/Bugs/ [3] https://bugs.debian.org/801925
On Fri, 2015-10-16 at 09:54 +0200, Paul Menzel wrote: [...] > > BUG: unable to handle kernel NULL pointer dereference at 00000014 > > IP: [] sr_runtime_suspend+0xc/0x20 [sr_mod] > > *pdpt = 000000003696e001 *pde = 000000000000000000 > > Oops: 0000 [#1] SMB > > Modules linked in: sd_mod(+) sr_mod(+) cdrom ata_generic ohci_pci ahci libahci pata_amd firwire_ohci firewire_core crc_iti_t forcedeth libata scsi_mod ohci_hcd ehci_pci ehci_hcd usbcore usb_common fan thermal thermal_sys floppy(+) > > CPU: 1 PID: 73 Comm: systemd-udevd Not tainted 4.2.0-1-686-pae #1 Debian 4.2.3-1 > > Hardware name: Packard Bell imedia S3210/WMCP78M, BIOs P01-B2 11/06/2009 > > task: f68dd040 ti: f6988000 task.ti: f6988000 > > EIP: 0060:[] EFLAGS: 00010246 CPU: 1 > > EIP is at sr_runtime_suspend+0xc/0x20 [sr_mod] > > EAX: 00000000 EBX: f6a30cd8 ECX: f6c03d2c EDX: 00000000 > > ESI: 00000000 EDI: f828e100 EBP: f6989ba8 ESP: f6989b88 > > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > > CR0: 8005003b CR2: 00000014 CR3: 3696d780 CR4: 000006f0 > > Stack: > > af83346c3 00000000 00000001 fffffff5 f6a7d150 f6a30cd8 f6a30d3c 00000000 > > f6989bbc c1390cb7 f6a30cd8 f8334660 00000000 f6989bd0 c1390d0f f6a30cd8 > > f8334660 00000000 f6989c0c c13916cb f694a614 f68dd040 00000000 00000008 > > Call Trace: > > […] ? scsi_runtime_suspend+0x63/0xa0 [scsi_mod] > > […] ? __rpm_callback+0x27/0x60 > > […] [...] > Ben Hutchings asked me to test the patch below to get more debug > information. [...] Well, that didn't help much. Paul hit another oops, this time in sd_mod but again apparently related to runtime PM. My patch only touched sr_mod. This time he sent photos of the complete oops; see <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801925;filename=20151020_005.jpg;att=4;msg=15> and <https://bugs.debian.org/cgi-bin/bugreport.cgi?filename=20151020_006.jpg;bug=801925;att=3;msg=15> Ben.
Control: notfound -1 3.19-1~exp1 Control: found -1 4.2.5-1 Am Dienstag, den 20.10.2015, 02:39 +0100 schrieb Ben Hutchings: > On Fri, 2015-10-16 at 09:54 +0200, Paul Menzel wrote: > [...] > > > BUG: unable to handle kernel NULL pointer dereference at 00000014 > > > IP: [] sr_runtime_suspend+0xc/0x20 [sr_mod] > > > *pdpt = 000000003696e001 *pde = 000000000000000000 > > > Oops: 0000 [#1] SMB > > > Modules linked in: sd_mod(+) sr_mod(+) cdrom ata_generic ohci_pci ahci libahci pata_amd firwire_ohci firewire_core crc_iti_t forcedeth libata scsi_mod ohci_hcd ehci_pci ehci_hcd usbcore usb_common fan thermal thermal_sys floppy(+) > > > CPU: 1 PID: 73 Comm: systemd-udevd Not tainted 4.2.0-1-686-pae #1 Debian 4.2.3-1 > > > Hardware name: Packard Bell imedia S3210/WMCP78M, BIOs P01-B2 11/06/2009 > > > task: f68dd040 ti: f6988000 task.ti: f6988000 > > > EIP: 0060:[] EFLAGS: 00010246 CPU: 1 > > > EIP is at sr_runtime_suspend+0xc/0x20 [sr_mod] > > > EAX: 00000000 EBX: f6a30cd8 ECX: f6c03d2c EDX: 00000000 > > > ESI: 00000000 EDI: f828e100 EBP: f6989ba8 ESP: f6989b88 > > > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > > > CR0: 8005003b CR2: 00000014 CR3: 3696d780 CR4: 000006f0 > > > Stack: > > > af83346c3 00000000 00000001 fffffff5 f6a7d150 f6a30cd8 f6a30d3c 00000000 > > > f6989bbc c1390cb7 f6a30cd8 f8334660 00000000 f6989bd0 c1390d0f f6a30cd8 > > > f8334660 00000000 f6989c0c c13916cb f694a614 f68dd040 00000000 00000008 > > > Call Trace: > > > […] ? scsi_runtime_suspend+0x63/0xa0 [scsi_mod] > > > […] ? __rpm_callback+0x27/0x60 > > > […] > [...] > > Ben Hutchings asked me to test the patch below to get more debug > > information. > [...] > > Well, that didn't help much. Paul hit another oops, this time in > sd_mod but again apparently related to runtime PM. My patch only > touched sr_mod. > > This time he sent photos of the complete oops; see > <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801925;filename=20151020_005.jpg;att=4;msg=15> > and > <https://bugs.debian.org/cgi-bin/bugreport.cgi?filename=20151020_006.jpg;bug=801925;att=3;msg=15> after backing up my data, I tested a little bit more, and using Linux 3.19 the drive is detected and the system boots. Does anything stand out what changed in this area between Linux 3.19 and 4.1? Thanks Paul
On Sat, 31 Oct 2015, Paul Menzel wrote: > > Well, that didn't help much. Paul hit another oops, this time in > > sd_mod but again apparently related to runtime PM. My patch only > > touched sr_mod. > > > > This time he sent photos of the complete oops; see > > <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801925;filename=20151020_005.jpg;att=4;msg=15> > > and > > <https://bugs.debian.org/cgi-bin/bugreport.cgi?filename=20151020_006.jpg;bug=801925;att=3;msg=15> > > after backing up my data, I tested a little bit more, and using Linux > 3.19 the drive is detected and the system boots. > > Does anything stand out what changed in this area between Linux 3.19 and > 4.1? I believe the problem shown in that photo was fixed by commit 49718f0fb8c9 ("SCSI: Fix NULL pointer dereference in runtime PM"), which was merged in 4.2 and has been back-ported to various stable releases. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 31 Oct 2015, Alan Stern wrote: > I believe the problem shown in that photo was fixed by commit > 49718f0fb8c9 ("SCSI: Fix NULL pointer dereference in runtime PM"), > which was merged in 4.2 and has been back-ported to various stable > releases. On second thought, it seems more likely that this issue probably was _caused_ by that commit. The fix can be found in these two emails: http://marc.info/?l=linux-scsi&m=144185206825609&w=2 http://marc.info/?l=linux-scsi&m=144185208525611&w=2 which have not been merged yet as far as I know even though they were submitted back in September. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Version: 4.4~rc8-1~exp1 Dear Alan, Thank you for your help! There were some follow-ups to the bug report [1], but I think you and I were not in CC. Am Samstag, den 31.10.2015, 22:05 -0400 schrieb Alan Stern: > On Sat, 31 Oct 2015, Alan Stern wrote: > > > I believe the problem shown in that photo was fixed by commit > > 49718f0fb8c9 ("SCSI: Fix NULL pointer dereference in runtime PM"), > > which was merged in 4.2 and has been back-ported to various stable > > releases. > > On second thought, it seems more likely that this issue probably was > _caused_ by that commit. The fix can be found in these two emails: > > http://marc.info/?l=linux-scsi&m=144185206825609&w=2 > http://marc.info/?l=linux-scsi&m=144185208525611&w=2 > > which have not been merged yet as far as I know even though they were > submitted back in September. I can only say, that I am still unable to boot my system with Linux 4.4-rc8 [2]. Are these patches included there? Thanks, Paul [1] https://bugs.debian.org/801925 [2] https://packages.debian.org/experimental/linux-image-4.4.0-rc8-686-pae-dbg
On Sat, 9 Jan 2016, Paul Menzel wrote: > Version: 4.4~rc8-1~exp1 > > Dear Alan, > > > Thank you for your help! > > There were some follow-ups to the bug report [1], but I think you and I > were not in CC. I wasn't. > > http://marc.info/?l=linux-scsi&m=144185206825609&w=2 > > http://marc.info/?l=linux-scsi&m=144185208525611&w=2 > I can only say, that I am still unable to boot my system with Linux > 4.4-rc8 [2]. Are these patches included there? They are. I don't see how they could cause a NULL pointer dereference in sd_resume(), though. If you revert them, does the problem go away? Also, can you add some debugging statements to sd_resume() so we can see where the NULL pointer comes from? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi all, 4.4-rc8 does not fix the problem for me. Anything beyond 4.1.0 remains unable to boot this computer. Unfortunately, because the error occurs during early early SCSI initialization, I do not have easy access to the log - no disk, no network. It happens during SATA initialization: "scsi_runtime_resume". So my back trace looks different than Alex in https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=42;filename=scsi-null-pointer-dereference.log;bug=801925;att=1 but like the one Paul is seeing: https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=15;filename=20151020_006.jpg;bug=801925;att=3 I will try to do a photo next time, too. Here is some dmesg output from a successful boot on 4.1.0: Note there are some ACPI Errors there (but probably not related). --- ahci 0000:00:1f.2: version 3.0 ahci 0000:00:1f.2: SSS flag set, parallel bus scan disabled ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 3 Gbps 0x1 impl SATA mode ahci 0000:00:1f.2: flags: 64bit ncq sntf stag pm led clo pio slum part ems apst scsi host0: ahci scsi host1: ahci scsi host2: ahci scsi host3: ahci scsi host4: ahci scsi host5: ahci ata1: SATA max UDMA/133 abar m2048@0xc0728000 port 0xc0728100 irq 30 ata2: DUMMY ata3: DUMMY ata4: DUMMY ata5: DUMMY ata6: DUMMY usb 3-1: new high-speed USB device number 2 using ehci-pci usb 4-1: new high-speed USB device number 2 using ehci-pci ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ACPI Error: [GTF0] Namespace lookup failure, AE_NOT_FOUND (20150410/psargs-359) ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.PRT0._SDD] (Node ffff8802458b1608), AE_NOT_FOUND (20150410/psparse-536) ACPI Error: [GTF0] Namespace lookup failure, AE_NOT_FOUND (20150410/psargs-359) ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.PRT0._GTF] (Node ffff8802458b15e0), AE_NOT_FOUND (20150410/psparse-536) ata1.00: ATA-8: TOSHIBA THNSNS256GMCP, TA2ABBF0, max UDMA/133 ata1.00: 500118192 sectors, multi 16: LBA48 NCQ (depth 31/32), AA ACPI Error: [GTF0] Namespace lookup failure, AE_NOT_FOUND (20150410/psargs-359) ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.PRT0._SDD] (Node ffff8802458b1608), AE_NOT_FOUND (20150410/psparse-536) ACPI Error: [GTF0] Namespace lookup failure, AE_NOT_FOUND (20150410/psargs-359) ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.PRT0._GTF] (Node ffff8802458b15e0), AE_NOT_FOUND (20150410/psparse-536) ata1.00: configured for UDMA/133 scsi 0:0:0:0: Direct-Access ATA TOSHIBA THNSNS25 BBF0 PQ: 0 ANSI: 5 sd 0:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sda4 < sda5 sda6 > sd 0:0:0:0: [sda] Attached SCSI disk PM: Starting manual resume from disk PM: Hibernation image partition 8:6 present PM: Looking for hibernation image. PM: Image not found (code -22) PM: Hibernation image not present or could not be loaded. --- On Sat, Jan 9, 2016 at 5:36 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Sat, 9 Jan 2016, Paul Menzel wrote: > >> Version: 4.4~rc8-1~exp1 >> >> Dear Alan, >> >> >> Thank you for your help! >> >> There were some follow-ups to the bug report [1], but I think you and I >> were not in CC. > > I wasn't. > >> > http://marc.info/?l=linux-scsi&m=144185206825609&w=2 >> > http://marc.info/?l=linux-scsi&m=144185208525611&w=2 > >> I can only say, that I am still unable to boot my system with Linux >> 4.4-rc8 [2]. Are these patches included there? > > They are. I don't see how they could cause a NULL pointer dereference > in sd_resume(), though. If you revert them, does the problem go away? > > Also, can you add some debugging statements to sd_resume() so we can > see where the NULL pointer comes from? > > Alan Stern > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 10 Jan 2016, Erich Schubert wrote: > Hi all, > 4.4-rc8 does not fix the problem for me. > Anything beyond 4.1.0 remains unable to boot this computer. > > Unfortunately, because the error occurs during early early SCSI > initialization, I do not have easy access to the log - no disk, no > network. > It happens during SATA initialization: "scsi_runtime_resume". You didn't include any debugging information. However... > So my back trace looks different than Alex in > https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=42;filename=scsi-null-pointer-dereference.log;bug=801925;att=1 > but like the one Paul is seeing: > https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=15;filename=20151020_006.jpg;bug=801925;att=3 The information in that bug report says that the failure happens in sr_runtime_resume, not in scsi_runtime_resume. Compare with the Subject: line in this email thread. > I will try to do a photo next time, too. If I send you a patch, can you build and test it? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index 8bd54a6..dd5b5b2 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -144,6 +144,12 @@ static int sr_runtime_suspend(struct device *dev) { struct scsi_cd *cd = dev_get_drvdata(dev); + if (WARN_ON(!cd)) { + pr_info("%s: cd == NULL; power.usage_count = %d\n", + __func__, atomic_read(&dev->power.usage_count)); + return 0; + } + if (cd->media_present) return -EBUSY; else @@ -652,7 +658,13 @@ static int sr_probe(struct device *dev) struct scsi_cd *cd; int minor, error; - scsi_autopm_get_device(sdev); + error = scsi_autopm_get_device(sdev); + if (error) { + pr_err("%s: scsi_autopm_get_device returned %d\n", + __func__, error); + return error; + } + error = -ENODEV; if (sdev->type != TYPE_ROM && sdev->type != TYPE_WORM) goto fail; @@ -719,6 +731,9 @@ static int sr_probe(struct device *dev) if (register_cdrom(&cd->cdi)) goto fail_put; + pr_info("%s: power.usage_count = %d\n", + __func__, atomic_read(&dev->power.usage_count)); + /* * Initialize block layer runtime PM stuffs before the * periodic event checking request gets started in add_disk.