Message ID | 20230926081507.69346-1-dlemoal@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Fix libata suspend/resume handling and code cleanup | expand |
Hi Damien, (oops, found a two-day old email still in draft) On Tue, Sep 26, 2023 at 10:15 AM Damien Le Moal <dlemoal@kernel.org> wrote: > The first 9 patches of this series fix several issues with suspend/resume > power management operations in scsi and libata. The most significant > changes introduced are in patch 4 and 5, where the manage_start_stop > flag of scsi devices is split into the manage_system_start_stop and > manage_runtime_start_stop flags to allow keeping scsi runtime power > operations for spining up/down ATA devices but have libata do its own > system suspend/resume device power state management using EH. > > The remaining patches are code cleanup that do not introduce any > significant functional change. > > This series was tested on qemu and on various PCs and servers. I am > CC-ing people who recently reported issues with suspend/resume. > Additional testing would be much appreciated. JFTR, with current libata/for-next[*], I saw the following with rcar-sata, once (interesting lines marked with "!"): PM: suspend entry (s2idle) Filesystems sync: 0.026 seconds Freezing user space processes ! ata1.00: qc timeout after 10000 msecs (cmd 0x40) Freezing user space processes completed (elapsed 0.007 seconds) ! ata1.00: VERIFY failed (err_mask=0x4) OOM killer disabled. ! ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40) Freezing remaining freezable tasks ! ata1.00: revalidation failed (errno=-5) Freezing remaining freezable tasks completed (elapsed 0.002 seconds) sd 0:0:0:0: [sda] Synchronizing SCSI cache ata1: link resume succeeded after 1 retries ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/133 ata1.00: Entering active power mode ata1.00: Entering standby power mode ravb e6800000.ethernet eth0: Link is Down Micrel KSZ9031 Gigabit PHY e6800000.ethernet-ffffffff:00: attached PHY driver (mii_bus:phy_addr=e6800000.ethernet-ffffffff:00, irq=136) OOM killer enabled. Restarting tasks ... done. random: crng reseeded on system resumption PM: suspend exit ata1: link resume succeeded after 1 retries ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: Entering active power mode ata1.00: configured for UDMA/133 ravb e6800000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off Regardless, the disk worked fine after resume. Note that I saw this only once. [*] Commit f940258b63da95f4 ("ata: libata: Cleanup inline DMA helper functions"). No idea which version of this series that corresponds to, adding Link:-tags to your commits would make it easier to track that. https://www.kernel.org/doc/html/latest/maintainer/configure-git.html#creating-commit-links-to-lore-kernel-org Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On 2023/09/28 14:26, Geert Uytterhoeven wrote: > Hi Damien, > > (oops, found a two-day old email still in draft) > > On Tue, Sep 26, 2023 at 10:15 AM Damien Le Moal <dlemoal@kernel.org> wrote: >> The first 9 patches of this series fix several issues with suspend/resume >> power management operations in scsi and libata. The most significant >> changes introduced are in patch 4 and 5, where the manage_start_stop >> flag of scsi devices is split into the manage_system_start_stop and >> manage_runtime_start_stop flags to allow keeping scsi runtime power >> operations for spining up/down ATA devices but have libata do its own >> system suspend/resume device power state management using EH. >> >> The remaining patches are code cleanup that do not introduce any >> significant functional change. >> >> This series was tested on qemu and on various PCs and servers. I am >> CC-ing people who recently reported issues with suspend/resume. >> Additional testing would be much appreciated. > > JFTR, with current libata/for-next[*], I saw the following with > rcar-sata, once (interesting lines marked with "!"): > > PM: suspend entry (s2idle) > Filesystems sync: 0.026 seconds > Freezing user space processes > ! ata1.00: qc timeout after 10000 msecs (cmd 0x40) > Freezing user space processes completed (elapsed 0.007 seconds) > ! ata1.00: VERIFY failed (err_mask=0x4) > OOM killer disabled. > ! ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40) > Freezing remaining freezable tasks > ! ata1.00: revalidation failed (errno=-5) > Freezing remaining freezable tasks completed (elapsed 0.002 seconds) > sd 0:0:0:0: [sda] Synchronizing SCSI cache > ata1: link resume succeeded after 1 retries > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > ata1.00: configured for UDMA/133 > ata1.00: Entering active power mode > ata1.00: Entering standby power mode > ravb e6800000.ethernet eth0: Link is Down > Micrel KSZ9031 Gigabit PHY e6800000.ethernet-ffffffff:00: attached > PHY driver (mii_bus:phy_addr=e6800000.ethernet-ffffffff:00, irq=136) > OOM killer enabled. > Restarting tasks ... done. > random: crng reseeded on system resumption > PM: suspend exit > ata1: link resume succeeded after 1 retries > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > ata1.00: Entering active power mode > ata1.00: configured for UDMA/133 > ravb e6800000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off > > Regardless, the disk worked fine after resume. Looks like ata-EH was still doing a resume while the suspend was on-going... Hmm.. Will need to look at that. Regardless, I am going to send the fixes for rc4 given that they vastly improve things as they are. But there may still be some corner cases that are not handled. Thanks for testing. > > Note that I saw this only once. > > [*] Commit f940258b63da95f4 ("ata: libata: Cleanup inline DMA helper > functions"). > No idea which version of this series that corresponds to, adding > Link:-tags to your commits would make it easier to track that. > https://www.kernel.org/doc/html/latest/maintainer/configure-git.html#creating-commit-links-to-lore-kernel-org > > Gr{oetje,eeting}s, > > Geert > > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like that. > -- Linus Torvalds
On 2023/09/28 14:26, Geert Uytterhoeven wrote: > Hi Damien, > > (oops, found a two-day old email still in draft) > > On Tue, Sep 26, 2023 at 10:15 AM Damien Le Moal <dlemoal@kernel.org> wrote: >> The first 9 patches of this series fix several issues with suspend/resume >> power management operations in scsi and libata. The most significant >> changes introduced are in patch 4 and 5, where the manage_start_stop >> flag of scsi devices is split into the manage_system_start_stop and >> manage_runtime_start_stop flags to allow keeping scsi runtime power >> operations for spining up/down ATA devices but have libata do its own >> system suspend/resume device power state management using EH. >> >> The remaining patches are code cleanup that do not introduce any >> significant functional change. >> >> This series was tested on qemu and on various PCs and servers. I am >> CC-ing people who recently reported issues with suspend/resume. >> Additional testing would be much appreciated. > > JFTR, with current libata/for-next[*], I saw the following with > rcar-sata, once (interesting lines marked with "!"): > > PM: suspend entry (s2idle) > Filesystems sync: 0.026 seconds > Freezing user space processes > ! ata1.00: qc timeout after 10000 msecs (cmd 0x40) > Freezing user space processes completed (elapsed 0.007 seconds) > ! ata1.00: VERIFY failed (err_mask=0x4) > OOM killer disabled. > ! ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40) > Freezing remaining freezable tasks > ! ata1.00: revalidation failed (errno=-5) > Freezing remaining freezable tasks completed (elapsed 0.002 seconds) > sd 0:0:0:0: [sda] Synchronizing SCSI cache > ata1: link resume succeeded after 1 retries > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > ata1.00: configured for UDMA/133 > ata1.00: Entering active power mode > ata1.00: Entering standby power mode > ravb e6800000.ethernet eth0: Link is Down > Micrel KSZ9031 Gigabit PHY e6800000.ethernet-ffffffff:00: attached > PHY driver (mii_bus:phy_addr=e6800000.ethernet-ffffffff:00, irq=136) > OOM killer enabled. > Restarting tasks ... done. > random: crng reseeded on system resumption > PM: suspend exit > ata1: link resume succeeded after 1 retries > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > ata1.00: Entering active power mode > ata1.00: configured for UDMA/133 > ravb e6800000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off > > Regardless, the disk worked fine after resume. > > Note that I saw this only once. I think I found the reason for this, but to confirm, were you doing a suspend right after resuming the system ? If yes, that I think I exactly understand the issue and why you saw it only once (it is a subtle race with scheduling libata-EH suspend/resume operations). I will send a fix next week.
Hi Damien, On Fri, Sep 29, 2023 at 3:37 PM Damien Le Moal <dlemoal@kernel.org> wrote: > On 2023/09/28 14:26, Geert Uytterhoeven wrote: > > On Tue, Sep 26, 2023 at 10:15 AM Damien Le Moal <dlemoal@kernel.org> wrote: > >> The first 9 patches of this series fix several issues with suspend/resume > >> power management operations in scsi and libata. The most significant > >> changes introduced are in patch 4 and 5, where the manage_start_stop > >> flag of scsi devices is split into the manage_system_start_stop and > >> manage_runtime_start_stop flags to allow keeping scsi runtime power > >> operations for spining up/down ATA devices but have libata do its own > >> system suspend/resume device power state management using EH. > >> > >> The remaining patches are code cleanup that do not introduce any > >> significant functional change. > >> > >> This series was tested on qemu and on various PCs and servers. I am > >> CC-ing people who recently reported issues with suspend/resume. > >> Additional testing would be much appreciated. > > > > JFTR, with current libata/for-next[*], I saw the following with > > rcar-sata, once (interesting lines marked with "!"): > > > > PM: suspend entry (s2idle) > > Filesystems sync: 0.026 seconds > > Freezing user space processes > > ! ata1.00: qc timeout after 10000 msecs (cmd 0x40) > > Freezing user space processes completed (elapsed 0.007 seconds) > > ! ata1.00: VERIFY failed (err_mask=0x4) > > OOM killer disabled. > > ! ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40) > > Freezing remaining freezable tasks > > ! ata1.00: revalidation failed (errno=-5) > > Freezing remaining freezable tasks completed (elapsed 0.002 seconds) > > sd 0:0:0:0: [sda] Synchronizing SCSI cache > > ata1: link resume succeeded after 1 retries > > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > > ata1.00: configured for UDMA/133 > > ata1.00: Entering active power mode > > ata1.00: Entering standby power mode > > ravb e6800000.ethernet eth0: Link is Down > > Micrel KSZ9031 Gigabit PHY e6800000.ethernet-ffffffff:00: attached > > PHY driver (mii_bus:phy_addr=e6800000.ethernet-ffffffff:00, irq=136) > > OOM killer enabled. > > Restarting tasks ... done. > > random: crng reseeded on system resumption > > PM: suspend exit > > ata1: link resume succeeded after 1 retries > > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > > ata1.00: Entering active power mode > > ata1.00: configured for UDMA/133 > > ravb e6800000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off > > > > Regardless, the disk worked fine after resume. > > > > Note that I saw this only once. > > I think I found the reason for this, but to confirm, were you doing a suspend > right after resuming the system ? If yes, that I think I exactly understand the > issue and why you saw it only once (it is a subtle race with scheduling > libata-EH suspend/resume operations). I will send a fix next week. Now you ask that, yes there was a system suspend before. Relevant log with timing info: [ 130.177616] PM: suspend exit [ 130.257981] ata1: link resume succeeded after 1 retries [ 130.376714] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 130.388525] ata1.00: Entering active power mode so the drive should have been ready here. [ 140.452669] PM: suspend entry (s2idle) [ 140.488313] Filesystems sync: 0.026 seconds [ 140.515957] Freezing user space processes [ 140.518209] ata1.00: qc timeout after 10000 msecs (cmd 0x40) [ 140.523384] Freezing user space processes completed (elapsed 0.007 seconds) [ 140.527718] ata1.00: VERIFY failed (err_mask=0x4) [ 140.532541] OOM killer disabled. [ 140.537270] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40) [ 140.542069] Freezing remaining freezable tasks [ 140.546784] ata1.00: revalidation failed (errno=-5) Gr{oetje,eeting}s, Geert
On 2023/09/29 15:56, Geert Uytterhoeven wrote: > Hi Damien, > > On Fri, Sep 29, 2023 at 3:37 PM Damien Le Moal <dlemoal@kernel.org> wrote: >> On 2023/09/28 14:26, Geert Uytterhoeven wrote: >>> On Tue, Sep 26, 2023 at 10:15 AM Damien Le Moal <dlemoal@kernel.org> wrote: >>>> The first 9 patches of this series fix several issues with suspend/resume >>>> power management operations in scsi and libata. The most significant >>>> changes introduced are in patch 4 and 5, where the manage_start_stop >>>> flag of scsi devices is split into the manage_system_start_stop and >>>> manage_runtime_start_stop flags to allow keeping scsi runtime power >>>> operations for spining up/down ATA devices but have libata do its own >>>> system suspend/resume device power state management using EH. >>>> >>>> The remaining patches are code cleanup that do not introduce any >>>> significant functional change. >>>> >>>> This series was tested on qemu and on various PCs and servers. I am >>>> CC-ing people who recently reported issues with suspend/resume. >>>> Additional testing would be much appreciated. >>> >>> JFTR, with current libata/for-next[*], I saw the following with >>> rcar-sata, once (interesting lines marked with "!"): >>> >>> PM: suspend entry (s2idle) >>> Filesystems sync: 0.026 seconds >>> Freezing user space processes >>> ! ata1.00: qc timeout after 10000 msecs (cmd 0x40) >>> Freezing user space processes completed (elapsed 0.007 seconds) >>> ! ata1.00: VERIFY failed (err_mask=0x4) >>> OOM killer disabled. >>> ! ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40) >>> Freezing remaining freezable tasks >>> ! ata1.00: revalidation failed (errno=-5) >>> Freezing remaining freezable tasks completed (elapsed 0.002 seconds) >>> sd 0:0:0:0: [sda] Synchronizing SCSI cache >>> ata1: link resume succeeded after 1 retries >>> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) >>> ata1.00: configured for UDMA/133 >>> ata1.00: Entering active power mode >>> ata1.00: Entering standby power mode >>> ravb e6800000.ethernet eth0: Link is Down >>> Micrel KSZ9031 Gigabit PHY e6800000.ethernet-ffffffff:00: attached >>> PHY driver (mii_bus:phy_addr=e6800000.ethernet-ffffffff:00, irq=136) >>> OOM killer enabled. >>> Restarting tasks ... done. >>> random: crng reseeded on system resumption >>> PM: suspend exit >>> ata1: link resume succeeded after 1 retries >>> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) >>> ata1.00: Entering active power mode >>> ata1.00: configured for UDMA/133 >>> ravb e6800000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off >>> >>> Regardless, the disk worked fine after resume. >>> >>> Note that I saw this only once. >> >> I think I found the reason for this, but to confirm, were you doing a suspend >> right after resuming the system ? If yes, that I think I exactly understand the >> issue and why you saw it only once (it is a subtle race with scheduling >> libata-EH suspend/resume operations). I will send a fix next week. > > Now you ask that, yes there was a system suspend before. > > Relevant log with timing info: > > [ 130.177616] PM: suspend exit > [ 130.257981] ata1: link resume succeeded after 1 retries > [ 130.376714] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > [ 130.388525] ata1.00: Entering active power mode > > so the drive should have been ready here. yep. > > [ 140.452669] PM: suspend entry (s2idle) Then suspend again 10s later. > [ 140.488313] Filesystems sync: 0.026 seconds > [ 140.515957] Freezing user space processes > [ 140.518209] ata1.00: qc timeout after 10000 msecs (cmd 0x40) > [ 140.523384] Freezing user space processes completed (elapsed > 0.007 seconds) > [ 140.527718] ata1.00: VERIFY failed (err_mask=0x4) But that verify sent 10s earlier to spinup the drive failed... Hmmm... That is not exactly what I was thinking of. While the race between scheduling suspend while resume is still on-going does exist, it is likely not what is happening here given the time interval with suspend entry. Need to dig further. Do you perhaps have "Power-up in Standby" (PUIS) enabled on that drive ? > [ 140.532541] OOM killer disabled. > [ 140.537270] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40) > [ 140.542069] Freezing remaining freezable tasks > [ 140.546784] ata1.00: revalidation failed (errno=-5) > > Gr{oetje,eeting}s, > > Geert >