Message ID | 20230927141828.90288-5-dlemoal@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | Fix libata suspend/resume handling and code cleanup | expand |
Damien, > The underlying device and driver of a SCSI disk may have different > system and runtime power mode control requirements. This is because > runtime power management affects only the SCSI disk, while sustem level /s/sustem/system/ > power management affects all devices, including the controller for the > SCSI disk. > > +manage_start_stop_show(struct device *dev, > + struct device_attribute *attr, char *buf) > { > struct scsi_disk *sdkp = to_scsi_disk(dev); > struct scsi_device *sdp = sdkp->device; > > - return sprintf(buf, "%u\n", sdp->manage_start_stop); > + return sprintf(buf, "%u\n", > + sdp->manage_system_start_stop && > + sdp->manage_runtime_start_stop); > } > +static DEVICE_ATTR_RO(manage_start_stop); Nitpick: I suggest you turn these sprintf() calls into sysfs_emit() before the checker bots notice. Otherwise this looks OK to me. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Damien Le Moal <dlemoal@kernel.org> writes: > system suspend/resume operations, the ATA port used to connect the > device will also be suspended and resumed, with the resume operation > requiring re-validating the device link and the device itself. In this > case, issuing a VERIFY command to spinup the disk must be done before > starting to revalidate the device, when the ata port is being resumed. > In such case, we must not allow the SCSI disk driver to issue START STOP > UNIT commands. Why must a VERIFY be issued to spinup the disk before revalidating? Before these patches, by default, manage_start_stop was on, and so sd would cause a VERIFY in the system resume path. That resume however ( sd and its issuing START UNIT ), would have happened AFTER the link was resumed and the ATA device was revalidated, woudldn't it? So at that point, the drive would already be spinning. And if manage_start_stop was disabled, then there would be no VERIFY at all, and this did not seem to cause a problem before. If this VERIFY were skipped, the next thing that would happen is for ata_dev_revalidate() to issue IDENTIFY, which would wait for the drive to spin up before returning wouldn't it? ( unless the drive has PuiS enabled ).
On 10/10/23 22:09, Phillip Susi wrote: > Damien Le Moal <dlemoal@kernel.org> writes: > >> system suspend/resume operations, the ATA port used to connect the >> device will also be suspended and resumed, with the resume operation >> requiring re-validating the device link and the device itself. In this >> case, issuing a VERIFY command to spinup the disk must be done before >> starting to revalidate the device, when the ata port is being resumed. >> In such case, we must not allow the SCSI disk driver to issue START STOP >> UNIT commands. > > Why must a VERIFY be issued to spinup the disk before revalidating? We can most likely move that VERIFY to after revalidation. That should shorten delays on first access after resume as many drive do actually spinup with the reset done before revalidating (but note that the specs say that even a reset shall not take a drive out of standby mode, without specifying the reset type, so this is rather loosely defined). > Before these patches, by default, manage_start_stop was on, and so sd > would cause a VERIFY in the system resume path. That resume however ( > sd and its issuing START UNIT ), would have happened AFTER the link was > resumed and the ATA device was revalidated, woudldn't it? So at that In theory, yes, that was the intent. In practice, the verify was issued from scsi PM resume context while the actual drive port reset + revalidation is done in libata EH context, triggered from ATA port resume context which itself was not synchronized/ordered with the scsi disk resume. So we ended up with the verify command execution sometimes being attempted with the drive not even revalidated yet, or with the port/link not even active sometimes (depending on timing). So problems all over and deadlocks due to scsi revalidate using the device lock, which PM use too. > point, the drive would already be spinning. And if manage_start_stop > was disabled, then there would be no VERIFY at all, and this did not > seem to cause a problem before. See above. With the switch to async PM ops in scsi in kernel 5.16, things broke badly due to the lack of synchronization that sync PM provided before that. > > If this VERIFY were skipped, the next thing that would happen is for > ata_dev_revalidate() to issue IDENTIFY, which would wait for the drive > to spin up before returning wouldn't it? ( unless the drive has PuiS > enabled ). ACS defines that only media access commands can get a drive out of standby mode back into active mode. So an IDENTIFY command would not (normally) spinup a drive. Nor would READ LOG. Normally, IDENTIFY, READ LOG etc done in revalidate can be processed with the drive spun down (*but* I have seen drives that react badly to some of these commands when spun down). Hence why I put it first, before revalidate. Now that all the synchronization is in place and libata EH does its own manage_start_stop for system suspend/resume, I will see if moving the verify to the end of revalidate works.
For SCSI disks that are runtime suspended, it looks like they skip waking the disk on system resume, leaving them in runtime suspend. After these patches, it looks like libata always wakes up the disk, but I don't see any calls to pm_runtime_disable/set_active/enable to mark the scsi disk as active after the system resume. That should result in a disk that is spinning, but runtime pm thinks is not, and so will not put it into suspend after the inactivity timeout.
On 10/16/23 01:14, Phillip Susi wrote: > For SCSI disks that are runtime suspended, it looks like they skip > waking the disk on system resume, leaving them in runtime suspend. > After these patches, it looks like libata always wakes up the disk, but > I don't see any calls to pm_runtime_disable/set_active/enable to mark > the scsi disk as active after the system resume. That should result in > a disk that is spinning, but runtime pm thinks is not, and so will not > put it into suspend after the inactivity timeout. Yes, correct, but this does not create any issues in practice beside the undesired disk spinup. Fixing that is not trivial because using runtime suspend/resume on the SCSI disk is just that, it will affect *only* the SCSI disk and not the ATA device and its port. In other words, a runtime suspend of the SCSI disk will spin down the drive but it will not runtime suspend the ATA port. So if you suspend the system, on resume, the ATA port will not be runtime suspended and so it will be resumed. The SCSI disk will not be resumed, but the ATA port resume will have spun up the disk, which we do not really want in that case. I am looking into this. Again, that is not a trivial fix. The other thing to notice here is that ATA port runtime suspend/resume is in fact broken: it does not track accesses to the device(s) connected to the port. And given that more than one device may be connected to a port, we need PM runtime reference counting to be done for this to work correctly. That is missing. Solutions are: fix everything or simply do not support ATA port runtime suspend/resume (i.e. remove code doing it). I am leaning toward the latter as it seems that no one actually noticed these issues because no one is actually using ATA port runtime suspend/resume...
Damien Le Moal <dlemoal@kernel.org> writes: > Yes, correct, but this does not create any issues in practice beside the > undesired disk spinup. The issue it creates is the opposite of that: it breaks the desired spin down. After some period of inactivity, the disk should be suspended, but after a system resume, the kernel thinks that it already is, and so won't suspend it again. > Fixing that is not trivial because using runtime suspend/resume on the SCSI disk > is just that, it will affect *only* the SCSI disk and not the ATA device and its > port. In other words, a runtime suspend of the SCSI disk will spin down the > drive but it will not runtime suspend the ATA port. So if you suspend > the I tested this last week and it appeared to work. I enabled runtime pm on the disk, as well as the ata port, and as soon as the disk suspended, the port did as well. > system, on resume, the ATA port will not be runtime suspended and so it will be > resumed. The SCSI disk will not be resumed, but the ATA port resume will have > spun up the disk, which we do not really want in that case. Right, I would rather the disk stay asleep if it has PuiS enabled, and I'm working on a patch for that. In the process of doing that though, I noticed that despite waking the disk up, it does not inform runtime pm about that. > I am looking into this. Again, that is not a trivial fix. The other thing to > notice here is that ATA port runtime suspend/resume is in fact broken: it does > not track accesses to the device(s) connected to the port. And given that more > than one device may be connected to a port, we need PM runtime reference > counting to be done for this to work correctly. That is > missing. Solutions are: Again, it seems to me that the child reference counting IS working. > fix everything or simply do not support ATA port runtime suspend/resume (i.e. > remove code doing it). I am leaning toward the latter as it seems that no one > actually noticed these issues because no one is actually using ATA port runtime > suspend/resume... Probably nobody is using it yes, but that doesn't mean we shouldn't try to get it working. It would be nice to have the drive go into deep SLEEP instead of standby, as well as suspend the ata port, and possibly even the whole AHCI controller rather than relying on the old APM drive internal auto standby mode.
On 10/16/23 21:39, Phillip Susi wrote: > Damien Le Moal <dlemoal@kernel.org> writes: > >> Yes, correct, but this does not create any issues in practice beside the >> undesired disk spinup. > > The issue it creates is the opposite of that: it breaks the desired spin > down. After some period of inactivity, the disk should be suspended, > but after a system resume, the kernel thinks that it already is, and so > won't suspend it again. That one should be fixable, though it I do not see an elegant method to do it. It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state from libata... Not great. > >> Fixing that is not trivial because using runtime suspend/resume on the SCSI disk >> is just that, it will affect *only* the SCSI disk and not the ATA device and its >> port. In other words, a runtime suspend of the SCSI disk will spin down the >> drive but it will not runtime suspend the ATA port. So if you suspend >> the > > I tested this last week and it appeared to work. I enabled runtime pm > on the disk, as well as the ata port, and as soon as the disk suspended, > the port did as well. Never saw that in my tests when enabling runtime pm on the scsi disk only. Which is the important point here: there is no propagation of the suspend state down to the device parent it seems. > >> system, on resume, the ATA port will not be runtime suspended and so it will be >> resumed. The SCSI disk will not be resumed, but the ATA port resume will have >> spun up the disk, which we do not really want in that case. > > Right, I would rather the disk stay asleep if it has PuiS enabled, and > I'm working on a patch for that. In the process of doing that though, I > noticed that despite waking the disk up, it does not inform runtime pm > about that. But there are no runtime PM operations defined for ATA devices, only for ports. So not sure that matters... I am probably still missing something about runtime PM and devices ancestry. I focused a lot on system suspend/resume to fix the issues. runtime suspend/resume is next. > >> I am looking into this. Again, that is not a trivial fix. The other thing to >> notice here is that ATA port runtime suspend/resume is in fact broken: it does >> not track accesses to the device(s) connected to the port. And given that more >> than one device may be connected to a port, we need PM runtime reference >> counting to be done for this to work correctly. That is >> missing. Solutions are: > > Again, it seems to me that the child reference counting IS working. I am not sure of that, especially with cases of ATA ports with multiple disks (e.g. pmp or IDE). > >> fix everything or simply do not support ATA port runtime suspend/resume (i.e. >> remove code doing it). I am leaning toward the latter as it seems that no one >> actually noticed these issues because no one is actually using ATA port runtime >> suspend/resume... > > Probably nobody is using it yes, but that doesn't mean we shouldn't try > to get it working. It would be nice to have the drive go into deep > SLEEP instead of standby, as well as suspend the ata port, and possibly > even the whole AHCI controller rather than relying on the old APM drive > internal auto standby mode. >
Damien Le Moal <dlemoal@kernel.org> writes: > That one should be fixable, though it I do not see an elegant method to do it. > It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state > from libata... Not great. What would be not great about it? libata already takes over the system suspend/resume from sd. I'm currently testing having libata do just this right now. I just got ahold of some jumpers today to put the drives back into PuiS and do some further testing tonight. > Never saw that in my tests when enabling runtime pm on the scsi disk only. Which > is the important point here: there is no propagation of the suspend state down > to the device parent it seems. Last night I again saw the port auto suspend when the scsi disk was runtime suspended. Tonight I'll test with PuiS, as well as with system resume while runtime suspended. Maybe I'll even try to get the whole AHCI controller to auto suspend. It seems like it should once all of the ports do. > I am not sure of that, especially with cases of ATA ports with multiple disks > (e.g. pmp or IDE). Good point. I have an eSATA dock with PMP. I'll check tonight if the children are counted properly.
On 10/18/23 03:03, Phillip Susi wrote: > Damien Le Moal <dlemoal@kernel.org> writes: > >> That one should be fixable, though it I do not see an elegant method to do it. >> It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state >> from libata... Not great. > > What would be not great about it? libata already takes over the system To have to add code in libata that touches the scsi device PM status is what's not going to be great. Such code should stay in sd.c and scsi_pm.c. But not sure we actually need anything. > suspend/resume from sd. I'm currently testing having libata do just > this right now. I just got ahold of some jumpers today to put the > drives back into PuiS and do some further testing tonight. > >> Never saw that in my tests when enabling runtime pm on the scsi disk only. Which >> is the important point here: there is no propagation of the suspend state down >> to the device parent it seems. > > Last night I again saw the port auto suspend when the scsi disk was > runtime suspended. Tonight I'll test with PuiS, as well as with system > resume while runtime suspended. Maybe I'll even try to get the whole > AHCI controller to auto suspend. It seems like it should once all of > the ports do. With my tests, I never set the ata port power/control to "auto", which may be why I did not see the port being runtime suspended when the scsi disk was runtime suspended. Will try again. >> I am not sure of that, especially with cases of ATA ports with multiple disks >> (e.g. pmp or IDE). > > Good point. I have an eSATA dock with PMP. I'll check tonight if the > children are counted properly. With the device links in place between port and scsi devices, we should be OK. But still need to check that we do not need runtime_get/put calls added. Ideally, we should have the chain: scsi disk -> scsi target -> scsi host -> ata port for runtime suspend, and the reverse for runtime resume. If there is a system suspend/resume between runtime suspend/resume, the port should not be resumed if it is runtime suspended.
On 10/18/23 03:03, Phillip Susi wrote: > Damien Le Moal <dlemoal@kernel.org> writes: > >> That one should be fixable, though it I do not see an elegant method to do it. >> It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state >> from libata... Not great. > > What would be not great about it? libata already takes over the system > suspend/resume from sd. I'm currently testing having libata do just > this right now. I just got ahold of some jumpers today to put the > drives back into PuiS and do some further testing tonight. > >> Never saw that in my tests when enabling runtime pm on the scsi disk only. Which >> is the important point here: there is no propagation of the suspend state down >> to the device parent it seems. > > Last night I again saw the port auto suspend when the scsi disk was > runtime suspended. Tonight I'll test with PuiS, as well as with system > resume while runtime suspended. Maybe I'll even try to get the whole > AHCI controller to auto suspend. It seems like it should once all of > the ports do. > >> I am not sure of that, especially with cases of ATA ports with multiple disks >> (e.g. pmp or IDE). > > Good point. I have an eSATA dock with PMP. I'll check tonight if the > children are counted properly. On my system, I see: cat /sys/class/ata_port/ata1/power/runtime_active_kids 0 and same for port 10 which is a PMP box with 3 drives. So it means that the children will be ignored, which is wrong. Note that the corresponding scsi_host device also shows 0. So to be safe with port runtime PM, we need to fix that first. Otherwise, the port may end up being runtime suspended with running drives still attached to it. /sys/class/ata_port/ata1/power/control is set to "auto" by default.
Damien Le Moal <dlemoal@kernel.org> writes: > With the device links in place between port and scsi devices, we should be OK. > But still need to check that we do not need runtime_get/put calls added. > Ideally, we should have the chain: > > scsi disk -> scsi target -> scsi host -> ata port It looks to me like there is an additional generic block device that sits on top and that is what actually has the idle timeout. Or maybe that's the scsi disk, since it's name incldues the SCSI LUN, but in the structure, its called sdev_gendev. But then there's also sdev_dev, and sdev_target. > for runtime suspend, and the reverse for runtime resume. If there is a system > suspend/resume between runtime suspend/resume, the port should not be resumed if > it is runtime suspended. I'm not sure about it. The port has to be resumed so that we can attempt to revalidate the devices on it. For disks that have spun up on their own, we should not leave then marked as runtime suspended, but really they are spinning. I suppose we could put them to sleep, though I was leaning to just marking them as active, and leaving the runtime pm timer to put them to sleep later, which then could allow the port to suspend again.
Damien Le Moal <dlemoal@kernel.org> writes: > On my system, I see: > > cat /sys/class/ata_port/ata1/power/runtime_active_kids > 0 I see a 1 there, which is the single scsi_host. The scsi_host has 2 active kids; the two disks. When I enabled runtime pm, only when the second disk was suspended did that allow the scsi_host to suspend, which then allowed the port to suspend. Everything looked fine there so far. Then I tried: echo 1 > /sys/block/sdf/device/delete And the SCSI EH appears to have tried to wake up the disk, and hung in the process. [ 314.246282] sd 7:0:0:0: [sde] Synchronizing SCSI cache [ 314.246445] sd 7:0:0:0: [sde] Stopping disk First disk suspends. [ 388.518295] sd 7:1:0:0: [sdf] Synchronizing SCSI cache [ 388.518519] sd 7:1:0:0: [sdf] Stopping disk Second disk suspends some time later. [ 388.930428] ata8.00: Entering standby power mode [ 389.330651] ata8.01: Entering standby power mode That allowed the port to suspend. This is when I tried to detach the disk driver, which I think tried to resume the disk before detaching, which resumed the port. [ 467.511878] ata8.15: SATA link down (SStatus 0 SControl 310) [ 468.142726] ata8.15: failed to read PMP GSCR[0] (Emask=0x100) [ 468.142741] ata8.15: PMP revalidation failed (errno=-5) I ran hdparm -C on the other disk at this point. I just noticed that the ata8.15 that represents the PMP itself was NOT suspended along with the two drive links, and then maybe was not resumed before trying to revalidate the PMP? And that's why it failed? [ 473.172792] ata8.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 473.486860] ata8.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 473.802139] ata8.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310) It seems like it ended up recovering here though? And yet the scsi_eh remained hung, as did the hdparm -C: [ 605.566814] INFO: task scsi_eh_7:173 blocked for more than 120 seconds. [ 605.566829] Not tainted 6.6.0-rc5+ #5 [ 605.566834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.566838] task:scsi_eh_7 state:D stack:0 pid:173 ppid:2 flags:0x00004000 [ 605.566850] Call Trace: [ 605.566853] <TASK> [ 605.566860] __schedule+0x37c/0xb70 [ 605.566878] schedule+0x61/0xd0 [ 605.566888] rpm_resume+0x156/0x760 [ 605.566896] ? sched_energy_aware_handler+0xb0/0xb0 [ 605.566907] rpm_resume+0x255/0x760 [ 605.566915] rpm_resume+0x255/0x760 [ 605.566923] rpm_resume+0x255/0x760 [ 605.566931] __pm_runtime_resume+0x4e/0x80 [ 605.566941] ata_eh_recover+0x695/0x1060 [libata] [ 605.567001] ? ata_port_pm_suspend+0x50/0x50 [libata] [ 605.567048] ? ahci_do_softreset+0x2d0/0x2d0 [libahci] [ 605.567067] ? ata_host_release+0x80/0x80 [libata] [ 605.567108] ? ata_port_runtime_idle+0x110/0x110 [libata] [ 605.567151] ? sata_pmp_configure+0x72/0x210 [libata] [ 605.567204] sata_pmp_error_handler+0x357/0xac0 [libata] [ 605.567249] ? ata_port_pm_suspend+0x50/0x50 [libata] [ 605.567291] ? ahci_stop_engine+0xe0/0xe0 [libahci] [ 605.567309] ? ahci_do_hardreset+0x140/0x140 [libahci] [ 605.567325] ? ahci_do_softreset+0x2d0/0x2d0 [libahci] [ 605.567344] ? _raw_spin_unlock_irqrestore+0x27/0x40 [ 605.567355] ahci_error_handler+0x36/0x60 [libahci] [ 605.567373] ata_scsi_port_error_handler+0x3de/0x8a0 [libata] [ 605.567424] ? scsi_eh_get_sense+0x250/0x250 [scsi_mod] [ 605.567464] ata_scsi_error+0x95/0xc0 [libata] [ 605.567511] scsi_error_handler+0xb9/0x580 [scsi_mod] [ 605.567547] ? preempt_count_add+0x6c/0xa0 [ 605.567556] ? scsi_eh_get_sense+0x250/0x250 [scsi_mod] [ 605.567587] kthread+0xf2/0x120 [ 605.567594] ? kthread_complete_and_exit+0x20/0x20 [ 605.567602] ret_from_fork+0x31/0x50 [ 605.567611] ? kthread_complete_and_exit+0x20/0x20 [ 605.567617] ret_from_fork_asm+0x11/0x20 [ 605.567630] </TASK> [ 605.567663] INFO: task bash:1305 blocked for more than 120 seconds. [ 605.567670] Not tainted 6.6.0-rc5+ #5 [ 605.567675] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.567678] task:bash state:D stack:0 pid:1305 ppid:1300 flags:0x00004004 [ 605.567687] Call Trace: [ 605.567689] <TASK> [ 605.567693] __schedule+0x37c/0xb70 [ 605.567703] ? try_to_wake_up+0xb2/0x5e0 [ 605.567715] schedule+0x61/0xd0 [ 605.567725] ata_port_wait_eh+0x7c/0xf0 [libata] [ 605.567776] ? sched_energy_aware_handler+0xb0/0xb0 [ 605.567784] ? ata_sas_port_resume+0x30/0x30 [libata] [ 605.567829] ata_port_runtime_resume+0x27/0x30 [libata] [ 605.567870] __rpm_callback+0x41/0x110 [ 605.567879] ? ata_sas_port_resume+0x30/0x30 [libata] [ 605.567917] rpm_callback+0x35/0x70 [ 605.567925] rpm_resume+0x513/0x760 [ 605.567931] ? _raw_read_lock_irqsave+0x28/0x50 [ 605.567938] ? _raw_read_unlock_irqrestore+0x2a/0x40 [ 605.567944] ? ep_poll_callback+0x269/0x2d0 [ 605.567955] rpm_resume+0x255/0x760 [ 605.567962] ? __slab_free+0xc7/0x320 [ 605.567972] rpm_resume+0x255/0x760 [ 605.567978] ? kernfs_should_drain_open_files+0x38/0x50 [ 605.567989] rpm_resume+0x255/0x760 [ 605.567994] ? kernfs_should_drain_open_files+0x38/0x50 [ 605.568002] ? kernfs_drain+0xec/0x120 [ 605.568010] __pm_runtime_resume+0x4e/0x80 [ 605.568018] device_release_driver_internal+0xa8/0x200 [ 605.568028] bus_remove_device+0xc0/0x120 [ 605.568035] device_del+0x158/0x3d0 [ 605.568045] ? mutex_lock+0x12/0x30 [ 605.568051] __scsi_remove_device+0x12b/0x180 [scsi_mod] [ 605.568095] sdev_store_delete+0x6a/0xd0 [scsi_mod] [ 605.568132] kernfs_fop_write_iter+0x129/0x1c0 [ 605.568141] vfs_write+0x2d3/0x3f0 [ 605.568155] ksys_write+0x63/0xe0 [ 605.568165] do_syscall_64+0x5a/0xb0 [ 605.568173] ? syscall_exit_to_user_mode+0x2b/0x40 [ 605.568178] ? do_syscall_64+0x67/0xb0 [ 605.568184] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 605.568193] RIP: 0033:0x7f0c1e9b6473 [ 605.568200] RSP: 002b:00007ffe01b0bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 605.568208] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0c1e9b6473 [ 605.568213] RDX: 0000000000000002 RSI: 000056115e4b86f0 RDI: 0000000000000001 [ 605.568216] RBP: 000056115e4b86f0 R08: 000000000000000a R09: 00007f0c1ea5a0c0 [ 605.568220] R10: 00007f0c1ea59fc0 R11: 0000000000000246 R12: 0000000000000002 [ 605.568224] R13: 00007f0c1ea9a6a0 R14: 0000000000000002 R15: 00007f0c1ea95880 [ 605.568232] </TASK>
On 10/21/23 06:23, Phillip Susi wrote: > Damien Le Moal <dlemoal@kernel.org> writes: > >> On my system, I see: >> >> cat /sys/class/ata_port/ata1/power/runtime_active_kids >> 0 > > I see a 1 there, which is the single scsi_host. The scsi_host has 2 > active kids; the two disks. When I enabled runtime pm, only when the > second disk was suspended did that allow the scsi_host to suspend, which > then allowed the port to suspend. Everything looked fine there so far. > Then I tried: > > echo 1 > /sys/block/sdf/device/delete > > And the SCSI EH appears to have tried to wake up the disk, and hung in > the process. > > [ 314.246282] sd 7:0:0:0: [sde] Synchronizing SCSI cache > [ 314.246445] sd 7:0:0:0: [sde] Stopping disk > > First disk suspends. > > [ 388.518295] sd 7:1:0:0: [sdf] Synchronizing SCSI cache > [ 388.518519] sd 7:1:0:0: [sdf] Stopping disk > > Second disk suspends some time later. > > [ 388.930428] ata8.00: Entering standby power mode > [ 389.330651] ata8.01: Entering standby power mode > > That allowed the port to suspend. This is when I tried to detach the > disk driver, which I think tried to resume the disk before detaching, > which resumed the port. > > [ 467.511878] ata8.15: SATA link down (SStatus 0 SControl 310) > [ 468.142726] ata8.15: failed to read PMP GSCR[0] (Emask=0x100) > [ 468.142741] ata8.15: PMP revalidation failed (errno=-5) > > I ran hdparm -C on the other disk at this point. I just noticed that > the ata8.15 that represents the PMP itself was NOT suspended along with > the two drive links, and then maybe was not resumed before trying to > revalidate the PMP? And that's why it failed? > > [ 473.172792] ata8.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310) > [ 473.486860] ata8.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310) > [ 473.802139] ata8.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310) > > It seems like it ended up recovering here though? And yet the scsi_eh > remained hung, as did the hdparm -C: > > [ 605.566814] INFO: task scsi_eh_7:173 blocked for more than 120 seconds. > [ 605.566829] Not tainted 6.6.0-rc5+ #5 > [ 605.566834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 605.566838] task:scsi_eh_7 state:D stack:0 pid:173 ppid:2 flags:0x00004000 > [ 605.566850] Call Trace: > [ 605.566853] <TASK> > [ 605.566860] __schedule+0x37c/0xb70 > [ 605.566878] schedule+0x61/0xd0 > [ 605.566888] rpm_resume+0x156/0x760 Looks like a deadlock somewhere, likely with the device remove that you triggered with the "echo 1 > /sys/block/sdf/device/delete". Can you send the exact list of commands & events you executed to get to that point ? Also please share your kernel config.
Damien Le Moal <dlemoal@kernel.org> writes: > On 10/21/23 06:23, Phillip Susi wrote: > Looks like a deadlock somewhere, likely with the device remove that you > triggered with the "echo 1 > /sys/block/sdf/device/delete". > > Can you send the exact list of commands & events you executed to get to that > point ? Also please share your kernel config. I wrote auto to /sys/block/sd[ef]/device/power/config and 10000 to /sys/block/sd[ef]/device/power/auto_suspend_delay_ms, and auto to the control file of their corresponding ata8 port ( both drives are behind an PMP in an eSATA dock on ata8 ). As I said before, it the dmesg showed that the ata port only suspended after BOTH drives had suspended, which is as it should be. The problem seems to be after I echo 1 > /sys/block/sdf/device/delete, when this happens: Oct 26 16:43:00 faldara kernel: ata8.15: failed to read PMP GSCR[0] (Emask=0x100) Oct 26 16:43:00 faldara kernel: ata8.15: PMP revalidation failed (errno=-5) Oct 26 16:43:05 faldara kernel: ata8.15: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 26 16:43:05 faldara kernel: ata8.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Oct 26 16:43:05 faldara kernel: ata8.01: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Then I get the hung task. I reverted the PuiS patch that I have been working on, and the hang no longer happens, however, the above errors do still happen. I think that is a problem in itself, and may or may not be related to the hang. I see no reason why the PMP revalidation should fail after the two disks and the port are runtime pm suspended, but for some reason, with my patch applied, that then leads to the hang. Here is my git log showing your two patches applied on the upstream kernel, plus my patch: commit bb5db8bb505171fd2b8b67c3f22b16d8355d2a8b Author: Phillip Susi <phill@thesusis.net> Date: Mon Oct 16 17:03:51 2023 -0400 Olibata: don't start disks on resume Disks with Power Up In Standby enabled that required the SET FEATURES command to start up were being issued the command during resume. Suppress this until the disk is actually accessed. commit 4f1a1a9da4ff833417fa520d097b3f07c20e3c5d Author: Damien Le Moal <dlemoal@kernel.org> Date: Thu Oct 12 16:18:00 2023 +0900 [PATCH 2/2] ata: libata-core: Improve ata_dev_power_set_active() Improve the function ata_dev_power_set_active() by having it do nothing for a disk that is already in the active power state. To do that, introduce the function ata_dev_power_is_active() to test the current power state of the disk and return true if the disk is in the PM0: active or PM1: idle state (0xff value for the count field of the CHECK POWER MODE command output). To preserve the existing behavior, if the CHECK POWER MODE command issued in ata_dev_power_is_active() fails, the drive is assumed to be in standby mode and false is returned. With this change, issuing the VERIFY command to access the disk media to spin it up becomes unnecessary most of the time during system resume as the port reset done by libata-eh on resume often result in the drive to spin-up (this behavior is not clearly defined by the ACS specifications and may thus vary between disk models). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> commit bb7e1fcd9e3a207820e4b828e9f4f02986942d8d Author: Damien Le Moal <dlemoal@kernel.org> Date: Thu Oct 12 16:17:59 2023 +0900 ata: libata-eh: Spinup disk on resume after revalidation Move the call to ata_dev_power_set_active() to transition a disk in standby power mode to the active power mode from ata_eh_revalidate_and_attach() before doing revalidation to the end of ata_eh_recover(), after the link speed for the device is reconfigured (if that was necessary). This is safer as this ensure that the VERIFY command executed to spinup the disk is executed with the drive properly reconfigured first. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> commit 727fb83765049981e342db4c5a8b51aca72201d8 Merge: 8cb1f10d8c4b 5c15c60e7be6 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri Oct 13 23:19:16 2023 -0700 Merge tag 'input-for-v6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: - a reworked way for handling reset delay on SMBus-connected Synaptics touchpads (the original one, while being correct, uncovered an old bug in fallback to PS/2 code that was fixed separately; the new one however avoids having delay in serio port "fast" resume, and instead has the wait in the RMI4 code) - a fix for potential crashes when devices with Elan controllers (and Synaptics) fall back to PS/2 code. Can't be hit without the original patch above, but still good to have it fixed - a couple new device IDs in xpad Xbox driver - another quirk for Goodix driver to deal with stuff vendors put in ACPI tables - a fix for use-after-free on disconnect for powermate driver - a quirk to not initialize PS/2 mouse port on Fujitsu Lifebook E5411 laptop as it makes keyboard not usable and the device uses hid-over-i2c touchpad anyways * tag 'input-for-v6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: powermate - fix use-after-free in powermate_config_complete Input: xpad - add PXN V900 support Input: synaptics-rmi4 - handle reset delay when using SMBus trsnsport Input: psmouse - fix fast_reconnect function for PS/2 mode Revert "Input: psmouse - add delay when deactivating for SMBus mode" Input: goodix - ensure int GPIO is in input for gpio_count == 1 && gpio_int_idx == 0 case Input: i8042 - add Fujitsu Lifebook E5411 to i8042 quirk table Input: xpad - add HyperX Clutch Gladiate Support And here is my patch, which basically checks for PuiS during system resume, and either forces the disk to suspend or resume depending on whether it is PuiS. Since I have not even engaged in any system suspend/resume for this test, this patch should not have any effect. So far in my testing of this patch on my 3 internal drives that I have enabled PuiS on, it appears to work in so much as after a suspend/resume, the runtime status of the disks is suspended ( as long as I have *enabled* runtime pm on them, otherwise not ). Another problem seems to be that while the disks are left suspended after system resume, they very quickly are resumed due to begnign IO eminating from either ext4 or udsisks2 that does not cause a drive to spin up when it is suspended with hdparm -y. This would be the case of either FLUSH CASH or CHECK POWER CONDITION not causing the drive to spin itself up when given the commands, but the runtime pm core does not know that these commands can be completed without resuming the disk, so it resumes them first.
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 8b43290ca2cd..73428ad0c8d2 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1056,7 +1056,8 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) * will be woken up by ata_port_pm_resume() with a port reset * and device revalidation. */ - sdev->manage_start_stop = 1; + sdev->manage_system_start_stop = true; + sdev->manage_runtime_start_stop = true; sdev->no_start_on_resume = 1; } diff --git a/drivers/firewire/sbp2.c b/drivers/firewire/sbp2.c index 26db5b8dfc1e..749868b9e80d 100644 --- a/drivers/firewire/sbp2.c +++ b/drivers/firewire/sbp2.c @@ -81,7 +81,8 @@ MODULE_PARM_DESC(exclusive_login, "Exclusive login to sbp2 device " * * - power condition * Set the power condition field in the START STOP UNIT commands sent by - * sd_mod on suspend, resume, and shutdown (if manage_start_stop is on). + * sd_mod on suspend, resume, and shutdown (if manage_system_start_stop or + * manage_runtime_start_stop is on). * Some disks need this to spin down or to resume properly. * * - override internal blacklist @@ -1517,8 +1518,10 @@ static int sbp2_scsi_slave_configure(struct scsi_device *sdev) sdev->use_10_for_rw = 1; - if (sbp2_param_exclusive_login) - sdev->manage_start_stop = 1; + if (sbp2_param_exclusive_login) { + sdev->manage_system_start_stop = true; + sdev->manage_runtime_start_stop = true; + } if (sdev->type == TYPE_ROM) sdev->use_10_for_ms = 1; diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index c92a317ba547..a1ef4eef904f 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -201,18 +201,32 @@ cache_type_store(struct device *dev, struct device_attribute *attr, } static ssize_t -manage_start_stop_show(struct device *dev, struct device_attribute *attr, - char *buf) +manage_start_stop_show(struct device *dev, + struct device_attribute *attr, char *buf) { struct scsi_disk *sdkp = to_scsi_disk(dev); struct scsi_device *sdp = sdkp->device; - return sprintf(buf, "%u\n", sdp->manage_start_stop); + return sprintf(buf, "%u\n", + sdp->manage_system_start_stop && + sdp->manage_runtime_start_stop); } +static DEVICE_ATTR_RO(manage_start_stop); static ssize_t -manage_start_stop_store(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count) +manage_system_start_stop_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct scsi_disk *sdkp = to_scsi_disk(dev); + struct scsi_device *sdp = sdkp->device; + + return sprintf(buf, "%u\n", sdp->manage_system_start_stop); +} + +static ssize_t +manage_system_start_stop_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) { struct scsi_disk *sdkp = to_scsi_disk(dev); struct scsi_device *sdp = sdkp->device; @@ -224,11 +238,42 @@ manage_start_stop_store(struct device *dev, struct device_attribute *attr, if (kstrtobool(buf, &v)) return -EINVAL; - sdp->manage_start_stop = v; + sdp->manage_system_start_stop = v; return count; } -static DEVICE_ATTR_RW(manage_start_stop); +static DEVICE_ATTR_RW(manage_system_start_stop); + +static ssize_t +manage_runtime_start_stop_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct scsi_disk *sdkp = to_scsi_disk(dev); + struct scsi_device *sdp = sdkp->device; + + return sprintf(buf, "%u\n", sdp->manage_runtime_start_stop); +} + +static ssize_t +manage_runtime_start_stop_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct scsi_disk *sdkp = to_scsi_disk(dev); + struct scsi_device *sdp = sdkp->device; + bool v; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (kstrtobool(buf, &v)) + return -EINVAL; + + sdp->manage_runtime_start_stop = v; + + return count; +} +static DEVICE_ATTR_RW(manage_runtime_start_stop); static ssize_t allow_restart_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -560,6 +605,8 @@ static struct attribute *sd_disk_attrs[] = { &dev_attr_FUA.attr, &dev_attr_allow_restart.attr, &dev_attr_manage_start_stop.attr, + &dev_attr_manage_system_start_stop.attr, + &dev_attr_manage_runtime_start_stop.attr, &dev_attr_protection_type.attr, &dev_attr_protection_mode.attr, &dev_attr_app_tag_own.attr, @@ -3771,13 +3818,20 @@ static void sd_shutdown(struct device *dev) sd_sync_cache(sdkp, NULL); } - if (system_state != SYSTEM_RESTART && sdkp->device->manage_start_stop) { + if (system_state != SYSTEM_RESTART && + sdkp->device->manage_system_start_stop) { sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n"); sd_start_stop_device(sdkp, 0); } } -static int sd_suspend_common(struct device *dev, bool ignore_stop_errors) +static inline bool sd_do_start_stop(struct scsi_device *sdev, bool runtime) +{ + return (sdev->manage_system_start_stop && !runtime) || + (sdev->manage_runtime_start_stop && runtime); +} + +static int sd_suspend_common(struct device *dev, bool runtime) { struct scsi_disk *sdkp = dev_get_drvdata(dev); struct scsi_sense_hdr sshdr; @@ -3809,12 +3863,12 @@ static int sd_suspend_common(struct device *dev, bool ignore_stop_errors) } } - if (sdkp->device->manage_start_stop) { + if (sd_do_start_stop(sdkp->device, runtime)) { if (!sdkp->device->silence_suspend) sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n"); /* an error is not worth aborting a system sleep */ ret = sd_start_stop_device(sdkp, 0); - if (ignore_stop_errors) + if (!runtime) ret = 0; } @@ -3826,23 +3880,23 @@ static int sd_suspend_system(struct device *dev) if (pm_runtime_suspended(dev)) return 0; - return sd_suspend_common(dev, true); + return sd_suspend_common(dev, false); } static int sd_suspend_runtime(struct device *dev) { - return sd_suspend_common(dev, false); + return sd_suspend_common(dev, true); } -static int sd_resume(struct device *dev) +static int sd_resume(struct device *dev, bool runtime) { struct scsi_disk *sdkp = dev_get_drvdata(dev); - int ret = 0; + int ret; if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */ return 0; - if (!sdkp->device->manage_start_stop) + if (!sd_do_start_stop(sdkp->device, runtime)) return 0; if (!sdkp->device->no_start_on_resume) { @@ -3860,7 +3914,7 @@ static int sd_resume_system(struct device *dev) if (pm_runtime_suspended(dev)) return 0; - return sd_resume(dev); + return sd_resume(dev, false); } static int sd_resume_runtime(struct device *dev) @@ -3887,7 +3941,7 @@ static int sd_resume_runtime(struct device *dev) "Failed to clear sense data\n"); } - return sd_resume(dev); + return sd_resume(dev, true); } static const struct dev_pm_ops sd_pm_ops = { diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index b9230b6add04..fd41fdac0a8e 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -161,6 +161,10 @@ struct scsi_device { * pass settings from slave_alloc to scsi * core. */ unsigned int eh_timeout; /* Error handling timeout */ + + bool manage_system_start_stop; /* Let HLD (sd) manage system start/stop */ + bool manage_runtime_start_stop; /* Let HLD (sd) manage runtime start/stop */ + unsigned removable:1; unsigned changed:1; /* Data invalid due to media change */ unsigned busy:1; /* Used to prevent races */ @@ -193,7 +197,6 @@ struct scsi_device { unsigned use_192_bytes_for_3f:1; /* ask for 192 bytes from page 0x3f */ unsigned no_start_on_add:1; /* do not issue start on add */ unsigned allow_restart:1; /* issue START_UNIT in error handler */ - unsigned manage_start_stop:1; /* Let HLD (sd) manage start/stop */ unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */ unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT */ unsigned no_uld_attach:1; /* disable connecting to upper level drivers */