Message ID | 20230911040217.253905-8-dlemoal@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Fix libata suspend/resume handling and code cleanup | expand |
On 9/11/23 06:02, Damien Le Moal wrote: > If an error occurs when resuming a host adapter before the devices > attached to the adapter are resumed, the adapter low level driver may > remove the scsi host, resulting in a call to sd_remove() for the > disks of the host. However, since this function calls sd_shutdown(), > a synchronize cache command and a start stop unit may be issued with the > drive still sleeping and the HBA non-functional. This causes PM resume > to hang, forcing a reset of the machine to recover. > > Fix this by checking a device host state in sd_shutdown() and by > returning early doing nothing if the host state is not SHOST_RUNNING. > > Cc: stable@vger.kernel.org > Signed-off-by: Damien Le Moal <dlemoal@kernel.org> > --- > drivers/scsi/sd.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index c92a317ba547..a415abb721d3 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev) > if (!sdkp) > return; /* this can happen */ > > - if (pm_runtime_suspended(dev)) > + if (pm_runtime_suspended(dev) || > + sdkp->device->host->shost_state != SHOST_RUNNING) > return; > > if (sdkp->WCE && sdkp->media_present) { Reviewed-by: Hannes Reinecke <hare@suse.de> Cheers, Hannes
On Mon, Sep 11, 2023 at 01:02:05PM +0900, Damien Le Moal wrote: > If an error occurs when resuming a host adapter before the devices > attached to the adapter are resumed, the adapter low level driver may > remove the scsi host, resulting in a call to sd_remove() for the > disks of the host. However, since this function calls sd_shutdown(), > a synchronize cache command and a start stop unit may be issued with the > drive still sleeping and the HBA non-functional. This causes PM resume > to hang, forcing a reset of the machine to recover. > > Fix this by checking a device host state in sd_shutdown() and by > returning early doing nothing if the host state is not SHOST_RUNNING. > > Cc: stable@vger.kernel.org > Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
On 9/10/23 21:02, Damien Le Moal wrote: > If an error occurs when resuming a host adapter before the devices > attached to the adapter are resumed, the adapter low level driver may > remove the scsi host, resulting in a call to sd_remove() for the > disks of the host. However, since this function calls sd_shutdown(), > a synchronize cache command and a start stop unit may be issued with the > drive still sleeping and the HBA non-functional. This causes PM resume > to hang, forcing a reset of the machine to recover. > > Fix this by checking a device host state in sd_shutdown() and by > returning early doing nothing if the host state is not SHOST_RUNNING. > > Cc: stable@vger.kernel.org > Signed-off-by: Damien Le Moal <dlemoal@kernel.org> > --- > drivers/scsi/sd.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index c92a317ba547..a415abb721d3 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev) > if (!sdkp) > return; /* this can happen */ > > - if (pm_runtime_suspended(dev)) > + if (pm_runtime_suspended(dev) || > + sdkp->device->host->shost_state != SHOST_RUNNING) > return; > > if (sdkp->WCE && sdkp->media_present) { Why to test the host state instead of dev->power.runtime_status? I don't think that it is safe to skip shutdown if the error handler is active. If the error handler can recover the device a SYNCHRONIZE CACHE command should be submitted. Thanks, Bart.
On 9/14/23 05:50, Bart Van Assche wrote: > On 9/10/23 21:02, Damien Le Moal wrote: >> If an error occurs when resuming a host adapter before the devices >> attached to the adapter are resumed, the adapter low level driver may >> remove the scsi host, resulting in a call to sd_remove() for the >> disks of the host. However, since this function calls sd_shutdown(), >> a synchronize cache command and a start stop unit may be issued with the >> drive still sleeping and the HBA non-functional. This causes PM resume >> to hang, forcing a reset of the machine to recover. >> >> Fix this by checking a device host state in sd_shutdown() and by >> returning early doing nothing if the host state is not SHOST_RUNNING. >> >> Cc: stable@vger.kernel.org >> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >> --- >> drivers/scsi/sd.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >> index c92a317ba547..a415abb721d3 100644 >> --- a/drivers/scsi/sd.c >> +++ b/drivers/scsi/sd.c >> @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev) >> if (!sdkp) >> return; /* this can happen */ >> >> - if (pm_runtime_suspended(dev)) >> + if (pm_runtime_suspended(dev) || >> + sdkp->device->host->shost_state != SHOST_RUNNING) >> return; >> >> if (sdkp->WCE && sdkp->media_present) { > > Why to test the host state instead of dev->power.runtime_status? I don't > think that it is safe to skip shutdown if the error handler is active. > If the error handler can recover the device a SYNCHRONIZE CACHE command > should be submitted. But there is no synchronization with EH that I can see anyway. At least for sd_remove(), I would assume that this is called only once the device references were all dropped, so presumably EH is not doing anything with the drive when that happen, no ? In any case, looking at dev->power.runtime_status is not correct as this is set to RPM_ACTIVE when the device is suspended through system suspend. We could replace the test "sdkp->device->host->shost_state != SHOST_RUNNING" with "dev->power.is_suspended", as that indicates true (1) for a suspended device. However, I really do not like that as that is a PM internal field and should not be accessing it directly. The PM code comments say as much. Any better idea ?
On 9/13/23 17:29, Damien Le Moal wrote: > On 9/14/23 05:50, Bart Van Assche wrote: >> On 9/10/23 21:02, Damien Le Moal wrote: >>> If an error occurs when resuming a host adapter before the devices >>> attached to the adapter are resumed, the adapter low level driver may >>> remove the scsi host, resulting in a call to sd_remove() for the >>> disks of the host. However, since this function calls sd_shutdown(), >>> a synchronize cache command and a start stop unit may be issued with the >>> drive still sleeping and the HBA non-functional. This causes PM resume >>> to hang, forcing a reset of the machine to recover. >>> >>> Fix this by checking a device host state in sd_shutdown() and by >>> returning early doing nothing if the host state is not SHOST_RUNNING. >>> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >>> --- >>> drivers/scsi/sd.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >>> index c92a317ba547..a415abb721d3 100644 >>> --- a/drivers/scsi/sd.c >>> +++ b/drivers/scsi/sd.c >>> @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev) >>> if (!sdkp) >>> return; /* this can happen */ >>> >>> - if (pm_runtime_suspended(dev)) >>> + if (pm_runtime_suspended(dev) || >>> + sdkp->device->host->shost_state != SHOST_RUNNING) >>> return; >>> >>> if (sdkp->WCE && sdkp->media_present) { >> >> Why to test the host state instead of dev->power.runtime_status? I don't >> think that it is safe to skip shutdown if the error handler is active. >> If the error handler can recover the device a SYNCHRONIZE CACHE command >> should be submitted. > > But there is no synchronization with EH that I can see anyway. At least for > sd_remove(), I would assume that this is called only once the device references > were all dropped, so presumably EH is not doing anything with the drive when > that happen, no ? > > In any case, looking at dev->power.runtime_status is not correct as this is set > to RPM_ACTIVE when the device is suspended through system suspend. We could > replace the test "sdkp->device->host->shost_state != SHOST_RUNNING" with > "dev->power.is_suspended", as that indicates true (1) for a suspended device. > However, I really do not like that as that is a PM internal field and should not > be accessing it directly. The PM code comments say as much. Any better idea ? I will reply to the above question on v2 of this patch. Bart.
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index c92a317ba547..a415abb721d3 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev) if (!sdkp) return; /* this can happen */ - if (pm_runtime_suspended(dev)) + if (pm_runtime_suspended(dev) || + sdkp->device->host->shost_state != SHOST_RUNNING) return; if (sdkp->WCE && sdkp->media_present) {
If an error occurs when resuming a host adapter before the devices attached to the adapter are resumed, the adapter low level driver may remove the scsi host, resulting in a call to sd_remove() for the disks of the host. However, since this function calls sd_shutdown(), a synchronize cache command and a start stop unit may be issued with the drive still sleeping and the HBA non-functional. This causes PM resume to hang, forcing a reset of the machine to recover. Fix this by checking a device host state in sd_shutdown() and by returning early doing nothing if the host state is not SHOST_RUNNING. Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> --- drivers/scsi/sd.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)