Message ID | 20230920135439.929695-10-dlemoal@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Fix libata suspend/resume handling and code cleanup | expand |
On 9/20/23 06:54, Damien Le Moal wrote: > If an error occurs when resuming a host adapter before the devices > attached to the adapter are resumed, the adapter low level driver may > remove the scsi host, resulting in a call to sd_remove() for the > disks of the host. This in turn results in a call to sd_shutdown() which > will issue a synchronize cache command and a start stop unit command to > spindown the disk. sd_shutdown() issues the commands only if the device > is not already suspended but does not check the power state for > system-wide suspend/resume. That is, the commands may be issued with the > device in a suspended state, which causes PM resume to hang, forcing a > reset of the machine to recover. > > Fix this by not calling sd_shutdown() in sd_remove() if the device > is not running. Hi Damien, I'd like to look into an alternative fix (after this patch series went in) but I couldn't identify the call chain in the ATA resume code that results in removal of the SCSI host. Can you please show me the call chain that results in SCSI host removal if resuming fails? Thanks, Bart.
On 2023/09/21 14:36, Bart Van Assche wrote: > On 9/20/23 06:54, Damien Le Moal wrote: >> If an error occurs when resuming a host adapter before the devices >> attached to the adapter are resumed, the adapter low level driver may >> remove the scsi host, resulting in a call to sd_remove() for the >> disks of the host. This in turn results in a call to sd_shutdown() which >> will issue a synchronize cache command and a start stop unit command to >> spindown the disk. sd_shutdown() issues the commands only if the device >> is not already suspended but does not check the power state for >> system-wide suspend/resume. That is, the commands may be issued with the >> device in a suspended state, which causes PM resume to hang, forcing a >> reset of the machine to recover. >> >> Fix this by not calling sd_shutdown() in sd_remove() if the device >> is not running. > > Hi Damien, > > I'd like to look into an alternative fix (after this patch series went > in) but I couldn't identify the call chain in the ATA resume code that > results in removal of the SCSI host. Can you please show me the call > chain that results in SCSI host removal if resuming fails? See the pm80xx driver for which I recently fixed a resume issue. That is how I found this problem with device removal: resuming the pm800xx HBA was failing and the driver then called scsi_remove_host() to drop the ports and that led to trying to removed sd devices that were still suspended. > > Thanks, > > Bart. >
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 1d106c8ad5af..d86306d42445 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3727,7 +3727,8 @@ static int sd_remove(struct device *dev) device_del(&sdkp->disk_dev); del_gendisk(sdkp->disk); - sd_shutdown(dev); + if (sdkp->device->sdev_state == SDEV_RUNNING) + sd_shutdown(dev); put_disk(sdkp->disk); return 0;