Message ID | 20240808034619.768289-1-liyihang9@huawei.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | scsi: sd: Have scsi-ml retry START_STOP errors | expand |
On 8/7/24 8:46 PM, Yihang Li wrote: > When sending START_STOP commands to resume scsi_device, it may be > interrupted by exception operations such as host reset or PCI FLR. Once > the command of START_STOP is failed, the runtime_status of scsi device > will be error and it is difficult for user to recover it. How is the PCI FLR sent to the device? Shouldn't PCI FLRs only be triggered by the SCSI LLD from inside an error handler callback? How can a PCI FLR be triggered while a START STOP UNIT command is being processed? Why can PCI FLRs only be triggered while a START STOP UNIT command is being processed and not while any other command is being processed? > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 5cd88a8eea73..29f30407d713 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -4088,9 +4088,20 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start) > { > unsigned char cmd[6] = { START_STOP }; /* START_VALID */ > struct scsi_sense_hdr sshdr; > + struct scsi_failure failure_defs[] = { > + { > + .allowed = 3, > + .result = SCMD_FAILURE_RESULT_ANY, > + }, > + {} > + }; > + struct scsi_failures failures = { > + .failure_definitions = failure_defs, > + }; > const struct scsi_exec_args exec_args = { > .sshdr = &sshdr, > .req_flags = BLK_MQ_REQ_PM, > + .failures = &failures, > }; > struct scsi_device *sdp = sdkp->device; > int res; The above change makes the START STOP UNIT command to be retried unconditionally. A START STOP UNIT command should not be retried unconditionally. Please take a look at the following patch series (posted yesterday): https://lore.kernel.org/linux-scsi/20240807203215.2439244-1-bvanassche@acm.org/ Thanks, Bart.
On 2024/8/9 1:20, Bart Van Assche wrote: > On 8/7/24 8:46 PM, Yihang Li wrote: >> When sending START_STOP commands to resume scsi_device, it may be >> interrupted by exception operations such as host reset or PCI FLR. Once >> the command of START_STOP is failed, the runtime_status of scsi device >> will be error and it is difficult for user to recover it. > > How is the PCI FLR sent to the device? Shouldn't PCI FLRs only be > triggered by the SCSI LLD from inside an error handler callback? How can > a PCI FLR be triggered while a START STOP UNIT command is being > processed? Why can PCI FLRs only be triggered while a START STOP UNIT > command is being processed and not while any other command is being > processed? The PCI FLR mentioned in my description is not sent to the SCSI device, but to the SCSI host. When the START STOP UNIT command is submitted to the SCSI device, the command is sent to the SCSI device through the SCSI host. If the user triggers the PCI FLR through the sysfs interface or the host is reset due to an error, the START STOP UNIT command (other commands are the same) is interrupted. So I think the command needs to be retried. > >> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >> index 5cd88a8eea73..29f30407d713 100644 >> --- a/drivers/scsi/sd.c >> +++ b/drivers/scsi/sd.c >> @@ -4088,9 +4088,20 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start) >> { >> unsigned char cmd[6] = { START_STOP }; /* START_VALID */ >> struct scsi_sense_hdr sshdr; >> + struct scsi_failure failure_defs[] = { >> + { >> + .allowed = 3, >> + .result = SCMD_FAILURE_RESULT_ANY, >> + }, >> + {} >> + }; >> + struct scsi_failures failures = { >> + .failure_definitions = failure_defs, >> + }; >> const struct scsi_exec_args exec_args = { >> .sshdr = &sshdr, >> .req_flags = BLK_MQ_REQ_PM, >> + .failures = &failures, >> }; >> struct scsi_device *sdp = sdkp->device; >> int res; > > The above change makes the START STOP UNIT command to be retried > unconditionally. A START STOP UNIT command should not be retried > unconditionally. > > Please take a look at the following patch series (posted yesterday): > https://lore.kernel.org/linux-scsi/20240807203215.2439244-1-bvanassche@acm.org/ I will reconsider the retry of the START STOP UNIT command, and try this patch series as well. Thanks, Yihang.
On 8/9/24 1:28 AM, Yihang Li wrote: > If the user triggers the PCI FLR through the sysfs interface or > the host is reset due to an error, the START STOP UNIT command > (other commands are the same) is interrupted. So I think the command > needs to be retried. Hmm ... aren't operations that bypass the SCSI core considered out of scope for SCSI core fixes? Thanks, Bart.
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 5cd88a8eea73..29f30407d713 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -4088,9 +4088,20 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start) { unsigned char cmd[6] = { START_STOP }; /* START_VALID */ struct scsi_sense_hdr sshdr; + struct scsi_failure failure_defs[] = { + { + .allowed = 3, + .result = SCMD_FAILURE_RESULT_ANY, + }, + {} + }; + struct scsi_failures failures = { + .failure_definitions = failure_defs, + }; const struct scsi_exec_args exec_args = { .sshdr = &sshdr, .req_flags = BLK_MQ_REQ_PM, + .failures = &failures, }; struct scsi_device *sdp = sdkp->device; int res;
When sending START_STOP commands to resume scsi_device, it may be interrupted by exception operations such as host reset or PCI FLR. Once the command of START_STOP is failed, the runtime_status of scsi device will be error and it is difficult for user to recover it. So try more retries to increase robustness as the process of command START_STOP. Signed-off-by: Yihang Li <liyihang9@huawei.com> --- drivers/scsi/sd.c | 11 +++++++++++ 1 file changed, 11 insertions(+)