Message ID | 20230922093636.2645961-2-haowenchao2@huawei.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | Fix two issue between removing device and error handle | expand |
On 9/22/23 02:36, Wenchao Hao wrote: > SDEV_CANCEL is set when removing device and scsi_device_online() should > return false if sdev_state is SDEV_CANCEL. > > IO hang would be caused if return true when state is SDEV_CANCEL with > following order: > > T1: T2:scsi_error_handler > __scsi_remove_device() > scsi_device_set_state(sdev, SDEV_CANCEL) > scsi_eh_flush_done_q() > if (scsi_device_online(sdev)) > scsi_queue_insert(scmd,...) > > The command added by scsi_queue_insert() would never be handled any > more. Why not? I think the blk_mq_destroy_queue() call in __scsi_remove_device() will cause it to fail. Thanks, Bart.
On 2023/9/22 23:23, Bart Van Assche wrote: > On 9/22/23 02:36, Wenchao Hao wrote: >> SDEV_CANCEL is set when removing device and scsi_device_online() should >> return false if sdev_state is SDEV_CANCEL. >> >> IO hang would be caused if return true when state is SDEV_CANCEL with >> following order: >> >> T1: T2:scsi_error_handler >> __scsi_remove_device() >> scsi_device_set_state(sdev, SDEV_CANCEL) >> scsi_eh_flush_done_q() >> if (scsi_device_online(sdev)) >> scsi_queue_insert(scmd,...) >> >> The command added by scsi_queue_insert() would never be handled any >> more. > > Why not? I think the blk_mq_destroy_queue() call in __scsi_remove_device() will cause it to fail. > > Thanks, > > Bart. > Sorry, I did not describe in detail, the __scsi_remove_device() would be blocked in blk_mq_freeze_queue_wait() to wait all block requests finished, so blk_mq_destroy_queue() would not be called, and the task which try to remove scsi_device would be hung.
On 2023/9/22 17:36, Wenchao Hao wrote: > SDEV_CANCEL is set when removing device and scsi_device_online() should > return false if sdev_state is SDEV_CANCEL. > > IO hang would be caused if return true when state is SDEV_CANCEL with > following order: > > T1: T2:scsi_error_handler > __scsi_remove_device() > scsi_device_set_state(sdev, SDEV_CANCEL) > scsi_eh_flush_done_q() > if (scsi_device_online(sdev)) > scsi_queue_insert(scmd,...) > > The command added by scsi_queue_insert() would never be handled any > more. > > Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> > --- > include/scsi/scsi_device.h | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h > index 75b2235b99e2..c498a12f7715 100644 > --- a/include/scsi/scsi_device.h > +++ b/include/scsi/scsi_device.h > @@ -517,7 +517,8 @@ static inline int scsi_device_online(struct scsi_device *sdev) > { > return (sdev->sdev_state != SDEV_OFFLINE && > sdev->sdev_state != SDEV_TRANSPORT_OFFLINE && > - sdev->sdev_state != SDEV_DEL); > + sdev->sdev_state != SDEV_DEL && > + sdev->sdev_state != SDEV_CANCEL); > } > static inline int scsi_device_blocked(struct scsi_device *sdev) > { Return false when if sdev_state is SDEV_CANCEL seems change some flow in error handle, but I don't know if we should introduce these changes. I think it's both ok to finish the failed command or try more recovery steps. For example, in scsi_eh_bus_device_reset(), when scsi_try_bus_device_reset() returned SUCCEED but the sdev_state is SDEV_CANCEL, should skip TUR and just call scsi_eh_finish_cmd() to add this LUN's error command to done_q? We can address the issue of IO hang described in this patch by running scsi_device's queue regardless of the scsi_device's state and it seems a better solution because the main reason of IO hang is as following: scsi_restart_operations() -> scsi_run_host_queues() -> shost_for_each_device() // skip scsi_device with SDEV_DEL // or SDEV_CANCEL state
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 75b2235b99e2..c498a12f7715 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -517,7 +517,8 @@ static inline int scsi_device_online(struct scsi_device *sdev) { return (sdev->sdev_state != SDEV_OFFLINE && sdev->sdev_state != SDEV_TRANSPORT_OFFLINE && - sdev->sdev_state != SDEV_DEL); + sdev->sdev_state != SDEV_DEL && + sdev->sdev_state != SDEV_CANCEL); } static inline int scsi_device_blocked(struct scsi_device *sdev) {
SDEV_CANCEL is set when removing device and scsi_device_online() should return false if sdev_state is SDEV_CANCEL. IO hang would be caused if return true when state is SDEV_CANCEL with following order: T1: T2:scsi_error_handler __scsi_remove_device() scsi_device_set_state(sdev, SDEV_CANCEL) scsi_eh_flush_done_q() if (scsi_device_online(sdev)) scsi_queue_insert(scmd,...) The command added by scsi_queue_insert() would never be handled any more. Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> --- include/scsi/scsi_device.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)