diff mbox series

[1/2] scsi: core: scsi_device_online() return false if state is SDEV_CANCEL

Message ID 20230922093636.2645961-2-haowenchao2@huawei.com (mailing list archive)
State Changes Requested
Headers show
Series Fix two issue between removing device and error handle | expand

Commit Message

Wenchao Hao Sept. 22, 2023, 9:36 a.m. UTC
SDEV_CANCEL is set when removing device and scsi_device_online() should
return false if sdev_state is SDEV_CANCEL.

IO hang would be caused if return true when state is SDEV_CANCEL with
following order:

T1:					    T2:scsi_error_handler
__scsi_remove_device()
  scsi_device_set_state(sdev, SDEV_CANCEL)
  					    scsi_eh_flush_done_q()
					    if (scsi_device_online(sdev))
					      scsi_queue_insert(scmd,...)

The command added by scsi_queue_insert() would never be handled any
more.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 include/scsi/scsi_device.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Bart Van Assche Sept. 22, 2023, 3:23 p.m. UTC | #1
On 9/22/23 02:36, Wenchao Hao wrote:
> SDEV_CANCEL is set when removing device and scsi_device_online() should
> return false if sdev_state is SDEV_CANCEL.
> 
> IO hang would be caused if return true when state is SDEV_CANCEL with
> following order:
> 
> T1:					    T2:scsi_error_handler
> __scsi_remove_device()
>    scsi_device_set_state(sdev, SDEV_CANCEL)
>    					    scsi_eh_flush_done_q()
> 					    if (scsi_device_online(sdev))
> 					      scsi_queue_insert(scmd,...)
> 
> The command added by scsi_queue_insert() would never be handled any
> more.

Why not? I think the blk_mq_destroy_queue() call in 
__scsi_remove_device() will cause it to fail.

Thanks,

Bart.
Wenchao Hao Sept. 24, 2023, 6:37 a.m. UTC | #2
On 2023/9/22 23:23, Bart Van Assche wrote:
> On 9/22/23 02:36, Wenchao Hao wrote:
>> SDEV_CANCEL is set when removing device and scsi_device_online() should
>> return false if sdev_state is SDEV_CANCEL.
>>
>> IO hang would be caused if return true when state is SDEV_CANCEL with
>> following order:
>>
>> T1:                        T2:scsi_error_handler
>> __scsi_remove_device()
>>    scsi_device_set_state(sdev, SDEV_CANCEL)
>>                            scsi_eh_flush_done_q()
>>                         if (scsi_device_online(sdev))
>>                           scsi_queue_insert(scmd,...)
>>
>> The command added by scsi_queue_insert() would never be handled any
>> more.
> 
> Why not? I think the blk_mq_destroy_queue() call in __scsi_remove_device() will cause it to fail.
> 
> Thanks,
> 
> Bart.
> 

Sorry, I did not describe in detail, the __scsi_remove_device() would be blocked
in blk_mq_freeze_queue_wait() to wait all block requests finished, so
blk_mq_destroy_queue() would not be called, and the task which try to remove
scsi_device would be hung.
Wenchao Hao Sept. 25, 2023, 3:02 p.m. UTC | #3
On 2023/9/22 17:36, Wenchao Hao wrote:
> SDEV_CANCEL is set when removing device and scsi_device_online() should
> return false if sdev_state is SDEV_CANCEL.
> 
> IO hang would be caused if return true when state is SDEV_CANCEL with
> following order:
> 
> T1:					    T2:scsi_error_handler
> __scsi_remove_device()
>    scsi_device_set_state(sdev, SDEV_CANCEL)
>    					    scsi_eh_flush_done_q()
> 					    if (scsi_device_online(sdev))
> 					      scsi_queue_insert(scmd,...)
> 
> The command added by scsi_queue_insert() would never be handled any
> more.
> 
> Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
> ---
>   include/scsi/scsi_device.h | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 75b2235b99e2..c498a12f7715 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -517,7 +517,8 @@ static inline int scsi_device_online(struct scsi_device *sdev)
>   {
>   	return (sdev->sdev_state != SDEV_OFFLINE &&
>   		sdev->sdev_state != SDEV_TRANSPORT_OFFLINE &&
> -		sdev->sdev_state != SDEV_DEL);
> +		sdev->sdev_state != SDEV_DEL &&
> +		sdev->sdev_state != SDEV_CANCEL);
>   }
>   static inline int scsi_device_blocked(struct scsi_device *sdev)
>   {

Return false when if sdev_state is SDEV_CANCEL seems change some flow in
error handle, but I don't know if we should introduce these changes.
I think it's both ok to finish the failed command or try more recovery steps.

For example, in scsi_eh_bus_device_reset(), when scsi_try_bus_device_reset()
returned SUCCEED but the sdev_state is SDEV_CANCEL, should skip TUR and just
call scsi_eh_finish_cmd() to add this LUN's error command to done_q?

We can address the issue of IO hang described in this patch by running
scsi_device's queue regardless of the scsi_device's state and it seems
a better solution because the main reason of IO hang is as following:

scsi_restart_operations()
	-> scsi_run_host_queues()
		-> shost_for_each_device() // skip scsi_device with SDEV_DEL
					   // or SDEV_CANCEL state
diff mbox series

Patch

diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 75b2235b99e2..c498a12f7715 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -517,7 +517,8 @@  static inline int scsi_device_online(struct scsi_device *sdev)
 {
 	return (sdev->sdev_state != SDEV_OFFLINE &&
 		sdev->sdev_state != SDEV_TRANSPORT_OFFLINE &&
-		sdev->sdev_state != SDEV_DEL);
+		sdev->sdev_state != SDEV_DEL &&
+		sdev->sdev_state != SDEV_CANCEL);
 }
 static inline int scsi_device_blocked(struct scsi_device *sdev)
 {