diff mbox series

scsi: core: always finish successfully aborted notry command

Message ID 20220401035020.1043239-1-haowenchao@huawei.com (mailing list archive)
State Changes Requested
Headers show
Series scsi: core: always finish successfully aborted notry command | expand

Commit Message

Wenchao Hao April 1, 2022, 3:50 a.m. UTC
If the abort command succeed and it does not need to be retired, do not add
it to error handle list. 

Adding command to error handle list is an annoying flow which would stop I/O
of all LUNs which shared a same HBA.

So here if we successfully abort a command, we can finish it via
scsi_finish_command() which would reduce time spent on scsi_error_handler()

Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
---
 drivers/scsi/scsi_error.c | 55 +++++++++++++++++++++------------------
 1 file changed, 29 insertions(+), 26 deletions(-)

Comments

Mike Christie April 3, 2022, 4:58 p.m. UTC | #1
On 3/31/22 10:50 PM, Wenchao Hao wrote:
> If the abort command succeed and it does not need to be retired, do not add
> it to error handle list. 
> 
> Adding command to error handle list is an annoying flow which would stop I/O
> of all LUNs which shared a same HBA.
> 
> So here if we successfully abort a command, we can finish it via
> scsi_finish_command() which would reduce time spent on scsi_error_handler()
> 
> Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
> ---
>  drivers/scsi/scsi_error.c | 55 +++++++++++++++++++++------------------
>  1 file changed, 29 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index cdaca13ac1f1..15299603b7ee 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -173,41 +173,44 @@ scmd_eh_abort_handler(struct work_struct *work)
>  		goto out;
>  	}
>  	set_host_byte(scmd, DID_TIME_OUT);
> -	if (scsi_host_eh_past_deadline(shost)) {
> -		SCSI_LOG_ERROR_RECOVERY(3,
> -			scmd_printk(KERN_INFO, scmd,
> -				    "eh timeout, not retrying "
> -				    "aborted command\n"));
> -		goto out;
> -	}
>  
> -	spin_lock_irqsave(shost->host_lock, flags);
> -	list_del_init(&scmd->eh_entry);
> +	if (scsi_noretry_cmd(scmd) ||
> +	    !scsi_cmd_retry_allowed(scmd) &&
> +	    !scsi_eh_should_retry_cmd(scmd)) {


I don't think this test is correct. Did you want all ||s?

For what the patch is trying to accomplish I'm not sure if it's
the behavior people wanted. Do all drivers have configurable
abort timeouts? If an abort takes N minutes to complete ok,
and that's put us over the eh deadline maybe the user wanted
that device to be offlined.
Wenchao Hao April 6, 2022, 6:56 a.m. UTC | #2
On 2022/4/4 0:58, Mike Christie wrote:
> On 3/31/22 10:50 PM, Wenchao Hao wrote:
>> If the abort command succeed and it does not need to be retired, do not add
>> it to error handle list. 
>>
>> Adding command to error handle list is an annoying flow which would stop I/O
>> of all LUNs which shared a same HBA.
>>
>> So here if we successfully abort a command, we can finish it via
>> scsi_finish_command() which would reduce time spent on scsi_error_handler()
>>
>> Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
>> ---
>>  drivers/scsi/scsi_error.c | 55 +++++++++++++++++++++------------------
>>  1 file changed, 29 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index cdaca13ac1f1..15299603b7ee 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -173,41 +173,44 @@ scmd_eh_abort_handler(struct work_struct *work)
>>  		goto out;
>>  	}
>>  	set_host_byte(scmd, DID_TIME_OUT);
>> -	if (scsi_host_eh_past_deadline(shost)) {
>> -		SCSI_LOG_ERROR_RECOVERY(3,
>> -			scmd_printk(KERN_INFO, scmd,
>> -				    "eh timeout, not retrying "
>> -				    "aborted command\n"));
>> -		goto out;
>> -	}
>>  
>> -	spin_lock_irqsave(shost->host_lock, flags);
>> -	list_del_init(&scmd->eh_entry);
>> +	if (scsi_noretry_cmd(scmd) ||
>> +	    !scsi_cmd_retry_allowed(scmd) &&
>> +	    !scsi_eh_should_retry_cmd(scmd)) {
> 
> 
> I don't think this test is correct. Did you want all ||s?

Sorry, I made a mistake, it should be "||" here.

> 
> For what the patch is trying to accomplish I'm not sure if it's
> the behavior people wanted. Do all drivers have configurable
> abort timeouts? If an abort takes N minutes to complete ok,
> and that's put us over the eh deadline maybe the user wanted
> that device to be offlined.
> .
> 

I think the changes would not affect the timeouts and deadline
because the logic changed is after command abort successfully.

We check if the command need to be retired after aborting command
successfully, if it does not need to retry, just finish it (only
commands aborted failed should be add to error handle list).

I submit this change because we want to reduce the time spent on 
scsi_error_handler().

Do you think the change is OK? Except previous wrong "&&" which should
be "||".

Thanks, looking forward to hearing from you.
Mike Christie April 6, 2022, 4:08 p.m. UTC | #3
On 4/6/22 1:56 AM, Wenchao Hao wrote:
> On 2022/4/4 0:58, Mike Christie wrote:
>> On 3/31/22 10:50 PM, Wenchao Hao wrote:
>>> If the abort command succeed and it does not need to be retired, do not add
>>> it to error handle list. 
>>>
>>> Adding command to error handle list is an annoying flow which would stop I/O
>>> of all LUNs which shared a same HBA.
>>>
>>> So here if we successfully abort a command, we can finish it via
>>> scsi_finish_command() which would reduce time spent on scsi_error_handler()
>>>
>>> Signed-off-by: Wenchao Hao <haowenchao@huawei.com>
>>> ---
>>>  drivers/scsi/scsi_error.c | 55 +++++++++++++++++++++------------------
>>>  1 file changed, 29 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>> index cdaca13ac1f1..15299603b7ee 100644
>>> --- a/drivers/scsi/scsi_error.c
>>> +++ b/drivers/scsi/scsi_error.c
>>> @@ -173,41 +173,44 @@ scmd_eh_abort_handler(struct work_struct *work)
>>>  		goto out;
>>>  	}
>>>  	set_host_byte(scmd, DID_TIME_OUT);
>>> -	if (scsi_host_eh_past_deadline(shost)) {
>>> -		SCSI_LOG_ERROR_RECOVERY(3,
>>> -			scmd_printk(KERN_INFO, scmd,
>>> -				    "eh timeout, not retrying "
>>> -				    "aborted command\n"));
>>> -		goto out;
>>> -	}
>>>  
>>> -	spin_lock_irqsave(shost->host_lock, flags);
>>> -	list_del_init(&scmd->eh_entry);
>>> +	if (scsi_noretry_cmd(scmd) ||
>>> +	    !scsi_cmd_retry_allowed(scmd) &&
>>> +	    !scsi_eh_should_retry_cmd(scmd)) {
>>
>>
>> I don't think this test is correct. Did you want all ||s?
> 
> Sorry, I made a mistake, it should be "||" here.
> 
>>
>> For what the patch is trying to accomplish I'm not sure if it's
>> the behavior people wanted. Do all drivers have configurable
>> abort timeouts? If an abort takes N minutes to complete ok,
>> and that's put us over the eh deadline maybe the user wanted
>> that device to be offlined.
>> .
>>
> 
> I think the changes would not affect the timeouts and deadline
> because the logic changed is after command abort successfully.
> 
> We check if the command need to be retired after aborting command
> successfully, if it does not need to retry, just finish it (only
> commands aborted failed should be add to error handle list).

I'm saying an abort can complete but the device could be flakey.
The user could have set the deadline for 5 minutes. If the abort
took 5 minutes and 1 second then the user might want that device
set to offline, because retrying that device could take another 5
minutes. Instead of waiting for N retries they might have just
wanted all to that device failed.

> 
> I submit this change because we want to reduce the time spent on 
> scsi_error_handler().

I get that and I agree users do not want to wait for the entire
host to be blocked. I'm just saying that it might change behavior
that users wanted where the device will be offlined. I'm not 100% sure
though, so it's better for maybe Hannes and others to chime in.

Fixing up the EH to work in a way that does not require the entire
host to be blocked like in your other post is best. It would
give you want you want, and preserve the existing behavior and also
help other users.
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index cdaca13ac1f1..15299603b7ee 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -173,41 +173,44 @@  scmd_eh_abort_handler(struct work_struct *work)
 		goto out;
 	}
 	set_host_byte(scmd, DID_TIME_OUT);
-	if (scsi_host_eh_past_deadline(shost)) {
-		SCSI_LOG_ERROR_RECOVERY(3,
-			scmd_printk(KERN_INFO, scmd,
-				    "eh timeout, not retrying "
-				    "aborted command\n"));
-		goto out;
-	}
 
-	spin_lock_irqsave(shost->host_lock, flags);
-	list_del_init(&scmd->eh_entry);
+	if (scsi_noretry_cmd(scmd) ||
+	    !scsi_cmd_retry_allowed(scmd) &&
+	    !scsi_eh_should_retry_cmd(scmd)) {
+		spin_lock_irqsave(shost->host_lock, flags);
+		list_del_init(&scmd->eh_entry);
 
-	/*
-	 * If the abort succeeds, and there is no further
-	 * EH action, clear the ->last_reset time.
-	 */
-	if (list_empty(&shost->eh_abort_list) &&
-	    list_empty(&shost->eh_cmd_q))
-		if (shost->eh_deadline != -1)
-			shost->last_reset = 0;
+		/*
+		 * If the abort succeeds, and there is no further
+		 * EH action, clear the ->last_reset time.
+		 */
+		if (list_empty(&shost->eh_abort_list) &&
+		    list_empty(&shost->eh_cmd_q))
+			if (shost->eh_deadline != -1)
+				shost->last_reset = 0;
 
-	spin_unlock_irqrestore(shost->host_lock, flags);
+		spin_unlock_irqrestore(shost->host_lock, flags);
 
-	if (!scsi_noretry_cmd(scmd) &&
-	    scsi_cmd_retry_allowed(scmd) &&
-	    scsi_eh_should_retry_cmd(scmd)) {
-		SCSI_LOG_ERROR_RECOVERY(3,
-			scmd_printk(KERN_WARNING, scmd,
-				    "retry aborted command\n"));
-		scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
-	} else {
 		SCSI_LOG_ERROR_RECOVERY(3,
 			scmd_printk(KERN_WARNING, scmd,
 				    "finish aborted command\n"));
 		scsi_finish_command(scmd);
+		return;
 	}
+
+	if (scsi_host_eh_past_deadline(shost)) {
+		SCSI_LOG_ERROR_RECOVERY(3,
+			scmd_printk(KERN_INFO, scmd,
+				    "eh timeout, not retrying "
+				    "aborted command\n"));
+		goto out;
+	}
+
+	SCSI_LOG_ERROR_RECOVERY(3,
+		scmd_printk(KERN_WARNING, scmd,
+			    "retry aborted command\n"));
+	scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
+
 	return;
 
 out: