scsi: sd: Have scsi-ml retry START_STOP errors

Message ID	20240808034619.768289-1-liyihang9@huawei.com (mailing list archive)
State	Changes Requested
Headers	show Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 23B5C65C; Thu, 8 Aug 2024 03:46:22 +0000 (UTC) From: Yihang Li <liyihang9@huawei.com> To: <James.Bottomley@HansenPartnership.com>, <martin.petersen@oracle.com> CC: <linux-scsi@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <prime.zeng@huawei.com>, <liyihang9@huawei.com>, <linuxarm@huawei.com> Subject: [PATCH] scsi: sd: Have scsi-ml retry START_STOP errors Date: Thu, 8 Aug 2024 11:46:19 +0800 Message-ID: <20240808034619.768289-1-liyihang9@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	scsi: sd: Have scsi-ml retry START_STOP errors \| expand scsi: sd: Have scsi-ml retry START_STOP errors

Message ID

20240808034619.768289-1-liyihang9@huawei.com (mailing list archive)

State

Changes Requested

Headers

From: Yihang Li <liyihang9@huawei.com>
To: <James.Bottomley@HansenPartnership.com>, <martin.petersen@oracle.com>
CC: <linux-scsi@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<prime.zeng@huawei.com>, <liyihang9@huawei.com>, <linuxarm@huawei.com>
Subject: [PATCH] scsi: sd: Have scsi-ml retry START_STOP errors
Date: Thu, 8 Aug 2024 11:46:19 +0800
Message-ID: <20240808034619.768289-1-liyihang9@huawei.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain

Series

scsi: sd: Have scsi-ml retry START_STOP errors | expand

Commit Message

Yihang Li Aug. 8, 2024, 3:46 a.m. UTC

When sending START_STOP commands to resume scsi_device, it may be
interrupted by exception operations such as host reset or PCI FLR. Once
the command of START_STOP is failed, the runtime_status of scsi device
will be error and it is difficult for user to recover it.

So try more retries to increase robustness as the process of command
START_STOP.

Signed-off-by: Yihang Li <liyihang9@huawei.com>
---
 drivers/scsi/sd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Bart Van Assche Aug. 8, 2024, 5:20 p.m. UTC | #1

On 8/7/24 8:46 PM, Yihang Li wrote:
> When sending START_STOP commands to resume scsi_device, it may be
> interrupted by exception operations such as host reset or PCI FLR. Once
> the command of START_STOP is failed, the runtime_status of scsi device
> will be error and it is difficult for user to recover it.

How is the PCI FLR sent to the device? Shouldn't PCI FLRs only be
triggered by the SCSI LLD from inside an error handler callback? How can
a PCI FLR be triggered while a START STOP UNIT command is being
processed? Why can PCI FLRs only be triggered while a START STOP UNIT
command is being processed and not while any other command is being
processed?

> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 5cd88a8eea73..29f30407d713 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -4088,9 +4088,20 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
>   {
>   	unsigned char cmd[6] = { START_STOP };	/* START_VALID */
>   	struct scsi_sense_hdr sshdr;
> +	struct scsi_failure failure_defs[] = {
> +		{
> +			.allowed = 3,
> +			.result = SCMD_FAILURE_RESULT_ANY,
> +		},
> +		{}
> +	};
> +	struct scsi_failures failures = {
> +		.failure_definitions = failure_defs,
> +	};
>   	const struct scsi_exec_args exec_args = {
>   		.sshdr = &sshdr,
>   		.req_flags = BLK_MQ_REQ_PM,
> +		.failures = &failures,
>   	};
>   	struct scsi_device *sdp = sdkp->device;
>   	int res;

The above change makes the START STOP UNIT command to be retried
unconditionally. A START STOP UNIT command should not be retried
unconditionally.

Please take a look at the following patch series (posted yesterday):
https://lore.kernel.org/linux-scsi/20240807203215.2439244-1-bvanassche@acm.org/

Thanks,

Bart.

Yihang Li Aug. 9, 2024, 8:28 a.m. UTC | #2

On 2024/8/9 1:20, Bart Van Assche wrote:
> On 8/7/24 8:46 PM, Yihang Li wrote:
>> When sending START_STOP commands to resume scsi_device, it may be
>> interrupted by exception operations such as host reset or PCI FLR. Once
>> the command of START_STOP is failed, the runtime_status of scsi device
>> will be error and it is difficult for user to recover it.
> 
> How is the PCI FLR sent to the device? Shouldn't PCI FLRs only be
> triggered by the SCSI LLD from inside an error handler callback? How can
> a PCI FLR be triggered while a START STOP UNIT command is being
> processed? Why can PCI FLRs only be triggered while a START STOP UNIT
> command is being processed and not while any other command is being
> processed?

The PCI FLR mentioned in my description is not sent to the SCSI device,
but to the SCSI host.

When the START STOP UNIT command is submitted to the SCSI device,
the command is sent to the SCSI device through the SCSI host.
If the user triggers the PCI FLR through the sysfs interface or
the host is reset due to an error, the START STOP UNIT command
(other commands are the same) is interrupted. So I think the command
needs to be retried.

> 
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 5cd88a8eea73..29f30407d713 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -4088,9 +4088,20 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
>>   {
>>       unsigned char cmd[6] = { START_STOP };    /* START_VALID */
>>       struct scsi_sense_hdr sshdr;
>> +    struct scsi_failure failure_defs[] = {
>> +        {
>> +            .allowed = 3,
>> +            .result = SCMD_FAILURE_RESULT_ANY,
>> +        },
>> +        {}
>> +    };
>> +    struct scsi_failures failures = {
>> +        .failure_definitions = failure_defs,
>> +    };
>>       const struct scsi_exec_args exec_args = {
>>           .sshdr = &sshdr,
>>           .req_flags = BLK_MQ_REQ_PM,
>> +        .failures = &failures,
>>       };
>>       struct scsi_device *sdp = sdkp->device;
>>       int res;
> 
> The above change makes the START STOP UNIT command to be retried
> unconditionally. A START STOP UNIT command should not be retried
> unconditionally.
> 
> Please take a look at the following patch series (posted yesterday):
> https://lore.kernel.org/linux-scsi/20240807203215.2439244-1-bvanassche@acm.org/

I will reconsider the retry of the START STOP UNIT command, and try this patch series as well.

Thanks,

Yihang.

Bart Van Assche Aug. 9, 2024, 6:25 p.m. UTC | #3

On 8/9/24 1:28 AM, Yihang Li wrote:
> If the user triggers the PCI FLR through the sysfs interface or
> the host is reset due to an error, the START STOP UNIT command
> (other commands are the same) is interrupted. So I think the command
> needs to be retried.
Hmm ... aren't operations that bypass the SCSI core considered out of
scope for SCSI core fixes?

Thanks,

Bart.

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5cd88a8eea73..29f30407d713 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -4088,9 +4088,20 @@  static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
 {
 	unsigned char cmd[6] = { START_STOP };	/* START_VALID */
 	struct scsi_sense_hdr sshdr;
+	struct scsi_failure failure_defs[] = {
+		{
+			.allowed = 3,
+			.result = SCMD_FAILURE_RESULT_ANY,
+		},
+		{}
+	};
+	struct scsi_failures failures = {
+		.failure_definitions = failure_defs,
+	};
 	const struct scsi_exec_args exec_args = {
 		.sshdr = &sshdr,
 		.req_flags = BLK_MQ_REQ_PM,
+		.failures = &failures,
 	};
 	struct scsi_device *sdp = sdkp->device;
 	int res;

scsi: sd: Have scsi-ml retry START_STOP errors

Commit Message

Comments

Patch