[v10,2/2] ufs: core: requeue aborted request

Message ID	20241001091917.6917-3-peter.wang@mediatek.com (mailing list archive)
State	Accepted
Headers	show Received: from mailgw01.mediatek.com (unknown [60.244.123.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 105C019D884; Tue, 1 Oct 2024 09:19:27 +0000 (UTC) From: <peter.wang@mediatek.com> To: <linux-scsi@vger.kernel.org>, <martin.petersen@oracle.com>, <avri.altman@wdc.com>, <alim.akhtar@samsung.com>, <jejb@linux.ibm.com> CC: <wsd_upstream@mediatek.com>, <linux-mediatek@lists.infradead.org>, <peter.wang@mediatek.com>, <chun-hung.wu@mediatek.com>, <alice.chao@mediatek.com>, <cc.chou@mediatek.com>, <chaotian.jing@mediatek.com>, <jiajie.hao@mediatek.com>, <powen.kao@mediatek.com>, <qilin.tan@mediatek.com>, <lin.gui@mediatek.com>, <tun-yu.yu@mediatek.com>, <eddie.huang@mediatek.com>, <naomi.chu@mediatek.com>, <ed.tsai@mediatek.com>, <bvanassche@acm.org>, <quic_nguyenb@quicinc.com>, <stable@vger.kernel.org> Subject: [PATCH v10 2/2] ufs: core: requeue aborted request Date: Tue, 1 Oct 2024 17:19:17 +0800 Message-ID: <20241001091917.6917-3-peter.wang@mediatek.com> In-Reply-To: <20241001091917.6917-1-peter.wang@mediatek.com> References: <20241001091917.6917-1-peter.wang@mediatek.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain
Series	fix abort defect \| expand [v10,0/2] fix abort defect [v10,1/2] ufs: core: fix the issue of ICU failure [v10,2/2] ufs: core: requeue aborted request

Peter Wang (王信友) Oct. 1, 2024, 9:19 a.m. UTC

From: Peter Wang <peter.wang@mediatek.com>

After the SQ cleanup fix, the CQ will receive a response with
the corresponding tag marked as OCS: ABORTED. To align with
the behavior of Legacy SDB mode, the handling of OCS: ABORTED
has been changed to match that of OCS_INVALID_COMMAND_STATUS
(SDB), with both returning a SCSI result of DID_REQUEUE.

Furthermore, the workaround implemented before the SQ cleanup
fix can be removed.

Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ mode")
Cc: stable@vger.kernel.org
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

Bart Van Assche Oct. 1, 2024, 5:13 p.m. UTC | #1

On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote:
> From: Peter Wang <peter.wang@mediatek.com>
> 
> After the SQ cleanup fix, the CQ will receive a response with
> the corresponding tag marked as OCS: ABORTED. To align with
> the behavior of Legacy SDB mode, the handling of OCS: ABORTED
> has been changed to match that of OCS_INVALID_COMMAND_STATUS
> (SDB), with both returning a SCSI result of DID_REQUEUE.
> 
> Furthermore, the workaround implemented before the SQ cleanup
> fix can be removed.
> 
> Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ mode")
> Cc: stable@vger.kernel.org
> Signed-off-by: Peter Wang <peter.wang@mediatek.com>
> ---
>   drivers/ufs/core/ufshcd.c | 20 ++++----------------
>   1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index 24a32e2fd75e..8e2a7889a565 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp,
>   		}
>   		break;
>   	case OCS_ABORTED:
> -		result |= DID_ABORT << 16;
> -		break;
>   	case OCS_INVALID_COMMAND_STATUS:
>   		result |= DID_REQUEUE << 16;
> +		dev_warn(hba->dev,
> +				"OCS %s from controller for tag %d\n",
> +				(ocs == OCS_ABORTED? "aborted" : "invalid"),
> +				lrbp->task_tag);
>   		break;
>   	case OCS_INVALID_CMD_TABLE_ATTR:
>   	case OCS_INVALID_PRDT_ATTR:
> @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request *rq, void *priv)
>   	struct scsi_device *sdev = cmd->device;
>   	struct Scsi_Host *shost = sdev->host;
>   	struct ufs_hba *hba = shost_priv(shost);
> -	struct ufshcd_lrb *lrbp = &hba->lrb[tag];
> -	struct ufs_hw_queue *hwq;
> -	unsigned long flags;
>   
>   	*ret = ufshcd_try_to_abort_task(hba, tag);
>   	dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
>   		hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
>   		*ret ? "failed" : "succeeded");
>   
> -	/* Release cmd in MCQ mode if abort succeeds */
> -	if (hba->mcq_enabled && (*ret == 0)) {
> -		hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd));
> -		if (!hwq)
> -			return 0;
> -		spin_lock_irqsave(&hwq->cq_lock, flags);
> -		if (ufshcd_cmd_inflight(lrbp->cmd))
> -			ufshcd_release_scsi_cmd(hba, lrbp);
> -		spin_unlock_irqrestore(&hwq->cq_lock, flags);
> -	}
> -
>   	return *ret == 0;
>   }

As mentioned before, ufshcd_try_to_abort_task() cannot handle concurrent
scsi_done() calls. ufshcd_abort_one() calls ufshcd_try_to_abort_task()
without even trying to prevent that scsi_done() is called concurrently. 
Since this could result in a kernel crash, I think that it is important 
that this gets fixed, even if it requires modifying the SCSI core.

Bart.

Peter Wang (王信友) Oct. 2, 2024, 12:42 p.m. UTC | #2

On Tue, 2024-10-01 at 10:13 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote:
> > From: Peter Wang <peter.wang@mediatek.com>
> > 
> > After the SQ cleanup fix, the CQ will receive a response with
> > the corresponding tag marked as OCS: ABORTED. To align with
> > the behavior of Legacy SDB mode, the handling of OCS: ABORTED
> > has been changed to match that of OCS_INVALID_COMMAND_STATUS
> > (SDB), with both returning a SCSI result of DID_REQUEUE.
> > 
> > Furthermore, the workaround implemented before the SQ cleanup
> > fix can be removed.
> > 
> > Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ
> mode")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Peter Wang <peter.wang@mediatek.com>
> > ---
> >   drivers/ufs/core/ufshcd.c | 20 ++++----------------
> >   1 file changed, 4 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> > index 24a32e2fd75e..8e2a7889a565 100644
> > --- a/drivers/ufs/core/ufshcd.c
> > +++ b/drivers/ufs/core/ufshcd.c
> > @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba
> *hba, struct ufshcd_lrb *lrbp,
> >   }
> >   break;
> >   case OCS_ABORTED:
> > -result |= DID_ABORT << 16;
> > -break;
> >   case OCS_INVALID_COMMAND_STATUS:
> >   result |= DID_REQUEUE << 16;
> > +dev_warn(hba->dev,
> > +"OCS %s from controller for tag %d\n",
> > +(ocs == OCS_ABORTED? "aborted" : "invalid"),
> > +lrbp->task_tag);
> >   break;
> >   case OCS_INVALID_CMD_TABLE_ATTR:
> >   case OCS_INVALID_PRDT_ATTR:
> > @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request
> *rq, void *priv)
> >   struct scsi_device *sdev = cmd->device;
> >   struct Scsi_Host *shost = sdev->host;
> >   struct ufs_hba *hba = shost_priv(shost);
> > -struct ufshcd_lrb *lrbp = &hba->lrb[tag];
> > -struct ufs_hw_queue *hwq;
> > -unsigned long flags;
> >   
> >   *ret = ufshcd_try_to_abort_task(hba, tag);
> >   dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
> >   hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
> >   *ret ? "failed" : "succeeded");
> >   
> > -/* Release cmd in MCQ mode if abort succeeds */
> > -if (hba->mcq_enabled && (*ret == 0)) {
> > -hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd));
> > -if (!hwq)
> > -return 0;
> > -spin_lock_irqsave(&hwq->cq_lock, flags);
> > -if (ufshcd_cmd_inflight(lrbp->cmd))
> > -ufshcd_release_scsi_cmd(hba, lrbp);
> > -spin_unlock_irqrestore(&hwq->cq_lock, flags);
> > -}
> > -
> >   return *ret == 0;
> >   }
> 
> As mentioned before, ufshcd_try_to_abort_task() cannot handle
> concurrent
> scsi_done() calls. ufshcd_abort_one() calls
> ufshcd_try_to_abort_task()
> without even trying to prevent that scsi_done() is called
> concurrently. 
> Since this could result in a kernel crash, I think that it is
> important 
> that this gets fixed, even if it requires modifying the SCSI core.
> 
> Bart.
> 
> 

Hi Bart,

This patch merely aligns with the approach of SDB mode 
and does not involve the flow of scsi_done. Besides, 
I don't see any issue with concurrency between 
ufshcd_abort_one() calling ufshcd_try_to_abort_task() 
and scsi_done(). Can you point out the specific flow where 
the problem occurs? If there is one, shouldn't SDB mode
have the same issue?

Thanks
Peter

Bart Van Assche Oct. 3, 2024, 8:02 p.m. UTC | #3

On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
> This patch merely aligns with the approach of SDB mode
> and does not involve the flow of scsi_done. Besides,
> I don't see any issue with concurrency between
> ufshcd_abort_one() calling ufshcd_try_to_abort_task()
> and scsi_done(). Can you point out the specific flow where
> the problem occurs? If there is one, shouldn't SDB mode
> have the same issue?

Hi Peter,

Correct, my comment applies to both legacy mode and MCQ mode. From the 
section in the UFS standard about ABORT TASK: "A response of FUNCTION
COMPLETE shall indicate that the command was aborted or was not in the 
task set." In other words, if a command completes just before 
ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
that has already completed. In legacy mode, this call will succeed.
Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to be
decremented twice instead of only once. Do you agree that this can
happen and also that it should be prevented that this happens?

Thanks,

Bart.

Peter Wang (王信友) Oct. 7, 2024, 7:20 a.m. UTC | #4

On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
> > This patch merely aligns with the approach of SDB mode
> > and does not involve the flow of scsi_done. Besides,
> > I don't see any issue with concurrency between
> > ufshcd_abort_one() calling ufshcd_try_to_abort_task()
> > and scsi_done(). Can you point out the specific flow where
> > the problem occurs? If there is one, shouldn't SDB mode
> > have the same issue?
> 
> Hi Peter,
> 
> Correct, my comment applies to both legacy mode and MCQ mode. From
> the 
> section in the UFS standard about ABORT TASK: "A response of FUNCTION
> COMPLETE shall indicate that the command was aborted or was not in
> the 
> task set." In other words, if a command completes just before 
> ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
> ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
> that has already completed. In legacy mode, this call will succeed.
> 

Hi Bart,

Yes, the legacy SDB mode is protected by the outstanding_lock.


> Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
> ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to
> be
> decremented twice instead of only once. Do you agree that this can
> happen and also that it should be prevented that this happens?
> 
> Thanks,
> 
> Bart.

Sorry, I still don't understand why both ufshcd_compl_one_cqe() 
and ufshcd_abort_all() will call ufshcd_release(hba)? 
Because I have already removed the ufshcd_release_scsi_cmd from 
ufshcd_abort_one, so the command won't be released immediately 
after ufshcd_try_to_abort_task succeeds. Instead, it will wait 
until the CQ Entry comes in before releasing. And since it is 
protected by the cq_lock, it should only release once, right?

Thanks.
Peter

Bart Van Assche Oct. 8, 2024, 6:29 p.m. UTC | #5

On 10/7/24 12:20 AM, Peter Wang (王信友) wrote:
> On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote:
>>   	
>> External email : Please do not click links or open attachments until
>> you have verified the sender or the content.
>>   On 10/2/24 5:42 AM, Peter Wang (王信友) wrote:
>>> This patch merely aligns with the approach of SDB mode
>>> and does not involve the flow of scsi_done. Besides,
>>> I don't see any issue with concurrency between
>>> ufshcd_abort_one() calling ufshcd_try_to_abort_task()
>>> and scsi_done(). Can you point out the specific flow where
>>> the problem occurs? If there is one, shouldn't SDB mode
>>> have the same issue?
>>
>> Hi Peter,
>>
>> Correct, my comment applies to both legacy mode and MCQ mode. From
>> the
>> section in the UFS standard about ABORT TASK: "A response of FUNCTION
>> COMPLETE shall indicate that the command was aborted or was not in
>> the
>> task set." In other words, if a command completes just before
>> ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then
>> ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command
>> that has already completed. In legacy mode, this call will succeed.
>>
> 
> Hi Bart,
> 
> Yes, the legacy SDB mode is protected by the outstanding_lock.
> 
> 
>> Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call
>> ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to
>> be
>> decremented twice instead of only once. Do you agree that this can
>> happen and also that it should be prevented that this happens?
>>
>> Thanks,
>>
>> Bart.
> 
> Sorry, I still don't understand why both ufshcd_compl_one_cqe()
> and ufshcd_abort_all() will call ufshcd_release(hba)?
> Because I have already removed the ufshcd_release_scsi_cmd from
> ufshcd_abort_one, so the command won't be released immediately
> after ufshcd_try_to_abort_task succeeds. Instead, it will wait
> until the CQ Entry comes in before releasing. And since it is
> protected by the cq_lock, it should only release once, right?

Hi Peter,

I think what you wrote applies to MCQ mode only. In my previous email
I clearly referred to "legacy mode" (SDB mode). Summarizing my previous
email, I think that in legacy mode it is possible that ufshcd_release()
is called twice while it only should be called once. Here are the
possible solutions I see:
* Add a function to the SCSI core for setting SCMD_STATE_COMPLETE. This
   may be controversial since no other SCSI LLD needs this functionality.
* Changing the error handling approach in the UFS driver to the same
   approach other SCSI LLDs use: instead of using queue_work() to
   activate the error handler, call scsi_schedule_eh(). This will cause
   the error handler to be activated later, namely after all pending
   commands have timed out instead of aborting any pending commands
   first.
* Add a variant of scsi_schedule_eh() to the SCSI core that accelerates
   error handling by calling scsi_timeout() on all pending commands.

Thanks,

Bart.

Peter Wang (王信友) Oct. 9, 2024, 2:17 a.m. UTC | #6

On Tue, 2024-10-08 at 11:29 -0700, Bart Van Assche wrote:
> Hi Peter,
> 
> I think what you wrote applies to MCQ mode only. In my previous email
> I clearly referred to "legacy mode" (SDB mode). Summarizing my
> previous
> email, I think that in legacy mode it is possible that
> ufshcd_release()
> is called twice while it only should be called once. Here are the
> possible solutions I see:
> * Add a function to the SCSI core for setting SCMD_STATE_COMPLETE.
> This
>    may be controversial since no other SCSI LLD needs this
> functionality.
> * Changing the error handling approach in the UFS driver to the same
>    approach other SCSI LLDs use: instead of using queue_work() to
>    activate the error handler, call scsi_schedule_eh(). This will
> cause
>    the error handler to be activated later, namely after all pending
>    commands have timed out instead of aborting any pending commands
>    first.
> * Add a variant of scsi_schedule_eh() to the SCSI core that
> accelerates
>    error handling by calling scsi_timeout() on all pending commands.
> 
> Thanks,
> 
> Bart.
> 

Hi Bart,

Yes, this patch is only for MCQ mode, because only MCQ mode 
receives OCS: ABORTED, right? This patch doesn't modify 
any of the Legacy mode flows, does it?

Additionally, I still don't understand why you say there would 
be an issue with legacy mode having duplicate ufshcd_release(hba) 
calls. As I mentioned before, it is protected by the 
outstanding_lock. Could you please clarify the detailed 
error flow?

Furthermore, even if there is an issue with Legacy mode, it 
should be addressed by a separate patch, not by this one, which is 
intended to resolve the MCQ mode issue. We shouldn't mix two 
different issues together, don't you agree?

Thanks
Peter

Bart Van Assche Oct. 9, 2024, 5:59 p.m. UTC | #7

On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote:
> After the SQ cleanup fix, the CQ will receive a response with
> the corresponding tag marked as OCS: ABORTED. To align with
> the behavior of Legacy SDB mode, the handling of OCS: ABORTED
> has been changed to match that of OCS_INVALID_COMMAND_STATUS
> (SDB), with both returning a SCSI result of DID_REQUEUE.
> 
> Furthermore, the workaround implemented before the SQ cleanup
> fix can be removed.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Bart Van Assche Oct. 9, 2024, 6:06 p.m. UTC | #8

On 10/8/24 7:17 PM, Peter Wang (王信友) wrote:
> Yes, this patch is only for MCQ mode, because only MCQ mode
> receives OCS: ABORTED, right? This patch doesn't modify
> any of the Legacy mode flows, does it?

Agreed. What I mentioned in my email is an existing bug in the legacy
flow for ufshcd_abort_all().

> Furthermore, even if there is an issue with Legacy mode, it
> should be addressed by a separate patch, not by this one, which is
> intended to resolve the MCQ mode issue. We shouldn't mix two
> different issues together, don't you agree?

Let's proceed with this patch series and let's address what I brought
up in my email separately.

With the current approach for error handling in the UFS driver, anyone
who wants to verify or modify ufshcd_try_to_abort_task() has to consider
all possible interleavings of ufshcd_try_to_abort_task() and the
completion path (ufshcd_compl_one_cqe()). That's an unnecessary burden
on UFS driver contributors. Additionally, this is error-prone. This
applies to both modes (legacy and MCQ). I know of reports of sporadic
crashes in legacy mode related to UFS error handling. I'm wondering
whether these are perhaps the result of the issue I mentioned in a
previous email. Anyway, I will look further into this myself as soon as
I have the time.

Thanks,

Bart.

Peter Wang (王信友) Oct. 11, 2024, 5:44 a.m. UTC | #9

On Wed, 2024-10-09 at 11:06 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  
> On 10/8/24 7:17 PM, Peter Wang (王信友) wrote:
> > Yes, this patch is only for MCQ mode, because only MCQ mode
> > receives OCS: ABORTED, right? This patch doesn't modify
> > any of the Legacy mode flows, does it?
> 
> Agreed. What I mentioned in my email is an existing bug in the legacy
> flow for ufshcd_abort_all().
> 
> > Furthermore, even if there is an issue with Legacy mode, it
> > should be addressed by a separate patch, not by this one, which is
> > intended to resolve the MCQ mode issue. We shouldn't mix two
> > different issues together, don't you agree?
> 
> Let's proceed with this patch series and let's address what I brought
> up in my email separately.
> 
> With the current approach for error handling in the UFS driver,
> anyone
> who wants to verify or modify ufshcd_try_to_abort_task() has to
> consider
> all possible interleavings of ufshcd_try_to_abort_task() and the
> completion path (ufshcd_compl_one_cqe()). That's an unnecessary
> burden
> on UFS driver contributors. Additionally, this is error-prone. This
> applies to both modes (legacy and MCQ). I know of reports of sporadic
> crashes in legacy mode related to UFS error handling. I'm wondering
> whether these are perhaps the result of the issue I mentioned in a
> previous email. Anyway, I will look further into this myself as soon
> as
> I have the time.
> 
> Thanks,
> 
> Bart.

Hi Bart,

Thank you for your review.
 
I currently cannot see the issue of duplicate releases in 
legacy SDB mode. ufshcd_try_to_abort_task() will directly 
reset if it fails. It is only in the case of success that 
we need to consider the possibility of ufshcd_compl_one_cqe. 
I believe the original design flow has already taken this 
into account, which is why there is protection with 
outstanding_lock/cq_lock. Perhaps we can wait for an actual 
example to occur before making corrections. Even if there 
is an issue, I think the probability should be very low, 
because the flow for legacy SDB mode has been in use 
for several years.

Thank you again for your review.

Thanks
Peter

[v10,2/2] ufs: core: requeue aborted request

Commit Message

Comments

Patch