[v9,2/3] ufs: core: fix error handler process for MCQ abort

Message ID	20240925095546.19492-3-peter.wang@mediatek.com (mailing list archive)
State	Superseded
Headers	show Received: from mailgw01.mediatek.com (unknown [60.244.123.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6776A158557 for <linux-scsi@vger.kernel.org>; Wed, 25 Sep 2024 10:11:57 +0000 (UTC) From: <peter.wang@mediatek.com> To: <linux-scsi@vger.kernel.org>, <martin.petersen@oracle.com>, <avri.altman@wdc.com>, <alim.akhtar@samsung.com>, <jejb@linux.ibm.com> CC: <wsd_upstream@mediatek.com>, <linux-mediatek@lists.infradead.org>, <peter.wang@mediatek.com>, <chun-hung.wu@mediatek.com>, <alice.chao@mediatek.com>, <cc.chou@mediatek.com>, <chaotian.jing@mediatek.com>, <jiajie.hao@mediatek.com>, <powen.kao@mediatek.com>, <qilin.tan@mediatek.com>, <lin.gui@mediatek.com>, <tun-yu.yu@mediatek.com>, <eddie.huang@mediatek.com>, <naomi.chu@mediatek.com>, <ed.tsai@mediatek.com>, <bvanassche@acm.org>, <quic_nguyenb@quicinc.com> Subject: [PATCH v9 2/3] ufs: core: fix error handler process for MCQ abort Date: Wed, 25 Sep 2024 17:55:45 +0800 Message-ID: <20240925095546.19492-3-peter.wang@mediatek.com> In-Reply-To: <20240925095546.19492-1-peter.wang@mediatek.com> References: <20240925095546.19492-1-peter.wang@mediatek.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain
Series	fix abort defect \| expand [v9,0/3] fix abort defect [v9,1/3] ufs: core: fix the issue of ICU failure [v9,2/3] ufs: core: fix error handler process for MCQ abort [v9,3/3] ufs: core: add a quirk for MediaTek SDB mode aborted

Peter Wang (王信友) Sept. 25, 2024, 9:55 a.m. UTC

From: Peter Wang <peter.wang@mediatek.com>

When the error handler successfully aborts a MCQ request,
it only releases the command and does not notify the SCSI layer.
This may cause another abort after 30 seconds timeout.
This patch notifies the SCSI layer to requeue the request.

Additionally, ignore the OCS: ABORTED CQ slot after MCQ mode
SQ cleanup. This makes the behavior of MCQ mode consistent with
that of legacy SDB mode.

Also, print logs for OCS: ABORTED and OCS_INVALID_COMMAND_STATUS
for debugging purposes.

Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Bart Van Assche Sept. 25, 2024, 4:49 p.m. UTC | #1

On 9/25/24 2:55 AM, peter.wang@mediatek.com wrote:
>   	case OCS_INVALID_COMMAND_STATUS:
>   		result |= DID_REQUEUE << 16;
> +		dev_warn(hba->dev,
> +				"OCS invaild from controller for tag %d\n",
> +				lrbp->task_tag);

Please change "invaild" into "invalid". Once that change has been made,
feel free to add:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Peter Wang (王信友) Sept. 26, 2024, 3:45 a.m. UTC | #2

On Wed, 2024-09-25 at 09:49 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 9/25/24 2:55 AM, peter.wang@mediatek.com wrote:
> >   case OCS_INVALID_COMMAND_STATUS:
> >   result |= DID_REQUEUE << 16;
> > +dev_warn(hba->dev,
> > +"OCS invaild from controller for tag %d\n",
> > +lrbp->task_tag);
> 
> Please change "invaild" into "invalid". Once that change has been
> made,
> feel free to add:
> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Hi Bart,

Sorry, this typo wasn't corrected.

However, I still feel that this patch is not quite reasonable.
The reason is that in the first patch,
"ufs: core: fix the issue of ICU failure"
we corrected the ICU problem, allowing the SQ to clean up 
correctly, while the CQ will have a corresponding response. 
But the second patch directly ignores the CQ's response and 
continues to use a workaround to release the command right 
after ufshcd_try_to_abort_task. 
The Legacy SDB mode's approach would not release the 
command after ufshcd_try_to_abort_task. 
Instead, it releases after the DBR is cleared.

Therefore, as I previously suggested, it would probably 
be more reasonable to directly requeue the ABORTED commands 
as shown in the patch below.

---------------------------------------------------------------------
---
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 24a32e2fd75e..06aa4ed1a9e6 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba,
struct ufshcd_lrb *lrbp,
 		}
 		break;
 	case OCS_ABORTED:
-		result |= DID_ABORT << 16;
-		break;
 	case OCS_INVALID_COMMAND_STATUS:
 		result |= DID_REQUEUE << 16;
+		dev_warn(hba->dev,
+				"OCS %s from controller for tag %d\n",
+				(ocs == OCS_ABORTED? "aborted" :
"invalid"),
+				lrbp->task_tag);
 		break;
 	case OCS_INVALID_CMD_TABLE_ATTR:
 	case OCS_INVALID_PRDT_ATTR:
@@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request
*rq, void *priv)
 	struct scsi_device *sdev = cmd->device;
 	struct Scsi_Host *shost = sdev->host;
 	struct ufs_hba *hba = shost_priv(shost);
-	struct ufshcd_lrb *lrbp = &hba->lrb[tag];
-	struct ufs_hw_queue *hwq;
-	unsigned long flags;

 	*ret = ufshcd_try_to_abort_task(hba, tag);
 	dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
 		hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
 		*ret ? "failed" : "succeeded");

-	/* Release cmd in MCQ mode if abort succeeds */
-	if (hba->mcq_enabled && (*ret == 0)) {
-		hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp-
>cmd));
-		if (!hwq)
-			return 0;
-		spin_lock_irqsave(&hwq->cq_lock, flags);
-		if (ufshcd_cmd_inflight(lrbp->cmd))
-			ufshcd_release_scsi_cmd(hba, lrbp);
-		spin_unlock_irqrestore(&hwq->cq_lock, flags);
-	}
-
 	return *ret == 0;
 }

---------------------------------------------------------------------
---

This patch has several advantages:

1. It makes the patch 'ufs: core: fix the issue of ICU failure' 
   seem valuable.
2. The patch is more concise.
3. There is no need to fetch OCS to determine OCS: ABORTED 
   on every CQ completion, which increases ISR time.
4. The err_handler flow for SDB and MCQ would be consistent.
5. There is no need for the MediaTek SDB quirk.

What do you think?"

Bart Van Assche Sept. 26, 2024, 6:26 p.m. UTC | #3

On 9/25/24 8:45 PM, Peter Wang (王信友) wrote:
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index 24a32e2fd75e..06aa4ed1a9e6 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba,
> struct ufshcd_lrb *lrbp,
>   		}
>   		break;
>   	case OCS_ABORTED:
> -		result |= DID_ABORT << 16;
> -		break;
>   	case OCS_INVALID_COMMAND_STATUS:
>   		result |= DID_REQUEUE << 16;
> +		dev_warn(hba->dev,
> +				"OCS %s from controller for tag %d\n",
> +				(ocs == OCS_ABORTED? "aborted" :
> "invalid"),
> +				lrbp->task_tag);
>   		break;
>   	case OCS_INVALID_CMD_TABLE_ATTR:
>   	case OCS_INVALID_PRDT_ATTR:
> @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request
> *rq, void *priv)
>   	struct scsi_device *sdev = cmd->device;
>   	struct Scsi_Host *shost = sdev->host;
>   	struct ufs_hba *hba = shost_priv(shost);
> -	struct ufshcd_lrb *lrbp = &hba->lrb[tag];
> -	struct ufs_hw_queue *hwq;
> -	unsigned long flags;
>   
>   	*ret = ufshcd_try_to_abort_task(hba, tag);
>   	dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
>   		hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
>   		*ret ? "failed" : "succeeded");
>   
> -	/* Release cmd in MCQ mode if abort succeeds */
> -	if (hba->mcq_enabled && (*ret == 0)) {
> -		hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp-
>> cmd));
> -		if (!hwq)
> -			return 0;
> -		spin_lock_irqsave(&hwq->cq_lock, flags);
> -		if (ufshcd_cmd_inflight(lrbp->cmd))
> -			ufshcd_release_scsi_cmd(hba, lrbp);
> -		spin_unlock_irqrestore(&hwq->cq_lock, flags);
> -	}
> -
>   	return *ret == 0;
>   }
>   
> ---------------------------------------------------------------------
> 
> 
> This patch has several advantages:
> 
> 1. It makes the patch 'ufs: core: fix the issue of ICU failure'
>     seem valuable.
> 2. The patch is more concise.
> 3. There is no need to fetch OCS to determine OCS: ABORTED
>     on every CQ completion, which increases ISR time.
> 4. The err_handler flow for SDB and MCQ would be consistent.
> 5. There is no need for the MediaTek SDB quirk.
> 
> 
> What do you think?"

Hi Peter,

Is the above patch sufficient? In MCQ mode, aborting a command happens
as follows (simplified):
(1) Send the ABORT TASK TMF. If this TMF succeeds, no SQE will be
     generated. If this TMF succeeds it means that the SCSI command has
     reached the UFS device and hence is no longer present in any
     submission queue (SQ).
(2) If the command is still in a submission queue, nullify the SQE. In
     this case a CQE will be generated with status ABORTED.

It seems to me that the above patch handles (2) but not (1)?

Thanks,

Bart.

Peter Wang (王信友) Sept. 27, 2024, 7:51 a.m. UTC | #4

On Thu, 2024-09-26 at 11:26 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 9/25/24 8:45 PM, Peter Wang (王信友) wrote:
> > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> > index 24a32e2fd75e..06aa4ed1a9e6 100644
> > --- a/drivers/ufs/core/ufshcd.c
> > +++ b/drivers/ufs/core/ufshcd.c
> > @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba
> *hba,
> > struct ufshcd_lrb *lrbp,
> >   }
> >   break;
> >   case OCS_ABORTED:
> > -result |= DID_ABORT << 16;
> > -break;
> >   case OCS_INVALID_COMMAND_STATUS:
> >   result |= DID_REQUEUE << 16;
> > +dev_warn(hba->dev,
> > +"OCS %s from controller for tag %d\n",
> > +(ocs == OCS_ABORTED? "aborted" :
> > "invalid"),
> > +lrbp->task_tag);
> >   break;
> >   case OCS_INVALID_CMD_TABLE_ATTR:
> >   case OCS_INVALID_PRDT_ATTR:
> > @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request
> > *rq, void *priv)
> >   struct scsi_device *sdev = cmd->device;
> >   struct Scsi_Host *shost = sdev->host;
> >   struct ufs_hba *hba = shost_priv(shost);
> > -struct ufshcd_lrb *lrbp = &hba->lrb[tag];
> > -struct ufs_hw_queue *hwq;
> > -unsigned long flags;
> >   
> >   *ret = ufshcd_try_to_abort_task(hba, tag);
> >   dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag,
> >   hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1,
> >   *ret ? "failed" : "succeeded");
> >   
> > -/* Release cmd in MCQ mode if abort succeeds */
> > -if (hba->mcq_enabled && (*ret == 0)) {
> > -hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp-
> >> cmd));
> > -if (!hwq)
> > -return 0;
> > -spin_lock_irqsave(&hwq->cq_lock, flags);
> > -if (ufshcd_cmd_inflight(lrbp->cmd))
> > -ufshcd_release_scsi_cmd(hba, lrbp);
> > -spin_unlock_irqrestore(&hwq->cq_lock, flags);
> > -}
> > -
> >   return *ret == 0;
> >   }
> >   
> > -----------------------------------------------------------------
> ----
> > 
> > 
> > This patch has several advantages:
> > 
> > 1. It makes the patch 'ufs: core: fix the issue of ICU failure'
> >     seem valuable.
> > 2. The patch is more concise.
> > 3. There is no need to fetch OCS to determine OCS: ABORTED
> >     on every CQ completion, which increases ISR time.
> > 4. The err_handler flow for SDB and MCQ would be consistent.
> > 5. There is no need for the MediaTek SDB quirk.
> > 
> > 
> > What do you think?"
> 
> Hi Peter,
> 
> Is the above patch sufficient? In MCQ mode, aborting a command
> happens
> as follows (simplified):
> (1) Send the ABORT TASK TMF. If this TMF succeeds, no SQE will be
>      generated. If this TMF succeeds it means that the SCSI command
> has
>      reached the UFS device and hence is no longer present in any
>      submission queue (SQ).
> (2) If the command is still in a submission queue, nullify the SQE.
> In
>      this case a CQE will be generated with status ABORTED.
> 
> It seems to me that the above patch handles (2) but not (1)?
> 
> Thanks,
> 
> Bart.

Hi Bart,

Regarding your description of 'ABORT TMF success' and 'SQE' 
I think there might be some misunderstanding.

In this section of the UFSHCI 4.0 specification. 
4.4.6 (Informative) Processing Abort in MCQ mode: An Implementation
Example
There are three case for MCQ abort:

1. When the host controller has already sent out the SQE 
   and the UFS device has already responded with the 
   corresponding response, the CQ Entry will automatically 
   increment by 1. This case is the simplest, the SQE 
   will have a corresponding CQE for the host to cleanup 
   resources.

2. When the host controller has not yet sent out this SQE 
  (SQ is not empty), the software can fill in 'nullify' to 
  notify the host controller that there is no need to send 
  it, and directly fill the corresponding response into the 
  CQ with OCS: ABORTED. This scenario is also straightforward, 
  the UFS device won't be aware, and only the host controller 
  needs to clean up the related resources.

3. When the host controller has already sent out the SQE 
   and is waiting for the response from the UFS device (CQE), 
   the software can initiate cleanup to notify the host 
   controller that there is no need to wait, and directly fill 
   the corresponding response into the CQ with OCS: ABORTED.

Therefore, when a TMF is successfully executed, for example, 
tag 0 has been aborted, the software will know that the UFS 
device will no longer return a response for tag 0. 
Thus, the host controller needs to initiate cleanup, 
allowing the CQE corresponding to this SQE to return, 
while also cleaning the resources for tag 0.

In both the second and third cases, no matter which one occurs, 
they will result in the CQE corresponding to the SQE being 
filled with OCS: ABORTED. So, as long as we directly 
requeue when OCS: ABORTED, it will satisfy both of these 
situations.

Thanks
Peter

Bart Van Assche Sept. 28, 2024, 11:10 p.m. UTC | #5

On 9/27/24 12:51 AM, Peter Wang (王信友) wrote:
> In this section of the UFSHCI 4.0 specification.
> 4.4.6 (Informative) Processing Abort in MCQ mode: An Implementation
> Example
> There are three case for MCQ abort:
> 
> 1. When the host controller has already sent out the SQE
>     and the UFS device has already responded with the
>     corresponding response, the CQ Entry will automatically
>     increment by 1. This case is the simplest, the SQE
>     will have a corresponding CQE for the host to cleanup
>     resources.
> 
> 2. When the host controller has not yet sent out this SQE
>    (SQ is not empty), the software can fill in 'nullify' to
>    notify the host controller that there is no need to send
>    it, and directly fill the corresponding response into the
>    CQ with OCS: ABORTED. This scenario is also straightforward,
>    the UFS device won't be aware, and only the host controller
>    needs to clean up the related resources.
> 
> 3. When the host controller has already sent out the SQE
>     and is waiting for the response from the UFS device (CQE),
>     the software can initiate cleanup to notify the host
>     controller that there is no need to wait, and directly fill
>     the corresponding response into the CQ with OCS: ABORTED.

Hi Peter,

Thank you for having drawn my attention to the above text. Regarding
the code changes included in your previous email, do you agree that the
completion code must not call scsi_done() if the CQE status is ABORTED
and if the SCSI command has been aborted by the SCSI core since in this
case calling scsi_done() could result in a use-after-free?

Thanks,

Bart.

Peter Wang (王信友) Sept. 30, 2024, 6:45 a.m. UTC | #6

On Sat, 2024-09-28 at 16:10 -0700, Bart Van Assche wrote:
> 
> Hi Peter,
> 
> Thank you for having drawn my attention to the above text. Regarding
> the code changes included in your previous email, do you agree that
> the
> completion code must not call scsi_done() if the CQE status is
> ABORTED
> and if the SCSI command has been aborted by the SCSI core since in
> this
> case calling scsi_done() could result in a use-after-free?
> 
> Thanks,
> 
> Bart.

Hi Bart,

I'm not quite sure what you mean. Are you suggesting that scsi_done() 
should not be called in the case of a SCSI Abort? This should be 
unrelated to OCS: ABORTED (MCQ), because in the case of OCS: INVALID 
(SDB), scsi_done() might also be called, and calling scsi_done()
should 
not cause any issues. This is because it has already been aborted 
by the SCSI timeout, so the test bit(SCMD_STATE_COMPLETE) would 
directly return. Below is the call flow:

scsi_done
  scsi_done_internal
    if (unlikely(test_and_set_bit(SCMD_STATE_COMPLETE, &cmd->state)))
      return;

Thanks.
Peter

Bart Van Assche Sept. 30, 2024, 5:25 p.m. UTC | #7

On 9/29/24 11:45 PM, Peter Wang (王信友) wrote:
> I'm not quite sure what you mean. Are you suggesting that scsi_done()
> should not be called in the case of a SCSI Abort? This should be
> unrelated to OCS: ABORTED (MCQ), because in the case of OCS: INVALID
> (SDB), scsi_done() might also be called, and calling scsi_done()
> should
> not cause any issues. This is because it has already been aborted
> by the SCSI timeout, so the test bit(SCMD_STATE_COMPLETE) would
> directly return. Below is the call flow:
> 
> scsi_done
>    scsi_done_internal
>      if (unlikely(test_and_set_bit(SCMD_STATE_COMPLETE, &cmd->state)))
>        return;

Hi Peter,

The purpose of the SCMD_STATE_COMPLETE bit is to prevent that the SCSI
core tries to abort a SCSI command concurrently with the SCSI LLD (UFS
driver in this case) calling scsi_done(). Making scsi_done() calls no-
ops while an .eh_abort_handler() implementation is in progress is an
undocumented side effect of this mechanism. But since it is unlikely
that this behavior will change in the SCSI core, I'm OK with relying on
this behavior.

ufshcd_abort_one() does not set the SCMD_STATE_COMPLETE bit before it
tries to abort a SCSI command. I think this is wrong because plenty of
code in ufshcd_try_to_abort_task() relies on the assumption that
scsi_done() is not called while that function is in progress. Do you
agree that SCMD_STATE_COMPLETE should be set by ufshcd_abort_one()
before it calls ufshcd_try_to_abort_task()? If this change is made, a
consequence is that the completion handler won't inform the SCSI core
anymore that abortion of a command by ufshcd_abort_one() has completed.
Hence, the cmd->result value won't matter anymore for commands aborted
by ufshcd_abort_one() and how ufshcd_transfer_rsp_status() translates
OCS_ABORTED won't matter anymore either.

Thanks,

Bart.

Peter Wang (王信友) Oct. 1, 2024, 7:46 a.m. UTC | #8

On Mon, 2024-09-30 at 10:25 -0700, Bart Van Assche wrote:
> 

> Hi Peter,
> 
> The purpose of the SCMD_STATE_COMPLETE bit is to prevent that the
> SCSI
> core tries to abort a SCSI command concurrently with the SCSI LLD
> (UFS
> driver in this case) calling scsi_done(). Making scsi_done() calls
> no-
> ops while an .eh_abort_handler() implementation is in progress is an
> undocumented side effect of this mechanism. But since it is unlikely
> that this behavior will change in the SCSI core, I'm OK with relying
> on
> this behavior.

Hi Bart,

Yes, the SCMD_STATE_COMPLETE bit is there to protect against
a scsi command timeout and finish occurring simultaneously. 
Regardless of whether the timeout or finish happens first, 
the other will not proceed with its subsequent actions. 
Therefore, it is within expectations that scsi_done will 
not perform any actions after SCSI notified that the command 
has already been timeout aborted.

> 
> ufshcd_abort_one() does not set the SCMD_STATE_COMPLETE bit before it
> tries to abort a SCSI command. I think this is wrong because plenty
> of
> code in ufshcd_try_to_abort_task() relies on the assumption that
> scsi_done() is not called while that function is in progress. Do you
> agree that SCMD_STATE_COMPLETE should be set by ufshcd_abort_one()
> before it calls ufshcd_try_to_abort_task()? If this change is made, a
> consequence is that the completion handler won't inform the SCSI core
> anymore that abortion of a command by ufshcd_abort_one() has
> completed.
> Hence, the cmd->result value won't matter anymore for commands
> aborted
> by ufshcd_abort_one() and how ufshcd_transfer_rsp_status() translates
> OCS_ABORTED won't matter anymore either.
> 
> Thanks,
> 

SCMD_STATE_COMPLETE should not be set outside of the SCSI 
core. It is reasonable to set it through scsi_done(), as 
setting this flag also requires notifying the block layer 
on how to handle the completed command. If the UFS driver 
sets this flag, the driver would have to bypass the SCSI 
layer and directly notify the block layer, which is clearly 
not a reasonable situation. Additionally, when a command 
completes, the SCSI layer needs to determine how to handle 
it based on the result. This is obviously not something 
the UFS driver layer can handle on its own. Notifying 
the SCSI layer through the result on how to proceed 
is what the UFS driver should do.

Furthermore, ufshcd_try_to_abort_task() is responsible for 
comunication between the UFS host controller and the UFS 
device, unrelated to the SCSI layer. Notifying SCSI layter 
based on the outcome of ufshcd_try_to_abort_task() should 
be the correct and reasonable approach, as has always been 
the case in SDB mode. The current difference is only that 
the OCS and result in SDB mode and MCQ is differ. Aligning 
both to the same result would likely be a more reasonable 
approach, wouldn't it?"

Thanks.
Peter

> Bart.

[v9,2/3] ufs: core: fix error handler process for MCQ abort

Commit Message

Comments

Patch