diff mbox series

[v4,09/10] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

Message ID 1624433711-9339-11-git-send-email-cang@codeaurora.org (mailing list archive)
State Changes Requested
Headers show
Series Complementary changes for error handling | expand

Commit Message

Can Guo June 23, 2021, 7:35 a.m. UTC
If PM requests fail during runtime suspend/resume, RPM framework saves the
error to dev->power.runtime_error. Before the runtime_error gets cleared,
runtime PM on this specific device won't work again, leaving the device
either runtime active or runtime suspended permanently.

When task abort happens to a PM request sent during runtime suspend/resume,
even if it can be successfully aborted, RPM framework anyways saves the
(TIMEOUT) error. In this situation, we can leverage error handling to
recover and clear the runtime_error. So, let PM requests take the fast
abort path in ufshcd_abort().

Signed-off-by: Can Guo <cang@codeaurora.org>
---
 drivers/scsi/ufs/ufshcd.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

Comments

Bart Van Assche June 23, 2021, 9:33 p.m. UTC | #1
On 6/23/21 12:35 AM, Can Guo wrote:
> @@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
>  		 * err handler blocked for too long. So, just fail the scsi cmd
>  		 * sent from PM ops, err handler can recover PM error anyways.
>  		 */
> -		if (hba->wlu_pm_op_in_progress) {
> +		if (cmd->request->rq_flags & RQF_PM) {
>  			hba->force_reset = true;
>  			set_host_byte(cmd, DID_BAD_TARGET);
>  			cmd->scsi_done(cmd);

I'm still concerned that the above code may trigger data corruption. I
prefer that the above code is removed instead of being modified.

Thanks,

Bart.
Can Guo June 24, 2021, 4:16 a.m. UTC | #2
Hi Bart,

On 2021-06-24 05:33, Bart Van Assche wrote:
> On 6/23/21 12:35 AM, Can Guo wrote:
>> @@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host 
>> *host, struct scsi_cmnd *cmd)
>>  		 * err handler blocked for too long. So, just fail the scsi cmd
>>  		 * sent from PM ops, err handler can recover PM error anyways.
>>  		 */
>> -		if (hba->wlu_pm_op_in_progress) {
>> +		if (cmd->request->rq_flags & RQF_PM) {
>>  			hba->force_reset = true;
>>  			set_host_byte(cmd, DID_BAD_TARGET);
>>  			cmd->scsi_done(cmd);
> 
> I'm still concerned that the above code may trigger data corruption. I
> prefer that the above code is removed instead of being modified.

Removing the change will lead to deadlock when error handling prepare
calls pm_runtime_get_sync().

RQF_PM is only given to requests sent from power management operations,
during which the specific device/LU is suspending/resuming, meaning no
data transaction is ongoing. How can fast failing a PM request trigger
data corruption?

Thanks,

Can Guo.

> 
> Thanks,
> 
> Bart.
Bart Van Assche June 24, 2021, 4:57 p.m. UTC | #3
On 6/23/21 9:16 PM, Can Guo wrote:
> On 2021-06-24 05:33, Bart Van Assche wrote:
>> On 6/23/21 12:35 AM, Can Guo wrote:
>>> @@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host
>>> *host, struct scsi_cmnd *cmd)
>>>           * err handler blocked for too long. So, just fail the scsi cmd
>>>           * sent from PM ops, err handler can recover PM error anyways.
>>>           */
>>> -        if (hba->wlu_pm_op_in_progress) {
>>> +        if (cmd->request->rq_flags & RQF_PM) {
>>>              hba->force_reset = true;
>>>              set_host_byte(cmd, DID_BAD_TARGET);
>>>              cmd->scsi_done(cmd);
>>
>> I'm still concerned that the above code may trigger data corruption. I
>> prefer that the above code is removed instead of being modified.
> 
> Removing the change will lead to deadlock when error handling prepare
> calls pm_runtime_get_sync().
> 
> RQF_PM is only given to requests sent from power management operations,
> during which the specific device/LU is suspending/resuming, meaning no
> data transaction is ongoing. How can fast failing a PM request trigger
> data corruption?

Right, the above code only affects power management requests so there is
no risk for data corruption.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index d739401..59fc521 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -2737,7 +2737,7 @@  static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
 		 * err handler blocked for too long. So, just fail the scsi cmd
 		 * sent from PM ops, err handler can recover PM error anyways.
 		 */
-		if (hba->wlu_pm_op_in_progress) {
+		if (cmd->request->rq_flags & RQF_PM) {
 			hba->force_reset = true;
 			set_host_byte(cmd, DID_BAD_TARGET);
 			cmd->scsi_done(cmd);
@@ -6981,11 +6981,14 @@  static int ufshcd_abort(struct scsi_cmnd *cmd)
 	int err = 0;
 	struct ufshcd_lrb *lrbp;
 	u32 reg;
+	bool need_eh = false;
 
 	host = cmd->device->host;
 	hba = shost_priv(host);
 	tag = cmd->request->tag;
 	lrbp = &hba->lrb[tag];
+
+	dev_info(hba->dev, "%s: Device abort task at tag %d\n", __func__, tag);
 	if (!ufshcd_valid_tag(hba, tag)) {
 		dev_err(hba->dev,
 			"%s: invalid command tag %d: cmd=0x%p, cmd->request=0x%p",
@@ -7003,9 +7006,6 @@  static int ufshcd_abort(struct scsi_cmnd *cmd)
 		goto out;
 	}
 
-	/* Print Transfer Request of aborted task */
-	dev_info(hba->dev, "%s: Device abort task at tag %d\n", __func__, tag);
-
 	/*
 	 * Print detailed info about aborted request.
 	 * As more than one request might get aborted at the same time,
@@ -7033,21 +7033,21 @@  static int ufshcd_abort(struct scsi_cmnd *cmd)
 	}
 
 	/*
-	 * Task abort to the device W-LUN is illegal. When this command
-	 * will fail, due to spec violation, scsi err handling next step
-	 * will be to send LU reset which, again, is a spec violation.
-	 * To avoid these unnecessary/illegal steps, first we clean up
-	 * the lrb taken by this cmd and re-set it in outstanding_reqs,
-	 * then queue the eh_work and bail.
+	 * This fast path guarantees the cmd always gets aborted successfully,
+	 * meanwhile it invokes the error handler. It allows contexts, which
+	 * are blocked by this cmd, to fail fast. It serves multiple purposes:
+	 * #1 To avoid unnecessary/illagal abort attempts to the W-LU.
+	 * #2 To avoid live lock between eh_work and specific contexts, i.e.,
+	 *    suspend/resume and eh_work itself.
+	 * #3 To let eh_work recover runtime PM error in case abort happens
+	 *    to cmds sent from runtime suspend/resume ops.
 	 */
-	if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN) {
+	if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN ||
+	    (cmd->request->rq_flags & RQF_PM)) {
 		ufshcd_update_evt_hist(hba, UFS_EVT_ABORT, lrbp->lun);
 		__ufshcd_transfer_req_compl(hba, (1UL << tag));
 		set_bit(tag, &hba->outstanding_reqs);
-		spin_lock_irqsave(host->host_lock, flags);
-		hba->force_reset = true;
-		ufshcd_schedule_eh_work(hba);
-		spin_unlock_irqrestore(host->host_lock, flags);
+		need_eh = true;
 		goto out;
 	}
 
@@ -7061,6 +7061,12 @@  static int ufshcd_abort(struct scsi_cmnd *cmd)
 cleanup:
 		__ufshcd_transfer_req_compl(hba, (1UL << tag));
 out:
+		if (cmd->request->rq_flags & RQF_PM || need_eh) {
+			spin_lock_irqsave(host->host_lock, flags);
+			hba->force_reset = true;
+			ufshcd_schedule_eh_work(hba);
+			spin_unlock_irqrestore(host->host_lock, flags);
+		}
 		err = SUCCESS;
 	} else {
 		dev_err(hba->dev, "%s: failed with err %d\n", __func__, err);