Message ID | 1624433711-9339-11-git-send-email-cang@codeaurora.org (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | Complementary changes for error handling | expand |
On 6/23/21 12:35 AM, Can Guo wrote: > @@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd) > * err handler blocked for too long. So, just fail the scsi cmd > * sent from PM ops, err handler can recover PM error anyways. > */ > - if (hba->wlu_pm_op_in_progress) { > + if (cmd->request->rq_flags & RQF_PM) { > hba->force_reset = true; > set_host_byte(cmd, DID_BAD_TARGET); > cmd->scsi_done(cmd); I'm still concerned that the above code may trigger data corruption. I prefer that the above code is removed instead of being modified. Thanks, Bart.
Hi Bart, On 2021-06-24 05:33, Bart Van Assche wrote: > On 6/23/21 12:35 AM, Can Guo wrote: >> @@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host >> *host, struct scsi_cmnd *cmd) >> * err handler blocked for too long. So, just fail the scsi cmd >> * sent from PM ops, err handler can recover PM error anyways. >> */ >> - if (hba->wlu_pm_op_in_progress) { >> + if (cmd->request->rq_flags & RQF_PM) { >> hba->force_reset = true; >> set_host_byte(cmd, DID_BAD_TARGET); >> cmd->scsi_done(cmd); > > I'm still concerned that the above code may trigger data corruption. I > prefer that the above code is removed instead of being modified. Removing the change will lead to deadlock when error handling prepare calls pm_runtime_get_sync(). RQF_PM is only given to requests sent from power management operations, during which the specific device/LU is suspending/resuming, meaning no data transaction is ongoing. How can fast failing a PM request trigger data corruption? Thanks, Can Guo. > > Thanks, > > Bart.
On 6/23/21 9:16 PM, Can Guo wrote: > On 2021-06-24 05:33, Bart Van Assche wrote: >> On 6/23/21 12:35 AM, Can Guo wrote: >>> @@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host >>> *host, struct scsi_cmnd *cmd) >>> * err handler blocked for too long. So, just fail the scsi cmd >>> * sent from PM ops, err handler can recover PM error anyways. >>> */ >>> - if (hba->wlu_pm_op_in_progress) { >>> + if (cmd->request->rq_flags & RQF_PM) { >>> hba->force_reset = true; >>> set_host_byte(cmd, DID_BAD_TARGET); >>> cmd->scsi_done(cmd); >> >> I'm still concerned that the above code may trigger data corruption. I >> prefer that the above code is removed instead of being modified. > > Removing the change will lead to deadlock when error handling prepare > calls pm_runtime_get_sync(). > > RQF_PM is only given to requests sent from power management operations, > during which the specific device/LU is suspending/resuming, meaning no > data transaction is ongoing. How can fast failing a PM request trigger > data corruption? Right, the above code only affects power management requests so there is no risk for data corruption. Thanks, Bart.
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index d739401..59fc521 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd) * err handler blocked for too long. So, just fail the scsi cmd * sent from PM ops, err handler can recover PM error anyways. */ - if (hba->wlu_pm_op_in_progress) { + if (cmd->request->rq_flags & RQF_PM) { hba->force_reset = true; set_host_byte(cmd, DID_BAD_TARGET); cmd->scsi_done(cmd); @@ -6981,11 +6981,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) int err = 0; struct ufshcd_lrb *lrbp; u32 reg; + bool need_eh = false; host = cmd->device->host; hba = shost_priv(host); tag = cmd->request->tag; lrbp = &hba->lrb[tag]; + + dev_info(hba->dev, "%s: Device abort task at tag %d\n", __func__, tag); if (!ufshcd_valid_tag(hba, tag)) { dev_err(hba->dev, "%s: invalid command tag %d: cmd=0x%p, cmd->request=0x%p", @@ -7003,9 +7006,6 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) goto out; } - /* Print Transfer Request of aborted task */ - dev_info(hba->dev, "%s: Device abort task at tag %d\n", __func__, tag); - /* * Print detailed info about aborted request. * As more than one request might get aborted at the same time, @@ -7033,21 +7033,21 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) } /* - * Task abort to the device W-LUN is illegal. When this command - * will fail, due to spec violation, scsi err handling next step - * will be to send LU reset which, again, is a spec violation. - * To avoid these unnecessary/illegal steps, first we clean up - * the lrb taken by this cmd and re-set it in outstanding_reqs, - * then queue the eh_work and bail. + * This fast path guarantees the cmd always gets aborted successfully, + * meanwhile it invokes the error handler. It allows contexts, which + * are blocked by this cmd, to fail fast. It serves multiple purposes: + * #1 To avoid unnecessary/illagal abort attempts to the W-LU. + * #2 To avoid live lock between eh_work and specific contexts, i.e., + * suspend/resume and eh_work itself. + * #3 To let eh_work recover runtime PM error in case abort happens + * to cmds sent from runtime suspend/resume ops. */ - if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN) { + if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN || + (cmd->request->rq_flags & RQF_PM)) { ufshcd_update_evt_hist(hba, UFS_EVT_ABORT, lrbp->lun); __ufshcd_transfer_req_compl(hba, (1UL << tag)); set_bit(tag, &hba->outstanding_reqs); - spin_lock_irqsave(host->host_lock, flags); - hba->force_reset = true; - ufshcd_schedule_eh_work(hba); - spin_unlock_irqrestore(host->host_lock, flags); + need_eh = true; goto out; } @@ -7061,6 +7061,12 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) cleanup: __ufshcd_transfer_req_compl(hba, (1UL << tag)); out: + if (cmd->request->rq_flags & RQF_PM || need_eh) { + spin_lock_irqsave(host->host_lock, flags); + hba->force_reset = true; + ufshcd_schedule_eh_work(hba); + spin_unlock_irqrestore(host->host_lock, flags); + } err = SUCCESS; } else { dev_err(hba->dev, "%s: failed with err %d\n", __func__, err);
If PM requests fail during runtime suspend/resume, RPM framework saves the error to dev->power.runtime_error. Before the runtime_error gets cleared, runtime PM on this specific device won't work again, leaving the device either runtime active or runtime suspended permanently. When task abort happens to a PM request sent during runtime suspend/resume, even if it can be successfully aborted, RPM framework anyways saves the (TIMEOUT) error. In this situation, we can leverage error handling to recover and clear the runtime_error. So, let PM requests take the fast abort path in ufshcd_abort(). Signed-off-by: Can Guo <cang@codeaurora.org> --- drivers/scsi/ufs/ufshcd.c | 36 +++++++++++++++++++++--------------- 1 file changed, 21 insertions(+), 15 deletions(-)