Message ID | 20241001091917.6917-3-peter.wang@mediatek.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | fix abort defect | expand |
On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote: > From: Peter Wang <peter.wang@mediatek.com> > > After the SQ cleanup fix, the CQ will receive a response with > the corresponding tag marked as OCS: ABORTED. To align with > the behavior of Legacy SDB mode, the handling of OCS: ABORTED > has been changed to match that of OCS_INVALID_COMMAND_STATUS > (SDB), with both returning a SCSI result of DID_REQUEUE. > > Furthermore, the workaround implemented before the SQ cleanup > fix can be removed. > > Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ mode") > Cc: stable@vger.kernel.org > Signed-off-by: Peter Wang <peter.wang@mediatek.com> > --- > drivers/ufs/core/ufshcd.c | 20 ++++---------------- > 1 file changed, 4 insertions(+), 16 deletions(-) > > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c > index 24a32e2fd75e..8e2a7889a565 100644 > --- a/drivers/ufs/core/ufshcd.c > +++ b/drivers/ufs/core/ufshcd.c > @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp, > } > break; > case OCS_ABORTED: > - result |= DID_ABORT << 16; > - break; > case OCS_INVALID_COMMAND_STATUS: > result |= DID_REQUEUE << 16; > + dev_warn(hba->dev, > + "OCS %s from controller for tag %d\n", > + (ocs == OCS_ABORTED? "aborted" : "invalid"), > + lrbp->task_tag); > break; > case OCS_INVALID_CMD_TABLE_ATTR: > case OCS_INVALID_PRDT_ATTR: > @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request *rq, void *priv) > struct scsi_device *sdev = cmd->device; > struct Scsi_Host *shost = sdev->host; > struct ufs_hba *hba = shost_priv(shost); > - struct ufshcd_lrb *lrbp = &hba->lrb[tag]; > - struct ufs_hw_queue *hwq; > - unsigned long flags; > > *ret = ufshcd_try_to_abort_task(hba, tag); > dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag, > hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1, > *ret ? "failed" : "succeeded"); > > - /* Release cmd in MCQ mode if abort succeeds */ > - if (hba->mcq_enabled && (*ret == 0)) { > - hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd)); > - if (!hwq) > - return 0; > - spin_lock_irqsave(&hwq->cq_lock, flags); > - if (ufshcd_cmd_inflight(lrbp->cmd)) > - ufshcd_release_scsi_cmd(hba, lrbp); > - spin_unlock_irqrestore(&hwq->cq_lock, flags); > - } > - > return *ret == 0; > } As mentioned before, ufshcd_try_to_abort_task() cannot handle concurrent scsi_done() calls. ufshcd_abort_one() calls ufshcd_try_to_abort_task() without even trying to prevent that scsi_done() is called concurrently. Since this could result in a kernel crash, I think that it is important that this gets fixed, even if it requires modifying the SCSI core. Bart.
On Tue, 2024-10-01 at 10:13 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote: > > From: Peter Wang <peter.wang@mediatek.com> > > > > After the SQ cleanup fix, the CQ will receive a response with > > the corresponding tag marked as OCS: ABORTED. To align with > > the behavior of Legacy SDB mode, the handling of OCS: ABORTED > > has been changed to match that of OCS_INVALID_COMMAND_STATUS > > (SDB), with both returning a SCSI result of DID_REQUEUE. > > > > Furthermore, the workaround implemented before the SQ cleanup > > fix can be removed. > > > > Fixes: ab248643d3d6 ("scsi: ufs: core: Add error handling for MCQ > mode") > > Cc: stable@vger.kernel.org > > Signed-off-by: Peter Wang <peter.wang@mediatek.com> > > --- > > drivers/ufs/core/ufshcd.c | 20 ++++---------------- > > 1 file changed, 4 insertions(+), 16 deletions(-) > > > > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c > > index 24a32e2fd75e..8e2a7889a565 100644 > > --- a/drivers/ufs/core/ufshcd.c > > +++ b/drivers/ufs/core/ufshcd.c > > @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba > *hba, struct ufshcd_lrb *lrbp, > > } > > break; > > case OCS_ABORTED: > > -result |= DID_ABORT << 16; > > -break; > > case OCS_INVALID_COMMAND_STATUS: > > result |= DID_REQUEUE << 16; > > +dev_warn(hba->dev, > > +"OCS %s from controller for tag %d\n", > > +(ocs == OCS_ABORTED? "aborted" : "invalid"), > > +lrbp->task_tag); > > break; > > case OCS_INVALID_CMD_TABLE_ATTR: > > case OCS_INVALID_PRDT_ATTR: > > @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request > *rq, void *priv) > > struct scsi_device *sdev = cmd->device; > > struct Scsi_Host *shost = sdev->host; > > struct ufs_hba *hba = shost_priv(shost); > > -struct ufshcd_lrb *lrbp = &hba->lrb[tag]; > > -struct ufs_hw_queue *hwq; > > -unsigned long flags; > > > > *ret = ufshcd_try_to_abort_task(hba, tag); > > dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag, > > hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1, > > *ret ? "failed" : "succeeded"); > > > > -/* Release cmd in MCQ mode if abort succeeds */ > > -if (hba->mcq_enabled && (*ret == 0)) { > > -hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd)); > > -if (!hwq) > > -return 0; > > -spin_lock_irqsave(&hwq->cq_lock, flags); > > -if (ufshcd_cmd_inflight(lrbp->cmd)) > > -ufshcd_release_scsi_cmd(hba, lrbp); > > -spin_unlock_irqrestore(&hwq->cq_lock, flags); > > -} > > - > > return *ret == 0; > > } > > As mentioned before, ufshcd_try_to_abort_task() cannot handle > concurrent > scsi_done() calls. ufshcd_abort_one() calls > ufshcd_try_to_abort_task() > without even trying to prevent that scsi_done() is called > concurrently. > Since this could result in a kernel crash, I think that it is > important > that this gets fixed, even if it requires modifying the SCSI core. > > Bart. > > Hi Bart, This patch merely aligns with the approach of SDB mode and does not involve the flow of scsi_done. Besides, I don't see any issue with concurrency between ufshcd_abort_one() calling ufshcd_try_to_abort_task() and scsi_done(). Can you point out the specific flow where the problem occurs? If there is one, shouldn't SDB mode have the same issue? Thanks Peter
On 10/2/24 5:42 AM, Peter Wang (王信友) wrote: > This patch merely aligns with the approach of SDB mode > and does not involve the flow of scsi_done. Besides, > I don't see any issue with concurrency between > ufshcd_abort_one() calling ufshcd_try_to_abort_task() > and scsi_done(). Can you point out the specific flow where > the problem occurs? If there is one, shouldn't SDB mode > have the same issue? Hi Peter, Correct, my comment applies to both legacy mode and MCQ mode. From the section in the UFS standard about ABORT TASK: "A response of FUNCTION COMPLETE shall indicate that the command was aborted or was not in the task set." In other words, if a command completes just before ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command that has already completed. In legacy mode, this call will succeed. Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to be decremented twice instead of only once. Do you agree that this can happen and also that it should be prevented that this happens? Thanks, Bart.
On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 10/2/24 5:42 AM, Peter Wang (王信友) wrote: > > This patch merely aligns with the approach of SDB mode > > and does not involve the flow of scsi_done. Besides, > > I don't see any issue with concurrency between > > ufshcd_abort_one() calling ufshcd_try_to_abort_task() > > and scsi_done(). Can you point out the specific flow where > > the problem occurs? If there is one, shouldn't SDB mode > > have the same issue? > > Hi Peter, > > Correct, my comment applies to both legacy mode and MCQ mode. From > the > section in the UFS standard about ABORT TASK: "A response of FUNCTION > COMPLETE shall indicate that the command was aborted or was not in > the > task set." In other words, if a command completes just before > ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then > ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command > that has already completed. In legacy mode, this call will succeed. > Hi Bart, Yes, the legacy SDB mode is protected by the outstanding_lock. > Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call > ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to > be > decremented twice instead of only once. Do you agree that this can > happen and also that it should be prevented that this happens? > > Thanks, > > Bart. Sorry, I still don't understand why both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call ufshcd_release(hba)? Because I have already removed the ufshcd_release_scsi_cmd from ufshcd_abort_one, so the command won't be released immediately after ufshcd_try_to_abort_task succeeds. Instead, it will wait until the CQ Entry comes in before releasing. And since it is protected by the cq_lock, it should only release once, right? Thanks. Peter
On 10/7/24 12:20 AM, Peter Wang (王信友) wrote: > On Thu, 2024-10-03 at 13:02 -0700, Bart Van Assche wrote: >> >> External email : Please do not click links or open attachments until >> you have verified the sender or the content. >> On 10/2/24 5:42 AM, Peter Wang (王信友) wrote: >>> This patch merely aligns with the approach of SDB mode >>> and does not involve the flow of scsi_done. Besides, >>> I don't see any issue with concurrency between >>> ufshcd_abort_one() calling ufshcd_try_to_abort_task() >>> and scsi_done(). Can you point out the specific flow where >>> the problem occurs? If there is one, shouldn't SDB mode >>> have the same issue? >> >> Hi Peter, >> >> Correct, my comment applies to both legacy mode and MCQ mode. From >> the >> section in the UFS standard about ABORT TASK: "A response of FUNCTION >> COMPLETE shall indicate that the command was aborted or was not in >> the >> task set." In other words, if a command completes just before >> ufshcd_try_to_abort_task() calls ufshcd_issue_tm_cmd(), then >> ufshcd_try_to_abort_task() will call ufshcd_clear_cmd() for a command >> that has already completed. In legacy mode, this call will succeed. >> > > Hi Bart, > > Yes, the legacy SDB mode is protected by the outstanding_lock. > > >> Hence, both ufshcd_compl_one_cqe() and ufshcd_abort_all() will call >> ufshcd_release(hba). This will cause hba->clk_gating.active_reqs to >> be >> decremented twice instead of only once. Do you agree that this can >> happen and also that it should be prevented that this happens? >> >> Thanks, >> >> Bart. > > Sorry, I still don't understand why both ufshcd_compl_one_cqe() > and ufshcd_abort_all() will call ufshcd_release(hba)? > Because I have already removed the ufshcd_release_scsi_cmd from > ufshcd_abort_one, so the command won't be released immediately > after ufshcd_try_to_abort_task succeeds. Instead, it will wait > until the CQ Entry comes in before releasing. And since it is > protected by the cq_lock, it should only release once, right? Hi Peter, I think what you wrote applies to MCQ mode only. In my previous email I clearly referred to "legacy mode" (SDB mode). Summarizing my previous email, I think that in legacy mode it is possible that ufshcd_release() is called twice while it only should be called once. Here are the possible solutions I see: * Add a function to the SCSI core for setting SCMD_STATE_COMPLETE. This may be controversial since no other SCSI LLD needs this functionality. * Changing the error handling approach in the UFS driver to the same approach other SCSI LLDs use: instead of using queue_work() to activate the error handler, call scsi_schedule_eh(). This will cause the error handler to be activated later, namely after all pending commands have timed out instead of aborting any pending commands first. * Add a variant of scsi_schedule_eh() to the SCSI core that accelerates error handling by calling scsi_timeout() on all pending commands. Thanks, Bart.
On Tue, 2024-10-08 at 11:29 -0700, Bart Van Assche wrote: > Hi Peter, > > I think what you wrote applies to MCQ mode only. In my previous email > I clearly referred to "legacy mode" (SDB mode). Summarizing my > previous > email, I think that in legacy mode it is possible that > ufshcd_release() > is called twice while it only should be called once. Here are the > possible solutions I see: > * Add a function to the SCSI core for setting SCMD_STATE_COMPLETE. > This > may be controversial since no other SCSI LLD needs this > functionality. > * Changing the error handling approach in the UFS driver to the same > approach other SCSI LLDs use: instead of using queue_work() to > activate the error handler, call scsi_schedule_eh(). This will > cause > the error handler to be activated later, namely after all pending > commands have timed out instead of aborting any pending commands > first. > * Add a variant of scsi_schedule_eh() to the SCSI core that > accelerates > error handling by calling scsi_timeout() on all pending commands. > > Thanks, > > Bart. > Hi Bart, Yes, this patch is only for MCQ mode, because only MCQ mode receives OCS: ABORTED, right? This patch doesn't modify any of the Legacy mode flows, does it? Additionally, I still don't understand why you say there would be an issue with legacy mode having duplicate ufshcd_release(hba) calls. As I mentioned before, it is protected by the outstanding_lock. Could you please clarify the detailed error flow? Furthermore, even if there is an issue with Legacy mode, it should be addressed by a separate patch, not by this one, which is intended to resolve the MCQ mode issue. We shouldn't mix two different issues together, don't you agree? Thanks Peter
On 10/1/24 2:19 AM, peter.wang@mediatek.com wrote: > After the SQ cleanup fix, the CQ will receive a response with > the corresponding tag marked as OCS: ABORTED. To align with > the behavior of Legacy SDB mode, the handling of OCS: ABORTED > has been changed to match that of OCS_INVALID_COMMAND_STATUS > (SDB), with both returning a SCSI result of DID_REQUEUE. > > Furthermore, the workaround implemented before the SQ cleanup > fix can be removed. Reviewed-by: Bart Van Assche <bvanassche@acm.org>
On 10/8/24 7:17 PM, Peter Wang (王信友) wrote: > Yes, this patch is only for MCQ mode, because only MCQ mode > receives OCS: ABORTED, right? This patch doesn't modify > any of the Legacy mode flows, does it? Agreed. What I mentioned in my email is an existing bug in the legacy flow for ufshcd_abort_all(). > Furthermore, even if there is an issue with Legacy mode, it > should be addressed by a separate patch, not by this one, which is > intended to resolve the MCQ mode issue. We shouldn't mix two > different issues together, don't you agree? Let's proceed with this patch series and let's address what I brought up in my email separately. With the current approach for error handling in the UFS driver, anyone who wants to verify or modify ufshcd_try_to_abort_task() has to consider all possible interleavings of ufshcd_try_to_abort_task() and the completion path (ufshcd_compl_one_cqe()). That's an unnecessary burden on UFS driver contributors. Additionally, this is error-prone. This applies to both modes (legacy and MCQ). I know of reports of sporadic crashes in legacy mode related to UFS error handling. I'm wondering whether these are perhaps the result of the issue I mentioned in a previous email. Anyway, I will look further into this myself as soon as I have the time. Thanks, Bart.
On Wed, 2024-10-09 at 11:06 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > > On 10/8/24 7:17 PM, Peter Wang (王信友) wrote: > > Yes, this patch is only for MCQ mode, because only MCQ mode > > receives OCS: ABORTED, right? This patch doesn't modify > > any of the Legacy mode flows, does it? > > Agreed. What I mentioned in my email is an existing bug in the legacy > flow for ufshcd_abort_all(). > > > Furthermore, even if there is an issue with Legacy mode, it > > should be addressed by a separate patch, not by this one, which is > > intended to resolve the MCQ mode issue. We shouldn't mix two > > different issues together, don't you agree? > > Let's proceed with this patch series and let's address what I brought > up in my email separately. > > With the current approach for error handling in the UFS driver, > anyone > who wants to verify or modify ufshcd_try_to_abort_task() has to > consider > all possible interleavings of ufshcd_try_to_abort_task() and the > completion path (ufshcd_compl_one_cqe()). That's an unnecessary > burden > on UFS driver contributors. Additionally, this is error-prone. This > applies to both modes (legacy and MCQ). I know of reports of sporadic > crashes in legacy mode related to UFS error handling. I'm wondering > whether these are perhaps the result of the issue I mentioned in a > previous email. Anyway, I will look further into this myself as soon > as > I have the time. > > Thanks, > > Bart. Hi Bart, Thank you for your review. I currently cannot see the issue of duplicate releases in legacy SDB mode. ufshcd_try_to_abort_task() will directly reset if it fails. It is only in the case of success that we need to consider the possibility of ufshcd_compl_one_cqe. I believe the original design flow has already taken this into account, which is why there is protection with outstanding_lock/cq_lock. Perhaps we can wait for an actual example to occur before making corrections. Even if there is an issue, I think the probability should be very low, because the flow for legacy SDB mode has been in use for several years. Thank you again for your review. Thanks Peter
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index 24a32e2fd75e..8e2a7889a565 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -5417,10 +5417,12 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp, } break; case OCS_ABORTED: - result |= DID_ABORT << 16; - break; case OCS_INVALID_COMMAND_STATUS: result |= DID_REQUEUE << 16; + dev_warn(hba->dev, + "OCS %s from controller for tag %d\n", + (ocs == OCS_ABORTED? "aborted" : "invalid"), + lrbp->task_tag); break; case OCS_INVALID_CMD_TABLE_ATTR: case OCS_INVALID_PRDT_ATTR: @@ -6466,26 +6468,12 @@ static bool ufshcd_abort_one(struct request *rq, void *priv) struct scsi_device *sdev = cmd->device; struct Scsi_Host *shost = sdev->host; struct ufs_hba *hba = shost_priv(shost); - struct ufshcd_lrb *lrbp = &hba->lrb[tag]; - struct ufs_hw_queue *hwq; - unsigned long flags; *ret = ufshcd_try_to_abort_task(hba, tag); dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag, hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1, *ret ? "failed" : "succeeded"); - /* Release cmd in MCQ mode if abort succeeds */ - if (hba->mcq_enabled && (*ret == 0)) { - hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd)); - if (!hwq) - return 0; - spin_lock_irqsave(&hwq->cq_lock, flags); - if (ufshcd_cmd_inflight(lrbp->cmd)) - ufshcd_release_scsi_cmd(hba, lrbp); - spin_unlock_irqrestore(&hwq->cq_lock, flags); - } - return *ret == 0; }