Message ID | 20200706132113.21096-1-stanley.chu@mediatek.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] scsi: ufs: Cleanup completed request without interrupt notification | expand |
> > If somehow no interrupt notification is raised for a completed request > and its doorbell bit is cleared by host, UFS driver needs to cleanup > its outstanding bit in ufshcd_abort(). Theoretically, this case is already accounted for - See line 6407: a proper error is issued and eventually outstanding req is cleared. Can you go over the scenario you are attending line by line, And explain why ufshcd_abort does not account for it? > > Otherwise, system may crash by below abnormal flow: > > After this request is requeued by SCSI layer with its > outstanding bit set, the next completed request will trigger > ufshcd_transfer_req_compl() to handle all "completed outstanding > bits". In this time, the "abnormal outstanding bit" will be detected > and the "requeued request" will be chosen to execute request > post-processing flow. This is wrong and blk_finish_request() will > BUG_ON because this request is still "alive". > > It is worth mentioning that before ufshcd_abort() cleans the timed-out > request, driver need to check again if this request is really not > handled by __ufshcd_transfer_req_compl() yet because it may be > possible that the interrupt comes very lately before the cleaning. What do you mean? Why checking the outstanding reqs isn't enough? > > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> > --- > drivers/scsi/ufs/ufshcd.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c > index 8603b07045a6..f23fb14df9f6 100644 > --- a/drivers/scsi/ufs/ufshcd.c > +++ b/drivers/scsi/ufs/ufshcd.c > @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > /* command completed already */ > dev_err(hba->dev, "%s: cmd at tag %d successfully cleared from > DB.\n", > __func__, tag); > - goto out; > + goto cleanup; But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) - See line 6400. > } else { > dev_err(hba->dev, > "%s: no response from device. tag = %d, err %d\n", > @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > goto out; > } > > +cleanup: > + spin_lock_irqsave(host->host_lock, flags); > + if (!test_bit(tag, &hba->outstanding_reqs)) { > + spin_unlock_irqrestore(host->host_lock, flags); > + goto out; > + } > scsi_dma_unmap(cmd); > > - spin_lock_irqsave(host->host_lock, flags); > ufshcd_outstanding_req_clear(hba, tag); > hba->lrb[tag].cmd = NULL; > spin_unlock_irqrestore(host->host_lock, flags); > -- > 2.18.0
Hi Avri, On Thu, 2020-07-09 at 08:31 +0000, Avri Altman wrote: > > > > If somehow no interrupt notification is raised for a completed request > > and its doorbell bit is cleared by host, UFS driver needs to cleanup > > its outstanding bit in ufshcd_abort(). > Theoretically, this case is already accounted for - > See line 6407: a proper error is issued and eventually outstanding req is cleared. > > Can you go over the scenario you are attending line by line, > And explain why ufshcd_abort does not account for it? Sure. If a request using tag N is completed by UFS device without interrupt notification till timeout happens, ufshcd_abort() will be invoked. Since request completion flow is not executed, current status may be - Tag N in hba->outstanding_reqs is set - Tag N in doorbell register is not set In this case, ufshcd_abort() flow would be - This log is printed: "ufshcd_abort: cmd was completed, but without a notifying intr, tag = N" - This log is printed: "ufshcd_abort: Device abort task at tag N" - If hba->req_abort_skip is zero, QUERY_TASK command is sent - Device responds "UPIU_TASK_MANAGEMENT_FUNC_COMPL" - This log is printed: "ufshcd_abort: cmd at tag N not pending in the device." - Doorbell tells that tag N is not set, so the driver goes to label "out" with this log printed: "ufshcd_abort: cmd at tag %d successfully cleared from DB." - In label "out" section, no cleanup will be made, and then ufshcd_abort exits - This request will be re-queued to request queue by SCSI timeout handler Now, Inconsistent state shows-up: A request is "re-queued" but its corresponding resource in UFS layer is not cleared, below flow will trigger bad things, - A new request with tag M is finished - Interrupt is raised and ufshcd_transfer_req_compl() found both tag N and M can process the completion flow - The post-processing flow for tag N will be executed while its request is still alive I am sorry that below messages are only for old kernel in non-blk-mq case. However above scenario will also trigger bad thing in blk-mq case. > > > > > Otherwise, system may crash by below abnormal flow: > > > > After this request is requeued by SCSI layer with its > > outstanding bit set, the next completed request will trigger > > ufshcd_transfer_req_compl() to handle all "completed outstanding > > bits". In this time, the "abnormal outstanding bit" will be detected > > and the "requeued request" will be chosen to execute request > > post-processing flow. This is wrong and blk_finish_request() will > > BUG_ON because this request is still "alive". > > > > It is worth mentioning that before ufshcd_abort() cleans the timed-out > > request, driver need to check again if this request is really not > > handled by __ufshcd_transfer_req_compl() yet because it may be > > possible that the interrupt comes very lately before the cleaning. > What do you mean? Why checking the outstanding reqs isn't enough? > > > > > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> > > --- > > drivers/scsi/ufs/ufshcd.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c > > index 8603b07045a6..f23fb14df9f6 100644 > > --- a/drivers/scsi/ufs/ufshcd.c > > +++ b/drivers/scsi/ufs/ufshcd.c > > @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > > /* command completed already */ > > dev_err(hba->dev, "%s: cmd at tag %d successfully cleared from > > DB.\n", > > __func__, tag); > > - goto out; > > + goto cleanup; > But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) - > See line 6400. > > > } else { > > dev_err(hba->dev, > > "%s: no response from device. tag = %d, err %d\n", > > @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > > goto out; > > } > > > > +cleanup: > > + spin_lock_irqsave(host->host_lock, flags); > > + if (!test_bit(tag, &hba->outstanding_reqs)) { > > + spin_unlock_irqrestore(host->host_lock, flags); > > + goto out; > > + } > > scsi_dma_unmap(cmd); > > > > - spin_lock_irqsave(host->host_lock, flags); > > ufshcd_outstanding_req_clear(hba, tag); > > hba->lrb[tag].cmd = NULL; > > spin_unlock_irqrestore(host->host_lock, flags); > > -- > > 2.18.0
> > Hi Avri, > > On Thu, 2020-07-09 at 08:31 +0000, Avri Altman wrote: > > > > > > If somehow no interrupt notification is raised for a completed request > > > and its doorbell bit is cleared by host, UFS driver needs to cleanup > > > its outstanding bit in ufshcd_abort(). > > Theoretically, this case is already accounted for - > > See line 6407: a proper error is issued and eventually outstanding req is > cleared. > > > > Can you go over the scenario you are attending line by line, > > And explain why ufshcd_abort does not account for it? > > Sure. > > If a request using tag N is completed by UFS device without interrupt > notification till timeout happens, ufshcd_abort() will be invoked. > > Since request completion flow is not executed, current status may be > > - Tag N in hba->outstanding_reqs is set > - Tag N in doorbell register is not set > > In this case, ufshcd_abort() flow would be > > - This log is printed: "ufshcd_abort: cmd was completed, but without a > notifying intr, tag = N" > - This log is printed: "ufshcd_abort: Device abort task at tag N" > - If hba->req_abort_skip is zero, QUERY_TASK command is sent > - Device responds "UPIU_TASK_MANAGEMENT_FUNC_COMPL" > - This log is printed: "ufshcd_abort: cmd at tag N not pending in the > device." > - Doorbell tells that tag N is not set, so the driver goes to label > "out" with this log printed: "ufshcd_abort: cmd at tag %d successfully > cleared from DB." > - In label "out" section, no cleanup will be made, and then ufshcd_abort > exits > - This request will be re-queued to request queue by SCSI timeout > handler > > Now, Inconsistent state shows-up: A request is "re-queued" but its > corresponding resource in UFS layer is not cleared, below flow will > trigger bad things, > > - A new request with tag M is finished > - Interrupt is raised and ufshcd_transfer_req_compl() found both tag N > and M can process the completion flow > - The post-processing flow for tag N will be executed while its request > is still alive > > I am sorry that below messages are only for old kernel in non-blk-mq > case. However above scenario will also trigger bad thing in blk-mq case. Ok. Thanks. > > > > > > > > > Otherwise, system may crash by below abnormal flow: > > > > > > After this request is requeued by SCSI layer with its > > > outstanding bit set, the next completed request will trigger > > > ufshcd_transfer_req_compl() to handle all "completed outstanding > > > bits". In this time, the "abnormal outstanding bit" will be detected > > > and the "requeued request" will be chosen to execute request > > > post-processing flow. This is wrong and blk_finish_request() will > > > BUG_ON because this request is still "alive". > > > > > > It is worth mentioning that before ufshcd_abort() cleans the timed-out > > > request, driver need to check again if this request is really not > > > handled by __ufshcd_transfer_req_compl() yet because it may be > > > possible that the interrupt comes very lately before the cleaning. > > What do you mean? Why checking the outstanding reqs isn't enough? > > > > > > > > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> > > > --- > > > drivers/scsi/ufs/ufshcd.c | 9 +++++++-- > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c > > > index 8603b07045a6..f23fb14df9f6 100644 > > > --- a/drivers/scsi/ufs/ufshcd.c > > > +++ b/drivers/scsi/ufs/ufshcd.c > > > @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > > > /* command completed already */ > > > dev_err(hba->dev, "%s: cmd at tag %d successfully cleared > from > > > DB.\n", > > > __func__, tag); > > > - goto out; > > > + goto cleanup; > > But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) - > > See line 6400. > > > > > } else { > > > dev_err(hba->dev, > > > "%s: no response from device. tag = %d, err %d\n", > > > @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > > > goto out; > > > } > > > > > > +cleanup: > > > + spin_lock_irqsave(host->host_lock, flags); > > > + if (!test_bit(tag, &hba->outstanding_reqs)) { Is this needed? it was already checked in line 6439. Thanks, Avri > > > + spin_unlock_irqrestore(host->host_lock, flags); > > > + goto out; > > > + } > > > scsi_dma_unmap(cmd); > > > > > > - spin_lock_irqsave(host->host_lock, flags); > > > ufshcd_outstanding_req_clear(hba, tag); > > > hba->lrb[tag].cmd = NULL; > > > spin_unlock_irqrestore(host->host_lock, flags); > > > -- > > > 2.18.0
On 2020-07-06 06:21, Stanley Chu wrote: > If somehow no interrupt notification is raised for a completed request > and its doorbell bit is cleared by host, UFS driver needs to cleanup > its outstanding bit in ufshcd_abort(). How is it possible that no interrupt notification is raised for a completed request? Is this the result of a hardware shortcoming or rather the result of how the UFS driver works? In the latter case, is this patch perhaps a workaround? If so, has it been considered to fix the root cause instead of implementing a workaround? In section 7.2.3 of the UFS specification I found the following about how to process request completions: "Software determines if new TRs have completed since step #2, by repeating one of the two methods described in step #2. If new TRs have completed, software repeats the sequence from step #3." Is such a loop perhaps missing from the Linux UFS driver? Thanks, Bart.
Hi Bart and Avri, On Sun, 2020-07-12 at 18:39 -0700, Bart Van Assche wrote: > On 2020-07-06 06:21, Stanley Chu wrote: > > If somehow no interrupt notification is raised for a completed request > > and its doorbell bit is cleared by host, UFS driver needs to cleanup > > its outstanding bit in ufshcd_abort(). > > How is it possible that no interrupt notification is raised for a completed > request? Is this the result of a hardware shortcoming or rather the result > of how the UFS driver works? In the latter case, is this patch perhaps a > workaround? If so, has it been considered to fix the root cause instead of > implementing a workaround? Actually this fail is triggered by "error injection" to produce a command timeout event for checking if anything can be improved or fixed. I agree that "no interrupt notification" may be something wrong in hardware and the root cause shall be fixed in the highest priority. However from this injection, we found ufshcd_abort() indeed has a defect flow for a corner case, so we are looking for the solution to fix the "hole". What would you think if Linux driver shall consider this case? If this is not necessary, I would drop this patch : ) Thanks a lot, Stanley Chu > > In section 7.2.3 of the UFS specification I found the following about how > to process request completions: "Software determines if new TRs have > completed since step #2, by repeating one of the two methods described in > step #2. If new TRs have completed, software repeats the sequence from step > #3." Is such a loop perhaps missing from the Linux UFS driver? > > Thanks, > > Bart.
> > Hi Bart and Avri, > > On Sun, 2020-07-12 at 18:39 -0700, Bart Van Assche wrote: > > On 2020-07-06 06:21, Stanley Chu wrote: > > > If somehow no interrupt notification is raised for a completed request > > > and its doorbell bit is cleared by host, UFS driver needs to cleanup > > > its outstanding bit in ufshcd_abort(). > > > > How is it possible that no interrupt notification is raised for a completed > > request? Is this the result of a hardware shortcoming or rather the result > > of how the UFS driver works? In the latter case, is this patch perhaps a > > workaround? If so, has it been considered to fix the root cause instead of > > implementing a workaround? > > Actually this fail is triggered by "error injection" to produce a > command timeout event for checking if anything can be improved or fixed. > > I agree that "no interrupt notification" may be something wrong in > hardware and the root cause shall be fixed in the highest priority. > However from this injection, we found ufshcd_abort() indeed has a defect > flow for a corner case, so we are looking for the solution to fix the > "hole". > > What would you think if Linux driver shall consider this case? If this > is not necessary, I would drop this patch : ) Artificially injecting errors is a very common validation mechanism, Provided that you are not breaking anything of the upper-layers, Which I don't think you are doing. Can you refer please to my last comment? > > Thanks a lot, > Stanley Chu > > > > > In section 7.2.3 of the UFS specification I found the following about how > > to process request completions: "Software determines if new TRs have > > completed since step #2, by repeating one of the two methods described in > > step #2. If new TRs have completed, software repeats the sequence from > step > > #3." Is such a loop perhaps missing from the Linux UFS driver? Could not find that citation. What version of the spec are you using? Thanks, Avri > > > > Thanks, > > > > Bart.
Hi Avri, Sorry for the late response. On Sun, 2020-07-12 at 10:04 +0000, Avri Altman wrote: > > > > > Hi Avri, > > > > On Thu, 2020-07-09 at 08:31 +0000, Avri Altman wrote: > > > > > > > > If somehow no interrupt notification is raised for a completed request > > > > and its doorbell bit is cleared by host, UFS driver needs to cleanup > > > > its outstanding bit in ufshcd_abort(). > > > Theoretically, this case is already accounted for - > > > See line 6407: a proper error is issued and eventually outstanding req is > > cleared. > > > > > > Can you go over the scenario you are attending line by line, > > > And explain why ufshcd_abort does not account for it? > > > > Sure. > > > > If a request using tag N is completed by UFS device without interrupt > > notification till timeout happens, ufshcd_abort() will be invoked. > > > > Since request completion flow is not executed, current status may be > > > > - Tag N in hba->outstanding_reqs is set > > - Tag N in doorbell register is not set > > > > In this case, ufshcd_abort() flow would be > > > > - This log is printed: "ufshcd_abort: cmd was completed, but without a > > notifying intr, tag = N" > > - This log is printed: "ufshcd_abort: Device abort task at tag N" > > - If hba->req_abort_skip is zero, QUERY_TASK command is sent > > - Device responds "UPIU_TASK_MANAGEMENT_FUNC_COMPL" > > - This log is printed: "ufshcd_abort: cmd at tag N not pending in the > > device." > > - Doorbell tells that tag N is not set, so the driver goes to label > > "out" with this log printed: "ufshcd_abort: cmd at tag %d successfully > > cleared from DB." > > - In label "out" section, no cleanup will be made, and then ufshcd_abort > > exits > > - This request will be re-queued to request queue by SCSI timeout > > handler > > > > Now, Inconsistent state shows-up: A request is "re-queued" but its > > corresponding resource in UFS layer is not cleared, below flow will > > trigger bad things, > > > > - A new request with tag M is finished > > - Interrupt is raised and ufshcd_transfer_req_compl() found both tag N > > and M can process the completion flow > > - The post-processing flow for tag N will be executed while its request > > is still alive > > > > I am sorry that below messages are only for old kernel in non-blk-mq > > case. However above scenario will also trigger bad thing in blk-mq case. > > Ok. Thanks. > > > > > > > > > > > > > > Otherwise, system may crash by below abnormal flow: > > > > > > > > After this request is requeued by SCSI layer with its > > > > outstanding bit set, the next completed request will trigger > > > > ufshcd_transfer_req_compl() to handle all "completed outstanding > > > > bits". In this time, the "abnormal outstanding bit" will be detected > > > > and the "requeued request" will be chosen to execute request > > > > post-processing flow. This is wrong and blk_finish_request() will > > > > BUG_ON because this request is still "alive". > > > > > > > > It is worth mentioning that before ufshcd_abort() cleans the timed-out > > > > request, driver need to check again if this request is really not > > > > handled by __ufshcd_transfer_req_compl() yet because it may be > > > > possible that the interrupt comes very lately before the cleaning. > > > What do you mean? Why checking the outstanding reqs isn't enough? > > > > > > > > > > > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> > > > > --- > > > > drivers/scsi/ufs/ufshcd.c | 9 +++++++-- > > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c > > > > index 8603b07045a6..f23fb14df9f6 100644 > > > > --- a/drivers/scsi/ufs/ufshcd.c > > > > +++ b/drivers/scsi/ufs/ufshcd.c > > > > @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > > > > /* command completed already */ > > > > dev_err(hba->dev, "%s: cmd at tag %d successfully cleared > > from > > > > DB.\n", > > > > __func__, tag); > > > > - goto out; > > > > + goto cleanup; > > > But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) - > > > See line 6400. > > > > > > > } else { > > > > dev_err(hba->dev, > > > > "%s: no response from device. tag = %d, err %d\n", > > > > @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) > > > > goto out; > > > > } > > > > > > > > +cleanup: > > > > + spin_lock_irqsave(host->host_lock, flags); > > > > + if (!test_bit(tag, &hba->outstanding_reqs)) { > Is this needed? it was already checked in line 6439. > I am worried about the case that interrupt comes very lately. For example, if interrupt finally comes while ufshcd_abort() is handling this command, then probably this command may be completed first by interrupt handler. In this case, ufshcd_abort() shall not clear this command again. In contrast, if ufshcd_abort() clears this command first, then interrupt shall not complete it. Thus here checking hba->outstanding_req with host lock held is required to prevent above racing. Thanks, Stanley Chu
> > > > > +cleanup: > > > > > + spin_lock_irqsave(host->host_lock, flags); > > > > > + if (!test_bit(tag, &hba->outstanding_reqs)) { > > Is this needed? it was already checked in line 6439. > > > > I am worried about the case that interrupt comes very lately. scsi timeout is 30sec - do you expect an interrupt to arrive after that? Thanks, Avri >For > example, if interrupt finally comes while ufshcd_abort() is handling > this command, then probably this command may be completed first by > interrupt handler. In this case, ufshcd_abort() shall not clear this > command again. In contrast, if ufshcd_abort() clears this command first, > then interrupt shall not complete it. Thus here checking > hba->outstanding_req with host lock held is required to prevent above > racing. > > Thanks, > Stanley Chu >
Hi Avri, On Tue, 2020-07-14 at 09:29 +0000, Avri Altman wrote: > > > > > > +cleanup: > > > > > > + spin_lock_irqsave(host->host_lock, flags); > > > > > > + if (!test_bit(tag, &hba->outstanding_reqs)) { > > > Is this needed? it was already checked in line 6439. > > > > > > > I am worried about the case that interrupt comes very lately. > scsi timeout is 30sec - do you expect an interrupt to arrive after that? > Yeah, I agree that a 30s delayed interrupt sounds kind of ridiculous. This checking is just to make the cleanup flow safer. Thanks, Stanley Chu
On 2020-07-13 01:10, Avri Altman wrote: > Artificially injecting errors is a very common validation mechanism, > Provided that you are not breaking anything of the upper-layers, > Which I don't think you are doing. Hi Avri, My concern is that the code that is being added in the abort handler sooner or later will evolve into a duplicate of the regular completion path. Wouldn't it be better to poll for completions from the timeout handler by calling ufshcd_transfer_req_compl() instead of duplicating that function? >>> In section 7.2.3 of the UFS specification I found the following about how >>> to process request completions: "Software determines if new TRs have >>> completed since step #2, by repeating one of the two methods described in >>> step #2. If new TRs have completed, software repeats the sequence from >>> step #3." Is such a loop perhaps missing from the Linux UFS driver? > > Could not find that citation. > What version of the spec are you using? That quote comes from the following document: "Universal Flash Storage Host Controller Interface (UFSHCI); Version 2.1; JESD223C; (Revision of JESD223B, September 2013); MARCH 2016". Bart.
Hi Bart, Avri, On Tue, 2020-07-14 at 21:00 -0700, Bart Van Assche wrote: > On 2020-07-13 01:10, Avri Altman wrote: > > Artificially injecting errors is a very common validation mechanism, > > Provided that you are not breaking anything of the upper-layers, > > Which I don't think you are doing. > As the concerns of below questions, "scsi timeout is 30sec - do you expect an interrupt to arrive after that?" Actually in my test scenario, the flow works well without re-checking "outstanding_reqs" in "cleanup" section in ufshcd_abort(), so I would remove this checking first and resend this fix (with refined commit message according to blk-mq, not legacy blk). Please let me know if you have any suggestions. > Hi Avri, > > My concern is that the code that is being added in the abort handler > sooner or later will evolve into a duplicate of the regular completion > path. Wouldn't it be better to poll for completions from the timeout > handler by calling ufshcd_transfer_req_compl() instead of duplicating > that function? > The duplicated calls of cleanup job would be as below, scsi_dma_unmap(cmd); hba->lrb[tag].cmd = NULL; ufshcd_outstanding_req_clear(hba, tag); As your suggestions, above calls could be re-factored but the third call in __ufshcd_transfer_req_compl() would be more efficient by hba->outstanding_reqs ^= completed_reqs; for all handled requests in interrupt handler. Here we could not directly use "ufshcd_transfer_req_compl()" or its inner function "__ufshcd_transfer_req_compl()" since at least scsi_done() is not required in ufshcd_abort() because the completion flow will be handled by SCSI error handler, not ufshcd_abort() itself. > >>> In section 7.2.3 of the UFS specification I found the following about how > >>> to process request completions: "Software determines if new TRs have > >>> completed since step #2, by repeating one of the two methods described in > >>> step #2. If new TRs have completed, software repeats the sequence from > >>> step #3." Is such a loop perhaps missing from the Linux UFS driver? > > > > Could not find that citation. > > What version of the spec are you using? > > That quote comes from the following document: "Universal Flash Storage > Host Controller Interface (UFSHCI); Version 2.1; JESD223C; (Revision of > JESD223B, September 2013); MARCH 2016". Above description has already be implemented in ufshcd_intr() and ufshcd_transfer_req_compl(). But this loop cannot save "missing interrupt" just like this injected error case. Thanks, Stanley Chu
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 8603b07045a6..f23fb14df9f6 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) /* command completed already */ dev_err(hba->dev, "%s: cmd at tag %d successfully cleared from DB.\n", __func__, tag); - goto out; + goto cleanup; } else { dev_err(hba->dev, "%s: no response from device. tag = %d, err %d\n", @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd) goto out; } +cleanup: + spin_lock_irqsave(host->host_lock, flags); + if (!test_bit(tag, &hba->outstanding_reqs)) { + spin_unlock_irqrestore(host->host_lock, flags); + goto out; + } scsi_dma_unmap(cmd); - spin_lock_irqsave(host->host_lock, flags); ufshcd_outstanding_req_clear(hba, tag); hba->lrb[tag].cmd = NULL; spin_unlock_irqrestore(host->host_lock, flags);
If somehow no interrupt notification is raised for a completed request and its doorbell bit is cleared by host, UFS driver needs to cleanup its outstanding bit in ufshcd_abort(). Otherwise, system may crash by below abnormal flow: After this request is requeued by SCSI layer with its outstanding bit set, the next completed request will trigger ufshcd_transfer_req_compl() to handle all "completed outstanding bits". In this time, the "abnormal outstanding bit" will be detected and the "requeued request" will be chosen to execute request post-processing flow. This is wrong and blk_finish_request() will BUG_ON because this request is still "alive". It is worth mentioning that before ufshcd_abort() cleans the timed-out request, driver need to check again if this request is really not handled by __ufshcd_transfer_req_compl() yet because it may be possible that the interrupt comes very lately before the cleaning. Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> --- drivers/scsi/ufs/ufshcd.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)