Message ID | 1562906656-27154-1-git-send-email-stanley.chu@mediatek.com (mailing list archive) |
---|---|
Headers | show |
Series | scsi: ufs: Fix broken hba->outstanding_tasks | expand |
Hi Avri, Alim and Pedrom, Gentle ping for this fix. On Fri, 2019-07-12 at 12:44 +0800, Stanley Chu wrote: > Currently bits in hba->outstanding_tasks are cleared only after their > corresponding task management commands are successfully done by > __ufshcd_issue_tm_cmd(). > > If timeout happens in a task management command, its corresponding > bit in hba->outstanding_tasks will not be cleared until next task > management command with the same tag used successfully finishes.‧ > > This is wrong and can lead to some issues, like power consumpton issue. > For example, ufshcd_release() and ufshcd_gate_work() will do nothing > if hba->outstanding_tasks is not zero even if both UFS host and devices > are actually idle. > > Because error handling flow, i.e., ufshcd_reset_and_restore(), will be > triggered after any task management command times out, we fix this by > clearing corresponding hba->outstanding_tasks bits during this flow. > To achieve this, we need a mask to track timed-out commands and thus > error handling flow can clear their tags specifically. > > Stanley Chu (2): > scsi: ufs: Make new function for clearing outstanding task bits > scsi: ufs: Fix broken hba->outstanding_tasks > > drivers/scsi/ufs/ufshcd.c | 49 +++++++++++++++++++++++++++++++++------ > drivers/scsi/ufs/ufshcd.h | 1 + > 2 files changed, 43 insertions(+), 7 deletions(-) > Thanks, Stanley
Hi, > > Currently bits in hba->outstanding_tasks are cleared only after their > corresponding task management commands are successfully done by > __ufshcd_issue_tm_cmd(). > > If timeout happens in a task management command, its corresponding > bit in hba->outstanding_tasks will not be cleared until next task > management command with the same tag used successfully finishes.‧ ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler. Does this change something in your assumptions? Thanks, Avri > > This is wrong and can lead to some issues, like power consumpton issue. > For example, ufshcd_release() and ufshcd_gate_work() will do nothing > if hba->outstanding_tasks is not zero even if both UFS host and devices > are actually idle. > > Because error handling flow, i.e., ufshcd_reset_and_restore(), will be > triggered after any task management command times out, we fix this by > clearing corresponding hba->outstanding_tasks bits during this flow. > To achieve this, we need a mask to track timed-out commands and thus > error handling flow can clear their tags specifically. > > Stanley Chu (2): > scsi: ufs: Make new function for clearing outstanding task bits > scsi: ufs: Fix broken hba->outstanding_tasks > > drivers/scsi/ufs/ufshcd.c | 49 +++++++++++++++++++++++++++++++++------ > drivers/scsi/ufs/ufshcd.h | 1 + > 2 files changed, 43 insertions(+), 7 deletions(-) > > -- > 2.18.0
> > Hi, > > > > > Currently bits in hba->outstanding_tasks are cleared only after their > > corresponding task management commands are successfully done by > > __ufshcd_issue_tm_cmd(). > > > > If timeout happens in a task management command, its corresponding > > bit in hba->outstanding_tasks will not be cleared until next task > > management command with the same tag used successfully finishes.‧ > ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler. > Does this change something in your assumptions? And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in case of a TO. > > Thanks, > Avri >
> > > > > Hi, > > > > > > > > Currently bits in hba->outstanding_tasks are cleared only after their > > > corresponding task management commands are successfully done by > > > __ufshcd_issue_tm_cmd(). > > > > > > If timeout happens in a task management command, its corresponding > > > bit in hba->outstanding_tasks will not be cleared until next task > > > management command with the same tag used successfully finishes.‧ > > ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler. > > Does this change something in your assumptions? > And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in case > of a TO. Gave it another look - If indeed this bit isn't cleared as part of the error flow that the timeout triggers, I think you should relate to ufshcd_clear_tm_cmd specifically in your commit log - Because this is the obvious place where the bit cleanup should take place. Also the fix should be much more intuitive IMO - Today we do __clear_bit() on success, ufshcd_clear_tm_cmd() on error, And also ufshcd_put_tm_slot() either way? Maybe you can choose a single place to clear it, without any additional code? Thanks, Avri
Hi Avri, On Mon, 2019-07-22 at 06:10 +0000, Avri Altman wrote: > > > > > > > > Hi, > > > > > > > > > > > Currently bits in hba->outstanding_tasks are cleared only after their > > > > corresponding task management commands are successfully done by > > > > __ufshcd_issue_tm_cmd(). > > > > > > > > If timeout happens in a task management command, its corresponding > > > > bit in hba->outstanding_tasks will not be cleared until next task > > > > management command with the same tag used successfully finishes.‧ > > > ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler. > > > Does this change something in your assumptions? > > And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in case > > of a TO. > > Gave it another look - > If indeed this bit isn't cleared as part of the error flow that the timeout triggers, > I think you should relate to ufshcd_clear_tm_cmd specifically in your commit log - > Because this is the obvious place where the bit cleanup should take place. > > Also the fix should be much more intuitive IMO - > Today we do __clear_bit() on success, ufshcd_clear_tm_cmd() on error, > And also ufshcd_put_tm_slot() either way? > > Maybe you can choose a single place to clear it, without any additional code? ufshcd_clear_tm_cmd() is similar to ufshcd_clear_cmd() which tries to write timed-out bit in "clear register". They do not clean bits in both outstanding masks. Another reason to not to do outstanding bits cleanup in ufshcd_clear_tm_cmd() is: if host is unable to clear TM command by setting "clear register", the TM command may be still "outstanding" in the device. In this case, it may be better to do cleanup after reset is done. Cleanup includes bits in hba->outstanding_tasks and hba->tm_slots_in_use which is possibly cleaned too early by ufshcd_put_tm_slot() if command is timed-out now. Referring to error handling flow of hba->outstanding_reqs, all timed-out bits will be cleaned by ufshcd_reset_and_restore() => ufshcd_transfer_req_compl() after reset is done. Similar handling for hba->outstanding_tasks could be applied, i.e., do cleanup by ufshcd_reset_and_restore() => ufshcd_tmc_handler(). The next thing is what you suggested: How to make the fix more intuitive. Patchset v2 will be uploaded for review: It tries to "re-factor" cleanup jobs first, and then add fixed flow to make the whole patch more readable. One more thing, above description and flow is done through UFSHCD SCSI error handling routines registered with SCSI Midlayer. For TM command timeout happening in bsg path without error handling triggered by SCSI layer, we may need to consider to handle those tasks in future patches. > > Thanks, > Avri Thanks, Stanley > _______________________________________________ > Linux-mediatek mailing list > Linux-mediatek@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-mediatek
Stanley, > > Hi Avri, > > On Mon, 2019-07-22 at 06:10 +0000, Avri Altman wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Currently bits in hba->outstanding_tasks are cleared only after their > > > > > corresponding task management commands are successfully done > by > > > > > __ufshcd_issue_tm_cmd(). > > > > > > > > > > If timeout happens in a task management command, its > corresponding > > > > > bit in hba->outstanding_tasks will not be cleared until next task > > > > > management command with the same tag used successfully > finishes.‧ > > > > ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler. > > > > Does this change something in your assumptions? > > > And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in > case > > > of a TO. > > > > Gave it another look - > > If indeed this bit isn't cleared as part of the error flow that the timeout > triggers, > > I think you should relate to ufshcd_clear_tm_cmd specifically in your > commit log - > > Because this is the obvious place where the bit cleanup should take > place. > > > > Also the fix should be much more intuitive IMO - > > Today we do __clear_bit() on success, ufshcd_clear_tm_cmd() on error, > > And also ufshcd_put_tm_slot() either way? > > > > Maybe you can choose a single place to clear it, without any additional > code? > > ufshcd_clear_tm_cmd() is similar to ufshcd_clear_cmd() which tries to > write timed-out bit in "clear register". They do not clean bits in both > outstanding masks. > > Another reason to not to do outstanding bits cleanup in > ufshcd_clear_tm_cmd() is: if host is unable to clear TM command by > setting "clear register", the TM command may be still "outstanding" in > the device. In this case, it may be better to do cleanup after reset is > done. Cleanup includes bits in hba->outstanding_tasks and > hba->tm_slots_in_use which is possibly cleaned too early by > ufshcd_put_tm_slot() if command is timed-out now. > > Referring to error handling flow of hba->outstanding_reqs, all timed-out > bits will be cleaned by ufshcd_reset_and_restore() => > ufshcd_transfer_req_compl() after reset is done. Similar handling for > hba->outstanding_tasks could be applied, i.e., do cleanup by > ufshcd_reset_and_restore() => ufshcd_tmc_handler(). Fair enough. Thank you for the detailed explanation. > > The next thing is what you suggested: How to make the fix more > intuitive. Patchset v2 will be uploaded for review: It tries to > "re-factor" cleanup jobs first, and then add fixed flow to make the > whole patch more readable. > > One more thing, above description and flow is done through UFSHCD SCSI > error handling routines registered with SCSI Midlayer. For TM command > timeout happening in bsg path without error handling triggered by SCSI > layer, we may need to consider to handle those tasks in future patches. Please do. Thanks, Avri