Message ID | 20221101142410.31463-1-peter.wang@mediatek.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1] ufs: core: wlun suspend SSU fail recovery | expand |
On 1/11/22 16:24, peter.wang@mediatek.com wrote: > From: Peter Wang <peter.wang@mediatek.com> > > When SSU fail in wlun suspend flow, trigger error handlder and handlder -> handler Why / how does SSU fail? > return busy to break the suspend. > If not, wlun runtime pm status become error and the consumer will > stuck in runtime suspend status. > > Signed-off-by: Peter Wang <peter.wang@mediatek.com> > --- > drivers/ufs/core/ufshcd.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c > index b1f59a5fe632..2f2d3d5d8684 100644 > --- a/drivers/ufs/core/ufshcd.c > +++ b/drivers/ufs/core/ufshcd.c > @@ -8970,6 +8970,7 @@ static int __ufshcd_wl_suspend(struct ufs_hba *hba, enum ufs_pm_op pm_op) > enum ufs_pm_level pm_lvl; > enum ufs_dev_pwr_mode req_dev_pwr_mode; > enum uic_link_state req_link_state; > + unsigned long flags; > > hba->pm_op_in_progress = true; > if (pm_op != UFS_SHUTDOWN_PM) { > @@ -9049,8 +9050,21 @@ static int __ufshcd_wl_suspend(struct ufs_hba *hba, enum ufs_pm_op pm_op) > > if (!hba->dev_info.b_rpm_dev_flush_capable) { > ret = ufshcd_set_dev_pwr_mode(hba, req_dev_pwr_mode); > - if (ret) > + if (ret) { > + /* > + * If retrun err in suspend flow, IO will hang. retrun -> return > + * Trigger error handler and break suspend for > + * error recovery. > + */ > + spin_lock_irqsave(hba->host->host_lock, flags); > + hba->force_reset = true; > + ufshcd_schedule_eh_work(hba); __ufshcd_wl_suspend() is also used for shutdown and poweroff where scheduling EH is not appropriate. > + spin_unlock_irqrestore(hba->host->host_lock, > + flags); > + > + ret = -EBUSY; > goto enable_scaling; > + } > } > } >
On 11/30/22 02:17, Adrian Hunter wrote: > On 1/11/22 16:24, peter.wang@mediatek.com wrote: >> From: Peter Wang <peter.wang@mediatek.com> >> >> When SSU fail in wlun suspend flow, trigger error handlder and > > handlder -> handler > > Why / how does SSU fail? I'm not sure but the issue that Peter is trying to fix with this patch may already have been fixed by my patch series "Fix a deadlock in the UFS driver". Thanks, Bart.
On Wed, 2022-11-30 at 12:17 +0200, Adrian Hunter wrote: > On 1/11/22 16:24, peter.wang@mediatek.com wrote: > > From: Peter Wang <peter.wang@mediatek.com> > > > > When SSU fail in wlun suspend flow, trigger error handlder and > > handlder -> handler will fix next versio, thanks. > > Why / how does SSU fail? > Unknow, it is seldom and we cannot catch fail the failure scenarios. We suspect is maybe the device problem which response wrong sense code. So, I think the driver is better to cover this error. Here is our cmd histroy log with 3 times SSU fail, and retrun busy to do error handler. 26-c(1), 2576.085601525, 9396,18, rs, ret=-16, time_us= 41136, pwr_mode=1, link_status=1 25-r(1), 2576.085027371, 9396, 1,0x1b,t=53,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 31308 24-r(1), 2576.084996910, 9396, 0,0x1b,t=53,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 0 23-r(1), 2576.084919679, 9396, 1,0x1b,t=52,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 32846 22-r(1), 2576.084887679, 9396, 0,0x1b,t=52,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 0 21-r(1), 2576.084704295, 0, 1,0x1b,t=55,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 39777539 20-r(1), 2576.044927679, 9396, 0,0x1b,t=55,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 0 > > return busy to break the suspend. > > If not, wlun runtime pm status become error and the consumer will > > stuck in runtime suspend status. > > > > Signed-off-by: Peter Wang <peter.wang@mediatek.com> > > --- > > drivers/ufs/core/ufshcd.c | 16 +++++++++++++++- > > 1 file changed, 15 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c > > index b1f59a5fe632..2f2d3d5d8684 100644 > > --- a/drivers/ufs/core/ufshcd.c > > +++ b/drivers/ufs/core/ufshcd.c > > @@ -8970,6 +8970,7 @@ static int __ufshcd_wl_suspend(struct ufs_hba > > *hba, enum ufs_pm_op pm_op) > > enum ufs_pm_level pm_lvl; > > enum ufs_dev_pwr_mode req_dev_pwr_mode; > > enum uic_link_state req_link_state; > > + unsigned long flags; > > > > hba->pm_op_in_progress = true; > > if (pm_op != UFS_SHUTDOWN_PM) { > > @@ -9049,8 +9050,21 @@ static int __ufshcd_wl_suspend(struct > > ufs_hba *hba, enum ufs_pm_op pm_op) > > > > if (!hba->dev_info.b_rpm_dev_flush_capable) { > > ret = ufshcd_set_dev_pwr_mode(hba, > > req_dev_pwr_mode); > > - if (ret) > > + if (ret) { > > + /* > > + * If retrun err in suspend flow, IO > > will hang. > > retrun -> return will fix next version, thanks > > > + * Trigger error handler and break > > suspend for > > + * error recovery. > > + */ > > + spin_lock_irqsave(hba->host->host_lock, > > flags); > > + hba->force_reset = true; > > + ufshcd_schedule_eh_work(hba); > > __ufshcd_wl_suspend() is also used for shutdown and poweroff > where scheduling EH is not appropriate. Will check pm_op and bypass UFS_SHUTDOWN_PM next version, thanks. > > > + spin_unlock_irqrestore(hba->host- > > >host_lock, > > + flags); > > + > > + ret = -EBUSY; > > goto enable_scaling; > > + } > > } > > } > > > >
On Wed, 2022-11-30 at 09:49 -0800, Bart Van Assche wrote: > On 11/30/22 02:17, Adrian Hunter wrote: > > On 1/11/22 16:24, peter.wang@mediatek.com wrote: > > > From: Peter Wang <peter.wang@mediatek.com> > > > > > > When SSU fail in wlun suspend flow, trigger error handlder and > > > > handlder -> handler > > > > Why / how does SSU fail? > > I'm not sure but the issue that Peter is trying to fix with this > patch > may already have been fixed by my patch series "Fix a deadlock in > the > UFS driver". > > Thanks, > > Bart. > Hi Bart, Yes, it may also can fix by "[v4,07/10] scsi: ufs: Try harder to change the power mode" But I am not sure if any outher corner case SSU will got antother err which retry canoot fix? (ex, device always return same error sense code which host is not expect to receive again, again and again let retry sill fail) By the way, in wl suspend flow, not only SSU, but also enter hibern8(0x17) could got fail. I think it is better fix in both fail case. Hera is our 0x17 timeout fail log and runtime suspend retrun timeout. 473-c(1), 21140.355276017,31742,18, rs, ret=-110, time_us= 2892755, pwr_mode=2, link_status=3 472-u(1), 21139.829423708,31742, 7,0x17,arg1=0x0,arg2=0x0,arg3=0x0, 0 471-r(1), 21139.829361861, 0, 1,0x1b,t=46,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 1738384 470-r(1), 21139.827623708,31742, 0,0x1b,t=46,db:0x 0,is:0x 80000,crypt:0,0,lba= 0,len= -1, 0 Thanks. Peter
On Thu, Dec 1, 2022 at 4:50 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 11/30/22 02:17, Adrian Hunter wrote: > > On 1/11/22 16:24, peter.wang@mediatek.com wrote: > >> From: Peter Wang <peter.wang@mediatek.com> > >> > >> When SSU fail in wlun suspend flow, trigger error handlder and > > > > handlder -> handler > > > > Why / how does SSU fail? > > I'm not sure but the issue that Peter is trying to fix with this patch > may already have been fixed by my patch series "Fix a deadlock in the > UFS driver". We observe a similar failure scenario when a device tries to change power mode (e.g. due to autosuspend) while "purge" is in progress. It appears that in this case the "pm_only" counter is increased on the device queue prior to the attempt of entering a runtime suspended state, but not reduced upon it failing due to the device being busy with an ongoing purge. That leads to a deadlock of subsequent IO operations to the device. --Daniil
On Fri, 2022-12-02 at 10:32 +1100, Daniil Lunev wrote: > On Thu, Dec 1, 2022 at 4:50 AM Bart Van Assche <bvanassche@acm.org> > wrote: > > > > On 11/30/22 02:17, Adrian Hunter wrote: > > > On 1/11/22 16:24, peter.wang@mediatek.com wrote: > > > > From: Peter Wang <peter.wang@mediatek.com> > > > > > > > > When SSU fail in wlun suspend flow, trigger error handlder and > > > > > > handlder -> handler > > > > > > Why / how does SSU fail? > > > > I'm not sure but the issue that Peter is trying to fix with this > > patch > > may already have been fixed by my patch series "Fix a deadlock in > > the > > UFS driver". > > We observe a similar failure scenario when a device tries to change > power mode (e.g. due to autosuspend) while "purge" is in progress. It > appears that in this case the "pm_only" counter is increased on the > device queue prior to the attempt of entering a runtime suspended > state, but not reduced upon it failing due to the device being busy > with an ongoing purge. That leads to a deadlock of subsequent IO > operations to the device. > > --Daniil Hi Daniil, May I have a question, which dievce you use in this failure scenario? I think it is same issue due to SSU fail, and wlun devcie pm enter error state. So the cunsomer(scsi lu device) is block in suspend state and connot resume to reduce pm_only lead to IO hang. Thanks. BR Peter
> May I have a question, which dievce you use in this failure scenario? > I think it is same issue due to SSU fail, and wlun devcie pm enter > error state. So the cunsomer(scsi lu device) is block in suspend state > and connot resume to reduce pm_only lead to IO hang. > > Thanks. > BR > Peter I don't think it is device specific. I am pretty sure it is the same problem, for it traces to the same origin and your patch fixes the issue (though with a lot of log spam which is a bit unfortunate, but we can live with that) --Daniil
Hi Daniil, Thanks your feedback. I am glad to hear your problem can be solved by this patch. Please also help have a review of my v3 patch https://patchwork.kernel.org/project/linux-scsi/patch/20221202073532.7884-1-peter.wang@mediatek.com/ Thanks. BR Peter On Mon, 2022-12-05 at 07:02 +1100, Daniil Lunev wrote: > > May I have a question, which dievce you use in this failure > > scenario? > > I think it is same issue due to SSU fail, and wlun devcie pm enter > > error state. So the cunsomer(scsi lu device) is block in suspend > > state > > and connot resume to reduce pm_only lead to IO hang. > > > > Thanks. > > BR > > Peter > > I don't think it is device specific. I am pretty sure it is the same > problem, for it traces to the same origin and your patch fixes the > issue (though with a lot of log spam which is a bit unfortunate, but > we can live with that) > > --Daniil
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index b1f59a5fe632..2f2d3d5d8684 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -8970,6 +8970,7 @@ static int __ufshcd_wl_suspend(struct ufs_hba *hba, enum ufs_pm_op pm_op) enum ufs_pm_level pm_lvl; enum ufs_dev_pwr_mode req_dev_pwr_mode; enum uic_link_state req_link_state; + unsigned long flags; hba->pm_op_in_progress = true; if (pm_op != UFS_SHUTDOWN_PM) { @@ -9049,8 +9050,21 @@ static int __ufshcd_wl_suspend(struct ufs_hba *hba, enum ufs_pm_op pm_op) if (!hba->dev_info.b_rpm_dev_flush_capable) { ret = ufshcd_set_dev_pwr_mode(hba, req_dev_pwr_mode); - if (ret) + if (ret) { + /* + * If retrun err in suspend flow, IO will hang. + * Trigger error handler and break suspend for + * error recovery. + */ + spin_lock_irqsave(hba->host->host_lock, flags); + hba->force_reset = true; + ufshcd_schedule_eh_work(hba); + spin_unlock_irqrestore(hba->host->host_lock, + flags); + + ret = -EBUSY; goto enable_scaling; + } } }