diff mbox series

[v3] ufs: core: wlun resume SSU(Acitve) fail recovery

Message ID 20221228060157.24852-1-peter.wang@mediatek.com (mailing list archive)
State New, archived
Headers show
Series [v3] ufs: core: wlun resume SSU(Acitve) fail recovery | expand

Commit Message

Peter Wang (王信友) Dec. 28, 2022, 6:01 a.m. UTC
From: Peter Wang <peter.wang@mediatek.com>

When wlun resume SSU(Active) timeout, scsi try eh_host_reset_handler.
But ufshcd_eh_host_reset_handler hang at wait flush_work(&hba->eh_work).
And ufshcd_err_handler hang at wait rpm resume.
Do link recovery only in this case. Below is IO hang stack dump.

<ffffffdd78e02b34> schedule+0x110/0x204
<ffffffdd78e0be60> schedule_timeout+0x98/0x138
<ffffffdd78e040e8> wait_for_common_io+0x130/0x2d0
<ffffffdd77d6a000> blk_execute_rq+0x10c/0x16c
<ffffffdd78126d90> __scsi_execute+0xfc/0x278
<ffffffdd7813891c> ufshcd_set_dev_pwr_mode+0x1c8/0x40c
<ffffffdd78137d1c> __ufshcd_wl_resume+0xf0/0x5cc
<ffffffdd78137ae0> ufshcd_wl_runtime_resume+0x40/0x18c
<ffffffdd78136108> scsi_runtime_resume+0x88/0x104
<ffffffdd7809a4f8> __rpm_callback+0x1a0/0xaec
<ffffffdd7809b624> rpm_resume+0x7e0/0xcd0
<ffffffdd7809a788> __rpm_callback+0x430/0xaec
<ffffffdd7809b644> rpm_resume+0x800/0xcd0
<ffffffdd780a0778> pm_runtime_work+0x148/0x198

<ffffffdd78e02b34> schedule+0x110/0x204
<ffffffdd78e0be10> schedule_timeout+0x48/0x138
<ffffffdd78e03d9c> wait_for_common+0x144/0x2dc
<ffffffdd7758bba4> __flush_work+0x3d0/0x508
<ffffffdd7815572c> ufshcd_eh_host_reset_handler+0x134/0x3a8
<ffffffdd781216f4> scsi_try_host_reset+0x54/0x204
<ffffffdd78120594> scsi_eh_ready_devs+0xb30/0xd48
<ffffffdd7812373c> scsi_error_handler+0x260/0x874

<ffffffdd78e02b34> schedule+0x110/0x204
<ffffffdd7809af64> rpm_resume+0x120/0xcd0
<ffffffdd7809fde8> __pm_runtime_resume+0xa0/0x17c
<ffffffdd7815193c> ufshcd_err_handling_prepare+0x40/0x430
<ffffffdd7814cce8> ufshcd_err_handler+0x1c4/0xd4c

Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Bart Van Assche Jan. 3, 2023, 12:29 a.m. UTC | #1
On 12/27/22 22:01, peter.wang@mediatek.com wrote:
> When wlun resume SSU(Active) timeout, scsi try eh_host_reset_handler.
                    ^^^^^^^^^^^
Please use the same spelling in the patch subject (Acitve -> Active).

timeout -> times out
scsi try -> the SCSI core invokes

> But ufshcd_eh_host_reset_handler hang at wait flush_work(&hba->eh_work).

hang at -> hangs in

> And ufshcd_err_handler hang at wait rpm resume.

hang at wait rpm resume -> hangs in rpm_resume().

 > <ffffffdd78e02b34> schedule+0x110/0x204
 > <ffffffdd78e0be60> schedule_timeout+0x98/0x138
 > <ffffffdd78e040e8> wait_for_common_io+0x130/0x2d0
 > <ffffffdd77d6a000> blk_execute_rq+0x10c/0x16c
 > <ffffffdd78126d90> __scsi_execute+0xfc/0x278
 > <ffffffdd7813891c> ufshcd_set_dev_pwr_mode+0x1c8/0x40c
 > <ffffffdd78137d1c> __ufshcd_wl_resume+0xf0/0x5cc
 > <ffffffdd78137ae0> ufshcd_wl_runtime_resume+0x40/0x18c
 > <ffffffdd78136108> scsi_runtime_resume+0x88/0x104
 > <ffffffdd7809a4f8> __rpm_callback+0x1a0/0xaec
 > <ffffffdd7809b624> rpm_resume+0x7e0/0xcd0
 > <ffffffdd7809a788> __rpm_callback+0x430/0xaec
 > <ffffffdd7809b644> rpm_resume+0x800/0xcd0
 > <ffffffdd780a0778> pm_runtime_work+0x148/0x198
 >
 > <ffffffdd78e02b34> schedule+0x110/0x204
 > <ffffffdd78e0be10> schedule_timeout+0x48/0x138
 > <ffffffdd78e03d9c> wait_for_common+0x144/0x2dc
 > <ffffffdd7758bba4> __flush_work+0x3d0/0x508
 > <ffffffdd7815572c> ufshcd_eh_host_reset_handler+0x134/0x3a8
 > <ffffffdd781216f4> scsi_try_host_reset+0x54/0x204
 > <ffffffdd78120594> scsi_eh_ready_devs+0xb30/0xd48
 > <ffffffdd7812373c> scsi_error_handler+0x260/0x874
 >
 > <ffffffdd78e02b34> schedule+0x110/0x204
 > <ffffffdd7809af64> rpm_resume+0x120/0xcd0
 > <ffffffdd7809fde8> __pm_runtime_resume+0xa0/0x17c
 > <ffffffdd7815193c> ufshcd_err_handling_prepare+0x40/0x430
 > <ffffffdd7814cce8> ufshcd_err_handler+0x1c4/0xd4c

On top of which kernel version has this patch been developed?
I think this deadlock has already been fixed by commit 7029e2151a7c 
("scsi: ufs: Fix a deadlock between PM and the SCSI error handler").
Please check whether that commit by itself (without this patch) is 
sufficient to fix the reported deadlock.

> ---
>   drivers/ufs/core/ufshcd.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)

The changelog is missing. Please include a changelog when posting v2 or 
a later version of a patch.

> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index e18c9f4463ec..0dfb9a35bf66 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -7366,6 +7366,23 @@ static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
>   
>   	hba = shost_priv(cmd->device->host);
>   
> +	/*
> +	 * If pm op resume fail and wait err recovery, do link recovery only.
> +	 * Because schedule eh work will get dead lock in ufshcd_rpm_get_sync
> +	 * and wait wlun resume, but wlun resume error wait eh work finish.
> +	 */

The above comment has grammar issues and some parts are 
incomprehensible. What does e.g. "wait err recovery" mean? Please 
improve this source code comment.

Thanks,

Bart.
Peter Wang (王信友) Sept. 22, 2023, 9:08 a.m. UTC | #2
On Mon, 2023-01-02 at 16:29 -0800, Bart Van Assche wrote:
> On 12/27/22 22:01, peter.wang@mediatek.com wrote:
> > When wlun resume SSU(Active) timeout, scsi try
> > eh_host_reset_handler.
> 
>                     ^^^^^^^^^^^
> Please use the same spelling in the patch subject (Acitve -> Active).
> 
> timeout -> times out
> scsi try -> the SCSI core invokes
> 
> > But ufshcd_eh_host_reset_handler hang at wait flush_work(&hba-
> > >eh_work).
> 
> hang at -> hangs in
> 
> > And ufshcd_err_handler hang at wait rpm resume.
> 
> hang at wait rpm resume -> hangs in rpm_resume().
> 
>  > <ffffffdd78e02b34> schedule+0x110/0x204
>  > <ffffffdd78e0be60> schedule_timeout+0x98/0x138
>  > <ffffffdd78e040e8> wait_for_common_io+0x130/0x2d0
>  > <ffffffdd77d6a000> blk_execute_rq+0x10c/0x16c
>  > <ffffffdd78126d90> __scsi_execute+0xfc/0x278
>  > <ffffffdd7813891c> ufshcd_set_dev_pwr_mode+0x1c8/0x40c
>  > <ffffffdd78137d1c> __ufshcd_wl_resume+0xf0/0x5cc
>  > <ffffffdd78137ae0> ufshcd_wl_runtime_resume+0x40/0x18c
>  > <ffffffdd78136108> scsi_runtime_resume+0x88/0x104
>  > <ffffffdd7809a4f8> __rpm_callback+0x1a0/0xaec
>  > <ffffffdd7809b624> rpm_resume+0x7e0/0xcd0
>  > <ffffffdd7809a788> __rpm_callback+0x430/0xaec
>  > <ffffffdd7809b644> rpm_resume+0x800/0xcd0
>  > <ffffffdd780a0778> pm_runtime_work+0x148/0x198
>  >
>  > <ffffffdd78e02b34> schedule+0x110/0x204
>  > <ffffffdd78e0be10> schedule_timeout+0x48/0x138
>  > <ffffffdd78e03d9c> wait_for_common+0x144/0x2dc
>  > <ffffffdd7758bba4> __flush_work+0x3d0/0x508
>  > <ffffffdd7815572c> ufshcd_eh_host_reset_handler+0x134/0x3a8
>  > <ffffffdd781216f4> scsi_try_host_reset+0x54/0x204
>  > <ffffffdd78120594> scsi_eh_ready_devs+0xb30/0xd48
>  > <ffffffdd7812373c> scsi_error_handler+0x260/0x874
>  >
>  > <ffffffdd78e02b34> schedule+0x110/0x204
>  > <ffffffdd7809af64> rpm_resume+0x120/0xcd0
>  > <ffffffdd7809fde8> __pm_runtime_resume+0xa0/0x17c
>  > <ffffffdd7815193c> ufshcd_err_handling_prepare+0x40/0x430
>  > <ffffffdd7814cce8> ufshcd_err_handler+0x1c4/0xd4c
> 
> On top of which kernel version has this patch been developed?
> I think this deadlock has already been fixed by commit 7029e2151a7c 
> ("scsi: ufs: Fix a deadlock between PM and the SCSI error handler").
> Please check whether that commit by itself (without this patch) is 
> sufficient to fix the reported deadlock.
> 

Hi Bart, 

7029e2151a7c is not fix this dead lock, we still found this issue
happen in kernel-6.1
I will update this patch and correct comment.
Please have a look again.

Thanks.
Peter


> > ---
> >   drivers/ufs/core/ufshcd.c | 17 +++++++++++++++++
> >   1 file changed, 17 insertions(+)
> 
> The changelog is missing. Please include a changelog when posting v2
> or 
> a later version of a patch.
> 
> > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> > index e18c9f4463ec..0dfb9a35bf66 100644
> > --- a/drivers/ufs/core/ufshcd.c
> > +++ b/drivers/ufs/core/ufshcd.c
> > @@ -7366,6 +7366,23 @@ static int
> > ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
> >   
> >   	hba = shost_priv(cmd->device->host);
> >   
> > +	/*
> > +	 * If pm op resume fail and wait err recovery, do link recovery
> > only.
> > +	 * Because schedule eh work will get dead lock in
> > ufshcd_rpm_get_sync
> > +	 * and wait wlun resume, but wlun resume error wait eh work
> > finish.
> > +	 */
> 
> The above comment has grammar issues and some parts are 
> incomprehensible. What does e.g. "wait err recovery" mean? Please 
> improve this source code comment.
> 
> Thanks,
> 
> Bart.
diff mbox series

Patch

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index e18c9f4463ec..0dfb9a35bf66 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -7366,6 +7366,23 @@  static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
 
 	hba = shost_priv(cmd->device->host);
 
+	/*
+	 * If pm op resume fail and wait err recovery, do link recovery only.
+	 * Because schedule eh work will get dead lock in ufshcd_rpm_get_sync
+	 * and wait wlun resume, but wlun resume error wait eh work finish.
+	 */
+	if (hba->pm_op_in_progress &&
+	    hba->curr_dev_pwr_mode != UFS_ACTIVE_PWR_MODE) {
+		err = ufshcd_link_recovery(hba);
+		if (err) {
+			dev_err(hba->dev, "%s: power state %d pm_progress %d",
+				__func__,
+				hba->curr_dev_pwr_mode,
+				hba->pm_op_in_progress);
+		}
+		return err;
+	}
+
 	spin_lock_irqsave(hba->host->host_lock, flags);
 	hba->force_reset = true;
 	ufshcd_schedule_eh_work(hba);