[4/8] qla2xxx: Fix driver unload hang

Message ID	20191105150657.8092-5-hmadhani@marvell.com (mailing list archive)
State	Mainlined
Commit	dd322b7f3efc8cda085bb60eadc4aee6324eadd8
Headers	show Return-Path: <SRS0=fi+p=Y5=vger.kernel.org=linux-scsi-owner@kernel.org> From: Himanshu Madhani <hmadhani@marvell.com> To: <James.Bottomley@HansenPartnership.com>, <martin.petersen@oracle.com> CC: <hmadhani@marvell.com>, <linux-scsi@vger.kernel.org> Subject: [PATCH 4/8] qla2xxx: Fix driver unload hang Date: Tue, 5 Nov 2019 07:06:53 -0800 Message-ID: <20191105150657.8092-5-hmadhani@marvell.com> In-Reply-To: <20191105150657.8092-1-hmadhani@marvell.com> References: <20191105150657.8092-1-hmadhani@marvell.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk
Series	qla2xxx: Bug Fixes for the driver \| expand [0/8] qla2xxx: Bug Fixes for the driver [1/8] qla2xxx: Retry PLOGI on FC-NVMe PRLI failure [2/8] qla2xxx: Do command completion on abort timeout [3/8] qla2xxx: Fix SRB leak on switch command timeout [4/8] qla2xxx: Fix driver unload hang [5/8] qla2xxx: Fix double scsi_done for abort path [6/8] qla2xxx: Fix memory leak when sending I/O fails [7/8] qla2xxx: Fix device connect issues in P2P configuration [8/8] qla2xxx: Update driver version to 10.01.00.21-k

Himanshu Madhani Nov. 5, 2019, 3:06 p.m. UTC

From: Quinn Tran <qutran@marvell.com>

This patch fixes driver unload hang by removing msleep()

Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.")
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
---
 drivers/scsi/qla2xxx/qla_init.c | 2 --
 1 file changed, 2 deletions(-)

Ewan Milne Nov. 5, 2019, 3:19 p.m. UTC | #1

On Tue, 2019-11-05 at 07:06 -0800, Himanshu Madhani wrote:
> From: Quinn Tran <qutran@marvell.com>
> 
> This patch fixes driver unload hang by removing msleep()
> 
> Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Quinn Tran <qutran@marvell.com>
> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
> ---
>  drivers/scsi/qla2xxx/qla_init.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
> index bddb26baedd2..ff4528702b4e 100644
> --- a/drivers/scsi/qla2xxx/qla_init.c
> +++ b/drivers/scsi/qla2xxx/qla_init.c
> @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair)
>  	struct qla_hw_data *ha = qpair->hw;
>  
>  	qpair->delete_in_progress = 1;
> -	while (atomic_read(&qpair->ref_count))
> -		msleep(500);
>  
>  	ret = qla25xx_delete_req_que(vha, qpair->req);
>  	if (ret != QLA_SUCCESS)

Reviewed-by: Ewan D. Milne <emilne@redhat.com>

Bart Van Assche Nov. 7, 2019, 4:54 p.m. UTC | #2

On 11/5/19 7:06 AM, Himanshu Madhani wrote:
> From: Quinn Tran <qutran@marvell.com>
> 
> This patch fixes driver unload hang by removing msleep()
> 
> Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Quinn Tran <qutran@marvell.com>
> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
> ---
>   drivers/scsi/qla2xxx/qla_init.c | 2 --
>   1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
> index bddb26baedd2..ff4528702b4e 100644
> --- a/drivers/scsi/qla2xxx/qla_init.c
> +++ b/drivers/scsi/qla2xxx/qla_init.c
> @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair)
>   	struct qla_hw_data *ha = qpair->hw;
>   
>   	qpair->delete_in_progress = 1;
> -	while (atomic_read(&qpair->ref_count))
> -		msleep(500);
>   
>   	ret = qla25xx_delete_req_que(vha, qpair->req);
>   	if (ret != QLA_SUCCESS)

I think that an explanation is needed why that loop had been introduced 
and also why it is safe not to wait until qpair->ref_count drops to zero 
in qla2xxx_delete_qpair().

Thanks,

Bart.

Bart Van Assche Nov. 7, 2019, 5:58 p.m. UTC | #3

On 11/7/19 9:55 AM, Himanshu Madhani wrote:
> 
> 
>> On Nov 7, 2019, at 10:54 AM, Bart Van Assche <bvanassche@acm.org
>> <mailto:bvanassche@acm.org>> wrote:
>>
>> On 11/5/19 7:06 AM, Himanshu Madhani wrote:
>>> From: Quinn Tran <qutran@marvell.com <mailto:qutran@marvell.com>>
>>> This patch fixes driver unload hang by removing msleep()
>>> Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair
>>> functionality.")
>>> Cc: stable@vger.kernel.org <mailto:stable@vger.kernel.org>
>>> Signed-off-by: Quinn Tran <qutran@marvell.com
>>> <mailto:qutran@marvell.com>>
>>> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com
>>> <mailto:hmadhani@marvell.com>>
>>> ---
>>>  drivers/scsi/qla2xxx/qla_init.c | 2 --
>>>  1 file changed, 2 deletions(-)
>>> diff --git a/drivers/scsi/qla2xxx/qla_init.c
>>> b/drivers/scsi/qla2xxx/qla_init.c
>>> index bddb26baedd2..ff4528702b4e 100644
>>> --- a/drivers/scsi/qla2xxx/qla_init.c
>>> +++ b/drivers/scsi/qla2xxx/qla_init.c
>>> @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host
>>> *vha, struct qla_qpair *qpair)
>>>  struct qla_hw_data *ha = qpair->hw;
>>>    qpair->delete_in_progress = 1;
>>> -while (atomic_read(&qpair->ref_count))
>>> -msleep(500);
>>>    ret = qla25xx_delete_req_que(vha, qpair->req);
>>>  if (ret != QLA_SUCCESS)
>>
>> I think that an explanation is needed why that loop had been
>> introduced and also why it is safe not to wait until qpair->ref_count
>> drops to zero in qla2xxx_delete_qpair().
>>
> 
> commit d74595278f4ab had drawback in design for MQ implementation in
> qla2xxx. Now that we have been making this more stable with MQ being
> default on for 5x kernel. What we discovered that after heavy IO
> workload in a cluster environment, driver unload encountered hang and
> shows following stack trace
> 
> # ps -fax | grep rmmod
> 6029 pts/0 D+ 0:00 | \_ rmmod qla2xxx
> 
> [<0>] msleep+0x29/0x30 [<0>] qla2xxx_delete_qpair+0x2c/0x160 [qla2xxx]
> [<0>] qla25xx_delete_queues+0x14b/0x1d0 [qla2xxx] [<0>]
> qla2x00_free_device+0x31/0xe0 [qla2xxx] [<0>]
> qla2x00_remove_one+0x239/0x370 [qla2xxx] [<0>]
> pci_device_remove+0x3b/0xc0 [<0>]
> device_release_driver_internal+0x18c/0x250 [<0>] driver_detach+0x39/0x6d
> [<0>] bus_remove_driver+0x77/0xc9 [<0>] pci_unregister_driver+0x2d/0xb0
> [<0>] qla2x00_module_exit+0x2d/0x90 [qla2xxx] [<0>]
> __x64_sys_delete_module+0x139/0x270 [<0>] do_syscall_64+0x5b/0x1b0 [<0>]
> entry_SYSCALL_64_after_hwframe+0x65/0xca [<0>] 0xffffffffffffffff
> 
> Removing this msleep() help resolve this stack trace. 

Hi Himanshu,

Does your answer mean that this hang has not yet been root-caused fully
and hence that it is possible this patch is only a workaround but not a
fix of the root cause?

Thanks,

Bart.

Bart Van Assche Nov. 7, 2019, 6:30 p.m. UTC | #4

On 11/7/19 9:58 AM, Bart Van Assche wrote:
> Does your answer mean that this hang has not yet been root-caused fully
> and hence that it is possible this patch is only a workaround but not a
> fix of the root cause?

Answering my own question: I think that a qpair refcount leak is a 
severe problem and not something that should be ignored. How about 
changing the while loop into something like the following:

	if (atomic_read(&qpair->ref_count))
		msleep(500);
	WARN_ON_ONCE(atomic_read(&qpair->ref_count));

Thanks,

Bart.

Himanshu Madhani Nov. 8, 2019, 11:38 p.m. UTC | #5

Hi Bart, 

> On Nov 7, 2019, at 12:30 PM, Bart Van Assche <bvanassche@acm.org> wrote:
> 
> External Email
> 
> ----------------------------------------------------------------------
> On 11/7/19 9:58 AM, Bart Van Assche wrote:
>> Does your answer mean that this hang has not yet been root-caused fully
>> and hence that it is possible this patch is only a workaround but not a
>> fix of the root cause?
> 
> Answering my own question: I think that a qpair refcount leak is a severe problem and not something that should be ignored. How about changing the while loop into something like the following:
> 
> 	if (atomic_read(&qpair->ref_count))
> 		msleep(500);
> 	WARN_ON_ONCE(atomic_read(&qpair->ref_count));
> 
> Thanks,
> 
> Bart.

Since we had seen this hang in a specific cluster environment and refcount leak was observed, I would like to add this patch as is and will consider your suggestion to verify if adding WARN_ON_ONCE will make any difference. If we discover that adding WARN_ON_ONCE indeed helps, then I will add a patch with fixes tag during rc window. 

Let me know if you disagree. 

Thanks,
Himanshu

Bart Van Assche Nov. 8, 2019, 11:58 p.m. UTC | #6

On 11/8/19 3:38 PM, Himanshu Madhani wrote:
> Hi Bart,
> 
>> On Nov 7, 2019, at 12:30 PM, Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 11/7/19 9:58 AM, Bart Van Assche wrote:
>>> Does your answer mean that this hang has not yet been root-caused fully
>>> and hence that it is possible this patch is only a workaround but not a
>>> fix of the root cause?
>>
>> Answering my own question: I think that a qpair refcount leak is a severe problem and not something that should be ignored. How about changing the while loop into something like the following:
>>
>> 	if (atomic_read(&qpair->ref_count))
>> 		msleep(500);
>> 	WARN_ON_ONCE(atomic_read(&qpair->ref_count));
>>
>> Thanks,
>>
>> Bart.
> 
> Since we had seen this hang in a specific cluster environment and refcount leak was observed, I would like to add this patch as is and will consider your suggestion to verify if adding WARN_ON_ONCE will make any difference. If we discover that adding WARN_ON_ONCE indeed helps, then I will add a patch with fixes tag during rc window.
> 
> Let me know if you disagree.

Hi Himanshu,

Please do not suppress reports of kernel bugs but instead make sure that 
some report is provided that indicates that something went wrong and 
needs further attention.

Thanks,

Bart.

[4/8] qla2xxx: Fix driver unload hang

Commit Message

Comments

Patch