diff mbox series

[4/8] qla2xxx: Fix driver unload hang

Message ID 20191105150657.8092-5-hmadhani@marvell.com (mailing list archive)
State Mainlined
Commit dd322b7f3efc8cda085bb60eadc4aee6324eadd8
Headers show
Series qla2xxx: Bug Fixes for the driver | expand

Commit Message

Himanshu Madhani Nov. 5, 2019, 3:06 p.m. UTC
From: Quinn Tran <qutran@marvell.com>

This patch fixes driver unload hang by removing msleep()

Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.")
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
---
 drivers/scsi/qla2xxx/qla_init.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Ewan Milne Nov. 5, 2019, 3:19 p.m. UTC | #1
On Tue, 2019-11-05 at 07:06 -0800, Himanshu Madhani wrote:
> From: Quinn Tran <qutran@marvell.com>
> 
> This patch fixes driver unload hang by removing msleep()
> 
> Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Quinn Tran <qutran@marvell.com>
> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
> ---
>  drivers/scsi/qla2xxx/qla_init.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
> index bddb26baedd2..ff4528702b4e 100644
> --- a/drivers/scsi/qla2xxx/qla_init.c
> +++ b/drivers/scsi/qla2xxx/qla_init.c
> @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair)
>  	struct qla_hw_data *ha = qpair->hw;
>  
>  	qpair->delete_in_progress = 1;
> -	while (atomic_read(&qpair->ref_count))
> -		msleep(500);
>  
>  	ret = qla25xx_delete_req_que(vha, qpair->req);
>  	if (ret != QLA_SUCCESS)

Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Bart Van Assche Nov. 7, 2019, 4:54 p.m. UTC | #2
On 11/5/19 7:06 AM, Himanshu Madhani wrote:
> From: Quinn Tran <qutran@marvell.com>
> 
> This patch fixes driver unload hang by removing msleep()
> 
> Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Quinn Tran <qutran@marvell.com>
> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
> ---
>   drivers/scsi/qla2xxx/qla_init.c | 2 --
>   1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
> index bddb26baedd2..ff4528702b4e 100644
> --- a/drivers/scsi/qla2xxx/qla_init.c
> +++ b/drivers/scsi/qla2xxx/qla_init.c
> @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair)
>   	struct qla_hw_data *ha = qpair->hw;
>   
>   	qpair->delete_in_progress = 1;
> -	while (atomic_read(&qpair->ref_count))
> -		msleep(500);
>   
>   	ret = qla25xx_delete_req_que(vha, qpair->req);
>   	if (ret != QLA_SUCCESS)

I think that an explanation is needed why that loop had been introduced 
and also why it is safe not to wait until qpair->ref_count drops to zero 
in qla2xxx_delete_qpair().

Thanks,

Bart.
Bart Van Assche Nov. 7, 2019, 5:58 p.m. UTC | #3
On 11/7/19 9:55 AM, Himanshu Madhani wrote:
> 
> 
>> On Nov 7, 2019, at 10:54 AM, Bart Van Assche <bvanassche@acm.org
>> <mailto:bvanassche@acm.org>> wrote:
>>
>> On 11/5/19 7:06 AM, Himanshu Madhani wrote:
>>> From: Quinn Tran <qutran@marvell.com <mailto:qutran@marvell.com>>
>>> This patch fixes driver unload hang by removing msleep()
>>> Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair
>>> functionality.")
>>> Cc: stable@vger.kernel.org <mailto:stable@vger.kernel.org>
>>> Signed-off-by: Quinn Tran <qutran@marvell.com
>>> <mailto:qutran@marvell.com>>
>>> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com
>>> <mailto:hmadhani@marvell.com>>
>>> ---
>>>  drivers/scsi/qla2xxx/qla_init.c | 2 --
>>>  1 file changed, 2 deletions(-)
>>> diff --git a/drivers/scsi/qla2xxx/qla_init.c
>>> b/drivers/scsi/qla2xxx/qla_init.c
>>> index bddb26baedd2..ff4528702b4e 100644
>>> --- a/drivers/scsi/qla2xxx/qla_init.c
>>> +++ b/drivers/scsi/qla2xxx/qla_init.c
>>> @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host
>>> *vha, struct qla_qpair *qpair)
>>>  struct qla_hw_data *ha = qpair->hw;
>>>    qpair->delete_in_progress = 1;
>>> -while (atomic_read(&qpair->ref_count))
>>> -msleep(500);
>>>    ret = qla25xx_delete_req_que(vha, qpair->req);
>>>  if (ret != QLA_SUCCESS)
>>
>> I think that an explanation is needed why that loop had been
>> introduced and also why it is safe not to wait until qpair->ref_count
>> drops to zero in qla2xxx_delete_qpair().
>>
> 
> commit d74595278f4ab had drawback in design for MQ implementation in
> qla2xxx. Now that we have been making this more stable with MQ being
> default on for 5x kernel. What we discovered that after heavy IO
> workload in a cluster environment, driver unload encountered hang and
> shows following stack trace
> 
> # ps -fax | grep rmmod
> 6029 pts/0 D+ 0:00 | \_ rmmod qla2xxx
> 
> [<0>] msleep+0x29/0x30 [<0>] qla2xxx_delete_qpair+0x2c/0x160 [qla2xxx]
> [<0>] qla25xx_delete_queues+0x14b/0x1d0 [qla2xxx] [<0>]
> qla2x00_free_device+0x31/0xe0 [qla2xxx] [<0>]
> qla2x00_remove_one+0x239/0x370 [qla2xxx] [<0>]
> pci_device_remove+0x3b/0xc0 [<0>]
> device_release_driver_internal+0x18c/0x250 [<0>] driver_detach+0x39/0x6d
> [<0>] bus_remove_driver+0x77/0xc9 [<0>] pci_unregister_driver+0x2d/0xb0
> [<0>] qla2x00_module_exit+0x2d/0x90 [qla2xxx] [<0>]
> __x64_sys_delete_module+0x139/0x270 [<0>] do_syscall_64+0x5b/0x1b0 [<0>]
> entry_SYSCALL_64_after_hwframe+0x65/0xca [<0>] 0xffffffffffffffff
> 
> Removing this msleep() help resolve this stack trace. 

Hi Himanshu,

Does your answer mean that this hang has not yet been root-caused fully
and hence that it is possible this patch is only a workaround but not a
fix of the root cause?

Thanks,

Bart.
Bart Van Assche Nov. 7, 2019, 6:30 p.m. UTC | #4
On 11/7/19 9:58 AM, Bart Van Assche wrote:
> Does your answer mean that this hang has not yet been root-caused fully
> and hence that it is possible this patch is only a workaround but not a
> fix of the root cause?

Answering my own question: I think that a qpair refcount leak is a 
severe problem and not something that should be ignored. How about 
changing the while loop into something like the following:

	if (atomic_read(&qpair->ref_count))
		msleep(500);
	WARN_ON_ONCE(atomic_read(&qpair->ref_count));

Thanks,

Bart.
Himanshu Madhani Nov. 8, 2019, 11:38 p.m. UTC | #5
Hi Bart, 

> On Nov 7, 2019, at 12:30 PM, Bart Van Assche <bvanassche@acm.org> wrote:
> 
> External Email
> 
> ----------------------------------------------------------------------
> On 11/7/19 9:58 AM, Bart Van Assche wrote:
>> Does your answer mean that this hang has not yet been root-caused fully
>> and hence that it is possible this patch is only a workaround but not a
>> fix of the root cause?
> 
> Answering my own question: I think that a qpair refcount leak is a severe problem and not something that should be ignored. How about changing the while loop into something like the following:
> 
> 	if (atomic_read(&qpair->ref_count))
> 		msleep(500);
> 	WARN_ON_ONCE(atomic_read(&qpair->ref_count));
> 
> Thanks,
> 
> Bart.

Since we had seen this hang in a specific cluster environment and refcount leak was observed, I would like to add this patch as is and will consider your suggestion to verify if adding WARN_ON_ONCE will make any difference. If we discover that adding WARN_ON_ONCE indeed helps, then I will add a patch with fixes tag during rc window. 

Let me know if you disagree. 

Thanks,
Himanshu
Bart Van Assche Nov. 8, 2019, 11:58 p.m. UTC | #6
On 11/8/19 3:38 PM, Himanshu Madhani wrote:
> Hi Bart,
> 
>> On Nov 7, 2019, at 12:30 PM, Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 11/7/19 9:58 AM, Bart Van Assche wrote:
>>> Does your answer mean that this hang has not yet been root-caused fully
>>> and hence that it is possible this patch is only a workaround but not a
>>> fix of the root cause?
>>
>> Answering my own question: I think that a qpair refcount leak is a severe problem and not something that should be ignored. How about changing the while loop into something like the following:
>>
>> 	if (atomic_read(&qpair->ref_count))
>> 		msleep(500);
>> 	WARN_ON_ONCE(atomic_read(&qpair->ref_count));
>>
>> Thanks,
>>
>> Bart.
> 
> Since we had seen this hang in a specific cluster environment and refcount leak was observed, I would like to add this patch as is and will consider your suggestion to verify if adding WARN_ON_ONCE will make any difference. If we discover that adding WARN_ON_ONCE indeed helps, then I will add a patch with fixes tag during rc window.
> 
> Let me know if you disagree.

Hi Himanshu,

Please do not suppress reports of kernel bugs but instead make sure that 
some report is provided that indicates that something went wrong and 
needs further attention.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
index bddb26baedd2..ff4528702b4e 100644
--- a/drivers/scsi/qla2xxx/qla_init.c
+++ b/drivers/scsi/qla2xxx/qla_init.c
@@ -9009,8 +9009,6 @@  int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair)
 	struct qla_hw_data *ha = qpair->hw;
 
 	qpair->delete_in_progress = 1;
-	while (atomic_read(&qpair->ref_count))
-		msleep(500);
 
 	ret = qla25xx_delete_req_que(vha, qpair->req);
 	if (ret != QLA_SUCCESS)