Message ID | 20191105150657.8092-5-hmadhani@marvell.com (mailing list archive) |
---|---|
State | Mainlined |
Commit | dd322b7f3efc8cda085bb60eadc4aee6324eadd8 |
Headers | show |
Series | qla2xxx: Bug Fixes for the driver | expand |
On Tue, 2019-11-05 at 07:06 -0800, Himanshu Madhani wrote: > From: Quinn Tran <qutran@marvell.com> > > This patch fixes driver unload hang by removing msleep() > > Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.") > Cc: stable@vger.kernel.org > Signed-off-by: Quinn Tran <qutran@marvell.com> > Signed-off-by: Himanshu Madhani <hmadhani@marvell.com> > --- > drivers/scsi/qla2xxx/qla_init.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c > index bddb26baedd2..ff4528702b4e 100644 > --- a/drivers/scsi/qla2xxx/qla_init.c > +++ b/drivers/scsi/qla2xxx/qla_init.c > @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair) > struct qla_hw_data *ha = qpair->hw; > > qpair->delete_in_progress = 1; > - while (atomic_read(&qpair->ref_count)) > - msleep(500); > > ret = qla25xx_delete_req_que(vha, qpair->req); > if (ret != QLA_SUCCESS) Reviewed-by: Ewan D. Milne <emilne@redhat.com>
On 11/5/19 7:06 AM, Himanshu Madhani wrote: > From: Quinn Tran <qutran@marvell.com> > > This patch fixes driver unload hang by removing msleep() > > Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.") > Cc: stable@vger.kernel.org > Signed-off-by: Quinn Tran <qutran@marvell.com> > Signed-off-by: Himanshu Madhani <hmadhani@marvell.com> > --- > drivers/scsi/qla2xxx/qla_init.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c > index bddb26baedd2..ff4528702b4e 100644 > --- a/drivers/scsi/qla2xxx/qla_init.c > +++ b/drivers/scsi/qla2xxx/qla_init.c > @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair) > struct qla_hw_data *ha = qpair->hw; > > qpair->delete_in_progress = 1; > - while (atomic_read(&qpair->ref_count)) > - msleep(500); > > ret = qla25xx_delete_req_que(vha, qpair->req); > if (ret != QLA_SUCCESS) I think that an explanation is needed why that loop had been introduced and also why it is safe not to wait until qpair->ref_count drops to zero in qla2xxx_delete_qpair(). Thanks, Bart.
On 11/7/19 9:55 AM, Himanshu Madhani wrote: > > >> On Nov 7, 2019, at 10:54 AM, Bart Van Assche <bvanassche@acm.org >> <mailto:bvanassche@acm.org>> wrote: >> >> On 11/5/19 7:06 AM, Himanshu Madhani wrote: >>> From: Quinn Tran <qutran@marvell.com <mailto:qutran@marvell.com>> >>> This patch fixes driver unload hang by removing msleep() >>> Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair >>> functionality.") >>> Cc: stable@vger.kernel.org <mailto:stable@vger.kernel.org> >>> Signed-off-by: Quinn Tran <qutran@marvell.com >>> <mailto:qutran@marvell.com>> >>> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com >>> <mailto:hmadhani@marvell.com>> >>> --- >>> drivers/scsi/qla2xxx/qla_init.c | 2 -- >>> 1 file changed, 2 deletions(-) >>> diff --git a/drivers/scsi/qla2xxx/qla_init.c >>> b/drivers/scsi/qla2xxx/qla_init.c >>> index bddb26baedd2..ff4528702b4e 100644 >>> --- a/drivers/scsi/qla2xxx/qla_init.c >>> +++ b/drivers/scsi/qla2xxx/qla_init.c >>> @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host >>> *vha, struct qla_qpair *qpair) >>> struct qla_hw_data *ha = qpair->hw; >>> qpair->delete_in_progress = 1; >>> -while (atomic_read(&qpair->ref_count)) >>> -msleep(500); >>> ret = qla25xx_delete_req_que(vha, qpair->req); >>> if (ret != QLA_SUCCESS) >> >> I think that an explanation is needed why that loop had been >> introduced and also why it is safe not to wait until qpair->ref_count >> drops to zero in qla2xxx_delete_qpair(). >> > > commit d74595278f4ab had drawback in design for MQ implementation in > qla2xxx. Now that we have been making this more stable with MQ being > default on for 5x kernel. What we discovered that after heavy IO > workload in a cluster environment, driver unload encountered hang and > shows following stack trace > > # ps -fax | grep rmmod > 6029 pts/0 D+ 0:00 | \_ rmmod qla2xxx > > [<0>] msleep+0x29/0x30 [<0>] qla2xxx_delete_qpair+0x2c/0x160 [qla2xxx] > [<0>] qla25xx_delete_queues+0x14b/0x1d0 [qla2xxx] [<0>] > qla2x00_free_device+0x31/0xe0 [qla2xxx] [<0>] > qla2x00_remove_one+0x239/0x370 [qla2xxx] [<0>] > pci_device_remove+0x3b/0xc0 [<0>] > device_release_driver_internal+0x18c/0x250 [<0>] driver_detach+0x39/0x6d > [<0>] bus_remove_driver+0x77/0xc9 [<0>] pci_unregister_driver+0x2d/0xb0 > [<0>] qla2x00_module_exit+0x2d/0x90 [qla2xxx] [<0>] > __x64_sys_delete_module+0x139/0x270 [<0>] do_syscall_64+0x5b/0x1b0 [<0>] > entry_SYSCALL_64_after_hwframe+0x65/0xca [<0>] 0xffffffffffffffff > > Removing this msleep() help resolve this stack trace. Hi Himanshu, Does your answer mean that this hang has not yet been root-caused fully and hence that it is possible this patch is only a workaround but not a fix of the root cause? Thanks, Bart.
On 11/7/19 9:58 AM, Bart Van Assche wrote: > Does your answer mean that this hang has not yet been root-caused fully > and hence that it is possible this patch is only a workaround but not a > fix of the root cause? Answering my own question: I think that a qpair refcount leak is a severe problem and not something that should be ignored. How about changing the while loop into something like the following: if (atomic_read(&qpair->ref_count)) msleep(500); WARN_ON_ONCE(atomic_read(&qpair->ref_count)); Thanks, Bart.
Hi Bart, > On Nov 7, 2019, at 12:30 PM, Bart Van Assche <bvanassche@acm.org> wrote: > > External Email > > ---------------------------------------------------------------------- > On 11/7/19 9:58 AM, Bart Van Assche wrote: >> Does your answer mean that this hang has not yet been root-caused fully >> and hence that it is possible this patch is only a workaround but not a >> fix of the root cause? > > Answering my own question: I think that a qpair refcount leak is a severe problem and not something that should be ignored. How about changing the while loop into something like the following: > > if (atomic_read(&qpair->ref_count)) > msleep(500); > WARN_ON_ONCE(atomic_read(&qpair->ref_count)); > > Thanks, > > Bart. Since we had seen this hang in a specific cluster environment and refcount leak was observed, I would like to add this patch as is and will consider your suggestion to verify if adding WARN_ON_ONCE will make any difference. If we discover that adding WARN_ON_ONCE indeed helps, then I will add a patch with fixes tag during rc window. Let me know if you disagree. Thanks, Himanshu
On 11/8/19 3:38 PM, Himanshu Madhani wrote: > Hi Bart, > >> On Nov 7, 2019, at 12:30 PM, Bart Van Assche <bvanassche@acm.org> wrote: >> >> External Email >> >> ---------------------------------------------------------------------- >> On 11/7/19 9:58 AM, Bart Van Assche wrote: >>> Does your answer mean that this hang has not yet been root-caused fully >>> and hence that it is possible this patch is only a workaround but not a >>> fix of the root cause? >> >> Answering my own question: I think that a qpair refcount leak is a severe problem and not something that should be ignored. How about changing the while loop into something like the following: >> >> if (atomic_read(&qpair->ref_count)) >> msleep(500); >> WARN_ON_ONCE(atomic_read(&qpair->ref_count)); >> >> Thanks, >> >> Bart. > > Since we had seen this hang in a specific cluster environment and refcount leak was observed, I would like to add this patch as is and will consider your suggestion to verify if adding WARN_ON_ONCE will make any difference. If we discover that adding WARN_ON_ONCE indeed helps, then I will add a patch with fixes tag during rc window. > > Let me know if you disagree. Hi Himanshu, Please do not suppress reports of kernel bugs but instead make sure that some report is provided that indicates that something went wrong and needs further attention. Thanks, Bart.
diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c index bddb26baedd2..ff4528702b4e 100644 --- a/drivers/scsi/qla2xxx/qla_init.c +++ b/drivers/scsi/qla2xxx/qla_init.c @@ -9009,8 +9009,6 @@ int qla2xxx_delete_qpair(struct scsi_qla_host *vha, struct qla_qpair *qpair) struct qla_hw_data *ha = qpair->hw; qpair->delete_in_progress = 1; - while (atomic_read(&qpair->ref_count)) - msleep(500); ret = qla25xx_delete_req_que(vha, qpair->req); if (ret != QLA_SUCCESS)