diff mbox series

Regarding - Patch - Fix crash on qla2x00_mailbox_command

Message ID DM6PR18MB30346814DE1F5807188A844CD2B80@DM6PR18MB3034.namprd18.prod.outlook.com (mailing list archive)
State Changes Requested
Headers show
Series Regarding - Patch - Fix crash on qla2x00_mailbox_command | expand

Commit Message

Saurav Kashyap May 18, 2020, 6:49 a.m. UTC
Hi  Rodrigo,
We are seen regression introduced by below patch for QLA 82XX HBAs. On unload, the disable interrupt, mailbox command (MBX 0x10) fails because of this patch and leaves the FW/HW in unstable state. The next load fails with initialization FW timing out.
The only way out of this is to reboot the server. I  and  test team have tried to reproduce an original problem that is fixed by below patch but we don't have any luck.

We would like to revert the below patch but would like to address original problem as well. Can you share more details about the NULL pointer dereference? Which data structure was NULL and what was the test case?

==============================
git show 3cb182b3fa8b7a61f05c671525494697cba39c6a
commit 3cb182b3fa8b7a61f05c671525494697cba39c6a
Author: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
Date:   Mon May 28 14:58:44 2018 -0300
 
    scsi: qla2xxx: Fix crash on qla2x00_mailbox_command
 
    This patch fixes a crash on qla2x00_mailbox_command caused when the driver
    is on UNLOADING state and tries to call qla2x00_poll, which triggers a
    NULL pointer dereference.
 
    Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
    Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
    Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
 

Thanks,
~Saurav

Comments

Saurav Kashyap May 20, 2020, 6:33 a.m. UTC | #1
Hi Rodrigo,
Any updates on this?

Thanks,
~Saurav

> -----Original Message-----
> From: Saurav Kashyap
> Sent: Monday, May 18, 2020 12:20 PM
> To: rosattig@linux.vnet.ibm.com
> Cc: linux-scsi@vger.kernel.org; Arun Easi <aeasi@marvell.com>; Girish Basrur
> <gbasrur@marvell.com>; Nilesh Javali <njavali@marvell.com>
> Subject: Regarding - Patch - Fix crash on qla2x00_mailbox_command
> 
> Hi  Rodrigo,
> We are seen regression introduced by below patch for QLA 82XX HBAs. On
> unload, the disable interrupt, mailbox command (MBX 0x10) fails because of
> this patch and leaves the FW/HW in unstable state. The next load fails with
> initialization FW timing out.
> The only way out of this is to reboot the server. I  and  test team have tried to
> reproduce an original problem that is fixed by below patch but we don't have
> any luck.
> 
> We would like to revert the below patch but would like to address original
> problem as well. Can you share more details about the NULL pointer
> dereference? Which data structure was NULL and what was the test case?
> 
> ==============================
> git show 3cb182b3fa8b7a61f05c671525494697cba39c6a
> commit 3cb182b3fa8b7a61f05c671525494697cba39c6a
> Author: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
> Date:   Mon May 28 14:58:44 2018 -0300
> 
>     scsi: qla2xxx: Fix crash on qla2x00_mailbox_command
> 
>     This patch fixes a crash on qla2x00_mailbox_command caused when the
> driver
>     is on UNLOADING state and tries to call qla2x00_poll, which triggers a
>     NULL pointer dereference.
> 
>     Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
>     Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
>     Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
>     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> 
> diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c
> index d8a36c1..7e875f5 100644
> --- a/drivers/scsi/qla2xxx/qla_mbx.c
> +++ b/drivers/scsi/qla2xxx/qla_mbx.c
> @@ -292,6 +292,14 @@ static int is_rom_cmd(uint16_t cmd)
>                         if (time_after(jiffies, wait_time))
>                                 break;
> 
> +                       /*
> +                        * Check if it's UNLOADING, cause we cannot poll in
> +                        * this case, or else a NULL pointer dereference
> +                        * is triggered.
> +                        */
> +                       if (unlikely(test_bit(UNLOADING, &base_vha->dpc_flags)))
> +                               return QLA_FUNCTION_TIMEOUT;
> +
>                         /* Check for pending interrupts. */
>                         qla2x00_poll(ha->rsp_q_map[0]);
> ====================
> 
> Thanks,
> ~Saurav
Guilherme G. Piccoli June 3, 2020, 3:24 a.m. UTC | #2
IIRC Rodrigo's email does not work anymore...looping Mauro, he should
be able to forward to appropriate folks at IBM.
Cheers,


Guilherme


On Wed, May 20, 2020 at 3:33 AM Saurav Kashyap <skashyap@marvell.com> wrote:
>
> Hi Rodrigo,
> Any updates on this?
>
> Thanks,
> ~Saurav
>
> > -----Original Message-----
> > From: Saurav Kashyap
> > Sent: Monday, May 18, 2020 12:20 PM
> > To: rosattig@linux.vnet.ibm.com
> > Cc: linux-scsi@vger.kernel.org; Arun Easi <aeasi@marvell.com>; Girish Basrur
> > <gbasrur@marvell.com>; Nilesh Javali <njavali@marvell.com>
> > Subject: Regarding - Patch - Fix crash on qla2x00_mailbox_command
> >
> > Hi  Rodrigo,
> > We are seen regression introduced by below patch for QLA 82XX HBAs. On
> > unload, the disable interrupt, mailbox command (MBX 0x10) fails because of
> > this patch and leaves the FW/HW in unstable state. The next load fails with
> > initialization FW timing out.
> > The only way out of this is to reboot the server. I  and  test team have tried to
> > reproduce an original problem that is fixed by below patch but we don't have
> > any luck.
> >
> > We would like to revert the below patch but would like to address original
> > problem as well. Can you share more details about the NULL pointer
> > dereference? Which data structure was NULL and what was the test case?
> >
> > ==============================
> > git show 3cb182b3fa8b7a61f05c671525494697cba39c6a
> > commit 3cb182b3fa8b7a61f05c671525494697cba39c6a
> > Author: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
> > Date:   Mon May 28 14:58:44 2018 -0300
> >
> >     scsi: qla2xxx: Fix crash on qla2x00_mailbox_command
> >
> >     This patch fixes a crash on qla2x00_mailbox_command caused when the
> > driver
> >     is on UNLOADING state and tries to call qla2x00_poll, which triggers a
> >     NULL pointer dereference.
> >
> >     Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
> >     Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
> >     Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
> >     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> >
> > diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c
> > index d8a36c1..7e875f5 100644
> > --- a/drivers/scsi/qla2xxx/qla_mbx.c
> > +++ b/drivers/scsi/qla2xxx/qla_mbx.c
> > @@ -292,6 +292,14 @@ static int is_rom_cmd(uint16_t cmd)
> >                         if (time_after(jiffies, wait_time))
> >                                 break;
> >
> > +                       /*
> > +                        * Check if it's UNLOADING, cause we cannot poll in
> > +                        * this case, or else a NULL pointer dereference
> > +                        * is triggered.
> > +                        */
> > +                       if (unlikely(test_bit(UNLOADING, &base_vha->dpc_flags)))
> > +                               return QLA_FUNCTION_TIMEOUT;
> > +
> >                         /* Check for pending interrupts. */
> >                         qla2x00_poll(ha->rsp_q_map[0]);
> > ====================
> >
> > Thanks,
> > ~Saurav
Mauro S. M. Rodrigues June 16, 2020, 3:26 p.m. UTC | #3
Hi Saurav,

I've reopened the internal bugzilla so we can check what's going on.

Do you have a script to reproduce the issue you're seeing so our team can try to
reproduce it?

From what I've read at the internal bz, this issue was quite consistent to
reproduce on our boxes, I suspect that you might not be able to reproduce it
because it was related to EEH mechanism in power architecture, I'm sorry that
information isn't mentioned in the commit message. 

The trace seen at the time was:

38:mon> e
cpu 0x38: Vector: 300 (Data Access) at [c000003fe22872f0]
pc: c00800001382727c: qla2x00_mailbox_command+0x93c/0x1520 [qla2xxx]
lr: c0080000138272cc: qla2x00_mailbox_command+0x98c/0x1520 [qla2xxx]
sp: c000003fe2287570
msr: 9000000000009033
dar: 0
dsisr: 40000000
current = 0xc000003fe2241700
paca    = 0xc000000007a46800	 softe: 0	 irq_happened: 0x01
pid   = 779, comm = eehd
Linux version 4.14.0-49.1.1.el7a.ppc64le
(mockbuild@ppc-048.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623
(Red Hat 4.8.5-28) (GCC)) #1 SMP Tue Apr 10 08:31:02 UTC 2018
38:mon> t
[c000003fe22876c0] c008000013834bb4 qla2x00_dump_ram+0xf4/0x220 [qla2xxx]
[c000003fe22877b0] c00800001389e280 qla25xx_read_optrom_data+0x230/0x390 [qla2xxx]
[c000003fe2287880] c008000013824da0 qla81xx_nvram_config+0xb0/0xb60 [qla2xxx]
[c000003fe2287930] c0080000138222f4 qla2x00_abort_isp+0x204/0x900 [qla2xxx]
[c000003fe2287a20] c0080000138023bc qla2xxx_pci_slot_reset+0x21c/0x600 [qla2xxx]
[c000003fe2287ab0] c00000000004b724 eeh_report_reset+0x184/0x1c0
[c000003fe2287af0] c000000000048f18 eeh_pe_dev_traverse+0x98/0x170
[c000003fe2287b80] c00000000004bcc4 eeh_handle_normal_event+0x564/0x650
[c000003fe2287c60] c00000000004bf74 eeh_handle_event+0x54/0x380
[c000003fe2287d10] c00000000004c69c eeh_event_handler+0x14c/0x210
[c000003fe2287dc0] c000000000171c88 kthread+0x168/0x1b0
[c000003fe2287e30] c00000000000b528 ret_from_kernel_thread+0x5c/0xb4

I'll see if our team can get more information on the specific data structure.

Thank you,

Mauro

On Wed, Jun 03, 2020 at 12:24:22AM -0300, Guilherme G. Piccoli wrote:
> IIRC Rodrigo's email does not work anymore...looping Mauro, he should
> be able to forward to appropriate folks at IBM.
> Cheers,
> 
> 
> Guilherme
> 
> 
> On Wed, May 20, 2020 at 3:33 AM Saurav Kashyap <skashyap@marvell.com> wrote:
> >
> > Hi Rodrigo,
> > Any updates on this?
> >
> > Thanks,
> > ~Saurav
> >
> > > -----Original Message-----
> > > From: Saurav Kashyap
> > > Sent: Monday, May 18, 2020 12:20 PM
> > > To: rosattig@linux.vnet.ibm.com
> > > Cc: linux-scsi@vger.kernel.org; Arun Easi <aeasi@marvell.com>; Girish Basrur
> > > <gbasrur@marvell.com>; Nilesh Javali <njavali@marvell.com>
> > > Subject: Regarding - Patch - Fix crash on qla2x00_mailbox_command
> > >
> > > Hi  Rodrigo,
> > > We are seen regression introduced by below patch for QLA 82XX HBAs. On
> > > unload, the disable interrupt, mailbox command (MBX 0x10) fails because of
> > > this patch and leaves the FW/HW in unstable state. The next load fails with
> > > initialization FW timing out.
> > > The only way out of this is to reboot the server. I  and  test team have tried to
> > > reproduce an original problem that is fixed by below patch but we don't have
> > > any luck.
> > >
> > > We would like to revert the below patch but would like to address original
> > > problem as well. Can you share more details about the NULL pointer
> > > dereference? Which data structure was NULL and what was the test case?
> > >
> > > ==============================
> > > git show 3cb182b3fa8b7a61f05c671525494697cba39c6a
> > > commit 3cb182b3fa8b7a61f05c671525494697cba39c6a
> > > Author: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
> > > Date:   Mon May 28 14:58:44 2018 -0300
> > >
> > >     scsi: qla2xxx: Fix crash on qla2x00_mailbox_command
> > >
> > >     This patch fixes a crash on qla2x00_mailbox_command caused when the
> > > driver
> > >     is on UNLOADING state and tries to call qla2x00_poll, which triggers a
> > >     NULL pointer dereference.
> > >
> > >     Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
> > >     Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
> > >     Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
> > >     Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> > >
> > > diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c
> > > index d8a36c1..7e875f5 100644
> > > --- a/drivers/scsi/qla2xxx/qla_mbx.c
> > > +++ b/drivers/scsi/qla2xxx/qla_mbx.c
> > > @@ -292,6 +292,14 @@ static int is_rom_cmd(uint16_t cmd)
> > >                         if (time_after(jiffies, wait_time))
> > >                                 break;
> > >
> > > +                       /*
> > > +                        * Check if it's UNLOADING, cause we cannot poll in
> > > +                        * this case, or else a NULL pointer dereference
> > > +                        * is triggered.
> > > +                        */
> > > +                       if (unlikely(test_bit(UNLOADING, &base_vha->dpc_flags)))
> > > +                               return QLA_FUNCTION_TIMEOUT;
> > > +
> > >                         /* Check for pending interrupts. */
> > >                         qla2x00_poll(ha->rsp_q_map[0]);
> > > ====================
> > >
> > > Thanks,
> > > ~Saurav
diff mbox series

Patch

diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c
index d8a36c1..7e875f5 100644
--- a/drivers/scsi/qla2xxx/qla_mbx.c
+++ b/drivers/scsi/qla2xxx/qla_mbx.c
@@ -292,6 +292,14 @@  static int is_rom_cmd(uint16_t cmd)
                        if (time_after(jiffies, wait_time))
                                break;
 
+                       /*
+                        * Check if it's UNLOADING, cause we cannot poll in
+                        * this case, or else a NULL pointer dereference
+                        * is triggered.
+                        */
+                       if (unlikely(test_bit(UNLOADING, &base_vha->dpc_flags)))
+                               return QLA_FUNCTION_TIMEOUT;
+
                        /* Check for pending interrupts. */
                        qla2x00_poll(ha->rsp_q_map[0]);
====================