diff mbox series

[v4,3/6] mpt3sas: Introdude _scsih_get_shost_and_ioc.

Message ID 1537935759-14754-4-git-send-email-suganath-prabu.subramani@broadcom.com (mailing list archive)
State Changes Requested
Headers show
Series mpt3sas: Hot-Plug Surprise removal support on IOC. | expand

Commit Message

Suganath Prabu S Sept. 26, 2018, 4:22 a.m. UTC
The code for getting shost and IOC is redundant so
moved that to function "scsih_get_shost_and_ioc".
Also checks for NULL are added to IOC and shost.

Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 98 ++++++++++++++++++++++++++++++------
 1 file changed, 82 insertions(+), 16 deletions(-)

Comments

Bjorn Helgaas Sept. 26, 2018, 9:09 p.m. UTC | #1
On Wed, Sep 26, 2018 at 09:52:36AM +0530, Suganath Prabu S wrote:
> The code for getting shost and IOC is redundant so
> moved that to function "scsih_get_shost_and_ioc".
> Also checks for NULL are added to IOC and shost.
> 
> Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
> ---
>  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 98 ++++++++++++++++++++++++++++++------
>  1 file changed, 82 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> index 566a550..f6e92eb 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -9809,6 +9809,35 @@ _scsih_ir_shutdown(struct MPT3SAS_ADAPTER *ioc)
>  }
>  
>  /**
> + * _scsih_get_shost_and_ioc - get shost and ioc
> + *			and verify whether they are NULL or not
> + * @pdev: PCI device struct
> + * @shost: address of scsi host pointer
> + * @ioc: address of HBA adapter pointer
> + *
> + * Return zero if *shost and *ioc are not NULL otherwise return error number.
> + */
> +static int
> +_scsih_get_shost_and_ioc(struct pci_dev *pdev,
> +	struct Scsi_Host **shost, struct MPT3SAS_ADAPTER **ioc)
> +{
> +	*shost = pci_get_drvdata(pdev);
> +	if (*shost == NULL) {
> +		dev_err(&pdev->dev, "pdev's driver data is null\n");
> +		return -ENXIO;
> +	}
> +
> +	*ioc = shost_priv(*shost);
> +	if (*ioc == NULL) {
> +		dev_err(&pdev->dev, "shost's private data is null\n");
> +		return -ENXIO;

I think it's better to omit NULL pointer checks like these because
there should not be a path where we can execute this code when these
pointers are NULL.

If there *is* such a path, I think that's a serious bug and it's
better to oops when we try to dereference the NULL pointer.  If we
just return an error code, it's likely the bug will be ignored and
never fixed.

Bjorn
Bjorn Helgaas Sept. 26, 2018, 9:33 p.m. UTC | #2
Trivial nits/questions.  In subject:

  s/Introdude/Introduce/
  s/  / /   (remove double space)
  s/\.//    (remove trailing period, also appears in patches 4, 5, 6)

On Wed, Sep 26, 2018 at 09:52:36AM +0530, Suganath Prabu S wrote:
> The code for getting shost and IOC is redundant so
> moved that to function "scsih_get_shost_and_ioc".
> Also checks for NULL are added to IOC and shost.
> 
> Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>

Per [1], it's good to use full names, e.g., spell out "S" when that
makes sense.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst#n456

> ---
>  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 98 ++++++++++++++++++++++++++++++------
>  1 file changed, 82 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> index 566a550..f6e92eb 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -9809,6 +9809,35 @@ _scsih_ir_shutdown(struct MPT3SAS_ADAPTER *ioc)
>  }
>  
>  /**
> + * _scsih_get_shost_and_ioc - get shost and ioc
> + *			and verify whether they are NULL or not
> + * @pdev: PCI device struct
> + * @shost: address of scsi host pointer

There seems to be a convention to capitalize "IOC" and "SCSI" when
used in English text, i.e., when they're not variable names or part of
a function name.

> + * @ioc: address of HBA adapter pointer
> + *
> + * Return zero if *shost and *ioc are not NULL otherwise return error number.
> + */
> +static int
> +_scsih_get_shost_and_ioc(struct pci_dev *pdev,
> +	struct Scsi_Host **shost, struct MPT3SAS_ADAPTER **ioc)
> +{
> +	*shost = pci_get_drvdata(pdev);
> +	if (*shost == NULL) {
> +		dev_err(&pdev->dev, "pdev's driver data is null\n");

I don't see any other uses of dev_printk() in this file; existing
output seems to use sdev_printk(), pr_info(), etc.  Just pointing this
out to make sure dev_printk() is really what you want here.

> +		return -ENXIO;
> +	}

Bjorn
Suganath Prabu S Oct. 1, 2018, 7:27 a.m. UTC | #3
On Thu, Sep 27, 2018 at 2:39 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Sep 26, 2018 at 09:52:36AM +0530, Suganath Prabu S wrote:
> > The code for getting shost and IOC is redundant so
> > moved that to function "scsih_get_shost_and_ioc".
> > Also checks for NULL are added to IOC and shost.
> >
> > Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
> > ---
> >  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 98 ++++++++++++++++++++++++++++++------
> >  1 file changed, 82 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > index 566a550..f6e92eb 100644
> > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > @@ -9809,6 +9809,35 @@ _scsih_ir_shutdown(struct MPT3SAS_ADAPTER *ioc)
> >  }
> >
> >  /**
> > + * _scsih_get_shost_and_ioc - get shost and ioc
> > + *                   and verify whether they are NULL or not
> > + * @pdev: PCI device struct
> > + * @shost: address of scsi host pointer
> > + * @ioc: address of HBA adapter pointer
> > + *
> > + * Return zero if *shost and *ioc are not NULL otherwise return error number.
> > + */
> > +static int
> > +_scsih_get_shost_and_ioc(struct pci_dev *pdev,
> > +     struct Scsi_Host **shost, struct MPT3SAS_ADAPTER **ioc)
> > +{
> > +     *shost = pci_get_drvdata(pdev);
> > +     if (*shost == NULL) {
> > +             dev_err(&pdev->dev, "pdev's driver data is null\n");
> > +             return -ENXIO;
> > +     }
> > +
> > +     *ioc = shost_priv(*shost);
> > +     if (*ioc == NULL) {
> > +             dev_err(&pdev->dev, "shost's private data is null\n");
> > +             return -ENXIO;
>
> I think it's better to omit NULL pointer checks like these because
> there should not be a path where we can execute this code when these
> pointers are NULL.
>
> If there *is* such a path, I think that's a serious bug and it's
> better to oops when we try to dereference the NULL pointer.  If we
> just return an error code, it's likely the bug will be ignored and
> never fixed.
>
We have added the ioc and shost checks, because of kernel panic in
below scenario.
Have 3 HBA's in system and perform below operation.
1) Run “poweroff”.
2) Immediate hot unplug HBA.
We have observed, kernel's shutdown process has removed all the 3 HBA
devices smoothly,
and also user have unplugged the HBA device during this time. PCI
subsystem's hot-plug thread is also trying to remove
the hot plugged PCI device which is already removed/cleaned by the
shutdown process. (Which is not expected for the
already removed device) Due to this kernel panic is observed. And we
are not sure whether it
has to fixed from pciehp layer, so we added sanity checks in driver.

[ 1745.605510] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000a98
[ 1745.606554] IP: [<ffffffffa054c750>] scsih_remove+0x20/0x2d0 [mpt3sas]
[ 1745.607609] PGD 0
[ 1745.608621] Oops: 0000 [#1] SMP
[ 1745.622989] CPU: 0 PID: 668 Comm: kworker/0:2 Tainted: G
O    4.4.55-1.el7.elrepo.x86_64 #1
[ 1745.624800] Hardware name: PRO-MNU65930231
PRO-NME69559126/BRD-PRO55212588, BIOS 0.51e 05/08/2017
[ 1745.626673] Workqueue: pciehp-3 pciehp_power_thread
[ 1745.628566] task: ffff881fe50dd880 ti: ffff881fe88e4000 task.ti:
ffff881fe88e4000
[ 1745.630530] RIP: 0010:[<ffffffffa054c750>]  [<ffffffffa054c750>]
scsih_remove+0x20/0x2d0 [mpt3sas]
[ 1745.632577] RSP: 0018:ffff881fe88e7c98  EFLAGS: 00010292
[ 1745.634639] RAX: 0000000000000001 RBX: ffff881feef5c000 RCX: 0000000000000000
[ 1745.636718] RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff881feef5c000
[ 1745.638832] RBP: ffff881fe88e7cc8 R08: 0000000000000000 R09: 0000000180220002
[ 1745.640972] R10: 00000000eaf4a401 R11: ffffea007fabd280 R12: 0000000000000000
[ 1745.643136] R13: ffffffffa0576020 R14: ffff881fe9af8240 R15: 0000000000000000
[ 1745.645320] FS:  0000000000000000(0000) GS:ffff881ffde00000(0000)
knlGS:0000000000000000
[ 1745.647572] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1745.649833] CR2: 0000000000000a98 CR3: 0000001fe76df000 CR4: 00000000003406f0
[ 1745.652147] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1745.654476] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1745.656825] Stack:
[ 1745.659138]  ffff881fe88e7cc8 ffff881feef5c098 ffff881feef5c000
ffffffffa0576020
[ 1745.661562]  ffff881fe9af8240 0000000000000000 ffff881fe88e7cf0
ffffffff8137f9d9
[ 1745.663990]  ffff881feef5c098 ffffffffa0576088 ffff881feef5c000
ffff881fe88e7d10
[ 1745.666428] Call Trace:
[ 1745.668830]  [<ffffffff8137f9d9>] pci_device_remove+0x39/0xc0
[ 1745.671256]  [<ffffffff8147db36>] __device_release_driver+0x96/0x130
[ 1745.673664]  [<ffffffff8147dbf3>] device_release_driver+0x23/0x30
[ 1745.676071]  [<ffffffff8137851c>] pci_stop_bus_device+0x8c/0xa0
[ 1745.678485]  [<ffffffff81378612>] pci_stop_and_remove_bus_device+0x12/0x20
[ 1745.680909]  [<ffffffff81392eea>] pciehp_unconfigure_device+0xaa/0x1b0
[ 1745.683331]  [<ffffffff813929e2>] pciehp_disable_slot+0x52/0xd0
[ 1745.685767]  [<ffffffff81392aed>] pciehp_power_thread+0x8d/0xb0
[ 1745.688210]  [<ffffffff8109728f>] process_one_work+0x14f/0x400
[ 1745.690633]  [<ffffffff81097b04>] worker_thread+0x114/0x470
[ 1745.693080]  [<ffffffff810979f0>] ? rescuer_thread+0x310/0x310
[ 1745.695518]  [<ffffffff8109d648>] kthread+0xd8/0xf0
[ 1745.697936]  [<ffffffff8109d570>] ? kthread_park+0x60/0x60
[ 1745.700359]  [<ffffffff816f854f>] ret_from_fork+0x3f/0x70
[ 1745.702747]  [<ffffffff8109d570>] ? kthread_park+0x60/0x60

> Bjorn
Bjorn Helgaas Oct. 1, 2018, 1:40 p.m. UTC | #4
[+cc LKML]

On Mon, Oct 01, 2018 at 12:57:01PM +0530, Suganath Prabu Subramani wrote:
> On Thu, Sep 27, 2018 at 2:39 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Sep 26, 2018 at 09:52:36AM +0530, Suganath Prabu S wrote:
> > > The code for getting shost and IOC is redundant so
> > > moved that to function "scsih_get_shost_and_ioc".
> > > Also checks for NULL are added to IOC and shost.
> > >
> > > Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
> > > ---
> > >  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 98 ++++++++++++++++++++++++++++++------
> > >  1 file changed, 82 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > > index 566a550..f6e92eb 100644
> > > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > > @@ -9809,6 +9809,35 @@ _scsih_ir_shutdown(struct MPT3SAS_ADAPTER *ioc)
> > >  }
> > >
> > >  /**
> > > + * _scsih_get_shost_and_ioc - get shost and ioc
> > > + *                   and verify whether they are NULL or not
> > > + * @pdev: PCI device struct
> > > + * @shost: address of scsi host pointer
> > > + * @ioc: address of HBA adapter pointer
> > > + *
> > > + * Return zero if *shost and *ioc are not NULL otherwise return error number.
> > > + */
> > > +static int
> > > +_scsih_get_shost_and_ioc(struct pci_dev *pdev,
> > > +     struct Scsi_Host **shost, struct MPT3SAS_ADAPTER **ioc)
> > > +{
> > > +     *shost = pci_get_drvdata(pdev);
> > > +     if (*shost == NULL) {
> > > +             dev_err(&pdev->dev, "pdev's driver data is null\n");
> > > +             return -ENXIO;
> > > +     }
> > > +
> > > +     *ioc = shost_priv(*shost);
> > > +     if (*ioc == NULL) {
> > > +             dev_err(&pdev->dev, "shost's private data is null\n");
> > > +             return -ENXIO;
> >
> > I think it's better to omit NULL pointer checks like these because
> > there should not be a path where we can execute this code when these
> > pointers are NULL.
> >
> > If there *is* such a path, I think that's a serious bug and it's
> > better to oops when we try to dereference the NULL pointer.  If we
> > just return an error code, it's likely the bug will be ignored and
> > never fixed.
> >
> We have added the ioc and shost checks, because of kernel panic in
> below scenario.
> Have 3 HBA's in system and perform below operation.
> 1) Run “poweroff”.
> 2) Immediate hot unplug HBA.
> We have observed, kernel's shutdown process has removed all the 3
> HBA devices smoothly, and also user have unplugged the HBA device
> during this time. PCI subsystem's hot-plug thread is also trying to
> remove the hot plugged PCI device which is already removed/cleaned
> by the shutdown process. (Which is not expected for the already
> removed device) Due to this kernel panic is observed. And we are not
> sure whether it has to fixed from pciehp layer, so we added sanity
> checks in driver.

This is a great example.  scsih_remove() is the mpt3sas_driver.remove
method.  It sounds like it's getting called twice for the same HBA.
I think that's a PCI core defect and we should fix it there instead of
trying to work around it in every driver.

The trace below is from v4.4.55.  Can you reproduce the same problem
with a current tree, e.g., v4.19-rc?  There have been many changes
since v4.4 that may have fixed this problem.

> [ 1745.605510] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000a98
> [ 1745.606554] IP: [<ffffffffa054c750>] scsih_remove+0x20/0x2d0 [mpt3sas]
> [ 1745.607609] PGD 0
> [ 1745.608621] Oops: 0000 [#1] SMP
> [ 1745.622989] CPU: 0 PID: 668 Comm: kworker/0:2 Tainted: G
> O    4.4.55-1.el7.elrepo.x86_64 #1
> [ 1745.624800] Hardware name: PRO-MNU65930231
> PRO-NME69559126/BRD-PRO55212588, BIOS 0.51e 05/08/2017
> [ 1745.626673] Workqueue: pciehp-3 pciehp_power_thread
> [ 1745.628566] task: ffff881fe50dd880 ti: ffff881fe88e4000 task.ti:
> ffff881fe88e4000
> [ 1745.630530] RIP: 0010:[<ffffffffa054c750>]  [<ffffffffa054c750>]
> scsih_remove+0x20/0x2d0 [mpt3sas]
> [ 1745.632577] RSP: 0018:ffff881fe88e7c98  EFLAGS: 00010292
> [ 1745.634639] RAX: 0000000000000001 RBX: ffff881feef5c000 RCX: 0000000000000000
> [ 1745.636718] RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff881feef5c000
> [ 1745.638832] RBP: ffff881fe88e7cc8 R08: 0000000000000000 R09: 0000000180220002
> [ 1745.640972] R10: 00000000eaf4a401 R11: ffffea007fabd280 R12: 0000000000000000
> [ 1745.643136] R13: ffffffffa0576020 R14: ffff881fe9af8240 R15: 0000000000000000
> [ 1745.645320] FS:  0000000000000000(0000) GS:ffff881ffde00000(0000)
> knlGS:0000000000000000
> [ 1745.647572] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1745.649833] CR2: 0000000000000a98 CR3: 0000001fe76df000 CR4: 00000000003406f0
> [ 1745.652147] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1745.654476] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1745.656825] Stack:
> [ 1745.659138]  ffff881fe88e7cc8 ffff881feef5c098 ffff881feef5c000
> ffffffffa0576020
> [ 1745.661562]  ffff881fe9af8240 0000000000000000 ffff881fe88e7cf0
> ffffffff8137f9d9
> [ 1745.663990]  ffff881feef5c098 ffffffffa0576088 ffff881feef5c000
> ffff881fe88e7d10
> [ 1745.666428] Call Trace:
> [ 1745.668830]  [<ffffffff8137f9d9>] pci_device_remove+0x39/0xc0
> [ 1745.671256]  [<ffffffff8147db36>] __device_release_driver+0x96/0x130
> [ 1745.673664]  [<ffffffff8147dbf3>] device_release_driver+0x23/0x30
> [ 1745.676071]  [<ffffffff8137851c>] pci_stop_bus_device+0x8c/0xa0
> [ 1745.678485]  [<ffffffff81378612>] pci_stop_and_remove_bus_device+0x12/0x20
> [ 1745.680909]  [<ffffffff81392eea>] pciehp_unconfigure_device+0xaa/0x1b0
> [ 1745.683331]  [<ffffffff813929e2>] pciehp_disable_slot+0x52/0xd0
> [ 1745.685767]  [<ffffffff81392aed>] pciehp_power_thread+0x8d/0xb0
> [ 1745.688210]  [<ffffffff8109728f>] process_one_work+0x14f/0x400
> [ 1745.690633]  [<ffffffff81097b04>] worker_thread+0x114/0x470
> [ 1745.693080]  [<ffffffff810979f0>] ? rescuer_thread+0x310/0x310
> [ 1745.695518]  [<ffffffff8109d648>] kthread+0xd8/0xf0
> [ 1745.697936]  [<ffffffff8109d570>] ? kthread_park+0x60/0x60
> [ 1745.700359]  [<ffffffff816f854f>] ret_from_fork+0x3f/0x70
> [ 1745.702747]  [<ffffffff8109d570>] ? kthread_park+0x60/0x60
Suganath Prabu S Oct. 8, 2018, 6:47 a.m. UTC | #5
Hi Bjorn,
We tried to recreate the issue with latest kernel multiple
times.(Without including this patch,)
but we were not able to recreate it. On 4.4 kernel we hit this issue
time long back.
So we are not sure that this issue is really fixed with latest kernel or not.
We prefer to keep this patch as this only adds sanity check.
We are also okay, if you suggest us to completely remove this patch.

Thanks,
Suganath Prabu
On Mon, Oct 1, 2018 at 7:11 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc LKML]
>
> On Mon, Oct 01, 2018 at 12:57:01PM +0530, Suganath Prabu Subramani wrote:
> > On Thu, Sep 27, 2018 at 2:39 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Wed, Sep 26, 2018 at 09:52:36AM +0530, Suganath Prabu S wrote:
> > > > The code for getting shost and IOC is redundant so
> > > > moved that to function "scsih_get_shost_and_ioc".
> > > > Also checks for NULL are added to IOC and shost.
> > > >
> > > > Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
> > > > ---
> > > >  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 98 ++++++++++++++++++++++++++++++------
> > > >  1 file changed, 82 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > > > index 566a550..f6e92eb 100644
> > > > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > > > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > > > @@ -9809,6 +9809,35 @@ _scsih_ir_shutdown(struct MPT3SAS_ADAPTER *ioc)
> > > >  }
> > > >
> > > >  /**
> > > > + * _scsih_get_shost_and_ioc - get shost and ioc
> > > > + *                   and verify whether they are NULL or not
> > > > + * @pdev: PCI device struct
> > > > + * @shost: address of scsi host pointer
> > > > + * @ioc: address of HBA adapter pointer
> > > > + *
> > > > + * Return zero if *shost and *ioc are not NULL otherwise return error number.
> > > > + */
> > > > +static int
> > > > +_scsih_get_shost_and_ioc(struct pci_dev *pdev,
> > > > +     struct Scsi_Host **shost, struct MPT3SAS_ADAPTER **ioc)
> > > > +{
> > > > +     *shost = pci_get_drvdata(pdev);
> > > > +     if (*shost == NULL) {
> > > > +             dev_err(&pdev->dev, "pdev's driver data is null\n");
> > > > +             return -ENXIO;
> > > > +     }
> > > > +
> > > > +     *ioc = shost_priv(*shost);
> > > > +     if (*ioc == NULL) {
> > > > +             dev_err(&pdev->dev, "shost's private data is null\n");
> > > > +             return -ENXIO;
> > >
> > > I think it's better to omit NULL pointer checks like these because
> > > there should not be a path where we can execute this code when these
> > > pointers are NULL.
> > >
> > > If there *is* such a path, I think that's a serious bug and it's
> > > better to oops when we try to dereference the NULL pointer.  If we
> > > just return an error code, it's likely the bug will be ignored and
> > > never fixed.
> > >
> > We have added the ioc and shost checks, because of kernel panic in
> > below scenario.
> > Have 3 HBA's in system and perform below operation.
> > 1) Run “poweroff”.
> > 2) Immediate hot unplug HBA.
> > We have observed, kernel's shutdown process has removed all the 3
> > HBA devices smoothly, and also user have unplugged the HBA device
> > during this time. PCI subsystem's hot-plug thread is also trying to
> > remove the hot plugged PCI device which is already removed/cleaned
> > by the shutdown process. (Which is not expected for the already
> > removed device) Due to this kernel panic is observed. And we are not
> > sure whether it has to fixed from pciehp layer, so we added sanity
> > checks in driver.
>
> This is a great example.  scsih_remove() is the mpt3sas_driver.remove
> method.  It sounds like it's getting called twice for the same HBA.
> I think that's a PCI core defect and we should fix it there instead of
> trying to work around it in every driver.
>
> The trace below is from v4.4.55.  Can you reproduce the same problem
> with a current tree, e.g., v4.19-rc?  There have been many changes
> since v4.4 that may have fixed this problem.
>
> > [ 1745.605510] BUG: unable to handle kernel NULL pointer dereference
> > at 0000000000000a98
> > [ 1745.606554] IP: [<ffffffffa054c750>] scsih_remove+0x20/0x2d0 [mpt3sas]
> > [ 1745.607609] PGD 0
> > [ 1745.608621] Oops: 0000 [#1] SMP
> > [ 1745.622989] CPU: 0 PID: 668 Comm: kworker/0:2 Tainted: G
> > O    4.4.55-1.el7.elrepo.x86_64 #1
> > [ 1745.624800] Hardware name: PRO-MNU65930231
> > PRO-NME69559126/BRD-PRO55212588, BIOS 0.51e 05/08/2017
> > [ 1745.626673] Workqueue: pciehp-3 pciehp_power_thread
> > [ 1745.628566] task: ffff881fe50dd880 ti: ffff881fe88e4000 task.ti:
> > ffff881fe88e4000
> > [ 1745.630530] RIP: 0010:[<ffffffffa054c750>]  [<ffffffffa054c750>]
> > scsih_remove+0x20/0x2d0 [mpt3sas]
> > [ 1745.632577] RSP: 0018:ffff881fe88e7c98  EFLAGS: 00010292
> > [ 1745.634639] RAX: 0000000000000001 RBX: ffff881feef5c000 RCX: 0000000000000000
> > [ 1745.636718] RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff881feef5c000
> > [ 1745.638832] RBP: ffff881fe88e7cc8 R08: 0000000000000000 R09: 0000000180220002
> > [ 1745.640972] R10: 00000000eaf4a401 R11: ffffea007fabd280 R12: 0000000000000000
> > [ 1745.643136] R13: ffffffffa0576020 R14: ffff881fe9af8240 R15: 0000000000000000
> > [ 1745.645320] FS:  0000000000000000(0000) GS:ffff881ffde00000(0000)
> > knlGS:0000000000000000
> > [ 1745.647572] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1745.649833] CR2: 0000000000000a98 CR3: 0000001fe76df000 CR4: 00000000003406f0
> > [ 1745.652147] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 1745.654476] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 1745.656825] Stack:
> > [ 1745.659138]  ffff881fe88e7cc8 ffff881feef5c098 ffff881feef5c000
> > ffffffffa0576020
> > [ 1745.661562]  ffff881fe9af8240 0000000000000000 ffff881fe88e7cf0
> > ffffffff8137f9d9
> > [ 1745.663990]  ffff881feef5c098 ffffffffa0576088 ffff881feef5c000
> > ffff881fe88e7d10
> > [ 1745.666428] Call Trace:
> > [ 1745.668830]  [<ffffffff8137f9d9>] pci_device_remove+0x39/0xc0
> > [ 1745.671256]  [<ffffffff8147db36>] __device_release_driver+0x96/0x130
> > [ 1745.673664]  [<ffffffff8147dbf3>] device_release_driver+0x23/0x30
> > [ 1745.676071]  [<ffffffff8137851c>] pci_stop_bus_device+0x8c/0xa0
> > [ 1745.678485]  [<ffffffff81378612>] pci_stop_and_remove_bus_device+0x12/0x20
> > [ 1745.680909]  [<ffffffff81392eea>] pciehp_unconfigure_device+0xaa/0x1b0
> > [ 1745.683331]  [<ffffffff813929e2>] pciehp_disable_slot+0x52/0xd0
> > [ 1745.685767]  [<ffffffff81392aed>] pciehp_power_thread+0x8d/0xb0
> > [ 1745.688210]  [<ffffffff8109728f>] process_one_work+0x14f/0x400
> > [ 1745.690633]  [<ffffffff81097b04>] worker_thread+0x114/0x470
> > [ 1745.693080]  [<ffffffff810979f0>] ? rescuer_thread+0x310/0x310
> > [ 1745.695518]  [<ffffffff8109d648>] kthread+0xd8/0xf0
> > [ 1745.697936]  [<ffffffff8109d570>] ? kthread_park+0x60/0x60
> > [ 1745.700359]  [<ffffffff816f854f>] ret_from_fork+0x3f/0x70
> > [ 1745.702747]  [<ffffffff8109d570>] ? kthread_park+0x60/0x60
diff mbox series

Patch

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 566a550..f6e92eb 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -9809,6 +9809,35 @@  _scsih_ir_shutdown(struct MPT3SAS_ADAPTER *ioc)
 }
 
 /**
+ * _scsih_get_shost_and_ioc - get shost and ioc
+ *			and verify whether they are NULL or not
+ * @pdev: PCI device struct
+ * @shost: address of scsi host pointer
+ * @ioc: address of HBA adapter pointer
+ *
+ * Return zero if *shost and *ioc are not NULL otherwise return error number.
+ */
+static int
+_scsih_get_shost_and_ioc(struct pci_dev *pdev,
+	struct Scsi_Host **shost, struct MPT3SAS_ADAPTER **ioc)
+{
+	*shost = pci_get_drvdata(pdev);
+	if (*shost == NULL) {
+		dev_err(&pdev->dev, "pdev's driver data is null\n");
+		return -ENXIO;
+	}
+
+	*ioc = shost_priv(*shost);
+	if (*ioc == NULL) {
+		dev_err(&pdev->dev, "shost's private data is null\n");
+		return -ENXIO;
+	}
+
+	return 0;
+}
+
+
+/**
  * scsih_remove - detach and remove add host
  * @pdev: PCI device struct
  *
@@ -9816,8 +9845,8 @@  _scsih_ir_shutdown(struct MPT3SAS_ADAPTER *ioc)
  */
 static void scsih_remove(struct pci_dev *pdev)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 	struct _sas_port *mpt3sas_port, *next_port;
 	struct _raid_device *raid_device, *next;
 	struct MPT3SAS_TARGET *sas_target_priv_data;
@@ -9825,6 +9854,10 @@  static void scsih_remove(struct pci_dev *pdev)
 	struct workqueue_struct	*wq;
 	unsigned long flags;
 
+	if (_scsih_get_shost_and_ioc(pdev, &shost, &ioc)) {
+		dev_err(&pdev->dev, "unable to remove device\n");
+		return;
+	}
 	ioc->remove_host = 1;
 
 	mpt3sas_wait_for_commands_to_complete(ioc);
@@ -9898,11 +9931,16 @@  static void scsih_remove(struct pci_dev *pdev)
 static void
 scsih_shutdown(struct pci_dev *pdev)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 	struct workqueue_struct	*wq;
 	unsigned long flags;
 
+	if (_scsih_get_shost_and_ioc(pdev, &shost, &ioc)) {
+		dev_err(&pdev->dev, "unable to shutdown device\n");
+		return;
+	}
+
 	ioc->remove_host = 1;
 
 	mpt3sas_wait_for_commands_to_complete(ioc);
@@ -10727,10 +10765,16 @@  out_add_shost_fail:
 static int
 scsih_suspend(struct pci_dev *pdev, pm_message_t state)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 	pci_power_t device_state;
+	int rc;
 
+	rc = _scsih_get_shost_and_ioc(pdev, &shost, &ioc);
+	if (rc) {
+		dev_err(&pdev->dev, "unable to suspend device\n");
+		return rc;
+	}
 	mpt3sas_base_stop_watchdog(ioc);
 	flush_scheduled_work();
 	scsi_block_requests(shost);
@@ -10754,11 +10798,17 @@  scsih_suspend(struct pci_dev *pdev, pm_message_t state)
 static int
 scsih_resume(struct pci_dev *pdev)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 	pci_power_t device_state = pdev->current_state;
 	int r;
 
+	r = _scsih_get_shost_and_ioc(pdev, &shost, &ioc);
+	if (r) {
+		dev_err(&pdev->dev, "unable to resume device\n");
+		return r;
+	}
+
 	pr_info(MPT3SAS_FMT
 		"pdev=0x%p, slot=%s, previous operating state [D%d]\n",
 		ioc->name, pdev, pci_name(pdev), device_state);
@@ -10790,9 +10840,13 @@  scsih_resume(struct pci_dev *pdev)
 static pci_ers_result_t
 scsih_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 
+	if (_scsih_get_shost_and_ioc(pdev, &shost, &ioc)) {
+		dev_err(&pdev->dev, "device unavailable\n");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
 	pr_info(MPT3SAS_FMT "PCI error: detected callback, state(%d)!!\n",
 	    ioc->name, state);
 
@@ -10827,10 +10881,14 @@  scsih_pci_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
 static pci_ers_result_t
 scsih_pci_slot_reset(struct pci_dev *pdev)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 	int rc;
 
+	if (_scsih_get_shost_and_ioc(pdev, &shost, &ioc)) {
+		dev_err(&pdev->dev, "unable to perform slot reset\n");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
 	pr_info(MPT3SAS_FMT "PCI error: slot reset callback!!\n",
 	     ioc->name);
 
@@ -10863,9 +10921,13 @@  scsih_pci_slot_reset(struct pci_dev *pdev)
 static void
 scsih_pci_resume(struct pci_dev *pdev)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 
+	if (_scsih_get_shost_and_ioc(pdev, &shost, &ioc)) {
+		dev_err(&pdev->dev, "unable to resume device\n");
+		return;
+	}
 	pr_info(MPT3SAS_FMT "PCI error: resume callback!!\n", ioc->name);
 
 	pci_cleanup_aer_uncorrect_error_status(pdev);
@@ -10880,9 +10942,13 @@  scsih_pci_resume(struct pci_dev *pdev)
 static pci_ers_result_t
 scsih_pci_mmio_enabled(struct pci_dev *pdev)
 {
-	struct Scsi_Host *shost = pci_get_drvdata(pdev);
-	struct MPT3SAS_ADAPTER *ioc = shost_priv(shost);
+	struct Scsi_Host *shost = NULL;
+	struct MPT3SAS_ADAPTER *ioc = NULL;
 
+	if (_scsih_get_shost_and_ioc(pdev, &shost, &ioc)) {
+		dev_err(&pdev->dev, "unable to enable mmio\n");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
 	pr_info(MPT3SAS_FMT "PCI error: mmio enabled callback!!\n",
 	    ioc->name);