diff mbox

scsi: mpt3sas: fix oops in error handlers after shutdown/unload

Message ID 1517425200-19126-1-git-send-email-mauricfo@linux.vnet.ibm.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Mauricio Faria de Oliveira Jan. 31, 2018, 7 p.m. UTC
This patch adds checks for 'ioc->remove_host' in the SCSI error
handlers, so not to access pointers/resources potentially freed
in the PCI shutdown/module unload path.  The error handlers may
be invoked after shutdown/unload, depending on other components.

This problem was observed with kexec on a system with a mpt3sas
based adapter and an infiniband adapter which takes long enough
to shutdown.  The mpt3sas driver finished shuttting down, which
disabled interruption handling, thus some running commands have
not finished and timed out.  Since the system was still running,
waiting for the infiniband adapter to shut down, the scsi error
handler for task abort of mpt3sas was invoked, and hit the oops
because 'ioc->scsi_lookup' was NULL.

During patch testing, a different oops happens if host reset is
reached (when it eventually calls 'mpt3sas_base_get_iocstate()').

The device reset and target reset handlers do not cause oopses,
but print a misleading message of host reset in progress, thus
fix those too.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Comments

Bart Van Assche Jan. 31, 2018, 7:06 p.m. UTC | #1
On Wed, 2018-01-31 at 17:00 -0200, Mauricio Faria de Oliveira wrote:
> This problem was observed with kexec on a system with a mpt3sas

> based adapter and an infiniband adapter which takes long enough

> to shutdown.  The mpt3sas driver finished shuttting down, which

> disabled interruption handling, thus some running commands have

> not finished and timed out.  Since the system was still running,

> waiting for the infiniband adapter to shut down, the scsi error

> handler for task abort of mpt3sas was invoked, and hit the oops

> because 'ioc->scsi_lookup' was NULL.


Sorry but I think this patch introduces new race conditions. Have you
considered to modify the mpt3sas driver such that interrupts are only
disabled after all commands have finished?

Thanks,

Bart.
Mauricio Faria de Oliveira Jan. 31, 2018, 7:48 p.m. UTC | #2
Bart,

Thanks for reviewing.

On 01/31/2018 05:06 PM, Bart Van Assche wrote:
> Sorry but I think this patch introduces new race conditions. Have you

Can you detail the race conditions? As far as I can see, the only race
condition would be when an error handler is invoked very close in time
to the shutdown/unload handler, and as such miss the 'ioc->remove_host'
assignment (thus it continues running anyway, and hit the oops).

But this problem has never been reported, so I'd think that case would
be rare.  Not that it does not need to be perfectly solved, but _if_
that is the only problem with the patch (and _if_ I got that right),
then it would still be better than the current code in oops.

 > considered to modify the mpt3sas driver such that interrupts are only
 > disabled after all commands have finished?

Yes, I considered introducing waits in several points, but since the
driver is tearing down, the only point which it seem to make sense to
insert is right at the beginning of shutdown/unload -- but it seemed
not sufficient: imagine not all commands finish, which would block that
path, so we need a time out mechanism, but that does not guarantee
the call back won't occur later on (say, a while after we gave up),
so we would still need a guard at the callbacks.

I also considered checking for the particular pointer which is set
to NULL and cause the Oops, but that is a series of 10ish (small)
patches which must check 'ioc->scsi_lookup' in a spinlock, so it
seemed a bit bigger than what was needed.. and would still be prone
to hitting an oops due to another pointer at a later time.

So after thinking about those, I went with the simplest approach.

I'd be happy to consider the race conditions you thought of,
provided more detail.

thanks,
Mauricio
Bart Van Assche Jan. 31, 2018, 10:59 p.m. UTC | #3
On Wed, 2018-01-31 at 17:48 -0200, Mauricio Faria de Oliveira wrote:
> On 01/31/2018 05:06 PM, Bart Van Assche wrote:

> > Sorry but I think this patch introduces new race conditions. Have you

> 

> Can you detail the race conditions? As far as I can see, the only race

> condition would be when an error handler is invoked very close in time

> to the shutdown/unload handler, and as such miss the 'ioc->remove_host'

> assignment (thus it continues running anyway, and hit the oops).


That's indeed the race I was referring to. BTW, are you aware that your patch
does not seem to apply on Martin's tree (the fixes branch of
git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git):

$ patch -p1 < ~/\[PATCH\]_scsi\:_mpt3sas\:_fix_oops_in_error_handlers_after_shutdown_unload.mbox       
patching file drivers/scsi/mpt3sas/mpt3sas_scsih.c
Hunk #1 FAILED at 3007.
Hunk #2 succeeded at 2970 (offset -99 lines).
Hunk #3 succeeded at 2977 (offset -161 lines).
Hunk #4 succeeded at 3033 (offset -160 lines).
1 out of 4 hunks FAILED -- saving rejects to file drivers/scsi/mpt3sas/mpt3sas_scsih.c.rej

Can you rebase this patch on top of Martin's tree?

Thanks,

Bart.
Mauricio Faria de Oliveira Feb. 1, 2018, 7:54 p.m. UTC | #4
On 01/31/2018 08:59 PM, Bart Van Assche wrote:
> On Wed, 2018-01-31 at 17:48 -0200, Mauricio Faria de Oliveira wrote:
>> On 01/31/2018 05:06 PM, Bart Van Assche wrote:
>>> Sorry but I think this patch introduces new race conditions. Have you

>> Can you detail the race conditions? As far as I can see, the only race
>> condition would be when an error handler is invoked very close in time
>> to the shutdown/unload handler, and as such miss the 'ioc->remove_host'
>> assignment (thus it continues running anyway, and hit the oops).

> That's indeed the race I was referring to. [snip]

Okay.  I'm not sure that point invalidates this patch; let me explain.

First, IMHO that problem already exists in mpt3sas as it is now.

There are 17 checks for 'ioc->remove_host' to return early, in the
same manner.  AFAIK, those are subject to this same race condition.

   $ grep -r 'ioc->remove_host[^=]*$' drivers/scsi/mpt3sas/ | wc -l
   17

Second, I don't think this patch makes the driver any worse.

Even with the potential race condition (which already exists elsewhere
in the driver), it reduces the chance to hit an oops from _every time_
to only when the race condition occurs _and_ the error handler looses.

So, even not being completely safe (like other parts of the driver),
it still does help.

> BTW, are you aware that your patch
> does not seem to apply on Martin's tree (the fixes branch of
> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git):
 > [snip]
> Can you rebase this patch on top of Martin's tree?

Yes, sure. I'll rebase, test, and send v2. Thanks for the pointer.

cheers,
Mauricio
diff mbox

Patch

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b258f21..3c4e47c 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -3007,6 +3007,13 @@  int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
 		goto out;
 	}
 
+	if (ioc->remove_host) {
+		scmd->result = DID_NO_CONNECT << 16;
+		scmd->scsi_done(scmd);
+		r = SUCCESS;
+		goto out;
+	}
+
 	/* search for the command */
 	smid = _scsih_scsi_lookup_find_by_scmd(ioc, scmd);
 	if (!smid) {
@@ -3069,6 +3076,13 @@  int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
 		goto out;
 	}
 
+	if (ioc->remove_host) {
+		scmd->result = DID_NO_CONNECT << 16;
+		scmd->scsi_done(scmd);
+		r = SUCCESS;
+		goto out;
+	}
+
 	/* for hidden raid components obtain the volume_handle */
 	handle = 0;
 	if (sas_device_priv_data->sas_target->flags &
@@ -3131,6 +3145,13 @@  int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
 		goto out;
 	}
 
+	if (ioc->remove_host) {
+		scmd->result = DID_NO_CONNECT << 16;
+		scmd->scsi_done(scmd);
+		r = SUCCESS;
+		goto out;
+	}
+
 	/* for hidden raid components obtain the volume_handle */
 	handle = 0;
 	if (sas_device_priv_data->sas_target->flags &
@@ -3179,6 +3200,13 @@  int mpt3sas_scsih_issue_locked_tm(struct MPT3SAS_ADAPTER *ioc, u16 handle,
 	    ioc->name, scmd);
 	scsi_print_command(scmd);
 
+	if (ioc->remove_host) {
+		scmd->result = DID_NO_CONNECT << 16;
+		scmd->scsi_done(scmd);
+		r = FAILED;
+		goto out;
+	}
+
 	if (ioc->is_driver_loading) {
 		pr_info(MPT3SAS_FMT "Blocking the host reset\n",
 		    ioc->name);