Bugs in multipath scsi in 4.3-rc2

Message ID	20150930151449.GC26299@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-scsi-owner@kernel.org> Date: Wed, 30 Sep 2015 17:14:49 +0200 From: Christoph Hellwig <hch@lst.de> To: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Paul Mackerras <paulus@ozlabs.org>, linux-scsi@vger.kernel.org, Tejun Heo <tj@kernel.org> Subject: Re: Bugs in multipath scsi in 4.3-rc2 Message-ID: <20150930151449.GC26299@lst.de> References: <20150925121636.GC12540@fergus.ozlabs.ibm.com> <20150925151802.GB20282@lst.de> <1443202278.2188.13.camel@HansenPartnership.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1443202278.2188.13.camel@HansenPartnership.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

Christoph Hellwig Sept. 30, 2015, 3:14 p.m. UTC

On Fri, Sep 25, 2015 at 10:31:18AM -0700, James Bottomley wrote:
> So the warning seems to be because scsi_dh_find_driver() is not quite
> consistent.  For everything except alua, it scans the dh driver list to
> see what might attach to the device.  It returns "alua" if the TPGS
> field is anything other than zero, regardless of whether the alua driver
> is loaded.  We could fix the problem by returning NULL if the alua
> driver isn't present ... would that have any other adverse consequences?

It's not inconsistent - to autoload a driver we cant't require it to
be loaded.  For alua we check the TPGS bit per the standard, and for
the non-standard drivers we use a whitelist in scsi_dh.c now.

The problem is that async probing deadlocks vs a synchronous
request_module, as Tejun figured out based on the thrad in
http://thread.gmane.org/gmane.linux.kernel/1420814

Tejun, if I understand the thread and your patch right there really
isn't any good altenative to a synchronous request_module, and you
thus disabled autoloading elevator modules, right?

For SCSI device handlers we could do this by switching from scsi_dh_lookup
to __scsi_dh_lookup in scsi_dh_add_device, but autoloading the device
handlers is a really useful feature that I'd love to keep if possible.

Given that before 4.3-rc we could not autoload them the switch to
__scsi_dh_lookup migt be the required band aid for now until we can
come up with something better.

Paul, can you try the trivial one liner below?

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo Sept. 30, 2015, 9:53 p.m. UTC | #1

Hello, Christoph.

On Wed, Sep 30, 2015 at 05:14:49PM +0200, Christoph Hellwig wrote:
> The problem is that async probing deadlocks vs a synchronous
> request_module, as Tejun figured out based on the thrad in
> http://thread.gmane.org/gmane.linux.kernel/1420814
> 
> Tejun, if I understand the thread and your patch right there really
> isn't any good altenative to a synchronous request_module, and you
> thus disabled autoloading elevator modules, right?

Yeap, that's right.  IIRC, the conclusion was "let's not do that".

Thanks.

James Bottomley Sept. 30, 2015, 10:34 p.m. UTC | #2

On Wed, 2015-09-30 at 17:53 -0400, Tejun Heo wrote:
> Hello, Christoph.
> 
> On Wed, Sep 30, 2015 at 05:14:49PM +0200, Christoph Hellwig wrote:
> > The problem is that async probing deadlocks vs a synchronous
> > request_module, as Tejun figured out based on the thrad in
> > http://thread.gmane.org/gmane.linux.kernel/1420814
> > 
> > Tejun, if I understand the thread and your patch right there really
> > isn't any good altenative to a synchronous request_module, and you
> > thus disabled autoloading elevator modules, right?
> 
> Yeap, that's right.  IIRC, the conclusion was "let's not do that".

Perhaps we don't have to be that draconian.  There's no real reason we
can't autoload asynchronously.  If the module isn't ready by the time we
come to check the attachment, then it will attach to the device later
when the module init routine runs.

Should we do anything to limit the module_request floods?  This will
likely happen for every LUN of an ALUA system ... there can be hundreds
of those.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul Mackerras Oct. 1, 2015, 4:34 a.m. UTC | #3

On Wed, Sep 30, 2015 at 05:14:49PM +0200, Christoph Hellwig wrote:
> 
> Paul, can you try the trivial one liner below?
> 
> diff --git a/drivers/scsi/scsi_dh.c b/drivers/scsi/scsi_dh.c
> index edb044a..fbc9502 100644
> --- a/drivers/scsi/scsi_dh.c
> +++ b/drivers/scsi/scsi_dh.c
> @@ -391,7 +391,7 @@ int scsi_dh_attach(struct request_queue *q, const char *name)
>  	if (!sdev)
>  		return -ENODEV;
>  
> -	scsi_dh = scsi_dh_lookup(name);
> +	scsi_dh = __scsi_dh_lookup(name);
>  	if (!scsi_dh) {
>  		err = -EINVAL;
>  		goto out_put_device;

I still get the warning:

WARNING: at /home/paulus/kernel/kvm/kernel/kmod.c:140
Modules linked in: radeon(+) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
 sysimgblt fb_sys_fops ttm drm tg3(+) ptp ipr cxgb3 pps_core i2c_core mdio dm_multipath
CPU: 11 PID: 7 Comm: kworker/u321:0 Not tainted 4.3.0-rc2-kvm+ #73
Workqueue: events_unbound async_run_entry_fn
task: c000000ff0545080 ti: c000000ff0610000 task.ti: c000000ff0610000

...

NIP [c0000000000d390c] __request_module+0x21c/0x380
LR [c0000000000d38f4] __request_module+0x204/0x380
Call Trace:
[c000000ff0613920] [c0000000000d38f4] __request_module+0x204/0x380 (unreliable)
[c000000ff06139e0] [c0000000006369d4] scsi_dh_lookup+0x64/0x80
[c000000ff0613a50] [c000000000636fcc] scsi_dh_add_device+0x13c/0x170
[c000000ff0613aa0] [c000000000630ea4] scsi_sysfs_add_sdev+0x114/0x380
[c000000ff0613b30] [c00000000062e040] do_scan_async+0xf0/0x240
[c000000ff0613bc0] [c0000000000e6bc0] async_run_entry_fn+0xa0/0x200
[c000000ff0613c50] [c0000000000d9750] process_one_work+0x1a0/0x4b0
[c000000ff0613ce0] [c0000000000d9bf0] worker_thread+0x190/0x5f0
[c000000ff0613d80] [c0000000000e21b0] kthread+0x110/0x130
[c000000ff0613e30] [c0000000000095b0] ret_from_kernel_thread+0x5c/0xac

So it's the call to scsi_dh_lookup() in scsi_dh_add_device() that is
causing the warning this time.  Do you want me to add the "__" to that
one too?

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Oct. 2, 2015, 12:56 p.m. UTC | #4

On Wed, Sep 30, 2015 at 03:34:54PM -0700, James Bottomley wrote:
> Perhaps we don't have to be that draconian.  There's no real reason we
> can't autoload asynchronously.  If the module isn't ready by the time we
> come to check the attachment, then it will attach to the device later
> when the module init routine runs.

I don't think this works.  We're attaching just after the request_module
call, so modprobe doesn't really have a chance to finish before we
return.

> Should we do anything to limit the module_request floods?  This will
> likely happen for every LUN of an ALUA system ... there can be hundreds
> of those.

Once the alua module is loaded there won't be any request_module calls.
If you build with device handler support but without ALUA support
you'll get a lot of calls, but that's not any different than probing
for other optional features.

Personally I think we probably should simple build ALUA support into
scsi_mod if SCSI_DH is selected, as that's part of the standard and
supported by any modern multi pathing device.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

James Bottomley Oct. 2, 2015, 1:25 p.m. UTC | #5

On Fri, 2015-10-02 at 14:56 +0200, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 03:34:54PM -0700, James Bottomley wrote:
> > Perhaps we don't have to be that draconian.  There's no real reason we
> > can't autoload asynchronously.  If the module isn't ready by the time we
> > come to check the attachment, then it will attach to the device later
> > when the module init routine runs.
> 
> I don't think this works.  We're attaching just after the request_module
> call, so modprobe doesn't really have a chance to finish before we
> return.

Yes, I suspect it will mostly happen in the alua attach, except for
large LUN sets.

> > Should we do anything to limit the module_request floods?  This will
> > likely happen for every LUN of an ALUA system ... there can be hundreds
> > of those.
> 
> Once the alua module is loaded there won't be any request_module calls.
> If you build with device handler support but without ALUA support
> you'll get a lot of calls, but that's not any different than probing
> for other optional features.

That doesn't matter: if you modprobe alua after all devices are
discovered, it will attach correctly to all potential devices from the
alua module_init.  This means the effect is the same whether the
request_module is sync or async ... the object is to get the device
attached to alua if it is an alua device.

The only drawback of an async request_module is that we get a flood of
request_modules()s from every LUN on an array until alua is loaded.

> Personally I think we probably should simple build ALUA support into
> scsi_mod if SCSI_DH is selected, as that's part of the standard and
> supported by any modern multi pathing device.

I suppose; the code size is roughly the same as scsi_dh.o which we just
made non modular.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Oct. 2, 2015, 1:34 p.m. UTC | #6

On Fri, Oct 02, 2015 at 06:25:01AM -0700, James Bottomley wrote:
> That doesn't matter: if you modprobe alua after all devices are
> discovered, it will attach correctly to all potential devices from the
> alua module_init.  This means the effect is the same whether the
> request_module is sync or async ... the object is to get the device
> attached to alua if it is an alua device.

No, in 4.3-rc it won't.  We removed that feature.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

James Bottomley Oct. 2, 2015, 1:44 p.m. UTC | #7

On Fri, 2015-10-02 at 15:34 +0200, Christoph Hellwig wrote:
> On Fri, Oct 02, 2015 at 06:25:01AM -0700, James Bottomley wrote:
> > That doesn't matter: if you modprobe alua after all devices are
> > discovered, it will attach correctly to all potential devices from the
> > alua module_init.  This means the effect is the same whether the
> > request_module is sync or async ... the object is to get the device
> > attached to alua if it is an alua device.
> 
> No, in 4.3-rc it won't.  We removed that feature.

I think I prefer restoring that to having to build in every dh module to
get them to work.  If we take your proposed fix for the sync module load
in the current scheme, any non-built in modules would never attach, so
we'd be moving towards the conclusion that *every* device handler has to
be non-modular.

Skimming the code it looks like dh should be using the driver binding
model rather than reinventing it.  That would decouple it better and
make sure binding happened regardless of when the module was loaded.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Oct. 4, 2015, 7:45 a.m. UTC | #8

On Fri, Oct 02, 2015 at 06:44:57AM -0700, James Bottomley wrote:
> I think I prefer restoring that to having to build in every dh module to
> get them to work.  If we take your proposed fix for the sync module load
> in the current scheme, any non-built in modules would never attach, so
> we'd be moving towards the conclusion that *every* device handler has to
> be non-modular.

You don't need to build every module in to make it work.  In 4.2 and earlier
we already only auto load modules when dm-multipath explicitly attaches
to them.  That will still work in 4.3+.  In fact we will now autoload
when activating through sysfs as well.  With the change I sent to Paul
we still won't autoload at scan time, which would be really useful to have,
but wasn't implemented previously.

> Skimming the code it looks like dh should be using the driver binding
> model rather than reinventing it.  That would decouple it better and
> make sure binding happened regardless of when the module was loaded.

I tried this early on but gave up because I ran into too many problems.
I can try to give it a spin again.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hannes Reinecke Oct. 12, 2015, 12:45 p.m. UTC | #9

On 10/04/2015 09:45 AM, Christoph Hellwig wrote:
> On Fri, Oct 02, 2015 at 06:44:57AM -0700, James Bottomley wrote:
>> I think I prefer restoring that to having to build in every dh module to
>> get them to work.  If we take your proposed fix for the sync module load
>> in the current scheme, any non-built in modules would never attach, so
>> we'd be moving towards the conclusion that *every* device handler has to
>> be non-modular.
> 
> You don't need to build every module in to make it work.  In 4.2 and earlier
> we already only auto load modules when dm-multipath explicitly attaches
> to them.  That will still work in 4.3+.  In fact we will now autoload
> when activating through sysfs as well.  With the change I sent to Paul
> we still won't autoload at scan time, which would be really useful to have,
> but wasn't implemented previously.
> 
>> Skimming the code it looks like dh should be using the driver binding
>> model rather than reinventing it.  That would decouple it better and
>> make sure binding happened regardless of when the module was loaded.
> 
> I tried this early on but gave up because I ran into too many problems.
> I can try to give it a spin again.

You cannot easily use the driver model here as the scsi_device is
already (potentially) bound to the ULDs.
If you were to go with the driver model you'd have to introduce
another sub device between scsi_target and scsi_device.

Actually I have been thinking that, as it might make my life for the
ALUA handler easier. However, this would be quite a largish redesign
of the current handler infrastructure, pushing out my ALUA handler
update even more.
So I'd like to have the ALUA changes ironed out first and merged,
and then work on a redesigned device handler infrastructure.

Cheers,

Hannes

Christoph Hellwig Oct. 12, 2015, 2:39 p.m. UTC | #10

On Mon, Oct 12, 2015 at 02:45:38PM +0200, Hannes Reinecke wrote:
> You cannot easily use the driver model here as the scsi_device is
> already (potentially) bound to the ULDs.
> If you were to go with the driver model you'd have to introduce
> another sub device between scsi_target and scsi_device.

You can have two struct devices in struct scsi_device, while it's
not pretty there are plenty of example all over the kernel with
multiple devices in a single containing structure. 

> Actually I have been thinking that, as it might make my life for the
> ALUA handler easier. However, this would be quite a largish redesign
> of the current handler infrastructure, pushing out my ALUA handler
> update even more.
> So I'd like to have the ALUA changes ironed out first and merged,
> and then work on a redesigned device handler infrastructure.

Fine with me.  As mentioned before we've never supported autoloading
the device handler modules at boot time - we only ever loaded them when
explicitly attaching them through device mapper.  Waiting another
release or maybe two to finally get there isn't the end of the world.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

James Bottomley Oct. 12, 2015, 2:51 p.m. UTC | #11

On Mon, 2015-10-12 at 14:45 +0200, Hannes Reinecke wrote:
> On 10/04/2015 09:45 AM, Christoph Hellwig wrote:
> > On Fri, Oct 02, 2015 at 06:44:57AM -0700, James Bottomley wrote:
> >> I think I prefer restoring that to having to build in every dh module to
> >> get them to work.  If we take your proposed fix for the sync module load
> >> in the current scheme, any non-built in modules would never attach, so
> >> we'd be moving towards the conclusion that *every* device handler has to
> >> be non-modular.
> > 
> > You don't need to build every module in to make it work.  In 4.2 and earlier
> > we already only auto load modules when dm-multipath explicitly attaches
> > to them.  That will still work in 4.3+.  In fact we will now autoload
> > when activating through sysfs as well.  With the change I sent to Paul
> > we still won't autoload at scan time, which would be really useful to have,
> > but wasn't implemented previously.
> > 
> >> Skimming the code it looks like dh should be using the driver binding
> >> model rather than reinventing it.  That would decouple it better and
> >> make sure binding happened regardless of when the module was loaded.
> > 
> > I tried this early on but gave up because I ran into too many problems.
> > I can try to give it a spin again.
> 
> You cannot easily use the driver model here as the scsi_device is
> already (potentially) bound to the ULDs.
> If you were to go with the driver model you'd have to introduce
> another sub device between scsi_target and scsi_device.

I was thinking more like what we do today for the ULD's: 3 of them use
the driver binding and one uses the class interface model.  Why can't we
also use the class interface model for dh?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Snitzer Oct. 12, 2015, 7:29 p.m. UTC | #12

On Mon, Oct 12, 2015 at 10:39 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Oct 12, 2015 at 02:45:38PM +0200, Hannes Reinecke wrote:
>> You cannot easily use the driver model here as the scsi_device is
>> already (potentially) bound to the ULDs.
>> If you were to go with the driver model you'd have to introduce
>> another sub device between scsi_target and scsi_device.
>
> You can have two struct devices in struct scsi_device, while it's
> not pretty there are plenty of example all over the kernel with
> multiple devices in a single containing structure.
>
>> Actually I have been thinking that, as it might make my life for the
>> ALUA handler easier. However, this would be quite a largish redesign
>> of the current handler infrastructure, pushing out my ALUA handler
>> update even more.
>> So I'd like to have the ALUA changes ironed out first and merged,
>> and then work on a redesigned device handler infrastructure.
>
> Fine with me.  As mentioned before we've never supported autoloading
> the device handler modules at boot time - we only ever loaded them when
> explicitly attaching them through device mapper.  Waiting another
> release or maybe two to finally get there isn't the end of the world.

What may be getting lost during this discussion is that historically
it has been important to be able to attach the scsi_dh during the SCSI
scan.  As the scsi_dh alters the SCSI midlayer's sense code processing
(via callout to the attached scsi_dh).  And this altered SCSI sense
code handling amounts to the difference between a successful/quick
boot versus hugely delayed and ultimately error-prone boot on systems
with many LUNs that have multiple paths.

So that is why either of these solutions were deployed:
1) in RHEL6 we'd require dracut to preload the scsi_dh modules early
in loading the initramfs
2) in RHEL7 all scsi_dh modules _are_ builtin

Both achieve the goal of having all required scsi_dh available during SCSI scan.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Oct. 12, 2015, 7:36 p.m. UTC | #13

On Mon, Oct 12, 2015 at 03:29:45PM -0400, Mike Snitzer wrote:
> What may be getting lost during this discussion is that historically
> it has been important to be able to attach the scsi_dh during the SCSI
> scan.  As the scsi_dh alters the SCSI midlayer's sense code processing
> (via callout to the attached scsi_dh).  And this altered SCSI sense
> code handling amounts to the difference between a successful/quick
> boot versus hugely delayed and ultimately error-prone boot on systems
> with many LUNs that have multiple paths.
> 
> So that is why either of these solutions were deployed:
> 1) in RHEL6 we'd require dracut to preload the scsi_dh modules early
> in loading the initramfs
> 2) in RHEL7 all scsi_dh modules _are_ builtin
> 
> Both achieve the goal of having all required scsi_dh available during SCSI scan.

Yes, and both are workarounds.  I tried to implement this properly,
but due to async probing it doesn't actually work.  Given the statement
from Tejun I don't really see how to ever get it to work as long as
we use async probing and the module loading code isn't safe to call
from the async probe path unfortunately.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hannes Reinecke Oct. 13, 2015, 6 a.m. UTC | #14

On 10/12/2015 09:36 PM, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 03:29:45PM -0400, Mike Snitzer wrote:
>> What may be getting lost during this discussion is that historically
>> it has been important to be able to attach the scsi_dh during the SCSI
>> scan.  As the scsi_dh alters the SCSI midlayer's sense code processing
>> (via callout to the attached scsi_dh).  And this altered SCSI sense
>> code handling amounts to the difference between a successful/quick
>> boot versus hugely delayed and ultimately error-prone boot on systems
>> with many LUNs that have multiple paths.
>>
>> So that is why either of these solutions were deployed:
>> 1) in RHEL6 we'd require dracut to preload the scsi_dh modules early
>> in loading the initramfs
>> 2) in RHEL7 all scsi_dh modules _are_ builtin
>>
>> Both achieve the goal of having all required scsi_dh available during SCSI scan.
> 
> Yes, and both are workarounds.  I tried to implement this properly,
> but due to async probing it doesn't actually work.  Given the statement
> from Tejun I don't really see how to ever get it to work as long as
> we use async probing and the module loading code isn't safe to call
> from the async probe path unfortunately.
> 
We've originally designed them to be modular as some handler
(notably emc, netapp, and alua) are mutually exclusive, ie you could
load either the alua or one of the vendor specific ones.

In the light of the discussion I think it would be better to
upgrade the device handler to become their own transport class,
to be hooked in between scsi_target and scsi_device.
That way we would have a way of exposing the topology (target port
groups etc) and would get rid of the probing problem.

It would mean that effectively the device handler would be
demodularized and become a compile-time option, but even that
wouldn't be too much of an issue, seeing that most vendors load
the modules unconditionally anyway.
(BTW, SUSE also loads the modules unconditionally ...)

Cheers,

Hannes

Christoph Hellwig Oct. 13, 2015, 11:52 a.m. UTC | #15

On Tue, Oct 13, 2015 at 08:00:43AM +0200, Hannes Reinecke wrote:
> In the light of the discussion I think it would be better to
> upgrade the device handler to become their own transport class,
> to be hooked in between scsi_target and scsi_device.
> That way we would have a way of exposing the topology (target port
> groups etc) and would get rid of the probing problem.

I don't think a transport class makes sense here.  A struct device
or a class_interface as suggested by James sound much better.

But even then unless we solve the request_module from async context
problem we can't reliably load them as soon as we encounter the first
ALUA/RDAC/EMC/HP device, so distribution would probably still need to
pre-load them.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bugs in multipath scsi in 4.3-rc2

Commit Message

Comments

Patch