diff mbox

scsi_error: should not get sense for timeout IO in scsi error handler

Message ID OFAF74C95C.6058723F-ON48257E93.00322982-48257E93.00363C6C@zte.com.cn (mailing list archive)
State New, archived
Headers show

Commit Message

Jiang Biao July 31, 2015, 9:52 a.m. UTC
scsi_error: should not get sense for timeout IO in scsi error handler

When an IO timeout occurs, the IO will be aborted in
scsi_abort_command() and SCSI_EH_ABORT_SCHEDULED will be set. Because
of that, the SCSI_EH_CANCEL_CMD will be clear in scsi_eh_scmd_add().
So when scsi error handler starts, it will get sense for this
timeout IO and the scmd of the IO request will be reused. In that
case, the scmd may be double released when racing with io_done(),
which will result in crash.
SO SCSI_EH_ABORT_SCHEDULED should also be checked when getting sense.
The bug maybe reproduced when the link between host and disk is
unstable.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Long Chun <long.chun@zte.com.cn>
Reviewed-by: Tan Hu <tan.hu@zte.com.cn>
Reviewed-by: Chen Donghai <chen.donghai@zte.com.cn>
Reviewed-by: Cai Qu <cai.qu@zte.com.cn>

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Hannes Reinecke July 31, 2015, 1:17 p.m. UTC | #1
On 07/31/2015 11:52 AM, jiang.biao2@zte.com.cn wrote:
> scsi_error: should not get sense for timeout IO in scsi error handler
> 
> When an IO timeout occurs, the IO will be aborted in
> scsi_abort_command() and SCSI_EH_ABORT_SCHEDULED will be set. Because
> of that, the SCSI_EH_CANCEL_CMD will be clear in scsi_eh_scmd_add().
> So when scsi error handler starts, it will get sense for this
> timeout IO and the scmd of the IO request will be reused. In that
> case, the scmd may be double released when racing with io_done(),
> which will result in crash.
> SO SCSI_EH_ABORT_SCHEDULED should also be checked when getting sense.
> The bug maybe reproduced when the link between host and disk is
> unstable.
> 
> Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
> Signed-off-by: Long Chun <long.chun@zte.com.cn>
> Reviewed-by: Tan Hu <tan.hu@zte.com.cn>
> Reviewed-by: Chen Donghai <chen.donghai@zte.com.cn>
> Reviewed-by: Cai Qu <cai.qu@zte.com.cn>
> 
> diff -uprN drivers/scsi/scsi_error.c drivers_new/scsi/scsi_error.c
> --- scsi/scsi_error.c   2015-07-31 16:03:18.000000000 +0800
> +++ scsi_new/scsi_error.c       2015-07-31 16:29:25.000000000 +0800
> @@ -1156,9 +1156,14 @@ int scsi_eh_get_sense(struct list_head *
>         struct Scsi_Host *shost;
>         int rtn;
> 
> +       /*
> +        * If SCSI_EH_ABORT_SCHEDULED has been set, it is timeout IO,
> +        * should not get sense.
> +        */
>         list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
>                 if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
> -                   SCSI_SENSE_VALID(scmd))
> +                   (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) ||
> +                    SCSI_SENSE_VALID(scmd))
>                         continue;
> 
>                 shost = scmd->device->host;
> --
_Actually_ you need to test for both, SCSI_EH_CANCEL_CMD _and_
SCSI_EH_ABORT_SCHEDULED.
Not every driver is required to implement and/or support
asynchronous command aborts, and those will be setting
SCSI_EH_CANCEL_CMD even though they've run into a timeout.

Cheers,

Hannes
Jiang Biao Aug. 1, 2015, 4:39 a.m. UTC | #2
bGludXgtc2NzaS1vd25lckB2Z2VyLmtlcm5lbC5vcmcgd3JvdGUgb24gMjAxNS8wNy8zMSAyMTox
NzozMzoNCg0KPiBIYW5uZXMgUmVpbmVja2UgPGhhcmVAc3VzZS5kZT4gDQo+IOWPkeS7tuS6ujog
IGxpbnV4LXNjc2ktb3duZXJAdmdlci5rZXJuZWwub3JnDQo+IA0KPiAyMDE1LzA3LzMxIDIxOjE3
DQo+IA0KPiDmlLbku7bkuroNCj4gDQo+IGppYW5nLmJpYW8yQHp0ZS5jb20uY24sIGxpbnV4LXNj
c2lAdmdlci5rZXJuZWwub3JnLCBKQm90dG9tbGV5QG9kaW4uY29tLCANCg0KPiANCj4g5oqE6YCB
DQo+IA0KPiDkuLvpopgNCj4gDQo+IFJlOiBbUGF0Y2hdIHNjc2lfZXJyb3I6IHNob3VsZCBub3Qg
Z2V0IHNlbnNlIGZvciB0aW1lb3V0IElPIGluIHNjc2kgDQo+IGVycm9yIGhhbmRsZXINCj4gDQo+
IE9uIDA3LzMxLzIwMTUgMTE6NTIgQU0sIGppYW5nLmJpYW8yQHp0ZS5jb20uY24gd3JvdGU6DQo+
ID4gc2NzaV9lcnJvcjogc2hvdWxkIG5vdCBnZXQgc2Vuc2UgZm9yIHRpbWVvdXQgSU8gaW4gc2Nz
aSBlcnJvciBoYW5kbGVyDQo+ID4gDQo+ID4gV2hlbiBhbiBJTyB0aW1lb3V0IG9jY3VycywgdGhl
IElPIHdpbGwgYmUgYWJvcnRlZCBpbg0KPiA+IHNjc2lfYWJvcnRfY29tbWFuZCgpIGFuZCBTQ1NJ
X0VIX0FCT1JUX1NDSEVEVUxFRCB3aWxsIGJlIHNldC4gQmVjYXVzZQ0KPiA+IG9mIHRoYXQsIHRo
ZSBTQ1NJX0VIX0NBTkNFTF9DTUQgd2lsbCBiZSBjbGVhciBpbiBzY3NpX2VoX3NjbWRfYWRkKCku
DQo+ID4gU28gd2hlbiBzY3NpIGVycm9yIGhhbmRsZXIgc3RhcnRzLCBpdCB3aWxsIGdldCBzZW5z
ZSBmb3IgdGhpcw0KPiA+IHRpbWVvdXQgSU8gYW5kIHRoZSBzY21kIG9mIHRoZSBJTyByZXF1ZXN0
IHdpbGwgYmUgcmV1c2VkLiBJbiB0aGF0DQo+ID4gY2FzZSwgdGhlIHNjbWQgbWF5IGJlIGRvdWJs
ZSByZWxlYXNlZCB3aGVuIHJhY2luZyB3aXRoIGlvX2RvbmUoKSwNCj4gPiB3aGljaCB3aWxsIHJl
c3VsdCBpbiBjcmFzaC4NCj4gPiBTTyBTQ1NJX0VIX0FCT1JUX1NDSEVEVUxFRCBzaG91bGQgYWxz
byBiZSBjaGVja2VkIHdoZW4gZ2V0dGluZyBzZW5zZS4NCj4gPiBUaGUgYnVnIG1heWJlIHJlcHJv
ZHVjZWQgd2hlbiB0aGUgbGluayBiZXR3ZWVuIGhvc3QgYW5kIGRpc2sgaXMNCj4gPiB1bnN0YWJs
ZS4NCj4gPiANCj4gPiBTaWduZWQtb2ZmLWJ5OiBKaWFuZyBCaWFvIDxqaWFuZy5iaWFvMkB6dGUu
Y29tLmNuPg0KPiA+IFNpZ25lZC1vZmYtYnk6IExvbmcgQ2h1biA8bG9uZy5jaHVuQHp0ZS5jb20u
Y24+DQo+ID4gUmV2aWV3ZWQtYnk6IFRhbiBIdSA8dGFuLmh1QHp0ZS5jb20uY24+DQo+ID4gUmV2
aWV3ZWQtYnk6IENoZW4gRG9uZ2hhaSA8Y2hlbi5kb25naGFpQHp0ZS5jb20uY24+DQo+ID4gUmV2
aWV3ZWQtYnk6IENhaSBRdSA8Y2FpLnF1QHp0ZS5jb20uY24+DQo+ID4gDQo+ID4gZGlmZiAtdXBy
TiBkcml2ZXJzL3Njc2kvc2NzaV9lcnJvci5jIGRyaXZlcnNfbmV3L3Njc2kvc2NzaV9lcnJvci5j
DQo+ID4gLS0tIHNjc2kvc2NzaV9lcnJvci5jICAgMjAxNS0wNy0zMSAxNjowMzoxOC4wMDAwMDAw
MDAgKzA4MDANCj4gPiArKysgc2NzaV9uZXcvc2NzaV9lcnJvci5jICAgICAgIDIwMTUtMDctMzEg
MTY6Mjk6MjUuMDAwMDAwMDAwICswODAwDQo+ID4gQEAgLTExNTYsOSArMTE1NiwxNCBAQCBpbnQg
c2NzaV9laF9nZXRfc2Vuc2Uoc3RydWN0IGxpc3RfaGVhZCAqDQo+ID4gICAgICAgICBzdHJ1Y3Qg
U2NzaV9Ib3N0ICpzaG9zdDsNCj4gPiAgICAgICAgIGludCBydG47DQo+ID4gDQo+ID4gKyAgICAg
ICAvKg0KPiA+ICsgICAgICAgICogSWYgU0NTSV9FSF9BQk9SVF9TQ0hFRFVMRUQgaGFzIGJlZW4g
c2V0LCBpdCBpcyB0aW1lb3V0IElPLA0KPiA+ICsgICAgICAgICogc2hvdWxkIG5vdCBnZXQgc2Vu
c2UuDQo+ID4gKyAgICAgICAgKi8NCj4gPiAgICAgICAgIGxpc3RfZm9yX2VhY2hfZW50cnlfc2Fm
ZShzY21kLCBuZXh0LCB3b3JrX3EsIGVoX2VudHJ5KSB7DQo+ID4gICAgICAgICAgICAgICAgIGlm
ICgoc2NtZC0+ZWhfZWZsYWdzICYgU0NTSV9FSF9DQU5DRUxfQ01EKSB8fA0KPiA+IC0gICAgICAg
ICAgICAgICAgICAgU0NTSV9TRU5TRV9WQUxJRChzY21kKSkNCj4gPiArICAgICAgICAgICAgICAg
ICAgIChzY21kLT5laF9lZmxhZ3MgJiBTQ1NJX0VIX0FCT1JUX1NDSEVEVUxFRCkgfHwNCj4gPiAr
ICAgICAgICAgICAgICAgICAgICBTQ1NJX1NFTlNFX1ZBTElEKHNjbWQpKQ0KPiA+ICAgICAgICAg
ICAgICAgICAgICAgICAgIGNvbnRpbnVlOw0KPiA+IA0KPiA+ICAgICAgICAgICAgICAgICBzaG9z
dCA9IHNjbWQtPmRldmljZS0+aG9zdDsNCj4gPiAtLQ0KPiBfQWN0dWFsbHlfIHlvdSBuZWVkIHRv
IHRlc3QgZm9yIGJvdGgsIFNDU0lfRUhfQ0FOQ0VMX0NNRCBfYW5kXw0KPiBTQ1NJX0VIX0FCT1JU
X1NDSEVEVUxFRC4NCj4gTm90IGV2ZXJ5IGRyaXZlciBpcyByZXF1aXJlZCB0byBpbXBsZW1lbnQg
YW5kL29yIHN1cHBvcnQNCj4gYXN5bmNocm9ub3VzIGNvbW1hbmQgYWJvcnRzLCBhbmQgdGhvc2Ug
d2lsbCBiZSBzZXR0aW5nDQo+IFNDU0lfRUhfQ0FOQ0VMX0NNRCBldmVuIHRob3VnaCB0aGV5J3Zl
IHJ1biBpbnRvIGEgdGltZW91dC4NCj4gDQpUaGF0J3MgcmlnaHQsIGJ1dCBTQ1NJX0VIX0NBTkNF
TF9DTUQgX2hhc18gYWxyZWFkeSBiZWVuIHRlc3RlZCANCmluIHRoZSBjdXJyZW50IGNvZGUsIHNv
IHRoZXJlJ3Mgbm8gbmVlZCB0byBhZGQgaW4gdGhlIHBhdGNoLiANCkFmdGVyIHBhdGNoZWQsIGJv
dGggU0NTSV9FSF9DQU5DRUxfQ01EIF9hbmRfIA0KU0NTSV9FSF9BQk9SVF9TQ0hFRFVMRUQgYXJl
IHRlc3RlZCBoZXJlLCB0aGF0J2xsIGVuc3VyZSBubyANCmdldHRpbmcgc2Vuc2UgZm9yICp0aW1l
b3V0IGlvKi4NCg0KVGhhbmtzLg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke Aug. 1, 2015, 7:37 a.m. UTC | #3
On 08/01/2015 04:26 AM, jiang.biao2@zte.com.cn wrote:
> Hannes Reinecke <hare@suse.de> ?? 2015/07/31 21:17:33:
> 
>> Hannes Reinecke <hare@suse.de> 
>> 2015/07/31 21:17
>>
>> ???
>>
>> jiang.biao2@zte.com.cn, linux-scsi@vger.kernel.org, JBottomley@odin.com, 
> 
>>
>> ??
>>
>> ??
>>
>> Re: [Patch] scsi_error: should not get sense for timeout IO in scsi 
>> error handler
>>
>> On 07/31/2015 11:52 AM, jiang.biao2@zte.com.cn wrote:
>>> scsi_error: should not get sense for timeout IO in scsi error handler
>>>
>>> When an IO timeout occurs, the IO will be aborted in
>>> scsi_abort_command() and SCSI_EH_ABORT_SCHEDULED will be set. Because
>>> of that, the SCSI_EH_CANCEL_CMD will be clear in scsi_eh_scmd_add().
>>> So when scsi error handler starts, it will get sense for this
>>> timeout IO and the scmd of the IO request will be reused. In that
>>> case, the scmd may be double released when racing with io_done(),
>>> which will result in crash.
>>> SO SCSI_EH_ABORT_SCHEDULED should also be checked when getting sense.
>>> The bug maybe reproduced when the link between host and disk is
>>> unstable.
>>>
>>> Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
>>> Signed-off-by: Long Chun <long.chun@zte.com.cn>
>>> Reviewed-by: Tan Hu <tan.hu@zte.com.cn>
>>> Reviewed-by: Chen Donghai <chen.donghai@zte.com.cn>
>>> Reviewed-by: Cai Qu <cai.qu@zte.com.cn>
>>>
>>> diff -uprN drivers/scsi/scsi_error.c drivers_new/scsi/scsi_error.c
>>> --- scsi/scsi_error.c   2015-07-31 16:03:18.000000000 +0800
>>> +++ scsi_new/scsi_error.c       2015-07-31 16:29:25.000000000 +0800
>>> @@ -1156,9 +1156,14 @@ int scsi_eh_get_sense(struct list_head *
>>>         struct Scsi_Host *shost;
>>>         int rtn;
>>>
>>> +       /*
>>> +        * If SCSI_EH_ABORT_SCHEDULED has been set, it is timeout IO,
>>> +        * should not get sense.
>>> +        */
>>>         list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
>>>                 if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
>>> -                   SCSI_SENSE_VALID(scmd))
>>> +                   (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) ||
>>> +                    SCSI_SENSE_VALID(scmd))
>>>                         continue;
>>>
>>>                 shost = scmd->device->host;
>>> --
>> _Actually_ you need to test for both, SCSI_EH_CANCEL_CMD _and_
>> SCSI_EH_ABORT_SCHEDULED.
>> Not every driver is required to implement and/or support
>> asynchronous command aborts, and those will be setting
>> SCSI_EH_CANCEL_CMD even though they've run into a timeout.
> That's right, but SCSI_EH_CANCEL_CMD _has_ already been tested 
> in the current code, so there's no need to add in the patch. 
> After patched, both SCSI_EH_CANCEL_CMD _and_ 
> SCSI_EH_ABORT_SCHEDULED are tested here, that'll ensure no 
> getting sense for timeout io.
> 
Ah, right. I've misread the patch.

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
James Bottomley Aug. 27, 2015, 12:31 a.m. UTC | #4
On Fri, 2015-07-31 at 17:52 +0800, jiang.biao2@zte.com.cn wrote:
> scsi_error: should not get sense for timeout IO in scsi error handler
> 
> When an IO timeout occurs, the IO will be aborted in
> scsi_abort_command() and SCSI_EH_ABORT_SCHEDULED will be set. Because
> of that, the SCSI_EH_CANCEL_CMD will be clear in scsi_eh_scmd_add().
> So when scsi error handler starts, it will get sense for this
> timeout IO and the scmd of the IO request will be reused. In that
> case, the scmd may be double released when racing with io_done(),
> which will result in crash.
> SO SCSI_EH_ABORT_SCHEDULED should also be checked when getting sense.
> The bug maybe reproduced when the link between host and disk is
> unstable.
> 
> Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
> Signed-off-by: Long Chun <long.chun@zte.com.cn>
> Reviewed-by: Tan Hu <tan.hu@zte.com.cn>
> Reviewed-by: Chen Donghai <chen.donghai@zte.com.cn>
> Reviewed-by: Cai Qu <cai.qu@zte.com.cn>
> 
> diff -uprN drivers/scsi/scsi_error.c drivers_new/scsi/scsi_error.c

to apply easily, diffs need to start at the top of the tree, please.

> --- scsi/scsi_error.c   2015-07-31 16:03:18.000000000 +0800
> +++ scsi_new/scsi_error.c       2015-07-31 16:29:25.000000000 +0800
> @@ -1156,9 +1156,14 @@ int scsi_eh_get_sense(struct list_head *
>         struct Scsi_Host *shost;
>         int rtn;
> 
> +       /*
> +        * If SCSI_EH_ABORT_SCHEDULED has been set, it is timeout IO,
> +        * should not get sense.
> +        */
>         list_for_each_entry_safe(scmd, next, work_q, eh_entry) {

and here all the tabs have been converted to spaces; you need to read
Documentation/email-clients.txt for details on how to avoid this.

I managed to fix it up this time, but won't again.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -uprN drivers/scsi/scsi_error.c drivers_new/scsi/scsi_error.c
--- scsi/scsi_error.c   2015-07-31 16:03:18.000000000 +0800
+++ scsi_new/scsi_error.c       2015-07-31 16:29:25.000000000 +0800
@@ -1156,9 +1156,14 @@  int scsi_eh_get_sense(struct list_head *
        struct Scsi_Host *shost;
        int rtn;

+       /*
+        * If SCSI_EH_ABORT_SCHEDULED has been set, it is timeout IO,
+        * should not get sense.
+        */
        list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
                if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
-                   SCSI_SENSE_VALID(scmd))
+                   (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) ||
+                    SCSI_SENSE_VALID(scmd))
                        continue;

                shost = scmd->device->host;