diff mbox

qla2xxx: Fix NULL pointer crash due to active timer for ABTS

Message ID 20180212182814.4541-1-himanshu.madhani@cavium.com (mailing list archive)
State Accepted
Headers show

Commit Message

Madhani, Himanshu Feb. 12, 2018, 6:28 p.m. UTC
This patch fixes NULL pointer crash due to active timer running
for abort IOCB.

From crash dump analysis it was discoverd that get_next_timer_interrupt()
encountered a corrupted entry on the timer list.

 #9 [ffff95e1f6f0fd40] page_fault at ffffffff914fe8f8
    [exception RIP: get_next_timer_interrupt+440]
    RIP: ffffffff90ea3088  RSP: ffff95e1f6f0fdf0  RFLAGS: 00010013
    RAX: ffff95e1f6451028  RBX: 000218e2389e5f40  RCX: 00000001232ad600
    RDX: 0000000000000001  RSI: ffff95e1f6f0fdf0  RDI: 0000000001232ad6
    RBP: ffff95e1f6f0fe40   R8: ffff95e1f6451188   R9: 0000000000000001
    R10: 0000000000000016  R11: 0000000000000016  R12: 00000001232ad5f6
    R13: ffff95e1f6450000  R14: ffff95e1f6f0fdf8  R15: ffff95e1f6f0fe10
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

 Looking at the assembly of get_next_timer_interrupt(), address came
 from %r8 (ffff95e1f6451188) which is pointing to list_head with single
 entry at ffff95e5ff621178.

 0xffffffff90ea307a <get_next_timer_interrupt+426>:      mov    (%r8),%rdx
 0xffffffff90ea307d <get_next_timer_interrupt+429>:      cmp    %r8,%rdx
 0xffffffff90ea3080 <get_next_timer_interrupt+432>:      je     0xffffffff90ea30a7 <get_next_timer_interrupt+471>
 0xffffffff90ea3082 <get_next_timer_interrupt+434>:      nopw   0x0(%rax,%rax,1)
 0xffffffff90ea3088 <get_next_timer_interrupt+440>:      testb  $0x1,0x18(%rdx)

 crash> rd ffff95e1f6451188 10
 ffff95e1f6451188:  ffff95e5ff621178 ffff95e5ff621178   x.b.....x.b.....
 ffff95e1f6451198:  ffff95e1f6451198 ffff95e1f6451198   ..E.......E.....
 ffff95e1f64511a8:  ffff95e1f64511a8 ffff95e1f64511a8   ..E.......E.....
 ffff95e1f64511b8:  ffff95e77cf509a0 ffff95e77cf509a0   ...|.......|....
 ffff95e1f64511c8:  ffff95e1f64511c8 ffff95e1f64511c8   ..E.......E.....

 crash> rd ffff95e5ff621178 10
 ffff95e5ff621178:  0000000000000001 ffff95e15936aa00   ..........6Y....
 ffff95e5ff621188:  0000000000000000 00000000ffffffff   ................
 ffff95e5ff621198:  00000000000000a0 0000000000000010   ................
 ffff95e5ff6211a8:  ffff95e5ff621198 000000000000000c   ..b.............
 ffff95e5ff6211b8:  00000f5800000000 ffff95e751f8d720   ....X... ..Q....

 ffff95e5ff621178 belongs to freed mempool object at ffff95e5ff621080.

 CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
 ffff95dc7fd74d00 mnt_cache                384      19785     24948    594    16k
   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
   ffffdc5dabfd8800  ffff95e5ff620000     1     42         29    13
   FREE / [ALLOCATED]
    ffff95e5ff621080  (cpu 6 cache)

 Examining the contents of that memory reveals a pointer to a constant
 string in the driver, "abort\0", which is set by qla24xx_async_abort_cmd().

 crash> rd ffffffffc059277c 20
 ffffffffc059277c:  6e490074726f6261 0074707572726574   abort.Interrupt.
 ffffffffc059278c:  00676e696c6c6f50 6920726576697244   Polling.Driver i
 ffffffffc059279c:  646f6d207325206e 6974736554000a65   n %s mode..Testi
 ffffffffc05927ac:  636976656420676e 786c252074612065   ng device at %lx
 ffffffffc05927bc:  6b63656843000a2e 646f727020676e69   ...Checking prod
 ffffffffc05927cc:  6f20444920746375 0a2e706968632066   uct ID of chip..
 ffffffffc05927dc:  5120646e756f4600 204130303232414c   .Found QLA2200A
 ffffffffc05927ec:  43000a2e70696843 20676e696b636568   Chip...Checking
 ffffffffc05927fc:  65786f626c69616d 6c636e69000a2e73   mailboxes...incl
 ffffffffc059280c:  756e696c2f656475 616d2d616d642f78   ude/linux/dma-ma

 crash> struct -ox srb_iocb
 struct srb_iocb {
           union {
               struct {...} logio;
               struct {...} els_logo;
               struct {...} tmf;
               struct {...} fxiocb;
               struct {...} abt;
               struct ct_arg ctarg;
               struct {...} mbx;
               struct {...} nack;
    [0x0 ] } u;
    [0xb8] struct timer_list timer;
    [0x108] void (*timeout)(void *);
 }
 SIZE: 0x110

 crash> ! bc
 ibase=16
 obase=10
 B8+40
 F8

 The object is a srb_t, and at offset 0xf8 within that structure
 (i.e. ffff95e5ff621080 + f8 -> ffff95e5ff621178) is a struct timer_list.

Cc: <stable@vger.stable.com> #4.4+
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
---
Hi Martin, 

This patch addresses crash due to NULL pointer access because driver
left active timer running for abort IOCB.

Please apply this patch to 4.16/scsi-fixes at your earliest convenience.

Thanks,
Himanshu
---
 drivers/scsi/qla2xxx/qla_init.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Johannes Thumshirn Feb. 13, 2018, 10:38 a.m. UTC | #1
Thanks Himanshu,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

Do you happen to know which commit it actually fixes?
John Garry Feb. 13, 2018, 10:54 a.m. UTC | #2
On 12/02/2018 18:28, Himanshu Madhani wrote:
> This patch fixes NULL pointer crash due to active timer running
> for abort IOCB.
>
>>From crash dump analysis it was discoverd that get_next_timer_interrupt()
> encountered a corrupted entry on the timer list.
>
>  #9 [ffff95e1f6f0fd40] page_fault at ffffffff914fe8f8
>     [exception RIP: get_next_timer_interrupt+440]
>     RIP: ffffffff90ea3088  RSP: ffff95e1f6f0fdf0  RFLAGS: 00010013
>     RAX: ffff95e1f6451028  RBX: 000218e2389e5f40  RCX: 00000001232ad600
>     RDX: 0000000000000001  RSI: ffff95e1f6f0fdf0  RDI: 0000000001232ad6
>     RBP: ffff95e1f6f0fe40   R8: ffff95e1f6451188   R9: 0000000000000001
>     R10: 0000000000000016  R11: 0000000000000016  R12: 00000001232ad5f6
>     R13: ffff95e1f6450000  R14: ffff95e1f6f0fdf8  R15: ffff95e1f6f0fe10
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>
>  Looking at the assembly of get_next_timer_interrupt(), address came
>  from %r8 (ffff95e1f6451188) which is pointing to list_head with single
>  entry at ffff95e5ff621178.
>
>  0xffffffff90ea307a <get_next_timer_interrupt+426>:      mov    (%r8),%rdx
>  0xffffffff90ea307d <get_next_timer_interrupt+429>:      cmp    %r8,%rdx
>  0xffffffff90ea3080 <get_next_timer_interrupt+432>:      je     0xffffffff90ea30a7 <get_next_timer_interrupt+471>
>  0xffffffff90ea3082 <get_next_timer_interrupt+434>:      nopw   0x0(%rax,%rax,1)
>  0xffffffff90ea3088 <get_next_timer_interrupt+440>:      testb  $0x1,0x18(%rdx)
>
>  crash> rd ffff95e1f6451188 10
>  ffff95e1f6451188:  ffff95e5ff621178 ffff95e5ff621178   x.b.....x.b.....
>  ffff95e1f6451198:  ffff95e1f6451198 ffff95e1f6451198   ..E.......E.....
>  ffff95e1f64511a8:  ffff95e1f64511a8 ffff95e1f64511a8   ..E.......E.....
>  ffff95e1f64511b8:  ffff95e77cf509a0 ffff95e77cf509a0   ...|.......|....
>  ffff95e1f64511c8:  ffff95e1f64511c8 ffff95e1f64511c8   ..E.......E.....
>
>  crash> rd ffff95e5ff621178 10
>  ffff95e5ff621178:  0000000000000001 ffff95e15936aa00   ..........6Y....
>  ffff95e5ff621188:  0000000000000000 00000000ffffffff   ................
>  ffff95e5ff621198:  00000000000000a0 0000000000000010   ................
>  ffff95e5ff6211a8:  ffff95e5ff621198 000000000000000c   ..b.............
>  ffff95e5ff6211b8:  00000f5800000000 ffff95e751f8d720   ....X... ..Q....
>
>  ffff95e5ff621178 belongs to freed mempool object at ffff95e5ff621080.
>
>  CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
>  ffff95dc7fd74d00 mnt_cache                384      19785     24948    594    16k
>    SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>    ffffdc5dabfd8800  ffff95e5ff620000     1     42         29    13
>    FREE / [ALLOCATED]
>     ffff95e5ff621080  (cpu 6 cache)
>
>  Examining the contents of that memory reveals a pointer to a constant
>  string in the driver, "abort\0", which is set by qla24xx_async_abort_cmd().
>
>  crash> rd ffffffffc059277c 20
>  ffffffffc059277c:  6e490074726f6261 0074707572726574   abort.Interrupt.
>  ffffffffc059278c:  00676e696c6c6f50 6920726576697244   Polling.Driver i
>  ffffffffc059279c:  646f6d207325206e 6974736554000a65   n %s mode..Testi
>  ffffffffc05927ac:  636976656420676e 786c252074612065   ng device at %lx
>  ffffffffc05927bc:  6b63656843000a2e 646f727020676e69   ...Checking prod
>  ffffffffc05927cc:  6f20444920746375 0a2e706968632066   uct ID of chip..
>  ffffffffc05927dc:  5120646e756f4600 204130303232414c   .Found QLA2200A
>  ffffffffc05927ec:  43000a2e70696843 20676e696b636568   Chip...Checking
>  ffffffffc05927fc:  65786f626c69616d 6c636e69000a2e73   mailboxes...incl
>  ffffffffc059280c:  756e696c2f656475 616d2d616d642f78   ude/linux/dma-ma
>
>  crash> struct -ox srb_iocb
>  struct srb_iocb {
>            union {
>                struct {...} logio;
>                struct {...} els_logo;
>                struct {...} tmf;
>                struct {...} fxiocb;
>                struct {...} abt;
>                struct ct_arg ctarg;
>                struct {...} mbx;
>                struct {...} nack;
>     [0x0 ] } u;
>     [0xb8] struct timer_list timer;
>     [0x108] void (*timeout)(void *);
>  }
>  SIZE: 0x110
>
>  crash> ! bc
>  ibase=16
>  obase=10
>  B8+40
>  F8
>
>  The object is a srb_t, and at offset 0xf8 within that structure
>  (i.e. ffff95e5ff621080 + f8 -> ffff95e5ff621178) is a struct timer_list.
>
> Cc: <stable@vger.stable.com> #4.4+
> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
> ---
> Hi Martin,
>
> This patch addresses crash due to NULL pointer access because driver
> left active timer running for abort IOCB.
>
> Please apply this patch to 4.16/scsi-fixes at your earliest convenience.
>
> Thanks,
> Himanshu
> ---
>  drivers/scsi/qla2xxx/qla_init.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
> index 2dea1129d396..04870621e712 100644
> --- a/drivers/scsi/qla2xxx/qla_init.c
> +++ b/drivers/scsi/qla2xxx/qla_init.c
> @@ -1527,6 +1527,7 @@ qla24xx_abort_sp_done(void *ptr, int res)
>  	srb_t *sp = ptr;
>  	struct srb_iocb *abt = &sp->u.iocb_cmd;
>
> +	del_timer(&sp->u.iocb_cmd.timer);

Hi,

I am not familiar with this code. But I note that this abort seems to be 
"asynchronous" (I see it's setup in qla24xx_async_abort_cmd()), so do 
you need the "sync" variant of del_timer() for safety? This would be to 
ensure the timeout has not actually occurred and is racing with the 
"done" callback?

Thanks,
John

>  	complete(&abt->u.abt.comp);
>  }
>
>
Madhani, Himanshu Feb. 13, 2018, 6:13 p.m. UTC | #3
Hi Johannes, 

> On Feb 13, 2018, at 2:38 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:

> 

> Thanks Himanshu,

> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

> 

> Do you happen to know which commit it actually fixes?

> 


Thanks for asking (I did not add fixes commit ID because it was pre 4.x kernel) 

Here’s commit id 4440e46d5db7b445a961a84444849b2a31fa7fd1 which is fixed by this patch.

> -- 

> Johannes Thumshirn                                          Storage

> jthu

> mshirn@suse.de                                +49 911 74053 689

> SUSE

> LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg

> GF: Felix Imendörffer, Jane

> Smithard, Graham Norton

> HRB 21284 (AG Nürnberg)

> Key fingerprint = EC38

> 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Thanks,
- Himanshu
Madhani, Himanshu Feb. 13, 2018, 8:07 p.m. UTC | #4
Hi John, 

> On Feb 13, 2018, at 2:54 AM, John Garry <john.garry@huawei.com> wrote:
> 
> On 12/02/2018 18:28, Himanshu Madhani wrote:
>> This patch fixes NULL pointer crash due to active timer running
>> for abort IOCB.
>> 
>>> From crash dump analysis it was discoverd that get_next_timer_interrupt()
>> encountered a corrupted entry on the timer list.
>> 
>> #9 [ffff95e1f6f0fd40] page_fault at ffffffff914fe8f8
>>    [exception RIP: get_next_timer_interrupt+440]
>>    RIP: ffffffff90ea3088  RSP: ffff95e1f6f0fdf0  RFLAGS: 00010013
>>    RAX: ffff95e1f6451028  RBX: 000218e2389e5f40  RCX: 00000001232ad600
>>    RDX: 0000000000000001  RSI: ffff95e1f6f0fdf0  RDI: 0000000001232ad6
>>    RBP: ffff95e1f6f0fe40   R8: ffff95e1f6451188   R9: 0000000000000001
>>    R10: 0000000000000016  R11: 0000000000000016  R12: 00000001232ad5f6
>>    R13: ffff95e1f6450000  R14: ffff95e1f6f0fdf8  R15: ffff95e1f6f0fe10
>>    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> 
>> Looking at the assembly of get_next_timer_interrupt(), address came
>> from %r8 (ffff95e1f6451188) which is pointing to list_head with single
>> entry at ffff95e5ff621178.
>> 
>> 0xffffffff90ea307a <get_next_timer_interrupt+426>:      mov    (%r8),%rdx
>> 0xffffffff90ea307d <get_next_timer_interrupt+429>:      cmp    %r8,%rdx
>> 0xffffffff90ea3080 <get_next_timer_interrupt+432>:      je     0xffffffff90ea30a7 <get_next_timer_interrupt+471>
>> 0xffffffff90ea3082 <get_next_timer_interrupt+434>:      nopw   0x0(%rax,%rax,1)
>> 0xffffffff90ea3088 <get_next_timer_interrupt+440>:      testb  $0x1,0x18(%rdx)
>> 
>> crash> rd ffff95e1f6451188 10
>> ffff95e1f6451188:  ffff95e5ff621178 ffff95e5ff621178   x.b.....x.b.....
>> ffff95e1f6451198:  ffff95e1f6451198 ffff95e1f6451198   ..E.......E.....
>> ffff95e1f64511a8:  ffff95e1f64511a8 ffff95e1f64511a8   ..E.......E.....
>> ffff95e1f64511b8:  ffff95e77cf509a0 ffff95e77cf509a0   ...|.......|....
>> ffff95e1f64511c8:  ffff95e1f64511c8 ffff95e1f64511c8   ..E.......E.....
>> 
>> crash> rd ffff95e5ff621178 10
>> ffff95e5ff621178:  0000000000000001 ffff95e15936aa00   ..........6Y....
>> ffff95e5ff621188:  0000000000000000 00000000ffffffff   ................
>> ffff95e5ff621198:  00000000000000a0 0000000000000010   ................
>> ffff95e5ff6211a8:  ffff95e5ff621198 000000000000000c   ..b.............
>> ffff95e5ff6211b8:  00000f5800000000 ffff95e751f8d720   ....X... ..Q....
>> 
>> ffff95e5ff621178 belongs to freed mempool object at ffff95e5ff621080.
>> 
>> CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
>> ffff95dc7fd74d00 mnt_cache                384      19785     24948    594    16k
>>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>>   ffffdc5dabfd8800  ffff95e5ff620000     1     42         29    13
>>   FREE / [ALLOCATED]
>>    ffff95e5ff621080  (cpu 6 cache)
>> 
>> Examining the contents of that memory reveals a pointer to a constant
>> string in the driver, "abort\0", which is set by qla24xx_async_abort_cmd().
>> 
>> crash> rd ffffffffc059277c 20
>> ffffffffc059277c:  6e490074726f6261 0074707572726574   abort.Interrupt.
>> ffffffffc059278c:  00676e696c6c6f50 6920726576697244   Polling.Driver i
>> ffffffffc059279c:  646f6d207325206e 6974736554000a65   n %s mode..Testi
>> ffffffffc05927ac:  636976656420676e 786c252074612065   ng device at %lx
>> ffffffffc05927bc:  6b63656843000a2e 646f727020676e69   ...Checking prod
>> ffffffffc05927cc:  6f20444920746375 0a2e706968632066   uct ID of chip..
>> ffffffffc05927dc:  5120646e756f4600 204130303232414c   .Found QLA2200A
>> ffffffffc05927ec:  43000a2e70696843 20676e696b636568   Chip...Checking
>> ffffffffc05927fc:  65786f626c69616d 6c636e69000a2e73   mailboxes...incl
>> ffffffffc059280c:  756e696c2f656475 616d2d616d642f78   ude/linux/dma-ma
>> 
>> crash> struct -ox srb_iocb
>> struct srb_iocb {
>>           union {
>>               struct {...} logio;
>>               struct {...} els_logo;
>>               struct {...} tmf;
>>               struct {...} fxiocb;
>>               struct {...} abt;
>>               struct ct_arg ctarg;
>>               struct {...} mbx;
>>               struct {...} nack;
>>    [0x0 ] } u;
>>    [0xb8] struct timer_list timer;
>>    [0x108] void (*timeout)(void *);
>> }
>> SIZE: 0x110
>> 
>> crash> ! bc
>> ibase=16
>> obase=10
>> B8+40
>> F8
>> 
>> The object is a srb_t, and at offset 0xf8 within that structure
>> (i.e. ffff95e5ff621080 + f8 -> ffff95e5ff621178) is a struct timer_list.
>> 
>> Cc: <stable@vger.stable.com> #4.4+
>> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
>> ---
>> Hi Martin,
>> 
>> This patch addresses crash due to NULL pointer access because driver
>> left active timer running for abort IOCB.
>> 
>> Please apply this patch to 4.16/scsi-fixes at your earliest convenience.
>> 
>> Thanks,
>> Himanshu
>> ---
>> drivers/scsi/qla2xxx/qla_init.c | 1 +
>> 1 file changed, 1 insertion(+)
>> 
>> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
>> index 2dea1129d396..04870621e712 100644
>> --- a/drivers/scsi/qla2xxx/qla_init.c
>> +++ b/drivers/scsi/qla2xxx/qla_init.c
>> @@ -1527,6 +1527,7 @@ qla24xx_abort_sp_done(void *ptr, int res)
>> 	srb_t *sp = ptr;
>> 	struct srb_iocb *abt = &sp->u.iocb_cmd;
>> 
>> +	del_timer(&sp->u.iocb_cmd.timer);
> 
> Hi,
> 
> I am not familiar with this code. But I note that this abort seems to be "asynchronous" (I see it's setup in qla24xx_async_abort_cmd()), so do you need the "sync" variant of del_timer() for safety? This would be to ensure the timeout has not actually occurred and is racing with the "done" callback?
> 

You do have a point here. I need to investigate your suggestion and see if we need to go with “sync” variant. I’ll update you shortly.

> Thanks,
> John
> 
>> 	complete(&abt->u.abt.comp);
>> }
>> 
>> 
> 
> 

Thanks,
- Himanshu
Johannes Thumshirn Feb. 14, 2018, 8:27 a.m. UTC | #5
On Tue, 2018-02-13 at 18:13 +0000, Madhani, Himanshu wrote:
> Thanks for asking (I did not add fixes commit ID because it was pre
> 4.x kernel) 
> 
> Here’s commit id 4440e46d5db7b445a961a84444849b2a31fa7fd1 which is
> fixed by this patch.

Thanks for the info. In general having a 
Fixes: 4440e46d5db7 ("[SCSI] qla2xxx: Add IOCB Abort command
asynchronous handling.")
line is extremly valuable as many distributions have automation in
place to check whether a fix for a backport arrived upstream.

Thanks,
	Johannes
Madhani, Himanshu Feb. 14, 2018, 7:58 p.m. UTC | #6
Hi John,

> On Feb 13, 2018, at 2:54 AM, John Garry <john.garry@huawei.com> wrote:
> 
> On 12/02/2018 18:28, Himanshu Madhani wrote:
>> This patch fixes NULL pointer crash due to active timer running
>> for abort IOCB.
>> 
>>> From crash dump analysis it was discoverd that get_next_timer_interrupt()
>> encountered a corrupted entry on the timer list.
>> 
>> #9 [ffff95e1f6f0fd40] page_fault at ffffffff914fe8f8
>>    [exception RIP: get_next_timer_interrupt+440]
>>    RIP: ffffffff90ea3088  RSP: ffff95e1f6f0fdf0  RFLAGS: 00010013
>>    RAX: ffff95e1f6451028  RBX: 000218e2389e5f40  RCX: 00000001232ad600
>>    RDX: 0000000000000001  RSI: ffff95e1f6f0fdf0  RDI: 0000000001232ad6
>>    RBP: ffff95e1f6f0fe40   R8: ffff95e1f6451188   R9: 0000000000000001
>>    R10: 0000000000000016  R11: 0000000000000016  R12: 00000001232ad5f6
>>    R13: ffff95e1f6450000  R14: ffff95e1f6f0fdf8  R15: ffff95e1f6f0fe10
>>    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> 
>> Looking at the assembly of get_next_timer_interrupt(), address came
>> from %r8 (ffff95e1f6451188) which is pointing to list_head with single
>> entry at ffff95e5ff621178.
>> 
>> 0xffffffff90ea307a <get_next_timer_interrupt+426>:      mov    (%r8),%rdx
>> 0xffffffff90ea307d <get_next_timer_interrupt+429>:      cmp    %r8,%rdx
>> 0xffffffff90ea3080 <get_next_timer_interrupt+432>:      je     0xffffffff90ea30a7 <get_next_timer_interrupt+471>
>> 0xffffffff90ea3082 <get_next_timer_interrupt+434>:      nopw   0x0(%rax,%rax,1)
>> 0xffffffff90ea3088 <get_next_timer_interrupt+440>:      testb  $0x1,0x18(%rdx)
>> 
>> crash> rd ffff95e1f6451188 10
>> ffff95e1f6451188:  ffff95e5ff621178 ffff95e5ff621178   x.b.....x.b.....
>> ffff95e1f6451198:  ffff95e1f6451198 ffff95e1f6451198   ..E.......E.....
>> ffff95e1f64511a8:  ffff95e1f64511a8 ffff95e1f64511a8   ..E.......E.....
>> ffff95e1f64511b8:  ffff95e77cf509a0 ffff95e77cf509a0   ...|.......|....
>> ffff95e1f64511c8:  ffff95e1f64511c8 ffff95e1f64511c8   ..E.......E.....
>> 
>> crash> rd ffff95e5ff621178 10
>> ffff95e5ff621178:  0000000000000001 ffff95e15936aa00   ..........6Y....
>> ffff95e5ff621188:  0000000000000000 00000000ffffffff   ................
>> ffff95e5ff621198:  00000000000000a0 0000000000000010   ................
>> ffff95e5ff6211a8:  ffff95e5ff621198 000000000000000c   ..b.............
>> ffff95e5ff6211b8:  00000f5800000000 ffff95e751f8d720   ....X... ..Q....
>> 
>> ffff95e5ff621178 belongs to freed mempool object at ffff95e5ff621080.
>> 
>> CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
>> ffff95dc7fd74d00 mnt_cache                384      19785     24948    594    16k
>>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>>   ffffdc5dabfd8800  ffff95e5ff620000     1     42         29    13
>>   FREE / [ALLOCATED]
>>    ffff95e5ff621080  (cpu 6 cache)
>> 
>> Examining the contents of that memory reveals a pointer to a constant
>> string in the driver, "abort\0", which is set by qla24xx_async_abort_cmd().
>> 
>> crash> rd ffffffffc059277c 20
>> ffffffffc059277c:  6e490074726f6261 0074707572726574   abort.Interrupt.
>> ffffffffc059278c:  00676e696c6c6f50 6920726576697244   Polling.Driver i
>> ffffffffc059279c:  646f6d207325206e 6974736554000a65   n %s mode..Testi
>> ffffffffc05927ac:  636976656420676e 786c252074612065   ng device at %lx
>> ffffffffc05927bc:  6b63656843000a2e 646f727020676e69   ...Checking prod
>> ffffffffc05927cc:  6f20444920746375 0a2e706968632066   uct ID of chip..
>> ffffffffc05927dc:  5120646e756f4600 204130303232414c   .Found QLA2200A
>> ffffffffc05927ec:  43000a2e70696843 20676e696b636568   Chip...Checking
>> ffffffffc05927fc:  65786f626c69616d 6c636e69000a2e73   mailboxes...incl
>> ffffffffc059280c:  756e696c2f656475 616d2d616d642f78   ude/linux/dma-ma
>> 
>> crash> struct -ox srb_iocb
>> struct srb_iocb {
>>           union {
>>               struct {...} logio;
>>               struct {...} els_logo;
>>               struct {...} tmf;
>>               struct {...} fxiocb;
>>               struct {...} abt;
>>               struct ct_arg ctarg;
>>               struct {...} mbx;
>>               struct {...} nack;
>>    [0x0 ] } u;
>>    [0xb8] struct timer_list timer;
>>    [0x108] void (*timeout)(void *);
>> }
>> SIZE: 0x110
>> 
>> crash> ! bc
>> ibase=16
>> obase=10
>> B8+40
>> F8
>> 
>> The object is a srb_t, and at offset 0xf8 within that structure
>> (i.e. ffff95e5ff621080 + f8 -> ffff95e5ff621178) is a struct timer_list.
>> 
>> Cc: <stable@vger.stable.com> #4.4+
>> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
>> ---
>> Hi Martin,
>> 
>> This patch addresses crash due to NULL pointer access because driver
>> left active timer running for abort IOCB.
>> 
>> Please apply this patch to 4.16/scsi-fixes at your earliest convenience.
>> 
>> Thanks,
>> Himanshu
>> ---
>> drivers/scsi/qla2xxx/qla_init.c | 1 +
>> 1 file changed, 1 insertion(+)
>> 
>> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
>> index 2dea1129d396..04870621e712 100644
>> --- a/drivers/scsi/qla2xxx/qla_init.c
>> +++ b/drivers/scsi/qla2xxx/qla_init.c
>> @@ -1527,6 +1527,7 @@ `(void *ptr, int res)
>> 	srb_t *sp = ptr;
>> 	struct srb_iocb *abt = &sp->u.iocb_cmd;
>> 
>> +	del_timer(&sp->u.iocb_cmd.timer);
> 
> Hi,
> 
> I am not familiar with this code. But I note that this abort seems to be "asynchronous" (I see it's setup in qla24xx_async_abort_cmd()), so do you need the "sync" variant of del_timer() for safety? This would be to ensure the timeout has not actually occurred and is racing with the "done" callback?
> 

after reviewing code for the usage of timer i see that qla24xx_async_abort_cmd() would not be adding timer back even though the call is
asynchronous. I think its safe to use only del_timer(). 

> Thanks,
> John
> 
>> 	complete(&abt->u.abt.comp);
>> }

Thanks,
- Himanshu
Madhani, Himanshu Feb. 23, 2018, 1:04 a.m. UTC | #7
Hi Martin, 

> On Feb 14, 2018, at 11:58 AM, Madhani, Himanshu <Himanshu.Madhani@cavium.com> wrote:
> 
> Hi John,
> 
>> On Feb 13, 2018, at 2:54 AM, John Garry <john.garry@huawei.com> wrote:
>> 
>> On 12/02/2018 18:28, Himanshu Madhani wrote:
>>> This patch fixes NULL pointer crash due to active timer running
>>> for abort IOCB.
>>> 
>>>> From crash dump analysis it was discoverd that get_next_timer_interrupt()
>>> encountered a corrupted entry on the timer list.
>>> 
>>> #9 [ffff95e1f6f0fd40] page_fault at ffffffff914fe8f8
>>>   [exception RIP: get_next_timer_interrupt+440]
>>>   RIP: ffffffff90ea3088  RSP: ffff95e1f6f0fdf0  RFLAGS: 00010013
>>>   RAX: ffff95e1f6451028  RBX: 000218e2389e5f40  RCX: 00000001232ad600
>>>   RDX: 0000000000000001  RSI: ffff95e1f6f0fdf0  RDI: 0000000001232ad6
>>>   RBP: ffff95e1f6f0fe40   R8: ffff95e1f6451188   R9: 0000000000000001
>>>   R10: 0000000000000016  R11: 0000000000000016  R12: 00000001232ad5f6
>>>   R13: ffff95e1f6450000  R14: ffff95e1f6f0fdf8  R15: ffff95e1f6f0fe10
>>>   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>> 
>>> Looking at the assembly of get_next_timer_interrupt(), address came
>>> from %r8 (ffff95e1f6451188) which is pointing to list_head with single
>>> entry at ffff95e5ff621178.
>>> 
>>> 0xffffffff90ea307a <get_next_timer_interrupt+426>:      mov    (%r8),%rdx
>>> 0xffffffff90ea307d <get_next_timer_interrupt+429>:      cmp    %r8,%rdx
>>> 0xffffffff90ea3080 <get_next_timer_interrupt+432>:      je     0xffffffff90ea30a7 <get_next_timer_interrupt+471>
>>> 0xffffffff90ea3082 <get_next_timer_interrupt+434>:      nopw   0x0(%rax,%rax,1)
>>> 0xffffffff90ea3088 <get_next_timer_interrupt+440>:      testb  $0x1,0x18(%rdx)
>>> 
>>> crash> rd ffff95e1f6451188 10
>>> ffff95e1f6451188:  ffff95e5ff621178 ffff95e5ff621178   x.b.....x.b.....
>>> ffff95e1f6451198:  ffff95e1f6451198 ffff95e1f6451198   ..E.......E.....
>>> ffff95e1f64511a8:  ffff95e1f64511a8 ffff95e1f64511a8   ..E.......E.....
>>> ffff95e1f64511b8:  ffff95e77cf509a0 ffff95e77cf509a0   ...|.......|....
>>> ffff95e1f64511c8:  ffff95e1f64511c8 ffff95e1f64511c8   ..E.......E.....
>>> 
>>> crash> rd ffff95e5ff621178 10
>>> ffff95e5ff621178:  0000000000000001 ffff95e15936aa00   ..........6Y....
>>> ffff95e5ff621188:  0000000000000000 00000000ffffffff   ................
>>> ffff95e5ff621198:  00000000000000a0 0000000000000010   ................
>>> ffff95e5ff6211a8:  ffff95e5ff621198 000000000000000c   ..b.............
>>> ffff95e5ff6211b8:  00000f5800000000 ffff95e751f8d720   ....X... ..Q....
>>> 
>>> ffff95e5ff621178 belongs to freed mempool object at ffff95e5ff621080.
>>> 
>>> CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
>>> ffff95dc7fd74d00 mnt_cache                384      19785     24948    594    16k
>>>  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>>>  ffffdc5dabfd8800  ffff95e5ff620000     1     42         29    13
>>>  FREE / [ALLOCATED]
>>>   ffff95e5ff621080  (cpu 6 cache)
>>> 
>>> Examining the contents of that memory reveals a pointer to a constant
>>> string in the driver, "abort\0", which is set by qla24xx_async_abort_cmd().
>>> 
>>> crash> rd ffffffffc059277c 20
>>> ffffffffc059277c:  6e490074726f6261 0074707572726574   abort.Interrupt.
>>> ffffffffc059278c:  00676e696c6c6f50 6920726576697244   Polling.Driver i
>>> ffffffffc059279c:  646f6d207325206e 6974736554000a65   n %s mode..Testi
>>> ffffffffc05927ac:  636976656420676e 786c252074612065   ng device at %lx
>>> ffffffffc05927bc:  6b63656843000a2e 646f727020676e69   ...Checking prod
>>> ffffffffc05927cc:  6f20444920746375 0a2e706968632066   uct ID of chip..
>>> ffffffffc05927dc:  5120646e756f4600 204130303232414c   .Found QLA2200A
>>> ffffffffc05927ec:  43000a2e70696843 20676e696b636568   Chip...Checking
>>> ffffffffc05927fc:  65786f626c69616d 6c636e69000a2e73   mailboxes...incl
>>> ffffffffc059280c:  756e696c2f656475 616d2d616d642f78   ude/linux/dma-ma
>>> 
>>> crash> struct -ox srb_iocb
>>> struct srb_iocb {
>>>          union {
>>>              struct {...} logio;
>>>              struct {...} els_logo;
>>>              struct {...} tmf;
>>>              struct {...} fxiocb;
>>>              struct {...} abt;
>>>              struct ct_arg ctarg;
>>>              struct {...} mbx;
>>>              struct {...} nack;
>>>   [0x0 ] } u;
>>>   [0xb8] struct timer_list timer;
>>>   [0x108] void (*timeout)(void *);
>>> }
>>> SIZE: 0x110
>>> 
>>> crash> ! bc
>>> ibase=16
>>> obase=10
>>> B8+40
>>> F8
>>> 
>>> The object is a srb_t, and at offset 0xf8 within that structure
>>> (i.e. ffff95e5ff621080 + f8 -> ffff95e5ff621178) is a struct timer_list.
>>> 
>>> Cc: <stable@vger.stable.com> #4.4+
>>> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
>>> ---
>>> Hi Martin,
>>> 
>>> This patch addresses crash due to NULL pointer access because driver
>>> left active timer running for abort IOCB.
>>> 
>>> Please apply this patch to 4.16/scsi-fixes at your earliest convenience.
>>> 
>>> Thanks,
>>> Himanshu
>>> ---
>>> drivers/scsi/qla2xxx/qla_init.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>> 
>>> diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
>>> index 2dea1129d396..04870621e712 100644
>>> --- a/drivers/scsi/qla2xxx/qla_init.c
>>> +++ b/drivers/scsi/qla2xxx/qla_init.c
>>> @@ -1527,6 +1527,7 @@ `(void *ptr, int res)
>>> 	srb_t *sp = ptr;
>>> 	struct srb_iocb *abt = &sp->u.iocb_cmd;
>>> 
>>> +	del_timer(&sp->u.iocb_cmd.timer);
>> 
>> Hi,
>> 
>> I am not familiar with this code. But I note that this abort seems to be "asynchronous" (I see it's setup in qla24xx_async_abort_cmd()), so do you need the "sync" variant of del_timer() for safety? This would be to ensure the timeout has not actually occurred and is racing with the "done" callback?
>> 
> 
> after reviewing code for the usage of timer i see that qla24xx_async_abort_cmd() would not be adding timer back even though the call is
> asynchronous. I think its safe to use only del_timer(). 
> 
>> Thanks,
>> John
>> 
>>> 	complete(&abt->u.abt.comp);
>>> }
> 
> Thanks,
> - Himanshu

Can you please queue this patch for 4.16-rc3

Thanks,
- Himanshu
Martin K. Petersen Feb. 23, 2018, 1:17 a.m. UTC | #8
Himanshu,

> Can you please queue this patch for 4.16-rc3

Applied to 4.16/scsi-fixes with the missing Fixes: tag. Thank you!
James Bottomley March 1, 2018, 7:27 p.m. UTC | #9
On Mon, 2018-02-12 at 10:28 -0800, Himanshu Madhani wrote:
[...]
> Cc: <stable@vger.stable.com> #4.4+

This is the wrong stable tag, which would lead to stable not picking up
the patch automatically.  The correct stable address is

Cc: <stable@vger.kernel.org> #4.4+
                 ^^^^^^^^^^

Please be more careful in future.

Martin, can you fix this up with a rebase and I'll resync my tree?

Thanks,

James
Martin K. Petersen March 2, 2018, 1:20 a.m. UTC | #10
James,

>> Cc: <stable@vger.stable.com> #4.4+

Ugh.

> Martin, can you fix this up with a rebase and I'll resync my tree?

Done and pushed.
diff mbox

Patch

diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
index 2dea1129d396..04870621e712 100644
--- a/drivers/scsi/qla2xxx/qla_init.c
+++ b/drivers/scsi/qla2xxx/qla_init.c
@@ -1527,6 +1527,7 @@  qla24xx_abort_sp_done(void *ptr, int res)
 	srb_t *sp = ptr;
 	struct srb_iocb *abt = &sp->u.iocb_cmd;
 
+	del_timer(&sp->u.iocb_cmd.timer);
 	complete(&abt->u.abt.comp);
 }