diff mbox

[2/2] qla2xxx: Avoid that issuing a LIP triggers a kernel crash

Message ID 91a8f843-c969-72c4-1add-8e44ea2d9a8a@linux.vnet.ibm.com (mailing list archive)
State Accepted, archived
Headers show

Commit Message

Mauricio Faria de Oliveira Jan. 24, 2017, 2:59 p.m. UTC
Hi Bart,

First of all, sorry for the new bug; I didn't realize the pointer could
be NULL at this scenario.

On 01/23/2017 02:34 PM, Bart Van Assche wrote:
> @@ -1624,7 +1627,8 @@ qla2x00_abort_all_cmds(scsi_qla_host_t *vha, int res)
>  					 */
>  					sp_get(sp);
>  					spin_unlock_irqrestore(&ha->hardware_lock, flags);
> -					qla2xxx_eh_abort(GET_CMD_SP(sp));
> +					if (scmd)
> +						qla2xxx_eh_abort(scmd);
>  					spin_lock_irqsave(&ha->hardware_lock, flags);
>  				}

Now, this chunk has a problem with reference counting (and unnecessary
spin-locking), which we can avoid by simply moving up this NULL check.

The call to sp_get() increments the sp->ref_count, but if you skip the
call to qla2xxx_eh_abort() you don't get the decrement from the call to
sp->done() at abort handling from ISR, e.g., qla24xx_abort_iocb_entry().
[or if the command completed successfully between issue/complete abort,
at the completion from ISR, e.g., qla2x00_process_completed_request().]

The sp->done() call just below this chunk was supposed to drop the
initial reference [set at qla2xxx_queuecommand()] at a time we did
not call qla2xxx_eh_abort() yet... but now that we __may__ call it
(and get that sp->done() call from the ISR abort handling), we need
to only increment it if we're going to drop it.

That should be resolved with this slight change to your patch
(which also helps w/ the spin-locking).  What do you/others think?

                                         /* Get a reference to the sp 
and drop the lock.
                                          * The reference ensures this 
sp->done() call
                                          * - and not the call in 
qla2xxx_eh_abort() -
@@ -1624,7 +1627,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
                                          */
                                         sp_get(sp);
 
spin_unlock_irqrestore(&ha->hardware_lock, flags);
-                                       qla2xxx_eh_abort(GET_CMD_SP(sp));
+                                       qla2xxx_eh_abort(scmd);
 
spin_lock_irqsave(&ha->hardware_lock, flags);
                                 }
                                 req->outstanding_cmds[cnt] = NULL;


Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>

Comments

Madhani, Himanshu Jan. 25, 2017, 10:05 p.m. UTC | #1
On 1/24/17, 6:59 AM, "Mauricio Faria de Oliveira" <mauricfo@linux.vnet.ibm.com> wrote:

>Hi Bart,

>

>First of all, sorry for the new bug; I didn't realize the pointer could

>be NULL at this scenario.

>

>On 01/23/2017 02:34 PM, Bart Van Assche wrote:

>> @@ -1624,7 +1627,8 @@ qla2x00_abort_all_cmds(scsi_qla_host_t *vha, int res)

>>  					 */

>>  					sp_get(sp);

>>  					spin_unlock_irqrestore(&ha->hardware_lock, flags);

>> -					qla2xxx_eh_abort(GET_CMD_SP(sp));

>> +					if (scmd)

>> +						qla2xxx_eh_abort(scmd);

>>  					spin_lock_irqsave(&ha->hardware_lock, flags);

>>  				}

>

>Now, this chunk has a problem with reference counting (and unnecessary

>spin-locking), which we can avoid by simply moving up this NULL check.

>

>The call to sp_get() increments the sp->ref_count, but if you skip the

>call to qla2xxx_eh_abort() you don't get the decrement from the call to

>sp->done() at abort handling from ISR, e.g., qla24xx_abort_iocb_entry().

>[or if the command completed successfully between issue/complete abort,

>at the completion from ISR, e.g., qla2x00_process_completed_request().]

>

>The sp->done() call just below this chunk was supposed to drop the

>initial reference [set at qla2xxx_queuecommand()] at a time we did

>not call qla2xxx_eh_abort() yet... but now that we __may__ call it

>(and get that sp->done() call from the ISR abort handling), we need

>to only increment it if we're going to drop it.

>

>That should be resolved with this slight change to your patch

>(which also helps w/ the spin-locking).  What do you/others think?

>

>diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c

>index 0a000ecf0881..a17cb63b3fd5 100644

>--- a/drivers/scsi/qla2xxx/qla_os.c

>+++ b/drivers/scsi/qla2xxx/qla_os.c

>@@ -1600,6 +1600,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)

>         srb_t *sp;

>         struct qla_hw_data *ha = vha->hw;

>         struct req_que *req;

>+       struct scsi_cmnd *scmd;

>

>         qlt_host_reset_handler(ha);

>

>@@ -1613,10 +1614,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data 

>*ha)

>                 for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) {

>                         sp = req->outstanding_cmds[cnt];

>                         if (sp) {

>+                               scmd = GET_CMD_SP(sp);

>+

>                                 /* Don't abort commands in adapter 

>during EEH

>                                  * recovery as it's not 

>accessible/responding.

>                                  */

>-                               if (!ha->flags.eeh_busy) {

>+                               if (scmd && !ha->flags.eeh_busy) {

>                                         /* Get a reference to the sp 

>and drop the lock.

>                                          * The reference ensures this 

>sp->done() call

>                                          * - and not the call in 

>qla2xxx_eh_abort() -

>@@ -1624,7 +1627,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)

>                                          */

>                                         sp_get(sp);

> 

>spin_unlock_irqrestore(&ha->hardware_lock, flags);

>-                                       qla2xxx_eh_abort(GET_CMD_SP(sp));

>+                                       qla2xxx_eh_abort(scmd);

> 

>spin_lock_irqsave(&ha->hardware_lock, flags);

>                                 }

>                                 req->outstanding_cmds[cnt] = NULL;

>

>

>Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>

>

>

>-- 

>Mauricio Faria de Oliveira

>IBM Linux Technology Center


This is more appropriate fix. Looks good.
Martin K. Petersen Jan. 25, 2017, 11:29 p.m. UTC | #2
>>>>> "Mauricio" == Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> writes:

Hi Mauricio,

Mauricio> First of all, sorry for the new bug; I didn't realize the
Mauricio> pointer could be NULL at this scenario.

Please do a proper patch submission for this fix.
Mauricio Faria de Oliveira Jan. 26, 2017, 12:09 a.m. UTC | #3
Hi Martin,

On 01/25/2017 09:29 PM, Martin K. Petersen wrote:
> Please do a proper patch submission for this fix.

Okay, I submitted a v2 patch w/ the suggested change.

However, the original patch has been submitted by Bart,
so I believe credit is due, but not sure how to handle
this case.

Thus, please feel free to change the sign-off line as
appropriate here.

Thanks,
diff mbox

Patch

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 0a000ecf0881..a17cb63b3fd5 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1600,6 +1600,7 @@  uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
         srb_t *sp;
         struct qla_hw_data *ha = vha->hw;
         struct req_que *req;
+       struct scsi_cmnd *scmd;

         qlt_host_reset_handler(ha);

@@ -1613,10 +1614,12 @@  uint32_t qla2x00_isp_reg_stat(struct qla_hw_data 
*ha)
                 for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) {
                         sp = req->outstanding_cmds[cnt];
                         if (sp) {
+                               scmd = GET_CMD_SP(sp);
+
                                 /* Don't abort commands in adapter 
during EEH
                                  * recovery as it's not 
accessible/responding.
                                  */
-                               if (!ha->flags.eeh_busy) {
+                               if (scmd && !ha->flags.eeh_busy) {