Message ID | 20200205214422.3657-2-mwilck@suse.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | scsi: qla2xxx: fixes for driver unloading | expand |
Hi Martin, On Wed, Feb 05, 2020 at 10:44:20PM +0100, mwilck@suse.com wrote: > From: Martin Wilck <mwilck@suse.com> > > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip"), > it is possible that FC commands are scheduled after the adapter firmware > has been shut down. IO sent to the firmware in this situation hangs > indefinitely. Avoid this for the LOGO code path that is typically taken > when adapters are shut down. > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") > Signed-off-by: Martin Wilck <mwilck@suse.com> > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com> Just to understand it correctly: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") is saying FW is not able to shutdown propably and therefore we kill it first and still try to do some cleanup. Are you sure you got all the necessary places fixed up? I tend to say we should add if (!ha->flags.fw_started) return QLA_FUNCTION_FAILED; do qla2x00_mailbox_command() and handle the errors one by one. Just an idea. Thanks, Daniel
On Thu, 2020-02-06 at 13:33 +0100, Daniel Wagner wrote: > Hi Martin, > > On Wed, Feb 05, 2020 at 10:44:20PM +0100, mwilck@suse.com wrote: > > From: Martin Wilck <mwilck@suse.com> > > > > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by > > shutting down chip"), > > it is possible that FC commands are scheduled after the adapter > > firmware > > has been shut down. IO sent to the firmware in this situation hangs > > indefinitely. Avoid this for the LOGO code path that is typically > > taken > > when adapters are shut down. > > > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting > > down chip") > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com> > > Just to understand it correctly: 45235022da99 ("scsi: qla2xxx: Fix > driver unload by shutting down chip") is saying FW is not able to > shutdown propably and therefore we kill it first and still try to do > some cleanup. Yes, that's what I observed. Mailbox access hangs in this case, as the mailbox simply won't execute the command and the expected status change will not happen. > Are you sure you got all the necessary places fixed up? > > I tend to say we should add > > if (!ha->flags.fw_started) > return QLA_FUNCTION_FAILED; > > do qla2x00_mailbox_command() and handle the errors one by one. Just > an > idea. I had that idea too. I even tried it out. IIRC it's not that easy. Some commands need to be sent to the mailbox even in shut-down state (MBC_EXECUTE_FIRMWARE, for example?). I admit I didn't dare going down the path you're suggesting, I thought it had quite a potential to cause regressions. Did you mean this as a negative review, or rather an additional suggestion? Thanks Martin
On Thu, Feb 06, 2020 at 01:50:32PM +0100, Martin Wilck wrote: > On Thu, 2020-02-06 at 13:33 +0100, Daniel Wagner wrote: > > Are you sure you got all the necessary places fixed up? > > > > I tend to say we should add > > > > if (!ha->flags.fw_started) > > return QLA_FUNCTION_FAILED; > > > > do qla2x00_mailbox_command() and handle the errors one by one. Just > > an > > idea. > > I had that idea too. I even tried it out. IIRC it's not that easy. Some > commands need to be sent to the mailbox even in shut-down state > (MBC_EXECUTE_FIRMWARE, for example?). I admit I didn't dare going down > the path you're suggesting, I thought it had quite a potential to cause > regressions. Yes, this certainly causing regressions. This patch is using the blacklist approach which is probably introducing less regression. But I think for finding all the right spots the white listing approach is better. Maybe only during development phase. Just an idea. > Did you mean this as a negative review, or rather an additional > suggestion? This patch clearly improves things. So yes, I approve. Reviewed-by: Daniel Wagner <dwagner@suse.de> Thanks, Daniel
On Wed, 5 Feb 2020, 1:44pm, mwilck@suse.com wrote: > From: Martin Wilck <mwilck@suse.com> > > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip"), > it is possible that FC commands are scheduled after the adapter firmware > has been shut down. IO sent to the firmware in this situation hangs > indefinitely. Avoid this for the LOGO code path that is typically taken > when adapters are shut down. > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") > Signed-off-by: Martin Wilck <mwilck@suse.com> > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com> > --- > drivers/scsi/qla2xxx/qla_mbx.c | 3 +++ > drivers/scsi/qla2xxx/qla_os.c | 3 +++ > 2 files changed, 6 insertions(+) > > diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c > index 9e09964..53129f2 100644 > --- a/drivers/scsi/qla2xxx/qla_mbx.c > +++ b/drivers/scsi/qla2xxx/qla_mbx.c > @@ -2644,6 +2644,9 @@ qla24xx_fabric_logout(scsi_qla_host_t *vha, uint16_t loop_id, uint8_t domain, > ql_dbg(ql_dbg_mbx + ql_dbg_verbose, vha, 0x106d, > "Entered %s.\n", __func__); > > + if (!ha->flags.fw_started) > + return QLA_FUNCTION_FAILED; > + Ok. > lg = dma_pool_zalloc(ha->s_dma_pool, GFP_KERNEL, &lg_dma); > if (lg == NULL) { > ql_log(ql_log_warn, vha, 0x106e, > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > index b520a98..2dafb46 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -4878,6 +4878,9 @@ qla2x00_post_work(struct scsi_qla_host *vha, struct qla_work_evt *e) > unsigned long flags; > bool q = false; > > + if (!vha->hw->flags.fw_started) > + return QLA_FUNCTION_FAILED; > + I'd probably not do it here; rather in qla2x00_get_sp() /qla2x00_start_sp()/ qla2xxx_get_qpair_sp() time. Not all works are for posting to firmware (QLA_EVT_IDC_ACK, QLA_EVT_UNMAP etc.). Regards, -Arun > spin_lock_irqsave(&vha->work_lock, flags); > list_add_tail(&e->list, &vha->work_list); > >
On Tue, 2020-03-24 at 16:51 -0700, Arun Easi wrote: > On Wed, 5 Feb 2020, 1:44pm, mwilck@suse.com wrote: > > > From: Martin Wilck <mwilck@suse.com> > > > > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by > > shutting down chip"), > > it is possible that FC commands are scheduled after the adapter > > firmware > > has been shut down. IO sent to the firmware in this situation hangs > > indefinitely. Avoid this for the LOGO code path that is typically > > taken > > when adapters are shut down. > > > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting > > down chip") > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com> > > --- > > drivers/scsi/qla2xxx/qla_mbx.c | 3 +++ > > drivers/scsi/qla2xxx/qla_os.c | 3 +++ > > 2 files changed, 6 insertions(+) > > > > > > --- a/drivers/scsi/qla2xxx/qla_os.c > > +++ b/drivers/scsi/qla2xxx/qla_os.c > > @@ -4878,6 +4878,9 @@ qla2x00_post_work(struct scsi_qla_host *vha, > > struct qla_work_evt *e) > > unsigned long flags; > > bool q = false; > > > > + if (!vha->hw->flags.fw_started) > > + return QLA_FUNCTION_FAILED; > > + > > I'd probably not do it here; rather in qla2x00_get_sp() > /qla2x00_start_sp()/ qla2xxx_get_qpair_sp() time. Not all works are > for > posting to firmware (QLA_EVT_IDC_ACK, QLA_EVT_UNMAP etc.). qla81xx_idc_ack() calls qla2x00_mailbox_command(), which should be avoided IMO. But I'll review the various cases and re-post the patch. Thanks, Martin > > Regards, > -Arun > > > spin_lock_irqsave(&vha->work_lock, flags); > > list_add_tail(&e->list, &vha->work_list); > > > >
On Wed, 2020-03-25 at 17:28 +0100, Martin Wilck wrote: > On Tue, 2020-03-24 at 16:51 -0700, Arun Easi wrote: > > On Wed, 5 Feb 2020, 1:44pm, mwilck@suse.com wrote: > > > > > From: Martin Wilck <mwilck@suse.com> > > > > > > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by > > > shutting down chip"), > > > it is possible that FC commands are scheduled after the adapter > > > firmware > > > has been shut down. IO sent to the firmware in this situation > > > hangs > > > indefinitely. Avoid this for the LOGO code path that is typically > > > taken > > > when adapters are shut down. > > > > > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by > > > shutting > > > down chip") > > > Signed-off-by: Martin Wilck <mwilck@suse.com> > > > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com> > > > --- > > > drivers/scsi/qla2xxx/qla_mbx.c | 3 +++ > > > drivers/scsi/qla2xxx/qla_os.c | 3 +++ > > > 2 files changed, 6 insertions(+) > > > > > > [...] > > > --- a/drivers/scsi/qla2xxx/qla_os.c > > > +++ b/drivers/scsi/qla2xxx/qla_os.c > > > @@ -4878,6 +4878,9 @@ qla2x00_post_work(struct scsi_qla_host > > > *vha, > > > struct qla_work_evt *e) > > > unsigned long flags; > > > bool q = false; > > > > > > + if (!vha->hw->flags.fw_started) > > > + return QLA_FUNCTION_FAILED; > > > + > > > > I'd probably not do it here; rather in qla2x00_get_sp() > > /qla2x00_start_sp()/ qla2xxx_get_qpair_sp() time. Not all works are > > for > > posting to firmware (QLA_EVT_IDC_ACK, QLA_EVT_UNMAP etc.). > > qla81xx_idc_ack() calls qla2x00_mailbox_command(), which should be > avoided IMO. But I'll review the various cases and re-post the patch. Thinking about it once more, the approach is racy. qla2x00_try_stop_firmware() can be called any time, and it sets fw_started = 0 *after* actually stopping the firmware. Even if we check fw_started, the firwmare might be stopped between our check and our actual mailbox / FW register access, and we'd end up hanging. At least the check should be as close as possible to the actual FW access, e.g. in qla2x00_mailbox_command(), or before writing to the request queue registers in qla2x00_start_sp() etc. Perhaps the (!fw_started) condition should be treated like ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if is_rom_cmd() returns true? I can re-post, but I feel this should really be done by someone who knows exactly how the firmware operates, IOW Marvell staff. Regards Martin
On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote: > Perhaps the (!fw_started) condition should be treated like > ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if > is_rom_cmd() returns true? No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in rom_cmds[], but has been found to hang (this is why I had the hunk in qla24xx_fabric_logout()). The list of mailbox commands that must be passed on when the FW is stopped has to be shorter than rom_cmds[]. Better suggestions welcome. Martin
On Wed, 25 Mar 2020, 11:21am, Martin Wilck wrote: > On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote: > > Perhaps the (!fw_started) condition should be treated like > > ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if > > is_rom_cmd() returns true? > > No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in > rom_cmds[], but has been found to hang (this is why I had the hunk in > qla24xx_fabric_logout()). The list of mailbox commands > that must be passed on when the FW is stopped has to be shorter than > rom_cmds[]. > With your patch 2/3, where UNLOADING is set prior to the reset of chip, in place this issue should be largely addressed. Basically, paths that send out request to wire check UNLOADING bit (if any path is missing that, should add the check) before sending it out. Now with UNLOADING set (with your patch 2/3), chip reset, and all outstanding command's completion called (qla2x00_abort_isp_cleanup) I see less chance of anything being sent out. If you see any issue with your patches 1 & 2 (addressing my comments) applied, let me know and we can tackle then. How about that? Regards, -Arun
On Wed, 2020-03-25 at 15:13 -0700, Arun Easi wrote: > On Wed, 25 Mar 2020, 11:21am, Martin Wilck wrote: > > > On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote: > > > Perhaps the (!fw_started) condition should be treated like > > > ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if > > > is_rom_cmd() returns true? > > > > No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in > > rom_cmds[], but has been found to hang (this is why I had the hunk > > in > > qla24xx_fabric_logout()). The list of mailbox commands > > that must be passed on when the FW is stopped has to be shorter > > than > > rom_cmds[]. > > > > With your patch 2/3, where UNLOADING is set prior to the reset of > chip, in > place this issue should be largely addressed. Basically, paths that > send > out request to wire check UNLOADING bit (if any path is missing > that, > should add the check) before sending it out. > > Now with UNLOADING set (with your patch 2/3), chip reset, and all > outstanding command's completion called (qla2x00_abort_isp_cleanup) I > see > less chance of anything being sent out. If you see any issue with > your > patches 1 & 2 (addressing my comments) applied, let me know and we > can > tackle then. How about that? It sounds like a plan. Although it means that I just wasted time trying to figure out which mailbox commands need to be processed even if the firmware is down :-) Thanks, Martin
On Wed, 25 Mar 2020, 3:41pm, Martin Wilck wrote: > > ---------------------------------------------------------------------- > On Wed, 2020-03-25 at 15:13 -0700, Arun Easi wrote: > > On Wed, 25 Mar 2020, 11:21am, Martin Wilck wrote: > > > > > On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote: > > > > Perhaps the (!fw_started) condition should be treated like > > > > ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if > > > > is_rom_cmd() returns true? > > > > > > No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in > > > rom_cmds[], but has been found to hang (this is why I had the hunk > > > in > > > qla24xx_fabric_logout()). The list of mailbox commands > > > that must be passed on when the FW is stopped has to be shorter > > > than > > > rom_cmds[]. > > > > > > > With your patch 2/3, where UNLOADING is set prior to the reset of > > chip, in place this issue should be largely addressed. Basically, > > paths that send out request to wire check UNLOADING bit (if any path > > is missing that, should add the check) before sending it out. > > > > Now with UNLOADING set (with your patch 2/3), chip reset, and all > > outstanding command's completion called (qla2x00_abort_isp_cleanup) I > > see less chance of anything being sent out. If you see any issue with > > your patches 1 & 2 (addressing my comments) applied, let me know and > > we can tackle then. How about that? > > It sounds like a plan. Although it means that I just wasted time trying > to figure out which mailbox commands need to be processed even if the > firmware is down :-) > I hope it was not too much time. On the plus side, you know the mailbox path very well now. :) Regards, -Arun
diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c index 9e09964..53129f2 100644 --- a/drivers/scsi/qla2xxx/qla_mbx.c +++ b/drivers/scsi/qla2xxx/qla_mbx.c @@ -2644,6 +2644,9 @@ qla24xx_fabric_logout(scsi_qla_host_t *vha, uint16_t loop_id, uint8_t domain, ql_dbg(ql_dbg_mbx + ql_dbg_verbose, vha, 0x106d, "Entered %s.\n", __func__); + if (!ha->flags.fw_started) + return QLA_FUNCTION_FAILED; + lg = dma_pool_zalloc(ha->s_dma_pool, GFP_KERNEL, &lg_dma); if (lg == NULL) { ql_log(ql_log_warn, vha, 0x106e, diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index b520a98..2dafb46 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -4878,6 +4878,9 @@ qla2x00_post_work(struct scsi_qla_host *vha, struct qla_work_evt *e) unsigned long flags; bool q = false; + if (!vha->hw->flags.fw_started) + return QLA_FUNCTION_FAILED; + spin_lock_irqsave(&vha->work_lock, flags); list_add_tail(&e->list, &vha->work_list);