Message ID | 1538059180-28025-1-git-send-email-William.Kuzeja@stratus.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | scsi: qla2xxx: I/Os timing out on surprise removal of | expand |
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 42b8f0d..3ba3765 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -1771,8 +1771,9 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) * if immediate exit from * ql2xxx_eh_abort */ - if (status == FAILED && - (qla2x00_isp_reg_stat(ha))) + if (((status == FAILED) && + (qla2x00_isp_reg_stat(ha))) || + (status == FAST_IO_FAIL)) atomic_dec( &sp->ref_count); }
When removing an adapter through sysfs, some in flight I/Os can get stuck and take a while to complete (they actually timeout and are retried). We are not handling an early error exit from qla2xxx_eh_abort properly. Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com> --- When doing a sysfs remove of a QLogic adapter, the driver's remove function gets called and we end up aborting all in progress I/Os. Here is the code flow: qla2x00_remove_one qla2x00_abort_isp_cleanup qla2x00_abort_all_cmds __qla2x00_abort_all_cmds qla2xxx_eh_abort At the start of qla2xxx_eh_abort, there are some sanity checks done before actually sending the abort. One of these checks is a call to fc_block_scsi_eh. In the case of a sysfs remove, it turns out that this routine can exit with FAST_IO_FAIL. When this occurs, we return back to __qla2x00_abort_all_cmds with an extra reference on sp (because the abort never gets sent). Originally, I remedied this kind of situation with another fix: commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after adapter break But this later added change complicated matters: commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down chip Because the abort is now being done earlier in the teardown (through qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past the first check because qla2x00_isp_reg_stat(ha) returns zero. When we fail a few lines later in fc_block_scsi_eh, this error is not handled properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and timing out because of the extra reference. For this fix, I will add this case to __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort succeeded or not. This removes the extra reference in this additional early exit case. In my testing, this eliminates the timeouts and delays and the remove proceeds smoothly. --- drivers/scsi/qla2xxx/qla_os.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)