From patchwork Thu Sep 27 14:39:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bill Kuzeja X-Patchwork-Id: 10618107 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 24A3716B1 for ; Thu, 27 Sep 2018 14:45:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 142012AB1F for ; Thu, 27 Sep 2018 14:45:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 077972AFD0; Thu, 27 Sep 2018 14:45:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 696862AB1F for ; Thu, 27 Sep 2018 14:45:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727341AbeI0VE3 convert rfc822-to-8bit (ORCPT ); Thu, 27 Sep 2018 17:04:29 -0400 Received: from us-smtp-delivery-131.mimecast.com ([63.128.21.131]:49435 "EHLO us-smtp-delivery-131.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727307AbeI0VE3 (ORCPT ); Thu, 27 Sep 2018 17:04:29 -0400 X-Greylist: delayed 371 seconds by postgrey-1.27 at vger.kernel.org; Thu, 27 Sep 2018 17:04:28 EDT Received: from mailhub5.stratus.com (mailhub5.stratus.com [134.111.1.18]) by us-smtp-1.mimecast.com with ESMTP id us-mta-248-WKKKF-yGMXe2il-8cA92cA-1; Thu, 27 Sep 2018 10:39:41 -0400 Received: from EXHQ1.corp.stratus.com (exhq1.corp.stratus.com [134.111.200.125]) by mailhub5.stratus.com (8.12.11/8.12.11) with ESMTP id w8REdfXZ010043; Thu, 27 Sep 2018 10:39:41 -0400 Received: from linuxdev.lnx.eng.stratus.com (134.111.220.63) by EXHQ1.corp.stratus.com (134.111.200.125) with Microsoft SMTP Server (TLS) id 14.3.279.2; Thu, 27 Sep 2018 10:39:41 -0400 From: Bill Kuzeja To: CC: , Subject: [PATCH] scsi: qla2xxx: I/Os timing out on surprise removal of Date: Thu, 27 Sep 2018 10:39:40 -0400 Message-ID: <1538059180-28025-1-git-send-email-William.Kuzeja@stratus.com> X-Mailer: git-send-email 1.8.3.1 MIME-Version: 1.0 X-MC-Unique: WKKKF-yGMXe2il-8cA92cA-1 Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When removing an adapter through sysfs, some in flight I/Os can get stuck and take a while to complete (they actually timeout and are retried). We are not handling an early error exit from qla2xxx_eh_abort properly. Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") Signed-off-by: Bill Kuzeja --- When doing a sysfs remove of a QLogic adapter, the driver's remove function gets called and we end up aborting all in progress I/Os. Here is the code flow: qla2x00_remove_one qla2x00_abort_isp_cleanup qla2x00_abort_all_cmds __qla2x00_abort_all_cmds qla2xxx_eh_abort At the start of qla2xxx_eh_abort, there are some sanity checks done before actually sending the abort. One of these checks is a call to fc_block_scsi_eh. In the case of a sysfs remove, it turns out that this routine can exit with FAST_IO_FAIL. When this occurs, we return back to __qla2x00_abort_all_cmds with an extra reference on sp (because the abort never gets sent). Originally, I remedied this kind of situation with another fix: commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after adapter break But this later added change complicated matters: commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down chip Because the abort is now being done earlier in the teardown (through qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past the first check because qla2x00_isp_reg_stat(ha) returns zero. When we fail a few lines later in fc_block_scsi_eh, this error is not handled properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and timing out because of the extra reference. For this fix, I will add this case to __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort succeeded or not. This removes the extra reference in this additional early exit case. In my testing, this eliminates the timeouts and delays and the remove proceeds smoothly. --- drivers/scsi/qla2xxx/qla_os.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 42b8f0d..3ba3765 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -1771,8 +1771,9 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) * if immediate exit from * ql2xxx_eh_abort */ - if (status == FAILED && - (qla2x00_isp_reg_stat(ha))) + if (((status == FAILED) && + (qla2x00_isp_reg_stat(ha))) || + (status == FAST_IO_FAIL)) atomic_dec( &sp->ref_count); }