From patchwork Tue Jul 11 03:41:32 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
X-Patchwork-Id: 9834025
Return-Path: <target-devel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	1FB9C60325 for <patchwork-target-devel@patchwork.kernel.org>;
	Tue, 11 Jul 2017 03:41:40 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F088623B23
	for <patchwork-target-devel@patchwork.kernel.org>;
	Tue, 11 Jul 2017 03:41:39 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id E512927FB3; Tue, 11 Jul 2017 03:41:39 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5EC2426E97
	for <patchwork-target-devel@patchwork.kernel.org>;
	Tue, 11 Jul 2017 03:41:39 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755178AbdGKDli (ORCPT
	<rfc822;patchwork-target-devel@patchwork.kernel.org>);
	Mon, 10 Jul 2017 23:41:38 -0400
Received: from mail.linux-iscsi.org ([67.23.28.174]:60864 "EHLO
	linux-iscsi.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755171AbdGKDlh (ORCPT
	<rfc822;target-devel@vger.kernel.org>);
	Mon, 10 Jul 2017 23:41:37 -0400
Received: from [192.168.1.66] (75-37-194-224.lightspeed.lsatca.sbcglobal.net
	[75.37.194.224])
	(using SSLv3 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested) (Authenticated sender: nab)
	by linux-iscsi.org (Postfix) with ESMTPSA id F116140B19;
	Tue, 11 Jul 2017 03:46:23 +0000 (UTC)
Message-ID: <1499744492.28145.12.camel@haakon3.daterainc.com>
Subject: Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!
From: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
To: Pascal de Bruijn <p.debruijn@unilogic.nl>
Cc: "Madhani, Himanshu" <Himanshu.Madhani@cavium.com>,
	target-devel <target-devel@vger.kernel.org>,
	"Tran, Quinn" <Quinn.Tran@cavium.com>, s.kleijkers@unilogic.nl
Date: Mon, 10 Jul 2017 20:41:32 -0700
In-Reply-To: <d78c5f77-f588-079a-98ed-97fb1fe70c22@unilogic.nl>
References: <e35f36eb-bb55-e9a6-a6a1-9a21c6599660@unilogic.nl>
	<F6E81C3D-7B0F-4026-91D8-F5EBBC5C259B@cavium.com>
	<1498429881.26123.55.camel@haakon3.risingtidesystems.com>
	<0883cc0b-1cd8-66e7-e2d0-8bd9c0c76b9d@unilogic.nl>
	<1498807372.24896.45.camel@haakon3.risingtidesystems.com>
	<061954d8-6cfe-8b21-6e35-3f3c570b8c68@unilogic.nl>
	<1499401581.30628.16.camel@haakon3.risingtidesystems.com>
	<d78c5f77-f588-079a-98ed-97fb1fe70c22@unilogic.nl>
X-Mailer: Evolution 3.4.4-1 
Mime-Version: 1.0
Sender: target-devel-owner@vger.kernel.org
Precedence: bulk
List-ID: <target-devel.vger.kernel.org>
X-Mailing-List: target-devel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote:
> On 07-07-17 06:26, Nicholas A. Bellinger wrote:
> > On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
> >> So abort_task can still be observed, but they do not result in a
> >> non-functional not-quite-PANICked machine anymore.
> > Thank alot for the bug report and your continuous testing to get this
> > resolved.  The patch is queued up in target-pending/for-next with your
> > Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
> > kernels get this fix as well.
> >
> > Thanks again.
> 
> I'm afraid we may not be quite there yet after all...
> 
> So we had the other two machines run an md check this weekend
> as well, again with a rediculously high synx_speed_max:
> 
> Jul  9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0
> Jul  9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
> Jul  9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> Jul  9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k.
> Jul  9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332
> Jul  9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332
> Jul  9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380
> Jul  9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380
> Jul  9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620
> Jul  9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140

This means the outstanding I/O was located, but never completed back.

Or, there is something else blocked waiting for completion like
outstanding qla2xxx WRITE transfer, before the abort may proceed.

> Jul  9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> Jul  9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done.
> 

Btw, where there any hung task warnings while this was happening..?

> The machine in question was still responsive (was accepting
> SSH logins), however it seemed VMware hosts weren't seeing
> the volume anymore (presumably due to the heavy IO on the backend).
> 
> Also, several (12?) kworkers seemed stuck in a D state.
> 

Next time that happens, do a 'cat /proc/$PID/stack' to see where those
kworkers are blocked in un-interruptible sleep.

> When my collegue tried to reboot the machine it got
> (presumably) stuck on
> 
> /usr/bin/targetctl clear
> 
> After which it was forcefully rebooted :)
> 
> Sorry we don't have any more detailed info at this point.
> We haven't been able to reproduce this on a
> different machine yet :(

Ok, please try with the following debug patch to verify if it's blocked
on qla_tgt_cmd->write_pending_abort_comp.

Btw, normally if it was blocked here indefinitely there would be a hung
task warning after 180 seconds, unless that is not enabled in your
kernel..

If it's not, then you probably want to enable it via:

    CONFIG_DETECT_HUNG_TASK=y
---
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
index 8c1bf9b..1199969 100644
--- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
+++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
@@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
 	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
 	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
 		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
+		printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag);
 		wait_for_completion(&cmd->write_pending_abort_comp);
 		return 0;
 	}
@@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
 		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
 		 */
 		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
+			u64 tag = cmd->se_cmd.tag;
+
+			printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
 			complete(&cmd->write_pending_abort_comp);
+			printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
 			return;
 		}