From patchwork Fri Jun  9 05:21:52 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
X-Patchwork-Id: 9777181
Return-Path: <target-devel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	7430B6034B for <patchwork-target-devel@patchwork.kernel.org>;
	Fri,  9 Jun 2017 05:21:56 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 666AD28546
	for <patchwork-target-devel@patchwork.kernel.org>;
	Fri,  9 Jun 2017 05:21:56 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5B1EB2858C; Fri,  9 Jun 2017 05:21:56 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DE497285A6
	for <patchwork-target-devel@patchwork.kernel.org>;
	Fri,  9 Jun 2017 05:21:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751511AbdFIFVy (ORCPT
	<rfc822;patchwork-target-devel@patchwork.kernel.org>);
	Fri, 9 Jun 2017 01:21:54 -0400
Received: from mail.linux-iscsi.org ([67.23.28.174]:39039 "EHLO
	linux-iscsi.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751509AbdFIFVy (ORCPT
	<rfc822;target-devel@vger.kernel.org>);
	Fri, 9 Jun 2017 01:21:54 -0400
Received: from [192.168.1.66] (75-37-194-224.lightspeed.lsatca.sbcglobal.net
	[75.37.194.224])
	(using SSLv3 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested) (Authenticated sender: nab)
	by linux-iscsi.org (Postfix) with ESMTPSA id 519EE40B11;
	Fri,  9 Jun 2017 05:25:11 +0000 (UTC)
Message-ID: <1496985712.28997.13.camel@haakon3.risingtidesystems.com>
Subject: Re: ESXi snapshot I/O error after upgrade to 4.9.30
From: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
To: Martin Svec <martin.svec@zoner.cz>
Cc: target-devel <target-devel@vger.kernel.org>
Date: Thu, 08 Jun 2017 22:21:52 -0700
In-Reply-To: <e159c844-dd3a-42c8-3714-e385e6c8f254@zoner.cz>
References: <e159c844-dd3a-42c8-3714-e385e6c8f254@zoner.cz>
X-Mailer: Evolution 3.4.4-1 
Mime-Version: 1.0
Sender: target-devel-owner@vger.kernel.org
Precedence: bulk
List-ID: <target-devel.vger.kernel.org>
X-Mailing-List: target-devel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Hi Martin,

On Mon, 2017-06-05 at 18:05 +0200, Martin Svec wrote:
> Hello Nic,
> 
> Today, three of our vSphere VMs running on iSCSI LIO 4.9.30 failed to create a backup snapshot and
> hung with errors like "Create virtual machine snapshot xxxxx. Unable to close the 
> '/vmfs/volumes/.../xxxxx-000001-ctk.vmdk' file: 5 (Input/output error)." or other more general I/O
> errors. It always happened during snapshot creation and there were multiple "Detected MISCOMPARE +
> Target/iblock: Send MISCOMPARE check condition and sense" in target log at the same time.
> Subsequently, virtual machines lost access to their virtual disks and required VM reset. The
> failures seem to be independent of each other and VMs ran on different hosts.
> 

So nothing else in the target logs of interest..?

I assume the MISCOMPARE warnings occur at the normal rate..?

> The storage was upgraded to 4.9.30 only two days ago. However, we have an identical iSCSI LIO
> storage running 4.9.27 more than three weeks without any issue in the same vSphere cluster. So I'm
> wondering if this could be caused by a stable target patch between 4.9.27 and 4.9.30. Quick look
> into changelog shows "target: Fix compare_and_write_callback handling for non GOOD status" as the
> only fix related to CAW since 4.9.27. What do you think?
> 
> We have ESXi 5.5.0 rev. 5230635 on all ESXi nodes.

Note the 'target: Fix compare_and_write_callback handling for non GOOD
status' change only effects COMPARE_AND_WRITE related I/Os that actually
fail.

That is, unless the underlying backend target device was actually
generating hard I/O errors (eg: something like the following where 'sdc'
is your target backend device):

   Buffer I/O error on dev sdc, logical block 0, async page read
   blk_update_request: I/O error, dev sdc, sector 2097144
   blk_update_request: I/O error, dev sdc, sector 2097144
   Buffer I/O error on dev sdc, logical block 262143, async page read
   blk_update_request: I/O error, dev sdc, sector 0
   Buffer I/O error on dev sdc, logical block 0, async page read
   blk_update_request: I/O error, dev sdc, sector 0

then the CAW change above in v4.9.30 won't have any effect.

If the issue is reproducible, you can verify by re-enabling the debug
message for a hard I/O error in compare_and_write_callback():


That said, if you can confirm the backend device is not generating hard
I/O errors for COMPARE_AND_WRITE I/O up to target-core, I'd wager the
ESX host failures observed aren't specific to the change.
---
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index ca42fba..a0de5ab 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -479,7 +479,7 @@ static sense_reason_t compare_and_write_callback(struct se_cmd *cmd, bool succes
         * been failed with a non-zero SCSI status.
         */
        if (cmd->scsi_status) {
-               pr_debug("compare_and_write_callback: non zero scsi_status:"
+               printk_ratelimited("compare_and_write_callback: non zero scsi_status:"
                        " 0x%02x\n", cmd->scsi_status);
                *post_ret = 1;
                if (cmd->scsi_status == SAM_STAT_CHECK_CONDITION)