From patchwork Mon Jun 11 23:08:18 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: James Bottomley <James.Bottomley@HansenPartnership.com>
X-Patchwork-Id: 10458955
Return-Path: <linux-scsi-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	BF34060234 for <patchwork-linux-scsi@patchwork.kernel.org>;
	Mon, 11 Jun 2018 23:10:39 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A1FB728AD1
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Mon, 11 Jun 2018 23:10:39 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 0DDD729086; Mon, 11 Jun 2018 23:09:56 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8313128C07
	for <patchwork-linux-scsi@patchwork.kernel.org>;
	Mon, 11 Jun 2018 23:08:22 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932471AbeFKXIV (ORCPT
	<rfc822;patchwork-linux-scsi@patchwork.kernel.org>);
	Mon, 11 Jun 2018 19:08:21 -0400
Received: from bedivere.hansenpartnership.com ([66.63.167.143]:37324 "EHLO
	bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S932300AbeFKXIU (ORCPT
	<rfc822;linux-scsi@vger.kernel.org>);
	Mon, 11 Jun 2018 19:08:20 -0400
Received: from localhost (localhost [127.0.0.1])
	by bedivere.hansenpartnership.com (Postfix) with ESMTP id 16A888EE1E9;
	Mon, 11 Jun 2018 16:08:20 -0700 (PDT)
Received: from bedivere.hansenpartnership.com ([127.0.0.1])
	by localhost (bedivere.hansenpartnership.com [127.0.0.1])
	(amavisd-new, port 10024)
	with ESMTP id yXSpwCPgsbDh; Mon, 11 Jun 2018 16:08:19 -0700 (PDT)
Received: from [153.66.254.194] (unknown [50.35.70.236])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128
	bits)) (No client certificate requested)
	by bedivere.hansenpartnership.com (Postfix) with ESMTPSA id
	9F4258EE0BF; Mon, 11 Jun 2018 16:08:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=hansenpartnership.com;
	s=20151216; t=1528758499;
	bh=EBTRlxWzHClJroAcKCopNOQZAd+1W7apBoJSFKj5L20=;
	h=Subject:From:To:Date:In-Reply-To:References:From;
	b=BlUEqKJdacCpPYMt17l7Ho9rhyGp8U4ahAQwvwENpfHJ+CFExKVEqExOSY+6KTQt1
	o+SN7n6ViZobxrB+J3lK93kUZSlDMn6X91kRmF9Fp6jd6RooQiql3glWP6ILB4y48o
	YPJRDYU92c5XmAnmk9myimDUnKEvHPZoti+SV71I=
Message-ID: <1528758498.3572.16.camel@HansenPartnership.com>
Subject: Re: sd 6:0:0:0: [sdb] Unaligned partial completion
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Ted Cabeen <ted.cabeen@lscg.ucsb.edu>, dgilbert@interlog.com,
	linux-scsi@vger.kernel.org
Date: Mon, 11 Jun 2018 16:08:18 -0700
In-Reply-To: <55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu>
References: <ef458835-1dfe-77c3-0ddb-e1458193f3af@lscg.ucsb.edu>
	<5b81824e-3a7f-e1df-e8d3-07e258e31af3@interlog.com>
	<1528753251.3572.3.camel@HansenPartnership.com>
	<55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu>
X-Mailer: Evolution 3.22.6 
Mime-Version: 1.0
Sender: linux-scsi-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-scsi.vger.kernel.org>
X-Mailing-List: linux-scsi@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Mon, 2018-06-11 at 14:59 -0700, Ted Cabeen wrote:
> On 06/11/2018 02:40 PM, James Bottomley wrote:
> > On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote:
> > > I have also seen Aborted Command sense when doing heavy testing
> > > on one or more SAS disks behind a SAS expander. I put it down to
> > > a temporary lack of paths available (on the link between the
> > > host's HBA and the expander) when one of those SAS disks tries to
> > > get a connection back to the host with the data (data-in
> > > transfer) from an earlier READ command.
> > > 
> > > In my code (ddpt and sg_dd) I treat it as a "retry" type error
> > > and in my experience that works. IOW a follow-up READ with the
> > > same parameters is successful.
> > 
> > We do treat ABORTED_COMMAND as a retry.  However, it will tick down
> > the retry count (usually 3) and then fail if it still occurs.  How
> > long does this condition persist for? because if it's long lived we
> > could treat it as ADD_TO_MLQUEUE which would mean we'd retry until
> > the timeout condition was reached.
> 
> On my system, it's a bit hard to tell, as as soon as ZFS sees the
> read error, it starts resilvering to repair the sector that reported
> the I/O error.  Without the scrub, it happened once over a 5-day
> window.  During the scrub, it was usually 10s of minutes between
> occurrences that failed all the retries, but I had some occasions
> where it happened about 5-10 minutes apart.  It definitely seems to
> be load-related, so how long and hard the load stays elevated is a
> factor.

OK, try this: it will print a rate limited warning if it triggers
(showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS
errors (we'll likely narrow this if it works, but for now let's do the
lot).

James

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..94aa5cb94064 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -531,6 +531,11 @@ int scsi_check_sense(struct scsi_cmnd *scmd)
 		if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 &&
 		    sdev->sdev_bflags & BLIST_RETRY_ASC_C1)
 			return ADD_TO_MLQUEUE;
+		if (sshdr.asc == 0x4b) {
+			printk_ratelimited(KERN_WARNING "SAS/SATA link retry\n");
+			return ADD_TO_MLQUEUE;
+		}
+
 
 		return NEEDS_RETRY;
 	case NOT_READY: