From patchwork Mon Jun 11 23:08:18 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: James Bottomley X-Patchwork-Id: 10458955 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id BF34060234 for ; Mon, 11 Jun 2018 23:10:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A1FB728AD1 for ; Mon, 11 Jun 2018 23:10:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0DDD729086; Mon, 11 Jun 2018 23:09:56 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8313128C07 for ; Mon, 11 Jun 2018 23:08:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932471AbeFKXIV (ORCPT ); Mon, 11 Jun 2018 19:08:21 -0400 Received: from bedivere.hansenpartnership.com ([66.63.167.143]:37324 "EHLO bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932300AbeFKXIU (ORCPT ); Mon, 11 Jun 2018 19:08:20 -0400 Received: from localhost (localhost [127.0.0.1]) by bedivere.hansenpartnership.com (Postfix) with ESMTP id 16A888EE1E9; Mon, 11 Jun 2018 16:08:20 -0700 (PDT) Received: from bedivere.hansenpartnership.com ([127.0.0.1]) by localhost (bedivere.hansenpartnership.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yXSpwCPgsbDh; Mon, 11 Jun 2018 16:08:19 -0700 (PDT) Received: from [153.66.254.194] (unknown [50.35.70.236]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by bedivere.hansenpartnership.com (Postfix) with ESMTPSA id 9F4258EE0BF; Mon, 11 Jun 2018 16:08:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=hansenpartnership.com; s=20151216; t=1528758499; bh=EBTRlxWzHClJroAcKCopNOQZAd+1W7apBoJSFKj5L20=; h=Subject:From:To:Date:In-Reply-To:References:From; b=BlUEqKJdacCpPYMt17l7Ho9rhyGp8U4ahAQwvwENpfHJ+CFExKVEqExOSY+6KTQt1 o+SN7n6ViZobxrB+J3lK93kUZSlDMn6X91kRmF9Fp6jd6RooQiql3glWP6ILB4y48o YPJRDYU92c5XmAnmk9myimDUnKEvHPZoti+SV71I= Message-ID: <1528758498.3572.16.camel@HansenPartnership.com> Subject: Re: sd 6:0:0:0: [sdb] Unaligned partial completion From: James Bottomley To: Ted Cabeen , dgilbert@interlog.com, linux-scsi@vger.kernel.org Date: Mon, 11 Jun 2018 16:08:18 -0700 In-Reply-To: <55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu> References: <5b81824e-3a7f-e1df-e8d3-07e258e31af3@interlog.com> <1528753251.3572.3.camel@HansenPartnership.com> <55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu> X-Mailer: Evolution 3.22.6 Mime-Version: 1.0 Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Mon, 2018-06-11 at 14:59 -0700, Ted Cabeen wrote: > On 06/11/2018 02:40 PM, James Bottomley wrote: > > On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote: > > > I have also seen Aborted Command sense when doing heavy testing > > > on one or more SAS disks behind a SAS expander. I put it down to > > > a temporary lack of paths available (on the link between the > > > host's HBA and the expander) when one of those SAS disks tries to > > > get a connection back to the host with the data (data-in > > > transfer) from an earlier READ command. > > > > > > In my code (ddpt and sg_dd) I treat it as a "retry" type error > > > and in my experience that works. IOW a follow-up READ with the > > > same parameters is successful. > > > > We do treat ABORTED_COMMAND as a retry.  However, it will tick down > > the retry count (usually 3) and then fail if it still occurs.  How > > long does this condition persist for? because if it's long lived we > > could treat it as ADD_TO_MLQUEUE which would mean we'd retry until > > the timeout condition was reached. > > On my system, it's a bit hard to tell, as as soon as ZFS sees the > read error, it starts resilvering to repair the sector that reported > the I/O error.  Without the scrub, it happened once over a 5-day > window.  During the scrub, it was usually 10s of minutes between > occurrences that failed all the retries, but I had some occasions > where it happened about 5-10 minutes apart.  It definitely seems to > be load-related, so how long and hard the load stays elevated is a > factor. OK, try this: it will print a rate limited warning if it triggers (showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS errors (we'll likely narrow this if it works, but for now let's do the lot). James diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..94aa5cb94064 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -531,6 +531,11 @@ int scsi_check_sense(struct scsi_cmnd *scmd) if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 && sdev->sdev_bflags & BLIST_RETRY_ASC_C1) return ADD_TO_MLQUEUE; + if (sshdr.asc == 0x4b) { + printk_ratelimited(KERN_WARNING "SAS/SATA link retry\n"); + return ADD_TO_MLQUEUE; + } + return NEEDS_RETRY; case NOT_READY: