sd 6:0:0:0: [sdb] Unaligned partial completion

Message ID	1528758498.3572.16.camel@HansenPartnership.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-scsi-owner@kernel.org> Message-ID: <1528758498.3572.16.camel@HansenPartnership.com> Subject: Re: sd 6:0:0:0: [sdb] Unaligned partial completion From: James Bottomley <James.Bottomley@HansenPartnership.com> To: Ted Cabeen <ted.cabeen@lscg.ucsb.edu>, dgilbert@interlog.com, linux-scsi@vger.kernel.org Date: Mon, 11 Jun 2018 16:08:18 -0700 In-Reply-To: <55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu> References: <ef458835-1dfe-77c3-0ddb-e1458193f3af@lscg.ucsb.edu> <5b81824e-3a7f-e1df-e8d3-07e258e31af3@interlog.com> <1528753251.3572.3.camel@HansenPartnership.com> <55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

Message ID

1528758498.3572.16.camel@HansenPartnership.com (mailing list archive)

State

Not Applicable

Headers

Message-ID: <1528758498.3572.16.camel@HansenPartnership.com>
Subject: Re: sd 6:0:0:0: [sdb] Unaligned partial completion
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Ted Cabeen <ted.cabeen@lscg.ucsb.edu>, dgilbert@interlog.com,
	linux-scsi@vger.kernel.org
Date: Mon, 11 Jun 2018 16:08:18 -0700
In-Reply-To: <55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu>
References: <ef458835-1dfe-77c3-0ddb-e1458193f3af@lscg.ucsb.edu>
	<5b81824e-3a7f-e1df-e8d3-07e258e31af3@interlog.com>
	<1528753251.3572.3.camel@HansenPartnership.com>
	<55517505-06f2-c0bc-0b61-18149954cfc9@lscg.ucsb.edu>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-scsi-owner@vger.kernel.org
Precedence: bulk

Commit Message

James Bottomley June 11, 2018, 11:08 p.m. UTC

On Mon, 2018-06-11 at 14:59 -0700, Ted Cabeen wrote:
> On 06/11/2018 02:40 PM, James Bottomley wrote:
> > On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote:
> > > I have also seen Aborted Command sense when doing heavy testing
> > > on one or more SAS disks behind a SAS expander. I put it down to
> > > a temporary lack of paths available (on the link between the
> > > host's HBA and the expander) when one of those SAS disks tries to
> > > get a connection back to the host with the data (data-in
> > > transfer) from an earlier READ command.
> > > 
> > > In my code (ddpt and sg_dd) I treat it as a "retry" type error
> > > and in my experience that works. IOW a follow-up READ with the
> > > same parameters is successful.
> > 
> > We do treat ABORTED_COMMAND as a retry.  However, it will tick down
> > the retry count (usually 3) and then fail if it still occurs.  How
> > long does this condition persist for? because if it's long lived we
> > could treat it as ADD_TO_MLQUEUE which would mean we'd retry until
> > the timeout condition was reached.
> 
> On my system, it's a bit hard to tell, as as soon as ZFS sees the
> read error, it starts resilvering to repair the sector that reported
> the I/O error.  Without the scrub, it happened once over a 5-day
> window.  During the scrub, it was usually 10s of minutes between
> occurrences that failed all the retries, but I had some occasions
> where it happened about 5-10 minutes apart.  It definitely seems to
> be load-related, so how long and hard the load stays elevated is a
> factor.

OK, try this: it will print a rate limited warning if it triggers
(showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS
errors (we'll likely narrow this if it works, but for now let's do the
lot).

James

---

Comments

Ted Cabeen June 13, 2018, 6:30 p.m. UTC | #1

Will do.  It'll be a few weeks, as I have to schedule downtime, but I'll 
report back my results when it's done.

--Ted

On 06/11/2018 04:08 PM, James Bottomley wrote:
> OK, try this: it will print a rate limited warning if it triggers
> (showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS
> errors (we'll likely narrow this if it works, but for now let's do the
> lot).
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 8932ae81a15a..94aa5cb94064 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -531,6 +531,11 @@ int scsi_check_sense(struct scsi_cmnd *scmd)
>   		if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 &&
>   		    sdev->sdev_bflags & BLIST_RETRY_ASC_C1)
>   			return ADD_TO_MLQUEUE;
> +		if (sshdr.asc == 0x4b) {
> +			printk_ratelimited(KERN_WARNING "SAS/SATA link retry\n");
> +			return ADD_TO_MLQUEUE;
> +		}
> +
>   
>   		return NEEDS_RETRY;
>   	case NOT_READY:
>

Ted Cabeen Sept. 20, 2018, 4:28 p.m. UTC | #2

On 06/11/2018 04:08 PM, James Bottomley wrote:
> On Mon, 2018-06-11 at 14:59 -0700, Ted Cabeen wrote:
>> On 06/11/2018 02:40 PM, James Bottomley wrote:
>>> On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote:
>>>> I have also seen Aborted Command sense when doing heavy testing
>>>> on one or more SAS disks behind a SAS expander. I put it down to
>>>> a temporary lack of paths available (on the link between the
>>>> host's HBA and the expander) when one of those SAS disks tries to
>>>> get a connection back to the host with the data (data-in
>>>> transfer) from an earlier READ command.
>>>>
>>>> In my code (ddpt and sg_dd) I treat it as a "retry" type error
>>>> and in my experience that works. IOW a follow-up READ with the
>>>> same parameters is successful.
>>>
>>> We do treat ABORTED_COMMAND as a retry.  However, it will tick down
>>> the retry count (usually 3) and then fail if it still occurs.  How
>>> long does this condition persist for? because if it's long lived we
>>> could treat it as ADD_TO_MLQUEUE which would mean we'd retry until
>>> the timeout condition was reached.
>>
>> On my system, it's a bit hard to tell, as as soon as ZFS sees the
>> read error, it starts resilvering to repair the sector that reported
>> the I/O error.  Without the scrub, it happened once over a 5-day
>> window.  During the scrub, it was usually 10s of minutes between
>> occurrences that failed all the retries, but I had some occasions
>> where it happened about 5-10 minutes apart.  It definitely seems to
>> be load-related, so how long and hard the load stays elevated is a
>> factor.
> 
> OK, try this: it will print a rate limited warning if it triggers
> (showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS
> errors (we'll likely narrow this if it works, but for now let's do the
> lot).

I replaced the HBA in this system with a new one, and the problem 
resolved, so this was an intermittent hardware issue, and not 
software-related.  Thanks for digging in with me, it helped a lot to 
fully understand the software side.

--Ted

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..94aa5cb94064 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -531,6 +531,11 @@  int scsi_check_sense(struct scsi_cmnd *scmd)
 		if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 &&
 		    sdev->sdev_bflags & BLIST_RETRY_ASC_C1)
 			return ADD_TO_MLQUEUE;
+		if (sshdr.asc == 0x4b) {
+			printk_ratelimited(KERN_WARNING "SAS/SATA link retry\n");
+			return ADD_TO_MLQUEUE;
+		}
+
 
 		return NEEDS_RETRY;
 	case NOT_READY:

sd 6:0:0:0: [sdb] Unaligned partial completion

Commit Message

Comments

Patch