Message ID | 1528758498.3572.16.camel@HansenPartnership.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Will do. It'll be a few weeks, as I have to schedule downtime, but I'll report back my results when it's done. --Ted On 06/11/2018 04:08 PM, James Bottomley wrote: > OK, try this: it will print a rate limited warning if it triggers > (showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS > errors (we'll likely narrow this if it works, but for now let's do the > lot). > > James > > --- > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 8932ae81a15a..94aa5cb94064 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -531,6 +531,11 @@ int scsi_check_sense(struct scsi_cmnd *scmd) > if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 && > sdev->sdev_bflags & BLIST_RETRY_ASC_C1) > return ADD_TO_MLQUEUE; > + if (sshdr.asc == 0x4b) { > + printk_ratelimited(KERN_WARNING "SAS/SATA link retry\n"); > + return ADD_TO_MLQUEUE; > + } > + > > return NEEDS_RETRY; > case NOT_READY: >
On 06/11/2018 04:08 PM, James Bottomley wrote: > On Mon, 2018-06-11 at 14:59 -0700, Ted Cabeen wrote: >> On 06/11/2018 02:40 PM, James Bottomley wrote: >>> On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote: >>>> I have also seen Aborted Command sense when doing heavy testing >>>> on one or more SAS disks behind a SAS expander. I put it down to >>>> a temporary lack of paths available (on the link between the >>>> host's HBA and the expander) when one of those SAS disks tries to >>>> get a connection back to the host with the data (data-in >>>> transfer) from an earlier READ command. >>>> >>>> In my code (ddpt and sg_dd) I treat it as a "retry" type error >>>> and in my experience that works. IOW a follow-up READ with the >>>> same parameters is successful. >>> >>> We do treat ABORTED_COMMAND as a retry. However, it will tick down >>> the retry count (usually 3) and then fail if it still occurs. How >>> long does this condition persist for? because if it's long lived we >>> could treat it as ADD_TO_MLQUEUE which would mean we'd retry until >>> the timeout condition was reached. >> >> On my system, it's a bit hard to tell, as as soon as ZFS sees the >> read error, it starts resilvering to repair the sector that reported >> the I/O error. Without the scrub, it happened once over a 5-day >> window. During the scrub, it was usually 10s of minutes between >> occurrences that failed all the retries, but I had some occasions >> where it happened about 5-10 minutes apart. It definitely seems to >> be load-related, so how long and hard the load stays elevated is a >> factor. > > OK, try this: it will print a rate limited warning if it triggers > (showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS > errors (we'll likely narrow this if it works, but for now let's do the > lot). I replaced the HBA in this system with a new one, and the problem resolved, so this was an intermittent hardware issue, and not software-related. Thanks for digging in with me, it helped a lot to fully understand the software side. --Ted
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 8932ae81a15a..94aa5cb94064 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -531,6 +531,11 @@ int scsi_check_sense(struct scsi_cmnd *scmd) if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 && sdev->sdev_bflags & BLIST_RETRY_ASC_C1) return ADD_TO_MLQUEUE; + if (sshdr.asc == 0x4b) { + printk_ratelimited(KERN_WARNING "SAS/SATA link retry\n"); + return ADD_TO_MLQUEUE; + } + return NEEDS_RETRY; case NOT_READY: