Message ID | 1631711048-6177-1-git-send-email-wenxiong@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [V3,1/1] scsi/ses: Saw "Failed to get diagnostic page 0x1" | expand |
On Wed, 2021-09-15 at 08:04 -0500, wenxiong@linux.vnet.ibm.com wrote: > From: Wen Xiong <wenxiong@linux.vnet.ibm.com> > > Setting scsi logging level with error=3, we saw some errors from > enclosues: > > [108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result: > hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s > [108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01 > 00 20 00 > [108017.427778] ses 0:0:9:0: Power-on or device reset occurred > [108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result: > hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s > [108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01 > 00 20 00 > [108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention > [current] > [108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset > function occurred > [108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1 > [108017.427804] ses 0:0:9:0: Failed to bind enclosure -19 > [108017.427895] ses 0:0:10:0: Attached Enclosure device > [108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13 > > As Martin's suggestion, the patch checks to retry on NOT_READY as > well as > UNIT_ATTENTION with ASC 0x29. This looks fine to me. I think the reason expecting_cc_ua doesn't work for you is that you're getting > 1 reset per command. expecting_cc_ua automatically resets after eating the first unit attention. Reviewed-by: James Bottomley <jejb@linux.ibm.com> James
On 2021-09-15 10:43 a.m., James Bottomley wrote: > On Wed, 2021-09-15 at 08:04 -0500, wenxiong@linux.vnet.ibm.com wrote: >> From: Wen Xiong <wenxiong@linux.vnet.ibm.com> >> >> Setting scsi logging level with error=3, we saw some errors from >> enclosues: >> >> [108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result: >> hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >> [108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01 >> 00 20 00 >> [108017.427778] ses 0:0:9:0: Power-on or device reset occurred >> [108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result: >> hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s >> [108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c 01 01 >> 00 20 00 >> [108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention >> [current] >> [108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset >> function occurred >> [108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1 >> [108017.427804] ses 0:0:9:0: Failed to bind enclosure -19 >> [108017.427895] ses 0:0:10:0: Attached Enclosure device >> [108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13 >> >> As Martin's suggestion, the patch checks to retry on NOT_READY as >> well as >> UNIT_ATTENTION with ASC 0x29. > > This looks fine to me. I think the reason expecting_cc_ua doesn't work > for you is that you're getting > 1 reset per command. expecting_cc_ua > automatically resets after eating the first unit attention. Rather that simply consuming UAs, what do you think of a fixed length FIFO, say 8 entries, that holds the asc,ascq of the last 8 UAs together with a timestamp of when it was received (with a boot time epoch). Then allow the user space to see that FIFO via sysfs (e.g. under /sys/class/scsi_device/<hctl>). Remembering the previous UA may also be useful for the mid-level UA processing. For example after a firmware upgrade, there may be UAs for both INQUIRY data change and device reset. Also the first device reset after a reboot (power cycle) is expected, having one later, for example when part of a disk is mounted, is a bit more suspicious. Doug Gilbert
On Wed, 2021-09-15 at 13:55 -0400, Douglas Gilbert wrote: > On 2021-09-15 10:43 a.m., James Bottomley wrote: > > On Wed, 2021-09-15 at 08:04 -0500, wenxiong@linux.vnet.ibm.com > > wrote: > > > From: Wen Xiong <wenxiong@linux.vnet.ibm.com> > > > > > > Setting scsi logging level with error=3, we saw some errors from > > > enclosues: > > > > > > [108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result: > > > hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s > > > [108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c > > > 01 01 > > > 00 20 00 > > > [108017.427778] ses 0:0:9:0: Power-on or device reset occurred > > > [108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result: > > > hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s > > > [108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c > > > 01 01 > > > 00 20 00 > > > [108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention > > > [current] > > > [108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset > > > function occurred > > > [108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1 > > > [108017.427804] ses 0:0:9:0: Failed to bind enclosure -19 > > > [108017.427895] ses 0:0:10:0: Attached Enclosure device > > > [108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13 > > > > > > As Martin's suggestion, the patch checks to retry on NOT_READY as > > > well as UNIT_ATTENTION with ASC 0x29. > > > > This looks fine to me. I think the reason expecting_cc_ua doesn't > > work for you is that you're getting > 1 reset per > > command. expecting_cc_ua automatically resets after eating the > > first unit attention. > > Rather that simply consuming UAs, what do you think of a fixed length > FIFO, say 8 entries, that holds the asc,ascq of the last 8 UAs > together with a timestamp of when it was received (with a boot time > epoch). Then allow the user space to see that FIFO via sysfs (e.g. > under /sys/class/scsi_device/<hctl>). Remembering the previous UA may > also be useful for the mid-level UA processing. For example after a > firmware upgrade, there may be UAs for both INQUIRY data change and > device reset. I don't really see how it would be useful. On the available debugging information (which is a bit scant) it looks like a single device problem (as usual) ... possibly not clearing the condition after the REQUEST_SENSE because I can't believe another reset went over the bus immediately after the first UA/CC was received. > Also the first device reset after a reboot (power cycle) is expected, > having one later, for example when part of a disk is mounted, is a > bit more suspicious. It's definitely hugely suspicious, but tying it to ASC/ASQC 0x29/0x00 should limit the fallout considerably. James
diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index c2afba2a5414..7b6efe4ccc13 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -87,9 +87,16 @@ static int ses_recv_diag(struct scsi_device *sdev, int page_code, 0 }; unsigned char recv_page_code; + struct scsi_sense_hdr sshdr; + int retries = SES_RETRIES; + + do { + ret = scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, bufflen, + &sshdr, SES_TIMEOUT, 1, NULL); + } while (scsi_sense_valid(&sshdr) && --retries && + (sshdr.sense_key == NOT_READY || + (sshdr.sense_key == UNIT_ATTENTION && sshdr.asc == 0x29))); - ret = scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, bufflen, - NULL, SES_TIMEOUT, SES_RETRIES, NULL); if (unlikely(ret)) return ret; @@ -121,9 +128,16 @@ static int ses_send_diag(struct scsi_device *sdev, int page_code, bufflen & 0xff, 0 }; + struct scsi_sense_hdr sshdr; + int retries = SES_RETRIES; + + do { + result = scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buf, bufflen, + &sshdr, SES_TIMEOUT, 1, NULL); + } while (scsi_sense_valid(&sshdr) && --retries && + (sshdr.sense_key == NOT_READY || + (sshdr.sense_key == UNIT_ATTENTION && sshdr.asc == 0x29))); - result = scsi_execute_req(sdev, cmd, DMA_TO_DEVICE, buf, bufflen, - NULL, SES_TIMEOUT, SES_RETRIES, NULL); if (result) sdev_printk(KERN_ERR, sdev, "SEND DIAGNOSTIC result: %8x\n", result);