Message ID | 20171110151344.10563-1-hdegoede@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Fri, 10 Nov 2017 16:13:44 +0100 Hans de Goede <hdegoede@redhat.com> wrote: > Just like all previous UAS capable Seagate disk enclosures, this > one needs a US_FL_NO_ATA_1X quirk. > [...] > +/* Reported-by: Andrey Astafyev <dev@246060.ru> */ Hi Andrey, I notice that you have an external Seagate SMR drive, I am encountering some trouble and was wondering if you (or anyone in CC) could have some feedback. While the drives I have (PID 0xab45, "Backup+ Hub BK") are used without problem with very light loads, I'm starting to get annoyed when doing more serious work. Here is a series of messages (v4.14) I have when simply trying to balance 1 chunk of BTRFS data, ie. write 5 GiB, and waiting a long time in-between: Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 CDB: Write(16) 8a 00 00 00 00 00 00 00 11 00 00 00 00 80 00 00 Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 CDB: Write(16) 8a 00 00 00 00 00 00 00 11 80 00 00 00 80 00 00 Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 CDB: Write(16) 8a 00 00 00 00 00 00 bf fd 80 00 00 00 80 00 00 Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#4 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#4 CDB: Write(16) 8a 00 00 00 00 00 00 bf fd 00 00 00 00 80 00 00 Nov 12 16:15:35 Bidule kernel: scsi host22: uas_eh_device_reset_handler start Nov 12 16:15:35 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd Nov 12 16:15:35 Bidule kernel: scsi host22: uas_eh_device_reset_handler success Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#3 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD IN Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#3 CDB: Read(16) 88 00 00 00 00 00 01 02 53 80 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#13 uas_eh_abort_handler 0 uas-tag 14 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#13 CDB: Write(16) 8a 00 00 00 00 00 01 02 3b c0 00 00 00 20 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 CDB: Write(16) 8a 00 00 00 00 00 01 02 41 00 00 00 01 00 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#12 uas_eh_abort_handler 0 uas-tag 13 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#12 CDB: Write(16) 8a 00 00 00 00 00 01 02 3c a0 00 00 00 20 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#11 uas_eh_abort_handler 0 uas-tag 12 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#11 CDB: Write(16) 8a 00 00 00 00 00 01 02 62 00 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 CDB: Write(16) 8a 00 00 00 00 00 01 02 61 80 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#10 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#10 CDB: Write(16) 8a 00 00 00 00 00 01 02 61 00 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#9 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#9 CDB: Write(16) 8a 00 00 00 00 00 01 02 60 80 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#8 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#8 CDB: Write(16) 8a 00 00 00 00 00 01 02 60 00 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 CDB: Write(16) 8a 00 00 00 00 00 01 02 5f 80 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 CDB: Write(16) 8a 00 00 00 00 00 01 02 5f 00 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 CDB: Write(16) 8a 00 00 00 00 00 01 02 50 00 00 00 00 80 00 00 Nov 12 16:20:28 Bidule kernel: scsi host22: uas_eh_device_reset_handler start Nov 12 16:20:28 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd Nov 12 16:20:28 Bidule kernel: scsi host22: uas_eh_device_reset_handler success Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 CDB: Write(16) 8a 00 00 00 00 00 20 00 08 00 00 00 00 08 00 00 Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 CDB: Write(16) 8a 00 00 00 00 00 00 02 08 00 00 00 00 08 00 00 Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler start Nov 12 16:21:29 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler success Do you see such things? I understand that latencies can climb to several seconds as the disks can do GC, and suspect that the issues (sometimes the USB reset doesn't work and a reboot is needed...) are due to some unawareness of that, but I don't know. There are several threads on distribution forums & bug trackers, but nothing really helpful, and I didn't see much on LKML. Regards,
13.11.2017 00:42, Jérôme Carretero пишет: > Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 > uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT > Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 CDB: Write(16) 8a 00 00 00 00 00 20 00 08 00 00 00 00 08 00 00 > Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT > Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 CDB: Write(16) 8a 00 00 00 00 00 00 02 08 00 00 00 00 08 00 00 > Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD > Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 > Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler start > Nov 12 16:21:29 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd > Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler success > > > Do you see such things? > Hi, I've seen dmesg output like this so I've added my device to quirks list like any other Seagate USB drive.
On Mon, 13 Nov 2017 07:01:30 +0300 Andrey Astafyev <1@246060.ru> wrote: > 13.11.2017 00:42, Jérôme Carretero пишет: > > Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 > > uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT > > [...] > > Do you see such things? > > > Hi, I've seen dmesg output like this so I've added my device to > quirks list like any other Seagate USB drive. Hi Andrey, Hans, For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't change anything, and while adding US_FL_IGNORE_UAS (using quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets, but they cause shorter hangs in system activity (~1 second when UAS was more like ~20). Is the US_FL_NO_ATA_1X supposed to get rid of the R/W errors? The first entry I saw about this quirk was so that the device can be used at all (wouldn't be able to mount without it). I'll follow up: when scrubbing the devices after a sequence of experiments, there were checksum errors, so I'll retry with a more reproducible sequence to try and get something more solid. I wouldn't want to disable UAS just to see that reliability has been reduced (one Arch bug report mentions something like that https://bugs.archlinux.org/task/48362). Regards,
13.11.2017 09:14, Jérôme Carretero пишет: > For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't > change anything, and while adding US_FL_IGNORE_UAS (using > quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets, > but they cause shorter hangs in system activity (~1 second when > UAS was more like ~20). Maybe you should try like this: quirks=0bc2:ab34:ut,0bc2:ab38:ut ?
On Mon, 13 Nov 2017 09:16:39 +0300 Andrey Astafyev <1@246060.ru> wrote: > 13.11.2017 09:14, Jérôme Carretero пишет: > > For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't > > change anything, and while adding US_FL_IGNORE_UAS (using > > quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets, > > but they cause shorter hangs in system activity (~1 second when > > UAS was more like ~20). > > Maybe you should try like this: quirks=0bc2:ab34:ut,0bc2:ab38:ut ? It looks like "ut" would do the same as "u" alone, as US_FL_NO_ATA_1X is only used inside uas.ko. For some reason, first I wanted to go in the .h, and it's only after, when the trial and error went more intense, that I used the command-line parameters. Will follow up later... I hope the drives won't end up costing more than 12Gb/s SAS drives on an expensive HBA... as they are definitely not hassle-free so far. I had gotten the first 8 TB model in 2015 and had issues, but my simple, background workload (attic/borg) could afford to use a workaround of throttling writes*.
Hi, On 13-11-17 07:14, Jérôme Carretero wrote: > On Mon, 13 Nov 2017 07:01:30 +0300 > Andrey Astafyev <1@246060.ru> wrote: > >> 13.11.2017 00:42, Jérôme Carretero пишет: >>> Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 >>> uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT >>> [...] >>> Do you see such things? >>> >> Hi, I've seen dmesg output like this so I've added my device to >> quirks list like any other Seagate USB drive. > > Hi Andrey, Hans, > > > For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't > change anything, and while adding US_FL_IGNORE_UAS (using > quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets, > but they cause shorter hangs in system activity (~1 second when > UAS was more like ~20). The errors you are seeing are write errors. If you're seeing these errors with both the usb-storage and uas drivers then there likely is something wrong with your setup / hardware. Does the drive in question use an external power-supply or is it USB bus-powered? If it is the latter then that is likely the problem. Anyways things I would check and try to swap are both the cable used, the power-supply used (if any), the USB-port used as well as trying the disk on a completely different computer. I've the feeling something is busted with your hardware, it could be the disk itself. Did you mention that this was the first release of a new higher capacity ? Those often have some kinks which are worked out in later revisions. Regards, Hans
Hi Hans, On Mon, 13 Nov 2017 10:04:53 +0100 Hans de Goede <hdegoede@redhat.com> wrote: > Hi, > > On 13-11-17 07:14, Jérôme Carretero wrote: > > On Mon, 13 Nov 2017 07:01:30 +0300 > > Andrey Astafyev <1@246060.ru> wrote: > > > >> 13.11.2017 00:42, Jérôme Carretero пишет: > >>> Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 > >>> uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT > >>> [...] > >>> Do you see such things? > > For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't > > change anything, and while adding US_FL_IGNORE_UAS (using > > quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets, > > but they cause shorter hangs in system activity (~1 second when > > UAS was more like ~20). > > The errors you are seeing are write errors. If you're seeing these > errors with both the usb-storage and uas drivers then there likely > is something wrong with your setup / hardware. My latest drives are Seagate Backup+ Hub 8TB and have ~ 50 hours of uptime. I have connected them to different controllers and they do the same as the first generation of the same capacity from 2015. SMART says that everything is OK on these disks (I have another that was RMA'ed and the symptoms of failure are something else), and if there were USB errors, the messages wouldn't be at the higher SCSI level, I guess I would see "xact failed" USB errors... no? > Does the drive in question use an external power-supply or is it > USB bus-powered? If it is the latter then that is likely the problem. External power supply & ~2-ft cable provided by Seagate. > Anyways things I would check and try to swap are both the cable > used, the power-supply used (if any), the USB-port used as well > as trying the disk on a completely different computer. I did that. The same thing happens. > I've the feeling something is busted with your hardware, it > could be the disk itself. Did you mention that this was the first > release of a new higher capacity ? Those often have some kinks > which are worked out in later revisions. No, that's about the 3rd release I think. I really suspect this has to do with GC activity of these SMR drives, as if the write activity is throttled or in more spaced bursts (same USB-level intensity), then there is no problem. I will do longer tests and see if only some of them do that, after they have been subjected to similar usage history. Best regards,
Hi Hans, Tests are currently undergoing with drives operating in plain USB mass storage class. In a first time, I'm filling drives with data (uncontrolled corpus, just TBs that I have on hand). It looks like the drives with most usage history are the ones that drop most often. kernel: usb 3-4.1.1: reset SuperSpeed USB device number 6 using xhci_hcd kernel: usb 3-4.2.1: reset SuperSpeed USB device number 7 using xhci_hcd kernel: usb 3-4.3.1.1: reset SuperSpeed USB device number 13 using xhci_hcd kernel: usb 3-4.3.2.1: reset SuperSpeed USB device number 14 using xhci_hcd kernel: usb 3-4.4: reset SuperSpeed USB device number 8 using xhci_hcd kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd kernel: usb 6-4.4.1: reset SuperSpeed USB device number 6 using xhci_hcd Will provide some more interesting/visual data later. I'm surprised that the message "reset SuperSpeed USB device ..." is displayed without prior information about why. Someone with more background could give hints? I took a look at the USB MSC code and have few questions / observations: - It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG isn't used with the USB mass storage debugging infrastructure, please confirm? If unused, are we interested to have a patch that would go back to regular pr_debug() that can work with dynamic debugging? Because with several of these drives / lots of activity / occasional issues, it looks like it will be hard to catch (yes I can use usbmon). - It looks like there is no configurable timeout for USB MSC requests. Perhaps the device is not responding in time and this is why it's reset? Best regards,
(Adding Tejun Heo who was assigned on still-open bugzilla #93581 which is about SATA but seems terribly related.) On Wed, 15 Nov 2017 16:43:14 -0500 Jérôme Carretero <cJ-ko@zougloub.eu> wrote: > Hi Hans, > > > Tests are currently undergoing with drives operating in plain USB mass > storage class. In a first time, I'm filling drives with data > (uncontrolled corpus, just TBs that I have on hand). It looks like the > drives with most usage history are the ones that drop most often. > > kernel: usb 3-4.1.1: reset SuperSpeed USB device number 6 using > xhci_hcd kernel: usb 3-4.2.1: reset SuperSpeed USB device number 7 > using xhci_hcd kernel: usb 3-4.3.1.1: reset SuperSpeed USB device > number 13 using xhci_hcd kernel: usb 3-4.3.2.1: reset SuperSpeed USB > device number 14 using xhci_hcd kernel: usb 3-4.4: reset SuperSpeed > USB device number 8 using xhci_hcd kernel: usb 6-4.3.2.1: reset > SuperSpeed USB device number 8 using xhci_hcd kernel: usb 6-4.3.3.1: > reset SuperSpeed USB device number 9 using xhci_hcd kernel: usb > 6-4.4.1: reset SuperSpeed USB device number 6 using xhci_hcd > > Will provide some more interesting/visual data later. > > > I'm surprised that the message "reset SuperSpeed USB device ..." is > displayed without prior information about why. > Someone with more background could give hints? > > > I took a look at the USB MSC code and have few questions / > observations: > > - It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG isn't > used with the USB mass storage debugging infrastructure, please > confirm? If unused, are we interested to have a patch that would go > back to regular pr_debug() that can work with dynamic debugging? > > Because with several of these drives / lots of activity / occasional > issues, it looks like it will be hard to catch (yes I can use > usbmon). > > - It looks like there is no configurable timeout for USB MSC requests. > Perhaps the device is not responding in time and this is why it's > reset? > > > Best regards,
On Wed, 15 Nov 2017, Jérôme Carretero wrote: > (Adding Tejun Heo who was assigned on still-open bugzilla #93581 which > is about SATA but seems terribly related.) > > On Wed, 15 Nov 2017 16:43:14 -0500 > Jérôme Carretero <cJ-ko@zougloub.eu> wrote: > > > Hi Hans, > > > > > > Tests are currently undergoing with drives operating in plain USB mass > > storage class. In a first time, I'm filling drives with data > > (uncontrolled corpus, just TBs that I have on hand). It looks like the > > drives with most usage history are the ones that drop most often. > > > > kernel: usb 3-4.1.1: reset SuperSpeed USB device number 6 using > > xhci_hcd kernel: usb 3-4.2.1: reset SuperSpeed USB device number 7 > > using xhci_hcd kernel: usb 3-4.3.1.1: reset SuperSpeed USB device > > number 13 using xhci_hcd kernel: usb 3-4.3.2.1: reset SuperSpeed USB > > device number 14 using xhci_hcd kernel: usb 3-4.4: reset SuperSpeed > > USB device number 8 using xhci_hcd kernel: usb 6-4.3.2.1: reset > > SuperSpeed USB device number 8 using xhci_hcd kernel: usb 6-4.3.3.1: > > reset SuperSpeed USB device number 9 using xhci_hcd kernel: usb > > 6-4.4.1: reset SuperSpeed USB device number 6 using xhci_hcd > > > > Will provide some more interesting/visual data later. > > > > > > I'm surprised that the message "reset SuperSpeed USB device ..." is > > displayed without prior information about why. > > Someone with more background could give hints? usb-storage resets a drive whenever the SCSI layer tells it to or when a protocol error occurs. As far as I know, the SCSI layer issues these resets only when a command has timed out. > > I took a look at the USB MSC code and have few questions / > > observations: > > > > - It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG isn't > > used with the USB mass storage debugging infrastructure, please > > confirm? If unused, are we interested to have a patch that would go > > back to regular pr_debug() that can work with dynamic debugging? dev_dbg() please, not pr_debug(). But yes, that would be worthwhile. Note, however, that usb-storage debugging is meant only for debugging problems in the usb-storage driver itself, not for debugging problems in attached devices. We use usbmon or wireshark for the latter. > > Because with several of these drives / lots of activity / occasional > > issues, it looks like it will be hard to catch (yes I can use > > usbmon). > > > > - It looks like there is no configurable timeout for USB MSC requests. > > Perhaps the device is not responding in time and this is why it's > > reset? Timeouts are set by the SCSI layer. I believe they are rather long (30 seconds, by default). Presumably they are configurable, although I would have to do some digging to figure out how. Alan Stern
On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote: > On Wed, 15 Nov 2017, Jérôme Carretero wrote: > > > Because with several of these drives / lots of activity / > occasional > > > issues, it looks like it will be hard to catch (yes I can use > > > usbmon). > > > > > > - It looks like there is no configurable timeout for USB MSC > requests. > > > Perhaps the device is not responding in time and this is why > it's > > > reset? > > Timeouts are set by the SCSI layer. I believe they are rather long > (30 seconds, by default). Presumably they are configurable, although > I would have to do some digging to figure out how. They're in /sys/class/scsi_device/<id>/device/timeout jejb@bedivere:~> ls -l /sys/class/scsi_device/0\:0\:0\:0/device/timeout -rw-r--r-- 1 root root 4096 Nov 15 14:37 /sys/class/scsi_device/0:0:0:0/device/timeout jejb@bedivere:~> cat /sys/class/scsi_device/0\:0\:0\:0/device/timeout 30 You can actually have a udev rule adjust them on a per device id basis. James
Hi, On Thu, 16 Nov 2017 07:40:08 +0900 James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote: > > On Wed, 15 Nov 2017, Jérôme Carretero wrote: > > > > Because with several of these drives / lots of activity / > > occasional > > > > issues, it looks like it will be hard to catch (yes I can use > > > > usbmon). > > > > > > > > - It looks like there is no configurable timeout for USB MSC > > requests. > > > > Perhaps the device is not responding in time and this is why > > it's > > > > reset? > > > > Timeouts are set by the SCSI layer. I believe they are rather long > > (30 seconds, by default). Presumably they are configurable, > > although I would have to do some digging to figure out how. > > They're in /sys/class/scsi_device/<id>/device/timeout I'll use wireshark to check the cause: for sure, the drives are not "timing out" after 30 seconds (indeed the reported value in /sys/class/scsi_device/.../timeout or /sys/block/*/device/timeout), because I see (in dstat) that a disk is busy until the right about the moment where its reset message appears. Is it possible that the SCSI timeout doesn't get set into an USB URB timeout (I'll check by myself, but asking doesn't hurt) ? Thank you,
On Wed, 15 Nov 2017 17:02:39 -0500 (EST) Alan Stern <stern@rowland.harvard.edu> wrote: > > > - It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG > > > isn't used with the USB mass storage debugging infrastructure, > > > please confirm? If unused, are we interested to have a patch that > > > would go back to regular pr_debug() that can work with dynamic > > > debugging? > > dev_dbg() please, not pr_debug(). But yes, that would be worthwhile. Yes obviously. Adding Joe Perches to the loop as he did a refactor of USB mass storage a while ago. > Note, however, that usb-storage debugging is meant only for debugging > problems in the usb-storage driver itself, not for debugging problems > in attached devices. We use usbmon or wireshark for the latter. OK but I find that a "reset" message without any reason is not as helpful as it could have been. At the minimum I'll try to scratch my own itch and see if I can go at the bottom of my issue. Regards,
On Wed, 2017-11-15 at 18:27 -0500, Jérôme Carretero wrote: > OK but I find that a "reset" message without any reason is not > as helpful as it could have been. At the minimum I'll try to scratch my > own itch and see if I can go at the bottom of my issue. If you want more information about SCSI error handler decisions, observing the output sent by the following command to the system log will probably help: echo 63 > /sys/module/scsi_mod/parameters/scsi_logging_level See also https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/scsi_logging.h Bart.
Hi, On Wed, 15 Nov 2017 18:17:08 -0500 Jérôme Carretero <cJ-ko@zougloub.eu> wrote: > Hi, > > > On Thu, 16 Nov 2017 07:40:08 +0900 > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote: > > > On Wed, 15 Nov 2017, Jérôme Carretero wrote: > > > > > Because with several of these drives / lots of > > > > >activity / > > > occasional > > > > > issues, it looks like it will be hard to catch (yes I can > > > > >use > > > > > usbmon). > > > > > > > > > > - It looks like there is no configurable timeout for USB > > > > > MSC > > > requests. > > > > > Perhaps the device is not responding in time and this is > > > > >why > > > it's > > > > > reset? > > > > > > Timeouts are set by the SCSI layer. I believe they are rather > > > long (30 seconds, by default). Presumably they are configurable, > > > although I would have to do some digging to figure out how. > > > > They're in /sys/class/scsi_device/<id>/device/timeout > > > I'll use wireshark to check the cause: for sure, the drives are not > "timing out" after 30 seconds (indeed the reported value > in /sys/class/scsi_device/.../timeout or /sys/block/*/device/timeout), > because I see (in dstat) that a disk is busy until the right about the > moment where its reset message appears. > > Is it possible that the SCSI timeout doesn't get set into an USB URB > timeout (I'll check by myself, but asking doesn't hurt) ? I performed an usbmon capture extract, centered around the event (there was a few hundred MBs written for this to happen): Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd I can see that the computer is sending a write request, and sees a -EPROTO in answer (capture in attachment), so scratch the timeout issue (and actually when thinking about it, this matches what UAS was saying, except that UAS was taking ages to recover). Looked for EPROTO in the usb code, and found a dynamic debug printf in XHCI; after enabling it: Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 13 ep 3 on endpoint Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd Nov 15 22:45:03 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd First, I understand that a bad USB device could poison the kernel log, but shouldn't that xhci_dbg() (and others eg. babble) be at least an xhci_info() (I saw 2a9227a5)? Then... I don't know enough to attribute the issue the upstream USB hub(s) or the drive endpoint not behaving properly, or the kernel... what should I do with these messages? I'm still filling the drives, will perform a scrub after, to see if the issue causes data loss...
On Wed, 15 Nov 2017, Jérôme Carretero wrote: > I performed an usbmon capture extract, centered around the event > (there was a few hundred MBs written for this to happen): > > Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB > device number 8 using xhci_hcd > > I can see that the computer is sending a write request, and sees a > -EPROTO in answer (capture in attachment), so scratch the timeout issue > (and actually when thinking about it, this matches what UAS was saying, > except that UAS was taking ages to recover). > > Looked for EPROTO in the usb code, and found a dynamic debug printf in > XHCI; after enabling it: > > Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 13 ep 3 on endpoint > Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint > Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd > Nov 15 22:45:03 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd > > First, I understand that a bad USB device could poison the kernel log, > but shouldn't that xhci_dbg() (and others eg. babble) be at least an > xhci_info() (I saw 2a9227a5)? I suspect that if every USB error got printed in the kernel log, people would be upset at how much useless information was added. > Then... I don't know enough to attribute the issue the upstream USB hub(s) > or the drive endpoint not behaving properly, or the kernel... what > should I do with these messages? Here's the error: b5251480 0.505661 S Bo:6:008:2 -115 196608 = 540a2813 1a33dd99 ab76840c bf72fc6b 60f9fcaf 4d61822c c007ff4e ab72d022 b5251480 0.506280 C Bo:6:008:2 -71 86016 > This means the kernel tried to write 196608 bytes to the drive. After 86016 had been transferred, the drive did not reply correctly to the next output transaction, causing the kernel to perform a reset. That's what happened, according to the viewpoint of the xhci-hcd driver. In theory it's possible that the drive did respond correctly and the information get messed up on the USB cable or on the computer's end. Since we can't see what signals were actually sent on the USB bus, there's no way to be certain. But it seems most likely that the drive (or rather, its USB interface) was at fault. Alan Stern > I'm still filling the drives, will perform a scrub after, to see if > the issue causes data loss...
Hi, On Thu, 16 Nov 2017 14:42:51 -0500 (EST) Alan Stern <stern@rowland.harvard.edu> wrote: > On Wed, 15 Nov 2017, Jérôme Carretero wrote: > > > I performed an usbmon capture extract, centered around the event > > (there was a few hundred MBs written for this to happen): > > > > Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB > > device number 8 using xhci_hcd > > > > I can see that the computer is sending a write request, and sees a > > -EPROTO in answer (capture in attachment), so scratch the timeout > > issue (and actually when thinking about it, this matches what UAS > > was saying, except that UAS was taking ages to recover). > > > > Looked for EPROTO in the usb code, and found a dynamic debug printf > > in XHCI; after enabling it: > > > > Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer > > error for slot 13 ep 3 on endpoint Nov 15 22:45:03 Bidule kernel: > > xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint > > Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB > > device number 9 using xhci_hcd Nov 15 22:45:03 Bidule kernel: usb > > 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd > > > > First, I understand that a bad USB device could poison the kernel > > log, but shouldn't that xhci_dbg() (and others eg. babble) be at > > least an xhci_info() (I saw 2a9227a5)? > > I suspect that if every USB error got printed in the kernel log, > people would be upset at how much useless information was added. So it turns out that one of the 2 drives that produced most of these errors died overnight (the kernel first reported failure at READ DMA EXT, SMART seeing 6k Current_Pending_Sector / Offline_Uncorrectable, then the drive just lost it and wouldn't even complete USB enumeration now. IMHO too much information is perhaps better than not enough, and I bet that people would reconsider purchasing low-quality hardware if they noticed these (unless they can happen for no reason). > > > Then... I don't know enough to attribute the issue the upstream USB > > hub(s) or the drive endpoint not behaving properly, or the > > kernel... what should I do with these messages? > > Here's the error: > > b5251480 0.505661 S Bo:6:008:2 -115 196608 = 540a2813 1a33dd99 > ab76840c bf72fc6b 60f9fcaf 4d61822c c007ff4e ab72d022 b5251480 > 0.506280 C Bo:6:008:2 -71 86016 > Out of curiosity, which tool produced this condensed output? > This means the kernel tried to write 196608 bytes to the drive. After > 86016 had been transferred, the drive did not reply correctly to the > next output transaction, causing the kernel to perform a reset. > That's what happened, according to the viewpoint of the xhci-hcd > driver. > > In theory it's possible that the drive did respond correctly and the > information get messed up on the USB cable or on the computer's end. Wow, that sucks. I had a mental image where the transactions used FEC and it would be obviously possible to differentiate between cable/hub/endpoint errors. > Since we can't see what signals were actually sent on the USB bus, > there's no way to be certain. But it seems most likely that the drive > (or rather, its USB interface) was at fault. I would speculate (with high confidence) that the drive itself is doing unexpected stuff, because of that bugzilla issue showing that these SMR drives also behave strangely when connected on SATA. I have had in circulation 10 of these 8 TB SMR drives, 1 SATA and 9 USB, and all of them are generating unexpected kernel logging to some extent, when subject to write-intensive loads. 2 from 2015 and SMART says they're all good; the rest since 10 days ago, one was DOA (very early SMART bad sectors) and tonight's failure has an S/N consecutive to that first DOA one, which smells a little. > > I'm still filling the drives, will perform a scrub after, to see if > > the issue causes data loss... To be continued... since it looks like there's no fundamental issue with the kernel itself and this is turning into a rant on hardware, I'll just direct follow-up e-mails to the ML only, tell me if you want to stay in CC. Thanks again,
On Fri, 17 Nov 2017, Jérôme Carretero wrote: > > Here's the error: > > > > b5251480 0.505661 S Bo:6:008:2 -115 196608 = 540a2813 1a33dd99 > > ab76840c bf72fc6b 60f9fcaf 4d61822c c007ff4e ab72d022 b5251480 > > 0.506280 C Bo:6:008:2 -71 86016 > > > Out of curiosity, which tool produced this condensed output? I wrote a program for myself to translate pcap files into usbmon's text format. I find that format a lot easier to read than trying to deal with wireshark's "one packet at a time" approach. > > This means the kernel tried to write 196608 bytes to the drive. After > > 86016 had been transferred, the drive did not reply correctly to the > > next output transaction, causing the kernel to perform a reset. > > That's what happened, according to the viewpoint of the xhci-hcd > > driver. > > > > In theory it's possible that the drive did respond correctly and the > > information get messed up on the USB cable or on the computer's end. > > Wow, that sucks. > I had a mental image where the transactions used FEC and it would be > obviously possible to differentiate between cable/hub/endpoint errors. I don't see how FEC could help in a situation where a device sends a properly formatted message and then some component closer to the kernel improperly rejects it. Alan Stern
diff --git a/drivers/usb/storage/unusual_uas.h b/drivers/usb/storage/unusual_uas.h index cde115359793..2fe4fd336446 100644 --- a/drivers/usb/storage/unusual_uas.h +++ b/drivers/usb/storage/unusual_uas.h @@ -121,6 +121,13 @@ UNUSUAL_DEV(0x0bc2, 0xab2a, 0x0000, 0x9999, USB_SC_DEVICE, USB_PR_DEVICE, NULL, US_FL_NO_ATA_1X), +/* Reported-by: Andrey Astafyev <dev@246060.ru> */ +UNUSUAL_DEV(0x0bc2, 0xab30, 0x0000, 0x9999, + "Seagate", + "BUP BK", + USB_SC_DEVICE, USB_PR_DEVICE, NULL, + US_FL_NO_ATA_1X), + /* Reported-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> */ UNUSUAL_DEV(0x13fd, 0x3940, 0x0000, 0x9999, "Initio Corporation",