uas: Add US_FL_NO_ATA_1X quirk for one more Seagate device

Message ID	20171110151344.10563-1-hdegoede@redhat.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Hans de Goede <hdegoede@redhat.com> To: Oliver Neukum <oneukum@suse.com>, Alan Stern <stern@rowland.harvard.edu> Cc: Hans de Goede <hdegoede@redhat.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, linux-usb@vger.kernel.org, linux-scsi@vger.kernel.org, stable@vger.kernel.org, Andrey Astafyev <1@246060.ru> Subject: [PATCH] uas: Add US_FL_NO_ATA_1X quirk for one more Seagate device Date: Fri, 10 Nov 2017 16:13:44 +0100 Message-Id: <20171110151344.10563-1-hdegoede@redhat.com> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

Hans de Goede Nov. 10, 2017, 3:13 p.m. UTC

Just like all previous UAS capable Seagate disk enclosures, this
one needs a US_FL_NO_ATA_1X quirk.

Cc: stable@vger.kernel.org # 3.16
Signed-off-by: Andrey Astafyev <1@246060.ru>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
---
 drivers/usb/storage/unusual_uas.h | 7 +++++++
 1 file changed, 7 insertions(+)

Jérôme Carretero Nov. 12, 2017, 9:42 p.m. UTC | #1

On Fri, 10 Nov 2017 16:13:44 +0100
Hans de Goede <hdegoede@redhat.com> wrote:

> Just like all previous UAS capable Seagate disk enclosures, this
> one needs a US_FL_NO_ATA_1X quirk.
> [...]
> +/* Reported-by: Andrey Astafyev <dev@246060.ru> */

Hi Andrey,


I notice that you have an external Seagate SMR drive, I am encountering
some trouble and was wondering if you (or anyone in CC) could have some
feedback.

While the drives I have (PID 0xab45, "Backup+ Hub BK") are used
without problem with very light loads, I'm starting to get annoyed when
doing more serious work.
Here is a series of messages (v4.14) I have when simply trying to balance 1
chunk of BTRFS data, ie. write 5 GiB, and waiting a long time in-between:

	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT
	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 CDB: Write(16) 8a 00 00 00 00 00 00 00 11 00 00 00 00 80 00 00
	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT
	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 CDB: Write(16) 8a 00 00 00 00 00 00 00 11 80 00 00 00 80 00 00
	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 CDB: Write(16) 8a 00 00 00 00 00 00 bf fd 80 00 00 00 80 00 00
	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#4 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT
	Nov 12 16:15:35 Bidule kernel: sd 22:0:0:0: [sdaa] tag#4 CDB: Write(16) 8a 00 00 00 00 00 00 bf fd 00 00 00 00 80 00 00
	Nov 12 16:15:35 Bidule kernel: scsi host22: uas_eh_device_reset_handler start
	Nov 12 16:15:35 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd
	Nov 12 16:15:35 Bidule kernel: scsi host22: uas_eh_device_reset_handler success


	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#3 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD IN
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#3 CDB: Read(16) 88 00 00 00 00 00 01 02 53 80 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#13 uas_eh_abort_handler 0 uas-tag 14 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#13 CDB: Write(16) 8a 00 00 00 00 00 01 02 3b c0 00 00 00 20 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 CDB: Write(16) 8a 00 00 00 00 00 01 02 41 00 00 00 01 00 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#12 uas_eh_abort_handler 0 uas-tag 13 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#12 CDB: Write(16) 8a 00 00 00 00 00 01 02 3c a0 00 00 00 20 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#11 uas_eh_abort_handler 0 uas-tag 12 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#11 CDB: Write(16) 8a 00 00 00 00 00 01 02 62 00 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 CDB: Write(16) 8a 00 00 00 00 00 01 02 61 80 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#10 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#10 CDB: Write(16) 8a 00 00 00 00 00 01 02 61 00 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#9 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#9 CDB: Write(16) 8a 00 00 00 00 00 01 02 60 80 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#8 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#8 CDB: Write(16) 8a 00 00 00 00 00 01 02 60 00 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#7 CDB: Write(16) 8a 00 00 00 00 00 01 02 5f 80 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#6 CDB: Write(16) 8a 00 00 00 00 00 01 02 5f 00 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
	Nov 12 16:20:28 Bidule kernel: sd 22:0:0:0: [sdaa] tag#5 CDB: Write(16) 8a 00 00 00 00 00 01 02 50 00 00 00 00 80 00 00
	Nov 12 16:20:28 Bidule kernel: scsi host22: uas_eh_device_reset_handler start
	Nov 12 16:20:28 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd
	Nov 12 16:20:28 Bidule kernel: scsi host22: uas_eh_device_reset_handler success


	Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
	Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 CDB: Write(16) 8a 00 00 00 00 00 20 00 08 00 00 00 00 08 00 00
	Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
	Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 CDB: Write(16) 8a 00 00 00 00 00 00 02 08 00 00 00 00 08 00 00
	Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
	Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
	Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler start
	Nov 12 16:21:29 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd
	Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler success


Do you see such things?

I understand that latencies can climb to several seconds as the disks can do GC,
and suspect that the issues (sometimes the USB reset doesn't work and a reboot
is needed...) are due to some unawareness of that, but I don't know.

There are several threads on distribution forums & bug trackers, but nothing
really helpful, and I didn't see much on LKML.


Regards,

Andrey Astafyev Nov. 13, 2017, 4:01 a.m. UTC | #2

13.11.2017 00:42, Jérôme Carretero пишет:
> Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 
> uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
> 	Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 CDB: Write(16) 8a 00 00 00 00 00 20 00 08 00 00 00 00 08 00 00
> 	Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
> 	Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#1 CDB: Write(16) 8a 00 00 00 00 00 00 02 08 00 00 00 00 08 00 00
> 	Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
> 	Nov 12 16:21:29 Bidule kernel: sd 22:0:0:0: [sdaa] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
> 	Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler start
> 	Nov 12 16:21:29 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 9 using xhci_hcd
> 	Nov 12 16:21:29 Bidule kernel: scsi host22: uas_eh_device_reset_handler success
>
>
> Do you see such things?
>
Hi, I've seen dmesg output like this so I've added my device to quirks 
list like any other Seagate USB drive.

Jérôme Carretero Nov. 13, 2017, 6:14 a.m. UTC | #3

On Mon, 13 Nov 2017 07:01:30 +0300
Andrey Astafyev <1@246060.ru> wrote:

> 13.11.2017 00:42, Jérôme Carretero пишет:
> > Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2 
> > uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
> > [...]
> > Do you see such things?
> >  
> Hi, I've seen dmesg output like this so I've added my device to
> quirks list like any other Seagate USB drive.

Hi Andrey, Hans,

For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't
change anything, and while adding US_FL_IGNORE_UAS (using
quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets,
but they cause shorter hangs in system activity (~1 second when
UAS was more like ~20).

Is the US_FL_NO_ATA_1X supposed to get rid of the R/W errors?
The first entry I saw about this quirk was so that the device can
be used at all (wouldn't be able to mount without it).

I'll follow up: when scrubbing the devices after a sequence of
experiments, there were checksum errors, so I'll retry with a more
reproducible sequence to try and get something more solid.
I wouldn't want to disable UAS just to see that reliability has
been reduced (one Arch bug report mentions something like that
https://bugs.archlinux.org/task/48362).

Regards,

Andrey Astafyev Nov. 13, 2017, 6:16 a.m. UTC | #4

13.11.2017 09:14, Jérôme Carretero пишет:
> For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't
> change anything, and while adding US_FL_IGNORE_UAS (using
> quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets,
> but they cause shorter hangs in system activity (~1 second when
> UAS was more like ~20).

Maybe you should try like this: quirks=0bc2:ab34:ut,0bc2:ab38:ut ?

Jérôme Carretero Nov. 13, 2017, 7:14 a.m. UTC | #5

On Mon, 13 Nov 2017 09:16:39 +0300
Andrey Astafyev <1@246060.ru> wrote:

> 13.11.2017 09:14, Jérôme Carretero пишет:
> > For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't
> > change anything, and while adding US_FL_IGNORE_UAS (using
> > quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets,
> > but they cause shorter hangs in system activity (~1 second when
> > UAS was more like ~20).  
> 
> Maybe you should try like this: quirks=0bc2:ab34:ut,0bc2:ab38:ut ?

It looks like "ut" would do the same as "u" alone, as US_FL_NO_ATA_1X
is only used inside uas.ko.
For some reason, first I wanted to go in the .h, and it's only after,
when the trial and error went more intense, that I used the command-line
parameters.

Will follow up later... I hope the drives won't end up costing more
than 12Gb/s SAS drives on an expensive HBA... as they are definitely
not hassle-free so far.
I had gotten the first 8 TB model in 2015 and had issues, but my
simple, background workload (attic/borg) could afford to use a
workaround of throttling writes*.

Hans de Goede Nov. 13, 2017, 9:04 a.m. UTC | #6

Hi,

On 13-11-17 07:14, Jérôme Carretero wrote:
> On Mon, 13 Nov 2017 07:01:30 +0300
> Andrey Astafyev <1@246060.ru> wrote:
> 
>> 13.11.2017 00:42, Jérôme Carretero пишет:
>>> Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2
>>> uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
>>> [...]
>>> Do you see such things?
>>>   
>> Hi, I've seen dmesg output like this so I've added my device to
>> quirks list like any other Seagate USB drive.
> 
> Hi Andrey, Hans,
> 
> 
> For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't
> change anything, and while adding US_FL_IGNORE_UAS (using
> quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets,
> but they cause shorter hangs in system activity (~1 second when
> UAS was more like ~20).

The errors you are seeing are write errors. If you're seeing these
errors with both the usb-storage and uas drivers then there likely
is something wrong with your setup / hardware.

Does the drive in question use an external power-supply or is it
USB bus-powered? If it is the latter then that is likely the problem.

Anyways things I would check and try to swap are both the cable
used, the power-supply used (if any), the USB-port used as well
as trying the disk on a completely different computer.

I've the feeling something is busted with your hardware, it
could be the disk itself. Did you mention that this was the first
release of a new higher capacity ? Those often have some kinks
which are worked out in later revisions.

Regards,

Hans

Jérôme Carretero Nov. 13, 2017, 5:38 p.m. UTC | #7

Hi Hans,

On Mon, 13 Nov 2017 10:04:53 +0100
Hans de Goede <hdegoede@redhat.com> wrote:

> Hi,
> 
> On 13-11-17 07:14, Jérôme Carretero wrote:
> > On Mon, 13 Nov 2017 07:01:30 +0300
> > Andrey Astafyev <1@246060.ru> wrote:
> >   
> >> 13.11.2017 00:42, Jérôme Carretero пишет:  
> >>> Nov 12 16:20:59 Bidule kernel: sd 22:0:0:0: [sdaa] tag#2
> >>> uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
> >>> [...]
> >>> Do you see such things?

> > For my devices, adding US_FL_NO_ATA_1X to unusual_uas.h didn't
> > change anything, and while adding US_FL_IGNORE_UAS (using
> > quirks=0bc2:ab34:u,0bc2:ab38:u) there are still device resets,
> > but they cause shorter hangs in system activity (~1 second when
> > UAS was more like ~20).  
> 
> The errors you are seeing are write errors. If you're seeing these
> errors with both the usb-storage and uas drivers then there likely
> is something wrong with your setup / hardware.

My latest drives are Seagate Backup+ Hub 8TB and have ~ 50 hours of
uptime. I have connected them to different controllers and they do the
same as the first generation of the same capacity from 2015.

SMART says that everything is OK on these disks (I have another that
was RMA'ed and the symptoms of failure are something else), and if there
were USB errors, the messages wouldn't be at the higher SCSI level, I
guess I would see "xact failed" USB errors... no?

> Does the drive in question use an external power-supply or is it
> USB bus-powered? If it is the latter then that is likely the problem.

External power supply & ~2-ft cable provided by Seagate.

> Anyways things I would check and try to swap are both the cable
> used, the power-supply used (if any), the USB-port used as well
> as trying the disk on a completely different computer.

I did that. The same thing happens.

> I've the feeling something is busted with your hardware, it
> could be the disk itself. Did you mention that this was the first
> release of a new higher capacity ? Those often have some kinks
> which are worked out in later revisions.

No, that's about the 3rd release I think.

I really suspect this has to do with GC activity of these SMR drives,
as if the write activity is throttled or in more spaced bursts (same
USB-level intensity), then there is no problem.

I will do longer tests and see if only some of them do that, after
they have been subjected to similar usage history.

Best regards,

Jérôme Carretero Nov. 15, 2017, 9:43 p.m. UTC | #8

Hi Hans,


Tests are currently undergoing with drives operating in plain USB mass
storage class. In a first time, I'm filling drives with data
(uncontrolled corpus, just TBs that I have on hand). It looks like the
drives with most usage history are the ones that drop most often.

kernel: usb 3-4.1.1: reset SuperSpeed USB device number 6 using xhci_hcd
kernel: usb 3-4.2.1: reset SuperSpeed USB device number 7 using xhci_hcd
kernel: usb 3-4.3.1.1: reset SuperSpeed USB device number 13 using xhci_hcd
kernel: usb 3-4.3.2.1: reset SuperSpeed USB device number 14 using xhci_hcd
kernel: usb 3-4.4: reset SuperSpeed USB device number 8 using xhci_hcd
kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd
kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd
kernel: usb 6-4.4.1: reset SuperSpeed USB device number 6 using xhci_hcd

Will provide some more interesting/visual data later.


I'm surprised that the message "reset SuperSpeed USB device ..." is
displayed without prior information about why.
Someone with more background could give hints?


I took a look at the USB MSC code and have few questions / observations:

- It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG isn't
  used with the USB mass storage debugging infrastructure, please
  confirm? If unused, are we interested to have a patch that would go
  back to regular pr_debug() that can work with dynamic debugging?

  Because with several of these drives / lots of activity / occasional
  issues, it looks like it will be hard to catch (yes I can use usbmon).

- It looks like there is no configurable timeout for USB MSC requests.
  Perhaps the device is not responding in time and this is why it's
  reset?


Best regards,

Jérôme Carretero Nov. 15, 2017, 9:49 p.m. UTC | #9

(Adding Tejun Heo who was assigned on still-open bugzilla #93581 which
is about SATA but seems terribly related.)

On Wed, 15 Nov 2017 16:43:14 -0500
Jérôme Carretero <cJ-ko@zougloub.eu> wrote:

> Hi Hans,
> 
> 
> Tests are currently undergoing with drives operating in plain USB mass
> storage class. In a first time, I'm filling drives with data
> (uncontrolled corpus, just TBs that I have on hand). It looks like the
> drives with most usage history are the ones that drop most often.
> 
> kernel: usb 3-4.1.1: reset SuperSpeed USB device number 6 using
> xhci_hcd kernel: usb 3-4.2.1: reset SuperSpeed USB device number 7
> using xhci_hcd kernel: usb 3-4.3.1.1: reset SuperSpeed USB device
> number 13 using xhci_hcd kernel: usb 3-4.3.2.1: reset SuperSpeed USB
> device number 14 using xhci_hcd kernel: usb 3-4.4: reset SuperSpeed
> USB device number 8 using xhci_hcd kernel: usb 6-4.3.2.1: reset
> SuperSpeed USB device number 8 using xhci_hcd kernel: usb 6-4.3.3.1:
> reset SuperSpeed USB device number 9 using xhci_hcd kernel: usb
> 6-4.4.1: reset SuperSpeed USB device number 6 using xhci_hcd
> 
> Will provide some more interesting/visual data later.
> 
> 
> I'm surprised that the message "reset SuperSpeed USB device ..." is
> displayed without prior information about why.
> Someone with more background could give hints?
> 
> 
> I took a look at the USB MSC code and have few questions /
> observations:
> 
> - It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG isn't
>   used with the USB mass storage debugging infrastructure, please
>   confirm? If unused, are we interested to have a patch that would go
>   back to regular pr_debug() that can work with dynamic debugging?
> 
>   Because with several of these drives / lots of activity / occasional
>   issues, it looks like it will be hard to catch (yes I can use
> usbmon).
> 
> - It looks like there is no configurable timeout for USB MSC requests.
>   Perhaps the device is not responding in time and this is why it's
>   reset?
> 
> 
> Best regards,

Alan Stern Nov. 15, 2017, 10:02 p.m. UTC | #10

On Wed, 15 Nov 2017, Jérôme Carretero wrote:

> (Adding Tejun Heo who was assigned on still-open bugzilla #93581 which
> is about SATA but seems terribly related.)
> 
> On Wed, 15 Nov 2017 16:43:14 -0500
> Jérôme Carretero <cJ-ko@zougloub.eu> wrote:
> 
> > Hi Hans,
> > 
> > 
> > Tests are currently undergoing with drives operating in plain USB mass
> > storage class. In a first time, I'm filling drives with data
> > (uncontrolled corpus, just TBs that I have on hand). It looks like the
> > drives with most usage history are the ones that drop most often.
> > 
> > kernel: usb 3-4.1.1: reset SuperSpeed USB device number 6 using
> > xhci_hcd kernel: usb 3-4.2.1: reset SuperSpeed USB device number 7
> > using xhci_hcd kernel: usb 3-4.3.1.1: reset SuperSpeed USB device
> > number 13 using xhci_hcd kernel: usb 3-4.3.2.1: reset SuperSpeed USB
> > device number 14 using xhci_hcd kernel: usb 3-4.4: reset SuperSpeed
> > USB device number 8 using xhci_hcd kernel: usb 6-4.3.2.1: reset
> > SuperSpeed USB device number 8 using xhci_hcd kernel: usb 6-4.3.3.1:
> > reset SuperSpeed USB device number 9 using xhci_hcd kernel: usb
> > 6-4.4.1: reset SuperSpeed USB device number 6 using xhci_hcd
> > 
> > Will provide some more interesting/visual data later.
> > 
> > 
> > I'm surprised that the message "reset SuperSpeed USB device ..." is
> > displayed without prior information about why.
> > Someone with more background could give hints?

usb-storage resets a drive whenever the SCSI layer tells it to or when
a protocol error occurs.  As far as I know, the SCSI layer issues these
resets only when a command has timed out.

> > I took a look at the USB MSC code and have few questions /
> > observations:
> > 
> > - It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG isn't
> >   used with the USB mass storage debugging infrastructure, please
> >   confirm? If unused, are we interested to have a patch that would go
> >   back to regular pr_debug() that can work with dynamic debugging?

dev_dbg() please, not pr_debug().  But yes, that would be worthwhile.

Note, however, that usb-storage debugging is meant only for debugging
problems in the usb-storage driver itself, not for debugging problems
in attached devices.  We use usbmon or wireshark for the latter.

> >   Because with several of these drives / lots of activity / occasional
> >   issues, it looks like it will be hard to catch (yes I can use
> > usbmon).
> > 
> > - It looks like there is no configurable timeout for USB MSC requests.
> >   Perhaps the device is not responding in time and this is why it's
> >   reset?

Timeouts are set by the SCSI layer.  I believe they are rather long (30 
seconds, by default).  Presumably they are configurable, although I 
would have to do some digging to figure out how.

Alan Stern

James Bottomley Nov. 15, 2017, 10:40 p.m. UTC | #11

On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote:
> On Wed, 15 Nov 2017, Jérôme Carretero wrote:
> > >   Because with several of these drives / lots of activity /
> occasional
> > >   issues, it looks like it will be hard to catch (yes I can use
> > > usbmon).
> > > 
> > > - It looks like there is no configurable timeout for USB MSC
> requests.
> > >   Perhaps the device is not responding in time and this is why
> it's
> > >   reset?
> 
> Timeouts are set by the SCSI layer.  I believe they are rather long
> (30 seconds, by default).  Presumably they are configurable, although
> I would have to do some digging to figure out how.

They're in /sys/class/scsi_device/<id>/device/timeout

jejb@bedivere:~> ls -l /sys/class/scsi_device/0\:0\:0\:0/device/timeout
-rw-r--r-- 1 root root 4096 Nov 15 14:37 /sys/class/scsi_device/0:0:0:0/device/timeout
jejb@bedivere:~> cat /sys/class/scsi_device/0\:0\:0\:0/device/timeout
30

You can actually have a udev rule adjust them on a per device id basis.

James

Jérôme Carretero Nov. 15, 2017, 11:17 p.m. UTC | #12

Hi,

On Thu, 16 Nov 2017 07:40:08 +0900
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote:
> > On Wed, 15 Nov 2017, Jérôme Carretero wrote:  
> > > >   Because with several of these drives / lots of activity /  
> > occasional  
> > > >   issues, it looks like it will be hard to catch (yes I can use
> > > > usbmon).
> > > > 
> > > > - It looks like there is no configurable timeout for USB MSC  
> > requests.  
> > > >   Perhaps the device is not responding in time and this is why  
> > it's  
> > > >   reset?  
> > 
> > Timeouts are set by the SCSI layer.  I believe they are rather long
> > (30 seconds, by default).  Presumably they are configurable,
> > although I would have to do some digging to figure out how.  
> 
> They're in /sys/class/scsi_device/<id>/device/timeout

I'll use wireshark to check the cause: for sure, the drives are not
"timing out" after 30 seconds (indeed the reported value
in /sys/class/scsi_device/.../timeout or /sys/block/*/device/timeout),
because I see (in dstat) that a disk is busy until the right about the
moment where its reset message appears.

Is it possible that the SCSI timeout doesn't get set into an USB URB
timeout (I'll check by myself, but asking doesn't hurt) ?

Thank you,

Jérôme Carretero Nov. 15, 2017, 11:27 p.m. UTC | #13

On Wed, 15 Nov 2017 17:02:39 -0500 (EST)
Alan Stern <stern@rowland.harvard.edu> wrote:

> > > - It looks like (haven't tested it yet) the CONFIG_DYNAMIC_DEBUG
> > > isn't used with the USB mass storage debugging infrastructure,
> > > please confirm? If unused, are we interested to have a patch that
> > > would go back to regular pr_debug() that can work with dynamic
> > > debugging?  
> 
> dev_dbg() please, not pr_debug().  But yes, that would be worthwhile.

Yes obviously.

Adding Joe Perches to the loop as he did a refactor of USB mass storage
a while ago.


> Note, however, that usb-storage debugging is meant only for debugging
> problems in the usb-storage driver itself, not for debugging problems
> in attached devices.  We use usbmon or wireshark for the latter.

OK but I find that a "reset" message without any reason is not
as helpful as it could have been. At the minimum I'll try to scratch my
own itch and see if I can go at the bottom of my issue.


Regards,

Bart Van Assche Nov. 15, 2017, 11:40 p.m. UTC | #14

On Wed, 2017-11-15 at 18:27 -0500, Jérôme Carretero wrote:
> OK but I find that a "reset" message without any reason is not

> as helpful as it could have been. At the minimum I'll try to scratch my

> own itch and see if I can go at the bottom of my issue.

If you want more information about SCSI error handler decisions, observing
the output sent by the following command to the system log will probably help:

echo 63 > /sys/module/scsi_mod/parameters/scsi_logging_level

See also
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/scsi_logging.h
Bart.

Jérôme Carretero Nov. 16, 2017, 4:21 a.m. UTC | #15

Hi,

On Wed, 15 Nov 2017 18:17:08 -0500
Jérôme Carretero <cJ-ko@zougloub.eu> wrote:

> Hi,
> 
> 
> On Thu, 16 Nov 2017 07:40:08 +0900
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote:  
> > > On Wed, 15 Nov 2017, Jérôme Carretero wrote:    
> > > > >   Because with several of these drives / lots of
> > > > >activity /    
> > > occasional    
> > > > >   issues, it looks like it will be hard to catch (yes I can
> > > > >use
> > > > > usbmon).
> > > > > 
> > > > > - It looks like there is no configurable timeout for USB
> > > > > MSC    
> > > requests.    
> > > > >   Perhaps the device is not responding in time and this is
> > > > >why    
> > > it's    
> > > > >   reset?    
> > > 
> > > Timeouts are set by the SCSI layer.  I believe they are rather
> > > long (30 seconds, by default).  Presumably they are configurable,
> > > although I would have to do some digging to figure out how.    
> > 
> > They're in /sys/class/scsi_device/<id>/device/timeout  
> 
> 
> I'll use wireshark to check the cause: for sure, the drives are not
> "timing out" after 30 seconds (indeed the reported value
> in /sys/class/scsi_device/.../timeout or /sys/block/*/device/timeout),
> because I see (in dstat) that a disk is busy until the right about the
> moment where its reset message appears.
> 
> Is it possible that the SCSI timeout doesn't get set into an USB URB
> timeout (I'll check by myself, but asking doesn't hurt) ?

I performed an usbmon capture extract, centered around the event
(there was a few hundred MBs written for this to happen):

 Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB
  device number 8 using xhci_hcd

I can see that the computer is sending a write request, and sees a
-EPROTO in answer (capture in attachment), so scratch the timeout issue
(and actually when thinking about it, this matches what UAS was saying,
except that UAS was taking ages to recover).

Looked for EPROTO in the usb code, and found a dynamic debug printf in
XHCI; after enabling it:

 Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 13 ep 3 on endpoint
 Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint
 Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd
 Nov 15 22:45:03 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd

First, I understand that a bad USB device could poison the kernel log,
but shouldn't that xhci_dbg() (and others eg. babble) be at least an
xhci_info() (I saw 2a9227a5)?

Then... I don't know enough to attribute the issue the upstream USB hub(s)
or the drive endpoint not behaving properly, or the kernel... what
should I do with these messages?

I'm still filling the drives, will perform a scrub after, to see if
the issue causes data loss...

Alan Stern Nov. 16, 2017, 7:42 p.m. UTC | #16

On Wed, 15 Nov 2017, Jérôme Carretero wrote:

> I performed an usbmon capture extract, centered around the event
> (there was a few hundred MBs written for this to happen):
> 
>  Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB
>   device number 8 using xhci_hcd
> 
> I can see that the computer is sending a write request, and sees a
> -EPROTO in answer (capture in attachment), so scratch the timeout issue
> (and actually when thinking about it, this matches what UAS was saying,
> except that UAS was taking ages to recover).
> 
> Looked for EPROTO in the usb code, and found a dynamic debug printf in
> XHCI; after enabling it:
> 
>  Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 13 ep 3 on endpoint
>  Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint
>  Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd
>  Nov 15 22:45:03 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd
> 
> First, I understand that a bad USB device could poison the kernel log,
> but shouldn't that xhci_dbg() (and others eg. babble) be at least an
> xhci_info() (I saw 2a9227a5)?

I suspect that if every USB error got printed in the kernel log, people 
would be upset at how much useless information was added.

> Then... I don't know enough to attribute the issue the upstream USB hub(s)
> or the drive endpoint not behaving properly, or the kernel... what
> should I do with these messages?

Here's the error:

b5251480 0.505661 S Bo:6:008:2 -115 196608 = 540a2813 1a33dd99 ab76840c bf72fc6b 60f9fcaf 4d61822c c007ff4e ab72d022
b5251480 0.506280 C Bo:6:008:2 -71 86016 >

This means the kernel tried to write 196608 bytes to the drive.  After
86016 had been transferred, the drive did not reply correctly to the
next output transaction, causing the kernel to perform a reset.  
That's what happened, according to the viewpoint of the xhci-hcd 
driver.

In theory it's possible that the drive did respond correctly and the
information get messed up on the USB cable or on the computer's end.  
Since we can't see what signals were actually sent on the USB bus,
there's no way to be certain.  But it seems most likely that the drive
(or rather, its USB interface) was at fault.

Alan Stern

> I'm still filling the drives, will perform a scrub after, to see if
> the issue causes data loss...

Jérôme Carretero Nov. 17, 2017, 10:19 p.m. UTC | #17

Hi,

On Thu, 16 Nov 2017 14:42:51 -0500 (EST)
Alan Stern <stern@rowland.harvard.edu> wrote:

> On Wed, 15 Nov 2017, Jérôme Carretero wrote:
> 
> > I performed an usbmon capture extract, centered around the event
> > (there was a few hundred MBs written for this to happen):
> > 
> >  Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB
> >   device number 8 using xhci_hcd
> > 
> > I can see that the computer is sending a write request, and sees a
> > -EPROTO in answer (capture in attachment), so scratch the timeout
> > issue (and actually when thinking about it, this matches what UAS
> > was saying, except that UAS was taking ages to recover).
> > 
> > Looked for EPROTO in the usb code, and found a dynamic debug printf
> > in XHCI; after enabling it:
> > 
> >  Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer
> > error for slot 13 ep 3 on endpoint Nov 15 22:45:03 Bidule kernel:
> > xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint
> > Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB
> > device number 9 using xhci_hcd Nov 15 22:45:03 Bidule kernel: usb
> > 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd
> > 
> > First, I understand that a bad USB device could poison the kernel
> > log, but shouldn't that xhci_dbg() (and others eg. babble) be at
> > least an xhci_info() (I saw 2a9227a5)?  
> 
> I suspect that if every USB error got printed in the kernel log,
> people would be upset at how much useless information was added.

So it turns out that one of the 2 drives that produced most of these
errors died overnight (the kernel first reported failure at READ DMA
EXT, SMART seeing 6k Current_Pending_Sector / Offline_Uncorrectable,
then the drive just lost it and wouldn't even complete USB enumeration
now.

IMHO too much information is perhaps better than not enough, and I bet
that people would reconsider purchasing low-quality hardware if they
noticed these (unless they can happen for no reason).

> 
> > Then... I don't know enough to attribute the issue the upstream USB
> > hub(s) or the drive endpoint not behaving properly, or the
> > kernel... what should I do with these messages?  
> 
> Here's the error:
> 
> b5251480 0.505661 S Bo:6:008:2 -115 196608 = 540a2813 1a33dd99
> ab76840c bf72fc6b 60f9fcaf 4d61822c c007ff4e ab72d022 b5251480
> 0.506280 C Bo:6:008:2 -71 86016 >

Out of curiosity, which tool produced this condensed output?

> This means the kernel tried to write 196608 bytes to the drive.  After
> 86016 had been transferred, the drive did not reply correctly to the
> next output transaction, causing the kernel to perform a reset.  
> That's what happened, according to the viewpoint of the xhci-hcd 
> driver.
> 
> In theory it's possible that the drive did respond correctly and the
> information get messed up on the USB cable or on the computer's end.

Wow, that sucks.
I had a mental image where the transactions used FEC and it would be
obviously possible to differentiate between cable/hub/endpoint errors.

> Since we can't see what signals were actually sent on the USB bus,
> there's no way to be certain.  But it seems most likely that the drive
> (or rather, its USB interface) was at fault.

I would speculate (with high confidence) that the drive itself is doing
unexpected stuff, because of that bugzilla issue showing that these SMR
drives also behave strangely when connected on SATA.

I have had in circulation 10 of these 8 TB SMR drives, 1 SATA and 9 USB,
and all of them are generating unexpected kernel logging to some
extent, when subject to write-intensive loads.
2 from 2015 and SMART says they're all good; the rest since 10 days
ago, one was DOA (very early SMART bad sectors) and tonight's failure
has an S/N consecutive to that first DOA one, which smells a little.

> > I'm still filling the drives, will perform a scrub after, to see if
> > the issue causes data loss...  

To be continued... since it looks like there's no fundamental issue
with the kernel itself and this is turning into a rant on hardware,
I'll just direct follow-up e-mails to the ML only, tell me if you want
to stay in CC.

Thanks again,

Alan Stern Nov. 18, 2017, 4:57 p.m. UTC | #18

On Fri, 17 Nov 2017, Jérôme Carretero wrote:

> > Here's the error:
> > 
> > b5251480 0.505661 S Bo:6:008:2 -115 196608 = 540a2813 1a33dd99
> > ab76840c bf72fc6b 60f9fcaf 4d61822c c007ff4e ab72d022 b5251480
> > 0.506280 C Bo:6:008:2 -71 86016 >
> 
> Out of curiosity, which tool produced this condensed output?

I wrote a program for myself to translate pcap files into usbmon's text
format.  I find that format a lot easier to read than trying to deal
with wireshark's "one packet at a time" approach.

> > This means the kernel tried to write 196608 bytes to the drive.  After
> > 86016 had been transferred, the drive did not reply correctly to the
> > next output transaction, causing the kernel to perform a reset.  
> > That's what happened, according to the viewpoint of the xhci-hcd 
> > driver.
> > 
> > In theory it's possible that the drive did respond correctly and the
> > information get messed up on the USB cable or on the computer's end.
> 
> Wow, that sucks.
> I had a mental image where the transactions used FEC and it would be
> obviously possible to differentiate between cable/hub/endpoint errors.

I don't see how FEC could help in a situation where a device sends a
properly formatted message and then some component closer to the kernel
improperly rejects it.

Alan Stern

uas: Add US_FL_NO_ATA_1X quirk for one more Seagate device

Commit Message

Comments

Patch