Message ID | 0b089b17-00fc-5a7c-baa3-e6141029b191@free.fr (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 28.08.2017 17:40, Mason wrote: > On 28/08/2017 10:39, Mathias Nyman wrote: > >> Could you take a log with the following added debug, without >> your extra delays, It should show a bit more about the state >> of the controller when we read 0xffffffff > > I applied the following patch on top of v4.12-rc1 > > diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c > index 5e3e9d4c6956..c7ea7d4c801f 100644 > --- a/drivers/usb/host/xhci-hub.c > +++ b/drivers/usb/host/xhci-hub.c > @@ -23,6 +23,7 @@ > > #include <linux/slab.h> > #include <asm/unaligned.h> > +#include <linux/pci.h> > > #include "xhci.h" > #include "xhci-trace.h" > @@ -1268,7 +1269,10 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue, > wIndex--; > temp = readl(port_array[wIndex]); > if (temp == ~(u32)0) { > - xhci_hc_died(xhci); > + struct pci_dev *pdev = to_pci_dev(hcd->self.controller); > + xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n", > + wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state); > + WARN_ON(1); > retval = -ENODEV; > break; > } > > > And here are logs I get when I plug/unplug my USB3 device. > > [ 14.970148] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd > [ 15.003487] usb 2-2: New USB device found, idVendor=0951, idProduct=1666 > [ 15.010237] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 > [ 15.017483] usb 2-2: Product: DataTraveler 3.0 > [ 15.021990] usb 2-2: Manufacturer: Kingston > [ 15.026234] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA > [ 15.034830] usb-storage 2-2:1.0: USB Mass Storage device detected > [ 15.041269] scsi host0: usb-storage 2-2:1.0 > [ 16.056140] scsi 0:0:0:0: Direct-Access Kingston DataTraveler 3.0 PQ: 0 ANSI: 6 > [ 16.064979] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB) > [ 16.072978] sd 0:0:0:0: [sda] Write Protect is off > [ 16.078076] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA > [ 16.089417] sda: sda1 > [ 16.093050] sd 0:0:0:0: [sda] Attached SCSI removable disk > > > [ 22.152078] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000 > [ 22.160157] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID) > [ 22.172051] pcieport 0000:00:00.0: device [1105:0024] error status/mask=00004000/00000000 > [ 22.180493] pcieport 0000:00:00.0: [14] Completion Timeout (First) > [ 22.187368] pcieport 0000:00:00.0: AER: Device recovery failed > [ 22.885269] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0 State is HC_STATE_RUNNING, Flags bits 0,2,5,7,8 set: #define HCD_FLAG_HW_ACCESSIBLE 0 /* at full power */ #define HCD_FLAG_POLL_RH 2 /* poll for rh status? */ #define HCD_FLAG_RH_RUNNING 5 /* root hub is running? */ #define HCD_FLAG_INTF_AUTHORIZED 7 /* authorize interfaces? */ #define HCD_FLAG_DEV_AUTHORIZED 8 /* authorize devices? */ And pci state seems to be D0 (according to driver, pdev->current_state) I can't see anything wrong from xhci/usb point of view. I'd focus more on the PCI errors in the logs as the cause for reading 0xffffffff from xhci mmio. Then again it might be a bit too drastic to kill xhci just because we read 0xffffffff once from a mmio xhci register. Maybe we should return an error a couple times before actually tearing down xhci. This tight check was originally done to detect pci hotplug removed hosts as soon as possible. -Mathias
On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote: > Then again it might be a bit too drastic to kill xhci just because > we read 0xffffffff once from a mmio xhci register. Maybe we should > return an error a couple times before actually tearing down xhci. > > This tight check was originally done to detect pci hotplug removed > hosts as soon as possible. Just make pci_dev_is_disconnected() public to detect PCI hot removal. We *know* when the device was hot removed, so I think there's no need to guess that based on reading "all ones" from mmio (which may happen for entirely legitimate reasons unrelated to hot removal). Thanks, Lukas
On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote: > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote: > > Then again it might be a bit too drastic to kill xhci just because > > we read 0xffffffff once from a mmio xhci register. Maybe we should > > return an error a couple times before actually tearing down xhci. > > > > This tight check was originally done to detect pci hotplug removed > > hosts as soon as possible. > > Just make pci_dev_is_disconnected() public to detect PCI hot removal. > We *know* when the device was hot removed, so I think there's no need > to guess that based on reading "all ones" from mmio (which may happen > for entirely legitimate reasons unrelated to hot removal). No, you don't always "know" when a device is removed, don't rely on it, not all platforms support that. One more reason why I hate that function and I'm glad it's not exported for others to think it somehow actually works for their system... Reading all ff shows the device is removed, that's all the PCI spec guarantees. What other legitimate reason could that happen for? thanks, greg k-h
On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote: > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote: > > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote: > > > Then again it might be a bit too drastic to kill xhci just because > > > we read 0xffffffff once from a mmio xhci register. Maybe we should > > > return an error a couple times before actually tearing down xhci. > > > > > > This tight check was originally done to detect pci hotplug removed > > > hosts as soon as possible. > > > > Just make pci_dev_is_disconnected() public to detect PCI hot removal. > > We *know* when the device was hot removed, so I think there's no need > > to guess that based on reading "all ones" from mmio (which may happen > > for entirely legitimate reasons unrelated to hot removal). > > No, you don't always "know" when a device is removed, don't rely on it, > not all platforms support that. Please explain, which platforms don't support that? They wouldn't be compliant with the spec it seems. PCIe r3.1, section 6.7.3: "A Downstream Port with hot-plug capabilities supports the following hot-plug events: Presence Detect Changed A Downstream Port with hot-plug capabilities monitors the slot it controls for the slot events listed above. [...] If enabled through the associated enable field, slot events must generate a software notification." And pciehp sets the flag on all downstream devices that they're removed once the software notification has been received and processed. > Reading all ff shows the device is removed, that's all the PCI spec > guarantees. What other legitimate reason could that happen for? Is 0xffffffff not a valid value to be stored in and read from mmio space? Best regards, Lukas
On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote: > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote: > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote: > > > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote: > > > > Then again it might be a bit too drastic to kill xhci just because > > > > we read 0xffffffff once from a mmio xhci register. Maybe we should > > > > return an error a couple times before actually tearing down xhci. > > > > > > > > This tight check was originally done to detect pci hotplug removed > > > > hosts as soon as possible. > > > > > > Just make pci_dev_is_disconnected() public to detect PCI hot removal. > > > We *know* when the device was hot removed, so I think there's no need > > > to guess that based on reading "all ones" from mmio (which may happen > > > for entirely legitimate reasons unrelated to hot removal). > > > > No, you don't always "know" when a device is removed, don't rely on it, > > not all platforms support that. > > Please explain, which platforms don't support that? They wouldn't be > compliant with the spec it seems. > > PCIe r3.1, section 6.7.3: > > "A Downstream Port with hot-plug capabilities supports the > following hot-plug events: > > Presence Detect Changed > > A Downstream Port with hot-plug capabilities monitors the slot > it controls for the slot events listed above. [...] > If enabled through the associated enable field, slot events > must generate a software notification." > > And pciehp sets the flag on all downstream devices that they're removed > once the software notification has been received and processed. What about all of the non-pciehp platforms? :) Also, there is always a race between when that notification has been sent and processed on the PCIe channel and the reading of all 1s on the PCI bus by the driver. For fun, try disconnecting a USB device from one of the more modern laptops with a USB 3.1 connection on it. The bios treats those as a pci hotpluggable xhci controller on the PCI bus. When the device is disconnected, the BIOS rips out the PCI device as well, but all that time, the xhci driver is thinking the device is still present as it takes a while for the BIOS to do all of the needed housekeeping. It's a really long time for everything to shut down and to help prevent the driver from going crazy, it has to detect ffff reads as "disconnection happened". All PCI drivers have had to do this for decades now, it's nothing new here, PCIe just gave us a chance to be notified that the device really is gone now, PCI hotplug has always been out-of-band like this. > > Reading all ff shows the device is removed, that's all the PCI spec > > guarantees. What other legitimate reason could that happen for? > > Is 0xffffffff not a valid value to be stored in and read from mmio space? For a specific register, doubtful, which is why the code errors out, right? If it is a valid value, then it shouldn't be exiting, and move on to the next read. I don't understand what we are arguing about here anymore... thanks, greg k-h
On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote: > This tight check was originally done to detect pci hotplug removed > hosts as soon as possible. In Mason's case, the parent of the XHCI controller isn't a hotplug port, see this lspci output: https://www.spinics.net/lists/linux-usb/msg160010.html Please check is_hotplug_bridge in the parent's struct pci_dev before assuming that the XHCI controller was unplugged. Thanks, Lukas
On Wed, Aug 30, 2017 at 01:53:10AM +0200, Lukas Wunner wrote: > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote: > > This tight check was originally done to detect pci hotplug removed > > hosts as soon as possible. > > In Mason's case, the parent of the XHCI controller isn't a hotplug port, > see this lspci output: > > https://www.spinics.net/lists/linux-usb/msg160010.html > > Please check is_hotplug_bridge in the parent's struct pci_dev before > assuming that the XHCI controller was unplugged. How can you guarantee that this is set on some systems? Will it be set on cardbus devices? What about on a "normal" system where I can just go and yank out a PCI card at will? I don't think this is a valid thing to check, and again, why are we arguing this point? It's been this way since the 1990's, this isn't a new thing... To get back to the original issue here, the hardware seems to have died, the driver stops talking to it, and all is good. The "regression" here is that we now properly can determine that the hardware is crap. So, how do you think we should proceed, delay a bit longer before saying the device is gone? How long is "long enough"? How many bus errors are we allowed to tolerate (hint, the PCI spec says none...) Maybe someone wants to get to the root problem here, why is the hardware suddenly reporting all 1s? thanks, greg k-h
On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote: > On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote: > > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote: > > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote: > > Is 0xffffffff not a valid value to be stored in and read from mmio space? > > For a specific register, doubtful Well, "doubtful" means you don't know for sure. It's fine to check for "all ones" as a heuristic if that's not a valid value for the register read, however a hotplug notification is a *definitive* indication the hardware is gone. I you seem to prefer forgoing a *definitive* indication for a mere heuristic, that doesn't make sense from my point of view. > It's a really long time for everything to shut down and to help > prevent the driver from going crazy, [...] > Also, there is always a race between when that notification has been > sent and processed on the PCIe channel and the reading of all 1s on the > PCI bus by the driver. Yes I know that. In practice the interrupt signaling hot removal comes fast enough to prevent drivers from "going crazy", as I've mentioned in this patch: https://patchwork.kernel.org/patch/9405255/ > For fun, try disconnecting a USB device from one of the more modern > laptops with a USB 3.1 connection on it. The bios treats those as a pci > hotpluggable xhci controller on the PCI bus. When the device is > disconnected, the BIOS rips out the PCI device as well, but all that > time, the xhci driver is thinking the device is still present as it > takes a while for the BIOS to do all of the needed housekeeping. Yes, that is the case with Thunderbolt 3 controllers on non-Macs: The XHCI controller appears below downstream bridge 2 of the Thunderbolt controller's PCIe switch. Once the last device is removed, the PCIe switch and all devices below it disappear and the controller is powered down. The controller is thus only visible if powered up. On Mac this is all native instead: Native pciehp, native tunnel setup, native PM. > > > > Just make pci_dev_is_disconnected() public to detect PCI hot removal. > > > > We *know* when the device was hot removed, so I think there's no need > > > > to guess that based on reading "all ones" from mmio (which may happen > > > > for entirely legitimate reasons unrelated to hot removal). > > > > > > No, you don't always "know" when a device is removed, don't rely on it, > > > not all platforms support that. > > > > Please explain, which platforms don't support that? They wouldn't be > > compliant with the spec it seems. > > What about all of the non-pciehp platforms? :) Fair enough, those should be extended to set PCI_DEV_DISCONNECTED as well. Thanks, Lukas
On Wed, Aug 30, 2017 at 08:36:23AM +0200, Lukas Wunner wrote: > On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote: > > On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote: > > > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote: > > > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote: > > > Is 0xffffffff not a valid value to be stored in and read from mmio space? > > > > For a specific register, doubtful > > Well, "doubtful" means you don't know for sure. > > It's fine to check for "all ones" as a heuristic if that's not a valid > value for the register read, however a hotplug notification is a > *definitive* indication the hardware is gone. > > I you seem to prefer forgoing a *definitive* indication for a mere > heuristic, that doesn't make sense from my point of view. I still don't know what you are arguing about here. The _driver_ knows if a specific read allows all ones as a valid return value. If it isn't, then the driver knows the device is now gone. It's that simple, don't do that type of check if all ones is a valid read. And that's not what is happening here anyway, so again, what is this discussion about? Unless there's something specific we can do here for the xhci driver, I think this thread is dead until someone determines what is going wrong with the hardware the original reporter posted about. greg k-h
On 30/08/2017 08:02, Greg Kroah-Hartman wrote: > To get back to the original issue here, the hardware seems to have died, > the driver stops talking to it, and all is good. The "regression" here > is that we now properly can determine that the hardware is crap. Before 4.12, when I unplugged my USB3 Flash drive, Linux would detect a few "Uncorrected Non-Fatal errors" via AER, but it was still possible to plug the drive back in. Since 4.12, once I unplug the drive, the whole USB3 card is marked as dead (all 4 ports), and I can no longer plug anything in (not even the USB2 drive that didn't have any issues, IIRC). It seems a bit premature to "mark as dead" something that remains functional, doesn't it? Disclaimer, there are many variables in this setup, and I've only tested a small fraction of the problem space: only one system, only one USB3 board, only one USB3 Flash drive. > So, how do you think we should proceed, delay a bit longer before saying > the device is gone? How long is "long enough"? How many bus errors are > we allowed to tolerate (hint, the PCI spec says none...) > > Maybe someone wants to get to the root problem here, why is the hardware > suddenly reporting all 1s? I'm afraid I won't be able to make any progress on this front, unless I can get my hands on a PCIe packet analyzer. Regards.
On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote: > On 30/08/2017 08:02, Greg Kroah-Hartman wrote: > > > To get back to the original issue here, the hardware seems to have died, > > the driver stops talking to it, and all is good. The "regression" here > > is that we now properly can determine that the hardware is crap. > > Before 4.12, when I unplugged my USB3 Flash drive, Linux would > detect a few "Uncorrected Non-Fatal errors" via AER, but it was > still possible to plug the drive back in. > > Since 4.12, once I unplug the drive, the whole USB3 card is marked > as dead (all 4 ports), and I can no longer plug anything in (not even > the USB2 drive that didn't have any issues, IIRC). > > It seems a bit premature to "mark as dead" something that remains > functional, doesn't it? I agree, but if the device sends all ones, it's a good indication it is really dead, right? Or something is wrong with it. > Disclaimer, there are many variables in this setup, and I've only > tested a small fraction of the problem space: only one system, > only one USB3 board, only one USB3 Flash drive. Did you ever happen to narrow this down to a single git commit using 'git bisect'? I can't remember what happened in the beginning of this thread... > > So, how do you think we should proceed, delay a bit longer before saying > > the device is gone? How long is "long enough"? How many bus errors are > > we allowed to tolerate (hint, the PCI spec says none...) > > > > Maybe someone wants to get to the root problem here, why is the hardware > > suddenly reporting all 1s? > > I'm afraid I won't be able to make any progress on this front, > unless I can get my hands on a PCIe packet analyzer. Odds of that happening are pretty rare, right? I've never even seen one of those... thanks, greg k-h
On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote: > On 30/08/2017 08:02, Greg Kroah-Hartman wrote: > >> To get back to the original issue here, the hardware seems to have died, >> the driver stops talking to it, and all is good. The "regression" here >> is that we now properly can determine that the hardware is crap. > > Before 4.12, when I unplugged my USB3 Flash drive, Linux would > detect a few "Uncorrected Non-Fatal errors" via AER, but it was > still possible to plug the drive back in. > > Since 4.12, once I unplug the drive, the whole USB3 card is marked > as dead (all 4 ports), and I can no longer plug anything in (not even > the USB2 drive that didn't have any issues, IIRC). > > It seems a bit premature to "mark as dead" something that remains > functional, doesn't it? > > Disclaimer, there are many variables in this setup, and I've only > tested a small fraction of the problem space: only one system, > only one USB3 board, only one USB3 Flash drive. > Please don't forget to mention that this is quirky hardware that depends on BROKEN because it multiplexes MMIO and config space accesses in the same memory window without any locking whatsoever (which would be difficult to do in the first place because we don't use accessors for MMIO in the kernel). So how likely is it that you are attempting to read from the xhci BAR window while a config space access is in progress? Any way to instrument this in your driver? >> So, how do you think we should proceed, delay a bit longer before saying >> the device is gone? How long is "long enough"? How many bus errors are >> we allowed to tolerate (hint, the PCI spec says none...) >> >> Maybe someone wants to get to the root problem here, why is the hardware >> suddenly reporting all 1s? > > I'm afraid I won't be able to make any progress on this front, > unless I can get my hands on a PCIe packet analyzer. > > Regards. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Wed, Aug 30, 2017 at 10:07:59AM +0100, Ard Biesheuvel wrote: > On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote: > > On 30/08/2017 08:02, Greg Kroah-Hartman wrote: > > > >> To get back to the original issue here, the hardware seems to have died, > >> the driver stops talking to it, and all is good. The "regression" here > >> is that we now properly can determine that the hardware is crap. > > > > Before 4.12, when I unplugged my USB3 Flash drive, Linux would > > detect a few "Uncorrected Non-Fatal errors" via AER, but it was > > still possible to plug the drive back in. > > > > Since 4.12, once I unplug the drive, the whole USB3 card is marked > > as dead (all 4 ports), and I can no longer plug anything in (not even > > the USB2 drive that didn't have any issues, IIRC). > > > > It seems a bit premature to "mark as dead" something that remains > > functional, doesn't it? > > > > Disclaimer, there are many variables in this setup, and I've only > > tested a small fraction of the problem space: only one system, > > only one USB3 board, only one USB3 Flash drive. > > > > Please don't forget to mention that this is quirky hardware that > depends on BROKEN because it multiplexes MMIO and config space > accesses in the same memory window without any locking whatsoever > (which would be difficult to do in the first place because we don't > use accessors for MMIO in the kernel). > > So how likely is it that you are attempting to read from the xhci BAR > window while a config space access is in progress? Any way to > instrument this in your driver? Seriously? Ok, that's crap hardware, sorry, I don't feel bad at all here. You are going to have worse problems than just a single USB controller issue if that's your hardware design, go kick some hardware engineers for me please. good luck, you are on your own :( greg k-h
On 30/08/2017 11:07, Ard Biesheuvel wrote: > Please don't forget to mention that this is quirky hardware that > depends on BROKEN because it multiplexes MMIO and config space > accesses in the same memory window without any locking whatsoever > (which would be difficult to do in the first place because we don't > use accessors for MMIO in the kernel). You're right, it was in the back of my mind, but I didn't state it explicitly for the benefit of linux-usb readers. > So how likely is it that you are attempting to read from the xhci BAR > window while a config space access is in progress? Any way to > instrument this in your driver? I logged config space accesses here: https://www.spinics.net/lists/arm-kernel/msg602832.html IIRC, the config space accesses are generated by the AER ISR. So disabling the AER driver should guarantee that no config space accesses are occurring when the drive is unplugged. Regards.
On 30/08/2017 11:37, Mason wrote: > On 30/08/2017 11:07, Ard Biesheuvel wrote: > >> Please don't forget to mention that this is quirky hardware that >> depends on BROKEN because it multiplexes MMIO and config space >> accesses in the same memory window without any locking whatsoever >> (which would be difficult to do in the first place because we don't >> use accessors for MMIO in the kernel). > > You're right, it was in the back of my mind, but I didn't state > it explicitly for the benefit of linux-usb readers. > >> So how likely is it that you are attempting to read from the xhci >> BAR window while a config space access is in progress? Any way to >> instrument this in your driver? > > I logged config space accesses here: > > https://www.spinics.net/lists/arm-kernel/msg602832.html > > IIRC, the config space accesses are generated by the AER ISR. > So disabling the AER driver should guarantee that no config space > accesses are occurring when the drive is unplugged. I checked, and I *did* remember correctly. Disabling the AER driver results in 0 config space access occurring when the USB3 drive is unplugged. This confirms that the controller's broken design (muxing config and mem space) is not responsible for the glitches occurring on unplug events. Furthermore, I confirm that once the controller has been deemed "dead", even USB2 drives are no longer detected, and all USB port on the PCIe board are disabled. Regards. For reads/writes in config space, I have: if (do_debug) { printk("\t READ: bus=%d devfn=%u where=%d size=%d val=0x%x\n", bus->number, devfn, where, size, *val); dump_stack(); } if (do_debug) { printk("\tWRITE: bus=%d devfn=%u where=%d size=%d val=0x%x\n", bus->number, devfn, where, size, val); dump_stack(); } During setup I do get, e.g. [ 7.621417] READ: bus=1 devfn=0 where=84 size=2 val=0x8 [ 7.626840] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G C 4.12.0-rc1 #2 [ 7.634358] Hardware name: Sigma Tango DT [ 7.638387] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14) [ 7.646171] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98) [ 7.653429] [<c0257a30>] (dump_stack) from [<c029cb34>] (smp8759_config_read+0xa0/0xa4) [ 7.661474] [<c029cb34>] (smp8759_config_read) from [<c0282908>] (pci_bus_read_config_word+0x6c/0x94) [ 7.670742] [<c0282908>] (pci_bus_read_config_word) from [<c0282cfc>] (pci_read_config_word+0x24/0x38) [ 7.680097] [<c0282cfc>] (pci_read_config_word) from [<c028c5c0>] (__pci_dev_reset+0x11c/0x2fc) [ 7.688841] [<c028c5c0>] (__pci_dev_reset) from [<c028c9c4>] (pci_probe_reset_function+0xc/0x10) [ 7.697673] [<c028c9c4>] (pci_probe_reset_function) from [<c028f720>] (pci_create_sysfs_dev_files+0x2a8/0x374) [ 7.707728] [<c028f720>] (pci_create_sysfs_dev_files) from [<c0515cf8>] (pci_sysfs_init+0x34/0x54) [ 7.716734] [<c0515cf8>] (pci_sysfs_init) from [<c010175c>] (do_one_initcall+0x44/0x168) [ 7.724867] [<c010175c>] (do_one_initcall) from [<c0500dd8>] (kernel_init_freeable+0x15c/0x1e8) [ 7.733611] [<c0500dd8>] (kernel_init_freeable) from [<c0332348>] (kernel_init+0x8/0x108) [ 7.741831] [<c0332348>] (kernel_init) from [<c01076f8>] (ret_from_fork+0x14/0x3c) On plug/unplug events, there are no config space accesses: [ 88.006750] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd [ 88.040179] usb 2-2: New USB device found, idVendor=0951, idProduct=1666 [ 88.046930] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 88.054177] usb 2-2: Product: DataTraveler 3.0 [ 88.058684] usb 2-2: Manufacturer: Kingston [ 88.062927] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA [ 88.071523] usb-storage 2-2:1.0: USB Mass Storage device detected [ 88.081334] scsi host0: usb-storage 2-2:1.0 [ 89.096074] scsi 0:0:0:0: Direct-Access Kingston DataTraveler 3.0 PQ: 0 ANSI: 6 [ 89.104828] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB) [ 89.112996] sd 0:0:0:0: [sda] Write Protect is off [ 89.118060] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 89.129463] sda: sda1 [ 89.133104] sd 0:0:0:0: [sda] Attached SCSI removable disk [ 103.375210] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead [ 103.382917] xhci_hcd 0000:01:00.0: HC died; cleaning up [ 103.388281] usb 2-2: USB disconnect, device number 2
On 30/08/2017 11:06, Greg Kroah-Hartman wrote: > On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote: > >> On 30/08/2017 08:02, Greg Kroah-Hartman wrote: >> >>> To get back to the original issue here, the hardware seems to have died, >>> the driver stops talking to it, and all is good. The "regression" here >>> is that we now properly can determine that the hardware is crap. >> >> Before 4.12, when I unplugged my USB3 Flash drive, Linux would >> detect a few "Uncorrected Non-Fatal errors" via AER, but it was >> still possible to plug the drive back in. >> >> Since 4.12, once I unplug the drive, the whole USB3 card is marked >> as dead (all 4 ports), and I can no longer plug anything in (not even >> the USB2 drive that didn't have any issues, IIRC). >> >> It seems a bit premature to "mark as dead" something that remains >> functional, doesn't it? > > I agree, but if the device sends all ones, it's a good indication it is > really dead, right? Or something is wrong with it. I wouldn't call it dead if I can plug the drive back in, and have it working... But I agree that something fishy is happening... >> Disclaimer, there are many variables in this setup, and I've only >> tested a small fraction of the problem space: only one system, >> only one USB3 board, only one USB3 Flash drive. > > Did you ever happen to narrow this down to a single git commit using > 'git bisect'? I can't remember what happened in the beginning of this > thread... Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b >>> So, how do you think we should proceed, delay a bit longer before saying >>> the device is gone? How long is "long enough"? How many bus errors are >>> we allowed to tolerate (hint, the PCI spec says none...) >>> >>> Maybe someone wants to get to the root problem here, why is the hardware >>> suddenly reporting all 1s? >> >> I'm afraid I won't be able to make any progress on this front, >> unless I can get my hands on a PCIe packet analyzer. > > Odds of that happening are pretty rare, right? I've never even seen one > of those... I had a "Summit T24 Analyzer" on my desk a few months ago, but I was getting strange results, and the knowledgeable people in my company were not available at the time. http://teledynelecroy.com/protocolanalyzer/protocoloverview.aspx?seriesid=445 Regards.
On 31.08.2017 12:17, Mason wrote: > On 30/08/2017 11:37, Mason wrote: > >> On 30/08/2017 11:07, Ard Biesheuvel wrote: >> >>> Please don't forget to mention that this is quirky hardware that >>> depends on BROKEN because it multiplexes MMIO and config space >>> accesses in the same memory window without any locking whatsoever >>> (which would be difficult to do in the first place because we don't >>> use accessors for MMIO in the kernel). >> >> You're right, it was in the back of my mind, but I didn't state >> it explicitly for the benefit of linux-usb readers. >> >>> So how likely is it that you are attempting to read from the xhci >>> BAR window while a config space access is in progress? Any way to >>> instrument this in your driver? >> >> I logged config space accesses here: >> >> https://www.spinics.net/lists/arm-kernel/msg602832.html >> >> IIRC, the config space accesses are generated by the AER ISR. >> So disabling the AER driver should guarantee that no config space >> accesses are occurring when the drive is unplugged. > > I checked, and I *did* remember correctly. > > Disabling the AER driver results in 0 config space access occurring > when the USB3 drive is unplugged. This confirms that the controller's > broken design (muxing config and mem space) is not responsible for > the glitches occurring on unplug events. > > Furthermore, I confirm that once the controller has been deemed "dead", > even USB2 drives are no longer detected, and all USB port on the PCIe > board are disabled. xhci handles both USB3 and USB2, If there is only a xhci in use then all usb ports will be disabled. Many systems have both ehci and xhci, where ehci handles USB2 side. I'm guessing yours only have the xhci. -Mathias
On 31.08.2017 12:39, Mason wrote: > On 30/08/2017 11:06, Greg Kroah-Hartman wrote: > >> On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote: >> >>> On 30/08/2017 08:02, Greg Kroah-Hartman wrote: >>> >>>> To get back to the original issue here, the hardware seems to have died, >>>> the driver stops talking to it, and all is good. The "regression" here >>>> is that we now properly can determine that the hardware is crap. >>> >>> Before 4.12, when I unplugged my USB3 Flash drive, Linux would >>> detect a few "Uncorrected Non-Fatal errors" via AER, but it was >>> still possible to plug the drive back in. >>> >>> Since 4.12, once I unplug the drive, the whole USB3 card is marked >>> as dead (all 4 ports), and I can no longer plug anything in (not even >>> the USB2 drive that didn't have any issues, IIRC). >>> >>> It seems a bit premature to "mark as dead" something that remains >>> functional, doesn't it? >> >> I agree, but if the device sends all ones, it's a good indication it is >> really dead, right? Or something is wrong with it. > > I wouldn't call it dead if I can plug the drive back in, and have > it working... But I agree that something fishy is happening... > >>> Disclaimer, there are many variables in this setup, and I've only >>> tested a small fraction of the problem space: only one system, >>> only one USB3 board, only one USB3 Flash drive. >> >> Did you ever happen to narrow this down to a single git commit using >> 'git bisect'? I can't remember what happened in the beginning of this >> thread... > > Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b > That patch only changes how xhci reacts to reading 0xffffffff. we used to just returned -ENODEV, but after patch we assume hardware is broken or removed. -Mathias
diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c index 5e3e9d4c6956..c7ea7d4c801f 100644 --- a/drivers/usb/host/xhci-hub.c +++ b/drivers/usb/host/xhci-hub.c @@ -23,6 +23,7 @@ #include <linux/slab.h> #include <asm/unaligned.h> +#include <linux/pci.h> #include "xhci.h" #include "xhci-trace.h" @@ -1268,7 +1269,10 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue, wIndex--; temp = readl(port_array[wIndex]); if (temp == ~(u32)0) { - xhci_hc_died(xhci); + struct pci_dev *pdev = to_pci_dev(hcd->self.controller); + xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n", + wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state); + WARN_ON(1); retval = -ENODEV; break; }