diff mbox

Possible regression between 4.9 and 4.13

Message ID 0b089b17-00fc-5a7c-baa3-e6141029b191@free.fr (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Mason Aug. 28, 2017, 2:40 p.m. UTC
On 28/08/2017 10:39, Mathias Nyman wrote:

> Could you take a log with the following added debug, without
> your extra delays, It should show a bit more about the state
> of the controller when we read 0xffffffff

I applied the following patch on top of v4.12-rc1



And here are logs I get when I plug/unplug my USB3 device.

[   14.970148] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   15.003487] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   15.010237] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   15.017483] usb 2-2: Product: DataTraveler 3.0
[   15.021990] usb 2-2: Manufacturer: Kingston
[   15.026234] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   15.034830] usb-storage 2-2:1.0: USB Mass Storage device detected
[   15.041269] scsi host0: usb-storage 2-2:1.0
[   16.056140] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   16.064979] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   16.072978] sd 0:0:0:0: [sda] Write Protect is off
[   16.078076] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   16.089417]  sda: sda1
[   16.093050] sd 0:0:0:0: [sda] Attached SCSI removable disk


[   22.152078] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   22.160157] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   22.172051] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   22.180493] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   22.187368] pcieport 0000:00:00.0: AER: Device recovery failed
[   22.885269] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0
[   22.896284] ------------[ cut here ]------------
[   22.900938] WARNING: CPU: 0 PID: 127 at drivers/usb/host/xhci-hub.c:1275 xhci_hub_control+0x10f4/0x1778
[   22.910377] Modules linked in:
[   22.913447] CPU: 0 PID: 127 Comm: kworker/0:1 Tainted: G         C      4.12.0-rc1 #4
[   22.921314] Hardware name: Sigma Tango DT
[   22.925342] Workqueue: usb_hub_wq hub_event
[   22.929564] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[   22.937353] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[   22.944617] [<c0257a30>] (dump_stack) from [<c01183d0>] (__warn+0xe8/0x100)
[   22.951616] [<c01183d0>] (__warn) from [<c0118498>] (warn_slowpath_null+0x20/0x28)
[   22.959227] [<c0118498>] (warn_slowpath_null) from [<c031ad90>] (xhci_hub_control+0x10f4/0x1778)
[   22.968062] [<c031ad90>] (xhci_hub_control) from [<c02fbb4c>] (usb_hcd_submit_urb+0x264/0x810)
[   22.976719] [<c02fbb4c>] (usb_hcd_submit_urb) from [<c02fccec>] (usb_submit_urb+0x2b0/0x4b4)
[   22.985201] [<c02fccec>] (usb_submit_urb) from [<c02fd3c4>] (usb_start_wait_urb+0x4c/0xbc)
[   22.993509] [<c02fd3c4>] (usb_start_wait_urb) from [<c02fd4d4>] (usb_control_msg+0xa0/0xcc)
[   23.001904] [<c02fd4d4>] (usb_control_msg) from [<c02f5718>] (usb_clear_port_feature+0x44/0x4c)
[   23.010648] [<c02f5718>] (usb_clear_port_feature) from [<c02f60fc>] (hub_port_reset+0x228/0x51c)
[   23.019479] [<c02f60fc>] (hub_port_reset) from [<c02f82f0>] (hub_event+0x1f4/0xe64)
[   23.027177] [<c02f82f0>] (hub_event) from [<c012d398>] (process_one_work+0x1d4/0x3ec)
[   23.035049] [<c012d398>] (process_one_work) from [<c012dfb4>] (worker_thread+0x38/0x554)
[   23.043185] [<c012dfb4>] (worker_thread) from [<c0132c84>] (kthread+0x108/0x138)
[   23.050620] [<c0132c84>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
[   23.057877] ---[ end trace 5e4494cf1f6e3761 ]---
[   23.062691] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0
[   23.073707] ------------[ cut here ]------------
[   23.078349] WARNING: CPU: 0 PID: 127 at drivers/usb/host/xhci-hub.c:1275 xhci_hub_control+0x10f4/0x1778
[   23.087787] Modules linked in:
[   23.090854] CPU: 0 PID: 127 Comm: kworker/0:1 Tainted: G        WC      4.12.0-rc1 #4
[   23.098720] Hardware name: Sigma Tango DT
[   23.102745] Workqueue: usb_hub_wq hub_event
[   23.106953] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[   23.114737] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[   23.121998] [<c0257a30>] (dump_stack) from [<c01183d0>] (__warn+0xe8/0x100)
[   23.128996] [<c01183d0>] (__warn) from [<c0118498>] (warn_slowpath_null+0x20/0x28)
[   23.136606] [<c0118498>] (warn_slowpath_null) from [<c031ad90>] (xhci_hub_control+0x10f4/0x1778)
[   23.145439] [<c031ad90>] (xhci_hub_control) from [<c02fbb4c>] (usb_hcd_submit_urb+0x264/0x810)
[   23.154095] [<c02fbb4c>] (usb_hcd_submit_urb) from [<c02fccec>] (usb_submit_urb+0x2b0/0x4b4)
[   23.162577] [<c02fccec>] (usb_submit_urb) from [<c02fd3c4>] (usb_start_wait_urb+0x4c/0xbc)
[   23.170884] [<c02fd3c4>] (usb_start_wait_urb) from [<c02fd4d4>] (usb_control_msg+0xa0/0xcc)
[   23.179278] [<c02fd4d4>] (usb_control_msg) from [<c02f5718>] (usb_clear_port_feature+0x44/0x4c)
[   23.188021] [<c02f5718>] (usb_clear_port_feature) from [<c02f611c>] (hub_port_reset+0x248/0x51c)
[   23.196851] [<c02f611c>] (hub_port_reset) from [<c02f82f0>] (hub_event+0x1f4/0xe64)
[   23.204547] [<c02f82f0>] (hub_event) from [<c012d398>] (process_one_work+0x1d4/0x3ec)
[   23.212418] [<c012d398>] (process_one_work) from [<c012dfb4>] (worker_thread+0x38/0x554)
[   23.220551] [<c012dfb4>] (worker_thread) from [<c0132c84>] (kthread+0x108/0x138)
[   23.227986] [<c0132c84>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
[   23.235242] ---[ end trace 5e4494cf1f6e3762 ]---
[   23.239953] xhci_hcd 0000:01:00.0: Cannot set link state.
[   23.245403] usb usb2-port2: cannot disable (err = -32)
[   23.250575] usb 2-2: USB disconnect, device number 2
[   23.255724] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.264048] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.275985] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.284417] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.291256] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.297144] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.305218] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.317047] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.325467] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.332309] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.338188] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.346273] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.358093] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.366518] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.373357] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.379229] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.387287] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.399101] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.407504] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.414344] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.434143] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.442263] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.454100] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.462542] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.469427] pcieport 0000:00:00.0: AER: Device recovery failed


Regards.

Comments

Mathias Nyman Aug. 29, 2017, 1:28 p.m. UTC | #1
On 28.08.2017 17:40, Mason wrote:
> On 28/08/2017 10:39, Mathias Nyman wrote:
>
>> Could you take a log with the following added debug, without
>> your extra delays, It should show a bit more about the state
>> of the controller when we read 0xffffffff
>
> I applied the following patch on top of v4.12-rc1
>
> diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
> index 5e3e9d4c6956..c7ea7d4c801f 100644
> --- a/drivers/usb/host/xhci-hub.c
> +++ b/drivers/usb/host/xhci-hub.c
> @@ -23,6 +23,7 @@
>
>   #include <linux/slab.h>
>   #include <asm/unaligned.h>
> +#include <linux/pci.h>
>
>   #include "xhci.h"
>   #include "xhci-trace.h"
> @@ -1268,7 +1269,10 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
>   		wIndex--;
>   		temp = readl(port_array[wIndex]);
>   		if (temp == ~(u32)0) {
> -			xhci_hc_died(xhci);
> +			struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
> +			xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
> +					wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
> +			WARN_ON(1);
>   			retval = -ENODEV;
>   			break;
>   		}
>
>
> And here are logs I get when I plug/unplug my USB3 device.
>
> [   14.970148] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [   15.003487] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
> [   15.010237] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> [   15.017483] usb 2-2: Product: DataTraveler 3.0
> [   15.021990] usb 2-2: Manufacturer: Kingston
> [   15.026234] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
> [   15.034830] usb-storage 2-2:1.0: USB Mass Storage device detected
> [   15.041269] scsi host0: usb-storage 2-2:1.0
> [   16.056140] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [   16.064979] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [   16.072978] sd 0:0:0:0: [sda] Write Protect is off
> [   16.078076] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [   16.089417]  sda: sda1
> [   16.093050] sd 0:0:0:0: [sda] Attached SCSI removable disk
>
>
> [   22.152078] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [   22.160157] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [   22.172051] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [   22.180493] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [   22.187368] pcieport 0000:00:00.0: AER: Device recovery failed
> [   22.885269] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0

State is HC_STATE_RUNNING,

Flags bits 0,2,5,7,8 set:
#define HCD_FLAG_HW_ACCESSIBLE          0       /* at full power */
#define HCD_FLAG_POLL_RH                2       /* poll for rh status? */
#define HCD_FLAG_RH_RUNNING             5       /* root hub is running? */
#define HCD_FLAG_INTF_AUTHORIZED        7       /* authorize interfaces? */
#define HCD_FLAG_DEV_AUTHORIZED         8       /* authorize devices? */

And pci state seems to be D0 (according to driver, pdev->current_state)

I can't see anything wrong from xhci/usb point of view.
I'd focus more on the PCI errors in the logs as the cause for reading 0xffffffff from xhci mmio.

Then again it might be a bit too drastic to kill xhci just because we read 0xffffffff once
from a mmio xhci register. Maybe we should return an error a couple times before actually
tearing down xhci.

This tight check was originally done to detect pci hotplug removed hosts as soon as possible.

-Mathias
Lukas Wunner Aug. 29, 2017, 1:38 p.m. UTC | #2
On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> Then again it might be a bit too drastic to kill xhci just because
> we read 0xffffffff once from a mmio xhci register. Maybe we should
> return an error a couple times before actually tearing down xhci.
> 
> This tight check was originally done to detect pci hotplug removed
> hosts as soon as possible.

Just make pci_dev_is_disconnected() public to detect PCI hot removal.
We *know* when the device was hot removed, so I think there's no need
to guess that based on reading "all ones" from mmio (which may happen
for entirely legitimate reasons unrelated to hot removal).

Thanks,

Lukas
Greg KH Aug. 29, 2017, 2:47 p.m. UTC | #3
On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > Then again it might be a bit too drastic to kill xhci just because
> > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > return an error a couple times before actually tearing down xhci.
> > 
> > This tight check was originally done to detect pci hotplug removed
> > hosts as soon as possible.
> 
> Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> We *know* when the device was hot removed, so I think there's no need
> to guess that based on reading "all ones" from mmio (which may happen
> for entirely legitimate reasons unrelated to hot removal).

No, you don't always "know" when a device is removed, don't rely on it,
not all platforms support that.

One more reason why I hate that function and I'm glad it's not exported
for others to think it somehow actually works for their system...

Reading all ff shows the device is removed, that's all the PCI spec
guarantees.  What other legitimate reason could that happen for?

thanks,

greg k-h
Lukas Wunner Aug. 29, 2017, 3:34 p.m. UTC | #4
On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > > Then again it might be a bit too drastic to kill xhci just because
> > > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > > return an error a couple times before actually tearing down xhci.
> > > 
> > > This tight check was originally done to detect pci hotplug removed
> > > hosts as soon as possible.
> > 
> > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > We *know* when the device was hot removed, so I think there's no need
> > to guess that based on reading "all ones" from mmio (which may happen
> > for entirely legitimate reasons unrelated to hot removal).
> 
> No, you don't always "know" when a device is removed, don't rely on it,
> not all platforms support that.

Please explain, which platforms don't support that?  They wouldn't be
compliant with the spec it seems.

PCIe r3.1, section 6.7.3:

	"A Downstream Port with hot-plug capabilities supports the
	 following hot-plug events:

	 Presence Detect Changed

	 A Downstream Port with hot-plug capabilities monitors the slot
	 it controls for the slot events listed above. [...]
	 If enabled through the associated enable field, slot events
	 must generate a software notification."

And pciehp sets the flag on all downstream devices that they're removed
once the software notification has been received and processed.


> Reading all ff shows the device is removed, that's all the PCI spec
> guarantees.  What other legitimate reason could that happen for?

Is 0xffffffff not a valid value to be stored in and read from mmio space?

Best regards,

Lukas
Greg KH Aug. 29, 2017, 3:51 p.m. UTC | #5
On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > > > Then again it might be a bit too drastic to kill xhci just because
> > > > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > > > return an error a couple times before actually tearing down xhci.
> > > > 
> > > > This tight check was originally done to detect pci hotplug removed
> > > > hosts as soon as possible.
> > > 
> > > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > > We *know* when the device was hot removed, so I think there's no need
> > > to guess that based on reading "all ones" from mmio (which may happen
> > > for entirely legitimate reasons unrelated to hot removal).
> > 
> > No, you don't always "know" when a device is removed, don't rely on it,
> > not all platforms support that.
> 
> Please explain, which platforms don't support that?  They wouldn't be
> compliant with the spec it seems.
> 
> PCIe r3.1, section 6.7.3:
> 
> 	"A Downstream Port with hot-plug capabilities supports the
> 	 following hot-plug events:
> 
> 	 Presence Detect Changed
> 
> 	 A Downstream Port with hot-plug capabilities monitors the slot
> 	 it controls for the slot events listed above. [...]
> 	 If enabled through the associated enable field, slot events
> 	 must generate a software notification."
> 
> And pciehp sets the flag on all downstream devices that they're removed
> once the software notification has been received and processed.

What about all of the non-pciehp platforms?  :)

Also, there is always a race between when that notification has been
sent and processed on the PCIe channel and the reading of all 1s on the
PCI bus by the driver.

For fun, try disconnecting a USB device from one of the more modern
laptops with a USB 3.1 connection on it.  The bios treats those as a pci
hotpluggable xhci controller on the PCI bus.  When the device is
disconnected, the BIOS rips out the PCI device as well, but all that
time, the xhci driver is thinking the device is still present as it
takes a while for the BIOS to do all of the needed housekeeping.  It's a
really long time for everything to shut down and to help prevent the
driver from going crazy, it has to detect ffff reads as "disconnection
happened".

All PCI drivers have had to do this for decades now, it's nothing new
here, PCIe just gave us a chance to be notified that the device really
is gone now, PCI hotplug has always been out-of-band like this.

> > Reading all ff shows the device is removed, that's all the PCI spec
> > guarantees.  What other legitimate reason could that happen for?
> 
> Is 0xffffffff not a valid value to be stored in and read from mmio space?

For a specific register, doubtful, which is why the code errors out,
right?  If it is a valid value, then it shouldn't be exiting, and move
on to the next read.

I don't understand what we are arguing about here anymore...

thanks,

greg k-h
Lukas Wunner Aug. 29, 2017, 11:53 p.m. UTC | #6
On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> This tight check was originally done to detect pci hotplug removed
> hosts as soon as possible.

In Mason's case, the parent of the XHCI controller isn't a hotplug port,
see this lspci output:

https://www.spinics.net/lists/linux-usb/msg160010.html

Please check is_hotplug_bridge in the parent's struct pci_dev before
assuming that the XHCI controller was unplugged.

Thanks,

Lukas
Greg KH Aug. 30, 2017, 6:02 a.m. UTC | #7
On Wed, Aug 30, 2017 at 01:53:10AM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > This tight check was originally done to detect pci hotplug removed
> > hosts as soon as possible.
> 
> In Mason's case, the parent of the XHCI controller isn't a hotplug port,
> see this lspci output:
> 
> https://www.spinics.net/lists/linux-usb/msg160010.html
> 
> Please check is_hotplug_bridge in the parent's struct pci_dev before
> assuming that the XHCI controller was unplugged.

How can you guarantee that this is set on some systems?  Will it be set
on cardbus devices?  What about on a "normal" system where I can just go
and yank out a PCI card at will?

I don't think this is a valid thing to check, and again, why are we
arguing this point?  It's been this way since the 1990's, this isn't a
new thing...

To get back to the original issue here, the hardware seems to have died,
the driver stops talking to it, and all is good.  The "regression" here
is that we now properly can determine that the hardware is crap.

So, how do you think we should proceed, delay a bit longer before saying
the device is gone?  How long is "long enough"?  How many bus errors are
we allowed to tolerate (hint, the PCI spec says none...)

Maybe someone wants to get to the root problem here, why is the hardware
suddenly reporting all 1s?

thanks,

greg k-h
Lukas Wunner Aug. 30, 2017, 6:36 a.m. UTC | #8
On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > Is 0xffffffff not a valid value to be stored in and read from mmio space?
> 
> For a specific register, doubtful

Well, "doubtful" means you don't know for sure.

It's fine to check for "all ones" as a heuristic if that's not a valid
value for the register read, however a hotplug notification is a
*definitive* indication the hardware is gone.

I you seem to prefer forgoing a *definitive* indication for a mere
heuristic, that doesn't make sense from my point of view.


> It's a really long time for everything to shut down and to help
> prevent the driver from going crazy, [...]
> Also, there is always a race between when that notification has been
> sent and processed on the PCIe channel and the reading of all 1s on the
> PCI bus by the driver.

Yes I know that.  In practice the interrupt signaling hot removal
comes fast enough to prevent drivers from "going crazy", as I've
mentioned in this patch:

https://patchwork.kernel.org/patch/9405255/


> For fun, try disconnecting a USB device from one of the more modern
> laptops with a USB 3.1 connection on it.  The bios treats those as a pci
> hotpluggable xhci controller on the PCI bus.  When the device is
> disconnected, the BIOS rips out the PCI device as well, but all that
> time, the xhci driver is thinking the device is still present as it
> takes a while for the BIOS to do all of the needed housekeeping.

Yes, that is the case with Thunderbolt 3 controllers on non-Macs:
The XHCI controller appears below downstream bridge 2 of the Thunderbolt
controller's PCIe switch.  Once the last device is removed, the PCIe
switch and all devices below it disappear and the controller is powered
down.  The controller is thus only visible if powered up.  On Mac this
is all native instead:  Native pciehp, native tunnel setup, native PM.


> > > > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > > > We *know* when the device was hot removed, so I think there's no need
> > > > to guess that based on reading "all ones" from mmio (which may happen
> > > > for entirely legitimate reasons unrelated to hot removal).
> > > 
> > > No, you don't always "know" when a device is removed, don't rely on it,
> > > not all platforms support that.
> > 
> > Please explain, which platforms don't support that?  They wouldn't be
> > compliant with the spec it seems.
> 
> What about all of the non-pciehp platforms?  :)

Fair enough, those should be extended to set PCI_DEV_DISCONNECTED as well.

Thanks,

Lukas
Greg KH Aug. 30, 2017, 6:45 a.m. UTC | #9
On Wed, Aug 30, 2017 at 08:36:23AM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> > > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > > Is 0xffffffff not a valid value to be stored in and read from mmio space?
> > 
> > For a specific register, doubtful
> 
> Well, "doubtful" means you don't know for sure.
> 
> It's fine to check for "all ones" as a heuristic if that's not a valid
> value for the register read, however a hotplug notification is a
> *definitive* indication the hardware is gone.
> 
> I you seem to prefer forgoing a *definitive* indication for a mere
> heuristic, that doesn't make sense from my point of view.

I still don't know what you are arguing about here.  The _driver_ knows
if a specific read allows all ones as a valid return value.  If it
isn't, then the driver knows the device is now gone.  It's that simple,
don't do that type of check if all ones is a valid read.

And that's not what is happening here anyway, so again, what is this
discussion about?

Unless there's something specific we can do here for the xhci driver, I
think this thread is dead until someone determines what is going wrong
with the hardware the original reporter posted about.

greg k-h
Mason Aug. 30, 2017, 8:55 a.m. UTC | #10
On 30/08/2017 08:02, Greg Kroah-Hartman wrote:

> To get back to the original issue here, the hardware seems to have died,
> the driver stops talking to it, and all is good.  The "regression" here
> is that we now properly can determine that the hardware is crap.

Before 4.12, when I unplugged my USB3 Flash drive, Linux would
detect a few "Uncorrected Non-Fatal errors" via AER, but it was
still possible to plug the drive back in.

Since 4.12, once I unplug the drive, the whole USB3 card is marked
as dead (all 4 ports), and I can no longer plug anything in (not even
the USB2 drive that didn't have any issues, IIRC).

It seems a bit premature to "mark as dead" something that remains
functional, doesn't it?

Disclaimer, there are many variables in this setup, and I've only
tested a small fraction of the problem space: only one system,
only one USB3 board, only one USB3 Flash drive.

> So, how do you think we should proceed, delay a bit longer before saying
> the device is gone?  How long is "long enough"?  How many bus errors are
> we allowed to tolerate (hint, the PCI spec says none...)
> 
> Maybe someone wants to get to the root problem here, why is the hardware
> suddenly reporting all 1s?

I'm afraid I won't be able to make any progress on this front,
unless I can get my hands on a PCIe packet analyzer.

Regards.
Greg KH Aug. 30, 2017, 9:06 a.m. UTC | #11
On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
> 
> > To get back to the original issue here, the hardware seems to have died,
> > the driver stops talking to it, and all is good.  The "regression" here
> > is that we now properly can determine that the hardware is crap.
> 
> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> still possible to plug the drive back in.
> 
> Since 4.12, once I unplug the drive, the whole USB3 card is marked
> as dead (all 4 ports), and I can no longer plug anything in (not even
> the USB2 drive that didn't have any issues, IIRC).
> 
> It seems a bit premature to "mark as dead" something that remains
> functional, doesn't it?

I agree, but if the device sends all ones, it's a good indication it is
really dead, right?  Or something is wrong with it.

> Disclaimer, there are many variables in this setup, and I've only
> tested a small fraction of the problem space: only one system,
> only one USB3 board, only one USB3 Flash drive.

Did you ever happen to narrow this down to a single git commit using
'git bisect'?  I can't remember what happened in the beginning of this
thread...

> > So, how do you think we should proceed, delay a bit longer before saying
> > the device is gone?  How long is "long enough"?  How many bus errors are
> > we allowed to tolerate (hint, the PCI spec says none...)
> > 
> > Maybe someone wants to get to the root problem here, why is the hardware
> > suddenly reporting all 1s?
> 
> I'm afraid I won't be able to make any progress on this front,
> unless I can get my hands on a PCIe packet analyzer.

Odds of that happening are pretty rare, right?  I've never even seen one
of those...

thanks,

greg k-h
Ard Biesheuvel Aug. 30, 2017, 9:07 a.m. UTC | #12
On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote:
> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>
>> To get back to the original issue here, the hardware seems to have died,
>> the driver stops talking to it, and all is good.  The "regression" here
>> is that we now properly can determine that the hardware is crap.
>
> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> still possible to plug the drive back in.
>
> Since 4.12, once I unplug the drive, the whole USB3 card is marked
> as dead (all 4 ports), and I can no longer plug anything in (not even
> the USB2 drive that didn't have any issues, IIRC).
>
> It seems a bit premature to "mark as dead" something that remains
> functional, doesn't it?
>
> Disclaimer, there are many variables in this setup, and I've only
> tested a small fraction of the problem space: only one system,
> only one USB3 board, only one USB3 Flash drive.
>

Please don't forget to mention that this is quirky hardware that
depends on BROKEN because it multiplexes MMIO and config space
accesses in the same memory window without any locking whatsoever
(which would be difficult to do in the first place because we don't
use accessors for MMIO in the kernel).

So how likely is it that you are attempting to read from the xhci BAR
window while a config space access is in progress? Any way to
instrument this in your driver?

>> So, how do you think we should proceed, delay a bit longer before saying
>> the device is gone?  How long is "long enough"?  How many bus errors are
>> we allowed to tolerate (hint, the PCI spec says none...)
>>
>> Maybe someone wants to get to the root problem here, why is the hardware
>> suddenly reporting all 1s?
>
> I'm afraid I won't be able to make any progress on this front,
> unless I can get my hands on a PCIe packet analyzer.
>
> Regards.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Greg KH Aug. 30, 2017, 9:22 a.m. UTC | #13
On Wed, Aug 30, 2017 at 10:07:59AM +0100, Ard Biesheuvel wrote:
> On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote:
> > On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
> >
> >> To get back to the original issue here, the hardware seems to have died,
> >> the driver stops talking to it, and all is good.  The "regression" here
> >> is that we now properly can determine that the hardware is crap.
> >
> > Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> > detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> > still possible to plug the drive back in.
> >
> > Since 4.12, once I unplug the drive, the whole USB3 card is marked
> > as dead (all 4 ports), and I can no longer plug anything in (not even
> > the USB2 drive that didn't have any issues, IIRC).
> >
> > It seems a bit premature to "mark as dead" something that remains
> > functional, doesn't it?
> >
> > Disclaimer, there are many variables in this setup, and I've only
> > tested a small fraction of the problem space: only one system,
> > only one USB3 board, only one USB3 Flash drive.
> >
> 
> Please don't forget to mention that this is quirky hardware that
> depends on BROKEN because it multiplexes MMIO and config space
> accesses in the same memory window without any locking whatsoever
> (which would be difficult to do in the first place because we don't
> use accessors for MMIO in the kernel).
> 
> So how likely is it that you are attempting to read from the xhci BAR
> window while a config space access is in progress? Any way to
> instrument this in your driver?

Seriously?  Ok, that's crap hardware, sorry, I don't feel bad at all
here.  You are going to have worse problems than just a single USB
controller issue if that's your hardware design, go kick some hardware
engineers for me please.

good luck, you are on your own :(

greg k-h
Mason Aug. 30, 2017, 9:37 a.m. UTC | #14
On 30/08/2017 11:07, Ard Biesheuvel wrote:

> Please don't forget to mention that this is quirky hardware that
> depends on BROKEN because it multiplexes MMIO and config space
> accesses in the same memory window without any locking whatsoever
> (which would be difficult to do in the first place because we don't
> use accessors for MMIO in the kernel).

You're right, it was in the back of my mind, but I didn't state
it explicitly for the benefit of linux-usb readers.

> So how likely is it that you are attempting to read from the xhci BAR
> window while a config space access is in progress? Any way to
> instrument this in your driver?

I logged config space accesses here:

https://www.spinics.net/lists/arm-kernel/msg602832.html

IIRC, the config space accesses are generated by the AER ISR.
So disabling the AER driver should guarantee that no config space
accesses are occurring when the drive is unplugged.

Regards.
Mason Aug. 31, 2017, 9:17 a.m. UTC | #15
On 30/08/2017 11:37, Mason wrote:

> On 30/08/2017 11:07, Ard Biesheuvel wrote:
> 
>> Please don't forget to mention that this is quirky hardware that
>> depends on BROKEN because it multiplexes MMIO and config space
>> accesses in the same memory window without any locking whatsoever
>> (which would be difficult to do in the first place because we don't
>> use accessors for MMIO in the kernel).
> 
> You're right, it was in the back of my mind, but I didn't state
> it explicitly for the benefit of linux-usb readers.
> 
>> So how likely is it that you are attempting to read from the xhci
>> BAR window while a config space access is in progress? Any way to
>> instrument this in your driver?
> 
> I logged config space accesses here:
> 
> https://www.spinics.net/lists/arm-kernel/msg602832.html
> 
> IIRC, the config space accesses are generated by the AER ISR.
> So disabling the AER driver should guarantee that no config space
> accesses are occurring when the drive is unplugged.

I checked, and I *did* remember correctly.

Disabling the AER driver results in 0 config space access occurring
when the USB3 drive is unplugged. This confirms that the controller's
broken design (muxing config and mem space) is not responsible for
the glitches occurring on unplug events.

Furthermore, I confirm that once the controller has been deemed "dead",
even USB2 drives are no longer detected, and all USB port on the PCIe
board are disabled.

Regards.


For reads/writes in config space, I have:

	if (do_debug) {
		printk("\t READ: bus=%d devfn=%u where=%d size=%d val=0x%x\n",
			bus->number, devfn, where, size, *val);
		dump_stack();
	}

	if (do_debug) {
		printk("\tWRITE: bus=%d devfn=%u where=%d size=%d val=0x%x\n",
			bus->number, devfn, where, size, val);
		dump_stack();
	}

During setup I do get, e.g.

[    7.621417]   READ: bus=1 devfn=0 where=84 size=2 val=0x8
[    7.626840] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C      4.12.0-rc1 #2
[    7.634358] Hardware name: Sigma Tango DT
[    7.638387] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[    7.646171] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[    7.653429] [<c0257a30>] (dump_stack) from [<c029cb34>] (smp8759_config_read+0xa0/0xa4)
[    7.661474] [<c029cb34>] (smp8759_config_read) from [<c0282908>] (pci_bus_read_config_word+0x6c/0x94)
[    7.670742] [<c0282908>] (pci_bus_read_config_word) from [<c0282cfc>] (pci_read_config_word+0x24/0x38)
[    7.680097] [<c0282cfc>] (pci_read_config_word) from [<c028c5c0>] (__pci_dev_reset+0x11c/0x2fc)
[    7.688841] [<c028c5c0>] (__pci_dev_reset) from [<c028c9c4>] (pci_probe_reset_function+0xc/0x10)
[    7.697673] [<c028c9c4>] (pci_probe_reset_function) from [<c028f720>] (pci_create_sysfs_dev_files+0x2a8/0x374)
[    7.707728] [<c028f720>] (pci_create_sysfs_dev_files) from [<c0515cf8>] (pci_sysfs_init+0x34/0x54)
[    7.716734] [<c0515cf8>] (pci_sysfs_init) from [<c010175c>] (do_one_initcall+0x44/0x168)
[    7.724867] [<c010175c>] (do_one_initcall) from [<c0500dd8>] (kernel_init_freeable+0x15c/0x1e8)
[    7.733611] [<c0500dd8>] (kernel_init_freeable) from [<c0332348>] (kernel_init+0x8/0x108)
[    7.741831] [<c0332348>] (kernel_init) from [<c01076f8>] (ret_from_fork+0x14/0x3c)


On plug/unplug events, there are no config space accesses:

[   88.006750] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   88.040179] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   88.046930] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   88.054177] usb 2-2: Product: DataTraveler 3.0
[   88.058684] usb 2-2: Manufacturer: Kingston
[   88.062927] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   88.071523] usb-storage 2-2:1.0: USB Mass Storage device detected
[   88.081334] scsi host0: usb-storage 2-2:1.0
[   89.096074] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   89.104828] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   89.112996] sd 0:0:0:0: [sda] Write Protect is off
[   89.118060] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   89.129463]  sda: sda1
[   89.133104] sd 0:0:0:0: [sda] Attached SCSI removable disk

[  103.375210] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[  103.382917] xhci_hcd 0000:01:00.0: HC died; cleaning up
[  103.388281] usb 2-2: USB disconnect, device number 2
Mason Aug. 31, 2017, 9:39 a.m. UTC | #16
On 30/08/2017 11:06, Greg Kroah-Hartman wrote:

> On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
>
>> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>>
>>> To get back to the original issue here, the hardware seems to have died,
>>> the driver stops talking to it, and all is good.  The "regression" here
>>> is that we now properly can determine that the hardware is crap.
>>
>> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
>> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
>> still possible to plug the drive back in.
>>
>> Since 4.12, once I unplug the drive, the whole USB3 card is marked
>> as dead (all 4 ports), and I can no longer plug anything in (not even
>> the USB2 drive that didn't have any issues, IIRC).
>>
>> It seems a bit premature to "mark as dead" something that remains
>> functional, doesn't it?
> 
> I agree, but if the device sends all ones, it's a good indication it is
> really dead, right?  Or something is wrong with it.

I wouldn't call it dead if I can plug the drive back in, and have
it working... But I agree that something fishy is happening...

>> Disclaimer, there are many variables in this setup, and I've only
>> tested a small fraction of the problem space: only one system,
>> only one USB3 board, only one USB3 Flash drive.
> 
> Did you ever happen to narrow this down to a single git commit using
> 'git bisect'?  I can't remember what happened in the beginning of this
> thread...

Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b

>>> So, how do you think we should proceed, delay a bit longer before saying
>>> the device is gone?  How long is "long enough"?  How many bus errors are
>>> we allowed to tolerate (hint, the PCI spec says none...)
>>>
>>> Maybe someone wants to get to the root problem here, why is the hardware
>>> suddenly reporting all 1s?
>>
>> I'm afraid I won't be able to make any progress on this front,
>> unless I can get my hands on a PCIe packet analyzer.
> 
> Odds of that happening are pretty rare, right?  I've never even seen one
> of those...

I had a "Summit T24 Analyzer" on my desk a few months ago, but I was getting
strange results, and the knowledgeable people in my company were not available
at the time.

http://teledynelecroy.com/protocolanalyzer/protocoloverview.aspx?seriesid=445

Regards.
Mathias Nyman Aug. 31, 2017, 11:38 a.m. UTC | #17
On 31.08.2017 12:17, Mason wrote:
> On 30/08/2017 11:37, Mason wrote:
>
>> On 30/08/2017 11:07, Ard Biesheuvel wrote:
>>
>>> Please don't forget to mention that this is quirky hardware that
>>> depends on BROKEN because it multiplexes MMIO and config space
>>> accesses in the same memory window without any locking whatsoever
>>> (which would be difficult to do in the first place because we don't
>>> use accessors for MMIO in the kernel).
>>
>> You're right, it was in the back of my mind, but I didn't state
>> it explicitly for the benefit of linux-usb readers.
>>
>>> So how likely is it that you are attempting to read from the xhci
>>> BAR window while a config space access is in progress? Any way to
>>> instrument this in your driver?
>>
>> I logged config space accesses here:
>>
>> https://www.spinics.net/lists/arm-kernel/msg602832.html
>>
>> IIRC, the config space accesses are generated by the AER ISR.
>> So disabling the AER driver should guarantee that no config space
>> accesses are occurring when the drive is unplugged.
>
> I checked, and I *did* remember correctly.
>
> Disabling the AER driver results in 0 config space access occurring
> when the USB3 drive is unplugged. This confirms that the controller's
> broken design (muxing config and mem space) is not responsible for
> the glitches occurring on unplug events.
>
> Furthermore, I confirm that once the controller has been deemed "dead",
> even USB2 drives are no longer detected, and all USB port on the PCIe
> board are disabled.

xhci handles both USB3 and USB2, If there is only a xhci in use then all
usb ports will be disabled.
Many systems have both ehci and xhci, where ehci handles USB2 side.
I'm guessing yours only have the xhci.

-Mathias
Mathias Nyman Aug. 31, 2017, 11:40 a.m. UTC | #18
On 31.08.2017 12:39, Mason wrote:
> On 30/08/2017 11:06, Greg Kroah-Hartman wrote:
>
>> On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
>>
>>> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>>>
>>>> To get back to the original issue here, the hardware seems to have died,
>>>> the driver stops talking to it, and all is good.  The "regression" here
>>>> is that we now properly can determine that the hardware is crap.
>>>
>>> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
>>> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
>>> still possible to plug the drive back in.
>>>
>>> Since 4.12, once I unplug the drive, the whole USB3 card is marked
>>> as dead (all 4 ports), and I can no longer plug anything in (not even
>>> the USB2 drive that didn't have any issues, IIRC).
>>>
>>> It seems a bit premature to "mark as dead" something that remains
>>> functional, doesn't it?
>>
>> I agree, but if the device sends all ones, it's a good indication it is
>> really dead, right?  Or something is wrong with it.
>
> I wouldn't call it dead if I can plug the drive back in, and have
> it working... But I agree that something fishy is happening...
>
>>> Disclaimer, there are many variables in this setup, and I've only
>>> tested a small fraction of the problem space: only one system,
>>> only one USB3 board, only one USB3 Flash drive.
>>
>> Did you ever happen to narrow this down to a single git commit using
>> 'git bisect'?  I can't remember what happened in the beginning of this
>> thread...
>
> Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>

That patch only changes how xhci reacts to reading 0xffffffff.
we used to just returned -ENODEV, but after patch we assume
hardware is broken or removed.

-Mathias
diff mbox

Patch

diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 5e3e9d4c6956..c7ea7d4c801f 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -23,6 +23,7 @@ 
 
 #include <linux/slab.h>
 #include <asm/unaligned.h>
+#include <linux/pci.h>
 
 #include "xhci.h"
 #include "xhci-trace.h"
@@ -1268,7 +1269,10 @@  int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
 		wIndex--;
 		temp = readl(port_array[wIndex]);
 		if (temp == ~(u32)0) {
-			xhci_hc_died(xhci);
+			struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
+			xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
+					wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
+			WARN_ON(1);
 			retval = -ENODEV;
 			break;
 		}