diff mbox

Fwd: Enabling USB (auto)suspend for xHCI controllers incurs random device failures since kernel 4.15

Message ID 0f586dae-8af3-db03-d191-791f3f56b0c4@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mathias Nyman May 14, 2018, 1:27 p.m. UTC
On 11.05.2018 19:37, gert wrote:
> 
> Since the upgrade to Linux 4.15 (and also in 4.16), I'm experiencing an issue where all my USB devices just die seemingly without any cause. Both my laptop's internal (attached) keyboard as well as my external keyboard die.
> 
> Replugging the external keyboard unfortunately does not solve the problem. My touchpad, on the other hand, continues to work, though it may internally be connected via PS/2.
> 
> After this happens, I have only been able to solve it by rebooting.
> 
> In the logs, the following error can be found.
> 
> xhci_hcd 0000:3d:00.0: xHCI host controller not responding, assume dead
> 
> Previously, similar issues occurred to users that could be fixed by adding intel_iommu=false to the kernel parameters. This however seems to be a different problem, as it newly occurs in this specific kernel version and is not solved by the aforementioned solution.
> 
> This was also posted at the Archlinux forums [1], where we managed to pin down the issue to being related to xHCI and autosuspend (power management). I'm using powertop's --auto-tune and disabling the "good" setting for all xHCI controllers again makes the issue disappear. Linux 4.14 and lower also do not expose this issue.
> 
> Please also find attached the complete journalctl output of one boot from start to finish that exposed the issue, which may be helpful during debugging. [1]
> 

Hi

I got a report about a very similar issue, thread can be found at:
https://marc.info/?l=linux-usb&m=152335174903017&w=2

I didn't get any feedback about the suggested patch.
If you can test it, either by just by compiling my for-usb-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git for-usb-linus

or alternatively applying the attached patch I would appreciate it.

If it doesn't help then we need to dig deeper into it, woith more detailed logs.

Thanks
Mathias

Comments

russianneuromancer@ya.ru May 14, 2018, 6:06 p.m. UTC | #1
Hello!

> I didn't get any feedback about the suggested patch.

If you talk about this one https://marc.info/?l=linux-usb&m=15252835690
2325&w=2 then unfortunately I missed it.

Mathias, please confirm, is this patch I need to check on Dell 5855?

14/05/2018 в 16:27 +0300, Mathias Nyman:
> On 11.05.2018 19:37, gert wrote:
> > 
> > Since the upgrade to Linux 4.15 (and also in 4.16), I'm
> > experiencing an issue where all my USB devices just die seemingly
> > without any cause. Both my laptop's internal (attached) keyboard as
> > well as my external keyboard die.
> > 
> > Replugging the external keyboard unfortunately does not solve the
> > problem. My touchpad, on the other hand, continues to work, though
> > it may internally be connected via PS/2.
> > 
> > After this happens, I have only been able to solve it by rebooting.
> > 
> > In the logs, the following error can be found.
> > 
> > xhci_hcd 0000:3d:00.0: xHCI host controller not responding, assume
> > dead
> > 
> > Previously, similar issues occurred to users that could be fixed by
> > adding intel_iommu=false to the kernel parameters. This however
> > seems to be a different problem, as it newly occurs in this
> > specific kernel version and is not solved by the aforementioned
> > solution.
> > 
> > This was also posted at the Archlinux forums [1], where we managed
> > to pin down the issue to being related to xHCI and autosuspend
> > (power management). I'm using powertop's --auto-tune and disabling
> > the "good" setting for all xHCI controllers again makes the issue
> > disappear. Linux 4.14 and lower also do not expose this issue.
> > 
> > Please also find attached the complete journalctl output of one
> > boot from start to finish that exposed the issue, which may be
> > helpful during debugging. [1]
> > 
> 
> Hi
> 
> I got a report about a very similar issue, thread can be found at:
> https://marc.info/?l=linux-usb&m=152335174903017&w=2
> 
> I didn't get any feedback about the suggested patch.
> If you can test it, either by just by compiling my for-usb-linus
> branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git for-
> usb-linus
> 
> or alternatively applying the attached patch I would appreciate it.
> 
> If it doesn't help then we need to dig deeper into it, woith more
> detailed logs.
> 
> Thanks
> Mathias
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
russianneuromancer@ya.ru May 14, 2018, 6:06 p.m. UTC | #2
Hello!

> I didn't get any feedback about the suggested patch.

If you talk about this one https://marc.info/?l=linux-usb&m=15252835690
2325&w=2 then unfortunately I missed it.

Mathias, please confirm, is this patch I need to check on Dell 5855?

14/05/2018 в 16:27 +0300, Mathias Nyman:
> On 11.05.2018 19:37, gert wrote:
> > 
> > Since the upgrade to Linux 4.15 (and also in 4.16), I'm
> > experiencing an issue where all my USB devices just die seemingly
> > without any cause. Both my laptop's internal (attached) keyboard as
> > well as my external keyboard die.
> > 
> > Replugging the external keyboard unfortunately does not solve the
> > problem. My touchpad, on the other hand, continues to work, though
> > it may internally be connected via PS/2.
> > 
> > After this happens, I have only been able to solve it by rebooting.
> > 
> > In the logs, the following error can be found.
> > 
> > xhci_hcd 0000:3d:00.0: xHCI host controller not responding, assume
> > dead
> > 
> > Previously, similar issues occurred to users that could be fixed by
> > adding intel_iommu=false to the kernel parameters. This however
> > seems to be a different problem, as it newly occurs in this
> > specific kernel version and is not solved by the aforementioned
> > solution.
> > 
> > This was also posted at the Archlinux forums [1], where we managed
> > to pin down the issue to being related to xHCI and autosuspend
> > (power management). I'm using powertop's --auto-tune and disabling
> > the "good" setting for all xHCI controllers again makes the issue
> > disappear. Linux 4.14 and lower also do not expose this issue.
> > 
> > Please also find attached the complete journalctl output of one
> > boot from start to finish that exposed the issue, which may be
> > helpful during debugging. [1]
> > 
> 
> Hi
> 
> I got a report about a very similar issue, thread can be found at:
> https://marc.info/?l=linux-usb&m=152335174903017&w=2
> 
> I didn't get any feedback about the suggested patch.
> If you can test it, either by just by compiling my for-usb-linus
> branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git for-
> usb-linus
> 
> or alternatively applying the attached patch I would appreciate it.
> 
> If it doesn't help then we need to dig deeper into it, woith more
> detailed logs.
> 
> Thanks
> Mathias
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Nyman May 15, 2018, 7:33 a.m. UTC | #3
Hi

>> I didn't get any feedback about the suggested patch.
> 
> If you talk about this one https://marc.info/?l=linux-usb&m=15252835690
> 2325&w=2 then unfortunately I missed it.
> 
> Mathias, please confirm, is this patch I need to check on Dell 5855?
> 

Ah, no, it's a different one:

https://marc.info/?l=linux-usb&m=152544392822763&w=2
also found in my for-usb-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git for-usb-linus
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=for-usb-linus

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
gert May 17, 2018, 8:14 p.m. UTC | #4
Hello

Thanks for your quick response.

I'll be away on holidays for the coming two weeks, but after that I'll 
certainly try taking the patch for a test drive for a couple of days 
(as the issue seemed intermittent).

I'll also notify the users on the Arch linux forum of the patch, as 
there were others experiencing the same issue. The more testers, the 
merrier!

Op dinsdag 15 mei 2018 om 9:33 schreef Mathias Nyman 
<mathias.nyman@linux.intel.com>:
> Hi
> 
>>> I didn't get any feedback about the suggested patch.
>> 
>> If you talk about this one 
>> https://marc.info/?l=linux-usb&m=15252835690
>> 2325&w=2 then unfortunately I missed it.
>> 
>> Mathias, please confirm, is this patch I need to check on Dell 5855?
>> 
> 
> Ah, no, it's a different one:
> 
> https://marc.info/?l=linux-usb&m=152544392822763&w=2
> also found in my for-usb-linus branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git 
> for-usb-linus
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=for-usb-linus
> 
> -Mathias
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
russianneuromancer@ya.ru June 2, 2018, 12:32 p.m. UTC | #5
Hi!

I tested for-usb-linus branch on Dell 5855 and my issue is no longer reproducible with it (at least for latest two days, which should be more than enough to reproduce it).

15.05.2018, 15:30, "Mathias Nyman" <mathias.nyman@linux.intel.com>:
> Hi
>
>>>  I didn't get any feedback about the suggested patch.
>>
>>  If you talk about this one https://marc.info/?l=linux-usb&m=15252835690
>>  2325&w=2 then unfortunately I missed it.
>>
>>  Mathias, please confirm, is this patch I need to check on Dell 5855?
>
> Ah, no, it's a different one:
>
> https://marc.info/?l=linux-usb&m=152544392822763&w=2
> also found in my for-usb-linus branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git for-usb-linus
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=for-usb-linus
>
> -Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Nyman June 4, 2018, 7:33 a.m. UTC | #6
On 02.06.2018 15:32, russianneuromancer@ya.ru wrote:
> Hi!
> 
> I tested for-usb-linus branch on Dell 5855 and my issue is no longer reproducible with it (at least for latest two days, which should be more than enough to reproduce it).
> 

Great, Thanks.

If you want to have a "tested-by:" flag with your name added to the patch I'll need
a real name to go with the email address.

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
russianneuromancer@ya.ru June 4, 2018, 12:07 p.m. UTC | #7
Hello!

Thank you for fixing this issue! 

> If you want to have a "tested-by:" flag with your name added to the
> patch 

It doesn't matter to me.

04/06/2018 10:33 +0300, Mathias Nyman:
> On 02.06.2018 15:32, russianneuromancer@ya.ru wrote:
> > Hi!
> > 
> > I tested for-usb-linus branch on Dell 5855 and my issue is no
> > longer reproducible with it (at least for latest two days, which
> > should be more than enough to reproduce it).
> > 
> 
> Great, Thanks.
> 
> If you want to have a "tested-by:" flag with your name added to the
> patch I'll need
> a real name to go with the email address.
> 
> -Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
gert June 7, 2018, 5:56 p.m. UTC | #8
Hello

Sorry for the delay. I've briefly tested the changes and so far have 
not encountered the issue anymore - due to the nature of it, it could 
just coincidentally not be occurring anymore, though I suspect it would 
have occurred some time by now.

Thanks!

Op maandag 4 juni 2018 om 14:07 schreef russianneuromancer@ya.ru:
> Hello!
> 
> Thank you for fixing this issue!
> 
>>  If you want to have a "tested-by:" flag with your name added to the
>>  patch
> 
> It doesn't matter to me.
> 
> 04/06/2018 10:33 +0300, Mathias Nyman:
>>  On 02.06.2018 15:32, russianneuromancer@ya.ru wrote:
>>  > Hi!
>>  >
>>  > I tested for-usb-linus branch on Dell 5855 and my issue is no
>>  > longer reproducible with it (at least for latest two days, which
>>  > should be more than enough to reproduce it).
>>  >
>> 
>>  Great, Thanks.
>> 
>>  If you want to have a "tested-by:" flag with your name added to the
>>  patch I'll need
>>  a real name to go with the email address.
>> 
>>  -Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Nyman June 8, 2018, 7:12 a.m. UTC | #9
On 07.06.2018 20:56, gert wrote:
> Hello
> 
> Sorry for the delay. I've briefly tested the changes and so far have not encountered the issue anymore - due to the nature of it, it could just coincidentally not be occurring anymore, though I suspect it would have occurred some time by now.
> 
> Thanks!
> 

Great, looks promising if neither you or russianneuromancer@ya.ru can reproduce the issue after the patch.
I'll send it forward after rc-1

Thanks
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From d5fb6d354cd14d962d9f790f782ace0d06bf5002 Mon Sep 17 00:00:00 2001
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Fri, 4 May 2018 15:38:34 +0300
Subject: [PATCH] xhci: Fix perceived dead host due to runtime suspend race
 with event handler

Don't rely on event interrupt (EINT) bit alone to detect pending port
change in resume. If no change event is detected the host may be suspended
again, oterwise roothubs are resumed.

There is a lag in xHC setting EINT. If we don't notice the pending change
in resume, and the controller is runtime suspeded again, it causes the
event handler to assume host is dead as it will fail to read xHC registers
once PCI puts the controller to D3 state.

[  268.520969] xhci_hcd: xhci_resume: starting port polling.
[  268.520985] xhci_hcd: xhci_hub_status_data: stopping port polling.
[  268.521030] xhci_hcd: xhci_suspend: stopping port polling.
[  268.521040] xhci_hcd: // Setting command ring address to 0x349bd001
[  268.521139] xhci_hcd: Port Status Change Event for port 3
[  268.521149] xhci_hcd: resume root hub
[  268.521163] xhci_hcd: port resume event for port 3
[  268.521168] xhci_hcd: xHC is not running.
[  268.521174] xhci_hcd: handle_port_status: starting port polling.
[  268.596322] xhci_hcd: xhci_hc_died: xHCI host controller not responding, assume dead

The EINT lag is described in a additional note in xhci specs 4.19.2:

"Due to internal xHC scheduling and system delays, there will be a lag
between a change bit being set and the Port Status Change Event that it
generated being written to the Event Ring. If SW reads the PORTSC and
sees a change bit set, there is no guarantee that the corresponding Port
Status Change Event has already been written into the Event Ring."

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci.c | 37 ++++++++++++++++++++++++++++++++++---
 drivers/usb/host/xhci.h |  4 ++++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 711da33..18e9fcf 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -844,6 +844,38 @@  static void xhci_disable_port_wake_on_bits(struct xhci_hcd *xhci)
 	spin_unlock_irqrestore(&xhci->lock, flags);
 }
 
+static bool xhci_pending_portevent(struct xhci_hcd *xhci)
+{
+	int	port_index;
+	u32	status;
+	u32	portsc;
+
+	status = readl(&xhci->op_regs->status);
+	if (status & STS_EINT)
+		return true;
+	/*
+	 * Checking STS_EINT is not enough as there is a lag between a change
+	 * bit being set and the Port Status Change Event that it generated
+	 * being written to the Event Ring. See note in xhci 1.1 section 4.19.2.
+	 */
+
+	port_index = xhci->num_usb2_ports;
+	while (port_index--) {
+		portsc = readl(xhci->usb2_ports[port_index]);
+		if (portsc & PORT_CHANGE_MASK ||
+		    (portsc & PORT_PLS_MASK) == XDEV_RESUME)
+			return true;
+	}
+	port_index = xhci->num_usb3_ports;
+	while (port_index--) {
+		portsc = readl(xhci->usb3_ports[port_index]);
+		if (portsc & PORT_CHANGE_MASK ||
+		    (portsc & PORT_PLS_MASK) == XDEV_RESUME)
+			return true;
+	}
+	return false;
+}
+
 /*
  * Stop HC (not bus-specific)
  *
@@ -945,7 +977,7 @@  EXPORT_SYMBOL_GPL(xhci_suspend);
  */
 int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
 {
-	u32			command, temp = 0, status;
+	u32			command, temp = 0;
 	struct usb_hcd		*hcd = xhci_to_hcd(xhci);
 	struct usb_hcd		*secondary_hcd;
 	int			retval = 0;
@@ -1069,8 +1101,7 @@  int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
  done:
 	if (retval == 0) {
 		/* Resume root hubs only when have pending events. */
-		status = readl(&xhci->op_regs->status);
-		if (status & STS_EINT) {
+		if (xhci_pending_portevent(xhci)) {
 			usb_hcd_resume_root_hub(xhci->shared_hcd);
 			usb_hcd_resume_root_hub(hcd);
 		}
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 6dfc486..9751c13 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -382,6 +382,10 @@  struct xhci_op_regs {
 #define PORT_PLC	(1 << 22)
 /* port configure error change - port failed to configure its link partner */
 #define PORT_CEC	(1 << 23)
+#define PORT_CHANGE_MASK	(PORT_CSC | PORT_PEC | PORT_WRC | PORT_OCC | \
+				 PORT_RC | PORT_PLC | PORT_CEC)
+
+
 /* Cold Attach Status - xHC can set this bit to report device attached during
  * Sx state. Warm port reset should be perfomed to clear this bit and move port
  * to connected state.
-- 
2.7.4