diff mbox series

[v5,2/5] PCI: Check PCI_PM_CTRL instead of PCI_COMMAND in pci_dev_wait()

Message ID 20240823154023.360234-3-superm1@kernel.org (mailing list archive)
State New
Delegated to: Bjorn Helgaas
Headers show
Series Verify devices transition from D3cold to D0 | expand

Commit Message

Mario Limonciello Aug. 23, 2024, 3:40 p.m. UTC
From: Mario Limonciello <mario.limonciello@amd.com>

If a dock is plugged in at the same time as autosuspend delay then this
can cause malfunctions in the USB4 stack. This happens because the
device is still in D3cold at the time that the PCI core handed
control back to the USB4 stack.

A device that has gone through a reset may return a value in PCI_COMMAND
but that doesn't mean it's finished transitioning to D0.  For evices that
support power management explicitly check PCI_PM_CTRL on everything but
system resume to ensure the transition happened.

Devices that don't support power management and system resume will
continue to use PCI_COMMAND.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
v4->v5:
 * Fix misleading indentation
 * Amend commit message
---
 drivers/pci/pci.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

Comments

Bjorn Helgaas Aug. 23, 2024, 7:54 p.m. UTC | #1
[+cc Duc, Alex]

On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
> If a dock is plugged in at the same time as autosuspend delay then this
> can cause malfunctions in the USB4 stack. This happens because the
> device is still in D3cold at the time that the PCI core handed
> control back to the USB4 stack.

I assume the USB device in question is in the dock that was hot-added?
This patch suggests that pci_dev_wait() has waited for a read of
PCI_COMMAND to respond with something other than ~0, but the device is
still in D3cold.  I suppose we got to pci_dev_wait() via
pci_pm_bridge_power_up_actions() calling
pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset
in the hot-add case.

> A device that has gone through a reset may return a value in PCI_COMMAND
> but that doesn't mean it's finished transitioning to D0.  For evices that
> support power management explicitly check PCI_PM_CTRL on everything but
> system resume to ensure the transition happened.

s/evices/devices/

> Devices that don't support power management and system resume will
> continue to use PCI_COMMAND.

Is there a bugzilla or other report with more details that we can
include here?

> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> ---
> v4->v5:
>  * Fix misleading indentation
>  * Amend commit message
> ---
>  drivers/pci/pci.c | 28 ++++++++++++++++++++--------
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 1e219057a5069..f032a4aaec268 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int
>  	 * the read (except when CRS SV is enabled and the read was for the
>  	 * Vendor ID; in that case it synthesizes 0x0001 data).
>  	 *
> -	 * Wait for the device to return a non-CRS completion.  Read the
> -	 * Command register instead of Vendor ID so we don't have to
> -	 * contend with the CRS SV value.
> +	 * Wait for the device to return a non-CRS completion.  On devices
> +	 * that support PM control and on waits that aren't part of system
> +	 * resume read the PM control register to ensure the device has
> +	 * transitioned to D0.  On devices that don't support PM control,
> +	 * or during system resume read the command register to instead of
> +	 * Vendor ID so we don't have to contend with the CRS SV value.
>  	 */
>  	for (;;) {
> -		u32 id;
> -
>  		if (pci_dev_is_disconnected(dev)) {
>  			pci_dbg(dev, "disconnected; not waiting\n");
>  			return -ENOTTY;
>  		}
>  
> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> -		if (!PCI_POSSIBLE_ERROR(id))
> -			break;
> +		if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
> +			u16 pmcsr;
> +
> +			pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> +			if (!PCI_POSSIBLE_ERROR(pmcsr) &&
> +			    (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
> +				break;
> +		} else {
> +			u32 id;
> +
> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
> +			if (!PCI_POSSIBLE_ERROR(id))
> +				break;
> +		}

What is the rationale behind using PCI_PM_CTRL in some cases and
PCI_COMMAND in others?  Is there some spec language we can cite for
this?

IIUC, pci_dev_wait() waits for a device to be ready after a reset
(FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear,
DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()).
I think device readiness in all of these cases is covered by PCIe
r6.0, sec 6.6.1.

If the Root Port above the device supports Configuration RRS Software
Visibility, I think we probably should use that here instead of either
PCI_COMMAND or PCI_PM_CTRL.  SR-IOV VFs don't implement Vendor ID,
which might complicate this a little.  But it niggles in my mind that
there may be some other problem beyond that.  Maybe Alex remembers.

Anyway, if we meet the requirements of sec 6.6.1, the device should be
ready to respond to config requests, and I assume that also means
the device is in D0.

>  		if (delay > timeout) {
>  			pci_warn(dev, "not ready %dms after %s; giving up\n",
> -- 
> 2.43.0
>
Mario Limonciello Aug. 26, 2024, 7:16 p.m. UTC | #2
On 8/23/2024 14:54, Bjorn Helgaas wrote:
> [+cc Duc, Alex]
> 
> On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
>> If a dock is plugged in at the same time as autosuspend delay then this
>> can cause malfunctions in the USB4 stack. This happens because the
>> device is still in D3cold at the time that the PCI core handed
>> control back to the USB4 stack.
> 
> I assume the USB device in question is in the dock that was hot-added?

No; it's actually the USB4 router that is malfunctioning.  The CM 
(thunderbolt.ko) thinks the router is in D0 already, but when it 
attempts to do a register read it gets back all F's and it trusts that.

> This patch suggests that pci_dev_wait() has waited for a read of
> PCI_COMMAND to respond with something other than ~0, but the device is
> still in D3cold.  I suppose we got to pci_dev_wait() via
> pci_pm_bridge_power_up_actions() calling
> pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset
> in the hot-add case.

As I said it's the router (part of the SoC). The device never 
disappears.  It's the action of plugging in/out the the dock that causes 
it to change power states.

We didn't try it, but I wouldn't be surprised if it could be reproduced 
with a script that turned on/off runtime PM on very tight timing around 
the autosuspend delay.

> 
>> A device that has gone through a reset may return a value in PCI_COMMAND
>> but that doesn't mean it's finished transitioning to D0.  For evices that
>> support power management explicitly check PCI_PM_CTRL on everything but
>> system resume to ensure the transition happened.
> 
> s/evices/devices/

Thanks.

> 
>> Devices that don't support power management and system resume will
>> continue to use PCI_COMMAND.
> 
> Is there a bugzilla or other report with more details that we can
> include here?

No, unfortunately in this case it was only reported internally at AMD.

Gary who is on CC brought it to me though, and if you think there are 
some other specific details needed but are missing we can see what else 
can be added to the commit message.

> 
>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>> ---
>> v4->v5:
>>   * Fix misleading indentation
>>   * Amend commit message
>> ---
>>   drivers/pci/pci.c | 28 ++++++++++++++++++++--------
>>   1 file changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 1e219057a5069..f032a4aaec268 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int
>>   	 * the read (except when CRS SV is enabled and the read was for the
>>   	 * Vendor ID; in that case it synthesizes 0x0001 data).
>>   	 *
>> -	 * Wait for the device to return a non-CRS completion.  Read the
>> -	 * Command register instead of Vendor ID so we don't have to
>> -	 * contend with the CRS SV value.
>> +	 * Wait for the device to return a non-CRS completion.  On devices
>> +	 * that support PM control and on waits that aren't part of system
>> +	 * resume read the PM control register to ensure the device has
>> +	 * transitioned to D0.  On devices that don't support PM control,
>> +	 * or during system resume read the command register to instead of
>> +	 * Vendor ID so we don't have to contend with the CRS SV value.
>>   	 */
>>   	for (;;) {
>> -		u32 id;
>> -
>>   		if (pci_dev_is_disconnected(dev)) {
>>   			pci_dbg(dev, "disconnected; not waiting\n");
>>   			return -ENOTTY;
>>   		}
>>   
>> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
>> -		if (!PCI_POSSIBLE_ERROR(id))
>> -			break;
>> +		if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
>> +			u16 pmcsr;
>> +
>> +			pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
>> +			if (!PCI_POSSIBLE_ERROR(pmcsr) &&
>> +			    (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
>> +				break;
>> +		} else {
>> +			u32 id;
>> +
>> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
>> +			if (!PCI_POSSIBLE_ERROR(id))
>> +				break;
>> +		}
> 
> What is the rationale behind using PCI_PM_CTRL in some cases and
> PCI_COMMAND in others?

We saw a deadlock during resume from suspend when PCI_PM_CTRL was used 
for all cases that supported dev->pm_cap.

> Is there some spec language we can cite for
> this?

Perhaps it being a "cold reset" during resume?

> 
> IIUC, pci_dev_wait() waits for a device to be ready after a reset
> (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear,
> DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()).
> I think device readiness in all of these cases is covered by PCIe
> r6.0, sec 6.6.1.

Would it be helpful to you to get a dump_stack() call trace to 
pci_power_up() the specific call flow that needed this fix?

Gary is able to to reproduce this at will, I think he should be able to 
gather that using an unpatched kernel to help this conversation.

> 
> If the Root Port above the device supports Configuration RRS Software
> Visibility, I think we probably should use that here instead of either
> PCI_COMMAND or PCI_PM_CTRL.  

I did check and in this case the root port the USB4 routers are 
connected to support this.

How do you think this should be done instead?

> SR-IOV VFs don't implement Vendor ID,
> which might complicate this a little.  But it niggles in my mind that
> there may be some other problem beyond that.  Maybe Alex remembers.


> 
> Anyway, if we meet the requirements of sec 6.6.1, the device should be
> ready to respond to config requests, and I assume that also means
> the device is in D0.
> 

Within that section there is a quote to point out:

"
To allow components to perform internal initialization, system software 
must wait a specified minimum period
following exit from a Conventional Reset of one or more devices before 
it is permitted to issue Configuration
Requests to those devices
"

In pci_power_up() I don't really see any hardcoded delays specifically 
for this case of leaving D3cold. The PCI PM spec specifies that it will 
take "Full context restore or boot latency".  I don't think it's 
reasonable to have NO delay.
Mario Limonciello Aug. 27, 2024, 5:43 p.m. UTC | #3
On 8/26/2024 14:16, Mario Limonciello wrote:
> On 8/23/2024 14:54, Bjorn Helgaas wrote:
>> [+cc Duc, Alex]
>>
>> On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
>>> If a dock is plugged in at the same time as autosuspend delay then this
>>> can cause malfunctions in the USB4 stack. This happens because the
>>> device is still in D3cold at the time that the PCI core handed
>>> control back to the USB4 stack.
>>
>> I assume the USB device in question is in the dock that was hot-added?
> 
> No; it's actually the USB4 router that is malfunctioning.  The CM 
> (thunderbolt.ko) thinks the router is in D0 already, but when it 
> attempts to do a register read it gets back all F's and it trusts that.
> 
>> This patch suggests that pci_dev_wait() has waited for a read of
>> PCI_COMMAND to respond with something other than ~0, but the device is
>> still in D3cold.  I suppose we got to pci_dev_wait() via
>> pci_pm_bridge_power_up_actions() calling
>> pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset
>> in the hot-add case.
> 
> As I said it's the router (part of the SoC). The device never 
> disappears.  It's the action of plugging in/out the the dock that causes 
> it to change power states.
> 
> We didn't try it, but I wouldn't be surprised if it could be reproduced 
> with a script that turned on/off runtime PM on very tight timing around 
> the autosuspend delay.
> 
>>
>>> A device that has gone through a reset may return a value in PCI_COMMAND
>>> but that doesn't mean it's finished transitioning to D0.  For evices 
>>> that
>>> support power management explicitly check PCI_PM_CTRL on everything but
>>> system resume to ensure the transition happened.
>>
>> s/evices/devices/
> 
> Thanks.
> 
>>
>>> Devices that don't support power management and system resume will
>>> continue to use PCI_COMMAND.
>>
>> Is there a bugzilla or other report with more details that we can
>> include here?
> 
> No, unfortunately in this case it was only reported internally at AMD.
> 
> Gary who is on CC brought it to me though, and if you think there are 
> some other specific details needed but are missing we can see what else 
> can be added to the commit message.
> 
>>
>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>> ---
>>> v4->v5:
>>>   * Fix misleading indentation
>>>   * Amend commit message
>>> ---
>>>   drivers/pci/pci.c | 28 ++++++++++++++++++++--------
>>>   1 file changed, 20 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 1e219057a5069..f032a4aaec268 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, 
>>> enum pci_reset_type reset_type, int
>>>        * the read (except when CRS SV is enabled and the read was for 
>>> the
>>>        * Vendor ID; in that case it synthesizes 0x0001 data).
>>>        *
>>> -     * Wait for the device to return a non-CRS completion.  Read the
>>> -     * Command register instead of Vendor ID so we don't have to
>>> -     * contend with the CRS SV value.
>>> +     * Wait for the device to return a non-CRS completion.  On devices
>>> +     * that support PM control and on waits that aren't part of system
>>> +     * resume read the PM control register to ensure the device has
>>> +     * transitioned to D0.  On devices that don't support PM control,
>>> +     * or during system resume read the command register to instead of
>>> +     * Vendor ID so we don't have to contend with the CRS SV value.
>>>        */
>>>       for (;;) {
>>> -        u32 id;
>>> -
>>>           if (pci_dev_is_disconnected(dev)) {
>>>               pci_dbg(dev, "disconnected; not waiting\n");
>>>               return -ENOTTY;
>>>           }
>>> -        pci_read_config_dword(dev, PCI_COMMAND, &id);
>>> -        if (!PCI_POSSIBLE_ERROR(id))
>>> -            break;
>>> +        if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
>>> +            u16 pmcsr;
>>> +
>>> +            pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, 
>>> &pmcsr);
>>> +            if (!PCI_POSSIBLE_ERROR(pmcsr) &&
>>> +                (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
>>> +                break;
>>> +        } else {
>>> +            u32 id;
>>> +
>>> +            pci_read_config_dword(dev, PCI_COMMAND, &id);
>>> +            if (!PCI_POSSIBLE_ERROR(id))
>>> +                break;
>>> +        }
>>
>> What is the rationale behind using PCI_PM_CTRL in some cases and
>> PCI_COMMAND in others?
> 
> We saw a deadlock during resume from suspend when PCI_PM_CTRL was used 
> for all cases that supported dev->pm_cap.
> 
>> Is there some spec language we can cite for
>> this?
> 
> Perhaps it being a "cold reset" during resume?
> 
>>
>> IIUC, pci_dev_wait() waits for a device to be ready after a reset
>> (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear,
>> DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()).
>> I think device readiness in all of these cases is covered by PCIe
>> r6.0, sec 6.6.1.
> 
> Would it be helpful to you to get a dump_stack() call trace to 
> pci_power_up() the specific call flow that needed this fix?
> 
> Gary is able to to reproduce this at will, I think he should be able to 
> gather that using an unpatched kernel to help this conversation.

Here is the kernel trace with a dump_stack() and a printk of current 
inserted into pci_power_up() right before the failure occurs.

https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc

Also the failure is visible right after.

> 
>>
>> If the Root Port above the device supports Configuration RRS Software
>> Visibility, I think we probably should use that here instead of either
>> PCI_COMMAND or PCI_PM_CTRL. 
> 
> I did check and in this case the root port the USB4 routers are 
> connected to support this.
> 
> How do you think this should be done instead?
> 
>> SR-IOV VFs don't implement Vendor ID,
>> which might complicate this a little.  But it niggles in my mind that
>> there may be some other problem beyond that.  Maybe Alex remembers.
> 
> 
>>
>> Anyway, if we meet the requirements of sec 6.6.1, the device should be
>> ready to respond to config requests, and I assume that also means
>> the device is in D0.
>>
> 
> Within that section there is a quote to point out:
> 
> "
> To allow components to perform internal initialization, system software 
> must wait a specified minimum period
> following exit from a Conventional Reset of one or more devices before 
> it is permitted to issue Configuration
> Requests to those devices
> "
> 
> In pci_power_up() I don't really see any hardcoded delays specifically 
> for this case of leaving D3cold. The PCI PM spec specifies that it will 
> take "Full context restore or boot latency".  I don't think it's 
> reasonable to have NO delay.
Bjorn Helgaas Aug. 27, 2024, 7:44 p.m. UTC | #4
[+cc Rafael for ACPI power transition from D3cold to D0 delay]

On Mon, Aug 26, 2024 at 02:16:34PM -0500, Mario Limonciello wrote:
> On 8/23/2024 14:54, Bjorn Helgaas wrote:
> > On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
> > > If a dock is plugged in at the same time as autosuspend delay
> > > then this can cause malfunctions in the USB4 stack. This happens
> > > because the device is still in D3cold at the time that the PCI
> > > core handed control back to the USB4 stack.
> ...

> > > A device that has gone through a reset may return a value in
> > > PCI_COMMAND but that doesn't mean it's finished transitioning to
> > > D0.  For devices that support power management explicitly check
> > > PCI_PM_CTRL on everything but system resume to ensure the
> > > transition happened.
> ...

> > > Devices that don't support power management and system resume
> > > will continue to use PCI_COMMAND.

> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int
> > >   	 * the read (except when CRS SV is enabled and the read was for the
> > >   	 * Vendor ID; in that case it synthesizes 0x0001 data).
> > >   	 *
> > > -	 * Wait for the device to return a non-CRS completion.  Read the
> > > -	 * Command register instead of Vendor ID so we don't have to
> > > -	 * contend with the CRS SV value.
> > > +	 * Wait for the device to return a non-CRS completion.  On devices
> > > +	 * that support PM control and on waits that aren't part of system
> > > +	 * resume read the PM control register to ensure the device has
> > > +	 * transitioned to D0.  On devices that don't support PM control,
> > > +	 * or during system resume read the command register to instead of
> > > +	 * Vendor ID so we don't have to contend with the CRS SV value.
> > >   	 */
> > >   	for (;;) {
> > > -		u32 id;
> > > -
> > >   		if (pci_dev_is_disconnected(dev)) {
> > >   			pci_dbg(dev, "disconnected; not waiting\n");
> > >   			return -ENOTTY;
> > >   		}
> > > -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> > > -		if (!PCI_POSSIBLE_ERROR(id))
> > > -			break;
> > > +		if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
> > > +			u16 pmcsr;
> > > +
> > > +			pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > > +			if (!PCI_POSSIBLE_ERROR(pmcsr) &&
> > > +			    (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
> > > +				break;
> > > +		} else {
> > > +			u32 id;
> > > +
> > > +			pci_read_config_dword(dev, PCI_COMMAND, &id);
> > > +			if (!PCI_POSSIBLE_ERROR(id))
> > > +				break;
> > > +		}
> > 
> > What is the rationale behind using PCI_PM_CTRL in some cases and
> > PCI_COMMAND in others?
> 
> We saw a deadlock during resume from suspend when PCI_PM_CTRL was used for
> all cases that supported dev->pm_cap.
> 
> > Is there some spec language we can cite for
> > this?
> 
> Perhaps it being a "cold reset" during resume?
> 
> > IIUC, pci_dev_wait() waits for a device to be ready after a reset
> > (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear,
> > DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()).
> > I think device readiness in all of these cases is covered by PCIe
> > r6.0, sec 6.6.1.
> ...

> > If the Root Port above the device supports Configuration RRS
> > Software Visibility, I think we probably should use that here
> > instead of either PCI_COMMAND or PCI_PM_CTRL.
> 
> I did check and in this case the root port the USB4 routers are
> connected to support this.
> 
> How do you think this should be done instead?
> 
> > SR-IOV VFs don't implement Vendor ID, which might complicate this
> > a little.  But it niggles in my mind that there may be some other
> > problem beyond that.  Maybe Alex remembers.
> > 
> > Anyway, if we meet the requirements of sec 6.6.1, the device
> > should be ready to respond to config requests, and I assume that
> > also means the device is in D0.
> 
> Within that section there is a quote to point out:
> 
> "
> To allow components to perform internal initialization, system
> software must wait a specified minimum period following exit from a
> Conventional Reset of one or more devices before it is permitted to
> issue Configuration Requests to those devices
> "
> 
> In pci_power_up() I don't really see any hardcoded delays
> specifically for this case of leaving D3cold. The PCI PM spec
> specifies that it will take "Full context restore or boot latency".
> I don't think it's reasonable to have NO delay.

I agree, leaving D3cold means "applying power to the device", and PCIe
r6.0, sec 6.6.1, says that's a Fundamental Reset:

  A Fundamental Reset must occur following the application of power to
  the component. This is called a Cold Reset.

So we need a delay similar to what we do in
pci_bridge_wait_for_secondary_bus().  I don't know whether that is
supposed to happen somewhere on the ACPI side or in the PCI side, but
my inclination would be the PCI side because the delay isn't really
platform-specific, it's specified by the PCI specs, and the OS needs
to manage the RRS Software Visibility retries and features like
Readiness Notifications.

pci_set_low_power_state() contains corresponding delays for putting
devices in D1, D2, and D3hot.

Maybe the D3cold -> D0 delays should be in
platform_pci_set_power_state()?  I think pci_power_up() is the only
caller that sets the state to D0, and it assumes the device is
Configuration Ready when platform_pci_set_power_state() returns:

  pci_power_up
    platform_pci_set_power_state
    pci_read_config_word(PCI_PM_CTRL)

So I think we need basically the same delays and pci_dev_wait() stuff
in this path.

Bjorn
Bjorn Helgaas Aug. 30, 2024, 12:01 a.m. UTC | #5
On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
> From: Mario Limonciello <mario.limonciello@amd.com>
> 
> If a dock is plugged in at the same time as autosuspend delay then
> this can cause malfunctions in the USB4 stack. This happens because
> the device is still in D3cold at the time that the PCI core handed
> control back to the USB4 stack.
> 
> A device that has gone through a reset may return a value in
> PCI_COMMAND but that doesn't mean it's finished transitioning to D0.
> For devices that support power management explicitly check
> PCI_PM_CTRL on everything but system resume to ensure the transition
> happened.

Still trying to understand what's going on here.

I posted a change to pci_dev_wait() to read Vendor ID, look for Config
RRS status, and wait for a successful completion (when RRS Software
Visibility is enabled) [1].

You tested that and found that it didn't help with *this* issue [2].
I assume you tested something like v6.11-rc plus the patches from [1],
i.e., without the PCI_PM_CTRL changes in this series.

  1) Initially the device is in D0

  2) We put it in D3cold (using some ACPI method) because the
  autosuspend delay expired (?)

  3) Plugging in the dock wakes up the device, so we power up the
  device (via pci_power_up(), which again uses some ACPI method), and
  it should transition to D0uninitialized

  4) The USB4 stack sees the device but thinks it's in D3cold (?)

If your testing only included [1], but did not include the
pci_power_up() change from patch 3/5 "Verify functions currently in
D3cold have entered D0", I don't think we would call pci_dev_wait(),
so I wouldn't expect [1] to make any difference.

If you *did* include both [1] and patch 3/5, the implication would be
that pci_dev_wait() successfully read the Vendor ID, meaning the
device is not in D3cold when pci_power_up() returns.

Can you clarify what you see and possibly expand/correct my timeline
above?

[1] https://lore.kernel.org/linux-pci/20240827234848.4429-1-helgaas@kernel.org/
[2] https://lore.kernel.org/linux-pci/30d9589a-8050-421b-a9a5-ad3422feadad@amd.com/

> Devices that don't support power management and system resume will
> continue to use PCI_COMMAND.
> 
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> ---
> v4->v5:
>  * Fix misleading indentation
>  * Amend commit message
> ---
>  drivers/pci/pci.c | 28 ++++++++++++++++++++--------
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 1e219057a5069..f032a4aaec268 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int
>  	 * the read (except when CRS SV is enabled and the read was for the
>  	 * Vendor ID; in that case it synthesizes 0x0001 data).
>  	 *
> -	 * Wait for the device to return a non-CRS completion.  Read the
> -	 * Command register instead of Vendor ID so we don't have to
> -	 * contend with the CRS SV value.
> +	 * Wait for the device to return a non-CRS completion.  On devices
> +	 * that support PM control and on waits that aren't part of system
> +	 * resume read the PM control register to ensure the device has
> +	 * transitioned to D0.  On devices that don't support PM control,
> +	 * or during system resume read the command register to instead of
> +	 * Vendor ID so we don't have to contend with the CRS SV value.
>  	 */
>  	for (;;) {
> -		u32 id;
> -
>  		if (pci_dev_is_disconnected(dev)) {
>  			pci_dbg(dev, "disconnected; not waiting\n");
>  			return -ENOTTY;
>  		}
>  
> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> -		if (!PCI_POSSIBLE_ERROR(id))
> -			break;
> +		if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
> +			u16 pmcsr;
> +
> +			pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> +			if (!PCI_POSSIBLE_ERROR(pmcsr) &&
> +			    (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
> +				break;
> +		} else {
> +			u32 id;
> +
> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
> +			if (!PCI_POSSIBLE_ERROR(id))
> +				break;
> +		}
>  
>  		if (delay > timeout) {
>  			pci_warn(dev, "not ready %dms after %s; giving up\n",
> -- 
> 2.43.0
>
Mario Limonciello Sept. 3, 2024, 4:29 p.m. UTC | #6
On 8/29/2024 19:01, Bjorn Helgaas wrote:
> On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
>> From: Mario Limonciello <mario.limonciello@amd.com>
>>
>> If a dock is plugged in at the same time as autosuspend delay then
>> this can cause malfunctions in the USB4 stack. This happens because
>> the device is still in D3cold at the time that the PCI core handed
>> control back to the USB4 stack.
>>
>> A device that has gone through a reset may return a value in
>> PCI_COMMAND but that doesn't mean it's finished transitioning to D0.
>> For devices that support power management explicitly check
>> PCI_PM_CTRL on everything but system resume to ensure the transition
>> happened.
> 
> Still trying to understand what's going on here.
> 
> I posted a change to pci_dev_wait() to read Vendor ID, look for Config
> RRS status, and wait for a successful completion (when RRS Software
> Visibility is enabled) [1].
> 
> You tested that and found that it didn't help with *this* issue [2].
> I assume you tested something like v6.11-rc plus the patches from [1],
> i.e., without the PCI_PM_CTRL changes in this series.
> 
>    1) Initially the device is in D0
> 
>    2) We put it in D3cold (using some ACPI method) because the
>    autosuspend delay expired (?)
> 
>    3) Plugging in the dock wakes up the device, so we power up the
>    device (via pci_power_up(), which again uses some ACPI method), and
>    it should transition to D0uninitialized
> 
>    4) The USB4 stack sees the device but thinks it's in D3cold (?)
> 
> If your testing only included [1], but did not include the
> pci_power_up() change from patch 3/5 "Verify functions currently in
> D3cold have entered D0", I don't think we would call pci_dev_wait(),
> so I wouldn't expect [1] to make any difference.
> 
> If you *did* include both [1] and patch 3/5, the implication would be
> that pci_dev_wait() successfully read the Vendor ID, meaning the
> device is not in D3cold when pci_power_up() returns.

The testing here was from the LTS 6.6.y kernel with both [1] and patch 
3/5 from this series.

I can get testing from 6.11-rc6 with a combination of patches if you 
would like.

> 
> Can you clarify what you see and possibly expand/correct my timeline
> above?

The timeline you shared is nearly correct.  The USB4 stack *thinks* the 
device is in D0 because of the return of pci_power_up().

As by polling PCI_PM_CTRL we can see it's still in D3, so it hasn't made 
it to D0uninitialized yet.

I guess I reading between the lines you have an assumption that you 
can't read the vendor ID from D3; which doesn't appear to be the case 
from our testing.

> 
> [1] https://lore.kernel.org/linux-pci/20240827234848.4429-1-helgaas@kernel.org/
> [2] https://lore.kernel.org/linux-pci/30d9589a-8050-421b-a9a5-ad3422feadad@amd.com/
> 
>> Devices that don't support power management and system resume will
>> continue to use PCI_COMMAND.
>>
>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>> ---
>> v4->v5:
>>   * Fix misleading indentation
>>   * Amend commit message
>> ---
>>   drivers/pci/pci.c | 28 ++++++++++++++++++++--------
>>   1 file changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 1e219057a5069..f032a4aaec268 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int
>>   	 * the read (except when CRS SV is enabled and the read was for the
>>   	 * Vendor ID; in that case it synthesizes 0x0001 data).
>>   	 *
>> -	 * Wait for the device to return a non-CRS completion.  Read the
>> -	 * Command register instead of Vendor ID so we don't have to
>> -	 * contend with the CRS SV value.
>> +	 * Wait for the device to return a non-CRS completion.  On devices
>> +	 * that support PM control and on waits that aren't part of system
>> +	 * resume read the PM control register to ensure the device has
>> +	 * transitioned to D0.  On devices that don't support PM control,
>> +	 * or during system resume read the command register to instead of
>> +	 * Vendor ID so we don't have to contend with the CRS SV value.
>>   	 */
>>   	for (;;) {
>> -		u32 id;
>> -
>>   		if (pci_dev_is_disconnected(dev)) {
>>   			pci_dbg(dev, "disconnected; not waiting\n");
>>   			return -ENOTTY;
>>   		}
>>   
>> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
>> -		if (!PCI_POSSIBLE_ERROR(id))
>> -			break;
>> +		if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
>> +			u16 pmcsr;
>> +
>> +			pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
>> +			if (!PCI_POSSIBLE_ERROR(pmcsr) &&
>> +			    (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
>> +				break;
>> +		} else {
>> +			u32 id;
>> +
>> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
>> +			if (!PCI_POSSIBLE_ERROR(id))
>> +				break;
>> +		}
>>   
>>   		if (delay > timeout) {
>>   			pci_warn(dev, "not ready %dms after %s; giving up\n",
>> -- 
>> 2.43.0
>>
Bjorn Helgaas Sept. 3, 2024, 5:11 p.m. UTC | #7
On Tue, Sep 03, 2024 at 11:29:23AM -0500, Mario Limonciello wrote:
> On 8/29/2024 19:01, Bjorn Helgaas wrote:
> > On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
> > > From: Mario Limonciello <mario.limonciello@amd.com>
> > > 
> > > If a dock is plugged in at the same time as autosuspend delay then
> > > this can cause malfunctions in the USB4 stack. This happens because
> > > the device is still in D3cold at the time that the PCI core handed
> > > control back to the USB4 stack.
> > > 
> > > A device that has gone through a reset may return a value in
> > > PCI_COMMAND but that doesn't mean it's finished transitioning to D0.
> > > For devices that support power management explicitly check
> > > PCI_PM_CTRL on everything but system resume to ensure the transition
> > > happened.
> > 
> > Still trying to understand what's going on here.
> > 
> > I posted a change to pci_dev_wait() to read Vendor ID, look for Config
> > RRS status, and wait for a successful completion (when RRS Software
> > Visibility is enabled) [1].
> > 
> > You tested that and found that it didn't help with *this* issue [2].
> > I assume you tested something like v6.11-rc plus the patches from [1],
> > i.e., without the PCI_PM_CTRL changes in this series.
> > 
> >    1) Initially the device is in D0
> > 
> >    2) We put it in D3cold (using some ACPI method) because the
> >    autosuspend delay expired (?)
> > 
> >    3) Plugging in the dock wakes up the device, so we power up the
> >    device (via pci_power_up(), which again uses some ACPI method), and
> >    it should transition to D0uninitialized
> > 
> >    4) The USB4 stack sees the device but thinks it's in D3cold (?)
> ...

> > If you *did* include both [1] and patch 3/5, the implication would be
> > that pci_dev_wait() successfully read the Vendor ID, meaning the
> > device is not in D3cold when pci_power_up() returns.
> 
> The testing here was from the LTS 6.6.y kernel with both [1] and
> patch 3/5 from this series.
> 
> > Can you clarify what you see and possibly expand/correct my
> > timeline above?
> 
> The timeline you shared is nearly correct.  The USB4 stack *thinks*
> the device is in D0 because of the return of pci_power_up().
> 
> As by polling PCI_PM_CTRL we can see it's still in D3, so it hasn't
> made it to D0uninitialized yet.
> 
> I guess I reading between the lines you have an assumption that you
> can't read the vendor ID from D3; which doesn't appear to be the
> case from our testing.

A Vendor ID read of a device in D3hot should definitely work.
Obviously if the device were in D3cold, we'd get no response at all,
so the requester should log a UR error and fabricate ~0 data.

But if the device starts out in D3cold and we power it up, it should
not go through D3hot.  The only legal transition from D3cold is to
D0uninitialized (PCIe r6.0, sec 5.8).

OK, so with [1] and patch 3/5:

  1) Initially the device is in D0

  2) We put it in D3cold (using some ACPI method) because the
  autosuspend delay expired (?)

  3) Plugging in the dock wakes up the device, so we power up the
  device (via pci_power_up(), which again uses some ACPI method), and
  it should transition to D0uninitialized

  4) With patch 3/5, pci_power_up() calls pci_dev_wait() because
  dev->current_state == PCI_D3cold

  5) I *assume* RRS SV is enabled (lspci -vv of Root Port would
  confirm this; maybe we should add a pci_dbg message about which
  register we're polling).  If so, patch [1] means we should poll
  Vendor ID until successful completion.

  6) pci_dbg log should confirm the device is ready with a "ready %dms
  after D3cold->D0" message, which would mean we got a successful
  completion when reading Vendor ID

  7) For debugging purposes, it would be interesting to read and log
  the PCI_PM_CTRL value here.  Per sec 2.3.1, the device is not
  allowed to return RRS at this point since we already got a
  successful completion.

  8) The USB4 stack sees the device and assumes it is in D0, but it
  seems to still be in D3cold.  What is this based on?  Is there a
  config read that returns ~0 data when it shouldn't?

> > [1] https://lore.kernel.org/linux-pci/20240827234848.4429-1-helgaas@kernel.org/
> > [2] https://lore.kernel.org/linux-pci/30d9589a-8050-421b-a9a5-ad3422feadad@amd.com/
> > 
> > > Devices that don't support power management and system resume will
> > > continue to use PCI_COMMAND.
> > > 
> > > Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> > > ---
> > > v4->v5:
> > >   * Fix misleading indentation
> > >   * Amend commit message
> > > ---
> > >   drivers/pci/pci.c | 28 ++++++++++++++++++++--------
> > >   1 file changed, 20 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index 1e219057a5069..f032a4aaec268 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int
> > >   	 * the read (except when CRS SV is enabled and the read was for the
> > >   	 * Vendor ID; in that case it synthesizes 0x0001 data).
> > >   	 *
> > > -	 * Wait for the device to return a non-CRS completion.  Read the
> > > -	 * Command register instead of Vendor ID so we don't have to
> > > -	 * contend with the CRS SV value.
> > > +	 * Wait for the device to return a non-CRS completion.  On devices
> > > +	 * that support PM control and on waits that aren't part of system
> > > +	 * resume read the PM control register to ensure the device has
> > > +	 * transitioned to D0.  On devices that don't support PM control,
> > > +	 * or during system resume read the command register to instead of
> > > +	 * Vendor ID so we don't have to contend with the CRS SV value.
> > >   	 */
> > >   	for (;;) {
> > > -		u32 id;
> > > -
> > >   		if (pci_dev_is_disconnected(dev)) {
> > >   			pci_dbg(dev, "disconnected; not waiting\n");
> > >   			return -ENOTTY;
> > >   		}
> > > -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> > > -		if (!PCI_POSSIBLE_ERROR(id))
> > > -			break;
> > > +		if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
> > > +			u16 pmcsr;
> > > +
> > > +			pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > > +			if (!PCI_POSSIBLE_ERROR(pmcsr) &&
> > > +			    (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
> > > +				break;
> > > +		} else {
> > > +			u32 id;
> > > +
> > > +			pci_read_config_dword(dev, PCI_COMMAND, &id);
> > > +			if (!PCI_POSSIBLE_ERROR(id))
> > > +				break;
> > > +		}
> > >   		if (delay > timeout) {
> > >   			pci_warn(dev, "not ready %dms after %s; giving up\n",
> > > -- 
> > > 2.43.0
> > > 
>
Mario Limonciello Sept. 3, 2024, 5:31 p.m. UTC | #8
On 9/3/2024 12:11, Bjorn Helgaas wrote:

<snip>

>>
>> I guess I reading between the lines you have an assumption that you
>> can't read the vendor ID from D3; which doesn't appear to be the
>> case from our testing.
> 
> A Vendor ID read of a device in D3hot should definitely work.
> Obviously if the device were in D3cold, we'd get no response at all,
> so the requester should log a UR error and fabricate ~0 data.
> 
> But if the device starts out in D3cold and we power it up, it should
> not go through D3hot.  The only legal transition from D3cold is to
> D0uninitialized (PCIe r6.0, sec 5.8).

Right.  The issue is it didn't finish getting into D3 at the time that 
we attempted to go to D0 though.  So all this extra time is basically 
waiting for the D0->D3 transition to finish followed by D3->D0uninitialized.

The best description I could offer is to call it an "aborted" D3.

> 
> OK, so with [1] and patch 3/5:
> 
>    1) Initially the device is in D0
> 
>    2) We put it in D3cold (using some ACPI method) because the
>    autosuspend delay expired (?)
> 
>    3) Plugging in the dock wakes up the device, so we power up the
>    device (via pci_power_up(), which again uses some ACPI method), and
>    it should transition to D0uninitialized
> 
>    4) With patch 3/5, pci_power_up() calls pci_dev_wait() because
>    dev->current_state == PCI_D3cold
> 
>    5) I *assume* RRS SV is enabled (lspci -vv of Root Port would
>    confirm this; maybe we should add a pci_dbg message about which
>    register we're polling).  If so, patch [1] means we should poll
>    Vendor ID until successful completion.
> 

Yup.
                 RootCap: CRSVisible+
                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- 
PMEIntEna+ CRSVisible+

>    6) pci_dbg log should confirm the device is ready with a "ready %dms
>    after D3cold->D0" message, which would mean we got a successful
>    completion when reading Vendor ID
> 
>    7) For debugging purposes, it would be interesting to read and log
>    the PCI_PM_CTRL value here.  Per sec 2.3.1, the device is not
>    allowed to return RRS at this point since we already got a
>    successful completion.
> 

OK let me get a debug log with [1], 3/5, 6.11-rc6 and a message added 
about this value to share back.

>    8) The USB4 stack sees the device and assumes it is in D0, but it
>    seems to still be in D3cold.  What is this based on?  Is there a
>    config read that returns ~0 data when it shouldn't?
> 

Yes there is.  From earlier in the thread I have a [log] I shared.

The message emitted is from ring_interrupt_active():

"thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"

[log] https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc
Bjorn Helgaas Sept. 3, 2024, 6:25 p.m. UTC | #9
On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote:
> On 9/3/2024 12:11, Bjorn Helgaas wrote:
> ...

> >    8) The USB4 stack sees the device and assumes it is in D0, but it
> >    seems to still be in D3cold.  What is this based on?  Is there a
> >    config read that returns ~0 data when it shouldn't?
> 
> Yes there is.  From earlier in the thread I have a [log] I shared.
> 
> The message emitted is from ring_interrupt_active():
> 
> "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"

Right, that's in the cover letter, but I can't tell from this what the
ioread32(ring->nhi->iobase + reg) returned.  It looks like this is an
MMIO read of BAR 0, not a config read.

> [log] https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc
Mario Limonciello Sept. 3, 2024, 6:32 p.m. UTC | #10
On 9/3/2024 13:25, Bjorn Helgaas wrote:
> On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote:
>> On 9/3/2024 12:11, Bjorn Helgaas wrote:
>> ...
> 
>>>     8) The USB4 stack sees the device and assumes it is in D0, but it
>>>     seems to still be in D3cold.  What is this based on?  Is there a
>>>     config read that returns ~0 data when it shouldn't?
>>
>> Yes there is.  From earlier in the thread I have a [log] I shared.
>>
>> The message emitted is from ring_interrupt_active():
>>
>> "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"
> 
> Right, that's in the cover letter, but I can't tell from this what the
> ioread32(ring->nhi->iobase + reg) returned.  It looks like this is an
> MMIO read of BAR 0, not a config read.
> 

Yeah.  I suppose another way to approach this problem is to make 
something else in the call chain poll PCI_PM_CTRL.

Polling at the start of nhi_runtime_resume() should also work.  For the 
"normal" scenario it would just be a single read to PCI_PM_CTRL.

Mika, thoughts?
Bjorn Helgaas Sept. 3, 2024, 9:32 p.m. UTC | #11
On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote:
> On 9/3/2024 13:25, Bjorn Helgaas wrote:
> > On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote:
> > > On 9/3/2024 12:11, Bjorn Helgaas wrote:
> > > ...
> > 
> > > >     8) The USB4 stack sees the device and assumes it is in D0, but it
> > > >     seems to still be in D3cold.  What is this based on?  Is there a
> > > >     config read that returns ~0 data when it shouldn't?
> > > 
> > > Yes there is.  From earlier in the thread I have a [log] I shared.
> > > 
> > > The message emitted is from ring_interrupt_active():
> > > 
> > > "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"
> > 
> > Right, that's in the cover letter, but I can't tell from this what the
> > ioread32(ring->nhi->iobase + reg) returned.  It looks like this is an
> > MMIO read of BAR 0, not a config read.
> 
> Yeah.  I suppose another way to approach this problem is to make something
> else in the call chain poll PCI_PM_CTRL.
> 
> Polling at the start of nhi_runtime_resume() should also work.  For the
> "normal" scenario it would just be a single read to PCI_PM_CTRL.

It doesn't sound like this is anything specific to Thunderbolt, so
there should be a generic solution.
Mika Westerberg Sept. 4, 2024, 12:05 p.m. UTC | #12
Hi,

On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote:
> On 9/3/2024 13:25, Bjorn Helgaas wrote:
> > On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote:
> > > On 9/3/2024 12:11, Bjorn Helgaas wrote:
> > > ...
> > 
> > > >     8) The USB4 stack sees the device and assumes it is in D0, but it
> > > >     seems to still be in D3cold.  What is this based on?  Is there a
> > > >     config read that returns ~0 data when it shouldn't?
> > > 
> > > Yes there is.  From earlier in the thread I have a [log] I shared.
> > > 
> > > The message emitted is from ring_interrupt_active():
> > > 
> > > "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"
> > 
> > Right, that's in the cover letter, but I can't tell from this what the
> > ioread32(ring->nhi->iobase + reg) returned.  It looks like this is an
> > MMIO read of BAR 0, not a config read.
> > 
> 
> Yeah.  I suppose another way to approach this problem is to make something
> else in the call chain poll PCI_PM_CTRL.
> 
> Polling at the start of nhi_runtime_resume() should also work.  For the
> "normal" scenario it would just be a single read to PCI_PM_CTRL.
> 
> Mika, thoughts?

I'm starting to wonder if we are looking at the correct place ;-) This
reminds me that our PCIe SV people recently reported a couple of Linux
related issues which they recommended to fix, and these are on my list
but I'll share them because maybe they are related?

First problem, and actually a PCI spec violation, is that Linux does not
clear Bus Master, MMIO and IO space enables when it programs the device
to D3 on runtime suspend path. It does so on system sleep path though.
Something like below (untested) should do that:

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index f412ef73a6e4..79a566376301 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev)
 
 	if (!pci_dev->state_saved) {
 		pci_save_state(pci_dev);
+		pci_pm_default_suspend(pci_dev);
 		pci_finish_runtime_suspend(pci_dev);
 	}
 

The second thing is that Thunderbolt driver, for historical reasons,
leaves the MSI enabled when entering D3. This too might be related. I
think we can unconditionally disable it so below hack should do that
(untested as well). I wonder if you could try if any of these or both
can help here? Both of these issues can result unwanted events during D3
entry as far as I understand.

diff --git a/drivers/thunderbolt/ctl.c b/drivers/thunderbolt/ctl.c
index dc1f456736dc..73b815fbbceb 100644
--- a/drivers/thunderbolt/ctl.c
+++ b/drivers/thunderbolt/ctl.c
@@ -659,12 +659,11 @@ struct tb_ctl *tb_ctl_alloc(struct tb_nhi *nhi, int index, int timeout_msec,
 	if (!ctl->frame_pool)
 		goto err;
 
-	ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, RING_FLAG_NO_SUSPEND);
+	ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, 0);
 	if (!ctl->tx)
 		goto err;
 
-	ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, RING_FLAG_NO_SUSPEND, 0, 0xffff,
-				   0xffff, NULL, NULL);
+	ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, 0, 0, 0xffff, 0xffff, NULL, NULL);
 	if (!ctl->rx)
 		goto err;
Mario Limonciello Sept. 4, 2024, 3:24 p.m. UTC | #13
On 9/4/2024 07:05, Mika Westerberg wrote:
> Hi,
> 
> On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote:
>> On 9/3/2024 13:25, Bjorn Helgaas wrote:
>>> On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote:
>>>> On 9/3/2024 12:11, Bjorn Helgaas wrote:
>>>> ...
>>>
>>>>>      8) The USB4 stack sees the device and assumes it is in D0, but it
>>>>>      seems to still be in D3cold.  What is this based on?  Is there a
>>>>>      config read that returns ~0 data when it shouldn't?
>>>>
>>>> Yes there is.  From earlier in the thread I have a [log] I shared.
>>>>
>>>> The message emitted is from ring_interrupt_active():
>>>>
>>>> "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"
>>>
>>> Right, that's in the cover letter, but I can't tell from this what the
>>> ioread32(ring->nhi->iobase + reg) returned.  It looks like this is an
>>> MMIO read of BAR 0, not a config read.
>>>
>>
>> Yeah.  I suppose another way to approach this problem is to make something
>> else in the call chain poll PCI_PM_CTRL.
>>
>> Polling at the start of nhi_runtime_resume() should also work.  For the
>> "normal" scenario it would just be a single read to PCI_PM_CTRL.
>>
>> Mika, thoughts?

We did this experiment to throw code to poll PCI_PM_CTRL at the start of 
nhi_runtime_resume() but this also fails.  From that I would hypothesize 
the device transitioned to D0uninitialized sometime in the middle of 
pci_pm_runtime_resume() before the call to pm->runtime_resume(dev);

> 
> I'm starting to wonder if we are looking at the correct place ;-) This
> reminds me that our PCIe SV people recently reported a couple of Linux
> related issues which they recommended to fix, and these are on my list
> but I'll share them because maybe they are related?

Thanks for sharing those.  We had a try with them but sorry to say no 
improvements to the issue at hand.

> 
> First problem, and actually a PCI spec violation, is that Linux does not
> clear Bus Master, MMIO and IO space enables when it programs the device
> to D3 on runtime suspend path. It does so on system sleep path though.
> Something like below (untested) should do that:
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index f412ef73a6e4..79a566376301 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev)
>   
>   	if (!pci_dev->state_saved) {
>   		pci_save_state(pci_dev);
> +		pci_pm_default_suspend(pci_dev);
>   		pci_finish_runtime_suspend(pci_dev);
>   	}
>   
> 
> The second thing is that Thunderbolt driver, for historical reasons,
> leaves the MSI enabled when entering D3. This too might be related. I
> think we can unconditionally disable it so below hack should do that
> (untested as well). I wonder if you could try if any of these or both
> can help here? Both of these issues can result unwanted events during D3
> entry as far as I understand.
> 
> diff --git a/drivers/thunderbolt/ctl.c b/drivers/thunderbolt/ctl.c
> index dc1f456736dc..73b815fbbceb 100644
> --- a/drivers/thunderbolt/ctl.c
> +++ b/drivers/thunderbolt/ctl.c
> @@ -659,12 +659,11 @@ struct tb_ctl *tb_ctl_alloc(struct tb_nhi *nhi, int index, int timeout_msec,
>   	if (!ctl->frame_pool)
>   		goto err;
>   
> -	ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, RING_FLAG_NO_SUSPEND);
> +	ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, 0);
>   	if (!ctl->tx)
>   		goto err;
>   
> -	ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, RING_FLAG_NO_SUSPEND, 0, 0xffff,
> -				   0xffff, NULL, NULL);
> +	ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, 0, 0, 0xffff, 0xffff, NULL, NULL);
>   	if (!ctl->rx)
>   		goto err;
>
Mika Westerberg Sept. 5, 2024, 9:33 a.m. UTC | #14
Hi,

On Wed, Sep 04, 2024 at 10:24:26AM -0500, Mario Limonciello wrote:
> On 9/4/2024 07:05, Mika Westerberg wrote:
> > Hi,
> > 
> > On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote:
> > > On 9/3/2024 13:25, Bjorn Helgaas wrote:
> > > > On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote:
> > > > > On 9/3/2024 12:11, Bjorn Helgaas wrote:
> > > > > ...
> > > > 
> > > > > >      8) The USB4 stack sees the device and assumes it is in D0, but it
> > > > > >      seems to still be in D3cold.  What is this based on?  Is there a
> > > > > >      config read that returns ~0 data when it shouldn't?
> > > > > 
> > > > > Yes there is.  From earlier in the thread I have a [log] I shared.
> > > > > 
> > > > > The message emitted is from ring_interrupt_active():
> > > > > 
> > > > > "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"
> > > > 
> > > > Right, that's in the cover letter, but I can't tell from this what the
> > > > ioread32(ring->nhi->iobase + reg) returned.  It looks like this is an
> > > > MMIO read of BAR 0, not a config read.
> > > > 
> > > 
> > > Yeah.  I suppose another way to approach this problem is to make something
> > > else in the call chain poll PCI_PM_CTRL.
> > > 
> > > Polling at the start of nhi_runtime_resume() should also work.  For the
> > > "normal" scenario it would just be a single read to PCI_PM_CTRL.
> > > 
> > > Mika, thoughts?
> 
> We did this experiment to throw code to poll PCI_PM_CTRL at the start of
> nhi_runtime_resume() but this also fails.  From that I would hypothesize the
> device transitioned to D0uninitialized sometime in the middle of
> pci_pm_runtime_resume() before the call to pm->runtime_resume(dev);
> 
> > 
> > I'm starting to wonder if we are looking at the correct place ;-) This
> > reminds me that our PCIe SV people recently reported a couple of Linux
> > related issues which they recommended to fix, and these are on my list
> > but I'll share them because maybe they are related?
> 
> Thanks for sharing those.  We had a try with them but sorry to say no
> improvements to the issue at hand.

Okay, thanks for checking.

Few additional side paths here, though. This is supposed to work so that
once the host router sleep bit is set the driver is supposed to allow
the domain to enter sleep (e.g it should not be waken up before it is
fully transitioned). That's what we do:

1. All tunneled PCIe Root/Downstream ports are in D3.
2. All tunneled USB 3.x ports are in U3.
3. No DisplayPort is tunneled.
4. Thunderbolt driver enables wakes.
5. Thunderbolt driver writes sleep ready bit of the host router.
6. Thunderbolt driver runtime suspend is complete.
7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep
Event".

If between 5 and 7 there is device connected, it should not "abort" the
sequence. Unfortunately this is not explict in the USB4 spec but the
connection manager guide has similar note. Even if the connect happens
there the "Sleep Event" should happen but after that it can trigger
normal wakeup which will then bring everything back.

Would it be possible to enable tracing around these steps so that we
could see if there is hotplug notification somewhere there that is not
expected? Here are instructions how to get pretty accurate trace:

https://github.com/intel/tbtools?tab=readme-ov-file#tracing

Please also take full dmesg.

It is entirely possible that this has nothing to do with the issue but I
think it is worth checking.

The second thing we could try is to check the wake status bits after
this has happened, like:

  # tbdump -r 0 -a <ADAPTER> -vv -N 1 PORT_CS_18

(where <ADAPTER> is the lane 0 adapter of the USB4 port the device was
connected).

The third thing to try is to comment out TB_WAKE_ON_CONNECT in
tb_switch_suspend(). This should result no wake even if the device is
connected. This tells us that it is really the connect on USB4 port that
triggered the wake.

These could (also) explain why the host router appears to be in D3 even
if it should be in D0 already.
Mario Limonciello Sept. 9, 2024, 8:40 p.m. UTC | #15
On 9/5/2024 04:33, Mika Westerberg wrote:
> Hi,
> 
> On Wed, Sep 04, 2024 at 10:24:26AM -0500, Mario Limonciello wrote:
>> On 9/4/2024 07:05, Mika Westerberg wrote:
>>> Hi,
>>>
>>> On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote:
>>>> On 9/3/2024 13:25, Bjorn Helgaas wrote:
>>>>> On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote:
>>>>>> On 9/3/2024 12:11, Bjorn Helgaas wrote:
>>>>>> ...
>>>>>
>>>>>>>       8) The USB4 stack sees the device and assumes it is in D0, but it
>>>>>>>       seems to still be in D3cold.  What is this based on?  Is there a
>>>>>>>       config read that returns ~0 data when it shouldn't?
>>>>>>
>>>>>> Yes there is.  From earlier in the thread I have a [log] I shared.
>>>>>>
>>>>>> The message emitted is from ring_interrupt_active():
>>>>>>
>>>>>> "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled"
>>>>>
>>>>> Right, that's in the cover letter, but I can't tell from this what the
>>>>> ioread32(ring->nhi->iobase + reg) returned.  It looks like this is an
>>>>> MMIO read of BAR 0, not a config read.
>>>>>
>>>>
>>>> Yeah.  I suppose another way to approach this problem is to make something
>>>> else in the call chain poll PCI_PM_CTRL.
>>>>
>>>> Polling at the start of nhi_runtime_resume() should also work.  For the
>>>> "normal" scenario it would just be a single read to PCI_PM_CTRL.
>>>>
>>>> Mika, thoughts?
>>
>> We did this experiment to throw code to poll PCI_PM_CTRL at the start of
>> nhi_runtime_resume() but this also fails.  From that I would hypothesize the
>> device transitioned to D0uninitialized sometime in the middle of
>> pci_pm_runtime_resume() before the call to pm->runtime_resume(dev);
>>
>>>
>>> I'm starting to wonder if we are looking at the correct place ;-) This
>>> reminds me that our PCIe SV people recently reported a couple of Linux
>>> related issues which they recommended to fix, and these are on my list
>>> but I'll share them because maybe they are related?
>>
>> Thanks for sharing those.  We had a try with them but sorry to say no
>> improvements to the issue at hand.
> 
> Okay, thanks for checking.
> 
> Few additional side paths here, though. This is supposed to work so that
> once the host router sleep bit is set the driver is supposed to allow
> the domain to enter sleep (e.g it should not be waken up before it is
> fully transitioned). That's what we do:
> 
> 1. All tunneled PCIe Root/Downstream ports are in D3.
> 2. All tunneled USB 3.x ports are in U3.
> 3. No DisplayPort is tunneled.
> 4. Thunderbolt driver enables wakes.
> 5. Thunderbolt driver writes sleep ready bit of the host router.
> 6. Thunderbolt driver runtime suspend is complete.
> 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep
> Event".
> 
> If between 5 and 7 there is device connected, it should not "abort" the
> sequence. Unfortunately this is not explict in the USB4 spec but the
> connection manager guide has similar note. Even if the connect happens
> there the "Sleep Event" should happen but after that it can trigger
> normal wakeup which will then bring everything back.
> 
> Would it be possible to enable tracing around these steps so that we
> could see if there is hotplug notification somewhere there that is not
> expected? Here are instructions how to get pretty accurate trace:
> 
> https://github.com/intel/tbtools?tab=readme-ov-file#tracing
> 
> Please also take full dmesg.

Sure, here is the dmesg with tracing enabled:

https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308
> 
> It is entirely possible that this has nothing to do with the issue but I
> think it is worth checking.
> 
> The second thing we could try is to check the wake status bits after
> this has happened, like:
> 
>    # tbdump -r 0 -a <ADAPTER> -vv -N 1 PORT_CS_18
> 
> (where <ADAPTER> is the lane 0 adapter of the USB4 port the device was
> connected).
> 

Unfortunately the adapter is in such a bad state at this time that 
tbdump doesn't work.

> The third thing to try is to comment out TB_WAKE_ON_CONNECT in
> tb_switch_suspend(). This should result no wake even if the device is
> connected. This tells us that it is really the connect on USB4 port that
> triggered the wake.

Yup that's correct; there is no action on the hotplug with this change.

> 
> These could (also) explain why the host router appears to be in D3 even
> if it should be in D0 already.
Mika Westerberg Sept. 10, 2024, 9:13 a.m. UTC | #16
Hi,

On Mon, Sep 09, 2024 at 03:40:54PM -0500, Mario Limonciello wrote:
> > Few additional side paths here, though. This is supposed to work so that
> > once the host router sleep bit is set the driver is supposed to allow
> > the domain to enter sleep (e.g it should not be waken up before it is
> > fully transitioned). That's what we do:
> > 
> > 1. All tunneled PCIe Root/Downstream ports are in D3.
> > 2. All tunneled USB 3.x ports are in U3.
> > 3. No DisplayPort is tunneled.
> > 4. Thunderbolt driver enables wakes.
> > 5. Thunderbolt driver writes sleep ready bit of the host router.
> > 6. Thunderbolt driver runtime suspend is complete.
> > 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep
> > Event".
> > 
> > If between 5 and 7 there is device connected, it should not "abort" the
> > sequence. Unfortunately this is not explict in the USB4 spec but the
> > connection manager guide has similar note. Even if the connect happens
> > there the "Sleep Event" should happen but after that it can trigger
> > normal wakeup which will then bring everything back.
> > 
> > Would it be possible to enable tracing around these steps so that we
> > could see if there is hotplug notification somewhere there that is not
> > expected? Here are instructions how to get pretty accurate trace:
> > 
> > https://github.com/intel/tbtools?tab=readme-ov-file#tracing
> > 
> > Please also take full dmesg.
> 
> Sure, here is the dmesg with tracing enabled:
> 
> https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308

Thanks! I meant also enable control channel tracing as explained in the
above linked page so we could maybe see the hotplug packet coming from
the lane adapter too. Anyhow,

[  560.789681] thunderbolt 0-2: device disconnected
[  560.789771] thunderbolt 0000:e5:00.5: bandwidth consumption changed, re-calculating estimated bandwidth
[  560.789775] thunderbolt 0000:e5:00.5: bandwidth re-calculation done
[  560.789778] thunderbolt 0000:e5:00.5: looking for DP IN <-> DP OUT pairs:
[  560.789800] thunderbolt 0000:e5:00.5: 0:6: DP IN available
[  560.789803] thunderbolt 0000:e5:00.5: 0:6: no suitable DP OUT adapter available, not tunneling
[  560.790484] thunderbolt 0000:e5:00.5: 0:7: DP IN available
[  560.790487] thunderbolt 0000:e5:00.5: 0:7: no suitable DP OUT adapter available, not tunneling
...
[  578.677640] thunderbolt 0000:e5:00.5: nhi_runtime_suspend() - enter, pdev->current_state = 0
[  578.677648] thunderbolt 0000:e5:00.5: 0: suspending switch
[  578.677653] thunderbolt 0000:e5:00.5: 0: enabling wakeup: 0x3f
[  578.678248] thunderbolt 0000:e5:00.5: stopping RX ring 0
[  578.678256] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 3 (0x9 -> 0x1)
[  578.678272] thunderbolt 0000:e5:00.5: stopping TX ring 0
[  578.678277] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 0 (0x1 -> 0x0)
[  578.678287] thunderbolt 0000:e5:00.5: control channel stopped
...
pci_pm_runtime_resume()
  pci_pm_default_resume_early(pci_dev);
    pci_pm_power_up_and_verify_state(pci_dev);
      pci_power_up(pci_dev);
        platform_pci_set_power_state(dev, PCI_D0);
	// here we should see acpi_pci_set_power_state() printing "xxx: power
	// state changed by ACPI to D0" but we don't.
	//
	// also pdev->current_state is set to PCI_D0
      pci_update_current_state(pci_dev, PCI_D0);
        // this one reads it back, it should be 0..
  pm->runtime_resume(dev);
    nhi_runtime_resume(dev)
[  578.773503] thunderbolt 0000:e5:00.5: nhi_runtime_resume() - enter, pdev->current_state = 3

// .. but it is not. It seems to be powered off, D3cold.

[  578.773516] thunderbolt 0000:e5:00.5: control channel starting...
[  578.773518] thunderbolt 0000:e5:00.5: starting TX ring 0
[  578.773524] thunderbolt 0000:e5:00.5: enabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xffffffff)

// As here too.

It would be interesting to see what ACPI does here, I mean enabling
dynamic debugging of acpi_device_set_power() and friends. One suspect is
that when the runtime suspends happens, the power resource or _PS3()
gets called to put the device into D3cold and it has some sort of
Sleep() inside. This allows the OS to relese the CPU to do other things
too according to ACPI spec (and Linux does this) so now when you plug in
the device the GPE triggers wake and Linux starts processing that one.
We see the above runtime resume but then the Sleep() in the _PS3()
completes and cuts the power from the NHI while we are in the middle of
resuming it already (or something along those lines).

We saw similar "context switch" happening when PCIe L2 failed, BIOS
implemented the timeout using Sleep() that allowed Linux to process
other things which resulted unexpected wake (not the same as here
though).

So one thing to check too is how the ACPI methods get called and
especially if they somehow end up messing with each other.

> > It is entirely possible that this has nothing to do with the issue but I
> > think it is worth checking.
> > 
> > The second thing we could try is to check the wake status bits after
> > this has happened, like:
> > 
> >    # tbdump -r 0 -a <ADAPTER> -vv -N 1 PORT_CS_18
> > 
> > (where <ADAPTER> is the lane 0 adapter of the USB4 port the device was
> > connected).
> > 
> 
> Unfortunately the adapter is in such a bad state at this time that tbdump
> doesn't work.

Oh, indeed it is.

> > The third thing to try is to comment out TB_WAKE_ON_CONNECT in
> > tb_switch_suspend(). This should result no wake even if the device is
> > connected. This tells us that it is really the connect on USB4 port that
> > triggered the wake.
> 
> Yup that's correct; there is no action on the hotplug with this change.

Okay thanks for checking.
Mario Limonciello Sept. 13, 2024, 4:12 a.m. UTC | #17
On 9/10/2024 04:13, Mika Westerberg wrote:
> Hi,
> 
> On Mon, Sep 09, 2024 at 03:40:54PM -0500, Mario Limonciello wrote:
>>> Few additional side paths here, though. This is supposed to work so that
>>> once the host router sleep bit is set the driver is supposed to allow
>>> the domain to enter sleep (e.g it should not be waken up before it is
>>> fully transitioned). That's what we do:
>>>
>>> 1. All tunneled PCIe Root/Downstream ports are in D3.
>>> 2. All tunneled USB 3.x ports are in U3.
>>> 3. No DisplayPort is tunneled.
>>> 4. Thunderbolt driver enables wakes.
>>> 5. Thunderbolt driver writes sleep ready bit of the host router.
>>> 6. Thunderbolt driver runtime suspend is complete.
>>> 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep
>>> Event".
>>>
>>> If between 5 and 7 there is device connected, it should not "abort" the
>>> sequence. Unfortunately this is not explict in the USB4 spec but the
>>> connection manager guide has similar note. Even if the connect happens
>>> there the "Sleep Event" should happen but after that it can trigger
>>> normal wakeup which will then bring everything back.
>>>
>>> Would it be possible to enable tracing around these steps so that we
>>> could see if there is hotplug notification somewhere there that is not
>>> expected? Here are instructions how to get pretty accurate trace:
>>>
>>> https://github.com/intel/tbtools?tab=readme-ov-file#tracing
>>>
>>> Please also take full dmesg.
>>
>> Sure, here is the dmesg with tracing enabled:
>>
>> https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308
> 
> Thanks! I meant also enable control channel tracing as explained in the
> above linked page so we could maybe see the hotplug packet coming from
> the lane adapter too. Anyhow,
> 
> [  560.789681] thunderbolt 0-2: device disconnected
> [  560.789771] thunderbolt 0000:e5:00.5: bandwidth consumption changed, re-calculating estimated bandwidth
> [  560.789775] thunderbolt 0000:e5:00.5: bandwidth re-calculation done
> [  560.789778] thunderbolt 0000:e5:00.5: looking for DP IN <-> DP OUT pairs:
> [  560.789800] thunderbolt 0000:e5:00.5: 0:6: DP IN available
> [  560.789803] thunderbolt 0000:e5:00.5: 0:6: no suitable DP OUT adapter available, not tunneling
> [  560.790484] thunderbolt 0000:e5:00.5: 0:7: DP IN available
> [  560.790487] thunderbolt 0000:e5:00.5: 0:7: no suitable DP OUT adapter available, not tunneling
> ...
> [  578.677640] thunderbolt 0000:e5:00.5: nhi_runtime_suspend() - enter, pdev->current_state = 0
> [  578.677648] thunderbolt 0000:e5:00.5: 0: suspending switch
> [  578.677653] thunderbolt 0000:e5:00.5: 0: enabling wakeup: 0x3f
> [  578.678248] thunderbolt 0000:e5:00.5: stopping RX ring 0
> [  578.678256] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 3 (0x9 -> 0x1)
> [  578.678272] thunderbolt 0000:e5:00.5: stopping TX ring 0
> [  578.678277] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 0 (0x1 -> 0x0)
> [  578.678287] thunderbolt 0000:e5:00.5: control channel stopped
> ...
> pci_pm_runtime_resume()
>    pci_pm_default_resume_early(pci_dev);
>      pci_pm_power_up_and_verify_state(pci_dev);
>        pci_power_up(pci_dev);
>          platform_pci_set_power_state(dev, PCI_D0);
> 	// here we should see acpi_pci_set_power_state() printing "xxx: power
> 	// state changed by ACPI to D0" but we don't.
> 	//
> 	// also pdev->current_state is set to PCI_D0
>        pci_update_current_state(pci_dev, PCI_D0);
>          // this one reads it back, it should be 0..
>    pm->runtime_resume(dev);
>      nhi_runtime_resume(dev)
> [  578.773503] thunderbolt 0000:e5:00.5: nhi_runtime_resume() - enter, pdev->current_state = 3
> 
> // .. but it is not. It seems to be powered off, D3cold.
> 
> [  578.773516] thunderbolt 0000:e5:00.5: control channel starting...
> [  578.773518] thunderbolt 0000:e5:00.5: starting TX ring 0
> [  578.773524] thunderbolt 0000:e5:00.5: enabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xffffffff)
> 
> // As here too.
> 
> It would be interesting to see what ACPI does here, I mean enabling
> dynamic debugging of acpi_device_set_power() and friends. 

Here is a log with dyndbg turned up.
https://gist.github.com/superm1/2f7ca6123431365d11e9bc3cc9329c14

You can see firmware node \_SB_.PCI0.BUSC.NHI0 is physical device 
0000:e5:00.5.

> One suspect is
> that when the runtime suspends happens, the power resource or _PS3()
> gets called to put the device into D3cold and it has some sort of
> Sleep() inside. This allows the OS to relese the CPU to do other things
> too according to ACPI spec (and Linux does this) so now when you plug in
> the device the GPE triggers wake and Linux starts processing that one.
> We see the above runtime resume but then the Sleep() in the _PS3()
> completes and cuts the power from the NHI while we are in the middle of
> resuming it already (or something along those lines).
> 
> We saw similar "context switch" happening when PCIe L2 failed, BIOS
> implemented the timeout using Sleep() that allowed Linux to process
> other things which resulted unexpected wake (not the same as here
> though).
> 
> So one thing to check too is how the ACPI methods get called and
> especially if they somehow end up messing with each other.

At least to me it looks pretty straightforward that each D0 sequence is 
going through and nothing looked out of order:

acpi_pci_set_power_state()
->acpi_device_set_power()
->->acpi_dev_pm_explicit_set()


There is a Sleep() call in M232() which is a SMI used in both _PS0 and 
_PS3, but as far as I can tell it's not actually called as the case 
always has zero as an argument.

Here's some snippets of ASL from the NHI0 device to see.
https://gist.github.com/superm1/32ef1f822a6220a41b19574024717f79

I feel your theory is right about it being an aborted D3cold request, 
but I don't believe it's from concurrency issue of _PS3.  I feel there 
isn't any guard rails either in Linux or the AML for ensuring that
the transition actually finished.
Mika Westerberg Sept. 13, 2024, 4:58 a.m. UTC | #18
Hi,

On Thu, Sep 12, 2024 at 11:12:11PM -0500, Mario Limonciello wrote:
> On 9/10/2024 04:13, Mika Westerberg wrote:
> > Hi,
> > 
> > On Mon, Sep 09, 2024 at 03:40:54PM -0500, Mario Limonciello wrote:
> > > > Few additional side paths here, though. This is supposed to work so that
> > > > once the host router sleep bit is set the driver is supposed to allow
> > > > the domain to enter sleep (e.g it should not be waken up before it is
> > > > fully transitioned). That's what we do:
> > > > 
> > > > 1. All tunneled PCIe Root/Downstream ports are in D3.
> > > > 2. All tunneled USB 3.x ports are in U3.
> > > > 3. No DisplayPort is tunneled.
> > > > 4. Thunderbolt driver enables wakes.
> > > > 5. Thunderbolt driver writes sleep ready bit of the host router.
> > > > 6. Thunderbolt driver runtime suspend is complete.
> > > > 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep
> > > > Event".
> > > > 
> > > > If between 5 and 7 there is device connected, it should not "abort" the
> > > > sequence. Unfortunately this is not explict in the USB4 spec but the
> > > > connection manager guide has similar note. Even if the connect happens
> > > > there the "Sleep Event" should happen but after that it can trigger
> > > > normal wakeup which will then bring everything back.
> > > > 
> > > > Would it be possible to enable tracing around these steps so that we
> > > > could see if there is hotplug notification somewhere there that is not
> > > > expected? Here are instructions how to get pretty accurate trace:
> > > > 
> > > > https://github.com/intel/tbtools?tab=readme-ov-file#tracing
> > > > 
> > > > Please also take full dmesg.
> > > 
> > > Sure, here is the dmesg with tracing enabled:
> > > 
> > > https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308
> > 
> > Thanks! I meant also enable control channel tracing as explained in the
> > above linked page so we could maybe see the hotplug packet coming from
> > the lane adapter too. Anyhow,
> > 
> > [  560.789681] thunderbolt 0-2: device disconnected
> > [  560.789771] thunderbolt 0000:e5:00.5: bandwidth consumption changed, re-calculating estimated bandwidth
> > [  560.789775] thunderbolt 0000:e5:00.5: bandwidth re-calculation done
> > [  560.789778] thunderbolt 0000:e5:00.5: looking for DP IN <-> DP OUT pairs:
> > [  560.789800] thunderbolt 0000:e5:00.5: 0:6: DP IN available
> > [  560.789803] thunderbolt 0000:e5:00.5: 0:6: no suitable DP OUT adapter available, not tunneling
> > [  560.790484] thunderbolt 0000:e5:00.5: 0:7: DP IN available
> > [  560.790487] thunderbolt 0000:e5:00.5: 0:7: no suitable DP OUT adapter available, not tunneling
> > ...
> > [  578.677640] thunderbolt 0000:e5:00.5: nhi_runtime_suspend() - enter, pdev->current_state = 0
> > [  578.677648] thunderbolt 0000:e5:00.5: 0: suspending switch
> > [  578.677653] thunderbolt 0000:e5:00.5: 0: enabling wakeup: 0x3f
> > [  578.678248] thunderbolt 0000:e5:00.5: stopping RX ring 0
> > [  578.678256] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 3 (0x9 -> 0x1)
> > [  578.678272] thunderbolt 0000:e5:00.5: stopping TX ring 0
> > [  578.678277] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 0 (0x1 -> 0x0)
> > [  578.678287] thunderbolt 0000:e5:00.5: control channel stopped
> > ...
> > pci_pm_runtime_resume()
> >    pci_pm_default_resume_early(pci_dev);
> >      pci_pm_power_up_and_verify_state(pci_dev);
> >        pci_power_up(pci_dev);
> >          platform_pci_set_power_state(dev, PCI_D0);
> > 	// here we should see acpi_pci_set_power_state() printing "xxx: power
> > 	// state changed by ACPI to D0" but we don't.
> > 	//
> > 	// also pdev->current_state is set to PCI_D0
> >        pci_update_current_state(pci_dev, PCI_D0);
> >          // this one reads it back, it should be 0..
> >    pm->runtime_resume(dev);
> >      nhi_runtime_resume(dev)
> > [  578.773503] thunderbolt 0000:e5:00.5: nhi_runtime_resume() - enter, pdev->current_state = 3
> > 
> > // .. but it is not. It seems to be powered off, D3cold.
> > 
> > [  578.773516] thunderbolt 0000:e5:00.5: control channel starting...
> > [  578.773518] thunderbolt 0000:e5:00.5: starting TX ring 0
> > [  578.773524] thunderbolt 0000:e5:00.5: enabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xffffffff)
> > 
> > // As here too.
> > 
> > It would be interesting to see what ACPI does here, I mean enabling
> > dynamic debugging of acpi_device_set_power() and friends.
> 
> Here is a log with dyndbg turned up.
> https://gist.github.com/superm1/2f7ca6123431365d11e9bc3cc9329c14
> 
> You can see firmware node \_SB_.PCI0.BUSC.NHI0 is physical device
> 0000:e5:00.5.

Thanks!

> > One suspect is
> > that when the runtime suspends happens, the power resource or _PS3()
> > gets called to put the device into D3cold and it has some sort of
> > Sleep() inside. This allows the OS to relese the CPU to do other things
> > too according to ACPI spec (and Linux does this) so now when you plug in
> > the device the GPE triggers wake and Linux starts processing that one.
> > We see the above runtime resume but then the Sleep() in the _PS3()
> > completes and cuts the power from the NHI while we are in the middle of
> > resuming it already (or something along those lines).
> > 
> > We saw similar "context switch" happening when PCIe L2 failed, BIOS
> > implemented the timeout using Sleep() that allowed Linux to process
> > other things which resulted unexpected wake (not the same as here
> > though).
> > 
> > So one thing to check too is how the ACPI methods get called and
> > especially if they somehow end up messing with each other.
> 
> At least to me it looks pretty straightforward that each D0 sequence is
> going through and nothing looked out of order:
> 
> acpi_pci_set_power_state()
> ->acpi_device_set_power()
> ->->acpi_dev_pm_explicit_set()
> 
> 
> There is a Sleep() call in M232() which is a SMI used in both _PS0 and _PS3,
> but as far as I can tell it's not actually called as the case always has
> zero as an argument.
> 
> Here's some snippets of ASL from the NHI0 device to see.
> https://gist.github.com/superm1/32ef1f822a6220a41b19574024717f79
> 
> I feel your theory is right about it being an aborted D3cold request, but I
> don't believe it's from concurrency issue of _PS3.  I feel there isn't any
> guard rails either in Linux or the AML for ensuring that
> the transition actually finished.

Yeah, I agree now. It does not look like the methods are messing each
other here. We don't see the GPE handler being run but I don't think it
matters here. For some reason the device just is not yet ready when it
is supposed to be in D0.

Sorry for wasting your time with these suspects.
Mika Westerberg Sept. 13, 2024, 7:23 a.m. UTC | #19
Hi again,

On Fri, Sep 13, 2024 at 07:58:07AM +0300, Mika Westerberg wrote:
> Yeah, I agree now. It does not look like the methods are messing each
> other here. We don't see the GPE handler being run but I don't think it
> matters here. For some reason the device just is not yet ready when it
> is supposed to be in D0.
> 
> Sorry for wasting your time with these suspects.

One more suggestion though ;-) I realized that my hack patch to disable
I/O and MMIO did not actually do that because it looks like we don't
clear those bits ever. I wonder if you could still check if the below
changes anything? At least it should now "disable" the device to follow
the spec.

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index f412ef73a6e4..79a566376301 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev)
 
 	if (!pci_dev->state_saved) {
 		pci_save_state(pci_dev);
+		pci_pm_default_suspend(pci_dev);
 		pci_finish_runtime_suspend(pci_dev);
 	}
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ffaaca0978cb..91f4e7a03c94 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2218,6 +2218,13 @@ static void do_pci_disable_device(struct pci_dev *dev)
 		pci_command &= ~PCI_COMMAND_MASTER;
 		pci_write_config_word(dev, PCI_COMMAND, pci_command);
 	}
+	/*
+	 * PCI PM 1.2 sec 8.2.2 says that when a function is put into D3
+	 * the OS needs to disable I/O and MMIO space in addition to bus
+	 * mastering so do that here.
+	 */
+	pci_command &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY);
+	pci_write_config_word(dev, PCI_COMMAND, pci_command);
 
 	pcibios_disable_device(dev);
 }
Mario Limonciello Sept. 13, 2024, 8:56 p.m. UTC | #20
On 9/13/2024 02:23, Mika Westerberg wrote:
> Hi again,
> 
> On Fri, Sep 13, 2024 at 07:58:07AM +0300, Mika Westerberg wrote:
>> Yeah, I agree now. It does not look like the methods are messing each
>> other here. We don't see the GPE handler being run but I don't think it
>> matters here. For some reason the device just is not yet ready when it
>> is supposed to be in D0.
>>
>> Sorry for wasting your time with these suspects.

Don't worry about it.  It validates that we're coming to the right 
conclusions about where the problem is.

> 
> One more suggestion though ;-) I realized that my hack patch to disable
> I/O and MMIO did not actually do that because it looks like we don't
> clear those bits ever. I wonder if you could still check if the below
> changes anything? At least it should now "disable" the device to follow
> the spec.

This actually causes the system to fail to boot.  I guess some deadlock 
from other callers to pci_disable_device().

We also double checked putting the PCI_COMMAND writes just into the 
runtime suspend call backs instead (to narrow down if that is part of 
the issue here).  Putting it there fixed the boot hang, but no change to 
the actual issue behavior.

Bjorn,

Considering this finding, are you more amenable to the polling approach?

> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index f412ef73a6e4..79a566376301 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev)
>   
>   	if (!pci_dev->state_saved) {
>   		pci_save_state(pci_dev);
> +		pci_pm_default_suspend(pci_dev);
>   		pci_finish_runtime_suspend(pci_dev);
>   	}
>   
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ffaaca0978cb..91f4e7a03c94 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2218,6 +2218,13 @@ static void do_pci_disable_device(struct pci_dev *dev)
>   		pci_command &= ~PCI_COMMAND_MASTER;
>   		pci_write_config_word(dev, PCI_COMMAND, pci_command);
>   	}
> +	/*
> +	 * PCI PM 1.2 sec 8.2.2 says that when a function is put into D3
> +	 * the OS needs to disable I/O and MMIO space in addition to bus
> +	 * mastering so do that here.
> +	 */
> +	pci_command &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY);
> +	pci_write_config_word(dev, PCI_COMMAND, pci_command);
>   
>   	pcibios_disable_device(dev);
>   }
Mika Westerberg Sept. 15, 2024, 7:07 a.m. UTC | #21
Hi,

On Fri, Sep 13, 2024 at 03:56:46PM -0500, Mario Limonciello wrote:
> > One more suggestion though ;-) I realized that my hack patch to disable
> > I/O and MMIO did not actually do that because it looks like we don't
> > clear those bits ever. I wonder if you could still check if the below
> > changes anything? At least it should now "disable" the device to follow
> > the spec.
> 
> This actually causes the system to fail to boot.  I guess some deadlock from
> other callers to pci_disable_device().
> 
> We also double checked putting the PCI_COMMAND writes just into the runtime
> suspend call backs instead (to narrow down if that is part of the issue
> here).  Putting it there fixed the boot hang, but no change to the actual
> issue behavior.

Okay, thanks again for checking!
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 1e219057a5069..f032a4aaec268 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1309,21 +1309,33 @@  static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int
 	 * the read (except when CRS SV is enabled and the read was for the
 	 * Vendor ID; in that case it synthesizes 0x0001 data).
 	 *
-	 * Wait for the device to return a non-CRS completion.  Read the
-	 * Command register instead of Vendor ID so we don't have to
-	 * contend with the CRS SV value.
+	 * Wait for the device to return a non-CRS completion.  On devices
+	 * that support PM control and on waits that aren't part of system
+	 * resume read the PM control register to ensure the device has
+	 * transitioned to D0.  On devices that don't support PM control,
+	 * or during system resume read the command register to instead of
+	 * Vendor ID so we don't have to contend with the CRS SV value.
 	 */
 	for (;;) {
-		u32 id;
-
 		if (pci_dev_is_disconnected(dev)) {
 			pci_dbg(dev, "disconnected; not waiting\n");
 			return -ENOTTY;
 		}
 
-		pci_read_config_dword(dev, PCI_COMMAND, &id);
-		if (!PCI_POSSIBLE_ERROR(id))
-			break;
+		if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
+			u16 pmcsr;
+
+			pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
+			if (!PCI_POSSIBLE_ERROR(pmcsr) &&
+			    (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
+				break;
+		} else {
+			u32 id;
+
+			pci_read_config_dword(dev, PCI_COMMAND, &id);
+			if (!PCI_POSSIBLE_ERROR(id))
+				break;
+		}
 
 		if (delay > timeout) {
 			pci_warn(dev, "not ready %dms after %s; giving up\n",