diff mbox series

thunderbolt: Resume PCIe bridges after switch is found on AMD USB4 controller

Message ID 20220905065622.1573811-1-kai.heng.feng@canonical.com (mailing list archive)
State New, archived
Headers show
Series thunderbolt: Resume PCIe bridges after switch is found on AMD USB4 controller | expand

Commit Message

Kai-Heng Feng Sept. 5, 2022, 6:56 a.m. UTC
AMD USB4 can not detect external PCIe devices like external NVMe when
it's hotplugged, because card/link are not up:

pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101

Use `lspci` to resume pciehp bridges can find external devices.

A long delay before checking card/link presence doesn't help, either.
The only way to make the hotplug work is to enable pciehp interrupt and
check card presence after the TB switch is added.

Since the topology of USB4 and its PCIe bridges are siblings, hardcode
the bridge ID so TBT driver can wake them up to check presence.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216448
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/thunderbolt/nhi.c    | 29 +++++++++++++++++++++++++++++
 drivers/thunderbolt/switch.c |  6 ++++++
 drivers/thunderbolt/tb.c     |  1 +
 drivers/thunderbolt/tb.h     |  5 +++++
 include/linux/thunderbolt.h  |  1 +
 5 files changed, 42 insertions(+)

Comments

Mika Westerberg Sept. 5, 2022, 7:07 a.m. UTC | #1
Hi,

On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> AMD USB4 can not detect external PCIe devices like external NVMe when
> it's hotplugged, because card/link are not up:
> 
> pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101

I think the correct solution is then to block them from runtime
suspending entirely.

> Use `lspci` to resume pciehp bridges can find external devices.
> 
> A long delay before checking card/link presence doesn't help, either.
> The only way to make the hotplug work is to enable pciehp interrupt and
> check card presence after the TB switch is added.
> 
> Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> the bridge ID so TBT driver can wake them up to check presence.

Let's not add PCI things into TBT driver unless absolutely necessary.

At least on Intel hardware the PCIe hotplug is signaled by SCI when the
root port is in D3, I wonder if AMD has something similar.
Kai-Heng Feng Sept. 5, 2022, 7:26 a.m. UTC | #2
Hi Mika,

On Mon, Sep 5, 2022 at 3:06 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Hi,
>
> On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> > AMD USB4 can not detect external PCIe devices like external NVMe when
> > it's hotplugged, because card/link are not up:
> >
> > pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
>
> I think the correct solution is then to block them from runtime
> suspending entirely.

Do you mean disable runtime suspend completely? Or just block runtime
suspend for a period?

>
> > Use `lspci` to resume pciehp bridges can find external devices.
> >
> > A long delay before checking card/link presence doesn't help, either.
> > The only way to make the hotplug work is to enable pciehp interrupt and
> > check card presence after the TB switch is added.
> >
> > Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> > the bridge ID so TBT driver can wake them up to check presence.
>
> Let's not add PCI things into TBT driver unless absolutely necessary.

OK. It's getting harder as different components are intertwined
together on new hardwares...

>
> At least on Intel hardware the PCIe hotplug is signaled by SCI when the
> root port is in D3, I wonder if AMD has something similar.

Yes those root ports are resumed to D0 when something is plugged. They
however fail to detect any externel PCIe devices.

Kai-Heng
Greg KH Sept. 5, 2022, 7:29 a.m. UTC | #3
On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> AMD USB4 can not detect external PCIe devices like external NVMe when
> it's hotplugged, because card/link are not up:
> 
> pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101

That sounds like a hardware bug, how does this work in other operating
systems for this hardware?

> Use `lspci` to resume pciehp bridges can find external devices.

That's not good :(

> A long delay before checking card/link presence doesn't help, either.
> The only way to make the hotplug work is to enable pciehp interrupt and
> check card presence after the TB switch is added.
> 
> Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> the bridge ID so TBT driver can wake them up to check presence.

As I mention below, this is not an acceptable solution.

AMD developers, any ideas on how to get this fixed in the TB controller
firware instead?

> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216448
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/thunderbolt/nhi.c    | 29 +++++++++++++++++++++++++++++
>  drivers/thunderbolt/switch.c |  6 ++++++
>  drivers/thunderbolt/tb.c     |  1 +
>  drivers/thunderbolt/tb.h     |  5 +++++
>  include/linux/thunderbolt.h  |  1 +
>  5 files changed, 42 insertions(+)
> 
> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> index cb8c9c4ae93a2..75f5ce5e22978 100644
> --- a/drivers/thunderbolt/nhi.c
> +++ b/drivers/thunderbolt/nhi.c
> @@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct tb_nhi *nhi;
>  	struct tb *tb;
> +	struct pci_dev *p = NULL;
> +	struct tb_pci_bridge *pci_bridge, *n;
>  	int res;
>  
>  	if (!nhi_imr_valid(pdev)) {
> @@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  		nhi_shutdown(nhi);
>  		return res;
>  	}
> +
> +	if (pdev->vendor == PCI_VENDOR_ID_AMD) {
> +		while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd, p))) {
> +			pci_bridge = kmalloc(sizeof(struct tb_pci_bridge), GFP_KERNEL);
> +			if (!pci_bridge)
> +				goto cleanup;
> +
> +			pci_bridge->bridge = p;
> +			INIT_LIST_HEAD(&pci_bridge->list);
> +			list_add(&pci_bridge->list, &tb->bridge_list);
> +		}
> +	}

You can't walk the device tree and create a "shadow" list of devices
like this and expect any lifetime rules to work properly with them at
all.

Please do not do this.

greg k-h
Mika Westerberg Sept. 5, 2022, 7:50 a.m. UTC | #4
On Mon, Sep 05, 2022 at 03:26:28PM +0800, Kai-Heng Feng wrote:
> Hi Mika,
> 
> On Mon, Sep 5, 2022 at 3:06 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > Hi,
> >
> > On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> > > AMD USB4 can not detect external PCIe devices like external NVMe when
> > > it's hotplugged, because card/link are not up:
> > >
> > > pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
> >
> > I think the correct solution is then to block them from runtime
> > suspending entirely.
> 
> Do you mean disable runtime suspend completely? Or just block runtime
> suspend for a period?

Completely. The port should enter D3 if it cannot wake up and Linux does
not even enable runtime PM for such ports unless they declare
"HotPlugSupportInD3" in their ACPI description:

https://docs.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#identifying-pcie-root-ports-supporting-hot-plug-in-d3

So that property should not be there if they cannot wake up.

> > > Use `lspci` to resume pciehp bridges can find external devices.
> > >
> > > A long delay before checking card/link presence doesn't help, either.
> > > The only way to make the hotplug work is to enable pciehp interrupt and
> > > check card presence after the TB switch is added.
> > >
> > > Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> > > the bridge ID so TBT driver can wake them up to check presence.
> >
> > Let's not add PCI things into TBT driver unless absolutely necessary.
> 
> OK. It's getting harder as different components are intertwined
> together on new hardwares...
> 
> >
> > At least on Intel hardware the PCIe hotplug is signaled by SCI when the
> > root port is in D3, I wonder if AMD has something similar.
> 
> Yes those root ports are resumed to D0 when something is plugged. They
> however fail to detect any externel PCIe devices.

Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
detected? Does the PCIe "Card Present" (or Data Link Layer Active)
status change at all or is it always 0?
Mika Westerberg Sept. 5, 2022, 1:18 p.m. UTC | #5
Hi,

On Mon, Sep 05, 2022 at 10:50:33AM +0300, Mika Westerberg wrote:
> > Yes those root ports are resumed to D0 when something is plugged. They
> > however fail to detect any externel PCIe devices.
> 
> Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> status change at all or is it always 0?

I wonder if we are simply missing the required delays here? Looking at
the lspci dump in the bugzilla you refer the root port 03.1 supports
active link reporting:

		LnkCap:	Port #247, Speed 2.5GT/s, Width x1, ASPM L1, Exit Latency L1 <4us
			ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+

So when the link goes back to L0 (D3 -> D0 transition) the kernel should
issue the 100+ ms reset delay in pci_bridge_wait_for_secondary_bus():

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c#n5045

can you check if that is happening? It should show up in the dmesg when
CONFIG_PCI_DEBUG=y but I don't see it in yours.
Kai-Heng Feng Sept. 5, 2022, 3:21 p.m. UTC | #6
On Mon, Sep 5, 2022 at 3:50 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Mon, Sep 05, 2022 at 03:26:28PM +0800, Kai-Heng Feng wrote:
> > Hi Mika,
> >
> > On Mon, Sep 5, 2022 at 3:06 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > Hi,
> > >
> > > On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> > > > AMD USB4 can not detect external PCIe devices like external NVMe when
> > > > it's hotplugged, because card/link are not up:
> > > >
> > > > pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
> > >
> > > I think the correct solution is then to block them from runtime
> > > suspending entirely.
> >
> > Do you mean disable runtime suspend completely? Or just block runtime
> > suspend for a period?
>
> Completely. The port should enter D3 if it cannot wake up and Linux does
> not even enable runtime PM for such ports unless they declare
> "HotPlugSupportInD3" in their ACPI description:
>
> https://docs.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#identifying-pcie-root-ports-supporting-hot-plug-in-d3
>
> So that property should not be there if they cannot wake up.

The port can wake up but card/link isn't up yet.

>
> > > > Use `lspci` to resume pciehp bridges can find external devices.
> > > >
> > > > A long delay before checking card/link presence doesn't help, either.
> > > > The only way to make the hotplug work is to enable pciehp interrupt and
> > > > check card presence after the TB switch is added.
> > > >
> > > > Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> > > > the bridge ID so TBT driver can wake them up to check presence.
> > >
> > > Let's not add PCI things into TBT driver unless absolutely necessary.
> >
> > OK. It's getting harder as different components are intertwined
> > together on new hardwares...
> >
> > >
> > > At least on Intel hardware the PCIe hotplug is signaled by SCI when the
> > > root port is in D3, I wonder if AMD has something similar.
> >
> > Yes those root ports are resumed to D0 when something is plugged. They
> > however fail to detect any externel PCIe devices.
>
> Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> status change at all or is it always 0?

It changes only after tb_switch_add() is called.

Kai-Heng
Kai-Heng Feng Sept. 5, 2022, 3:24 p.m. UTC | #7
On Mon, Sep 5, 2022 at 9:18 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Hi,
>
> On Mon, Sep 05, 2022 at 10:50:33AM +0300, Mika Westerberg wrote:
> > > Yes those root ports are resumed to D0 when something is plugged. They
> > > however fail to detect any externel PCIe devices.
> >
> > Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> > detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> > status change at all or is it always 0?
>
> I wonder if we are simply missing the required delays here? Looking at
> the lspci dump in the bugzilla you refer the root port 03.1 supports
> active link reporting:
>
>                 LnkCap: Port #247, Speed 2.5GT/s, Width x1, ASPM L1, Exit Latency L1 <4us
>                         ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
>
> So when the link goes back to L0 (D3 -> D0 transition) the kernel should
> issue the 100+ ms reset delay in pci_bridge_wait_for_secondary_bus():

This was actually the first thing I tried but it doesn't work. Even a
5 seconds delay doesn't work either.

>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c#n5045
>
> can you check if that is happening? It should show up in the dmesg when
> CONFIG_PCI_DEBUG=y but I don't see it in yours.

That's because there isn't any child device yet, so the function bails early:
if (!dev->subordinate || list_empty(&dev->subordinate->devices))

Kai-Heng
Mika Westerberg Sept. 5, 2022, 3:34 p.m. UTC | #8
On Mon, Sep 05, 2022 at 11:21:36PM +0800, Kai-Heng Feng wrote:
> > Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> > detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> > status change at all or is it always 0?
> 
> It changes only after tb_switch_add() is called.

I doubt tb_switch_add() does anything but instead it is the established
PCIe tunnel that then shows up as it toggles the Card Present bit or so.
But that should also trigger PME if the root port is in D3 so you should
see this wake if everything works accordingly (unless I'm missing
something).

So if you do this:

1. Boot the system up, nothing connected
2. Plug in the TBT/USB4 device but do not authorize the PCIe tunnel
3. Wait for the TBT/USB4 domain to enter sleep (runtime suspend)
4. Authorize the PCIe tunnel

  # echo 1 > .../authorized

The established PCIe tunnel should trigger PME and the root port then
should be able to detect the PCIe link. Can you add full dmesg with
"thunderbolt.dyndbg=+p" in the command line to the bug?
Mika Westerberg Sept. 5, 2022, 3:36 p.m. UTC | #9
On Mon, Sep 05, 2022 at 11:24:26PM +0800, Kai-Heng Feng wrote:
> > can you check if that is happening? It should show up in the dmesg when
> > CONFIG_PCI_DEBUG=y but I don't see it in yours.
> 
> That's because there isn't any child device yet, so the function bails early:
> if (!dev->subordinate || list_empty(&dev->subordinate->devices))

Ah, of course I forgot that this is hotplug case so there are no need
for the delays :/
Kai-Heng Feng Sept. 6, 2022, 12:57 p.m. UTC | #10
On Mon, Sep 5, 2022 at 11:34 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Mon, Sep 05, 2022 at 11:21:36PM +0800, Kai-Heng Feng wrote:
> > > Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> > > detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> > > status change at all or is it always 0?
> >
> > It changes only after tb_switch_add() is called.
>
> I doubt tb_switch_add() does anything but instead it is the established
> PCIe tunnel that then shows up as it toggles the Card Present bit or so.
> But that should also trigger PME if the root port is in D3 so you should
> see this wake if everything works accordingly (unless I'm missing
> something).

You are right. Sometimes it may still fail to detect hotplugged device
right after tb_switch_add().
At which point PCIe tunnels are established? Is it after tb_scan_port()?

I found that it's cleaner to wakeup hotplug ports via iterating device
link consumers at the end of tb_scan_port().

According to your commit b2be2b05cf3b1c7b499d3b05decdcc524879fea7
("thunderbolt: Create device links from ACPI description"), it states
"The _DSD can be added to tunneled USB3 and PCIe ports, and is needed to
make sure the USB4 NHI is resumed before any of the tunneled ports so
the protocol tunnels get established properly before the actual port
itself is resumed. Othwerwise the USB/PCI core find the link may not be
established and starts tearing down the device stack."

So isn't waking them up a logical thing to do here?

>
> So if you do this:
>
> 1. Boot the system up, nothing connected
> 2. Plug in the TBT/USB4 device but do not authorize the PCIe tunnel
> 3. Wait for the TBT/USB4 domain to enter sleep (runtime suspend)
> 4. Authorize the PCIe tunnel
>
>   # echo 1 > .../authorized
>
> The established PCIe tunnel should trigger PME and the root port then
> should be able to detect the PCIe link. Can you add full dmesg with
> "thunderbolt.dyndbg=+p" in the command line to the bug?

dmesg attached. Unfortunately there's no PME.

Kai-Heng
Mika Westerberg Sept. 6, 2022, 1:37 p.m. UTC | #11
On Tue, Sep 06, 2022 at 08:57:20PM +0800, Kai-Heng Feng wrote:
> On Mon, Sep 5, 2022 at 11:34 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Mon, Sep 05, 2022 at 11:21:36PM +0800, Kai-Heng Feng wrote:
> > > > Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> > > > detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> > > > status change at all or is it always 0?
> > >
> > > It changes only after tb_switch_add() is called.
> >
> > I doubt tb_switch_add() does anything but instead it is the established
> > PCIe tunnel that then shows up as it toggles the Card Present bit or so.
> > But that should also trigger PME if the root port is in D3 so you should
> > see this wake if everything works accordingly (unless I'm missing
> > something).
> 
> You are right. Sometimes it may still fail to detect hotplugged device
> right after tb_switch_add().
> At which point PCIe tunnels are established? Is it after tb_scan_port()?

They are established when userspace writes "1" to ../authorized of the
device (not automatically).

On Ubuntu that's boltd that handles this so you may need to disable it
before you do the experiment.

> I found that it's cleaner to wakeup hotplug ports via iterating device
> link consumers at the end of tb_scan_port().
> 
> According to your commit b2be2b05cf3b1c7b499d3b05decdcc524879fea7
> ("thunderbolt: Create device links from ACPI description"), it states
> "The _DSD can be added to tunneled USB3 and PCIe ports, and is needed to
> make sure the USB4 NHI is resumed before any of the tunneled ports so
> the protocol tunnels get established properly before the actual port
> itself is resumed. Othwerwise the USB/PCI core find the link may not be
> established and starts tearing down the device stack."
> 
> So isn't waking them up a logical thing to do here?

No they should wake up themselves.

> > So if you do this:
> >
> > 1. Boot the system up, nothing connected
> > 2. Plug in the TBT/USB4 device but do not authorize the PCIe tunnel
> > 3. Wait for the TBT/USB4 domain to enter sleep (runtime suspend)
> > 4. Authorize the PCIe tunnel
> >
> >   # echo 1 > .../authorized
> >
> > The established PCIe tunnel should trigger PME and the root port then
> > should be able to detect the PCIe link. Can you add full dmesg with
> > "thunderbolt.dyndbg=+p" in the command line to the bug?
> 
> dmesg attached. Unfortunately there's no PME.

Hmm, attached to where? Forgot to attach? ;-)
Kai-Heng Feng Sept. 6, 2022, 2:29 p.m. UTC | #12
On Tue, Sep 6, 2022 at 9:37 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Tue, Sep 06, 2022 at 08:57:20PM +0800, Kai-Heng Feng wrote:
> > On Mon, Sep 5, 2022 at 11:34 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > On Mon, Sep 05, 2022 at 11:21:36PM +0800, Kai-Heng Feng wrote:
> > > > > Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> > > > > detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> > > > > status change at all or is it always 0?
> > > >
> > > > It changes only after tb_switch_add() is called.
> > >
> > > I doubt tb_switch_add() does anything but instead it is the established
> > > PCIe tunnel that then shows up as it toggles the Card Present bit or so.
> > > But that should also trigger PME if the root port is in D3 so you should
> > > see this wake if everything works accordingly (unless I'm missing
> > > something).
> >
> > You are right. Sometimes it may still fail to detect hotplugged device
> > right after tb_switch_add().
> > At which point PCIe tunnels are established? Is it after tb_scan_port()?
>
> They are established when userspace writes "1" to ../authorized of the
> device (not automatically).
>
> On Ubuntu that's boltd that handles this so you may need to disable it
> before you do the experiment.

In the dmesg it was disabled and "authorized" was 0 originally.

>
> > I found that it's cleaner to wakeup hotplug ports via iterating device
> > link consumers at the end of tb_scan_port().
> >
> > According to your commit b2be2b05cf3b1c7b499d3b05decdcc524879fea7
> > ("thunderbolt: Create device links from ACPI description"), it states
> > "The _DSD can be added to tunneled USB3 and PCIe ports, and is needed to
> > make sure the USB4 NHI is resumed before any of the tunneled ports so
> > the protocol tunnels get established properly before the actual port
> > itself is resumed. Othwerwise the USB/PCI core find the link may not be
> > established and starts tearing down the device stack."
> >
> > So isn't waking them up a logical thing to do here?
>
> No they should wake up themselves.

OK.

>
> > > So if you do this:
> > >
> > > 1. Boot the system up, nothing connected
> > > 2. Plug in the TBT/USB4 device but do not authorize the PCIe tunnel
> > > 3. Wait for the TBT/USB4 domain to enter sleep (runtime suspend)
> > > 4. Authorize the PCIe tunnel
> > >
> > >   # echo 1 > .../authorized
> > >
> > > The established PCIe tunnel should trigger PME and the root port then
> > > should be able to detect the PCIe link. Can you add full dmesg with
> > > "thunderbolt.dyndbg=+p" in the command line to the bug?
> >
> > dmesg attached. Unfortunately there's no PME.
>
> Hmm, attached to where? Forgot to attach? ;-)

Oops, it's attached to the Bugzilla now.

Kai-Heng
Mika Westerberg Sept. 6, 2022, 2:59 p.m. UTC | #13
On Tue, Sep 06, 2022 at 10:29:03PM +0800, Kai-Heng Feng wrote:
> On Tue, Sep 6, 2022 at 9:37 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Tue, Sep 06, 2022 at 08:57:20PM +0800, Kai-Heng Feng wrote:
> > > On Mon, Sep 5, 2022 at 11:34 PM Mika Westerberg
> > > <mika.westerberg@linux.intel.com> wrote:
> > > >
> > > > On Mon, Sep 05, 2022 at 11:21:36PM +0800, Kai-Heng Feng wrote:
> > > > > > Hmm, so you see the actual hotplug but the tunneled PCIe link may not be
> > > > > > detected? Does the PCIe "Card Present" (or Data Link Layer Active)
> > > > > > status change at all or is it always 0?
> > > > >
> > > > > It changes only after tb_switch_add() is called.
> > > >
> > > > I doubt tb_switch_add() does anything but instead it is the established
> > > > PCIe tunnel that then shows up as it toggles the Card Present bit or so.
> > > > But that should also trigger PME if the root port is in D3 so you should
> > > > see this wake if everything works accordingly (unless I'm missing
> > > > something).
> > >
> > > You are right. Sometimes it may still fail to detect hotplugged device
> > > right after tb_switch_add().
> > > At which point PCIe tunnels are established? Is it after tb_scan_port()?
> >
> > They are established when userspace writes "1" to ../authorized of the
> > device (not automatically).
> >
> > On Ubuntu that's boltd that handles this so you may need to disable it
> > before you do the experiment.
> 
> In the dmesg it was disabled and "authorized" was 0 originally.

OK.

> >
> > > I found that it's cleaner to wakeup hotplug ports via iterating device
> > > link consumers at the end of tb_scan_port().
> > >
> > > According to your commit b2be2b05cf3b1c7b499d3b05decdcc524879fea7
> > > ("thunderbolt: Create device links from ACPI description"), it states
> > > "The _DSD can be added to tunneled USB3 and PCIe ports, and is needed to
> > > make sure the USB4 NHI is resumed before any of the tunneled ports so
> > > the protocol tunnels get established properly before the actual port
> > > itself is resumed. Othwerwise the USB/PCI core find the link may not be
> > > established and starts tearing down the device stack."
> > >
> > > So isn't waking them up a logical thing to do here?
> >
> > No they should wake up themselves.
> 
> OK.
> 
> >
> > > > So if you do this:
> > > >
> > > > 1. Boot the system up, nothing connected
> > > > 2. Plug in the TBT/USB4 device but do not authorize the PCIe tunnel
> > > > 3. Wait for the TBT/USB4 domain to enter sleep (runtime suspend)
> > > > 4. Authorize the PCIe tunnel
> > > >
> > > >   # echo 1 > .../authorized
> > > >
> > > > The established PCIe tunnel should trigger PME and the root port then
> > > > should be able to detect the PCIe link. Can you add full dmesg with
> > > > "thunderbolt.dyndbg=+p" in the command line to the bug?
> > >
> > > dmesg attached. Unfortunately there's no PME.
> >
> > Hmm, attached to where? Forgot to attach? ;-)
> 
> Oops, it's attached to the Bugzilla now.

Okay I see in the dmesg now that the root port enters D3hot:

[   40.363249] pcieport 0000:00:04.1: saving config space at offset 0x34 (reading 0x50)
[   40.363257] pcieport 0000:00:04.1: saving config space at offset 0x38 (reading 0x0)
[   40.363265] pcieport 0000:00:04.1: saving config space at offset 0x3c (reading 0x200ff)
[   40.365052] pcieport 0000:00:04.1: PME# enabled

and then the PCIe tunnel is authorized:

[   58.936258] thunderbolt 0000:64:00.6: 0:5 <-> 2:6 (PCI): activating
[   58.936270] thunderbolt 0000:64:00.6: activating PCIe Down path from 0:5 to 2:6
[   58.936434] thunderbolt 0000:64:00.6: 2:1: Writing hop 1
[   58.936443] thunderbolt 0000:64:00.6: 2:1:  In HopID: 8 => Out port: 6 Out HopID: 8
[   58.936449] thunderbolt 0000:64:00.6: 2:1:   Weight: 1 Priority: 3 Credits: 32 Drop: 0
[   58.936453] thunderbolt 0000:64:00.6: 2:1:    Counter enabled: 0 Counter index: 2047
[   58.936457] thunderbolt 0000:64:00.6: 2:1:   Flow Control (In/Eg): 1/0 Shared Buffer (In/Eg): 0/0
[   58.936460] thunderbolt 0000:64:00.6: 2:1:   Unknown1: 0x0 Unknown2: 0x0 Unknown3: 0x0
[   58.936786] thunderbolt 0000:64:00.6: 0:5: Writing hop 0
[   58.936795] thunderbolt 0000:64:00.6: 0:5:  In HopID: 8 => Out port: 2 Out HopID: 8
[   58.936800] thunderbolt 0000:64:00.6: 0:5:   Weight: 1 Priority: 3 Credits: 7 Drop: 0
[   58.936803] thunderbolt 0000:64:00.6: 0:5:    Counter enabled: 0 Counter index: 2047
[   58.936807] thunderbolt 0000:64:00.6: 0:5:   Flow Control (In/Eg): 1/1 Shared Buffer (In/Eg): 0/0
[   58.936810] thunderbolt 0000:64:00.6: 0:5:   Unknown1: 0x0 Unknown2: 0x0 Unknown3: 0x0
[   58.936971] thunderbolt 0000:64:00.6: path activation complete
[   58.936979] thunderbolt 0000:64:00.6: activating PCIe Up path from 2:6 to 0:5
[   58.937129] thunderbolt 0000:64:00.6: 0:2: Writing hop 1
[   58.937138] thunderbolt 0000:64:00.6: 0:2:  In HopID: 8 => Out port: 5 Out HopID: 8
[   58.937143] thunderbolt 0000:64:00.6: 0:2:   Weight: 1 Priority: 3 Credits: 32 Drop: 0
[   58.937147] thunderbolt 0000:64:00.6: 0:2:    Counter enabled: 0 Counter index: 2047
[   58.937150] thunderbolt 0000:64:00.6: 0:2:   Flow Control (In/Eg): 1/0 Shared Buffer (In/Eg): 0/0
[   58.937154] thunderbolt 0000:64:00.6: 0:2:   Unknown1: 0x0 Unknown2: 0x0 Unknown3: 0x0
[   58.937453] thunderbolt 0000:64:00.6: 2:6: Writing hop 0
[   58.937462] thunderbolt 0000:64:00.6: 2:6:  In HopID: 8 => Out port: 1 Out HopID: 8
[   58.937467] thunderbolt 0000:64:00.6: 2:6:   Weight: 1 Priority: 3 Credits: 7 Drop: 0
[   58.937471] thunderbolt 0000:64:00.6: 2:6:    Counter enabled: 0 Counter index: 2047
[   58.937474] thunderbolt 0000:64:00.6: 2:6:   Flow Control (In/Eg): 1/1 Shared Buffer (In/Eg): 0/0
[   58.937478] thunderbolt 0000:64:00.6: 2:6:   Unknown1: 0x0 Unknown2: 0x0 Unknown3: 0x0
[   58.937633] thunderbolt 0000:64:00.6: path activation complete

but this is not detected by the root port so after a while the domain is
suspended as well:

[   84.327759] thunderbolt 0000:64:00.6: 0: suspending switch

What should happen right after the PCIe tunnel activation is that the root port
should get PME / hotplug and it should then find the devices but this for some
reason is not happening.

This reminded me that in Intel hardware there is an ACPI power resource that is
shared between related devices. IIRC there is _PR0() method under the root
port, xHCI and the TBT NHI that returns the same power resource. Now, when the
power resource is turned on for any of the devices the kernel wakes up the rest
too to make sure they get properly re-initialized if they went into
D0unitialized or something like that. The commit that added this is
4533771c1e53 ("ACPI / PM: Introduce concept of a _PR0 dependent device").

I'm not entirely sure if that applies here because none of the devices seem to
enter D3cold, though. Is this working on Windows side and do you know if it
goes into D3cold too?
Mika Westerberg Sept. 6, 2022, 3:22 p.m. UTC | #14
On Tue, Sep 06, 2022 at 05:59:49PM +0300, Mika Westerberg wrote:
> This reminded me that in Intel hardware there is an ACPI power resource that is
> shared between related devices. IIRC there is _PR0() method under the root
> port, xHCI and the TBT NHI that returns the same power resource. Now, when the
> power resource is turned on for any of the devices the kernel wakes up the rest
> too to make sure they get properly re-initialized if they went into
> D0unitialized or something like that. The commit that added this is
> 4533771c1e53 ("ACPI / PM: Introduce concept of a _PR0 dependent device").

Probably has nothing to do with this actually.
Mario Limonciello Sept. 7, 2022, 4:30 p.m. UTC | #15
[Public]

Hi,

> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Monday, September 5, 2022 02:30
> To: Kai-Heng Feng <kai.heng.feng@canonical.com>
> Cc: mika.westerberg@linux.intel.com; andreas.noever@gmail.com;
> michael.jamet@intel.com; YehezkelShB@gmail.com; Mehta, Sanju
> <Sanju.Mehta@amd.com>; Limonciello, Mario
> <Mario.Limonciello@amd.com>; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found
> on AMD USB4 controller
> 
> On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> > AMD USB4 can not detect external PCIe devices like external NVMe when
> > it's hotplugged, because card/link are not up:
> >
> > pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
> 
> That sounds like a hardware bug, how does this work in other operating
> systems for this hardware?

We happen to have this HP system in our lab.  My colleague Anson (now on CC) flashed
the same BIOS to it (01.02.01) using dediprog and loaded a 6.0-rc3 mainline kernel built
from the Canonical mainline kernel PPA.

He then tried to hotplug a TBT3 SSD a number of times but couldn't hit this issue.
I attached his log to the kernel Bugzilla.

> 
> > Use `lspci` to resume pciehp bridges can find external devices.
> 
> That's not good :(
> 
> > A long delay before checking card/link presence doesn't help, either.
> > The only way to make the hotplug work is to enable pciehp interrupt and
> > check card presence after the TB switch is added.
> >
> > Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> > the bridge ID so TBT driver can wake them up to check presence.
> 
> As I mention below, this is not an acceptable solution.
> 
> AMD developers, any ideas on how to get this fixed in the TB controller
> firware instead?

Anson also double checked on the AMD reference hardware that the HP system is built
against and couldn't reproduce it there either.

KH, I've got a few questions/comments to try to better explain why we're here.

1) How did you flash the 01.02.01 firmware?  In Anson's check, he used dediprog.
Is it possible there was some stateful stuff used by HP's BIOS still on the SPI from the
upgrade that didn't get set/cleared properly from an earlier pre-release BIOS?

2) Did you change any BIOS settings?  Particularly anything to do with Pre-OS CM?

3) If you explicitly reset to HP's "default BIOS settings" does it resolve?

4) Can you double check ADP_CS_5 bit 31?  I attached is a patch to kernel Bugzilla to
add dyndbg output for it.  If it was for some reason set by Pre-OS CM in your BIOS/settings
combination, we might need to undo it by the Linux CM.

5) Are you changing any of the default runtime PM policies for any of the USB4 routers or
root ports used for tunneling using software like TLP?

> 
> >
> > Bugzilla:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D216448&amp;data=05%7C01%7Cm
> ario.limonciello%40amd.com%7C1e27b1d6f69e42796c7b08da8f107121%7C3d
> d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637979598042186185%7CU
> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> 6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0lhcaKfUyoK
> 0FXT9uDZ8a%2Fpxs9tHd8aoQcyPFdB%2F0eY%3D&amp;reserved=0
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/thunderbolt/nhi.c    | 29 +++++++++++++++++++++++++++++
> >  drivers/thunderbolt/switch.c |  6 ++++++
> >  drivers/thunderbolt/tb.c     |  1 +
> >  drivers/thunderbolt/tb.h     |  5 +++++
> >  include/linux/thunderbolt.h  |  1 +
> >  5 files changed, 42 insertions(+)
> >
> > diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> > index cb8c9c4ae93a2..75f5ce5e22978 100644
> > --- a/drivers/thunderbolt/nhi.c
> > +++ b/drivers/thunderbolt/nhi.c
> > @@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
> >  {
> >  	struct tb_nhi *nhi;
> >  	struct tb *tb;
> > +	struct pci_dev *p = NULL;
> > +	struct tb_pci_bridge *pci_bridge, *n;
> >  	int res;
> >
> >  	if (!nhi_imr_valid(pdev)) {
> > @@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
> >  		nhi_shutdown(nhi);
> >  		return res;
> >  	}
> > +
> > +	if (pdev->vendor == PCI_VENDOR_ID_AMD) {
> > +		while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd,
> p))) {
> > +			pci_bridge = kmalloc(sizeof(struct tb_pci_bridge),
> GFP_KERNEL);
> > +			if (!pci_bridge)
> > +				goto cleanup;
> > +
> > +			pci_bridge->bridge = p;
> > +			INIT_LIST_HEAD(&pci_bridge->list);
> > +			list_add(&pci_bridge->list, &tb->bridge_list);
> > +		}
> > +	}
> 
> You can't walk the device tree and create a "shadow" list of devices
> like this and expect any lifetime rules to work properly with them at
> all.
> 
> Please do not do this.
> 
> greg k-h
Kai-Heng Feng Sept. 8, 2022, 2:02 p.m. UTC | #16
"

On Thu, Sep 8, 2022 at 12:30 AM Limonciello, Mario
<Mario.Limonciello@amd.com> wrote:
>
> [Public]
>
> Hi,
>
> > -----Original Message-----
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Monday, September 5, 2022 02:30
> > To: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > Cc: mika.westerberg@linux.intel.com; andreas.noever@gmail.com;
> > michael.jamet@intel.com; YehezkelShB@gmail.com; Mehta, Sanju
> > <Sanju.Mehta@amd.com>; Limonciello, Mario
> > <Mario.Limonciello@amd.com>; linux-usb@vger.kernel.org; linux-
> > kernel@vger.kernel.org
> > Subject: Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found
> > on AMD USB4 controller
> >
> > On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> > > AMD USB4 can not detect external PCIe devices like external NVMe when
> > > it's hotplugged, because card/link are not up:
> > >
> > > pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
> >
> > That sounds like a hardware bug, how does this work in other operating
> > systems for this hardware?
>
> We happen to have this HP system in our lab.  My colleague Anson (now on CC) flashed
> the same BIOS to it (01.02.01) using dediprog and loaded a 6.0-rc3 mainline kernel built
> from the Canonical mainline kernel PPA.
>
> He then tried to hotplug a TBT3 SSD a number of times but couldn't hit this issue.
> I attached his log to the kernel Bugzilla.

Nice to hear. Hopefully this can be fixed at firmware/hardware side.

>
> >
> > > Use `lspci` to resume pciehp bridges can find external devices.
> >
> > That's not good :(
> >
> > > A long delay before checking card/link presence doesn't help, either.
> > > The only way to make the hotplug work is to enable pciehp interrupt and
> > > check card presence after the TB switch is added.
> > >
> > > Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> > > the bridge ID so TBT driver can wake them up to check presence.
> >
> > As I mention below, this is not an acceptable solution.
> >
> > AMD developers, any ideas on how to get this fixed in the TB controller
> > firware instead?
>
> Anson also double checked on the AMD reference hardware that the HP system is built
> against and couldn't reproduce it there either.
>
> KH, I've got a few questions/comments to try to better explain why we're here.
>
> 1) How did you flash the 01.02.01 firmware?  In Anson's check, he used dediprog.
> Is it possible there was some stateful stuff used by HP's BIOS still on the SPI from the
> upgrade that didn't get set/cleared properly from an earlier pre-release BIOS?

We used UEFI capsule to update the firmware, via fwupd.

>
> 2) Did you change any BIOS settings?  Particularly anything to do with Pre-OS CM?

No, nothing in BIOS was changed.

>
> 3) If you explicitly reset to HP's "default BIOS settings" does it resolve?

Doesn't help. I put the device to ACPI G3 and it doesn't help, either.

>
> 4) Can you double check ADP_CS_5 bit 31?  I attached is a patch to kernel Bugzilla to
> add dyndbg output for it.  If it was for some reason set by Pre-OS CM in your BIOS/settings
> combination, we might need to undo it by the Linux CM.

All ports say "Hotplug disabled: 0".

dmesg attached to the bugzilla.

>
> 5) Are you changing any of the default runtime PM policies for any of the USB4 routers or
> root ports used for tunneling using software like TLP?

No. And they should be suspended by default.

Kai-Heng

>
> >
> > >
> > > Bugzilla:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
> > illa.kernel.org%2Fshow_bug.cgi%3Fid%3D216448&amp;data=05%7C01%7Cm
> > ario.limonciello%40amd.com%7C1e27b1d6f69e42796c7b08da8f107121%7C3d
> > d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637979598042186185%7CU
> > nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> > 6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0lhcaKfUyoK
> > 0FXT9uDZ8a%2Fpxs9tHd8aoQcyPFdB%2F0eY%3D&amp;reserved=0
> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > ---
> > >  drivers/thunderbolt/nhi.c    | 29 +++++++++++++++++++++++++++++
> > >  drivers/thunderbolt/switch.c |  6 ++++++
> > >  drivers/thunderbolt/tb.c     |  1 +
> > >  drivers/thunderbolt/tb.h     |  5 +++++
> > >  include/linux/thunderbolt.h  |  1 +
> > >  5 files changed, 42 insertions(+)
> > >
> > > diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> > > index cb8c9c4ae93a2..75f5ce5e22978 100644
> > > --- a/drivers/thunderbolt/nhi.c
> > > +++ b/drivers/thunderbolt/nhi.c
> > > @@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const
> > struct pci_device_id *id)
> > >  {
> > >     struct tb_nhi *nhi;
> > >     struct tb *tb;
> > > +   struct pci_dev *p = NULL;
> > > +   struct tb_pci_bridge *pci_bridge, *n;
> > >     int res;
> > >
> > >     if (!nhi_imr_valid(pdev)) {
> > > @@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const
> > struct pci_device_id *id)
> > >             nhi_shutdown(nhi);
> > >             return res;
> > >     }
> > > +
> > > +   if (pdev->vendor == PCI_VENDOR_ID_AMD) {
> > > +           while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd,
> > p))) {
> > > +                   pci_bridge = kmalloc(sizeof(struct tb_pci_bridge),
> > GFP_KERNEL);
> > > +                   if (!pci_bridge)
> > > +                           goto cleanup;
> > > +
> > > +                   pci_bridge->bridge = p;
> > > +                   INIT_LIST_HEAD(&pci_bridge->list);
> > > +                   list_add(&pci_bridge->list, &tb->bridge_list);
> > > +           }
> > > +   }
> >
> > You can't walk the device tree and create a "shadow" list of devices
> > like this and expect any lifetime rules to work properly with them at
> > all.
> >
> > Please do not do this.
> >
> > greg k-h
Mario Limonciello Sept. 8, 2022, 3:22 p.m. UTC | #17
On 9/8/2022 09:02, Kai-Heng Feng wrote:
> "
> 
> On Thu, Sep 8, 2022 at 12:30 AM Limonciello, Mario
> <Mario.Limonciello@amd.com> wrote:
>>
>> [Public]
>>
>> Hi,
>>
>>> -----Original Message-----
>>> From: Greg KH <gregkh@linuxfoundation.org>
>>> Sent: Monday, September 5, 2022 02:30
>>> To: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>> Cc: mika.westerberg@linux.intel.com; andreas.noever@gmail.com;
>>> michael.jamet@intel.com; YehezkelShB@gmail.com; Mehta, Sanju
>>> <Sanju.Mehta@amd.com>; Limonciello, Mario
>>> <Mario.Limonciello@amd.com>; linux-usb@vger.kernel.org; linux-
>>> kernel@vger.kernel.org
>>> Subject: Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found
>>> on AMD USB4 controller
>>>
>>> On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
>>>> AMD USB4 can not detect external PCIe devices like external NVMe when
>>>> it's hotplugged, because card/link are not up:
>>>>
>>>> pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
>>>
>>> That sounds like a hardware bug, how does this work in other operating
>>> systems for this hardware?
>>
>> We happen to have this HP system in our lab.  My colleague Anson (now on CC) flashed
>> the same BIOS to it (01.02.01) using dediprog and loaded a 6.0-rc3 mainline kernel built
>> from the Canonical mainline kernel PPA.
>>
>> He then tried to hotplug a TBT3 SSD a number of times but couldn't hit this issue.
>> I attached his log to the kernel Bugzilla.
> 
> Nice to hear. Hopefully this can be fixed at firmware/hardware side.

I guess you and Anson might want to sync up offline and compare whether 
you have the same hardware stepping.

> 
>>
>>>
>>>> Use `lspci` to resume pciehp bridges can find external devices.
>>>
>>> That's not good :(
>>>
>>>> A long delay before checking card/link presence doesn't help, either.
>>>> The only way to make the hotplug work is to enable pciehp interrupt and
>>>> check card presence after the TB switch is added.
>>>>
>>>> Since the topology of USB4 and its PCIe bridges are siblings, hardcode
>>>> the bridge ID so TBT driver can wake them up to check presence.
>>>
>>> As I mention below, this is not an acceptable solution.
>>>
>>> AMD developers, any ideas on how to get this fixed in the TB controller
>>> firware instead?
>>
>> Anson also double checked on the AMD reference hardware that the HP system is built
>> against and couldn't reproduce it there either.
>>
>> KH, I've got a few questions/comments to try to better explain why we're here.
>>
>> 1) How did you flash the 01.02.01 firmware?  In Anson's check, he used dediprog.
>> Is it possible there was some stateful stuff used by HP's BIOS still on the SPI from the
>> upgrade that didn't get set/cleared properly from an earlier pre-release BIOS?
> 
> We used UEFI capsule to update the firmware, via fwupd.

So that's a difference from how Anson did it.  Could you perhaps dump 
the BIOS SPI image?  Anson can flash the exact same dump via dediprog 
and see if he can repro.

It would let us confirm if it was caused by your upgrade path.

> 
>>
>> 2) Did you change any BIOS settings?  Particularly anything to do with Pre-OS CM?
> 
> No, nothing in BIOS was changed. >
>>
>> 3) If you explicitly reset to HP's "default BIOS settings" does it resolve?
> 
> Doesn't help. I put the device to ACPI G3 and it doesn't help, either.

OK.

> 
>>
>> 4) Can you double check ADP_CS_5 bit 31?  I attached is a patch to kernel Bugzilla to
>> add dyndbg output for it.  If it was for some reason set by Pre-OS CM in your BIOS/settings
>> combination, we might need to undo it by the Linux CM.
> 
> All ports say "Hotplug disabled: 0".
> 
> dmesg attached to the bugzilla.

OK, that at least rules out DHP from Pre-OS CM.

> 
>>
>> 5) Are you changing any of the default runtime PM policies for any of the USB4 routers or
>> root ports used for tunneling using software like TLP?
> 
> No. And they should be suspended by default.
> 

Thinking about this being possibly a firmware upgrade path problem, can 
you please check:

# grep SMC /sys/kernel/debug/dri/0/amdgpu_firmware_info

Anson's system was 0x04453200 (program 4, version 69.50.0).

> Kai-Heng
> 
>>
>>>
>>>>
>>>> Bugzilla:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
>>> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D216448&amp;data=05%7C01%7Cm
>>> ario.limonciello%40amd.com%7C1e27b1d6f69e42796c7b08da8f107121%7C3d
>>> d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637979598042186185%7CU
>>> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
>>> 6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0lhcaKfUyoK
>>> 0FXT9uDZ8a%2Fpxs9tHd8aoQcyPFdB%2F0eY%3D&amp;reserved=0
>>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>>> ---
>>>>   drivers/thunderbolt/nhi.c    | 29 +++++++++++++++++++++++++++++
>>>>   drivers/thunderbolt/switch.c |  6 ++++++
>>>>   drivers/thunderbolt/tb.c     |  1 +
>>>>   drivers/thunderbolt/tb.h     |  5 +++++
>>>>   include/linux/thunderbolt.h  |  1 +
>>>>   5 files changed, 42 insertions(+)
>>>>
>>>> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
>>>> index cb8c9c4ae93a2..75f5ce5e22978 100644
>>>> --- a/drivers/thunderbolt/nhi.c
>>>> +++ b/drivers/thunderbolt/nhi.c
>>>> @@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const
>>> struct pci_device_id *id)
>>>>   {
>>>>      struct tb_nhi *nhi;
>>>>      struct tb *tb;
>>>> +   struct pci_dev *p = NULL;
>>>> +   struct tb_pci_bridge *pci_bridge, *n;
>>>>      int res;
>>>>
>>>>      if (!nhi_imr_valid(pdev)) {
>>>> @@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const
>>> struct pci_device_id *id)
>>>>              nhi_shutdown(nhi);
>>>>              return res;
>>>>      }
>>>> +
>>>> +   if (pdev->vendor == PCI_VENDOR_ID_AMD) {
>>>> +           while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd,
>>> p))) {
>>>> +                   pci_bridge = kmalloc(sizeof(struct tb_pci_bridge),
>>> GFP_KERNEL);
>>>> +                   if (!pci_bridge)
>>>> +                           goto cleanup;
>>>> +
>>>> +                   pci_bridge->bridge = p;
>>>> +                   INIT_LIST_HEAD(&pci_bridge->list);
>>>> +                   list_add(&pci_bridge->list, &tb->bridge_list);
>>>> +           }
>>>> +   }
>>>
>>> You can't walk the device tree and create a "shadow" list of devices
>>> like this and expect any lifetime rules to work properly with them at
>>> all.
>>>
>>> Please do not do this.
>>>
>>> greg k-h
Kai-Heng Feng Sept. 12, 2022, 7:35 a.m. UTC | #18
Hi,

On Thu, Sep 8, 2022 at 11:22 PM Limonciello, Mario
<mario.limonciello@amd.com> wrote:
[snipped]
> > Nice to hear. Hopefully this can be fixed at firmware/hardware side.
>
> I guess you and Anson might want to sync up offline and compare whether
> you have the same hardware stepping.

Sure.

[snipped]

> > We used UEFI capsule to update the firmware, via fwupd.
>
> So that's a difference from how Anson did it.  Could you perhaps dump
> the BIOS SPI image?  Anson can flash the exact same dump via dediprog
> and see if he can repro.
>
> It would let us confirm if it was caused by your upgrade path.

OK, will dissuss this with AMD/HP offline.

>
> >
> >>
> >> 2) Did you change any BIOS settings?  Particularly anything to do with Pre-OS CM?
> >
> > No, nothing in BIOS was changed. >
> >>
> >> 3) If you explicitly reset to HP's "default BIOS settings" does it resolve?
> >
> > Doesn't help. I put the device to ACPI G3 and it doesn't help, either.
>
> OK.
>
> >
> >>
> >> 4) Can you double check ADP_CS_5 bit 31?  I attached is a patch to kernel Bugzilla to
> >> add dyndbg output for it.  If it was for some reason set by Pre-OS CM in your BIOS/settings
> >> combination, we might need to undo it by the Linux CM.
> >
> > All ports say "Hotplug disabled: 0".
> >
> > dmesg attached to the bugzilla.
>
> OK, that at least rules out DHP from Pre-OS CM.
>
> >
> >>
> >> 5) Are you changing any of the default runtime PM policies for any of the USB4 routers or
> >> root ports used for tunneling using software like TLP?
> >
> > No. And they should be suspended by default.
> >
>
> Thinking about this being possibly a firmware upgrade path problem, can
> you please check:
>
> # grep SMC /sys/kernel/debug/dri/0/amdgpu_firmware_info
>
> Anson's system was 0x04453200 (program 4, version 69.50.0).

Exactly the same here.

Kai-Heng

>
> > Kai-Heng
> >
> >>
> >>>
> >>>>
> >>>> Bugzilla:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
> >>> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D216448&amp;data=05%7C01%7Cm
> >>> ario.limonciello%40amd.com%7C1e27b1d6f69e42796c7b08da8f107121%7C3d
> >>> d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637979598042186185%7CU
> >>> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> >>> 6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0lhcaKfUyoK
> >>> 0FXT9uDZ8a%2Fpxs9tHd8aoQcyPFdB%2F0eY%3D&amp;reserved=0
> >>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> >>>> ---
> >>>>   drivers/thunderbolt/nhi.c    | 29 +++++++++++++++++++++++++++++
> >>>>   drivers/thunderbolt/switch.c |  6 ++++++
> >>>>   drivers/thunderbolt/tb.c     |  1 +
> >>>>   drivers/thunderbolt/tb.h     |  5 +++++
> >>>>   include/linux/thunderbolt.h  |  1 +
> >>>>   5 files changed, 42 insertions(+)
> >>>>
> >>>> diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> >>>> index cb8c9c4ae93a2..75f5ce5e22978 100644
> >>>> --- a/drivers/thunderbolt/nhi.c
> >>>> +++ b/drivers/thunderbolt/nhi.c
> >>>> @@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const
> >>> struct pci_device_id *id)
> >>>>   {
> >>>>      struct tb_nhi *nhi;
> >>>>      struct tb *tb;
> >>>> +   struct pci_dev *p = NULL;
> >>>> +   struct tb_pci_bridge *pci_bridge, *n;
> >>>>      int res;
> >>>>
> >>>>      if (!nhi_imr_valid(pdev)) {
> >>>> @@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const
> >>> struct pci_device_id *id)
> >>>>              nhi_shutdown(nhi);
> >>>>              return res;
> >>>>      }
> >>>> +
> >>>> +   if (pdev->vendor == PCI_VENDOR_ID_AMD) {
> >>>> +           while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd,
> >>> p))) {
> >>>> +                   pci_bridge = kmalloc(sizeof(struct tb_pci_bridge),
> >>> GFP_KERNEL);
> >>>> +                   if (!pci_bridge)
> >>>> +                           goto cleanup;
> >>>> +
> >>>> +                   pci_bridge->bridge = p;
> >>>> +                   INIT_LIST_HEAD(&pci_bridge->list);
> >>>> +                   list_add(&pci_bridge->list, &tb->bridge_list);
> >>>> +           }
> >>>> +   }
> >>>
> >>> You can't walk the device tree and create a "shadow" list of devices
> >>> like this and expect any lifetime rules to work properly with them at
> >>> all.
> >>>
> >>> Please do not do this.
> >>>
> >>> greg k-h
>
diff mbox series

Patch

diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
index cb8c9c4ae93a2..75f5ce5e22978 100644
--- a/drivers/thunderbolt/nhi.c
+++ b/drivers/thunderbolt/nhi.c
@@ -1225,6 +1225,8 @@  static int nhi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct tb_nhi *nhi;
 	struct tb *tb;
+	struct pci_dev *p = NULL;
+	struct tb_pci_bridge *pci_bridge, *n;
 	int res;
 
 	if (!nhi_imr_valid(pdev)) {
@@ -1306,6 +1308,19 @@  static int nhi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		nhi_shutdown(nhi);
 		return res;
 	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_AMD) {
+		while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd, p))) {
+			pci_bridge = kmalloc(sizeof(struct tb_pci_bridge), GFP_KERNEL);
+			if (!pci_bridge)
+				goto cleanup;
+
+			pci_bridge->bridge = p;
+			INIT_LIST_HEAD(&pci_bridge->list);
+			list_add(&pci_bridge->list, &tb->bridge_list);
+		}
+	}
+
 	pci_set_drvdata(pdev, tb);
 
 	device_wakeup_enable(&pdev->dev);
@@ -1316,12 +1331,26 @@  static int nhi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	pm_runtime_put_autosuspend(&pdev->dev);
 
 	return 0;
+
+cleanup:
+	list_for_each_entry_safe(pci_bridge, n, &tb->bridge_list, list) {
+		list_del(&pci_bridge->list);
+		kfree(pci_bridge);
+	}
+
+	return -ENOMEM;
 }
 
 static void nhi_remove(struct pci_dev *pdev)
 {
 	struct tb *tb = pci_get_drvdata(pdev);
 	struct tb_nhi *nhi = tb->nhi;
+	struct tb_pci_bridge *pci_bridge, *n;
+
+	list_for_each_entry_safe(pci_bridge, n, &tb->bridge_list, list) {
+		list_del(&pci_bridge->list);
+		kfree(pci_bridge);
+	}
 
 	pm_runtime_get_sync(&pdev->dev);
 	pm_runtime_dont_use_autosuspend(&pdev->dev);
diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
index c63c1f4ff9dc7..aae898cc907d3 100644
--- a/drivers/thunderbolt/switch.c
+++ b/drivers/thunderbolt/switch.c
@@ -2836,6 +2836,8 @@  static void tb_switch_credits_init(struct tb_switch *sw)
 int tb_switch_add(struct tb_switch *sw)
 {
 	int i, ret;
+	struct tb *tb = sw->tb;
+	struct tb_pci_bridge *pci_bridge;
 
 	/*
 	 * Initialize DMA control port now before we read DROM. Recent
@@ -2933,6 +2935,10 @@  int tb_switch_add(struct tb_switch *sw)
 	}
 
 	tb_switch_debugfs_init(sw);
+
+	list_for_each_entry(pci_bridge, &tb->bridge_list, list)
+		pm_request_resume(&pci_bridge->bridge->dev);
+
 	return 0;
 
 err_ports:
diff --git a/drivers/thunderbolt/tb.c b/drivers/thunderbolt/tb.c
index 9853f6c7e81d7..07e97b77ac630 100644
--- a/drivers/thunderbolt/tb.c
+++ b/drivers/thunderbolt/tb.c
@@ -1771,6 +1771,7 @@  struct tb *tb_probe(struct tb_nhi *nhi)
 		tb->security_level = TB_SECURITY_NOPCIE;
 
 	tb->cm_ops = &tb_cm_ops;
+	INIT_LIST_HEAD(&tb->bridge_list);
 
 	tcm = tb_priv(tb);
 	INIT_LIST_HEAD(&tcm->tunnel_list);
diff --git a/drivers/thunderbolt/tb.h b/drivers/thunderbolt/tb.h
index 5db76de40cc1c..8efbd1afacad0 100644
--- a/drivers/thunderbolt/tb.h
+++ b/drivers/thunderbolt/tb.h
@@ -489,6 +489,11 @@  struct tb_cm_ops {
 						   u32 *status);
 };
 
+struct tb_pci_bridge {
+	struct pci_dev *bridge;
+	struct list_head list;
+};
+
 static inline void *tb_priv(struct tb *tb)
 {
 	return (void *)tb->privdata;
diff --git a/include/linux/thunderbolt.h b/include/linux/thunderbolt.h
index 9f442d73f3df8..728bb36070e9d 100644
--- a/include/linux/thunderbolt.h
+++ b/include/linux/thunderbolt.h
@@ -83,6 +83,7 @@  struct tb {
 	int index;
 	enum tb_security_level security_level;
 	size_t nboot_acl;
+	struct list_head bridge_list;
 	unsigned long privdata[];
 };