diff mbox series

[v2,2/3] PCI: brcmstb: CLKREQ# accomodations of downstream device

Message ID 20230411165919.23955-3-jim2101024@gmail.com (mailing list archive)
State New, archived
Headers show
Series PCI: brcmstb: CLKREQ# accomodations of downstream device | expand

Commit Message

Jim Quinlan April 11, 2023, 4:59 p.m. UTC
The Broadcom STB/CM PCIe HW core, which is also used in RPi SOCs, must be
deliberately set by the probe() into one of three mutually exclusive modes:

  (a) No CLKREQ# expected or required, refclk is always available.
  (b) CLKREQ# is expected to be driven by downstream device when needed.
  (c) Bidirectional CLKREQ# for L1SS capable devices.

Previously, only (b) was supported by the driver, as almost all STB/CM
boards operate in this mode.  But now there is interest in activating L1SS
power savings from STB/CM customers, and also interest in accomodating mode
(a) for designs such as the RPi CM4 with IO board.

The HW+driver is able to tell us when mode (a) mode is needed.  But there
is no easy way to tell if L1SS mode should be configured.  In certain
situations, getting this wrong may cause a panic during boot time.  So we
rely on the DT prop "brcm,enable-l1ss" to tell us when mode (c) is desired.
Using this mode only makes sense when the downstream device is L1SS-capable
and the OS has been configured to activate L1SS
(e.g. policy==powersupersave).

This property has already been in use by Raspian Linux, but this
immplementation adds more details and discerns between (a) and (b)
automatically.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217276
Signed-off-by: Jim Quinlan <jim2101024@gmail.com>
---
 drivers/pci/controller/pcie-brcmstb.c | 69 +++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 10 deletions(-)

Comments

Cyril Brulebois April 13, 2023, 2:39 p.m. UTC | #1
Hi Jim,

Jim Quinlan <jim2101024@gmail.com> (2023-04-11):
> […]
> This property has already been in use by Raspian Linux, but this
> immplementation adds more details and discerns between (a) and (b)
  ^^^^^^^^^^^^^^^
  implementation

> automatically.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217276
> Signed-off-by: Jim Quinlan <jim2101024@gmail.com>

Sorry, it seems like my initial tests with v1 (applied on top of
v6.3-rc5-137-gf2afccfefe7b) weren't thorough enough, and I'm seeing the
same problems with v2 (applied on top of v6.3-rc6-46-gde4664485abb):
 - same setup as in https://bugzilla.kernel.org/show_bug.cgi?id=217276
 - the kernel panic is indeed gone;
 - a USB keyboard connected on that SupaHub PCIe-to-multiple-USB adapter
   isn't seen by the kernel;
 - a USB memory stick connected on the same adapter isn't seen by the
   kernel either;
 - of course both USB devices are confirmed to work fine if they're
   plugged directly on the CM4's USB ports.

Logs with v2:

    root@cm4:~# dmesg|grep -i pci
    [    0.610997] PCI: CLS 0 bytes, default 64
    [    1.664886] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
    [    1.672083] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
    [    1.679125] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
    [    1.688279] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x0603ffffff -> 0x00f8000000
    [    1.696463] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0400000000
    [    1.705282] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
    [    1.711629] pci_bus 0000:00: root bus resource [bus 00-ff]
    [    1.717172] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff])
    [    1.727653] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
    [    1.733768] pci 0000:00:00.0: PME# supported from D0 D3hot
    [    1.740235] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
    [    1.855826] brcm-pcie fd500000.pcie: CLKREQ# ignored; no ASPM
    [    1.863666] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
    [    1.870115] pci 0000:01:00.0: [1912:0014] type 00 class 0x0c0330
    [    1.876205] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00001fff 64bit]
    [    1.883177] pci 0000:01:00.0: PME# supported from D0 D3hot
    [    1.888881] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
    [    1.895581] pci 0000:00:00.0: BAR 14: assigned [mem 0x600000000-0x6000fffff]
    [    1.902707] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x600001fff 64bit]
    [    1.910279] pci 0000:00:00.0: PCI bridge to [bus 01]
    [    1.915293] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
    [    1.922412] pcieport 0000:00:00.0: enabling device (0000 -> 0002)
    [    1.928633] pcieport 0000:00:00.0: PME: Signaling with IRQ 23
    [    1.934609] pcieport 0000:00:00.0: AER: enabled with IRQ 23
    [    1.940340] pci 0000:01:00.0: enabling device (0000 -> 0002)
    [    6.946090] pci 0000:01:00.0: xHCI HW not ready after 5 sec (HC bug?) status = 0x1801
    [    6.954026] pci 0000:01:00.0: quirk_usb_early_handoff+0x0/0x968 took 4896180 usecs

Please let me know what I can do to help.


Cheers,
Jim Quinlan April 13, 2023, 2:57 p.m. UTC | #2
Hello Cyril,
Can you provide (a) the full boot log prior to applying the patch
series and (b) full boot log after applying the series, using an
IDENTICAL  setup.
If it fails on both then it has little to do with my patch series.

In my last series your testing somehow conflated the effect of an
unrelated MMC interrupt issue so please be precise.

Regards,
Jim  Quinlan
Broadcom STB

On Thu, Apr 13, 2023 at 10:39 AM Cyril Brulebois <kibi@debian.org> wrote:
>
> Hi Jim,
>
> Jim Quinlan <jim2101024@gmail.com> (2023-04-11):
> > […]
> > This property has already been in use by Raspian Linux, but this
> > immplementation adds more details and discerns between (a) and (b)
>   ^^^^^^^^^^^^^^^
>   implementation
>
> > automatically.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217276
> > Signed-off-by: Jim Quinlan <jim2101024@gmail.com>
>
> Sorry, it seems like my initial tests with v1 (applied on top of
> v6.3-rc5-137-gf2afccfefe7b) weren't thorough enough, and I'm seeing the
> same problems with v2 (applied on top of v6.3-rc6-46-gde4664485abb):
>  - same setup as in https://bugzilla.kernel.org/show_bug.cgi?id=217276
>  - the kernel panic is indeed gone;
>  - a USB keyboard connected on that SupaHub PCIe-to-multiple-USB adapter
>    isn't seen by the kernel;
>  - a USB memory stick connected on the same adapter isn't seen by the
>    kernel either;
>  - of course both USB devices are confirmed to work fine if they're
>    plugged directly on the CM4's USB ports.
>
> Logs with v2:
>
>     root@cm4:~# dmesg|grep -i pci
>     [    0.610997] PCI: CLS 0 bytes, default 64
>     [    1.664886] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
>     [    1.672083] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
>     [    1.679125] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
>     [    1.688279] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x0603ffffff -> 0x00f8000000
>     [    1.696463] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0400000000
>     [    1.705282] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
>     [    1.711629] pci_bus 0000:00: root bus resource [bus 00-ff]
>     [    1.717172] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff])
>     [    1.727653] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
>     [    1.733768] pci 0000:00:00.0: PME# supported from D0 D3hot
>     [    1.740235] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>     [    1.855826] brcm-pcie fd500000.pcie: CLKREQ# ignored; no ASPM
>     [    1.863666] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
>     [    1.870115] pci 0000:01:00.0: [1912:0014] type 00 class 0x0c0330
>     [    1.876205] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00001fff 64bit]
>     [    1.883177] pci 0000:01:00.0: PME# supported from D0 D3hot
>     [    1.888881] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
>     [    1.895581] pci 0000:00:00.0: BAR 14: assigned [mem 0x600000000-0x6000fffff]
>     [    1.902707] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x600001fff 64bit]
>     [    1.910279] pci 0000:00:00.0: PCI bridge to [bus 01]
>     [    1.915293] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
>     [    1.922412] pcieport 0000:00:00.0: enabling device (0000 -> 0002)
>     [    1.928633] pcieport 0000:00:00.0: PME: Signaling with IRQ 23
>     [    1.934609] pcieport 0000:00:00.0: AER: enabled with IRQ 23
>     [    1.940340] pci 0000:01:00.0: enabling device (0000 -> 0002)
>     [    6.946090] pci 0000:01:00.0: xHCI HW not ready after 5 sec (HC bug?) status = 0x1801
>     [    6.954026] pci 0000:01:00.0: quirk_usb_early_handoff+0x0/0x968 took 4896180 usecs
>
> Please let me know what I can do to help.
>
>
> Cheers,
> --
> Cyril Brulebois (kibi@debian.org)            <https://debamax.com/>
> D-I release manager -- Release team member -- Freelance Consultant
Florian Fainelli April 13, 2023, 2:58 p.m. UTC | #3
On 4/13/2023 7:39 AM, Cyril Brulebois wrote:
> Hi Jim,
> 
> Jim Quinlan <jim2101024@gmail.com> (2023-04-11):
>> […]
>> This property has already been in use by Raspian Linux, but this
>> immplementation adds more details and discerns between (a) and (b)
>    ^^^^^^^^^^^^^^^
>    implementation
> 
>> automatically.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217276
>> Signed-off-by: Jim Quinlan <jim2101024@gmail.com>
> 
> Sorry, it seems like my initial tests with v1 (applied on top of
> v6.3-rc5-137-gf2afccfefe7b) weren't thorough enough, and I'm seeing the
> same problems with v2 (applied on top of v6.3-rc6-46-gde4664485abb):
>   - same setup as in https://bugzilla.kernel.org/show_bug.cgi?id=217276
>   - the kernel panic is indeed gone;
>   - a USB keyboard connected on that SupaHub PCIe-to-multiple-USB adapter
>     isn't seen by the kernel;
>   - a USB memory stick connected on the same adapter isn't seen by the
>     kernel either;
>   - of course both USB devices are confirmed to work fine if they're
>     plugged directly on the CM4's USB ports.
> 
> Logs with v2:
> 
>      root@cm4:~# dmesg|grep -i pci
>      [    0.610997] PCI: CLS 0 bytes, default 64
>      [    1.664886] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
>      [    1.672083] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
>      [    1.679125] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
>      [    1.688279] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x0603ffffff -> 0x00f8000000
>      [    1.696463] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0400000000
>      [    1.705282] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
>      [    1.711629] pci_bus 0000:00: root bus resource [bus 00-ff]
>      [    1.717172] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff])
>      [    1.727653] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
>      [    1.733768] pci 0000:00:00.0: PME# supported from D0 D3hot
>      [    1.740235] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>      [    1.855826] brcm-pcie fd500000.pcie: CLKREQ# ignored; no ASPM
>      [    1.863666] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
>      [    1.870115] pci 0000:01:00.0: [1912:0014] type 00 class 0x0c0330
>      [    1.876205] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00001fff 64bit]
>      [    1.883177] pci 0000:01:00.0: PME# supported from D0 D3hot
>      [    1.888881] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
>      [    1.895581] pci 0000:00:00.0: BAR 14: assigned [mem 0x600000000-0x6000fffff]
>      [    1.902707] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x600001fff 64bit]
>      [    1.910279] pci 0000:00:00.0: PCI bridge to [bus 01]
>      [    1.915293] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
>      [    1.922412] pcieport 0000:00:00.0: enabling device (0000 -> 0002)
>      [    1.928633] pcieport 0000:00:00.0: PME: Signaling with IRQ 23
>      [    1.934609] pcieport 0000:00:00.0: AER: enabled with IRQ 23
>      [    1.940340] pci 0000:01:00.0: enabling device (0000 -> 0002)
>      [    6.946090] pci 0000:01:00.0: xHCI HW not ready after 5 sec (HC bug?) status = 0x1801
>      [    6.954026] pci 0000:01:00.0: quirk_usb_early_handoff+0x0/0x968 took 4896180 usecs
> 
> Please let me know what I can do to help.

Could you please attach your .config so we can check a few things?
Jim Quinlan April 14, 2023, 12:14 p.m. UTC | #4
On Thu, Apr 13, 2023 at 4:06 PM Cyril Brulebois <kibi@debian.org> wrote:
>
> Hi Jim,
>
> Jim Quinlan <jim2101024@gmail.com> (2023-04-13):
> > Can you provide (a) the full boot log prior to applying the patch
> > series and (b) full boot log after applying the series, using an
> > IDENTICAL setup. If it fails on both then it has little to do with my
> > patch series.
>
> Just to be clear, the issue I reported was with:
>  - Raspberry Pi Compute Module 4 (Rev 1.1, 4G RAM, 32G storage)
>  - Raspberry Pi Compute Module 4 IO Board
>  - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
>
> This was my minimal reproducer for the kernel panic at boot-up, which
> goes away with either v1 or v2. When I realized I didn't actually check
> whether the SupaHub board was working correctly, I plugged 2 devices to
> obtain this setup:
>  - Raspberry Pi Compute Module 4 (Rev 1.1, 4G RAM, 32G storage)
>  - Raspberry Pi Compute Module 4 IO Board
>  - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
>  - Kingston DataTraveler G4 32GB on USB-A port #1 of the SupaHub board.
>  - Logitech K120 keyboard on USB-A port #2 of the SupaHub board.
>
> It turns out that this particular revision of the SupaHub board isn't
> supported by xhci_hcd directly (failing to probe with error -110) and
> one needs to enable CONFIG_USB_XHCI_PCI_RENESAS=m and also ship its
> accompanying firmware (/lib/firmware/renesas_usb_fw.mem). With this
> updated kernel config, I'm able to use the keyboard and to read data
> from the memory stick without problems (70 MB/s).
>
> > In my last series your testing somehow conflated the effect of an
> > unrelated MMC interrupt issue so please be precise.
>
> I wish things would be simpler and didn't involve combinatorics, let
> alone other bugs/regressions at times, but I'm really trying my best to
> navigate and report issues and test patches when I can spare some time…

Hi Cyril,

I want to encourage you and others doing testing and bug reporting:
everyone wins when a bug or issue is reported, fixed, and tested.
I'm just asking that when you have negative results, that you provide
information on the "before" and "after" test results of
the patch series, and run both on the same test environment.

Regards,
Jim Quinlan
Broadcom STB

>
>
> In my defence, the very similar board PCE6U1C-R02, VER 006 boots without
> a kernel panic, and works fine with xhci_hcd without that extra module
> and its firmware. It's based on the exact same chip (Renesas Technology
> Corp. uPD720201) though, that's why I didn't realize the need for an
> extra module until now for the PCE6U1C-R02, VER 006S one. (Florian,
> thanks for mentioning .config earlier…)
>
> To sum up, it seems both (sub)versions of that SupaHub PCE6U1C-R02 board
> are usable with this patch series: the kernel panic at boot-up is gone,
> and USB devices plugged into those boards are working as expected.
>
>
> But, speaking of combinatorics, while the patch series helps fix
> https://bugzilla.kernel.org/show_bug.cgi?id=217276 on that particular
> initial combination of CM4 and SupaHub PCE6U1C-R02, VER 006S, it also
> generates a regression if I use a different CM4…
>
> Setup:
>  - Raspberry Pi Compute Module 4 (Rev 1.0, 8G RAM, 32G storage)
>  - Raspberry Pi Compute Module 4 IO Board
>  - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
>
> You'll find attached (a) and (b) as requested above, for this setup, as
> well as the kernel configuration file:
>  - (a) = dmesg-unpatched.txt = boots fine (and USB devices work fine if
>    memory sticks or keyboards are plugged in).
>  - (b) = dmesg-patched.txt = kernel panic.
>  - cm4+pcie.config
>
> I'm getting similar results with the other SupaHub board:
>  - Raspberry Pi Compute Module 4 (Rev 1.0, 8G RAM, 32G storage)
>  - Raspberry Pi Compute Module 4 IO Board
>  - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006
>
>
> This is probably best summarized this way:
>
>                               | unpatched    | patched
> ------------------------------+--------------+--------------
> CM4 Rev 1.1 4G/32G + VER 006  | OK           | OK
> CM4 Rev 1.1 4G/32G + VER 006S | Kernel panic | OK            ← Bugzilla entry
> CM4 Rev 1.0 8G/32G + VER 006  | OK           | Kernel panic
> CM4 Rev 1.0 8G/32G + VER 006S | OK           | Kernel panic
>
>
> (And I'm of course using the same settings regarding the serial console,
> to get traces when needed.)
>
>
> Cheers,
> --
> Cyril Brulebois (kibi@debian.org)            <https://debamax.com/>
> D-I release manager -- Release team member -- Freelance Consultant
Florian Fainelli April 14, 2023, 12:27 p.m. UTC | #5
On 4/14/2023 5:14 AM, Jim Quinlan wrote:
> On Thu, Apr 13, 2023 at 4:06 PM Cyril Brulebois <kibi@debian.org> wrote:
>>
>> Hi Jim,
>>
>> Jim Quinlan <jim2101024@gmail.com> (2023-04-13):
>>> Can you provide (a) the full boot log prior to applying the patch
>>> series and (b) full boot log after applying the series, using an
>>> IDENTICAL setup. If it fails on both then it has little to do with my
>>> patch series.
>>
>> Just to be clear, the issue I reported was with:
>>   - Raspberry Pi Compute Module 4 (Rev 1.1, 4G RAM, 32G storage)
>>   - Raspberry Pi Compute Module 4 IO Board
>>   - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
>>
>> This was my minimal reproducer for the kernel panic at boot-up, which
>> goes away with either v1 or v2. When I realized I didn't actually check
>> whether the SupaHub board was working correctly, I plugged 2 devices to
>> obtain this setup:
>>   - Raspberry Pi Compute Module 4 (Rev 1.1, 4G RAM, 32G storage)
>>   - Raspberry Pi Compute Module 4 IO Board
>>   - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
>>   - Kingston DataTraveler G4 32GB on USB-A port #1 of the SupaHub board.
>>   - Logitech K120 keyboard on USB-A port #2 of the SupaHub board.
>>
>> It turns out that this particular revision of the SupaHub board isn't
>> supported by xhci_hcd directly (failing to probe with error -110) and
>> one needs to enable CONFIG_USB_XHCI_PCI_RENESAS=m and also ship its
>> accompanying firmware (/lib/firmware/renesas_usb_fw.mem). With this
>> updated kernel config, I'm able to use the keyboard and to read data
>> from the memory stick without problems (70 MB/s).
>>
>>> In my last series your testing somehow conflated the effect of an
>>> unrelated MMC interrupt issue so please be precise.
>>
>> I wish things would be simpler and didn't involve combinatorics, let
>> alone other bugs/regressions at times, but I'm really trying my best to
>> navigate and report issues and test patches when I can spare some time…
> 
> Hi Cyril,
> 
> I want to encourage you and others doing testing and bug reporting:
> everyone wins when a bug or issue is reported, fixed, and tested.
> I'm just asking that when you have negative results, that you provide
> information on the "before" and "after" test results of
> the patch series, and run both on the same test environment.

Cyril, based upon the table and logs you provided whereby you have used 
the following:

- Raspberry Pi Compute Module 4 (Rev 1.0, 8G RAM, 32G storage)
- Raspberry Pi Compute Module 4 IO Board
- SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S

in the before/unpatched case we have a PCIe link down and in the 
after/patched we have a PCIe link up but a kernel panic. Neither are 
great nor resulting in a fully functional PCIe device.

Looking at:

https://www.amazon.co.uk/SupaHub-Express-BandWidth-Capable-Expanding/dp/B092ZQWG5B

it would appear that it can accept an external power supply, do you have 
one connected to that USB expansion card by any chance? Are you able to 
boot the kernel before/after if you disconnect any USB peripheral?

This looks like a broader electrical problem than the scope of this 
patch, though it would be neat if we could find a combination that 
works. At least with Jim's patch we have a PCIe link with 
uni-directional CLKREQ# so we could try a variety of things.

Does that SupaHub board plugged to the CM4 1.0 system work fine in the 
Raspberry Pi kernel tree?
Jim Quinlan April 14, 2023, 1:31 p.m. UTC | #6
On Fri, Apr 14, 2023 at 8:27 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 4/14/2023 5:14 AM, Jim Quinlan wrote:
> > On Thu, Apr 13, 2023 at 4:06 PM Cyril Brulebois <kibi@debian.org> wrote:
> >>
> >> Hi Jim,
> >>
> >> Jim Quinlan <jim2101024@gmail.com> (2023-04-13):
> >>> Can you provide (a) the full boot log prior to applying the patch
> >>> series and (b) full boot log after applying the series, using an
> >>> IDENTICAL setup. If it fails on both then it has little to do with my
> >>> patch series.
> >>
> >> Just to be clear, the issue I reported was with:
> >>   - Raspberry Pi Compute Module 4 (Rev 1.1, 4G RAM, 32G storage)
> >>   - Raspberry Pi Compute Module 4 IO Board
> >>   - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
> >>
> >> This was my minimal reproducer for the kernel panic at boot-up, which
> >> goes away with either v1 or v2. When I realized I didn't actually check
> >> whether the SupaHub board was working correctly, I plugged 2 devices to
> >> obtain this setup:
> >>   - Raspberry Pi Compute Module 4 (Rev 1.1, 4G RAM, 32G storage)
> >>   - Raspberry Pi Compute Module 4 IO Board
> >>   - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
> >>   - Kingston DataTraveler G4 32GB on USB-A port #1 of the SupaHub board.
> >>   - Logitech K120 keyboard on USB-A port #2 of the SupaHub board.
> >>
> >> It turns out that this particular revision of the SupaHub board isn't
> >> supported by xhci_hcd directly (failing to probe with error -110) and
> >> one needs to enable CONFIG_USB_XHCI_PCI_RENESAS=m and also ship its
> >> accompanying firmware (/lib/firmware/renesas_usb_fw.mem). With this
> >> updated kernel config, I'm able to use the keyboard and to read data
> >> from the memory stick without problems (70 MB/s).
> >>
> >>> In my last series your testing somehow conflated the effect of an
> >>> unrelated MMC interrupt issue so please be precise.
> >>
> >> I wish things would be simpler and didn't involve combinatorics, let
> >> alone other bugs/regressions at times, but I'm really trying my best to
> >> navigate and report issues and test patches when I can spare some time…
> >
> > Hi Cyril,
> >
> > I want to encourage you and others doing testing and bug reporting:
> > everyone wins when a bug or issue is reported, fixed, and tested.
> > I'm just asking that when you have negative results, that you provide
> > information on the "before" and "after" test results of
> > the patch series, and run both on the same test environment.
>
> Cyril, based upon the table and logs you provided whereby you have used
> the following:
>
> - Raspberry Pi Compute Module 4 (Rev 1.0, 8G RAM, 32G storage)
> - Raspberry Pi Compute Module 4 IO Board
> - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
>
> in the before/unpatched case we have a PCIe link down and in the
> after/patched we have a PCIe link up but a kernel panic. Neither are
> great nor resulting in a fully functional PCIe device.

The data do not make sense to me. My new code is executed AFTER pcie
link up/down determination.  Both test results should have the same
link status -- either "link up" or "link down".

If the system is wishy-washy, i.e. it has link-up in 5/10 boots, then
we need to repeat  the experiment to compare the "link up" cases only.
Or discount the test completely

If the system is not wishy-washy, then something has been changed
between the "before" and "after"  tests.


>
> Looking at:
>
> https://www.amazon.co.uk/SupaHub-Express-BandWidth-Capable-Expanding/dp/B092ZQWG5B
>
> it would appear that it can accept an external power supply, do you have
> one connected to that USB expansion card by any chance? Are you able to
> boot the kernel before/after if you disconnect any USB peripheral?
>
> This looks like a broader electrical problem than the scope of this
> patch, though it would be neat if we could find a combination that
> works. At least with Jim's patch we have a PCIe link with
> uni-directional CLKREQ# so we could try a variety of things.
>
> Does that SupaHub board plugged to the CM4 1.0 system work fine in the
> Raspberry Pi kernel tree?
> --
> Florian
Cyril Brulebois April 14, 2023, 4:19 p.m. UTC | #7
Florian Fainelli <f.fainelli@gmail.com> (2023-04-14):
> Cyril, based upon the table and logs you provided whereby you have used the
> following:
> 
> - Raspberry Pi Compute Module 4 (Rev 1.0, 8G RAM, 32G storage)
> - Raspberry Pi Compute Module 4 IO Board
> - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
> 
> in the before/unpatched case we have a PCIe link down and in the
> after/patched we have a PCIe link up but a kernel panic. Neither are great
> nor resulting in a fully functional PCIe device.

Based on Jim's feedback, I'm attaching a new dmesg for the aforementioned
setup, with an unpatched kernel (dmesg-unpatched-pcie-link-up.txt).

I fear that initially I might have not waited enough before power off and
power on when replugging the SupaHub adapter, and I've only seen the PCIe
link down a few times (now that I'm actively paying attention to that
part). I'm indeed seeing PCIe link up consistently (100%) when using the
following method:

    for i in $(seq 1 20); do
      # power on, let the system boot up and settle:
      sispmctl -t 4; sleep 2m
      ssh cm4 sh -c "dmesg > dmesg-$i.txt; poweroff"
      # let the system power down, and power off:
      sleep 30; sispmctl -t 4
      # wait before power cycling:
      sleep 10
    done

> Looking at:
> https://www.amazon.co.uk/SupaHub-Express-BandWidth-Capable-Expanding/dp/B092ZQWG5B
> 
> it would appear that it can accept an external power supply, do you have one
> connected to that USB expansion card by any chance?

With the patched kernel, I have tried several things (leaving the regular
12V adapter plugged in all the time):
 - No external power supply (that's what I've been using in all previous
   attempts).
 - Using a 5V PSU with 2 pins (ground and 5V) plugged on the Ext PSU
   4-pin header on the IO Board.
 - Using a 5V PSU with a SATA connector plugged directly onto the SupaHub
   adapter.

I'm getting the same trace in all cases.

> Are you able to boot the kernel before/after if you disconnect any USB
> peripheral?

The trace is obtained without any USB peripheral (on the extension card or
on the onboard USB slots). Besides the SupaHub adapter on the PCIe slot, I
only have Ethernet and HDMI plugged in (I'm getting traces via serial
console, and operating the system over SSH when it boots fine).

As soon as I unplug the SupaHub board itself, the system boots fine.

> Does that SupaHub board plugged to the CM4 1.0 system work fine in the
> Raspberry Pi kernel tree?

With the Raspberry Pi OS (64-bit) > Raspberry Pi OS Lite image
(2023-02-21-raspios-bullseye-arm64-lite.img.xz), the system at least
boots fine; I haven't tried adding devices to the mix just yet, trying
to stick with that particular setup.

A full dmesg is attached (dmesg-raspios.txt).


Cheers,
Bjorn Helgaas April 14, 2023, 8:27 p.m. UTC | #8
This subject line no verb.  Can you add a leading verb to suggest what
this patch does?

s/accomodations/accommodations/

On Tue, Apr 11, 2023 at 12:59:17PM -0400, Jim Quinlan wrote:
> The Broadcom STB/CM PCIe HW core, which is also used in RPi SOCs, must be
> deliberately set by the probe() into one of three mutually exclusive modes:
> 
>   (a) No CLKREQ# expected or required, refclk is always available.
>   (b) CLKREQ# is expected to be driven by downstream device when needed.
>   (c) Bidirectional CLKREQ# for L1SS capable devices.
> 
> Previously, only (b) was supported by the driver, as almost all STB/CM
> boards operate in this mode.  But now there is interest in activating L1SS
> power savings from STB/CM customers, and also interest in accomodating mode
> (a) for designs such as the RPi CM4 with IO board.

accommodating

> The HW+driver is able to tell us when mode (a) mode is needed.  But there
> is no easy way to tell if L1SS mode should be configured.  In certain
> situations, getting this wrong may cause a panic during boot time.  So we
> rely on the DT prop "brcm,enable-l1ss" to tell us when mode (c) is desired.
> Using this mode only makes sense when the downstream device is L1SS-capable
> and the OS has been configured to activate L1SS
> (e.g. policy==powersupersave).

I'm really concerned about the user experience here.  I assume users
do not want to edit the DT based on what device they plug in.  They
shouldn't need to (and probably won't) know whether the device
supports L1SS.

I hate kernel/module parameters, but I think even that would be better
then having to edit the DT.

There's obviously a period of time when L1SS is supported but not yet
enabled, so I'm *guessing* the "OS has been configured to activate
L1SS" is not actually a requirement, and choosing (c) really just
opens the possibility that L1SS can be used?

Would be nice to have a hint (maybe a line or two of the panic
message) to help users find the fix for a problem they're seeing.

Obviously the ideal would be if we could use (c) in all cases, so I
assume that's where a panic might happen.  What situation would that
be?  An endpoint that doesn't support L1SS?  One that supports L1SS
but it's not enabled?  Maybe if L1SS isn't configured correctly, e.g.,
LTR values programmed wrong?

Bjorn
Florian Fainelli April 14, 2023, 8:33 p.m. UTC | #9
On 4/14/23 13:27, Bjorn Helgaas wrote:
> This subject line no verb.  Can you add a leading verb to suggest what
> this patch does?
> 
> s/accomodations/accommodations/
> 
> On Tue, Apr 11, 2023 at 12:59:17PM -0400, Jim Quinlan wrote:
>> The Broadcom STB/CM PCIe HW core, which is also used in RPi SOCs, must be
>> deliberately set by the probe() into one of three mutually exclusive modes:
>>
>>    (a) No CLKREQ# expected or required, refclk is always available.
>>    (b) CLKREQ# is expected to be driven by downstream device when needed.
>>    (c) Bidirectional CLKREQ# for L1SS capable devices.
>>
>> Previously, only (b) was supported by the driver, as almost all STB/CM
>> boards operate in this mode.  But now there is interest in activating L1SS
>> power savings from STB/CM customers, and also interest in accomodating mode
>> (a) for designs such as the RPi CM4 with IO board.
> 
> accommodating
> 
>> The HW+driver is able to tell us when mode (a) mode is needed.  But there
>> is no easy way to tell if L1SS mode should be configured.  In certain
>> situations, getting this wrong may cause a panic during boot time.  So we
>> rely on the DT prop "brcm,enable-l1ss" to tell us when mode (c) is desired.
>> Using this mode only makes sense when the downstream device is L1SS-capable
>> and the OS has been configured to activate L1SS
>> (e.g. policy==powersupersave).
> 
> I'm really concerned about the user experience here.  I assume users
> do not want to edit the DT based on what device they plug in.  They
> shouldn't need to (and probably won't) know whether the device
> supports L1SS.
> 
> I hate kernel/module parameters, but I think even that would be better
> then having to edit the DT.

The problem I see with kernel/module parameters is that it is really 
hard to differentiate whether they should be applied across all 
instances of the device/drivers or just for one in particular.

The Raspberry Pi 4 is a single pcie-brcmstb instance, but we have other 
systems supported by that driver on Set-top-box and Cable Modem chips 
for instance where we may have different types of end-points being 
connected, some of which could be Multi-chip-module (MCM) where 
everything is known ahead of time, and sometimes cards that are plugged 
to full-sized PCIe or mini-PCIe connectors, where some amount of runtime 
discoverability is involved.

Without inventing some custom modular parameter syntax, it may not work 
that well. The Device Tree properties at least easily allow for 
per-instance configuration.

> 
> There's obviously a period of time when L1SS is supported but not yet
> enabled, so I'm *guessing* the "OS has been configured to activate
> L1SS" is not actually a requirement, and choosing (c) really just
> opens the possibility that L1SS can be used?
> 
> Would be nice to have a hint (maybe a line or two of the panic
> message) to help users find the fix for a problem they're seeing.
> 
> Obviously the ideal would be if we could use (c) in all cases, so I
> assume that's where a panic might happen.  What situation would that
> be?  An endpoint that doesn't support L1SS?  One that supports L1SS
> but it's not enabled?  Maybe if L1SS isn't configured correctly, e.g.,
> LTR values programmed wrong?
> 
> Bjorn
Jim Quinlan April 14, 2023, 11:14 p.m. UTC | #10
On Fri, Apr 14, 2023 at 4:27 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> This subject line no verb.  Can you add a leading verb to suggest what
> this patch does?
>
> s/accomodations/accommodations/
>
> On Tue, Apr 11, 2023 at 12:59:17PM -0400, Jim Quinlan wrote:
> > The Broadcom STB/CM PCIe HW core, which is also used in RPi SOCs, must be
> > deliberately set by the probe() into one of three mutually exclusive modes:
> >
> >   (a) No CLKREQ# expected or required, refclk is always available.
> >   (b) CLKREQ# is expected to be driven by downstream device when needed.
> >   (c) Bidirectional CLKREQ# for L1SS capable devices.
> >
> > Previously, only (b) was supported by the driver, as almost all STB/CM
> > boards operate in this mode.  But now there is interest in activating L1SS
> > power savings from STB/CM customers, and also interest in accomodating mode
> > (a) for designs such as the RPi CM4 with IO board.
>
> accommodating
>
> > The HW+driver is able to tell us when mode (a) mode is needed.  But there
> > is no easy way to tell if L1SS mode should be configured.  In certain
> > situations, getting this wrong may cause a panic during boot time.  So we
> > rely on the DT prop "brcm,enable-l1ss" to tell us when mode (c) is desired.
> > Using this mode only makes sense when the downstream device is L1SS-capable
> > and the OS has been configured to activate L1SS
> > (e.g. policy==powersupersave).
>
> I'm really concerned about the user experience here.  I assume users
> do not want to edit the DT based on what device they plug in.  They
> shouldn't need to (and probably won't) know whether the device
> supports L1SS.
>
> I hate kernel/module parameters, but I think even that would be better
> then having to edit the DT.
>
> There's obviously a period of time when L1SS is supported but not yet
> enabled, so I'm *guessing* the "OS has been configured to activate
> L1SS" is not actually a requirement, and choosing (c) really just
> opens the possibility that L1SS can be used?

Yes.  Before this patch series we had two panic scenarios:

(a) Endpoint devices with no CLKREQ# connection
(b) Endpoints that are L1SS-capable

Even without  the "brcm,enable-l1ss" property present, both (a) and
(b) should be eliminated.
The reason (b) is eliminated is because the RC driver now unadvertises
 RC L1SS by default; subsequently, Linux does
not turn it on.  So the default setting should be fine for all devices.

For those folks who have L1SS capable devices and desire L1SS power
savings, they can add
the brcm,enable-l1ss property.  But everyone should have functionality
 w/o doing anything.

As I am typing this I realize that my comments and dev_info()s  are not
aligned with what I am saying so I will change them in V3.  Sorry
about the confusion.

>
> Would be nice to have a hint (maybe a line or two of the panic
> message) to help users find the fix for a problem they're seeing.SS
>
> Obviously the ideal would be if we could use (c) in all cases, so I
> assume that's where a panic might happen.  What situation would that
> be?  An endpoint that doesn't support L1SS?  One that supports L1SS
> but it's not enabled?  Maybe if L1SS isn't configured correctly, e.g.,
> LTR values programmed wrong?

Let me test everything on Monday and get back to you before I make any
statements.


Regards,
Jim Quinilan
Broadcom STB

>
> Bjorn
Bjorn Helgaas April 17, 2023, 9:41 p.m. UTC | #11
On Fri, Apr 14, 2023 at 01:33:29PM -0700, Florian Fainelli wrote:
> On 4/14/23 13:27, Bjorn Helgaas wrote:
> > On Tue, Apr 11, 2023 at 12:59:17PM -0400, Jim Quinlan wrote:
> ...

> > > The HW+driver is able to tell us when mode (a) mode is needed.  But there
> > > is no easy way to tell if L1SS mode should be configured.  In certain
> > > situations, getting this wrong may cause a panic during boot time.  So we
> > > rely on the DT prop "brcm,enable-l1ss" to tell us when mode (c) is desired.
> > > Using this mode only makes sense when the downstream device is L1SS-capable
> > > and the OS has been configured to activate L1SS
> > > (e.g. policy==powersupersave).
> > 
> > I'm really concerned about the user experience here.  I assume users
> > do not want to edit the DT based on what device they plug in.  They
> > shouldn't need to (and probably won't) know whether the device
> > supports L1SS.
> > 
> > I hate kernel/module parameters, but I think even that would be better
> > then having to edit the DT.
> 
> The problem I see with kernel/module parameters is that it is really hard to
> differentiate whether they should be applied across all instances of the
> device/drivers or just for one in particular.
> 
> The Raspberry Pi 4 is a single pcie-brcmstb instance, but we have other
> systems supported by that driver on Set-top-box and Cable Modem chips for
> instance where we may have different types of end-points being connected,
> some of which could be Multi-chip-module (MCM) where everything is known
> ahead of time, and sometimes cards that are plugged to full-sized PCIe or
> mini-PCIe connectors, where some amount of runtime discoverability is
> involved.
> 
> Without inventing some custom modular parameter syntax, it may not work that
> well. The Device Tree properties at least easily allow for per-instance
> configuration.

We do have this already (from
Documentation/admin-guide/kernel-parameters.txt):

        pci=option[,option...]  [PCI] various PCI subsystem options.

                                Some options herein operate on a specific device
                                or a set of devices (<pci_dev>). These are
                                specified in one of the following formats:

                                [<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]*
                                pci:<vendor>:<device>[:<subvendor>:<subdevice>]

                                Note: the first format specifies a PCI
                                bus/device/function address which may change
                                if new hardware is inserted, if motherboard
                                firmware changes, or due to changes caused
                                by other kernel parameters. If the
                                domain is left unspecified, it is
                                taken to be zero. Optionally, a path
                                to a device through multiple device/function
                                addresses can be specified after the base
                                address (this is more robust against
                                renumbering issues).  The second format
                                selects devices using IDs from the
                                configuration space which may match multiple
                                devices in the system.
Jim Quinlan April 19, 2023, 2:23 p.m. UTC | #12
On Fri, Apr 14, 2023 at 12:19 PM Cyril Brulebois <kibi@debian.org> wrote:
>
> Florian Fainelli <f.fainelli@gmail.com> (2023-04-14):
> > Cyril, based upon the table and logs you provided whereby you have used the
> > following:
> >
> > - Raspberry Pi Compute Module 4 (Rev 1.0, 8G RAM, 32G storage)
> > - Raspberry Pi Compute Module 4 IO Board
> > - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S
> >
> > in the before/unpatched case we have a PCIe link down and in the
> > after/patched we have a PCIe link up but a kernel panic. Neither are great
> > nor resulting in a fully functional PCIe device.
>
> Based on Jim's feedback, I'm attaching a new dmesg for the aforementioned
> setup, with an unpatched kernel (dmesg-unpatched-pcie-link-up.txt).

Hello Cyril,

I'm still seeing things in the logs that do not make sense to me.

First, the "unpatched" version logs -- including the Raspian case --
all have the following lines:

    [    0.000000] Movable zone start for each node
    /* ... */
    [    0.000000] pcpu-alloc: s88232 r8192 d30552 u126976 alloc=31*4096
    [    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3

The above lines are also in my logs.  But they are missing from your
"after patch" logs -- what did you  change to have these lines
disappear on the patched logs?

Second, you say that you used a different "CM4" from the one used in
the Bugzilla  report -- what do you mean by that?   Are you referring
to the CM4 module proper, whose only change was going from 4GB to 8GB,
or the IO board being used?  My  testing is done with a standard RPi
CM4 and standard RPi IO board.  Before this patch series, the only way
this standard configuration can work for most hobbyist PCI cards (i.e.
floating CLKREQ# pin) is by using Raspian AND adding
"brcm,enable-l1ss" to the DT node.

I'm guessing that you are using the configuration that you described
to me in  a personal email: "[the] chip is embedded directly on the
modified PCB, as opposed to plugged into the PCIe slot on the official
CM4 IO Board".  If true, you are testing on a configuration that (a)
is unique to you and your group and (b) must be doing something with
the CLKREQ# signal so that your "before" case does not panic.  Is this
the case?

Finally, and this is a big one, is the fact that in both of the
"before" and "after" cases the value written to the internal Brcm
CLKREQ# register is the same.
This is evident by this line in the "after" patch:  "brcm-pcie
fd500000.pcie: uni-dir CLKREQ# for ASPM".  This mode is the only mode
supported by the "before"
case.  Now there are  some other things going on in the patch series
-- reading two registers and extending the timeout value of a
completion abort, but I'm having
a hard time imagining how they could cause a panic.

Regards,
Jim

PS  Can you please include in your logs the output of  "sudo lspci
-vvv" for those cases that boot to prompt?
>
> I fear that initially I might have not waited enough before power off and
> power on when replugging the SupaHub adapter, and I've only seen the PCIe
> link down a few times (now that I'm actively paying attention to that
> part). I'm indeed seeing PCIe link up consistently (100%) when using the
> following method:
>
>     for i in $(seq 1 20); do
>       # power on, let the system boot up and settle:
>       sispmctl -t 4; sleep 2m
>       ssh cm4 sh -c "dmesg > dmesg-$i.txt; poweroff"
>       # let the system power down, and power off:
>       sleep 30; sispmctl -t 4
>       # wait before power cycling:
>       sleep 10
>     done
>
> > Looking at:
> > https://www.amazon.co.uk/SupaHub-Express-BandWidth-Capable-Expanding/dp/B092ZQWG5B
> >
> > it would appear that it can accept an external power supply, do you have one
> > connected to that USB expansion card by any chance?
>
> With the patched kernel, I have tried several things (leaving the regular
> 12V adapter plugged in all the time):
>  - No external power supply (that's what I've been using in all previous
>    attempts).
>  - Using a 5V PSU with 2 pins (ground and 5V) plugged on the Ext PSU
>    4-pin header on the IO Board.
>  - Using a 5V PSU with a SATA connector plugged directly onto the SupaHub
>    adapter.
>
> I'm getting the same trace in all cases.
>
> > Are you able to boot the kernel before/after if you disconnect any USB
> > peripheral?
>
> The trace is obtained without any USB peripheral (on the extension card or
> on the onboard USB slots). Besides the SupaHub adapter on the PCIe slot, I
> only have Ethernet and HDMI plugged in (I'm getting traces via serial
> console, and operating the system over SSH when it boots fine).
>
> As soon as I unplug the SupaHub board itself, the system boots fine.
>
> > Does that SupaHub board plugged to the CM4 1.0 system work fine in the
> > Raspberry Pi kernel tree?
>
> With the Raspberry Pi OS (64-bit) > Raspberry Pi OS Lite image
> (2023-02-21-raspios-bullseye-arm64-lite.img.xz), the system at least
> boots fine; I haven't tried adding devices to the mix just yet, trying
> to stick with that particular setup.
>
> A full dmesg is attached (dmesg-raspios.txt).
>
>
> Cheers,
> --
> Cyril Brulebois (kibi@debian.org)            <https://debamax.com/>
> D-I release manager -- Release team member -- Freelance Consultant
Cyril Brulebois April 19, 2023, 3:57 p.m. UTC | #13
Hi Jim,

It might take a few days before getting back to you regarding the
various questions you asked. To be on the safe side, I'll probably run
a cold boot for each setup a number of times (e.g. 10), so that any
possible outlier can be spotted/rejected. And maybe share results off
list once we have a better understanding of what's going on. Does that
make sense to you?

I'll cover a particular topic right away though.

Jim Quinlan <jim2101024@gmail.com> (2023-04-19):
> Second, you say that you used a different "CM4" from the one used in
> the Bugzilla  report -- what do you mean by that?   Are you referring
> to the CM4 module proper, whose only change was going from 4GB to 8GB,
> or the IO board being used?  My  testing is done with a standard RPi
> CM4 and standard RPi IO board.  Before this patch series, the only way
> this standard configuration can work for most hobbyist PCI cards (i.e.
> floating CLKREQ# pin) is by using Raspian AND adding
> "brcm,enable-l1ss" to the DT node.

Regarding the IO Board, I'm using the official Compute Module 4 IO
Board: https://www.raspberrypi.com/products/compute-module-4-io-board/

I've been using the very same IO Board for all my testing, and what I'm
changing is the standard RPi CM4 plugged on it.

> I'm guessing that you are using the configuration that you described
> to me in  a personal email: "[the] chip is embedded directly on the
> modified PCB, as opposed to plugged into the PCIe slot on the official
> CM4 IO Board".  If true, you are testing on a configuration that (a)
> is unique to you and your group and (b) must be doing something with
> the CLKREQ# signal so that your "before" case does not panic.  Is this
> the case?

That's definitely not the case.

True, as mentioned in a personal mail, we've seen problems with a custom
PCB, derived from the CM4 IO Board design, but of course there could be
some faulty design at work there… So we've first researched whether the
same problem could be produced with consumer grade products, and once
we've verified that, I opened #217276 on Bugzilla.


Since Florian's testing seems overwhelmingly positive, and since I'm
seeing definitive improvement with at least one CM4, maybe it would make
sense not to block the patch series on the kernel panic I'm seeing with
the other CM4, and track that particular issue via a separate bug?


Cheers,
diff mbox series

Patch

diff --git a/drivers/pci/controller/pcie-brcmstb.c b/drivers/pci/controller/pcie-brcmstb.c
index edf283e2b5dd..56b96aa02221 100644
--- a/drivers/pci/controller/pcie-brcmstb.c
+++ b/drivers/pci/controller/pcie-brcmstb.c
@@ -48,10 +48,17 @@ 
 #define PCIE_RC_CFG_PRIV1_LINK_CAPABILITY			0x04dc
 #define  PCIE_RC_CFG_PRIV1_LINK_CAPABILITY_ASPM_SUPPORT_MASK	0xc00
 
+#define PCIE_RC_CFG_PRIV1_ROOT_CAP			0x4f8
+#define  PCIE_RC_CFG_PRIV1_ROOT_CAP_L1SS_MODE_MASK	0xf8
+
 #define PCIE_RC_DL_MDIO_ADDR				0x1100
 #define PCIE_RC_DL_MDIO_WR_DATA				0x1104
 #define PCIE_RC_DL_MDIO_RD_DATA				0x1108
 
+#define PCIE_0_RC_PL_PHY_DBG_CLKREQ2_0			0x1e30
+#define  CLKREQ2_0_CLKREQ_IN_CNT_MASK			0x3f000000
+#define  CLKREQ2_0_CLKREQ_IN_MASK			0x40000000
+
 #define PCIE_MISC_MISC_CTRL				0x4008
 #define  PCIE_MISC_MISC_CTRL_PCIE_RCB_64B_MODE_MASK	0x80
 #define  PCIE_MISC_MISC_CTRL_PCIE_RCB_MPS_MODE_MASK	0x400
@@ -121,9 +128,12 @@ 
 
 #define PCIE_MISC_HARD_PCIE_HARD_DEBUG					0x4204
 #define  PCIE_MISC_HARD_PCIE_HARD_DEBUG_CLKREQ_DEBUG_ENABLE_MASK	0x2
+#define  PCIE_MISC_HARD_PCIE_HARD_DEBUG_L1SS_ENABLE_MASK		0x200000
 #define  PCIE_MISC_HARD_PCIE_HARD_DEBUG_SERDES_IDDQ_MASK		0x08000000
 #define  PCIE_BMIPS_MISC_HARD_PCIE_HARD_DEBUG_SERDES_IDDQ_MASK		0x00800000
-
+#define  PCIE_CLKREQ_MASK \
+	  (PCIE_MISC_HARD_PCIE_HARD_DEBUG_CLKREQ_DEBUG_ENABLE_MASK | \
+	   PCIE_MISC_HARD_PCIE_HARD_DEBUG_L1SS_ENABLE_MASK)
 
 #define PCIE_INTR2_CPU_BASE		0x4300
 #define PCIE_MSI_INTR2_BASE		0x4500
@@ -1024,13 +1034,58 @@  static int brcm_pcie_setup(struct brcm_pcie *pcie)
 	return 0;
 }
 
+static void brcm_config_clkreq(struct brcm_pcie *pcie)
+{
+	bool l1ss = of_property_read_bool(pcie->np, "brcm,enable-l1ss");
+	void __iomem *base = pcie->base;
+	u32 clkreq_set, tmp = readl(base + PCIE_0_RC_PL_PHY_DBG_CLKREQ2_0);
+	bool clkreq_in_seen;
+
+	/*
+	 * We have "seen" CLKREQ# if it is asserted or has been in the past.
+	 * Note that the CLKREQ_IN_MASK is 1 if CLKREQ# is asserted.
+	 */
+	clkreq_in_seen = !!(tmp & CLKREQ2_0_CLKREQ_IN_MASK) ||
+		!!FIELD_GET(CLKREQ2_0_CLKREQ_IN_CNT_MASK, tmp);
+
+	/* Start with safest setting where we provide refclk regardless */
+	clkreq_set = readl(pcie->base + PCIE_MISC_HARD_PCIE_HARD_DEBUG) &
+		~PCIE_CLKREQ_MASK;
+
+	if (l1ss && IS_ENABLED(CONFIG_PCIEASPM)) {
+		/*
+		 * Note: This (l1ss) mode may not meet requirement for
+		 * downstream devices that require CLKREQ# assertion to
+		 * clock active within 400ns.
+		 */
+		clkreq_set |= PCIE_MISC_HARD_PCIE_HARD_DEBUG_L1SS_ENABLE_MASK;
+		dev_info(pcie->dev, "bi-dir CLKREQ# for l1ss-capable devs\n");
+		dev_info(pcie->dev, "ASPM policy should be set to \"powersupersave\"\n");
+	} else {
+		if (clkreq_in_seen && IS_ENABLED(CONFIG_PCIEASPM)) {
+			clkreq_set |= PCIE_MISC_HARD_PCIE_HARD_DEBUG_CLKREQ_DEBUG_ENABLE_MASK;
+			dev_info(pcie->dev, "uni-dir CLKREQ# for ASPM\n");
+		} else {
+			dev_info(pcie->dev, "CLKREQ# ignored; no ASPM\n");
+			/* Might as well unadvertise ASPM */
+			tmp = readl(base + PCIE_RC_CFG_PRIV1_LINK_CAPABILITY) &
+				~PCIE_RC_CFG_PRIV1_LINK_CAPABILITY_ASPM_SUPPORT_MASK;
+			writel(tmp, base + PCIE_RC_CFG_PRIV1_LINK_CAPABILITY);
+		}
+		/* Setting the field to 2 unadvertises L1SS support */
+		tmp = readl(base + PCIE_RC_CFG_PRIV1_ROOT_CAP);
+		u32p_replace_bits(&tmp, 2, PCIE_RC_CFG_PRIV1_ROOT_CAP_L1SS_MODE_MASK);
+		writel(tmp, base + PCIE_RC_CFG_PRIV1_ROOT_CAP);
+	}
+	writel(clkreq_set, pcie->base + PCIE_MISC_HARD_PCIE_HARD_DEBUG);
+}
+
 static int brcm_pcie_start_link(struct brcm_pcie *pcie)
 {
 	struct device *dev = pcie->dev;
 	void __iomem *base = pcie->base;
 	u16 nlw, cls, lnksta;
 	bool ssc_good = false;
-	u32 tmp;
 	int ret, i;
 
 	/* Unassert the fundamental reset */
@@ -1055,6 +1110,8 @@  static int brcm_pcie_start_link(struct brcm_pcie *pcie)
 		return -ENODEV;
 	}
 
+	brcm_config_clkreq(pcie);
+
 	if (pcie->gen)
 		brcm_pcie_set_gen(pcie, pcie->gen);
 
@@ -1073,14 +1130,6 @@  static int brcm_pcie_start_link(struct brcm_pcie *pcie)
 		 pci_speed_string(pcie_link_speed[cls]), nlw,
 		 ssc_good ? "(SSC)" : "(!SSC)");
 
-	/*
-	 * Refclk from RC should be gated with CLKREQ# input when ASPM L0s,L1
-	 * is enabled => setting the CLKREQ_DEBUG_ENABLE field to 1.
-	 */
-	tmp = readl(base + PCIE_MISC_HARD_PCIE_HARD_DEBUG);
-	tmp |= PCIE_MISC_HARD_PCIE_HARD_DEBUG_CLKREQ_DEBUG_ENABLE_MASK;
-	writel(tmp, base + PCIE_MISC_HARD_PCIE_HARD_DEBUG);
-
 	return 0;
 }