diff mbox series

[1/3] PCI/ASPM: Use the path max in L1 ASPM latency check

Message ID 20201024205548.1837770-1-ian.kumlien@gmail.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series [1/3] PCI/ASPM: Use the path max in L1 ASPM latency check | expand

Commit Message

Ian Kumlien Oct. 24, 2020, 8:55 p.m. UTC
Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
"5.4.1.2.2. Exit from the L1 State"

Which makes it clear that each switch is required to initiate a
transition within 1μs from receiving it, accumulating this latency and
then we have to wait for the slowest link along the path before
entering L0 state from L1.

The current code doesn't take the maximum latency into account.

From the example:
   +----------------+
   |                |
   |  Root complex  |
   |                |
   |    +-----+     |
   |    |32 μs|     |
   +----------------+
           |
           |  Link 1
           |
   +----------------+
   |     |8 μs|     |
   |     +----+     |
   |    Switch A    |
   |     +----+     |
   |     |8 μs|     |
   +----------------+
           |
           |  Link 2
           |
   +----------------+
   |    |32 μs|     |
   |    +-----+     |
   |    Switch B    |
   |    +-----+     |
   |    |32 μs|     |
   +----------------+
           |
           |  Link 3
           |
   +----------------+
   |     |8μs|      |
   |     +---+      |
   |   Endpoint C   |
   |                |
   |                |
   +----------------+

Links 1, 2 and 3 are all in L1 state - endpoint C initiates the
transition to L0 at time T. Since switch B takes 32 μs to exit L1 on
it's ports, Link 3 will transition to L0 at T+32 (longest time
considering T+8 for endpoint C and T+32 for switch B).

Switch B is required to initiate a transition from the L1 state on it's
upstream port after no more than 1 μs from the beginning of the
transition from L1 state on the downstream port. Therefore, transition from
L1 to L0 will begin on link 2 at T+1, this will cascade up the path.

The path will exit L1 at T+34.

On my specific system:
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)

            Exit latency       Acceptable latency
Tree:       L1       L0s       L1       L0s
----------  -------  -----     -------  ------
00:01.2     <32 us   -
| 01:00.0   <32 us   -
|- 02:03.0  <32 us   -
| \03:00.0  <16 us   <2us      <64 us   <512ns
|
\- 02:04.0  <32 us   -
  \04:00.0  <64 us   unlimited <64 us   <512ns

04:00.0's latency is the same as the maximum it allows so as we walk the path
the first switchs startup latency will pass the acceptable latency limit
for the link, and as a side-effect it fixes my issues with 03:00.0.

Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
links with 6 or more hops. With this patch I'm back to a maximum of ~933
mbit/s.

The original code path did:
04:00:0-02:04.0 max latency 64    -> ok
02:04.0-01:00.0 max latency 32 +1 -> ok
01:00.0-00:01.2 max latency 32 +2 -> ok

And thus didn't see any L1 ASPM latency issues.

The new code does:
04:00:0-02:04.0 max latency 64    -> ok
02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
01:00.0-00:01.2 max latency 64 +2 -> latency exceeded

It correctly identifies the issue.

For reference, pcie information:
https://bugzilla.kernel.org/show_bug.cgi?id=209725

Kai-Heng Feng has a machine that will not boot with ASPM without this patch,
information is documented here:
https://bugzilla.kernel.org/show_bug.cgi?id=209671

Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

Comments

Ian Kumlien Nov. 15, 2020, 9:49 p.m. UTC | #1
Is this ok? Or what's missing?

On Sat, Oct 24, 2020 at 10:55 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> "5.4.1.2.2. Exit from the L1 State"
>
> Which makes it clear that each switch is required to initiate a
> transition within 1μs from receiving it, accumulating this latency and
> then we have to wait for the slowest link along the path before
> entering L0 state from L1.
>
> The current code doesn't take the maximum latency into account.
>
> From the example:
>    +----------------+
>    |                |
>    |  Root complex  |
>    |                |
>    |    +-----+     |
>    |    |32 μs|     |
>    +----------------+
>            |
>            |  Link 1
>            |
>    +----------------+
>    |     |8 μs|     |
>    |     +----+     |
>    |    Switch A    |
>    |     +----+     |
>    |     |8 μs|     |
>    +----------------+
>            |
>            |  Link 2
>            |
>    +----------------+
>    |    |32 μs|     |
>    |    +-----+     |
>    |    Switch B    |
>    |    +-----+     |
>    |    |32 μs|     |
>    +----------------+
>            |
>            |  Link 3
>            |
>    +----------------+
>    |     |8μs|      |
>    |     +---+      |
>    |   Endpoint C   |
>    |                |
>    |                |
>    +----------------+
>
> Links 1, 2 and 3 are all in L1 state - endpoint C initiates the
> transition to L0 at time T. Since switch B takes 32 μs to exit L1 on
> it's ports, Link 3 will transition to L0 at T+32 (longest time
> considering T+8 for endpoint C and T+32 for switch B).
>
> Switch B is required to initiate a transition from the L1 state on it's
> upstream port after no more than 1 μs from the beginning of the
> transition from L1 state on the downstream port. Therefore, transition from
> L1 to L0 will begin on link 2 at T+1, this will cascade up the path.
>
> The path will exit L1 at T+34.
>
> On my specific system:
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
>
>             Exit latency       Acceptable latency
> Tree:       L1       L0s       L1       L0s
> ----------  -------  -----     -------  ------
> 00:01.2     <32 us   -
> | 01:00.0   <32 us   -
> |- 02:03.0  <32 us   -
> | \03:00.0  <16 us   <2us      <64 us   <512ns
> |
> \- 02:04.0  <32 us   -
>   \04:00.0  <64 us   unlimited <64 us   <512ns
>
> 04:00.0's latency is the same as the maximum it allows so as we walk the path
> the first switchs startup latency will pass the acceptable latency limit
> for the link, and as a side-effect it fixes my issues with 03:00.0.
>
> Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> links with 6 or more hops. With this patch I'm back to a maximum of ~933
> mbit/s.
>
> The original code path did:
> 04:00:0-02:04.0 max latency 64    -> ok
> 02:04.0-01:00.0 max latency 32 +1 -> ok
> 01:00.0-00:01.2 max latency 32 +2 -> ok
>
> And thus didn't see any L1 ASPM latency issues.
>
> The new code does:
> 04:00:0-02:04.0 max latency 64    -> ok
> 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
>
> It correctly identifies the issue.
>
> For reference, pcie information:
> https://bugzilla.kernel.org/show_bug.cgi?id=209725
>
> Kai-Heng Feng has a machine that will not boot with ASPM without this patch,
> information is documented here:
> https://bugzilla.kernel.org/show_bug.cgi?id=209671
>
> Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index 253c30cc1967..c03ead0f1013 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
>
>  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  {
> -       u32 latency, l1_switch_latency = 0;
> +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
>         struct aspm_latency *acceptable;
>         struct pcie_link_state *link;
>
> @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>                 if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
>                     (link->latency_dw.l0s > acceptable->l0s))
>                         link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> +
>                 /*
>                  * Check L1 latency.
> -                * Every switch on the path to root complex need 1
> -                * more microsecond for L1. Spec doesn't mention L0s.
> +                *
> +                * PCIe r5.0, sec 5.4.1.2.2 states:
> +                * A Switch is required to initiate an L1 exit transition on its
> +                * Upstream Port Link after no more than 1 μs from the beginning of an
> +                * L1 exit transition on any of its Downstream Port Links.
>                  *
>                  * The exit latencies for L1 substates are not advertised
>                  * by a device.  Since the spec also doesn't mention a way
> @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>                  * L1 exit latencies advertised by a device include L1
>                  * substate latencies (and hence do not do any check).
>                  */
> -               latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> -               if ((link->aspm_capable & ASPM_STATE_L1) &&
> -                   (latency + l1_switch_latency > acceptable->l1))
> -                       link->aspm_capable &= ~ASPM_STATE_L1;
> -               l1_switch_latency += 1000;
> +               if (link->aspm_capable & ASPM_STATE_L1) {
> +                       latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> +                       l1_max_latency = max_t(u32, latency, l1_max_latency);
> +                       if (l1_max_latency + l1_switch_latency > acceptable->l1)
> +                               link->aspm_capable &= ~ASPM_STATE_L1;
> +                       l1_switch_latency += 1000;
> +               }
>
>                 link = link->parent;
>         }
> --
> 2.29.1
>
Ian Kumlien Dec. 7, 2020, 11:04 a.m. UTC | #2
I was hoping that this patch would get in to 5.10 since it'll be a LTS
release and all... but it doesn't seem like it.

Any thoughts?

On Sun, Nov 15, 2020 at 10:49 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Is this ok? Or what's missing?
>
> On Sat, Oct 24, 2020 at 10:55 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > "5.4.1.2.2. Exit from the L1 State"
> >
> > Which makes it clear that each switch is required to initiate a
> > transition within 1μs from receiving it, accumulating this latency and
> > then we have to wait for the slowest link along the path before
> > entering L0 state from L1.
> >
> > The current code doesn't take the maximum latency into account.
> >
> > From the example:
> >    +----------------+
> >    |                |
> >    |  Root complex  |
> >    |                |
> >    |    +-----+     |
> >    |    |32 μs|     |
> >    +----------------+
> >            |
> >            |  Link 1
> >            |
> >    +----------------+
> >    |     |8 μs|     |
> >    |     +----+     |
> >    |    Switch A    |
> >    |     +----+     |
> >    |     |8 μs|     |
> >    +----------------+
> >            |
> >            |  Link 2
> >            |
> >    +----------------+
> >    |    |32 μs|     |
> >    |    +-----+     |
> >    |    Switch B    |
> >    |    +-----+     |
> >    |    |32 μs|     |
> >    +----------------+
> >            |
> >            |  Link 3
> >            |
> >    +----------------+
> >    |     |8μs|      |
> >    |     +---+      |
> >    |   Endpoint C   |
> >    |                |
> >    |                |
> >    +----------------+
> >
> > Links 1, 2 and 3 are all in L1 state - endpoint C initiates the
> > transition to L0 at time T. Since switch B takes 32 μs to exit L1 on
> > it's ports, Link 3 will transition to L0 at T+32 (longest time
> > considering T+8 for endpoint C and T+32 for switch B).
> >
> > Switch B is required to initiate a transition from the L1 state on it's
> > upstream port after no more than 1 μs from the beginning of the
> > transition from L1 state on the downstream port. Therefore, transition from
> > L1 to L0 will begin on link 2 at T+1, this will cascade up the path.
> >
> > The path will exit L1 at T+34.
> >
> > On my specific system:
> > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> >
> >             Exit latency       Acceptable latency
> > Tree:       L1       L0s       L1       L0s
> > ----------  -------  -----     -------  ------
> > 00:01.2     <32 us   -
> > | 01:00.0   <32 us   -
> > |- 02:03.0  <32 us   -
> > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > |
> > \- 02:04.0  <32 us   -
> >   \04:00.0  <64 us   unlimited <64 us   <512ns
> >
> > 04:00.0's latency is the same as the maximum it allows so as we walk the path
> > the first switchs startup latency will pass the acceptable latency limit
> > for the link, and as a side-effect it fixes my issues with 03:00.0.
> >
> > Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> > links with 6 or more hops. With this patch I'm back to a maximum of ~933
> > mbit/s.
> >
> > The original code path did:
> > 04:00:0-02:04.0 max latency 64    -> ok
> > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > 01:00.0-00:01.2 max latency 32 +2 -> ok
> >
> > And thus didn't see any L1 ASPM latency issues.
> >
> > The new code does:
> > 04:00:0-02:04.0 max latency 64    -> ok
> > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> >
> > It correctly identifies the issue.
> >
> > For reference, pcie information:
> > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> >
> > Kai-Heng Feng has a machine that will not boot with ASPM without this patch,
> > information is documented here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=209671
> >
> > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> >  1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index 253c30cc1967..c03ead0f1013 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> >
> >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> >  {
> > -       u32 latency, l1_switch_latency = 0;
> > +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> >         struct aspm_latency *acceptable;
> >         struct pcie_link_state *link;
> >
> > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> >                 if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> >                     (link->latency_dw.l0s > acceptable->l0s))
> >                         link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > +
> >                 /*
> >                  * Check L1 latency.
> > -                * Every switch on the path to root complex need 1
> > -                * more microsecond for L1. Spec doesn't mention L0s.
> > +                *
> > +                * PCIe r5.0, sec 5.4.1.2.2 states:
> > +                * A Switch is required to initiate an L1 exit transition on its
> > +                * Upstream Port Link after no more than 1 μs from the beginning of an
> > +                * L1 exit transition on any of its Downstream Port Links.
> >                  *
> >                  * The exit latencies for L1 substates are not advertised
> >                  * by a device.  Since the spec also doesn't mention a way
> > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> >                  * L1 exit latencies advertised by a device include L1
> >                  * substate latencies (and hence do not do any check).
> >                  */
> > -               latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > -               if ((link->aspm_capable & ASPM_STATE_L1) &&
> > -                   (latency + l1_switch_latency > acceptable->l1))
> > -                       link->aspm_capable &= ~ASPM_STATE_L1;
> > -               l1_switch_latency += 1000;
> > +               if (link->aspm_capable & ASPM_STATE_L1) {
> > +                       latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > +                       l1_max_latency = max_t(u32, latency, l1_max_latency);
> > +                       if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > +                               link->aspm_capable &= ~ASPM_STATE_L1;
> > +                       l1_switch_latency += 1000;
> > +               }
> >
> >                 link = link->parent;
> >         }
> > --
> > 2.29.1
> >
Bjorn Helgaas Dec. 12, 2020, 11:47 p.m. UTC | #3
On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> "5.4.1.2.2. Exit from the L1 State"
> 
> Which makes it clear that each switch is required to initiate a
> transition within 1μs from receiving it, accumulating this latency and
> then we have to wait for the slowest link along the path before
> entering L0 state from L1.
> ...

> On my specific system:
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> 
>             Exit latency       Acceptable latency
> Tree:       L1       L0s       L1       L0s
> ----------  -------  -----     -------  ------
> 00:01.2     <32 us   -
> | 01:00.0   <32 us   -
> |- 02:03.0  <32 us   -
> | \03:00.0  <16 us   <2us      <64 us   <512ns
> |
> \- 02:04.0  <32 us   -
>   \04:00.0  <64 us   unlimited <64 us   <512ns
> 
> 04:00.0's latency is the same as the maximum it allows so as we walk the path
> the first switchs startup latency will pass the acceptable latency limit
> for the link, and as a side-effect it fixes my issues with 03:00.0.
> 
> Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> links with 6 or more hops. With this patch I'm back to a maximum of ~933
> mbit/s.

There are two paths here that share a Link:

  00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
  00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek

1) The path to the I211 NIC includes four Ports and two Links (the
   connection between 01:00.0 and 02:03.0 is internal Switch routing,
   not a Link).

   The Ports advertise L1 exit latencies of <32us, <32us, <32us,
   <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
   01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
   to 1 + 32 = 33us of L1 exit latency.

   The NIC can tolerate up to 64us of L1 exit latency, so it is safe
   to enable L1 for both Links.

2) The path to the Realtek device is similar except that the Realtek
   L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
   initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
   but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
   latency.

   The Realtek device can only tolerate 64us of latency, so it is not
   safe to enable L1 for both Links.  It should be safe to enable L1
   on the shared link because the exit latency for that link would be
   <32us.

> The original code path did:
> 04:00:0-02:04.0 max latency 64    -> ok
> 02:04.0-01:00.0 max latency 32 +1 -> ok
> 01:00.0-00:01.2 max latency 32 +2 -> ok
> 
> And thus didn't see any L1 ASPM latency issues.
> 
> The new code does:
> 04:00:0-02:04.0 max latency 64    -> ok
> 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded

[Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
because that's internal Switch routing, not a Link.  But even without
that extra microsecond, this path does exceed the acceptable latency
since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]

> It correctly identifies the issue.
> 
> For reference, pcie information:
> https://bugzilla.kernel.org/show_bug.cgi?id=209725

The "lspci without my patches" [1] shows L1 enabled for the shared
Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
not for the Link to 04:00.x (Realtek).

Per my analysis above, that looks like it *should* be a safe
configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
can tolerate 64us, actual should be <32us since only the shared Link
is in L1.

However, the commit log at [2] shows L1 *enabled* for both the shared
Link from 00:01.2 --- 01:00.0 and the 02:04.0 --- 04:00.x Link, and 
that would definitely be a problem.

Can you explain the differences between [1] and [2]?

> Kai-Heng Feng has a machine that will not boot with ASPM without this patch,
> information is documented here:
> https://bugzilla.kernel.org/show_bug.cgi?id=209671

I started working through this info, too, but there's not enough
information to tell what difference this patch makes.  The attachments
compare:

  1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
  2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]

Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure things
differently than CONFIG_PCIEASPM_DEFAULT=y, so we can't tell what
changes are due to the config change and what are due to the patch.

The lspci *with* the patch ([4]) shows L0s and L1 enabled at almost
every possible place.  Here are the Links, how they're configured, and
my analysis of the exit latencies vs acceptable latencies:

  00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
  00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
  00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
  00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
  00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
  00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)

So I can't tell what change prevents the freeze.  I would expect the
patch would cause us to *disable* L0s or L1 somewhere.

The only place [4] shows ASPM disabled is for 05:00.1.  The spec says
we should program the same value in all functions of a multi-function
device.  This is a non-ARI device, so "only capabilities enabled in
all functions are enabled for the component as a whole."  That would
mean that L0s and L1 are effectively disabled for 05:00.x even though
05:00.0 claims they're enabled.  But the latencies say ASPM L0s and L1
should be safe to be enabled.  This looks like another bug that's
probably unrelated.

The patch might be correct; I haven't actually analyzed the code.  But
the commit log doesn't make sense to me yet.

[1] https://bugzilla.kernel.org/attachment.cgi?id=293047
[2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
[3] https://bugzilla.kernel.org/attachment.cgi?id=292955
[4] https://bugzilla.kernel.org/attachment.cgi?id=292957

> Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index 253c30cc1967..c03ead0f1013 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
>  
>  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  {
> -	u32 latency, l1_switch_latency = 0;
> +	u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
>  	struct aspm_latency *acceptable;
>  	struct pcie_link_state *link;
>  
> @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  		if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
>  		    (link->latency_dw.l0s > acceptable->l0s))
>  			link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> +
>  		/*
>  		 * Check L1 latency.
> -		 * Every switch on the path to root complex need 1
> -		 * more microsecond for L1. Spec doesn't mention L0s.
> +		 *
> +		 * PCIe r5.0, sec 5.4.1.2.2 states:
> +		 * A Switch is required to initiate an L1 exit transition on its
> +		 * Upstream Port Link after no more than 1 μs from the beginning of an
> +		 * L1 exit transition on any of its Downstream Port Links.
>  		 *
>  		 * The exit latencies for L1 substates are not advertised
>  		 * by a device.  Since the spec also doesn't mention a way
> @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  		 * L1 exit latencies advertised by a device include L1
>  		 * substate latencies (and hence do not do any check).
>  		 */
> -		latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> -		if ((link->aspm_capable & ASPM_STATE_L1) &&
> -		    (latency + l1_switch_latency > acceptable->l1))
> -			link->aspm_capable &= ~ASPM_STATE_L1;
> -		l1_switch_latency += 1000;
> +		if (link->aspm_capable & ASPM_STATE_L1) {
> +			latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> +			l1_max_latency = max_t(u32, latency, l1_max_latency);
> +			if (l1_max_latency + l1_switch_latency > acceptable->l1)
> +				link->aspm_capable &= ~ASPM_STATE_L1;
> +			l1_switch_latency += 1000;
> +		}
>  
>  		link = link->parent;
>  	}
> -- 
> 2.29.1
>
Ian Kumlien Dec. 13, 2020, 9:39 p.m. UTC | #4
On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > "5.4.1.2.2. Exit from the L1 State"
> >
> > Which makes it clear that each switch is required to initiate a
> > transition within 1μs from receiving it, accumulating this latency and
> > then we have to wait for the slowest link along the path before
> > entering L0 state from L1.
> > ...
>
> > On my specific system:
> > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> >
> >             Exit latency       Acceptable latency
> > Tree:       L1       L0s       L1       L0s
> > ----------  -------  -----     -------  ------
> > 00:01.2     <32 us   -
> > | 01:00.0   <32 us   -
> > |- 02:03.0  <32 us   -
> > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > |
> > \- 02:04.0  <32 us   -
> >   \04:00.0  <64 us   unlimited <64 us   <512ns
> >
> > 04:00.0's latency is the same as the maximum it allows so as we walk the path
> > the first switchs startup latency will pass the acceptable latency limit
> > for the link, and as a side-effect it fixes my issues with 03:00.0.
> >
> > Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> > links with 6 or more hops. With this patch I'm back to a maximum of ~933
> > mbit/s.
>
> There are two paths here that share a Link:
>
>   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
>   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
>
> 1) The path to the I211 NIC includes four Ports and two Links (the
>    connection between 01:00.0 and 02:03.0 is internal Switch routing,
>    not a Link).

>    The Ports advertise L1 exit latencies of <32us, <32us, <32us,
>    <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
>    01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
>    to 1 + 32 = 33us of L1 exit latency.
>
>    The NIC can tolerate up to 64us of L1 exit latency, so it is safe
>    to enable L1 for both Links.
>
> 2) The path to the Realtek device is similar except that the Realtek
>    L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
>    initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
>    but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
>    latency.
>
>    The Realtek device can only tolerate 64us of latency, so it is not
>    safe to enable L1 for both Links.  It should be safe to enable L1
>    on the shared link because the exit latency for that link would be
>    <32us.

04:00.0:
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
unlimited, L1 <64us

So maximum latency for the entire link has to be <64 us
For the device to leave L1 ASPM takes <64us

So the device itself is the slowest entry along the link, which means
that nothing
else along that path can have ASPM enabled

> > The original code path did:
> > 04:00:0-02:04.0 max latency 64    -> ok
> > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > 01:00.0-00:01.2 max latency 32 +2 -> ok
> >
> > And thus didn't see any L1 ASPM latency issues.
> >
> > The new code does:
> > 04:00:0-02:04.0 max latency 64    -> ok
> > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
>
> [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> because that's internal Switch routing, not a Link.  But even without
> that extra microsecond, this path does exceed the acceptable latency
> since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]

It does report L1 ASPM on both ends, so the links will be counted as
such in the code.

I also assume that it can power down individual ports... and enter
rest state if no links are up.

> > It correctly identifies the issue.
> >
> > For reference, pcie information:
> > https://bugzilla.kernel.org/show_bug.cgi?id=209725
>
> The "lspci without my patches" [1] shows L1 enabled for the shared
> Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
> not for the Link to 04:00.x (Realtek).
>
> Per my analysis above, that looks like it *should* be a safe
> configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
> can tolerate 64us, actual should be <32us since only the shared Link
> is in L1.

See above.

> However, the commit log at [2] shows L1 *enabled* for both the shared
> Link from 00:01.2 --- 01:00.0 and the 02:04.0 --- 04:00.x Link, and
> that would definitely be a problem.
>
> Can you explain the differences between [1] and [2]?

I don't understand which sections you're referring to.

> > Kai-Heng Feng has a machine that will not boot with ASPM without this patch,
> > information is documented here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=209671
>
> I started working through this info, too, but there's not enough
> information to tell what difference this patch makes.  The attachments
> compare:
>
>   1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
>   2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]
>
> Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure things
> differently than CONFIG_PCIEASPM_DEFAULT=y, so we can't tell what
> changes are due to the config change and what are due to the patch.
>
> The lspci *with* the patch ([4]) shows L0s and L1 enabled at almost
> every possible place.  Here are the Links, how they're configured, and
> my analysis of the exit latencies vs acceptable latencies:
>
>   00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
>   00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
>   00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
>   00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
>   00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
>   00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
>
> So I can't tell what change prevents the freeze.  I would expect the
> patch would cause us to *disable* L0s or L1 somewhere.
>
> The only place [4] shows ASPM disabled is for 05:00.1.  The spec says
> we should program the same value in all functions of a multi-function
> device.  This is a non-ARI device, so "only capabilities enabled in
> all functions are enabled for the component as a whole."  That would
> mean that L0s and L1 are effectively disabled for 05:00.x even though
> 05:00.0 claims they're enabled.  But the latencies say ASPM L0s and L1
> should be safe to be enabled.  This looks like another bug that's
> probably unrelated.

I don't think it's unrelated, i suspect it's how PCIe works with
multiple links...
a device can cause some kind of head of queue stalling - i don't know how but
it really looks like it.

> The patch might be correct; I haven't actually analyzed the code.  But
> the commit log doesn't make sense to me yet.

I personally don't think that all this PCI information is required,
the linux kernel is
currently doing it wrong according to the spec.

Also, since it's clearly doing the wrong thing, I'm worried that dists
will take a kernel enable aspm
and there will be alot of bugreports of non-booting systems or other
weird issues... And the culprit
was known all along.

It's been five months...

> [1] https://bugzilla.kernel.org/attachment.cgi?id=293047
> [2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
> [3] https://bugzilla.kernel.org/attachment.cgi?id=292955
> [4] https://bugzilla.kernel.org/attachment.cgi?id=292957
>
> > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > ---
> >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> >  1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index 253c30cc1967..c03ead0f1013 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> >
> >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> >  {
> > -     u32 latency, l1_switch_latency = 0;
> > +     u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> >       struct aspm_latency *acceptable;
> >       struct pcie_link_state *link;
> >
> > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> >               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> >                   (link->latency_dw.l0s > acceptable->l0s))
> >                       link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > +
> >               /*
> >                * Check L1 latency.
> > -              * Every switch on the path to root complex need 1
> > -              * more microsecond for L1. Spec doesn't mention L0s.
> > +              *
> > +              * PCIe r5.0, sec 5.4.1.2.2 states:
> > +              * A Switch is required to initiate an L1 exit transition on its
> > +              * Upstream Port Link after no more than 1 μs from the beginning of an
> > +              * L1 exit transition on any of its Downstream Port Links.
> >                *
> >                * The exit latencies for L1 substates are not advertised
> >                * by a device.  Since the spec also doesn't mention a way
> > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> >                * L1 exit latencies advertised by a device include L1
> >                * substate latencies (and hence do not do any check).
> >                */
> > -             latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > -             if ((link->aspm_capable & ASPM_STATE_L1) &&
> > -                 (latency + l1_switch_latency > acceptable->l1))
> > -                     link->aspm_capable &= ~ASPM_STATE_L1;
> > -             l1_switch_latency += 1000;
> > +             if (link->aspm_capable & ASPM_STATE_L1) {
> > +                     latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > +                     l1_max_latency = max_t(u32, latency, l1_max_latency);
> > +                     if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > +                             link->aspm_capable &= ~ASPM_STATE_L1;
> > +                     l1_switch_latency += 1000;
> > +             }
> >
> >               link = link->parent;
> >       }
> > --
> > 2.29.1
> >
Bjorn Helgaas Dec. 14, 2020, 5:44 a.m. UTC | #5
[+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
issue with I211 or Realtek NICs.  Beginning of thread:
https://lore.kernel.org/r/20201024205548.1837770-1-ian.kumlien@gmail.com

Short story: Ian has:

  Root Port --- Switch --- I211 NIC
                       \-- multifunction Realtek NIC, etc

and the I211 performance is poor with ASPM L1 enabled on both links
in the path to it.  The patch here disables ASPM on the upstream link
and fixes the performance, but AFAICT the devices in that path give us
no reason to disable L1.  If I understand the spec correctly, the
Realtek device should not be relevant to the I211 path.]

On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > "5.4.1.2.2. Exit from the L1 State"
> > >
> > > Which makes it clear that each switch is required to initiate a
> > > transition within 1μs from receiving it, accumulating this latency and
> > > then we have to wait for the slowest link along the path before
> > > entering L0 state from L1.
> > > ...
> >
> > > On my specific system:
> > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> > >
> > >             Exit latency       Acceptable latency
> > > Tree:       L1       L0s       L1       L0s
> > > ----------  -------  -----     -------  ------
> > > 00:01.2     <32 us   -
> > > | 01:00.0   <32 us   -
> > > |- 02:03.0  <32 us   -
> > > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > > |
> > > \- 02:04.0  <32 us   -
> > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > >
> > > 04:00.0's latency is the same as the maximum it allows so as we walk the path
> > > the first switchs startup latency will pass the acceptable latency limit
> > > for the link, and as a side-effect it fixes my issues with 03:00.0.
> > >
> > > Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> > > links with 6 or more hops. With this patch I'm back to a maximum of ~933
> > > mbit/s.
> >
> > There are two paths here that share a Link:
> >
> >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> >
> > 1) The path to the I211 NIC includes four Ports and two Links (the
> >    connection between 01:00.0 and 02:03.0 is internal Switch routing,
> >    not a Link).
> 
> >    The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> >    <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> >    01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> >    to 1 + 32 = 33us of L1 exit latency.
> >
> >    The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> >    to enable L1 for both Links.
> >
> > 2) The path to the Realtek device is similar except that the Realtek
> >    L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> >    initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> >    but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> >    latency.
> >
> >    The Realtek device can only tolerate 64us of latency, so it is not
> >    safe to enable L1 for both Links.  It should be safe to enable L1
> >    on the shared link because the exit latency for that link would be
> >    <32us.
> 
> 04:00.0:
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> unlimited, L1 <64us
> 
> So maximum latency for the entire link has to be <64 us
> For the device to leave L1 ASPM takes <64us
> 
> So the device itself is the slowest entry along the link, which
> means that nothing else along that path can have ASPM enabled

Yes.  That's what I said above: "it is not safe to enable L1 for both
Links."  Unless I'm missing something, we agree on that.

I also said that it should be safe to enable L1 on the shared Link
(from 00:01.2 to 01:00.0) because if the downstream Link is always in
L0, the exit latency of the shared Link should be <32us, and 04:00.x
can tolerate 64us.

> > > The original code path did:
> > > 04:00:0-02:04.0 max latency 64    -> ok
> > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > >
> > > And thus didn't see any L1 ASPM latency issues.
> > >
> > > The new code does:
> > > 04:00:0-02:04.0 max latency 64    -> ok
> > > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> >
> > [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> > because that's internal Switch routing, not a Link.  But even without
> > that extra microsecond, this path does exceed the acceptable latency
> > since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]
> 
> It does report L1 ASPM on both ends, so the links will be counted as
> such in the code.

This is a bit of a tangent and we shouldn't get too wrapped up in it.
This is a confusing aspect of PCIe.  We're talking about this path:

  00:01.2 --- [01:00.0 -- 02:04.0] --- 04:00.x multifunction Realtek

This path only contains two Links.  The first one is 
00:01.2 --- 01:00.0, and the second one is 02:04.0 --- 04:00.x.

01:00.0 is a Switch Upstream Port and 02:04.0 is a Switch Downstream
Port.  The connection between them is not a Link; it is some internal
wiring of the Switch that is completely opaque to software.

The ASPM information and knobs in 01:00.0 apply to the Link on its
upstream side, and the ASPM info and knobs in 02:04.0 apply to the
Link on its downstream side.

The example in sec 5.4.1.2.2 contains three Links.  The L1 exit latency
for the Link is the max of the exit latencies at each end:

  Link 1: max(32, 8) = 32us
  Link 2: max(8, 32) = 32us
  Link 3: max(32, 8) = 32us

The total delay for a TLP starting at the downstream end of Link 3
is 32 + 2 = 32us.

In the path to your 04:00.x Realtek device:

  Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
  Link 2 (from 02:04.0 to 04:00.x): max(32, 64) = 64us

If L1 were enabled on both Links, the exit latency would be 64 + 1 =
65us.

> I also assume that it can power down individual ports... and enter
> rest state if no links are up.

I don't think this is quite true -- a Link can't enter L1 unless the
Ports on both ends have L1 enabled, so I don't think it makes sense to
talk about an individual Port being in L1.

> > > It correctly identifies the issue.
> > >
> > > For reference, pcie information:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> >
> > The "lspci without my patches" [1] shows L1 enabled for the shared
> > Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
> > not for the Link to 04:00.x (Realtek).
> >
> > Per my analysis above, that looks like it *should* be a safe
> > configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
> > can tolerate 64us, actual should be <32us since only the shared Link
> > is in L1.
> 
> See above.

As I said above, if we enabled L1 only on the shared Link from 00:01.2
to 01:00.0, the exit latency should be acceptable.  In that case, a
TLP from 04:00.x would see only 32us of latency:

  Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us

and 04:00.x can tolerate 64us.

> > However, the commit log at [2] shows L1 *enabled* for both the shared
> > Link from 00:01.2 --- 01:00.0 and the 02:04.0 --- 04:00.x Link, and
> > that would definitely be a problem.
> >
> > Can you explain the differences between [1] and [2]?
> 
> I don't understand which sections you're referring to.

[1] is the "lspci without my patches" attachment of bugzilla #209725,
which is supposed to show the problem this patch solves.  We're
talking about the path to 04:00.x, and [1] show this:

  01:00.2 L1+
  01:00.0 L1+
  02:04.0 L1-
  04:00.0 L1-

AFAICT, that should be a legal configuration as far as 04:00.0 is
concerned, so it's not a reason for this patch.

[2] is a previous posting of this same patch, and its commit log
includes information about the same path to 04:00.x, but the "LnkCtl
Before" column shows:

  01:00.2 L1+
  01:00.0 L1+
  02:04.0 L1+
  04:00.0 L1+

I don't know why [1] shows L1 disabled on the downstream Link, while
[2] shows L1 *enabled* on the same Link.

> > > Kai-Heng Feng has a machine that will not boot with ASPM without
> > > this patch, information is documented here:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=209671
> >
> > I started working through this info, too, but there's not enough
> > information to tell what difference this patch makes.  The attachments
> > compare:
> >
> >   1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
> >   2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]
> >
> > Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure things
> > differently than CONFIG_PCIEASPM_DEFAULT=y, so we can't tell what
> > changes are due to the config change and what are due to the patch.
> >
> > The lspci *with* the patch ([4]) shows L0s and L1 enabled at almost
> > every possible place.  Here are the Links, how they're configured, and
> > my analysis of the exit latencies vs acceptable latencies:
> >
> >   00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
> >   00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
> >   00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
> >   00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
> >   00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> >   00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> >
> > So I can't tell what change prevents the freeze.  I would expect the
> > patch would cause us to *disable* L0s or L1 somewhere.
> >
> > The only place [4] shows ASPM disabled is for 05:00.1.  The spec says
> > we should program the same value in all functions of a multi-function
> > device.  This is a non-ARI device, so "only capabilities enabled in
> > all functions are enabled for the component as a whole."  That would
> > mean that L0s and L1 are effectively disabled for 05:00.x even though
> > 05:00.0 claims they're enabled.  But the latencies say ASPM L0s and L1
> > should be safe to be enabled.  This looks like another bug that's
> > probably unrelated.
> 
> I don't think it's unrelated, i suspect it's how PCIe works with
> multiple links...  a device can cause some kind of head of queue
> stalling - i don't know how but it really looks like it.

The text in quotes above is straight out of the spec (PCIe r5.0, sec
7.5.3.7).  Either the device works that way or it's not compliant.

The OS configures ASPM based on the requirements and capabilities
advertised by the device.  If a device has any head of queue stalling
or similar issues, those must be comprehended in the numbers
advertised by the device.  It's not up to the OS to speculate about
issues like that.

> > The patch might be correct; I haven't actually analyzed the code.  But
> > the commit log doesn't make sense to me yet.
> 
> I personally don't think that all this PCI information is required,
> the linux kernel is currently doing it wrong according to the spec.

We're trying to establish exactly *what* Linux is doing wrong.  So far
we don't have a good explanation of that.

Based on [1], in the path to 03:00.0, both Links have L1 enabled, with
an exit latency of <33us, and 03:00.0 can tolerate 64us.  That should
work fine.

Also based on [1], in the path to 04:00.x, the upstream Link has L1
enabled and the downstream Link has L1 disabled, for an exit latency
of <32us, and 04:00.0 can tolerate 64us.  That should also work fine.

(Alternately, disabling L1 on the upstream Link and enabling it on the
downstream Link should have an exit latency of <64us and 04:00.0 can
tolerate 64us, so that should work fine, too.)

> Also, since it's clearly doing the wrong thing, I'm worried that
> dists will take a kernel enable aspm and there will be alot of
> bugreports of non-booting systems or other weird issues... And the
> culprit was known all along.

There's clearly a problem on your system, but I don't know yet whether
Linux is doing something wrong, a device in your system is designed
incorrectly, or a device is designed correctly but the instance in
your system is defective.

> It's been five months...

I apologize for the delay.  ASPM is a subtle area of PCIe, the Linux
code is complicated, and we have a long history of issues with it.  I
want to fix the problem, but I want to make sure we do it in a way
that matches the spec so the fix applies to all systems.  I don't want
a magic fix that fixes your system in a way I don't quite understand.

Obviously *you* understand this, so hopefully it's just a matter of
pounding it through my thick skull :)

> > [1] https://bugzilla.kernel.org/attachment.cgi?id=293047
> > [2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
> > [3] https://bugzilla.kernel.org/attachment.cgi?id=292955
> > [4] https://bugzilla.kernel.org/attachment.cgi?id=292957
> >
> > > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > ---
> > >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> > >  1 file changed, 14 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > index 253c30cc1967..c03ead0f1013 100644
> > > --- a/drivers/pci/pcie/aspm.c
> > > +++ b/drivers/pci/pcie/aspm.c
> > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > >
> > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > >  {
> > > -     u32 latency, l1_switch_latency = 0;
> > > +     u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > >       struct aspm_latency *acceptable;
> > >       struct pcie_link_state *link;
> > >
> > > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > >               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> > >                   (link->latency_dw.l0s > acceptable->l0s))
> > >                       link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > > +
> > >               /*
> > >                * Check L1 latency.
> > > -              * Every switch on the path to root complex need 1
> > > -              * more microsecond for L1. Spec doesn't mention L0s.
> > > +              *
> > > +              * PCIe r5.0, sec 5.4.1.2.2 states:
> > > +              * A Switch is required to initiate an L1 exit transition on its
> > > +              * Upstream Port Link after no more than 1 μs from the beginning of an
> > > +              * L1 exit transition on any of its Downstream Port Links.
> > >                *
> > >                * The exit latencies for L1 substates are not advertised
> > >                * by a device.  Since the spec also doesn't mention a way
> > > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > >                * L1 exit latencies advertised by a device include L1
> > >                * substate latencies (and hence do not do any check).
> > >                */
> > > -             latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > -             if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > -                 (latency + l1_switch_latency > acceptable->l1))
> > > -                     link->aspm_capable &= ~ASPM_STATE_L1;
> > > -             l1_switch_latency += 1000;
> > > +             if (link->aspm_capable & ASPM_STATE_L1) {
> > > +                     latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > +                     l1_max_latency = max_t(u32, latency, l1_max_latency);
> > > +                     if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > > +                             link->aspm_capable &= ~ASPM_STATE_L1;
> > > +                     l1_switch_latency += 1000;
> > > +             }
> > >
> > >               link = link->parent;
> > >       }
> > > --
> > > 2.29.1
> > >
Ian Kumlien Dec. 14, 2020, 9:14 a.m. UTC | #6
On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> issue with I211 or Realtek NICs.  Beginning of thread:
> https://lore.kernel.org/r/20201024205548.1837770-1-ian.kumlien@gmail.com
>
> Short story: Ian has:
>
>   Root Port --- Switch --- I211 NIC
>                        \-- multifunction Realtek NIC, etc
>
> and the I211 performance is poor with ASPM L1 enabled on both links
> in the path to it.  The patch here disables ASPM on the upstream link
> and fixes the performance, but AFAICT the devices in that path give us
> no reason to disable L1.  If I understand the spec correctly, the
> Realtek device should not be relevant to the I211 path.]
>
> On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > > "5.4.1.2.2. Exit from the L1 State"
> > > >
> > > > Which makes it clear that each switch is required to initiate a
> > > > transition within 1μs from receiving it, accumulating this latency and
> > > > then we have to wait for the slowest link along the path before
> > > > entering L0 state from L1.
> > > > ...
> > >
> > > > On my specific system:
> > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> > > >
> > > >             Exit latency       Acceptable latency
> > > > Tree:       L1       L0s       L1       L0s
> > > > ----------  -------  -----     -------  ------
> > > > 00:01.2     <32 us   -
> > > > | 01:00.0   <32 us   -
> > > > |- 02:03.0  <32 us   -
> > > > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > > > |
> > > > \- 02:04.0  <32 us   -
> > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > >
> > > > 04:00.0's latency is the same as the maximum it allows so as we walk the path
> > > > the first switchs startup latency will pass the acceptable latency limit
> > > > for the link, and as a side-effect it fixes my issues with 03:00.0.
> > > >
> > > > Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> > > > links with 6 or more hops. With this patch I'm back to a maximum of ~933
> > > > mbit/s.
> > >
> > > There are two paths here that share a Link:
> > >
> > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > >
> > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > >    connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > >    not a Link).
> >
> > >    The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > >    <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > >    01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > >    to 1 + 32 = 33us of L1 exit latency.
> > >
> > >    The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > >    to enable L1 for both Links.
> > >
> > > 2) The path to the Realtek device is similar except that the Realtek
> > >    L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > >    initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > >    but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > >    latency.
> > >
> > >    The Realtek device can only tolerate 64us of latency, so it is not
> > >    safe to enable L1 for both Links.  It should be safe to enable L1
> > >    on the shared link because the exit latency for that link would be
> > >    <32us.
> >
> > 04:00.0:
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > unlimited, L1 <64us
> >
> > So maximum latency for the entire link has to be <64 us
> > For the device to leave L1 ASPM takes <64us
> >
> > So the device itself is the slowest entry along the link, which
> > means that nothing else along that path can have ASPM enabled
>
> Yes.  That's what I said above: "it is not safe to enable L1 for both
> Links."  Unless I'm missing something, we agree on that.
>
> I also said that it should be safe to enable L1 on the shared Link
> (from 00:01.2 to 01:00.0) because if the downstream Link is always in
> L0, the exit latency of the shared Link should be <32us, and 04:00.x
> can tolerate 64us.

Exit latency of shared link would be max of link, ie 64 + L1-hops, not 32

> > > > The original code path did:
> > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > > >
> > > > And thus didn't see any L1 ASPM latency issues.
> > > >
> > > > The new code does:
> > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > > > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> > >
> > > [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> > > because that's internal Switch routing, not a Link.  But even without
> > > that extra microsecond, this path does exceed the acceptable latency
> > > since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]
> >
> > It does report L1 ASPM on both ends, so the links will be counted as
> > such in the code.
>
> This is a bit of a tangent and we shouldn't get too wrapped up in it.
> This is a confusing aspect of PCIe.  We're talking about this path:
>
>   00:01.2 --- [01:00.0 -- 02:04.0] --- 04:00.x multifunction Realtek
>
> This path only contains two Links.  The first one is
> 00:01.2 --- 01:00.0, and the second one is 02:04.0 --- 04:00.x.
>
> 01:00.0 is a Switch Upstream Port and 02:04.0 is a Switch Downstream
> Port.  The connection between them is not a Link; it is some internal
> wiring of the Switch that is completely opaque to software.
>
> The ASPM information and knobs in 01:00.0 apply to the Link on its
> upstream side, and the ASPM info and knobs in 02:04.0 apply to the
> Link on its downstream side.
>
> The example in sec 5.4.1.2.2 contains three Links.  The L1 exit latency
> for the Link is the max of the exit latencies at each end:
>
>   Link 1: max(32, 8) = 32us
>   Link 2: max(8, 32) = 32us
>   Link 3: max(32, 8) = 32us
>
> The total delay for a TLP starting at the downstream end of Link 3
> is 32 + 2 = 32us.
>
> In the path to your 04:00.x Realtek device:
>
>   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
>   Link 2 (from 02:04.0 to 04:00.x): max(32, 64) = 64us
>
> If L1 were enabled on both Links, the exit latency would be 64 + 1 =
> 65us.

So one line to be removed from the changelog, i assume... And yes, the
code handles that - first disable is 01:00.0 <-> 00:01.2

> > I also assume that it can power down individual ports... and enter
> > rest state if no links are up.
>
> I don't think this is quite true -- a Link can't enter L1 unless the
> Ports on both ends have L1 enabled, so I don't think it makes sense to
> talk about an individual Port being in L1.
>
> > > > It correctly identifies the issue.
> > > >
> > > > For reference, pcie information:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > >
> > > The "lspci without my patches" [1] shows L1 enabled for the shared
> > > Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
> > > not for the Link to 04:00.x (Realtek).
> > >
> > > Per my analysis above, that looks like it *should* be a safe
> > > configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
> > > can tolerate 64us, actual should be <32us since only the shared Link
> > > is in L1.
> >
> > See above.
>
> As I said above, if we enabled L1 only on the shared Link from 00:01.2
> to 01:00.0, the exit latency should be acceptable.  In that case, a
> TLP from 04:00.x would see only 32us of latency:
>
>   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
>
> and 04:00.x can tolerate 64us.

But, again, you're completely ignoring the full link, ie 04:00.x would
also have to power on.

> > > However, the commit log at [2] shows L1 *enabled* for both the shared
> > > Link from 00:01.2 --- 01:00.0 and the 02:04.0 --- 04:00.x Link, and
> > > that would definitely be a problem.
> > >
> > > Can you explain the differences between [1] and [2]?
> >
> > I don't understand which sections you're referring to.
>
> [1] is the "lspci without my patches" attachment of bugzilla #209725,
> which is supposed to show the problem this patch solves.  We're
> talking about the path to 04:00.x, and [1] show this:
>
>   01:00.2 L1+
>   01:00.0 L1+
>   02:04.0 L1-
>   04:00.0 L1-
>
> AFAICT, that should be a legal configuration as far as 04:00.0 is
> concerned, so it's not a reason for this patch.

Actually, no, maximum path latency 64us

04:00.0 wakeup latency == 64us

Again, as stated, it can't be behind any sleeping L1 links

> [2] is a previous posting of this same patch, and its commit log
> includes information about the same path to 04:00.x, but the "LnkCtl
> Before" column shows:
>
>   01:00.2 L1+
>   01:00.0 L1+
>   02:04.0 L1+
>   04:00.0 L1+
>
> I don't know why [1] shows L1 disabled on the downstream Link, while
> [2] shows L1 *enabled* on the same Link.

From the data they look switched.

> > > > Kai-Heng Feng has a machine that will not boot with ASPM without
> > > > this patch, information is documented here:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=209671
> > >
> > > I started working through this info, too, but there's not enough
> > > information to tell what difference this patch makes.  The attachments
> > > compare:
> > >
> > >   1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
> > >   2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]
> > >
> > > Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure things
> > > differently than CONFIG_PCIEASPM_DEFAULT=y, so we can't tell what
> > > changes are due to the config change and what are due to the patch.
> > >
> > > The lspci *with* the patch ([4]) shows L0s and L1 enabled at almost
> > > every possible place.  Here are the Links, how they're configured, and
> > > my analysis of the exit latencies vs acceptable latencies:
> > >
> > >   00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
> > >   00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
> > >   00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
> > >   00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
> > >   00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > >   00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > >
> > > So I can't tell what change prevents the freeze.  I would expect the
> > > patch would cause us to *disable* L0s or L1 somewhere.
> > >
> > > The only place [4] shows ASPM disabled is for 05:00.1.  The spec says
> > > we should program the same value in all functions of a multi-function
> > > device.  This is a non-ARI device, so "only capabilities enabled in
> > > all functions are enabled for the component as a whole."  That would
> > > mean that L0s and L1 are effectively disabled for 05:00.x even though
> > > 05:00.0 claims they're enabled.  But the latencies say ASPM L0s and L1
> > > should be safe to be enabled.  This looks like another bug that's
> > > probably unrelated.
> >
> > I don't think it's unrelated, i suspect it's how PCIe works with
> > multiple links...  a device can cause some kind of head of queue
> > stalling - i don't know how but it really looks like it.
>
> The text in quotes above is straight out of the spec (PCIe r5.0, sec
> 7.5.3.7).  Either the device works that way or it's not compliant.
>
> The OS configures ASPM based on the requirements and capabilities
> advertised by the device.  If a device has any head of queue stalling
> or similar issues, those must be comprehended in the numbers
> advertised by the device.  It's not up to the OS to speculate about
> issues like that.
>
> > > The patch might be correct; I haven't actually analyzed the code.  But
> > > the commit log doesn't make sense to me yet.
> >
> > I personally don't think that all this PCI information is required,
> > the linux kernel is currently doing it wrong according to the spec.
>
> We're trying to establish exactly *what* Linux is doing wrong.  So far
> we don't have a good explanation of that.

Yes we do, linux counts hops + max for "link" while what should be done is
counting hops + max for path

> Based on [1], in the path to 03:00.0, both Links have L1 enabled, with
> an exit latency of <33us, and 03:00.0 can tolerate 64us.  That should
> work fine.
>
> Also based on [1], in the path to 04:00.x, the upstream Link has L1
> enabled and the downstream Link has L1 disabled, for an exit latency
> of <32us, and 04:00.0 can tolerate 64us.  That should also work fine.

Again, ignoring the exit latendy for 04:00.0

> (Alternately, disabling L1 on the upstream Link and enabling it on the
> downstream Link should have an exit latency of <64us and 04:00.0 can
> tolerate 64us, so that should work fine, too.)

Then nothing else can have L1 aspm enabled

> > Also, since it's clearly doing the wrong thing, I'm worried that
> > dists will take a kernel enable aspm and there will be alot of
> > bugreports of non-booting systems or other weird issues... And the
> > culprit was known all along.
>
> There's clearly a problem on your system, but I don't know yet whether
> Linux is doing something wrong, a device in your system is designed
> incorrectly, or a device is designed correctly but the instance in
> your system is defective.

According to the spec it is, there is a explanation of how to
calculate the exit latency
and when you implement that, which i did (before knowing the actual
spec) then it works...

> > It's been five months...
>
> I apologize for the delay.  ASPM is a subtle area of PCIe, the Linux
> code is complicated, and we have a long history of issues with it.  I
> want to fix the problem, but I want to make sure we do it in a way
> that matches the spec so the fix applies to all systems.  I don't want
> a magic fix that fixes your system in a way I don't quite understand.

> Obviously *you* understand this, so hopefully it's just a matter of
> pounding it through my thick skull :)

I only understand what I've been forced to understand - and I do
leverage the existing code without
knowing what it does underneath, I only look at the links maximum
latency and make sure that I keep
the maximum latency along the path and not just link for link

once you realise that the max allowed latency is buffer dependent -
then this becomes obviously correct,
and then the pcie spec showed it as being correct as well... so...


> > > [1] https://bugzilla.kernel.org/attachment.cgi?id=293047
> > > [2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
> > > [3] https://bugzilla.kernel.org/attachment.cgi?id=292955
> > > [4] https://bugzilla.kernel.org/attachment.cgi?id=292957
> > >
> > > > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > ---
> > > >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> > > >  1 file changed, 14 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > index 253c30cc1967..c03ead0f1013 100644
> > > > --- a/drivers/pci/pcie/aspm.c
> > > > +++ b/drivers/pci/pcie/aspm.c
> > > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > > >
> > > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > >  {
> > > > -     u32 latency, l1_switch_latency = 0;
> > > > +     u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > > >       struct aspm_latency *acceptable;
> > > >       struct pcie_link_state *link;
> > > >
> > > > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > >               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> > > >                   (link->latency_dw.l0s > acceptable->l0s))
> > > >                       link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > > > +
> > > >               /*
> > > >                * Check L1 latency.
> > > > -              * Every switch on the path to root complex need 1
> > > > -              * more microsecond for L1. Spec doesn't mention L0s.
> > > > +              *
> > > > +              * PCIe r5.0, sec 5.4.1.2.2 states:
> > > > +              * A Switch is required to initiate an L1 exit transition on its
> > > > +              * Upstream Port Link after no more than 1 μs from the beginning of an
> > > > +              * L1 exit transition on any of its Downstream Port Links.
> > > >                *
> > > >                * The exit latencies for L1 substates are not advertised
> > > >                * by a device.  Since the spec also doesn't mention a way
> > > > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > >                * L1 exit latencies advertised by a device include L1
> > > >                * substate latencies (and hence do not do any check).
> > > >                */
> > > > -             latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > -             if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > > -                 (latency + l1_switch_latency > acceptable->l1))
> > > > -                     link->aspm_capable &= ~ASPM_STATE_L1;
> > > > -             l1_switch_latency += 1000;
> > > > +             if (link->aspm_capable & ASPM_STATE_L1) {
> > > > +                     latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > +                     l1_max_latency = max_t(u32, latency, l1_max_latency);
> > > > +                     if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > > > +                             link->aspm_capable &= ~ASPM_STATE_L1;
> > > > +                     l1_switch_latency += 1000;
> > > > +             }
> > > >
> > > >               link = link->parent;
> > > >       }
> > > > --
> > > > 2.29.1
> > > >
Bjorn Helgaas Dec. 14, 2020, 2:02 p.m. UTC | #7
On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > issue with I211 or Realtek NICs.  Beginning of thread:
> > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kumlien@gmail.com
> >
> > Short story: Ian has:
> >
> >   Root Port --- Switch --- I211 NIC
> >                        \-- multifunction Realtek NIC, etc
> >
> > and the I211 performance is poor with ASPM L1 enabled on both links
> > in the path to it.  The patch here disables ASPM on the upstream link
> > and fixes the performance, but AFAICT the devices in that path give us
> > no reason to disable L1.  If I understand the spec correctly, the
> > Realtek device should not be relevant to the I211 path.]
> >
> > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > >
> > > > > Which makes it clear that each switch is required to
> > > > > initiate a transition within 1μs from receiving it,
> > > > > accumulating this latency and then we have to wait for the
> > > > > slowest link along the path before entering L0 state from
> > > > > L1.
> > > > > ...
> > > >
> > > > > On my specific system:
> > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> > > > >
> > > > >             Exit latency       Acceptable latency
> > > > > Tree:       L1       L0s       L1       L0s
> > > > > ----------  -------  -----     -------  ------
> > > > > 00:01.2     <32 us   -
> > > > > | 01:00.0   <32 us   -
> > > > > |- 02:03.0  <32 us   -
> > > > > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > > > > |
> > > > > \- 02:04.0  <32 us   -
> > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > >
> > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > we walk the path the first switchs startup latency will pass
> > > > > the acceptable latency limit for the link, and as a
> > > > > side-effect it fixes my issues with 03:00.0.
> > > > >
> > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > back to a maximum of ~933 mbit/s.
> > > >
> > > > There are two paths here that share a Link:
> > > >
> > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > >
> > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > >    connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > > >    not a Link).
> > >
> > > >    The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > >    <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > > >    01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > > >    to 1 + 32 = 33us of L1 exit latency.
> > > >
> > > >    The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > > >    to enable L1 for both Links.
> > > >
> > > > 2) The path to the Realtek device is similar except that the Realtek
> > > >    L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > >    initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > > >    but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > >    latency.
> > > >
> > > >    The Realtek device can only tolerate 64us of latency, so it is not
> > > >    safe to enable L1 for both Links.  It should be safe to enable L1
> > > >    on the shared link because the exit latency for that link would be
> > > >    <32us.
> > >
> > > 04:00.0:
> > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > > unlimited, L1 <64us
> > >
> > > So maximum latency for the entire link has to be <64 us
> > > For the device to leave L1 ASPM takes <64us
> > >
> > > So the device itself is the slowest entry along the link, which
> > > means that nothing else along that path can have ASPM enabled
> >
> > Yes.  That's what I said above: "it is not safe to enable L1 for both
> > Links."  Unless I'm missing something, we agree on that.
> >
> > I also said that it should be safe to enable L1 on the shared Link
> > (from 00:01.2 to 01:00.0) because if the downstream Link is always in
> > L0, the exit latency of the shared Link should be <32us, and 04:00.x
> > can tolerate 64us.
> 
> Exit latency of shared link would be max of link, ie 64 + L1-hops, not 32

I don't think this is true.  The path from 00:01.2 to 04:00.x includes
two Links, and they are independent.  The exit latency for each Link
depends only on the Port at each end:

  Link 1 (depends on 00:01.2 and 01:00.0): max(32, 32) = 32us
  Link 2 (depends on 02:04.0 and 04:00.x): max(32, 64) = 64us

If L1 is enabled for Link 1 and disabled for Link 2, Link 2 will
remain in L0 so it has no L1 exit latency, and the exit latency of
the entire path should be 32us.  

> > > > > The original code path did:
> > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > > > >
> > > > > And thus didn't see any L1 ASPM latency issues.
> > > > >
> > > > > The new code does:
> > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > > > > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> > > >
> > > > [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> > > > because that's internal Switch routing, not a Link.  But even without
> > > > that extra microsecond, this path does exceed the acceptable latency
> > > > since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]
> > >
> > > It does report L1 ASPM on both ends, so the links will be counted as
> > > such in the code.
> >
> > This is a bit of a tangent and we shouldn't get too wrapped up in it.
> > This is a confusing aspect of PCIe.  We're talking about this path:
> >
> >   00:01.2 --- [01:00.0 -- 02:04.0] --- 04:00.x multifunction Realtek
> >
> > This path only contains two Links.  The first one is
> > 00:01.2 --- 01:00.0, and the second one is 02:04.0 --- 04:00.x.
> >
> > 01:00.0 is a Switch Upstream Port and 02:04.0 is a Switch Downstream
> > Port.  The connection between them is not a Link; it is some internal
> > wiring of the Switch that is completely opaque to software.
> >
> > The ASPM information and knobs in 01:00.0 apply to the Link on its
> > upstream side, and the ASPM info and knobs in 02:04.0 apply to the
> > Link on its downstream side.
> >
> > The example in sec 5.4.1.2.2 contains three Links.  The L1 exit latency
> > for the Link is the max of the exit latencies at each end:
> >
> >   Link 1: max(32, 8) = 32us
> >   Link 2: max(8, 32) = 32us
> >   Link 3: max(32, 8) = 32us
> >
> > The total delay for a TLP starting at the downstream end of Link 3
> > is 32 + 2 = 32us.
> >
> > In the path to your 04:00.x Realtek device:
> >
> >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> >   Link 2 (from 02:04.0 to 04:00.x): max(32, 64) = 64us
> >
> > If L1 were enabled on both Links, the exit latency would be 64 + 1 =
> > 65us.
> 
> So one line to be removed from the changelog, i assume... And yes, the
> code handles that - first disable is 01:00.0 <-> 00:01.2
> 
> > > I also assume that it can power down individual ports... and enter
> > > rest state if no links are up.
> >
> > I don't think this is quite true -- a Link can't enter L1 unless the
> > Ports on both ends have L1 enabled, so I don't think it makes sense to
> > talk about an individual Port being in L1.
> >
> > > > > It correctly identifies the issue.
> > > > >
> > > > > For reference, pcie information:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > > >
> > > > The "lspci without my patches" [1] shows L1 enabled for the shared
> > > > Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
> > > > not for the Link to 04:00.x (Realtek).
> > > >
> > > > Per my analysis above, that looks like it *should* be a safe
> > > > configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
> > > > can tolerate 64us, actual should be <32us since only the shared Link
> > > > is in L1.
> > >
> > > See above.
> >
> > As I said above, if we enabled L1 only on the shared Link from 00:01.2
> > to 01:00.0, the exit latency should be acceptable.  In that case, a
> > TLP from 04:00.x would see only 32us of latency:
> >
> >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> >
> > and 04:00.x can tolerate 64us.
> 
> But, again, you're completely ignoring the full link, ie 04:00.x would
> also have to power on.

I think you're using "the full link" to refer to the entire path from
00:01.2 to 04:00.x.  In PCIe, a "Link" directly connects two Ports.
It doesn't refer to the entire path.

No, if L1 is disabled on 02:04.0 and 04:00.x (as Linux apparently does
by default), the Link between them never enters L1, so there is no
power-on for this Link.

> > > > However, the commit log at [2] shows L1 *enabled* for both the shared
> > > > Link from 00:01.2 --- 01:00.0 and the 02:04.0 --- 04:00.x Link, and
> > > > that would definitely be a problem.
> > > >
> > > > Can you explain the differences between [1] and [2]?
> > >
> > > I don't understand which sections you're referring to.
> >
> > [1] is the "lspci without my patches" attachment of bugzilla #209725,
> > which is supposed to show the problem this patch solves.  We're
> > talking about the path to 04:00.x, and [1] show this:
> >
> >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> >   01:00.0 L1+
> >   02:04.0 L1-
> >   04:00.0 L1-
> >
> > AFAICT, that should be a legal configuration as far as 04:00.0 is
> > concerned, so it's not a reason for this patch.
> 
> Actually, no, maximum path latency 64us
> 
> 04:00.0 wakeup latency == 64us
> 
> Again, as stated, it can't be behind any sleeping L1 links

It would be pointless for a device to advertise L1 support if it could
never be used.  04:00.0 advertises that it can tolerate L1 latency of
64us and that it can exit L1 in 64us or less.  So it *can* be behind a
Link in L1 as long as nothing else in the path adds more latency.

> > [2] is a previous posting of this same patch, and its commit log
> > includes information about the same path to 04:00.x, but the "LnkCtl
> > Before" column shows:
> >
> >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> >   01:00.0 L1+
> >   02:04.0 L1+
> >   04:00.0 L1+
> >
> > I don't know why [1] shows L1 disabled on the downstream Link, while
> > [2] shows L1 *enabled* on the same Link.
> 
> From the data they look switched.
> 
> > > > > Kai-Heng Feng has a machine that will not boot with ASPM without
> > > > > this patch, information is documented here:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209671
> > > >
> > > > I started working through this info, too, but there's not enough
> > > > information to tell what difference this patch makes.  The attachments
> > > > compare:
> > > >
> > > >   1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
> > > >   2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]
> > > >
> > > > Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure things
> > > > differently than CONFIG_PCIEASPM_DEFAULT=y, so we can't tell what
> > > > changes are due to the config change and what are due to the patch.
> > > >
> > > > The lspci *with* the patch ([4]) shows L0s and L1 enabled at almost
> > > > every possible place.  Here are the Links, how they're configured, and
> > > > my analysis of the exit latencies vs acceptable latencies:
> > > >
> > > >   00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
> > > >   00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
> > > >   00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
> > > >   00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
> > > >   00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > >   00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > >
> > > > So I can't tell what change prevents the freeze.  I would expect the
> > > > patch would cause us to *disable* L0s or L1 somewhere.
> > > >
> > > > The only place [4] shows ASPM disabled is for 05:00.1.  The spec says
> > > > we should program the same value in all functions of a multi-function
> > > > device.  This is a non-ARI device, so "only capabilities enabled in
> > > > all functions are enabled for the component as a whole."  That would
> > > > mean that L0s and L1 are effectively disabled for 05:00.x even though
> > > > 05:00.0 claims they're enabled.  But the latencies say ASPM L0s and L1
> > > > should be safe to be enabled.  This looks like another bug that's
> > > > probably unrelated.
> > >
> > > I don't think it's unrelated, i suspect it's how PCIe works with
> > > multiple links...  a device can cause some kind of head of queue
> > > stalling - i don't know how but it really looks like it.
> >
> > The text in quotes above is straight out of the spec (PCIe r5.0, sec
> > 7.5.3.7).  Either the device works that way or it's not compliant.
> >
> > The OS configures ASPM based on the requirements and capabilities
> > advertised by the device.  If a device has any head of queue stalling
> > or similar issues, those must be comprehended in the numbers
> > advertised by the device.  It's not up to the OS to speculate about
> > issues like that.
> >
> > > > The patch might be correct; I haven't actually analyzed the code.  But
> > > > the commit log doesn't make sense to me yet.
> > >
> > > I personally don't think that all this PCI information is required,
> > > the linux kernel is currently doing it wrong according to the spec.
> >
> > We're trying to establish exactly *what* Linux is doing wrong.  So far
> > we don't have a good explanation of that.
> 
> Yes we do, linux counts hops + max for "link" while what should be done is
> counting hops + max for path

I think you're saying we need to include L1 exit latency even for
Links where L1 is disabled.  I don't think we should include those.

> > Based on [1], in the path to 03:00.0, both Links have L1 enabled, with
> > an exit latency of <33us, and 03:00.0 can tolerate 64us.  That should
> > work fine.
> >
> > Also based on [1], in the path to 04:00.x, the upstream Link has L1
> > enabled and the downstream Link has L1 disabled, for an exit latency
> > of <32us, and 04:00.0 can tolerate 64us.  That should also work fine.
> 
> Again, ignoring the exit latendy for 04:00.0
> 
> > (Alternately, disabling L1 on the upstream Link and enabling it on the
> > downstream Link should have an exit latency of <64us and 04:00.0 can
> > tolerate 64us, so that should work fine, too.)
> 
> Then nothing else can have L1 aspm enabled

Yes, as I said, we should be able to enable L1 on either of the Links
in the path to 04:00.x, but not both.

The original problem here is not with the Realtek device at 04:00.x
but with the I211 NIC at 03:00.0.  So we also need to figure out what
the connection is.  Does the same I211 performance problem occur if
you remove the Realtek device from the system?

03:00.0 can tolerate 64us of latency, so even if L1 is enabled on both
Links leading to it, the path exit latency would be <33us, which
should be fine.

> > > Also, since it's clearly doing the wrong thing, I'm worried that
> > > dists will take a kernel enable aspm and there will be alot of
> > > bugreports of non-booting systems or other weird issues... And the
> > > culprit was known all along.
> >
> > There's clearly a problem on your system, but I don't know yet whether
> > Linux is doing something wrong, a device in your system is designed
> > incorrectly, or a device is designed correctly but the instance in
> > your system is defective.
> 
> According to the spec it is, there is a explanation of how to
> calculate the exit latency
> and when you implement that, which i did (before knowing the actual
> spec) then it works...
> 
> > > It's been five months...
> >
> > I apologize for the delay.  ASPM is a subtle area of PCIe, the Linux
> > code is complicated, and we have a long history of issues with it.  I
> > want to fix the problem, but I want to make sure we do it in a way
> > that matches the spec so the fix applies to all systems.  I don't want
> > a magic fix that fixes your system in a way I don't quite understand.
> 
> > Obviously *you* understand this, so hopefully it's just a matter of
> > pounding it through my thick skull :)
> 
> I only understand what I've been forced to understand - and I do
> leverage the existing code without
> knowing what it does underneath, I only look at the links maximum
> latency and make sure that I keep
> the maximum latency along the path and not just link for link
> 
> once you realise that the max allowed latency is buffer dependent -
> then this becomes obviously correct,
> and then the pcie spec showed it as being correct as well... so...
> 
> 
> > > > [1] https://bugzilla.kernel.org/attachment.cgi?id=293047
> > > > [2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
> > > > [3] https://bugzilla.kernel.org/attachment.cgi?id=292955
> > > > [4] https://bugzilla.kernel.org/attachment.cgi?id=292957
> > > >
> > > > > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > > > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > ---
> > > > >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> > > > >  1 file changed, 14 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > index 253c30cc1967..c03ead0f1013 100644
> > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > > > >
> > > > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > >  {
> > > > > -     u32 latency, l1_switch_latency = 0;
> > > > > +     u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > > > >       struct aspm_latency *acceptable;
> > > > >       struct pcie_link_state *link;
> > > > >
> > > > > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > >               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> > > > >                   (link->latency_dw.l0s > acceptable->l0s))
> > > > >                       link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > > > > +
> > > > >               /*
> > > > >                * Check L1 latency.
> > > > > -              * Every switch on the path to root complex need 1
> > > > > -              * more microsecond for L1. Spec doesn't mention L0s.
> > > > > +              *
> > > > > +              * PCIe r5.0, sec 5.4.1.2.2 states:
> > > > > +              * A Switch is required to initiate an L1 exit transition on its
> > > > > +              * Upstream Port Link after no more than 1 μs from the beginning of an
> > > > > +              * L1 exit transition on any of its Downstream Port Links.
> > > > >                *
> > > > >                * The exit latencies for L1 substates are not advertised
> > > > >                * by a device.  Since the spec also doesn't mention a way
> > > > > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > >                * L1 exit latencies advertised by a device include L1
> > > > >                * substate latencies (and hence do not do any check).
> > > > >                */
> > > > > -             latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > -             if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > > > -                 (latency + l1_switch_latency > acceptable->l1))
> > > > > -                     link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > -             l1_switch_latency += 1000;
> > > > > +             if (link->aspm_capable & ASPM_STATE_L1) {
> > > > > +                     latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > +                     l1_max_latency = max_t(u32, latency, l1_max_latency);
> > > > > +                     if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > > > > +                             link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > +                     l1_switch_latency += 1000;
> > > > > +             }
> > > > >
> > > > >               link = link->parent;
> > > > >       }
> > > > > --
> > > > > 2.29.1
> > > > >
Ian Kumlien Dec. 14, 2020, 3:47 p.m. UTC | #8
On Mon, Dec 14, 2020 at 3:02 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > > issue with I211 or Realtek NICs.  Beginning of thread:
> > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kumlien@gmail.com
> > >
> > > Short story: Ian has:
> > >
> > >   Root Port --- Switch --- I211 NIC
> > >                        \-- multifunction Realtek NIC, etc
> > >
> > > and the I211 performance is poor with ASPM L1 enabled on both links
> > > in the path to it.  The patch here disables ASPM on the upstream link
> > > and fixes the performance, but AFAICT the devices in that path give us
> > > no reason to disable L1.  If I understand the spec correctly, the
> > > Realtek device should not be relevant to the I211 path.]
> > >
> > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > > >
> > > > > > Which makes it clear that each switch is required to
> > > > > > initiate a transition within 1μs from receiving it,
> > > > > > accumulating this latency and then we have to wait for the
> > > > > > slowest link along the path before entering L0 state from
> > > > > > L1.
> > > > > > ...
> > > > >
> > > > > > On my specific system:
> > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> > > > > >
> > > > > >             Exit latency       Acceptable latency
> > > > > > Tree:       L1       L0s       L1       L0s
> > > > > > ----------  -------  -----     -------  ------
> > > > > > 00:01.2     <32 us   -
> > > > > > | 01:00.0   <32 us   -
> > > > > > |- 02:03.0  <32 us   -
> > > > > > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > > > > > |
> > > > > > \- 02:04.0  <32 us   -
> > > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > > >
> > > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > > we walk the path the first switchs startup latency will pass
> > > > > > the acceptable latency limit for the link, and as a
> > > > > > side-effect it fixes my issues with 03:00.0.
> > > > > >
> > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > > back to a maximum of ~933 mbit/s.
> > > > >
> > > > > There are two paths here that share a Link:
> > > > >
> > > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > > >
> > > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > > >    connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > > > >    not a Link).
> > > >
> > > > >    The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > > >    <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > > > >    01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > > > >    to 1 + 32 = 33us of L1 exit latency.
> > > > >
> > > > >    The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > > > >    to enable L1 for both Links.
> > > > >
> > > > > 2) The path to the Realtek device is similar except that the Realtek
> > > > >    L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > > >    initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > > > >    but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > > >    latency.
> > > > >
> > > > >    The Realtek device can only tolerate 64us of latency, so it is not
> > > > >    safe to enable L1 for both Links.  It should be safe to enable L1
> > > > >    on the shared link because the exit latency for that link would be
> > > > >    <32us.
> > > >
> > > > 04:00.0:
> > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > > > unlimited, L1 <64us
> > > >
> > > > So maximum latency for the entire link has to be <64 us
> > > > For the device to leave L1 ASPM takes <64us
> > > >
> > > > So the device itself is the slowest entry along the link, which
> > > > means that nothing else along that path can have ASPM enabled
> > >
> > > Yes.  That's what I said above: "it is not safe to enable L1 for both
> > > Links."  Unless I'm missing something, we agree on that.
> > >
> > > I also said that it should be safe to enable L1 on the shared Link
> > > (from 00:01.2 to 01:00.0) because if the downstream Link is always in
> > > L0, the exit latency of the shared Link should be <32us, and 04:00.x
> > > can tolerate 64us.
> >
> > Exit latency of shared link would be max of link, ie 64 + L1-hops, not 32
>
> I don't think this is true.  The path from 00:01.2 to 04:00.x includes
> two Links, and they are independent.  The exit latency for each Link
> depends only on the Port at each end:

The full path is what is important, because that is the actual latency
(which the current linux code doesn't do)

>   Link 1 (depends on 00:01.2 and 01:00.0): max(32, 32) = 32us
>   Link 2 (depends on 02:04.0 and 04:00.x): max(32, 64) = 64us
>
> If L1 is enabled for Link 1 and disabled for Link 2, Link 2 will
> remain in L0 so it has no L1 exit latency, and the exit latency of
> the entire path should be 32us.

My patch disables this so yes.

> > > > > > The original code path did:
> > > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > > > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > > > > >
> > > > > > And thus didn't see any L1 ASPM latency issues.
> > > > > >
> > > > > > The new code does:
> > > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > > > > > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> > > > >
> > > > > [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> > > > > because that's internal Switch routing, not a Link.  But even without
> > > > > that extra microsecond, this path does exceed the acceptable latency
> > > > > since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]
> > > >
> > > > It does report L1 ASPM on both ends, so the links will be counted as
> > > > such in the code.
> > >
> > > This is a bit of a tangent and we shouldn't get too wrapped up in it.
> > > This is a confusing aspect of PCIe.  We're talking about this path:
> > >
> > >   00:01.2 --- [01:00.0 -- 02:04.0] --- 04:00.x multifunction Realtek
> > >
> > > This path only contains two Links.  The first one is
> > > 00:01.2 --- 01:00.0, and the second one is 02:04.0 --- 04:00.x.
> > >
> > > 01:00.0 is a Switch Upstream Port and 02:04.0 is a Switch Downstream
> > > Port.  The connection between them is not a Link; it is some internal
> > > wiring of the Switch that is completely opaque to software.
> > >
> > > The ASPM information and knobs in 01:00.0 apply to the Link on its
> > > upstream side, and the ASPM info and knobs in 02:04.0 apply to the
> > > Link on its downstream side.
> > >
> > > The example in sec 5.4.1.2.2 contains three Links.  The L1 exit latency
> > > for the Link is the max of the exit latencies at each end:
> > >
> > >   Link 1: max(32, 8) = 32us
> > >   Link 2: max(8, 32) = 32us
> > >   Link 3: max(32, 8) = 32us
> > >
> > > The total delay for a TLP starting at the downstream end of Link 3
> > > is 32 + 2 = 32us.
> > >
> > > In the path to your 04:00.x Realtek device:
> > >
> > >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> > >   Link 2 (from 02:04.0 to 04:00.x): max(32, 64) = 64us
> > >
> > > If L1 were enabled on both Links, the exit latency would be 64 + 1 =
> > > 65us.
> >
> > So one line to be removed from the changelog, i assume... And yes, the
> > code handles that - first disable is 01:00.0 <-> 00:01.2
> >
> > > > I also assume that it can power down individual ports... and enter
> > > > rest state if no links are up.
> > >
> > > I don't think this is quite true -- a Link can't enter L1 unless the
> > > Ports on both ends have L1 enabled, so I don't think it makes sense to
> > > talk about an individual Port being in L1.
> > >
> > > > > > It correctly identifies the issue.
> > > > > >
> > > > > > For reference, pcie information:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > > > >
> > > > > The "lspci without my patches" [1] shows L1 enabled for the shared
> > > > > Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
> > > > > not for the Link to 04:00.x (Realtek).
> > > > >
> > > > > Per my analysis above, that looks like it *should* be a safe
> > > > > configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
> > > > > can tolerate 64us, actual should be <32us since only the shared Link
> > > > > is in L1.
> > > >
> > > > See above.
> > >
> > > As I said above, if we enabled L1 only on the shared Link from 00:01.2
> > > to 01:00.0, the exit latency should be acceptable.  In that case, a
> > > TLP from 04:00.x would see only 32us of latency:
> > >
> > >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> > >
> > > and 04:00.x can tolerate 64us.
> >
> > But, again, you're completely ignoring the full link, ie 04:00.x would
> > also have to power on.
>
> I think you're using "the full link" to refer to the entire path from
> 00:01.2 to 04:00.x.  In PCIe, a "Link" directly connects two Ports.
> It doesn't refer to the entire path.
>
> No, if L1 is disabled on 02:04.0 and 04:00.x (as Linux apparently does
> by default), the Link between them never enters L1, so there is no
> power-on for this Link.

It doesn't do it by default, my patch does

> > > > > However, the commit log at [2] shows L1 *enabled* for both the shared
> > > > > Link from 00:01.2 --- 01:00.0 and the 02:04.0 --- 04:00.x Link, and
> > > > > that would definitely be a problem.
> > > > >
> > > > > Can you explain the differences between [1] and [2]?
> > > >
> > > > I don't understand which sections you're referring to.
> > >
> > > [1] is the "lspci without my patches" attachment of bugzilla #209725,
> > > which is supposed to show the problem this patch solves.  We're
> > > talking about the path to 04:00.x, and [1] show this:
> > >
> > >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> > >   01:00.0 L1+
> > >   02:04.0 L1-
> > >   04:00.0 L1-
> > >
> > > AFAICT, that should be a legal configuration as far as 04:00.0 is
> > > concerned, so it's not a reason for this patch.
> >
> > Actually, no, maximum path latency 64us
> >
> > 04:00.0 wakeup latency == 64us
> >
> > Again, as stated, it can't be behind any sleeping L1 links
>
> It would be pointless for a device to advertise L1 support if it could
> never be used.  04:00.0 advertises that it can tolerate L1 latency of
> 64us and that it can exit L1 in 64us or less.  So it *can* be behind a
> Link in L1 as long as nothing else in the path adds more latency.

Yes, as long as nothing along the entire path adds latency - and I
didn't make the component
I can only say what it states, and we have to handle it.

> > > [2] is a previous posting of this same patch, and its commit log
> > > includes information about the same path to 04:00.x, but the "LnkCtl
> > > Before" column shows:
> > >
> > >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> > >   01:00.0 L1+
> > >   02:04.0 L1+
> > >   04:00.0 L1+
> > >
> > > I don't know why [1] shows L1 disabled on the downstream Link, while
> > > [2] shows L1 *enabled* on the same Link.
> >
> > From the data they look switched.
> >
> > > > > > Kai-Heng Feng has a machine that will not boot with ASPM without
> > > > > > this patch, information is documented here:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209671
> > > > >
> > > > > I started working through this info, too, but there's not enough
> > > > > information to tell what difference this patch makes.  The attachments
> > > > > compare:
> > > > >
> > > > >   1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
> > > > >   2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]
> > > > >
> > > > > Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure things
> > > > > differently than CONFIG_PCIEASPM_DEFAULT=y, so we can't tell what
> > > > > changes are due to the config change and what are due to the patch.
> > > > >
> > > > > The lspci *with* the patch ([4]) shows L0s and L1 enabled at almost
> > > > > every possible place.  Here are the Links, how they're configured, and
> > > > > my analysis of the exit latencies vs acceptable latencies:
> > > > >
> > > > >   00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
> > > > >   00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
> > > > >   00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
> > > > >   00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
> > > > >   00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > > >   00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > > >
> > > > > So I can't tell what change prevents the freeze.  I would expect the
> > > > > patch would cause us to *disable* L0s or L1 somewhere.
> > > > >
> > > > > The only place [4] shows ASPM disabled is for 05:00.1.  The spec says
> > > > > we should program the same value in all functions of a multi-function
> > > > > device.  This is a non-ARI device, so "only capabilities enabled in
> > > > > all functions are enabled for the component as a whole."  That would
> > > > > mean that L0s and L1 are effectively disabled for 05:00.x even though
> > > > > 05:00.0 claims they're enabled.  But the latencies say ASPM L0s and L1
> > > > > should be safe to be enabled.  This looks like another bug that's
> > > > > probably unrelated.
> > > >
> > > > I don't think it's unrelated, i suspect it's how PCIe works with
> > > > multiple links...  a device can cause some kind of head of queue
> > > > stalling - i don't know how but it really looks like it.
> > >
> > > The text in quotes above is straight out of the spec (PCIe r5.0, sec
> > > 7.5.3.7).  Either the device works that way or it's not compliant.
> > >
> > > The OS configures ASPM based on the requirements and capabilities
> > > advertised by the device.  If a device has any head of queue stalling
> > > or similar issues, those must be comprehended in the numbers
> > > advertised by the device.  It's not up to the OS to speculate about
> > > issues like that.
> > >
> > > > > The patch might be correct; I haven't actually analyzed the code.  But
> > > > > the commit log doesn't make sense to me yet.
> > > >
> > > > I personally don't think that all this PCI information is required,
> > > > the linux kernel is currently doing it wrong according to the spec.
> > >
> > > We're trying to establish exactly *what* Linux is doing wrong.  So far
> > > we don't have a good explanation of that.
> >
> > Yes we do, linux counts hops + max for "link" while what should be done is
> > counting hops + max for path
>
> I think you're saying we need to include L1 exit latency even for
> Links where L1 is disabled.  I don't think we should include those.

Nope, the code does not do that, it only adds the l1 latency on L1 enabled hops

> > > Based on [1], in the path to 03:00.0, both Links have L1 enabled, with
> > > an exit latency of <33us, and 03:00.0 can tolerate 64us.  That should
> > > work fine.
> > >
> > > Also based on [1], in the path to 04:00.x, the upstream Link has L1
> > > enabled and the downstream Link has L1 disabled, for an exit latency
> > > of <32us, and 04:00.0 can tolerate 64us.  That should also work fine.
> >
> > Again, ignoring the exit latendy for 04:00.0
> >
> > > (Alternately, disabling L1 on the upstream Link and enabling it on the
> > > downstream Link should have an exit latency of <64us and 04:00.0 can
> > > tolerate 64us, so that should work fine, too.)
> >
> > Then nothing else can have L1 aspm enabled
>
> Yes, as I said, we should be able to enable L1 on either of the Links
> in the path to 04:00.x, but not both.

The code works backwards and disables the first hop that exceeds the
latency requirements -
we could argue that it should try to be smarter about it and try to
disable a minimum amount of links
while still retaining the minimum latency but... It is what it is and
it works when patched.

> The original problem here is not with the Realtek device at 04:00.x
> but with the I211 NIC at 03:00.0.  So we also need to figure out what
> the connection is.  Does the same I211 performance problem occur if
> you remove the Realtek device from the system?

It's mounted on the motherboard, so no I can't remove it.

> 03:00.0 can tolerate 64us of latency, so even if L1 is enabled on both
> Links leading to it, the path exit latency would be <33us, which
> should be fine.

Yes, it "should be" but due to broken ASPM latency calculations we
have some kind of
side effect that triggers a racecondition/sideeffect/bug that causes
it to misbehave.

Since fixing the latency calculation fixes it, I'll leave the rest to
someone with a logic
analyzer and a die-hard-fetish for pcie links - I can't debug it.

> > > > Also, since it's clearly doing the wrong thing, I'm worried that
> > > > dists will take a kernel enable aspm and there will be alot of
> > > > bugreports of non-booting systems or other weird issues... And the
> > > > culprit was known all along.
> > >
> > > There's clearly a problem on your system, but I don't know yet whether
> > > Linux is doing something wrong, a device in your system is designed
> > > incorrectly, or a device is designed correctly but the instance in
> > > your system is defective.
> >
> > According to the spec it is, there is a explanation of how to
> > calculate the exit latency
> > and when you implement that, which i did (before knowing the actual
> > spec) then it works...
> >
> > > > It's been five months...
> > >
> > > I apologize for the delay.  ASPM is a subtle area of PCIe, the Linux
> > > code is complicated, and we have a long history of issues with it.  I
> > > want to fix the problem, but I want to make sure we do it in a way
> > > that matches the spec so the fix applies to all systems.  I don't want
> > > a magic fix that fixes your system in a way I don't quite understand.
> >
> > > Obviously *you* understand this, so hopefully it's just a matter of
> > > pounding it through my thick skull :)
> >
> > I only understand what I've been forced to understand - and I do
> > leverage the existing code without
> > knowing what it does underneath, I only look at the links maximum
> > latency and make sure that I keep
> > the maximum latency along the path and not just link for link
> >
> > once you realise that the max allowed latency is buffer dependent -
> > then this becomes obviously correct,
> > and then the pcie spec showed it as being correct as well... so...
> >
> >
> > > > > [1] https://bugzilla.kernel.org/attachment.cgi?id=293047
> > > > > [2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
> > > > > [3] https://bugzilla.kernel.org/attachment.cgi?id=292955
> > > > > [4] https://bugzilla.kernel.org/attachment.cgi?id=292957
> > > > >
> > > > > > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > > > > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > > ---
> > > > > >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> > > > > >  1 file changed, 14 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > index 253c30cc1967..c03ead0f1013 100644
> > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > > > > >
> > > > > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > >  {
> > > > > > -     u32 latency, l1_switch_latency = 0;
> > > > > > +     u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > > > > >       struct aspm_latency *acceptable;
> > > > > >       struct pcie_link_state *link;
> > > > > >
> > > > > > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > >               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> > > > > >                   (link->latency_dw.l0s > acceptable->l0s))
> > > > > >                       link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > > > > > +
> > > > > >               /*
> > > > > >                * Check L1 latency.
> > > > > > -              * Every switch on the path to root complex need 1
> > > > > > -              * more microsecond for L1. Spec doesn't mention L0s.
> > > > > > +              *
> > > > > > +              * PCIe r5.0, sec 5.4.1.2.2 states:
> > > > > > +              * A Switch is required to initiate an L1 exit transition on its
> > > > > > +              * Upstream Port Link after no more than 1 μs from the beginning of an
> > > > > > +              * L1 exit transition on any of its Downstream Port Links.
> > > > > >                *
> > > > > >                * The exit latencies for L1 substates are not advertised
> > > > > >                * by a device.  Since the spec also doesn't mention a way
> > > > > > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > >                * L1 exit latencies advertised by a device include L1
> > > > > >                * substate latencies (and hence do not do any check).
> > > > > >                */
> > > > > > -             latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > > -             if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > > > > -                 (latency + l1_switch_latency > acceptable->l1))
> > > > > > -                     link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > > -             l1_switch_latency += 1000;
> > > > > > +             if (link->aspm_capable & ASPM_STATE_L1) {
> > > > > > +                     latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > > +                     l1_max_latency = max_t(u32, latency, l1_max_latency);
> > > > > > +                     if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > > > > > +                             link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > > +                     l1_switch_latency += 1000;
> > > > > > +             }
> > > > > >
> > > > > >               link = link->parent;
> > > > > >       }
> > > > > > --
> > > > > > 2.29.1
> > > > > >
Bjorn Helgaas Dec. 14, 2020, 7:19 p.m. UTC | #9
On Mon, Dec 14, 2020 at 04:47:32PM +0100, Ian Kumlien wrote:
> On Mon, Dec 14, 2020 at 3:02 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> > > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >
> > > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > > > issue with I211 or Realtek NICs.  Beginning of thread:
> > > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kumlien@gmail.com
> > > >
> > > > Short story: Ian has:
> > > >
> > > >   Root Port --- Switch --- I211 NIC
> > > >                        \-- multifunction Realtek NIC, etc
> > > >
> > > > and the I211 performance is poor with ASPM L1 enabled on both links
> > > > in the path to it.  The patch here disables ASPM on the upstream link
> > > > and fixes the performance, but AFAICT the devices in that path give us
> > > > no reason to disable L1.  If I understand the spec correctly, the
> > > > Realtek device should not be relevant to the I211 path.]
> > > >
> > > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > > > >
> > > > > > > Which makes it clear that each switch is required to
> > > > > > > initiate a transition within 1μs from receiving it,
> > > > > > > accumulating this latency and then we have to wait for the
> > > > > > > slowest link along the path before entering L0 state from
> > > > > > > L1.
> > > > > > > ...
> > > > > >
> > > > > > > On my specific system:
> > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> > > > > > >
> > > > > > >             Exit latency       Acceptable latency
> > > > > > > Tree:       L1       L0s       L1       L0s
> > > > > > > ----------  -------  -----     -------  ------
> > > > > > > 00:01.2     <32 us   -
> > > > > > > | 01:00.0   <32 us   -
> > > > > > > |- 02:03.0  <32 us   -
> > > > > > > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > > > > > > |
> > > > > > > \- 02:04.0  <32 us   -
> > > > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > > > >
> > > > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > > > we walk the path the first switchs startup latency will pass
> > > > > > > the acceptable latency limit for the link, and as a
> > > > > > > side-effect it fixes my issues with 03:00.0.
> > > > > > >
> > > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > > > back to a maximum of ~933 mbit/s.
> > > > > >
> > > > > > There are two paths here that share a Link:
> > > > > >
> > > > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > > > >
> > > > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > > > >    connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > > > > >    not a Link).
> > > > >
> > > > > >    The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > > > >    <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > > > > >    01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > > > > >    to 1 + 32 = 33us of L1 exit latency.
> > > > > >
> > > > > >    The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > > > > >    to enable L1 for both Links.
> > > > > >
> > > > > > 2) The path to the Realtek device is similar except that the Realtek
> > > > > >    L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > > > >    initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > > > > >    but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > > > >    latency.
> > > > > >
> > > > > >    The Realtek device can only tolerate 64us of latency, so it is not
> > > > > >    safe to enable L1 for both Links.  It should be safe to enable L1
> > > > > >    on the shared link because the exit latency for that link would be
> > > > > >    <32us.
> > > > >
> > > > > 04:00.0:
> > > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > > > > unlimited, L1 <64us
> > > > >
> > > > > So maximum latency for the entire link has to be <64 us
> > > > > For the device to leave L1 ASPM takes <64us
> > > > >
> > > > > So the device itself is the slowest entry along the link, which
> > > > > means that nothing else along that path can have ASPM enabled
> > > >
> > > > Yes.  That's what I said above: "it is not safe to enable L1 for both
> > > > Links."  Unless I'm missing something, we agree on that.
> > > >
> > > > I also said that it should be safe to enable L1 on the shared Link
> > > > (from 00:01.2 to 01:00.0) because if the downstream Link is always in
> > > > L0, the exit latency of the shared Link should be <32us, and 04:00.x
> > > > can tolerate 64us.
> > >
> > > Exit latency of shared link would be max of link, ie 64 + L1-hops, not 32
> >
> > I don't think this is true.  The path from 00:01.2 to 04:00.x includes
> > two Links, and they are independent.  The exit latency for each Link
> > depends only on the Port at each end:
> 
> The full path is what is important, because that is the actual latency
> (which the current linux code doesn't do)

I think you're saying we need to include the 04:00.x exit latency of
64us even though L1 is not enabled for 04:00.x.  I disagree; the L1
exit latency of Ports where L1 is disabled is irrelevant.  

> >   Link 1 (depends on 00:01.2 and 01:00.0): max(32, 32) = 32us
> >   Link 2 (depends on 02:04.0 and 04:00.x): max(32, 64) = 64us
> >
> > If L1 is enabled for Link 1 and disabled for Link 2, Link 2 will
> > remain in L0 so it has no L1 exit latency, and the exit latency of
> > the entire path should be 32us.
> 
> My patch disables this so yes.
> 
> > > > > > > The original code path did:
> > > > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > > > > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > > > > > >
> > > > > > > And thus didn't see any L1 ASPM latency issues.
> > > > > > >
> > > > > > > The new code does:
> > > > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > > > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > > > > > > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> > > > > >
> > > > > > [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> > > > > > because that's internal Switch routing, not a Link.  But even without
> > > > > > that extra microsecond, this path does exceed the acceptable latency
> > > > > > since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]
> > > > >
> > > > > It does report L1 ASPM on both ends, so the links will be counted as
> > > > > such in the code.
> > > >
> > > > This is a bit of a tangent and we shouldn't get too wrapped up in it.
> > > > This is a confusing aspect of PCIe.  We're talking about this path:
> > > >
> > > >   00:01.2 --- [01:00.0 -- 02:04.0] --- 04:00.x multifunction Realtek
> > > >
> > > > This path only contains two Links.  The first one is
> > > > 00:01.2 --- 01:00.0, and the second one is 02:04.0 --- 04:00.x.
> > > >
> > > > 01:00.0 is a Switch Upstream Port and 02:04.0 is a Switch Downstream
> > > > Port.  The connection between them is not a Link; it is some internal
> > > > wiring of the Switch that is completely opaque to software.
> > > >
> > > > The ASPM information and knobs in 01:00.0 apply to the Link on its
> > > > upstream side, and the ASPM info and knobs in 02:04.0 apply to the
> > > > Link on its downstream side.
> > > >
> > > > The example in sec 5.4.1.2.2 contains three Links.  The L1 exit latency
> > > > for the Link is the max of the exit latencies at each end:
> > > >
> > > >   Link 1: max(32, 8) = 32us
> > > >   Link 2: max(8, 32) = 32us
> > > >   Link 3: max(32, 8) = 32us
> > > >
> > > > The total delay for a TLP starting at the downstream end of Link 3
> > > > is 32 + 2 = 32us.
> > > >
> > > > In the path to your 04:00.x Realtek device:
> > > >
> > > >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> > > >   Link 2 (from 02:04.0 to 04:00.x): max(32, 64) = 64us
> > > >
> > > > If L1 were enabled on both Links, the exit latency would be 64 + 1 =
> > > > 65us.
> > >
> > > So one line to be removed from the changelog, i assume... And yes, the
> > > code handles that - first disable is 01:00.0 <-> 00:01.2
> > >
> > > > > I also assume that it can power down individual ports... and enter
> > > > > rest state if no links are up.
> > > >
> > > > I don't think this is quite true -- a Link can't enter L1 unless the
> > > > Ports on both ends have L1 enabled, so I don't think it makes sense to
> > > > talk about an individual Port being in L1.
> > > >
> > > > > > > It correctly identifies the issue.
> > > > > > >
> > > > > > > For reference, pcie information:
> > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > > > > >
> > > > > > The "lspci without my patches" [1] shows L1 enabled for the shared
> > > > > > Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
> > > > > > not for the Link to 04:00.x (Realtek).
> > > > > >
> > > > > > Per my analysis above, that looks like it *should* be a safe
> > > > > > configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
> > > > > > can tolerate 64us, actual should be <32us since only the shared Link
> > > > > > is in L1.
> > > > >
> > > > > See above.
> > > >
> > > > As I said above, if we enabled L1 only on the shared Link from 00:01.2
> > > > to 01:00.0, the exit latency should be acceptable.  In that case, a
> > > > TLP from 04:00.x would see only 32us of latency:
> > > >
> > > >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> > > >
> > > > and 04:00.x can tolerate 64us.
> > >
> > > But, again, you're completely ignoring the full link, ie 04:00.x would
> > > also have to power on.
> >
> > I think you're using "the full link" to refer to the entire path from
> > 00:01.2 to 04:00.x.  In PCIe, a "Link" directly connects two Ports.
> > It doesn't refer to the entire path.
> >
> > No, if L1 is disabled on 02:04.0 and 04:00.x (as Linux apparently does
> > by default), the Link between them never enters L1, so there is no
> > power-on for this Link.
> 
> It doesn't do it by default, my patch does

I'm relying on [1], your "lspci without my patches" attachment named
"lspci-5.9-mainline.txt", which shows:

  02:04.0 LnkCtl: ASPM Disabled
  04:00.0 LnkCtl: ASPM Disabled

so I assumed that was what Linux did by default.

> > > > > > However, the commit log at [2] shows L1 *enabled* for both
> > > > > > the shared Link from 00:01.2 --- 01:00.0 and the 02:04.0
> > > > > > --- 04:00.x Link, and that would definitely be a problem.
> > > > > >
> > > > > > Can you explain the differences between [1] and [2]?
> > > > >
> > > > > I don't understand which sections you're referring to.
> > > >
> > > > [1] is the "lspci without my patches" attachment of bugzilla #209725,
> > > > which is supposed to show the problem this patch solves.  We're
> > > > talking about the path to 04:00.x, and [1] show this:
> > > >
> > > >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> > > >   01:00.0 L1+
> > > >   02:04.0 L1-
> > > >   04:00.0 L1-
> > > >
> > > > AFAICT, that should be a legal configuration as far as 04:00.0 is
> > > > concerned, so it's not a reason for this patch.
> > >
> > > Actually, no, maximum path latency 64us
> > >
> > > 04:00.0 wakeup latency == 64us
> > >
> > > Again, as stated, it can't be behind any sleeping L1 links
> >
> > It would be pointless for a device to advertise L1 support if it could
> > never be used.  04:00.0 advertises that it can tolerate L1 latency of
> > 64us and that it can exit L1 in 64us or less.  So it *can* be behind a
> > Link in L1 as long as nothing else in the path adds more latency.
> 
> Yes, as long as nothing along the entire path adds latency - and I
> didn't make the component
> I can only say what it states, and we have to handle it.
> 
> > > > [2] is a previous posting of this same patch, and its commit log
> > > > includes information about the same path to 04:00.x, but the "LnkCtl
> > > > Before" column shows:
> > > >
> > > >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> > > >   01:00.0 L1+
> > > >   02:04.0 L1+
> > > >   04:00.0 L1+
> > > >
> > > > I don't know why [1] shows L1 disabled on the downstream Link, while
> > > > [2] shows L1 *enabled* on the same Link.
> > >
> > > From the data they look switched.
> > >
> > > > > > > Kai-Heng Feng has a machine that will not boot with ASPM without
> > > > > > > this patch, information is documented here:
> > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209671
> > > > > >
> > > > > > I started working through this info, too, but there's not
> > > > > > enough information to tell what difference this patch
> > > > > > makes.  The attachments compare:
> > > > > >
> > > > > >   1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
> > > > > >   2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]
> > > > > >
> > > > > > Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure
> > > > > > things differently than CONFIG_PCIEASPM_DEFAULT=y, so we
> > > > > > can't tell what changes are due to the config change and
> > > > > > what are due to the patch.
> > > > > >
> > > > > > The lspci *with* the patch ([4]) shows L0s and L1 enabled
> > > > > > at almost every possible place.  Here are the Links, how
> > > > > > they're configured, and my analysis of the exit latencies
> > > > > > vs acceptable latencies:
> > > > > >
> > > > > >   00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
> > > > > >   00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
> > > > > >   00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
> > > > > >   00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
> > > > > >   00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > > > >   00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > > > >
> > > > > > So I can't tell what change prevents the freeze.  I would
> > > > > > expect the patch would cause us to *disable* L0s or L1
> > > > > > somewhere.
> > > > > >
> > > > > > The only place [4] shows ASPM disabled is for 05:00.1.
> > > > > > The spec says we should program the same value in all
> > > > > > functions of a multi-function device.  This is a non-ARI
> > > > > > device, so "only capabilities enabled in all functions are
> > > > > > enabled for the component as a whole."  That would mean
> > > > > > that L0s and L1 are effectively disabled for 05:00.x even
> > > > > > though 05:00.0 claims they're enabled.  But the latencies
> > > > > > say ASPM L0s and L1 should be safe to be enabled.  This
> > > > > > looks like another bug that's probably unrelated.
> > > > >
> > > > > I don't think it's unrelated, i suspect it's how PCIe works with
> > > > > multiple links...  a device can cause some kind of head of queue
> > > > > stalling - i don't know how but it really looks like it.
> > > >
> > > > The text in quotes above is straight out of the spec (PCIe r5.0, sec
> > > > 7.5.3.7).  Either the device works that way or it's not compliant.
> > > >
> > > > The OS configures ASPM based on the requirements and capabilities
> > > > advertised by the device.  If a device has any head of queue stalling
> > > > or similar issues, those must be comprehended in the numbers
> > > > advertised by the device.  It's not up to the OS to speculate about
> > > > issues like that.
> > > >
> > > > > > The patch might be correct; I haven't actually analyzed
> > > > > > the code.  But the commit log doesn't make sense to me
> > > > > > yet.
> > > > >
> > > > > I personally don't think that all this PCI information is required,
> > > > > the linux kernel is currently doing it wrong according to the spec.
> > > >
> > > > We're trying to establish exactly *what* Linux is doing wrong.  So far
> > > > we don't have a good explanation of that.
> > >
> > > Yes we do, linux counts hops + max for "link" while what should be done is
> > > counting hops + max for path
> >
> > I think you're saying we need to include L1 exit latency even for
> > Links where L1 is disabled.  I don't think we should include those.
> 
> Nope, the code does not do that, it only adds the l1 latency on L1
> enabled hops
> 
> > > > Based on [1], in the path to 03:00.0, both Links have L1 enabled, with
> > > > an exit latency of <33us, and 03:00.0 can tolerate 64us.  That should
> > > > work fine.
> > > >
> > > > Also based on [1], in the path to 04:00.x, the upstream Link has L1
> > > > enabled and the downstream Link has L1 disabled, for an exit latency
> > > > of <32us, and 04:00.0 can tolerate 64us.  That should also work fine.
> > >
> > > Again, ignoring the exit latency for 04:00.0
> > >
> > > > (Alternately, disabling L1 on the upstream Link and enabling it on the
> > > > downstream Link should have an exit latency of <64us and 04:00.0 can
> > > > tolerate 64us, so that should work fine, too.)
> > >
> > > Then nothing else can have L1 aspm enabled
> >
> > Yes, as I said, we should be able to enable L1 on either of the Links
> > in the path to 04:00.x, but not both.
> 
> The code works backwards and disables the first hop that exceeds the
> latency requirements -
> we could argue that it should try to be smarter about it and try to
> disable a minimum amount of links
> while still retaining the minimum latency but... It is what it is and
> it works when patched.
> 
> > The original problem here is not with the Realtek device at 04:00.x
> > but with the I211 NIC at 03:00.0.  So we also need to figure out what
> > the connection is.  Does the same I211 performance problem occur if
> > you remove the Realtek device from the system?
> 
> It's mounted on the motherboard, so no I can't remove it.

If you're interested, you could probably unload the Realtek drivers,
remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
in 02:04.0, e.g.,

  # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
  # echo 1 > $RT/0000:04:00.0/remove
  # echo 1 > $RT/0000:04:00.1/remove
  # echo 1 > $RT/0000:04:00.2/remove
  # echo 1 > $RT/0000:04:00.4/remove
  # echo 1 > $RT/0000:04:00.7/remove
  # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010

That should take 04:00.x out of the picture.

> > 03:00.0 can tolerate 64us of latency, so even if L1 is enabled on both
> > Links leading to it, the path exit latency would be <33us, which
> > should be fine.
> 
> Yes, it "should be" but due to broken ASPM latency calculations we
> have some kind of
> side effect that triggers a racecondition/sideeffect/bug that causes
> it to misbehave.
> 
> Since fixing the latency calculation fixes it, I'll leave the rest to
> someone with a logic
> analyzer and a die-hard-fetish for pcie links - I can't debug it.
> 
> > > > > Also, since it's clearly doing the wrong thing, I'm worried that
> > > > > dists will take a kernel enable aspm and there will be alot of
> > > > > bugreports of non-booting systems or other weird issues... And the
> > > > > culprit was known all along.
> > > >
> > > > There's clearly a problem on your system, but I don't know yet whether
> > > > Linux is doing something wrong, a device in your system is designed
> > > > incorrectly, or a device is designed correctly but the instance in
> > > > your system is defective.
> > >
> > > According to the spec it is, there is a explanation of how to
> > > calculate the exit latency
> > > and when you implement that, which i did (before knowing the actual
> > > spec) then it works...
> > >
> > > > > It's been five months...
> > > >
> > > > I apologize for the delay.  ASPM is a subtle area of PCIe, the Linux
> > > > code is complicated, and we have a long history of issues with it.  I
> > > > want to fix the problem, but I want to make sure we do it in a way
> > > > that matches the spec so the fix applies to all systems.  I don't want
> > > > a magic fix that fixes your system in a way I don't quite understand.
> > >
> > > > Obviously *you* understand this, so hopefully it's just a matter of
> > > > pounding it through my thick skull :)
> > >
> > > I only understand what I've been forced to understand - and I do
> > > leverage the existing code without
> > > knowing what it does underneath, I only look at the links maximum
> > > latency and make sure that I keep
> > > the maximum latency along the path and not just link for link
> > >
> > > once you realise that the max allowed latency is buffer dependent -
> > > then this becomes obviously correct,
> > > and then the pcie spec showed it as being correct as well... so...
> > >
> > >
> > > > > > [1] https://bugzilla.kernel.org/attachment.cgi?id=293047
> > > > > > [2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
> > > > > > [3] https://bugzilla.kernel.org/attachment.cgi?id=292955
> > > > > > [4] https://bugzilla.kernel.org/attachment.cgi?id=292957
> > > > > >
> > > > > > > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > > > > > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > > > ---
> > > > > > >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> > > > > > >  1 file changed, 14 insertions(+), 8 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > index 253c30cc1967..c03ead0f1013 100644
> > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > > > > > >
> > > > > > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > > >  {
> > > > > > > -     u32 latency, l1_switch_latency = 0;
> > > > > > > +     u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > > > > > >       struct aspm_latency *acceptable;
> > > > > > >       struct pcie_link_state *link;
> > > > > > >
> > > > > > > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > > >               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> > > > > > >                   (link->latency_dw.l0s > acceptable->l0s))
> > > > > > >                       link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > > > > > > +
> > > > > > >               /*
> > > > > > >                * Check L1 latency.
> > > > > > > -              * Every switch on the path to root complex need 1
> > > > > > > -              * more microsecond for L1. Spec doesn't mention L0s.
> > > > > > > +              *
> > > > > > > +              * PCIe r5.0, sec 5.4.1.2.2 states:
> > > > > > > +              * A Switch is required to initiate an L1 exit transition on its
> > > > > > > +              * Upstream Port Link after no more than 1 μs from the beginning of an
> > > > > > > +              * L1 exit transition on any of its Downstream Port Links.
> > > > > > >                *
> > > > > > >                * The exit latencies for L1 substates are not advertised
> > > > > > >                * by a device.  Since the spec also doesn't mention a way
> > > > > > > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > > >                * L1 exit latencies advertised by a device include L1
> > > > > > >                * substate latencies (and hence do not do any check).
> > > > > > >                */
> > > > > > > -             latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > > > -             if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > > > > > -                 (latency + l1_switch_latency > acceptable->l1))
> > > > > > > -                     link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > > > -             l1_switch_latency += 1000;
> > > > > > > +             if (link->aspm_capable & ASPM_STATE_L1) {
> > > > > > > +                     latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > > > +                     l1_max_latency = max_t(u32, latency, l1_max_latency);
> > > > > > > +                     if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > > > > > > +                             link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > > > +                     l1_switch_latency += 1000;
> > > > > > > +             }
> > > > > > >
> > > > > > >               link = link->parent;
> > > > > > >       }
> > > > > > > --
> > > > > > > 2.29.1
> > > > > > >
Ian Kumlien Dec. 14, 2020, 10:56 p.m. UTC | #10
On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Dec 14, 2020 at 04:47:32PM +0100, Ian Kumlien wrote:
> > On Mon, Dec 14, 2020 at 3:02 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, Dec 14, 2020 at 10:14:18AM +0100, Ian Kumlien wrote:
> > > > On Mon, Dec 14, 2020 at 6:44 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > >
> > > > > [+cc Jesse, Tony, David, Jakub, Heiner, lists in case there's an ASPM
> > > > > issue with I211 or Realtek NICs.  Beginning of thread:
> > > > > https://lore.kernel.org/r/20201024205548.1837770-1-ian.kumlien@gmail.com
> > > > >
> > > > > Short story: Ian has:
> > > > >
> > > > >   Root Port --- Switch --- I211 NIC
> > > > >                        \-- multifunction Realtek NIC, etc
> > > > >
> > > > > and the I211 performance is poor with ASPM L1 enabled on both links
> > > > > in the path to it.  The patch here disables ASPM on the upstream link
> > > > > and fixes the performance, but AFAICT the devices in that path give us
> > > > > no reason to disable L1.  If I understand the spec correctly, the
> > > > > Realtek device should not be relevant to the I211 path.]
> > > > >
> > > > > On Sun, Dec 13, 2020 at 10:39:53PM +0100, Ian Kumlien wrote:
> > > > > > On Sun, Dec 13, 2020 at 12:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> > > > > > > > Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> > > > > > > > "5.4.1.2.2. Exit from the L1 State"
> > > > > > > >
> > > > > > > > Which makes it clear that each switch is required to
> > > > > > > > initiate a transition within 1μs from receiving it,
> > > > > > > > accumulating this latency and then we have to wait for the
> > > > > > > > slowest link along the path before entering L0 state from
> > > > > > > > L1.
> > > > > > > > ...
> > > > > > >
> > > > > > > > On my specific system:
> > > > > > > > 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > > > > > > > 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> > > > > > > >
> > > > > > > >             Exit latency       Acceptable latency
> > > > > > > > Tree:       L1       L0s       L1       L0s
> > > > > > > > ----------  -------  -----     -------  ------
> > > > > > > > 00:01.2     <32 us   -
> > > > > > > > | 01:00.0   <32 us   -
> > > > > > > > |- 02:03.0  <32 us   -
> > > > > > > > | \03:00.0  <16 us   <2us      <64 us   <512ns
> > > > > > > > |
> > > > > > > > \- 02:04.0  <32 us   -
> > > > > > > >   \04:00.0  <64 us   unlimited <64 us   <512ns
> > > > > > > >
> > > > > > > > 04:00.0's latency is the same as the maximum it allows so as
> > > > > > > > we walk the path the first switchs startup latency will pass
> > > > > > > > the acceptable latency limit for the link, and as a
> > > > > > > > side-effect it fixes my issues with 03:00.0.
> > > > > > > >
> > > > > > > > Without this patch, 03:00.0 misbehaves and only gives me ~40
> > > > > > > > mbit/s over links with 6 or more hops. With this patch I'm
> > > > > > > > back to a maximum of ~933 mbit/s.
> > > > > > >
> > > > > > > There are two paths here that share a Link:
> > > > > > >
> > > > > > >   00:01.2 --- 01:00.0 -- 02:03.0 --- 03:00.0 I211 NIC
> > > > > > >   00:01.2 --- 01:00.0 -- 02:04.0 --- 04:00.x multifunction Realtek
> > > > > > >
> > > > > > > 1) The path to the I211 NIC includes four Ports and two Links (the
> > > > > > >    connection between 01:00.0 and 02:03.0 is internal Switch routing,
> > > > > > >    not a Link).
> > > > > >
> > > > > > >    The Ports advertise L1 exit latencies of <32us, <32us, <32us,
> > > > > > >    <16us.  If both Links are in L1 and 03:00.0 initiates L1 exit at T,
> > > > > > >    01:00.0 initiates L1 exit at T + 1.  A TLP from 03:00.0 may see up
> > > > > > >    to 1 + 32 = 33us of L1 exit latency.
> > > > > > >
> > > > > > >    The NIC can tolerate up to 64us of L1 exit latency, so it is safe
> > > > > > >    to enable L1 for both Links.
> > > > > > >
> > > > > > > 2) The path to the Realtek device is similar except that the Realtek
> > > > > > >    L1 exit latency is <64us.  If both Links are in L1 and 04:00.x
> > > > > > >    initiates L1 exit at T, 01:00.0 again initiates L1 exit at T + 1,
> > > > > > >    but a TLP from 04:00.x may see up to 1 + 64 = 65us of L1 exit
> > > > > > >    latency.
> > > > > > >
> > > > > > >    The Realtek device can only tolerate 64us of latency, so it is not
> > > > > > >    safe to enable L1 for both Links.  It should be safe to enable L1
> > > > > > >    on the shared link because the exit latency for that link would be
> > > > > > >    <32us.
> > > > > >
> > > > > > 04:00.0:
> > > > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
> > > > > > LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> > > > > > unlimited, L1 <64us
> > > > > >
> > > > > > So maximum latency for the entire link has to be <64 us
> > > > > > For the device to leave L1 ASPM takes <64us
> > > > > >
> > > > > > So the device itself is the slowest entry along the link, which
> > > > > > means that nothing else along that path can have ASPM enabled
> > > > >
> > > > > Yes.  That's what I said above: "it is not safe to enable L1 for both
> > > > > Links."  Unless I'm missing something, we agree on that.
> > > > >
> > > > > I also said that it should be safe to enable L1 on the shared Link
> > > > > (from 00:01.2 to 01:00.0) because if the downstream Link is always in
> > > > > L0, the exit latency of the shared Link should be <32us, and 04:00.x
> > > > > can tolerate 64us.
> > > >
> > > > Exit latency of shared link would be max of link, ie 64 + L1-hops, not 32
> > >
> > > I don't think this is true.  The path from 00:01.2 to 04:00.x includes
> > > two Links, and they are independent.  The exit latency for each Link
> > > depends only on the Port at each end:
> >
> > The full path is what is important, because that is the actual latency
> > (which the current linux code doesn't do)
>
> I think you're saying we need to include the 04:00.x exit latency of
> 64us even though L1 is not enabled for 04:00.x.  I disagree; the L1
> exit latency of Ports where L1 is disabled is irrelevant.

I will redo the without patch and look again, I know that I have to
wait a while for it to happen.

Witch patch 3 i get:
dec 14 13:44:40 localhost kernel: pci 0000:04:00.0: ASPM latency
exceeded, disabling: L1:0000:01:00.0-0000:00:01.2

And it should only check links that has L1 aspm enabled, as per the
original code.

> > >   Link 1 (depends on 00:01.2 and 01:00.0): max(32, 32) = 32us
> > >   Link 2 (depends on 02:04.0 and 04:00.x): max(32, 64) = 64us
> > >
> > > If L1 is enabled for Link 1 and disabled for Link 2, Link 2 will
> > > remain in L0 so it has no L1 exit latency, and the exit latency of
> > > the entire path should be 32us.
> >
> > My patch disables this so yes.
> >
> > > > > > > > The original code path did:
> > > > > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > > > > 02:04.0-01:00.0 max latency 32 +1 -> ok
> > > > > > > > 01:00.0-00:01.2 max latency 32 +2 -> ok
> > > > > > > >
> > > > > > > > And thus didn't see any L1 ASPM latency issues.
> > > > > > > >
> > > > > > > > The new code does:
> > > > > > > > 04:00:0-02:04.0 max latency 64    -> ok
> > > > > > > > 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> > > > > > > > 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> > > > > > >
> > > > > > > [Nit: I don't think we should add 1 for the 02:04.0 -- 01:00.0 piece
> > > > > > > because that's internal Switch routing, not a Link.  But even without
> > > > > > > that extra microsecond, this path does exceed the acceptable latency
> > > > > > > since 1 + 64 = 65us, and 04:00.0 can only tolerate 64us.]
> > > > > >
> > > > > > It does report L1 ASPM on both ends, so the links will be counted as
> > > > > > such in the code.
> > > > >
> > > > > This is a bit of a tangent and we shouldn't get too wrapped up in it.
> > > > > This is a confusing aspect of PCIe.  We're talking about this path:
> > > > >
> > > > >   00:01.2 --- [01:00.0 -- 02:04.0] --- 04:00.x multifunction Realtek
> > > > >
> > > > > This path only contains two Links.  The first one is
> > > > > 00:01.2 --- 01:00.0, and the second one is 02:04.0 --- 04:00.x.
> > > > >
> > > > > 01:00.0 is a Switch Upstream Port and 02:04.0 is a Switch Downstream
> > > > > Port.  The connection between them is not a Link; it is some internal
> > > > > wiring of the Switch that is completely opaque to software.
> > > > >
> > > > > The ASPM information and knobs in 01:00.0 apply to the Link on its
> > > > > upstream side, and the ASPM info and knobs in 02:04.0 apply to the
> > > > > Link on its downstream side.
> > > > >
> > > > > The example in sec 5.4.1.2.2 contains three Links.  The L1 exit latency
> > > > > for the Link is the max of the exit latencies at each end:
> > > > >
> > > > >   Link 1: max(32, 8) = 32us
> > > > >   Link 2: max(8, 32) = 32us
> > > > >   Link 3: max(32, 8) = 32us
> > > > >
> > > > > The total delay for a TLP starting at the downstream end of Link 3
> > > > > is 32 + 2 = 32us.
> > > > >
> > > > > In the path to your 04:00.x Realtek device:
> > > > >
> > > > >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> > > > >   Link 2 (from 02:04.0 to 04:00.x): max(32, 64) = 64us
> > > > >
> > > > > If L1 were enabled on both Links, the exit latency would be 64 + 1 =
> > > > > 65us.
> > > >
> > > > So one line to be removed from the changelog, i assume... And yes, the
> > > > code handles that - first disable is 01:00.0 <-> 00:01.2
> > > >
> > > > > > I also assume that it can power down individual ports... and enter
> > > > > > rest state if no links are up.
> > > > >
> > > > > I don't think this is quite true -- a Link can't enter L1 unless the
> > > > > Ports on both ends have L1 enabled, so I don't think it makes sense to
> > > > > talk about an individual Port being in L1.
> > > > >
> > > > > > > > It correctly identifies the issue.
> > > > > > > >
> > > > > > > > For reference, pcie information:
> > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > > > > > >
> > > > > > > The "lspci without my patches" [1] shows L1 enabled for the shared
> > > > > > > Link from 00:01.2 --- 01:00.0 and for the Link to 03:00.0 (I211), but
> > > > > > > not for the Link to 04:00.x (Realtek).
> > > > > > >
> > > > > > > Per my analysis above, that looks like it *should* be a safe
> > > > > > > configuration.  03:00.0 can tolerate 64us, actual is <33us.  04:00.0
> > > > > > > can tolerate 64us, actual should be <32us since only the shared Link
> > > > > > > is in L1.
> > > > > >
> > > > > > See above.
> > > > >
> > > > > As I said above, if we enabled L1 only on the shared Link from 00:01.2
> > > > > to 01:00.0, the exit latency should be acceptable.  In that case, a
> > > > > TLP from 04:00.x would see only 32us of latency:
> > > > >
> > > > >   Link 1 (from 00:01.2 to 01:00.0): max(32, 32) = 32us
> > > > >
> > > > > and 04:00.x can tolerate 64us.
> > > >
> > > > But, again, you're completely ignoring the full link, ie 04:00.x would
> > > > also have to power on.
> > >
> > > I think you're using "the full link" to refer to the entire path from
> > > 00:01.2 to 04:00.x.  In PCIe, a "Link" directly connects two Ports.
> > > It doesn't refer to the entire path.
> > >
> > > No, if L1 is disabled on 02:04.0 and 04:00.x (as Linux apparently does
> > > by default), the Link between them never enters L1, so there is no
> > > power-on for this Link.
> >
> > It doesn't do it by default, my patch does
>
> I'm relying on [1], your "lspci without my patches" attachment named
> "lspci-5.9-mainline.txt", which shows:
>
>   02:04.0 LnkCtl: ASPM Disabled
>   04:00.0 LnkCtl: ASPM Disabled
>
> so I assumed that was what Linux did by default.

Interesting, they are disabled.

> > > > > > > However, the commit log at [2] shows L1 *enabled* for both
> > > > > > > the shared Link from 00:01.2 --- 01:00.0 and the 02:04.0
> > > > > > > --- 04:00.x Link, and that would definitely be a problem.
> > > > > > >
> > > > > > > Can you explain the differences between [1] and [2]?
> > > > > >
> > > > > > I don't understand which sections you're referring to.
> > > > >
> > > > > [1] is the "lspci without my patches" attachment of bugzilla #209725,
> > > > > which is supposed to show the problem this patch solves.  We're
> > > > > talking about the path to 04:00.x, and [1] show this:
> > > > >
> > > > >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> > > > >   01:00.0 L1+
> > > > >   02:04.0 L1-
> > > > >   04:00.0 L1-
> > > > >
> > > > > AFAICT, that should be a legal configuration as far as 04:00.0 is
> > > > > concerned, so it's not a reason for this patch.
> > > >
> > > > Actually, no, maximum path latency 64us
> > > >
> > > > 04:00.0 wakeup latency == 64us
> > > >
> > > > Again, as stated, it can't be behind any sleeping L1 links
> > >
> > > It would be pointless for a device to advertise L1 support if it could
> > > never be used.  04:00.0 advertises that it can tolerate L1 latency of
> > > 64us and that it can exit L1 in 64us or less.  So it *can* be behind a
> > > Link in L1 as long as nothing else in the path adds more latency.
> >
> > Yes, as long as nothing along the entire path adds latency - and I
> > didn't make the component
> > I can only say what it states, and we have to handle it.
> >
> > > > > [2] is a previous posting of this same patch, and its commit log
> > > > > includes information about the same path to 04:00.x, but the "LnkCtl
> > > > > Before" column shows:
> > > > >
> > > > >   01:00.2 L1+               # <-- my typo here, should be 00:01.2
> > > > >   01:00.0 L1+
> > > > >   02:04.0 L1+
> > > > >   04:00.0 L1+
> > > > >
> > > > > I don't know why [1] shows L1 disabled on the downstream Link, while
> > > > > [2] shows L1 *enabled* on the same Link.
> > > >
> > > > From the data they look switched.
> > > >
> > > > > > > > Kai-Heng Feng has a machine that will not boot with ASPM without
> > > > > > > > this patch, information is documented here:
> > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209671
> > > > > > >
> > > > > > > I started working through this info, too, but there's not
> > > > > > > enough information to tell what difference this patch
> > > > > > > makes.  The attachments compare:
> > > > > > >
> > > > > > >   1) CONFIG_PCIEASPM_DEFAULT=y without the patch [3] and
> > > > > > >   2) CONFIG_PCIEASPM_POWERSAVE=y *with* the patch [4]
> > > > > > >
> > > > > > > Obviously CONFIG_PCIEASPM_POWERSAVE=y will configure
> > > > > > > things differently than CONFIG_PCIEASPM_DEFAULT=y, so we
> > > > > > > can't tell what changes are due to the config change and
> > > > > > > what are due to the patch.
> > > > > > >
> > > > > > > The lspci *with* the patch ([4]) shows L0s and L1 enabled
> > > > > > > at almost every possible place.  Here are the Links, how
> > > > > > > they're configured, and my analysis of the exit latencies
> > > > > > > vs acceptable latencies:
> > > > > > >
> > > > > > >   00:01.1 --- 01:00.0      L1+ (                  L1 <64us vs unl)
> > > > > > >   00:01.2 --- 02:00.0      L1+ (                  L1 <64us vs 64us)
> > > > > > >   00:01.3 --- 03:00.0      L1+ (                  L1 <64us vs 64us)
> > > > > > >   00:01.4 --- 04:00.0      L1+ (                  L1 <64us vs unl)
> > > > > > >   00:08.1 --- 05:00.x L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > > > > >   00:08.2 --- 06:00.0 L0s+ L1+ (L0s <64ns vs 4us, L1  <1us vs unl)
> > > > > > >
> > > > > > > So I can't tell what change prevents the freeze.  I would
> > > > > > > expect the patch would cause us to *disable* L0s or L1
> > > > > > > somewhere.
> > > > > > >
> > > > > > > The only place [4] shows ASPM disabled is for 05:00.1.
> > > > > > > The spec says we should program the same value in all
> > > > > > > functions of a multi-function device.  This is a non-ARI
> > > > > > > device, so "only capabilities enabled in all functions are
> > > > > > > enabled for the component as a whole."  That would mean
> > > > > > > that L0s and L1 are effectively disabled for 05:00.x even
> > > > > > > though 05:00.0 claims they're enabled.  But the latencies
> > > > > > > say ASPM L0s and L1 should be safe to be enabled.  This
> > > > > > > looks like another bug that's probably unrelated.
> > > > > >
> > > > > > I don't think it's unrelated, i suspect it's how PCIe works with
> > > > > > multiple links...  a device can cause some kind of head of queue
> > > > > > stalling - i don't know how but it really looks like it.
> > > > >
> > > > > The text in quotes above is straight out of the spec (PCIe r5.0, sec
> > > > > 7.5.3.7).  Either the device works that way or it's not compliant.
> > > > >
> > > > > The OS configures ASPM based on the requirements and capabilities
> > > > > advertised by the device.  If a device has any head of queue stalling
> > > > > or similar issues, those must be comprehended in the numbers
> > > > > advertised by the device.  It's not up to the OS to speculate about
> > > > > issues like that.
> > > > >
> > > > > > > The patch might be correct; I haven't actually analyzed
> > > > > > > the code.  But the commit log doesn't make sense to me
> > > > > > > yet.
> > > > > >
> > > > > > I personally don't think that all this PCI information is required,
> > > > > > the linux kernel is currently doing it wrong according to the spec.
> > > > >
> > > > > We're trying to establish exactly *what* Linux is doing wrong.  So far
> > > > > we don't have a good explanation of that.
> > > >
> > > > Yes we do, linux counts hops + max for "link" while what should be done is
> > > > counting hops + max for path
> > >
> > > I think you're saying we need to include L1 exit latency even for
> > > Links where L1 is disabled.  I don't think we should include those.
> >
> > Nope, the code does not do that, it only adds the l1 latency on L1
> > enabled hops
> >
> > > > > Based on [1], in the path to 03:00.0, both Links have L1 enabled, with
> > > > > an exit latency of <33us, and 03:00.0 can tolerate 64us.  That should
> > > > > work fine.
> > > > >
> > > > > Also based on [1], in the path to 04:00.x, the upstream Link has L1
> > > > > enabled and the downstream Link has L1 disabled, for an exit latency
> > > > > of <32us, and 04:00.0 can tolerate 64us.  That should also work fine.
> > > >
> > > > Again, ignoring the exit latency for 04:00.0
> > > >
> > > > > (Alternately, disabling L1 on the upstream Link and enabling it on the
> > > > > downstream Link should have an exit latency of <64us and 04:00.0 can
> > > > > tolerate 64us, so that should work fine, too.)
> > > >
> > > > Then nothing else can have L1 aspm enabled
> > >
> > > Yes, as I said, we should be able to enable L1 on either of the Links
> > > in the path to 04:00.x, but not both.
> >
> > The code works backwards and disables the first hop that exceeds the
> > latency requirements -
> > we could argue that it should try to be smarter about it and try to
> > disable a minimum amount of links
> > while still retaining the minimum latency but... It is what it is and
> > it works when patched.
> >
> > > The original problem here is not with the Realtek device at 04:00.x
> > > but with the I211 NIC at 03:00.0.  So we also need to figure out what
> > > the connection is.  Does the same I211 performance problem occur if
> > > you remove the Realtek device from the system?
> >
> > It's mounted on the motherboard, so no I can't remove it.
>
> If you're interested, you could probably unload the Realtek drivers,
> remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> in 02:04.0, e.g.,
>
>   # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
>   # echo 1 > $RT/0000:04:00.0/remove
>   # echo 1 > $RT/0000:04:00.1/remove
>   # echo 1 > $RT/0000:04:00.2/remove
>   # echo 1 > $RT/0000:04:00.4/remove
>   # echo 1 > $RT/0000:04:00.7/remove
>   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
>
> That should take 04:00.x out of the picture.

Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...

So did this, with unpatched kernel:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec    0   67.9 KBytes
[  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   96.2 KBytes
[  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec    0   50.9 KBytes
[  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   70.7 KBytes
[  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
[  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   45.2 KBytes
[  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
[  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec    0   36.8 KBytes
[  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
[  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec                  receiver

and:
echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm

and:
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153    772 KBytes
[  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276    550 KBytes
[  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123    625 KBytes
[  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31    687 KBytes
[  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0    679 KBytes
[  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136    577 KBytes
[  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214    645 KBytes
[  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32    628 KBytes
[  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81    537 KBytes
[  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10    577 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056             sender
[  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec                  receiver

But this only confirms that the fix i experience is a side effect.

The original code is still wrong :)

> > > 03:00.0 can tolerate 64us of latency, so even if L1 is enabled on both
> > > Links leading to it, the path exit latency would be <33us, which
> > > should be fine.
> >
> > Yes, it "should be" but due to broken ASPM latency calculations we
> > have some kind of
> > side effect that triggers a racecondition/sideeffect/bug that causes
> > it to misbehave.
> >
> > Since fixing the latency calculation fixes it, I'll leave the rest to
> > someone with a logic
> > analyzer and a die-hard-fetish for pcie links - I can't debug it.
> >
> > > > > > Also, since it's clearly doing the wrong thing, I'm worried that
> > > > > > dists will take a kernel enable aspm and there will be alot of
> > > > > > bugreports of non-booting systems or other weird issues... And the
> > > > > > culprit was known all along.
> > > > >
> > > > > There's clearly a problem on your system, but I don't know yet whether
> > > > > Linux is doing something wrong, a device in your system is designed
> > > > > incorrectly, or a device is designed correctly but the instance in
> > > > > your system is defective.
> > > >
> > > > According to the spec it is, there is a explanation of how to
> > > > calculate the exit latency
> > > > and when you implement that, which i did (before knowing the actual
> > > > spec) then it works...
> > > >
> > > > > > It's been five months...
> > > > >
> > > > > I apologize for the delay.  ASPM is a subtle area of PCIe, the Linux
> > > > > code is complicated, and we have a long history of issues with it.  I
> > > > > want to fix the problem, but I want to make sure we do it in a way
> > > > > that matches the spec so the fix applies to all systems.  I don't want
> > > > > a magic fix that fixes your system in a way I don't quite understand.
> > > >
> > > > > Obviously *you* understand this, so hopefully it's just a matter of
> > > > > pounding it through my thick skull :)
> > > >
> > > > I only understand what I've been forced to understand - and I do
> > > > leverage the existing code without
> > > > knowing what it does underneath, I only look at the links maximum
> > > > latency and make sure that I keep
> > > > the maximum latency along the path and not just link for link
> > > >
> > > > once you realise that the max allowed latency is buffer dependent -
> > > > then this becomes obviously correct,
> > > > and then the pcie spec showed it as being correct as well... so...
> > > >
> > > >
> > > > > > > [1] https://bugzilla.kernel.org/attachment.cgi?id=293047
> > > > > > > [2] https://lore.kernel.org/linux-pci/20201007132808.647589-1-ian.kumlien@gmail.com/
> > > > > > > [3] https://bugzilla.kernel.org/attachment.cgi?id=292955
> > > > > > > [4] https://bugzilla.kernel.org/attachment.cgi?id=292957
> > > > > > >
> > > > > > > > Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> > > > > > > > Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> > > > > > > > ---
> > > > > > > >  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
> > > > > > > >  1 file changed, 14 insertions(+), 8 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > > index 253c30cc1967..c03ead0f1013 100644
> > > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
> > > > > > > >
> > > > > > > >  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > > > >  {
> > > > > > > > -     u32 latency, l1_switch_latency = 0;
> > > > > > > > +     u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
> > > > > > > >       struct aspm_latency *acceptable;
> > > > > > > >       struct pcie_link_state *link;
> > > > > > > >
> > > > > > > > @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > > > >               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> > > > > > > >                   (link->latency_dw.l0s > acceptable->l0s))
> > > > > > > >                       link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> > > > > > > > +
> > > > > > > >               /*
> > > > > > > >                * Check L1 latency.
> > > > > > > > -              * Every switch on the path to root complex need 1
> > > > > > > > -              * more microsecond for L1. Spec doesn't mention L0s.
> > > > > > > > +              *
> > > > > > > > +              * PCIe r5.0, sec 5.4.1.2.2 states:
> > > > > > > > +              * A Switch is required to initiate an L1 exit transition on its
> > > > > > > > +              * Upstream Port Link after no more than 1 μs from the beginning of an
> > > > > > > > +              * L1 exit transition on any of its Downstream Port Links.
> > > > > > > >                *
> > > > > > > >                * The exit latencies for L1 substates are not advertised
> > > > > > > >                * by a device.  Since the spec also doesn't mention a way
> > > > > > > > @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > > > >                * L1 exit latencies advertised by a device include L1
> > > > > > > >                * substate latencies (and hence do not do any check).
> > > > > > > >                */
> > > > > > > > -             latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > > > > -             if ((link->aspm_capable & ASPM_STATE_L1) &&
> > > > > > > > -                 (latency + l1_switch_latency > acceptable->l1))
> > > > > > > > -                     link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > > > > -             l1_switch_latency += 1000;
> > > > > > > > +             if (link->aspm_capable & ASPM_STATE_L1) {
> > > > > > > > +                     latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > > > > +                     l1_max_latency = max_t(u32, latency, l1_max_latency);
> > > > > > > > +                     if (l1_max_latency + l1_switch_latency > acceptable->l1)
> > > > > > > > +                             link->aspm_capable &= ~ASPM_STATE_L1;
> > > > > > > > +                     l1_switch_latency += 1000;
> > > > > > > > +             }
> > > > > > > >
> > > > > > > >               link = link->parent;
> > > > > > > >       }
> > > > > > > > --
> > > > > > > > 2.29.1
> > > > > > > >
Bjorn Helgaas Dec. 15, 2020, 12:40 a.m. UTC | #11
On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:

> > If you're interested, you could probably unload the Realtek drivers,
> > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > in 02:04.0, e.g.,
> >
> >   # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
> >   # echo 1 > $RT/0000:04:00.0/remove
> >   # echo 1 > $RT/0000:04:00.1/remove
> >   # echo 1 > $RT/0000:04:00.2/remove
> >   # echo 1 > $RT/0000:04:00.4/remove
> >   # echo 1 > $RT/0000:04:00.7/remove
> >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> >
> > That should take 04:00.x out of the picture.
> 
> Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...
> 
> So did this, with unpatched kernel:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec    0   67.9 KBytes
> [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   96.2 KBytes
> [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec    0   50.9 KBytes
> [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   70.7 KBytes
> [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   45.2 KBytes
> [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec    0   36.8 KBytes
> [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec    0             sender
> [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec                  receiver
> 
> and:
> echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm

BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
pleased that it seems to be working as intended.

> and:
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153    772 KBytes
> [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276    550 KBytes
> [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123    625 KBytes
> [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31    687 KBytes
> [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0    679 KBytes
> [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136    577 KBytes
> [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214    645 KBytes
> [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32    628 KBytes
> [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81    537 KBytes
> [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10    577 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056             sender
> [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec                  receiver
> 
> But this only confirms that the fix i experience is a side effect.
> 
> The original code is still wrong :)

What exactly is this machine?  Brand, model, config?  Maybe you could
add this and a dmesg log to the buzilla?  It seems like other people
should be seeing the same problem, so I'm hoping to grub around on the
web to see if there are similar reports involving these devices.

https://bugzilla.kernel.org/show_bug.cgi?id=209725

Here's one that is superficially similar:
https://linux-hardware.org/index.php?probe=e5f24075e5&log=lspci_all
in that it has a RP -- switch -- I211 path.  Interestingly, the switch
here advertises <64us L1 exit latency instead of the <32us latency
your switch advertises.  Of course, I can't tell if it's exactly the
same switch.

Bjorn
Ian Kumlien Dec. 15, 2020, 1:09 p.m. UTC | #12
On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> > > If you're interested, you could probably unload the Realtek drivers,
> > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > in 02:04.0, e.g.,
> > >
> > >   # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
> > >   # echo 1 > $RT/0000:04:00.0/remove
> > >   # echo 1 > $RT/0000:04:00.1/remove
> > >   # echo 1 > $RT/0000:04:00.2/remove
> > >   # echo 1 > $RT/0000:04:00.4/remove
> > >   # echo 1 > $RT/0000:04:00.7/remove
> > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > >
> > > That should take 04:00.x out of the picture.
> >
> > Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...
> >
> > So did this, with unpatched kernel:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec    0   67.9 KBytes
> > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   96.2 KBytes
> > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec    0   50.9 KBytes
> > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   70.7 KBytes
> > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   45.2 KBytes
> > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec    0   36.8 KBytes
> > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec    0             sender
> > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec                  receiver
> >
> > and:
> > echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm
>
> BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> pleased that it seems to be working as intended.

It was nice to find it for easy disabling :)

> > and:
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153    772 KBytes
> > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276    550 KBytes
> > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123    625 KBytes
> > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31    687 KBytes
> > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0    679 KBytes
> > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136    577 KBytes
> > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214    645 KBytes
> > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32    628 KBytes
> > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81    537 KBytes
> > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10    577 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056             sender
> > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec                  receiver
> >
> > But this only confirms that the fix i experience is a side effect.
> >
> > The original code is still wrong :)
>
> What exactly is this machine?  Brand, model, config?  Maybe you could
> add this and a dmesg log to the buzilla?  It seems like other people
> should be seeing the same problem, so I'm hoping to grub around on the
> web to see if there are similar reports involving these devices.

ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X

> https://bugzilla.kernel.org/show_bug.cgi?id=209725
>
> Here's one that is superficially similar:
> https://linux-hardware.org/index.php?probe=e5f24075e5&log=lspci_all
> in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> here advertises <64us L1 exit latency instead of the <32us latency
> your switch advertises.  Of course, I can't tell if it's exactly the
> same switch.

Same chipset it seems

I'm running bios version:
        Version: 2206
        Release Date: 08/13/2020

ANd latest is:
Version 3003
2020/12/07

Will test upgrading that as well, but it could be that they report the
incorrect latency of the switch - I don't know how many things AGESA
changes but... It's been updated twice since my upgrade.

> Bjorn
Bjorn Helgaas Dec. 16, 2020, 12:08 a.m. UTC | #13
On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > > > If you're interested, you could probably unload the Realtek drivers,
> > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > > in 02:04.0, e.g.,
> > > >
> > > >   # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
> > > >   # echo 1 > $RT/0000:04:00.0/remove
> > > >   # echo 1 > $RT/0000:04:00.1/remove
> > > >   # echo 1 > $RT/0000:04:00.2/remove
> > > >   # echo 1 > $RT/0000:04:00.4/remove
> > > >   # echo 1 > $RT/0000:04:00.7/remove
> > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > >
> > > > That should take 04:00.x out of the picture.
> > >
> > > Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...
> > >
> > > So did this, with unpatched kernel:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec    0   67.9 KBytes
> > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   96.2 KBytes
> > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec    0   50.9 KBytes
> > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   70.7 KBytes
> > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   45.2 KBytes
> > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec    0   36.8 KBytes
> > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec    0             sender
> > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec                  receiver
> > >
> > > and:
> > > echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm
> >
> > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > pleased that it seems to be working as intended.
> 
> It was nice to find it for easy disabling :)
> 
> > > and:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153    772 KBytes
> > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276    550 KBytes
> > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123    625 KBytes
> > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31    687 KBytes
> > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0    679 KBytes
> > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136    577 KBytes
> > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214    645 KBytes
> > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32    628 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81    537 KBytes
> > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10    577 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056             sender
> > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec                  receiver
> > >
> > > But this only confirms that the fix i experience is a side effect.
> > >
> > > The original code is still wrong :)
> >
> > What exactly is this machine?  Brand, model, config?  Maybe you could
> > add this and a dmesg log to the buzilla?  It seems like other people
> > should be seeing the same problem, so I'm hoping to grub around on the
> > web to see if there are similar reports involving these devices.
> 
> ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X

Possible similar issues:

  https://forums.unraid.net/topic/94274-hardware-upgrade-woes/
  https://forums.servethehome.com/index.php?threads/upgraded-my-home-server-from-intel-to-amd-virtual-disk-stuck-in-degraded-unhealty-state.25535/ (Windows)

> > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> >
> > Here's one that is superficially similar:
> > https://linux-hardware.org/index.php?probe=e5f24075e5&log=lspci_all
> > in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> > here advertises <64us L1 exit latency instead of the <32us latency
> > your switch advertises.  Of course, I can't tell if it's exactly the
> > same switch.
> 
> Same chipset it seems
> 
> I'm running bios version:
>         Version: 2206
>         Release Date: 08/13/2020
> 
> ANd latest is:
> Version 3003
> 2020/12/07
> 
> Will test upgrading that as well, but it could be that they report the
> incorrect latency of the switch - I don't know how many things AGESA
> changes but... It's been updated twice since my upgrade.

I wouldn't be surprised if the advertised exit latencies are writable
by the BIOS because it probably depends on electrical characteristics
outside the switch.  If so, it's possible ASUS just screwed it up.
Ian Kumlien Dec. 16, 2020, 11:20 a.m. UTC | #14
On Wed, Dec 16, 2020 at 1:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > > > If you're interested, you could probably unload the Realtek drivers,
> > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > > > in 02:04.0, e.g.,
> > > > >
> > > > >   # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
> > > > >   # echo 1 > $RT/0000:04:00.0/remove
> > > > >   # echo 1 > $RT/0000:04:00.1/remove
> > > > >   # echo 1 > $RT/0000:04:00.2/remove
> > > > >   # echo 1 > $RT/0000:04:00.4/remove
> > > > >   # echo 1 > $RT/0000:04:00.7/remove
> > > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > > >
> > > > > That should take 04:00.x out of the picture.
> > > >
> > > > Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...
> > > >
> > > > So did this, with unpatched kernel:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec    0   67.9 KBytes
> > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   96.2 KBytes
> > > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec    0   50.9 KBytes
> > > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   70.7 KBytes
> > > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   45.2 KBytes
> > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec    0   36.8 KBytes
> > > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec    0             sender
> > > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec                  receiver
> > > >
> > > > and:
> > > > echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm
> > >
> > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > > pleased that it seems to be working as intended.
> >
> > It was nice to find it for easy disabling :)
> >
> > > > and:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153    772 KBytes
> > > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276    550 KBytes
> > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123    625 KBytes
> > > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31    687 KBytes
> > > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0    679 KBytes
> > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136    577 KBytes
> > > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214    645 KBytes
> > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32    628 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81    537 KBytes
> > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10    577 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056             sender
> > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec                  receiver
> > > >
> > > > But this only confirms that the fix i experience is a side effect.
> > > >
> > > > The original code is still wrong :)
> > >
> > > What exactly is this machine?  Brand, model, config?  Maybe you could
> > > add this and a dmesg log to the buzilla?  It seems like other people
> > > should be seeing the same problem, so I'm hoping to grub around on the
> > > web to see if there are similar reports involving these devices.
> >
> > ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X
>
> Possible similar issues:
>
>   https://forums.unraid.net/topic/94274-hardware-upgrade-woes/
>   https://forums.servethehome.com/index.php?threads/upgraded-my-home-server-from-intel-to-amd-virtual-disk-stuck-in-degraded-unhealty-state.25535/ (Windows)

Could be, I suspect that we need a workaround (is there a quirk for
"reporting wrong latency"?) and the patches.

> > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > >
> > > Here's one that is superficially similar:
> > > https://linux-hardware.org/index.php?probe=e5f24075e5&log=lspci_all
> > > in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> > > here advertises <64us L1 exit latency instead of the <32us latency
> > > your switch advertises.  Of course, I can't tell if it's exactly the
> > > same switch.
> >
> > Same chipset it seems
> >
> > I'm running bios version:
> >         Version: 2206
> >         Release Date: 08/13/2020
> >
> > ANd latest is:
> > Version 3003
> > 2020/12/07
> >
> > Will test upgrading that as well, but it could be that they report the
> > incorrect latency of the switch - I don't know how many things AGESA
> > changes but... It's been updated twice since my upgrade.
>
> I wouldn't be surprised if the advertised exit latencies are writable
> by the BIOS because it probably depends on electrical characteristics
> outside the switch.  If so, it's possible ASUS just screwed it up.

Not surprisingly, nothing changed.
(There was a lot of "stability improvements")
Bjorn Helgaas Dec. 16, 2020, 11:21 p.m. UTC | #15
On Wed, Dec 16, 2020 at 12:20:53PM +0100, Ian Kumlien wrote:
> On Wed, Dec 16, 2020 at 1:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> > > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >
> > > > > > If you're interested, you could probably unload the Realtek drivers,
> > > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > > > > in 02:04.0, e.g.,
> > > > > >
> > > > > >   # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
> > > > > >   # echo 1 > $RT/0000:04:00.0/remove
> > > > > >   # echo 1 > $RT/0000:04:00.1/remove
> > > > > >   # echo 1 > $RT/0000:04:00.2/remove
> > > > > >   # echo 1 > $RT/0000:04:00.4/remove
> > > > > >   # echo 1 > $RT/0000:04:00.7/remove
> > > > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > > > >
> > > > > > That should take 04:00.x out of the picture.
> > > > >
> > > > > Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...
> > > > >
> > > > > So did this, with unpatched kernel:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec    0   67.9 KBytes
> > > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   96.2 KBytes
> > > > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec    0   50.9 KBytes
> > > > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   70.7 KBytes
> > > > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   45.2 KBytes
> > > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec    0   36.8 KBytes
> > > > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec    0             sender
> > > > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec                  receiver
> > > > >
> > > > > and:
> > > > > echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm
> > > >
> > > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > > > pleased that it seems to be working as intended.
> > >
> > > It was nice to find it for easy disabling :)
> > >
> > > > > and:
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153    772 KBytes
> > > > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276    550 KBytes
> > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123    625 KBytes
> > > > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31    687 KBytes
> > > > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0    679 KBytes
> > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136    577 KBytes
> > > > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214    645 KBytes
> > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32    628 KBytes
> > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81    537 KBytes
> > > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10    577 KBytes
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056             sender
> > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec                  receiver
> > > > >
> > > > > But this only confirms that the fix i experience is a side effect.
> > > > >
> > > > > The original code is still wrong :)
> > > >
> > > > What exactly is this machine?  Brand, model, config?  Maybe you could
> > > > add this and a dmesg log to the buzilla?  It seems like other people
> > > > should be seeing the same problem, so I'm hoping to grub around on the
> > > > web to see if there are similar reports involving these devices.
> > >
> > > ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X
> >
> > Possible similar issues:
> >
> >   https://forums.unraid.net/topic/94274-hardware-upgrade-woes/
> >   https://forums.servethehome.com/index.php?threads/upgraded-my-home-server-from-intel-to-amd-virtual-disk-stuck-in-degraded-unhealty-state.25535/ (Windows)
> 
> Could be, I suspect that we need a workaround (is there a quirk for
> "reporting wrong latency"?) and the patches.

I don't think there's currently a quirk mechanism that would work for
correcting latencies, but there should be, and we could add one if we
can figure out for sure what's wrong.

I found this:

  https://www.reddit.com/r/VFIO/comments/hgk3cz/x570_pcieclassic_pci_bridge_woes/

which looks like it should be the same hardware (if you can collect a
dmesg log or "lspci -nnvv" output we could tell for sure) and is
interesting because it includes some lspci output that shows different
L1 exit latencies than what you see.

> > > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > > >
> > > > Here's one that is superficially similar:
> > > > https://linux-hardware.org/index.php?probe=e5f24075e5&log=lspci_all
> > > > in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> > > > here advertises <64us L1 exit latency instead of the <32us latency
> > > > your switch advertises.  Of course, I can't tell if it's exactly the
> > > > same switch.
> > >
> > > Same chipset it seems
> > >
> > > I'm running bios version:
> > >         Version: 2206
> > >         Release Date: 08/13/2020
> > >
> > > ANd latest is:
> > > Version 3003
> > > 2020/12/07
> > >
> > > Will test upgrading that as well, but it could be that they report the
> > > incorrect latency of the switch - I don't know how many things AGESA
> > > changes but... It's been updated twice since my upgrade.
> >
> > I wouldn't be surprised if the advertised exit latencies are writable
> > by the BIOS because it probably depends on electrical characteristics
> > outside the switch.  If so, it's possible ASUS just screwed it up.
> 
> Not surprisingly, nothing changed.
> (There was a lot of "stability improvements")

I wouldn't be totally surprised if ASUS didn't test that I211 NIC
under Linux, but I'm sure it must work well under Windows.  If you
happen to have Windows, a free trial version of AIDA64 should be able
to give us the equivalent of "lspci -vv".

Bjorn
Ian Kumlien Dec. 17, 2020, 11:37 p.m. UTC | #16
On Thu, Dec 17, 2020 at 12:21 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Dec 16, 2020 at 12:20:53PM +0100, Ian Kumlien wrote:
> > On Wed, Dec 16, 2020 at 1:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Tue, Dec 15, 2020 at 02:09:12PM +0100, Ian Kumlien wrote:
> > > > On Tue, Dec 15, 2020 at 1:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Mon, Dec 14, 2020 at 11:56:31PM +0100, Ian Kumlien wrote:
> > > > > > On Mon, Dec 14, 2020 at 8:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > >
> > > > > > > If you're interested, you could probably unload the Realtek drivers,
> > > > > > > remove the devices, and set the PCI_EXP_LNKCTL_LD (Link Disable) bit
> > > > > > > in 02:04.0, e.g.,
> > > > > > >
> > > > > > >   # RT=/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0
> > > > > > >   # echo 1 > $RT/0000:04:00.0/remove
> > > > > > >   # echo 1 > $RT/0000:04:00.1/remove
> > > > > > >   # echo 1 > $RT/0000:04:00.2/remove
> > > > > > >   # echo 1 > $RT/0000:04:00.4/remove
> > > > > > >   # echo 1 > $RT/0000:04:00.7/remove
> > > > > > >   # setpci -s02:04.0 CAP_EXP+0x10.w=0x0010
> > > > > > >
> > > > > > > That should take 04:00.x out of the picture.
> > > > > >
> > > > > > Didn't actually change the behaviour, I'm suspecting an errata for AMD pcie...
> > > > > >
> > > > > > So did this, with unpatched kernel:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec  4.56 MBytes  38.2 Mbits/sec    0   67.9 KBytes
> > > > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   96.2 KBytes
> > > > > > [  5]   2.00-3.00   sec  4.85 MBytes  40.7 Mbits/sec    0   50.9 KBytes
> > > > > > [  5]   3.00-4.00   sec  4.23 MBytes  35.4 Mbits/sec    0   70.7 KBytes
> > > > > > [  5]   4.00-5.00   sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > > > > [  5]   5.00-6.00   sec  4.23 MBytes  35.4 Mbits/sec    0   45.2 KBytes
> > > > > > [  5]   6.00-7.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > > > > [  5]   7.00-8.00   sec  3.98 MBytes  33.4 Mbits/sec    0   36.8 KBytes
> > > > > > [  5]   8.00-9.00   sec  4.23 MBytes  35.4 Mbits/sec    0   36.8 KBytes
> > > > > > [  5]   9.00-10.00  sec  4.23 MBytes  35.4 Mbits/sec    0   48.1 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  43.2 MBytes  36.2 Mbits/sec    0             sender
> > > > > > [  5]   0.00-10.00  sec  42.7 MBytes  35.8 Mbits/sec                  receiver
> > > > > >
> > > > > > and:
> > > > > > echo 0 > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/link/l1_aspm
> > > > >
> > > > > BTW, thanks a lot for testing out the "l1_aspm" sysfs file.  I'm very
> > > > > pleased that it seems to be working as intended.
> > > >
> > > > It was nice to find it for easy disabling :)
> > > >
> > > > > > and:
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > > [  5]   0.00-1.00   sec   113 MBytes   951 Mbits/sec  153    772 KBytes
> > > > > > [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  276    550 KBytes
> > > > > > [  5]   2.00-3.00   sec   111 MBytes   933 Mbits/sec  123    625 KBytes
> > > > > > [  5]   3.00-4.00   sec   111 MBytes   933 Mbits/sec   31    687 KBytes
> > > > > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec    0    679 KBytes
> > > > > > [  5]   5.00-6.00   sec   110 MBytes   923 Mbits/sec  136    577 KBytes
> > > > > > [  5]   6.00-7.00   sec   110 MBytes   923 Mbits/sec  214    645 KBytes
> > > > > > [  5]   7.00-8.00   sec   110 MBytes   923 Mbits/sec   32    628 KBytes
> > > > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec   81    537 KBytes
> > > > > > [  5]   9.00-10.00  sec   110 MBytes   923 Mbits/sec   10    577 KBytes
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  1.08 GBytes   927 Mbits/sec  1056             sender
> > > > > > [  5]   0.00-10.00  sec  1.07 GBytes   923 Mbits/sec                  receiver
> > > > > >
> > > > > > But this only confirms that the fix i experience is a side effect.
> > > > > >
> > > > > > The original code is still wrong :)
> > > > >
> > > > > What exactly is this machine?  Brand, model, config?  Maybe you could
> > > > > add this and a dmesg log to the buzilla?  It seems like other people
> > > > > should be seeing the same problem, so I'm hoping to grub around on the
> > > > > web to see if there are similar reports involving these devices.
> > > >
> > > > ASUS Pro WS X570-ACE with AMD Ryzen 9 3900X
> > >
> > > Possible similar issues:
> > >
> > >   https://forums.unraid.net/topic/94274-hardware-upgrade-woes/
> > >   https://forums.servethehome.com/index.php?threads/upgraded-my-home-server-from-intel-to-amd-virtual-disk-stuck-in-degraded-unhealty-state.25535/ (Windows)
> >
> > Could be, I suspect that we need a workaround (is there a quirk for
> > "reporting wrong latency"?) and the patches.
>
> I don't think there's currently a quirk mechanism that would work for
> correcting latencies, but there should be, and we could add one if we
> can figure out for sure what's wrong.
>
> I found this:
>
>   https://www.reddit.com/r/VFIO/comments/hgk3cz/x570_pcieclassic_pci_bridge_woes/
>
> which looks like it should be the same hardware (if you can collect a
> dmesg log or "lspci -nnvv" output we could tell for sure) and is
> interesting because it includes some lspci output that shows different
> L1 exit latencies than what you see.

I'll send both of them separately to you, no reason to push that to
everyone i assume.. =)

> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209725
> > > > >
> > > > > Here's one that is superficially similar:
> > > > > https://linux-hardware.org/index.php?probe=e5f24075e5&log=lspci_all
> > > > > in that it has a RP -- switch -- I211 path.  Interestingly, the switch
> > > > > here advertises <64us L1 exit latency instead of the <32us latency
> > > > > your switch advertises.  Of course, I can't tell if it's exactly the
> > > > > same switch.
> > > >
> > > > Same chipset it seems
> > > >
> > > > I'm running bios version:
> > > >         Version: 2206
> > > >         Release Date: 08/13/2020
> > > >
> > > > ANd latest is:
> > > > Version 3003
> > > > 2020/12/07
> > > >
> > > > Will test upgrading that as well, but it could be that they report the
> > > > incorrect latency of the switch - I don't know how many things AGESA
> > > > changes but... It's been updated twice since my upgrade.
> > >
> > > I wouldn't be surprised if the advertised exit latencies are writable
> > > by the BIOS because it probably depends on electrical characteristics
> > > outside the switch.  If so, it's possible ASUS just screwed it up.
> >
> > Not surprisingly, nothing changed.
> > (There was a lot of "stability improvements")
>
> I wouldn't be totally surprised if ASUS didn't test that I211 NIC
> under Linux, but I'm sure it must work well under Windows.  If you
> happen to have Windows, a free trial version of AIDA64 should be able
> to give us the equivalent of "lspci -vv".

I don't have windows, haven't had windows at home since '98 ;)

I'll check with some friends that dualboot om systems that might be
similar - will see what i can get


> Bjorn
Bjorn Helgaas Jan. 12, 2021, 8:42 p.m. UTC | #17
On Sat, Oct 24, 2020 at 10:55:46PM +0200, Ian Kumlien wrote:
> Make pcie_aspm_check_latency comply with the PCIe spec, specifically:
> "5.4.1.2.2. Exit from the L1 State"
> 
> Which makes it clear that each switch is required to initiate a
> transition within 1μs from receiving it, accumulating this latency and
> then we have to wait for the slowest link along the path before
> entering L0 state from L1.
> 
> The current code doesn't take the maximum latency into account.
> 
> From the example:
>    +----------------+
>    |                |
>    |  Root complex  |
>    |                |
>    |    +-----+     |
>    |    |32 μs|     |
>    +----------------+
>            |
>            |  Link 1
>            |
>    +----------------+
>    |     |8 μs|     |
>    |     +----+     |
>    |    Switch A    |
>    |     +----+     |
>    |     |8 μs|     |
>    +----------------+
>            |
>            |  Link 2
>            |
>    +----------------+
>    |    |32 μs|     |
>    |    +-----+     |
>    |    Switch B    |
>    |    +-----+     |
>    |    |32 μs|     |
>    +----------------+
>            |
>            |  Link 3
>            |
>    +----------------+
>    |     |8μs|      |
>    |     +---+      |
>    |   Endpoint C   |
>    |                |
>    |                |
>    +----------------+
> 
> Links 1, 2 and 3 are all in L1 state - endpoint C initiates the
> transition to L0 at time T. Since switch B takes 32 μs to exit L1 on
> it's ports, Link 3 will transition to L0 at T+32 (longest time
> considering T+8 for endpoint C and T+32 for switch B).
> 
> Switch B is required to initiate a transition from the L1 state on it's
> upstream port after no more than 1 μs from the beginning of the
> transition from L1 state on the downstream port. Therefore, transition from
> L1 to L0 will begin on link 2 at T+1, this will cascade up the path.
> 
> The path will exit L1 at T+34.
> 
> On my specific system:
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> 04:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 816e (rev 1a)
> 
>             Exit latency       Acceptable latency
> Tree:       L1       L0s       L1       L0s
> ----------  -------  -----     -------  ------
> 00:01.2     <32 us   -
> | 01:00.0   <32 us   -
> |- 02:03.0  <32 us   -
> | \03:00.0  <16 us   <2us      <64 us   <512ns
> |
> \- 02:04.0  <32 us   -
>   \04:00.0  <64 us   unlimited <64 us   <512ns
> 
> 04:00.0's latency is the same as the maximum it allows so as we walk the path
> the first switchs startup latency will pass the acceptable latency limit
> for the link, and as a side-effect it fixes my issues with 03:00.0.

I don't think this is quite right.  We're looking at the path to
04:00.0, which includes two Links and one Switch:

The upstream Link:
  00:01.2 AMD Root Port               L1 Exit Latency <32us
  01:00.0 AMD Switch Upstream Port    L1 Exit Latency <32us

The downstream Link:
  02:04.0 AMD Switch Downstream Port  L1 Exit Latency <32us
  04:00.0 Realtek Endpoint            L1 Exit Latency <64us, Acceptable 64us

If both Links are in L1 and 04:00.0 needs to use the Link at time T,
I think the following events are relevant:

  T        04:00.0 initiates L1 exit on downstream Link
  T+1us    01:00.0 initiates L1 exit on upstream Link
  T+33us   upstream Link is in L0 (32us after initiating exit)
  T+64us   downstream Link is in L0 (64us after initiating exit)

The upstream Link's L1 exit latency is completely covered by the
downstream Link's, so 04:00.0 *should* only see its own exit latency
(64us), which it claims to be able to tolerate.

This patch computes "l1_max_latency + l1_switch_latency", which is
64us + 1us in this case.  I don't think it's correct to add the 1us
here because that delay is only relevant to the upstream Link.  We
should add it to the *upstream Link's* exit latency, but even with
that, its exit latency is only 33us from 04:00.0's point of view.

> Without this patch, 03:00.0 misbehaves and only gives me ~40 mbit/s over
> links with 6 or more hops. With this patch I'm back to a maximum of ~933
> mbit/s.
> 
> The original code path did:
> 04:00:0-02:04.0 max latency 64    -> ok
> 02:04.0-01:00.0 max latency 32 +1 -> ok
> 01:00.0-00:01.2 max latency 32 +2 -> ok
> 
> And thus didn't see any L1 ASPM latency issues.
> 
> The new code does:
> 04:00:0-02:04.0 max latency 64    -> ok
> 02:04.0-01:00.0 max latency 64 +1 -> latency exceeded
> 01:00.0-00:01.2 max latency 64 +2 -> latency exceeded
> 
> It correctly identifies the issue.
> 
> For reference, pcie information:
> https://bugzilla.kernel.org/show_bug.cgi?id=209725
> 
> Kai-Heng Feng has a machine that will not boot with ASPM without this patch,
> information is documented here:
> https://bugzilla.kernel.org/show_bug.cgi?id=209671
> 
> Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com>
> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/aspm.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index 253c30cc1967..c03ead0f1013 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
>  
>  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  {
> -	u32 latency, l1_switch_latency = 0;
> +	u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
>  	struct aspm_latency *acceptable;
>  	struct pcie_link_state *link;
>  
> @@ -456,10 +456,14 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  		if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
>  		    (link->latency_dw.l0s > acceptable->l0s))
>  			link->aspm_capable &= ~ASPM_STATE_L0S_DW;
> +
>  		/*
>  		 * Check L1 latency.
> -		 * Every switch on the path to root complex need 1
> -		 * more microsecond for L1. Spec doesn't mention L0s.
> +		 *
> +		 * PCIe r5.0, sec 5.4.1.2.2 states:
> +		 * A Switch is required to initiate an L1 exit transition on its
> +		 * Upstream Port Link after no more than 1 μs from the beginning of an
> +		 * L1 exit transition on any of its Downstream Port Links.
>  		 *
>  		 * The exit latencies for L1 substates are not advertised
>  		 * by a device.  Since the spec also doesn't mention a way
> @@ -469,11 +473,13 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  		 * L1 exit latencies advertised by a device include L1
>  		 * substate latencies (and hence do not do any check).
>  		 */
> -		latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> -		if ((link->aspm_capable & ASPM_STATE_L1) &&
> -		    (latency + l1_switch_latency > acceptable->l1))
> -			link->aspm_capable &= ~ASPM_STATE_L1;
> -		l1_switch_latency += 1000;
> +		if (link->aspm_capable & ASPM_STATE_L1) {
> +			latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> +			l1_max_latency = max_t(u32, latency, l1_max_latency);
> +			if (l1_max_latency + l1_switch_latency > acceptable->l1)
> +				link->aspm_capable &= ~ASPM_STATE_L1;
> +			l1_switch_latency += 1000;
> +		}

This is pretty subtle but I *think* the existing code is actually
correct.  The exit latency of a downstream Link overlaps all except
1us of the latency of the next upstream Link, so I don't think we have
to add the total Switch delay to the max Link exit latency.  We only
have to add the Switch delays downstream of Link X to Link X's exit
latency.

Also, I think we should accumulate the 1us Switch latency for *all*
Switches as the existing code does.  For the case where a Switch's
Upstream Port is L1-capable but the Downstream Port is not, this patch
doesn't add any l1_switch_latency.  That assumes the Switch can start
the upstream L1 exit instantly upon receipt of a TLP at the Downstream
Port.  I think we should assume it takes the same time (up to 1us) to
start that exit as it would if the Downstream Port were in L1.

>  		link = link->parent;
>  	}

My guess is the real problem is the Switch is advertising incorrect
exit latencies.  If the Switch advertised "<64us" exit latency for its
Upstream Port, we'd compute "64us exit latency + 1us Switch delay =
65us", which is more than either 03:00.0 or 04:00.0 can tolerate, so
we would disable L1 on that upstream Link.

Working around this would require some sort of quirk to override the
values read from the Switch, which is a little messy.  Here's what I'm
thinking (not asking you to do this; just trying to think of an
approach):

  - Configure common clock earlier, in pci_configure_device(), because
    this changes the "read-only" L1 exit latencies in Link
    Capabilities.

  - Cache Link Capabilities in the pci_dev.

  - Add a quirk to override the cached values for the Switch based on
    Vendor/Device ID and probably DMI motherboard/BIOS info.

Bjorn
Ian Kumlien Jan. 28, 2021, 12:41 p.m. UTC | #18
Sorry about the late reply, been trying to figure out what goes wrong
since this email...

And yes, I think you're right - the fact that it fixed my system was
basically too good to be true =)

On Tue, Jan 12, 2021 at 9:42 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> My guess is the real problem is the Switch is advertising incorrect
> exit latencies.  If the Switch advertised "<64us" exit latency for its
> Upstream Port, we'd compute "64us exit latency + 1us Switch delay =
> 65us", which is more than either 03:00.0 or 04:00.0 can tolerate, so
> we would disable L1 on that upstream Link.
>
> Working around this would require some sort of quirk to override the
> values read from the Switch, which is a little messy.  Here's what I'm
> thinking (not asking you to do this; just trying to think of an
> approach):

The question is if it's the device or if it's the bridge...

Ie, if the device can't quite handle it or if the bridge/switch/port
advertises the wrong latency
I have a friend with a similar motherboard and he has the same latency
values - but his kernel doesn't apply ASPM

I also want to check L0s since it seems to be enabled...

>   - Configure common clock earlier, in pci_configure_device(), because
>     this changes the "read-only" L1 exit latencies in Link
>     Capabilities.
>
>   - Cache Link Capabilities in the pci_dev.
>
>   - Add a quirk to override the cached values for the Switch based on
>     Vendor/Device ID and probably DMI motherboard/BIOS info.

I can't say and I actually think it depends on the actual culprit
which we haven't quite identified yet...

> Bjorn
Ian Kumlien Feb. 24, 2021, 10:19 p.m. UTC | #19
On Thu, Jan 28, 2021 at 1:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
>
> Sorry about the late reply, been trying to figure out what goes wrong
> since this email...
>
> And yes, I think you're right - the fact that it fixed my system was
> basically too good to be true =)

So, finally had some time to look at this again...

I played some with:
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index ac0557a305af..fdf252eee206 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -392,13 +392,13 @@ static void pcie_aspm_check_latency(struct
pci_dev *endpoint)

        while (link) {
                /* Check upstream direction L0s latency */
-               if ((link->aspm_capable & ASPM_STATE_L0S_UP) &&
-                   (link->latency_up.l0s > acceptable->l0s))
+               if ((link->aspm_capable & ASPM_STATE_L0S_UP) /* &&
+                   (link->latency_up.l0s > acceptable->l0s)*/)
                        link->aspm_capable &= ~ASPM_STATE_L0S_UP;

                /* Check downstream direction L0s latency */
-               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
-                   (link->latency_dw.l0s > acceptable->l0s))
+               if ((link->aspm_capable & ASPM_STATE_L0S_DW) /* &&
+                   (link->latency_dw.l0s > acceptable->l0s)*/)
                        link->aspm_capable &= ~ASPM_STATE_L0S_DW;
                /*
                 * Check L1 latency.
---

Which does perform better but doesn't solve all the issues...

Home machine:
Latency:       3.364 ms
Download:    640.170 Mbit/s
Upload:      918.865 Mbit/s

My test server:
Latency:       4.549 ms
Download:    945.137 Mbit/s
Upload:      957.848 Mbit/s

But iperf3 still gets bogged down...
[  5]   0.00-1.00   sec  4.66 MBytes  39.0 Mbits/sec    0   82.0 KBytes
[  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   79.2 KBytes
[  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   56.6 KBytes

And with L1 ASPM disabled as usual:
[  5]   0.00-1.00   sec   112 MBytes   938 Mbits/sec  439    911 KBytes
[  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  492    888 KBytes
[  5]   2.00-3.00   sec   110 MBytes   923 Mbits/sec  370   1.07 MBytes

And just for reference, bredbandskollen again with L1 ASPM disabled:
Latency:       2.281 ms
Download:    742.136 Mbit/s
Upload:      938.053 Mbit/s

Anyway, we started to look at the PCIe bridges etc, but i think it's
the network card
that is at fault, either with advertised latencies (should be lower)
or some bug since
other cards and peripherals connected to the system works just fine...

So, L0s actually seems to have somewhat of an impact - which I found
surprising sice
both machines are ~6 hops away - however latency differs (measured with tcp)

Can we collect L1 ASPM latency numbers for:
Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

> On Tue, Jan 12, 2021 at 9:42 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > My guess is the real problem is the Switch is advertising incorrect
> > exit latencies.  If the Switch advertised "<64us" exit latency for its
> > Upstream Port, we'd compute "64us exit latency + 1us Switch delay =
> > 65us", which is more than either 03:00.0 or 04:00.0 can tolerate, so
> > we would disable L1 on that upstream Link.
> >
> > Working around this would require some sort of quirk to override the
> > values read from the Switch, which is a little messy.  Here's what I'm
> > thinking (not asking you to do this; just trying to think of an
> > approach):
>
> The question is if it's the device or if it's the bridge...
>
> Ie, if the device can't quite handle it or if the bridge/switch/port
> advertises the wrong latency
> I have a friend with a similar motherboard and he has the same latency
> values - but his kernel doesn't apply ASPM
>
> I also want to check L0s since it seems to be enabled...
>
> >   - Configure common clock earlier, in pci_configure_device(), because
> >     this changes the "read-only" L1 exit latencies in Link
> >     Capabilities.
> >
> >   - Cache Link Capabilities in the pci_dev.
> >
> >   - Add a quirk to override the cached values for the Switch based on
> >     Vendor/Device ID and probably DMI motherboard/BIOS info.
>
> I can't say and I actually think it depends on the actual culprit
> which we haven't quite identified yet...
>
> > Bjorn
Bjorn Helgaas Feb. 25, 2021, 10:03 p.m. UTC | #20
On Wed, Feb 24, 2021 at 11:19:55PM +0100, Ian Kumlien wrote:
> On Thu, Jan 28, 2021 at 1:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > Sorry about the late reply, been trying to figure out what goes wrong
> > since this email...
> >
> > And yes, I think you're right - the fact that it fixed my system was
> > basically too good to be true =)
> 
> So, finally had some time to look at this again...
> 
> I played some with:
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index ac0557a305af..fdf252eee206 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -392,13 +392,13 @@ static void pcie_aspm_check_latency(struct
> pci_dev *endpoint)
> 
>         while (link) {
>                 /* Check upstream direction L0s latency */
> -               if ((link->aspm_capable & ASPM_STATE_L0S_UP) &&
> -                   (link->latency_up.l0s > acceptable->l0s))
> +               if ((link->aspm_capable & ASPM_STATE_L0S_UP) /* &&
> +                   (link->latency_up.l0s > acceptable->l0s)*/)
>                         link->aspm_capable &= ~ASPM_STATE_L0S_UP;
> 
>                 /* Check downstream direction L0s latency */
> -               if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
> -                   (link->latency_dw.l0s > acceptable->l0s))
> +               if ((link->aspm_capable & ASPM_STATE_L0S_DW) /* &&
> +                   (link->latency_dw.l0s > acceptable->l0s)*/)
>                         link->aspm_capable &= ~ASPM_STATE_L0S_DW;
>                 /*
>                  * Check L1 latency.
> ---
> 
> Which does perform better but doesn't solve all the issues...
> 
> Home machine:
> Latency:       3.364 ms
> Download:    640.170 Mbit/s
> Upload:      918.865 Mbit/s
> 
> My test server:
> Latency:       4.549 ms
> Download:    945.137 Mbit/s
> Upload:      957.848 Mbit/s
> 
> But iperf3 still gets bogged down...
> [  5]   0.00-1.00   sec  4.66 MBytes  39.0 Mbits/sec    0   82.0 KBytes
> [  5]   1.00-2.00   sec  4.60 MBytes  38.6 Mbits/sec    0   79.2 KBytes
> [  5]   2.00-3.00   sec  4.47 MBytes  37.5 Mbits/sec    0   56.6 KBytes
> 
> And with L1 ASPM disabled as usual:
> [  5]   0.00-1.00   sec   112 MBytes   938 Mbits/sec  439    911 KBytes
> [  5]   1.00-2.00   sec   109 MBytes   912 Mbits/sec  492    888 KBytes
> [  5]   2.00-3.00   sec   110 MBytes   923 Mbits/sec  370   1.07 MBytes
> 
> And just for reference, bredbandskollen again with L1 ASPM disabled:
> Latency:       2.281 ms
> Download:    742.136 Mbit/s
> Upload:      938.053 Mbit/s
> 
> Anyway, we started to look at the PCIe bridges etc, but i think it's
> the network card that is at fault, either with advertised latencies
> (should be lower) or some bug since other cards and peripherals
> connected to the system works just fine...
> 
> So, L0s actually seems to have somewhat of an impact - which I found
> surprising sice both machines are ~6 hops away - however latency
> differs (measured with tcp)
> 
> Can we collect L1 ASPM latency numbers for:
> Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

I think the most useful information would be the ASPM configuration of
the tree rooted at 00:01.2 under Windows, since there the NIC should
be supported and have good performance.

Bjorn
Ian Kumlien April 26, 2021, 2:36 p.m. UTC | #21
On Thu, Feb 25, 2021 at 11:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Feb 24, 2021 at 11:19:55PM +0100, Ian Kumlien wrote:

[... 8<...]

> I think the most useful information would be the ASPM configuration of
> the tree rooted at 00:01.2 under Windows, since there the NIC should
> be supported and have good performance.

So the AMD bios patches to fix USB issues seems to have fixed this
issue as well!

There are still further fixes to bioses available, but currently that
is in beta form
Bjorn Helgaas April 28, 2021, 9:15 p.m. UTC | #22
On Mon, Apr 26, 2021 at 04:36:24PM +0200, Ian Kumlien wrote:
> On Thu, Feb 25, 2021 at 11:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, Feb 24, 2021 at 11:19:55PM +0100, Ian Kumlien wrote:
> 
> [... 8<...]
> 
> > I think the most useful information would be the ASPM configuration of
> > the tree rooted at 00:01.2 under Windows, since there the NIC should
> > be supported and have good performance.
> 
> So the AMD bios patches to fix USB issues seems to have fixed this
> issue as well!

Really?  That's amazing!  I assume they did this by changing the exit
or acceptable latency values?

It would be really interesting to see the "lspci -vv" output after the
BIOS update.

Thanks a lot for following up on this!

Bjorn
Ian Kumlien May 15, 2021, 11:52 a.m. UTC | #23
On Wed, Apr 28, 2021 at 11:15 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Apr 26, 2021 at 04:36:24PM +0200, Ian Kumlien wrote:
> > On Thu, Feb 25, 2021 at 11:03 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Wed, Feb 24, 2021 at 11:19:55PM +0100, Ian Kumlien wrote:
> >
> > [... 8<...]
> >
> > > I think the most useful information would be the ASPM configuration of
> > > the tree rooted at 00:01.2 under Windows, since there the NIC should
> > > be supported and have good performance.
> >
> > So the AMD bios patches to fix USB issues seems to have fixed this
> > issue as well!
>
> Really?  That's amazing!  I assume they did this by changing the exit
> or acceptable latency values?
>
> It would be really interesting to see the "lspci -vv" output after the
> BIOS update.

First time I looked there was no difference...

for x in 00:01.2 01:00.0 02:03.0 03:00.0 ; do lspci -vvvv -s $x |grep
Latency ; done
Latency: 0, Cache Line Size: 64 bytes
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
Latency: 0, Cache Line Size: 64 bytes
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
Latency: 0, Cache Line Size: 64 bytes
LnkCap: Port #3, Speed 16GT/s, Width x1, ASPM L1, Exit Latency L1 <32us
Latency: 0, Cache Line Size: 64 bytes
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
L0s <2us, L1 <16us

I think they actually fixed a issue behind the scenes

> Thanks a lot for following up on this!

Sorry about the delay - work + looking at a switch with a dodgy lldp bug :/

> Bjorn
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 253c30cc1967..c03ead0f1013 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -434,7 +434,7 @@  static void pcie_get_aspm_reg(struct pci_dev *pdev,
 
 static void pcie_aspm_check_latency(struct pci_dev *endpoint)
 {
-	u32 latency, l1_switch_latency = 0;
+	u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
 	struct aspm_latency *acceptable;
 	struct pcie_link_state *link;
 
@@ -456,10 +456,14 @@  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
 		if ((link->aspm_capable & ASPM_STATE_L0S_DW) &&
 		    (link->latency_dw.l0s > acceptable->l0s))
 			link->aspm_capable &= ~ASPM_STATE_L0S_DW;
+
 		/*
 		 * Check L1 latency.
-		 * Every switch on the path to root complex need 1
-		 * more microsecond for L1. Spec doesn't mention L0s.
+		 *
+		 * PCIe r5.0, sec 5.4.1.2.2 states:
+		 * A Switch is required to initiate an L1 exit transition on its
+		 * Upstream Port Link after no more than 1 μs from the beginning of an
+		 * L1 exit transition on any of its Downstream Port Links.
 		 *
 		 * The exit latencies for L1 substates are not advertised
 		 * by a device.  Since the spec also doesn't mention a way
@@ -469,11 +473,13 @@  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
 		 * L1 exit latencies advertised by a device include L1
 		 * substate latencies (and hence do not do any check).
 		 */
-		latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
-		if ((link->aspm_capable & ASPM_STATE_L1) &&
-		    (latency + l1_switch_latency > acceptable->l1))
-			link->aspm_capable &= ~ASPM_STATE_L1;
-		l1_switch_latency += 1000;
+		if (link->aspm_capable & ASPM_STATE_L1) {
+			latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
+			l1_max_latency = max_t(u32, latency, l1_max_latency);
+			if (l1_max_latency + l1_switch_latency > acceptable->l1)
+				link->aspm_capable &= ~ASPM_STATE_L1;
+			l1_switch_latency += 1000;
+		}
 
 		link = link->parent;
 	}