diff mbox series

[v8,4/4] PCI/ASPM: Fix L1.2 parameters when enable link state

Message ID 20240719080255.10998-2-jhp@endlessos.org (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series PCI: vmd: Enable PCI PM's L1 substates of remapped PCIe Root Port and NVMe | expand

Commit Message

Jian-Hong Pan July 19, 2024, 8:02 a.m. UTC
Currently, when enable link's L1.2 features with __pci_enable_link_state(),
it configs the link directly without ensuring related L1.2 parameters, such
as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have been
programmed.

This leads the link's L1.2 between PCIe Root Port and child device gets
wrong configs when a caller tries to enabled it.

Here is a failed example on ASUS B1400CEAE with enabled VMD:

10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller (rev 01) (prog-if 00 [Normal decode])
    ...
    Capabilities: [200 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
        	  PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
        	   T_CommonMode=45us LTR1.2_Threshold=101376ns
        L1SubCtl2: T_PwrOn=50us

10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01) (prog-if 02 [NVM Express])
    ...
    Capabilities: [900 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
                  PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                   T_CommonMode=0us LTR1.2_Threshold=0ns
        L1SubCtl2: T_PwrOn=10us

According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the PCIe
Root Port and the child NVMe, they should be programmed with the same
LTR1.2_Threshold value. However, they have different values in this case.

Invoke aspm_calc_l12_info() to program the L1.2 parameters properly before
enable L1.2 bits of L1 PM Substates Control Register in
__pci_enable_link_state().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
---
v2:
- Prepare the PCIe LTR parameters before enable L1 Substates

v3:
- Only enable supported features for the L1 Substates part

v4:
- Focus on fixing L1.2 parameters, instead of re-initializing whole L1SS

v5:
- Fix typo and commit message
- Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
  aspm_get_l1ss_cap()"

v6:
- Skipped

v7:
- Pick back and rebase on the new version kernel
- Drop the link state flag check. And, always config link state's timing
  parameters

v8:
- Because pcie_aspm_get_link() might return the link as NULL, move
  getting the link's parent and child devices after check the link is
  not NULL. This avoids NULL memory access.

 drivers/pci/pcie/aspm.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Jian-Hong Pan Aug. 2, 2024, 8:24 a.m. UTC | #1
Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
>
> Currently, when enable link's L1.2 features with __pci_enable_link_state(),
> it configs the link directly without ensuring related L1.2 parameters, such
> as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have been
> programmed.
>
> This leads the link's L1.2 between PCIe Root Port and child device gets
> wrong configs when a caller tries to enabled it.
>
> Here is a failed example on ASUS B1400CEAE with enabled VMD:
>
> 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller (rev 01) (prog-if 00 [Normal decode])
>     ...
>     Capabilities: [200 v1] L1 PM Substates
>         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
>                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
>         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>                    T_CommonMode=45us LTR1.2_Threshold=101376ns
>         L1SubCtl2: T_PwrOn=50us
>
> 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01) (prog-if 02 [NVM Express])
>     ...
>     Capabilities: [900 v1] L1 PM Substates
>         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
>                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
>         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>                    T_CommonMode=0us LTR1.2_Threshold=0ns
>         L1SubCtl2: T_PwrOn=10us
>
> According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the PCIe
> Root Port and the child NVMe, they should be programmed with the same
> LTR1.2_Threshold value. However, they have different values in this case.
>
> Invoke aspm_calc_l12_info() to program the L1.2 parameters properly before
> enable L1.2 bits of L1 PM Substates Control Register in
> __pci_enable_link_state().
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> ---
> v2:
> - Prepare the PCIe LTR parameters before enable L1 Substates
>
> v3:
> - Only enable supported features for the L1 Substates part
>
> v4:
> - Focus on fixing L1.2 parameters, instead of re-initializing whole L1SS
>
> v5:
> - Fix typo and commit message
> - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
>   aspm_get_l1ss_cap()"
>
> v6:
> - Skipped
>
> v7:
> - Pick back and rebase on the new version kernel
> - Drop the link state flag check. And, always config link state's timing
>   parameters
>
> v8:
> - Because pcie_aspm_get_link() might return the link as NULL, move
>   getting the link's parent and child devices after check the link is
>   not NULL. This avoids NULL memory access.
>
>  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index 5db1044c9895..55ff1d26fcea 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
>  static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool locked)
>  {
>         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> +       u32 parent_l1ss_cap, child_l1ss_cap;
> +       struct pci_dev *parent, *child;
>
>         if (!link)
>                 return -EINVAL;
> +
> +       parent = link->pdev;
> +       child = link->downstream;
> +
>         /*
>          * A driver requested that ASPM be enabled on this device, but
>          * if we don't have permission to manage ASPM (e.g., on ACPI
> @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool locked)
>         if (!locked)
>                 down_read(&pci_bus_sem);
>         mutex_lock(&aspm_lock);
> +       /*
> +        * Ensure L1.2 parameters: Common_Mode_Restore_Times, T_POWER_ON and
> +        * LTR_L1.2_THRESHOLD are programmed properly before enable bits for
> +        * L1.2, per PCIe r6.0, sec 5.5.4.
> +        */
> +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> +
>         link->aspm_default = pci_calc_aspm_enable_mask(state);
>         pcie_config_aspm_link(link, policy_to_aspm_state(link));
>
> --
> 2.45.2
>

Hi Nirmal and Paul,

It will be great to have your review here.

I had tried to "set the threshold value in vmd_pm_enable_quirk()"
directly as Paul said [1].  However, it still needs to get the PCIe
link from the PCIe device to set the threshold value.
And, pci_enable_link_state_locked() gets the link. Then, it will be
great to calculate and programm L1 sub-states' parameters properly
before configuring the link's ASPM there.

[1]: https://lore.kernel.org/linux-kernel/20240624081108.10143-2-jhp@endlessos.org/T/#mc467498213fe1a6116985c04d714dae378976124

Jian-Hong Pan
Nirmal Patel Aug. 5, 2024, 6:24 p.m. UTC | #2
On Fri, 2 Aug 2024 16:24:18 +0800
Jian-Hong Pan <jhp@endlessos.org> wrote:

> Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> >
> > Currently, when enable link's L1.2 features with
> > __pci_enable_link_state(), it configs the link directly without
> > ensuring related L1.2 parameters, such as T_POWER_ON,
> > Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have been
> > programmed.
> >
> > This leads the link's L1.2 between PCIe Root Port and child device
> > gets wrong configs when a caller tries to enabled it.
> >
> > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> >
> > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > PCIe Controller (rev 01) (prog-if 00 [Normal decode]) ...
> >     Capabilities: [200 v1] L1 PM Substates
> >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > L1_PM_Substates+ PortCommonModeRestoreTime=45us
> > PortTPowerOnTime=50us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1-
> > ASPM_L1.2+ ASPM_L1.1- T_CommonMode=45us LTR1.2_Threshold=101376ns
> >         L1SubCtl2: T_PwrOn=50us
> >
> > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > SN550 NVMe SSD (rev 01) (prog-if 02 [NVM Express]) ...
> >     Capabilities: [900 v1] L1 PM Substates
> >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > L1_PM_Substates+ PortCommonModeRestoreTime=32us
> > PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1-
> > ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns
> >         L1SubCtl2: T_PwrOn=10us
> >
> > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on
> > the PCIe Root Port and the child NVMe, they should be programmed
> > with the same LTR1.2_Threshold value. However, they have different
> > values in this case.
> >
> > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > before enable L1.2 bits of L1 PM Substates Control Register in
> > __pci_enable_link_state().
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > ---
> > v2:
> > - Prepare the PCIe LTR parameters before enable L1 Substates
> >
> > v3:
> > - Only enable supported features for the L1 Substates part
> >
> > v4:
> > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > L1SS
> >
> > v5:
> > - Fix typo and commit message
> > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> >   aspm_get_l1ss_cap()"
> >
> > v6:
> > - Skipped
> >
> > v7:
> > - Pick back and rebase on the new version kernel
> > - Drop the link state flag check. And, always config link state's
> > timing parameters
> >
> > v8:
> > - Because pcie_aspm_get_link() might return the link as NULL, move
> >   getting the link's parent and child devices after check the link
> > is not NULL. This avoids NULL memory access.
> >
> >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index 5db1044c9895..55ff1d26fcea 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> >  static int __pci_enable_link_state(struct pci_dev *pdev, int
> > state, bool locked) {
> >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > +       struct pci_dev *parent, *child;
> >
> >         if (!link)
> >                 return -EINVAL;
> > +
> > +       parent = link->pdev;
> > +       child = link->downstream;
> > +
> >         /*
> >          * A driver requested that ASPM be enabled on this device,
> > but
> >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > pci_dev *pdev, int state, bool locked) if (!locked)
> >                 down_read(&pci_bus_sem);
> >         mutex_lock(&aspm_lock);
> > +       /*
> > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > T_POWER_ON and
> > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > bits for
> > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > +        */
> > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > +
> >         link->aspm_default = pci_calc_aspm_enable_mask(state);
> >         pcie_config_aspm_link(link, policy_to_aspm_state(link));
> >
> > --
> > 2.45.2
> >  
> 
> Hi Nirmal and Paul,
> 
> It will be great to have your review here.
> 
> I had tried to "set the threshold value in vmd_pm_enable_quirk()"
> directly as Paul said [1].  However, it still needs to get the PCIe
> link from the PCIe device to set the threshold value.
> And, pci_enable_link_state_locked() gets the link. Then, it will be
> great to calculate and programm L1 sub-states' parameters properly
> before configuring the link's ASPM there.
> 
> [1]:
> https://lore.kernel.org/linux-kernel/20240624081108.10143-2-jhp@endlessos.org/T/#mc467498213fe1a6116985c04d714dae378976124
> 
> Jian-Hong Pan

Hi Jian-Hong Pan,

I am not an LTR, ASPM expert, but this part looks good to me.

Can you explain why you decided to move pci_enable_link_state_locked()
call down to out_state_change in vmd.c? Will it cause any issue if 
pci_find_ext_capability returns 0?

Thanks
-nirmal
David E. Box Aug. 5, 2024, 8:26 p.m. UTC | #3
Hi Jian-Hong,

On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > 
> > Currently, when enable link's L1.2 features with __pci_enable_link_state(),
> > it configs the link directly without ensuring related L1.2 parameters, such
> > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have been
> > programmed.
> > 
> > This leads the link's L1.2 between PCIe Root Port and child device gets
> > wrong configs when a caller tries to enabled it.
> > 
> > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > 
> > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe
> > Controller (rev 01) (prog-if 00 [Normal decode])
> >     ...
> >     Capabilities: [200 v1] L1 PM Substates
> >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > L1_PM_Substates+
> >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> >         L1SubCtl2: T_PwrOn=50us
> > 
> > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550
> > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> >     ...
> >     Capabilities: [900 v1] L1 PM Substates
> >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > L1_PM_Substates+
> >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> >         L1SubCtl2: T_PwrOn=10us
> > 
> > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the PCIe
> > Root Port and the child NVMe, they should be programmed with the same
> > LTR1.2_Threshold value. However, they have different values in this case.
> > 
> > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly before
> > enable L1.2 bits of L1 PM Substates Control Register in
> > __pci_enable_link_state().
> > 
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > ---
> > v2:
> > - Prepare the PCIe LTR parameters before enable L1 Substates
> > 
> > v3:
> > - Only enable supported features for the L1 Substates part
> > 
> > v4:
> > - Focus on fixing L1.2 parameters, instead of re-initializing whole L1SS
> > 
> > v5:
> > - Fix typo and commit message
> > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> >   aspm_get_l1ss_cap()"
> > 
> > v6:
> > - Skipped
> > 
> > v7:
> > - Pick back and rebase on the new version kernel
> > - Drop the link state flag check. And, always config link state's timing
> >   parameters
> > 
> > v8:
> > - Because pcie_aspm_get_link() might return the link as NULL, move
> >   getting the link's parent and child devices after check the link is
> >   not NULL. This avoids NULL memory access.
> > 
> >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index 5db1044c9895..55ff1d26fcea 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> >  static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool
> > locked)
> >  {
> >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > +       struct pci_dev *parent, *child;
> > 
> >         if (!link)
> >                 return -EINVAL;
> > +
> > +       parent = link->pdev;
> > +       child = link->downstream;
> > +
> >         /*
> >          * A driver requested that ASPM be enabled on this device, but
> >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct pci_dev
> > *pdev, int state, bool locked)
> >         if (!locked)
> >                 down_read(&pci_bus_sem);
> >         mutex_lock(&aspm_lock);
> > +       /*
> > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times, T_POWER_ON and
> > +        * LTR_L1.2_THRESHOLD are programmed properly before enable bits for
> > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > +        */
> > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);

I still don't think this is the place to recalculate the L1.2 parameters
especially when know the calculation was done but was cleared by
pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
pci_bus_reset() in vmd.c?

David

> > +
> >         link->aspm_default = pci_calc_aspm_enable_mask(state);
> >         pcie_config_aspm_link(link, policy_to_aspm_state(link));
> > 
> > --
> > 2.45.2
> > 
> 
> Hi Nirmal and Paul,
> 
> It will be great to have your review here.
> 
> I had tried to "set the threshold value in vmd_pm_enable_quirk()"
> directly as Paul said [1].  However, it still needs to get the PCIe
> link from the PCIe device to set the threshold value.
> And, pci_enable_link_state_locked() gets the link. Then, it will be
> great to calculate and programm L1 sub-states' parameters properly
> before configuring the link's ASPM there.
> 
> [1]:
> https://lore.kernel.org/linux-kernel/20240624081108.10143-2-jhp@endlessos.org/T/#mc467498213fe1a6116985c04d714dae378976124
> 
> Jian-Hong Pan
Jian-Hong Pan Aug. 7, 2024, 4:23 a.m. UTC | #4
Nirmal Patel <nirmal.patel@linux.intel.com> 於 2024年8月6日 週二 上午2:25寫道:
>
> On Fri, 2 Aug 2024 16:24:18 +0800
> Jian-Hong Pan <jhp@endlessos.org> wrote:
>
> > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > >
> > > Currently, when enable link's L1.2 features with
> > > __pci_enable_link_state(), it configs the link directly without
> > > ensuring related L1.2 parameters, such as T_POWER_ON,
> > > Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have been
> > > programmed.
> > >
> > > This leads the link's L1.2 between PCIe Root Port and child device
> > > gets wrong configs when a caller tries to enabled it.
> > >
> > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > >
> > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > PCIe Controller (rev 01) (prog-if 00 [Normal decode]) ...
> > >     Capabilities: [200 v1] L1 PM Substates
> > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > L1_PM_Substates+ PortCommonModeRestoreTime=45us
> > > PortTPowerOnTime=50us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1-
> > > ASPM_L1.2+ ASPM_L1.1- T_CommonMode=45us LTR1.2_Threshold=101376ns
> > >         L1SubCtl2: T_PwrOn=50us
> > >
> > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > SN550 NVMe SSD (rev 01) (prog-if 02 [NVM Express]) ...
> > >     Capabilities: [900 v1] L1 PM Substates
> > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > L1_PM_Substates+ PortCommonModeRestoreTime=32us
> > > PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1-
> > > ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns
> > >         L1SubCtl2: T_PwrOn=10us
> > >
> > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on
> > > the PCIe Root Port and the child NVMe, they should be programmed
> > > with the same LTR1.2_Threshold value. However, they have different
> > > values in this case.
> > >
> > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > before enable L1.2 bits of L1 PM Substates Control Register in
> > > __pci_enable_link_state().
> > >
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > ---
> > > v2:
> > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > >
> > > v3:
> > > - Only enable supported features for the L1 Substates part
> > >
> > > v4:
> > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > L1SS
> > >
> > > v5:
> > > - Fix typo and commit message
> > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > >   aspm_get_l1ss_cap()"
> > >
> > > v6:
> > > - Skipped
> > >
> > > v7:
> > > - Pick back and rebase on the new version kernel
> > > - Drop the link state flag check. And, always config link state's
> > > timing parameters
> > >
> > > v8:
> > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > >   getting the link's parent and child devices after check the link
> > > is not NULL. This avoids NULL memory access.
> > >
> > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > >
> > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > index 5db1044c9895..55ff1d26fcea 100644
> > > --- a/drivers/pci/pcie/aspm.c
> > > +++ b/drivers/pci/pcie/aspm.c
> > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > >  static int __pci_enable_link_state(struct pci_dev *pdev, int
> > > state, bool locked) {
> > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > +       struct pci_dev *parent, *child;
> > >
> > >         if (!link)
> > >                 return -EINVAL;
> > > +
> > > +       parent = link->pdev;
> > > +       child = link->downstream;
> > > +
> > >         /*
> > >          * A driver requested that ASPM be enabled on this device,
> > > but
> > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > pci_dev *pdev, int state, bool locked) if (!locked)
> > >                 down_read(&pci_bus_sem);
> > >         mutex_lock(&aspm_lock);
> > > +       /*
> > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > T_POWER_ON and
> > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > bits for
> > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > +        */
> > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > +
> > >         link->aspm_default = pci_calc_aspm_enable_mask(state);
> > >         pcie_config_aspm_link(link, policy_to_aspm_state(link));
> > >
> > > --
> > > 2.45.2
> > >
> >
> > Hi Nirmal and Paul,
> >
> > It will be great to have your review here.
> >
> > I had tried to "set the threshold value in vmd_pm_enable_quirk()"
> > directly as Paul said [1].  However, it still needs to get the PCIe
> > link from the PCIe device to set the threshold value.
> > And, pci_enable_link_state_locked() gets the link. Then, it will be
> > great to calculate and programm L1 sub-states' parameters properly
> > before configuring the link's ASPM there.
> >
> > [1]:
> > https://lore.kernel.org/linux-kernel/20240624081108.10143-2-jhp@endlessos.org/T/#mc467498213fe1a6116985c04d714dae378976124
> >
> > Jian-Hong Pan
>
> Hi Jian-Hong Pan,
>
> I am not an LTR, ASPM expert, but this part looks good to me.
>
> Can you explain why you decided to move pci_enable_link_state_locked()
> call down to out_state_change in vmd.c?

The idea is setting all LTR related parameters before enabling the ASPM feature.

> Will it cause any issue if pci_find_ext_capability returns 0?

If pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_LTR) in
vmd_pm_enable_quirk() returns 0, then the device is not a PCIe device.

Then, it goes to:
...
pci_enable_link_state_locked()
  __pci_enable_link_state()

__pci_enable_link_state() uses pcie_aspm_get_link() to get the link
between the PCIe bridge and the PCIe device.  And,
pcie_aspm_get_link() returns the link as a barrier.  If
pcie_aspm_get_link() does not get the link, then the device is a PCIe
bridge, or not a PCIe device.  Because the link is NULL,
__pci_enable_link_state() returns with -EINVAL directly and will not
configure/enable ASPM things.

Jian-Hong Pan
Jian-Hong Pan Aug. 7, 2024, 10:05 a.m. UTC | #5
David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
>
> Hi Jian-Hong,
>
> On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > >
> > > Currently, when enable link's L1.2 features with __pci_enable_link_state(),
> > > it configs the link directly without ensuring related L1.2 parameters, such
> > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have been
> > > programmed.
> > >
> > > This leads the link's L1.2 between PCIe Root Port and child device gets
> > > wrong configs when a caller tries to enabled it.
> > >
> > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > >
> > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe
> > > Controller (rev 01) (prog-if 00 [Normal decode])
> > >     ...
> > >     Capabilities: [200 v1] L1 PM Substates
> > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > L1_PM_Substates+
> > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > >         L1SubCtl2: T_PwrOn=50us
> > >
> > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550
> > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > >     ...
> > >     Capabilities: [900 v1] L1 PM Substates
> > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > L1_PM_Substates+
> > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > >         L1SubCtl2: T_PwrOn=10us
> > >
> > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the PCIe
> > > Root Port and the child NVMe, they should be programmed with the same
> > > LTR1.2_Threshold value. However, they have different values in this case.
> > >
> > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly before
> > > enable L1.2 bits of L1 PM Substates Control Register in
> > > __pci_enable_link_state().
> > >
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > ---
> > > v2:
> > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > >
> > > v3:
> > > - Only enable supported features for the L1 Substates part
> > >
> > > v4:
> > > - Focus on fixing L1.2 parameters, instead of re-initializing whole L1SS
> > >
> > > v5:
> > > - Fix typo and commit message
> > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > >   aspm_get_l1ss_cap()"
> > >
> > > v6:
> > > - Skipped
> > >
> > > v7:
> > > - Pick back and rebase on the new version kernel
> > > - Drop the link state flag check. And, always config link state's timing
> > >   parameters
> > >
> > > v8:
> > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > >   getting the link's parent and child devices after check the link is
> > >   not NULL. This avoids NULL memory access.
> > >
> > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > >
> > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > index 5db1044c9895..55ff1d26fcea 100644
> > > --- a/drivers/pci/pcie/aspm.c
> > > +++ b/drivers/pci/pcie/aspm.c
> > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool
> > > locked)
> > >  {
> > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > +       struct pci_dev *parent, *child;
> > >
> > >         if (!link)
> > >                 return -EINVAL;
> > > +
> > > +       parent = link->pdev;
> > > +       child = link->downstream;
> > > +
> > >         /*
> > >          * A driver requested that ASPM be enabled on this device, but
> > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct pci_dev
> > > *pdev, int state, bool locked)
> > >         if (!locked)
> > >                 down_read(&pci_bus_sem);
> > >         mutex_lock(&aspm_lock);
> > > +       /*
> > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times, T_POWER_ON and
> > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable bits for
> > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > +        */
> > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
>
> I still don't think this is the place to recalculate the L1.2 parameters
> especially when know the calculation was done but was cleared by
> pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> pci_bus_reset() in vmd.c?

I have not thought pci_save/restore_state() around pci_bus_reset()
before.  It is an interesting direction.

So, I prepare modification below for test.  Include "[PATCH v8 1/4]
PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
too.  Then, both the PCIe bridge and the PCIe device have the same
LTR_L1.2_THRESHOLD 101376ns as expected.

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index bbf4a47e7b31..6b8dd4f30127 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
pci_host_bridge *root_bridge,
        vmd_bridge->native_dpc = root_bridge->native_dpc;
 }

+static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
+{
+       pci_save_state(pdev);
+       return 0;
+}
+
+static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
+{
+       pci_restore_state(pdev);
+       return 0;
+}
+
 /*
  * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
  */
@@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
unsigned long features)
        pci_scan_child_bus(vmd->bus);
        vmd_domain_reset(vmd);

+       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
        /* When Intel VMD is enabled, the OS does not discover the Root Ports
         * owned by Intel VMD within the MMCFG space. pci_reset_bus() applies
         * a reset to the parent of the PCI device supplied as argument. This
@@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
unsigned long features)
                        break;
                }
        }
+       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);

        pci_assign_unassigned_bus_resources(vmd->bus);


Jian-Hong Pan

> > > +
> > >         link->aspm_default = pci_calc_aspm_enable_mask(state);
> > >         pcie_config_aspm_link(link, policy_to_aspm_state(link));
> > >
> > > --
> > > 2.45.2
> > >
> >
> > Hi Nirmal and Paul,
> >
> > It will be great to have your review here.
> >
> > I had tried to "set the threshold value in vmd_pm_enable_quirk()"
> > directly as Paul said [1].  However, it still needs to get the PCIe
> > link from the PCIe device to set the threshold value.
> > And, pci_enable_link_state_locked() gets the link. Then, it will be
> > great to calculate and programm L1 sub-states' parameters properly
> > before configuring the link's ASPM there.
> >
> > [1]:
> > https://lore.kernel.org/linux-kernel/20240624081108.10143-2-jhp@endlessos.org/T/#mc467498213fe1a6116985c04d714dae378976124
> >
> > Jian-Hong Pan
>
Ilpo Järvinen Aug. 7, 2024, 11:18 a.m. UTC | #6
On Wed, 7 Aug 2024, Jian-Hong Pan wrote:

> David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> >
> > Hi Jian-Hong,
> >
> > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > >
> > > > Currently, when enable link's L1.2 features with __pci_enable_link_state(),
> > > > it configs the link directly without ensuring related L1.2 parameters, such
> > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have been
> > > > programmed.
> > > >
> > > > This leads the link's L1.2 between PCIe Root Port and child device gets
> > > > wrong configs when a caller tries to enabled it.
> > > >
> > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > >
> > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe
> > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > >     ...
> > > >     Capabilities: [200 v1] L1 PM Substates
> > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > L1_PM_Substates+
> > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > >         L1SubCtl2: T_PwrOn=50us
> > > >
> > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550
> > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > >     ...
> > > >     Capabilities: [900 v1] L1 PM Substates
> > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > L1_PM_Substates+
> > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > >         L1SubCtl2: T_PwrOn=10us
> > > >
> > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the PCIe
> > > > Root Port and the child NVMe, they should be programmed with the same
> > > > LTR1.2_Threshold value. However, they have different values in this case.
> > > >
> > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly before
> > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > __pci_enable_link_state().
> > > >
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > ---
> > > > v2:
> > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > >
> > > > v3:
> > > > - Only enable supported features for the L1 Substates part
> > > >
> > > > v4:
> > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole L1SS
> > > >
> > > > v5:
> > > > - Fix typo and commit message
> > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > >   aspm_get_l1ss_cap()"
> > > >
> > > > v6:
> > > > - Skipped
> > > >
> > > > v7:
> > > > - Pick back and rebase on the new version kernel
> > > > - Drop the link state flag check. And, always config link state's timing
> > > >   parameters
> > > >
> > > > v8:
> > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > >   getting the link's parent and child devices after check the link is
> > > >   not NULL. This avoids NULL memory access.
> > > >
> > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > >  1 file changed, 15 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > --- a/drivers/pci/pcie/aspm.c
> > > > +++ b/drivers/pci/pcie/aspm.c
> > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool
> > > > locked)
> > > >  {
> > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > +       struct pci_dev *parent, *child;
> > > >
> > > >         if (!link)
> > > >                 return -EINVAL;
> > > > +
> > > > +       parent = link->pdev;
> > > > +       child = link->downstream;
> > > > +
> > > >         /*
> > > >          * A driver requested that ASPM be enabled on this device, but
> > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct pci_dev
> > > > *pdev, int state, bool locked)
> > > >         if (!locked)
> > > >                 down_read(&pci_bus_sem);
> > > >         mutex_lock(&aspm_lock);
> > > > +       /*
> > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times, T_POWER_ON and
> > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable bits for
> > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > +        */
> > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> >
> > I still don't think this is the place to recalculate the L1.2 parameters
> > especially when know the calculation was done but was cleared by
> > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > pci_bus_reset() in vmd.c?
> 
> I have not thought pci_save/restore_state() around pci_bus_reset()
> before.  It is an interesting direction.
> 
> So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> too.  Then, both the PCIe bridge and the PCIe device have the same
> LTR_L1.2_THRESHOLD 101376ns as expected.
> 
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index bbf4a47e7b31..6b8dd4f30127 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> pci_host_bridge *root_bridge,
>         vmd_bridge->native_dpc = root_bridge->native_dpc;
>  }
> 
> +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> +{
> +       pci_save_state(pdev);
> +       return 0;
> +}
> +
> +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> +{
> +       pci_restore_state(pdev);
> +       return 0;
> +}
> +
>  /*
>   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
>   */
> @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> unsigned long features)
>         pci_scan_child_bus(vmd->bus);
>         vmd_domain_reset(vmd);
> 
> +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
>         /* When Intel VMD is enabled, the OS does not discover the Root Ports
>          * owned by Intel VMD within the MMCFG space. pci_reset_bus() applies
>          * a reset to the parent of the PCI device supplied as argument. This
> @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> unsigned long features)
>                         break;
>                 }
>         }
> +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);

Why not call pci_reset_bus() (or __pci_reset_bus()) then in 
vmd_enable_domain() which preserves state unlike pci_reset_bus()?

(Don't tell me naming of these functions is a horrible mess. :-/)
David E. Box Aug. 7, 2024, 11:27 p.m. UTC | #7
On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> 
> > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> > > 
> > > Hi Jian-Hong,
> > > 
> > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > > > 
> > > > > Currently, when enable link's L1.2 features with
> > > > > __pci_enable_link_state(),
> > > > > it configs the link directly without ensuring related L1.2 parameters,
> > > > > such
> > > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have
> > > > > been
> > > > > programmed.
> > > > > 
> > > > > This leads the link's L1.2 between PCIe Root Port and child device
> > > > > gets
> > > > > wrong configs when a caller tries to enabled it.
> > > > > 
> > > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > > > 
> > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > > > PCIe
> > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > >     ...
> > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > > L1_PM_Substates+
> > > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > > >         L1SubCtl2: T_PwrOn=50us
> > > > > 
> > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > > > SN550
> > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > >     ...
> > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > L1_PM_Substates+
> > > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > > >         L1SubCtl2: T_PwrOn=10us
> > > > > 
> > > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the
> > > > > PCIe
> > > > > Root Port and the child NVMe, they should be programmed with the same
> > > > > LTR1.2_Threshold value. However, they have different values in this
> > > > > case.
> > > > > 
> > > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > > > before
> > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > __pci_enable_link_state().
> > > > > 
> > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > ---
> > > > > v2:
> > > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > > > 
> > > > > v3:
> > > > > - Only enable supported features for the L1 Substates part
> > > > > 
> > > > > v4:
> > > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > > > L1SS
> > > > > 
> > > > > v5:
> > > > > - Fix typo and commit message
> > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > > >   aspm_get_l1ss_cap()"
> > > > > 
> > > > > v6:
> > > > > - Skipped
> > > > > 
> > > > > v7:
> > > > > - Pick back and rebase on the new version kernel
> > > > > - Drop the link state flag check. And, always config link state's
> > > > > timing
> > > > >   parameters
> > > > > 
> > > > > v8:
> > > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > > >   getting the link's parent and child devices after check the link is
> > > > >   not NULL. This avoids NULL memory access.
> > > > > 
> > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > >  1 file changed, 15 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state,
> > > > > bool
> > > > > locked)
> > > > >  {
> > > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > +       struct pci_dev *parent, *child;
> > > > > 
> > > > >         if (!link)
> > > > >                 return -EINVAL;
> > > > > +
> > > > > +       parent = link->pdev;
> > > > > +       child = link->downstream;
> > > > > +
> > > > >         /*
> > > > >          * A driver requested that ASPM be enabled on this device, but
> > > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > > > pci_dev
> > > > > *pdev, int state, bool locked)
> > > > >         if (!locked)
> > > > >                 down_read(&pci_bus_sem);
> > > > >         mutex_lock(&aspm_lock);
> > > > > +       /*
> > > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > > > T_POWER_ON and
> > > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > > > bits for
> > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > +        */
> > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > 
> > > I still don't think this is the place to recalculate the L1.2 parameters
> > > especially when know the calculation was done but was cleared by
> > > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > > pci_bus_reset() in vmd.c?
> > 
> > I have not thought pci_save/restore_state() around pci_bus_reset()
> > before.  It is an interesting direction.
> > 
> > So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> > PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> > too.  Then, both the PCIe bridge and the PCIe device have the same
> > LTR_L1.2_THRESHOLD 101376ns as expected.
> > 
> > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > index bbf4a47e7b31..6b8dd4f30127 100644
> > --- a/drivers/pci/controller/vmd.c
> > +++ b/drivers/pci/controller/vmd.c
> > @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> > pci_host_bridge *root_bridge,
> >         vmd_bridge->native_dpc = root_bridge->native_dpc;
> >  }
> > 
> > +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> > +{
> > +       pci_save_state(pdev);
> > +       return 0;
> > +}
> > +
> > +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> > +{
> > +       pci_restore_state(pdev);
> > +       return 0;
> > +}
> > +
> >  /*
> >   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
> >   */
> > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > unsigned long features)
> >         pci_scan_child_bus(vmd->bus);
> >         vmd_domain_reset(vmd);
> > 
> > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> >         /* When Intel VMD is enabled, the OS does not discover the Root
> > Ports
> >          * owned by Intel VMD within the MMCFG space. pci_reset_bus()
> > applies
> >          * a reset to the parent of the PCI device supplied as argument.
> > This
> > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > unsigned long features)
> >                         break;
> >                 }
> >         }
> > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> 
> Why not call pci_reset_bus() (or __pci_reset_bus()) then in 
> vmd_enable_domain() which preserves state unlike pci_reset_bus()?
> 
> (Don't tell me naming of these functions is a horrible mess. :-/)
> 

Hmm. So this *is* calling pci_reset_bus(). L1.2 configuration has specific
ordering requirements for changes to parent & child devices. Could be why it's
not getting restored properly.

David
Ilpo Järvinen Aug. 8, 2024, 9:48 a.m. UTC | #8
On Wed, 7 Aug 2024, David E. Box wrote:

> On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > 
> > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> > > > 
> > > > Hi Jian-Hong,
> > > > 
> > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > > > > 
> > > > > > Currently, when enable link's L1.2 features with
> > > > > > __pci_enable_link_state(),
> > > > > > it configs the link directly without ensuring related L1.2 parameters,
> > > > > > such
> > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have
> > > > > > been
> > > > > > programmed.
> > > > > > 
> > > > > > This leads the link's L1.2 between PCIe Root Port and child device
> > > > > > gets
> > > > > > wrong configs when a caller tries to enabled it.
> > > > > > 
> > > > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > > > > 
> > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > > > > PCIe
> > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > >     ...
> > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > > > L1_PM_Substates+
> > > > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > > > >         L1SubCtl2: T_PwrOn=50us
> > > > > > 
> > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > > > > SN550
> > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > >     ...
> > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > L1_PM_Substates+
> > > > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > > > >         L1SubCtl2: T_PwrOn=10us
> > > > > > 
> > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the
> > > > > > PCIe
> > > > > > Root Port and the child NVMe, they should be programmed with the same
> > > > > > LTR1.2_Threshold value. However, they have different values in this
> > > > > > case.
> > > > > > 
> > > > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > > > > before
> > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > __pci_enable_link_state().
> > > > > > 
> > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > ---
> > > > > > v2:
> > > > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > > > > 
> > > > > > v3:
> > > > > > - Only enable supported features for the L1 Substates part
> > > > > > 
> > > > > > v4:
> > > > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > > > > L1SS
> > > > > > 
> > > > > > v5:
> > > > > > - Fix typo and commit message
> > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > > > >   aspm_get_l1ss_cap()"
> > > > > > 
> > > > > > v6:
> > > > > > - Skipped
> > > > > > 
> > > > > > v7:
> > > > > > - Pick back and rebase on the new version kernel
> > > > > > - Drop the link state flag check. And, always config link state's
> > > > > > timing
> > > > > >   parameters
> > > > > > 
> > > > > > v8:
> > > > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > > > >   getting the link's parent and child devices after check the link is
> > > > > >   not NULL. This avoids NULL memory access.
> > > > > > 
> > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > >  1 file changed, 15 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state,
> > > > > > bool
> > > > > > locked)
> > > > > >  {
> > > > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > +       struct pci_dev *parent, *child;
> > > > > > 
> > > > > >         if (!link)
> > > > > >                 return -EINVAL;
> > > > > > +
> > > > > > +       parent = link->pdev;
> > > > > > +       child = link->downstream;
> > > > > > +
> > > > > >         /*
> > > > > >          * A driver requested that ASPM be enabled on this device, but
> > > > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > > > > pci_dev
> > > > > > *pdev, int state, bool locked)
> > > > > >         if (!locked)
> > > > > >                 down_read(&pci_bus_sem);
> > > > > >         mutex_lock(&aspm_lock);
> > > > > > +       /*
> > > > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > > > > T_POWER_ON and
> > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > > > > bits for
> > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > +        */
> > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > > 
> > > > I still don't think this is the place to recalculate the L1.2 parameters
> > > > especially when know the calculation was done but was cleared by
> > > > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > > > pci_bus_reset() in vmd.c?
> > > 
> > > I have not thought pci_save/restore_state() around pci_bus_reset()
> > > before.  It is an interesting direction.
> > > 
> > > So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> > > PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> > > too.  Then, both the PCIe bridge and the PCIe device have the same
> > > LTR_L1.2_THRESHOLD 101376ns as expected.
> > > 
> > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > > index bbf4a47e7b31..6b8dd4f30127 100644
> > > --- a/drivers/pci/controller/vmd.c
> > > +++ b/drivers/pci/controller/vmd.c
> > > @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> > > pci_host_bridge *root_bridge,
> > >         vmd_bridge->native_dpc = root_bridge->native_dpc;
> > >  }
> > > 
> > > +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> > > +{
> > > +       pci_save_state(pdev);
> > > +       return 0;
> > > +}
> > > +
> > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> > > +{
> > > +       pci_restore_state(pdev);
> > > +       return 0;
> > > +}
> > > +
> > >  /*
> > >   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
> > >   */
> > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > unsigned long features)
> > >         pci_scan_child_bus(vmd->bus);
> > >         vmd_domain_reset(vmd);
> > > 
> > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > >         /* When Intel VMD is enabled, the OS does not discover the Root
> > > Ports
> > >          * owned by Intel VMD within the MMCFG space. pci_reset_bus()
> > > applies
> > >          * a reset to the parent of the PCI device supplied as argument.
> > > This
> > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > unsigned long features)
> > >                         break;
> > >                 }
> > >         }
> > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > 
> > Why not call pci_reset_bus() (or __pci_reset_bus()) then in 
> > vmd_enable_domain() which preserves state unlike pci_reset_bus()?
> > 
> > (Don't tell me naming of these functions is a horrible mess. :-/)
> 
> Hmm. So this *is* calling pci_reset_bus().

Yeah, I managed to get confused by the names myself, I somehow 
ended up thinking it calls pci_bus_reset() which is not correct...

> L1.2 configuration has specific
> ordering requirements for changes to parent & child devices. Could be why it's
> not getting restored properly.

Indeed, it has to be something else since the patch above doesn't even 
restore anything because dev->state_saved should get set to false by the 
first pci_restore_state() called from 
__pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(), I 
think!?
Jian-Hong Pan Aug. 12, 2024, 8:18 a.m. UTC | #9
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四 下午5:49寫道:
>
> On Wed, 7 Aug 2024, David E. Box wrote:
>
> > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > >
> > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> > > > >
> > > > > Hi Jian-Hong,
> > > > >
> > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > > > > >
> > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > __pci_enable_link_state(),
> > > > > > > it configs the link directly without ensuring related L1.2 parameters,
> > > > > > > such
> > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have
> > > > > > > been
> > > > > > > programmed.
> > > > > > >
> > > > > > > This leads the link's L1.2 between PCIe Root Port and child device
> > > > > > > gets
> > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > >
> > > > > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > > > > >
> > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > > > > > PCIe
> > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > >     ...
> > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > > > > L1_PM_Substates+
> > > > > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > > > > >         L1SubCtl2: T_PwrOn=50us
> > > > > > >
> > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > > > > > SN550
> > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > >     ...
> > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > L1_PM_Substates+
> > > > > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > > > > >         L1SubCtl2: T_PwrOn=10us
> > > > > > >
> > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the
> > > > > > > PCIe
> > > > > > > Root Port and the child NVMe, they should be programmed with the same
> > > > > > > LTR1.2_Threshold value. However, they have different values in this
> > > > > > > case.
> > > > > > >
> > > > > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > > > > > before
> > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > __pci_enable_link_state().
> > > > > > >
> > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > ---
> > > > > > > v2:
> > > > > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > > > > >
> > > > > > > v3:
> > > > > > > - Only enable supported features for the L1 Substates part
> > > > > > >
> > > > > > > v4:
> > > > > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > > > > > L1SS
> > > > > > >
> > > > > > > v5:
> > > > > > > - Fix typo and commit message
> > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > > > > >   aspm_get_l1ss_cap()"
> > > > > > >
> > > > > > > v6:
> > > > > > > - Skipped
> > > > > > >
> > > > > > > v7:
> > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > - Drop the link state flag check. And, always config link state's
> > > > > > > timing
> > > > > > >   parameters
> > > > > > >
> > > > > > > v8:
> > > > > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > > > > >   getting the link's parent and child devices after check the link is
> > > > > > >   not NULL. This avoids NULL memory access.
> > > > > > >
> > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > >  1 file changed, 15 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > > > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state,
> > > > > > > bool
> > > > > > > locked)
> > > > > > >  {
> > > > > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > +       struct pci_dev *parent, *child;
> > > > > > >
> > > > > > >         if (!link)
> > > > > > >                 return -EINVAL;
> > > > > > > +
> > > > > > > +       parent = link->pdev;
> > > > > > > +       child = link->downstream;
> > > > > > > +
> > > > > > >         /*
> > > > > > >          * A driver requested that ASPM be enabled on this device, but
> > > > > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > > > > > pci_dev
> > > > > > > *pdev, int state, bool locked)
> > > > > > >         if (!locked)
> > > > > > >                 down_read(&pci_bus_sem);
> > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > +       /*
> > > > > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > > > > > T_POWER_ON and
> > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > > > > > bits for
> > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > +        */
> > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > > >
> > > > > I still don't think this is the place to recalculate the L1.2 parameters
> > > > > especially when know the calculation was done but was cleared by
> > > > > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > > > > pci_bus_reset() in vmd.c?
> > > >
> > > > I have not thought pci_save/restore_state() around pci_bus_reset()
> > > > before.  It is an interesting direction.
> > > >
> > > > So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> > > > PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> > > > too.  Then, both the PCIe bridge and the PCIe device have the same
> > > > LTR_L1.2_THRESHOLD 101376ns as expected.
> > > >
> > > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > > > index bbf4a47e7b31..6b8dd4f30127 100644
> > > > --- a/drivers/pci/controller/vmd.c
> > > > +++ b/drivers/pci/controller/vmd.c
> > > > @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> > > > pci_host_bridge *root_bridge,
> > > >         vmd_bridge->native_dpc = root_bridge->native_dpc;
> > > >  }
> > > >
> > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> > > > +{
> > > > +       pci_save_state(pdev);
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> > > > +{
> > > > +       pci_restore_state(pdev);
> > > > +       return 0;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
> > > >   */
> > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > unsigned long features)
> > > >         pci_scan_child_bus(vmd->bus);
> > > >         vmd_domain_reset(vmd);
> > > >
> > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > >         /* When Intel VMD is enabled, the OS does not discover the Root
> > > > Ports
> > > >          * owned by Intel VMD within the MMCFG space. pci_reset_bus()
> > > > applies
> > > >          * a reset to the parent of the PCI device supplied as argument.
> > > > This
> > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > unsigned long features)
> > > >                         break;
> > > >                 }
> > > >         }
> > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > >
> > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > vmd_enable_domain() which preserves state unlike pci_reset_bus()?
> > >
> > > (Don't tell me naming of these functions is a horrible mess. :-/)
> >
> > Hmm. So this *is* calling pci_reset_bus().
>
> Yeah, I managed to get confused by the names myself, I somehow
> ended up thinking it calls pci_bus_reset() which is not correct...
>
> > L1.2 configuration has specific
> > ordering requirements for changes to parent & child devices. Could be why it's
> > not getting restored properly.
>
> Indeed, it has to be something else since the patch above doesn't even
> restore anything because dev->state_saved should get set to false by the
> first pci_restore_state() called from
> __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(), I
> think!?

Inspired by Ilpo's comment.  I add some debug messages based on
linux-next's tag 'next-20240809' to understand the code path of
pci_reset_bus():

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ffaaca0978cb..3ee71374f1de 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
         * races with ->remove() by the device lock, which must be held by
         * the caller.
         */
-       if (err_handler && err_handler->reset_prepare)
+       if (err_handler && err_handler->reset_prepare) {
+               pci_info(dev, "%s: %pF\n", __func__,
err_handler->reset_prepare);
                err_handler->reset_prepare(dev);
+       }

        /*
         * Wake-up device prior to save.  PM registers default to D0 after
@@ -5144,6 +5146,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
        pci_set_power_state(dev, PCI_D0);

        pci_save_state(dev);
+       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
dev->state_saved ? "true" : "false");
        /*
         * Disable the device by clearing the Command register, except for
         * INTx-disable which is set.  This not only disables MMIO and I/O port
@@ -5655,6 +5658,10 @@ static void
pci_bus_save_and_disable_locked(struct pci_bus *bus)
        struct pci_dev *dev;

        list_for_each_entry(dev, &bus->devices, bus_list) {
+               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
+                        __func__,
+                        dev->state_saved ? "true" : "false",
+                        dev->subordinate ? "has" : "does not have");
                pci_dev_save_and_disable(dev);
                if (dev->subordinate)
                        pci_bus_save_and_disable_locked(dev->subordinate);
@@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
        struct pci_dev *dev;

        list_for_each_entry(dev, &bus->devices, bus_list) {
+               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
+                        __func__,
+                        dev->state_saved ? "true" : "false",
+                        dev->subordinate ? "has" : "does not have");
                pci_dev_restore(dev);
                if (dev->subordinate)
                        pci_bus_restore_locked(dev->subordinate);
@@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus, bool probe)
        if (!bus->self || !pci_bus_resettable(bus))
                return -ENOTTY;

-       if (probe)
+       if (probe) {
+               pci_info(bus->self, "%s: probe is true.  So return 0
directly", __func__);
                return 0;
+       }

        pci_bus_lock(bus);

@@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus *bus)
        int rc;

        rc = pci_bus_reset(bus, PCI_RESET_PROBE);
+       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n", __func__, rc);
        if (rc)
                return rc;

        if (pci_bus_trylock(bus)) {
+               pci_info(bus->self, "%s: pci_bus_trylock() returns
true\n", __func__);
                pci_bus_save_and_disable_locked(bus);
                might_sleep();
                rc = pci_bridge_secondary_bus_reset(bus->self);
@@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
  */
 int pci_reset_bus(struct pci_dev *pdev)
 {
+       pci_info(pdev, "%s: %s", __func__,
!pci_probe_reset_slot(pdev->slot) ? "true" : "false");
        return (!pci_probe_reset_slot(pdev->slot)) ?
            __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
 }

And, have the information of VMD PCIe devices with the built kernel:

10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
decode])
  ...
  Capabilities: [200 v1] L1 PM Substates
    L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
      PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
    L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
      T_CommonMode=0us LTR1.2_Threshold=0ns
    L1SubCtl2: T_PwrOn=0us

10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
  ...
  Capabilities: [900 v1] L1 PM Substates
    L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
      PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
    L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
      T_CommonMode=0us LTR1.2_Threshold=101376ns
    L1SubCtl2: T_PwrOn=50us

We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
PCIe bridge has LTR1.2_Threshold=0ns.

Then, check the dmesg.  I notice the debug messages:

pci 10000:e0:06.0: PCI bridge to [bus e1]
pci 10000:e0:06.0: Primary bus is hard wired to 0
pci 10000:e1:00.0: pci_reset_bus: false
pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
false, and does not have subordinate
pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
Freeing initrd memory: 75236K
pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
and does not have subordinate

So, the code path is:

vmd_enable_domain()
  pci_reset_bus()
    __pci_reset_bus()
      pci_bus_reset()
        pci_bus_save_and_disable_locked()
          pci_dev_save_and_disable()
        pci_bus_restore_locked()
          pci_dev_restore()

And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
So, PCIe bridge 10000:e0:06.0 does not restore state correctly.

Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
bus in vmd_enable_domain().
* Bus "e1" has only NVMe 10000:e1:00.0
* VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
10000:e1:00.0 and SATA Controller 10000:e0:17.0.

Here is the PCI tree:

-+-[0000:00]-+-00.0  Intel Corporation Device 9a04
 |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD Graphics G4]
 |           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning
Processor Participant
 |           +-06.0  Intel Corporation RST VMD Managed Controller
 |           +-07.0-[01-2b]--
 |           +-08.0  Intel Corporation GNA Scoring Accelerator module
 |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator Driver
 |           +-0d.0  Intel Corporation Tiger Lake-LP Thunderbolt 4 USB
Controller
 |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0
 |           +-0e.0  Intel Corporation Volume Management Device NVMe
RAID Controller
 |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
xHCI Host Controller
 |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
 |           +-14.3  Intel Corporation Wi-Fi 6 AX201
 |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0
 |           +-15.1  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1
 |           +-16.0  Intel Corporation Tiger Lake-LP Management Engine Interface
 |           +-17.0  Intel Corporation RST VMD Managed Controller
 |           +-1f.0  Intel Corporation Tiger Lake-LP LPC Controller
 |           +-1f.3  Intel Corporation Tiger Lake-LP Smart Sound
Technology Audio Controller
 |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
 |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
 |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
 \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
              \-17.0  Intel Corporation Tiger Lake-LP SATA Controller

According the findings above, to ensure the devices on the VMD bus
have correctly states, seems pci_save_state() all the devices before
pci_reset_bus(), and pci_restore_state() all the devices after
pci_reset_bus() is the correct answer.

Jian-Hong Pan
Ilpo Järvinen Sept. 2, 2024, 3:43 p.m. UTC | #10
On Mon, 12 Aug 2024, Jian-Hong Pan wrote:

> Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四 下午5:49寫道:
> > On Wed, 7 Aug 2024, David E. Box wrote:
> > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > >
> > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> > > > > >
> > > > > > Hi Jian-Hong,
> > > > > >
> > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > > > > > >
> > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > __pci_enable_link_state(),
> > > > > > > > it configs the link directly without ensuring related L1.2 parameters,
> > > > > > > > such
> > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have
> > > > > > > > been
> > > > > > > > programmed.
> > > > > > > >
> > > > > > > > This leads the link's L1.2 between PCIe Root Port and child device
> > > > > > > > gets
> > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > >
> > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > > > > > >
> > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > > > > > > PCIe
> > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > >     ...
> > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > > > > > L1_PM_Substates+
> > > > > > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > > > > > >         L1SubCtl2: T_PwrOn=50us
> > > > > > > >
> > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > > > > > > SN550
> > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > >     ...
> > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > L1_PM_Substates+
> > > > > > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > > > > > >         L1SubCtl2: T_PwrOn=10us
> > > > > > > >
> > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the
> > > > > > > > PCIe
> > > > > > > > Root Port and the child NVMe, they should be programmed with the same
> > > > > > > > LTR1.2_Threshold value. However, they have different values in this
> > > > > > > > case.
> > > > > > > >
> > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > > > > > > before
> > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > __pci_enable_link_state().
> > > > > > > >
> > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > ---
> > > > > > > > v2:
> > > > > > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > > > > > >
> > > > > > > > v3:
> > > > > > > > - Only enable supported features for the L1 Substates part
> > > > > > > >
> > > > > > > > v4:
> > > > > > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > > > > > > L1SS
> > > > > > > >
> > > > > > > > v5:
> > > > > > > > - Fix typo and commit message
> > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > > > > > >   aspm_get_l1ss_cap()"
> > > > > > > >
> > > > > > > > v6:
> > > > > > > > - Skipped
> > > > > > > >
> > > > > > > > v7:
> > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > - Drop the link state flag check. And, always config link state's
> > > > > > > > timing
> > > > > > > >   parameters
> > > > > > > >
> > > > > > > > v8:
> > > > > > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > > > > > >   getting the link's parent and child devices after check the link is
> > > > > > > >   not NULL. This avoids NULL memory access.
> > > > > > > >
> > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > > > > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state,
> > > > > > > > bool
> > > > > > > > locked)
> > > > > > > >  {
> > > > > > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > >
> > > > > > > >         if (!link)
> > > > > > > >                 return -EINVAL;
> > > > > > > > +
> > > > > > > > +       parent = link->pdev;
> > > > > > > > +       child = link->downstream;
> > > > > > > > +
> > > > > > > >         /*
> > > > > > > >          * A driver requested that ASPM be enabled on this device, but
> > > > > > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > > > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > > > > > > pci_dev
> > > > > > > > *pdev, int state, bool locked)
> > > > > > > >         if (!locked)
> > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > +       /*
> > > > > > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > > > > > > T_POWER_ON and
> > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > > > > > > bits for
> > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > +        */
> > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > > > >
> > > > > > I still don't think this is the place to recalculate the L1.2 parameters
> > > > > > especially when know the calculation was done but was cleared by
> > > > > > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > > > > > pci_bus_reset() in vmd.c?
> > > > >
> > > > > I have not thought pci_save/restore_state() around pci_bus_reset()
> > > > > before.  It is an interesting direction.
> > > > >
> > > > > So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> > > > > PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> > > > > too.  Then, both the PCIe bridge and the PCIe device have the same
> > > > > LTR_L1.2_THRESHOLD 101376ns as expected.
> > > > >
> > > > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > > > > index bbf4a47e7b31..6b8dd4f30127 100644
> > > > > --- a/drivers/pci/controller/vmd.c
> > > > > +++ b/drivers/pci/controller/vmd.c
> > > > > @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> > > > > pci_host_bridge *root_bridge,
> > > > >         vmd_bridge->native_dpc = root_bridge->native_dpc;
> > > > >  }
> > > > >
> > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> > > > > +{
> > > > > +       pci_save_state(pdev);
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> > > > > +{
> > > > > +       pci_restore_state(pdev);
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
> > > > >   */
> > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > unsigned long features)
> > > > >         pci_scan_child_bus(vmd->bus);
> > > > >         vmd_domain_reset(vmd);
> > > > >
> > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > >         /* When Intel VMD is enabled, the OS does not discover the Root
> > > > > Ports
> > > > >          * owned by Intel VMD within the MMCFG space. pci_reset_bus()
> > > > > applies
> > > > >          * a reset to the parent of the PCI device supplied as argument.
> > > > > This
> > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > unsigned long features)
> > > > >                         break;
> > > > >                 }
> > > > >         }
> > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > > >
> > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > vmd_enable_domain() which preserves state unlike pci_reset_bus()?
> > > >
> > > > (Don't tell me naming of these functions is a horrible mess. :-/)
> > >
> > > Hmm. So this *is* calling pci_reset_bus().
> >
> > Yeah, I managed to get confused by the names myself, I somehow
> > ended up thinking it calls pci_bus_reset() which is not correct...
> >
> > > L1.2 configuration has specific
> > > ordering requirements for changes to parent & child devices. Could be why it's
> > > not getting restored properly.
> >
> > Indeed, it has to be something else since the patch above doesn't even
> > restore anything because dev->state_saved should get set to false by the
> > first pci_restore_state() called from
> > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(), I
> > think!?
> 
> Inspired by Ilpo's comment.  I add some debug messages based on
> linux-next's tag 'next-20240809' to understand the code path of
> pci_reset_bus():
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ffaaca0978cb..3ee71374f1de 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
>          * races with ->remove() by the device lock, which must be held by
>          * the caller.
>          */
> -       if (err_handler && err_handler->reset_prepare)
> +       if (err_handler && err_handler->reset_prepare) {
> +               pci_info(dev, "%s: %pF\n", __func__,
> err_handler->reset_prepare);
>                 err_handler->reset_prepare(dev);
> +       }
> 
>         /*
>          * Wake-up device prior to save.  PM registers default to D0 after
> @@ -5144,6 +5146,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
>         pci_set_power_state(dev, PCI_D0);
> 
>         pci_save_state(dev);
> +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> dev->state_saved ? "true" : "false");
>         /*
>          * Disable the device by clearing the Command register, except for
>          * INTx-disable which is set.  This not only disables MMIO and I/O port
> @@ -5655,6 +5658,10 @@ static void
> pci_bus_save_and_disable_locked(struct pci_bus *bus)
>         struct pci_dev *dev;
> 
>         list_for_each_entry(dev, &bus->devices, bus_list) {
> +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> +                        __func__,
> +                        dev->state_saved ? "true" : "false",
> +                        dev->subordinate ? "has" : "does not have");
>                 pci_dev_save_and_disable(dev);
>                 if (dev->subordinate)
>                         pci_bus_save_and_disable_locked(dev->subordinate);
> @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
>         struct pci_dev *dev;
> 
>         list_for_each_entry(dev, &bus->devices, bus_list) {
> +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> +                        __func__,
> +                        dev->state_saved ? "true" : "false",
> +                        dev->subordinate ? "has" : "does not have");
>                 pci_dev_restore(dev);
>                 if (dev->subordinate)
>                         pci_bus_restore_locked(dev->subordinate);
> @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus, bool probe)
>         if (!bus->self || !pci_bus_resettable(bus))
>                 return -ENOTTY;
> 
> -       if (probe)
> +       if (probe) {
> +               pci_info(bus->self, "%s: probe is true.  So return 0
> directly", __func__);
>                 return 0;
> +       }
> 
>         pci_bus_lock(bus);
> 
> @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus *bus)
>         int rc;
> 
>         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n", __func__, rc);
>         if (rc)
>                 return rc;
> 
>         if (pci_bus_trylock(bus)) {
> +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> true\n", __func__);
>                 pci_bus_save_and_disable_locked(bus);
>                 might_sleep();
>                 rc = pci_bridge_secondary_bus_reset(bus->self);
> @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
>   */
>  int pci_reset_bus(struct pci_dev *pdev)
>  {
> +       pci_info(pdev, "%s: %s", __func__,
> !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
>         return (!pci_probe_reset_slot(pdev->slot)) ?
>             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
>  }
> 
> And, have the information of VMD PCIe devices with the built kernel:
> 
> 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> decode])
>   ...
>   Capabilities: [200 v1] L1 PM Substates
>     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
>       PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
>     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>       T_CommonMode=0us LTR1.2_Threshold=0ns
>     L1SubCtl2: T_PwrOn=0us
> 
> 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
>   ...
>   Capabilities: [900 v1] L1 PM Substates
>     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
>       PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
>     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>       T_CommonMode=0us LTR1.2_Threshold=101376ns
>     L1SubCtl2: T_PwrOn=50us
> 
> We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> PCIe bridge has LTR1.2_Threshold=0ns.

This is now the other way around as in the original posting that had 
0ns for 10000:e1:00.0 ??

Is this behavior even consistent or did you e.g. mess up some copy 
pasting somewhere?

> Then, check the dmesg.  I notice the debug messages:
> 
> pci 10000:e0:06.0: PCI bridge to [bus e1]
> pci 10000:e0:06.0: Primary bus is hard wired to 0
> pci 10000:e1:00.0: pci_reset_bus: false
> pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
> pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> false, and does not have subordinate
> pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> Freeing initrd memory: 75236K
> pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> and does not have subordinate
> 
> So, the code path is:
> 
> vmd_enable_domain()
>   pci_reset_bus()
>     __pci_reset_bus()
>       pci_bus_reset()
>         pci_bus_save_and_disable_locked()
>           pci_dev_save_and_disable()
>         pci_bus_restore_locked()
>           pci_dev_restore()
> 
> And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
> So, PCIe bridge 10000:e0:06.0 does not restore state correctly.

It should not be necessary to restore the bridge's configuration as it 
ought to not be changed by the SBR itself, PCIe6 spec 7.5.1.3.13:

"Port configuration registers must not be changed, except as required to 
update Port status."

While the second part wording leaves some leeway, I don't think any of 
these field really fall under "Port status".

> Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> bus in vmd_enable_domain().
> * Bus "e1" has only NVMe 10000:e1:00.0
> * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> 10000:e1:00.0 and SATA Controller 10000:e0:17.0.

...But even if those registers on the PCIe bridge were changed underneath 
against the spec, it's not clear from your debug log why pci_dev_restore() 
-> pci_restore_state() -> pci_restore_pcie_state() -> 
pci_restore_aspm_l1ss_state() did not restore also parent's 
LTR1.2_Threshold?? I think it should attempt to do that.
Jian-Hong Pan Sept. 3, 2024, 10:31 a.m. UTC | #11
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年9月2日 週一 下午11:44寫道:
>
> On Mon, 12 Aug 2024, Jian-Hong Pan wrote:
>
> > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四 下午5:49寫道:
> > > On Wed, 7 Aug 2024, David E. Box wrote:
> > > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > > >
> > > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> > > > > > >
> > > > > > > Hi Jian-Hong,
> > > > > > >
> > > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > > > > > > >
> > > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > > __pci_enable_link_state(),
> > > > > > > > > it configs the link directly without ensuring related L1.2 parameters,
> > > > > > > > > such
> > > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have
> > > > > > > > > been
> > > > > > > > > programmed.
> > > > > > > > >
> > > > > > > > > This leads the link's L1.2 between PCIe Root Port and child device
> > > > > > > > > gets
> > > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > > >
> > > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > > > > > > >
> > > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > > > > > > > PCIe
> > > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > > > > > > L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > > > > > > >         L1SubCtl2: T_PwrOn=50us
> > > > > > > > >
> > > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > > > > > > > SN550
> > > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > > L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > > > > > > >         L1SubCtl2: T_PwrOn=10us
> > > > > > > > >
> > > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the
> > > > > > > > > PCIe
> > > > > > > > > Root Port and the child NVMe, they should be programmed with the same
> > > > > > > > > LTR1.2_Threshold value. However, they have different values in this
> > > > > > > > > case.
> > > > > > > > >
> > > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > > > > > > > before
> > > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > > __pci_enable_link_state().
> > > > > > > > >
> > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > > > > > > >
> > > > > > > > > v3:
> > > > > > > > > - Only enable supported features for the L1 Substates part
> > > > > > > > >
> > > > > > > > > v4:
> > > > > > > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > > > > > > > L1SS
> > > > > > > > >
> > > > > > > > > v5:
> > > > > > > > > - Fix typo and commit message
> > > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > > > > > > >   aspm_get_l1ss_cap()"
> > > > > > > > >
> > > > > > > > > v6:
> > > > > > > > > - Skipped
> > > > > > > > >
> > > > > > > > > v7:
> > > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > > - Drop the link state flag check. And, always config link state's
> > > > > > > > > timing
> > > > > > > > >   parameters
> > > > > > > > >
> > > > > > > > > v8:
> > > > > > > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > > > > > > >   getting the link's parent and child devices after check the link is
> > > > > > > > >   not NULL. This avoids NULL memory access.
> > > > > > > > >
> > > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > > > > > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state,
> > > > > > > > > bool
> > > > > > > > > locked)
> > > > > > > > >  {
> > > > > > > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > > >
> > > > > > > > >         if (!link)
> > > > > > > > >                 return -EINVAL;
> > > > > > > > > +
> > > > > > > > > +       parent = link->pdev;
> > > > > > > > > +       child = link->downstream;
> > > > > > > > > +
> > > > > > > > >         /*
> > > > > > > > >          * A driver requested that ASPM be enabled on this device, but
> > > > > > > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > > > > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > > > > > > > pci_dev
> > > > > > > > > *pdev, int state, bool locked)
> > > > > > > > >         if (!locked)
> > > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > > +       /*
> > > > > > > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > > > > > > > T_POWER_ON and
> > > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > > > > > > > bits for
> > > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > > +        */
> > > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > > > > >
> > > > > > > I still don't think this is the place to recalculate the L1.2 parameters
> > > > > > > especially when know the calculation was done but was cleared by
> > > > > > > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > > > > > > pci_bus_reset() in vmd.c?
> > > > > >
> > > > > > I have not thought pci_save/restore_state() around pci_bus_reset()
> > > > > > before.  It is an interesting direction.
> > > > > >
> > > > > > So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> > > > > > PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> > > > > > too.  Then, both the PCIe bridge and the PCIe device have the same
> > > > > > LTR_L1.2_THRESHOLD 101376ns as expected.
> > > > > >
> > > > > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > > > > > index bbf4a47e7b31..6b8dd4f30127 100644
> > > > > > --- a/drivers/pci/controller/vmd.c
> > > > > > +++ b/drivers/pci/controller/vmd.c
> > > > > > @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> > > > > > pci_host_bridge *root_bridge,
> > > > > >         vmd_bridge->native_dpc = root_bridge->native_dpc;
> > > > > >  }
> > > > > >
> > > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> > > > > > +{
> > > > > > +       pci_save_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> > > > > > +{
> > > > > > +       pci_restore_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  /*
> > > > > >   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
> > > > > >   */
> > > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > > unsigned long features)
> > > > > >         pci_scan_child_bus(vmd->bus);
> > > > > >         vmd_domain_reset(vmd);
> > > > > >
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > > >         /* When Intel VMD is enabled, the OS does not discover the Root
> > > > > > Ports
> > > > > >          * owned by Intel VMD within the MMCFG space. pci_reset_bus()
> > > > > > applies
> > > > > >          * a reset to the parent of the PCI device supplied as argument.
> > > > > > This
> > > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > > unsigned long features)
> > > > > >                         break;
> > > > > >                 }
> > > > > >         }
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > > > >
> > > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > > vmd_enable_domain() which preserves state unlike pci_reset_bus()?
> > > > >
> > > > > (Don't tell me naming of these functions is a horrible mess. :-/)
> > > >
> > > > Hmm. So this *is* calling pci_reset_bus().
> > >
> > > Yeah, I managed to get confused by the names myself, I somehow
> > > ended up thinking it calls pci_bus_reset() which is not correct...
> > >
> > > > L1.2 configuration has specific
> > > > ordering requirements for changes to parent & child devices. Could be why it's
> > > > not getting restored properly.
> > >
> > > Indeed, it has to be something else since the patch above doesn't even
> > > restore anything because dev->state_saved should get set to false by the
> > > first pci_restore_state() called from
> > > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(), I
> > > think!?
> >
> > Inspired by Ilpo's comment.  I add some debug messages based on
> > linux-next's tag 'next-20240809' to understand the code path of
> > pci_reset_bus():
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index ffaaca0978cb..3ee71374f1de 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
> >          * races with ->remove() by the device lock, which must be held by
> >          * the caller.
> >          */
> > -       if (err_handler && err_handler->reset_prepare)
> > +       if (err_handler && err_handler->reset_prepare) {
> > +               pci_info(dev, "%s: %pF\n", __func__,
> > err_handler->reset_prepare);
> >                 err_handler->reset_prepare(dev);
> > +       }
> >
> >         /*
> >          * Wake-up device prior to save.  PM registers default to D0 after
> > @@ -5144,6 +5146,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
> >         pci_set_power_state(dev, PCI_D0);
> >
> >         pci_save_state(dev);
> > +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> > dev->state_saved ? "true" : "false");
> >         /*
> >          * Disable the device by clearing the Command register, except for
> >          * INTx-disable which is set.  This not only disables MMIO and I/O port
> > @@ -5655,6 +5658,10 @@ static void
> > pci_bus_save_and_disable_locked(struct pci_bus *bus)
> >         struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_save_and_disable(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_save_and_disable_locked(dev->subordinate);
> > @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
> >         struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_restore(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_restore_locked(dev->subordinate);
> > @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus, bool probe)
> >         if (!bus->self || !pci_bus_resettable(bus))
> >                 return -ENOTTY;
> >
> > -       if (probe)
> > +       if (probe) {
> > +               pci_info(bus->self, "%s: probe is true.  So return 0
> > directly", __func__);
> >                 return 0;
> > +       }
> >
> >         pci_bus_lock(bus);
> >
> > @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus *bus)
> >         int rc;
> >
> >         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> > +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n", __func__, rc);
> >         if (rc)
> >                 return rc;
> >
> >         if (pci_bus_trylock(bus)) {
> > +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> > true\n", __func__);
> >                 pci_bus_save_and_disable_locked(bus);
> >                 might_sleep();
> >                 rc = pci_bridge_secondary_bus_reset(bus->self);
> > @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
> >   */
> >  int pci_reset_bus(struct pci_dev *pdev)
> >  {
> > +       pci_info(pdev, "%s: %s", __func__,
> > !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
> >         return (!pci_probe_reset_slot(pdev->slot)) ?
> >             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
> >  }
> >
> > And, have the information of VMD PCIe devices with the built kernel:
> >
> > 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> > Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> > decode])
> >   ...
> >   Capabilities: [200 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> >       PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=0ns
> >     L1SubCtl2: T_PwrOn=0us
> >
> > 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> > Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
> >   ...
> >   Capabilities: [900 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
> >       PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=101376ns
> >     L1SubCtl2: T_PwrOn=50us
> >
> > We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> > PCIe bridge has LTR1.2_Threshold=0ns.
>
> This is now the other way around as in the original posting that had
> 0ns for 10000:e1:00.0 ??
>
> Is this behavior even consistent or did you e.g. mess up some copy
> pasting somewhere?

The original posting came with older kernel 6.5. It shows:

10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
decode])
...
  Capabilities: [200 v1] L1 PM Substates
    L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
      PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
    L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
      T_CommonMode=45us LTR1.2_Threshold=101376ns
    L1SubCtl2: T_PwrOn=50us
...

10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
...
  Capabilities: [900 v1] L1 PM Substates
    L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
      PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
    L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
      T_CommonMode=0us LTR1.2_Threshold=0ns
    L1SubCtl2: T_PwrOn=10us
...

Full information:
https://gist.github.com/starnight/e19487a44efefff477f9ac9ed641c183

But, newer kernel, for example linux-next next-20240809 and
next-20240820 which I have tried shows:

10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
decode])
...
  Capabilities: [200 v1] L1 PM Substates
    L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
      PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
    L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
      T_CommonMode=0us LTR1.2_Threshold=0ns
    L1SubCtl2: T_PwrOn=0us
...

10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
...
  Capabilities: [900 v1] L1 PM Substates
    L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
      PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
    L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
      T_CommonMode=0us LTR1.2_Threshold=101376ns
    L1SubCtl2: T_PwrOn=50us
...

Full information:
https://gist.github.com/starnight/081ea4adbce40a27faf234e5e135b49a

So, according to the information above, different kernel versions show
different L1 sub-states.

Jian-Hong Pan
Nirmal Patel Sept. 3, 2024, 3:17 p.m. UTC | #12
On Mon, 12 Aug 2024 16:18:22 +0800
Jian-Hong Pan <jhp@endlessos.org> wrote:

> Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四
> 下午5:49寫道:
> >
> > On Wed, 7 Aug 2024, David E. Box wrote:
> >  
> > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:  
> > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > >  
> > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日
> > > > > 週二 上午4:26寫道:  
> > > > > >
> > > > > > Hi Jian-Hong,
> > > > > >
> > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:  
> > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五
> > > > > > > 下午4:04寫道:  
> > > > > > > >
> > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > __pci_enable_link_state(),
> > > > > > > > it configs the link directly without ensuring related
> > > > > > > > L1.2 parameters, such
> > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and
> > > > > > > > LTR_L1.2_THRESHOLD have been
> > > > > > > > programmed.
> > > > > > > >
> > > > > > > > This leads the link's L1.2 between PCIe Root Port and
> > > > > > > > child device gets
> > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > >
> > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled
> > > > > > > > VMD:
> > > > > > > >
> > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen
> > > > > > > > Core Processor PCIe
> > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > >     ...
> > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> > > > > > > > ASPM_L1.1+ L1_PM_Substates+
> > > > > > > >                   PortCommonModeRestoreTime=45us
> > > > > > > > PortTPowerOnTime=50us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=45us
> > > > > > > > LTR1.2_Threshold=101376ns L1SubCtl2: T_PwrOn=50us
> > > > > > > >
> > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk
> > > > > > > > Corp WD Blue SN550
> > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > >     ...
> > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+
> > > > > > > > ASPM_L1.1- L1_PM_Substates+
> > > > > > > >                   PortCommonModeRestoreTime=32us
> > > > > > > > PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us
> > > > > > > > LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us
> > > > > > > >
> > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling
> > > > > > > > ASPM L1.2 on the PCIe
> > > > > > > > Root Port and the child NVMe, they should be programmed
> > > > > > > > with the same LTR1.2_Threshold value. However, they
> > > > > > > > have different values in this case.
> > > > > > > >
> > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2
> > > > > > > > parameters properly before
> > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > __pci_enable_link_state().
> > > > > > > >
> > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > ---
> > > > > > > > v2:
> > > > > > > > - Prepare the PCIe LTR parameters before enable L1
> > > > > > > > Substates
> > > > > > > >
> > > > > > > > v3:
> > > > > > > > - Only enable supported features for the L1 Substates
> > > > > > > > part
> > > > > > > >
> > > > > > > > v4:
> > > > > > > > - Focus on fixing L1.2 parameters, instead of
> > > > > > > > re-initializing whole L1SS
> > > > > > > >
> > > > > > > > v5:
> > > > > > > > - Fix typo and commit message
> > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM:
> > > > > > > > Introduce aspm_get_l1ss_cap()"
> > > > > > > >
> > > > > > > > v6:
> > > > > > > > - Skipped
> > > > > > > >
> > > > > > > > v7:
> > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > - Drop the link state flag check. And, always config
> > > > > > > > link state's timing
> > > > > > > >   parameters
> > > > > > > >
> > > > > > > > v8:
> > > > > > > > - Because pcie_aspm_get_link() might return the link as
> > > > > > > > NULL, move getting the link's parent and child devices
> > > > > > > > after check the link is not NULL. This avoids NULL
> > > > > > > > memory access.
> > > > > > > >
> > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/pci/pcie/aspm.c
> > > > > > > > b/drivers/pci/pcie/aspm.c index
> > > > > > > > 5db1044c9895..55ff1d26fcea 100644 ---
> > > > > > > > a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > @@ -1411,9 +1411,15 @@
> > > > > > > > EXPORT_SYMBOL(pci_disable_link_state); static int
> > > > > > > > __pci_enable_link_state(struct pci_dev *pdev, int
> > > > > > > > state, bool locked)
> > > > > > > >  {
> > > > > > > >         struct pcie_link_state *link =
> > > > > > > > pcie_aspm_get_link(pdev);
> > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > >
> > > > > > > >         if (!link)
> > > > > > > >                 return -EINVAL;
> > > > > > > > +
> > > > > > > > +       parent = link->pdev;
> > > > > > > > +       child = link->downstream;
> > > > > > > > +
> > > > > > > >         /*
> > > > > > > >          * A driver requested that ASPM be enabled on
> > > > > > > > this device, but
> > > > > > > >          * if we don't have permission to manage ASPM
> > > > > > > > (e.g., on ACPI @@ -1428,6 +1434,15 @@ static int
> > > > > > > > __pci_enable_link_state(struct pci_dev
> > > > > > > > *pdev, int state, bool locked)
> > > > > > > >         if (!locked)
> > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > +       /*
> > > > > > > > +        * Ensure L1.2 parameters:
> > > > > > > > Common_Mode_Restore_Times, T_POWER_ON and
> > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly
> > > > > > > > before enable bits for
> > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > +        */
> > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap,
> > > > > > > > child_l1ss_cap);  
> > > > > >
> > > > > > I still don't think this is the place to recalculate the
> > > > > > L1.2 parameters especially when know the calculation was
> > > > > > done but was cleared by pci_bus_reset(). Can't we just do a
> > > > > > pci_save/restore_state() before/after pci_bus_reset() in
> > > > > > vmd.c?  
> > > > >
> > > > > I have not thought pci_save/restore_state() around
> > > > > pci_bus_reset() before.  It is an interesting direction.
> > > > >
> > > > > So, I prepare modification below for test.  Include "[PATCH
> > > > > v8 1/4] PCI: vmd: Set PCI devices to D0 before enable PCI
> > > > > PM's L1 substates", too.  Then, both the PCIe bridge and the
> > > > > PCIe device have the same LTR_L1.2_THRESHOLD 101376ns as
> > > > > expected.
> > > > >
> > > > > diff --git a/drivers/pci/controller/vmd.c
> > > > > b/drivers/pci/controller/vmd.c index
> > > > > bbf4a47e7b31..6b8dd4f30127 100644 ---
> > > > > a/drivers/pci/controller/vmd.c +++
> > > > > b/drivers/pci/controller/vmd.c @@ -727,6 +727,18 @@ static
> > > > > void vmd_copy_host_bridge_flags(struct pci_host_bridge
> > > > > *root_bridge, vmd_bridge->native_dpc =
> > > > > root_bridge->native_dpc; }
> > > > >
> > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void
> > > > > *userdata) +{
> > > > > +       pci_save_state(pdev);
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void
> > > > > *userdata) +{
> > > > > +       pci_restore_state(pdev);
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Enable ASPM and LTR settings on devices that aren't
> > > > > configured by BIOS. */
> > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct
> > > > > vmd_dev *vmd, unsigned long features)
> > > > >         pci_scan_child_bus(vmd->bus);
> > > > >         vmd_domain_reset(vmd);
> > > > >
> > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > >         /* When Intel VMD is enabled, the OS does not
> > > > > discover the Root Ports
> > > > >          * owned by Intel VMD within the MMCFG space.
> > > > > pci_reset_bus() applies
> > > > >          * a reset to the parent of the PCI device supplied
> > > > > as argument. This
> > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct
> > > > > vmd_dev *vmd, unsigned long features)
> > > > >                         break;
> > > > >                 }
> > > > >         }
> > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);  
> > > >
> > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > vmd_enable_domain() which preserves state unlike
> > > > pci_reset_bus()?
> > > >
> > > > (Don't tell me naming of these functions is a horrible mess.
> > > > :-/)  
> > >
> > > Hmm. So this *is* calling pci_reset_bus().  
> >
> > Yeah, I managed to get confused by the names myself, I somehow
> > ended up thinking it calls pci_bus_reset() which is not correct...
> >  
> > > L1.2 configuration has specific
> > > ordering requirements for changes to parent & child devices.
> > > Could be why it's not getting restored properly.  
> >
> > Indeed, it has to be something else since the patch above doesn't
> > even restore anything because dev->state_saved should get set to
> > false by the first pci_restore_state() called from
> > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(),
> > I think!?  
> 
> Inspired by Ilpo's comment.  I add some debug messages based on
> linux-next's tag 'next-20240809' to understand the code path of
> pci_reset_bus():
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ffaaca0978cb..3ee71374f1de 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct
> pci_dev *dev)
>          * races with ->remove() by the device lock, which must be
> held by
>          * the caller.
>          */
> -       if (err_handler && err_handler->reset_prepare)
> +       if (err_handler && err_handler->reset_prepare) {
> +               pci_info(dev, "%s: %pF\n", __func__,
> err_handler->reset_prepare);
>                 err_handler->reset_prepare(dev);
> +       }
> 
>         /*
>          * Wake-up device prior to save.  PM registers default to D0
> after @@ -5144,6 +5146,7 @@ static void
> pci_dev_save_and_disable(struct pci_dev *dev)
> pci_set_power_state(dev, PCI_D0);
> 
>         pci_save_state(dev);
> +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> dev->state_saved ? "true" : "false");
>         /*
>          * Disable the device by clearing the Command register,
> except for
>          * INTx-disable which is set.  This not only disables MMIO
> and I/O port @@ -5655,6 +5658,10 @@ static void
> pci_bus_save_and_disable_locked(struct pci_bus *bus)
>         struct pci_dev *dev;
> 
>         list_for_each_entry(dev, &bus->devices, bus_list) {
> +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> subordinate\n",
> +                        __func__,
> +                        dev->state_saved ? "true" : "false",
> +                        dev->subordinate ? "has" : "does not have");
>                 pci_dev_save_and_disable(dev);
>                 if (dev->subordinate)
>                         pci_bus_save_and_disable_locked(dev->subordinate);
> @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct
> pci_bus *bus) struct pci_dev *dev;
> 
>         list_for_each_entry(dev, &bus->devices, bus_list) {
> +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> subordinate\n",
> +                        __func__,
> +                        dev->state_saved ? "true" : "false",
> +                        dev->subordinate ? "has" : "does not have");
>                 pci_dev_restore(dev);
>                 if (dev->subordinate)
>                         pci_bus_restore_locked(dev->subordinate);
> @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus,
> bool probe) if (!bus->self || !pci_bus_resettable(bus))
>                 return -ENOTTY;
> 
> -       if (probe)
> +       if (probe) {
> +               pci_info(bus->self, "%s: probe is true.  So return 0
> directly", __func__);
>                 return 0;
> +       }
> 
>         pci_bus_lock(bus);
> 
> @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus
> *bus) int rc;
> 
>         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n",
> __func__, rc); if (rc)
>                 return rc;
> 
>         if (pci_bus_trylock(bus)) {
> +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> true\n", __func__);
>                 pci_bus_save_and_disable_locked(bus);
>                 might_sleep();
>                 rc = pci_bridge_secondary_bus_reset(bus->self);
> @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
>   */
>  int pci_reset_bus(struct pci_dev *pdev)
>  {
> +       pci_info(pdev, "%s: %s", __func__,
> !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
>         return (!pci_probe_reset_slot(pdev->slot)) ?
>             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
>  }
> 
> And, have the information of VMD PCIe devices with the built kernel:
> 
> 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> decode])
>   ...
>   Capabilities: [200 v1] L1 PM Substates
>     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> L1_PM_Substates+ PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
>     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>       T_CommonMode=0us LTR1.2_Threshold=0ns
>     L1SubCtl2: T_PwrOn=0us
> 
> 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
>   ...
>   Capabilities: [900 v1] L1 PM Substates
>     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> L1_PM_Substates+ PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
>     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>       T_CommonMode=0us LTR1.2_Threshold=101376ns
>     L1SubCtl2: T_PwrOn=50us
> 
> We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> PCIe bridge has LTR1.2_Threshold=0ns.
> 
> Then, check the dmesg.  I notice the debug messages:
> 
> pci 10000:e0:06.0: PCI bridge to [bus e1]
> pci 10000:e0:06.0: Primary bus is hard wired to 0
> pci 10000:e1:00.0: pci_reset_bus: false
> pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
> pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> false, and does not have subordinate
> pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> Freeing initrd memory: 75236K
> pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> and does not have subordinate
> 
> So, the code path is:
> 
> vmd_enable_domain()
>   pci_reset_bus()
>     __pci_reset_bus()
>       pci_bus_reset()
>         pci_bus_save_and_disable_locked()
>           pci_dev_save_and_disable()
>         pci_bus_restore_locked()
>           pci_dev_restore()
> 
> And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
> So, PCIe bridge 10000:e0:06.0 does not restore state correctly.
> 
> Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> bus in vmd_enable_domain().
> * Bus "e1" has only NVMe 10000:e1:00.0
> * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> 10000:e1:00.0 and SATA Controller 10000:e0:17.0.
> 
> Here is the PCI tree:
> 
> -+-[0000:00]-+-00.0  Intel Corporation Device 9a04
>  |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD
> Graphics G4] |           +-04.0  Intel Corporation TigerLake-LP
> Dynamic Tuning Processor Participant
>  |           +-06.0  Intel Corporation RST VMD Managed Controller
>  |           +-07.0-[01-2b]--
>  |           +-08.0  Intel Corporation GNA Scoring Accelerator module
>  |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator
> Driver |           +-0d.0  Intel Corporation Tiger Lake-LP
> Thunderbolt 4 USB Controller
>  |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4
> NHI #0 |           +-0e.0  Intel Corporation Volume Management Device
> NVMe RAID Controller
>  |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
> xHCI Host Controller
>  |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
>  |           +-14.3  Intel Corporation Wi-Fi 6 AX201
>  |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C
> Controller #0 |           +-15.1  Intel Corporation Tiger Lake-LP
> Serial IO I2C Controller #1 |           +-16.0  Intel Corporation
> Tiger Lake-LP Management Engine Interface |           +-17.0  Intel
> Corporation RST VMD Managed Controller |           +-1f.0  Intel
> Corporation Tiger Lake-LP LPC Controller |           +-1f.3  Intel
> Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
>  |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
>  |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
>  |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
>  \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
>               \-17.0  Intel Corporation Tiger Lake-LP SATA Controller
> 
> According the findings above, to ensure the devices on the VMD bus
> have correctly states, seems pci_save_state() all the devices before
> pci_reset_bus(), and pci_restore_state() all the devices after
> pci_reset_bus() is the correct answer.
What happens if you call pci_reset_bus with PCIe bridge 10000:e0:06.0
instead of NVMe 10000:e1:00.0? I believe the current implementation in
vmd_enable_domain finds first child device on each rootport and calls
pci_reset_bus with NVMe.

-nirmal
> 
> Jian-Hong Pan
Jian-Hong Pan Sept. 4, 2024, 3:56 a.m. UTC | #13
Nirmal Patel <nirmal.patel@linux.intel.com> 於 2024年9月3日 週二 下午11:17寫道:
>
> On Mon, 12 Aug 2024 16:18:22 +0800
> Jian-Hong Pan <jhp@endlessos.org> wrote:
>
> > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四
> > 下午5:49寫道:
> > >
> > > On Wed, 7 Aug 2024, David E. Box wrote:
> > >
> > > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > > >
> > > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日
> > > > > > 週二 上午4:26寫道:
> > > > > > >
> > > > > > > Hi Jian-Hong,
> > > > > > >
> > > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五
> > > > > > > > 下午4:04寫道:
> > > > > > > > >
> > > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > > __pci_enable_link_state(),
> > > > > > > > > it configs the link directly without ensuring related
> > > > > > > > > L1.2 parameters, such
> > > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and
> > > > > > > > > LTR_L1.2_THRESHOLD have been
> > > > > > > > > programmed.
> > > > > > > > >
> > > > > > > > > This leads the link's L1.2 between PCIe Root Port and
> > > > > > > > > child device gets
> > > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > > >
> > > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled
> > > > > > > > > VMD:
> > > > > > > > >
> > > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen
> > > > > > > > > Core Processor PCIe
> > > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> > > > > > > > > ASPM_L1.1+ L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=45us
> > > > > > > > > PortTPowerOnTime=50us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=45us
> > > > > > > > > LTR1.2_Threshold=101376ns L1SubCtl2: T_PwrOn=50us
> > > > > > > > >
> > > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk
> > > > > > > > > Corp WD Blue SN550
> > > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+
> > > > > > > > > ASPM_L1.1- L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=32us
> > > > > > > > > PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us
> > > > > > > > > LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us
> > > > > > > > >
> > > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling
> > > > > > > > > ASPM L1.2 on the PCIe
> > > > > > > > > Root Port and the child NVMe, they should be programmed
> > > > > > > > > with the same LTR1.2_Threshold value. However, they
> > > > > > > > > have different values in this case.
> > > > > > > > >
> > > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2
> > > > > > > > > parameters properly before
> > > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > > __pci_enable_link_state().
> > > > > > > > >
> > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > - Prepare the PCIe LTR parameters before enable L1
> > > > > > > > > Substates
> > > > > > > > >
> > > > > > > > > v3:
> > > > > > > > > - Only enable supported features for the L1 Substates
> > > > > > > > > part
> > > > > > > > >
> > > > > > > > > v4:
> > > > > > > > > - Focus on fixing L1.2 parameters, instead of
> > > > > > > > > re-initializing whole L1SS
> > > > > > > > >
> > > > > > > > > v5:
> > > > > > > > > - Fix typo and commit message
> > > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM:
> > > > > > > > > Introduce aspm_get_l1ss_cap()"
> > > > > > > > >
> > > > > > > > > v6:
> > > > > > > > > - Skipped
> > > > > > > > >
> > > > > > > > > v7:
> > > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > > - Drop the link state flag check. And, always config
> > > > > > > > > link state's timing
> > > > > > > > >   parameters
> > > > > > > > >
> > > > > > > > > v8:
> > > > > > > > > - Because pcie_aspm_get_link() might return the link as
> > > > > > > > > NULL, move getting the link's parent and child devices
> > > > > > > > > after check the link is not NULL. This avoids NULL
> > > > > > > > > memory access.
> > > > > > > > >
> > > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c
> > > > > > > > > b/drivers/pci/pcie/aspm.c index
> > > > > > > > > 5db1044c9895..55ff1d26fcea 100644 ---
> > > > > > > > > a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > > @@ -1411,9 +1411,15 @@
> > > > > > > > > EXPORT_SYMBOL(pci_disable_link_state); static int
> > > > > > > > > __pci_enable_link_state(struct pci_dev *pdev, int
> > > > > > > > > state, bool locked)
> > > > > > > > >  {
> > > > > > > > >         struct pcie_link_state *link =
> > > > > > > > > pcie_aspm_get_link(pdev);
> > > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > > >
> > > > > > > > >         if (!link)
> > > > > > > > >                 return -EINVAL;
> > > > > > > > > +
> > > > > > > > > +       parent = link->pdev;
> > > > > > > > > +       child = link->downstream;
> > > > > > > > > +
> > > > > > > > >         /*
> > > > > > > > >          * A driver requested that ASPM be enabled on
> > > > > > > > > this device, but
> > > > > > > > >          * if we don't have permission to manage ASPM
> > > > > > > > > (e.g., on ACPI @@ -1428,6 +1434,15 @@ static int
> > > > > > > > > __pci_enable_link_state(struct pci_dev
> > > > > > > > > *pdev, int state, bool locked)
> > > > > > > > >         if (!locked)
> > > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > > +       /*
> > > > > > > > > +        * Ensure L1.2 parameters:
> > > > > > > > > Common_Mode_Restore_Times, T_POWER_ON and
> > > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly
> > > > > > > > > before enable bits for
> > > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > > +        */
> > > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap,
> > > > > > > > > child_l1ss_cap);
> > > > > > >
> > > > > > > I still don't think this is the place to recalculate the
> > > > > > > L1.2 parameters especially when know the calculation was
> > > > > > > done but was cleared by pci_bus_reset(). Can't we just do a
> > > > > > > pci_save/restore_state() before/after pci_bus_reset() in
> > > > > > > vmd.c?
> > > > > >
> > > > > > I have not thought pci_save/restore_state() around
> > > > > > pci_bus_reset() before.  It is an interesting direction.
> > > > > >
> > > > > > So, I prepare modification below for test.  Include "[PATCH
> > > > > > v8 1/4] PCI: vmd: Set PCI devices to D0 before enable PCI
> > > > > > PM's L1 substates", too.  Then, both the PCIe bridge and the
> > > > > > PCIe device have the same LTR_L1.2_THRESHOLD 101376ns as
> > > > > > expected.
> > > > > >
> > > > > > diff --git a/drivers/pci/controller/vmd.c
> > > > > > b/drivers/pci/controller/vmd.c index
> > > > > > bbf4a47e7b31..6b8dd4f30127 100644 ---
> > > > > > a/drivers/pci/controller/vmd.c +++
> > > > > > b/drivers/pci/controller/vmd.c @@ -727,6 +727,18 @@ static
> > > > > > void vmd_copy_host_bridge_flags(struct pci_host_bridge
> > > > > > *root_bridge, vmd_bridge->native_dpc =
> > > > > > root_bridge->native_dpc; }
> > > > > >
> > > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void
> > > > > > *userdata) +{
> > > > > > +       pci_save_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void
> > > > > > *userdata) +{
> > > > > > +       pci_restore_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  /*
> > > > > >   * Enable ASPM and LTR settings on devices that aren't
> > > > > > configured by BIOS. */
> > > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct
> > > > > > vmd_dev *vmd, unsigned long features)
> > > > > >         pci_scan_child_bus(vmd->bus);
> > > > > >         vmd_domain_reset(vmd);
> > > > > >
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > > >         /* When Intel VMD is enabled, the OS does not
> > > > > > discover the Root Ports
> > > > > >          * owned by Intel VMD within the MMCFG space.
> > > > > > pci_reset_bus() applies
> > > > > >          * a reset to the parent of the PCI device supplied
> > > > > > as argument. This
> > > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct
> > > > > > vmd_dev *vmd, unsigned long features)
> > > > > >                         break;
> > > > > >                 }
> > > > > >         }
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > > > >
> > > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > > vmd_enable_domain() which preserves state unlike
> > > > > pci_reset_bus()?
> > > > >
> > > > > (Don't tell me naming of these functions is a horrible mess.
> > > > > :-/)
> > > >
> > > > Hmm. So this *is* calling pci_reset_bus().
> > >
> > > Yeah, I managed to get confused by the names myself, I somehow
> > > ended up thinking it calls pci_bus_reset() which is not correct...
> > >
> > > > L1.2 configuration has specific
> > > > ordering requirements for changes to parent & child devices.
> > > > Could be why it's not getting restored properly.
> > >
> > > Indeed, it has to be something else since the patch above doesn't
> > > even restore anything because dev->state_saved should get set to
> > > false by the first pci_restore_state() called from
> > > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(),
> > > I think!?
> >
> > Inspired by Ilpo's comment.  I add some debug messages based on
> > linux-next's tag 'next-20240809' to understand the code path of
> > pci_reset_bus():
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index ffaaca0978cb..3ee71374f1de 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct
> > pci_dev *dev)
> >          * races with ->remove() by the device lock, which must be
> > held by
> >          * the caller.
> >          */
> > -       if (err_handler && err_handler->reset_prepare)
> > +       if (err_handler && err_handler->reset_prepare) {
> > +               pci_info(dev, "%s: %pF\n", __func__,
> > err_handler->reset_prepare);
> >                 err_handler->reset_prepare(dev);
> > +       }
> >
> >         /*
> >          * Wake-up device prior to save.  PM registers default to D0
> > after @@ -5144,6 +5146,7 @@ static void
> > pci_dev_save_and_disable(struct pci_dev *dev)
> > pci_set_power_state(dev, PCI_D0);
> >
> >         pci_save_state(dev);
> > +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> > dev->state_saved ? "true" : "false");
> >         /*
> >          * Disable the device by clearing the Command register,
> > except for
> >          * INTx-disable which is set.  This not only disables MMIO
> > and I/O port @@ -5655,6 +5658,10 @@ static void
> > pci_bus_save_and_disable_locked(struct pci_bus *bus)
> >         struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> > subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_save_and_disable(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_save_and_disable_locked(dev->subordinate);
> > @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct
> > pci_bus *bus) struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> > subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_restore(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_restore_locked(dev->subordinate);
> > @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus,
> > bool probe) if (!bus->self || !pci_bus_resettable(bus))
> >                 return -ENOTTY;
> >
> > -       if (probe)
> > +       if (probe) {
> > +               pci_info(bus->self, "%s: probe is true.  So return 0
> > directly", __func__);
> >                 return 0;
> > +       }
> >
> >         pci_bus_lock(bus);
> >
> > @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus
> > *bus) int rc;
> >
> >         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> > +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n",
> > __func__, rc); if (rc)
> >                 return rc;
> >
> >         if (pci_bus_trylock(bus)) {
> > +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> > true\n", __func__);
> >                 pci_bus_save_and_disable_locked(bus);
> >                 might_sleep();
> >                 rc = pci_bridge_secondary_bus_reset(bus->self);
> > @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
> >   */
> >  int pci_reset_bus(struct pci_dev *pdev)
> >  {
> > +       pci_info(pdev, "%s: %s", __func__,
> > !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
> >         return (!pci_probe_reset_slot(pdev->slot)) ?
> >             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
> >  }
> >
> > And, have the information of VMD PCIe devices with the built kernel:
> >
> > 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> > Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> > decode])
> >   ...
> >   Capabilities: [200 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > L1_PM_Substates+ PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=0ns
> >     L1SubCtl2: T_PwrOn=0us
> >
> > 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> > Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
> >   ...
> >   Capabilities: [900 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > L1_PM_Substates+ PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=101376ns
> >     L1SubCtl2: T_PwrOn=50us
> >
> > We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> > PCIe bridge has LTR1.2_Threshold=0ns.
> >
> > Then, check the dmesg.  I notice the debug messages:
> >
> > pci 10000:e0:06.0: PCI bridge to [bus e1]
> > pci 10000:e0:06.0: Primary bus is hard wired to 0
> > pci 10000:e1:00.0: pci_reset_bus: false
> > pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> > pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> > false, and does not have subordinate
> > pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> > Freeing initrd memory: 75236K
> > pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> > and does not have subordinate
> >
> > So, the code path is:
> >
> > vmd_enable_domain()
> >   pci_reset_bus()
> >     __pci_reset_bus()
> >       pci_bus_reset()
> >         pci_bus_save_and_disable_locked()
> >           pci_dev_save_and_disable()
> >         pci_bus_restore_locked()
> >           pci_dev_restore()
> >
> > And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> > pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
> > So, PCIe bridge 10000:e0:06.0 does not restore state correctly.
> >
> > Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> > bus in vmd_enable_domain().
> > * Bus "e1" has only NVMe 10000:e1:00.0
> > * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> > 10000:e1:00.0 and SATA Controller 10000:e0:17.0.
> >
> > Here is the PCI tree:
> >
> > -+-[0000:00]-+-00.0  Intel Corporation Device 9a04
> >  |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD
> > Graphics G4] |           +-04.0  Intel Corporation TigerLake-LP
> > Dynamic Tuning Processor Participant
> >  |           +-06.0  Intel Corporation RST VMD Managed Controller
> >  |           +-07.0-[01-2b]--
> >  |           +-08.0  Intel Corporation GNA Scoring Accelerator module
> >  |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator
> > Driver |           +-0d.0  Intel Corporation Tiger Lake-LP
> > Thunderbolt 4 USB Controller
> >  |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4
> > NHI #0 |           +-0e.0  Intel Corporation Volume Management Device
> > NVMe RAID Controller
> >  |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
> > xHCI Host Controller
> >  |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
> >  |           +-14.3  Intel Corporation Wi-Fi 6 AX201
> >  |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C
> > Controller #0 |           +-15.1  Intel Corporation Tiger Lake-LP
> > Serial IO I2C Controller #1 |           +-16.0  Intel Corporation
> > Tiger Lake-LP Management Engine Interface |           +-17.0  Intel
> > Corporation RST VMD Managed Controller |           +-1f.0  Intel
> > Corporation Tiger Lake-LP LPC Controller |           +-1f.3  Intel
> > Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
> >  |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
> >  |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
> >  |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
> >  \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
> >               \-17.0  Intel Corporation Tiger Lake-LP SATA Controller
> >
> > According the findings above, to ensure the devices on the VMD bus
> > have correctly states, seems pci_save_state() all the devices before
> > pci_reset_bus(), and pci_restore_state() all the devices after
> > pci_reset_bus() is the correct answer.
> What happens if you call pci_reset_bus with PCIe bridge 10000:e0:06.0
> instead of NVMe 10000:e1:00.0? I believe the current implementation in
> vmd_enable_domain finds first child device on each rootport and calls
> pci_reset_bus with NVMe.

If call pci_reset_bus with PCIe bridge 10000:e0:06.0 instead of NVMe
10000:e1:00.0, it will show warning message "pci 10000:e0:06.0 can't
reset device: -25".

Jian-Hong Pan
Nirmal Patel Sept. 20, 2024, 4:03 p.m. UTC | #14
On Mon, 12 Aug 2024 16:18:22 +0800
Jian-Hong Pan <jhp@endlessos.org> wrote:

> Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四
> 下午5:49寫道:
> >
> > On Wed, 7 Aug 2024, David E. Box wrote:
> >  
> > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:  
> > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > >  
> > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日
> > > > > 週二 上午4:26寫道:  
> > > > > >
> > > > > > Hi Jian-Hong,
> > > > > >
> > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:  
> > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五
> > > > > > > 下午4:04寫道:  
> > > > > > > >
> > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > __pci_enable_link_state(),
> > > > > > > > it configs the link directly without ensuring related
> > > > > > > > L1.2 parameters, such
> > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and
> > > > > > > > LTR_L1.2_THRESHOLD have been
> > > > > > > > programmed.
> > > > > > > >
> > > > > > > > This leads the link's L1.2 between PCIe Root Port and
> > > > > > > > child device gets
> > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > >
> > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled
> > > > > > > > VMD:
> > > > > > > >
> > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen
> > > > > > > > Core Processor PCIe
> > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > >     ...
> > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> > > > > > > > ASPM_L1.1+ L1_PM_Substates+
> > > > > > > >                   PortCommonModeRestoreTime=45us
> > > > > > > > PortTPowerOnTime=50us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=45us
> > > > > > > > LTR1.2_Threshold=101376ns L1SubCtl2: T_PwrOn=50us
> > > > > > > >
> > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk
> > > > > > > > Corp WD Blue SN550
> > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > >     ...
> > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+
> > > > > > > > ASPM_L1.1- L1_PM_Substates+
> > > > > > > >                   PortCommonModeRestoreTime=32us
> > > > > > > > PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us
> > > > > > > > LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us
> > > > > > > >
> > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling
> > > > > > > > ASPM L1.2 on the PCIe
> > > > > > > > Root Port and the child NVMe, they should be programmed
> > > > > > > > with the same LTR1.2_Threshold value. However, they
> > > > > > > > have different values in this case.
> > > > > > > >
> > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2
> > > > > > > > parameters properly before
> > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > __pci_enable_link_state().
> > > > > > > >
> > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > ---
> > > > > > > > v2:
> > > > > > > > - Prepare the PCIe LTR parameters before enable L1
> > > > > > > > Substates
> > > > > > > >
> > > > > > > > v3:
> > > > > > > > - Only enable supported features for the L1 Substates
> > > > > > > > part
> > > > > > > >
> > > > > > > > v4:
> > > > > > > > - Focus on fixing L1.2 parameters, instead of
> > > > > > > > re-initializing whole L1SS
> > > > > > > >
> > > > > > > > v5:
> > > > > > > > - Fix typo and commit message
> > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM:
> > > > > > > > Introduce aspm_get_l1ss_cap()"
> > > > > > > >
> > > > > > > > v6:
> > > > > > > > - Skipped
> > > > > > > >
> > > > > > > > v7:
> > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > - Drop the link state flag check. And, always config
> > > > > > > > link state's timing
> > > > > > > >   parameters
> > > > > > > >
> > > > > > > > v8:
> > > > > > > > - Because pcie_aspm_get_link() might return the link as
> > > > > > > > NULL, move getting the link's parent and child devices
> > > > > > > > after check the link is not NULL. This avoids NULL
> > > > > > > > memory access.
> > > > > > > >
> > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/pci/pcie/aspm.c
> > > > > > > > b/drivers/pci/pcie/aspm.c index
> > > > > > > > 5db1044c9895..55ff1d26fcea 100644 ---
> > > > > > > > a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > @@ -1411,9 +1411,15 @@
> > > > > > > > EXPORT_SYMBOL(pci_disable_link_state); static int
> > > > > > > > __pci_enable_link_state(struct pci_dev *pdev, int
> > > > > > > > state, bool locked)
> > > > > > > >  {
> > > > > > > >         struct pcie_link_state *link =
> > > > > > > > pcie_aspm_get_link(pdev);
> > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > >
> > > > > > > >         if (!link)
> > > > > > > >                 return -EINVAL;
> > > > > > > > +
> > > > > > > > +       parent = link->pdev;
> > > > > > > > +       child = link->downstream;
> > > > > > > > +
> > > > > > > >         /*
> > > > > > > >          * A driver requested that ASPM be enabled on
> > > > > > > > this device, but
> > > > > > > >          * if we don't have permission to manage ASPM
> > > > > > > > (e.g., on ACPI @@ -1428,6 +1434,15 @@ static int
> > > > > > > > __pci_enable_link_state(struct pci_dev
> > > > > > > > *pdev, int state, bool locked)
> > > > > > > >         if (!locked)
> > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > +       /*
> > > > > > > > +        * Ensure L1.2 parameters:
> > > > > > > > Common_Mode_Restore_Times, T_POWER_ON and
> > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly
> > > > > > > > before enable bits for
> > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > +        */
> > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap,
> > > > > > > > child_l1ss_cap);  
> > > > > >
> > > > > > I still don't think this is the place to recalculate the
> > > > > > L1.2 parameters especially when know the calculation was
> > > > > > done but was cleared by pci_bus_reset(). Can't we just do a
> > > > > > pci_save/restore_state() before/after pci_bus_reset() in
> > > > > > vmd.c?  
> > > > >
> > > > > I have not thought pci_save/restore_state() around
> > > > > pci_bus_reset() before.  It is an interesting direction.
> > > > >
> > > > > So, I prepare modification below for test.  Include "[PATCH
> > > > > v8 1/4] PCI: vmd: Set PCI devices to D0 before enable PCI
> > > > > PM's L1 substates", too.  Then, both the PCIe bridge and the
> > > > > PCIe device have the same LTR_L1.2_THRESHOLD 101376ns as
> > > > > expected.
> > > > >
> > > > > diff --git a/drivers/pci/controller/vmd.c
> > > > > b/drivers/pci/controller/vmd.c index
> > > > > bbf4a47e7b31..6b8dd4f30127 100644 ---
> > > > > a/drivers/pci/controller/vmd.c +++
> > > > > b/drivers/pci/controller/vmd.c @@ -727,6 +727,18 @@ static
> > > > > void vmd_copy_host_bridge_flags(struct pci_host_bridge
> > > > > *root_bridge, vmd_bridge->native_dpc =
> > > > > root_bridge->native_dpc; }
> > > > >
> > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void
> > > > > *userdata) +{
> > > > > +       pci_save_state(pdev);
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void
> > > > > *userdata) +{
> > > > > +       pci_restore_state(pdev);
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Enable ASPM and LTR settings on devices that aren't
> > > > > configured by BIOS. */
> > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct
> > > > > vmd_dev *vmd, unsigned long features)
> > > > >         pci_scan_child_bus(vmd->bus);
> > > > >         vmd_domain_reset(vmd);
> > > > >
> > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > >         /* When Intel VMD is enabled, the OS does not
> > > > > discover the Root Ports
> > > > >          * owned by Intel VMD within the MMCFG space.
> > > > > pci_reset_bus() applies
> > > > >          * a reset to the parent of the PCI device supplied
> > > > > as argument. This
> > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct
> > > > > vmd_dev *vmd, unsigned long features)
> > > > >                         break;
> > > > >                 }
> > > > >         }
> > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);  
> > > >
> > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > vmd_enable_domain() which preserves state unlike
> > > > pci_reset_bus()?
> > > >
> > > > (Don't tell me naming of these functions is a horrible mess.
> > > > :-/)  
> > >
> > > Hmm. So this *is* calling pci_reset_bus().  
> >
> > Yeah, I managed to get confused by the names myself, I somehow
> > ended up thinking it calls pci_bus_reset() which is not correct...
> >  
> > > L1.2 configuration has specific
> > > ordering requirements for changes to parent & child devices.
> > > Could be why it's not getting restored properly.  
> >
> > Indeed, it has to be something else since the patch above doesn't
> > even restore anything because dev->state_saved should get set to
> > false by the first pci_restore_state() called from
> > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(),
> > I think!?  
> 
> Inspired by Ilpo's comment.  I add some debug messages based on
> linux-next's tag 'next-20240809' to understand the code path of
> pci_reset_bus():
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ffaaca0978cb..3ee71374f1de 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct
> pci_dev *dev)
>          * races with ->remove() by the device lock, which must be
> held by
>          * the caller.
>          */
> -       if (err_handler && err_handler->reset_prepare)
> +       if (err_handler && err_handler->reset_prepare) {
> +               pci_info(dev, "%s: %pF\n", __func__,
> err_handler->reset_prepare);
>                 err_handler->reset_prepare(dev);
> +       }
> 
>         /*
>          * Wake-up device prior to save.  PM registers default to D0
> after @@ -5144,6 +5146,7 @@ static void
> pci_dev_save_and_disable(struct pci_dev *dev)
> pci_set_power_state(dev, PCI_D0);
> 
>         pci_save_state(dev);
> +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> dev->state_saved ? "true" : "false");
>         /*
>          * Disable the device by clearing the Command register,
> except for
>          * INTx-disable which is set.  This not only disables MMIO
> and I/O port @@ -5655,6 +5658,10 @@ static void
> pci_bus_save_and_disable_locked(struct pci_bus *bus)
>         struct pci_dev *dev;
> 
>         list_for_each_entry(dev, &bus->devices, bus_list) {
> +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> subordinate\n",
> +                        __func__,
> +                        dev->state_saved ? "true" : "false",
> +                        dev->subordinate ? "has" : "does not have");
>                 pci_dev_save_and_disable(dev);
>                 if (dev->subordinate)
>                         pci_bus_save_and_disable_locked(dev->subordinate);
> @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct
> pci_bus *bus) struct pci_dev *dev;
> 
>         list_for_each_entry(dev, &bus->devices, bus_list) {
> +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> subordinate\n",
> +                        __func__,
> +                        dev->state_saved ? "true" : "false",
> +                        dev->subordinate ? "has" : "does not have");
>                 pci_dev_restore(dev);
>                 if (dev->subordinate)
>                         pci_bus_restore_locked(dev->subordinate);
> @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus,
> bool probe) if (!bus->self || !pci_bus_resettable(bus))
>                 return -ENOTTY;
> 
> -       if (probe)
> +       if (probe) {
> +               pci_info(bus->self, "%s: probe is true.  So return 0
> directly", __func__);
>                 return 0;
> +       }
> 
>         pci_bus_lock(bus);
> 
> @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus
> *bus) int rc;
> 
>         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n",
> __func__, rc); if (rc)
>                 return rc;
> 
>         if (pci_bus_trylock(bus)) {
> +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> true\n", __func__);
>                 pci_bus_save_and_disable_locked(bus);
>                 might_sleep();
>                 rc = pci_bridge_secondary_bus_reset(bus->self);
> @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
>   */
>  int pci_reset_bus(struct pci_dev *pdev)
>  {
> +       pci_info(pdev, "%s: %s", __func__,
> !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
>         return (!pci_probe_reset_slot(pdev->slot)) ?
>             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
>  }
> 
> And, have the information of VMD PCIe devices with the built kernel:
> 
> 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> decode])
>   ...
>   Capabilities: [200 v1] L1 PM Substates
>     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> L1_PM_Substates+ PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
>     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>       T_CommonMode=0us LTR1.2_Threshold=0ns
>     L1SubCtl2: T_PwrOn=0us
> 
> 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
>   ...
>   Capabilities: [900 v1] L1 PM Substates
>     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> L1_PM_Substates+ PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
>     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>       T_CommonMode=0us LTR1.2_Threshold=101376ns
>     L1SubCtl2: T_PwrOn=50us
> 
> We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> PCIe bridge has LTR1.2_Threshold=0ns.
> 
> Then, check the dmesg.  I notice the debug messages:
> 
> pci 10000:e0:06.0: PCI bridge to [bus e1]
> pci 10000:e0:06.0: Primary bus is hard wired to 0
> pci 10000:e1:00.0: pci_reset_bus: false
> pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
> pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> false, and does not have subordinate
> pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> Freeing initrd memory: 75236K
> pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> and does not have subordinate
> 
> So, the code path is:
> 
> vmd_enable_domain()
>   pci_reset_bus()
>     __pci_reset_bus()
>       pci_bus_reset()
>         pci_bus_save_and_disable_locked()
>           pci_dev_save_and_disable()
>         pci_bus_restore_locked()
>           pci_dev_restore()
> 
> And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
> So, PCIe bridge 10000:e0:06.0 does not restore state correctly.
> 
> Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> bus in vmd_enable_domain().
> * Bus "e1" has only NVMe 10000:e1:00.0
> * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> 10000:e1:00.0 and SATA Controller 10000:e0:17.0.
> 
> Here is the PCI tree:
> 
> -+-[0000:00]-+-00.0  Intel Corporation Device 9a04
>  |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD
> Graphics G4] |           +-04.0  Intel Corporation TigerLake-LP
> Dynamic Tuning Processor Participant
>  |           +-06.0  Intel Corporation RST VMD Managed Controller
>  |           +-07.0-[01-2b]--
>  |           +-08.0  Intel Corporation GNA Scoring Accelerator module
>  |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator
> Driver |           +-0d.0  Intel Corporation Tiger Lake-LP
> Thunderbolt 4 USB Controller
>  |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4
> NHI #0 |           +-0e.0  Intel Corporation Volume Management Device
> NVMe RAID Controller
>  |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
> xHCI Host Controller
>  |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
>  |           +-14.3  Intel Corporation Wi-Fi 6 AX201
>  |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C
> Controller #0 |           +-15.1  Intel Corporation Tiger Lake-LP
> Serial IO I2C Controller #1 |           +-16.0  Intel Corporation
> Tiger Lake-LP Management Engine Interface |           +-17.0  Intel
> Corporation RST VMD Managed Controller |           +-1f.0  Intel
> Corporation Tiger Lake-LP LPC Controller |           +-1f.3  Intel
> Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
>  |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
>  |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
>  |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
>  \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
>               \-17.0  Intel Corporation Tiger Lake-LP SATA Controller
> 
> According the findings above, to ensure the devices on the VMD bus
> have correctly states, seems pci_save_state() all the devices before
> pci_reset_bus(), and pci_restore_state() all the devices after
> pci_reset_bus() is the correct answer.

Did you get a chance to test this theory?

> 
> Jian-Hong Pan
Jian-Hong Pan Sept. 23, 2024, 8:41 a.m. UTC | #15
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年9月2日 週一 下午11:44寫道:
>
> On Mon, 12 Aug 2024, Jian-Hong Pan wrote:
>
> > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四 下午5:49寫道:
> > > On Wed, 7 Aug 2024, David E. Box wrote:
> > > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > > >
> > > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> > > > > > >
> > > > > > > Hi Jian-Hong,
> > > > > > >
> > > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > > > > > > >
> > > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > > __pci_enable_link_state(),
> > > > > > > > > it configs the link directly without ensuring related L1.2 parameters,
> > > > > > > > > such
> > > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have
> > > > > > > > > been
> > > > > > > > > programmed.
> > > > > > > > >
> > > > > > > > > This leads the link's L1.2 between PCIe Root Port and child device
> > > > > > > > > gets
> > > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > > >
> > > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > > > > > > >
> > > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > > > > > > > PCIe
> > > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > > > > > > L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > > > > > > >         L1SubCtl2: T_PwrOn=50us
> > > > > > > > >
> > > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > > > > > > > SN550
> > > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > > L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > > > > > > >         L1SubCtl2: T_PwrOn=10us
> > > > > > > > >
> > > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the
> > > > > > > > > PCIe
> > > > > > > > > Root Port and the child NVMe, they should be programmed with the same
> > > > > > > > > LTR1.2_Threshold value. However, they have different values in this
> > > > > > > > > case.
> > > > > > > > >
> > > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > > > > > > > before
> > > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > > __pci_enable_link_state().
> > > > > > > > >
> > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > > > > > > >
> > > > > > > > > v3:
> > > > > > > > > - Only enable supported features for the L1 Substates part
> > > > > > > > >
> > > > > > > > > v4:
> > > > > > > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > > > > > > > L1SS
> > > > > > > > >
> > > > > > > > > v5:
> > > > > > > > > - Fix typo and commit message
> > > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > > > > > > >   aspm_get_l1ss_cap()"
> > > > > > > > >
> > > > > > > > > v6:
> > > > > > > > > - Skipped
> > > > > > > > >
> > > > > > > > > v7:
> > > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > > - Drop the link state flag check. And, always config link state's
> > > > > > > > > timing
> > > > > > > > >   parameters
> > > > > > > > >
> > > > > > > > > v8:
> > > > > > > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > > > > > > >   getting the link's parent and child devices after check the link is
> > > > > > > > >   not NULL. This avoids NULL memory access.
> > > > > > > > >
> > > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > > > > > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state,
> > > > > > > > > bool
> > > > > > > > > locked)
> > > > > > > > >  {
> > > > > > > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > > >
> > > > > > > > >         if (!link)
> > > > > > > > >                 return -EINVAL;
> > > > > > > > > +
> > > > > > > > > +       parent = link->pdev;
> > > > > > > > > +       child = link->downstream;
> > > > > > > > > +
> > > > > > > > >         /*
> > > > > > > > >          * A driver requested that ASPM be enabled on this device, but
> > > > > > > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > > > > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > > > > > > > pci_dev
> > > > > > > > > *pdev, int state, bool locked)
> > > > > > > > >         if (!locked)
> > > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > > +       /*
> > > > > > > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > > > > > > > T_POWER_ON and
> > > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > > > > > > > bits for
> > > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > > +        */
> > > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > > > > >
> > > > > > > I still don't think this is the place to recalculate the L1.2 parameters
> > > > > > > especially when know the calculation was done but was cleared by
> > > > > > > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > > > > > > pci_bus_reset() in vmd.c?
> > > > > >
> > > > > > I have not thought pci_save/restore_state() around pci_bus_reset()
> > > > > > before.  It is an interesting direction.
> > > > > >
> > > > > > So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> > > > > > PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> > > > > > too.  Then, both the PCIe bridge and the PCIe device have the same
> > > > > > LTR_L1.2_THRESHOLD 101376ns as expected.
> > > > > >
> > > > > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > > > > > index bbf4a47e7b31..6b8dd4f30127 100644
> > > > > > --- a/drivers/pci/controller/vmd.c
> > > > > > +++ b/drivers/pci/controller/vmd.c
> > > > > > @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> > > > > > pci_host_bridge *root_bridge,
> > > > > >         vmd_bridge->native_dpc = root_bridge->native_dpc;
> > > > > >  }
> > > > > >
> > > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> > > > > > +{
> > > > > > +       pci_save_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> > > > > > +{
> > > > > > +       pci_restore_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  /*
> > > > > >   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
> > > > > >   */
> > > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > > unsigned long features)
> > > > > >         pci_scan_child_bus(vmd->bus);
> > > > > >         vmd_domain_reset(vmd);
> > > > > >
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > > >         /* When Intel VMD is enabled, the OS does not discover the Root
> > > > > > Ports
> > > > > >          * owned by Intel VMD within the MMCFG space. pci_reset_bus()
> > > > > > applies
> > > > > >          * a reset to the parent of the PCI device supplied as argument.
> > > > > > This
> > > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > > unsigned long features)
> > > > > >                         break;
> > > > > >                 }
> > > > > >         }
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > > > >
> > > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > > vmd_enable_domain() which preserves state unlike pci_reset_bus()?
> > > > >
> > > > > (Don't tell me naming of these functions is a horrible mess. :-/)
> > > >
> > > > Hmm. So this *is* calling pci_reset_bus().
> > >
> > > Yeah, I managed to get confused by the names myself, I somehow
> > > ended up thinking it calls pci_bus_reset() which is not correct...
> > >
> > > > L1.2 configuration has specific
> > > > ordering requirements for changes to parent & child devices. Could be why it's
> > > > not getting restored properly.
> > >
> > > Indeed, it has to be something else since the patch above doesn't even
> > > restore anything because dev->state_saved should get set to false by the
> > > first pci_restore_state() called from
> > > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(), I
> > > think!?
> >
> > Inspired by Ilpo's comment.  I add some debug messages based on
> > linux-next's tag 'next-20240809' to understand the code path of
> > pci_reset_bus():
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index ffaaca0978cb..3ee71374f1de 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
> >          * races with ->remove() by the device lock, which must be held by
> >          * the caller.
> >          */
> > -       if (err_handler && err_handler->reset_prepare)
> > +       if (err_handler && err_handler->reset_prepare) {
> > +               pci_info(dev, "%s: %pF\n", __func__,
> > err_handler->reset_prepare);
> >                 err_handler->reset_prepare(dev);
> > +       }
> >
> >         /*
> >          * Wake-up device prior to save.  PM registers default to D0 after
> > @@ -5144,6 +5146,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
> >         pci_set_power_state(dev, PCI_D0);
> >
> >         pci_save_state(dev);
> > +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> > dev->state_saved ? "true" : "false");
> >         /*
> >          * Disable the device by clearing the Command register, except for
> >          * INTx-disable which is set.  This not only disables MMIO and I/O port
> > @@ -5655,6 +5658,10 @@ static void
> > pci_bus_save_and_disable_locked(struct pci_bus *bus)
> >         struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_save_and_disable(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_save_and_disable_locked(dev->subordinate);
> > @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
> >         struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_restore(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_restore_locked(dev->subordinate);
> > @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus, bool probe)
> >         if (!bus->self || !pci_bus_resettable(bus))
> >                 return -ENOTTY;
> >
> > -       if (probe)
> > +       if (probe) {
> > +               pci_info(bus->self, "%s: probe is true.  So return 0
> > directly", __func__);
> >                 return 0;
> > +       }
> >
> >         pci_bus_lock(bus);
> >
> > @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus *bus)
> >         int rc;
> >
> >         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> > +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n", __func__, rc);
> >         if (rc)
> >                 return rc;
> >
> >         if (pci_bus_trylock(bus)) {
> > +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> > true\n", __func__);
> >                 pci_bus_save_and_disable_locked(bus);
> >                 might_sleep();
> >                 rc = pci_bridge_secondary_bus_reset(bus->self);
> > @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
> >   */
> >  int pci_reset_bus(struct pci_dev *pdev)
> >  {
> > +       pci_info(pdev, "%s: %s", __func__,
> > !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
> >         return (!pci_probe_reset_slot(pdev->slot)) ?
> >             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
> >  }
> >
> > And, have the information of VMD PCIe devices with the built kernel:
> >
> > 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> > Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> > decode])
> >   ...
> >   Capabilities: [200 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> >       PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=0ns
> >     L1SubCtl2: T_PwrOn=0us
> >
> > 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> > Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
> >   ...
> >   Capabilities: [900 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
> >       PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=101376ns
> >     L1SubCtl2: T_PwrOn=50us
> >
> > We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> > PCIe bridge has LTR1.2_Threshold=0ns.
>
> This is now the other way around as in the original posting that had
> 0ns for 10000:e1:00.0 ??
>
> Is this behavior even consistent or did you e.g. mess up some copy
> pasting somewhere?
>
> > Then, check the dmesg.  I notice the debug messages:
> >
> > pci 10000:e0:06.0: PCI bridge to [bus e1]
> > pci 10000:e0:06.0: Primary bus is hard wired to 0
> > pci 10000:e1:00.0: pci_reset_bus: false
> > pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> > pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> > false, and does not have subordinate
> > pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> > Freeing initrd memory: 75236K
> > pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> > and does not have subordinate
> >
> > So, the code path is:
> >
> > vmd_enable_domain()
> >   pci_reset_bus()
> >     __pci_reset_bus()
> >       pci_bus_reset()
> >         pci_bus_save_and_disable_locked()
> >           pci_dev_save_and_disable()
> >         pci_bus_restore_locked()
> >           pci_dev_restore()
> >
> > And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> > pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
> > So, PCIe bridge 10000:e0:06.0 does not restore state correctly.
>
> It should not be necessary to restore the bridge's configuration as it
> ought to not be changed by the SBR itself, PCIe6 spec 7.5.1.3.13:
>
> "Port configuration registers must not be changed, except as required to
> update Port status."
>
> While the second part wording leaves some leeway, I don't think any of
> these field really fall under "Port status".
>
> > Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> > bus in vmd_enable_domain().
> > * Bus "e1" has only NVMe 10000:e1:00.0
> > * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> > 10000:e1:00.0 and SATA Controller 10000:e0:17.0.
>
> ...But even if those registers on the PCIe bridge were changed underneath
> against the spec, it's not clear from your debug log why pci_dev_restore()
> -> pci_restore_state() -> pci_restore_pcie_state() ->
> pci_restore_aspm_l1ss_state() did not restore also parent's
> LTR1.2_Threshold?? I think it should attempt to do that.
>
> --
>  i.
>
> > Here is the PCI tree:
> >
> > -+-[0000:00]-+-00.0  Intel Corporation Device 9a04
> >  |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD Graphics G4]
> >  |           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning
> > Processor Participant
> >  |           +-06.0  Intel Corporation RST VMD Managed Controller
> >  |           +-07.0-[01-2b]--
> >  |           +-08.0  Intel Corporation GNA Scoring Accelerator module
> >  |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator Driver
> >  |           +-0d.0  Intel Corporation Tiger Lake-LP Thunderbolt 4 USB
> > Controller
> >  |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0
> >  |           +-0e.0  Intel Corporation Volume Management Device NVMe
> > RAID Controller
> >  |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
> > xHCI Host Controller
> >  |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
> >  |           +-14.3  Intel Corporation Wi-Fi 6 AX201
> >  |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0
> >  |           +-15.1  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1
> >  |           +-16.0  Intel Corporation Tiger Lake-LP Management Engine Interface
> >  |           +-17.0  Intel Corporation RST VMD Managed Controller
> >  |           +-1f.0  Intel Corporation Tiger Lake-LP LPC Controller
> >  |           +-1f.3  Intel Corporation Tiger Lake-LP Smart Sound
> > Technology Audio Controller
> >  |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
> >  |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
> >  |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
> >  \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
> >               \-17.0  Intel Corporation Tiger Lake-LP SATA Controller
> >
> > According the findings above, to ensure the devices on the VMD bus
> > have correctly states, seems pci_save_state() all the devices before
> > pci_reset_bus(), and pci_restore_state() all the devices after
> > pci_reset_bus() is the correct answer.

I add some debug messages based on v6.11 as following:

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index cee2365e54b8..404ce92f0152 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -109,21 +109,28 @@ void pci_restore_aspm_l1ss_state(struct pci_dev *pdev)
        u32 cl_ctl1, cl_ctl2, cl_l1_2_enable;
        u16 clnkctl, plnkctl;

+       pci_info(pdev, "%s", __func__);
        /*
         * In case BIOS enabled L1.2 when resuming, we need to disable it first
         * on the downstream component before the upstream. So, don't attempt to
         * restore either until we are at the downstream component.
         */
-       if (pcie_downstream_port(pdev) || !parent)
+       if (pcie_downstream_port(pdev) || !parent) {
+               pci_info(pdev, "%s: %s", __func__,
pcie_downstream_port(pdev) ? "is downstream port" : "not parent");
                return;
+       }

-       if (!pdev->l1ss || !parent->l1ss)
+       if (!pdev->l1ss || !parent->l1ss) {
+               pci_info(pdev, "%s: %sdoes not have l1ss", __func__,
!pdev->l1ss ? "" : "parent ");
                return;
+       }

        cl_save_state = pci_find_saved_ext_cap(pdev, PCI_EXT_CAP_ID_L1SS);
        pl_save_state = pci_find_saved_ext_cap(parent, PCI_EXT_CAP_ID_L1SS);
-       if (!cl_save_state || !pl_save_state)
+       if (!cl_save_state || !pl_save_state) {
+               pci_info(pdev, "%s: %sdid not save ext_cap", __func__,
!cl_save_state ? "" : "parent ");
                return;
+       }

        cap = &cl_save_state->cap.data[0];
        cl_ctl2 = *cap++;
@@ -131,6 +138,7 @@ void pci_restore_aspm_l1ss_state(struct pci_dev *pdev)
        cap = &pl_save_state->cap.data[0];
        pl_ctl2 = *cap++;
        pl_ctl1 = *cap;
+       pci_info(pdev, "%s: cl_ctl1: 0x%08x, cl_ctl2: 0x%08x, pl_ctl1:
0x%08x, pl_ctl2: 0x%08x", __func__, cl_ctl1, cl_ctl2, pl_ctl1,
pl_ctl2);

        /* Make sure L0s/L1 are disabled before updating L1SS config */
        pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &clnkctl);

Here is the corresponding log:

[    0.418931] pci 10000:e0:06.0: PCI bridge to [bus e1]
[    0.418936] pci 10000:e0:06.0: Primary bus is hard wired to 0
[    0.447474] Freeing initrd memory: 24700K
[    0.670789] pci 10000:e1:00.0: pci_restore_aspm_l1ss_state
[    0.670807] pci 10000:e1:00.0: pci_restore_aspm_l1ss_state:
cl_ctl1: 0x40630000, cl_ctl2: 0x00000029, pl_ctl1: 0x00000000,
pl_ctl2: 0x00000000
[    0.670862] pci 10000:e0:06.0: bridge window [mem
0x76000000-0x760fffff]: assigned
[    0.670864] pci 10000:e0:17.0: BAR 0 [mem 0x76100000-0x76101fff]: assigned
[    0.670881] pci 10000:e0:06.0: bridge window [io  size 0x1000]:
can't assign; no space
[    0.670882] pci 10000:e0:06.0: bridge window [io  size 0x1000]:
failed to assign
[    0.670884] pci 10000:e0:17.0: BAR 5 [mem 0x76102000-0x761027ff]: assigned
[    0.670893] pci 10000:e0:17.0: BAR 1 [mem 0x76102800-0x761028ff]: assigned
[    0.670896] pci 10000:e0:17.0: BAR 4 [io  size 0x0020]: can't
assign; no space
[    0.670897] pci 10000:e0:17.0: BAR 4 [io  size 0x0020]: failed to assign
[    0.670898] pci 10000:e0:17.0: BAR 2 [io  size 0x0008]: can't
assign; no space
[    0.670898] pci 10000:e0:17.0: BAR 2 [io  size 0x0008]: failed to assign
[    0.670899] pci 10000:e0:17.0: BAR 3 [io  size 0x0004]: can't
assign; no space
[    0.670900] pci 10000:e0:17.0: BAR 3 [io  size 0x0004]: failed to assign
[    0.670901] pci 10000:e1:00.0: BAR 0 [mem 0x76000000-0x76003fff
64bit]: assigned
[    0.670909] pci 10000:e1:00.0: BAR 4 [mem 0x76004000-0x760040ff
64bit]: assigned
[    0.670918] pci 10000:e0:06.0: PCI bridge to [bus e1]
[    0.670921] pci 10000:e0:06.0:   bridge window [mem 0x76000000-0x760fffff]
[    0.670950] pci 10000:e1:00.0: VMD: Default LTR value set by driver
[    0.670977] pcieport 10000:e0:06.0: can't derive routing for PCI INT D
[    0.670979] pcieport 10000:e0:06.0: PCI INT D: no GSI
[    0.671043] pcieport 10000:e0:06.0: PME: Signaling with IRQ 143
[    0.671092] vmd 0000:00:0e.0: Bound to PCI domain 10000

We can notice both pl_ctl1 and pl_ctl2 are 0x0.  Because, the link's
parent device (PCIe bridge 10000:e0:06.0) did not save state before
reset.  So, pci_restore_aspm_l1ss_state() restores parent's
LTR1.2_Threshold with a wrong value 0.

Jian-Hong Pan
Jian-Hong Pan Sept. 23, 2024, 10:57 a.m. UTC | #16
Nirmal Patel <nirmal.patel@linux.intel.com> 於 2024年9月21日 週六 上午12:03寫道:
>
> On Mon, 12 Aug 2024 16:18:22 +0800
> Jian-Hong Pan <jhp@endlessos.org> wrote:
>
> > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四
> > 下午5:49寫道:
> > >
> > > On Wed, 7 Aug 2024, David E. Box wrote:
> > >
> > > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > > >
> > > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日
> > > > > > 週二 上午4:26寫道:
> > > > > > >
> > > > > > > Hi Jian-Hong,
> > > > > > >
> > > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五
> > > > > > > > 下午4:04寫道:
> > > > > > > > >
> > > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > > __pci_enable_link_state(),
> > > > > > > > > it configs the link directly without ensuring related
> > > > > > > > > L1.2 parameters, such
> > > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and
> > > > > > > > > LTR_L1.2_THRESHOLD have been
> > > > > > > > > programmed.
> > > > > > > > >
> > > > > > > > > This leads the link's L1.2 between PCIe Root Port and
> > > > > > > > > child device gets
> > > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > > >
> > > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled
> > > > > > > > > VMD:
> > > > > > > > >
> > > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen
> > > > > > > > > Core Processor PCIe
> > > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> > > > > > > > > ASPM_L1.1+ L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=45us
> > > > > > > > > PortTPowerOnTime=50us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=45us
> > > > > > > > > LTR1.2_Threshold=101376ns L1SubCtl2: T_PwrOn=50us
> > > > > > > > >
> > > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk
> > > > > > > > > Corp WD Blue SN550
> > > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > > >     ...
> > > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+
> > > > > > > > > ASPM_L1.1- L1_PM_Substates+
> > > > > > > > >                   PortCommonModeRestoreTime=32us
> > > > > > > > > PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us
> > > > > > > > > LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us
> > > > > > > > >
> > > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling
> > > > > > > > > ASPM L1.2 on the PCIe
> > > > > > > > > Root Port and the child NVMe, they should be programmed
> > > > > > > > > with the same LTR1.2_Threshold value. However, they
> > > > > > > > > have different values in this case.
> > > > > > > > >
> > > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2
> > > > > > > > > parameters properly before
> > > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > > __pci_enable_link_state().
> > > > > > > > >
> > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > - Prepare the PCIe LTR parameters before enable L1
> > > > > > > > > Substates
> > > > > > > > >
> > > > > > > > > v3:
> > > > > > > > > - Only enable supported features for the L1 Substates
> > > > > > > > > part
> > > > > > > > >
> > > > > > > > > v4:
> > > > > > > > > - Focus on fixing L1.2 parameters, instead of
> > > > > > > > > re-initializing whole L1SS
> > > > > > > > >
> > > > > > > > > v5:
> > > > > > > > > - Fix typo and commit message
> > > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM:
> > > > > > > > > Introduce aspm_get_l1ss_cap()"
> > > > > > > > >
> > > > > > > > > v6:
> > > > > > > > > - Skipped
> > > > > > > > >
> > > > > > > > > v7:
> > > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > > - Drop the link state flag check. And, always config
> > > > > > > > > link state's timing
> > > > > > > > >   parameters
> > > > > > > > >
> > > > > > > > > v8:
> > > > > > > > > - Because pcie_aspm_get_link() might return the link as
> > > > > > > > > NULL, move getting the link's parent and child devices
> > > > > > > > > after check the link is not NULL. This avoids NULL
> > > > > > > > > memory access.
> > > > > > > > >
> > > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c
> > > > > > > > > b/drivers/pci/pcie/aspm.c index
> > > > > > > > > 5db1044c9895..55ff1d26fcea 100644 ---
> > > > > > > > > a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > > @@ -1411,9 +1411,15 @@
> > > > > > > > > EXPORT_SYMBOL(pci_disable_link_state); static int
> > > > > > > > > __pci_enable_link_state(struct pci_dev *pdev, int
> > > > > > > > > state, bool locked)
> > > > > > > > >  {
> > > > > > > > >         struct pcie_link_state *link =
> > > > > > > > > pcie_aspm_get_link(pdev);
> > > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > > >
> > > > > > > > >         if (!link)
> > > > > > > > >                 return -EINVAL;
> > > > > > > > > +
> > > > > > > > > +       parent = link->pdev;
> > > > > > > > > +       child = link->downstream;
> > > > > > > > > +
> > > > > > > > >         /*
> > > > > > > > >          * A driver requested that ASPM be enabled on
> > > > > > > > > this device, but
> > > > > > > > >          * if we don't have permission to manage ASPM
> > > > > > > > > (e.g., on ACPI @@ -1428,6 +1434,15 @@ static int
> > > > > > > > > __pci_enable_link_state(struct pci_dev
> > > > > > > > > *pdev, int state, bool locked)
> > > > > > > > >         if (!locked)
> > > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > > +       /*
> > > > > > > > > +        * Ensure L1.2 parameters:
> > > > > > > > > Common_Mode_Restore_Times, T_POWER_ON and
> > > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly
> > > > > > > > > before enable bits for
> > > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > > +        */
> > > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap,
> > > > > > > > > child_l1ss_cap);
> > > > > > >
> > > > > > > I still don't think this is the place to recalculate the
> > > > > > > L1.2 parameters especially when know the calculation was
> > > > > > > done but was cleared by pci_bus_reset(). Can't we just do a
> > > > > > > pci_save/restore_state() before/after pci_bus_reset() in
> > > > > > > vmd.c?
> > > > > >
> > > > > > I have not thought pci_save/restore_state() around
> > > > > > pci_bus_reset() before.  It is an interesting direction.
> > > > > >
> > > > > > So, I prepare modification below for test.  Include "[PATCH
> > > > > > v8 1/4] PCI: vmd: Set PCI devices to D0 before enable PCI
> > > > > > PM's L1 substates", too.  Then, both the PCIe bridge and the
> > > > > > PCIe device have the same LTR_L1.2_THRESHOLD 101376ns as
> > > > > > expected.
> > > > > >
> > > > > > diff --git a/drivers/pci/controller/vmd.c
> > > > > > b/drivers/pci/controller/vmd.c index
> > > > > > bbf4a47e7b31..6b8dd4f30127 100644 ---
> > > > > > a/drivers/pci/controller/vmd.c +++
> > > > > > b/drivers/pci/controller/vmd.c @@ -727,6 +727,18 @@ static
> > > > > > void vmd_copy_host_bridge_flags(struct pci_host_bridge
> > > > > > *root_bridge, vmd_bridge->native_dpc =
> > > > > > root_bridge->native_dpc; }
> > > > > >
> > > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void
> > > > > > *userdata) +{
> > > > > > +       pci_save_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void
> > > > > > *userdata) +{
> > > > > > +       pci_restore_state(pdev);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  /*
> > > > > >   * Enable ASPM and LTR settings on devices that aren't
> > > > > > configured by BIOS. */
> > > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct
> > > > > > vmd_dev *vmd, unsigned long features)
> > > > > >         pci_scan_child_bus(vmd->bus);
> > > > > >         vmd_domain_reset(vmd);
> > > > > >
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > > >         /* When Intel VMD is enabled, the OS does not
> > > > > > discover the Root Ports
> > > > > >          * owned by Intel VMD within the MMCFG space.
> > > > > > pci_reset_bus() applies
> > > > > >          * a reset to the parent of the PCI device supplied
> > > > > > as argument. This
> > > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct
> > > > > > vmd_dev *vmd, unsigned long features)
> > > > > >                         break;
> > > > > >                 }
> > > > > >         }
> > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > > > >
> > > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > > vmd_enable_domain() which preserves state unlike
> > > > > pci_reset_bus()?
> > > > >
> > > > > (Don't tell me naming of these functions is a horrible mess.
> > > > > :-/)
> > > >
> > > > Hmm. So this *is* calling pci_reset_bus().
> > >
> > > Yeah, I managed to get confused by the names myself, I somehow
> > > ended up thinking it calls pci_bus_reset() which is not correct...
> > >
> > > > L1.2 configuration has specific
> > > > ordering requirements for changes to parent & child devices.
> > > > Could be why it's not getting restored properly.
> > >
> > > Indeed, it has to be something else since the patch above doesn't
> > > even restore anything because dev->state_saved should get set to
> > > false by the first pci_restore_state() called from
> > > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(),
> > > I think!?
> >
> > Inspired by Ilpo's comment.  I add some debug messages based on
> > linux-next's tag 'next-20240809' to understand the code path of
> > pci_reset_bus():
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index ffaaca0978cb..3ee71374f1de 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct
> > pci_dev *dev)
> >          * races with ->remove() by the device lock, which must be
> > held by
> >          * the caller.
> >          */
> > -       if (err_handler && err_handler->reset_prepare)
> > +       if (err_handler && err_handler->reset_prepare) {
> > +               pci_info(dev, "%s: %pF\n", __func__,
> > err_handler->reset_prepare);
> >                 err_handler->reset_prepare(dev);
> > +       }
> >
> >         /*
> >          * Wake-up device prior to save.  PM registers default to D0
> > after @@ -5144,6 +5146,7 @@ static void
> > pci_dev_save_and_disable(struct pci_dev *dev)
> > pci_set_power_state(dev, PCI_D0);
> >
> >         pci_save_state(dev);
> > +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> > dev->state_saved ? "true" : "false");
> >         /*
> >          * Disable the device by clearing the Command register,
> > except for
> >          * INTx-disable which is set.  This not only disables MMIO
> > and I/O port @@ -5655,6 +5658,10 @@ static void
> > pci_bus_save_and_disable_locked(struct pci_bus *bus)
> >         struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> > subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_save_and_disable(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_save_and_disable_locked(dev->subordinate);
> > @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct
> > pci_bus *bus) struct pci_dev *dev;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > +               pci_info(dev, "%s: PCI state_saved is %s, and %s
> > subordinate\n",
> > +                        __func__,
> > +                        dev->state_saved ? "true" : "false",
> > +                        dev->subordinate ? "has" : "does not have");
> >                 pci_dev_restore(dev);
> >                 if (dev->subordinate)
> >                         pci_bus_restore_locked(dev->subordinate);
> > @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus,
> > bool probe) if (!bus->self || !pci_bus_resettable(bus))
> >                 return -ENOTTY;
> >
> > -       if (probe)
> > +       if (probe) {
> > +               pci_info(bus->self, "%s: probe is true.  So return 0
> > directly", __func__);
> >                 return 0;
> > +       }
> >
> >         pci_bus_lock(bus);
> >
> > @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus
> > *bus) int rc;
> >
> >         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> > +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n",
> > __func__, rc); if (rc)
> >                 return rc;
> >
> >         if (pci_bus_trylock(bus)) {
> > +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> > true\n", __func__);
> >                 pci_bus_save_and_disable_locked(bus);
> >                 might_sleep();
> >                 rc = pci_bridge_secondary_bus_reset(bus->self);
> > @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
> >   */
> >  int pci_reset_bus(struct pci_dev *pdev)
> >  {
> > +       pci_info(pdev, "%s: %s", __func__,
> > !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
> >         return (!pci_probe_reset_slot(pdev->slot)) ?
> >             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
> >  }
> >
> > And, have the information of VMD PCIe devices with the built kernel:
> >
> > 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> > Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> > decode])
> >   ...
> >   Capabilities: [200 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > L1_PM_Substates+ PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=0ns
> >     L1SubCtl2: T_PwrOn=0us
> >
> > 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> > Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
> >   ...
> >   Capabilities: [900 v1] L1 PM Substates
> >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > L1_PM_Substates+ PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >       T_CommonMode=0us LTR1.2_Threshold=101376ns
> >     L1SubCtl2: T_PwrOn=50us
> >
> > We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> > PCIe bridge has LTR1.2_Threshold=0ns.
> >
> > Then, check the dmesg.  I notice the debug messages:
> >
> > pci 10000:e0:06.0: PCI bridge to [bus e1]
> > pci 10000:e0:06.0: Primary bus is hard wired to 0
> > pci 10000:e1:00.0: pci_reset_bus: false
> > pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> > pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> > false, and does not have subordinate
> > pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> > Freeing initrd memory: 75236K
> > pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> > and does not have subordinate
> >
> > So, the code path is:
> >
> > vmd_enable_domain()
> >   pci_reset_bus()
> >     __pci_reset_bus()
> >       pci_bus_reset()
> >         pci_bus_save_and_disable_locked()
> >           pci_dev_save_and_disable()
> >         pci_bus_restore_locked()
> >           pci_dev_restore()
> >
> > And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> > pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
> > So, PCIe bridge 10000:e0:06.0 does not restore state correctly.
> >
> > Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> > bus in vmd_enable_domain().
> > * Bus "e1" has only NVMe 10000:e1:00.0
> > * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> > 10000:e1:00.0 and SATA Controller 10000:e0:17.0.
> >
> > Here is the PCI tree:
> >
> > -+-[0000:00]-+-00.0  Intel Corporation Device 9a04
> >  |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD
> > Graphics G4] |           +-04.0  Intel Corporation TigerLake-LP
> > Dynamic Tuning Processor Participant
> >  |           +-06.0  Intel Corporation RST VMD Managed Controller
> >  |           +-07.0-[01-2b]--
> >  |           +-08.0  Intel Corporation GNA Scoring Accelerator module
> >  |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator
> > Driver |           +-0d.0  Intel Corporation Tiger Lake-LP
> > Thunderbolt 4 USB Controller
> >  |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4
> > NHI #0 |           +-0e.0  Intel Corporation Volume Management Device
> > NVMe RAID Controller
> >  |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
> > xHCI Host Controller
> >  |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
> >  |           +-14.3  Intel Corporation Wi-Fi 6 AX201
> >  |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C
> > Controller #0 |           +-15.1  Intel Corporation Tiger Lake-LP
> > Serial IO I2C Controller #1 |           +-16.0  Intel Corporation
> > Tiger Lake-LP Management Engine Interface |           +-17.0  Intel
> > Corporation RST VMD Managed Controller |           +-1f.0  Intel
> > Corporation Tiger Lake-LP LPC Controller |           +-1f.3  Intel
> > Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
> >  |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
> >  |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
> >  |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
> >  \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
> >               \-17.0  Intel Corporation Tiger Lake-LP SATA Controller
> >
> > According the findings above, to ensure the devices on the VMD bus
> > have correctly states, seems pci_save_state() all the devices before
> > pci_reset_bus(), and pci_restore_state() all the devices after
> > pci_reset_bus() is the correct answer.
>
> Did you get a chance to test this theory?

Yes.  But, according the findings in the previous mail talking about
restoring parent's LTR1.2_Threshold with a wrong value 0, it can be
shrunk to save/restore the PCIe bridge's state.

Preparing a new patch set.

Jian-Hong Pan
Ilpo Järvinen Sept. 23, 2024, 1:44 p.m. UTC | #17
On Mon, 23 Sep 2024, Jian-Hong Pan wrote:

> Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年9月2日 週一 下午11:44寫道:
> >
> > On Mon, 12 Aug 2024, Jian-Hong Pan wrote:
> >
> > > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> 於 2024年8月8日 週四 下午5:49寫道:
> > > > On Wed, 7 Aug 2024, David E. Box wrote:
> > > > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > > > >
> > > > > > > David E. Box <david.e.box@linux.intel.com> 於 2024年8月6日 週二 上午4:26寫道:
> > > > > > > >
> > > > > > > > Hi Jian-Hong,
> > > > > > > >
> > > > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > > > > Jian-Hong Pan <jhp@endlessos.org> 於 2024年7月19日 週五 下午4:04寫道:
> > > > > > > > > >
> > > > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > > > __pci_enable_link_state(),
> > > > > > > > > > it configs the link directly without ensuring related L1.2 parameters,
> > > > > > > > > > such
> > > > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and LTR_L1.2_THRESHOLD have
> > > > > > > > > > been
> > > > > > > > > > programmed.
> > > > > > > > > >
> > > > > > > > > > This leads the link's L1.2 between PCIe Root Port and child device
> > > > > > > > > > gets
> > > > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > > > >
> > > > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled VMD:
> > > > > > > > > >
> > > > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor
> > > > > > > > > > PCIe
> > > > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > > > >     ...
> > > > > > > > > >     Capabilities: [200 v1] L1 PM Substates
> > > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > > > > > > > > > L1_PM_Substates+
> > > > > > > > > >                   PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > > > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > > >                    T_CommonMode=45us LTR1.2_Threshold=101376ns
> > > > > > > > > >         L1SubCtl2: T_PwrOn=50us
> > > > > > > > > >
> > > > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue
> > > > > > > > > > SN550
> > > > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > > > >     ...
> > > > > > > > > >     Capabilities: [900 v1] L1 PM Substates
> > > > > > > > > >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > > > L1_PM_Substates+
> > > > > > > > > >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > > > > > > > > >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > > > > > > > > >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> > > > > > > > > >         L1SubCtl2: T_PwrOn=10us
> > > > > > > > > >
> > > > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling ASPM L1.2 on the
> > > > > > > > > > PCIe
> > > > > > > > > > Root Port and the child NVMe, they should be programmed with the same
> > > > > > > > > > LTR1.2_Threshold value. However, they have different values in this
> > > > > > > > > > case.
> > > > > > > > > >
> > > > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2 parameters properly
> > > > > > > > > > before
> > > > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > > > __pci_enable_link_state().
> > > > > > > > > >
> > > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> > > > > > > > > > ---
> > > > > > > > > > v2:
> > > > > > > > > > - Prepare the PCIe LTR parameters before enable L1 Substates
> > > > > > > > > >
> > > > > > > > > > v3:
> > > > > > > > > > - Only enable supported features for the L1 Substates part
> > > > > > > > > >
> > > > > > > > > > v4:
> > > > > > > > > > - Focus on fixing L1.2 parameters, instead of re-initializing whole
> > > > > > > > > > L1SS
> > > > > > > > > >
> > > > > > > > > > v5:
> > > > > > > > > > - Fix typo and commit message
> > > > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM: Introduce
> > > > > > > > > >   aspm_get_l1ss_cap()"
> > > > > > > > > >
> > > > > > > > > > v6:
> > > > > > > > > > - Skipped
> > > > > > > > > >
> > > > > > > > > > v7:
> > > > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > > > - Drop the link state flag check. And, always config link state's
> > > > > > > > > > timing
> > > > > > > > > >   parameters
> > > > > > > > > >
> > > > > > > > > > v8:
> > > > > > > > > > - Because pcie_aspm_get_link() might return the link as NULL, move
> > > > > > > > > >   getting the link's parent and child devices after check the link is
> > > > > > > > > >   not NULL. This avoids NULL memory access.
> > > > > > > > > >
> > > > > > > > > >  drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > > > > > index 5db1044c9895..55ff1d26fcea 100644
> > > > > > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > > > @@ -1411,9 +1411,15 @@ EXPORT_SYMBOL(pci_disable_link_state);
> > > > > > > > > >  static int __pci_enable_link_state(struct pci_dev *pdev, int state,
> > > > > > > > > > bool
> > > > > > > > > > locked)
> > > > > > > > > >  {
> > > > > > > > > >         struct pcie_link_state *link = pcie_aspm_get_link(pdev);
> > > > > > > > > > +       u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > > > +       struct pci_dev *parent, *child;
> > > > > > > > > >
> > > > > > > > > >         if (!link)
> > > > > > > > > >                 return -EINVAL;
> > > > > > > > > > +
> > > > > > > > > > +       parent = link->pdev;
> > > > > > > > > > +       child = link->downstream;
> > > > > > > > > > +
> > > > > > > > > >         /*
> > > > > > > > > >          * A driver requested that ASPM be enabled on this device, but
> > > > > > > > > >          * if we don't have permission to manage ASPM (e.g., on ACPI
> > > > > > > > > > @@ -1428,6 +1434,15 @@ static int __pci_enable_link_state(struct
> > > > > > > > > > pci_dev
> > > > > > > > > > *pdev, int state, bool locked)
> > > > > > > > > >         if (!locked)
> > > > > > > > > >                 down_read(&pci_bus_sem);
> > > > > > > > > >         mutex_lock(&aspm_lock);
> > > > > > > > > > +       /*
> > > > > > > > > > +        * Ensure L1.2 parameters: Common_Mode_Restore_Times,
> > > > > > > > > > T_POWER_ON and
> > > > > > > > > > +        * LTR_L1.2_THRESHOLD are programmed properly before enable
> > > > > > > > > > bits for
> > > > > > > > > > +        * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > > > +        */
> > > > > > > > > > +       parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > > > +       child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > > > +       aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
> > > > > > > >
> > > > > > > > I still don't think this is the place to recalculate the L1.2 parameters
> > > > > > > > especially when know the calculation was done but was cleared by
> > > > > > > > pci_bus_reset(). Can't we just do a pci_save/restore_state() before/after
> > > > > > > > pci_bus_reset() in vmd.c?
> > > > > > >
> > > > > > > I have not thought pci_save/restore_state() around pci_bus_reset()
> > > > > > > before.  It is an interesting direction.
> > > > > > >
> > > > > > > So, I prepare modification below for test.  Include "[PATCH v8 1/4]
> > > > > > > PCI: vmd: Set PCI devices to D0 before enable PCI PM's L1 substates",
> > > > > > > too.  Then, both the PCIe bridge and the PCIe device have the same
> > > > > > > LTR_L1.2_THRESHOLD 101376ns as expected.
> > > > > > >
> > > > > > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> > > > > > > index bbf4a47e7b31..6b8dd4f30127 100644
> > > > > > > --- a/drivers/pci/controller/vmd.c
> > > > > > > +++ b/drivers/pci/controller/vmd.c
> > > > > > > @@ -727,6 +727,18 @@ static void vmd_copy_host_bridge_flags(struct
> > > > > > > pci_host_bridge *root_bridge,
> > > > > > >         vmd_bridge->native_dpc = root_bridge->native_dpc;
> > > > > > >  }
> > > > > > >
> > > > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void *userdata)
> > > > > > > +{
> > > > > > > +       pci_save_state(pdev);
> > > > > > > +       return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void *userdata)
> > > > > > > +{
> > > > > > > +       pci_restore_state(pdev);
> > > > > > > +       return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /*
> > > > > > >   * Enable ASPM and LTR settings on devices that aren't configured by BIOS.
> > > > > > >   */
> > > > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > > > unsigned long features)
> > > > > > >         pci_scan_child_bus(vmd->bus);
> > > > > > >         vmd_domain_reset(vmd);
> > > > > > >
> > > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > > > >         /* When Intel VMD is enabled, the OS does not discover the Root
> > > > > > > Ports
> > > > > > >          * owned by Intel VMD within the MMCFG space. pci_reset_bus()
> > > > > > > applies
> > > > > > >          * a reset to the parent of the PCI device supplied as argument.
> > > > > > > This
> > > > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> > > > > > > unsigned long features)
> > > > > > >                         break;
> > > > > > >                 }
> > > > > > >         }
> > > > > > > +       pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > > > > >
> > > > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > > > vmd_enable_domain() which preserves state unlike pci_reset_bus()?
> > > > > >
> > > > > > (Don't tell me naming of these functions is a horrible mess. :-/)
> > > > >
> > > > > Hmm. So this *is* calling pci_reset_bus().
> > > >
> > > > Yeah, I managed to get confused by the names myself, I somehow
> > > > ended up thinking it calls pci_bus_reset() which is not correct...
> > > >
> > > > > L1.2 configuration has specific
> > > > > ordering requirements for changes to parent & child devices. Could be why it's
> > > > > not getting restored properly.
> > > >
> > > > Indeed, it has to be something else since the patch above doesn't even
> > > > restore anything because dev->state_saved should get set to false by the
> > > > first pci_restore_state() called from
> > > > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(), I
> > > > think!?
> > >
> > > Inspired by Ilpo's comment.  I add some debug messages based on
> > > linux-next's tag 'next-20240809' to understand the code path of
> > > pci_reset_bus():
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index ffaaca0978cb..3ee71374f1de 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
> > >          * races with ->remove() by the device lock, which must be held by
> > >          * the caller.
> > >          */
> > > -       if (err_handler && err_handler->reset_prepare)
> > > +       if (err_handler && err_handler->reset_prepare) {
> > > +               pci_info(dev, "%s: %pF\n", __func__,
> > > err_handler->reset_prepare);
> > >                 err_handler->reset_prepare(dev);
> > > +       }
> > >
> > >         /*
> > >          * Wake-up device prior to save.  PM registers default to D0 after
> > > @@ -5144,6 +5146,7 @@ static void pci_dev_save_and_disable(struct pci_dev *dev)
> > >         pci_set_power_state(dev, PCI_D0);
> > >
> > >         pci_save_state(dev);
> > > +       pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> > > dev->state_saved ? "true" : "false");
> > >         /*
> > >          * Disable the device by clearing the Command register, except for
> > >          * INTx-disable which is set.  This not only disables MMIO and I/O port
> > > @@ -5655,6 +5658,10 @@ static void
> > > pci_bus_save_and_disable_locked(struct pci_bus *bus)
> > >         struct pci_dev *dev;
> > >
> > >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > > +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> > > +                        __func__,
> > > +                        dev->state_saved ? "true" : "false",
> > > +                        dev->subordinate ? "has" : "does not have");
> > >                 pci_dev_save_and_disable(dev);
> > >                 if (dev->subordinate)
> > >                         pci_bus_save_and_disable_locked(dev->subordinate);
> > > @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct pci_bus *bus)
> > >         struct pci_dev *dev;
> > >
> > >         list_for_each_entry(dev, &bus->devices, bus_list) {
> > > +               pci_info(dev, "%s: PCI state_saved is %s, and %s subordinate\n",
> > > +                        __func__,
> > > +                        dev->state_saved ? "true" : "false",
> > > +                        dev->subordinate ? "has" : "does not have");
> > >                 pci_dev_restore(dev);
> > >                 if (dev->subordinate)
> > >                         pci_bus_restore_locked(dev->subordinate);
> > > @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus, bool probe)
> > >         if (!bus->self || !pci_bus_resettable(bus))
> > >                 return -ENOTTY;
> > >
> > > -       if (probe)
> > > +       if (probe) {
> > > +               pci_info(bus->self, "%s: probe is true.  So return 0
> > > directly", __func__);
> > >                 return 0;
> > > +       }
> > >
> > >         pci_bus_lock(bus);
> > >
> > > @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus *bus)
> > >         int rc;
> > >
> > >         rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> > > +       pci_info(bus->self, "%s: pci_bus_reset() returns %d\n", __func__, rc);
> > >         if (rc)
> > >                 return rc;
> > >
> > >         if (pci_bus_trylock(bus)) {
> > > +               pci_info(bus->self, "%s: pci_bus_trylock() returns
> > > true\n", __func__);
> > >                 pci_bus_save_and_disable_locked(bus);
> > >                 might_sleep();
> > >                 rc = pci_bridge_secondary_bus_reset(bus->self);
> > > @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
> > >   */
> > >  int pci_reset_bus(struct pci_dev *pdev)
> > >  {
> > > +       pci_info(pdev, "%s: %s", __func__,
> > > !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
> > >         return (!pci_probe_reset_slot(pdev->slot)) ?
> > >             __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
> > >  }
> > >
> > > And, have the information of VMD PCIe devices with the built kernel:
> > >
> > > 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> > > Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> > > decode])
> > >   ...
> > >   Capabilities: [200 v1] L1 PM Substates
> > >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> > >       PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > >       T_CommonMode=0us LTR1.2_Threshold=0ns
> > >     L1SubCtl2: T_PwrOn=0us
> > >
> > > 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> > > Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
> > >   ...
> > >   Capabilities: [900 v1] L1 PM Substates
> > >     L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
> > >       PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > >     L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > >       T_CommonMode=0us LTR1.2_Threshold=101376ns
> > >     L1SubCtl2: T_PwrOn=50us
> > >
> > > We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> > > PCIe bridge has LTR1.2_Threshold=0ns.
> >
> > This is now the other way around as in the original posting that had
> > 0ns for 10000:e1:00.0 ??
> >
> > Is this behavior even consistent or did you e.g. mess up some copy
> > pasting somewhere?
> >
> > > Then, check the dmesg.  I notice the debug messages:
> > >
> > > pci 10000:e0:06.0: PCI bridge to [bus e1]
> > > pci 10000:e0:06.0: Primary bus is hard wired to 0
> > > pci 10000:e1:00.0: pci_reset_bus: false
> > > pci 10000:e0:06.0: pci_bus_reset: probe is true.  So return 0 directly
> > > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> > > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> > > pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> > > false, and does not have subordinate
> > > pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> > > Freeing initrd memory: 75236K
> > > pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> > > and does not have subordinate
> > >
> > > So, the code path is:
> > >
> > > vmd_enable_domain()
> > >   pci_reset_bus()
> > >     __pci_reset_bus()
> > >       pci_bus_reset()
> > >         pci_bus_save_and_disable_locked()
> > >           pci_dev_save_and_disable()
> > >         pci_bus_restore_locked()
> > >           pci_dev_restore()
> > >
> > > And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> > > pci_save/restore_state.  But, the PCIe bridge 10000:e0:06.0 does not.
> > > So, PCIe bridge 10000:e0:06.0 does not restore state correctly.
> >
> > It should not be necessary to restore the bridge's configuration as it
> > ought to not be changed by the SBR itself, PCIe6 spec 7.5.1.3.13:
> >
> > "Port configuration registers must not be changed, except as required to
> > update Port status."
> >
> > While the second part wording leaves some leeway, I don't think any of
> > these field really fall under "Port status".
> >
> > > Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> > > bus in vmd_enable_domain().
> > > * Bus "e1" has only NVMe 10000:e1:00.0
> > > * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> > > 10000:e1:00.0 and SATA Controller 10000:e0:17.0.
> >
> > ...But even if those registers on the PCIe bridge were changed underneath
> > against the spec, it's not clear from your debug log why pci_dev_restore()
> > -> pci_restore_state() -> pci_restore_pcie_state() ->
> > pci_restore_aspm_l1ss_state() did not restore also parent's
> > LTR1.2_Threshold?? I think it should attempt to do that.
> >
> > --
> >  i.
> >
> > > Here is the PCI tree:
> > >
> > > -+-[0000:00]-+-00.0  Intel Corporation Device 9a04
> > >  |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD Graphics G4]
> > >  |           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning
> > > Processor Participant
> > >  |           +-06.0  Intel Corporation RST VMD Managed Controller
> > >  |           +-07.0-[01-2b]--
> > >  |           +-08.0  Intel Corporation GNA Scoring Accelerator module
> > >  |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator Driver
> > >  |           +-0d.0  Intel Corporation Tiger Lake-LP Thunderbolt 4 USB
> > > Controller
> > >  |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0
> > >  |           +-0e.0  Intel Corporation Volume Management Device NVMe
> > > RAID Controller
> > >  |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
> > > xHCI Host Controller
> > >  |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
> > >  |           +-14.3  Intel Corporation Wi-Fi 6 AX201
> > >  |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0
> > >  |           +-15.1  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1
> > >  |           +-16.0  Intel Corporation Tiger Lake-LP Management Engine Interface
> > >  |           +-17.0  Intel Corporation RST VMD Managed Controller
> > >  |           +-1f.0  Intel Corporation Tiger Lake-LP LPC Controller
> > >  |           +-1f.3  Intel Corporation Tiger Lake-LP Smart Sound
> > > Technology Audio Controller
> > >  |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
> > >  |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
> > >  |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
> > >  \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
> > >               \-17.0  Intel Corporation Tiger Lake-LP SATA Controller
> > >
> > > According the findings above, to ensure the devices on the VMD bus
> > > have correctly states, seems pci_save_state() all the devices before
> > > pci_reset_bus(), and pci_restore_state() all the devices after
> > > pci_reset_bus() is the correct answer.
> 
> I add some debug messages based on v6.11 as following:
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index cee2365e54b8..404ce92f0152 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -109,21 +109,28 @@ void pci_restore_aspm_l1ss_state(struct pci_dev *pdev)
>         u32 cl_ctl1, cl_ctl2, cl_l1_2_enable;
>         u16 clnkctl, plnkctl;
> 
> +       pci_info(pdev, "%s", __func__);
>         /*
>          * In case BIOS enabled L1.2 when resuming, we need to disable it first
>          * on the downstream component before the upstream. So, don't attempt to
>          * restore either until we are at the downstream component.
>          */
> -       if (pcie_downstream_port(pdev) || !parent)
> +       if (pcie_downstream_port(pdev) || !parent) {
> +               pci_info(pdev, "%s: %s", __func__,
> pcie_downstream_port(pdev) ? "is downstream port" : "not parent");
>                 return;
> +       }
> 
> -       if (!pdev->l1ss || !parent->l1ss)
> +       if (!pdev->l1ss || !parent->l1ss) {
> +               pci_info(pdev, "%s: %sdoes not have l1ss", __func__,
> !pdev->l1ss ? "" : "parent ");
>                 return;
> +       }
> 
>         cl_save_state = pci_find_saved_ext_cap(pdev, PCI_EXT_CAP_ID_L1SS);
>         pl_save_state = pci_find_saved_ext_cap(parent, PCI_EXT_CAP_ID_L1SS);
> -       if (!cl_save_state || !pl_save_state)
> +       if (!cl_save_state || !pl_save_state) {
> +               pci_info(pdev, "%s: %sdid not save ext_cap", __func__,
> !cl_save_state ? "" : "parent ");
>                 return;
> +       }
> 
>         cap = &cl_save_state->cap.data[0];
>         cl_ctl2 = *cap++;
> @@ -131,6 +138,7 @@ void pci_restore_aspm_l1ss_state(struct pci_dev *pdev)
>         cap = &pl_save_state->cap.data[0];
>         pl_ctl2 = *cap++;
>         pl_ctl1 = *cap;
> +       pci_info(pdev, "%s: cl_ctl1: 0x%08x, cl_ctl2: 0x%08x, pl_ctl1:
> 0x%08x, pl_ctl2: 0x%08x", __func__, cl_ctl1, cl_ctl2, pl_ctl1,
> pl_ctl2);
> 
>         /* Make sure L0s/L1 are disabled before updating L1SS config */
>         pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &clnkctl);
> 
> Here is the corresponding log:
> 
> [    0.418931] pci 10000:e0:06.0: PCI bridge to [bus e1]
> [    0.418936] pci 10000:e0:06.0: Primary bus is hard wired to 0
> [    0.447474] Freeing initrd memory: 24700K
> [    0.670789] pci 10000:e1:00.0: pci_restore_aspm_l1ss_state
> [    0.670807] pci 10000:e1:00.0: pci_restore_aspm_l1ss_state:
> cl_ctl1: 0x40630000, cl_ctl2: 0x00000029, pl_ctl1: 0x00000000,
> pl_ctl2: 0x00000000
> [    0.670862] pci 10000:e0:06.0: bridge window [mem
> 0x76000000-0x760fffff]: assigned
> [    0.670864] pci 10000:e0:17.0: BAR 0 [mem 0x76100000-0x76101fff]: assigned
> [    0.670881] pci 10000:e0:06.0: bridge window [io  size 0x1000]:
> can't assign; no space
> [    0.670882] pci 10000:e0:06.0: bridge window [io  size 0x1000]:
> failed to assign
> [    0.670884] pci 10000:e0:17.0: BAR 5 [mem 0x76102000-0x761027ff]: assigned
> [    0.670893] pci 10000:e0:17.0: BAR 1 [mem 0x76102800-0x761028ff]: assigned
> [    0.670896] pci 10000:e0:17.0: BAR 4 [io  size 0x0020]: can't
> assign; no space
> [    0.670897] pci 10000:e0:17.0: BAR 4 [io  size 0x0020]: failed to assign
> [    0.670898] pci 10000:e0:17.0: BAR 2 [io  size 0x0008]: can't
> assign; no space
> [    0.670898] pci 10000:e0:17.0: BAR 2 [io  size 0x0008]: failed to assign
> [    0.670899] pci 10000:e0:17.0: BAR 3 [io  size 0x0004]: can't
> assign; no space
> [    0.670900] pci 10000:e0:17.0: BAR 3 [io  size 0x0004]: failed to assign
> [    0.670901] pci 10000:e1:00.0: BAR 0 [mem 0x76000000-0x76003fff
> 64bit]: assigned
> [    0.670909] pci 10000:e1:00.0: BAR 4 [mem 0x76004000-0x760040ff
> 64bit]: assigned
> [    0.670918] pci 10000:e0:06.0: PCI bridge to [bus e1]
> [    0.670921] pci 10000:e0:06.0:   bridge window [mem 0x76000000-0x760fffff]
> [    0.670950] pci 10000:e1:00.0: VMD: Default LTR value set by driver
> [    0.670977] pcieport 10000:e0:06.0: can't derive routing for PCI INT D
> [    0.670979] pcieport 10000:e0:06.0: PCI INT D: no GSI
> [    0.671043] pcieport 10000:e0:06.0: PME: Signaling with IRQ 143
> [    0.671092] vmd 0000:00:0e.0: Bound to PCI domain 10000
> 
> We can notice both pl_ctl1 and pl_ctl2 are 0x0.  Because, the link's
> parent device (PCIe bridge 10000:e0:06.0) did not save state before
> reset.  So, pci_restore_aspm_l1ss_state() restores parent's
> LTR1.2_Threshold with a wrong value 0.

Okay, this is very useful finding because it leads us to the root cause, 
which is the disparity between pci_save_aspm_l1ss_state() and 
pci_restore_aspm_l1ss_state(). The latter restores both dev and its 
parent, yet the save state side only saves the ASPM state for the dev 
itself, not for the parent.
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 5db1044c9895..55ff1d26fcea 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -1411,9 +1411,15 @@  EXPORT_SYMBOL(pci_disable_link_state);
 static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool locked)
 {
 	struct pcie_link_state *link = pcie_aspm_get_link(pdev);
+	u32 parent_l1ss_cap, child_l1ss_cap;
+	struct pci_dev *parent, *child;
 
 	if (!link)
 		return -EINVAL;
+
+	parent = link->pdev;
+	child = link->downstream;
+
 	/*
 	 * A driver requested that ASPM be enabled on this device, but
 	 * if we don't have permission to manage ASPM (e.g., on ACPI
@@ -1428,6 +1434,15 @@  static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool locked)
 	if (!locked)
 		down_read(&pci_bus_sem);
 	mutex_lock(&aspm_lock);
+	/*
+	 * Ensure L1.2 parameters: Common_Mode_Restore_Times, T_POWER_ON and
+	 * LTR_L1.2_THRESHOLD are programmed properly before enable bits for
+	 * L1.2, per PCIe r6.0, sec 5.5.4.
+	 */
+	parent_l1ss_cap = aspm_get_l1ss_cap(parent);
+	child_l1ss_cap = aspm_get_l1ss_cap(child);
+	aspm_calc_l12_info(link, parent_l1ss_cap, child_l1ss_cap);
+
 	link->aspm_default = pci_calc_aspm_enable_mask(state);
 	pcie_config_aspm_link(link, policy_to_aspm_state(link));