Message ID | 1663669347-29308-2-git-send-email-quic_krichai@quicinc.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Lorenzo Pieralisi |
Headers | show |
Series | PCI: qcom: Add system suspend & resume support | expand |
On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: > Add suspend and resume syscore ops. > > Few PCIe endpoints like NVMe and WLANs are always expecting the device > to be in D0 state and the link to be active (or in l1ss) all the time > (including in S3 state). What does this have to do with the patch? I don't see any NVMe or WLAN patches here. > In qcom platform PCIe resources( clocks, phy etc..) can released > when the link is in L1ss to reduce the power consumption. So if the link > is in L1ss, release the PCIe resources. And when the system resumes, > enable the PCIe resources if they released in the suspend path. What's the connection with L1.x? Links enter L1.x based on activity and timing. That doesn't seem like a reliable indicator to turn PHYs off and disable clocks. > is_suspended flag indicates if the PCIe resources are released or not > in the suspend path. Why is "is_suspended" important for the commit log? It looks like just a standard implementation detail. > Its observed that access to Ep PCIe space to mask MSI/MSIX is happening > at the very late stage of suspend path (access by affinity changes while > making CPUs offline during suspend, this will happen after devices are > suspended (after all phases of suspend ops)). If we turn off clocks in > any PM callback, afterwards running into crashes due to un-clocked access > due to above mentioned MSI/MSIx access. > So, we are making use of syscore framework to turn off the PCIe clocks > which will be called after making CPUs offline. Add blank lines between paragraphs. Or rewrap into a single paragraph. s/Its observed/It's observed/ s/MSIX/MSI-X/ throughout s/MSIx/MSI-X/ throughout Bjorn
On 9/20/2022 3:22 AM, Krishna chaitanya chundru wrote: > Add suspend and resume syscore ops. > > Few PCIe endpoints like NVMe and WLANs are always expecting the device > to be in D0 state and the link to be active (or in l1ss) all the time > (including in S3 state). > > In qcom platform PCIe resources( clocks, phy etc..) can released can *be* released... ?? > when the link is in L1ss to reduce the power consumption. So if the link > is in L1ss, release the PCIe resources. And when the system resumes, > enable the PCIe resources if they released in the suspend path. if they *were* released... ?? > > is_suspended flag indicates if the PCIe resources are released or not > in the suspend path. > > Its observed that access to Ep PCIe space to mask MSI/MSIX is happening s/Its/It's/ > at the very late stage of suspend path (access by affinity changes while > making CPUs offline during suspend, this will happen after devices are > suspended (after all phases of suspend ops)). If we turn off clocks in All those parenthesis, thought I was reading Lisp. Can you rewrite in conversational English?
On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: > On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: >> Add suspend and resume syscore ops. >> >> Few PCIe endpoints like NVMe and WLANs are always expecting the device >> to be in D0 state and the link to be active (or in l1ss) all the time >> (including in S3 state). > What does this have to do with the patch? I don't see any NVMe or > WLAN patches here. Existing NVMe driver expecting NVMe device to be in D0 during S3 also. If we turn off the link in suspend, the NVMe resume path is broken as the state machine is getting reset in the NVMe device. Due to this, the host driver state machine and the device state machine are going out of sync, and all NVMe commands after resumes are getting timed out. IIRC, Tegra is also facing this issue with NVMe. This issue has been discussed below threads: https://lore.kernel.org/all/Yl+6V3pWuyRYuVV8@infradead.org/T/ https://lore.kernel.org/linux-nvme/20220201165006.3074615-1-kbusch@kernel.org/ >> In qcom platform PCIe resources( clocks, phy etc..) can released >> when the link is in L1ss to reduce the power consumption. So if the link >> is in L1ss, release the PCIe resources. And when the system resumes, >> enable the PCIe resources if they released in the suspend path. > What's the connection with L1.x? Links enter L1.x based on activity > and timing. That doesn't seem like a reliable indicator to turn PHYs > off and disable clocks. This is a Qcom PHY-specific feature (retaining the link state in L1.x with clocks turned off). It is possible only with the link being in l1.x. PHY can't retain the link state in L0 with the clocks turned off and we need to re-train the link if it's in L2 or L3. So we can support this feature only with L1.x. That is the reason we are taking l1.x as the trigger to turn off clocks (in only suspend path). >> is_suspended flag indicates if the PCIe resources are released or not >> in the suspend path. > Why is "is_suspended" important for the commit log? It looks like > just a standard implementation detail. Someone in one of the previous patch asked to include this in the commit text. >> Its observed that access to Ep PCIe space to mask MSI/MSIX is happening >> at the very late stage of suspend path (access by affinity changes while >> making CPUs offline during suspend, this will happen after devices are >> suspended (after all phases of suspend ops)). If we turn off clocks in >> any PM callback, afterwards running into crashes due to un-clocked access >> due to above mentioned MSI/MSIx access. >> So, we are making use of syscore framework to turn off the PCIe clocks >> which will be called after making CPUs offline. > Add blank lines between paragraphs. Or rewrap into a single paragraph. > > s/Its observed/It's observed/ > s/MSIX/MSI-X/ throughout > s/MSIx/MSI-X/ throughout > > Bjorn
[+cc Rafael, linux-pm since this is real power management magic, beginning of thread: https://lore.kernel.org/all/1663669347-29308-1-git-send-email-quic_krichai@quicinc.com/ full patch since I trimmed too much of it: https://lore.kernel.org/all/1663669347-29308-2-git-send-email-quic_krichai@quicinc.com/] On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: > On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: > > On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: > > > Add suspend and resume syscore ops. > > > > > > Few PCIe endpoints like NVMe and WLANs are always expecting the device > > > to be in D0 state and the link to be active (or in l1ss) all the time > > > (including in S3 state). > > > > What does this have to do with the patch? I don't see any NVMe or > > WLAN patches here. > > Existing NVMe driver expecting NVMe device to be in D0 during S3 also. If we > turn off the link in > suspend, the NVMe resume path is broken as the state machine is getting > reset in the NVMe device. > Due to this, the host driver state machine and the device state machine are > going out of sync, and all NVMe commands > after resumes are getting timed out. > > IIRC, Tegra is also facing this issue with NVMe. > > This issue has been discussed below threads: > > https://lore.kernel.org/all/Yl+6V3pWuyRYuVV8@infradead.org/T/ > > https://lore.kernel.org/linux-nvme/20220201165006.3074615-1-kbusch@kernel.org/ The problem is that this commit log doesn't explain the problem and doesn't give us anything to connect the NVMe and WLAN assumptions with this special driver behavior. There needs to be some explicit property of NVMe and WLAN that the PM core or drivers like qcom can use to tell whether the clocks can be turned off. > > > In qcom platform PCIe resources( clocks, phy etc..) can released > > > when the link is in L1ss to reduce the power consumption. So if the link > > > is in L1ss, release the PCIe resources. And when the system resumes, > > > enable the PCIe resources if they released in the suspend path. > > > > What's the connection with L1.x? Links enter L1.x based on activity > > and timing. That doesn't seem like a reliable indicator to turn PHYs > > off and disable clocks. > > This is a Qcom PHY-specific feature (retaining the link state in L1.x with > clocks turned off). > It is possible only with the link being in l1.x. PHY can't retain the link > state in L0 with the > clocks turned off and we need to re-train the link if it's in L2 or L3. So > we can support this feature only with L1.x. > That is the reason we are taking l1.x as the trigger to turn off clocks (in > only suspend path). This doesn't address my question. L1.x is an ASPM feature, which means hardware may enter or leave L1.x autonomously at any time without software intervention. Therefore, I don't think reading the current state is a reliable way to decide anything. > ... > > > Its observed that access to Ep PCIe space to mask MSI/MSIX is happening > > > at the very late stage of suspend path (access by affinity changes while > > > making CPUs offline during suspend, this will happen after devices are > > > suspended (after all phases of suspend ops)). If we turn off clocks in > > > any PM callback, afterwards running into crashes due to un-clocked access > > > due to above mentioned MSI/MSIx access. > > > So, we are making use of syscore framework to turn off the PCIe clocks > > > which will be called after making CPUs offline.
On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: > [+cc Rafael, linux-pm since this is real power management magic, > beginning of thread: > https://lore.kernel.org/all/1663669347-29308-1-git-send-email-quic_krichai@quicinc.com/ > full patch since I trimmed too much of it: > https://lore.kernel.org/all/1663669347-29308-2-git-send-email-quic_krichai@quicinc.com/] > > On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: >> On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: >>> On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: >>>> Add suspend and resume syscore ops. >>>> >>>> Few PCIe endpoints like NVMe and WLANs are always expecting the device >>>> to be in D0 state and the link to be active (or in l1ss) all the time >>>> (including in S3 state). >>> What does this have to do with the patch? I don't see any NVMe or >>> WLAN patches here. >> Existing NVMe driver expecting NVMe device to be in D0 during S3 also. If we >> turn off the link in >> suspend, the NVMe resume path is broken as the state machine is getting >> reset in the NVMe device. >> Due to this, the host driver state machine and the device state machine are >> going out of sync, and all NVMe commands >> after resumes are getting timed out. >> >> IIRC, Tegra is also facing this issue with NVMe. >> >> This issue has been discussed below threads: >> >> https://lore.kernel.org/all/Yl+6V3pWuyRYuVV8@infradead.org/T/ >> >> https://lore.kernel.org/linux-nvme/20220201165006.3074615-1-kbusch@kernel.org/ > The problem is that this commit log doesn't explain the problem and > doesn't give us anything to connect the NVMe and WLAN assumptions with > this special driver behavior. There needs to be some explicit > property of NVMe and WLAN that the PM core or drivers like qcom can > use to tell whether the clocks can be turned off. Not only that NVMe is expecting the device state to be always in D0. So any PCIe drivers should not turn off the link in suspend and do link retraining in the resume. As this is considered a power cycle by the NVMe device and eventually increases the wear of the NVMe flash. We are trying to keep the device in D0 and also reduce the power consumption when the system is in S3 by turning off clocks and phy with this patch series. > >>>> In qcom platform PCIe resources( clocks, phy etc..) can released >>>> when the link is in L1ss to reduce the power consumption. So if the link >>>> is in L1ss, release the PCIe resources. And when the system resumes, >>>> enable the PCIe resources if they released in the suspend path. >>> What's the connection with L1.x? Links enter L1.x based on activity >>> and timing. That doesn't seem like a reliable indicator to turn PHYs >>> off and disable clocks. >> This is a Qcom PHY-specific feature (retaining the link state in L1.x with >> clocks turned off). >> It is possible only with the link being in l1.x. PHY can't retain the link >> state in L0 with the >> clocks turned off and we need to re-train the link if it's in L2 or L3. So >> we can support this feature only with L1.x. >> That is the reason we are taking l1.x as the trigger to turn off clocks (in >> only suspend path). > This doesn't address my question. L1.x is an ASPM feature, which > means hardware may enter or leave L1.x autonomously at any time > without software intervention. Therefore, I don't think reading the > current state is a reliable way to decide anything. After the link enters the L1.x it will come out only if there is some activity on the link. AS system is suspended and NVMe driver is also suspended( queues will freeze in suspend) who else can initiate any data. As long the link stays in L1ss we can turn off clocks and phy. When the system resumes we turn off clocks and phy before resuming the NVMe, this makes sure the clocks and phy are up before there is any activity to bring up the link back to L0 state from L1.x. > >> ... >>>> Its observed that access to Ep PCIe space to mask MSI/MSIX is happening >>>> at the very late stage of suspend path (access by affinity changes while >>>> making CPUs offline during suspend, this will happen after devices are >>>> suspended (after all phases of suspend ops)). If we turn off clocks in >>>> any PM callback, afterwards running into crashes due to un-clocked access >>>> due to above mentioned MSI/MSIx access. >>>> So, we are making use of syscore framework to turn off the PCIe clocks >>>> which will be called after making CPUs offline.
On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: > On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: > > [+cc Rafael, linux-pm since this is real power management magic, > > beginning of thread: > > https://lore.kernel.org/all/1663669347-29308-1-git-send-email-quic_krichai@quicinc.com/ > > full patch since I trimmed too much of it: > > https://lore.kernel.org/all/1663669347-29308-2-git-send-email-quic_krichai@quicinc.com/] > > > > On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: > > > On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: > > > > On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: > > > > > Add suspend and resume syscore ops. > > > > > > > > > > Few PCIe endpoints like NVMe and WLANs are always expecting the device > > > > > to be in D0 state and the link to be active (or in l1ss) all the time > > > > > (including in S3 state). > > > > What does this have to do with the patch? I don't see any NVMe or > > > > WLAN patches here. > > > Existing NVMe driver expecting NVMe device to be in D0 during S3 also. If we > > > turn off the link in > > > suspend, the NVMe resume path is broken as the state machine is getting > > > reset in the NVMe device. > > > Due to this, the host driver state machine and the device state machine are > > > going out of sync, and all NVMe commands > > > after resumes are getting timed out. > > > > > > IIRC, Tegra is also facing this issue with NVMe. > > > > > > This issue has been discussed below threads: > > > > > > https://lore.kernel.org/all/Yl+6V3pWuyRYuVV8@infradead.org/T/ > > > > > > https://lore.kernel.org/linux-nvme/20220201165006.3074615-1-kbusch@kernel.org/ > > The problem is that this commit log doesn't explain the problem and > > doesn't give us anything to connect the NVMe and WLAN assumptions with > > this special driver behavior. There needs to be some explicit > > property of NVMe and WLAN that the PM core or drivers like qcom can > > use to tell whether the clocks can be turned off. > > Not only that NVMe is expecting the device state to be always in D0. > So any PCIe drivers should not turn off the link in suspend and do > link retraining in the resume. As this is considered a power cycle > by the NVMe device and eventually increases the wear of the NVMe > flash. I can't quite parse this. Are you saying that all PCI devices should stay in D0 when the system is in S3? > We are trying to keep the device in D0 and also reduce the power > consumption when the system is in S3 by turning off clocks and phy > with this patch series. The decision to keep a device in D0 is not up to qcom or any other PCI controller driver. > > > > > In qcom platform PCIe resources( clocks, phy etc..) can > > > > > released when the link is in L1ss to reduce the power > > > > > consumption. So if the link is in L1ss, release the PCIe > > > > > resources. And when the system resumes, enable the PCIe > > > > > resources if they released in the suspend path. > > > > > > > > What's the connection with L1.x? Links enter L1.x based on > > > > activity and timing. That doesn't seem like a reliable > > > > indicator to turn PHYs off and disable clocks. > > > > > > This is a Qcom PHY-specific feature (retaining the link state in > > > L1.x with clocks turned off). It is possible only with the link > > > being in l1.x. PHY can't retain the link state in L0 with the > > > clocks turned off and we need to re-train the link if it's in L2 > > > or L3. So we can support this feature only with L1.x. That is > > > the reason we are taking l1.x as the trigger to turn off clocks > > > (in only suspend path). > > > > This doesn't address my question. L1.x is an ASPM feature, which > > means hardware may enter or leave L1.x autonomously at any time > > without software intervention. Therefore, I don't think reading the > > current state is a reliable way to decide anything. > > After the link enters the L1.x it will come out only if there is > some activity on the link. AS system is suspended and NVMe driver > is also suspended( queues will freeze in suspend) who else can > initiate any data. I don't think we can assume that nothing will happen to cause exit from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, PTM, etc., may be sent even though we think the device is idle and there should be no link activity. Bjorn
On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: > On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: >> On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: >>> [+cc Rafael, linux-pm since this is real power management magic, >>> beginning of thread: >>> https://lore.kernel.org/all/1663669347-29308-1-git-send-email-quic_krichai@quicinc.com/ >>> full patch since I trimmed too much of it: >>> https://lore.kernel.org/all/1663669347-29308-2-git-send-email-quic_krichai@quicinc.com/] >>> >>> On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: >>>> On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: >>>>> On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: >>>>>> Add suspend and resume syscore ops. >>>>>> >>>>>> Few PCIe endpoints like NVMe and WLANs are always expecting the device >>>>>> to be in D0 state and the link to be active (or in l1ss) all the time >>>>>> (including in S3 state). >>>>> What does this have to do with the patch? I don't see any NVMe or >>>>> WLAN patches here. >>>> Existing NVMe driver expecting NVMe device to be in D0 during S3 also. If we >>>> turn off the link in >>>> suspend, the NVMe resume path is broken as the state machine is getting >>>> reset in the NVMe device. >>>> Due to this, the host driver state machine and the device state machine are >>>> going out of sync, and all NVMe commands >>>> after resumes are getting timed out. >>>> >>>> IIRC, Tegra is also facing this issue with NVMe. >>>> >>>> This issue has been discussed below threads: >>>> >>>> https://lore.kernel.org/all/Yl+6V3pWuyRYuVV8@infradead.org/T/ >>>> >>>> https://lore.kernel.org/linux-nvme/20220201165006.3074615-1-kbusch@kernel.org/ >>> The problem is that this commit log doesn't explain the problem and >>> doesn't give us anything to connect the NVMe and WLAN assumptions with >>> this special driver behavior. There needs to be some explicit >>> property of NVMe and WLAN that the PM core or drivers like qcom can >>> use to tell whether the clocks can be turned off. >> Not only that NVMe is expecting the device state to be always in D0. >> So any PCIe drivers should not turn off the link in suspend and do >> link retraining in the resume. As this is considered a power cycle >> by the NVMe device and eventually increases the wear of the NVMe >> flash. > I can't quite parse this. Are you saying that all PCI devices should > stay in D0 when the system is in S3? Not all PCI devices some PCI devices like NVMe. The NVMe driver is expecting the device to stay in D0 only. > >> We are trying to keep the device in D0 and also reduce the power >> consumption when the system is in S3 by turning off clocks and phy >> with this patch series. > The decision to keep a device in D0 is not up to qcom or any other PCI > controller driver. Yes, it is the NVMe driver who is deciding to keep the device in D0. Our QCOM PCI Controller driver is trying to keep the device in the same state as the client driver is expecting and also trying to reduce power consumption. > >>>>>> In qcom platform PCIe resources( clocks, phy etc..) can >>>>>> released when the link is in L1ss to reduce the power >>>>>> consumption. So if the link is in L1ss, release the PCIe >>>>>> resources. And when the system resumes, enable the PCIe >>>>>> resources if they released in the suspend path. >>>>> What's the connection with L1.x? Links enter L1.x based on >>>>> activity and timing. That doesn't seem like a reliable >>>>> indicator to turn PHYs off and disable clocks. >>>> This is a Qcom PHY-specific feature (retaining the link state in >>>> L1.x with clocks turned off). It is possible only with the link >>>> being in l1.x. PHY can't retain the link state in L0 with the >>>> clocks turned off and we need to re-train the link if it's in L2 >>>> or L3. So we can support this feature only with L1.x. That is >>>> the reason we are taking l1.x as the trigger to turn off clocks >>>> (in only suspend path). >>> This doesn't address my question. L1.x is an ASPM feature, which >>> means hardware may enter or leave L1.x autonomously at any time >>> without software intervention. Therefore, I don't think reading the >>> current state is a reliable way to decide anything. >> After the link enters the L1.x it will come out only if there is >> some activity on the link. AS system is suspended and NVMe driver >> is also suspended( queues will freeze in suspend) who else can >> initiate any data. > I don't think we can assume that nothing will happen to cause exit > from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, > PTM, etc., may be sent even though we think the device is idle and > there should be no link activity. > > Bjorn I don't think after the link enters into L1.x there will some activity on the link as you mentioned, except for PCIe messages like INTx/MSI/MSIX. These messages also will not come because the client drivers like NVMe will keep their device in the lowest power mode. The link will come out of L1.x only when there is config or memory access or some messages to trigger the interrupts from the devices. We are already making sure this access will not be there in S3. If the link is in L0 or L0s what you said is expected but not in L1.x
On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru wrote: > > On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: > > On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: > > > On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: > > > > On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: > > > > > On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: > > > > > > On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: > > > > > > > In qcom platform PCIe resources( clocks, phy etc..) can > > > > > > > released when the link is in L1ss to reduce the power > > > > > > > consumption. So if the link is in L1ss, release the PCIe > > > > > > > resources. And when the system resumes, enable the PCIe > > > > > > > resources if they released in the suspend path. > > > > > > What's the connection with L1.x? Links enter L1.x based on > > > > > > activity and timing. That doesn't seem like a reliable > > > > > > indicator to turn PHYs off and disable clocks. > > > > > > > > > > This is a Qcom PHY-specific feature (retaining the link state in > > > > > L1.x with clocks turned off). It is possible only with the link > > > > > being in l1.x. PHY can't retain the link state in L0 with the > > > > > clocks turned off and we need to re-train the link if it's in L2 > > > > > or L3. So we can support this feature only with L1.x. That is > > > > > the reason we are taking l1.x as the trigger to turn off clocks > > > > > (in only suspend path). > > > > > > > > This doesn't address my question. L1.x is an ASPM feature, which > > > > means hardware may enter or leave L1.x autonomously at any time > > > > without software intervention. Therefore, I don't think reading the > > > > current state is a reliable way to decide anything. > > > > > > After the link enters the L1.x it will come out only if there is > > > some activity on the link. AS system is suspended and NVMe driver > > > is also suspended( queues will freeze in suspend) who else can > > > initiate any data. > > > > I don't think we can assume that nothing will happen to cause exit > > from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, > > PTM, etc., may be sent even though we think the device is idle and > > there should be no link activity. > > I don't think after the link enters into L1.x there will some > activity on the link as you mentioned, except for PCIe messages like > INTx/MSI/MSIX. These messages also will not come because the client > drivers like NVMe will keep their device in the lowest power mode. > > The link will come out of L1.x only when there is config or memory > access or some messages to trigger the interrupts from the devices. > We are already making sure this access will not be there in S3. If > the link is in L0 or L0s what you said is expected but not in L1.x Forgive me for being skeptical, but we just spent a few months untangling the fact that some switches send PTM request messages even when they're in a non-D0 state. We expected that devices in D3hot would not send such messages because "why would they?" But it turns out the spec allows that, and they actually *do*. I don't think it's robust interoperable design for a PCI controller driver like qcom to assume anything about PCI devices unless it's required by the spec. Bjorn
On 9/23/2022 7:56 PM, Bjorn Helgaas wrote: > On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru wrote: >> On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: >>> On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: >>>> On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: >>>>> On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: >>>>>> On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: >>>>>>> On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: >>>>>>>> In qcom platform PCIe resources( clocks, phy etc..) can >>>>>>>> released when the link is in L1ss to reduce the power >>>>>>>> consumption. So if the link is in L1ss, release the PCIe >>>>>>>> resources. And when the system resumes, enable the PCIe >>>>>>>> resources if they released in the suspend path. >>>>>>> What's the connection with L1.x? Links enter L1.x based on >>>>>>> activity and timing. That doesn't seem like a reliable >>>>>>> indicator to turn PHYs off and disable clocks. >>>>>> This is a Qcom PHY-specific feature (retaining the link state in >>>>>> L1.x with clocks turned off). It is possible only with the link >>>>>> being in l1.x. PHY can't retain the link state in L0 with the >>>>>> clocks turned off and we need to re-train the link if it's in L2 >>>>>> or L3. So we can support this feature only with L1.x. That is >>>>>> the reason we are taking l1.x as the trigger to turn off clocks >>>>>> (in only suspend path). >>>>> This doesn't address my question. L1.x is an ASPM feature, which >>>>> means hardware may enter or leave L1.x autonomously at any time >>>>> without software intervention. Therefore, I don't think reading the >>>>> current state is a reliable way to decide anything. >>>> After the link enters the L1.x it will come out only if there is >>>> some activity on the link. AS system is suspended and NVMe driver >>>> is also suspended( queues will freeze in suspend) who else can >>>> initiate any data. >>> I don't think we can assume that nothing will happen to cause exit >>> from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, >>> PTM, etc., may be sent even though we think the device is idle and >>> there should be no link activity. >> I don't think after the link enters into L1.x there will some >> activity on the link as you mentioned, except for PCIe messages like >> INTx/MSI/MSIX. These messages also will not come because the client >> drivers like NVMe will keep their device in the lowest power mode. >> >> The link will come out of L1.x only when there is config or memory >> access or some messages to trigger the interrupts from the devices. >> We are already making sure this access will not be there in S3. If >> the link is in L0 or L0s what you said is expected but not in L1.x > Forgive me for being skeptical, but we just spent a few months > untangling the fact that some switches send PTM request messages even > when they're in a non-D0 state. We expected that devices in D3hot > would not send such messages because "why would they?" But it turns > out the spec allows that, and they actually *do*. > > I don't think it's robust interoperable design for a PCI controller > driver like qcom to assume anything about PCI devices unless it's > required by the spec. > > Bjorn We will check the spec once and will come back to you,
On 9/23/2022 7:56 PM, Bjorn Helgaas wrote: > On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru wrote: >> On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: >>> On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: >>>> On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: >>>>> On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: >>>>>> On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: >>>>>>> On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: >>>>>>>> In qcom platform PCIe resources( clocks, phy etc..) can >>>>>>>> released when the link is in L1ss to reduce the power >>>>>>>> consumption. So if the link is in L1ss, release the PCIe >>>>>>>> resources. And when the system resumes, enable the PCIe >>>>>>>> resources if they released in the suspend path. >>>>>>> What's the connection with L1.x? Links enter L1.x based on >>>>>>> activity and timing. That doesn't seem like a reliable >>>>>>> indicator to turn PHYs off and disable clocks. >>>>>> This is a Qcom PHY-specific feature (retaining the link state in >>>>>> L1.x with clocks turned off). It is possible only with the link >>>>>> being in l1.x. PHY can't retain the link state in L0 with the >>>>>> clocks turned off and we need to re-train the link if it's in L2 >>>>>> or L3. So we can support this feature only with L1.x. That is >>>>>> the reason we are taking l1.x as the trigger to turn off clocks >>>>>> (in only suspend path). >>>>> This doesn't address my question. L1.x is an ASPM feature, which >>>>> means hardware may enter or leave L1.x autonomously at any time >>>>> without software intervention. Therefore, I don't think reading the >>>>> current state is a reliable way to decide anything. >>>> After the link enters the L1.x it will come out only if there is >>>> some activity on the link. AS system is suspended and NVMe driver >>>> is also suspended( queues will freeze in suspend) who else can >>>> initiate any data. >>> I don't think we can assume that nothing will happen to cause exit >>> from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, >>> PTM, etc., may be sent even though we think the device is idle and >>> there should be no link activity. >> I don't think after the link enters into L1.x there will some >> activity on the link as you mentioned, except for PCIe messages like >> INTx/MSI/MSIX. These messages also will not come because the client >> drivers like NVMe will keep their device in the lowest power mode. >> >> The link will come out of L1.x only when there is config or memory >> access or some messages to trigger the interrupts from the devices. >> We are already making sure this access will not be there in S3. If >> the link is in L0 or L0s what you said is expected but not in L1.x > Forgive me for being skeptical, but we just spent a few months > untangling the fact that some switches send PTM request messages even > when they're in a non-D0 state. We expected that devices in D3hot > would not send such messages because "why would they?" But it turns > out the spec allows that, and they actually *do*. > > I don't think it's robust interoperable design for a PCI controller > driver like qcom to assume anything about PCI devices unless it's > required by the spec. > > Bjorn From pci spec 4, in sec 5.5 "Ports that support L1 PM Substates must not require a reference clock while in L1 PM Substates other than L1.0". If there is no reference clk we can say there is no activity on the link. If anything needs to be sent (such as LTR, or some messages ), the link needs to be back in L0 before it sends the packet to the link partner. To exit from L1.x clkreq pin should be asserted. In suspend after turning off clocks and phy we can enable to trigger an interrupt whenever the clk req pin asserts. In that interrupt handler, we can enable the pcie resources back. What are your thoughts on this?
On 9/25/2022 7:23 AM, Krishna Chaitanya Chundru wrote: > > On 9/23/2022 7:56 PM, Bjorn Helgaas wrote: >> On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru >> wrote: >>> On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: >>>> On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru >>>> wrote: >>>>> On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: >>>>>> On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya >>>>>> Chundru wrote: >>>>>>> On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: >>>>>>>> On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya >>>>>>>> chundru wrote: >>>>>>>>> In qcom platform PCIe resources( clocks, phy etc..) can >>>>>>>>> released when the link is in L1ss to reduce the power >>>>>>>>> consumption. So if the link is in L1ss, release the PCIe >>>>>>>>> resources. And when the system resumes, enable the PCIe >>>>>>>>> resources if they released in the suspend path. >>>>>>>> What's the connection with L1.x? Links enter L1.x based on >>>>>>>> activity and timing. That doesn't seem like a reliable >>>>>>>> indicator to turn PHYs off and disable clocks. >>>>>>> This is a Qcom PHY-specific feature (retaining the link state in >>>>>>> L1.x with clocks turned off). It is possible only with the link >>>>>>> being in l1.x. PHY can't retain the link state in L0 with the >>>>>>> clocks turned off and we need to re-train the link if it's in L2 >>>>>>> or L3. So we can support this feature only with L1.x. That is >>>>>>> the reason we are taking l1.x as the trigger to turn off clocks >>>>>>> (in only suspend path). >>>>>> This doesn't address my question. L1.x is an ASPM feature, which >>>>>> means hardware may enter or leave L1.x autonomously at any time >>>>>> without software intervention. Therefore, I don't think reading the >>>>>> current state is a reliable way to decide anything. >>>>> After the link enters the L1.x it will come out only if there is >>>>> some activity on the link. AS system is suspended and NVMe driver >>>>> is also suspended( queues will freeze in suspend) who else can >>>>> initiate any data. >>>> I don't think we can assume that nothing will happen to cause exit >>>> from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, >>>> PTM, etc., may be sent even though we think the device is idle and >>>> there should be no link activity. >>> I don't think after the link enters into L1.x there will some >>> activity on the link as you mentioned, except for PCIe messages like >>> INTx/MSI/MSIX. These messages also will not come because the client >>> drivers like NVMe will keep their device in the lowest power mode. >>> >>> The link will come out of L1.x only when there is config or memory >>> access or some messages to trigger the interrupts from the devices. >>> We are already making sure this access will not be there in S3. If >>> the link is in L0 or L0s what you said is expected but not in L1.x >> Forgive me for being skeptical, but we just spent a few months >> untangling the fact that some switches send PTM request messages even >> when they're in a non-D0 state. We expected that devices in D3hot >> would not send such messages because "why would they?" But it turns >> out the spec allows that, and they actually *do*. >> >> I don't think it's robust interoperable design for a PCI controller >> driver like qcom to assume anything about PCI devices unless it's >> required by the spec. >> >> Bjorn > We will check the spec once and will come back to you, From pci spec 4, in sec 5.5 "Ports that support L1 PM Substates must not require a reference clock while in L1 PM Substates other than L1.0". If there is no reference clk we can say there is no activity on the link. If anything needs to be sent (such as LTR, or some messages ), the link needs to be back in L0 before it sends the packet to the link partner. To exit from L1.x clkreq pin should be asserted. In suspend after turning off clocks and phy we can enable to trigger an interrupt whenever the clk req pin asserts. In that interrupt handler, we can enable the pcie resources back. What are your thoughts on this?
On Mon, Sep 26, 2022 at 09:00:11PM +0530, Krishna Chaitanya Chundru wrote: > On 9/23/2022 7:56 PM, Bjorn Helgaas wrote: > > On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru wrote: > > > On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: > > > > On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: > > > > > On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: > > > > > > On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: > > > > > > > On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: > > > > > > > > On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: > > > > > > > > > In qcom platform PCIe resources( clocks, phy etc..) can > > > > > > > > > released when the link is in L1ss to reduce the power > > > > > > > > > consumption. So if the link is in L1ss, release the PCIe > > > > > > > > > resources. And when the system resumes, enable the PCIe > > > > > > > > > resources if they released in the suspend path. > > > > > > > > What's the connection with L1.x? Links enter L1.x based on > > > > > > > > activity and timing. That doesn't seem like a reliable > > > > > > > > indicator to turn PHYs off and disable clocks. > > > > > > > This is a Qcom PHY-specific feature (retaining the link state in > > > > > > > L1.x with clocks turned off). It is possible only with the link > > > > > > > being in l1.x. PHY can't retain the link state in L0 with the > > > > > > > clocks turned off and we need to re-train the link if it's in L2 > > > > > > > or L3. So we can support this feature only with L1.x. That is > > > > > > > the reason we are taking l1.x as the trigger to turn off clocks > > > > > > > (in only suspend path). > > > > > > This doesn't address my question. L1.x is an ASPM feature, which > > > > > > means hardware may enter or leave L1.x autonomously at any time > > > > > > without software intervention. Therefore, I don't think reading the > > > > > > current state is a reliable way to decide anything. > > > > > After the link enters the L1.x it will come out only if there is > > > > > some activity on the link. AS system is suspended and NVMe driver > > > > > is also suspended( queues will freeze in suspend) who else can > > > > > initiate any data. > > > > I don't think we can assume that nothing will happen to cause exit > > > > from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, > > > > PTM, etc., may be sent even though we think the device is idle and > > > > there should be no link activity. > > > I don't think after the link enters into L1.x there will some > > > activity on the link as you mentioned, except for PCIe messages like > > > INTx/MSI/MSIX. These messages also will not come because the client > > > drivers like NVMe will keep their device in the lowest power mode. > > > > > > The link will come out of L1.x only when there is config or memory > > > access or some messages to trigger the interrupts from the devices. > > > We are already making sure this access will not be there in S3. If > > > the link is in L0 or L0s what you said is expected but not in L1.x > > Forgive me for being skeptical, but we just spent a few months > > untangling the fact that some switches send PTM request messages even > > when they're in a non-D0 state. We expected that devices in D3hot > > would not send such messages because "why would they?" But it turns > > out the spec allows that, and they actually *do*. > > > > I don't think it's robust interoperable design for a PCI controller > > driver like qcom to assume anything about PCI devices unless it's > > required by the spec. > > From pci spec 4, in sec 5.5 > "Ports that support L1 PM Substates must not require a reference clock while > in L1 PM Substates > other than L1.0". > If there is no reference clk we can say there is no activity on the link. > If anything needs to be sent (such as LTR, or some messages ), the link > needs to be back in L0 before it > sends the packet to the link partner. > > To exit from L1.x clkreq pin should be asserted. > > In suspend after turning off clocks and phy we can enable to trigger an > interrupt whenever the clk req pin asserts. > In that interrupt handler, we can enable the pcie resources back. From the point of view of the endpoint driver, ASPM should be invisible -- no software intervention required. I think you're suggesting that the PCIe controller driver could help exit L1.x by handling a clk req interrupt and enabling clock and PHY then. But doesn't L1.x exit also have to happen within the time the endpoint can tolerate? E.g., I think L1.2 exit has to happen within the LTR time advertised by the endpoint (PCIe r6.0, sec 5.5.5). How can we guarantee that if software is involved?
On 9/30/2022 12:23 AM, Bjorn Helgaas wrote: > On Mon, Sep 26, 2022 at 09:00:11PM +0530, Krishna Chaitanya Chundru wrote: >> On 9/23/2022 7:56 PM, Bjorn Helgaas wrote: >>> On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru wrote: >>>> On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: >>>>> On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: >>>>>> On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: >>>>>>> On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: >>>>>>>> On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: >>>>>>>>> On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: >>>>>>>>>> In qcom platform PCIe resources( clocks, phy etc..) can >>>>>>>>>> released when the link is in L1ss to reduce the power >>>>>>>>>> consumption. So if the link is in L1ss, release the PCIe >>>>>>>>>> resources. And when the system resumes, enable the PCIe >>>>>>>>>> resources if they released in the suspend path. >>>>>>>>> What's the connection with L1.x? Links enter L1.x based on >>>>>>>>> activity and timing. That doesn't seem like a reliable >>>>>>>>> indicator to turn PHYs off and disable clocks. >>>>>>>> This is a Qcom PHY-specific feature (retaining the link state in >>>>>>>> L1.x with clocks turned off). It is possible only with the link >>>>>>>> being in l1.x. PHY can't retain the link state in L0 with the >>>>>>>> clocks turned off and we need to re-train the link if it's in L2 >>>>>>>> or L3. So we can support this feature only with L1.x. That is >>>>>>>> the reason we are taking l1.x as the trigger to turn off clocks >>>>>>>> (in only suspend path). >>>>>>> This doesn't address my question. L1.x is an ASPM feature, which >>>>>>> means hardware may enter or leave L1.x autonomously at any time >>>>>>> without software intervention. Therefore, I don't think reading the >>>>>>> current state is a reliable way to decide anything. >>>>>> After the link enters the L1.x it will come out only if there is >>>>>> some activity on the link. AS system is suspended and NVMe driver >>>>>> is also suspended( queues will freeze in suspend) who else can >>>>>> initiate any data. >>>>> I don't think we can assume that nothing will happen to cause exit >>>>> from L1.x. For instance, PCIe Messages for INTx signaling, LTR, OBFF, >>>>> PTM, etc., may be sent even though we think the device is idle and >>>>> there should be no link activity. >>>> I don't think after the link enters into L1.x there will some >>>> activity on the link as you mentioned, except for PCIe messages like >>>> INTx/MSI/MSIX. These messages also will not come because the client >>>> drivers like NVMe will keep their device in the lowest power mode. >>>> >>>> The link will come out of L1.x only when there is config or memory >>>> access or some messages to trigger the interrupts from the devices. >>>> We are already making sure this access will not be there in S3. If >>>> the link is in L0 or L0s what you said is expected but not in L1.x >>> Forgive me for being skeptical, but we just spent a few months >>> untangling the fact that some switches send PTM request messages even >>> when they're in a non-D0 state. We expected that devices in D3hot >>> would not send such messages because "why would they?" But it turns >>> out the spec allows that, and they actually *do*. >>> >>> I don't think it's robust interoperable design for a PCI controller >>> driver like qcom to assume anything about PCI devices unless it's >>> required by the spec. >> From pci spec 4, in sec 5.5 >> "Ports that support L1 PM Substates must not require a reference clock while >> in L1 PM Substates >> other than L1.0". >> If there is no reference clk we can say there is no activity on the link. >> If anything needs to be sent (such as LTR, or some messages ), the link >> needs to be back in L0 before it >> sends the packet to the link partner. >> >> To exit from L1.x clkreq pin should be asserted. >> >> In suspend after turning off clocks and phy we can enable to trigger an >> interrupt whenever the clk req pin asserts. >> In that interrupt handler, we can enable the pcie resources back. > From the point of view of the endpoint driver, ASPM should be > invisible -- no software intervention required. I think you're > suggesting that the PCIe controller driver could help exit L1.x by > handling a clk req interrupt and enabling clock and PHY then. > > But doesn't L1.x exit also have to happen within the time the endpoint > can tolerate? E.g., I think L1.2 exit has to happen within the LTR > time advertised by the endpoint (PCIe r6.0, sec 5.5.5). How can we > guarantee that if software is involved? It is true that it is difficult to guarantee those delays. On our internal boards, we are able to achieve this but that is not with linux kernel. With NVMe attach we have connected the protocol analyzer and tried to see if there are any transactions over the link. We found there are no transactions on the link once the link enters L1.x till we resume the system. As the NVMe is a passive system it is not initiating any transactions. This whole requirement came from the NVMe driver, it requires keeping the link active state when the system is suspended. There are only two things we can in do in PCIe suspend as we have to turn off PCIe clocks to allow the system to the lowest possible power state. 1) Keep the device in D3 cold and turn off all the clocks and phy etc.( It is not an ideal one as this decreases the NVMe lifetime because link-down and link-up is treated as a power cycle by a few NVMe devices). 2) This is the one we are proposing where we turn off the clocks, phy once the link enters L1ss. Can you please suggest us any other possible solutions to meet NVMe requirement (That is to keep the link active during suspend) and the Qcom platform requirement (that is to turn off all the clocks to allow a lower possible power state)? Qcom PCIe controller is compatible with v3.1 specification only. Thanks & Regards, Krishna Chaitanya.
On Mon, Oct 03, 2022 at 05:40:21PM +0530, Krishna Chaitanya Chundru wrote: > On 9/30/2022 12:23 AM, Bjorn Helgaas wrote: > > On Mon, Sep 26, 2022 at 09:00:11PM +0530, Krishna Chaitanya Chundru wrote: > > > On 9/23/2022 7:56 PM, Bjorn Helgaas wrote: > > > > On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru wrote: > > > > > On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: > > > > > > On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: > > > > > > > On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: > > > > > > > > On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: > > > > > > > > > On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: > > > > > > > > > > On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: > > > > > > > > > > > In qcom platform PCIe resources( clocks, phy > > > > > > > > > > > etc..) can released when the link is in L1ss to > > > > > > > > > > > reduce the power consumption. So if the link is > > > > > > > > > > > in L1ss, release the PCIe resources. And when > > > > > > > > > > > the system resumes, enable the PCIe resources if > > > > > > > > > > > they released in the suspend path. > > > > > > > > > > What's the connection with L1.x? Links enter L1.x > > > > > > > > > > based on activity and timing. That doesn't seem > > > > > > > > > > like a reliable indicator to turn PHYs off and > > > > > > > > > > disable clocks. > > > > > > > > > This is a Qcom PHY-specific feature (retaining the > > > > > > > > > link state in L1.x with clocks turned off). It is > > > > > > > > > possible only with the link being in l1.x. PHY can't > > > > > > > > > retain the link state in L0 with the clocks turned > > > > > > > > > off and we need to re-train the link if it's in L2 > > > > > > > > > or L3. So we can support this feature only with > > > > > > > > > L1.x. That is the reason we are taking l1.x as the > > > > > > > > > trigger to turn off clocks (in only suspend path). > > > > > > > > This doesn't address my question. L1.x is an ASPM > > > > > > > > feature, which means hardware may enter or leave L1.x > > > > > > > > autonomously at any time without software > > > > > > > > intervention. Therefore, I don't think reading the > > > > > > > > current state is a reliable way to decide anything. > > > > > > > After the link enters the L1.x it will come out only if > > > > > > > there is some activity on the link. As system is > > > > > > > suspended and NVMe driver is also suspended (queues > > > > > > > will freeze in suspend) who else can initiate any data. > > > > > > I don't think we can assume that nothing will happen to > > > > > > cause exit from L1.x. For instance, PCIe Messages for > > > > > > INTx signaling, LTR, OBFF, PTM, etc., may be sent even > > > > > > though we think the device is idle and there should be no > > > > > > link activity. > > > > > I don't think after the link enters into L1.x there will > > > > > some activity on the link as you mentioned, except for PCIe > > > > > messages like INTx/MSI/MSIX. These messages also will not > > > > > come because the client drivers like NVMe will keep their > > > > > device in the lowest power mode. > > > > > > > > > > The link will come out of L1.x only when there is config or > > > > > memory access or some messages to trigger the interrupts > > > > > from the devices. We are already making sure this access > > > > > will not be there in S3. If the link is in L0 or L0s what > > > > > you said is expected but not in L1.x > > > > Forgive me for being skeptical, but we just spent a few months > > > > untangling the fact that some switches send PTM request > > > > messages even when they're in a non-D0 state. We expected > > > > that devices in D3hot would not send such messages because > > > > "why would they?" But it turns out the spec allows that, and > > > > they actually *do*. > > > > > > > > I don't think it's robust interoperable design for a PCI > > > > controller driver like qcom to assume anything about PCI > > > > devices unless it's required by the spec. > > > From pci spec 4, in sec 5.5 "Ports that support L1 PM Substates > > > must not require a reference clock while in L1 PM Substates > > > other than L1.0". If there is no reference clk we can say > > > there is no activity on the link. If anything needs to be sent > > > (such as LTR, or some messages ), the link needs to be back in > > > L0 before it sends the packet to the link partner. > > > > > > To exit from L1.x clkreq pin should be asserted. > > > > > > In suspend after turning off clocks and phy we can enable to > > > trigger an interrupt whenever the clk req pin asserts. In that > > > interrupt handler, we can enable the pcie resources back. > > From the point of view of the endpoint driver, ASPM should be > > invisible -- no software intervention required. I think you're > > suggesting that the PCIe controller driver could help exit L1.x by > > handling a clk req interrupt and enabling clock and PHY then. > > > > But doesn't L1.x exit also have to happen within the time the > > endpoint can tolerate? E.g., I think L1.2 exit has to happen > > within the LTR time advertised by the endpoint (PCIe r6.0, sec > > 5.5.5). How can we guarantee that if software is involved? > It is true that it is difficult to guarantee those delays. On our > internal boards, we are able to achieve this but that is not with > linux kernel. > > With NVMe attach we have connected the protocol analyzer and tried > to see if there are any transactions over the link. We found there > are no transactions on the link once the link enters L1.x till we > resume the system. As the NVMe is a passive system it is not > initiating any transactions. > > This whole requirement came from the NVMe driver, it requires > keeping the link active state when the system is suspended. > > There are only two things we can in do in PCIe suspend as we have to > turn off PCIe clocks to allow the system to the lowest possible > power state. > > 1) Keep the device in D3 cold and turn off all the clocks and phy > etc. (It is not an ideal one as this decreases the NVMe lifetime > because link-down and link-up is treated as a power cycle by a few > NVMe devices). > > 2) This is the one we are proposing where we turn off the clocks, > phy once the link enters L1ss. It sounds like both options turn off the clocks and PHY. But apparently they do not look the same to the NVMe endpoint? I guess NVMe is in D3cold for 1), but it's in D0 for 2), right? > Can you please suggest us any other possible solutions to meet NVMe > requirement (That is to keep the link active during suspend) and the > Qcom platform requirement (that is to turn off all the clocks to > allow a lower possible power state)? Qcom PCIe controller is > compatible with v3.1 specification only. The PCIe spec clearly envisions Refclk being turned off (sec 5.5.3.3.1) and PHYs being powered off (sec 5.5.3.2) while in L1.2. I've been assuming L1.2 exit (which includes Refclk being turned on and PHYs being powered up) is completely handled by hardware, but it sounds like the Qcom controller needs software assistance which fields an interrupt when CLKREQ# is asserted and turns on Refclk and the PHYs? 5.5.3 does say "All Link and PHY state must be maintained during L1.2, or must be restored upon exit using implementation specific means", and maybe Qcom counts as using implementation specific means. I *am* concerned about whether software can do the L1.2 exit fast enough, but the biggest reason I'm struggling with this is because using the syscore framework to work around IRQ affinity changes that happen late in suspend just seems kind of kludgy and it doesn't seem like it fits cleanly in the power management model. Bjorn
On 10/6/2022 2:43 AM, Bjorn Helgaas wrote: > On Mon, Oct 03, 2022 at 05:40:21PM +0530, Krishna Chaitanya Chundru wrote: >> On 9/30/2022 12:23 AM, Bjorn Helgaas wrote: >>> On Mon, Sep 26, 2022 at 09:00:11PM +0530, Krishna Chaitanya Chundru wrote: >>>> On 9/23/2022 7:56 PM, Bjorn Helgaas wrote: >>>>> On Fri, Sep 23, 2022 at 07:29:31AM +0530, Krishna Chaitanya Chundru wrote: >>>>>> On 9/23/2022 12:12 AM, Bjorn Helgaas wrote: >>>>>>> On Thu, Sep 22, 2022 at 09:09:28PM +0530, Krishna Chaitanya Chundru wrote: >>>>>>>> On 9/21/2022 10:26 PM, Bjorn Helgaas wrote: >>>>>>>>> On Wed, Sep 21, 2022 at 03:23:35PM +0530, Krishna Chaitanya Chundru wrote: >>>>>>>>>> On 9/20/2022 11:46 PM, Bjorn Helgaas wrote: >>>>>>>>>>> On Tue, Sep 20, 2022 at 03:52:23PM +0530, Krishna chaitanya chundru wrote: >>>>>>>>>>>> In qcom platform PCIe resources( clocks, phy >>>>>>>>>>>> etc..) can released when the link is in L1ss to >>>>>>>>>>>> reduce the power consumption. So if the link is >>>>>>>>>>>> in L1ss, release the PCIe resources. And when >>>>>>>>>>>> the system resumes, enable the PCIe resources if >>>>>>>>>>>> they released in the suspend path. >>>>>>>>>>> What's the connection with L1.x? Links enter L1.x >>>>>>>>>>> based on activity and timing. That doesn't seem >>>>>>>>>>> like a reliable indicator to turn PHYs off and >>>>>>>>>>> disable clocks. >>>>>>>>>> This is a Qcom PHY-specific feature (retaining the >>>>>>>>>> link state in L1.x with clocks turned off). It is >>>>>>>>>> possible only with the link being in l1.x. PHY can't >>>>>>>>>> retain the link state in L0 with the clocks turned >>>>>>>>>> off and we need to re-train the link if it's in L2 >>>>>>>>>> or L3. So we can support this feature only with >>>>>>>>>> L1.x. That is the reason we are taking l1.x as the >>>>>>>>>> trigger to turn off clocks (in only suspend path). >>>>>>>>> This doesn't address my question. L1.x is an ASPM >>>>>>>>> feature, which means hardware may enter or leave L1.x >>>>>>>>> autonomously at any time without software >>>>>>>>> intervention. Therefore, I don't think reading the >>>>>>>>> current state is a reliable way to decide anything. >>>>>>>> After the link enters the L1.x it will come out only if >>>>>>>> there is some activity on the link. As system is >>>>>>>> suspended and NVMe driver is also suspended (queues >>>>>>>> will freeze in suspend) who else can initiate any data. >>>>>>> I don't think we can assume that nothing will happen to >>>>>>> cause exit from L1.x. For instance, PCIe Messages for >>>>>>> INTx signaling, LTR, OBFF, PTM, etc., may be sent even >>>>>>> though we think the device is idle and there should be no >>>>>>> link activity. >>>>>> I don't think after the link enters into L1.x there will >>>>>> some activity on the link as you mentioned, except for PCIe >>>>>> messages like INTx/MSI/MSIX. These messages also will not >>>>>> come because the client drivers like NVMe will keep their >>>>>> device in the lowest power mode. >>>>>> >>>>>> The link will come out of L1.x only when there is config or >>>>>> memory access or some messages to trigger the interrupts >>>>>> from the devices. We are already making sure this access >>>>>> will not be there in S3. If the link is in L0 or L0s what >>>>>> you said is expected but not in L1.x >>>>> Forgive me for being skeptical, but we just spent a few months >>>>> untangling the fact that some switches send PTM request >>>>> messages even when they're in a non-D0 state. We expected >>>>> that devices in D3hot would not send such messages because >>>>> "why would they?" But it turns out the spec allows that, and >>>>> they actually *do*. >>>>> >>>>> I don't think it's robust interoperable design for a PCI >>>>> controller driver like qcom to assume anything about PCI >>>>> devices unless it's required by the spec. >>>> From pci spec 4, in sec 5.5 "Ports that support L1 PM Substates >>>> must not require a reference clock while in L1 PM Substates >>>> other than L1.0". If there is no reference clk we can say >>>> there is no activity on the link. If anything needs to be sent >>>> (such as LTR, or some messages ), the link needs to be back in >>>> L0 before it sends the packet to the link partner. >>>> >>>> To exit from L1.x clkreq pin should be asserted. >>>> >>>> In suspend after turning off clocks and phy we can enable to >>>> trigger an interrupt whenever the clk req pin asserts. In that >>>> interrupt handler, we can enable the pcie resources back. >>> From the point of view of the endpoint driver, ASPM should be >>> invisible -- no software intervention required. I think you're >>> suggesting that the PCIe controller driver could help exit L1.x by >>> handling a clk req interrupt and enabling clock and PHY then. >>> >>> But doesn't L1.x exit also have to happen within the time the >>> endpoint can tolerate? E.g., I think L1.2 exit has to happen >>> within the LTR time advertised by the endpoint (PCIe r6.0, sec >>> 5.5.5). How can we guarantee that if software is involved? >> It is true that it is difficult to guarantee those delays. On our >> internal boards, we are able to achieve this but that is not with >> linux kernel. >> >> With NVMe attach we have connected the protocol analyzer and tried >> to see if there are any transactions over the link. We found there >> are no transactions on the link once the link enters L1.x till we >> resume the system. As the NVMe is a passive system it is not >> initiating any transactions. >> >> This whole requirement came from the NVMe driver, it requires >> keeping the link active state when the system is suspended. >> >> There are only two things we can in do in PCIe suspend as we have to >> turn off PCIe clocks to allow the system to the lowest possible >> power state. >> >> 1) Keep the device in D3 cold and turn off all the clocks and phy >> etc. (It is not an ideal one as this decreases the NVMe lifetime >> because link-down and link-up is treated as a power cycle by a few >> NVMe devices). >> >> 2) This is the one we are proposing where we turn off the clocks, >> phy once the link enters L1ss. > It sounds like both options turn off the clocks and PHY. But > apparently they do not look the same to the NVMe endpoint? I guess > NVMe is in D3cold for 1), but it's in D0 for 2), right? > >> Can you please suggest us any other possible solutions to meet NVMe >> requirement (That is to keep the link active during suspend) and the >> Qcom platform requirement (that is to turn off all the clocks to >> allow a lower possible power state)? Qcom PCIe controller is >> compatible with v3.1 specification only. > The PCIe spec clearly envisions Refclk being turned off > (sec 5.5.3.3.1) and PHYs being powered off (sec 5.5.3.2) while in > L1.2. > > I've been assuming L1.2 exit (which includes Refclk being turned on > and PHYs being powered up) is completely handled by hardware, but it > sounds like the Qcom controller needs software assistance which fields > an interrupt when CLKREQ# is asserted and turns on Refclk and the > PHYs? > > 5.5.3 does say "All Link and PHY state must be maintained during L1.2, > or must be restored upon exit using implementation specific means", > and maybe Qcom counts as using implementation specific means. > > I *am* concerned about whether software can do the L1.2 exit fast > enough, but the biggest reason I'm struggling with this is because > using the syscore framework to work around IRQ affinity changes that > happen late in suspend just seems kind of kludgy and it doesn't seem > like it fits cleanly in the power management model. > > Bjorn Bjorn, Can you please suggest any another way to work around IRQ affinity changes. Thanks & Regards, Krishna Chaitanya.
[+cc Marc, Kevin] On Wed, Oct 12, 2022 at 07:36:52PM +0530, Krishna Chaitanya Chundru wrote: > On 10/6/2022 2:43 AM, Bjorn Helgaas wrote: [I'm declaring quote text bankruptcy and dropping the huge wall of text. The IRQ affinity change you mention seems to be the critical issue :)] > > The PCIe spec clearly envisions Refclk being turned off > > (sec 5.5.3.3.1) and PHYs being powered off (sec 5.5.3.2) while in > > L1.2. > > > > I've been assuming L1.2 exit (which includes Refclk being turned on > > and PHYs being powered up) is completely handled by hardware, but it > > sounds like the Qcom controller needs software assistance which fields > > an interrupt when CLKREQ# is asserted and turns on Refclk and the > > PHYs? > > > > 5.5.3 does say "All Link and PHY state must be maintained during L1.2, > > or must be restored upon exit using implementation specific means", > > and maybe Qcom counts as using implementation specific means. > > > > I *am* concerned about whether software can do the L1.2 exit fast > > enough, but the biggest reason I'm struggling with this is because > > using the syscore framework to work around IRQ affinity changes that > > happen late in suspend just seems kind of kludgy and it doesn't seem > > like it fits cleanly in the power management model. > > Can you please suggest any another way to work around IRQ affinity > changes. One of your earlier patches [1] made dw_msi_mask_irq() look like this: static void dw_msi_mask_irq(struct irq_data *d) { struct pcie_port *pp = irq_data_get_irq_chip_data(d->parent_data); struct dw_pcie *pci = to_dw_pcie_from_pp(pp); if (dw_pcie_link_up(pci)) pci_msi_mask_irq(d); irq_chip_mask_parent(d); } That was an awful lot like Marc's suggestion [2] that the pci_msi_mask_irq() should be redundant. If it's truly redundant, maybe pci_msi_mask_irq() can be removed from dw_msi_mask_irq() (and other similar *_mask_irq() implementations) completely? Bjorn [1] https://lore.kernel.org/r/1659526134-22978-3-git-send-email-quic_krichai@quicinc.com [2] https://lore.kernel.org/linux-pci/86k05m7dkr.wl-maz@kernel.org/
diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c index 39ca06f..3f5424a 100644 --- a/drivers/pci/controller/dwc/pcie-qcom.c +++ b/drivers/pci/controller/dwc/pcie-qcom.c @@ -27,6 +27,7 @@ #include <linux/reset.h> #include <linux/slab.h> #include <linux/types.h> +#include <linux/syscore_ops.h> #include "../../pci.h" #include "pcie-designware.h" @@ -44,6 +45,9 @@ #define PCIE20_PARF_PM_CTRL 0x20 #define REQ_NOT_ENTR_L1 BIT(5) +#define PCIE20_PARF_PM_STTS 0x24 +#define PCIE20_PARF_PM_STTS_LINKST_IN_L1SUB BIT(8) + #define PCIE20_PARF_PHY_CTRL 0x40 #define PHY_CTRL_PHY_TX0_TERM_OFFSET_MASK GENMASK(20, 16) #define PHY_CTRL_PHY_TX0_TERM_OFFSET(x) ((x) << 16) @@ -122,6 +126,8 @@ #define QCOM_PCIE_CRC8_POLYNOMIAL (BIT(2) | BIT(1) | BIT(0)) +static LIST_HEAD(qcom_pcie_list); + struct qcom_pcie_resources_2_1_0 { struct clk_bulk_data clks[QCOM_PCIE_2_1_0_MAX_CLOCKS]; struct reset_control *pci_reset; @@ -211,13 +217,21 @@ struct qcom_pcie_ops { void (*post_deinit)(struct qcom_pcie *pcie); void (*ltssm_enable)(struct qcom_pcie *pcie); int (*config_sid)(struct qcom_pcie *pcie); + int (*suspend)(struct qcom_pcie *pcie); + int (*resume)(struct qcom_pcie *pcie); }; struct qcom_pcie_cfg { const struct qcom_pcie_ops *ops; + /* + * Flag ensures which devices will turn off clks, phy + * in system suspend. + */ + unsigned int supports_system_suspend:1; }; struct qcom_pcie { + struct list_head list; /* list to probed instances */ struct dw_pcie *pci; void __iomem *parf; /* DT parf */ void __iomem *elbi; /* DT elbi */ @@ -225,10 +239,14 @@ struct qcom_pcie { struct phy *phy; struct gpio_desc *reset; const struct qcom_pcie_cfg *cfg; + unsigned int is_suspended:1; }; #define to_qcom_pcie(x) dev_get_drvdata((x)->dev) +static int __maybe_unused qcom_pcie_syscore_op_suspend(void); +static void __maybe_unused qcom_pcie_syscore_op_resume(void); + static void qcom_ep_reset_assert(struct qcom_pcie *pcie) { gpiod_set_value_cansleep(pcie->reset, 1); @@ -1301,6 +1319,28 @@ static void qcom_pcie_deinit_2_7_0(struct qcom_pcie *pcie) regulator_bulk_disable(ARRAY_SIZE(res->supplies), res->supplies); } +static int qcom_pcie_resume_2_7_0(struct qcom_pcie *pcie) +{ + struct qcom_pcie_resources_2_7_0 *res = &pcie->res.v2_7_0; + int ret; + + ret = clk_bulk_prepare_enable(res->num_clks, res->clks); + + phy_power_on(pcie->phy); + + return ret; +} + +static int qcom_pcie_suspend_2_7_0(struct qcom_pcie *pcie) +{ + struct qcom_pcie_resources_2_7_0 *res = &pcie->res.v2_7_0; + + phy_power_off(pcie->phy); + + clk_bulk_disable_unprepare(res->num_clks, res->clks); + return 0; +} + static int qcom_pcie_get_resources_2_9_0(struct qcom_pcie *pcie) { struct qcom_pcie_resources_2_9_0 *res = &pcie->res.v2_9_0; @@ -1594,6 +1634,8 @@ static const struct qcom_pcie_ops ops_1_9_0 = { .deinit = qcom_pcie_deinit_2_7_0, .ltssm_enable = qcom_pcie_2_3_2_ltssm_enable, .config_sid = qcom_pcie_config_sid_sm8250, + .suspend = qcom_pcie_suspend_2_7_0, + .resume = qcom_pcie_resume_2_7_0, }; /* Qcom IP rev.: 2.9.0 Synopsys IP rev.: 5.00a */ @@ -1613,6 +1655,11 @@ static const struct qcom_pcie_cfg cfg_1_9_0 = { .ops = &ops_1_9_0, }; +static const struct qcom_pcie_cfg sc7280_cfg = { + .ops = &ops_1_9_0, + .supports_system_suspend = true, +}; + static const struct qcom_pcie_cfg cfg_2_1_0 = { .ops = &ops_2_1_0, }; @@ -1642,6 +1689,23 @@ static const struct dw_pcie_ops dw_pcie_ops = { .start_link = qcom_pcie_start_link, }; +/* + * There is access to Ep PCIe space to mask MSI/MSIX after pm suspend + * ops.(getting hit by affinity changes while making CPUs offline during + * suspend, this will happen after devices are suspended + * (all phases of suspend ops)). + * + * When registered with pm ops there is a crash due to un-clocked access, + * as in the pm suspend op clocks are disabled. + * + * So, registering with syscore ops which will called after making + * CPU's offline. + */ +static struct syscore_ops qcom_pcie_syscore_ops = { + .suspend = qcom_pcie_syscore_op_suspend, + .resume = qcom_pcie_syscore_op_resume, +}; + static int qcom_pcie_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; @@ -1720,6 +1784,17 @@ static int qcom_pcie_probe(struct platform_device *pdev) goto err_phy_exit; } + /* Register for syscore ops only when first instance probed */ + if (list_empty(&qcom_pcie_list)) + register_syscore_ops(&qcom_pcie_syscore_ops); + + /* + * Add the qcom_pcie list of each PCIe instance probed to + * the global list so that we use it iterate through each PCIe + * instance in the syscore ops. + */ + list_add_tail(&pcie->list, &qcom_pcie_list); + return 0; err_phy_exit: @@ -1731,6 +1806,68 @@ static int qcom_pcie_probe(struct platform_device *pdev) return ret; } +static int __maybe_unused qcom_pcie_pm_suspend(struct qcom_pcie *pcie) +{ + u32 val; + struct dw_pcie *pci = pcie->pci; + struct device *dev = pci->dev; + + /* if the link is not active turn off clocks */ + if (!dw_pcie_link_up(pci)) { + dev_dbg(dev, "Link is not active\n"); + goto suspend; + } + + /* if the link is not in l1ss don't turn off clocks */ + val = readl(pcie->parf + PCIE20_PARF_PM_STTS); + if (!(val & PCIE20_PARF_PM_STTS_LINKST_IN_L1SUB)) { + dev_warn(dev, "Link is not in L1ss\n"); + return 0; + } + +suspend: + if (pcie->cfg->ops->suspend) + pcie->cfg->ops->suspend(pcie); + + pcie->is_suspended = true; + + return 0; +} + +static int __maybe_unused qcom_pcie_pm_resume(struct qcom_pcie *pcie) +{ + if (!pcie->is_suspended) + return 0; + + if (pcie->cfg->ops->resume) + pcie->cfg->ops->resume(pcie); + + pcie->is_suspended = false; + + return 0; +} + +static int __maybe_unused qcom_pcie_syscore_op_suspend(void) +{ + struct qcom_pcie *qcom_pcie; + + list_for_each_entry(qcom_pcie, &qcom_pcie_list, list) { + + if (qcom_pcie->cfg->supports_system_suspend) + qcom_pcie_pm_suspend(qcom_pcie); + } + return 0; +} + +static void __maybe_unused qcom_pcie_syscore_op_resume(void) +{ + struct qcom_pcie *qcom_pcie; + + list_for_each_entry(qcom_pcie, &qcom_pcie_list, list) { + qcom_pcie_pm_resume(qcom_pcie); + } +} + static const struct of_device_id qcom_pcie_match[] = { { .compatible = "qcom,pcie-apq8064", .data = &cfg_2_1_0 }, { .compatible = "qcom,pcie-apq8084", .data = &cfg_1_0_0 }, @@ -1742,7 +1879,7 @@ static const struct of_device_id qcom_pcie_match[] = { { .compatible = "qcom,pcie-msm8996", .data = &cfg_2_3_2 }, { .compatible = "qcom,pcie-qcs404", .data = &cfg_2_4_0 }, { .compatible = "qcom,pcie-sa8540p", .data = &cfg_1_9_0 }, - { .compatible = "qcom,pcie-sc7280", .data = &cfg_1_9_0 }, + { .compatible = "qcom,pcie-sc7280", .data = &sc7280_cfg }, { .compatible = "qcom,pcie-sc8180x", .data = &cfg_1_9_0 }, { .compatible = "qcom,pcie-sc8280xp", .data = &cfg_1_9_0 }, { .compatible = "qcom,pcie-sdm845", .data = &cfg_2_7_0 },
Add suspend and resume syscore ops. Few PCIe endpoints like NVMe and WLANs are always expecting the device to be in D0 state and the link to be active (or in l1ss) all the time (including in S3 state). In qcom platform PCIe resources( clocks, phy etc..) can released when the link is in L1ss to reduce the power consumption. So if the link is in L1ss, release the PCIe resources. And when the system resumes, enable the PCIe resources if they released in the suspend path. is_suspended flag indicates if the PCIe resources are released or not in the suspend path. Its observed that access to Ep PCIe space to mask MSI/MSIX is happening at the very late stage of suspend path (access by affinity changes while making CPUs offline during suspend, this will happen after devices are suspended (after all phases of suspend ops)). If we turn off clocks in any PM callback, afterwards running into crashes due to un-clocked access due to above mentioned MSI/MSIx access. So, we are making use of syscore framework to turn off the PCIe clocks which will be called after making CPUs offline. Signed-off-by: Krishna chaitanya chundru <quic_krichai@quicinc.com> --- changes since v6: - move the supports_system_suspend check to syscore ops. changes since v5: - Rebasing the code and replaced pm ops with syscore ops as we are getting access to pci region after pm ops. syscore ops will called after disabling non boot cpus and there is no pci access after that. Changes since v4: - Rebasing the code and removed the supports_system_suspend flag - in the resume path as is_suspended will serve its purpose. Changes since v3: - Powering down the phy in suspend and powering it on resume to achieve maximum power savings. Changes since v2: - Replaced the enable, disable clks ops with suspend and resume - Renamed support_pm_opsi flag with supports_system_suspend. Changes since v1: - Fixed compilation errors. --- drivers/pci/controller/dwc/pcie-qcom.c | 139 ++++++++++++++++++++++++++++++++- 1 file changed, 138 insertions(+), 1 deletion(-)