Message ID | 20240727090604.24646-1-manivannan.sadhasivam@linaro.org (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Krzysztof Wilczyński |
Headers | show |
Series | PCI: qcom-ep: Do not enable resources during probe() | expand |
On Sat, Jul 27, 2024 at 02:36:04PM GMT, Manivannan Sadhasivam wrote: > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > for drivers requiring refclk from host"), all the hardware register access > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > from host. > > So there is no need to enable the endpoint resources (like clk, regulators, > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > helper from probe(). This was added earlier because dw_pcie_ep_init() was > doing DBI access, which is not done now. ... moreover his makes PCIe EP fail on some of the platforms as powering on PHY requires refclk from the RC side, which is not enabled at the probe time. Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > EP controller in the case of failure. > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> > --- > drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++---------- > 1 file changed, 4 insertions(+), 10 deletions(-)
On Sat, Jul 27, 2024 at 01:32:40PM +0300, Dmitry Baryshkov wrote: > On Sat, Jul 27, 2024 at 02:36:04PM GMT, Manivannan Sadhasivam wrote: > > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > > for drivers requiring refclk from host"), all the hardware register access > > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > > from host. > > > > So there is no need to enable the endpoint resources (like clk, regulators, > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > > helper from probe(). This was added earlier because dw_pcie_ep_init() was > > doing DBI access, which is not done now. > > ... moreover his makes PCIe EP fail on some of the platforms as powering > on PHY requires refclk from the RC side, which is not enabled at the > probe time. > Yeah. I hope Bjorn/Krzysztof could add this to the commit message while applying. > Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> > Thanks! - Mani > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > > EP controller in the case of failure. > > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> > > --- > > drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++---------- > > 1 file changed, 4 insertions(+), 10 deletions(-) > > -- > With best wishes > Dmitry
Hello, > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > for drivers requiring refclk from host"), all the hardware register access > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > from host. > > So there is no need to enable the endpoint resources (like clk, regulators, > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > helper from probe(). This was added earlier because dw_pcie_ep_init() was > doing DBI access, which is not done now. > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > EP controller in the case of failure. Applied to controller/qcom, thank you! [1/1] PCI: qcom-ep: Do not enable resources during probe() https://git.kernel.org/pci/pci/c/cd0b3e13ec30 Krzysztof
Hello, [...] > > > So there is no need to enable the endpoint resources (like clk, regulators, > > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > > > helper from probe(). This was added earlier because dw_pcie_ep_init() was > > > doing DBI access, which is not done now. > > > > ... moreover his makes PCIe EP fail on some of the platforms as powering > > on PHY requires refclk from the RC side, which is not enabled at the > > probe time. > > > > Yeah. I hope Bjorn/Krzysztof could add this to the commit message while > applying. No worries. Added to the commit log. Krzysztof
On Sat, Jul 27, 2024 at 02:36:04PM +0530, Manivannan Sadhasivam wrote: > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > for drivers requiring refclk from host"), all the hardware register access > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > from host. > > So there is no need to enable the endpoint resources (like clk, regulators, > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > helper from probe(). This was added earlier because dw_pcie_ep_init() was > doing DBI access, which is not done now. > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > EP controller in the case of failure. Is this v6.11 material? If so, we need a little more justification than "no need to enable". > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> > --- > drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++---------- > 1 file changed, 4 insertions(+), 10 deletions(-) > > diff --git a/drivers/pci/controller/dwc/pcie-qcom-ep.c b/drivers/pci/controller/dwc/pcie-qcom-ep.c > index 236229f66c80..2319ff2ae9f6 100644 > --- a/drivers/pci/controller/dwc/pcie-qcom-ep.c > +++ b/drivers/pci/controller/dwc/pcie-qcom-ep.c > @@ -846,21 +846,15 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev) > if (ret) > return ret; > > - ret = qcom_pcie_enable_resources(pcie_ep); > - if (ret) { > - dev_err(dev, "Failed to enable resources: %d\n", ret); > - return ret; > - } > - > ret = dw_pcie_ep_init(&pcie_ep->pci.ep); > if (ret) { > dev_err(dev, "Failed to initialize endpoint: %d\n", ret); > - goto err_disable_resources; > + return ret; > } > > ret = qcom_pcie_ep_enable_irq_resources(pdev, pcie_ep); > if (ret) > - goto err_disable_resources; > + goto err_ep_deinit; > > name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node); > if (!name) { > @@ -877,8 +871,8 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev) > disable_irq(pcie_ep->global_irq); > disable_irq(pcie_ep->perst_irq); > > -err_disable_resources: > - qcom_pcie_disable_resources(pcie_ep); > +err_ep_deinit: > + dw_pcie_ep_deinit(&pcie_ep->pci.ep); > > return ret; > } > -- > 2.25.1 >
On Thu, Aug 15, 2024 at 01:15:57PM -0500, Bjorn Helgaas wrote: > On Sat, Jul 27, 2024 at 02:36:04PM +0530, Manivannan Sadhasivam wrote: > > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > > for drivers requiring refclk from host"), all the hardware register access > > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > > from host. > > > > So there is no need to enable the endpoint resources (like clk, regulators, > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > > helper from probe(). This was added earlier because dw_pcie_ep_init() was > > doing DBI access, which is not done now. > > > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > > EP controller in the case of failure. > > Is this v6.11 material? If so, we need a little more justification > than "no need to enable". > That's why I asked to merge the comment from Dmitry: "...moreover his makes PCIe EP fail on some of the platforms as powering on PHY requires refclk from the RC side, which is not enabled at the probe time." And Krzysztof did that: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=controller/qcom&id=cd0b3e13ec309dcbe1efb66b1969fc72088b791d - Mani > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> > > --- > > drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++---------- > > 1 file changed, 4 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/pci/controller/dwc/pcie-qcom-ep.c b/drivers/pci/controller/dwc/pcie-qcom-ep.c > > index 236229f66c80..2319ff2ae9f6 100644 > > --- a/drivers/pci/controller/dwc/pcie-qcom-ep.c > > +++ b/drivers/pci/controller/dwc/pcie-qcom-ep.c > > @@ -846,21 +846,15 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev) > > if (ret) > > return ret; > > > > - ret = qcom_pcie_enable_resources(pcie_ep); > > - if (ret) { > > - dev_err(dev, "Failed to enable resources: %d\n", ret); > > - return ret; > > - } > > - > > ret = dw_pcie_ep_init(&pcie_ep->pci.ep); > > if (ret) { > > dev_err(dev, "Failed to initialize endpoint: %d\n", ret); > > - goto err_disable_resources; > > + return ret; > > } > > > > ret = qcom_pcie_ep_enable_irq_resources(pdev, pcie_ep); > > if (ret) > > - goto err_disable_resources; > > + goto err_ep_deinit; > > > > name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node); > > if (!name) { > > @@ -877,8 +871,8 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev) > > disable_irq(pcie_ep->global_irq); > > disable_irq(pcie_ep->perst_irq); > > > > -err_disable_resources: > > - qcom_pcie_disable_resources(pcie_ep); > > +err_ep_deinit: > > + dw_pcie_ep_deinit(&pcie_ep->pci.ep); > > > > return ret; > > } > > -- > > 2.25.1 > >
On Fri, Aug 16, 2024 at 11:07:57AM +0530, Manivannan Sadhasivam wrote: > On Thu, Aug 15, 2024 at 01:15:57PM -0500, Bjorn Helgaas wrote: > > On Sat, Jul 27, 2024 at 02:36:04PM +0530, Manivannan Sadhasivam wrote: > > > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > > > for drivers requiring refclk from host"), all the hardware register access > > > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > > > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > > > from host. > > > > > > So there is no need to enable the endpoint resources (like clk, regulators, > > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > > > helper from probe(). This was added earlier because dw_pcie_ep_init() was > > > doing DBI access, which is not done now. > > > > > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > > > EP controller in the case of failure. > > > > Is this v6.11 material? If so, we need a little more justification > > than "no need to enable". > > That's why I asked to merge the comment from Dmitry: > > "...moreover his makes PCIe EP fail on some of the platforms as powering on PHY > requires refclk from the RC side, which is not enabled at the probe time." The patch was posted and described basically as a cleanup of something that was unnecessary but not harmful, which is not post-merge window material. If it is in fact a critical fix, that needs to be the central point of the commit log, not something tacked on with "moreover" or "additionally". And we need something like a Fixes: tag so we know when the problem was introduced and where to backport this. Bjorn
On Wed, Aug 14, 2024 at 05:25:47AM +0900, Krzysztof Wilczyński wrote: > > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > > for drivers requiring refclk from host"), all the hardware register access > > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > > from host. > > > > So there is no need to enable the endpoint resources (like clk, regulators, > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > > helper from probe(). This was added earlier because dw_pcie_ep_init() was > > doing DBI access, which is not done now. > > > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > > EP controller in the case of failure. > > Applied to controller/qcom, thank you! > > [1/1] PCI: qcom-ep: Do not enable resources during probe() > https://git.kernel.org/pci/pci/c/cd0b3e13ec30 I think we do need this, but I dropped it for now pending a commit log that says "we're fixing a crash" and explains how. The current log says "869bc5253406 moved hardware register access like DBI to dw_pcie_ep_init_registers()", but 869bc5253406 actually moved a bunch of register accesses from dw_pcie_ep_init() to dw_pcie_ep_init_complete(), and a subsequent patch renamed dw_pcie_ep_init_complete() to dw_pcie_ep_init_registers(). I did eventually figure out the rename, but it took a while to make that leap. It also says dw_pcie_ep_init_registers() is called only from qcom_pcie_perst_deassert(), but obviously all drivers call it. I think what you meant is that on qcom and tegra194, dw_pcie_ep_init_registers() isn't called from .probe(); it's called later because they require refclk to access the registers, so qcom and tegra194 call it after PERST# is deasserted, because then refclk is available. Trying to understand the 869bc5253406 reference: I guess the point is that the dw_pcie_ep_init_registers() work depends on qcom_pcie_enable_resources(), and before 869bc5253406, that work was done by qcom_pcie_ep_probe() calling dw_pcie_ep_init(), so it had to call qcom_pcie_enable_resources() first. But after 869bc5253406, dw_pcie_ep_init_registers() is done in qcom_pcie_perst_deassert(), which already calls qcom_pcie_enable_resources(). So qcom_pcie_ep_probe() no longer needs to call qcom_pcie_enable_resources(). As far as the *crash*, phy_power_on() has been called from qcom_pcie_ep_probe() since the very beginning in f55fee56a631 ("PCI: qcom-ep: Add Qualcomm PCIe Endpoint controller driver"). But apparently on some new platforms phy_power_on() depends on refclk and (I assume) it causes a crash when done from qcom_pcie_ep_probe(). So I would think the commit log should look something like this: PCI: qcom-ep: Postpone PHY power-on until refclk available qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and powers on PHYs. On new platforms like X, Y, Z, this depends on refclk from the RC, which may not be available at the time of qcom_pcie_ep_probe(), so this causes a crash in the qcom-ep driver. qcom_pcie_enable_resources() is already called by qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is available at that time. Remove the unnecessary call from qcom_pcie_ep_probe() to prevent the crash on X, Y, Z. Although I do have the question of what happens if the RC deasserts PERST# before qcom-ep is loaded. We probably don't execute qcom_pcie_perst_deassert() in that case, so how does the init happen? Bjorn
On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > On Wed, Aug 14, 2024 at 05:25:47AM +0900, Krzysztof Wilczyński wrote: > > > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure > > > for drivers requiring refclk from host"), all the hardware register access > > > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only > > > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk > > > from host. > > > > > > So there is no need to enable the endpoint resources (like clk, regulators, > > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() > > > helper from probe(). This was added earlier because dw_pcie_ep_init() was > > > doing DBI access, which is not done now. > > > > > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the > > > EP controller in the case of failure. > > > > Applied to controller/qcom, thank you! > > > > [1/1] PCI: qcom-ep: Do not enable resources during probe() > > https://git.kernel.org/pci/pci/c/cd0b3e13ec30 > > I think we do need this, but I dropped it for now pending a commit log > that says "we're fixing a crash" and explains how. > > The current log says "869bc5253406 moved hardware register access like > DBI to dw_pcie_ep_init_registers()", but 869bc5253406 actually moved a > bunch of register accesses from dw_pcie_ep_init() to > dw_pcie_ep_init_complete(), and a subsequent patch renamed > dw_pcie_ep_init_complete() to dw_pcie_ep_init_registers(). I did > eventually figure out the rename, but it took a while to make that > leap. > Ah, sorry about that. I somehow managed to merge those two commits in my mind. > It also says dw_pcie_ep_init_registers() is called only from > qcom_pcie_perst_deassert(), but obviously all drivers call it. I > think what you meant is that on qcom and tegra194, Yeah, the context is only applicable to the respective SoCs. > dw_pcie_ep_init_registers() isn't called from .probe(); it's called > later because they require refclk to access the registers, so qcom and > tegra194 call it after PERST# is deasserted, because then refclk is > available. > > Trying to understand the 869bc5253406 reference: I guess the > point is that the dw_pcie_ep_init_registers() work depends on > qcom_pcie_enable_resources(), and before 869bc5253406, that work was > done by qcom_pcie_ep_probe() calling dw_pcie_ep_init(), so it had to > call qcom_pcie_enable_resources() first. > > But after 869bc5253406, dw_pcie_ep_init_registers() is done in > qcom_pcie_perst_deassert(), which already calls > qcom_pcie_enable_resources(). So qcom_pcie_ep_probe() no longer needs > to call qcom_pcie_enable_resources(). > Right. > As far as the *crash*, phy_power_on() has been called from > qcom_pcie_ep_probe() since the very beginning in f55fee56a631 ("PCI: > qcom-ep: Add Qualcomm PCIe Endpoint controller driver"). But > apparently on some new platforms phy_power_on() depends on refclk and > (I assume) it causes a crash when done from qcom_pcie_ep_probe(). > Yeah. The actual intention of the patch was to get rid of the qcom_pcie_enable_resources() from probe() as it is now not needed. But later Dmitry pointed out that qcom_pcie_enable_resources() is also causing a crash on his platform because phy_power_on() depends on refclk from host. So he had to manually boot the host first and then endpoint to avoid the crash (without this patch). Then I thought that this is an RC candidate as it fixes a hard crash and asked to merge the patch ammending the commit message with Dmitry's info. But now I understood that I should've reworded the commit message to make it look like an RC material. > So I would think the commit log should look something like this: > > PCI: qcom-ep: Postpone PHY power-on until refclk available > > qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and > powers on PHYs. On new platforms like X, Y, Z, this depends on > refclk from the RC, which may not be available at the time of > qcom_pcie_ep_probe(), so this causes a crash in the qcom-ep driver. > > qcom_pcie_enable_resources() is already called by > qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is > available at that time. > > Remove the unnecessary call from qcom_pcie_ep_probe() to prevent the > crash on X, Y, Z. > Looks good to me. I'll post v2 incorporating this. > Although I do have the question of what happens if the RC deasserts > PERST# before qcom-ep is loaded. We probably don't execute > qcom_pcie_perst_deassert() in that case, so how does the init happen? > PERST# is a level trigger signal. So even if the host has asserted it before EP booted, the level will stay low and ep will detect it while booting. If it is an edge trigger signal, then ep wouldn't be able to catch it as you suspected. - Mani
On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > ... > > Although I do have the question of what happens if the RC deasserts > > PERST# before qcom-ep is loaded. We probably don't execute > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > PERST# is a level trigger signal. So even if the host has asserted > it before EP booted, the level will stay low and ep will detect it > while booting. The PERST# signal itself is definitely level oriented. I'm still skeptical about the *interrupt* from the PCIe controller being level-triggered, as I mentioned here: https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas tegra194 is also dwc-based and has a similar PERST# interrupt but it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think is a cleaner implementation. Then you don't have to remember the current state, switch between high and low trigger, worry about races and missing a pulse, etc. > If it is an edge trigger signal, then ep wouldn't be able to catch > it as you suspected. Bjorn
On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote: > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > > ... > > > > Although I do have the question of what happens if the RC deasserts > > > PERST# before qcom-ep is loaded. We probably don't execute > > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > > > PERST# is a level trigger signal. So even if the host has asserted > > it before EP booted, the level will stay low and ep will detect it > > while booting. > > The PERST# signal itself is definitely level oriented. > > I'm still skeptical about the *interrupt* from the PCIe controller > being level-triggered, as I mentioned here: > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas > Sorry, that comment got buried into my inbox. So didn't get a chance to respond. > tegra194 is also dwc-based and has a similar PERST# interrupt but it's > edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think is a > cleaner implementation. Then you don't have to remember the current > state, switch between high and low trigger, worry about races and > missing a pulse, etc. > I did try to mimic what tegra194 did when I wrote the qcom-ep driver, but it didn't work. If we use the level triggered interrupt as edge, the interrupt will be missed if we do not listen at the right time (when PERST# goes from high to low and vice versa). I don't know how tegra194 interrupt controller is wired up, but IIUC they will need to boot the endpoint first and then host to catch the PERST# interrupt. Otherwise, the endpoint will never see the interrupt until host toggles it again. But there is no point in forcing this ordering and that was the reason why I went with the level triggered approach. - Mani
On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote: > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote: > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > > > ... > > > > > > Although I do have the question of what happens if the RC deasserts > > > > PERST# before qcom-ep is loaded. We probably don't execute > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > > > > > PERST# is a level trigger signal. So even if the host has asserted > > > it before EP booted, the level will stay low and ep will detect it > > > while booting. > > > > The PERST# signal itself is definitely level oriented. > > > > I'm still skeptical about the *interrupt* from the PCIe controller > > being level-triggered, as I mentioned here: > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas > > Sorry, that comment got buried into my inbox. So didn't get a chance > to respond. > > > tegra194 is also dwc-based and has a similar PERST# interrupt but > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think > > is a cleaner implementation. Then you don't have to remember the > > current state, switch between high and low trigger, worry about > > races and missing a pulse, etc. > > I did try to mimic what tegra194 did when I wrote the qcom-ep > driver, but it didn't work. If we use the level triggered interrupt > as edge, the interrupt will be missed if we do not listen at the > right time (when PERST# goes from high to low and vice versa). > > I don't know how tegra194 interrupt controller is wired up, but IIUC > they will need to boot the endpoint first and then host to catch the > PERST# interrupt. Otherwise, the endpoint will never see the > interrupt until host toggles it again. Having to control the boot ordering of endpoint and host is definitely problematic. What is the nature of the crash when we try to enable the PHY when Refclk is not available? The endpoint has no control over when the host asserts/deasserts PERST#. If PERST# happens to be asserted while the endpoint is enabling the PHY, and this causes some kind of crash that the endpoint driver can't easily recover from, that's a serious robustness problem. > But there is no point in forcing this ordering and that was the > reason why I went with the level triggered approach.
On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote: > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote: > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote: > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > > > > ... > > > > > > > > Although I do have the question of what happens if the RC deasserts > > > > > PERST# before qcom-ep is loaded. We probably don't execute > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > > > > > > > PERST# is a level trigger signal. So even if the host has asserted > > > > it before EP booted, the level will stay low and ep will detect it > > > > while booting. > > > > > > The PERST# signal itself is definitely level oriented. > > > > > > I'm still skeptical about the *interrupt* from the PCIe controller > > > being level-triggered, as I mentioned here: > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas > > > > Sorry, that comment got buried into my inbox. So didn't get a chance > > to respond. > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think > > > is a cleaner implementation. Then you don't have to remember the > > > current state, switch between high and low trigger, worry about > > > races and missing a pulse, etc. > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep > > driver, but it didn't work. If we use the level triggered interrupt > > as edge, the interrupt will be missed if we do not listen at the > > right time (when PERST# goes from high to low and vice versa). > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC > > they will need to boot the endpoint first and then host to catch the > > PERST# interrupt. Otherwise, the endpoint will never see the > > interrupt until host toggles it again. > > Having to control the boot ordering of endpoint and host is definitely > problematic. > > What is the nature of the crash when we try to enable the PHY when > Refclk is not available? The endpoint has no control over when the > host asserts/deasserts PERST#. If PERST# happens to be asserted while > the endpoint is enabling the PHY, and this causes some kind of crash > that the endpoint driver can't easily recover from, that's a serious > robustness problem. > The whole endpoint SoC crashes if the refclk is not available during phy_power_on() as the PHY driver tries to access some register on Dmitry's platform (I did not see this crash on SM8450 SoC though). If we keep the enable_resources() during probe() then the race condition you observed above could apply. So removing that from probe() will also make the race condition go away, - Mani
On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote: > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote: > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote: > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote: > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > > > > > ... > > > > > > > > > > Although I do have the question of what happens if the RC deasserts > > > > > > PERST# before qcom-ep is loaded. We probably don't execute > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > > > > > > > > > PERST# is a level trigger signal. So even if the host has asserted > > > > > it before EP booted, the level will stay low and ep will detect it > > > > > while booting. > > > > > > > > The PERST# signal itself is definitely level oriented. > > > > > > > > I'm still skeptical about the *interrupt* from the PCIe controller > > > > being level-triggered, as I mentioned here: > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas > > > > > > Sorry, that comment got buried into my inbox. So didn't get a chance > > > to respond. > > > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think > > > > is a cleaner implementation. Then you don't have to remember the > > > > current state, switch between high and low trigger, worry about > > > > races and missing a pulse, etc. > > > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep > > > driver, but it didn't work. If we use the level triggered interrupt > > > as edge, the interrupt will be missed if we do not listen at the > > > right time (when PERST# goes from high to low and vice versa). > > > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC > > > they will need to boot the endpoint first and then host to catch the > > > PERST# interrupt. Otherwise, the endpoint will never see the > > > interrupt until host toggles it again. > > > > Having to control the boot ordering of endpoint and host is definitely > > problematic. > > > > What is the nature of the crash when we try to enable the PHY when > > Refclk is not available? The endpoint has no control over when the > > host asserts/deasserts PERST#. If PERST# happens to be asserted while > > the endpoint is enabling the PHY, and this causes some kind of crash > > that the endpoint driver can't easily recover from, that's a serious > > robustness problem. > > The whole endpoint SoC crashes if the refclk is not available during > phy_power_on() as the PHY driver tries to access some register on Dmitry's > platform (I did not see this crash on SM8450 SoC though). > > If we keep the enable_resources() during probe() then the race condition you > observed above could apply. So removing that from probe() will also make the > race condition go away, Example: 1) host deasserts PERST# 2) qcom-ep handles PERST# IRQ 3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert() 4) host asserts PERST#, Refclk no longer valid 5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources() 6) qcom_pcie_enable_resources() enables PHY I don't see what prevents the PERST# assertion at 4. It sounds like the endpoint SoC crashes at 6.
On Fri, Aug 23, 2024 at 05:04:36PM -0500, Bjorn Helgaas wrote: > On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote: > > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote: > > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote: > > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote: > > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > > > > > > ... > > > > > > > > > > > > Although I do have the question of what happens if the RC deasserts > > > > > > > PERST# before qcom-ep is loaded. We probably don't execute > > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > > > > > > > > > > > PERST# is a level trigger signal. So even if the host has asserted > > > > > > it before EP booted, the level will stay low and ep will detect it > > > > > > while booting. > > > > > > > > > > The PERST# signal itself is definitely level oriented. > > > > > > > > > > I'm still skeptical about the *interrupt* from the PCIe controller > > > > > being level-triggered, as I mentioned here: > > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas > > > > > > > > Sorry, that comment got buried into my inbox. So didn't get a chance > > > > to respond. > > > > > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but > > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think > > > > > is a cleaner implementation. Then you don't have to remember the > > > > > current state, switch between high and low trigger, worry about > > > > > races and missing a pulse, etc. > > > > > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep > > > > driver, but it didn't work. If we use the level triggered interrupt > > > > as edge, the interrupt will be missed if we do not listen at the > > > > right time (when PERST# goes from high to low and vice versa). > > > > > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC > > > > they will need to boot the endpoint first and then host to catch the > > > > PERST# interrupt. Otherwise, the endpoint will never see the > > > > interrupt until host toggles it again. > > > > > > Having to control the boot ordering of endpoint and host is definitely > > > problematic. > > > > > > What is the nature of the crash when we try to enable the PHY when > > > Refclk is not available? The endpoint has no control over when the > > > host asserts/deasserts PERST#. If PERST# happens to be asserted while > > > the endpoint is enabling the PHY, and this causes some kind of crash > > > that the endpoint driver can't easily recover from, that's a serious > > > robustness problem. > > > > The whole endpoint SoC crashes if the refclk is not available during > > phy_power_on() as the PHY driver tries to access some register on Dmitry's > > platform (I did not see this crash on SM8450 SoC though). > > > > If we keep the enable_resources() during probe() then the race condition you > > observed above could apply. So removing that from probe() will also make the > > race condition go away, > > Example: > > 1) host deasserts PERST# > 2) qcom-ep handles PERST# IRQ > 3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert() > 4) host asserts PERST#, Refclk no longer valid > 5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources() > 6) qcom_pcie_enable_resources() enables PHY > > I don't see what prevents the PERST# assertion at 4. It sounds like > the endpoint SoC crashes at 6. IDK why host would quickly assert the PERST# after deasserting during probe() unless someone intentionally does that from host side. If that happens then there is a possibility of the endpoint SoC crash, but I'm not sure how we can avoid that. But what this patch fixes is a crash occuring in a sane scenario: 1) Endpoint boots first (no refclk from host) 2) Probe() calls qcom_pcie_enable_resources() --> Crash - Mani
On Sat, Aug 24, 2024 at 07:49:46AM +0530, Manivannan Sadhasivam wrote: > On Fri, Aug 23, 2024 at 05:04:36PM -0500, Bjorn Helgaas wrote: > > On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote: > > > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote: > > > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote: > > > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote: > > > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > > > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > > > > > > > ... > > > > > > > > > > > > > > Although I do have the question of what happens if the RC deasserts > > > > > > > > PERST# before qcom-ep is loaded. We probably don't execute > > > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > > > > > > > > > > > > > PERST# is a level trigger signal. So even if the host has asserted > > > > > > > it before EP booted, the level will stay low and ep will detect it > > > > > > > while booting. > > > > > > > > > > > > The PERST# signal itself is definitely level oriented. > > > > > > > > > > > > I'm still skeptical about the *interrupt* from the PCIe controller > > > > > > being level-triggered, as I mentioned here: > > > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas > > > > > > > > > > Sorry, that comment got buried into my inbox. So didn't get a chance > > > > > to respond. > > > > > > > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but > > > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think > > > > > > is a cleaner implementation. Then you don't have to remember the > > > > > > current state, switch between high and low trigger, worry about > > > > > > races and missing a pulse, etc. > > > > > > > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep > > > > > driver, but it didn't work. If we use the level triggered interrupt > > > > > as edge, the interrupt will be missed if we do not listen at the > > > > > right time (when PERST# goes from high to low and vice versa). > > > > > > > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC > > > > > they will need to boot the endpoint first and then host to catch the > > > > > PERST# interrupt. Otherwise, the endpoint will never see the > > > > > interrupt until host toggles it again. > > > > > > > > Having to control the boot ordering of endpoint and host is definitely > > > > problematic. > > > > > > > > What is the nature of the crash when we try to enable the PHY when > > > > Refclk is not available? The endpoint has no control over when the > > > > host asserts/deasserts PERST#. If PERST# happens to be asserted while > > > > the endpoint is enabling the PHY, and this causes some kind of crash > > > > that the endpoint driver can't easily recover from, that's a serious > > > > robustness problem. > > > > > > The whole endpoint SoC crashes if the refclk is not available during > > > phy_power_on() as the PHY driver tries to access some register on Dmitry's > > > platform (I did not see this crash on SM8450 SoC though). I don't think the nature of this crash has been explained, so I don't know whether it's a recoverable situation or not. > > > If we keep the enable_resources() during probe() then the race > > > condition you observed above could apply. So removing that from > > > probe() will also make the race condition go away, > > > > Example: > > > > 1) host deasserts PERST# > > 2) qcom-ep handles PERST# IRQ > > 3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert() > > 4) host asserts PERST#, Refclk no longer valid > > 5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources() > > 6) qcom_pcie_enable_resources() enables PHY > > > > I don't see what prevents the PERST# assertion at 4. It sounds like > > the endpoint SoC crashes at 6. > > IDK why host would quickly assert the PERST# after deasserting > during probe() unless someone intentionally does that from host > side. I think the host is allowed to assert PERST# at any arbitrary time, so an endpoint should be able to handle it no matter when it happens. > If that happens then there is a possibility of the endpoint SoC > crash, but I'm not sure how we can avoid that. > > But what this patch fixes is a crash occuring in a sane scenario: > > 1) Endpoint boots first (no refclk from host) > 2) Probe() calls qcom_pcie_enable_resources() --> Crash I agree with this, although I think it's more of a band-aid than a complete solution. I don't have access to any SoC or PCIe controller docs, so maybe this is a hardware design problem and this is the best we can do. Bjorn
On Sat, Aug 24, 2024 at 11:12:34AM -0500, Bjorn Helgaas wrote: > On Sat, Aug 24, 2024 at 07:49:46AM +0530, Manivannan Sadhasivam wrote: > > On Fri, Aug 23, 2024 at 05:04:36PM -0500, Bjorn Helgaas wrote: > > > On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote: > > > > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote: > > > > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote: > > > > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote: > > > > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote: > > > > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote: > > > > > > > > ... > > > > > > > > > > > > > > > > Although I do have the question of what happens if the RC deasserts > > > > > > > > > PERST# before qcom-ep is loaded. We probably don't execute > > > > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen? > > > > > > > > > > > > > > > > PERST# is a level trigger signal. So even if the host has asserted > > > > > > > > it before EP booted, the level will stay low and ep will detect it > > > > > > > > while booting. > > > > > > > > > > > > > > The PERST# signal itself is definitely level oriented. > > > > > > > > > > > > > > I'm still skeptical about the *interrupt* from the PCIe controller > > > > > > > being level-triggered, as I mentioned here: > > > > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas > > > > > > > > > > > > Sorry, that comment got buried into my inbox. So didn't get a chance > > > > > > to respond. > > > > > > > > > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but > > > > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think > > > > > > > is a cleaner implementation. Then you don't have to remember the > > > > > > > current state, switch between high and low trigger, worry about > > > > > > > races and missing a pulse, etc. > > > > > > > > > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep > > > > > > driver, but it didn't work. If we use the level triggered interrupt > > > > > > as edge, the interrupt will be missed if we do not listen at the > > > > > > right time (when PERST# goes from high to low and vice versa). > > > > > > > > > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC > > > > > > they will need to boot the endpoint first and then host to catch the > > > > > > PERST# interrupt. Otherwise, the endpoint will never see the > > > > > > interrupt until host toggles it again. > > > > > > > > > > Having to control the boot ordering of endpoint and host is definitely > > > > > problematic. > > > > > > > > > > What is the nature of the crash when we try to enable the PHY when > > > > > Refclk is not available? The endpoint has no control over when the > > > > > host asserts/deasserts PERST#. If PERST# happens to be asserted while > > > > > the endpoint is enabling the PHY, and this causes some kind of crash > > > > > that the endpoint driver can't easily recover from, that's a serious > > > > > robustness problem. > > > > > > > > The whole endpoint SoC crashes if the refclk is not available during > > > > phy_power_on() as the PHY driver tries to access some register on Dmitry's > > > > platform (I did not see this crash on SM8450 SoC though). > > I don't think the nature of this crash has been explained, so I don't > know whether it's a recoverable situation or not. > I will add this info in the commit message. > > > > If we keep the enable_resources() during probe() then the race > > > > condition you observed above could apply. So removing that from > > > > probe() will also make the race condition go away, > > > > > > Example: > > > > > > 1) host deasserts PERST# > > > 2) qcom-ep handles PERST# IRQ > > > 3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert() > > > 4) host asserts PERST#, Refclk no longer valid > > > 5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources() > > > 6) qcom_pcie_enable_resources() enables PHY > > > > > > I don't see what prevents the PERST# assertion at 4. It sounds like > > > the endpoint SoC crashes at 6. > > > > IDK why host would quickly assert the PERST# after deasserting > > during probe() unless someone intentionally does that from host > > side. > > I think the host is allowed to assert PERST# at any arbitrary time, so > an endpoint should be able to handle it no matter when it happens. > > > If that happens then there is a possibility of the endpoint SoC > > crash, but I'm not sure how we can avoid that. > > > > But what this patch fixes is a crash occuring in a sane scenario: > > > > 1) Endpoint boots first (no refclk from host) > > 2) Probe() calls qcom_pcie_enable_resources() --> Crash > > I agree with this, although I think it's more of a band-aid than a > complete solution. I don't have access to any SoC or PCIe controller > docs, so maybe this is a hardware design problem and this is the best > we can do. > I agree. But AFAIK there is no way endpoint can avoid this crash unless it generates its own clock. I did some investigation on the SRIS support and able to get it work in my local branch. I will try to upstream that feature for the currerntly supported Qcom SoCs in endpoint mode. But Qcom told me that non-SRIS mode is also required by some customers, so unfortunately we cannot make it as the default operating mode. - Mani
Hello,
[...]
> Applied to controller/qcom, thank you!
Based on the conversation here, I removed this patch from the branch.
Krzysztof
diff --git a/drivers/pci/controller/dwc/pcie-qcom-ep.c b/drivers/pci/controller/dwc/pcie-qcom-ep.c index 236229f66c80..2319ff2ae9f6 100644 --- a/drivers/pci/controller/dwc/pcie-qcom-ep.c +++ b/drivers/pci/controller/dwc/pcie-qcom-ep.c @@ -846,21 +846,15 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev) if (ret) return ret; - ret = qcom_pcie_enable_resources(pcie_ep); - if (ret) { - dev_err(dev, "Failed to enable resources: %d\n", ret); - return ret; - } - ret = dw_pcie_ep_init(&pcie_ep->pci.ep); if (ret) { dev_err(dev, "Failed to initialize endpoint: %d\n", ret); - goto err_disable_resources; + return ret; } ret = qcom_pcie_ep_enable_irq_resources(pdev, pcie_ep); if (ret) - goto err_disable_resources; + goto err_ep_deinit; name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node); if (!name) { @@ -877,8 +871,8 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev) disable_irq(pcie_ep->global_irq); disable_irq(pcie_ep->perst_irq); -err_disable_resources: - qcom_pcie_disable_resources(pcie_ep); +err_ep_deinit: + dw_pcie_ep_deinit(&pcie_ep->pci.ep); return ret; }
Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host"), all the hardware register access (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk from host. So there is no need to enable the endpoint resources (like clk, regulators, PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources() helper from probe(). This was added earlier because dw_pcie_ep_init() was doing DBI access, which is not done now. While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the EP controller in the case of failure. Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> --- drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-)