PCI: qcom-ep: Do not enable resources during probe()

Message ID	20240727090604.24646-1-manivannan.sadhasivam@linaro.org (mailing list archive)
State	Accepted
Delegated to:	Krzysztof Wilczyński
Headers	show Received: from mail-il1-f179.google.com (mail-il1-f179.google.com [209.85.166.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE8F647F4D for <linux-pci@vger.kernel.org>; Sat, 27 Jul 2024 09:06:11 +0000 (UTC) From: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> To: lpieralisi@kernel.org, kw@linux.com Cc: robh@kernel.org, bhelgaas@google.com, linux-arm-msm@vger.kernel.org, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Subject: [PATCH] PCI: qcom-ep: Do not enable resources during probe() Date: Sat, 27 Jul 2024 14:36:04 +0530 Message-Id: <20240727090604.24646-1-manivannan.sadhasivam@linaro.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	PCI: qcom-ep: Do not enable resources during probe() \| expand PCI: qcom-ep: Do not enable resources during probe()

Manivannan Sadhasivam July 27, 2024, 9:06 a.m. UTC

Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
for drivers requiring refclk from host"), all the hardware register access
(like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
from host.

So there is no need to enable the endpoint resources (like clk, regulators,
PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
helper from probe(). This was added earlier because dw_pcie_ep_init() was
doing DBI access, which is not done now.

While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
EP controller in the case of failure.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---
 drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

Dmitry Baryshkov July 27, 2024, 10:32 a.m. UTC | #1

On Sat, Jul 27, 2024 at 02:36:04PM GMT, Manivannan Sadhasivam wrote:
> Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> for drivers requiring refclk from host"), all the hardware register access
> (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> from host.
> 
> So there is no need to enable the endpoint resources (like clk, regulators,
> PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> helper from probe(). This was added earlier because dw_pcie_ep_init() was
> doing DBI access, which is not done now.

... moreover his makes PCIe EP fail on some of the platforms as powering
on PHY requires refclk from the RC side, which is not enabled at the
probe time.

Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>

> While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> EP controller in the case of failure.
> 
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> ---
>  drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)

Manivannan Sadhasivam Aug. 13, 2024, 5:02 p.m. UTC | #2

On Sat, Jul 27, 2024 at 01:32:40PM +0300, Dmitry Baryshkov wrote:
> On Sat, Jul 27, 2024 at 02:36:04PM GMT, Manivannan Sadhasivam wrote:
> > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> > for drivers requiring refclk from host"), all the hardware register access
> > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> > from host.
> > 
> > So there is no need to enable the endpoint resources (like clk, regulators,
> > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> > helper from probe(). This was added earlier because dw_pcie_ep_init() was
> > doing DBI access, which is not done now.
> 
> ... moreover his makes PCIe EP fail on some of the platforms as powering
> on PHY requires refclk from the RC side, which is not enabled at the
> probe time.
> 

Yeah. I hope Bjorn/Krzysztof could add this to the commit message while
applying.

> Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> 

Thanks!

- Mani

> > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> > EP controller in the case of failure.
> > 
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > ---
> >  drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++----------
> >  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> -- 
> With best wishes
> Dmitry

Krzysztof Wilczyński Aug. 13, 2024, 8:25 p.m. UTC | #3

Hello,

> Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> for drivers requiring refclk from host"), all the hardware register access
> (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> from host.
> 
> So there is no need to enable the endpoint resources (like clk, regulators,
> PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> helper from probe(). This was added earlier because dw_pcie_ep_init() was
> doing DBI access, which is not done now.
> 
> While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> EP controller in the case of failure.

Applied to controller/qcom, thank you!

[1/1] PCI: qcom-ep: Do not enable resources during probe()
      https://git.kernel.org/pci/pci/c/cd0b3e13ec30

	Krzysztof

Krzysztof Wilczyński Aug. 13, 2024, 8:27 p.m. UTC | #4

Hello,

[...]
> > > So there is no need to enable the endpoint resources (like clk, regulators,
> > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> > > helper from probe(). This was added earlier because dw_pcie_ep_init() was
> > > doing DBI access, which is not done now.
> > 
> > ... moreover his makes PCIe EP fail on some of the platforms as powering
> > on PHY requires refclk from the RC side, which is not enabled at the
> > probe time.
> > 
> 
> Yeah. I hope Bjorn/Krzysztof could add this to the commit message while
> applying.

No worries. Added to the commit log.

	Krzysztof

Bjorn Helgaas Aug. 15, 2024, 6:15 p.m. UTC | #5

On Sat, Jul 27, 2024 at 02:36:04PM +0530, Manivannan Sadhasivam wrote:
> Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> for drivers requiring refclk from host"), all the hardware register access
> (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> from host.
> 
> So there is no need to enable the endpoint resources (like clk, regulators,
> PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> helper from probe(). This was added earlier because dw_pcie_ep_init() was
> doing DBI access, which is not done now.
> 
> While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> EP controller in the case of failure.

Is this v6.11 material?  If so, we need a little more justification
than "no need to enable".

> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> ---
>  drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-qcom-ep.c b/drivers/pci/controller/dwc/pcie-qcom-ep.c
> index 236229f66c80..2319ff2ae9f6 100644
> --- a/drivers/pci/controller/dwc/pcie-qcom-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-qcom-ep.c
> @@ -846,21 +846,15 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev)
>  	if (ret)
>  		return ret;
>  
> -	ret = qcom_pcie_enable_resources(pcie_ep);
> -	if (ret) {
> -		dev_err(dev, "Failed to enable resources: %d\n", ret);
> -		return ret;
> -	}
> -
>  	ret = dw_pcie_ep_init(&pcie_ep->pci.ep);
>  	if (ret) {
>  		dev_err(dev, "Failed to initialize endpoint: %d\n", ret);
> -		goto err_disable_resources;
> +		return ret;
>  	}
>  
>  	ret = qcom_pcie_ep_enable_irq_resources(pdev, pcie_ep);
>  	if (ret)
> -		goto err_disable_resources;
> +		goto err_ep_deinit;
>  
>  	name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node);
>  	if (!name) {
> @@ -877,8 +871,8 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev)
>  	disable_irq(pcie_ep->global_irq);
>  	disable_irq(pcie_ep->perst_irq);
>  
> -err_disable_resources:
> -	qcom_pcie_disable_resources(pcie_ep);
> +err_ep_deinit:
> +	dw_pcie_ep_deinit(&pcie_ep->pci.ep);
>  
>  	return ret;
>  }
> -- 
> 2.25.1
>

Manivannan Sadhasivam Aug. 16, 2024, 5:37 a.m. UTC | #6

On Thu, Aug 15, 2024 at 01:15:57PM -0500, Bjorn Helgaas wrote:
> On Sat, Jul 27, 2024 at 02:36:04PM +0530, Manivannan Sadhasivam wrote:
> > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> > for drivers requiring refclk from host"), all the hardware register access
> > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> > from host.
> > 
> > So there is no need to enable the endpoint resources (like clk, regulators,
> > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> > helper from probe(). This was added earlier because dw_pcie_ep_init() was
> > doing DBI access, which is not done now.
> > 
> > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> > EP controller in the case of failure.
> 
> Is this v6.11 material?  If so, we need a little more justification
> than "no need to enable".
> 

That's why I asked to merge the comment from Dmitry:

"...moreover his makes PCIe EP fail on some of the platforms as powering on PHY
requires refclk from the RC side, which is not enabled at the probe time."

And Krzysztof did that:
https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=controller/qcom&id=cd0b3e13ec309dcbe1efb66b1969fc72088b791d

- Mani

> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > ---
> >  drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++----------
> >  1 file changed, 4 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/dwc/pcie-qcom-ep.c b/drivers/pci/controller/dwc/pcie-qcom-ep.c
> > index 236229f66c80..2319ff2ae9f6 100644
> > --- a/drivers/pci/controller/dwc/pcie-qcom-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-qcom-ep.c
> > @@ -846,21 +846,15 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev)
> >  	if (ret)
> >  		return ret;
> >  
> > -	ret = qcom_pcie_enable_resources(pcie_ep);
> > -	if (ret) {
> > -		dev_err(dev, "Failed to enable resources: %d\n", ret);
> > -		return ret;
> > -	}
> > -
> >  	ret = dw_pcie_ep_init(&pcie_ep->pci.ep);
> >  	if (ret) {
> >  		dev_err(dev, "Failed to initialize endpoint: %d\n", ret);
> > -		goto err_disable_resources;
> > +		return ret;
> >  	}
> >  
> >  	ret = qcom_pcie_ep_enable_irq_resources(pdev, pcie_ep);
> >  	if (ret)
> > -		goto err_disable_resources;
> > +		goto err_ep_deinit;
> >  
> >  	name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node);
> >  	if (!name) {
> > @@ -877,8 +871,8 @@ static int qcom_pcie_ep_probe(struct platform_device *pdev)
> >  	disable_irq(pcie_ep->global_irq);
> >  	disable_irq(pcie_ep->perst_irq);
> >  
> > -err_disable_resources:
> > -	qcom_pcie_disable_resources(pcie_ep);
> > +err_ep_deinit:
> > +	dw_pcie_ep_deinit(&pcie_ep->pci.ep);
> >  
> >  	return ret;
> >  }
> > -- 
> > 2.25.1
> >

Bjorn Helgaas Aug. 16, 2024, 12:02 p.m. UTC | #7

On Fri, Aug 16, 2024 at 11:07:57AM +0530, Manivannan Sadhasivam wrote:
> On Thu, Aug 15, 2024 at 01:15:57PM -0500, Bjorn Helgaas wrote:
> > On Sat, Jul 27, 2024 at 02:36:04PM +0530, Manivannan Sadhasivam wrote:
> > > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> > > for drivers requiring refclk from host"), all the hardware register access
> > > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> > > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> > > from host.
> > > 
> > > So there is no need to enable the endpoint resources (like clk, regulators,
> > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> > > helper from probe(). This was added earlier because dw_pcie_ep_init() was
> > > doing DBI access, which is not done now.
> > > 
> > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> > > EP controller in the case of failure.
> > 
> > Is this v6.11 material?  If so, we need a little more justification
> > than "no need to enable".
> 
> That's why I asked to merge the comment from Dmitry:
> 
> "...moreover his makes PCIe EP fail on some of the platforms as powering on PHY
> requires refclk from the RC side, which is not enabled at the probe time."

The patch was posted and described basically as a cleanup of something
that was unnecessary but not harmful, which is not post-merge window
material.

If it is in fact a critical fix, that needs to be the central point of
the commit log, not something tacked on with "moreover" or
"additionally".  And we need something like a Fixes: tag so we know
when the problem was introduced and where to backport this.

Bjorn

Bjorn Helgaas Aug. 21, 2024, 10:56 p.m. UTC | #8

On Wed, Aug 14, 2024 at 05:25:47AM +0900, Krzysztof Wilczyński wrote:
> > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> > for drivers requiring refclk from host"), all the hardware register access
> > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> > from host.
> > 
> > So there is no need to enable the endpoint resources (like clk, regulators,
> > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> > helper from probe(). This was added earlier because dw_pcie_ep_init() was
> > doing DBI access, which is not done now.
> > 
> > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> > EP controller in the case of failure.
> 
> Applied to controller/qcom, thank you!
> 
> [1/1] PCI: qcom-ep: Do not enable resources during probe()
>       https://git.kernel.org/pci/pci/c/cd0b3e13ec30

I think we do need this, but I dropped it for now pending a commit log
that says "we're fixing a crash" and explains how.

The current log says "869bc5253406 moved hardware register access like
DBI to dw_pcie_ep_init_registers()", but 869bc5253406 actually moved a
bunch of register accesses from dw_pcie_ep_init() to
dw_pcie_ep_init_complete(), and a subsequent patch renamed
dw_pcie_ep_init_complete() to dw_pcie_ep_init_registers().  I did
eventually figure out the rename, but it took a while to make that
leap.

It also says dw_pcie_ep_init_registers() is called only from
qcom_pcie_perst_deassert(), but obviously all drivers call it.  I
think what you meant is that on qcom and tegra194,
dw_pcie_ep_init_registers() isn't called from .probe(); it's called
later because they require refclk to access the registers, so qcom and
tegra194 call it after PERST# is deasserted, because then refclk is
available.

Trying to understand the 869bc5253406 reference: I guess the
point is that the dw_pcie_ep_init_registers() work depends on
qcom_pcie_enable_resources(), and before 869bc5253406, that work was
done by qcom_pcie_ep_probe() calling dw_pcie_ep_init(), so it had to
call qcom_pcie_enable_resources() first.

But after 869bc5253406, dw_pcie_ep_init_registers() is done in
qcom_pcie_perst_deassert(), which already calls
qcom_pcie_enable_resources().  So qcom_pcie_ep_probe() no longer needs
to call qcom_pcie_enable_resources().

As far as the *crash*, phy_power_on() has been called from
qcom_pcie_ep_probe() since the very beginning in f55fee56a631 ("PCI:
qcom-ep: Add Qualcomm PCIe Endpoint controller driver").  But
apparently on some new platforms phy_power_on() depends on refclk and
(I assume) it causes a crash when done from qcom_pcie_ep_probe().

So I would think the commit log should look something like this:

  PCI: qcom-ep: Postpone PHY power-on until refclk available

  qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and
  powers on PHYs.  On new platforms like X, Y, Z, this depends on
  refclk from the RC, which may not be available at the time of
  qcom_pcie_ep_probe(), so this causes a crash in the qcom-ep driver.

  qcom_pcie_enable_resources() is already called by
  qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
  available at that time.

  Remove the unnecessary call from qcom_pcie_ep_probe() to prevent the
  crash on X, Y, Z.

Although I do have the question of what happens if the RC deasserts
PERST# before qcom-ep is loaded.  We probably don't execute
qcom_pcie_perst_deassert() in that case, so how does the init happen?

Bjorn

Manivannan Sadhasivam Aug. 22, 2024, 6:48 a.m. UTC | #9

On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> On Wed, Aug 14, 2024 at 05:25:47AM +0900, Krzysztof Wilczyński wrote:
> > > Starting from commit 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure
> > > for drivers requiring refclk from host"), all the hardware register access
> > > (like DBI) were moved to dw_pcie_ep_init_registers() which gets called only
> > > in qcom_pcie_perst_deassert() i.e., only after the endpoint received refclk
> > > from host.
> > > 
> > > So there is no need to enable the endpoint resources (like clk, regulators,
> > > PHY) during probe(). Hence, remove the call to qcom_pcie_enable_resources()
> > > helper from probe(). This was added earlier because dw_pcie_ep_init() was
> > > doing DBI access, which is not done now.
> > > 
> > > While at it, let's also call dw_pcie_ep_deinit() in err path to deinit the
> > > EP controller in the case of failure.
> > 
> > Applied to controller/qcom, thank you!
> > 
> > [1/1] PCI: qcom-ep: Do not enable resources during probe()
> >       https://git.kernel.org/pci/pci/c/cd0b3e13ec30
> 
> I think we do need this, but I dropped it for now pending a commit log
> that says "we're fixing a crash" and explains how.
> 
> The current log says "869bc5253406 moved hardware register access like
> DBI to dw_pcie_ep_init_registers()", but 869bc5253406 actually moved a
> bunch of register accesses from dw_pcie_ep_init() to
> dw_pcie_ep_init_complete(), and a subsequent patch renamed
> dw_pcie_ep_init_complete() to dw_pcie_ep_init_registers().  I did
> eventually figure out the rename, but it took a while to make that
> leap.
> 

Ah, sorry about that. I somehow managed to merge those two commits in my mind.

> It also says dw_pcie_ep_init_registers() is called only from
> qcom_pcie_perst_deassert(), but obviously all drivers call it.  I
> think what you meant is that on qcom and tegra194,

Yeah, the context is only applicable to the respective SoCs.

> dw_pcie_ep_init_registers() isn't called from .probe(); it's called
> later because they require refclk to access the registers, so qcom and
> tegra194 call it after PERST# is deasserted, because then refclk is
> available.
> 
> Trying to understand the 869bc5253406 reference: I guess the
> point is that the dw_pcie_ep_init_registers() work depends on
> qcom_pcie_enable_resources(), and before 869bc5253406, that work was
> done by qcom_pcie_ep_probe() calling dw_pcie_ep_init(), so it had to
> call qcom_pcie_enable_resources() first.
> 
> But after 869bc5253406, dw_pcie_ep_init_registers() is done in
> qcom_pcie_perst_deassert(), which already calls
> qcom_pcie_enable_resources().  So qcom_pcie_ep_probe() no longer needs
> to call qcom_pcie_enable_resources().
> 

Right.

> As far as the *crash*, phy_power_on() has been called from
> qcom_pcie_ep_probe() since the very beginning in f55fee56a631 ("PCI:
> qcom-ep: Add Qualcomm PCIe Endpoint controller driver").  But
> apparently on some new platforms phy_power_on() depends on refclk and
> (I assume) it causes a crash when done from qcom_pcie_ep_probe().
> 

Yeah. The actual intention of the patch was to get rid of the
qcom_pcie_enable_resources() from probe() as it is now not needed. But later
Dmitry pointed out that qcom_pcie_enable_resources() is also causing a crash on
his platform because phy_power_on() depends on refclk from host. So he had to
manually boot the host first and then endpoint to avoid the crash (without this
patch).

Then I thought that this is an RC candidate as it fixes a hard crash and asked
to merge the patch ammending the commit message with Dmitry's info.

But now I understood that I should've reworded the commit message to make it
look like an RC material.

> So I would think the commit log should look something like this:
> 
>   PCI: qcom-ep: Postpone PHY power-on until refclk available
> 
>   qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and
>   powers on PHYs.  On new platforms like X, Y, Z, this depends on
>   refclk from the RC, which may not be available at the time of
>   qcom_pcie_ep_probe(), so this causes a crash in the qcom-ep driver.
> 
>   qcom_pcie_enable_resources() is already called by
>   qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
>   available at that time.
> 
>   Remove the unnecessary call from qcom_pcie_ep_probe() to prevent the
>   crash on X, Y, Z.
> 

Looks good to me. I'll post v2 incorporating this.

> Although I do have the question of what happens if the RC deasserts
> PERST# before qcom-ep is loaded.  We probably don't execute
> qcom_pcie_perst_deassert() in that case, so how does the init happen?
> 

PERST# is a level trigger signal. So even if the host has asserted it before EP
booted, the level will stay low and ep will detect it while booting.

If it is an edge trigger signal, then ep wouldn't be able to catch it as you
suspected.

- Mani

Bjorn Helgaas Aug. 22, 2024, 3:16 p.m. UTC | #10

On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> ...

> > Although I do have the question of what happens if the RC deasserts
> > PERST# before qcom-ep is loaded.  We probably don't execute
> > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> 
> PERST# is a level trigger signal. So even if the host has asserted
> it before EP booted, the level will stay low and ep will detect it
> while booting.

The PERST# signal itself is definitely level oriented.

I'm still skeptical about the *interrupt* from the PCIe controller
being level-triggered, as I mentioned here:
https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas

tegra194 is also dwc-based and has a similar PERST# interrupt but it's
edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think is a
cleaner implementation.  Then you don't have to remember the current
state, switch between high and low trigger, worry about races and
missing a pulse, etc.

> If it is an edge trigger signal, then ep wouldn't be able to catch
> it as you suspected.

Bjorn

Manivannan Sadhasivam Aug. 22, 2024, 3:40 p.m. UTC | #11

On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > ...
> 
> > > Although I do have the question of what happens if the RC deasserts
> > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > 
> > PERST# is a level trigger signal. So even if the host has asserted
> > it before EP booted, the level will stay low and ep will detect it
> > while booting.
> 
> The PERST# signal itself is definitely level oriented.
> 
> I'm still skeptical about the *interrupt* from the PCIe controller
> being level-triggered, as I mentioned here:
> https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> 

Sorry, that comment got buried into my inbox. So didn't get a chance to respond.

> tegra194 is also dwc-based and has a similar PERST# interrupt but it's
> edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think is a
> cleaner implementation.  Then you don't have to remember the current
> state, switch between high and low trigger, worry about races and
> missing a pulse, etc.
> 

I did try to mimic what tegra194 did when I wrote the qcom-ep driver, but it
didn't work. If we use the level triggered interrupt as edge, the interrupt will
be missed if we do not listen at the right time (when PERST# goes from high to
low and vice versa).

I don't know how tegra194 interrupt controller is wired up, but IIUC they will
need to boot the endpoint first and then host to catch the PERST# interrupt.
Otherwise, the endpoint will never see the interrupt until host toggles it
again.

But there is no point in forcing this ordering and that was the reason why I
went with the level triggered approach.

- Mani

Bjorn Helgaas Aug. 22, 2024, 5:31 p.m. UTC | #12

On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote:
> On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > > ...
> > 
> > > > Although I do have the question of what happens if the RC deasserts
> > > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > > 
> > > PERST# is a level trigger signal. So even if the host has asserted
> > > it before EP booted, the level will stay low and ep will detect it
> > > while booting.
> > 
> > The PERST# signal itself is definitely level oriented.
> > 
> > I'm still skeptical about the *interrupt* from the PCIe controller
> > being level-triggered, as I mentioned here:
> > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> 
> Sorry, that comment got buried into my inbox. So didn't get a chance
> to respond.
> 
> > tegra194 is also dwc-based and has a similar PERST# interrupt but
> > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think
> > is a cleaner implementation.  Then you don't have to remember the
> > current state, switch between high and low trigger, worry about
> > races and missing a pulse, etc.
> 
> I did try to mimic what tegra194 did when I wrote the qcom-ep
> driver, but it didn't work. If we use the level triggered interrupt
> as edge, the interrupt will be missed if we do not listen at the
> right time (when PERST# goes from high to low and vice versa).
> 
> I don't know how tegra194 interrupt controller is wired up, but IIUC
> they will need to boot the endpoint first and then host to catch the
> PERST# interrupt.  Otherwise, the endpoint will never see the
> interrupt until host toggles it again.

Having to control the boot ordering of endpoint and host is definitely
problematic.

What is the nature of the crash when we try to enable the PHY when
Refclk is not available?  The endpoint has no control over when the
host asserts/deasserts PERST#.  If PERST# happens to be asserted while
the endpoint is enabling the PHY, and this causes some kind of crash
that the endpoint driver can't easily recover from, that's a serious
robustness problem.

> But there is no point in forcing this ordering and that was the
> reason why I went with the level triggered approach.

Manivannan Sadhasivam Aug. 23, 2024, 4:41 a.m. UTC | #13

On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote:
> On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote:
> > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > > > ...
> > > 
> > > > > Although I do have the question of what happens if the RC deasserts
> > > > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > > > 
> > > > PERST# is a level trigger signal. So even if the host has asserted
> > > > it before EP booted, the level will stay low and ep will detect it
> > > > while booting.
> > > 
> > > The PERST# signal itself is definitely level oriented.
> > > 
> > > I'm still skeptical about the *interrupt* from the PCIe controller
> > > being level-triggered, as I mentioned here:
> > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> > 
> > Sorry, that comment got buried into my inbox. So didn't get a chance
> > to respond.
> > 
> > > tegra194 is also dwc-based and has a similar PERST# interrupt but
> > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think
> > > is a cleaner implementation.  Then you don't have to remember the
> > > current state, switch between high and low trigger, worry about
> > > races and missing a pulse, etc.
> > 
> > I did try to mimic what tegra194 did when I wrote the qcom-ep
> > driver, but it didn't work. If we use the level triggered interrupt
> > as edge, the interrupt will be missed if we do not listen at the
> > right time (when PERST# goes from high to low and vice versa).
> > 
> > I don't know how tegra194 interrupt controller is wired up, but IIUC
> > they will need to boot the endpoint first and then host to catch the
> > PERST# interrupt.  Otherwise, the endpoint will never see the
> > interrupt until host toggles it again.
> 
> Having to control the boot ordering of endpoint and host is definitely
> problematic.
> 
> What is the nature of the crash when we try to enable the PHY when
> Refclk is not available?  The endpoint has no control over when the
> host asserts/deasserts PERST#.  If PERST# happens to be asserted while
> the endpoint is enabling the PHY, and this causes some kind of crash
> that the endpoint driver can't easily recover from, that's a serious
> robustness problem.
> 

The whole endpoint SoC crashes if the refclk is not available during
phy_power_on() as the PHY driver tries to access some register on Dmitry's
platform (I did not see this crash on SM8450 SoC though).

If we keep the enable_resources() during probe() then the race condition you
observed above could apply. So removing that from probe() will also make the
race condition go away,

- Mani

Bjorn Helgaas Aug. 23, 2024, 10:04 p.m. UTC | #14

On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote:
> On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote:
> > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote:
> > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > > > > ...
> > > > 
> > > > > > Although I do have the question of what happens if the RC deasserts
> > > > > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > > > > 
> > > > > PERST# is a level trigger signal. So even if the host has asserted
> > > > > it before EP booted, the level will stay low and ep will detect it
> > > > > while booting.
> > > > 
> > > > The PERST# signal itself is definitely level oriented.
> > > > 
> > > > I'm still skeptical about the *interrupt* from the PCIe controller
> > > > being level-triggered, as I mentioned here:
> > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> > > 
> > > Sorry, that comment got buried into my inbox. So didn't get a chance
> > > to respond.
> > > 
> > > > tegra194 is also dwc-based and has a similar PERST# interrupt but
> > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think
> > > > is a cleaner implementation.  Then you don't have to remember the
> > > > current state, switch between high and low trigger, worry about
> > > > races and missing a pulse, etc.
> > > 
> > > I did try to mimic what tegra194 did when I wrote the qcom-ep
> > > driver, but it didn't work. If we use the level triggered interrupt
> > > as edge, the interrupt will be missed if we do not listen at the
> > > right time (when PERST# goes from high to low and vice versa).
> > > 
> > > I don't know how tegra194 interrupt controller is wired up, but IIUC
> > > they will need to boot the endpoint first and then host to catch the
> > > PERST# interrupt.  Otherwise, the endpoint will never see the
> > > interrupt until host toggles it again.
> > 
> > Having to control the boot ordering of endpoint and host is definitely
> > problematic.
> > 
> > What is the nature of the crash when we try to enable the PHY when
> > Refclk is not available?  The endpoint has no control over when the
> > host asserts/deasserts PERST#.  If PERST# happens to be asserted while
> > the endpoint is enabling the PHY, and this causes some kind of crash
> > that the endpoint driver can't easily recover from, that's a serious
> > robustness problem.
> 
> The whole endpoint SoC crashes if the refclk is not available during
> phy_power_on() as the PHY driver tries to access some register on Dmitry's
> platform (I did not see this crash on SM8450 SoC though).
> 
> If we keep the enable_resources() during probe() then the race condition you
> observed above could apply. So removing that from probe() will also make the
> race condition go away,

Example:

  1) host deasserts PERST#
  2) qcom-ep handles PERST# IRQ
  3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert()
  4) host asserts PERST#, Refclk no longer valid
  5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources()
  6) qcom_pcie_enable_resources() enables PHY

I don't see what prevents the PERST# assertion at 4.  It sounds like
the endpoint SoC crashes at 6.

Manivannan Sadhasivam Aug. 24, 2024, 2:19 a.m. UTC | #15

On Fri, Aug 23, 2024 at 05:04:36PM -0500, Bjorn Helgaas wrote:
> On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote:
> > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote:
> > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote:
> > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > > > > > ...
> > > > > 
> > > > > > > Although I do have the question of what happens if the RC deasserts
> > > > > > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > > > > > 
> > > > > > PERST# is a level trigger signal. So even if the host has asserted
> > > > > > it before EP booted, the level will stay low and ep will detect it
> > > > > > while booting.
> > > > > 
> > > > > The PERST# signal itself is definitely level oriented.
> > > > > 
> > > > > I'm still skeptical about the *interrupt* from the PCIe controller
> > > > > being level-triggered, as I mentioned here:
> > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> > > > 
> > > > Sorry, that comment got buried into my inbox. So didn't get a chance
> > > > to respond.
> > > > 
> > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but
> > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think
> > > > > is a cleaner implementation.  Then you don't have to remember the
> > > > > current state, switch between high and low trigger, worry about
> > > > > races and missing a pulse, etc.
> > > > 
> > > > I did try to mimic what tegra194 did when I wrote the qcom-ep
> > > > driver, but it didn't work. If we use the level triggered interrupt
> > > > as edge, the interrupt will be missed if we do not listen at the
> > > > right time (when PERST# goes from high to low and vice versa).
> > > > 
> > > > I don't know how tegra194 interrupt controller is wired up, but IIUC
> > > > they will need to boot the endpoint first and then host to catch the
> > > > PERST# interrupt.  Otherwise, the endpoint will never see the
> > > > interrupt until host toggles it again.
> > > 
> > > Having to control the boot ordering of endpoint and host is definitely
> > > problematic.
> > > 
> > > What is the nature of the crash when we try to enable the PHY when
> > > Refclk is not available?  The endpoint has no control over when the
> > > host asserts/deasserts PERST#.  If PERST# happens to be asserted while
> > > the endpoint is enabling the PHY, and this causes some kind of crash
> > > that the endpoint driver can't easily recover from, that's a serious
> > > robustness problem.
> > 
> > The whole endpoint SoC crashes if the refclk is not available during
> > phy_power_on() as the PHY driver tries to access some register on Dmitry's
> > platform (I did not see this crash on SM8450 SoC though).
> > 
> > If we keep the enable_resources() during probe() then the race condition you
> > observed above could apply. So removing that from probe() will also make the
> > race condition go away,
> 
> Example:
> 
>   1) host deasserts PERST#
>   2) qcom-ep handles PERST# IRQ
>   3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert()
>   4) host asserts PERST#, Refclk no longer valid
>   5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources()
>   6) qcom_pcie_enable_resources() enables PHY
> 
> I don't see what prevents the PERST# assertion at 4.  It sounds like
> the endpoint SoC crashes at 6.

IDK why host would quickly assert the PERST# after deasserting during probe()
unless someone intentionally does that from host side.

If that happens then there is a possibility of the endpoint SoC crash, but I'm
not sure how we can avoid that.

But what this patch fixes is a crash occuring in a sane scenario:

1) Endpoint boots first (no refclk from host)
2) Probe() calls qcom_pcie_enable_resources() --> Crash

- Mani

Bjorn Helgaas Aug. 24, 2024, 4:12 p.m. UTC | #16

On Sat, Aug 24, 2024 at 07:49:46AM +0530, Manivannan Sadhasivam wrote:
> On Fri, Aug 23, 2024 at 05:04:36PM -0500, Bjorn Helgaas wrote:
> > On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote:
> > > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote:
> > > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> > > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > > > > > > ...
> > > > > > 
> > > > > > > > Although I do have the question of what happens if the RC deasserts
> > > > > > > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > > > > > > 
> > > > > > > PERST# is a level trigger signal. So even if the host has asserted
> > > > > > > it before EP booted, the level will stay low and ep will detect it
> > > > > > > while booting.
> > > > > > 
> > > > > > The PERST# signal itself is definitely level oriented.
> > > > > > 
> > > > > > I'm still skeptical about the *interrupt* from the PCIe controller
> > > > > > being level-triggered, as I mentioned here:
> > > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> > > > > 
> > > > > Sorry, that comment got buried into my inbox. So didn't get a chance
> > > > > to respond.
> > > > > 
> > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but
> > > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think
> > > > > > is a cleaner implementation.  Then you don't have to remember the
> > > > > > current state, switch between high and low trigger, worry about
> > > > > > races and missing a pulse, etc.
> > > > > 
> > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep
> > > > > driver, but it didn't work. If we use the level triggered interrupt
> > > > > as edge, the interrupt will be missed if we do not listen at the
> > > > > right time (when PERST# goes from high to low and vice versa).
> > > > > 
> > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC
> > > > > they will need to boot the endpoint first and then host to catch the
> > > > > PERST# interrupt.  Otherwise, the endpoint will never see the
> > > > > interrupt until host toggles it again.
> > > > 
> > > > Having to control the boot ordering of endpoint and host is definitely
> > > > problematic.
> > > > 
> > > > What is the nature of the crash when we try to enable the PHY when
> > > > Refclk is not available?  The endpoint has no control over when the
> > > > host asserts/deasserts PERST#.  If PERST# happens to be asserted while
> > > > the endpoint is enabling the PHY, and this causes some kind of crash
> > > > that the endpoint driver can't easily recover from, that's a serious
> > > > robustness problem.
> > > 
> > > The whole endpoint SoC crashes if the refclk is not available during
> > > phy_power_on() as the PHY driver tries to access some register on Dmitry's
> > > platform (I did not see this crash on SM8450 SoC though).

I don't think the nature of this crash has been explained, so I don't
know whether it's a recoverable situation or not.

> > > If we keep the enable_resources() during probe() then the race
> > > condition you observed above could apply. So removing that from
> > > probe() will also make the race condition go away,
> > 
> > Example:
> > 
> >   1) host deasserts PERST#
> >   2) qcom-ep handles PERST# IRQ
> >   3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert()
> >   4) host asserts PERST#, Refclk no longer valid
> >   5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources()
> >   6) qcom_pcie_enable_resources() enables PHY
> > 
> > I don't see what prevents the PERST# assertion at 4.  It sounds like
> > the endpoint SoC crashes at 6.
> 
> IDK why host would quickly assert the PERST# after deasserting
> during probe() unless someone intentionally does that from host
> side.

I think the host is allowed to assert PERST# at any arbitrary time, so
an endpoint should be able to handle it no matter when it happens.

> If that happens then there is a possibility of the endpoint SoC
> crash, but I'm not sure how we can avoid that.
> 
> But what this patch fixes is a crash occuring in a sane scenario:
> 
> 1) Endpoint boots first (no refclk from host)
> 2) Probe() calls qcom_pcie_enable_resources() --> Crash

I agree with this, although I think it's more of a band-aid than a
complete solution.  I don't have access to any SoC or PCIe controller
docs, so maybe this is a hardware design problem and this is the best
we can do.

Bjorn

Manivannan Sadhasivam Aug. 24, 2024, 4:34 p.m. UTC | #17

On Sat, Aug 24, 2024 at 11:12:34AM -0500, Bjorn Helgaas wrote:
> On Sat, Aug 24, 2024 at 07:49:46AM +0530, Manivannan Sadhasivam wrote:
> > On Fri, Aug 23, 2024 at 05:04:36PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote:
> > > > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote:
> > > > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> > > > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > > > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > > > > > > > ...
> > > > > > > 
> > > > > > > > > Although I do have the question of what happens if the RC deasserts
> > > > > > > > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > > > > > > > 
> > > > > > > > PERST# is a level trigger signal. So even if the host has asserted
> > > > > > > > it before EP booted, the level will stay low and ep will detect it
> > > > > > > > while booting.
> > > > > > > 
> > > > > > > The PERST# signal itself is definitely level oriented.
> > > > > > > 
> > > > > > > I'm still skeptical about the *interrupt* from the PCIe controller
> > > > > > > being level-triggered, as I mentioned here:
> > > > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> > > > > > 
> > > > > > Sorry, that comment got buried into my inbox. So didn't get a chance
> > > > > > to respond.
> > > > > > 
> > > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but
> > > > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think
> > > > > > > is a cleaner implementation.  Then you don't have to remember the
> > > > > > > current state, switch between high and low trigger, worry about
> > > > > > > races and missing a pulse, etc.
> > > > > > 
> > > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep
> > > > > > driver, but it didn't work. If we use the level triggered interrupt
> > > > > > as edge, the interrupt will be missed if we do not listen at the
> > > > > > right time (when PERST# goes from high to low and vice versa).
> > > > > > 
> > > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC
> > > > > > they will need to boot the endpoint first and then host to catch the
> > > > > > PERST# interrupt.  Otherwise, the endpoint will never see the
> > > > > > interrupt until host toggles it again.
> > > > > 
> > > > > Having to control the boot ordering of endpoint and host is definitely
> > > > > problematic.
> > > > > 
> > > > > What is the nature of the crash when we try to enable the PHY when
> > > > > Refclk is not available?  The endpoint has no control over when the
> > > > > host asserts/deasserts PERST#.  If PERST# happens to be asserted while
> > > > > the endpoint is enabling the PHY, and this causes some kind of crash
> > > > > that the endpoint driver can't easily recover from, that's a serious
> > > > > robustness problem.
> > > > 
> > > > The whole endpoint SoC crashes if the refclk is not available during
> > > > phy_power_on() as the PHY driver tries to access some register on Dmitry's
> > > > platform (I did not see this crash on SM8450 SoC though).
> 
> I don't think the nature of this crash has been explained, so I don't
> know whether it's a recoverable situation or not.
> 

I will add this info in the commit message.

> > > > If we keep the enable_resources() during probe() then the race
> > > > condition you observed above could apply. So removing that from
> > > > probe() will also make the race condition go away,
> > > 
> > > Example:
> > > 
> > >   1) host deasserts PERST#
> > >   2) qcom-ep handles PERST# IRQ
> > >   3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert()
> > >   4) host asserts PERST#, Refclk no longer valid
> > >   5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources()
> > >   6) qcom_pcie_enable_resources() enables PHY
> > > 
> > > I don't see what prevents the PERST# assertion at 4.  It sounds like
> > > the endpoint SoC crashes at 6.
> > 
> > IDK why host would quickly assert the PERST# after deasserting
> > during probe() unless someone intentionally does that from host
> > side.
> 
> I think the host is allowed to assert PERST# at any arbitrary time, so
> an endpoint should be able to handle it no matter when it happens.
> 
> > If that happens then there is a possibility of the endpoint SoC
> > crash, but I'm not sure how we can avoid that.
> > 
> > But what this patch fixes is a crash occuring in a sane scenario:
> > 
> > 1) Endpoint boots first (no refclk from host)
> > 2) Probe() calls qcom_pcie_enable_resources() --> Crash
> 
> I agree with this, although I think it's more of a band-aid than a
> complete solution.  I don't have access to any SoC or PCIe controller
> docs, so maybe this is a hardware design problem and this is the best
> we can do.
> 

I agree. But AFAIK there is no way endpoint can avoid this crash unless it
generates its own clock. I did some investigation on the SRIS support and able
to get it work in my local branch.

I will try to upstream that feature for the currerntly supported Qcom SoCs in
endpoint mode. But Qcom told me that non-SRIS mode is also required by some
customers, so unfortunately we cannot make it as the default operating mode.

- Mani

Krzysztof Wilczyński Sept. 1, 2024, 4:34 p.m. UTC | #18

Hello,

[...]
> Applied to controller/qcom, thank you!

Based on the conversation here, I removed this patch from the branch.

	Krzysztof

PCI: qcom-ep: Do not enable resources during probe()

Commit Message

Comments

Patch