Message ID | pmodcoakbs25z2a7mlo5gpuz63zluh35vbgb5itn6k5aqhjnny@jvphbpvahtse (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | Revert "arm64: dts: qcom: sa8540p-ride: enable pcie2a node" | expand |
Hi Lucas, On Fri, Jun 02, 2023 at 03:33:21PM -0400, Lucas Karpinski wrote: > This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71. I am all for reverting this commit however I think your commit message needs cleaned up. > The patch introduced a sporadic error where the Qdrive3 will fail to > boot occasionally due to an rcu preempt stall. > Qualcomm has disabled pcie2a downstream: > https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f Personally I'd remove the mention of the downstream kernel is this case. Also your paragraphs are formatted weird with a newline at the end of every sentence. Get them to flow together as a regular paragraph. This is the relevant line that I have in my muttrc file to help. set editor="vim -c 'set spell spelllang=en' -c 'set tw=72' -c 'set wrap'" > rcu: INFO: rcu_preempt self-detected stall on CPU > rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476 > rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8) > Call trace: > __do_softirq > ____do_softirq > call_on_irq_stack > do_softirq_own_stack > __irq_exit_rcu > irq_exit_rcu > > The issue occurs normally once every 3-4 boot cycles. > There is likely a race condition caused when setting up the two pcie > domains concurrently (pcie2a and pcie3a). I would also add that Qualcomm told us that upgrading the firmware on the PCIe switch would correct this issue. We've upgraded the PCIe switch to the latest firmware and this issue is still present. Apparently we need to use a specific older version of the firmware that we can't get from the PCIe switch vendor or Qualcomm. Nothing is hooked up to pcie2a on the QDrive3 so there's no loss in functionality by disabling this. We always have to remember to revert this commit when working with an upstream kernel. > This is not a solution, so this patch is disabling pcie2a as it seems > Red Hat are the only ones working on the board, > we're find with disabling the node until a root cause is found. If > anyone has further suggestions for debugging, let me know. This should go under the ---. Brian
On Fri, Jun 02, 2023 at 03:33:21PM -0400, Lucas Karpinski wrote: > This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71. > > The patch introduced a sporadic error where the Qdrive3 will fail to > boot occasionally due to an rcu preempt stall. > Qualcomm has disabled pcie2a downstream: > https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f > > rcu: INFO: rcu_preempt self-detected stall on CPU > rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476 > rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8) > Call trace: > __do_softirq > ____do_softirq > call_on_irq_stack > do_softirq_own_stack > __irq_exit_rcu > irq_exit_rcu > > The issue occurs normally once every 3-4 boot cycles. > There is likely a race condition caused when setting up the two pcie > domains concurrently (pcie2a and pcie3a). > > The issue is not present when only pcie2a is enabled or when only pcie3a > is enabled. > A workaround was found that allowed the Qdrive3 to boot with both pcie2a > and pcie3a enabled. > Set the .probe_type to PROBE_FORCE_SYNCHRONOUS and add an msleep() to > the probing function. > This is not a solution, so this patch is disabling pcie2a as it seems > Red Hat are the only ones working on the board, > we're find with disabling the node until a root cause is found. If > anyone has further suggestions for debugging, let me know. > > Signed-off-by: Lucas Karpinski <lkarpins@redhat.com> > --- > During debugging: > - Added additional time for clock/regulator stabilization. > - Reduced the bandwidth across pcie2a and pcie3a. > - Replaced the interconnect setup from another driver. > - The 32-bit/64-bit/config-io space for both pcie2a and pcie3a look to be mapped correctly. > - Verified interconnects were started successfully. I was looking at another issue downstream triggering a soft lock on CPU0, but it turns out this could be the same thing except the symptoms are less noticeable (the 3-4 boot cycles you mention). Using next-20230609, if I add a return kprobe on dw_handle_msi_irq: echo 'r:dwmsi_probe dw_handle_msi_irq $retval' > /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/events/kprobes/dwmsi_probe/enable cat /sys/kernel/debug/tracing/trace_pipe <idle>-0 [000] d.h1. 690.417268: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417272: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417276: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417281: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417284: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417288: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 [...] dw_handle_msi_irq constantly fires and never returns IRQ_HANDLED. It happens consistently for pcie2a or pcie3a, after I disable one or the other. I presume having both might be enough to overwhelm the system and trigger the stall? Looking at the handler, the status is always 0 after: status = dw_pcie_readl_dbi(pci, PCIE_MSI_INTR0_STATUS + (i * MSI_REG_CTRL_BLOCK_SIZE)); Unfortunately I do not know why that is yet. > > arch/arm64/boot/dts/qcom/sa8540p-ride.dts | 44 ----------------------- > 1 file changed, 44 deletions(-) > > diff --git a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts > index 24fa449d48a6..d492723ccf7c 100644 > --- a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts > +++ b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts > @@ -186,27 +186,6 @@ &i2c18 { > status = "okay"; > }; > > -&pcie2a { > - ranges = <0x01000000 0x0 0x3c200000 0x0 0x3c200000 0x0 0x100000>, > - <0x02000000 0x0 0x3c300000 0x0 0x3c300000 0x0 0x1d00000>, > - <0x03000000 0x5 0x00000000 0x5 0x00000000 0x1 0x00000000>; > - > - perst-gpios = <&tlmm 143 GPIO_ACTIVE_LOW>; > - wake-gpios = <&tlmm 145 GPIO_ACTIVE_HIGH>; > - > - pinctrl-names = "default"; > - pinctrl-0 = <&pcie2a_default>; > - > - status = "okay"; > -}; > - > -&pcie2a_phy { > - vdda-phy-supply = <&vreg_l11a>; > - vdda-pll-supply = <&vreg_l3a>; > - > - status = "okay"; > -}; > - > &pcie3a { > ranges = <0x01000000 0x0 0x40200000 0x0 0x40200000 0x0 0x100000>, > <0x02000000 0x0 0x40300000 0x0 0x40300000 0x0 0x20000000>, > @@ -356,29 +335,6 @@ i2c18_default: i2c18-default-state { > bias-pull-up; > }; > > - pcie2a_default: pcie2a-default-state { > - perst-pins { > - pins = "gpio143"; > - function = "gpio"; > - drive-strength = <2>; > - bias-pull-down; > - }; > - > - clkreq-pins { > - pins = "gpio142"; > - function = "pcie2a_clkreq"; > - drive-strength = <2>; > - bias-pull-up; > - }; > - > - wake-pins { > - pins = "gpio145"; > - function = "gpio"; > - drive-strength = <2>; > - bias-pull-up; > - }; > - }; > - > pcie3a_default: pcie3a-default-state { > perst-pins { > pins = "gpio151"; > -- > 2.40.1 >
diff --git a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts index 24fa449d48a6..d492723ccf7c 100644 --- a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts +++ b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts @@ -186,27 +186,6 @@ &i2c18 { status = "okay"; }; -&pcie2a { - ranges = <0x01000000 0x0 0x3c200000 0x0 0x3c200000 0x0 0x100000>, - <0x02000000 0x0 0x3c300000 0x0 0x3c300000 0x0 0x1d00000>, - <0x03000000 0x5 0x00000000 0x5 0x00000000 0x1 0x00000000>; - - perst-gpios = <&tlmm 143 GPIO_ACTIVE_LOW>; - wake-gpios = <&tlmm 145 GPIO_ACTIVE_HIGH>; - - pinctrl-names = "default"; - pinctrl-0 = <&pcie2a_default>; - - status = "okay"; -}; - -&pcie2a_phy { - vdda-phy-supply = <&vreg_l11a>; - vdda-pll-supply = <&vreg_l3a>; - - status = "okay"; -}; - &pcie3a { ranges = <0x01000000 0x0 0x40200000 0x0 0x40200000 0x0 0x100000>, <0x02000000 0x0 0x40300000 0x0 0x40300000 0x0 0x20000000>, @@ -356,29 +335,6 @@ i2c18_default: i2c18-default-state { bias-pull-up; }; - pcie2a_default: pcie2a-default-state { - perst-pins { - pins = "gpio143"; - function = "gpio"; - drive-strength = <2>; - bias-pull-down; - }; - - clkreq-pins { - pins = "gpio142"; - function = "pcie2a_clkreq"; - drive-strength = <2>; - bias-pull-up; - }; - - wake-pins { - pins = "gpio145"; - function = "gpio"; - drive-strength = <2>; - bias-pull-up; - }; - }; - pcie3a_default: pcie3a-default-state { perst-pins { pins = "gpio151";
This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71. The patch introduced a sporadic error where the Qdrive3 will fail to boot occasionally due to an rcu preempt stall. Qualcomm has disabled pcie2a downstream: https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476 rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8) Call trace: __do_softirq ____do_softirq call_on_irq_stack do_softirq_own_stack __irq_exit_rcu irq_exit_rcu The issue occurs normally once every 3-4 boot cycles. There is likely a race condition caused when setting up the two pcie domains concurrently (pcie2a and pcie3a). The issue is not present when only pcie2a is enabled or when only pcie3a is enabled. A workaround was found that allowed the Qdrive3 to boot with both pcie2a and pcie3a enabled. Set the .probe_type to PROBE_FORCE_SYNCHRONOUS and add an msleep() to the probing function. This is not a solution, so this patch is disabling pcie2a as it seems Red Hat are the only ones working on the board, we're find with disabling the node until a root cause is found. If anyone has further suggestions for debugging, let me know. Signed-off-by: Lucas Karpinski <lkarpins@redhat.com> --- During debugging: - Added additional time for clock/regulator stabilization. - Reduced the bandwidth across pcie2a and pcie3a. - Replaced the interconnect setup from another driver. - The 32-bit/64-bit/config-io space for both pcie2a and pcie3a look to be mapped correctly. - Verified interconnects were started successfully. arch/arm64/boot/dts/qcom/sa8540p-ride.dts | 44 ----------------------- 1 file changed, 44 deletions(-)