diff mbox

omap3: l3: Temporary fix to avoid the kernel hang with beagle board.

Message ID 1304952352-27837-1-git-send-email-r.sricharan@ti.com (mailing list archive)
State New, archived
Headers show

Commit Message

R Sricharan May 9, 2011, 2:45 p.m. UTC
Paul Walmsley reported a kernel hang issue with beagle board during
boot. This is an intermittent bug and the execution was found to be
stuck at the l3 interrupt handler.

This was due to a dss initiator agent timeout occuring during
the boot even when there is no actual interconnect access made by the
dss. since the reason for the dss timeout is not root caused yet,
the time out feature is disabled at the interconnect level.
Note that this is a temporary fix that should be removed once
the dss interconnect agent timeout issue is resolved.

Thanks to Paul Walmsley for reporting and helping in reproducing
this issue.

Signed-off-by: sricharan <r.sricharan@ti.com>
Cc: Paul Wamsley <paul@pwsan.com>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
---
 arch/arm/mach-omap2/omap_l3_smx.c |   11 +++++++++++
 arch/arm/mach-omap2/omap_l3_smx.h |    2 ++
 2 files changed, 13 insertions(+), 0 deletions(-)

Comments

Paul Walmsley July 9, 2011, 10:25 p.m. UTC | #1
Hi

On Mon, 9 May 2011, sricharan wrote:

> Paul Walmsley reported a kernel hang issue with beagle board during
> boot. This is an intermittent bug and the execution was found to be
> stuck at the l3 interrupt handler.
> 
> This was due to a dss initiator agent timeout occuring during
> the boot even when there is no actual interconnect access made by the
> dss. since the reason for the dss timeout is not root caused yet,
> the time out feature is disabled at the interconnect level.
> Note that this is a temporary fix that should be removed once
> the dss interconnect agent timeout issue is resolved.

So it's been two months since this bug was reported.  Any progress on 
root-causing it?

I don't see how I can upstream this temporary patch with a straight face.

First, it tries to unconditionally reset the L3 DSS interconnect agent, 
even if there's no problem on the L3 DSS IA that requires a reset.  It 
should only try to reset an IA if it's in a bad state.

Second, are you sure that reset sequence is correct?  Writing a 1 and then 
a 0 to that reset bit, without any barrier or delay in between?  Could you 
please confirm that this is a correct reset sequence with the L3 IA
designers and cc me on the E-mails, or send me an extract from the 
relevant documentation?

Third, the patch disables L3 timeout reporting.  This effectively reacts 
to an error by pretending that the error did not exist.  This isn't right.  
If there's an L3 timeout, it needs to be reported, if at all possible.  It 
should never happen and it indicates something is wrong with the software 
or the hardware.


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Santosh Shilimkar July 10, 2011, 12:22 a.m. UTC | #2
+ Tomi,

On 7/9/2011 3:25 PM, Paul Walmsley wrote:
> Hi
>
> On Mon, 9 May 2011, sricharan wrote:
>
>> Paul Walmsley reported a kernel hang issue with beagle board
>> during boot. This is an intermittent bug and the execution was
>> found to be stuck at the l3 interrupt handler.
>>
>> This was due to a dss initiator agent timeout occuring during the
>> boot even when there is no actual interconnect access made by the
>> dss. since the reason for the dss timeout is not root caused yet,
>> the time out feature is disabled at the interconnect level. Note
>> that this is a temporary fix that should be removed once the dss
>> interconnect agent timeout issue is resolved.
>
> So it's been two months since this bug was reported.  Any progress
> on root-causing it?
>
> I don't see how I can upstream this temporary patch with a straight
> face.
>
Sorry for not closing the loop on this thread but I thought Tomi
root-caused the DSS timeout issue to incorrect reset sequence of
DSS IP. With that fixed I though we shouldn't see that issue.

> First, it tries to unconditionally reset the L3 DSS interconnect
> agent, even if there's no problem on the L3 DSS IA that requires a
> reset.  It should only try to reset an IA if it's in a bad state.
>
This was to ensure that the issue hasn't happened during boot-loader
DSS reset sequence in case it does. But I agree with your comments.

> Second, are you sure that reset sequence is correct?  Writing a 1 and
> then a 0 to that reset bit, without any barrier or delay in between?
> Could you please confirm that this is a correct reset sequence with
> the L3 IA designers and cc me on the E-mails, or send me an extract
> from the relevant documentation?
>
> Third, the patch disables L3 timeout reporting.  This effectively
> reacts to an error by pretending that the error did not exist.  This
> isn't right. If there's an L3 timeout, it needs to be reported, if at
> all possible.  It should never happen and it indicates something is
> wrong with the software or the hardware.
>
Will come back to you on above queries.

Regards
Santosh

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Walmsley July 10, 2011, 12:30 a.m. UTC | #3
Hi Santosh,

On Sat, 9 Jul 2011, Santosh Shilimkar wrote:

> Sorry for not closing the loop on this thread but I thought Tomi
> root-caused the DSS timeout issue to incorrect reset sequence of
> DSS IP. With that fixed I though we shouldn't see that issue.

OK great, happy to hear that it was tracked down!

Tomi, do you have patches to fix the reset bug?

> > First, it tries to unconditionally reset the L3 DSS interconnect
> > agent, even if there's no problem on the L3 DSS IA that requires a
> > reset.  It should only try to reset an IA if it's in a bad state.
> > 
> This was to ensure that the issue hasn't happened during boot-loader
> DSS reset sequence in case it does. But I agree with your comments.

That's a good idea, but the patch should only do that if the L3 DSS IA is 
reporting a timeout error.

> > Second, are you sure that reset sequence is correct?  Writing a 1 and
> > then a 0 to that reset bit, without any barrier or delay in between?
> > Could you please confirm that this is a correct reset sequence with
> > the L3 IA designers and cc me on the E-mails, or send me an extract
> > from the relevant documentation?
> > 
> > Third, the patch disables L3 timeout reporting.  This effectively
> > reacts to an error by pretending that the error did not exist.  This
> > isn't right. If there's an L3 timeout, it needs to be reported, if at
> > all possible.  It should never happen and it indicates something is
> > wrong with the software or the hardware.
> > 
> Will come back to you on above queries.

regards,

- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tomi Valkeinen Aug. 1, 2011, 6:01 a.m. UTC | #4
On Sat, 2011-07-09 at 18:30 -0600, Paul Walmsley wrote:
> Hi Santosh,
> 
> On Sat, 9 Jul 2011, Santosh Shilimkar wrote:
> 
> > Sorry for not closing the loop on this thread but I thought Tomi
> > root-caused the DSS timeout issue to incorrect reset sequence of
> > DSS IP. With that fixed I though we shouldn't see that issue.
> 
> OK great, happy to hear that it was tracked down!
> 
> Tomi, do you have patches to fix the reset bug?

I have to say I'm not sure what this is about... I haven't seen any
hangs.

There was (or perhaps still is) problems with the hwmod code resetting
DSS. This was because the hwmod code didn't enable all the DSS clocks
before doing the reset. However, this shouldn't cause any problems in
the current mainline kernel, as the DSS driver there does a reset
itself. This will change then the DSS starts using pmruntime.

 Tomi


--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Santosh Shilimkar Aug. 1, 2011, 6:13 a.m. UTC | #5
Tomi,
On 8/1/2011 11:31 AM, Tomi Valkeinen wrote:
> On Sat, 2011-07-09 at 18:30 -0600, Paul Walmsley wrote:
>> Hi Santosh,
>>
>> On Sat, 9 Jul 2011, Santosh Shilimkar wrote:
>>
>>> Sorry for not closing the loop on this thread but I thought Tomi
>>> root-caused the DSS timeout issue to incorrect reset sequence of
>>> DSS IP. With that fixed I though we shouldn't see that issue.
>>
>> OK great, happy to hear that it was tracked down!
>>
>> Tomi, do you have patches to fix the reset bug?
>
> I have to say I'm not sure what this is about... I haven't seen any
> hangs.
>
> There was (or perhaps still is) problems with the hwmod code resetting
> DSS. This was because the hwmod code didn't enable all the DSS clocks
> before doing the reset. However, this shouldn't cause any problems in
> the current mainline kernel, as the DSS driver there does a reset
> itself. This will change then the DSS starts using pmruntime.
>
During your vacation, Archit and Sricharan looked at the issue further.
The issue is indeed related to DSS reset. Archit has posted the patch
on internal review. Please have a look at it.

Regards
Santosh

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/arm/mach-omap2/omap_l3_smx.c b/arch/arm/mach-omap2/omap_l3_smx.c
index 4321e79..4ea7dcd 100644
--- a/arch/arm/mach-omap2/omap_l3_smx.c
+++ b/arch/arm/mach-omap2/omap_l3_smx.c
@@ -248,6 +248,17 @@  static int __init omap3_l3_probe(struct platform_device *pdev)
 		goto err2;
 	}
 
+	/*
+	 * FIX ME: dss interconnect timeout error.
+	 * Disable the l3 timeout reporting feature for all modules.
+	 * Also reset the dss initiator agent with which the error is seen
+	 * to clear the interrupt. This is a temporary fix and should be
+	 * removed after root causing the issue.
+	 */
+	omap3_l3_writell(l3->rt, L3_RT_NETWORK_CONTROL, 0x0);
+	omap3_l3_writell(l3->rt + L3_DSS_IA_CONTROL, L3_AGENT_CONTROL, 0x1);
+	omap3_l3_writell(l3->rt + L3_DSS_IA_CONTROL, L3_AGENT_CONTROL, 0x0);
+
 	l3->debug_irq = platform_get_irq(pdev, 0);
 	ret = request_irq(l3->debug_irq, omap3_l3_app_irq,
 		IRQF_DISABLED | IRQF_TRIGGER_RISING,
diff --git a/arch/arm/mach-omap2/omap_l3_smx.h b/arch/arm/mach-omap2/omap_l3_smx.h
index ba2ed9a..96fff9d 100644
--- a/arch/arm/mach-omap2/omap_l3_smx.h
+++ b/arch/arm/mach-omap2/omap_l3_smx.h
@@ -35,6 +35,8 @@ 
 #define L3_ERROR_LOG_SECONDARY		(1 << 30)
 
 #define L3_ERROR_LOG_ADDR		0x060
+#define L3_RT_NETWORK_CONTROL		0x078
+#define L3_DSS_IA_CONTROL		0x5400
 
 /* Register definitions for Sideband Interconnect */
 #define L3_SI_CONTROL			0x020