Message ID | 20130207100403.GD16987@e106331-lin.cambridge.arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Mark, On Thursday 07 February 2013 03:34 PM, Mark Rutland wrote: > Hi Stephen, > > Sorry about this; I'm to blame for the bug. > > On Wed, Feb 06, 2013 at 08:51:43PM +0000, Stephen Warren wrote: >> On 01/14/2013 10:05 AM, Mark Rutland wrote: >>> Implement timer_broadcast for the arm architecture, allowing for the use >>> of clock_event_device_drivers decoupled from the timer tick broadcast >>> mechanism. >> >> Mark, this patch is now in next-20130206 and causes a crash during boot >> on Tegra. The reason appears to be because of: >> >>> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c >> >>> @@ -524,7 +524,6 @@ static void __cpuinit percpu_timer_setup(void) >>> struct clock_event_device *evt = &per_cpu(percpu_clockevent, cpu); >>> >>> evt->cpumask = cpumask_of(cpu); >>> - evt->broadcast = smp_timer_broadcast; >> >> After that change, evt->broadcast is never assigned, and hence is NULL. >> Yet elsewhere in kernel/time/tick-broadcast.c it's used unconditionally: >> >> static void tick_do_broadcast(struct cpumask *mask) >> ... >> if (!cpumask_empty(mask)) { >> ... >> td = &per_cpu(tick_cpu_device, cpumask_first(mask)); >> td->evtdev->broadcast(mask); >> >> Now perhaps the Tegra timer driver simply isn't being set up correctly, >> so the bug is there... But the only other place I can find where >> ->broadcast is assigned is in tick_device_uses_broadcast() which only >> does it for "non-functional" timers, which doesn't apply to Tegra's timer. >> > > The intent of 12ad100046: "clockevents: Add generic timer broadcast function" > was to setup the broadcast function both for non-functional/dummy timers and > those that stop in low-power states (CLOCK_EVT_FEAT_C3STOP). I missed the > CLOCK_EVT_FEAT_C3STOP case. > I am not sure this exactly the case. Because in my testing, the C3STOP path was exercised already. And if the C3STOP is used then notifiers calls are expected for switching of clock-events to broadcast mode. And dummy broad-cast hook should come into picture only if the per CPU local timer clock-event are not registered. I just tried "next-20130207" on OMAP and I see the broad-cast mechanism works and didn't observe any crash. ------------------------------ Tick Device: mode: 1 Broadcast device Clock Event Device: gp_timer max_delta_ns: 131071523464981 min_delta_ns: 91552 mult: 70369 shift: 31 mode: 3 next_event: 89984375000 nsecs set_next_event: omap2_gp_timer_set_next_event set_mode: omap2_gp_timer_set_mode event_handler: tick_handle_oneshot_broadcast retries: 0 tick_broadcast_mask: 00000003 tick_broadcast_oneshot_mask: 00000000 Tick Device: mode: 1 Per CPU device: 0 Clock Event Device: local_timer max_delta_ns: 10737418240 min_delta_ns: 1000 mult: 858993459 shift: 31 mode: 3 next_event: 125250000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 346 Tick Device: mode: 1 Per CPU device: 1 Clock Event Device: local_timer max_delta_ns: 10737418240 min_delta_ns: 1000 mult: 858993459 shift: 31 mode: 3 next_event: 89921875000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 258 # ------------------------------ > I believe the patch below will fix this for Tegra and any other platforms where > broadcast is required in low power states. > Am not sure you really need that patch unless and until am missing a scenario in my test. Stephan, Do we have crash log seen on Tegra ? Which tegra CPUIDLE driver is being used when crash is seen ? Regards Santosh
On 02/07/2013 03:04 AM, Mark Rutland wrote: > Hi Stephen, > > Sorry about this; I'm to blame for the bug. > > On Wed, Feb 06, 2013 at 08:51:43PM +0000, Stephen Warren wrote: >> On 01/14/2013 10:05 AM, Mark Rutland wrote: >>> Implement timer_broadcast for the arm architecture, allowing for the use >>> of clock_event_device_drivers decoupled from the timer tick broadcast >>> mechanism. >> >> Mark, this patch is now in next-20130206 and causes a crash during boot >> on Tegra. The reason appears to be because of: >> >>> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c >> >>> @@ -524,7 +524,6 @@ static void __cpuinit percpu_timer_setup(void) >>> struct clock_event_device *evt = &per_cpu(percpu_clockevent, cpu); >>> >>> evt->cpumask = cpumask_of(cpu); >>> - evt->broadcast = smp_timer_broadcast; >> >> After that change, evt->broadcast is never assigned, and hence is NULL. >> Yet elsewhere in kernel/time/tick-broadcast.c it's used unconditionally: >> >> static void tick_do_broadcast(struct cpumask *mask) >> ... >> if (!cpumask_empty(mask)) { >> ... >> td = &per_cpu(tick_cpu_device, cpumask_first(mask)); >> td->evtdev->broadcast(mask); >> >> Now perhaps the Tegra timer driver simply isn't being set up correctly, >> so the bug is there... But the only other place I can find where >> ->broadcast is assigned is in tick_device_uses_broadcast() which only >> does it for "non-functional" timers, which doesn't apply to Tegra's timer. >> > > The intent of 12ad100046: "clockevents: Add generic timer broadcast function" > was to setup the broadcast function both for non-functional/dummy timers and > those that stop in low-power states (CLOCK_EVT_FEAT_C3STOP). I missed the > CLOCK_EVT_FEAT_C3STOP case. > > I believe the patch below will fix this for Tegra and any other platforms where > broadcast is required in low power states. > > Stephen, could you test this for Tegra? Tested-by: Stephen Warren <swarren@nvidia.com> Thanks.
On 02/07/2013 04:40 AM, Santosh Shilimkar wrote: > Mark, > > On Thursday 07 February 2013 03:34 PM, Mark Rutland wrote: >> Hi Stephen, >> >> Sorry about this; I'm to blame for the bug. >> >> On Wed, Feb 06, 2013 at 08:51:43PM +0000, Stephen Warren wrote: >>> On 01/14/2013 10:05 AM, Mark Rutland wrote: >>>> Implement timer_broadcast for the arm architecture, allowing for the >>>> use >>>> of clock_event_device_drivers decoupled from the timer tick broadcast >>>> mechanism. >>> >>> Mark, this patch is now in next-20130206 and causes a crash during boot >>> on Tegra. The reason appears to be because of: >>> >>>> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c >>> >>>> @@ -524,7 +524,6 @@ static void __cpuinit percpu_timer_setup(void) >>>> struct clock_event_device *evt = &per_cpu(percpu_clockevent, >>>> cpu); >>>> >>>> evt->cpumask = cpumask_of(cpu); >>>> - evt->broadcast = smp_timer_broadcast; >>> >>> After that change, evt->broadcast is never assigned, and hence is NULL. >>> Yet elsewhere in kernel/time/tick-broadcast.c it's used unconditionally: >>> >>> static void tick_do_broadcast(struct cpumask *mask) >>> ... >>> if (!cpumask_empty(mask)) { >>> ... >>> td = &per_cpu(tick_cpu_device, cpumask_first(mask)); >>> td->evtdev->broadcast(mask); >>> >>> Now perhaps the Tegra timer driver simply isn't being set up correctly, >>> so the bug is there... But the only other place I can find where >>> ->broadcast is assigned is in tick_device_uses_broadcast() which only >>> does it for "non-functional" timers, which doesn't apply to Tegra's >>> timer. >> >> The intent of 12ad100046: "clockevents: Add generic timer broadcast >> function" >> was to setup the broadcast function both for non-functional/dummy >> timers and >> those that stop in low-power states (CLOCK_EVT_FEAT_C3STOP). I missed the >> CLOCK_EVT_FEAT_C3STOP case. > > I am not sure this exactly the case. Because in my testing, the C3STOP > path was exercised already. And if the C3STOP is used then notifiers > calls are expected for switching of clock-events to broadcast mode. > > And dummy broad-cast hook should come into picture only if the per CPU > local timer clock-event are not registered. > > I just tried "next-20130207" on OMAP and I see the broad-cast mechanism > works and didn't observe any crash. ... >> I believe the patch below will fix this for Tegra and any other >> platforms where >> broadcast is required in low power states. >> > Am not sure you really need that patch unless and until am missing a > scenario in my test. > > Stephan, > > Do we have crash log seen on Tegra ? Which tegra CPUIDLE driver > is being used when crash is seen ? The crash log is below. It's simply because td->evt->broadcast is NULL because it wasn't ever set. I'm using the CPUIDLE driver in arch/arm/mach-tegra/cpuidle*.c. > [ 4.179156] Unable to handle kernel NULL pointer dereference at virtual address 00000000 > [ 4.211200] pgd = c0004000 > [ 4.237688] [00000000] *pgd=00000000 > [ 4.264868] Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM > [ 4.294396] Modules linked in: > [ 4.320940] CPU: 0 Not tainted (3.8.0-rc6-next-20130206-00002-g2aaa5ff #6) > [ 4.352125] PC is at 0x0 > [ 4.378346] LR is at tick_do_broadcast.constprop.3+0x74/0xb0 > [ 4.407786] pc : [<00000000>] lr : [<c0063b38>] psr: 20000193 > [ 4.407786] sp : c06d7d60 ip : 00000002 fp : 00000004 > [ 4.466633] r10: c06d2ef8 r9 : c06def08 r8 : c06dec98 > [ 4.495457] r7 : 00000000 r6 : f8b8d230 r5 : c073d598 r4 : 00000000 > [ 4.525825] r3 : 00000000 r2 : 008ac000 r1 : 00000004 r0 : c073d5a4 > [ 4.556192] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment kernel > [ 4.587865] Control: 10c5387d Table: 0000404a DAC: 00000015 > [ 4.617983] Process swapper/0 (pid: 0, stack limit = 0xc06d6238) > [ 4.648314] Stack: (0xc06d7d60 to 0xc06d8000) > [ 4.677145] 7d60: 00000002 ffffffff 7fffffff c0063c28 00000000 c06f17c0 f8b8d230 00000000 > [ 4.710387] 7d80: c0029b40 c06f1780 ee015ad0 00000000 00000000 00000049 c07186a7 ee015a80 > [ 4.743652] 7da0: 00000001 c0228c2c c0228c08 c0079184 f8b66130 00000000 ee2dc8e8 ee015a80 > [ 4.776806] 7dc0: ee015ad0 c06f1780 fe000100 c06d8080 c06d6000 c04e01ac 00000000 c0079304 > [ 4.810088] 7de0: ee015a80 ee015ad0 c06d7ee8 c007bf38 c007bebc 00000049 00000049 c0078b8c > [ 4.843304] 7e00: c06d3678 c000eb74 fe00010c c06df040 c06d7e30 c00086d8 c0029acc c0029b40 > [ 4.876440] 7e20: 20000113 ffffffff c06d7e64 c000df40 00000001 c073a100 00000000 00000000 > [ 4.909580] 7e40: 00000202 0000001d 00000000 c06d6000 c06d8080 c06d6000 c04e01ac 00000000 > [ 4.942655] 7e60: 00000000 c06d7e78 c0029acc c0029b40 20000113 ffffffff c06ede80 ee012380 > [ 4.975697] 7e80: c0f7a834 00000000 ffff8c6e 00200000 c01f4290 c06d6000 0000001d 00000000 > [ 5.008939] 7ea0: fe000100 ee1a8b80 c06d6000 c04e01ac 00000000 c0029fa4 c06d3678 c000eb78 > [ 5.042564] 7ec0: fe00010c c06df040 c06d7ee8 c00086d8 c005d5f8 c0325700 60000113 ffffffff > [ 5.076394] 7ee0: c06d7f1c c000df40 c06d7f30 3b9aca00 f8b59610 00000000 f8105948 00000000 > [ 5.110339] 7f00: c0f763e8 00000001 ee1a8b80 c06d6000 c04e01ac 00000000 00000015 c06d7f30 > [ 5.144383] 7f20: c005d5f8 c0325700 60000113 ffffffff f8b59610 00000000 00000000 c04e01ac > [ 5.178363] 7f40: 00000000 c0f763e8 c0f763e8 00000001 c06e3de0 c03253ec 00000004 ee1a8b94 > [ 5.212337] 7f60: c0f763e8 c0327290 0000004c c0f763e8 c0761a0c 00000001 c06e3de0 00000000 > [ 5.246703] 7f80: c06e2ab0 c0325504 c06d6000 c06d6000 c0718888 c06de3f4 c04e01ac c000ef74 > [ 5.281026] 7fa0: c06decf8 c06d6000 c07187c0 c0f73d40 ffffffff 411fc090 3fffffff c069f7b4 > [ 5.315263] 7fc0: ffffffff ffffffff c069f2ec 00000000 00000000 c06c86a8 00000000 10c5387d > [ 5.349498] 7fe0: c06de3f0 c06c86a4 c06e2aa4 0000406a 00000000 00008074 00000000 00000000 > [ 5.383733] [<c0063b38>] (tick_do_broadcast.constprop.3+0x74/0xb0) from [<c0063c28>] (tick_handle_oneshot_broadcast+0xb4/0x108) > [ 5.421789] [<c0063c28>] (tick_handle_oneshot_broadcast+0xb4/0x108) from [<c0228c2c>] (tegra_timer_interrupt+0x24/0x2c) > [ 5.459485] [<c0228c2c>] (tegra_timer_interrupt+0x24/0x2c) from [<c0079184>] (handle_irq_event_percpu+0x50/0x194) > [ 5.497012] [<c0079184>] (handle_irq_event_percpu+0x50/0x194) from [<c0079304>] (handle_irq_event+0x3c/0x5c) > [ 5.534558] [<c0079304>] (handle_irq_event+0x3c/0x5c) from [<c007bf38>] (handle_fasteoi_irq+0x7c/0x138) > [ 5.572007] [<c007bf38>] (handle_fasteoi_irq+0x7c/0x138) from [<c0078b8c>] (generic_handle_irq+0x20/0x30) > [ 5.609920] [<c0078b8c>] (generic_handle_irq+0x20/0x30) from [<c000eb74>] (handle_IRQ+0x38/0x94) > [ 5.647209] [<c000eb74>] (handle_IRQ+0x38/0x94) from [<c00086d8>] (gic_handle_irq+0x28/0x5c) > [ 5.684385] [<c00086d8>] (gic_handle_irq+0x28/0x5c) from [<c000df40>] (__irq_svc+0x40/0x70) > [ 5.721713] Exception stack(0xc06d7e30 to 0xc06d7e78) > [ 5.755567] 7e20: 00000001 c073a100 00000000 00000000 > [ 5.792782] 7e40: 00000202 0000001d 00000000 c06d6000 c06d8080 c06d6000 c04e01ac 00000000 > [ 5.829856] 7e60: 00000000 c06d7e78 c0029acc c0029b40 20000113 ffffffff > [ 5.865532] [<c000df40>] (__irq_svc+0x40/0x70) from [<c0029b40>] (__do_softirq+0x84/0x1b8) > [ 5.903191] [<c0029b40>] (__do_softirq+0x84/0x1b8) from [<c0029fa4>] (irq_exit+0x90/0x98) > [ 5.940843] [<c0029fa4>] (irq_exit+0x90/0x98) from [<c000eb78>] (handle_IRQ+0x3c/0x94) > [ 5.978365] [<c000eb78>] (handle_IRQ+0x3c/0x94) from [<c00086d8>] (gic_handle_irq+0x28/0x5c) > [ 6.016482] [<c00086d8>] (gic_handle_irq+0x28/0x5c) from [<c000df40>] (__irq_svc+0x40/0x70) > [ 6.054601] Exception stack(0xc06d7ee8 to 0xc06d7f30) > [ 6.089523] 7ee0: c06d7f30 3b9aca00 f8b59610 00000000 f8105948 00000000 > [ 6.127885] 7f00: c0f763e8 00000001 ee1a8b80 c06d6000 c04e01ac 00000000 00000015 c06d7f30 > [ 6.166169] 7f20: c005d5f8 c0325700 60000113 ffffffff > [ 6.201233] [<c000df40>] (__irq_svc+0x40/0x70) from [<c0325700>] (cpuidle_wrap_enter+0x48/0x98) > [ 6.240573] [<c0325700>] (cpuidle_wrap_enter+0x48/0x98) from [<c03253ec>] (cpuidle_enter_state+0x14/0x60) > [ 6.281197] [<c03253ec>] (cpuidle_enter_state+0x14/0x60) from [<c0327290>] (cpuidle_enter_state_coupled+0x208/0x2a0) > [ 6.323163] [<c0327290>] (cpuidle_enter_state_coupled+0x208/0x2a0) from [<c0325504>] (cpuidle_idle_call+0xcc/0x11c) > [ 6.365258] [<c0325504>] (cpuidle_idle_call+0xcc/0x11c) from [<c000ef74>] (cpu_idle+0xb0/0x114) > [ 6.405790] [<c000ef74>] (cpu_idle+0xb0/0x114) from [<c069f7b4>] (start_kernel+0x2a4/0x2f4) > [ 6.446051] Code: bad PC value > [ 6.480872] ---[ end trace 319bbea8753aac77 ]--- > [ 6.517329] Kernel panic- not syncing: Fatal exception in interrupt > [ 7.944703] SMP: failed to stop secondary CPUs
[...] > > The intent of 12ad100046: "clockevents: Add generic timer broadcast function" > > was to setup the broadcast function both for non-functional/dummy timers and > > those that stop in low-power states (CLOCK_EVT_FEAT_C3STOP). I missed the > > CLOCK_EVT_FEAT_C3STOP case. > > > I am not sure this exactly the case. Because in my testing, the C3STOP > path was exercised already. And if the C3STOP is used then notifiers > calls are expected for switching of clock-events to broadcast mode. I don't thing the C3STOP path was fully exercised. The notifier calls sets up the broadcast source and adds cpus to tick_broadcast_oneshot_mask, but they don't set up the broadcast function on the local timer clock_event_devices. While everything else was set up for broadcast, the CPU in idle's tick_cpu_device->evtdev had a NULL broadcast function pointer. As soon as the broadcast device receives its interrupt, it attempts to broadcast to all cpus in tick_broadcast_oneshot_mask, calling the broadcast function pointer on their local timers' struct clock_event_device. As this was never set up, it explodes, attempting to branch to NULL. > > And dummy broad-cast hook should come into picture only if the per CPU > local timer clock-event are not registered. This is irrelevant here. The issue is that the broadcast function pointer isn't set up for clocks with CLOCK_EVT_FEAT_C3STOP. > > I just tried "next-20130207" on OMAP and I see the broad-cast mechanism > works and didn't observe any crash. > > ------------------------------ > Tick Device: mode: 1 > Broadcast device > Clock Event Device: gp_timer > max_delta_ns: 131071523464981 > min_delta_ns: 91552 > mult: 70369 > shift: 31 > mode: 3 > next_event: 89984375000 nsecs > set_next_event: omap2_gp_timer_set_next_event > set_mode: omap2_gp_timer_set_mode > event_handler: tick_handle_oneshot_broadcast > retries: 0 > > tick_broadcast_mask: 00000003 > tick_broadcast_oneshot_mask: 00000000 The broadcast clocks' next broadcast is in ~90 seconds time. [...] > > Tick Device: mode: 1 > Per CPU device: 1 > Clock Event Device: local_timer > max_delta_ns: 10737418240 > min_delta_ns: 1000 > mult: 858993459 > shift: 31 > mode: 3 > next_event: 89921875000 nsecs > set_next_event: twd_set_next_event > set_mode: twd_set_mode > event_handler: hrtimer_interrupt > retries: 258 And this clock next expects a tick in ~90 seconds time. Did you leave the board on for at least 90 seconds after grabbing this output, with the cpu staying in idle? If not, the kernel won't have attempted to broadcast. > > # > > ------------------------------ > > > I believe the patch below will fix this for Tegra and any other platforms where > > broadcast is required in low power states. > > > Am not sure you really need that patch unless and until am missing a > scenario in my test. I suspect that with the time before broadcast being so large, broadcast never actually occurred (even though the global timer was ready for it to). I've don't have a suitable OMAP platform to test that theory on, unfortunately. > > Stephan, > > Do we have crash log seen on Tegra ? Which tegra CPUIDLE driver > is being used when crash is seen ? I've acquired a Harmony board, and running next-20130207, I encounter the exact issue I described above (with the PC at NULL). With my suggested patch applied, the problem disappears. I believe the idle driver would be tegra20-cpuidle, and I believe it's doing the right thing. Thanks, Mark.
[...] > > The intent of 12ad100046: "clockevents: Add generic timer broadcast function" > > was to setup the broadcast function both for non-functional/dummy timers and > > those that stop in low-power states (CLOCK_EVT_FEAT_C3STOP). I missed the > > CLOCK_EVT_FEAT_C3STOP case. > > > > I believe the patch below will fix this for Tegra and any other platforms where > > broadcast is required in low power states. > > > > Stephen, could you test this for Tegra? > > Tested-by: Stephen Warren <swarren@nvidia.com> Thanks. Sorry for the bother. Mark.
On Thursday 07 February 2013 10:21 PM, Stephen Warren wrote: > On 02/07/2013 04:40 AM, Santosh Shilimkar wrote: [...] >> Stephan, >> >> Do we have crash log seen on Tegra ? Which tegra CPUIDLE driver >> is being used when crash is seen ? > > The crash log is below. It's simply because td->evt->broadcast is NULL > because it wasn't ever set. > Thanks for the log. Its clear from the log that broadcast event wasn't set. Regards, Santosh
Mark, On Thursday 07 February 2013 10:47 PM, Mark Rutland wrote: [..] >>> I believe the patch below will fix this for Tegra and any other platforms where >>> broadcast is required in low power states. >>> From the Stephan's crash the issue is pretty clear and your patch make sense. Some how I didn't manage to reproduce the issue on my board and hence assumed everything is just fine. Will have a look at it later why it didn't explode since it was so obvious. Sorry for the noise. Regards Santosh
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index f726537..2fb8cb8 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -92,6 +92,17 @@ static void err_broadcast(const struct cpumask *mask) pr_crit_once("Failed to broadcast timer tick. Some CPUs may be unresponsive.\n"); } +static void tick_device_setup_broadcast_func(struct clock_event_device *dev) +{ + if (!dev->broadcast) + dev->broadcast = tick_broadcast; + if (!dev->broadcast) { + pr_warn_once("%s depends on broadcast, but no broadcast function available\n", + dev->name); + dev->broadcast = err_broadcast; + } +} + /* * Check, if the device is disfunctional and a place holder, which * needs to be handled by the broadcast device. @@ -111,13 +122,7 @@ int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu) */ if (!tick_device_is_functional(dev)) { dev->event_handler = tick_handle_periodic; - if (!dev->broadcast) - dev->broadcast = tick_broadcast; - if (!dev->broadcast) { - pr_warn_once("%s depends on broadcast, but no broadcast function available\n", - dev->name); - dev->broadcast = err_broadcast; - } + tick_device_setup_broadcast_func(dev); cpumask_set_cpu(cpu, tick_get_broadcast_mask()); tick_broadcast_start_periodic(tick_broadcast_device.evtdev); ret = 1; @@ -129,9 +134,10 @@ int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu) */ if (!(dev->features & CLOCK_EVT_FEAT_C3STOP)) { int cpu = smp_processor_id(); - cpumask_clear_cpu(cpu, tick_get_broadcast_mask()); tick_broadcast_clear_oneshot(cpu); + } else { + tick_device_setup_broadcast_func(dev); } } raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);