Message ID | 20110620101438.GD2082@n2100.arm.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 6/20/2011 3:44 PM, Russell King - ARM Linux wrote: > On Mon, Jun 20, 2011 at 10:50:53AM +0100, Russell King - ARM Linux wrote: >> On Mon, Jun 20, 2011 at 02:53:59PM +0530, Santosh Shilimkar wrote: >>> The current ARM CPU hotplug code suffers from couple of race conditions >>> in CPU online path with scheduler. >>> The ARM CPU hotplug code doesn't wait for hot-plugged CPU to be marked >>> active as part of cpu_notify() by the CPU which brought it up before >>> enabling interrupts. >> >> Hmm, why not just move the set_cpu_online() call before notify_cpu_starting() >> and add the wait after the set_cpu_online() ? > > Actually, the race is caused by the CPU being marked online (and therefore > available for the scheduler) but not yet active (the CPU asking this one > to boot hasn't run the online notifiers yet.) > Scheduler uses the active mask and not online mask. For schedules CPU is ready for migration as soon as it is marked as active and that's the reason, interrupts should never be enabled before CPU is marked as active in online path. > This, I feel, is a fault of generic code. If the CPU is not ready to have > processes scheduled on it (because migration is not initialized) then we > shouldn't be scheduling processes on the new CPU yet. > > In any case, this should close the window by ensuring that we don't receive > an interrupt in the online-but-not-active case. Can you please test? > No it doesn't work. I still get the crash. The important point here is not to enable interrupts before CPU is marked as online and active. Regards Santosh
On Mon, Jun 20, 2011 at 03:58:03PM +0530, Santosh Shilimkar wrote: > On 6/20/2011 3:44 PM, Russell King - ARM Linux wrote: >> On Mon, Jun 20, 2011 at 10:50:53AM +0100, Russell King - ARM Linux wrote: >>> On Mon, Jun 20, 2011 at 02:53:59PM +0530, Santosh Shilimkar wrote: >>>> The current ARM CPU hotplug code suffers from couple of race conditions >>>> in CPU online path with scheduler. >>>> The ARM CPU hotplug code doesn't wait for hot-plugged CPU to be marked >>>> active as part of cpu_notify() by the CPU which brought it up before >>>> enabling interrupts. >>> >>> Hmm, why not just move the set_cpu_online() call before notify_cpu_starting() >>> and add the wait after the set_cpu_online() ? >> >> Actually, the race is caused by the CPU being marked online (and therefore >> available for the scheduler) but not yet active (the CPU asking this one >> to boot hasn't run the online notifiers yet.) >> > Scheduler uses the active mask and not online mask. For schedules CPU > is ready for migration as soon as it is marked as active and that's > the reason, interrupts should never be enabled before CPU is marked > as active in online path. > >> This, I feel, is a fault of generic code. If the CPU is not ready to have >> processes scheduled on it (because migration is not initialized) then we >> shouldn't be scheduling processes on the new CPU yet. >> >> In any case, this should close the window by ensuring that we don't receive >> an interrupt in the online-but-not-active case. Can you please test? >> > No it doesn't work. I still get the crash. The important point > here is not to enable interrupts before CPU is marked > as online and active. But we can't do that.
On Mon, Jun 20, 2011 at 03:58:03PM +0530, Santosh Shilimkar wrote: > No it doesn't work. I still get the crash. The important point > here is not to enable interrupts before CPU is marked > as online and active. What is the crash (in full please)? Do we know what interrupt is causing it?
On 6/20/2011 4:05 PM, Russell King - ARM Linux wrote: > On Mon, Jun 20, 2011 at 03:58:03PM +0530, Santosh Shilimkar wrote: >> On 6/20/2011 3:44 PM, Russell King - ARM Linux wrote: >>> On Mon, Jun 20, 2011 at 10:50:53AM +0100, Russell King - ARM Linux wrote: >>>> On Mon, Jun 20, 2011 at 02:53:59PM +0530, Santosh Shilimkar wrote: >>>>> The current ARM CPU hotplug code suffers from couple of race conditions >>>>> in CPU online path with scheduler. >>>>> The ARM CPU hotplug code doesn't wait for hot-plugged CPU to be marked >>>>> active as part of cpu_notify() by the CPU which brought it up before >>>>> enabling interrupts. >>>> >>>> Hmm, why not just move the set_cpu_online() call before notify_cpu_starting() >>>> and add the wait after the set_cpu_online() ? >>> >>> Actually, the race is caused by the CPU being marked online (and therefore >>> available for the scheduler) but not yet active (the CPU asking this one >>> to boot hasn't run the online notifiers yet.) >>> >> Scheduler uses the active mask and not online mask. For schedules CPU >> is ready for migration as soon as it is marked as active and that's >> the reason, interrupts should never be enabled before CPU is marked >> as active in online path. >> >>> This, I feel, is a fault of generic code. If the CPU is not ready to have >>> processes scheduled on it (because migration is not initialized) then we >>> shouldn't be scheduling processes on the new CPU yet. >>> >>> In any case, this should close the window by ensuring that we don't receive >>> an interrupt in the online-but-not-active case. Can you please test? >>> >> No it doesn't work. I still get the crash. The important point >> here is not to enable interrupts before CPU is marked >> as online and active. > > But we can't do that. Why is that ? Is it because of calibration or the hotplug start notifies needs to be called with interrupts enabled ? Regards Santosh
On 6/20/2011 4:14 PM, Russell King - ARM Linux wrote: > On Mon, Jun 20, 2011 at 03:58:03PM +0530, Santosh Shilimkar wrote: >> No it doesn't work. I still get the crash. The important point >> here is not to enable interrupts before CPU is marked >> as online and active. > > What is the crash (in full please)? > > Do we know what interrupt is causing it? Yes. It's because of interrupt and the CPU active-online race. Here is the chash log.. [ 21.025451] CPU1: Booted secondary processor [ 21.025451] CPU1: Unknown IPI message 0x1 [ 21.029113] Switched to NOHz mode on CPU #1 [ 21.029174] BUG: spinlock lockup on CPU#1, swapper/0, c06220c4 [ 21.029235] [<c0064704>] (unwind_backtrace+0x0/0xf4) from [<c028edc8>] (do_raw_spin_lock+0xd0/0x164) [ 21.029266] [<c028edc8>] (do_raw_spin_lock+0xd0/0x164) from [<c00cc3c4>] (tick_do_update_jiffies64+0x3c/0x118) [ 21.029296] [<c00cc3c4>] (tick_do_update_jiffies64+0x3c/0x118) from [<c00ccb04>] (tick_check_idle+0xb0/0x110) [ 21.029327] [<c00ccb04>] (tick_check_idle+0xb0/0x110) from [<c00a29cc>] (irq_enter+0x68/0x70) [ 21.029327] [<c00a29cc>] (irq_enter+0x68/0x70) from [<c00623c4>] (ipi_timer+0x24/0x40) [ 21.029357] [<c00623c4>] (ipi_timer+0x24/0x40) from [<c0051368>] (do_local_timer+0x54/0x70) [ 21.029388] [<c0051368>] (do_local_timer+0x54/0x70) from [<c048a09c>] (__irq_svc+0x3c/0x120) [ 21.029388] Exception stack(0xef87bf78 to 0xef87bfc0) [ 21.029388] bf60: 00000000 00026ec0 [ 21.029418] bf80: c0622080 ffff7483 c0622080 ffff7483 ef87a000 00000000 c0622080 411fc092 [ 21.029418] bfa0: c063a4f0 00000000 00000001 ef87bfc0 c0482e08 c0482b0c 60000113 ffffffff [ 21.029449] [<c048a09c>] (__irq_svc+0x3c/0x120) from [<c0482b0c>] (calibrate_delay+0x8c/0x1d4) [ 21.029479] [<c0482b0c>] (calibrate_delay+0x8c/0x1d4) from [<c0482e08>] (secondary_start_kernel+0x110/0x1ac) [ 21.029510] [<c0482e08>] (secondary_start_kernel+0x110/0x1ac) from [<c0070ee4>] (platform_cpu_die+0x34/0x54) [ 22.021362] CPU1: failed to come online [ 23.997955] CPU1: failed to come online [ 25.000122] BUG: spinlock lockup on CPU#0, kthreadd/663, efa27e64 [ 25.006408] [<c0064704>] (unwind_backtrace+0x0/0xf4) from [<c028edc8>] (do_raw_spin_lock+0xd0/0x164) [ 25.015808] [<c028edc8>] (do_raw_spin_lock+0xd0/0x164) from [<c048985c>] (_raw_spin_lock_irqsave+0x4c/0x58) [ 25.025848] [<c048985c>] (_raw_spin_lock_irqsave+0x4c/0x58) from [<c008ba24>] (complete+0x1c/0x5c) [ 25.035095] [<c008ba24>] (complete+0x1c/0x5c) from [<c00baf78>] (kthread+0x68/0x90) [ 25.042968] [<c00baf78>] (kthread+0x68/0x90) from [<c005dfdc>] (kernel_thread_exit+0x0/0x8)
On Mon, Jun 20, 2011 at 04:17:58PM +0530, Santosh Shilimkar wrote: > Yes. It's because of interrupt and the CPU active-online > race. I don't see that as a conclusion from this dump. > Here is the chash log.. > [ 21.025451] CPU1: Booted secondary processor > [ 21.025451] CPU1: Unknown IPI message 0x1 > [ 21.029113] Switched to NOHz mode on CPU #1 > [ 21.029174] BUG: spinlock lockup on CPU#1, swapper/0, c06220c4 That's the xtime seqlock. We're trying to update the xtime from CPU1, which is not yet online and not yet active. That's fine, we're just spinning on the spinlock here, waiting for the other CPUs to release it. But what this is saying is that the other CPUs aren't releasing it. The cpu hotplug code doesn't hold the seqlock either. So who else is holding this lock, causing CPU1 to time out on it. The other thing is that this is only supposed to trigger after about one second: u64 loops = loops_per_jiffy * HZ; for (i = 0; i < loops; i++) { if (arch_spin_trylock(&lock->raw_lock)) return; __delay(1); } which from the timings you have at the beginning of your printk lines is clearly not the case - it's more like 61us. Are you running with those h/w timer delay patches?
On 6/20/2011 4:43 PM, Russell King - ARM Linux wrote: > On Mon, Jun 20, 2011 at 04:17:58PM +0530, Santosh Shilimkar wrote: >> Yes. It's because of interrupt and the CPU active-online >> race. > > I don't see that as a conclusion from this dump. > >> Here is the chash log.. >> [ 21.025451] CPU1: Booted secondary processor >> [ 21.025451] CPU1: Unknown IPI message 0x1 >> [ 21.029113] Switched to NOHz mode on CPU #1 >> [ 21.029174] BUG: spinlock lockup on CPU#1, swapper/0, c06220c4 > > That's the xtime seqlock. We're trying to update the xtime from CPU1, > which is not yet online and not yet active. That's fine, we're just > spinning on the spinlock here, waiting for the other CPUs to release > it. > > But what this is saying is that the other CPUs aren't releasing it. > The cpu hotplug code doesn't hold the seqlock either. So who else is > holding this lock, causing CPU1 to time out on it. > > The other thing is that this is only supposed to trigger after about > one second: > > u64 loops = loops_per_jiffy * HZ; > for (i = 0; i< loops; i++) { > if (arch_spin_trylock(&lock->raw_lock)) > return; > __delay(1); > } > > which from the timings you have at the beginning of your printk lines > is clearly not the case - it's more like 61us. > > Are you running with those h/w timer delay patches? Nope. Regards Santosh
On Mon, Jun 20, 2011 at 04:55:43PM +0530, Santosh Shilimkar wrote: > On 6/20/2011 4:43 PM, Russell King - ARM Linux wrote: >> On Mon, Jun 20, 2011 at 04:17:58PM +0530, Santosh Shilimkar wrote: >>> Yes. It's because of interrupt and the CPU active-online >>> race. >> >> I don't see that as a conclusion from this dump. >> >>> Here is the chash log.. >>> [ 21.025451] CPU1: Booted secondary processor >>> [ 21.025451] CPU1: Unknown IPI message 0x1 >>> [ 21.029113] Switched to NOHz mode on CPU #1 >>> [ 21.029174] BUG: spinlock lockup on CPU#1, swapper/0, c06220c4 >> >> That's the xtime seqlock. We're trying to update the xtime from CPU1, >> which is not yet online and not yet active. That's fine, we're just >> spinning on the spinlock here, waiting for the other CPUs to release >> it. >> >> But what this is saying is that the other CPUs aren't releasing it. >> The cpu hotplug code doesn't hold the seqlock either. So who else is >> holding this lock, causing CPU1 to time out on it. >> >> The other thing is that this is only supposed to trigger after about >> one second: >> >> u64 loops = loops_per_jiffy * HZ; >> for (i = 0; i< loops; i++) { >> if (arch_spin_trylock(&lock->raw_lock)) >> return; >> __delay(1); >> } >> >> which from the timings you have at the beginning of your printk lines >> is clearly not the case - it's more like 61us. >> >> Are you running with those h/w timer delay patches? > Nope. Ok. So loops_per_jiffy must be too small. My guess is you're using an older kernel without 71c696b1 (calibrate: extract fall-back calculation into own helper). The delay calibration code used to start out by setting: loops_per_jiffy = (1<<12); This will shorten the delay right down, and that's probably causing these false spinlock lockup bug dumps. Arranging for IRQs to be disabled across the delay calibration just avoids the issue by preventing any spinlock being taken. The reason that CPU#0 also complains about spinlock lockup is that for some reason CPU#1 never finishes its calibration, and so the loop also times out early on CPU#0. Of course, fiddling with this global variable in this way is _not_ a good idea while other CPUs are running and using that variable. We could also do with implementing trigger_all_cpu_backtrace() to get backtraces from the other CPUs when spinlock lockup happens...
On 6/20/2011 4:15 PM, Santosh Shilimkar wrote: > On 6/20/2011 4:05 PM, Russell King - ARM Linux wrote: >> On Mon, Jun 20, 2011 at 03:58:03PM +0530, Santosh Shilimkar wrote: >>> On 6/20/2011 3:44 PM, Russell King - ARM Linux wrote: >>>> On Mon, Jun 20, 2011 at 10:50:53AM +0100, Russell King - ARM Linux >>>> wrote: >>>>> On Mon, Jun 20, 2011 at 02:53:59PM +0530, Santosh Shilimkar wrote: >>>>>> The current ARM CPU hotplug code suffers from couple of race >>>>>> conditions >>>>>> in CPU online path with scheduler. >>>>>> The ARM CPU hotplug code doesn't wait for hot-plugged CPU to be >>>>>> marked >>>>>> active as part of cpu_notify() by the CPU which brought it up before >>>>>> enabling interrupts. >>>>> >>>>> Hmm, why not just move the set_cpu_online() call before >>>>> notify_cpu_starting() >>>>> and add the wait after the set_cpu_online() ? >>>> >>>> Actually, the race is caused by the CPU being marked online (and >>>> therefore >>>> available for the scheduler) but not yet active (the CPU asking this >>>> one >>>> to boot hasn't run the online notifiers yet.) >>>> >>> Scheduler uses the active mask and not online mask. For schedules CPU >>> is ready for migration as soon as it is marked as active and that's >>> the reason, interrupts should never be enabled before CPU is marked >>> as active in online path. >>> >>>> This, I feel, is a fault of generic code. If the CPU is not ready to >>>> have >>>> processes scheduled on it (because migration is not initialized) >>>> then we >>>> shouldn't be scheduling processes on the new CPU yet. >>>> >>>> In any case, this should close the window by ensuring that we don't >>>> receive >>>> an interrupt in the online-but-not-active case. Can you please test? >>>> >>> No it doesn't work. I still get the crash. The important point >>> here is not to enable interrupts before CPU is marked >>> as online and active. >> >> But we can't do that. > Why is that ? > Is it because of calibration or the hotplug start notifies needs to > be called with interrupts enabled ? > BTW, how is ARM different from X86 here. I mean the X86 code seems to do similar what my patch is trying to fix for ARM. Some pointers would help me to understand why can't we delay the interrupt enable part on ARM hotplug code. Regards Santosh
On 6/20/2011 5:10 PM, Russell King - ARM Linux wrote: > On Mon, Jun 20, 2011 at 04:55:43PM +0530, Santosh Shilimkar wrote: >> On 6/20/2011 4:43 PM, Russell King - ARM Linux wrote: >>> On Mon, Jun 20, 2011 at 04:17:58PM +0530, Santosh Shilimkar wrote: >>>> Yes. It's because of interrupt and the CPU active-online >>>> race. >>> >>> I don't see that as a conclusion from this dump. >>> >>>> Here is the chash log.. >>>> [ 21.025451] CPU1: Booted secondary processor >>>> [ 21.025451] CPU1: Unknown IPI message 0x1 >>>> [ 21.029113] Switched to NOHz mode on CPU #1 >>>> [ 21.029174] BUG: spinlock lockup on CPU#1, swapper/0, c06220c4 >>> >>> That's the xtime seqlock. We're trying to update the xtime from CPU1, >>> which is not yet online and not yet active. That's fine, we're just >>> spinning on the spinlock here, waiting for the other CPUs to release >>> it. >>> >>> But what this is saying is that the other CPUs aren't releasing it. >>> The cpu hotplug code doesn't hold the seqlock either. So who else is >>> holding this lock, causing CPU1 to time out on it. >>> >>> The other thing is that this is only supposed to trigger after about >>> one second: >>> >>> u64 loops = loops_per_jiffy * HZ; >>> for (i = 0; i< loops; i++) { >>> if (arch_spin_trylock(&lock->raw_lock)) >>> return; >>> __delay(1); >>> } >>> >>> which from the timings you have at the beginning of your printk lines >>> is clearly not the case - it's more like 61us. >>> >>> Are you running with those h/w timer delay patches? >> Nope. > > Ok. So loops_per_jiffy must be too small. My guess is you're using an > older kernel without 71c696b1 (calibrate: extract fall-back calculation > into own helper). > I am on V3.0-rc3+(latest mainline) and the above commit is already part of it. > The delay calibration code used to start out by setting: > > loops_per_jiffy = (1<<12); > > This will shorten the delay right down, and that's probably causing these > false spinlock lockup bug dumps. > > Arranging for IRQs to be disabled across the delay calibration just avoids > the issue by preventing any spinlock being taken. > > The reason that CPU#0 also complains about spinlock lockup is that for > some reason CPU#1 never finishes its calibration, and so the loop also > times out early on CPU#0. > I am not sure but what I think is happening is as soon as interrupts start firing, as part of IRQ handling, scheduler will try to enqueue softIRQ thread for newly booted CPU since it sees that it's active and ready. But that's failing and both CPU's eventually lock-up. But I may be wrong here. > Of course, fiddling with this global variable in this way is _not_ a good > idea while other CPUs are running and using that variable. > > We could also do with implementing trigger_all_cpu_backtrace() to get > backtraces from the other CPUs when spinlock lockup happens... Any pointers on the other question about "why we need to enable interrupts before the CPU is ready?" Regards Santosh
On Mon, Jun 20, 2011 at 05:21:48PM +0530, Santosh Shilimkar wrote: > On 6/20/2011 5:10 PM, Russell King - ARM Linux wrote: >> On Mon, Jun 20, 2011 at 04:55:43PM +0530, Santosh Shilimkar wrote: >>> On 6/20/2011 4:43 PM, Russell King - ARM Linux wrote: >>>> On Mon, Jun 20, 2011 at 04:17:58PM +0530, Santosh Shilimkar wrote: >>>>> Yes. It's because of interrupt and the CPU active-online >>>>> race. >>>> >>>> I don't see that as a conclusion from this dump. >>>> >>>>> Here is the chash log.. >>>>> [ 21.025451] CPU1: Booted secondary processor >>>>> [ 21.025451] CPU1: Unknown IPI message 0x1 >>>>> [ 21.029113] Switched to NOHz mode on CPU #1 >>>>> [ 21.029174] BUG: spinlock lockup on CPU#1, swapper/0, c06220c4 >>>> >>>> That's the xtime seqlock. We're trying to update the xtime from CPU1, >>>> which is not yet online and not yet active. That's fine, we're just >>>> spinning on the spinlock here, waiting for the other CPUs to release >>>> it. >>>> >>>> But what this is saying is that the other CPUs aren't releasing it. >>>> The cpu hotplug code doesn't hold the seqlock either. So who else is >>>> holding this lock, causing CPU1 to time out on it. >>>> >>>> The other thing is that this is only supposed to trigger after about >>>> one second: >>>> >>>> u64 loops = loops_per_jiffy * HZ; >>>> for (i = 0; i< loops; i++) { >>>> if (arch_spin_trylock(&lock->raw_lock)) >>>> return; >>>> __delay(1); >>>> } >>>> >>>> which from the timings you have at the beginning of your printk lines >>>> is clearly not the case - it's more like 61us. >>>> >>>> Are you running with those h/w timer delay patches? >>> Nope. >> >> Ok. So loops_per_jiffy must be too small. My guess is you're using an >> older kernel without 71c696b1 (calibrate: extract fall-back calculation >> into own helper). >> > I am on V3.0-rc3+(latest mainline) and the above commit is already > part of it. > >> The delay calibration code used to start out by setting: >> >> loops_per_jiffy = (1<<12); >> >> This will shorten the delay right down, and that's probably causing these >> false spinlock lockup bug dumps. >> >> Arranging for IRQs to be disabled across the delay calibration just avoids >> the issue by preventing any spinlock being taken. >> >> The reason that CPU#0 also complains about spinlock lockup is that for >> some reason CPU#1 never finishes its calibration, and so the loop also >> times out early on CPU#0. >> > I am not sure but what I think is happening is as soon as interrupts > start firing, as part of IRQ handling, scheduler will try to > enqueue softIRQ thread for newly booted CPU since it sees that > it's active and ready. But that's failing and both CPU's > eventually lock-up. But I may be wrong here. Even if that happens, there is NO WAY that the spinlock lockup detector should report lockup in anything under 1s. >> Of course, fiddling with this global variable in this way is _not_ a good >> idea while other CPUs are running and using that variable. >> >> We could also do with implementing trigger_all_cpu_backtrace() to get >> backtraces from the other CPUs when spinlock lockup happens... > > Any pointers on the other question about "why we need to enable > interrupts before the CPU is ready?" To ensure that things like the delay loop calibration and twd calibration can run, though that looks like it'll run happily enough with the boot CPU updating jiffies. However, I'm still not taking your patch because I believe its just papering over the real issue, which is not as you describe. You first need to work out why the spinlock lockup detection is firing after just 61us rather than the full 1s and fix that. You then need to work out whether you really do have spinlock lockup, and if so, why. Implementing trigger_all_cpu_backtrace() may help to find out what CPU#0 is doing, though we can only do that with IRQs on, and so would be fragile. We can test whether CPU#0 is going off to do something else while CPU#1 is being brought up, by adding a preempt_disable() / preempt_enable() in __cpu_up() to prevent the wait-for-cpu#1-online being preempted by other threads - I suspect you'll still see spinlock lockup on the xtime seqlock on CPU#1 though. That would suggest a coherency issue. Finally, how are you provoking this - and what kernel configuration are you using?
On 6/20/2011 5:49 PM, Russell King - ARM Linux wrote: > On Mon, Jun 20, 2011 at 05:21:48PM +0530, Santosh Shilimkar wrote: >> On 6/20/2011 5:10 PM, Russell King - ARM Linux wrote: [...] >> >> Any pointers on the other question about "why we need to enable >> interrupts before the CPU is ready?" > > To ensure that things like the delay loop calibration and twd calibration > can run, though that looks like it'll run happily enough with the boot > CPU updating jiffies. > I guessed it and had same point as above. Calibration will still work. > However, I'm still not taking your patch because I believe its just > papering over the real issue, which is not as you describe. > > You first need to work out why the spinlock lockup detection is firing > after just 61us rather than the full 1s and fix that. > This is possibly because of my script which doesn't wait for 1 second. > You then need to work out whether you really do have spinlock lockup, > and if so, why. Implementing trigger_all_cpu_backtrace() may help to > find out what CPU#0 is doing, though we can only do that with IRQs on, > and so would be fragile. > > We can test whether CPU#0 is going off to do something else while CPU#1 > is being brought up, by adding a preempt_disable() / preempt_enable() > in __cpu_up() to prevent the wait-for-cpu#1-online being preempted by > other threads - I suspect you'll still see spinlock lockup on the > xtime seqlock on CPU#1 though. That would suggest a coherency issue. > > Finally, how are you provoking this - and what kernel configuration are > you using? Latest mainline kernel with omap2plus_defconfig and below simple script to trigger the failure. ------------- while true do echo 0 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu1/online done Regards Santosh
On Mon, Jun 20, 2011 at 05:57:01PM +0530, Santosh Shilimkar wrote: > On 6/20/2011 5:49 PM, Russell King - ARM Linux wrote: >> On Mon, Jun 20, 2011 at 05:21:48PM +0530, Santosh Shilimkar wrote: >>> On 6/20/2011 5:10 PM, Russell King - ARM Linux wrote: > > [...] > >>> >>> Any pointers on the other question about "why we need to enable >>> interrupts before the CPU is ready?" >> >> To ensure that things like the delay loop calibration and twd calibration >> can run, though that looks like it'll run happily enough with the boot >> CPU updating jiffies. >> > I guessed it and had same point as above. Calibration will still > work. > >> However, I'm still not taking your patch because I believe its just >> papering over the real issue, which is not as you describe. >> >> You first need to work out why the spinlock lockup detection is firing >> after just 61us rather than the full 1s and fix that. >> > This is possibly because of my script which doesn't wait for 1 > second. How could a userspace script affect the internal behaviour of spin_lock() and the spinlock lockup detector? > Latest mainline kernel with omap2plus_defconfig and below simple script > to trigger the failure. > > ------------- > while true > do > echo 0 > /sys/devices/system/cpu/cpu1/online > echo 1 > /sys/devices/system/cpu/cpu1/online > done Thanks, I'll give it a go here and see if I can debug it further.
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 344e52b..e34d750 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -318,9 +318,15 @@ asmlinkage void __cpuinit secondary_start_kernel(void) smp_store_cpu_info(cpu); /* - * OK, now it's safe to let the boot CPU continue + * OK, now it's safe to let the boot CPU continue. Wait for + * the CPU migration code to notice that the CPU is online + * before we continue. */ + local_irq_disable(); set_cpu_online(cpu, true); + while (!cpumask_test_cpu(cpu, cpu_active_mask)) + cpu_relax(); + local_irq_enable(); /* * OK, it's off to the idle thread for us