diff mbox

[RFC] ARM: smp: Fix the CPU hotplug race with scheduler.

Message ID 1308561839-18407-1-git-send-email-santosh.shilimkar@ti.com (mailing list archive)
State New, archived
Headers show

Commit Message

Santosh Shilimkar June 20, 2011, 9:23 a.m. UTC
The current ARM CPU hotplug code suffers from couple of race conditions
in CPU online path with scheduler.
The ARM CPU hotplug code doesn't wait for hot-plugged CPU to be marked
active as part of cpu_notify() by the CPU which brought it up before
enabling interrupts.

So we end up in with couple of race conditions,
1) Interrupts are enabled even before CPU is marked as active.
2) Newly plugged CPU is marked as active but it is not marked online yet.

When an interrupt happens before the cpu_active bit is set, the scheduler
can't schedule the woken thread which is bound to that newly onlined cpu
and and selects a fallback runqueue.

Secondly marking CPU active before it is online also not desirable behaviour.

Fix this race conditions.

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
---
On v3.0 kernel I started seeing lock-up and random crashes when
CPU online/offline was attempted aggressively. With git bisect I could
reach to Peter's commit e4a52bcb9a18142d79e231b6733cabdbf2e67c1f[sched:
Remove rq->lock from the first half of ttwu()] which was also reported
by Marc Zyngier. But even after using the follow up fix from Peter
d6aa8f85f163[sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW],
I was still seeing issues with hotplug path. So as a experiment I just
pushed down the interrupt enabling on newly plugged CPU after it's marked
as online. This made things better and much stable but occasionally I was
still seeing lock-up.

With above as background I looked at arch/x86/ code and got convinced
myself that the experimental hack could be the right fix. While doing
this I came across a commit from Thomas fd8a7de177b [x86: cpu-hotplug:
Prevent softirq wakeup on wrong CPU] which fixed the race 2) on x86
architecture. 

In this patch I have folded possible fixes for both race conditions for
ARM hotplug code as mentioned in change log. Hopefully I am not
introducing any new race with this patch and hence the RFC. 

 arch/arm/kernel/smp.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

Comments

Russell King - ARM Linux June 20, 2011, 9:50 a.m. UTC | #1
On Mon, Jun 20, 2011 at 02:53:59PM +0530, Santosh Shilimkar wrote:
> The current ARM CPU hotplug code suffers from couple of race conditions
> in CPU online path with scheduler.
> The ARM CPU hotplug code doesn't wait for hot-plugged CPU to be marked
> active as part of cpu_notify() by the CPU which brought it up before
> enabling interrupts.

Hmm, why not just move the set_cpu_online() call before notify_cpu_starting()
and add the wait after the set_cpu_online() ?
Santosh Shilimkar June 20, 2011, 10:19 a.m. UTC | #2
On 6/20/2011 3:20 PM, Russell King - ARM Linux wrote:
> On Mon, Jun 20, 2011 at 02:53:59PM +0530, Santosh Shilimkar wrote:
>> The current ARM CPU hotplug code suffers from couple of race conditions
>> in CPU online path with scheduler.
>> The ARM CPU hotplug code doesn't wait for hot-plugged CPU to be marked
>> active as part of cpu_notify() by the CPU which brought it up before
>> enabling interrupts.
>
> Hmm, why not just move the set_cpu_online() call before notify_cpu_starting()
> and add the wait after the set_cpu_online() ?
I think that's what patch is doing. Do you mean, calling hotplug
notifier chain immediately after CPU marked as online. Something
like below.

	set_cpu_online(cpu, true);
	notify_cpu_starting(cpu);
         while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
                 cpu_relax();

         /*
          * Enable local interrupts.
          */
                 local_irq_enable();
         local_fiq_enable();


Regards
Santosh
diff mbox

Patch

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 344e52b..84373a9 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -302,13 +302,6 @@  asmlinkage void __cpuinit secondary_start_kernel(void)
 	platform_secondary_init(cpu);
 
 	/*
-	 * Enable local interrupts.
-	 */
-	notify_cpu_starting(cpu);
-	local_irq_enable();
-	local_fiq_enable();
-
-	/*
 	 * Setup the percpu timer for this CPU.
 	 */
 	percpu_timer_setup();
@@ -322,6 +315,17 @@  asmlinkage void __cpuinit secondary_start_kernel(void)
 	 */
 	set_cpu_online(cpu, true);
 
+
+	while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
+		cpu_relax();
+
+	/*
+	 * Enable local interrupts.
+	 */
+	notify_cpu_starting(cpu);
+	local_irq_enable();
+	local_fiq_enable();
+
 	/*
 	 * OK, it's off to the idle thread for us
 	 */