[RFC] ARM: smp: Fix the CPU hotplug race with scheduler.

Message ID	20110620142338.GL2082@n2100.arm.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) by demeter1.kernel.org (8.14.4/8.14.4) with ESMTP id p5KEOMXQ027694 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <patchwork-linux-arm@patchwork.kernel.org>; Mon, 20 Jun 2011 14:24:48 GMT Date: Mon, 20 Jun 2011 15:23:38 +0100 From: Russell King - ARM Linux <linux@arm.linux.org.uk> To: Santosh Shilimkar <santosh.shilimkar@ti.com> Subject: Re: [RFC PATCH] ARM: smp: Fix the CPU hotplug race with scheduler. Message-ID: <20110620142338.GL2082@n2100.arm.linux.org.uk> References: <1308561839-18407-1-git-send-email-santosh.shilimkar@ti.com> <20110620095053.GA2082@n2100.arm.linux.org.uk> <20110620101438.GD2082@n2100.arm.linux.org.uk> <4DFF20B3.7010209@ti.com> <20110620104415.GF2082@n2100.arm.linux.org.uk> <4DFF255E.5030308@ti.com> <20110620111336.GG2082@n2100.arm.linux.org.uk> <4DFF2E37.8030602@ti.com> <20110620114019.GH2082@n2100.arm.linux.org.uk> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20110620114019.GH2082@n2100.arm.linux.org.uk> User-Agent: Mutt/1.5.19 (2009-01-05) summary: Content analysis details: (1.2 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 1.3 RDNS_NONE Delivered to internal network by a host with no rDNS Cc: Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, linux-omap@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: linux-arm-kernel-bounces@lists.infradead.org Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Russell King - ARM Linux June 20, 2011, 2:23 p.m. UTC

On Mon, Jun 20, 2011 at 12:40:19PM +0100, Russell King - ARM Linux wrote:
> Ok.  So loops_per_jiffy must be too small.  My guess is you're using an
> older kernel without 71c696b1 (calibrate: extract fall-back calculation
> into own helper).

Right, this commit above helps show the problem - and it's fairly subtle.

It's a race condition.  Let's first look at the spinlock debugging code.
It does this:

static void __spin_lock_debug(raw_spinlock_t *lock)
{
        u64 i;
        u64 loops = loops_per_jiffy * HZ;

        for (;;) {
                for (i = 0; i < loops; i++) {
                        if (arch_spin_trylock(&lock->raw_lock))
                                return;
                        __delay(1);
                }
		/* print warning */
	}
}

If loops_per_jiffy is zero, we never try to grab the spinlock, because
we never enter the inner for loop.  We immediately print a warning,
and re-execute the outer loop for ever, resulting in the CPU locking up
in this condition.

In theory, we should never see a zero loops_per_jiffy value, because it
represents the number of loops __delay() needs to delay by one jiffy and
clearly zero makes no sense.

However, calibrate_delay() does this (which x86 and ARM call on secondary
CPU startup):

calibrate_delay()
{
...
	if (preset_lpj) {
	} else if ((!printed) && lpj_fine) {
	} else if ((loops_per_jiffy = calibrate_delay_direct()) != 0) {
	} else {
		/* approximation/convergence stuff */
	}
}

Now, before 71c696b, this used to be:

        } else {
                loops_per_jiffy = (1<<12);

So the window between calibrate_delay_direct() returning and setting
loops_per_jiffy to zero, and the re-initialization of loops_per_jiffy
was relatively short (maybe even the compiler optimized away the zero
write.)

However, after 71c696b, this now does:

        } else {
                if (!printed)
                        pr_info("Calibrating delay loop... ");
+               loops_per_jiffy = calibrate_delay_converge();

So, as loops_per_jiffy is not local to this function, the compiler has
to write out that zero value, before calling calibrate_delay_converge(),
and loops_per_jiffy only becomes non-zero _after_ calibrate_delay_converge()
has returned.  This opens the window and allows the spinlock debugging
code to explode.

This patch closes the window completely, by only writing to loops_per_jiffy
only when we have a real value for it.

This allows me to boot 3.0.0-rc3 on Versatile Express (4 CPU) whereas
without this it fails with spinlock lockup and rcu problems.

 init/calibrate.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

Santosh Shilimkar June 20, 2011, 2:54 p.m. UTC | #1

On 6/20/2011 7:53 PM, Russell King - ARM Linux wrote:
> On Mon, Jun 20, 2011 at 12:40:19PM +0100, Russell King - ARM Linux wrote:
>> Ok.  So loops_per_jiffy must be too small.  My guess is you're using an
>> older kernel without 71c696b1 (calibrate: extract fall-back calculation
>> into own helper).
>
> Right, this commit above helps show the problem - and it's fairly subtle.
>
> It's a race condition.  Let's first look at the spinlock debugging code.
> It does this:
>
> static void __spin_lock_debug(raw_spinlock_t *lock)
> {
>          u64 i;
>          u64 loops = loops_per_jiffy * HZ;
>
>          for (;;) {
>                  for (i = 0; i<  loops; i++) {
>                          if (arch_spin_trylock(&lock->raw_lock))
>                                  return;
>                          __delay(1);
>                  }
> 		/* print warning */
> 	}
> }
>
> If loops_per_jiffy is zero, we never try to grab the spinlock, because
> we never enter the inner for loop.  We immediately print a warning,
> and re-execute the outer loop for ever, resulting in the CPU locking up
> in this condition.
>
> In theory, we should never see a zero loops_per_jiffy value, because it
> represents the number of loops __delay() needs to delay by one jiffy and
> clearly zero makes no sense.
>
> However, calibrate_delay() does this (which x86 and ARM call on secondary
> CPU startup):
>
> calibrate_delay()
> {
> ...
> 	if (preset_lpj) {
> 	} else if ((!printed)&&  lpj_fine) {
> 	} else if ((loops_per_jiffy = calibrate_delay_direct()) != 0) {
> 	} else {
> 		/* approximation/convergence stuff */
> 	}
> }
>
> Now, before 71c696b, this used to be:
>
>          } else {
>                  loops_per_jiffy = (1<<12);
>
> So the window between calibrate_delay_direct() returning and setting
> loops_per_jiffy to zero, and the re-initialization of loops_per_jiffy
> was relatively short (maybe even the compiler optimized away the zero
> write.)
>
> However, after 71c696b, this now does:
>
>          } else {
>                  if (!printed)
>                          pr_info("Calibrating delay loop... ");
> +               loops_per_jiffy = calibrate_delay_converge();
>
> So, as loops_per_jiffy is not local to this function, the compiler has
> to write out that zero value, before calling calibrate_delay_converge(),
> and loops_per_jiffy only becomes non-zero _after_ calibrate_delay_converge()
> has returned.  This opens the window and allows the spinlock debugging
> code to explode.
>
> This patch closes the window completely, by only writing to loops_per_jiffy
> only when we have a real value for it.
>
> This allows me to boot 3.0.0-rc3 on Versatile Express (4 CPU) whereas
> without this it fails with spinlock lockup and rcu problems.
>
>   init/calibrate.c |   14 ++++++++------
>   1 files changed, 8 insertions(+), 6 deletions(-)
>
I am away from my board now. Will test this change.
btw, the online-active race is still open even with this patch close
and should be fixed.

Regards
Santosh

Russell King - ARM Linux June 20, 2011, 3:01 p.m. UTC | #2

On Mon, Jun 20, 2011 at 08:24:33PM +0530, Santosh Shilimkar wrote:
> I am away from my board now. Will test this change.
> btw, the online-active race is still open even with this patch close
> and should be fixed.

I have yet to see any evidence of that race - I've been running your
test loop for about an hour so far on Versatile Express and nothing
yet.

That's not to say that we shouldn't wait for the active mask to become
true before calling schedule(), but I don't think its as big a deal as
you're suggesting it is.

Santosh Shilimkar June 20, 2011, 3:10 p.m. UTC | #3

On 6/20/2011 8:31 PM, Russell King - ARM Linux wrote:
> On Mon, Jun 20, 2011 at 08:24:33PM +0530, Santosh Shilimkar wrote:
>> I am away from my board now. Will test this change.
>> btw, the online-active race is still open even with this patch close
>> and should be fixed.
>
> I have yet to see any evidence of that race - I've been running your
> test loop for about an hour so far on Versatile Express and nothing
> yet.
>
In that case my script was just exposing the calibrate() code race
condition.

> That's not to say that we shouldn't wait for the active mask to become
> true before calling schedule(), but I don't think its as big a deal as
> you're suggesting it is.
I am not expert to really trigger that exact race online-to-active but
am sure the race needs to be fixed.

May be Thomas can suggest a test-case to expose that race. From
his fix for x86, it appeared that the race was indeed hit in some
sequence.

Regards,
Santosh

Santosh Shilimkar June 21, 2011, 9:08 a.m. UTC | #4

Russell,

On 6/20/2011 8:24 PM, Santosh Shilimkar wrote:
> On 6/20/2011 7:53 PM, Russell King - ARM Linux wrote:
>> On Mon, Jun 20, 2011 at 12:40:19PM +0100, Russell King - ARM Linux wrote:
[....]

>> So, as loops_per_jiffy is not local to this function, the compiler has
>> to write out that zero value, before calling calibrate_delay_converge(),
>> and loops_per_jiffy only becomes non-zero _after_
>> calibrate_delay_converge()
>> has returned. This opens the window and allows the spinlock debugging
>> code to explode.
>>
>> This patch closes the window completely, by only writing to
>> loops_per_jiffy
>> only when we have a real value for it.
>>
>> This allows me to boot 3.0.0-rc3 on Versatile Express (4 CPU) whereas
>> without this it fails with spinlock lockup and rcu problems.
>>
>> init/calibrate.c | 14 ++++++++------
>> 1 files changed, 8 insertions(+), 6 deletions(-)
>>
> I am away from my board now. Will test this change.
Have tested your change and it seems to fix the crash I
was observing. Are you planning to send this fix for rc5?

> btw, the online-active race is still open even with this patch close
> and should be fixed.
>
The only problem remains is waiting for active mask before
marking CPU online. Shall I refresh my patch with only
this change then ?

Regards
Santosh

Russell King - ARM Linux June 21, 2011, 10 a.m. UTC | #5

On Tue, Jun 21, 2011 at 02:38:34PM +0530, Santosh Shilimkar wrote:
> Russell,
>
> On 6/20/2011 8:24 PM, Santosh Shilimkar wrote:
>> On 6/20/2011 7:53 PM, Russell King - ARM Linux wrote:
>>> So, as loops_per_jiffy is not local to this function, the compiler has
>>> to write out that zero value, before calling calibrate_delay_converge(),
>>> and loops_per_jiffy only becomes non-zero _after_
>>> calibrate_delay_converge()
>>> has returned. This opens the window and allows the spinlock debugging
>>> code to explode.
>>>
>>> This patch closes the window completely, by only writing to
>>> loops_per_jiffy
>>> only when we have a real value for it.
>>>
>>> This allows me to boot 3.0.0-rc3 on Versatile Express (4 CPU) whereas
>>> without this it fails with spinlock lockup and rcu problems.
>>>
>>> init/calibrate.c | 14 ++++++++------
>>> 1 files changed, 8 insertions(+), 6 deletions(-)
>>>
>> I am away from my board now. Will test this change.
> Have tested your change and it seems to fix the crash I
> was observing. Are you planning to send this fix for rc5?

Yes.  I think sending CPUs into infinite loops in the spinlock code is
definitely sufficiently serious that it needs to go to Linus ASAP.
It'd be nice to have a tested-by line though.

>> btw, the online-active race is still open even with this patch close
>> and should be fixed.
>>
> The only problem remains is waiting for active mask before
> marking CPU online. Shall I refresh my patch with only
> this change then ?

I already have that as a separate change.

Santosh Shilimkar June 21, 2011, 10:17 a.m. UTC | #6

On 6/21/2011 3:30 PM, Russell King - ARM Linux wrote:
> On Tue, Jun 21, 2011 at 02:38:34PM +0530, Santosh Shilimkar wrote:
>> Russell,
>>
>> On 6/20/2011 8:24 PM, Santosh Shilimkar wrote:
>>> On 6/20/2011 7:53 PM, Russell King - ARM Linux wrote:
>>>> So, as loops_per_jiffy is not local to this function, the compiler has
>>>> to write out that zero value, before calling calibrate_delay_converge(),
>>>> and loops_per_jiffy only becomes non-zero _after_
>>>> calibrate_delay_converge()
>>>> has returned. This opens the window and allows the spinlock debugging
>>>> code to explode.
>>>>
>>>> This patch closes the window completely, by only writing to
>>>> loops_per_jiffy
>>>> only when we have a real value for it.
>>>>
>>>> This allows me to boot 3.0.0-rc3 on Versatile Express (4 CPU) whereas
>>>> without this it fails with spinlock lockup and rcu problems.
>>>>
>>>> init/calibrate.c | 14 ++++++++------
>>>> 1 files changed, 8 insertions(+), 6 deletions(-)
>>>>
>>> I am away from my board now. Will test this change.
>> Have tested your change and it seems to fix the crash I
>> was observing. Are you planning to send this fix for rc5?
>
> Yes.  I think sending CPUs into infinite loops in the spinlock code is
> definitely sufficiently serious that it needs to go to Linus ASAP.
> It'd be nice to have a tested-by line though.
>
Sure.

>>> btw, the online-active race is still open even with this patch close
>>> and should be fixed.
>>>
>> The only problem remains is waiting for active mask before
>> marking CPU online. Shall I refresh my patch with only
>> this change then ?
>
> I already have that as a separate change.
Can you point me to both of these commits so that I have
them in my tree for testing.

Thanks for help.

Regards
Santosh

Russell King - ARM Linux June 21, 2011, 10:19 a.m. UTC | #7

On Tue, Jun 21, 2011 at 03:47:04PM +0530, Santosh Shilimkar wrote:
> On 6/21/2011 3:30 PM, Russell King - ARM Linux wrote:
>> On Tue, Jun 21, 2011 at 02:38:34PM +0530, Santosh Shilimkar wrote:
>>> Russell,
>>>
>>> On 6/20/2011 8:24 PM, Santosh Shilimkar wrote:
>>>> On 6/20/2011 7:53 PM, Russell King - ARM Linux wrote:
>>>>> So, as loops_per_jiffy is not local to this function, the compiler has
>>>>> to write out that zero value, before calling calibrate_delay_converge(),
>>>>> and loops_per_jiffy only becomes non-zero _after_
>>>>> calibrate_delay_converge()
>>>>> has returned. This opens the window and allows the spinlock debugging
>>>>> code to explode.
>>>>>
>>>>> This patch closes the window completely, by only writing to
>>>>> loops_per_jiffy
>>>>> only when we have a real value for it.
>>>>>
>>>>> This allows me to boot 3.0.0-rc3 on Versatile Express (4 CPU) whereas
>>>>> without this it fails with spinlock lockup and rcu problems.
>>>>>
>>>>> init/calibrate.c | 14 ++++++++------
>>>>> 1 files changed, 8 insertions(+), 6 deletions(-)
>>>>>
>>>> I am away from my board now. Will test this change.
>>> Have tested your change and it seems to fix the crash I
>>> was observing. Are you planning to send this fix for rc5?
>>
>> Yes.  I think sending CPUs into infinite loops in the spinlock code is
>> definitely sufficiently serious that it needs to go to Linus ASAP.
>> It'd be nice to have a tested-by line though.
>>
> Sure.
>
>>>> btw, the online-active race is still open even with this patch close
>>>> and should be fixed.
>>>>
>>> The only problem remains is waiting for active mask before
>>> marking CPU online. Shall I refresh my patch with only
>>> this change then ?
>>
>> I already have that as a separate change.
> Can you point me to both of these commits so that I have
> them in my tree for testing.

I won't be committing the init/calibrate.c change to a git tree - it
isn't ARM stuff so it goes in patch form.

Santosh Shilimkar June 21, 2011, 10:21 a.m. UTC | #8

On 6/21/2011 3:49 PM, Russell King - ARM Linux wrote:
> On Tue, Jun 21, 2011 at 03:47:04PM +0530, Santosh Shilimkar wrote:

[...]

>>> I already have that as a separate change.
>> Can you point me to both of these commits so that I have
>> them in my tree for testing.
>
> I won't be committing the init/calibrate.c change to a git tree - it
> isn't ARM stuff so it goes in patch form.

Patches with change log would be fine as well.

Regards
Santosh

Russell King - ARM Linux June 21, 2011, 10:26 a.m. UTC | #9

On Tue, Jun 21, 2011 at 03:51:00PM +0530, Santosh Shilimkar wrote:
> On 6/21/2011 3:49 PM, Russell King - ARM Linux wrote:
>> On Tue, Jun 21, 2011 at 03:47:04PM +0530, Santosh Shilimkar wrote:
>
> [...]
>
>>>> I already have that as a separate change.
>>> Can you point me to both of these commits so that I have
>>> them in my tree for testing.
>>
>> I won't be committing the init/calibrate.c change to a git tree - it
>> isn't ARM stuff so it goes in patch form.
>
> Patches with change log would be fine as well.

The answer is not at the moment, but maybe soon.

Russell King - ARM Linux June 21, 2011, 11:10 p.m. UTC | #10

On Tue, Jun 21, 2011 at 01:16:47PM -0700, Stephen Boyd wrote:
> On 06/21/2011 03:26 AM, Russell King - ARM Linux wrote:
> > On Tue, Jun 21, 2011 at 03:51:00PM +0530, Santosh Shilimkar wrote:
> >> On 6/21/2011 3:49 PM, Russell King - ARM Linux wrote:
> >>> I won't be committing the init/calibrate.c change to a git tree - it
> >>> isn't ARM stuff so it goes in patch form.
> >> Patches with change log would be fine as well.
> > The answer is not at the moment, but maybe soon.
> 
> Should we send those two patches to the stable trees as well? They seem
> to fix issues with cpu onlining that have existed for a long time.

Looks to me like the problem was introduced for 2.6.39-rc1, so we
should probably get the fix into the 2.6.39-stable tree too.

Stephen Boyd June 22, 2011, 12:06 a.m. UTC | #11

On 06/21/2011 04:10 PM, Russell King - ARM Linux wrote:
> On Tue, Jun 21, 2011 at 01:16:47PM -0700, Stephen Boyd wrote:
>> On 06/21/2011 03:26 AM, Russell King - ARM Linux wrote:
>>> On Tue, Jun 21, 2011 at 03:51:00PM +0530, Santosh Shilimkar wrote:
>>>> On 6/21/2011 3:49 PM, Russell King - ARM Linux wrote:
>>>>> I won't be committing the init/calibrate.c change to a git tree - it
>>>>> isn't ARM stuff so it goes in patch form.
>>>> Patches with change log would be fine as well.
>>> The answer is not at the moment, but maybe soon.
>> Should we send those two patches to the stable trees as well? They seem
>> to fix issues with cpu onlining that have existed for a long time.
> Looks to me like the problem was introduced for 2.6.39-rc1, so we
> should probably get the fix into the 2.6.39-stable tree too.

Are we talking about the loops_per_jiffy problem or the cpu_active
problem? I would think the cpu_active problem has been there since SMP
support was added to ARM and the loops_per_jiffy problem has been there
(depending on the compiler) since 8a9e1b0 ([PATCH] Platform SMIs and
their interferance with tsc based delay calibration, 2005-06-23).

So pretty much every stable tree would want both of these patches.

Russell King - ARM Linux June 22, 2011, 10:06 a.m. UTC | #12

On Tue, Jun 21, 2011 at 05:06:58PM -0700, Stephen Boyd wrote:
> On 06/21/2011 04:10 PM, Russell King - ARM Linux wrote:
> > On Tue, Jun 21, 2011 at 01:16:47PM -0700, Stephen Boyd wrote:
> >> On 06/21/2011 03:26 AM, Russell King - ARM Linux wrote:
> >>> On Tue, Jun 21, 2011 at 03:51:00PM +0530, Santosh Shilimkar wrote:
> >>>> On 6/21/2011 3:49 PM, Russell King - ARM Linux wrote:
> >>>>> I won't be committing the init/calibrate.c change to a git tree - it
> >>>>> isn't ARM stuff so it goes in patch form.
> >>>> Patches with change log would be fine as well.
> >>> The answer is not at the moment, but maybe soon.
> >> Should we send those two patches to the stable trees as well? They seem
> >> to fix issues with cpu onlining that have existed for a long time.
> > Looks to me like the problem was introduced for 2.6.39-rc1, so we
> > should probably get the fix into the 2.6.39-stable tree too.
> 
> Are we talking about the loops_per_jiffy problem or the cpu_active
> problem? I would think the cpu_active problem has been there since SMP
> support was added to ARM and the loops_per_jiffy problem has been there
> (depending on the compiler) since 8a9e1b0 ([PATCH] Platform SMIs and
> their interferance with tsc based delay calibration, 2005-06-23).

The cpu_active problem hasn't actually caused any symptoms on ARM, so
it's low priority.  It's only a problem which should be sorted in
-stable _if_ someone reports that it has caused a problem.  Up until
Santosh's patch, no one has done so, and I've not seen any problems
on any of my ARM SMP platforms coming from it.

As for the loops_per_jiffy, it isn't a problem before the commit ID
I pointed out - I've checked the assembly, and the compiler optimizes
away the initialization of loops_per_jiffy to zero - the first write
is when its set to (1<<12).  Take a moment to think about this:

	if ((loops_per_jiffy = 0) == 0) {
	} else {
		loops_per_jiffy = 1<<12;
		...
	}

Any compiler worth talking about is going to optimize away the initial
constant write to loops_per_jiffy there provided loops_per_jiffy is not
volatile.

So, although its not desirable for older kernels to have their lpj
overwritten in this way, it doesn't cause the spinlock debugging code
to explode.

This can be shown to be correct because there hasn't been any problem
with ARM secondary CPU bringup until recently.

Plus, the previous version of the code requires significant changes to
sort the problem out.

So, the lpj patch will only sensibly apply to 2.6.39-rc1 and later,
and so it's only going to be submitted for 2.6.39-stable.  Previous
kernels, the risks of changing it outweighs by several orders of
magnitude any benefit coming from the change.

[RFC] ARM: smp: Fix the CPU hotplug race with scheduler.

Commit Message

Comments

Patch