Message ID | 1362663356-21151-1-git-send-email-tom.leiming@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Mar 7, 2013 at 9:35 PM, Ming Lei <tom.leiming@gmail.com> wrote: > Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access) > introduces arm's __my_cpu_offset to optimize percpu vaiable access, > which really works well on hackbench, but will cause __my_cpu_offset > to return garbage value before it is initialized in cpu_init() called > by setup_arch, so accessing percpu variable before setup_arch may cause > kernel hang. But generic __my_cpu_offset always returns zero before > percpu area is brought up, and won't hang kernel. > > So the patch tries to clear __my_cpu_offset on boot CPU early > to avoid boot hang. > > At least now percpu variable is accessed by lockdep before > setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP > can trigger kernel hang. > > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Rob Herring <rob.herring@calxeda.com> > Cc: Will Deacon <will.deacon@arm.com> > Cc: Nicolas Pitre <nico@linaro.org> > Cc: Russell King <rmk+kernel@arm.linux.org.uk> > Signed-off-by: Ming Lei <tom.leiming@gmail.com> > --- > V1: > - documents lockdep uses percpu variable early Looks no one objects the patch, so I has submitted it into Russell's patch system, and hope it can be pushed to linus tree soon and make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7. Thanks,
On Tue, Mar 12, 2013 at 10:32:21AM +0800, Ming Lei wrote: > On Thu, Mar 7, 2013 at 9:35 PM, Ming Lei <tom.leiming@gmail.com> wrote: > > Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access) > > introduces arm's __my_cpu_offset to optimize percpu vaiable access, > > which really works well on hackbench, but will cause __my_cpu_offset > > to return garbage value before it is initialized in cpu_init() called > > by setup_arch, so accessing percpu variable before setup_arch may cause > > kernel hang. But generic __my_cpu_offset always returns zero before > > percpu area is brought up, and won't hang kernel. > > > > So the patch tries to clear __my_cpu_offset on boot CPU early > > to avoid boot hang. > > > > At least now percpu variable is accessed by lockdep before > > setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP > > can trigger kernel hang. > > > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Rob Herring <rob.herring@calxeda.com> > > Cc: Will Deacon <will.deacon@arm.com> > > Cc: Nicolas Pitre <nico@linaro.org> > > Cc: Russell King <rmk+kernel@arm.linux.org.uk> > > Signed-off-by: Ming Lei <tom.leiming@gmail.com> > > --- > > V1: > > - documents lockdep uses percpu variable early > > Looks no one objects the patch, so I has submitted it into Russell's > patch system, and hope it can be pushed to linus tree soon and > make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7. I'm not convinced it is correct. Is the percpu data as stored in the kernel image (in other words, at offset zero) supposed to be read only? If so, the above means that we'll be accessing that rather than the copy of the percpu data we should be accessing. The percpu data areas are allocated by setup_per_cpu_areas() - that's where we should be initializing this, just like it's done on PowerPC.
On Tue, Mar 12, 2013 at 6:56 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: >> >> Looks no one objects the patch, so I has submitted it into Russell's >> patch system, and hope it can be pushed to linus tree soon and >> make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7. > > I'm not convinced it is correct. Is the percpu data as stored in the > kernel image (in other words, at offset zero) supposed to be read only? It should have been used after setup_per_cpu_areas(). > If so, the above means that we'll be accessing that rather than the > copy of the percpu data we should be accessing. I admit the patch is a work around for the problem, but it is harmless to make lockdep workable on arm at least. > The percpu data areas are allocated by setup_per_cpu_areas() - that's > where we should be initializing this, just like it's done on PowerPC. From the entry of start_kernel to setup_per_cpu_areas, there are many locks which will be acquired/released, so the percpu variable in lock_release has to be used early now. Either disabling lockdep during the period or introducing stupid/simple percpu variable inside lockdep may fix the probem, but looks both aren't perfect. So the workaround is proposed in this patch... Ingo and Peter, what is your opinion on the problem? Thanks
On Tue, Mar 12, 2013 at 07:25:28PM +0800, Ming Lei wrote: > On Tue, Mar 12, 2013 at 6:56 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > >> > >> Looks no one objects the patch, so I has submitted it into Russell's > >> patch system, and hope it can be pushed to linus tree soon and > >> make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7. > > > > I'm not convinced it is correct. Is the percpu data as stored in the > > kernel image (in other words, at offset zero) supposed to be read only? > > It should have been used after setup_per_cpu_areas(). > > > If so, the above means that we'll be accessing that rather than the > > copy of the percpu data we should be accessing. > > I admit the patch is a work around for the problem, but it is harmless > to make lockdep workable on arm at least. > > > The percpu data areas are allocated by setup_per_cpu_areas() - that's > > where we should be initializing this, just like it's done on PowerPC. > > >From the entry of start_kernel to setup_per_cpu_areas, there are many > locks which will be acquired/released, so the percpu variable in lock_release > has to be used early now. Either disabling lockdep during the period or > introducing stupid/simple percpu variable inside lockdep may fix the probem, > but looks both aren't perfect. > > So the workaround is proposed in this patch... > > Ingo and Peter, what is your opinion on the problem? Having discussed this with Ben Herrenschmidt, it seems that we do need to have a more complex patch to sort this out - we need to setup our private pointer inside setup_per_cpu_areas().
On Tue, Mar 12, 2013 at 7:30 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: >> >> Ingo and Peter, what is your opinion on the problem? > > Having discussed this with Ben Herrenschmidt, it seems that we do need > to have a more complex patch to sort this out - we need to setup our > private pointer inside setup_per_cpu_areas(). Suppose so, seems the patch is still needed to make CPU0 see static variables in '.data..percpu' section correctly. Thanks,
Hello, On Tue, Mar 12, 2013 at 07:44:55PM +0800, Ming Lei wrote: > On Tue, Mar 12, 2013 at 7:30 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > >> > >> Ingo and Peter, what is your opinion on the problem? > > > > Having discussed this with Ben Herrenschmidt, it seems that we do need > > to have a more complex patch to sort this out - we need to setup our > > private pointer inside setup_per_cpu_areas(). > > Suppose so, seems the patch is still needed to make CPU0 see > static variables in '.data..percpu' section correctly. If my memory serves me right, x86 also has places where CPU0 accesses its per-cpu data in .data.percpu. While those existed (not sure they're still there but probably they're) mostly due to historical reasons rather than by design, as long as the data is in consistent state by and during percpu setup, nothing will break. Thanks.
On Wed, Mar 13, 2013 at 1:25 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Tue, Mar 12, 2013 at 07:44:55PM +0800, Ming Lei wrote: >> On Tue, Mar 12, 2013 at 7:30 PM, Russell King - ARM Linux >> <linux@arm.linux.org.uk> wrote: >> >> >> >> Ingo and Peter, what is your opinion on the problem? >> > >> > Having discussed this with Ben Herrenschmidt, it seems that we do need >> > to have a more complex patch to sort this out - we need to setup our >> > private pointer inside setup_per_cpu_areas(). >> >> Suppose so, seems the patch is still needed to make CPU0 see >> static variables in '.data..percpu' section correctly. > > If my memory serves me right, x86 also has places where CPU0 accesses > its per-cpu data in .data.percpu. While those existed (not sure > they're still there but probably they're) mostly due to historical > reasons rather than by design, as long as the data is in consistent > state by and during percpu setup, nothing will break. Tejun, thanks for your input, yes, nothing will break. For lockdep, the percpu variables in non-boot-CPUs may be initialized again after percpu area is set up. Thanks,
On Tue, Mar 12, 2013 at 10:56:38AM +0000, Russell King - ARM Linux wrote: > On Tue, Mar 12, 2013 at 10:32:21AM +0800, Ming Lei wrote: > > On Thu, Mar 7, 2013 at 9:35 PM, Ming Lei <tom.leiming@gmail.com> wrote: > > > Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access) > > > introduces arm's __my_cpu_offset to optimize percpu vaiable access, > > > which really works well on hackbench, but will cause __my_cpu_offset > > > to return garbage value before it is initialized in cpu_init() called > > > by setup_arch, so accessing percpu variable before setup_arch may cause > > > kernel hang. But generic __my_cpu_offset always returns zero before > > > percpu area is brought up, and won't hang kernel. > > > > > > So the patch tries to clear __my_cpu_offset on boot CPU early > > > to avoid boot hang. > > > > > > At least now percpu variable is accessed by lockdep before > > > setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP > > > can trigger kernel hang. > > > > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > Cc: Ingo Molnar <mingo@redhat.com> > > > Cc: Rob Herring <rob.herring@calxeda.com> > > > Cc: Will Deacon <will.deacon@arm.com> > > > Cc: Nicolas Pitre <nico@linaro.org> > > > Cc: Russell King <rmk+kernel@arm.linux.org.uk> > > > Signed-off-by: Ming Lei <tom.leiming@gmail.com> > > > --- > > > V1: > > > - documents lockdep uses percpu variable early > > > > Looks no one objects the patch, so I has submitted it into Russell's > > patch system, and hope it can be pushed to linus tree soon and > > make LOCK_STAT/DEBUG_LOCKDEP usable on ARMv7. > > I'm not convinced it is correct. Is the percpu data as stored in the > kernel image (in other words, at offset zero) supposed to be read only? > If so, the above means that we'll be accessing that rather than the > copy of the percpu data we should be accessing. > > The percpu data areas are allocated by setup_per_cpu_areas() - that's > where we should be initializing this, just like it's done on PowerPC. Still not convinced this is a proper fix. Look, the problem is this: - Initially, set the CPU percpu offset to zero. This means the boot CPU reads and writes to the percpu data section in the kernel image. - The percpu areas are initialized, and the percpu data copied to each percpu data - this will have any writes from the boot CPU included as changes to the percpu data. - The boot CPU continues to read/write to the percpu data section. - If the boot CPU suspends/resumes, cpu_init() gets called, which will call set_my_cpu_offset(per_cpu_offset(cpu)); for the boot CPU. - The boot CPU now uses the allocated percpu data section and any updates it did in the percpu data section in the kernel image are lost to it. Your patch may be right on its own to solve the initial problem, but it leaves a _big_ hole. Now, the big question here: is it right that the boot CPU should ever write to the static percpu data section in the kernel image? What if there's a pointer in there, initially NULL, which then gets checked by each CPU and initialized if NULL - we'll end up sharing the same allocation amongst all CPUs, which probably isn't what was intended. If there's a list_head which gets added to, that too will be very bad. Although you have uncovered a problem, I still think by setting the offset to zero initially, you're just papering over a much bigger can of worms. So, should percpu data be used this early in boot before the percpu stuff is properly initialized? That feels _extremely_ unsafe. This, I think, needs to be addressed properly. And part of that is knowing where things went wrong. Will Deacon asked you for a backtrace showing where this problem occured. Your response seems to be to resend the patch with a "v1" tag a no new information. Sorry, not applying this until the above issue has been discussed and the location of these percpu accesses has been properly analysed.
Hi, On Thu, Apr 4, 2013 at 12:19 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > > Still not convinced this is a proper fix. Look, the problem is this: > > - Initially, set the CPU percpu offset to zero. This means the boot > CPU reads and writes to the percpu data section in the kernel image. > > - The percpu areas are initialized, and the percpu data copied to each > percpu data - this will have any writes from the boot CPU included as > changes to the percpu data. > > - The boot CPU continues to read/write to the percpu data section. No, the boot CPU switches to access percpu area after the area is created(after setup_per_cpu_areas() and smp_prepare_boot_cpu()). > > - If the boot CPU suspends/resumes, cpu_init() gets called, which will > call set_my_cpu_offset(per_cpu_offset(cpu)); for the boot CPU. > > - The boot CPU now uses the allocated percpu data section and any > updates it did in the percpu data section in the kernel image are > lost to it. > > Your patch may be right on its own to solve the initial problem, but > it leaves a _big_ hole. Without the patch, looks the hols was still here, at least before commit 14318efb(ARM: 7587/1: implement optimized percpu variable access), the per_cpu_offset() always return 0 before percpu area is created on ARM. Same on other ARCHs too. > > Now, the big question here: is it right that the boot CPU should ever > write to the static percpu data section in the kernel image? What if IMO, it isn't right, but as Tejun said, "it mostly due to historical reasons rather than by design, as long as the data is in consistent state by and during percpu setup, nothing will break." As far as the lockdep case, it is a bit special since lock can be used very early, and surely more early than creating percpu area. > there's a pointer in there, initially NULL, which then gets checked > by each CPU and initialized if NULL - we'll end up sharing the same > allocation amongst all CPUs, which probably isn't what was intended. > If there's a list_head which gets added to, that too will be very bad. Once lockstat is disabled, arm can boot well, that means no others might access percpu data section early, so could we just treat it as a special case? > > Although you have uncovered a problem, I still think by setting the > offset to zero initially, you're just papering over a much bigger > can of worms. > > So, should percpu data be used this early in boot before the percpu > stuff is properly initialized? That feels _extremely_ unsafe. > > This, I think, needs to be addressed properly. And part of that is > knowing where things went wrong. Will Deacon asked you for a backtrace > showing where this problem occured. Your response seems to be to > resend the patch with a "v1" tag a no new information. I have explained to Will Deacon, and looks no oops trace is generated on the memory access exception during booting. And the hang happened in mutex_unlock(&cgroup_mutex)(cgroup_init_subsys<- cgroup_init_early). > Sorry, not applying this until the above issue has been discussed > and the location of these percpu accesses has been properly analysed. No problem. Thanks,
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index 3f6cbb2..1a9dc21 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -439,6 +439,13 @@ void __init smp_setup_processor_id(void) for (i = 1; i < nr_cpu_ids; ++i) cpu_logical_map(i) = i == cpu ? 0 : i; + /* + * clear __my_cpu_offset on boot CPU to avoid hang caused by + * using percpu variable early, for example, lockdep will + * access percpu variable inside lock_release + */ + set_my_cpu_offset(0); + printk(KERN_INFO "Booting Linux on physical CPU 0x%x\n", mpidr); }
Commit 14318efb(ARM: 7587/1: implement optimized percpu variable access) introduces arm's __my_cpu_offset to optimize percpu vaiable access, which really works well on hackbench, but will cause __my_cpu_offset to return garbage value before it is initialized in cpu_init() called by setup_arch, so accessing percpu variable before setup_arch may cause kernel hang. But generic __my_cpu_offset always returns zero before percpu area is brought up, and won't hang kernel. So the patch tries to clear __my_cpu_offset on boot CPU early to avoid boot hang. At least now percpu variable is accessed by lockdep before setup_arch(), and enabling CONFIG_LOCK_STAT or CONFIG_DEBUG_LOCKDEP can trigger kernel hang. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Nicolas Pitre <nico@linaro.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ming Lei <tom.leiming@gmail.com> --- V1: - documents lockdep uses percpu variable early arch/arm/kernel/setup.c | 7 +++++++ 1 file changed, 7 insertions(+)