Message ID | 50224547.9020000@de.bosch.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Dirk Behme wrote: > On 08.08.2012 12:18, Shawn Guo wrote: >> Thanks Dirk for reporting that imx6q restart (reboot command) is broken. >> >> I tracked down the issue a little bit and found imx6q_restart hangs >> on the of_iomap/ioremap call. The following change, moving the call >> somewhere else than imx6q_restart, will just fix the problem. >> >> Does that mean ioremap call is not allowed in platform restart hook? >> I'm not sure about that, because I found it works just fine if I build >> imx_v6_v7_defconfig with V6 (imx3) platforms excluded (IOW, build a V7 >> only kernel - imx5 and imx6), which is the case how I tested imx6q >> restart feature when I was adding it. >> >> To summarize, the imx6q_restart hangs at ioremap call on a V6 + V7 >> kernel, while it works fine on a V7 only image. I need some help to >> understand that. > > Some additional information from my debugging: > > a) Having a JTAG debugger attached to the i.MX6 SabreLite board I use > (kernel built with imx_v6_v7_defconfig) the reboot does work. No hang. > This does mean I can't debug the reboot hang with a JTAG debugger. > Therefore I added some printk debugging: > > b) Adding some printk statements [1] in the of_iomap/ioremap call, it > looks to me that the system hangs in > > of_iomap() -> ... -> set_pte_at() -> set_pte_ext() / > cpu_v7_set_pte_ext() <= hang > I noticed this problem several weeks ago, and did some debug, what i found is following: - at the last stage of reset, all non-boot cpus will call ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5") - the boot cpu will call cpu_v7_set_pte_ext() in the proc-v7-2level.S, at the end of this function, boot cpu will call flush_pte (mcr p15, 0, r0, c7, c10, 1), after executing this function, the system will hang. That is to say, all non-boot cpus repeatedly call ("mcr p15, 0, %0, c7, c10, 5"), if boot cpu call flush_pte (mcr p15, 0, r0, c7, c10, 1), the system will hang, this occurs on the cortexa9 r2p10 cpus (freescale i.MX6Q), i don't know the root cause for this problem, but if we split imx_v6_v7_defconfig into v7_only and use v7_only to build kernel image for imx6q, this problem will disappear, since the cpu_relax() will be defined to barrier(). - i think we should split imx_v6_v7_defconfig into v6_only and v7_only, since when we use imx_v6_v7_defconfig, we will introduce "-D__LINUX_ARM_ARCH__=6" unconditionally, this macro is OK for V6 platforms (imx3), but it is not good for V7 platforms (imx5, imx6). lets grep "__LINUX_ARM_ARCH__" occurrence under the arch/arm, all (#if __LINUX_ARM_ARCH__ == 6) and (#if __LINUX_ARM_ARCH__ >= 7) occurrence will affect V7 platforms. Regards, Hui.
On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote: > Dirk Behme wrote: > >On 08.08.2012 12:18, Shawn Guo wrote: > >>Thanks Dirk for reporting that imx6q restart (reboot command) is broken. > >> > >>I tracked down the issue a little bit and found imx6q_restart hangs > >>on the of_iomap/ioremap call. The following change, moving the call > >>somewhere else than imx6q_restart, will just fix the problem. > >> > >>Does that mean ioremap call is not allowed in platform restart hook? > >>I'm not sure about that, because I found it works just fine if I build > >>imx_v6_v7_defconfig with V6 (imx3) platforms excluded (IOW, build a V7 > >>only kernel - imx5 and imx6), which is the case how I tested imx6q > >>restart feature when I was adding it. > >> > >>To summarize, the imx6q_restart hangs at ioremap call on a V6 + V7 > >>kernel, while it works fine on a V7 only image. I need some help to > >>understand that. > > > >Some additional information from my debugging: > > > >a) Having a JTAG debugger attached to the i.MX6 SabreLite board I > >use (kernel built with imx_v6_v7_defconfig) the reboot does work. > >No hang. This does mean I can't debug the reboot hang with a JTAG > >debugger. Therefore I added some printk debugging: > > > >b) Adding some printk statements [1] in the of_iomap/ioremap call, > >it looks to me that the system hangs in > > > >of_iomap() -> ... -> set_pte_at() -> set_pte_ext() / > >cpu_v7_set_pte_ext() <= hang > > > I noticed this problem several weeks ago, and did some debug, what i > found is following: > Thanks for these great findings. I also just tracked it down to that cpu_v7_set_pte_ext hangs on mcr p15, 0, r0, c7, c10, 1 @ flush_pte > - at the last stage of reset, all non-boot cpus will call > ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() > for V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, > 5") > > - the boot cpu will call cpu_v7_set_pte_ext() in the > proc-v7-2level.S, at the end of this function, boot cpu will call > flush_pte (mcr p15, 0, r0, c7, c10, 1), after executing this > function, the system will hang. That is to say, all non-boot cpus > repeatedly call ("mcr p15, 0, %0, c7, c10, 5"), if boot cpu call > flush_pte (mcr p15, 0, r0, c7, c10, 1), the system will hang, this > occurs on the cortexa9 r2p10 cpus (freescale i.MX6Q), i don't know > the root cause for this problem, but if we split imx_v6_v7_defconfig > into v7_only and use v7_only to build kernel image for imx6q, this > problem will disappear, since the cpu_relax() will be defined to > barrier(). > > - i think we should split imx_v6_v7_defconfig into v6_only and > v7_only, since when we use imx_v6_v7_defconfig, we will introduce > "-D__LINUX_ARM_ARCH__=6" unconditionally, this macro is OK for V6 > platforms (imx3), but it is not good for V7 platforms (imx5, imx6). > lets grep "__LINUX_ARM_ARCH__" occurrence under the arch/arm, all > (#if __LINUX_ARM_ARCH__ == 6) and (#if __LINUX_ARM_ARCH__ >= 7) > occurrence will affect V7 platforms. I just copied a few more people. Let's hear their comments. Regards, Shawn
On Thu, Aug 09, 2012 at 12:41:38PM +0800, Shawn Guo wrote: > On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote: > > > > - i think we should split imx_v6_v7_defconfig into v6_only and > > v7_only, since when we use imx_v6_v7_defconfig, we will introduce > > "-D__LINUX_ARM_ARCH__=6" unconditionally, this macro is OK for V6 > > platforms (imx3), but it is not good for V7 platforms (imx5, imx6). > > lets grep "__LINUX_ARM_ARCH__" occurrence under the arch/arm, all > > (#if __LINUX_ARM_ARCH__ == 6) and (#if __LINUX_ARM_ARCH__ >= 7) > > occurrence will affect V7 platforms. > > I just copied a few more people. Let's hear their comments. As a last resort we could split the defconfigs, but before we should try to find a solution for this problem. Sascha
On Thu, Aug 09, 2012 at 05:41:38AM +0100, Shawn Guo wrote: > On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote: > > Dirk Behme wrote: > > >On 08.08.2012 12:18, Shawn Guo wrote: > > >>Thanks Dirk for reporting that imx6q restart (reboot command) is broken. > > >> > > >>I tracked down the issue a little bit and found imx6q_restart hangs > > >>on the of_iomap/ioremap call. The following change, moving the call > > >>somewhere else than imx6q_restart, will just fix the problem. > > >> > > >>Does that mean ioremap call is not allowed in platform restart hook? > > >>I'm not sure about that, because I found it works just fine if I build > > >>imx_v6_v7_defconfig with V6 (imx3) platforms excluded (IOW, build a V7 > > >>only kernel - imx5 and imx6), which is the case how I tested imx6q > > >>restart feature when I was adding it. > > >> > > >>To summarize, the imx6q_restart hangs at ioremap call on a V6 + V7 > > >>kernel, while it works fine on a V7 only image. I need some help to > > >>understand that. > > > > > >Some additional information from my debugging: > > > > > >a) Having a JTAG debugger attached to the i.MX6 SabreLite board I > > >use (kernel built with imx_v6_v7_defconfig) the reboot does work. > > >No hang. This does mean I can't debug the reboot hang with a JTAG > > >debugger. Therefore I added some printk debugging: > > > > > >b) Adding some printk statements [1] in the of_iomap/ioremap call, > > >it looks to me that the system hangs in > > > > > >of_iomap() -> ... -> set_pte_at() -> set_pte_ext() / > > >cpu_v7_set_pte_ext() <= hang > > > > > I noticed this problem several weeks ago, and did some debug, what i > > found is following: > > > Thanks for these great findings. I also just tracked it down to that > cpu_v7_set_pte_ext hangs on > > mcr p15, 0, r0, c7, c10, 1 @ flush_pte > > > - at the last stage of reset, all non-boot cpus will call > > ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() > > for V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, > > 5") > > > > - the boot cpu will call cpu_v7_set_pte_ext() in the > > proc-v7-2level.S, at the end of this function, boot cpu will call > > flush_pte (mcr p15, 0, r0, c7, c10, 1), after executing this > > function, the system will hang. That is to say, all non-boot cpus > > repeatedly call ("mcr p15, 0, %0, c7, c10, 5"), if boot cpu call > > flush_pte (mcr p15, 0, r0, c7, c10, 1), the system will hang, this > > occurs on the cortexa9 r2p10 cpus (freescale i.MX6Q), i don't know > > the root cause for this problem, but if we split imx_v6_v7_defconfig > > into v7_only and use v7_only to build kernel image for imx6q, this > > problem will disappear, since the cpu_relax() will be defined to > > barrier(). Does this work if you always define cpu_relax() to smp_mb() (even on ARMv7) and compile the kernel to v7-only? In this case, the smp_mb() would use the DMB instruction rather than the CP15 equivalent. It's not a solution, just trying to understand the problem better. Thanks.
On Thu, Aug 09, 2012 at 09:06:48AM +0100, Catalin Marinas wrote: > Does this work if you always define cpu_relax() to smp_mb() (even on > ARMv7) and compile the kernel to v7-only? In this case, the smp_mb() > would use the DMB instruction rather than the CP15 equivalent. It's not > a solution, just trying to understand the problem better. > This is a test I already did a couple of hours ago. No, it does not work. The kernel hangs at the same place.
On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote: > - at the last stage of reset, all non-boot cpus will call > ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for > V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5") I suspect having this dmb inside cpu_relax() is flooding the interconnects with traffic, which then prevents other CPUs getting a look-in (maybe there's no fairness when it comes to dmb's. If I'm right, you'll find is that even converting this to the ARMv7 DMB instruction won't fix the problem. It does, however, point towards a more serious problem - it means that any tight loop using dmb is detremental. I have heard some people mention that even on various ARM SMP platforms, they have see quite an amount of interaction between the individual CPU cores, and I'm beginning to wonder whether this is why. I think a useful test would be to only execute the DMB maybe once in 50 or 100 loops - the DMB is there to work around a different problem with the temporal locality of stores on the local CPU. So, the only requirement is that we issue a DMB at some point while spinning waiting for another CPU to respond to our previous writes.
On Thu, Aug 9, 2012 at 4:20 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote: >> - at the last stage of reset, all non-boot cpus will call >> ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for >> V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5") > > I suspect having this dmb inside cpu_relax() is flooding the > interconnects with traffic, which then prevents other CPUs getting > a look-in (maybe there's no fairness when it comes to dmb's. > > If I'm right, you'll find is that even converting this to the ARMv7 > DMB instruction won't fix the problem. It does, however, point > towards a more serious problem - it means that any tight loop using > dmb is detremental. I have heard some people mention that even on > various ARM SMP platforms, they have see quite an amount of > interaction between the individual CPU cores, and I'm beginning > to wonder whether this is why. I have an irrelevant but possibly related question here; in mm/proc-v7.S there's this snip of code; #ifdef CONFIG_SMP ALT_SMP(mrc p15, 0, r0, c1, c0, 1) ALT_UP(mov r0, #(1 << 6)) @ fake it for UP tst r0, #(1 << 6) @ SMP/nAMP mode enabled? orreq r0, r0, #(1 << 6) @ Enable SMP/nAMP mode orreq r0, r0, r10 @ Enable CPU-specific SMP bits mcreq p15, 0, r0, c1, c0, 1 #endif Which I am reading as, read the SMP bit from cp15 and see if it's enabled, or on UP set the SMP bit and then write it back regardless. On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP method will get used and the SMP bit will get set regardless. No other cores will be enabled (at least the i.MX6Q you need to turn them on using the System Reset Controller..) but it's still going to end up involving the processors in SMP activity on the bus. Or.. actually it just sets the SMP bit *regardless* on any platform. I don't have my assembly hat on today. Is this correct? Is this actually architecturally correct? Does SMP bit get set regardless on any system that has multiple CPU cores but they're all brought down except the primary (or only-left-executing) core? Once you've got all the CPUs down, in my mind, SMP should be unset and in theory any cache management broadcasts between the CPUs should stop architecturally (with the caveat that on CPU hotplug, you need to flush the bejesus out of the primary CPU cache and invalidate the cache for one you just brought up). So I assume this is some kind of performance optimization to enable CPUs to be unplugged and replugged without expensive cache management? Aren't caches invalidated/flushed by the kernel anyway at this point, or is the expectation that the SCU/PL310 or so would manage this underneath as long as the primary CPU is involved in SMP? Above that TLB ops broadcasting is turned on for A9 and A15 too. Even in UP? Are these operations simply essential to the SCU since it might not have any idea how many CPUs are running, and just broadcasts changes on the AXI bus anyway for the one CPU that's left to pick up? The docs don't make it too clear what's going on here, and the code doesn't enlighten me. I would think that on a non-SMP system you'd want to turn all of this off, not "fake it for UP"..? We think we've had some serious performance problems here which point to a significant loss of performance on the AXI bus going to DDR memory. The only thing we can attribute it to is some misconfiguration of the SCU/PL310 or on MX51 maybe the "M4IF" unit with regards to AXI caching or priority for different bus masters, and that SMP bus traffic is causing a bottleneck of cache management operations. Otherwise we'd expect several gigabytes per second to memory in a streaming situation and we can architect it to only manage a couple of hundred, and that doesn't seem at all right (since beyond using misaligned accesses or plain kooky instruction sequences, everything gets aligned by the time it gets to DDR, and the DDR memory controller should be doing huge bursts which get cached in the bus arbiter..). I can only look at this from a naive point of view because there's no way to look at it with the bare minimum code running, we need the OS around to do what we're doing. Also in other tests (cortex-strings) we get ridiculously good performance, but that's mostly because it's single threaded.. no other CPU does anything and the cache is basically useless with very large transfers. Now we know dmb makes things explode, you have to wonder are things being needlessly involved here or turned on when they needn't be, just to make the code easier to write? We also saw a performance decrease by some hard to understand amount in the lpj calculation on i.MX51 between a CPU_V7-only kernel based on 2.6.31 and Freescale's BSP, versus a V6/V7 defconfig kernel direct from mainline. I wonder if the less efficient V6 stuff is causing that.. it would really, really bite if that was it... we're still running tests though. We like to enable errata fixes regardless of whether it's on our processor since the vast majority of them are only selected at runtime depending on whether the core or unit revision affected is present, however... #if __LINUX_ARM_ARCH__ == 6 || defined(CONFIG_ARM_ERRATA_754327) #define cpu_relax() smp_mb() #else #define cpu_relax() barrier() #endif This dirt-old (pre r2p0) Cortex-A9 erratum being selected will force a V7-only kernel to use smp_mb rather than barrier regardless of what the processor type is.. that means every one-image-for-ARMv7 kernel which included Tegra2, the appropriate above erratum and a few others would hit this problem regardless. I assume if looping and comparing a value of "i" and then calling smp_mb can't break the procedure being expected then reading out the CPU core revision from a cached value in memory somewhere and comparing it and OPTIONALLY running smp_mb (which is then defined as dmb) would not be bad either? Maybe I am getting this backwards, but I really want someone to tell me that :)
On Thu, Aug 09, 2012 at 02:03:13PM -0500, Matt Sealey wrote: > I have an irrelevant but possibly related question here; in > mm/proc-v7.S there's this snip of code; > > #ifdef CONFIG_SMP > ALT_SMP(mrc p15, 0, r0, c1, c0, 1) > ALT_UP(mov r0, #(1 << 6)) @ fake it for UP > tst r0, #(1 << 6) @ SMP/nAMP mode enabled? > orreq r0, r0, #(1 << 6) @ Enable SMP/nAMP mode > orreq r0, r0, r10 @ Enable CPU-specific SMP bits > mcreq p15, 0, r0, c1, c0, 1 > #endif > > Which I am reading as, read the SMP bit from cp15 and see if it's > enabled, or on UP set the SMP bit and then write it back > regardless. > > On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP > method will get used and the SMP bit will get > set regardless. I got here and stopped reading any further. No, it doesn't work like that; you don't understand the assembly. Yes, you're right that the ALT_UP version will be used. This sets r0 to 1 << 6. The very next instruction tests whether bit 6 is set in r0. The following three instructions will only be executed _if_ that bit was zero. I'm assuming that the rest of your email is therefore irrelevant; I have no desire to continue reading such a long message when the premise it's (presumably) based upon is false. If you have any further questions please restate them succinctly after taking on board the above. Thanks.
On Thu, Aug 9, 2012 at 4:07 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, Aug 09, 2012 at 02:03:13PM -0500, Matt Sealey wrote: >> I have an irrelevant but possibly related question here; in >> mm/proc-v7.S there's this snip of code; >> >> #ifdef CONFIG_SMP >> ALT_SMP(mrc p15, 0, r0, c1, c0, 1) >> ALT_UP(mov r0, #(1 << 6)) @ fake it for UP >> tst r0, #(1 << 6) @ SMP/nAMP mode enabled? >> orreq r0, r0, #(1 << 6) @ Enable SMP/nAMP mode >> orreq r0, r0, r10 @ Enable CPU-specific SMP bits >> mcreq p15, 0, r0, c1, c0, 1 >> #endif >> >> Which I am reading as, read the SMP bit from cp15 and see if it's >> enabled, or on UP set the SMP bit and then write it back >> regardless. >> >> On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP >> method will get used and the SMP bit will get >> set regardless. > > I got here and stopped reading any further. No, it doesn't work like > that; you don't understand the assembly. > > Yes, you're right that the ALT_UP version will be used. This sets r0 > to 1 << 6. The very next instruction tests whether bit 6 is set in r0. > The following three instructions will only be executed _if_ that bit > was zero. Right but, as I read it... if (CONFIG_SMP) read cp15 else set bit to one to hack it if (!smp) set smp write it back So who unsets the SMP bit? When the core goes down does it clear it, or are we just being safe for CPU startup? Does it still have an effect in the chip logic even if the core is 'not plugged'? What happens if you're in CONFIG_SMP and you say maxcpus=1? > I'm assuming that the rest of your email is therefore irrelevant; I > have no desire to continue reading such a long message when the premise > it's (presumably) based upon is false. If you have any further questions > please restate them succinctly after taking on board the above. Thanks. Please read the rest... I just didn't write first my question properly.
On Fri, Aug 10, 2012 at 08:33:06AM -0500, Matt Sealey wrote: > On Thu, Aug 9, 2012 at 4:07 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Thu, Aug 09, 2012 at 02:03:13PM -0500, Matt Sealey wrote: > >> I have an irrelevant but possibly related question here; in > >> mm/proc-v7.S there's this snip of code; > >> > >> #ifdef CONFIG_SMP > >> ALT_SMP(mrc p15, 0, r0, c1, c0, 1) > >> ALT_UP(mov r0, #(1 << 6)) @ fake it for UP > >> tst r0, #(1 << 6) @ SMP/nAMP mode enabled? > >> orreq r0, r0, #(1 << 6) @ Enable SMP/nAMP mode > >> orreq r0, r0, r10 @ Enable CPU-specific SMP bits > >> mcreq p15, 0, r0, c1, c0, 1 > >> #endif > >> > >> Which I am reading as, read the SMP bit from cp15 and see if it's > >> enabled, or on UP set the SMP bit and then write it back > >> regardless. > >> > >> On a system where !CONFIG_SMP but it's SMP-capable like i.MX6Q, ALT_UP > >> method will get used and the SMP bit will get > >> set regardless. > > > > I got here and stopped reading any further. No, it doesn't work like > > that; you don't understand the assembly. > > > > Yes, you're right that the ALT_UP version will be used. This sets r0 > > to 1 << 6. The very next instruction tests whether bit 6 is set in r0. > > The following three instructions will only be executed _if_ that bit > > was zero. > > Right but, as I read it... > > if (CONFIG_SMP) > read cp15 > else > set bit to one to hack it > if (!smp) > set smp > write it back Thank you for not reading what I wrote (or maybe you can't write back what I said correctly, but either way, it's just making me think that replying is a waste of time.) > So who unsets the SMP bit? When the core goes down does it clear it, > or are we just > being safe for CPU startup? Does it still have an effect in the chip > logic even if the core > is 'not plugged'? Generally we don't. Because on secure mode devices, we can't write the register, and if we tried to we'd get an abort. So, if we detect a SMP capable CPU, we will enable the SMP bit if it wasn't already enabled, and never ever attempt to disable it. > What happens if you're in CONFIG_SMP and you say maxcpus=1? > > > I'm assuming that the rest of your email is therefore irrelevant; I > > have no desire to continue reading such a long message when the premise > > it's (presumably) based upon is false. If you have any further questions > > please restate them succinctly after taking on board the above. Thanks. > > Please read the rest... I just didn't write first my question properly. No, unless you can say what you want more succinctly, I'm not going to bother reading your (apparantly regular) massive emails.
On 08/09/2012 04:20 AM, Russell King - ARM Linux wrote: > On Thu, Aug 09, 2012 at 11:18:47AM +0800, Hui Wang wrote: >> - at the last stage of reset, all non-boot cpus will call >> ipi_cpu_stop()->cpu_relax(), the cpu_relax() is defined to smp_mb() for >> V6, and smp_mb() is defined to dmb ("mcr p15, 0, %0, c7, c10, 5") > > I suspect having this dmb inside cpu_relax() is flooding the > interconnects with traffic, which then prevents other CPUs getting > a look-in (maybe there's no fairness when it comes to dmb's. > > If I'm right, you'll find is that even converting this to the ARMv7 > DMB instruction won't fix the problem. It does, however, point > towards a more serious problem - it means that any tight loop using > dmb is detremental. I have heard some people mention that even on > various ARM SMP platforms, they have see quite an amount of > interaction between the individual CPU cores, and I'm beginning > to wonder whether this is why. > > I think a useful test would be to only execute the DMB maybe once > in 50 or 100 loops - the DMB is there to work around a different > problem with the temporal locality of stores on the local CPU. So, > the only requirement is that we issue a DMB at some point while > spinning waiting for another CPU to respond to our previous writes. I think I am seeing a similar problem on highbank with a v7 only build. From what I've debugged, restart hangs for me on the L2x0 spinlock during a writel. Changing the writel to writel_relaxed in the restart hook fixes the problem. This skips barriers in the writel and for the spinlock. However, I'm still puzzled as cpu_relax on the secondary cores should not be doing a dmb in my case on a v7 only build. Rob > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote: > I think I am seeing a similar problem on highbank with a v7 only build. > > >From what I've debugged, restart hangs for me on the L2x0 spinlock > during a writel. Changing the writel to writel_relaxed in the restart > hook fixes the problem. This skips barriers in the writel and for the > spinlock. However, I'm still puzzled as cpu_relax on the secondary cores > should not be doing a dmb in my case on a v7 only build. Well, I had the idea of only doing a dmb() once every N loops, but I don't think we can sensibly introduce such a change into the mainline kernel. (How would cpu_relax() know it's being used in a loop?) Remember that the dmb() is in cpu_relax() as a work-around for the lack of temporal flushing of pending stores, and is needed to make various bits of the kernel work. So at the moment, there is no solution for this - and as I've pointed out it can be trivially exploited in userspace on the affected CPUs. So really a kernel work-around isn't going to sort it.
On Wed, Aug 15, 2012 at 10:44:12PM +0100, Russell King - ARM Linux wrote: > On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote: > > I think I am seeing a similar problem on highbank with a v7 only build. > > > > >From what I've debugged, restart hangs for me on the L2x0 spinlock > > during a writel. Changing the writel to writel_relaxed in the restart > > hook fixes the problem. This skips barriers in the writel and for the > > spinlock. However, I'm still puzzled as cpu_relax on the secondary cores > > should not be doing a dmb in my case on a v7 only build. > > Well, I had the idea of only doing a dmb() once every N loops, but I don't > think we can sensibly introduce such a change into the mainline kernel. > (How would cpu_relax() know it's being used in a loop?) > > Remember that the dmb() is in cpu_relax() as a work-around for the lack > of temporal flushing of pending stores, and is needed to make various bits > of the kernel work. > The cpu_relax() will do dmb() only #if __LINUX_ARM_ARCH__ == 6 || defined(CONFIG_ARM_ERRATA_754327). Otherwise, it's just a barrier() call. I guess Rob's puzzle is since the cpu_relax on the stopped cores does not do dmb, why a wmb/mb call on the running cpu would hang it. One thing I note is that mb() will do a outer_sync() call. Since the issue is around L2x0 operation, not sure if they are related ... > So at the moment, there is no solution for this - and as I've pointed out > it can be trivially exploited in userspace on the affected CPUs. So > really a kernel work-around isn't going to sort it. So, Sascha, it seems we get another good reason to split imx_v6_v7_defconfig into imx_v6_defconfig and imx_v7_defconfig? I have seen Matt Sealey and Hui Wang suggested doing this already.
On 15 August 2012 22:44, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote: >> I think I am seeing a similar problem on highbank with a v7 only build. >> >> >From what I've debugged, restart hangs for me on the L2x0 spinlock >> during a writel. Changing the writel to writel_relaxed in the restart >> hook fixes the problem. This skips barriers in the writel and for the >> spinlock. However, I'm still puzzled as cpu_relax on the secondary cores >> should not be doing a dmb in my case on a v7 only build. > > Well, I had the idea of only doing a dmb() once every N loops, but I don't > think we can sensibly introduce such a change into the mainline kernel. > (How would cpu_relax() know it's being used in a loop?) It doesn't really need to know. We can say that every X calls to cpu_relax() should result in a dmb and we can make the counter per-cpu. This would ensure the write-buffer draining which was the aim of this dmb.
On Wed, Aug 15, 2012 at 9:31 PM, Shawn Guo <shawn.guo@linaro.org> wrote: > On Wed, Aug 15, 2012 at 10:44:12PM +0100, Russell King - ARM Linux wrote: >> On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote: >> > I think I am seeing a similar problem on highbank with a v7 only build. >> > >> > >From what I've debugged, restart hangs for me on the L2x0 spinlock >> > during a writel. Changing the writel to writel_relaxed in the restart >> > hook fixes the problem. This skips barriers in the writel and for the >> > spinlock. However, I'm still puzzled as cpu_relax on the secondary cores >> > should not be doing a dmb in my case on a v7 only build. >> >> Well, I had the idea of only doing a dmb() once every N loops, but I don't >> think we can sensibly introduce such a change into the mainline kernel. >> (How would cpu_relax() know it's being used in a loop?) >> >> Remember that the dmb() is in cpu_relax() as a work-around for the lack >> of temporal flushing of pending stores, and is needed to make various bits >> of the kernel work. >> > The cpu_relax() will do dmb() only #if __LINUX_ARM_ARCH__ == 6 || > defined(CONFIG_ARM_ERRATA_754327). Otherwise, it's just a barrier() > call. I guess Rob's puzzle is since the cpu_relax on the stopped > cores does not do dmb, why a wmb/mb call on the running cpu would hang > it. > > One thing I note is that mb() will do a outer_sync() call. Since the > issue is around L2x0 operation, not sure if they are related ... > >> So at the moment, there is no solution for this - and as I've pointed out >> it can be trivially exploited in userspace on the affected CPUs. So >> really a kernel work-around isn't going to sort it. > > So, Sascha, it seems we get another good reason to split > imx_v6_v7_defconfig into imx_v6_defconfig and imx_v7_defconfig? > I have seen Matt Sealey and Hui Wang suggested doing this already. Only because Hui suggested it. I figured it would be better from the standpoint of allowing proper compiler-generated or more efficient/architecture-revision-specific code to be generated. It was mostly because I also have a side nit about individual erratum being enabled, in that by default in imx_v6_v7_defconfig they aren't; that's not nice and it's true of other ARM platforms too (Tegra notwithstanding). It's unlikely you'd want to run your chip without these fixes enabled, but in any kernel built from imx_v6_v7_defconfig it's likely MX51, MX53 aren't going to run particularly reliably. There are a couple options to make it more likely to run with a working configuration, namely turning every single one on regardless, or alternatively every mach defined (SOC_IMX51, SOC_OMAP or so) or so depends on the errata necessary for correct operation (just as ARCH_TEGRA_2x_SOC does right now) so you can still whittle down the number of compiled-in errata to the CPUs and cache controllers needed to boot that particular kernel on the particular boards you chose. The vast majority are only done on cores that need it, but this particular one, and a couple others I noticed, will be enabled in the future on "v7-pure" multi-platform configs if you just enable Tegra support with the above solution (since it's not v6, but it is affected by erratum 754327). Is there any way to fix cpu_relax(), smp_mb() or whatever to specifically fix that particular core or particular A9 revision at runtime rather than having it be compile-time if we're doing it every few loops, per-cpu like Catalin suggested? I assume it's not dangerous to do so (and would mean we don't run into it on the systems it's not booted on, rather than running into it by virtue of compiling it in). I'm not suggesting it fixes the dmb issue Russell described (especially the userspace part), but it would mean the kernel won't be issuing them itself under any circumstance when it doesn't need to.
On Thu, Aug 16, 2012 at 10:31:11AM +0800, Shawn Guo wrote: > On Wed, Aug 15, 2012 at 10:44:12PM +0100, Russell King - ARM Linux wrote: > > On Wed, Aug 15, 2012 at 10:07:47AM -0500, Rob Herring wrote: > > > I think I am seeing a similar problem on highbank with a v7 only build. > > > > > > >From what I've debugged, restart hangs for me on the L2x0 spinlock > > > during a writel. Changing the writel to writel_relaxed in the restart > > > hook fixes the problem. This skips barriers in the writel and for the > > > spinlock. However, I'm still puzzled as cpu_relax on the secondary cores > > > should not be doing a dmb in my case on a v7 only build. > > > > Well, I had the idea of only doing a dmb() once every N loops, but I don't > > think we can sensibly introduce such a change into the mainline kernel. > > (How would cpu_relax() know it's being used in a loop?) > > > > Remember that the dmb() is in cpu_relax() as a work-around for the lack > > of temporal flushing of pending stores, and is needed to make various bits > > of the kernel work. > > > The cpu_relax() will do dmb() only #if __LINUX_ARM_ARCH__ == 6 || > defined(CONFIG_ARM_ERRATA_754327). Otherwise, it's just a barrier() > call. I guess Rob's puzzle is since the cpu_relax on the stopped > cores does not do dmb, why a wmb/mb call on the running cpu would hang > it. > > One thing I note is that mb() will do a outer_sync() call. Since the > issue is around L2x0 operation, not sure if they are related ... This doesn't get around the problem that userspace can still effectively issue a DoS against the system by just running a dmb in a tight loop. Or maybe this would have a much more dramatic effect: while (1) { asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); asm("dmb"); } and make that 3 seconds to get a ps listing turn into something much longer? > > So at the moment, there is no solution for this - and as I've pointed out > > it can be trivially exploited in userspace on the affected CPUs. So > > really a kernel work-around isn't going to sort it. > > So, Sascha, it seems we get another good reason to split > imx_v6_v7_defconfig into imx_v6_defconfig and imx_v7_defconfig? > I have seen Matt Sealey and Hui Wang suggested doing this already. The last thing we need are more defconfigs. I think what needs to happen here (while we wait) is someone _with_ the problem needs to experiment, and find out how many nops are needed for the DMB not to have much effect in cpu_relax(). If it turns out we just need to put one nop in, then that's not _too_ bad.
--- a/lib/ioremap.c +++ b/lib/ioremap.c @@ -13,6 +13,19 @@ #include <asm/cacheflush.h> #include <asm/pgtable.h> +void set_pte_debug(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pteval) +{ + if (addr >= TASK_SIZE) { + printk("=> set_pte_debug #1\n"); + set_pte_ext(ptep, pteval, 0); + printk("=> set_pte_debug #2\n"); + } else { + __sync_icache_dcache(pteval); + set_pte_ext(ptep, pteval, PTE_EXT_NG); + } +} + static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot) { @@ -25,7 +38,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, return -ENOMEM; do { BUG_ON(!pte_none(*pte)); - set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot)); + set_pte_debug(&init_mm, addr, pte, pfn_pte(pfn, prot)); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end);