Message ID | 1372152228-16199-1-git-send-email-josephl@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 06/25/2013 03:23 AM, Joseph Lo wrote: > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > this state. I assume this patch is only needed to support the other two series you sent today; the Tegra114 cpuidle and system suspend series? In other words, there's no need to apply this to 3.11 to fix a bug; it can wait for 3.12 along with those other two series?
On Tue, 2013-06-25 at 23:12 +0800, Stephen Warren wrote: > On 06/25/2013 03:23 AM, Joseph Lo wrote: > > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > > this state. > > I assume this patch is only needed to support the other two series you > sent today; the Tegra114 cpuidle and system suspend series? In other > words, there's no need to apply this to 3.11 to fix a bug; it can wait > for 3.12 along with those other two series? This patch is independent of the other two series, you can apply it along. Yes, all these series were for 3.12. Thanks, Joseph
On 06/25/2013 03:23 AM, Joseph Lo wrote: > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > this state. I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114: cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM: tegra114: add support for system suspend", all on top of v3.11-rc1. On Dalmore, the new cpuidle mode /appears/ to work (I see increasing values in the sysfs cpuidle "usage" file for all defined cpuidle states), but I don't see the "CPU VDD off" LED light up; I'm not convinced that the CPU is actually being powered off in the idle mode. With these patches applies, Harmony acts very strangely. After hot-unplug and re-plug of CPU1, the system is hung or almost hung. The patches appear to reduce the amount of time CPU VDD is off judging by the LEDs. The patches might cause issues with system suspend too, but I'm not 100% sure. As such, I haven't applied any of these. Can you please test boards for all of Tegra20/30/114, and validate that on Dalmore the CPU power is actually being turned off, and report back? Thanks.
On Mon, Jul 15, 2013 at 08:04:44PM +0200, Stephen Warren wrote: > On 06/25/2013 03:23 AM, Joseph Lo wrote: > > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > > this state. > > I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114: > cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM: > tegra114: add support for system suspend", all on top of v3.11-rc1. > > On Dalmore, the new cpuidle mode /appears/ to work (I see increasing > values in the sysfs cpuidle "usage" file for all defined cpuidle > states), but I don't see the "CPU VDD off" LED light up; I'm not > convinced that the CPU is actually being powered off in the idle mode. The current patchset only powergates the individual cores on Tegra114. The entire cluster (including L2 cache etc) isn't railgated as a CPUidle state. Hence the CPU VDD off LED will never be on. Cheers, Peter.
On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: > On 06/25/2013 03:23 AM, Joseph Lo wrote: > > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > > this state. > > I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114: > cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM: > tegra114: add support for system suspend", all on top of v3.11-rc1. > > On Dalmore, the new cpuidle mode /appears/ to work (I see increasing > values in the sysfs cpuidle "usage" file for all defined cpuidle > states), but I don't see the "CPU VDD off" LED light up; I'm not > convinced that the CPU is actually being powered off in the idle mode. > The LED of the CPU Vdd indicates the power of CPU cluster. But the series of CPU idle power down mode support for Tegra114 only supports per core power down. It did not support cluster power down yet in idle mode. That needs some extra work to support cluster power down in idle mode. There are some items that I plat to do: 1. integrate the MCPM (multi cluster power management) code into Tegra CPU PM code 2. integrate coupled CPUidle framework into Tegra CPU idle driver 3. normalize the Tegra CPU idle driver to single one > With these patches applies, Harmony acts very strangely. After > hot-unplug and re-plug of CPU1, the system is hung or almost hung. The > patches appear to reduce the amount of time CPU VDD is off judging by > the LEDs. The patches might cause issues with system suspend too, but > I'm not 100% sure. > Actually I didn't add anything about hotplug or something can increase the loading or make regressions. But I think you are testing with USB device attached, right? That might cause some extra loading. You can give it a try after just removing USB device. And I double confirm that on Seaboard. Except that, the test result looks OK for me. > As such, I haven't applied any of these. Can you please test boards for > all of Tegra20/30/114, and validate that on Dalmore the CPU power is > actually being turned off, and report back? Thanks. I have tested all these patches based on next-20130716. And verified the functions on Seaboard, Cardhu and Dalmore. Looks good for me. And the per core power down in idle is OK too. The cluster power down only support when the system goes into suspend mode right now for Tegra114. Thanks, Joseph
On 07/16/2013 01:17 PM, Joseph Lo wrote: > On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: >> On 06/25/2013 03:23 AM, Joseph Lo wrote: >>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework >>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering >>> this state. >> >> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114: >> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM: >> tegra114: add support for system suspend", all on top of v3.11-rc1. >> >> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing >> values in the sysfs cpuidle "usage" file for all defined cpuidle >> states), but I don't see the "CPU VDD off" LED light up; I'm not >> convinced that the CPU is actually being powered off in the idle mode. >> > The LED of the CPU Vdd indicates the power of CPU cluster. But the > series of CPU idle power down mode support for Tegra114 only supports > per core power down. It did not support cluster power down yet in idle > mode. That needs some extra work to support cluster power down in idle > mode. > There are some items that I plat to do: > 1. integrate the MCPM (multi cluster power management) code into Tegra > CPU PM code +1 > 2. integrate coupled CPUidle framework into Tegra CPU idle driver If you are using the MCPM, I suggest you rely on the MCPM to replace the coupled idle state by it. The code will look much more consistent. > 3. normalize the Tegra CPU idle driver to single one I am working on making a single ARM driver, so may be instead of doing a single tegra driver, we can sync up to move the code forward to the same scheme than the others driver, then consolidating the tegra drivers would be easier and consistent. That would be very nice, if you can separate in the code the PM arch specific blocks and encapsulate the driver specific code, so no more include from <mach/...> are needed. That will allow to move the drivers to drivers/cpuidle directory. >> With these patches applies, Harmony acts very strangely. After >> hot-unplug and re-plug of CPU1, the system is hung or almost hung. The >> patches appear to reduce the amount of time CPU VDD is off judging by >> the LEDs. The patches might cause issues with system suspend too, but >> I'm not 100% sure. >> > Actually I didn't add anything about hotplug or something can increase > the loading or make regressions. But I think you are testing with USB > device attached, right? That might cause some extra loading. You can > give it a try after just removing USB device. And I double confirm that > on Seaboard. Except that, the test result looks OK for me. > >> As such, I haven't applied any of these. Can you please test boards for >> all of Tegra20/30/114, and validate that on Dalmore the CPU power is >> actually being turned off, and report back? Thanks. > > I have tested all these patches based on next-20130716. And verified the > functions on Seaboard, Cardhu and Dalmore. Looks good for me. And the > per core power down in idle is OK too. The cluster power down only > support when the system goes into suspend mode right now for Tegra114. > > Thanks, > Joseph > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
On 07/16/2013 05:17 AM, Joseph Lo wrote: > On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: >> On 06/25/2013 03:23 AM, Joseph Lo wrote: >>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework >>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering >>> this state. >> >> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114: >> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM: >> tegra114: add support for system suspend", all on top of v3.11-rc1. >> >> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing >> values in the sysfs cpuidle "usage" file for all defined cpuidle >> states), but I don't see the "CPU VDD off" LED light up; I'm not >> convinced that the CPU is actually being powered off in the idle mode. > > The LED of the CPU Vdd indicates the power of CPU cluster. But the > series of CPU idle power down mode support for Tegra114 only supports > per core power down. OK, that's fine then. >> With these patches applies, Harmony acts very strangely. After >> hot-unplug and re-plug of CPU1, the system is hung or almost hung. The >> patches appear to reduce the amount of time CPU VDD is off judging by >> the LEDs. The patches might cause issues with system suspend too, but >> I'm not 100% sure. > > Actually I didn't add anything about hotplug Well, the patches do touch the CPU reset vector and related code. That's shared with hotplug, right? But anyway, it turns out the CPU hotplug is not related to the issue; the system acts strangely even without/before performing any hotplug. I just hadn't noticed that before. > or something can increase > the loading or make regressions. But I think you are testing with USB > device attached, right? That might cause some extra loading. You can > give it a try after just removing USB device. And I double confirm that > on Seaboard. Except that, the test result looks OK for me. Yes, having a USB device plugged in does affect the issue. In summary: next-20130716, with or without USB devices plugged in: Works fine. next-20130716 with your patches, without USB device plugged in: Occasional short problems (detailed below). next-20130716 with your patches, with USB device plugged in: Continual long problems (detailed below). The testing I did was to log in over the serial console, hit <enter> around five times, and watch the shell prompt echo back, then type "ls<enter" and watch the (directory listing appear (I have about 20 files in my directory, with long-ish names), then repeat. With plain next-20130716, this is nice and fast and there are no pauses. With your patches applied, there are occasional or continual pauses, which last for either a short time (tenths of a second), or even a second or two, both depending on whether a USB device is plugged in or not. I believe this is the same issue I saw when I applied your patches to the Tegra tree on top of v3.11-rc1. So, either your patch causes a bug, or exposes some pre-existing bug. Either way, I can't apply them because they cause a regression. I wonder if this is at all related to NVIDIA bug 1286857 "Upstreaming task: Enabling LP2 causes slow serial console"? I tried the WAR there: echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/disable echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state1/disable ... and the system locked up solid very soon after; I had just enough time to type "ls<enter>", then it locked up. Again, performing those same echos on next-20130716 without your patches works fine.
On Tue, 2013-07-16 at 20:11 +0800, Daniel Lezcano wrote: > On 07/16/2013 01:17 PM, Joseph Lo wrote: > > On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: > >> On 06/25/2013 03:23 AM, Joseph Lo wrote: > >>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > >>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > >>> this state. > >> > >> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114: > >> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM: > >> tegra114: add support for system suspend", all on top of v3.11-rc1. > >> > >> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing > >> values in the sysfs cpuidle "usage" file for all defined cpuidle > >> states), but I don't see the "CPU VDD off" LED light up; I'm not > >> convinced that the CPU is actually being powered off in the idle mode. > >> > > The LED of the CPU Vdd indicates the power of CPU cluster. But the > > series of CPU idle power down mode support for Tegra114 only supports > > per core power down. It did not support cluster power down yet in idle > > mode. That needs some extra work to support cluster power down in idle > > mode. > > There are some items that I plat to do: > > 1. integrate the MCPM (multi cluster power management) code into Tegra > > CPU PM code > > +1 > > > 2. integrate coupled CPUidle framework into Tegra CPU idle driver > > If you are using the MCPM, I suggest you rely on the MCPM to replace the > coupled idle state by it. The code will look much more consistent. > Yes, true. Thanks for your advise. Joseph
On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote: > On 07/16/2013 05:17 AM, Joseph Lo wrote: > > On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: > >> On 06/25/2013 03:23 AM, Joseph Lo wrote: > >>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > >>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > >>> this state. > > or something can increase > > the loading or make regressions. But I think you are testing with USB > > device attached, right? That might cause some extra loading. You can > > give it a try after just removing USB device. And I double confirm that > > on Seaboard. Except that, the test result looks OK for me. > > Yes, having a USB device plugged in does affect the issue. In summary: > > next-20130716, with or without USB devices plugged in: Works fine. > > next-20130716 with your patches, without USB device plugged in: > Occasional short problems (detailed below). > > next-20130716 with your patches, with USB device plugged in: Continual > long problems (detailed below). > > The testing I did was to log in over the serial console, hit <enter> > around five times, and watch the shell prompt echo back, then type > "ls<enter" and watch the (directory listing appear (I have about 20 > files in my directory, with long-ish names), then repeat. With plain > next-20130716, this is nice and fast and there are no pauses. With your > patches applied, there are occasional or continual pauses, which last > for either a short time (tenths of a second), or even a second or two, > both depending on whether a USB device is plugged in or not. > > I believe this is the same issue I saw when I applied your patches to > the Tegra tree on top of v3.11-rc1. > OK. I did more stress tests last night and today. I found it cause by the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and only impact the Tegra20 platform. The hot plug regression seems due to this patch. After dropping this patch on top of v3.11-rc1, the Tegra20 can back to normal. And the hop plug and suspend stress test can pass on Tegra30/114 too. Can the other two patch series for Tegra114 to support CPU idle power down mode and system suspend still moving forward, not be blocked by this patch? Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for hot plug on Tegra20, I will continue to check this. You can just drop this patch. Thanks, Joseph
On 07/17/2013 12:15 PM, Joseph Lo wrote: > On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote: >> On 07/16/2013 05:17 AM, Joseph Lo wrote: >>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: >>>> On 06/25/2013 03:23 AM, Joseph Lo wrote: >>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework >>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering >>>>> this state. >>> or something can increase >>> the loading or make regressions. But I think you are testing with USB >>> device attached, right? That might cause some extra loading. You can >>> give it a try after just removing USB device. And I double confirm that >>> on Seaboard. Except that, the test result looks OK for me. >> >> Yes, having a USB device plugged in does affect the issue. In summary: >> >> next-20130716, with or without USB devices plugged in: Works fine. >> >> next-20130716 with your patches, without USB device plugged in: >> Occasional short problems (detailed below). >> >> next-20130716 with your patches, with USB device plugged in: Continual >> long problems (detailed below). >> >> The testing I did was to log in over the serial console, hit <enter> >> around five times, and watch the shell prompt echo back, then type >> "ls<enter" and watch the (directory listing appear (I have about 20 >> files in my directory, with long-ish names), then repeat. With plain >> next-20130716, this is nice and fast and there are no pauses. With your >> patches applied, there are occasional or continual pauses, which last >> for either a short time (tenths of a second), or even a second or two, >> both depending on whether a USB device is plugged in or not. >> >> I believe this is the same issue I saw when I applied your patches to >> the Tegra tree on top of v3.11-rc1. >> > OK. I did more stress tests last night and today. I found it cause by > the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and > only impact the Tegra20 platform. The hot plug regression seems due to > this patch. After dropping this patch on top of v3.11-rc1, the Tegra20 > can back to normal. > > And the hop plug and suspend stress test can pass on Tegra30/114 too. > > Can the other two patch series for Tegra114 to support CPU idle power > down mode and system suspend still moving forward, not be blocked by > this patch? > > Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for > hot plug on Tegra20, I will continue to check this. You can just drop > this patch. Please in the future Cc me when problems occur with some patches I submitted. That would be easier for me to track the issues and fix them when they happen. Thanks -- Daniel
On Wed, 2013-07-17 at 18:21 +0800, Daniel Lezcano wrote: > On 07/17/2013 12:15 PM, Joseph Lo wrote: > > On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote: > >> On 07/16/2013 05:17 AM, Joseph Lo wrote: > >>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: > >>>> On 06/25/2013 03:23 AM, Joseph Lo wrote: > >>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework > >>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering > >>>>> this state. > >>> or something can increase > >>> the loading or make regressions. But I think you are testing with USB > >>> device attached, right? That might cause some extra loading. You can > >>> give it a try after just removing USB device. And I double confirm that > >>> on Seaboard. Except that, the test result looks OK for me. > >> > >> Yes, having a USB device plugged in does affect the issue. In summary: > >> > >> next-20130716, with or without USB devices plugged in: Works fine. > >> > >> next-20130716 with your patches, without USB device plugged in: > >> Occasional short problems (detailed below). > >> > >> next-20130716 with your patches, with USB device plugged in: Continual > >> long problems (detailed below). > >> > >> The testing I did was to log in over the serial console, hit <enter> > >> around five times, and watch the shell prompt echo back, then type > >> "ls<enter" and watch the (directory listing appear (I have about 20 > >> files in my directory, with long-ish names), then repeat. With plain > >> next-20130716, this is nice and fast and there are no pauses. With your > >> patches applied, there are occasional or continual pauses, which last > >> for either a short time (tenths of a second), or even a second or two, > >> both depending on whether a USB device is plugged in or not. > >> > >> I believe this is the same issue I saw when I applied your patches to > >> the Tegra tree on top of v3.11-rc1. > >> > > OK. I did more stress tests last night and today. I found it cause by > > the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and > > only impact the Tegra20 platform. The hot plug regression seems due to > > this patch. After dropping this patch on top of v3.11-rc1, the Tegra20 > > can back to normal. > > > > And the hop plug and suspend stress test can pass on Tegra30/114 too. > > > > Can the other two patch series for Tegra114 to support CPU idle power > > down mode and system suspend still moving forward, not be blocked by > > this patch? > > > > Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for > > hot plug on Tegra20, I will continue to check this. You can just drop > > this patch. > > Please in the future Cc me when problems occur with some patches I > submitted. That would be easier for me to track the issues and fix them > when they happen. > Ah, sorry for that. I should always Cc you. Will do next time. Thanks for take care. Joseph
On 07/17/2013 04:15 AM, Joseph Lo wrote: > On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote: >> On 07/16/2013 05:17 AM, Joseph Lo wrote: >>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: >>>> On 06/25/2013 03:23 AM, Joseph Lo wrote: >>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework >>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering >>>>> this state. ... [ discussion of issues with Joesph's patches applied] > > OK. I did more stress tests last night and today. I found it cause by > the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and > only impact the Tegra20 platform. The hot plug regression seems due to > this patch. After dropping this patch on top of v3.11-rc1, the Tegra20 > can back to normal. > > And the hop plug and suspend stress test can pass on Tegra30/114 too. > > Can the other two patch series for Tegra114 to support CPU idle power > down mode and system suspend still moving forward, not be blocked by > this patch? > > Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for > hot plug on Tegra20, I will continue to check this. You can just drop > this patch. OK, if I drop that patch, then everything on Tegra20 and Tegra30 seems fine again. However, I've found some new and exciting issue on Tegra114! With unmodified v3.11-rc1, I can do the following without issue: * Unplug/replug CPUs, so that I had all combinations of CPU 1, 2, 3 plugged/unpplugged (with CPU 0 always plugged). * Unplug/replug CPUs, so that I had all combinations of CPU 0, 1, 2, 3 plugged/unpplugged (with the obvious exception of never having all CPUs unplugged). However, if I try this with your Tegra114 cpuidle and suspend patches applied, I see the following issues: 1) If I boot, unplug CPU 0, then replug CPU 0, the system immediately hard-hangs. 2) If I run the hotplug test script, leaving CPU 0 always present, I sometimes see: > root@localhost:~# for i in `seq 1 50`; do echo ITERATION $i; ./cpuonline.py; done > ITERATION 1 > echo 0 > /sys/devices/system/cpu/cpu2/online > [ 458.910054] CPU2: shutdown > echo 0 > /sys/devices/system/cpu/cpu1/online > [ 461.004371] CPU1: shutdown > echo 0 > /sys/devices/system/cpu/cpu3/online > [ 463.027341] CPU3: shutdown > echo 1 > /sys/devices/system/cpu/cpu1/online > [ 465.061412] CPU1: Booted secondary processor > echo 1 > /sys/devices/system/cpu/cpu2/online > [ 467.095313] CPU2: Booted secondary processor > [ 467.113243] ------------[ cut here ]------------ > [ 467.117948] WARNING: CPU: 2 PID: 0 at kernel/time/tick-broadcast.c:667 tick_broadcast_oneshot_control+0x19c/0x1c4() > [ 467.128352] Modules linked in: > [ 467.131455] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.11.0-rc1-00022-g7487363-dirty #49 > [ 467.139678] [<c0015620>] (unwind_backtrace+0x0/0xf8) from [<c001154c>] (show_stack+0x10/0x14) > [ 467.148228] [<c001154c>] (show_stack+0x10/0x14) from [<c05135a8>] (dump_stack+0x80/0xc4) > [ 467.156336] [<c05135a8>] (dump_stack+0x80/0xc4) from [<c0024590>] (warn_slowpath_common+0x64/0x88) > [ 467.165300] [<c0024590>] (warn_slowpath_common+0x64/0x88) from [<c00245d0>] (warn_slowpath_null+0x1c/0x24) > [ 467.174959] [<c00245d0>] (warn_slowpath_null+0x1c/0x24) from [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) > [ 467.185659] [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) from [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) > [ 467.196538] [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) from [<c034f348>] (cpuidle_idle_call+0x11c/0x168) > [ 467.206292] [<c034f348>] (cpuidle_idle_call+0x11c/0x168) from [<c000f134>] (arch_cpu_idle+0x8/0x38) > [ 467.215359] [<c000f134>] (arch_cpu_idle+0x8/0x38) from [<c0061038>] (cpu_startup_entry+0x60/0x134) > [ 467.224325] [<c0061038>] (cpu_startup_entry+0x60/0x134) from [<800083d8>] (0x800083d8) > [ 467.232227] ---[ end trace ea579be22a00e7fb ]--- > echo 0 > /sys/devices/system/cpu/cpu1/online > [ 469.126682] CPU1: shutdown I have found no solution for (1) (although I didn't look hard!). (2) can be solved with the following (at least 50 iterations of my test script worked with this patch applied): > diff --git a/arch/arm/mach-tegra/cpuidle-tegra114.c b/arch/arm/mach-tegra/cpuidle-tegra114.c > index 658b205..896408d 100644 > --- a/arch/arm/mach-tegra/cpuidle-tegra114.c > +++ b/arch/arm/mach-tegra/cpuidle-tegra114.c > @@ -66,8 +66,7 @@ static struct cpuidle_driver tegra_idle_driver = { > .exit_latency = 500, > .target_residency = 1000, > .power_usage = 0, > - .flags = CPUIDLE_FLAG_TIME_VALID | > - CPUIDLE_FLAG_TIMER_STOP, > + .flags = CPUIDLE_FLAG_TIME_VALID, > .name = "powered-down", > .desc = "CPU power gated", > }, Here's my test script for reference: #!/usr/bin/env python import multiprocessing import os import sys import time cpus = multiprocessing.cpu_count() if cpus == 4: socf = file('/sys/devices/soc0/soc_id') soc = socf.readline().strip() socf.close() if True: #soc == '48': gc = (11, 9, 1, 3, 7, 5, 13, 15) else: gc = (14, 10, 11, 9, 8, 1, 3, 2, 6, 7, 5, 4, 12, 13, 15) elif cpus == 2: gc = (1, 3) else: raise Exception("Invalid CPU count %d" % cpus) oldidx = len(gc) - 1 oldmask = gc[oldidx] for newidx in range(len(gc)): newmask = gc[newidx] for cpu in range(cpus): oldon = oldmask & (1 << cpu) newon = newmask & (1 << cpu) if oldon != newon: if newon: newonval = 1 else: newonval = 0 cmd = "echo %d > /sys/devices/system/cpu/cpu%d/online" \ % (newonval, cpu) print cmd os.system(cmd) time.sleep(2) oldidx = newidx oldmask = newmask
On 07/17/2013 10:31 PM, Stephen Warren wrote: > On 07/17/2013 04:15 AM, Joseph Lo wrote: >> On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote: >>> On 07/16/2013 05:17 AM, Joseph Lo wrote: >>>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: >>>>> On 06/25/2013 03:23 AM, Joseph Lo wrote: >>>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework >>>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering >>>>>> this state. > ... [ discussion of issues with Joesph's patches applied] >> >> OK. I did more stress tests last night and today. I found it cause by >> the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and >> only impact the Tegra20 platform. The hot plug regression seems due to >> this patch. After dropping this patch on top of v3.11-rc1, the Tegra20 >> can back to normal. >> >> And the hop plug and suspend stress test can pass on Tegra30/114 too. >> >> Can the other two patch series for Tegra114 to support CPU idle power >> down mode and system suspend still moving forward, not be blocked by >> this patch? >> >> Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for >> hot plug on Tegra20, I will continue to check this. You can just drop >> this patch. > > OK, if I drop that patch, then everything on Tegra20 and Tegra30 seems > fine again. > > However, I've found some new and exciting issue on Tegra114! > > With unmodified v3.11-rc1, I can do the following without issue: > > * Unplug/replug CPUs, so that I had all combinations of CPU 1, 2, 3 > plugged/unpplugged (with CPU 0 always plugged). > > * Unplug/replug CPUs, so that I had all combinations of CPU 0, 1, 2, 3 > plugged/unpplugged (with the obvious exception of never having all CPUs > unplugged). > > However, if I try this with your Tegra114 cpuidle and suspend patches > applied, I see the following issues: > > 1) If I boot, unplug CPU 0, then replug CPU 0, the system immediately > hard-hangs. > > 2) If I run the hotplug test script, leaving CPU 0 always present, I > sometimes see: > >> root@localhost:~# for i in `seq 1 50`; do echo ITERATION $i; ./cpuonline.py; done >> ITERATION 1 >> echo 0 > /sys/devices/system/cpu/cpu2/online >> [ 458.910054] CPU2: shutdown >> echo 0 > /sys/devices/system/cpu/cpu1/online >> [ 461.004371] CPU1: shutdown >> echo 0 > /sys/devices/system/cpu/cpu3/online >> [ 463.027341] CPU3: shutdown >> echo 1 > /sys/devices/system/cpu/cpu1/online >> [ 465.061412] CPU1: Booted secondary processor >> echo 1 > /sys/devices/system/cpu/cpu2/online >> [ 467.095313] CPU2: Booted secondary processor >> [ 467.113243] ------------[ cut here ]------------ >> [ 467.117948] WARNING: CPU: 2 PID: 0 at kernel/time/tick-broadcast.c:667 tick_broadcast_oneshot_control+0x19c/0x1c4() >> [ 467.128352] Modules linked in: >> [ 467.131455] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.11.0-rc1-00022-g7487363-dirty #49 >> [ 467.139678] [<c0015620>] (unwind_backtrace+0x0/0xf8) from [<c001154c>] (show_stack+0x10/0x14) >> [ 467.148228] [<c001154c>] (show_stack+0x10/0x14) from [<c05135a8>] (dump_stack+0x80/0xc4) >> [ 467.156336] [<c05135a8>] (dump_stack+0x80/0xc4) from [<c0024590>] (warn_slowpath_common+0x64/0x88) >> [ 467.165300] [<c0024590>] (warn_slowpath_common+0x64/0x88) from [<c00245d0>] (warn_slowpath_null+0x1c/0x24) >> [ 467.174959] [<c00245d0>] (warn_slowpath_null+0x1c/0x24) from [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) >> [ 467.185659] [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) from [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) >> [ 467.196538] [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) from [<c034f348>] (cpuidle_idle_call+0x11c/0x168) >> [ 467.206292] [<c034f348>] (cpuidle_idle_call+0x11c/0x168) from [<c000f134>] (arch_cpu_idle+0x8/0x38) >> [ 467.215359] [<c000f134>] (arch_cpu_idle+0x8/0x38) from [<c0061038>] (cpu_startup_entry+0x60/0x134) >> [ 467.224325] [<c0061038>] (cpu_startup_entry+0x60/0x134) from [<800083d8>] (0x800083d8) >> [ 467.232227] ---[ end trace ea579be22a00e7fb ]--- >> echo 0 > /sys/devices/system/cpu/cpu1/online >> [ 469.126682] CPU1: shutdown > > I have found no solution for (1) (although I didn't look hard!). > > (2) can be solved with the following (at least 50 iterations of my test > script worked with this patch applied): Actually this warning is resulting from a bug in the tick broadcast code and has been solved with commit: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=ea8deb8dfa6b0e8d1b3d1051585706739b46656c This patch has been merged in timers/urgent branch but still need to merged with timers/core. The patch below does not fix the warning but prevents the tick warning to occur. Applying the patch above will fix your problem. >> diff --git a/arch/arm/mach-tegra/cpuidle-tegra114.c b/arch/arm/mach-tegra/cpuidle-tegra114.c >> index 658b205..896408d 100644 >> --- a/arch/arm/mach-tegra/cpuidle-tegra114.c >> +++ b/arch/arm/mach-tegra/cpuidle-tegra114.c >> @@ -66,8 +66,7 @@ static struct cpuidle_driver tegra_idle_driver = { >> .exit_latency = 500, >> .target_residency = 1000, >> .power_usage = 0, >> - .flags = CPUIDLE_FLAG_TIME_VALID | >> - CPUIDLE_FLAG_TIMER_STOP, >> + .flags = CPUIDLE_FLAG_TIME_VALID, >> .name = "powered-down", >> .desc = "CPU power gated", >> }, > > Here's my test script for reference: > > #!/usr/bin/env python > > import multiprocessing > import os > import sys > import time > > cpus = multiprocessing.cpu_count() > if cpus == 4: > socf = file('/sys/devices/soc0/soc_id') > soc = socf.readline().strip() > socf.close() > if True: #soc == '48': > gc = (11, 9, 1, 3, 7, 5, 13, 15) > else: > gc = (14, 10, 11, 9, 8, 1, 3, 2, 6, 7, 5, 4, 12, 13, 15) > elif cpus == 2: > gc = (1, 3) > else: > raise Exception("Invalid CPU count %d" % cpus) > > oldidx = len(gc) - 1 > oldmask = gc[oldidx] > > for newidx in range(len(gc)): > newmask = gc[newidx] > for cpu in range(cpus): > oldon = oldmask & (1 << cpu) > newon = newmask & (1 << cpu) > if oldon != newon: > if newon: > newonval = 1 > else: > newonval = 0 > cmd = "echo %d > /sys/devices/system/cpu/cpu%d/online" \ > % (newonval, cpu) > print cmd > os.system(cmd) > time.sleep(2) > oldidx = newidx > oldmask = newmask >
On 07/17/2013 03:45 PM, Daniel Lezcano wrote: > On 07/17/2013 10:31 PM, Stephen Warren wrote: >> On 07/17/2013 04:15 AM, Joseph Lo wrote: >>> On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote: >>>> On 07/16/2013 05:17 AM, Joseph Lo wrote: >>>>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote: >>>>>> On 06/25/2013 03:23 AM, Joseph Lo wrote: >>>>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework >>>>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering >>>>>>> this state. >> ... [ discussion of issues with Joesph's patches applied] >>> >>> OK. I did more stress tests last night and today. I found it cause by >>> the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and >>> only impact the Tegra20 platform. The hot plug regression seems due to >>> this patch. After dropping this patch on top of v3.11-rc1, the Tegra20 >>> can back to normal. >>> >>> And the hop plug and suspend stress test can pass on Tegra30/114 too. >>> >>> Can the other two patch series for Tegra114 to support CPU idle power >>> down mode and system suspend still moving forward, not be blocked by >>> this patch? >>> >>> Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for >>> hot plug on Tegra20, I will continue to check this. You can just drop >>> this patch. >> >> OK, if I drop that patch, then everything on Tegra20 and Tegra30 seems >> fine again. >> >> However, I've found some new and exciting issue on Tegra114! >> >> With unmodified v3.11-rc1, I can do the following without issue: >> >> * Unplug/replug CPUs, so that I had all combinations of CPU 1, 2, 3 >> plugged/unpplugged (with CPU 0 always plugged). >> >> * Unplug/replug CPUs, so that I had all combinations of CPU 0, 1, 2, 3 >> plugged/unpplugged (with the obvious exception of never having all CPUs >> unplugged). >> >> However, if I try this with your Tegra114 cpuidle and suspend patches >> applied, I see the following issues: >> >> 1) If I boot, unplug CPU 0, then replug CPU 0, the system immediately >> hard-hangs. >> >> 2) If I run the hotplug test script, leaving CPU 0 always present, I >> sometimes see: >> >>> root@localhost:~# for i in `seq 1 50`; do echo ITERATION $i; ./cpuonline.py; done >>> ITERATION 1 >>> echo 0 > /sys/devices/system/cpu/cpu2/online >>> [ 458.910054] CPU2: shutdown >>> echo 0 > /sys/devices/system/cpu/cpu1/online >>> [ 461.004371] CPU1: shutdown >>> echo 0 > /sys/devices/system/cpu/cpu3/online >>> [ 463.027341] CPU3: shutdown >>> echo 1 > /sys/devices/system/cpu/cpu1/online >>> [ 465.061412] CPU1: Booted secondary processor >>> echo 1 > /sys/devices/system/cpu/cpu2/online >>> [ 467.095313] CPU2: Booted secondary processor >>> [ 467.113243] ------------[ cut here ]------------ >>> [ 467.117948] WARNING: CPU: 2 PID: 0 at kernel/time/tick-broadcast.c:667 tick_broadcast_oneshot_control+0x19c/0x1c4() >>> [ 467.128352] Modules linked in: >>> [ 467.131455] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.11.0-rc1-00022-g7487363-dirty #49 >>> [ 467.139678] [<c0015620>] (unwind_backtrace+0x0/0xf8) from [<c001154c>] (show_stack+0x10/0x14) >>> [ 467.148228] [<c001154c>] (show_stack+0x10/0x14) from [<c05135a8>] (dump_stack+0x80/0xc4) >>> [ 467.156336] [<c05135a8>] (dump_stack+0x80/0xc4) from [<c0024590>] (warn_slowpath_common+0x64/0x88) >>> [ 467.165300] [<c0024590>] (warn_slowpath_common+0x64/0x88) from [<c00245d0>] (warn_slowpath_null+0x1c/0x24) >>> [ 467.174959] [<c00245d0>] (warn_slowpath_null+0x1c/0x24) from [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) >>> [ 467.185659] [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) from [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) >>> [ 467.196538] [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) from [<c034f348>] (cpuidle_idle_call+0x11c/0x168) >>> [ 467.206292] [<c034f348>] (cpuidle_idle_call+0x11c/0x168) from [<c000f134>] (arch_cpu_idle+0x8/0x38) >>> [ 467.215359] [<c000f134>] (arch_cpu_idle+0x8/0x38) from [<c0061038>] (cpu_startup_entry+0x60/0x134) >>> [ 467.224325] [<c0061038>] (cpu_startup_entry+0x60/0x134) from [<800083d8>] (0x800083d8) >>> [ 467.232227] ---[ end trace ea579be22a00e7fb ]--- >>> echo 0 > /sys/devices/system/cpu/cpu1/online >>> [ 469.126682] CPU1: shutdown >> >> I have found no solution for (1) (although I didn't look hard!). >> >> (2) can be solved with the following (at least 50 iterations of my test >> script worked with this patch applied): > > Actually this warning is resulting from a bug in the tick broadcast code > and has been solved with commit: > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=ea8deb8dfa6b0e8d1b3d1051585706739b46656c > > This patch has been merged in timers/urgent branch but still need to > merged with timers/core. > > The patch below does not fix the warning but prevents the tick warning > to occur. Applying the patch above will fix your problem. Yes, I was imprecise when I said solve; I simply meant that making that change prevented me from seeing that issue any more. That patch is already in v3.11-rc1, and I was using that as a base when I applied Joseph's patches. So, it doesn't solve this issue.
diff --git a/arch/arm/mach-tegra/cpuidle-tegra20.c b/arch/arm/mach-tegra/cpuidle-tegra20.c index 706aa42..c7598fb 100644 --- a/arch/arm/mach-tegra/cpuidle-tegra20.c +++ b/arch/arm/mach-tegra/cpuidle-tegra20.c @@ -60,7 +60,8 @@ static struct cpuidle_driver tegra_idle_driver = { .target_residency = 10000, .power_usage = 0, .flags = CPUIDLE_FLAG_TIME_VALID | - CPUIDLE_FLAG_COUPLED, + CPUIDLE_FLAG_COUPLED | + CPUIDLE_FLAG_TIMER_STOP, .name = "powered-down", .desc = "CPU power gated", }, @@ -137,12 +138,8 @@ static bool tegra20_cpu_cluster_power_down(struct cpuidle_device *dev, if (tegra20_reset_cpu_1() || !tegra_cpu_rail_off_ready()) return false; - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu); - tegra_idle_lp2_last(); - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu); - if (cpu_online(1)) tegra20_wake_cpu1_from_reset(); @@ -154,14 +151,10 @@ static bool tegra20_idle_enter_lp2_cpu_1(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu); - cpu_suspend(0, tegra20_sleep_cpu_secondary_finish); tegra20_cpu_clear_resettable(); - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu); - return true; } #else diff --git a/arch/arm/mach-tegra/cpuidle-tegra30.c b/arch/arm/mach-tegra/cpuidle-tegra30.c index ed2a2a7..8ffd8d9 100644 --- a/arch/arm/mach-tegra/cpuidle-tegra30.c +++ b/arch/arm/mach-tegra/cpuidle-tegra30.c @@ -56,7 +56,8 @@ static struct cpuidle_driver tegra_idle_driver = { .exit_latency = 2000, .target_residency = 2200, .power_usage = 0, - .flags = CPUIDLE_FLAG_TIME_VALID, + .flags = CPUIDLE_FLAG_TIME_VALID | + CPUIDLE_FLAG_TIMER_STOP, .name = "powered-down", .desc = "CPU power gated", }, @@ -77,12 +78,8 @@ static bool tegra30_cpu_cluster_power_down(struct cpuidle_device *dev, return false; } - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu); - tegra_idle_lp2_last(); - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu); - return true; } @@ -91,14 +88,8 @@ static bool tegra30_cpu_core_power_down(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu); - - smp_wmb(); - cpu_suspend(0, tegra30_sleep_cpu_secondary_finish); - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu); - return true; } #else
Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering this state. Signed-off-by: Joseph Lo <josephl@nvidia.com> --- This patch depends on the patch "tick: Fix tick_broadcast_pending_mask not cleared". --- arch/arm/mach-tegra/cpuidle-tegra20.c | 11 ++--------- arch/arm/mach-tegra/cpuidle-tegra30.c | 13 ++----------- 2 files changed, 4 insertions(+), 20 deletions(-)