ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag

Message ID	1372152228-16199-1-git-send-email-josephl@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Joseph Lo <josephl@nvidia.com> To: Stephen Warren <swarren@wwwdotorg.org> Subject: [PATCH] ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag Date: Tue, 25 Jun 2013 17:23:48 +0800 Message-ID: <1372152228-16199-1-git-send-email-josephl@nvidia.com> MIME-Version: 1.0 Cc: linux-tegra@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Joseph Lo <josephl@nvidia.com> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Joseph Lo June 25, 2013, 9:23 a.m. UTC

Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
this state.

Signed-off-by: Joseph Lo <josephl@nvidia.com>
---
This patch depends on the patch "tick:  Fix tick_broadcast_pending_mask not cleared".
---
 arch/arm/mach-tegra/cpuidle-tegra20.c | 11 ++---------
 arch/arm/mach-tegra/cpuidle-tegra30.c | 13 ++-----------
 2 files changed, 4 insertions(+), 20 deletions(-)

Stephen Warren June 25, 2013, 3:12 p.m. UTC | #1

On 06/25/2013 03:23 AM, Joseph Lo wrote:
> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> this state.

I assume this patch is only needed to support the other two series you
sent today; the Tegra114 cpuidle and system suspend series? In other
words, there's no need to apply this to 3.11 to fix a bug; it can wait
for 3.12 along with those other two series?

Joseph Lo June 26, 2013, 11:11 a.m. UTC | #2

On Tue, 2013-06-25 at 23:12 +0800, Stephen Warren wrote:
> On 06/25/2013 03:23 AM, Joseph Lo wrote:
> > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> > this state.
> 
> I assume this patch is only needed to support the other two series you
> sent today; the Tegra114 cpuidle and system suspend series? In other
> words, there's no need to apply this to 3.11 to fix a bug; it can wait
> for 3.12 along with those other two series?

This patch is independent of the other two series, you can apply it
along.
Yes, all these series were for 3.12.

Thanks,
Joseph

Stephen Warren July 15, 2013, 6:04 p.m. UTC | #3

On 06/25/2013 03:23 AM, Joseph Lo wrote:
> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> this state.

I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114:
cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM:
tegra114: add support for system suspend", all on top of v3.11-rc1.

On Dalmore, the new cpuidle mode /appears/ to work (I see increasing
values in the sysfs cpuidle "usage" file for all defined cpuidle
states), but I don't see the "CPU VDD off" LED light up; I'm not
convinced that the CPU is actually being powered off in the idle mode.

With these patches applies, Harmony acts very strangely. After
hot-unplug and re-plug of CPU1, the system is hung or almost hung. The
patches appear to reduce the amount of time CPU VDD is off judging by
the LEDs. The patches might cause issues with system suspend too, but
I'm not 100% sure.

As such, I haven't applied any of these. Can you please test boards for
all of Tegra20/30/114, and validate that on Dalmore the CPU power is
actually being turned off, and report back? Thanks.

Peter De Schrijver July 16, 2013, 10:19 a.m. UTC | #4

On Mon, Jul 15, 2013 at 08:04:44PM +0200, Stephen Warren wrote:
> On 06/25/2013 03:23 AM, Joseph Lo wrote:
> > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> > this state.
> 
> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114:
> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM:
> tegra114: add support for system suspend", all on top of v3.11-rc1.
> 
> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing
> values in the sysfs cpuidle "usage" file for all defined cpuidle
> states), but I don't see the "CPU VDD off" LED light up; I'm not
> convinced that the CPU is actually being powered off in the idle mode.

The current patchset only powergates the individual cores on Tegra114. The
entire cluster (including L2 cache etc) isn't railgated as a CPUidle state.
Hence the CPU VDD off LED will never be on.

Cheers,

Peter.

Joseph Lo July 16, 2013, 11:17 a.m. UTC | #5

On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
> On 06/25/2013 03:23 AM, Joseph Lo wrote:
> > Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> > to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> > this state.
> 
> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114:
> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM:
> tegra114: add support for system suspend", all on top of v3.11-rc1.
> 
> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing
> values in the sysfs cpuidle "usage" file for all defined cpuidle
> states), but I don't see the "CPU VDD off" LED light up; I'm not
> convinced that the CPU is actually being powered off in the idle mode.
> 
The LED of the CPU Vdd indicates the power of CPU cluster. But the
series of CPU idle power down mode support for Tegra114 only supports
per core power down. It did not support cluster power down yet in idle
mode. That needs some extra work to support cluster power down in idle
mode.
There are some items that I plat to do:
1. integrate the MCPM (multi cluster power management) code into Tegra
CPU PM code
2. integrate coupled CPUidle framework into Tegra CPU idle driver
3. normalize the Tegra CPU idle driver to single one

> With these patches applies, Harmony acts very strangely. After
> hot-unplug and re-plug of CPU1, the system is hung or almost hung. The
> patches appear to reduce the amount of time CPU VDD is off judging by
> the LEDs. The patches might cause issues with system suspend too, but
> I'm not 100% sure.
> 
Actually I didn't add anything about hotplug or something can increase
the loading or make regressions. But I think you are testing with USB
device attached, right? That might cause some extra loading. You can
give it a try after just removing USB device. And I double confirm that
on Seaboard. Except that, the test result looks OK for me.

> As such, I haven't applied any of these. Can you please test boards for
> all of Tegra20/30/114, and validate that on Dalmore the CPU power is
> actually being turned off, and report back? Thanks.

I have tested all these patches based on next-20130716. And verified the
functions on Seaboard, Cardhu and Dalmore. Looks good for me. And the
per core power down in idle is OK too. The cluster power down only
support when the system goes into suspend mode right now for Tegra114.

Thanks,
Joseph

Daniel Lezcano July 16, 2013, 12:11 p.m. UTC | #6

On 07/16/2013 01:17 PM, Joseph Lo wrote:
> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
>> On 06/25/2013 03:23 AM, Joseph Lo wrote:
>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
>>> this state.
>>
>> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114:
>> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM:
>> tegra114: add support for system suspend", all on top of v3.11-rc1.
>>
>> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing
>> values in the sysfs cpuidle "usage" file for all defined cpuidle
>> states), but I don't see the "CPU VDD off" LED light up; I'm not
>> convinced that the CPU is actually being powered off in the idle mode.
>>
> The LED of the CPU Vdd indicates the power of CPU cluster. But the
> series of CPU idle power down mode support for Tegra114 only supports
> per core power down. It did not support cluster power down yet in idle
> mode. That needs some extra work to support cluster power down in idle
> mode.
> There are some items that I plat to do:
> 1. integrate the MCPM (multi cluster power management) code into Tegra
> CPU PM code

+1

> 2. integrate coupled CPUidle framework into Tegra CPU idle driver

If you are using the MCPM, I suggest you rely on the MCPM to replace the
coupled idle state by it. The code will look much more consistent.

> 3. normalize the Tegra CPU idle driver to single one

I am working on making a single ARM driver, so may be instead of doing a
single tegra driver, we can sync up to move the code forward to the same
scheme than the others driver, then consolidating the tegra drivers
would be easier and consistent.

That would be very nice, if you can separate in the code the PM arch
specific blocks and encapsulate the driver specific code, so no more
include from <mach/...> are needed. That will allow to move the drivers
to drivers/cpuidle directory.

>> With these patches applies, Harmony acts very strangely. After
>> hot-unplug and re-plug of CPU1, the system is hung or almost hung. The
>> patches appear to reduce the amount of time CPU VDD is off judging by
>> the LEDs. The patches might cause issues with system suspend too, but
>> I'm not 100% sure.
>>
> Actually I didn't add anything about hotplug or something can increase
> the loading or make regressions. But I think you are testing with USB
> device attached, right? That might cause some extra loading. You can
> give it a try after just removing USB device. And I double confirm that
> on Seaboard. Except that, the test result looks OK for me.
> 
>> As such, I haven't applied any of these. Can you please test boards for
>> all of Tegra20/30/114, and validate that on Dalmore the CPU power is
>> actually being turned off, and report back? Thanks.
> 
> I have tested all these patches based on next-20130716. And verified the
> functions on Seaboard, Cardhu and Dalmore. Looks good for me. And the
> per core power down in idle is OK too. The cluster power down only
> support when the system goes into suspend mode right now for Tegra114.
> 
> Thanks,
> Joseph
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Stephen Warren July 16, 2013, 7:51 p.m. UTC | #7

On 07/16/2013 05:17 AM, Joseph Lo wrote:
> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
>> On 06/25/2013 03:23 AM, Joseph Lo wrote:
>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
>>> this state.
>>
>> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114:
>> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM:
>> tegra114: add support for system suspend", all on top of v3.11-rc1.
>>
>> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing
>> values in the sysfs cpuidle "usage" file for all defined cpuidle
>> states), but I don't see the "CPU VDD off" LED light up; I'm not
>> convinced that the CPU is actually being powered off in the idle mode.
>
> The LED of the CPU Vdd indicates the power of CPU cluster. But the
> series of CPU idle power down mode support for Tegra114 only supports
> per core power down.

OK, that's fine then.

>> With these patches applies, Harmony acts very strangely. After
>> hot-unplug and re-plug of CPU1, the system is hung or almost hung. The
>> patches appear to reduce the amount of time CPU VDD is off judging by
>> the LEDs. The patches might cause issues with system suspend too, but
>> I'm not 100% sure.
>
> Actually I didn't add anything about hotplug

Well, the patches do touch the CPU reset vector and related code. That's
shared with hotplug, right? But anyway, it turns out the CPU hotplug is
not related to the issue; the system acts strangely even without/before
performing any hotplug. I just hadn't noticed that before.

> or something can increase
> the loading or make regressions. But I think you are testing with USB
> device attached, right? That might cause some extra loading. You can
> give it a try after just removing USB device. And I double confirm that
> on Seaboard. Except that, the test result looks OK for me.

Yes, having a USB device plugged in does affect the issue. In summary:

next-20130716, with or without USB devices plugged in: Works fine.

next-20130716 with your patches, without USB device plugged in:
Occasional short problems (detailed below).

next-20130716 with your patches, with USB device plugged in: Continual
long problems (detailed below).

The testing I did was to log in over the serial console, hit <enter>
around five times, and watch the shell prompt echo back, then type
"ls<enter" and watch the (directory listing appear (I have about 20
files in my directory, with long-ish names), then repeat. With plain
next-20130716, this is nice and fast and there are no pauses. With your
patches applied, there are occasional or continual pauses, which last
for either a short time (tenths of a second), or even a second or two,
both depending on whether a USB device is plugged in or not.

I believe this is the same issue I saw when I applied your patches to
the Tegra tree on top of v3.11-rc1.

So, either your patch causes a bug, or exposes some pre-existing bug.
Either way, I can't apply them because they cause a regression.

I wonder if this is at all related to NVIDIA bug 1286857 "Upstreaming
task: Enabling LP2 causes slow serial console"? I tried the WAR there:

echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state1/disable

... and the system locked up solid very soon after; I had just enough
time to type "ls<enter>", then it locked up. Again, performing those
same echos on next-20130716 without your patches works fine.

Joseph Lo July 17, 2013, 6:19 a.m. UTC | #8

On Tue, 2013-07-16 at 20:11 +0800, Daniel Lezcano wrote:
> On 07/16/2013 01:17 PM, Joseph Lo wrote:
> > On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
> >> On 06/25/2013 03:23 AM, Joseph Lo wrote:
> >>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> >>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> >>> this state.
> >>
> >> I tried applying this patch, your series "[PATCH V3 0/3] ARM: tegra114:
> >> cpuidle: add power down state", and your series "[PATCH V2 00/11] ARM:
> >> tegra114: add support for system suspend", all on top of v3.11-rc1.
> >>
> >> On Dalmore, the new cpuidle mode /appears/ to work (I see increasing
> >> values in the sysfs cpuidle "usage" file for all defined cpuidle
> >> states), but I don't see the "CPU VDD off" LED light up; I'm not
> >> convinced that the CPU is actually being powered off in the idle mode.
> >>
> > The LED of the CPU Vdd indicates the power of CPU cluster. But the
> > series of CPU idle power down mode support for Tegra114 only supports
> > per core power down. It did not support cluster power down yet in idle
> > mode. That needs some extra work to support cluster power down in idle
> > mode.
> > There are some items that I plat to do:
> > 1. integrate the MCPM (multi cluster power management) code into Tegra
> > CPU PM code
> 
> +1
> 
> > 2. integrate coupled CPUidle framework into Tegra CPU idle driver
> 
> If you are using the MCPM, I suggest you rely on the MCPM to replace the
> coupled idle state by it. The code will look much more consistent.
> 
Yes, true. Thanks for your advise.

Joseph

Joseph Lo July 17, 2013, 10:15 a.m. UTC | #9

On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote:
> On 07/16/2013 05:17 AM, Joseph Lo wrote:
> > On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
> >> On 06/25/2013 03:23 AM, Joseph Lo wrote:
> >>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> >>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> >>> this state.
> > or something can increase
> > the loading or make regressions. But I think you are testing with USB
> > device attached, right? That might cause some extra loading. You can
> > give it a try after just removing USB device. And I double confirm that
> > on Seaboard. Except that, the test result looks OK for me.
> 
> Yes, having a USB device plugged in does affect the issue. In summary:
> 
> next-20130716, with or without USB devices plugged in: Works fine.
> 
> next-20130716 with your patches, without USB device plugged in:
> Occasional short problems (detailed below).
> 
> next-20130716 with your patches, with USB device plugged in: Continual
> long problems (detailed below).
> 
> The testing I did was to log in over the serial console, hit <enter>
> around five times, and watch the shell prompt echo back, then type
> "ls<enter" and watch the (directory listing appear (I have about 20
> files in my directory, with long-ish names), then repeat. With plain
> next-20130716, this is nice and fast and there are no pauses. With your
> patches applied, there are occasional or continual pauses, which last
> for either a short time (tenths of a second), or even a second or two,
> both depending on whether a USB device is plugged in or not.
> 
> I believe this is the same issue I saw when I applied your patches to
> the Tegra tree on top of v3.11-rc1.
> 
OK. I did more stress tests last night and today. I found it cause by
the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and
only impact the Tegra20 platform. The hot plug regression seems due to
this patch. After dropping this patch on top of v3.11-rc1, the Tegra20
can back to normal.

And the hop plug and suspend stress test can pass on Tegra30/114 too.

Can the other two patch series for Tegra114 to support CPU idle power
down mode and system suspend still moving forward, not be blocked by
this patch?

Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for
hot plug on Tegra20, I will continue to check this. You can just drop
this patch.

Thanks,
Joseph

Daniel Lezcano July 17, 2013, 10:21 a.m. UTC | #10

On 07/17/2013 12:15 PM, Joseph Lo wrote:
> On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote:
>> On 07/16/2013 05:17 AM, Joseph Lo wrote:
>>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
>>>> On 06/25/2013 03:23 AM, Joseph Lo wrote:
>>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
>>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
>>>>> this state.
>>> or something can increase
>>> the loading or make regressions. But I think you are testing with USB
>>> device attached, right? That might cause some extra loading. You can
>>> give it a try after just removing USB device. And I double confirm that
>>> on Seaboard. Except that, the test result looks OK for me.
>>
>> Yes, having a USB device plugged in does affect the issue. In summary:
>>
>> next-20130716, with or without USB devices plugged in: Works fine.
>>
>> next-20130716 with your patches, without USB device plugged in:
>> Occasional short problems (detailed below).
>>
>> next-20130716 with your patches, with USB device plugged in: Continual
>> long problems (detailed below).
>>
>> The testing I did was to log in over the serial console, hit <enter>
>> around five times, and watch the shell prompt echo back, then type
>> "ls<enter" and watch the (directory listing appear (I have about 20
>> files in my directory, with long-ish names), then repeat. With plain
>> next-20130716, this is nice and fast and there are no pauses. With your
>> patches applied, there are occasional or continual pauses, which last
>> for either a short time (tenths of a second), or even a second or two,
>> both depending on whether a USB device is plugged in or not.
>>
>> I believe this is the same issue I saw when I applied your patches to
>> the Tegra tree on top of v3.11-rc1.
>>
> OK. I did more stress tests last night and today. I found it cause by
> the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and
> only impact the Tegra20 platform. The hot plug regression seems due to
> this patch. After dropping this patch on top of v3.11-rc1, the Tegra20
> can back to normal.
> 
> And the hop plug and suspend stress test can pass on Tegra30/114 too.
> 
> Can the other two patch series for Tegra114 to support CPU idle power
> down mode and system suspend still moving forward, not be blocked by
> this patch?
> 
> Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for
> hot plug on Tegra20, I will continue to check this. You can just drop
> this patch.

Please in the future Cc me when problems occur with some patches I
submitted. That would be easier for me to track the issues and fix them
when they happen.

Thanks
  -- Daniel

Joseph Lo July 17, 2013, 10:29 a.m. UTC | #11

On Wed, 2013-07-17 at 18:21 +0800, Daniel Lezcano wrote:
> On 07/17/2013 12:15 PM, Joseph Lo wrote:
> > On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote:
> >> On 07/16/2013 05:17 AM, Joseph Lo wrote:
> >>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
> >>>> On 06/25/2013 03:23 AM, Joseph Lo wrote:
> >>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
> >>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
> >>>>> this state.
> >>> or something can increase
> >>> the loading or make regressions. But I think you are testing with USB
> >>> device attached, right? That might cause some extra loading. You can
> >>> give it a try after just removing USB device. And I double confirm that
> >>> on Seaboard. Except that, the test result looks OK for me.
> >>
> >> Yes, having a USB device plugged in does affect the issue. In summary:
> >>
> >> next-20130716, with or without USB devices plugged in: Works fine.
> >>
> >> next-20130716 with your patches, without USB device plugged in:
> >> Occasional short problems (detailed below).
> >>
> >> next-20130716 with your patches, with USB device plugged in: Continual
> >> long problems (detailed below).
> >>
> >> The testing I did was to log in over the serial console, hit <enter>
> >> around five times, and watch the shell prompt echo back, then type
> >> "ls<enter" and watch the (directory listing appear (I have about 20
> >> files in my directory, with long-ish names), then repeat. With plain
> >> next-20130716, this is nice and fast and there are no pauses. With your
> >> patches applied, there are occasional or continual pauses, which last
> >> for either a short time (tenths of a second), or even a second or two,
> >> both depending on whether a USB device is plugged in or not.
> >>
> >> I believe this is the same issue I saw when I applied your patches to
> >> the Tegra tree on top of v3.11-rc1.
> >>
> > OK. I did more stress tests last night and today. I found it cause by
> > the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and
> > only impact the Tegra20 platform. The hot plug regression seems due to
> > this patch. After dropping this patch on top of v3.11-rc1, the Tegra20
> > can back to normal.
> > 
> > And the hop plug and suspend stress test can pass on Tegra30/114 too.
> > 
> > Can the other two patch series for Tegra114 to support CPU idle power
> > down mode and system suspend still moving forward, not be blocked by
> > this patch?
> > 
> > Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for
> > hot plug on Tegra20, I will continue to check this. You can just drop
> > this patch.
> 
> Please in the future Cc me when problems occur with some patches I
> submitted. That would be easier for me to track the issues and fix them
> when they happen.
> 
Ah, sorry for that. I should always Cc you. Will do next time. Thanks
for take care.

Joseph

Stephen Warren July 17, 2013, 8:31 p.m. UTC | #12

On 07/17/2013 04:15 AM, Joseph Lo wrote:
> On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote:
>> On 07/16/2013 05:17 AM, Joseph Lo wrote:
>>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
>>>> On 06/25/2013 03:23 AM, Joseph Lo wrote:
>>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
>>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
>>>>> this state.
... [ discussion of issues with Joesph's patches applied]
>
> OK. I did more stress tests last night and today. I found it cause by
> the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and
> only impact the Tegra20 platform. The hot plug regression seems due to
> this patch. After dropping this patch on top of v3.11-rc1, the Tegra20
> can back to normal.
> 
> And the hop plug and suspend stress test can pass on Tegra30/114 too.
> 
> Can the other two patch series for Tegra114 to support CPU idle power
> down mode and system suspend still moving forward, not be blocked by
> this patch?
> 
> Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for
> hot plug on Tegra20, I will continue to check this. You can just drop
> this patch.

OK, if I drop that patch, then everything on Tegra20 and Tegra30 seems
fine again.

However, I've found some new and exciting issue on Tegra114!

With unmodified v3.11-rc1, I can do the following without issue:

* Unplug/replug CPUs, so that I had all combinations of CPU 1, 2, 3
plugged/unpplugged (with CPU 0 always plugged).

* Unplug/replug CPUs, so that I had all combinations of CPU 0, 1, 2, 3
plugged/unpplugged (with the obvious exception of never having all CPUs
unplugged).

However, if I try this with your Tegra114 cpuidle and suspend patches
applied, I see the following issues:

1) If I boot, unplug CPU 0, then replug CPU 0, the system immediately
hard-hangs.

2) If I run the hotplug test script, leaving CPU 0 always present, I
sometimes see:

> root@localhost:~# for i in `seq 1 50`; do echo ITERATION $i; ./cpuonline.py; done
> ITERATION 1
> echo 0 > /sys/devices/system/cpu/cpu2/online
> [  458.910054] CPU2: shutdown
> echo 0 > /sys/devices/system/cpu/cpu1/online
> [  461.004371] CPU1: shutdown
> echo 0 > /sys/devices/system/cpu/cpu3/online
> [  463.027341] CPU3: shutdown
> echo 1 > /sys/devices/system/cpu/cpu1/online
> [  465.061412] CPU1: Booted secondary processor
> echo 1 > /sys/devices/system/cpu/cpu2/online
> [  467.095313] CPU2: Booted secondary processor
> [  467.113243] ------------[ cut here ]------------
> [  467.117948] WARNING: CPU: 2 PID: 0 at kernel/time/tick-broadcast.c:667 tick_broadcast_oneshot_control+0x19c/0x1c4()
> [  467.128352] Modules linked in:
> [  467.131455] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.11.0-rc1-00022-g7487363-dirty #49
> [  467.139678] [<c0015620>] (unwind_backtrace+0x0/0xf8) from [<c001154c>] (show_stack+0x10/0x14)
> [  467.148228] [<c001154c>] (show_stack+0x10/0x14) from [<c05135a8>] (dump_stack+0x80/0xc4)
> [  467.156336] [<c05135a8>] (dump_stack+0x80/0xc4) from [<c0024590>] (warn_slowpath_common+0x64/0x88)
> [  467.165300] [<c0024590>] (warn_slowpath_common+0x64/0x88) from [<c00245d0>] (warn_slowpath_null+0x1c/0x24)
> [  467.174959] [<c00245d0>] (warn_slowpath_null+0x1c/0x24) from [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4)
> [  467.185659] [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) from [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc)
> [  467.196538] [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) from [<c034f348>] (cpuidle_idle_call+0x11c/0x168)
> [  467.206292] [<c034f348>] (cpuidle_idle_call+0x11c/0x168) from [<c000f134>] (arch_cpu_idle+0x8/0x38)
> [  467.215359] [<c000f134>] (arch_cpu_idle+0x8/0x38) from [<c0061038>] (cpu_startup_entry+0x60/0x134)
> [  467.224325] [<c0061038>] (cpu_startup_entry+0x60/0x134) from [<800083d8>] (0x800083d8)
> [  467.232227] ---[ end trace ea579be22a00e7fb ]---
> echo 0 > /sys/devices/system/cpu/cpu1/online
> [  469.126682] CPU1: shutdown

I have found no solution for (1) (although I didn't look hard!).

(2) can be solved with the following (at least 50 iterations of my test
script worked with this patch applied):

> diff --git a/arch/arm/mach-tegra/cpuidle-tegra114.c b/arch/arm/mach-tegra/cpuidle-tegra114.c
> index 658b205..896408d 100644
> --- a/arch/arm/mach-tegra/cpuidle-tegra114.c
> +++ b/arch/arm/mach-tegra/cpuidle-tegra114.c
> @@ -66,8 +66,7 @@ static struct cpuidle_driver tegra_idle_driver = {
>                         .exit_latency           = 500,
>                         .target_residency       = 1000,
>                         .power_usage            = 0,
> -                       .flags                  = CPUIDLE_FLAG_TIME_VALID |
> -                                                 CPUIDLE_FLAG_TIMER_STOP,
> +                       .flags                  = CPUIDLE_FLAG_TIME_VALID,
>                         .name                   = "powered-down",
>                         .desc                   = "CPU power gated",
>                 },

Here's my test script for reference:

#!/usr/bin/env python

import multiprocessing
import os
import sys
import time

cpus = multiprocessing.cpu_count()
if cpus == 4:
  socf = file('/sys/devices/soc0/soc_id')
  soc = socf.readline().strip()
  socf.close()
  if True: #soc == '48':
    gc = (11, 9, 1, 3, 7, 5, 13, 15)
  else:
    gc = (14, 10, 11, 9, 8, 1, 3, 2, 6, 7, 5, 4, 12, 13, 15)
elif cpus == 2:
  gc = (1, 3)
else:
  raise Exception("Invalid CPU count %d" % cpus)

oldidx = len(gc) - 1
oldmask = gc[oldidx]

for newidx in range(len(gc)):
  newmask = gc[newidx]
  for cpu in range(cpus):
    oldon = oldmask & (1 << cpu)
    newon = newmask & (1 << cpu)
    if oldon != newon:
      if newon:
        newonval = 1
      else:
        newonval = 0
      cmd = "echo %d > /sys/devices/system/cpu/cpu%d/online" \
% (newonval, cpu)
      print cmd
      os.system(cmd)
  time.sleep(2)
  oldidx = newidx
  oldmask = newmask

Daniel Lezcano July 17, 2013, 9:45 p.m. UTC | #13

On 07/17/2013 10:31 PM, Stephen Warren wrote:
> On 07/17/2013 04:15 AM, Joseph Lo wrote:
>> On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote:
>>> On 07/16/2013 05:17 AM, Joseph Lo wrote:
>>>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
>>>>> On 06/25/2013 03:23 AM, Joseph Lo wrote:
>>>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
>>>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
>>>>>> this state.
> ... [ discussion of issues with Joesph's patches applied]
>>
>> OK. I did more stress tests last night and today. I found it cause by
>> the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and
>> only impact the Tegra20 platform. The hot plug regression seems due to
>> this patch. After dropping this patch on top of v3.11-rc1, the Tegra20
>> can back to normal.
>>
>> And the hop plug and suspend stress test can pass on Tegra30/114 too.
>>
>> Can the other two patch series for Tegra114 to support CPU idle power
>> down mode and system suspend still moving forward, not be blocked by
>> this patch?
>>
>> Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for
>> hot plug on Tegra20, I will continue to check this. You can just drop
>> this patch.
> 
> OK, if I drop that patch, then everything on Tegra20 and Tegra30 seems
> fine again.
> 
> However, I've found some new and exciting issue on Tegra114!
> 
> With unmodified v3.11-rc1, I can do the following without issue:
> 
> * Unplug/replug CPUs, so that I had all combinations of CPU 1, 2, 3
> plugged/unpplugged (with CPU 0 always plugged).
> 
> * Unplug/replug CPUs, so that I had all combinations of CPU 0, 1, 2, 3
> plugged/unpplugged (with the obvious exception of never having all CPUs
> unplugged).
> 
> However, if I try this with your Tegra114 cpuidle and suspend patches
> applied, I see the following issues:
> 
> 1) If I boot, unplug CPU 0, then replug CPU 0, the system immediately
> hard-hangs.
> 
> 2) If I run the hotplug test script, leaving CPU 0 always present, I
> sometimes see:
> 
>> root@localhost:~# for i in `seq 1 50`; do echo ITERATION $i; ./cpuonline.py; done
>> ITERATION 1
>> echo 0 > /sys/devices/system/cpu/cpu2/online
>> [  458.910054] CPU2: shutdown
>> echo 0 > /sys/devices/system/cpu/cpu1/online
>> [  461.004371] CPU1: shutdown
>> echo 0 > /sys/devices/system/cpu/cpu3/online
>> [  463.027341] CPU3: shutdown
>> echo 1 > /sys/devices/system/cpu/cpu1/online
>> [  465.061412] CPU1: Booted secondary processor
>> echo 1 > /sys/devices/system/cpu/cpu2/online
>> [  467.095313] CPU2: Booted secondary processor
>> [  467.113243] ------------[ cut here ]------------
>> [  467.117948] WARNING: CPU: 2 PID: 0 at kernel/time/tick-broadcast.c:667 tick_broadcast_oneshot_control+0x19c/0x1c4()
>> [  467.128352] Modules linked in:
>> [  467.131455] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.11.0-rc1-00022-g7487363-dirty #49
>> [  467.139678] [<c0015620>] (unwind_backtrace+0x0/0xf8) from [<c001154c>] (show_stack+0x10/0x14)
>> [  467.148228] [<c001154c>] (show_stack+0x10/0x14) from [<c05135a8>] (dump_stack+0x80/0xc4)
>> [  467.156336] [<c05135a8>] (dump_stack+0x80/0xc4) from [<c0024590>] (warn_slowpath_common+0x64/0x88)
>> [  467.165300] [<c0024590>] (warn_slowpath_common+0x64/0x88) from [<c00245d0>] (warn_slowpath_null+0x1c/0x24)
>> [  467.174959] [<c00245d0>] (warn_slowpath_null+0x1c/0x24) from [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4)
>> [  467.185659] [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) from [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc)
>> [  467.196538] [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) from [<c034f348>] (cpuidle_idle_call+0x11c/0x168)
>> [  467.206292] [<c034f348>] (cpuidle_idle_call+0x11c/0x168) from [<c000f134>] (arch_cpu_idle+0x8/0x38)
>> [  467.215359] [<c000f134>] (arch_cpu_idle+0x8/0x38) from [<c0061038>] (cpu_startup_entry+0x60/0x134)
>> [  467.224325] [<c0061038>] (cpu_startup_entry+0x60/0x134) from [<800083d8>] (0x800083d8)
>> [  467.232227] ---[ end trace ea579be22a00e7fb ]---
>> echo 0 > /sys/devices/system/cpu/cpu1/online
>> [  469.126682] CPU1: shutdown
> 
> I have found no solution for (1) (although I didn't look hard!).
> 
> (2) can be solved with the following (at least 50 iterations of my test
> script worked with this patch applied):

Actually this warning is resulting from a bug in the tick broadcast code
and has been solved with commit:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=ea8deb8dfa6b0e8d1b3d1051585706739b46656c

This patch has been merged in timers/urgent branch but still need to
merged with timers/core.

The patch below does not fix the warning but prevents the tick warning
to occur. Applying the patch above will fix your problem.


>> diff --git a/arch/arm/mach-tegra/cpuidle-tegra114.c b/arch/arm/mach-tegra/cpuidle-tegra114.c
>> index 658b205..896408d 100644
>> --- a/arch/arm/mach-tegra/cpuidle-tegra114.c
>> +++ b/arch/arm/mach-tegra/cpuidle-tegra114.c
>> @@ -66,8 +66,7 @@ static struct cpuidle_driver tegra_idle_driver = {
>>                         .exit_latency           = 500,
>>                         .target_residency       = 1000,
>>                         .power_usage            = 0,
>> -                       .flags                  = CPUIDLE_FLAG_TIME_VALID |
>> -                                                 CPUIDLE_FLAG_TIMER_STOP,
>> +                       .flags                  = CPUIDLE_FLAG_TIME_VALID,
>>                         .name                   = "powered-down",
>>                         .desc                   = "CPU power gated",
>>                 },
> 
> Here's my test script for reference:
> 
> #!/usr/bin/env python
> 
> import multiprocessing
> import os
> import sys
> import time
> 
> cpus = multiprocessing.cpu_count()
> if cpus == 4:
>   socf = file('/sys/devices/soc0/soc_id')
>   soc = socf.readline().strip()
>   socf.close()
>   if True: #soc == '48':
>     gc = (11, 9, 1, 3, 7, 5, 13, 15)
>   else:
>     gc = (14, 10, 11, 9, 8, 1, 3, 2, 6, 7, 5, 4, 12, 13, 15)
> elif cpus == 2:
>   gc = (1, 3)
> else:
>   raise Exception("Invalid CPU count %d" % cpus)
> 
> oldidx = len(gc) - 1
> oldmask = gc[oldidx]
> 
> for newidx in range(len(gc)):
>   newmask = gc[newidx]
>   for cpu in range(cpus):
>     oldon = oldmask & (1 << cpu)
>     newon = newmask & (1 << cpu)
>     if oldon != newon:
>       if newon:
>         newonval = 1
>       else:
>         newonval = 0
>       cmd = "echo %d > /sys/devices/system/cpu/cpu%d/online" \
> % (newonval, cpu)
>       print cmd
>       os.system(cmd)
>   time.sleep(2)
>   oldidx = newidx
>   oldmask = newmask
>

Stephen Warren July 17, 2013, 10:01 p.m. UTC | #14

On 07/17/2013 03:45 PM, Daniel Lezcano wrote:
> On 07/17/2013 10:31 PM, Stephen Warren wrote:
>> On 07/17/2013 04:15 AM, Joseph Lo wrote:
>>> On Wed, 2013-07-17 at 03:51 +0800, Stephen Warren wrote:
>>>> On 07/16/2013 05:17 AM, Joseph Lo wrote:
>>>>> On Tue, 2013-07-16 at 02:04 +0800, Stephen Warren wrote:
>>>>>> On 06/25/2013 03:23 AM, Joseph Lo wrote:
>>>>>>> Use the CPUIDLE_FLAG_TIMER_STOP and let the cpuidle framework
>>>>>>> to handle the CLOCK_EVT_NOTIFY_BROADCAST_ENTER/EXIT when entering
>>>>>>> this state.
>> ... [ discussion of issues with Joesph's patches applied]
>>>
>>> OK. I did more stress tests last night and today. I found it cause by
>>> the patch "ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag" and
>>> only impact the Tegra20 platform. The hot plug regression seems due to
>>> this patch. After dropping this patch on top of v3.11-rc1, the Tegra20
>>> can back to normal.
>>>
>>> And the hop plug and suspend stress test can pass on Tegra30/114 too.
>>>
>>> Can the other two patch series for Tegra114 to support CPU idle power
>>> down mode and system suspend still moving forward, not be blocked by
>>> this patch?
>>>
>>> Looks the CPUIDLE_FLAG_TIMER_STOP flag still cause some other issue for
>>> hot plug on Tegra20, I will continue to check this. You can just drop
>>> this patch.
>>
>> OK, if I drop that patch, then everything on Tegra20 and Tegra30 seems
>> fine again.
>>
>> However, I've found some new and exciting issue on Tegra114!
>>
>> With unmodified v3.11-rc1, I can do the following without issue:
>>
>> * Unplug/replug CPUs, so that I had all combinations of CPU 1, 2, 3
>> plugged/unpplugged (with CPU 0 always plugged).
>>
>> * Unplug/replug CPUs, so that I had all combinations of CPU 0, 1, 2, 3
>> plugged/unpplugged (with the obvious exception of never having all CPUs
>> unplugged).
>>
>> However, if I try this with your Tegra114 cpuidle and suspend patches
>> applied, I see the following issues:
>>
>> 1) If I boot, unplug CPU 0, then replug CPU 0, the system immediately
>> hard-hangs.
>>
>> 2) If I run the hotplug test script, leaving CPU 0 always present, I
>> sometimes see:
>>
>>> root@localhost:~# for i in `seq 1 50`; do echo ITERATION $i; ./cpuonline.py; done
>>> ITERATION 1
>>> echo 0 > /sys/devices/system/cpu/cpu2/online
>>> [  458.910054] CPU2: shutdown
>>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>> [  461.004371] CPU1: shutdown
>>> echo 0 > /sys/devices/system/cpu/cpu3/online
>>> [  463.027341] CPU3: shutdown
>>> echo 1 > /sys/devices/system/cpu/cpu1/online
>>> [  465.061412] CPU1: Booted secondary processor
>>> echo 1 > /sys/devices/system/cpu/cpu2/online
>>> [  467.095313] CPU2: Booted secondary processor
>>> [  467.113243] ------------[ cut here ]------------
>>> [  467.117948] WARNING: CPU: 2 PID: 0 at kernel/time/tick-broadcast.c:667 tick_broadcast_oneshot_control+0x19c/0x1c4()
>>> [  467.128352] Modules linked in:
>>> [  467.131455] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.11.0-rc1-00022-g7487363-dirty #49
>>> [  467.139678] [<c0015620>] (unwind_backtrace+0x0/0xf8) from [<c001154c>] (show_stack+0x10/0x14)
>>> [  467.148228] [<c001154c>] (show_stack+0x10/0x14) from [<c05135a8>] (dump_stack+0x80/0xc4)
>>> [  467.156336] [<c05135a8>] (dump_stack+0x80/0xc4) from [<c0024590>] (warn_slowpath_common+0x64/0x88)
>>> [  467.165300] [<c0024590>] (warn_slowpath_common+0x64/0x88) from [<c00245d0>] (warn_slowpath_null+0x1c/0x24)
>>> [  467.174959] [<c00245d0>] (warn_slowpath_null+0x1c/0x24) from [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4)
>>> [  467.185659] [<c00695e4>] (tick_broadcast_oneshot_control+0x19c/0x1c4) from [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc)
>>> [  467.196538] [<c0067cdc>] (clockevents_notify+0x1b0/0x1dc) from [<c034f348>] (cpuidle_idle_call+0x11c/0x168)
>>> [  467.206292] [<c034f348>] (cpuidle_idle_call+0x11c/0x168) from [<c000f134>] (arch_cpu_idle+0x8/0x38)
>>> [  467.215359] [<c000f134>] (arch_cpu_idle+0x8/0x38) from [<c0061038>] (cpu_startup_entry+0x60/0x134)
>>> [  467.224325] [<c0061038>] (cpu_startup_entry+0x60/0x134) from [<800083d8>] (0x800083d8)
>>> [  467.232227] ---[ end trace ea579be22a00e7fb ]---
>>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>> [  469.126682] CPU1: shutdown
>>
>> I have found no solution for (1) (although I didn't look hard!).
>>
>> (2) can be solved with the following (at least 50 iterations of my test
>> script worked with this patch applied):
> 
> Actually this warning is resulting from a bug in the tick broadcast code
> and has been solved with commit:
> 
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=ea8deb8dfa6b0e8d1b3d1051585706739b46656c
> 
> This patch has been merged in timers/urgent branch but still need to
> merged with timers/core.
> 
> The patch below does not fix the warning but prevents the tick warning
> to occur. Applying the patch above will fix your problem.

Yes, I was imprecise when I said solve; I simply meant that making that
change prevented me from seeing that issue any more.

That patch is already in v3.11-rc1, and I was using that as a base when
I applied Joseph's patches. So, it doesn't solve this issue.

ARM: tegra: cpuidle: use CPUIDLE_FLAG_TIMER_STOP flag

Commit Message

Comments

Patch