diff mbox

new ACPI processor driver to force CPUs idle

Message ID 20090624041354.GA15936@sli10-desk.sh.intel.com (mailing list archive)
State Accepted
Headers show

Commit Message

Shaohua Li June 24, 2009, 4:13 a.m. UTC
This patch supports the processor aggregator device. When OS gets one ACPI
notification, the driver will idle some number of cpus.

To make CPU idle, the patch will create power saving thread. Scheduler
will migrate the thread to preferred CPU. The thread has max priority and
has SCHED_RR policy, so it can occupy one CPU. To save power, the thread will
keep calling C-state instruction. Routine power_saving_thread() is the entry
of the thread.

To avoid starvation, the thread will sleep 5% time for every second
(current RT scheduler has threshold to avoid starvation, but if other
CPUs are idle, the CPU can borrow CPU timer from other, so makes the mechanism
not work here)

This approach (to force CPU idle) should hasn't impact to scheduler and tasks
with affinity still can get chance to run even the tasks run on idled cpu. Any
comments/suggestions are welcome.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 drivers/acpi/Kconfig                |   11 +
 drivers/acpi/Makefile               |    2 
 drivers/acpi/processor_aggregator.c |  389 ++++++++++++++++++++++++++++++++++++
 3 files changed, 402 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Peter Zijlstra June 24, 2009, 6:39 a.m. UTC | #1
On Wed, 2009-06-24 at 12:13 +0800, Shaohua Li wrote:
> This patch supports the processor aggregator device. When OS gets one ACPI
> notification, the driver will idle some number of cpus.
> 
> To make CPU idle, the patch will create power saving thread. Scheduler
> will migrate the thread to preferred CPU. The thread has max priority and
> has SCHED_RR policy, so it can occupy one CPU. To save power, the thread will
> keep calling C-state instruction. Routine power_saving_thread() is the entry
> of the thread.
> 
> To avoid starvation, the thread will sleep 5% time for every second
> (current RT scheduler has threshold to avoid starvation, but if other
> CPUs are idle, the CPU can borrow CPU timer from other, so makes the mechanism
> not work here)
> 
> This approach (to force CPU idle) should hasn't impact to scheduler and tasks
> with affinity still can get chance to run even the tasks run on idled cpu. Any
> comments/suggestions are welcome.

> +static int power_saving_thread(void *data)
> +{
> +	struct sched_param param = {.sched_priority = MAX_RT_PRIO - 1};
> +	int do_sleep;
> +
> +	/*
> +	 * we just create a RT task to do power saving. Scheduler will migrate
> +	 * the task to any CPU.
> +	 */
> +	sched_setscheduler(current, SCHED_RR, &param);
> +

This is crazy and wrong.

1) cpusets can be so configured as to not have the full machine in a
single load-balance domain, eg. the above comment about the scheduler is
false.

2) you're running at MAX_RT_PRIO-1, this will mightily upset the
migration thread and kstopmachine bits.

3) you're going to starve RT processes by being of a higher priority,
even though you might gain enough idle time by simply moving SCHED_OTHER
tasks around.

4) you're introducing 57s latencies to processes that happen to get
scheduled on whatever CPU you end up on, not nice.

NACK

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shaohua Li June 24, 2009, 7:47 a.m. UTC | #2
On Wed, Jun 24, 2009 at 02:39:18PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-06-24 at 12:13 +0800, Shaohua Li wrote:
> > This patch supports the processor aggregator device. When OS gets one ACPI
> > notification, the driver will idle some number of cpus.
> > 
> > To make CPU idle, the patch will create power saving thread. Scheduler
> > will migrate the thread to preferred CPU. The thread has max priority and
> > has SCHED_RR policy, so it can occupy one CPU. To save power, the thread will
> > keep calling C-state instruction. Routine power_saving_thread() is the entry
> > of the thread.
> > 
> > To avoid starvation, the thread will sleep 5% time for every second
> > (current RT scheduler has threshold to avoid starvation, but if other
> > CPUs are idle, the CPU can borrow CPU timer from other, so makes the mechanism
> > not work here)
> > 
> > This approach (to force CPU idle) should hasn't impact to scheduler and tasks
> > with affinity still can get chance to run even the tasks run on idled cpu. Any
> > comments/suggestions are welcome.
> 
> > +static int power_saving_thread(void *data)
> > +{
> > +	struct sched_param param = {.sched_priority = MAX_RT_PRIO - 1};
> > +	int do_sleep;
> > +
> > +	/*
> > +	 * we just create a RT task to do power saving. Scheduler will migrate
> > +	 * the task to any CPU.
> > +	 */
> > +	sched_setscheduler(current, SCHED_RR, &param);
> > +
> 
> This is crazy and wrong.
> 
> 1) cpusets can be so configured as to not have the full machine in a
> single load-balance domain, eg. the above comment about the scheduler is
> false.
Assume user will not assign such thread to a cpuset, if yes, it's user's
wrong.
 
> 2) you're running at MAX_RT_PRIO-1, this will mightily upset the
> migration thread and kstopmachine bits.
> 
> 3) you're going to starve RT processes by being of a higher priority,
> even though you might gain enough idle time by simply moving SCHED_OTHER
> tasks around.
for 2/3, the power saving thread has SCHED_RR, it will run out of its time slice
in 100ms. SCHED_OTHER might not work, because the system might be very busy.

Or we can lower the priority to not upset kernel RT threads. Usually applications
are not RT.

> 4) you're introducing 57s latencies to processes that happen to get
> scheduled on whatever CPU you end up on, not nice.
Sorry for my ignorance on scheduler, I don't understand what you mean.
Won't scheduler will migrate normal threads out the cpu?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra June 24, 2009, 8:03 a.m. UTC | #3
On Wed, 2009-06-24 at 15:47 +0800, Shaohua Li wrote:
> On Wed, Jun 24, 2009 at 02:39:18PM +0800, Peter Zijlstra wrote:
> > On Wed, 2009-06-24 at 12:13 +0800, Shaohua Li wrote:
> > > This patch supports the processor aggregator device. When OS gets one ACPI
> > > notification, the driver will idle some number of cpus.
> > > 
> > > To make CPU idle, the patch will create power saving thread. Scheduler
> > > will migrate the thread to preferred CPU. The thread has max priority and
> > > has SCHED_RR policy, so it can occupy one CPU. To save power, the thread will
> > > keep calling C-state instruction. Routine power_saving_thread() is the entry
> > > of the thread.
> > > 
> > > To avoid starvation, the thread will sleep 5% time for every second
> > > (current RT scheduler has threshold to avoid starvation, but if other
> > > CPUs are idle, the CPU can borrow CPU timer from other, so makes the mechanism
> > > not work here)
> > > 
> > > This approach (to force CPU idle) should hasn't impact to scheduler and tasks
> > > with affinity still can get chance to run even the tasks run on idled cpu. Any
> > > comments/suggestions are welcome.
> > 
> > > +static int power_saving_thread(void *data)
> > > +{
> > > +	struct sched_param param = {.sched_priority = MAX_RT_PRIO - 1};
> > > +	int do_sleep;
> > > +
> > > +	/*
> > > +	 * we just create a RT task to do power saving. Scheduler will migrate
> > > +	 * the task to any CPU.
> > > +	 */
> > > +	sched_setscheduler(current, SCHED_RR, &param);
> > > +
> > 
> > This is crazy and wrong.
> > 
> > 1) cpusets can be so configured as to not have the full machine in a
> > single load-balance domain, eg. the above comment about the scheduler is
> > false.
> Assume user will not assign such thread to a cpuset, if yes, it's user's
> wrong.

No its user policy, and esp on large machines cpusets are very useful.
The kernel not taking that into account is simply not an option.

Any thermal facility that doesn't take cpusets into account, or worse
destroys user policy (the hotplug road), is a full stop in my book.

Is similar to the saying the customer is always right, sure the admin
can indeed configure the machine so that any thermal policy is indeed
doomed to fail, and in that case I would print some warnings into syslog
and let the machine die of thermal overload -- not our problem.

The thing is, the admin configures it in a way, and then expects it to
work like that. If any random event can void the guarantees what good
are they?

Now, if ACPI-4.0 is so broken that it simply cannot support a sane
thermal model, then I suggest we simply not support this feature and
hope they will grow clue for 4.1 and try again next time.

> > 2) you're running at MAX_RT_PRIO-1, this will mightily upset the
> > migration thread and kstopmachine bits.
> > 
> > 3) you're going to starve RT processes by being of a higher priority,
> > even though you might gain enough idle time by simply moving SCHED_OTHER
> > tasks around.
> for 2/3, the power saving thread has SCHED_RR, it will run out of its time slice
> in 100ms. SCHED_OTHER might not work, because the system might be very busy.
> 
> Or we can lower the priority to not upset kernel RT threads. Usually applications
> are not RT.

Right, doing this at RR prio 1 would be much better.

> > 4) you're introducing 57s latencies to processes that happen to get
> > scheduled on whatever CPU you end up on, not nice.
> Sorry for my ignorance on scheduler, I don't understand what you mean.
> Won't scheduler will migrate normal threads out the cpu?

Not currently, no, that's one of the more interesting things on the todo
list.

But it appears I can't read very well, its .95s, still a lot but not
quite as bad as I made it out.

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shaohua Li June 24, 2009, 8:21 a.m. UTC | #4
On Wed, Jun 24, 2009 at 04:03:05PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-06-24 at 15:47 +0800, Shaohua Li wrote:
> > On Wed, Jun 24, 2009 at 02:39:18PM +0800, Peter Zijlstra wrote:
> > > On Wed, 2009-06-24 at 12:13 +0800, Shaohua Li wrote:
> > > > This patch supports the processor aggregator device. When OS gets one ACPI
> > > > notification, the driver will idle some number of cpus.
> > > > 
> > > > To make CPU idle, the patch will create power saving thread. Scheduler
> > > > will migrate the thread to preferred CPU. The thread has max priority and
> > > > has SCHED_RR policy, so it can occupy one CPU. To save power, the thread will
> > > > keep calling C-state instruction. Routine power_saving_thread() is the entry
> > > > of the thread.
> > > > 
> > > > To avoid starvation, the thread will sleep 5% time for every second
> > > > (current RT scheduler has threshold to avoid starvation, but if other
> > > > CPUs are idle, the CPU can borrow CPU timer from other, so makes the mechanism
> > > > not work here)
> > > > 
> > > > This approach (to force CPU idle) should hasn't impact to scheduler and tasks
> > > > with affinity still can get chance to run even the tasks run on idled cpu. Any
> > > > comments/suggestions are welcome.
> > > 
> > > > +static int power_saving_thread(void *data)
> > > > +{
> > > > +	struct sched_param param = {.sched_priority = MAX_RT_PRIO - 1};
> > > > +	int do_sleep;
> > > > +
> > > > +	/*
> > > > +	 * we just create a RT task to do power saving. Scheduler will migrate
> > > > +	 * the task to any CPU.
> > > > +	 */
> > > > +	sched_setscheduler(current, SCHED_RR, &param);
> > > > +
> > > 
> > > This is crazy and wrong.
> > > 
> > > 1) cpusets can be so configured as to not have the full machine in a
> > > single load-balance domain, eg. the above comment about the scheduler is
> > > false.
> > Assume user will not assign such thread to a cpuset, if yes, it's user's
> > wrong.
> 
> No its user policy, and esp on large machines cpusets are very useful.
> The kernel not taking that into account is simply not an option.
> 
> Any thermal facility that doesn't take cpusets into account, or worse
> destroys user policy (the hotplug road), is a full stop in my book.
> 
> Is similar to the saying the customer is always right, sure the admin
> can indeed configure the machine so that any thermal policy is indeed
> doomed to fail, and in that case I would print some warnings into syslog
> and let the machine die of thermal overload -- not our problem.
> 
> The thing is, the admin configures it in a way, and then expects it to
> work like that. If any random event can void the guarantees what good
> are they?
> 
> Now, if ACPI-4.0 is so broken that it simply cannot support a sane
> thermal model, then I suggest we simply not support this feature and
> hope they will grow clue for 4.1 and try again next time.
The assumption is user not assigns power saving thread to a specific cpuset.
I thought the assumption is feasible, user can assign threads they care about
to a cpuset, but not all.
Power saving thread stays at the top cpuset, so it still has chance to run on any
cpus. If power saving thread runs on a cpu, the tasks on the cpu still have chance
to run (at least 0.05s), so it does not completely break user policy.
 
> > > 2) you're running at MAX_RT_PRIO-1, this will mightily upset the
> > > migration thread and kstopmachine bits.
> > > 
> > > 3) you're going to starve RT processes by being of a higher priority,
> > > even though you might gain enough idle time by simply moving SCHED_OTHER
> > > tasks around.
> > for 2/3, the power saving thread has SCHED_RR, it will run out of its time slice
> > in 100ms. SCHED_OTHER might not work, because the system might be very busy.
> > 
> > Or we can lower the priority to not upset kernel RT threads. Usually applications
> > are not RT.
> 
> Right, doing this at RR prio 1 would be much better.
ok, will do this.

> > > 4) you're introducing 57s latencies to processes that happen to get
> > > scheduled on whatever CPU you end up on, not nice.
> > Sorry for my ignorance on scheduler, I don't understand what you mean.
> > Won't scheduler will migrate normal threads out the cpu?
> 
> Not currently, no, that's one of the more interesting things on the todo
> list.
> 
> But it appears I can't read very well, its .95s, still a lot but not
> quite as bad as I made it out.
good. 

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown June 24, 2009, 5:20 p.m. UTC | #5
> Any thermal facility that doesn't take cpusets into account, or worse
> destroys user policy (the hotplug road), is a full stop in my book.
>
> Is similar to the saying the customer is always right, sure the admin
> can indeed configure the machine so that any thermal policy is indeed
> doomed to fail, and in that case I would print some warnings into syslog
> and let the machine die of thermal overload -- not our problem.
>
> The thing is, the admin configures it in a way, and then expects it to
> work like that. If any random event can void the guarantees what good
> are they?
> 
> Now, if ACPI-4.0 is so broken that it simply cannot support a sane
> thermal model, then I suggest we simply not support this feature and
> hope they will grow clue for 4.1 and try again next time.

Peter,
ACPI is just the messenger here - user policy in in charge,
and everybody agrees, user policy is always right.

The policy may be a thermal cap to deal with thermal emergencies
as gracefully as possible, or it may be an electrical cap to
prevent a rack from approaching the limits of the provisioned
electrical supply.

This isn't about a brain dead administrator, doomed thermal policy,
or a broken ACPI spec.  This mechanism is about trying to maintain
uptime in the face of thermal emergencies, and spending limited
electrical provisioning dollars to match, rather than grosely exceed,
maximum machine room requirements.

Do you have any fundamental issues with these goals?
Are we agreement that they are worth goals?

The forced-idle technique is employed after the processors have
all already been forced to their lowest performance P-state
and the power/thermal problem has not been resolved.

No, this isn't a happy scenario, we are definately impacting
performance.  However, we are trying to impact system performance
as little as possible while saving as much energy as possible.

After P-states are exhausted and the problem is not resolved,
the rack (via ACPI) asks Linux to idle a processor.
Linux has full freedom to choose which processor.
If the condition does not get resolved, the rack will ask us
to offline more processors.

If this technique fails, the rack will throttle the processors
down as low as 1/16th of their lowest performance P-state.
Yes, that is about 100MHz on most multi GHz systems...

If that fails, the entire system is powered-off.

Obviously, the approach is to impact performance as little as possible
while impacting energy consumption as much as possible.  Use the most
efficieint means first, and resort to increasingly invasive measures
as necessary...

I think we all agree that we must not break the administrator's
cpuset policy if we are asked to force a core to be idle -- for
whent the emergency is over,the system should return to normal
and bear not permanent scars.

The simplest thing that comes to mind is to declare a system
with cpusets or binding fundamentally incompatible with
forced idle, and to skip that technique and let the hardware
throttle all the processor clocks with T-states.

However, on aggregate, forced-idle is a more efficient way
to save energy, as idle on today's processors is highly optimized.

So if you can suggest how we can force processors to be idle
even when cpusets and binding are present in a system,
that would be great.

thanks,
-Len Brown, Intel Open Source Technology Center






--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra June 26, 2009, 7:46 a.m. UTC | #6
On Wed, 2009-06-24 at 13:20 -0400, Len Brown wrote:
> > Any thermal facility that doesn't take cpusets into account, or worse
> > destroys user policy (the hotplug road), is a full stop in my book.
> >
> > Is similar to the saying the customer is always right, sure the admin
> > can indeed configure the machine so that any thermal policy is indeed
> > doomed to fail, and in that case I would print some warnings into syslog
> > and let the machine die of thermal overload -- not our problem.
> >
> > The thing is, the admin configures it in a way, and then expects it to
> > work like that. If any random event can void the guarantees what good
> > are they?
> > 
> > Now, if ACPI-4.0 is so broken that it simply cannot support a sane
> > thermal model, then I suggest we simply not support this feature and
> > hope they will grow clue for 4.1 and try again next time.
> 
> Peter,
> ACPI is just the messenger here - user policy in in charge,
> and everybody agrees, user policy is always right.
> 
> The policy may be a thermal cap to deal with thermal emergencies
> as gracefully as possible, or it may be an electrical cap to
> prevent a rack from approaching the limits of the provisioned
> electrical supply.
> 
> This isn't about a brain dead administrator, doomed thermal policy,
> or a broken ACPI spec.  This mechanism is about trying to maintain
> uptime in the face of thermal emergencies, and spending limited
> electrical provisioning dollars to match, rather than grosely exceed,
> maximum machine room requirements.
> 
> Do you have any fundamental issues with these goals?
> Are we agreement that they are worth goals?

As long as we all agree that these will be rare events, yes.

If people think its OK to seriously overcommit on thermal or electrical
(that was a new one for me) capacity, then we're in disagreement.

> The forced-idle technique is employed after the processors have
> all already been forced to their lowest performance P-state
> and the power/thermal problem has not been resolved.

Hmm, would fully idling a socket not be more efficient (throughput wise)
than forcing everybody into P states?

Also, who does the P state forcing, is that the BIOS or is that under OS
control?

> No, this isn't a happy scenario, we are definately impacting
> performance.  However, we are trying to impact system performance
> as little as possible while saving as much energy as possible.
> 
> After P-states are exhausted and the problem is not resolved,
> the rack (via ACPI) asks Linux to idle a processor.
> Linux has full freedom to choose which processor.
> If the condition does not get resolved, the rack will ask us
> to offline more processors.

Right, is there some measure we can tie into a closed feedback loop?

The thing I'm thinking off is vaidy's load-balancer changes that take an
overload packing argument.

If we can couple that to the ACPI driver in a closed feedback loop we
have automagic tuning.

We could even make an extension to cpusets where you can indicate that
you want your configuration to be able to support thermal control which
would limit configuration in a way that there is always some room to
idle sockets.

This could help avoid the: Oh my, I've melted my rack through
mis-configuration, scenario.

> If this technique fails, the rack will throttle the processors
> down as low as 1/16th of their lowest performance P-state.
> Yes, that is about 100MHz on most multi GHz systems...

Whee :-)

> If that fails, the entire system is powered-off.

I suppose if that fails someone messed up real bad anyway, that's a
level of thermal/electrical overcommit that should have corporal
punishment attached.

> Obviously, the approach is to impact performance as little as possible
> while impacting energy consumption as much as possible.  Use the most
> efficieint means first, and resort to increasingly invasive measures
> as necessary...
> 
> I think we all agree that we must not break the administrator's
> cpuset policy if we are asked to force a core to be idle -- for
> whent the emergency is over,the system should return to normal
> and bear not permanent scars.
> 
> The simplest thing that comes to mind is to declare a system
> with cpusets or binding fundamentally incompatible with
> forced idle, and to skip that technique and let the hardware
> throttle all the processor clocks with T-states.

Right, I really really want to avoid having thermal management and
cpusets become an exclusive feature. I think it would basically render
cpusets useless for a large number of people, and that would be an utter
shame.

> However, on aggregate, forced-idle is a more efficient way
> to save energy, as idle on today's processors is highly optimized.
> 
> So if you can suggest how we can force processors to be idle
> even when cpusets and binding are present in a system,
> that would be great.

Right, so I think the load-balancer angle possibly with a cpuset
extension that limits partitioning so that there is room for idling a
few sockets should work out nicely.

All we need is a metric to couple that load-balancer overload number to.

Some integration with P states might be interesting to think about. But
as it stands getting that load-balancer placement stuff fixed seems like
enough fun ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown June 26, 2009, 4:46 p.m. UTC | #7
> > ACPI is just the messenger here - user policy in in charge,
> > and everybody agrees, user policy is always right.
> > 
> > The policy may be a thermal cap to deal with thermal emergencies
> > as gracefully as possible, or it may be an electrical cap to
> > prevent a rack from approaching the limits of the provisioned
> > electrical supply.
> > 
> > This isn't about a brain dead administrator, doomed thermal policy,
> > or a broken ACPI spec.  This mechanism is about trying to maintain
> > uptime in the face of thermal emergencies, and spending limited
> > electrical provisioning dollars to match, rather than grosely exceed,
> > maximum machine room requirements.
> > 
> > Do you have any fundamental issues with these goals?
> > Are we agreement that they are worth goals?
> 
> As long as we all agree that these will be rare events, yes.
> 
> If people think its OK to seriously overcommit on thermal or electrical
> (that was a new one for me) capacity, then we're in disagreement.

As with any knob, there is a reasonable and an un-reasonable range...

The most obvious and reasonable is to use this mechanism as a guarantee
that the rack shall not exceed provisioned power.  In the past,
IT would add up the AC/CD name-plates inside the rack and call facilities
to provision that total.  While this was indeed a "guaranteed not to 
exceed" number, the margin of that number over actual peak actual was
often over 2x, causing IT to way over-estimate and way over-spend.
So giving IT a way to set up a guarantee that is closer to measured
peak consumption saves them a bundle of $$ on electrical provisioning.

The next most useful scenario is when the cooling fails.
Many machine rooms have multiple units and and when one goes off line,
the temperature rises.  Rather than having to either power off the
servers or provision fully redundant cooling, it is extremely 
valuableuseful to be able to ride-out cooling issues, preserving uptime.

Could somebody cut it too close and have these mechanisms cut in 
frequently?  Sure, and they'd have a mesurable impact on performance,
likely impacting their users and their job security...

> > The forced-idle technique is employed after the processors have
> > all already been forced to their lowest performance P-state
> > and the power/thermal problem has not been resolved.
> 
> Hmm, would fully idling a socket not be more efficient (throughput wise)
> than forcing everybody into P states?

Nope.

Low Frequency Mode (LFM), aka Pn - the deepest P-state,
is the lowest energy/instruction because it is this highest
frequency available at the lowest voltage that can still
retire instructions.

That is why it is the first method used -- it returns the
highest power_savings/performance_impact.

> Also, who does the P state forcing, is that the BIOS or is that under OS
> control?

Yes.
The platform (via ACPI) tells the OS that the highest
performance p-state is off limites, and cpufreq responds
to that by keeping the frequency below that limit.

The platform monitors the cause of the issue and if it doesn't
go away, tells us to limit to successively deeper P-states
until if necessary, we arrive at Pn, the deepest (lowest 
performance) P-state.

If the OS does not respond to these requests in a timely manner
some platforms have the capability to make these P-state
changes behind the OS's back.

> > No, this isn't a happy scenario, we are definately impacting
> > performance.  However, we are trying to impact system performance
> > as little as possible while saving as much energy as possible.
> > 
> > After P-states are exhausted and the problem is not resolved,
> > the rack (via ACPI) asks Linux to idle a processor.
> > Linux has full freedom to choose which processor.
> > If the condition does not get resolved, the rack will ask us
> > to offline more processors.
> 
> Right, is there some measure we can tie into a closed feedback loop?

The power and thermal monitoring are out-of-band in the platform,
so Linux is not (currently) part of a closed control loop.
However, Linux is part of the control, and the loop is indeed closed:-)

> The thing I'm thinking off is vaidy's load-balancer changes that take an
> overload packing argument.
> 
> If we can couple that to the ACPI driver in a closed feedback loop we
> have automagic tuning.

I think that those changes are probably fancier than we need for
this simple mechanism right now -- though if they ended up being different
ways to use the same code in the long run, that would be fine.

> We could even make an extension to cpusets where you can indicate that
> you want your configuration to be able to support thermal control which
> would limit configuration in a way that there is always some room to
> idle sockets.
> 
> This could help avoid the: Oh my, I've melted my rack through
> mis-configuration, scenario.

I'd rather it be more idiot proof.

eg. it doesn't matter _where_ the forced idle thread lives,
it just matters that it exists _somewhere_.  So if we could
move it around with some granuarity such that its penalty
were equally shared across the system, then that would
be idiot proof.

> > If this technique fails, the rack will throttle the processors
> > down as low as 1/16th of their lowest performance P-state.
> > Yes, that is about 100MHz on most multi GHz systems...
> 
> Whee :-)
> 
> > If that fails, the entire system is powered-off.
> 
> I suppose if that fails someone messed up real bad anyway, that's a
> level of thermal/electrical overcommit that should have corporal
> punishment attached.
> 
> > Obviously, the approach is to impact performance as little as possible
> > while impacting energy consumption as much as possible.  Use the most
> > efficieint means first, and resort to increasingly invasive measures
> > as necessary...
> > 
> > I think we all agree that we must not break the administrator's
> > cpuset policy if we are asked to force a core to be idle -- for
> > whent the emergency is over,the system should return to normal
> > and bear not permanent scars.
> > 
> > The simplest thing that comes to mind is to declare a system
> > with cpusets or binding fundamentally incompatible with
> > forced idle, and to skip that technique and let the hardware
> > throttle all the processor clocks with T-states.
> 
> Right, I really really want to avoid having thermal management and
> cpusets become an exclusive feature. I think it would basically render
> cpusets useless for a large number of people, and that would be an utter
> shame.
> 
> > However, on aggregate, forced-idle is a more efficient way
> > to save energy, as idle on today's processors is highly optimized.
> > 
> > So if you can suggest how we can force processors to be idle
> > even when cpusets and binding are present in a system,
> > that would be great.
> 
> Right, so I think the load-balancer angle possibly with a cpuset
> extension that limits partitioning so that there is room for idling a
> few sockets should work out nicely.
> 
> All we need is a metric to couple that load-balancer overload number to.
> 
> Some integration with P states might be interesting to think about. But
> as it stands getting that load-balancer placement stuff fixed seems like
> enough fun ;-)

I think that we already have an issue with scheduler vs P-states,
as the scheduler is handing out buckets of time assuming that 
they are all equal.  However, a high-frequency bucket is more valuable
than a low frequency bucket.  So probably the scheduler should be tracking
cycles rather than time...

But that is independent of the forced-idle thread issue at hand.

We'd like to ship the forced-idle thread as a self-contained driver,
if possilbe.  Because that would enable us to easily back-port it
to some enterprise releases that want the feature.  So if we can
implement this such that it is functional with existing scheduler
facilities, that would be get us by.  If the scheduler evolves
and provides a more optimal mechanism in the future, then that is
great, as long as we don't have to wait for that to provide
the basic version of the feature.

thanks,
Len Brown, Intel Open Source Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vaidyanathan Srinivasan June 26, 2009, 6:16 p.m. UTC | #8
* Shaohua Li <shaohua.li@intel.com> [2009-06-24 16:21:12]:

> On Wed, Jun 24, 2009 at 04:03:05PM +0800, Peter Zijlstra wrote:
> > On Wed, 2009-06-24 at 15:47 +0800, Shaohua Li wrote:
> > > On Wed, Jun 24, 2009 at 02:39:18PM +0800, Peter Zijlstra wrote:
> > > > On Wed, 2009-06-24 at 12:13 +0800, Shaohua Li wrote:
> > > > > This patch supports the processor aggregator device. When OS gets one ACPI
> > > > > notification, the driver will idle some number of cpus.
> > > > > 
> > > > > To make CPU idle, the patch will create power saving thread. Scheduler
> > > > > will migrate the thread to preferred CPU. The thread has max priority and
> > > > > has SCHED_RR policy, so it can occupy one CPU. To save power, the thread will
> > > > > keep calling C-state instruction. Routine power_saving_thread() is the entry
> > > > > of the thread.
> > > > > 
> > > > > To avoid starvation, the thread will sleep 5% time for every second
> > > > > (current RT scheduler has threshold to avoid starvation, but if other
> > > > > CPUs are idle, the CPU can borrow CPU timer from other, so makes the mechanism
> > > > > not work here)
> > > > > 
> > > > > This approach (to force CPU idle) should hasn't impact to scheduler and tasks
> > > > > with affinity still can get chance to run even the tasks run on idled cpu. Any
> > > > > comments/suggestions are welcome.
> > > > 
> > > > > +static int power_saving_thread(void *data)
> > > > > +{
> > > > > +	struct sched_param param = {.sched_priority = MAX_RT_PRIO - 1};
> > > > > +	int do_sleep;
> > > > > +
> > > > > +	/*
> > > > > +	 * we just create a RT task to do power saving. Scheduler will migrate
> > > > > +	 * the task to any CPU.
> > > > > +	 */
> > > > > +	sched_setscheduler(current, SCHED_RR, &param);
> > > > > +
> > > > 
> > > > This is crazy and wrong.
> > > > 
> > > > 1) cpusets can be so configured as to not have the full machine in a
> > > > single load-balance domain, eg. the above comment about the scheduler is
> > > > false.
> > > Assume user will not assign such thread to a cpuset, if yes, it's user's
> > > wrong.
> > 
> > No its user policy, and esp on large machines cpusets are very useful.
> > The kernel not taking that into account is simply not an option.
> > 
> > Any thermal facility that doesn't take cpusets into account, or worse
> > destroys user policy (the hotplug road), is a full stop in my book.
> > 
> > Is similar to the saying the customer is always right, sure the admin
> > can indeed configure the machine so that any thermal policy is indeed
> > doomed to fail, and in that case I would print some warnings into syslog
> > and let the machine die of thermal overload -- not our problem.
> > 
> > The thing is, the admin configures it in a way, and then expects it to
> > work like that. If any random event can void the guarantees what good
> > are they?
> > 
> > Now, if ACPI-4.0 is so broken that it simply cannot support a sane
> > thermal model, then I suggest we simply not support this feature and
> > hope they will grow clue for 4.1 and try again next time.
> The assumption is user not assigns power saving thread to a specific cpuset.
> I thought the assumption is feasible, user can assign threads they care about
> to a cpuset, but not all.
> Power saving thread stays at the top cpuset, so it still has chance to run on any
> cpus. If power saving thread runs on a cpu, the tasks on the cpu still have chance
> to run (at least 0.05s), so it does not completely break user policy.

How do we handle interrupts and timers during this interval?  You seem
to disable interrupts and hold the cpu at idle for 0.95 sec.  It may
cause timeouts and overflows for network interrupts right?

Next issue is halting sibling threads belonging to a core at the same
time to have any power/thermal benefit.  Who does the coordination for
forced idle in this approach?

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vaidyanathan Srinivasan June 26, 2009, 6:42 p.m. UTC | #9
* Len Brown <lenb@kernel.org> [2009-06-26 12:46:53]:

> > > ACPI is just the messenger here - user policy in in charge,
> > > and everybody agrees, user policy is always right.
> > > 
> > > The policy may be a thermal cap to deal with thermal emergencies
> > > as gracefully as possible, or it may be an electrical cap to
> > > prevent a rack from approaching the limits of the provisioned
> > > electrical supply.
> > > 
> > > This isn't about a brain dead administrator, doomed thermal policy,
> > > or a broken ACPI spec.  This mechanism is about trying to maintain
> > > uptime in the face of thermal emergencies, and spending limited
> > > electrical provisioning dollars to match, rather than grosely exceed,
> > > maximum machine room requirements.
> > > 
> > > Do you have any fundamental issues with these goals?
> > > Are we agreement that they are worth goals?
> > 
> > As long as we all agree that these will be rare events, yes.
> > 
> > If people think its OK to seriously overcommit on thermal or electrical
> > (that was a new one for me) capacity, then we're in disagreement.
> 
> As with any knob, there is a reasonable and an un-reasonable range...
> 
> The most obvious and reasonable is to use this mechanism as a guarantee
> that the rack shall not exceed provisioned power.  In the past,
> IT would add up the AC/CD name-plates inside the rack and call facilities
> to provision that total.  While this was indeed a "guaranteed not to 
> exceed" number, the margin of that number over actual peak actual was
> often over 2x, causing IT to way over-estimate and way over-spend.
> So giving IT a way to set up a guarantee that is closer to measured
> peak consumption saves them a bundle of $$ on electrical provisioning.
> 
> The next most useful scenario is when the cooling fails.
> Many machine rooms have multiple units and and when one goes off line,
> the temperature rises.  Rather than having to either power off the
> servers or provision fully redundant cooling, it is extremely 
> valuableuseful to be able to ride-out cooling issues, preserving uptime.
> 
> Could somebody cut it too close and have these mechanisms cut in 
> frequently?  Sure, and they'd have a mesurable impact on performance,
> likely impacting their users and their job security...
> 
> > > The forced-idle technique is employed after the processors have
> > > all already been forced to their lowest performance P-state
> > > and the power/thermal problem has not been resolved.
> > 
> > Hmm, would fully idling a socket not be more efficient (throughput wise)
> > than forcing everybody into P states?
> 
> Nope.
> 
> Low Frequency Mode (LFM), aka Pn - the deepest P-state,
> is the lowest energy/instruction because it is this highest
> frequency available at the lowest voltage that can still
> retire instructions.

This is true if you want to retire instructions.  But in case you want
to stop retiring instructions and hold cores in idle, then idling
complete package will be more efficient right?  Atleast you will need
to idle all the sibling threads at the same time to save power in
a core.
 
> That is why it is the first method used -- it returns the
> highest power_savings/performance_impact.

Depending on what is running in the system, force idling cores may
help reduce average power as compared to running all cores at lowest
P-state.
 
> > Also, who does the P state forcing, is that the BIOS or is that under OS
> > control?
> 
> Yes.
> The platform (via ACPI) tells the OS that the highest
> performance p-state is off limites, and cpufreq responds
> to that by keeping the frequency below that limit.
> 
> The platform monitors the cause of the issue and if it doesn't
> go away, tells us to limit to successively deeper P-states
> until if necessary, we arrive at Pn, the deepest (lowest 
> performance) P-state.
> 
> If the OS does not respond to these requests in a timely manner
> some platforms have the capability to make these P-state
> changes behind the OS's back.
> 
> > > No, this isn't a happy scenario, we are definately impacting
> > > performance.  However, we are trying to impact system performance
> > > as little as possible while saving as much energy as possible.
> > > 
> > > After P-states are exhausted and the problem is not resolved,
> > > the rack (via ACPI) asks Linux to idle a processor.
> > > Linux has full freedom to choose which processor.
> > > If the condition does not get resolved, the rack will ask us
> > > to offline more processors.
> > 
> > Right, is there some measure we can tie into a closed feedback loop?
> 
> The power and thermal monitoring are out-of-band in the platform,
> so Linux is not (currently) part of a closed control loop.
> However, Linux is part of the control, and the loop is indeed closed:-)

The more we could include Linux in the control loop, we can better
react to the situation with least performance impact.

> > The thing I'm thinking off is vaidy's load-balancer changes that take an
> > overload packing argument.
> > 
> > If we can couple that to the ACPI driver in a closed feedback loop we
> > have automagic tuning.
> 
> I think that those changes are probably fancier than we need for
> this simple mechanism right now -- though if they ended up being different
> ways to use the same code in the long run, that would be fine.

I agree that the load-balancer approach is more complex and has
challenges.  But it does have long term benefits because we can
utilise the scheduler's knowledge of system topology and current
system load to arrive at what is best.

> > We could even make an extension to cpusets where you can indicate that
> > you want your configuration to be able to support thermal control which
> > would limit configuration in a way that there is always some room to
> > idle sockets.
> > 
> > This could help avoid the: Oh my, I've melted my rack through
> > mis-configuration, scenario.
> 
> I'd rather it be more idiot proof.
> 
> eg. it doesn't matter _where_ the forced idle thread lives,
> it just matters that it exists _somewhere_.  So if we could
> move it around with some granuarity such that its penalty
> were equally shared across the system, then that would
> be idiot proof.

The requirement is clear but the challenge is to transparently remove
capacity without breaking user space polices.  P-States do this to
some extent but we have challenges if capacity of the cpu is
completely removed.

> > > If this technique fails, the rack will throttle the processors
> > > down as low as 1/16th of their lowest performance P-state.
> > > Yes, that is about 100MHz on most multi GHz systems...
> > 
> > Whee :-)
> > 
> > > If that fails, the entire system is powered-off.
> > 
> > I suppose if that fails someone messed up real bad anyway, that's a
> > level of thermal/electrical overcommit that should have corporal
> > punishment attached.
> > 
> > > Obviously, the approach is to impact performance as little as possible
> > > while impacting energy consumption as much as possible.  Use the most
> > > efficieint means first, and resort to increasingly invasive measures
> > > as necessary...
> > > 
> > > I think we all agree that we must not break the administrator's
> > > cpuset policy if we are asked to force a core to be idle -- for
> > > whent the emergency is over,the system should return to normal
> > > and bear not permanent scars.
> > > 
> > > The simplest thing that comes to mind is to declare a system
> > > with cpusets or binding fundamentally incompatible with
> > > forced idle, and to skip that technique and let the hardware
> > > throttle all the processor clocks with T-states.
> > 
> > Right, I really really want to avoid having thermal management and
> > cpusets become an exclusive feature. I think it would basically render
> > cpusets useless for a large number of people, and that would be an utter
> > shame.
> > 
> > > However, on aggregate, forced-idle is a more efficient way
> > > to save energy, as idle on today's processors is highly optimized.
> > > 
> > > So if you can suggest how we can force processors to be idle
> > > even when cpusets and binding are present in a system,
> > > that would be great.
> > 
> > Right, so I think the load-balancer angle possibly with a cpuset
> > extension that limits partitioning so that there is room for idling a
> > few sockets should work out nicely.
> > 
> > All we need is a metric to couple that load-balancer overload number to.
> > 
> > Some integration with P states might be interesting to think about. But
> > as it stands getting that load-balancer placement stuff fixed seems like
> > enough fun ;-)
> 
> I think that we already have an issue with scheduler vs P-states,
> as the scheduler is handing out buckets of time assuming that 
> they are all equal.  However, a high-frequency bucket is more valuable
> than a low frequency bucket.  So probably the scheduler should be tracking
> cycles rather than time...
> 
> But that is independent of the forced-idle thread issue at hand.
> 
> We'd like to ship the forced-idle thread as a self-contained driver,
> if possilbe.  Because that would enable us to easily back-port it
> to some enterprise releases that want the feature.  So if we can
> implement this such that it is functional with existing scheduler
> facilities, that would be get us by.  If the scheduler evolves
> and provides a more optimal mechanism in the future, then that is
> great, as long as we don't have to wait for that to provide
> the basic version of the feature.

ok, so if you want a solution that would work on older distros also,
then your choices are limited.  For backports, perhaps this module
will work, but should not be a baseline solution for future.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett June 26, 2009, 7:49 p.m. UTC | #10
On Fri, Jun 26, 2009 at 12:46:53PM -0400, Len Brown wrote:

> Low Frequency Mode (LFM), aka Pn - the deepest P-state,
> is the lowest energy/instruction because it is this highest
> frequency available at the lowest voltage that can still
> retire instructions.
> 
> That is why it is the first method used -- it returns the
> highest power_savings/performance_impact.

For a straightforward workload on a dual package system, do you get more 
performance from two packages running at their lowest P state or from 
one package at its highest P state and a forced-idle package? Which 
consumes more power?
Shaohua Li June 29, 2009, 2:54 a.m. UTC | #11
On Sat, Jun 27, 2009 at 02:16:23AM +0800, Vaidyanathan Srinivasan wrote:
> * Shaohua Li <shaohua.li@intel.com> [2009-06-24 16:21:12]:
> 
> > On Wed, Jun 24, 2009 at 04:03:05PM +0800, Peter Zijlstra wrote:
> > > On Wed, 2009-06-24 at 15:47 +0800, Shaohua Li wrote:
> > > > On Wed, Jun 24, 2009 at 02:39:18PM +0800, Peter Zijlstra wrote:
> > > > > On Wed, 2009-06-24 at 12:13 +0800, Shaohua Li wrote:
> > > > > > This patch supports the processor aggregator device. When OS gets one ACPI
> > > > > > notification, the driver will idle some number of cpus.
> > > > > > 
> > > > > > To make CPU idle, the patch will create power saving thread. Scheduler
> > > > > > will migrate the thread to preferred CPU. The thread has max priority and
> > > > > > has SCHED_RR policy, so it can occupy one CPU. To save power, the thread will
> > > > > > keep calling C-state instruction. Routine power_saving_thread() is the entry
> > > > > > of the thread.
> > > > > > 
> > > > > > To avoid starvation, the thread will sleep 5% time for every second
> > > > > > (current RT scheduler has threshold to avoid starvation, but if other
> > > > > > CPUs are idle, the CPU can borrow CPU timer from other, so makes the mechanism
> > > > > > not work here)
> > > > > > 
> > > > > > This approach (to force CPU idle) should hasn't impact to scheduler and tasks
> > > > > > with affinity still can get chance to run even the tasks run on idled cpu. Any
> > > > > > comments/suggestions are welcome.
> > > > > 
> > > > > > +static int power_saving_thread(void *data)
> > > > > > +{
> > > > > > +	struct sched_param param = {.sched_priority = MAX_RT_PRIO - 1};
> > > > > > +	int do_sleep;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * we just create a RT task to do power saving. Scheduler will migrate
> > > > > > +	 * the task to any CPU.
> > > > > > +	 */
> > > > > > +	sched_setscheduler(current, SCHED_RR, &param);
> > > > > > +
> > > > > 
> > > > > This is crazy and wrong.
> > > > > 
> > > > > 1) cpusets can be so configured as to not have the full machine in a
> > > > > single load-balance domain, eg. the above comment about the scheduler is
> > > > > false.
> > > > Assume user will not assign such thread to a cpuset, if yes, it's user's
> > > > wrong.
> > > 
> > > No its user policy, and esp on large machines cpusets are very useful.
> > > The kernel not taking that into account is simply not an option.
> > > 
> > > Any thermal facility that doesn't take cpusets into account, or worse
> > > destroys user policy (the hotplug road), is a full stop in my book.
> > > 
> > > Is similar to the saying the customer is always right, sure the admin
> > > can indeed configure the machine so that any thermal policy is indeed
> > > doomed to fail, and in that case I would print some warnings into syslog
> > > and let the machine die of thermal overload -- not our problem.
> > > 
> > > The thing is, the admin configures it in a way, and then expects it to
> > > work like that. If any random event can void the guarantees what good
> > > are they?
> > > 
> > > Now, if ACPI-4.0 is so broken that it simply cannot support a sane
> > > thermal model, then I suggest we simply not support this feature and
> > > hope they will grow clue for 4.1 and try again next time.
> > The assumption is user not assigns power saving thread to a specific cpuset.
> > I thought the assumption is feasible, user can assign threads they care about
> > to a cpuset, but not all.
> > Power saving thread stays at the top cpuset, so it still has chance to run on any
> > cpus. If power saving thread runs on a cpu, the tasks on the cpu still have chance
> > to run (at least 0.05s), so it does not completely break user policy.
> 
> How do we handle interrupts and timers during this interval?  You seem
> to disable interrupts and hold the cpu at idle for 0.95 sec.  It may
> cause timeouts and overflows for network interrupts right?
The x86 mwait/monitor instruction can detect interrupt and complete execution
even interrupt is disabled, so this isn't an issue.

> Next issue is halting sibling threads belonging to a core at the same
> time to have any power/thermal benefit.  Who does the coordination for
> forced idle in this approach?
Nobody does the coordination. Halt some threads even they belong to a core
is the best we can provide now. For future, if the scheduler approach really
works, we will happily use it.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vaidyanathan Srinivasan July 6, 2009, 6:03 p.m. UTC | #12
* Shaohua Li <shaohua.li@intel.com> [2009-06-29 10:54:55]:

[snip]

> > 
> > How do we handle interrupts and timers during this interval?  You seem
> > to disable interrupts and hold the cpu at idle for 0.95 sec.  It may
> > cause timeouts and overflows for network interrupts right?
> The x86 mwait/monitor instruction can detect interrupt and complete execution
> even interrupt is disabled, so this isn't an issue.

Cool, this will save a lot of trouble :)
 
> > Next issue is halting sibling threads belonging to a core at the same
> > time to have any power/thermal benefit.  Who does the coordination for
> > forced idle in this approach?
> Nobody does the coordination. Halt some threads even they belong to a core
> is the best we can provide now. For future, if the scheduler approach really
> works, we will happily use it.

Do you have some indicative data to show that arbitrary force-idling
of hardware threads reduce power and heat?  It will be very good if
this works without coordination among siblings.  Basically what you
are saying is that running one or two of the force-idle threads in
a 16-thread system provides significant reduction in power even
without explicit methods to ensure that they idle sibling threads
belonging to same core.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen July 6, 2009, 11:43 p.m. UTC | #13
On Mon, Jul 06, 2009 at 11:33:12PM +0530, Vaidyanathan Srinivasan wrote:
> > > 
> > > How do we handle interrupts and timers during this interval?  You seem
> > > to disable interrupts and hold the cpu at idle for 0.95 sec.  It may
> > > cause timeouts and overflows for network interrupts right?
> > The x86 mwait/monitor instruction can detect interrupt and complete execution
> > even interrupt is disabled, so this isn't an issue.
> 
> Cool, this will save a lot of trouble :)

This is only available in newer Intel CPUs. 

-Andi
venkip July 7, 2009, 12:50 a.m. UTC | #14
>-----Original Message-----
>From: Andi Kleen [mailto:andi@firstfloor.org] 
>Sent: Monday, July 06, 2009 4:44 PM
>To: Vaidyanathan Srinivasan
>Cc: Li, Shaohua; Peter Zijlstra; linux-acpi@vger.kernel.org; 
>tglx@linutronix.de; lenb@kernel.org; Pallipadi, Venkatesh; 
>andi@firstfloor.org; Ingo Molnar
>Subject: Re: [PATCH]new ACPI processor driver to force CPUs idle
>
>On Mon, Jul 06, 2009 at 11:33:12PM +0530, Vaidyanathan 
>Srinivasan wrote:
>> > > 
>> > > How do we handle interrupts and timers during this 
>interval?  You seem
>> > > to disable interrupts and hold the cpu at idle for 0.95 
>sec.  It may
>> > > cause timeouts and overflows for network interrupts right?
>> > The x86 mwait/monitor instruction can detect interrupt and 
>complete execution
>> > even interrupt is disabled, so this isn't an issue.
>> 
>> Cool, this will save a lot of trouble :)
>
>This is only available in newer Intel CPUs. 
>

This is present in Intel Core/Core 2 and beyond.
Shaohua's patch is checking for this in init path

> 
> +	if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
> +	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> +		return;
>

and -EINVALing when the feature is not present. So,
that will disable the feature on older CPUs.

Thanks,
Venki
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra July 7, 2009, 8:24 a.m. UTC | #15
On Fri, 2009-06-26 at 12:46 -0400, Len Brown wrote:

> We'd like to ship the forced-idle thread as a self-contained driver,
> if possilbe.  Because that would enable us to easily back-port it
> to some enterprise releases that want the feature.  So if we can
> implement this such that it is functional with existing scheduler
> facilities, that would be get us by.  If the scheduler evolves
> and provides a more optimal mechanism in the future, then that is
> great, as long as we don't have to wait for that to provide
> the basic version of the feature.

I don't think we should ever merge anything because of backports for
stale kernels.

If you want it in enterprise stuff all you need is something upstream,
and that something should be the scheduler driven bits.

After that you can argue the full backport isn't possible (should be
quite easy given that enterprise kernels still think kABI is something
sane) and propose this hack for them.

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown July 10, 2009, 7:31 p.m. UTC | #16
> Do you have some indicative data to show that arbitrary force-idling
> of hardware threads reduce power and heat?  It will be very good if
> this works without coordination among siblings.  Basically what you
> are saying is that running one or two of the force-idle threads in
> a 16-thread system provides significant reduction in power even
> without explicit methods to ensure that they idle sibling threads
> belonging to same core.

Re: siblings

Idling one HT sibling w/o idling the other saves almost very little power.
(On the flip side, the good news is that enabling HT consumes
 very little power:-)

So to save the most power possible,
the latest patch will attempt to idle 
threads in pairs of siblings.

But again, the driver is simply the messenger.  If the user
asks us to idle an odd number of threads, then one of them
will idle without idling its sibling.

cheers,
-Len Brown, Intel Open Source Technology Center


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown July 10, 2009, 7:47 p.m. UTC | #17
> > > Hmm, would fully idling a socket not be more efficient (throughput wise)
> > > than forcing everybody into P states?
> > 
> > Nope.
> > 
> > Low Frequency Mode (LFM), aka Pn - the deepest P-state,
> > is the lowest energy/instruction because it is this highest
> > frequency available at the lowest voltage that can still
> > retire instructions.
> 
> This is true if you want to retire instructions.  But in case you want
> to stop retiring instructions and hold cores in idle, then idling
> complete package will be more efficient right?


No.  Efficiency = Work/Energy
Thus if Work=0, then Efficiency=0.

>  Atleast you will need
> to idle all the sibling threads at the same time to save power in
> a core.

Yes.  Both HT siblings need to be idled in a core
for it to significantly reduce power.

> > That is why it is the first method used -- it returns the
> > highest power_savings/performance_impact.
> 
> Depending on what is running in the system, force idling cores may
> help reduce average power as compared to running all cores at lowest
> P-state.

The workloads that we've measured show that reducing p-states
has a smaller impact to average performance than idling cores.

> > The power and thermal monitoring are out-of-band in the platform,
> > so Linux is not (currently) part of a closed control loop.
> > However, Linux is part of the control, and the loop is indeed closed:-)
> 
> The more we could include Linux in the control loop, we can better
> react to the situation with least performance impact.

Some vendors prever to control things in-band,
some prefer to control them out-of-band.

I'm agnostic.  Vendors should be free to provision, and customers
should be free to run systems in the way that they choose.

> > > The thing I'm thinking off is vaidy's load-balancer changes that take an
> > > overload packing argument.
> > > 
> > > If we can couple that to the ACPI driver in a closed feedback loop we
> > > have automagic tuning.
> > 
> > I think that those changes are probably fancier than we need for
> > this simple mechanism right now -- though if they ended up being different
> > ways to use the same code in the long run, that would be fine.
> 
> I agree that the load-balancer approach is more complex and has
> challenges.

The main challenge of the load-balancer approach is
that it it not available to be shipped today.

> But it does have long term benefits because we can
> utilise the scheduler's knowledge of system topology and current
> system load to arrive at what is best.

> > > Some integration with P states might be interesting to think about. But
> > > as it stands getting that load-balancer placement stuff fixed seems like
> > > enough fun ;-)
> > 
> > I think that we already have an issue with scheduler vs P-states,
> > as the scheduler is handing out buckets of time assuming that 
> > they are all equal.  However, a high-frequency bucket is more valuable
> > than a low frequency bucket.  So probably the scheduler should be tracking
> > cycles rather than time...
> > 
> > But that is independent of the forced-idle thread issue at hand.
> > 
> > We'd like to ship the forced-idle thread as a self-contained driver,
> > if possilbe.  Because that would enable us to easily back-port it
> > to some enterprise releases that want the feature.  So if we can
> > implement this such that it is functional with existing scheduler
> > facilities, that would be get us by.  If the scheduler evolves
> > and provides a more optimal mechanism in the future, then that is
> > great, as long as we don't have to wait for that to provide
> > the basic version of the feature.
> 
> ok, so if you want a solution that would work on older distros also,
> then your choices are limited.  For backports, perhaps this module
> will work, but should not be a baseline solution for future.

The current driver receives the number of cpu's to idle from
the system, and spawns that many forced-idle threads.

When Linux has a method better than spawning forced-idle-threads,
we'll gladly update the driver to us it...

thanks,
-Len Brown, Intel Open Source Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown July 10, 2009, 8:29 p.m. UTC | #18
> > Low Frequency Mode (LFM), aka Pn - the deepest P-state,
> > is the lowest energy/instruction because it is this highest
> > frequency available at the lowest voltage that can still
> > retire instructions.
> > 
> > That is why it is the first method used -- it returns the
> > highest power_savings/performance_impact.
> 
> For a straightforward workload on a dual package system, do you get more 
> performance from two packages running at their lowest P state or from 
> one package at its highest P state and a forced-idle package?
>
> Which consumes more power?

For simplicity, lets say...

Pn = P1/2  eg. P1=3GHz, and Pn=1.5 GHz.
Lets assume that the application
has zero cache and memory footprint
and performance = CPU cycles.

In that case, we could choose to either cut the
frequency of 8 threads in half, or cut the frequency
of 4 threads to 0 -- and we have a "cycle count"
performance wash.

Reality is that applications do care about memory,
and so doubling the available cache is a performance
win for P-states.  If the application is something that
notices latency of multiple threads sharing a CPU
vs a thread/CPU, then P-states would again win
because of more available cores.

Current Intel multi-package systems don't quite get
down to P1/2 -- so the deepest P-state would actually
not have quite as much performance impact.
Last AMD system I saw got down to 1GHz, so on that
system P-state could (potentially) have a performance
impact greater than off-lining.

Power is more complicated.

We discussed not saving much when an HT sibling is taken
off line, right?  The converse is also true, HT siblings
are almost free in terms of power.  So even though an HT
sibling doesn't have the performance of a dedicated core,
it is actually one of the best performance/watt parts
of the system on many workloads.  So you want to disable
this last, not first; by exhausting p-states before
you take your siblings off line.

Lets discuss turbo-mode.
Think of turbo in these terms...
The ideal system to the HW designers is one that can
spend a fixed power+electrical budget in any way
on performance.
ie. when some cores are idle, spend that budget
to make the other cores go faster -- faster even
than the advertised P0 frequency.

As turbo has the highest voltage, it has the lowest
efficiency in terms of instruction/energy.
Thus it is very important that in power limited secarios
that turbo (aka P0) be disabled first.

There is a cross-over where a C-state that is highly
optimized will be less impact on total system efficiency
than a reduced P-state.  However, we don't see that
cross over on current Intel systems on workloads measured.
The reason is that they set Pn to the minimum voltage
where P-states are useful and run as fast as they
can at that voltage.  If the system were throttled
say via T-states, down to an extremely low clock rate,
then C-states would win sooner.

-Len Brown, Intel Open Source Technolgy Center



--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Len Brown July 10, 2009, 8:41 p.m. UTC | #19
> > We'd like to ship the forced-idle thread as a self-contained driver,
> > if possilbe.  Because that would enable us to easily back-port it
> > to some enterprise releases that want the feature.  So if we can
> > implement this such that it is functional with existing scheduler
> > facilities, that would be get us by.  If the scheduler evolves
> > and provides a more optimal mechanism in the future, then that is
> > great, as long as we don't have to wait for that to provide
> > the basic version of the feature.
> 
> I don't think we should ever merge anything because of backports for
> stale kernels.
> 
> If you want it in enterprise stuff all you need is something upstream,
> and that something should be the scheduler driven bits.

Peter,
I'm glad that you agree that the forced-idle goals are worth-while,
and I'm delighted that you offer the scheduler to
implement a solution to this problem.

When will we see a scheduler API that we can call from our
ACPI notification driver to create idle-time on a busy system?

As you've seen in the driver, we are notified simply of the
total number of hardware threads to keep running.  Linux has
full freedom in selecting (and even moving) which processors are 
idled.

thanks,
-Len Brown, Intel Open Source Technology Center




--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux/drivers/acpi/Kconfig
===================================================================
--- linux.orig/drivers/acpi/Kconfig	2009-06-23 15:34:33.000000000 +0800
+++ linux/drivers/acpi/Kconfig	2009-06-23 15:52:53.000000000 +0800
@@ -196,6 +196,17 @@  config ACPI_HOTPLUG_CPU
 	select ACPI_CONTAINER
 	default y
 
+config ACPI_PROCESSOR_AGGREGATOR
+	tristate "Processor Aggregator"
+	depends on ACPI_PROCESSOR
+	depends on EXPERIMENTAL
+	help
+	  ACPI 4.0 defines processor Aggregator, which enables OS to perform
+	  specfic processor configuration and control that applies to all
+          processors in the platform. Currently only logical processor idling
+          is defined, which is to reduce power consumption. This driver
+          support the new device.
+
 config ACPI_THERMAL
 	tristate "Thermal Zone"
 	depends on ACPI_PROCESSOR
Index: linux/drivers/acpi/Makefile
===================================================================
--- linux.orig/drivers/acpi/Makefile	2009-06-23 15:34:33.000000000 +0800
+++ linux/drivers/acpi/Makefile	2009-06-23 15:52:53.000000000 +0800
@@ -61,3 +61,5 @@  obj-$(CONFIG_ACPI_SBS)		+= sbs.o
 processor-y			:= processor_core.o processor_throttling.o
 processor-y			+= processor_idle.o processor_thermal.o
 processor-$(CONFIG_CPU_FREQ)	+= processor_perflib.o
+
+obj-$(CONFIG_ACPI_PROCESSOR_AGGREGATOR) += processor_aggregator.o
Index: linux/drivers/acpi/processor_aggregator.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/drivers/acpi/processor_aggregator.c	2009-06-24 11:58:33.000000000 +0800
@@ -0,0 +1,389 @@ 
+/*
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or (at
+ *  your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful, but
+ *  WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License along
+ *  with this program; if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ */
+#include <linux/kernel.h>
+#include <linux/cpumask.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/cpu.h>
+#include <linux/clockchips.h>
+#include <acpi/acpi_bus.h>
+#include <acpi/acpi_drivers.h>
+
+#define ACPI_PROCESSOR_AGGREGATOR_CLASS	"processor_aggregator"
+#define ACPI_PROCESSOR_AGGREGATOR_DEVICE_NAME "Processor Aggregator"
+#define ACPI_PROCESSOR_AGGREGATOR_NOTIFY 0x80
+static DEFINE_MUTEX(isolated_cpus_lock);
+
+#define MWAIT_SUBSTATE_MASK	(0xf)
+#define MWAIT_CSTATE_MASK	(0xf)
+#define MWAIT_SUBSTATE_SIZE	(4)
+#define CPUID_MWAIT_LEAF (5)
+#define CPUID5_ECX_EXTENSIONS_SUPPORTED (0x1)
+#define CPUID5_ECX_INTERRUPT_BREAK	(0x2)
+static unsigned long power_saving_mwait_eax;
+static void power_saving_mwait_init(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+	unsigned int highest_cstate = 0;
+	unsigned int highest_subcstate = 0;
+	int i;
+
+	if (!boot_cpu_has(X86_FEATURE_MWAIT))
+		return;
+	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return;
+
+	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
+	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+		return;
+
+	edx >>= MWAIT_SUBSTATE_SIZE;
+	for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
+		if (edx & MWAIT_SUBSTATE_MASK) {
+			highest_cstate = i;
+			highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
+		}
+	}
+	power_saving_mwait_eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
+		(highest_subcstate - 1);
+
+	for_each_online_cpu(i)
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ON, &i);
+
+#if defined (CONFIG_GENERIC_TIME) && defined (CONFIG_X86)
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_AMD:
+	case X86_VENDOR_INTEL:
+		/*
+		 * AMD Fam10h TSC will tick in all
+		 * C/P/S0/S1 states when this bit is set.
+		 */
+		if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+			return;
+
+		/*FALL THROUGH*/
+	default:
+		/* TSC could halt in idle, so notify users */
+		mark_tsc_unstable("TSC halts in idle");
+	}
+#endif
+}
+
+static int power_saving_thread(void *data)
+{
+	struct sched_param param = {.sched_priority = MAX_RT_PRIO - 1};
+	int do_sleep;
+
+	/*
+	 * we just create a RT task to do power saving. Scheduler will migrate
+	 * the task to any CPU.
+	 */
+	sched_setscheduler(current, SCHED_RR, &param);
+
+	while (!kthread_should_stop()) {
+		int cpu;
+		u64 expire_time;
+
+		try_to_freeze();
+
+		do_sleep = 0;
+
+		current_thread_info()->status &= ~TS_POLLING;
+		/*
+		 * TS_POLLING-cleared state must be visible before we test
+		 * NEED_RESCHED:
+		 */
+		smp_mb();
+
+		expire_time = jiffies + HZ * 95 /100;
+
+		while (!need_resched()) {
+			local_irq_disable();
+			cpu = smp_processor_id();
+			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+			stop_critical_timings();
+
+			__monitor((void *)&current_thread_info()->flags, 0, 0);
+			smp_mb();
+			if (!need_resched())
+				__mwait(power_saving_mwait_eax, 1);
+
+			start_critical_timings();
+			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+			local_irq_enable();
+
+			if (jiffies > expire_time) {
+				do_sleep = 1;
+				break;
+			}
+		}
+
+		current_thread_info()->status |= TS_POLLING;
+
+		/*
+		 * current sched_rt has threshold for rt task running time.
+		 * When a rt task uses 95% CPU time, the rt thread will be
+		 * scheduled out for 5% CPU time to not starve other tasks. But
+		 * the mechanism only works when all CPUs have RT task running,
+		 * as if one CPU hasn't RT task, RT task from other CPUs will
+		 * borrow CPU time from this CPU and cause RT task use > 95%
+		 * CPU time. To make 'avoid staration' work, takes a nap here.
+		 */
+		if (do_sleep)
+			schedule_timeout_killable(HZ * 5 /100);
+	}
+	return 0;
+}
+
+static struct task_struct *ps_tsks[NR_CPUS];
+static unsigned int ps_tsk_num;
+static int create_power_saving_task(void)
+{
+	ps_tsks[ps_tsk_num] = kthread_run(power_saving_thread, NULL,
+		"power_saving/%d", ps_tsk_num);
+	if (ps_tsks[ps_tsk_num]) {
+		ps_tsk_num++;
+		return 0;
+	}
+	return -EINVAL;
+}
+
+static void destroy_power_saving_task(void)
+{
+	if (ps_tsk_num > 0) {
+		ps_tsk_num--;
+		kthread_stop(ps_tsks[ps_tsk_num]);
+	}
+}
+
+static void set_power_saving_task_num(unsigned int num)
+{
+	if (num > ps_tsk_num) {
+		while (ps_tsk_num < num) {
+			if (create_power_saving_task())
+				return;
+		}
+	} else if (num < ps_tsk_num) {
+		while (ps_tsk_num > num)
+			destroy_power_saving_task();
+	}
+}
+
+static int acpi_processor_aggregator_idle_cpus(unsigned int num_cpus)
+{
+	get_online_cpus();
+
+	num_cpus = min_t(unsigned int, num_cpus, num_online_cpus());
+	set_power_saving_task_num(num_cpus);
+
+	put_online_cpus();
+	return 0;
+}
+
+static uint32_t acpi_processor_aggregator_idle_cpus_num(void)
+{
+	return ps_tsk_num;
+}
+
+static ssize_t acpi_processor_aggregator_idlecpus_store(struct device *dev,
+	struct device_attribute *attr, const char *buf, size_t count)
+{
+	unsigned long num;
+	if (strict_strtoul(buf, 0, &num))
+		return -EINVAL;
+	mutex_lock(&isolated_cpus_lock);
+	acpi_processor_aggregator_idle_cpus(num);
+	mutex_unlock(&isolated_cpus_lock);
+	return count;
+}
+
+static ssize_t acpi_processor_aggregator_idlecpus_show(struct device *dev,
+	struct device_attribute *attr, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%d",
+		acpi_processor_aggregator_idle_cpus_num());
+}
+static DEVICE_ATTR(idlecpus, S_IRUGO|S_IWUSR,
+	acpi_processor_aggregator_idlecpus_show,
+	acpi_processor_aggregator_idlecpus_store);
+
+static int acpi_processor_aggregator_add_sysfs(struct acpi_device *device)
+{
+	int result;
+
+	result = device_create_file(&device->dev, &dev_attr_idlecpus);
+	if (result)
+		return -ENODEV;
+	return 0;
+}
+
+static void acpi_processor_aggregator_remove_sysfs(struct acpi_device *device)
+{
+	device_remove_file(&device->dev, &dev_attr_idlecpus);
+}
+
+/* Query firmware how many CPUs should be idle */
+static int acpi_processor_aggregator_pur(acpi_handle handle, int *num_cpus)
+{
+	struct acpi_buffer buffer = {ACPI_ALLOCATE_BUFFER, NULL};
+	acpi_status status;
+	union acpi_object *package;
+	int rev, num, ret = -EINVAL;
+
+	status = acpi_evaluate_object(handle, "_PUR", NULL, &buffer);
+	if (ACPI_FAILURE(status))
+		return -EINVAL;
+	package = buffer.pointer;
+	if (package->type != ACPI_TYPE_PACKAGE || package->package.count != 2)
+		goto out;
+	rev = package->package.elements[0].integer.value;
+	num = package->package.elements[1].integer.value;
+	if (rev != 1)
+		goto out;
+	*num_cpus = num;
+	ret = 0;
+out:
+	kfree(buffer.pointer);
+	return ret;
+}
+
+/* Notify firmware how many CPUs are idle */
+static void acpi_processor_aggregator_ost(acpi_handle handle, int stat,
+	uint32_t idle_cpus)
+{
+	union acpi_object params[3] = {
+		{.type = ACPI_TYPE_INTEGER,},
+		{.type = ACPI_TYPE_INTEGER,},
+		{.type = ACPI_TYPE_BUFFER,},
+	};
+	struct acpi_object_list arg_list = {3, params};
+
+	params[0].integer.value = ACPI_PROCESSOR_AGGREGATOR_NOTIFY;
+	params[1].integer.value =  stat;
+	params[2].buffer.length = 4;
+	params[2].buffer.pointer = (void *)&idle_cpus;
+	acpi_evaluate_object(handle, "_OST", &arg_list, NULL);
+}
+
+static void acpi_processor_aggregator_handle_notify(acpi_handle handle)
+{
+	int num_cpus, ret;
+	uint32_t idle_cpus;
+
+	mutex_lock(&isolated_cpus_lock);
+	if (acpi_processor_aggregator_pur(handle, &num_cpus)) {
+		mutex_unlock(&isolated_cpus_lock);
+		return;
+	}
+	ret = acpi_processor_aggregator_idle_cpus(num_cpus);
+	idle_cpus = acpi_processor_aggregator_idle_cpus_num();
+	if (!ret)
+		acpi_processor_aggregator_ost(handle, 0, idle_cpus);
+	else
+		acpi_processor_aggregator_ost(handle, 1, 0);
+	mutex_unlock(&isolated_cpus_lock);
+}
+
+static void acpi_processor_aggregator_notify(acpi_handle handle, u32 event,
+	void *data)
+{
+	struct acpi_device *device = data;
+
+	switch (event) {
+	case ACPI_PROCESSOR_AGGREGATOR_NOTIFY:
+		acpi_processor_aggregator_handle_notify(handle);
+		acpi_bus_generate_proc_event(device, event, 0);
+		acpi_bus_generate_netlink_event(device->pnp.device_class,
+			dev_name(&device->dev), event, 0);
+		break;
+	default:
+		printk(KERN_WARNING"Unsupported event [0x%x]\n", event);
+		break;
+	}
+}
+
+static int acpi_processor_aggregator_add(struct acpi_device *device)
+{
+	acpi_status status;
+
+	strcpy(acpi_device_name(device), ACPI_PROCESSOR_AGGREGATOR_DEVICE_NAME);
+	strcpy(acpi_device_class(device), ACPI_PROCESSOR_AGGREGATOR_CLASS);
+
+	if (acpi_processor_aggregator_add_sysfs(device))
+		return -ENODEV;
+
+	status = acpi_install_notify_handler(device->handle,
+		ACPI_DEVICE_NOTIFY, acpi_processor_aggregator_notify, device);
+	if (ACPI_FAILURE(status)) {
+		acpi_processor_aggregator_remove_sysfs(device);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+static int acpi_processor_aggregator_remove(struct acpi_device *device, int type)
+{
+	mutex_lock(&isolated_cpus_lock);
+	acpi_processor_aggregator_idle_cpus(0);
+	mutex_unlock(&isolated_cpus_lock);
+
+	acpi_remove_notify_handler(device->handle,
+		ACPI_DEVICE_NOTIFY, acpi_processor_aggregator_notify);
+	acpi_processor_aggregator_remove_sysfs(device);
+	return 0;
+}
+
+static const struct acpi_device_id processor_aggregator_device_ids[] = {
+	{"ACPI000C", 0},
+	{"", 0},
+};
+MODULE_DEVICE_TABLE(acpi, processor_aggregator_device_ids);
+
+static struct acpi_driver acpi_processor_aggregator_driver = {
+	.name = "processor_aggregator",
+	.class = ACPI_PROCESSOR_AGGREGATOR_CLASS,
+	.ids = processor_aggregator_device_ids,
+	.ops = {
+		.add = acpi_processor_aggregator_add,
+		.remove = acpi_processor_aggregator_remove,
+	},
+};
+
+static int __init acpi_processor_aggregator_init(void)
+{
+	power_saving_mwait_init();
+	if (power_saving_mwait_eax == 0)
+		return -EINVAL;
+
+	return acpi_bus_register_driver(&acpi_processor_aggregator_driver);
+}
+
+static void __exit acpi_processor_aggregator_exit(void)
+{
+	acpi_bus_unregister_driver(&acpi_processor_aggregator_driver);
+}
+
+module_init(acpi_processor_aggregator_init);
+module_exit(acpi_processor_aggregator_exit);
+MODULE_AUTHOR("Shaohua Li<shaohua.li@intel.com>");
+MODULE_DESCRIPTION("ACPI Processor Aggregator Driver");
+MODULE_LICENSE("GPL");