diff mbox series

[RFC] x86, mce, therm_throt: Optimize notifications of thermal throttle

Message ID 20191025001924.10199-1-srinivas.pandruvada@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC] x86, mce, therm_throt: Optimize notifications of thermal throttle | expand

Commit Message

srinivas pandruvada Oct. 25, 2019, 12:19 a.m. UTC
Some modern systems have very tight thermal tolerances. Because of this
they may cross thermal thresholds when running normal workloads (even
during boot). The CPU hardware will react by limiting power/frequency
and using duty cycles to bring the temperature back into normal range.

Thus users may see a "critical" message about the "temperature above
threshold" which is soon followed by "temperature/speed normal". These
messages are rate limited, but still may repeat every few minutes.

A test run on a laptop with Intel 8th Gen i5 core for two hours with a
workload resulted in 20K+ thermal interrupts per CPU for core level and
another 20K+ interrupts at package level. The kernel logs were full of
throttling messages.

Brief background, on why there are so many thermal interrupts in a
modern system:
From IvyBridge, there is another offset called TCC offset is introduced.
This adds an offset to the real PROCHOT temperature target. So effectively
this interrupt is generated much before the PROCHOT. There will be several
very short throttling by the processor using adaptive thermal monitoring
at this threshold, instead of more aggressive action close to PROCHOT.
This offset is configured by OEMs and some tend to be more conservative
than others. So logging such events just generates noise in the logs.

The real value of these threshold interrupts, is to debug problems with
the external cooling solutions and performance issues due to excessive
throttling.

So the solution here:
- Show in the current thermal_throttle folder, the maximum time for one
throttling event and total amount of time, the system was in throttling
state.
- Don't log short excursions.
- Log only when, in spite of thermal throttling the temperature is rising.
This is done by monitoring temperature trend using three point moving
average. On the high threshold interrupt trigger a delayed workqueue,
which monitors the threshold violation log bit, calculates moving moving
average and logs when temperature trend is raising. When the log bit is
clear and temperature is below threshold temperature, it will print
"Normal" message. Once a high threshold event is logged, it rate limits
number of log messages.
- Reduce the logging severity to warning.

With this patch, on the same test laptop, no warnings are printed in logs
as the max time the processor could bring the temperature under control is
only 280 ms.

This implementation is done with the inputs from Alan Cox and Tony Luck.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 arch/x86/kernel/cpu/mce/therm_throt.c | 193 +++++++++++++++++++++++---
 1 file changed, 172 insertions(+), 21 deletions(-)

Comments

Borislav Petkov Nov. 5, 2019, 2:44 p.m. UTC | #1
On Thu, Oct 24, 2019 at 05:19:24PM -0700, Srinivas Pandruvada wrote:
> Some modern systems have very tight thermal tolerances. Because of this
> they may cross thermal thresholds when running normal workloads (even
> during boot). The CPU hardware will react by limiting power/frequency
> and using duty cycles to bring the temperature back into normal range.
> 
> Thus users may see a "critical" message about the "temperature above
> threshold" which is soon followed by "temperature/speed normal". These
> messages are rate limited, but still may repeat every few minutes.

rate-limited

> A test run on a laptop with Intel 8th Gen i5 core for two hours with a
> workload resulted in 20K+ thermal interrupts per CPU for core level and
> another 20K+ interrupts at package level. The kernel logs were full of
> throttling messages.
> 
> Brief background, on why there are so many thermal interrupts in a
> modern system:
> From IvyBridge, there is another offset called TCC offset is introduced.

That sentence needs fixing.

> This adds an offset to the real PROCHOT temperature target. So effectively
> this interrupt is generated much before the PROCHOT. There will be several
> very short throttling by the processor using adaptive thermal monitoring
  ^^^^^^^^^^^^^^^^^^^^^^

"There will be several very short throttling" reads funny.

> at this threshold, instead of more aggressive action close to PROCHOT.
> This offset is configured by OEMs and some tend to be more conservative
> than others. So logging such events just generates noise in the logs.
> 
> The real value of these threshold interrupts, is to debug problems with
> the external cooling solutions and performance issues due to excessive
> throttling.
> 
> So the solution here:
> - Show in the current thermal_throttle folder, the maximum time for one
> throttling event and total amount of time, the system was in throttling
> state.
> - Don't log short excursions.
> - Log only when, in spite of thermal throttling the temperature is rising.
> This is done by monitoring temperature trend using three point moving
> average. On the high threshold interrupt trigger a delayed workqueue,
> which monitors the threshold violation log bit, calculates moving moving

What is the "threshold violation log bit" ? THERM_STATUS_PROCHOT_LOG ?

s/moving moving/moving/

Please read your commit message before sending to check whether it makes
any sense. Commit messages are not write-only.

> average and logs when temperature trend is raising. When the log bit is
> clear and temperature is below threshold temperature, it will print
> "Normal" message. Once a high threshold event is logged, it rate limits
> number of log messages.
> - Reduce the logging severity to warning.

I already took the reducing printk severity patch, you'd need to redo
yours ontop of tip/master.

> With this patch, on the same test laptop, no warnings are printed in logs
> as the max time the processor could bring the temperature under control is
> only 280 ms.
> 
> This implementation is done with the inputs from Alan Cox and Tony Luck.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/mce/therm_throt.c | 193 +++++++++++++++++++++++---
>  1 file changed, 172 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
> index 6e2becf547c5..e06f8d475207 100644
> --- a/arch/x86/kernel/cpu/mce/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mce/therm_throt.c
> @@ -47,8 +47,19 @@ struct _thermal_state {
>  	bool			new_event;
>  	int			event;
>  	u64			next_check;
> +	u64			last_interrupt_time;
> +	struct delayed_work	therm_work;
>  	unsigned long		count;
>  	unsigned long		last_count;
> +	unsigned long		max_time_ms;
> +	unsigned long		total_time_ms;
> +	int			rate_control_active;

That wants to be a bool judging by the context it is used in.

> +	int			level;
> +	int			sample_index;
> +	int			sample_count;
> +	int			average;
> +	int			baseline_temp;
> +	u8			temp_samples[3];

All these new members (and old members) should be documented what they
are. Like, what is "max_time_ms", for example? I can find out from the
sysfs func names below but having comments explaining what those are is
much better.

>  };
>  
>  struct thermal_state {
> @@ -121,8 +132,22 @@ define_therm_throt_device_one_ro(package_throttle_count);
>  define_therm_throt_device_show_func(package_power_limit, count);
>  define_therm_throt_device_one_ro(package_power_limit_count);
>  
> +define_therm_throt_device_show_func(core_throttle, max_time_ms);
> +define_therm_throt_device_one_ro(core_throttle_max_time_ms);
> +
> +define_therm_throt_device_show_func(package_throttle, max_time_ms);
> +define_therm_throt_device_one_ro(package_throttle_max_time_ms);
> +
> +define_therm_throt_device_show_func(core_throttle, total_time_ms);
> +define_therm_throt_device_one_ro(core_throttle_total_time_ms);
> +
> +define_therm_throt_device_show_func(package_throttle, total_time_ms);
> +define_therm_throt_device_one_ro(package_throttle_total_time_ms);
> +
>  static struct attribute *thermal_throttle_attrs[] = {
>  	&dev_attr_core_throttle_count.attr,
> +	&dev_attr_core_throttle_max_time_ms.attr,
> +	&dev_attr_core_throttle_total_time_ms.attr,
>  	NULL
>  };
>  
> @@ -135,6 +160,95 @@ static const struct attribute_group thermal_attr_group = {
>  #define CORE_LEVEL	0
>  #define PACKAGE_LEVEL	1
>  
> +#define THERM_THROT_POLL_INTERVAL	HZ
> +#define THERM_STATUS_PROCHOT_LOG	BIT(1)
> +
> +static void therm_throt_clear_therm_status_log(int level)

That's a static function, called only once so prepending its name with
the "therm_throt_" prefix is pointless and doesn't help readability.
Having simply

	clear_therm_status_log(state->level);

in the code is clearer and shorter. Ditto for the other helper functions.

> +{
> +	u64 msr_val;
> +	int msr;
> +
> +	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
> +			MSR_IA32_PACKAGE_THERM_STATUS;

Make that a normal if-else statement for better readability:

	if (level == CORE_LEVEL)
		msr = MSR_IA32_THERM_STATUS;
	else
		msr = MSR_IA32_PACKAGE_THERM_STATUS;

> +	rdmsrl(msr, msr_val);

Is that rdmsrl() always going to succeed here or you need to handle a
possible error it returns?

> +	wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);

Same here.

> +}
> +
> +static void therm_throt_get_therm_status(int level, int *proc_hot, int *temp)
> +{
> +	u64 msr_val;
> +	int msr;
> +
> +	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
> +			MSR_IA32_PACKAGE_THERM_STATUS;
> +	rdmsrl(msr, msr_val);
> +	*proc_hot = msr_val & THERM_STATUS_PROCHOT_LOG ? 1 : 0;
> +	*temp = (msr_val >> 16) & 0x7F;

Same comments as for the therm_throt_clear_therm_status_log() function above.

...

> @@ -178,27 +292,23 @@ static void therm_throt_process(bool new_event, int event, int level)
>  	if (new_event)
>  		state->count++;
>  
> -	if (time_before64(now, state->next_check) &&
> -			state->count != state->last_count)
> -		return;
> +	if (event == THERMAL_THROTTLING_EVENT) {

Save an indentation level:

	if (event != THERMAL_THROTTLING_EVENT)
		return;

	/* next statement starts here */

> +		if (new_event && !state->last_interrupt_time) {
> +			int hot;
>  
> -	state->next_check = now + CHECK_INTERVAL;
> -	state->last_count = state->count;
> +			therm_throt_get_therm_status(state->level, &hot, &state->baseline_temp);
>  
> -	/* if we just entered the thermal event */
> -	if (new_event) {
> -		if (event == THERMAL_THROTTLING_EVENT)
> -			pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",
> -				this_cpu,
> -				level == CORE_LEVEL ? "Core" : "Package",
> -				state->count);
> -		return;
> -	}
> -	if (old_event) {
> -		if (event == THERMAL_THROTTLING_EVENT)
> -			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
> -				level == CORE_LEVEL ? "Core" : "Package");
> -		return;
> +			state->last_interrupt_time = now;
> +			schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
> +		} else if (old_event && state->last_interrupt_time) {
> +			unsigned long throttle_time;
> +
> +			throttle_time = jiffies_delta_to_msecs(now - state->last_interrupt_time);
> +			if (throttle_time > state->max_time_ms)
> +				state->max_time_ms = throttle_time;
> +			state->total_time_ms += throttle_time;
> +			state->last_interrupt_time = 0;
> +		}
>  	}
>  }
>
srinivas pandruvada Nov. 5, 2019, 8:36 p.m. UTC | #2
On Tue, 2019-11-05 at 15:44 +0100, Borislav Petkov wrote:
> On Thu, Oct 24, 2019 at 05:19:24PM -0700, Srinivas Pandruvada wrote:

Thanks for the review.

> > Some modern systems have very tight thermal tolerances. Because of
> > this
> > they may cross thermal thresholds when running normal workloads
> > (even
> > during boot). The CPU hardware will react by limiting
> > power/frequency
> > and using duty cycles to bring the temperature back into normal
> > range.
> > 
> > Thus users may see a "critical" message about the "temperature
> > above
> > threshold" which is soon followed by "temperature/speed normal".
> > These
> > messages are rate limited, but still may repeat every few minutes.
> 
> rate-limited
OK

> 
> > A test run on a laptop with Intel 8th Gen i5 core for two hours
> > with a
> > workload resulted in 20K+ thermal interrupts per CPU for core level
> > and
> > another 20K+ interrupts at package level. The kernel logs were full
> > of
> > throttling messages.
> > 
> > Brief background, on why there are so many thermal interrupts in a
> > modern system:
> > From IvyBridge, there is another offset called TCC offset is
> > introduced.
> 
> That sentence needs fixing.
Will try.

> 
> > This adds an offset to the real PROCHOT temperature target. So
> > effectively
> > this interrupt is generated much before the PROCHOT. There will be
> > several
> > very short throttling by the processor using adaptive thermal
> > monitoring
> 
>   ^^^^^^^^^^^^^^^^^^^^^^
> 
> "There will be several very short throttling" reads funny.
Will try to remove "Fun" out of it.

> 
> > at this threshold, instead of more aggressive action close to
> > PROCHOT.
> > This offset is configured by OEMs and some tend to be more
> > conservative
> > than others. So logging such events just generates noise in the
> > logs.
> > 
> > The real value of these threshold interrupts, is to debug problems
> > with
> > the external cooling solutions and performance issues due to
> > excessive
> > throttling.
> > 
> > So the solution here:
> > - Show in the current thermal_throttle folder, the maximum time for
> > one
> > throttling event and total amount of time, the system was in
> > throttling
> > state.
> > - Don't log short excursions.
> > - Log only when, in spite of thermal throttling the temperature is
> > rising.
> > This is done by monitoring temperature trend using three point
> > moving
> > average. On the high threshold interrupt trigger a delayed
> > workqueue,
> > which monitors the threshold violation log bit, calculates moving
> > moving
> 
> What is the "threshold violation log bit" ? THERM_STATUS_PROCHOT_LOG
> ?
Yes, BIT 1 of therm
MSR_IA32_THERM_STATUS/MSR_IA32_PACKAGE_THERM_STATUS.

> 
> s/moving moving/moving/
> 
> Please read your commit message before sending to check whether it
> makes
> any sense. Commit messages are not write-only.
Noted.

> 
> > average and logs when temperature trend is raising. When the log
> > bit is
> > clear and temperature is below threshold temperature, it will print
> > "Normal" message. Once a high threshold event is logged, it rate
> > limits
> > number of log messages.
> > - Reduce the logging severity to warning.
> 
> I already took the reducing printk severity patch, you'd need to redo
> yours ontop of tip/master.
I will rebase and remove this from the description.

> > 

[...]

> 
> > +	int			rate_control_active;
> 
> That wants to be a bool judging by the context it is used in.
I can change to bool, just didn't use it
https://yarchive.net/comp/linux/bool.html

> 
> > +	int			level;
> > +	int			sample_index;
> > +	int			sample_count;
> > +	int			average;
> > +	int			baseline_temp;
> > +	u8			temp_samples[3];
> 
> All these new members (and old members) should be documented what
> they
> are. Like, what is "max_time_ms", for example? I can find out from
> the
> sysfs func names below but having comments explaining what those are
> is
> much better.
> 
> >  };
I will do that.

> >  

[...]

> > +
> > +static void therm_throt_clear_therm_status_log(int level)
> 
> That's a static function, called only once so prepending its name
> with
> the "therm_throt_" prefix is pointless and doesn't help readability.
> Having simply
> 
> 	clear_therm_status_log(state->level);
> 
> in the code is clearer and shorter. Ditto for the other helper
> functions.
OK.

> 
> > +{
> > +	u64 msr_val;
> > +	int msr;
> > +
> > +	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
> > +			MSR_IA32_PACKAGE_THERM_STATUS;
> 
> Make that a normal if-else statement for better readability:
> 
OK

> 	if (level == CORE_LEVEL)
> 		msr = MSR_IA32_THERM_STATUS;
> 	else
> 		msr = MSR_IA32_PACKAGE_THERM_STATUS;
> 
> > +	rdmsrl(msr, msr_val);
> 
> Is that rdmsrl() always going to succeed here or you need to handle a
> possible error it returns?
> 
It should not.
They are architectural MSRs and the fact that we are getting called
means that they are enabled by looking at CPUID bits. Also this MSR was
read before once in intel_thermal_interrupt(). 


> > +	wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> 
> Same here.

> 
It shouldn't fail for local CPU write. But I can add error handling.


> > +}
> > +
> > +static void therm_throt_get_therm_status(int level, int *proc_hot,
> > int *temp)
> > +{
> > +	u64 msr_val;
> > +	int msr;
> > +
> > +	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
> > +			MSR_IA32_PACKAGE_THERM_STATUS;
> > +	rdmsrl(msr, msr_val);
> > +	*proc_hot = msr_val & THERM_STATUS_PROCHOT_LOG ? 1 : 0;
> > +	*temp = (msr_val >> 16) & 0x7F;
> 
> Same comments as for the therm_throt_clear_therm_status_log()
> function above.
> 
OK

> ...
> 
> > @@ -178,27 +292,23 @@ static void therm_throt_process(bool
> > new_event, int event, int level)
> >  	if (new_event)
> >  		state->count++;
> >  
> > -	if (time_before64(now, state->next_check) &&
> > -			state->count != state->last_count)
> > -		return;
> > +	if (event == THERMAL_THROTTLING_EVENT) {
> 
> Save an indentation level:
> 
OK

Thanks,
Srinivas

> 	if (event != THERMAL_THROTTLING_EVENT)
> 		return;
> 
> 	/* next statement starts here */
> 
> > +		if (new_event && !state->last_interrupt_time) {
> > +			int hot;
> >  
> > -	state->next_check = now + CHECK_INTERVAL;
> > -	state->last_count = state->count;
> > +			therm_throt_get_therm_status(state->level,
> > &hot, &state->baseline_temp);
> >  
> > -	/* if we just entered the thermal event */
> > -	if (new_event) {
> > -		if (event == THERMAL_THROTTLING_EVENT)
> > -			pr_crit("CPU%d: %s temperature above threshold,
> > cpu clock throttled (total events = %lu)\n",
> > -				this_cpu,
> > -				level == CORE_LEVEL ? "Core" :
> > "Package",
> > -				state->count);
> > -		return;
> > -	}
> > -	if (old_event) {
> > -		if (event == THERMAL_THROTTLING_EVENT)
> > -			pr_info("CPU%d: %s temperature/speed normal\n",
> > this_cpu,
> > -				level == CORE_LEVEL ? "Core" :
> > "Package");
> > -		return;
> > +			state->last_interrupt_time = now;
> > +			schedule_delayed_work_on(this_cpu, &state-
> > >therm_work, THERM_THROT_POLL_INTERVAL);
> > +		} else if (old_event && state->last_interrupt_time) {
> > +			unsigned long throttle_time;
> > +
> > +			throttle_time = jiffies_delta_to_msecs(now -
> > state->last_interrupt_time);
> > +			if (throttle_time > state->max_time_ms)
> > +				state->max_time_ms = throttle_time;
> > +			state->total_time_ms += throttle_time;
> > +			state->last_interrupt_time = 0;
> > +		}
> >  	}
> >  }
> >  
> 
>
Borislav Petkov Nov. 5, 2019, 8:56 p.m. UTC | #3
On Tue, Nov 05, 2019 at 12:36:32PM -0800, Srinivas Pandruvada wrote:
> > That wants to be a bool judging by the context it is used in.
> I can change to bool, just didn't use it
> https://yarchive.net/comp/linux/bool.html

And are you using it in a union or where the size of bool - which is
implementation-specific - plays any role, esp. in your particular use
case?

> They are architectural MSRs and the fact that we are getting called
> means that they are enabled by looking at CPUID bits.

If the CPUID bits guarantees their presence, then the error handling is
not absolutely necessary.

Thx.
srinivas pandruvada Nov. 5, 2019, 9:21 p.m. UTC | #4
On Tue, 2019-11-05 at 21:56 +0100, Borislav Petkov wrote:
> On Tue, Nov 05, 2019 at 12:36:32PM -0800, Srinivas Pandruvada wrote:
> > > That wants to be a bool judging by the context it is used in.
> > 
> > I can change to bool, just didn't use it
> > https://yarchive.net/comp/linux/bool.html
> 
> And are you using it in a union or where the size of bool - which is
> implementation-specific - plays any role, esp. in your particular use
> case?
No.

> 
> > They are architectural MSRs and the fact that we are getting called
> > means that they are enabled by looking at CPUID bits.
> 
> If the CPUID bits guarantees their presence, then the error handling
> is
> not absolutely necessary.
> 
> Thx.
>
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
index 6e2becf547c5..e06f8d475207 100644
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -47,8 +47,19 @@  struct _thermal_state {
 	bool			new_event;
 	int			event;
 	u64			next_check;
+	u64			last_interrupt_time;
+	struct delayed_work	therm_work;
 	unsigned long		count;
 	unsigned long		last_count;
+	unsigned long		max_time_ms;
+	unsigned long		total_time_ms;
+	int			rate_control_active;
+	int			level;
+	int			sample_index;
+	int			sample_count;
+	int			average;
+	int			baseline_temp;
+	u8			temp_samples[3];
 };
 
 struct thermal_state {
@@ -121,8 +132,22 @@  define_therm_throt_device_one_ro(package_throttle_count);
 define_therm_throt_device_show_func(package_power_limit, count);
 define_therm_throt_device_one_ro(package_power_limit_count);
 
+define_therm_throt_device_show_func(core_throttle, max_time_ms);
+define_therm_throt_device_one_ro(core_throttle_max_time_ms);
+
+define_therm_throt_device_show_func(package_throttle, max_time_ms);
+define_therm_throt_device_one_ro(package_throttle_max_time_ms);
+
+define_therm_throt_device_show_func(core_throttle, total_time_ms);
+define_therm_throt_device_one_ro(core_throttle_total_time_ms);
+
+define_therm_throt_device_show_func(package_throttle, total_time_ms);
+define_therm_throt_device_one_ro(package_throttle_total_time_ms);
+
 static struct attribute *thermal_throttle_attrs[] = {
 	&dev_attr_core_throttle_count.attr,
+	&dev_attr_core_throttle_max_time_ms.attr,
+	&dev_attr_core_throttle_total_time_ms.attr,
 	NULL
 };
 
@@ -135,6 +160,95 @@  static const struct attribute_group thermal_attr_group = {
 #define CORE_LEVEL	0
 #define PACKAGE_LEVEL	1
 
+#define THERM_THROT_POLL_INTERVAL	HZ
+#define THERM_STATUS_PROCHOT_LOG	BIT(1)
+
+static void therm_throt_clear_therm_status_log(int level)
+{
+	u64 msr_val;
+	int msr;
+
+	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
+			MSR_IA32_PACKAGE_THERM_STATUS;
+	rdmsrl(msr, msr_val);
+	wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
+}
+
+static void therm_throt_get_therm_status(int level, int *proc_hot, int *temp)
+{
+	u64 msr_val;
+	int msr;
+
+	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
+			MSR_IA32_PACKAGE_THERM_STATUS;
+	rdmsrl(msr, msr_val);
+	*proc_hot = msr_val & THERM_STATUS_PROCHOT_LOG ? 1 : 0;
+	*temp = (msr_val >> 16) & 0x7F;
+}
+
+static void therm_throt_active_work(struct work_struct *work)
+{
+	struct _thermal_state *state = container_of(to_delayed_work(work),
+						struct _thermal_state, therm_work);
+	int i, avg, this_cpu, hot, temp;
+	u64 now = get_jiffies_64();
+
+	this_cpu = smp_processor_id();
+
+	therm_throt_get_therm_status(state->level, &hot, &temp);
+	/* temperature value is offset from the max so lesser means hotter */
+	if (!hot && temp > state->baseline_temp) {
+		if (state->rate_control_active)
+			pr_warn("CPU%d: %s temperature/speed normal (total events = %lu)\n",
+				this_cpu,
+				state->level == CORE_LEVEL ? "Core" : "Package",
+				state->count);
+
+		state->rate_control_active = 0;
+		return;
+	}
+
+	if (time_before64(now, state->next_check) &&
+			  state->rate_control_active)
+		goto re_arm;
+
+	state->next_check = now + CHECK_INTERVAL;
+
+	if (state->count != state->last_count) {
+		/* There was one new thermal interrupt */
+		state->last_count = state->count;
+		state->average = 0;
+		state->sample_count = 0;
+		state->sample_index = 0;
+	}
+
+	state->temp_samples[state->sample_index] = temp;
+	state->sample_count++;
+	state->sample_index = (state->sample_index + 1) % ARRAY_SIZE(state->temp_samples);
+	if (state->sample_count < ARRAY_SIZE(state->temp_samples))
+		goto re_arm;
+
+	avg = 0;
+	for (i = 0; i < ARRAY_SIZE(state->temp_samples); ++i)
+		avg += state->temp_samples[i];
+
+	avg /= ARRAY_SIZE(state->temp_samples);
+
+	if (state->average > avg) {
+		pr_warn("CPU%d: %s temperature is above threshold, cpu clock is throttled (total events = %lu)\n",
+			this_cpu,
+			state->level == CORE_LEVEL ? "Core" : "Package",
+			state->count);
+		state->rate_control_active = 1;
+	}
+
+	state->average = avg;
+
+re_arm:
+	therm_throt_clear_therm_status_log(state->level);
+	schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
+}
+
 /***
  * therm_throt_process - Process thermal throttling event from interrupt
  * @curr: Whether the condition is current or not (boolean), since the
@@ -178,27 +292,23 @@  static void therm_throt_process(bool new_event, int event, int level)
 	if (new_event)
 		state->count++;
 
-	if (time_before64(now, state->next_check) &&
-			state->count != state->last_count)
-		return;
+	if (event == THERMAL_THROTTLING_EVENT) {
+		if (new_event && !state->last_interrupt_time) {
+			int hot;
 
-	state->next_check = now + CHECK_INTERVAL;
-	state->last_count = state->count;
+			therm_throt_get_therm_status(state->level, &hot, &state->baseline_temp);
 
-	/* if we just entered the thermal event */
-	if (new_event) {
-		if (event == THERMAL_THROTTLING_EVENT)
-			pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",
-				this_cpu,
-				level == CORE_LEVEL ? "Core" : "Package",
-				state->count);
-		return;
-	}
-	if (old_event) {
-		if (event == THERMAL_THROTTLING_EVENT)
-			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
-				level == CORE_LEVEL ? "Core" : "Package");
-		return;
+			state->last_interrupt_time = now;
+			schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
+		} else if (old_event && state->last_interrupt_time) {
+			unsigned long throttle_time;
+
+			throttle_time = jiffies_delta_to_msecs(now - state->last_interrupt_time);
+			if (throttle_time > state->max_time_ms)
+				state->max_time_ms = throttle_time;
+			state->total_time_ms += throttle_time;
+			state->last_interrupt_time = 0;
+		}
 	}
 }
 
@@ -244,20 +354,47 @@  static int thermal_throttle_add_dev(struct device *dev, unsigned int cpu)
 	if (err)
 		return err;
 
-	if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable)
+	if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable) {
 		err = sysfs_add_file_to_group(&dev->kobj,
 					      &dev_attr_core_power_limit_count.attr,
 					      thermal_attr_group.name);
+		if (err)
+			goto del_group;
+	}
+
 	if (cpu_has(c, X86_FEATURE_PTS)) {
 		err = sysfs_add_file_to_group(&dev->kobj,
 					      &dev_attr_package_throttle_count.attr,
 					      thermal_attr_group.name);
-		if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable)
+		if (err)
+			goto del_group;
+
+		err = sysfs_add_file_to_group(&dev->kobj,
+					      &dev_attr_package_throttle_max_time_ms.attr,
+					      thermal_attr_group.name);
+		if (err)
+			goto del_group;
+
+		err = sysfs_add_file_to_group(&dev->kobj,
+					      &dev_attr_package_throttle_total_time_ms.attr,
+					      thermal_attr_group.name);
+		if (err)
+			goto del_group;
+
+		if (cpu_has(c, X86_FEATURE_PLN) && int_pln_enable) {
 			err = sysfs_add_file_to_group(&dev->kobj,
 					&dev_attr_package_power_limit_count.attr,
 					thermal_attr_group.name);
+			if (err)
+				goto del_group;
+		}
 	}
 
+	return 0;
+
+del_group:
+	sysfs_remove_group(&dev->kobj, &thermal_attr_group);
+
 	return err;
 }
 
@@ -269,15 +406,29 @@  static void thermal_throttle_remove_dev(struct device *dev)
 /* Get notified when a cpu comes on/off. Be hotplug friendly. */
 static int thermal_throttle_online(unsigned int cpu)
 {
+	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
+	state->package_throttle.level = PACKAGE_LEVEL;
+	state->core_throttle.level = CORE_LEVEL;
+
+	INIT_DELAYED_WORK(&state->package_throttle.therm_work, therm_throt_active_work);
+	INIT_DELAYED_WORK(&state->core_throttle.therm_work, therm_throt_active_work);
+
 	return thermal_throttle_add_dev(dev, cpu);
 }
 
 static int thermal_throttle_offline(unsigned int cpu)
 {
+	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
+	cancel_delayed_work(&state->package_throttle.therm_work);
+	cancel_delayed_work(&state->core_throttle.therm_work);
+
+	state->package_throttle.rate_control_active = 0;
+	state->core_throttle.rate_control_active = 0;
+
 	thermal_throttle_remove_dev(dev);
 	return 0;
 }