diff mbox series

thermal/intel: introduce tcc cooling driver

Message ID 20210115094744.21156-1-rui.zhang@intel.com (mailing list archive)
State New, archived
Delegated to: Zhang Rui
Headers show
Series thermal/intel: introduce tcc cooling driver | expand

Commit Message

Zhang Rui Jan. 15, 2021, 9:47 a.m. UTC
On Intel processors, the core frequency can be reduced below OS request,
when the current temperature reaches the TCC (Thermal Control Circuit)
activation temperature.

The default TCC activation temperature is specified by
MSR_IA32_TEMPERATURE_TARGET. However, it can be adjusted by specifying an
offset in degrees C, using the TCC Offset bits in the same MSR register.

This patch introduces a cooling devices driver that utilizes the TCC
Offset feature. The bigger the current cooling state is, the lower the
effective TCC activation temperature is, so that the processors can be
throttled earlier before system critical overheats.

This patch has been tested on a KBL mobile platform.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
---
 drivers/thermal/intel/Kconfig             |   8 ++
 drivers/thermal/intel/Makefile            |   1 +
 drivers/thermal/intel/intel_tcc_cooling.c | 128 ++++++++++++++++++++++
 3 files changed, 137 insertions(+)
 create mode 100644 drivers/thermal/intel/intel_tcc_cooling.c

Comments

Doug Smythies Jan. 16, 2021, 5:08 p.m. UTC | #1
On 2021.01.15 Zhang Rui wrote:
> 
> On Intel processors, the core frequency can be reduced below OS request,
> when the current temperature reaches the TCC (Thermal Control Circuit)
> activation temperature.
> 
> The default TCC activation temperature is specified by
> MSR_IA32_TEMPERATURE_TARGET. However, it can be adjusted by specifying an
> offset in degrees C, using the TCC Offset bits in the same MSR register.
> 
> This patch introduces a cooling devices driver that utilizes the TCC
> Offset feature. The bigger the current cooling state is, the lower the
> effective TCC activation temperature is, so that the processors can be
> throttled earlier before system critical overheats.

Thank you for this useful patch.
My systems don't need thermald or any other thermal control, but it is nice
to have this extra margin to add to the critical stuff, as a backup.
I also like to use the offset to test stuff.

I use the internal power limit servo for power limiting,
and that servo works very well indeed. Using this temperature
offset as a way to servo the thermal operating limit does work,
but tends to overshoot, oscillate, hold low excessively long
(minutes). It also seems to limit CPU clock frequency reduction
to the non-turbo limit, regardless of the desired maximum
temperature.

I am not familiar with the thermal stuff at all, and didn't know
where to find the trip point knob. Anyway, found "cooling_devices11".

I do not understand this:

~$ cat /sys/devices/virtual/thermal/cooling_device11/stats/trans_table
cat: /sys/devices/virtual/thermal/cooling_device11/stats/trans_table: File too large

Rather than enter the actual TCC offset, I would rather enter the desired trip
point, and have the driver do the math to convert it to the offset.

Example step function overshoot, trip point set to 55 degrees C.

doug@s18:~$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
0.07    800     45      24      1.89    0.00
0.04    800     29      23      1.89    0.00
61.76   4546    4151    66      103.77  0.00 < step function load applied on 4 of 6 cores
67.76   4570    4476    66      120.42  0.00
68.03   4567    4488    66      120.73  0.00
67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point
68.10   4489    4493    58      109.19  0.00 < this throttling is either the power servo or the temp servo.
68.08   4262    4476    51      82.82   0.00 < this throttling is the temp servo.
68.13   4143    4513    48      75.16   0.00
68.03   4086    4488    46      71.87   0.00 < It actually undershoots often, I don't know why.
68.12   4000    4505    46      67.02   0.00 < often it doesn't undershoot.
68.44   4000    4502    45      67.16   0.00
68.06   4000    4483    45      66.95   0.00
68.02   3973    4490    44      65.20   0.00
67.94   3900    4489    43      60.51   0.00
67.88   3900    4501    44      60.55   0.00
67.85   3900    4472    43      60.52   0.00
67.96   3900    4481    43      60.59   0.00
68.26   3900    4501    44      60.70   0.00
67.93   3900    4498    43      60.58   0.00
68.03   3900    4476    43      60.68   0.00
67.83   3900    4481    44      60.54   0.00
35.06   3895    2412    25      32.13   0.00 < load removed.
0.04    800     25      24      1.89    0.00
0.04    800     22      23      1.89    0.00
0.06    800     35      23      1.90    0.00
0.03    800     18      23      1.89    0.00
0.04    800     26      22      1.90    0.00
0.30    1927    44      23      1.97    0.00
^C0.10  800     25      23      1.91    0.00

Example long time to recover:
(actually, this example never recovers, unusual):
Note: 3.7 GHz is the limit.

doug@s18:~$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 30
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
67.58   3700    134812  42      52.15   0.00 <<< the trip point was changed from 37 to 57 degrees
67.90   3700    134964  42      52.08   0.00
68.07   3700    134424  42      52.06   0.00
68.01   3700    134415  41      50.76   0.00
68.14   3700    134521  41      50.78   0.00
68.11   3700    134424  42      50.75   0.00
68.03   3700    134329  42      50.70   0.00
68.11   3700    134321  42      50.76   0.00
68.05   3700    134456  42      51.09   0.00
68.12   3700    134549  42      52.21   0.00
68.12   3700    134482  42      52.19   0.00
68.10   3700    134301  42      52.20   0.00
68.11   3700    134444  42      52.14   0.00
68.08   3700    134422  42      52.17   0.00
68.07   3700    134430  42      52.23   0.00
68.00   3700    134723  42      52.12   0.00
67.96   3711    135207  44      52.53   0.00 <<< It takes 8 minutes until the frequency goes above 3.7 GHz
68.05   3765    134519  42      54.34   0.00
68.11   3771    134461  43      54.60   0.00
67.83   3763    134867  43      54.26   0.00
67.93   3773    134577  43      54.78   0.00 <<< But it never recovers, Why not?
...

For unknown reason the processor seems to now
think it is not heavily loaded. From my MSR decoder:

0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL

From the book:

> Autonomous Utilization-Based Frequency Control
> Status (R0)
> When set, frequency is reduced below the operating
> system request because the processor has detected
> that utilization is low.

Which is not true.

Anyway,

Acked-by: Doug Smythies <dsmythies@telus.net>

... Doug
Doug Smythies Jan. 16, 2021, 9:21 p.m. UTC | #2
On 2021.01.16 09:08 Doug Smythies wrote: 
> On 2021.01.15 Zhang Rui wrote:

Added Len to the "To" list:

Turostat has another issue with this stuff.
It will be more work than I want to do to submit a fix patch, so I am not,
but see further down for my hack fix.

...

> Example step function overshoot, trip point set to 55 degrees C.
> 
> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --
> interval 1
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> 0.07    800     45      24      1.89    0.00
> 0.04    800     29      23      1.89    0.00
> 61.76   4546    4151    66      103.77  0.00 < step function load applied on 4 of 6 cores
> 67.76   4570    4476    66      120.42  0.00
> 68.03   4567    4488    66      120.73  0.00
> 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point
> 68.10   4489    4493    58      109.19  0.00 < this throttling is either the power servo or the temp
> servo.
> 68.08   4262    4476    51      82.82   0.00 < this throttling is the temp servo.
> 68.13   4143    4513    48      75.16   0.00
> 68.03   4086    4488    46      71.87   0.00 < It actually undershoots often, I don't know why.
> 68.12   4000    4505    46      67.02   0.00 < often it doesn't undershoot.

It turns out that tubostat does not list the package
temperature properly if it is started with an active TCC offset.
It erroneously includes the offset in the temperature math.
In the above example turbostat had also not yet been fixed for the
bit mask issue. So the real temp above was 59 degrees C.

> 68.44   4000    4502    45      67.16   0.00
> 68.06   4000    4483    45      66.95   0.00
> 68.02   3973    4490    44      65.20   0.00
> 67.94   3900    4489    43      60.51   0.00
> 67.88   3900    4501    44      60.55   0.00
> 67.85   3900    4472    43      60.52   0.00

And it settled at about 56 degrees, close to what was asked for.

To proceed with my work, I did a hack fix to turbostat:

doug@s18:~/temp-k-git/linux/tools/power/x86/turbostat$ git diff
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index d7acdd4d16c4..7f0a22ab3a0d 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -4831,6 +4831,7 @@ int read_tcc_activation_temp()
                fprintf(outf, "cpu%d: MSR_IA32_TEMPERATURE_TARGET: 0x%08llx (%d C) (%d default - %d offset)\n",
                        base_cpu, msr, tcc, target_c, offset_c);

+       tcc = target_c;
        return tcc;
 }

So this:

cpu4: MSR_IA32_TEMPERATURE_TARGET: 0x2b64100d (57 C) (100 default - 43 offset)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88420000 (-9 C)

becomes this:

cpu1: MSR_IA32_TEMPERATURE_TARGET: 0x2b64100d (57 C) (100 default - 43 offset)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88400000 (36 C)

and this:

Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
0.08    1079    928     -11     1.91    0.00

Becomes this:

Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
0.05    1046    846     32      1.94    0.00

So now back to my overshoot example:

This:

> 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point

Was actually:

> 67.98   4572    4492    80      121.00  0.00 <<< 25 degrees over trip point

But let's just do it again:

doug@s18:~$ cat /sys/devices/virtual/thermal/cooling_device11/cur_state
43       <<< so 100 - 43 = 57 degrees trip point.
doug@s18:~$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 0.25
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
0.09    800     6       36      2.01    0.00
0.16    800     23      36      2.00    0.00
0.11    800     14      36      2.15    0.00
66.81   4461    1160    70      101.17  0.00 <<< load applied, temp up 34 degrees in less than 0.25 seconds. Normal.
68.06   4581    1126    74      117.36  0.00
67.69   4589    1119    76      119.60  0.00
67.80   4589    1125    77      120.94  0.00
67.83   4587    1132    78      120.75  0.00
67.68   4591    1125    78      121.63  0.00
68.07   4585    1139    77      121.25  0.00
67.80   4588    1121    79      121.41  0.00 <<< now 20 degrees over trip point.
68.57   4579    1139    79      121.71  0.00
...
68.03   4220    1130    63      80.28   0.00 <<< it takes quite awhile (>7 seconds) to really throttle down.

... Doug
Zhang Rui Jan. 18, 2021, 9:31 a.m. UTC | #3
> -----Original Message-----
> From: Doug Smythies <dsmythies@telus.net>
> Sent: Sunday, January 17, 2021 5:22 AM
> To: Zhang, Rui <rui.zhang@intel.com>; Brown, Len <len.brown@intel.com>
> Cc: daniel.lezcano@linaro.org; srinivas.pandruvada@linux.intel.com; linux-
> pm@vger.kernel.org; 'Doug Smythies' <dsmythies@telus.net>
> Subject: RE: [PATCH] thermal/intel: introduce tcc cooling driver
> Importance: High
> 
> On 2021.01.16 09:08 Doug Smythies wrote:
> > On 2021.01.15 Zhang Rui wrote:
> 
> Added Len to the "To" list:
> 
> Turostat has another issue with this stuff.
> It will be more work than I want to do to submit a fix patch, so I am not, but
> see further down for my hack fix.
> 
> ...
> 
> > Example step function overshoot, trip point set to 55 degrees C.
> >
> > doug@s18:~$ sudo ~/turbostat --Summary --quiet --show
> > Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ -- interval 1
> > Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> > 0.07    800     45      24      1.89    0.00
> > 0.04    800     29      23      1.89    0.00
> > 61.76   4546    4151    66      103.77  0.00 < step function load applied on 4 of 6
> cores
> > 67.76   4570    4476    66      120.42  0.00
> > 68.03   4567    4488    66      120.73  0.00
> > 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point
> > 68.10   4489    4493    58      109.19  0.00 < this throttling is either the power
> servo or the temp
> > servo.
> > 68.08   4262    4476    51      82.82   0.00 < this throttling is the temp servo.
> > 68.13   4143    4513    48      75.16   0.00
> > 68.03   4086    4488    46      71.87   0.00 < It actually undershoots often, I don't
> know why.
> > 68.12   4000    4505    46      67.02   0.00 < often it doesn't undershoot.
> 
> It turns out that tubostat does not list the package temperature properly if it
> is started with an active TCC offset.
> It erroneously includes the offset in the temperature math.
> In the above example turbostat had also not yet been fixed for the bit mask
> issue. So the real temp above was 59 degrees C.
> 
> > 68.44   4000    4502    45      67.16   0.00
> > 68.06   4000    4483    45      66.95   0.00
> > 68.02   3973    4490    44      65.20   0.00
> > 67.94   3900    4489    43      60.51   0.00
> > 67.88   3900    4501    44      60.55   0.00
> > 67.85   3900    4472    43      60.52   0.00
> 
> And it settled at about 56 degrees, close to what was asked for.
> 
> To proceed with my work, I did a hack fix to turbostat:
> 
> doug@s18:~/temp-k-git/linux/tools/power/x86/turbostat$ git diff diff --git
> a/tools/power/x86/turbostat/turbostat.c
> b/tools/power/x86/turbostat/turbostat.c
> index d7acdd4d16c4..7f0a22ab3a0d 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -4831,6 +4831,7 @@ int read_tcc_activation_temp()
>                 fprintf(outf, "cpu%d: MSR_IA32_TEMPERATURE_TARGET: 0x%08llx
> (%d C) (%d default - %d offset)\n",
>                         base_cpu, msr, tcc, target_c, offset_c);
> 
> +       tcc = target_c;
>         return tcc;
>  }
> 

Yes, this is a right fix.
I think Len already knows this breakage and he will propose some fix soon.

> So this:
> 
> cpu4: MSR_IA32_TEMPERATURE_TARGET: 0x2b64100d (57 C) (100 default -
> 43 offset)
> cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88420000 (-9 C)
> 
> becomes this:
> 
> cpu1: MSR_IA32_TEMPERATURE_TARGET: 0x2b64100d (57 C) (100 default -
> 43 offset)
> cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88400000 (36 C)
> 
> and this:
> 
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> 0.08    1079    928     -11     1.91    0.00
> 
> Becomes this:
> 
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> 0.05    1046    846     32      1.94    0.00
> 
> So now back to my overshoot example:
> 
> This:
> 
> > 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point
> 
> Was actually:
> 
> > 67.98   4572    4492    80      121.00  0.00 <<< 25 degrees over trip point
> 
> But let's just do it again:
> 
> doug@s18:~$ cat /sys/devices/virtual/thermal/cooling_device11/cur_state
> 43       <<< so 100 - 43 = 57 degrees trip point.
> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 0.25
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> 0.09    800     6       36      2.01    0.00
> 0.16    800     23      36      2.00    0.00
> 0.11    800     14      36      2.15    0.00
> 66.81   4461    1160    70      101.17  0.00 <<< load applied, temp up 34 degrees in
> less than 0.25 seconds. Normal.
> 68.06   4581    1126    74      117.36  0.00
> 67.69   4589    1119    76      119.60  0.00
> 67.80   4589    1125    77      120.94  0.00
> 67.83   4587    1132    78      120.75  0.00
> 67.68   4591    1125    78      121.63  0.00
> 68.07   4585    1139    77      121.25  0.00
> 67.80   4588    1121    79      121.41  0.00 <<< now 20 degrees over trip point.
> 68.57   4579    1139    79      121.71  0.00
> ...
> 68.03   4220    1130    63      80.28   0.00 <<< it takes quite awhile (>7 seconds) to
> really throttle down.

What platform this is?
On a KBL platform I'm running right now, with performance governor, and tcc offset set to 30 (Effective TCC  is 70c), and also turbostat fixed,
I can observe that
1. all cpus running at max turbo freq (3.9G) when idle, PkgTmp around 40C
2. with load applied (I use stress tool to get 100% CPU load), the PkgTmp reports 70C and the frequency drops to  around 3G, IMMEDIATELY.
3. when I change TCC Offset to 60, cpu is throttled to around 200MHz, and the temperature is at around  50C, IMMEDIATELY.
4. when I change TCC Offset to  20, cpu freq raises to turbo range, and PkgTmp reaches 80C, IMMEDIATELY.

So in your test, there is something I don't understand. 
Zhang Rui Jan. 18, 2021, 9:46 a.m. UTC | #4
Hi, Doug,

Thanks for testing this patch.

> -----Original Message-----
> From: Doug Smythies <dsmythies@telus.net>
> Sent: Sunday, January 17, 2021 1:08 AM
> To: Zhang, Rui <rui.zhang@intel.com>
> Cc: daniel.lezcano@linaro.org; srinivas.pandruvada@linux.intel.com; linux-
> pm@vger.kernel.org
> Subject: RE: [PATCH] thermal/intel: introduce tcc cooling driver
> Importance: High
> 
> On 2021.01.15 Zhang Rui wrote:
> >
> > On Intel processors, the core frequency can be reduced below OS
> > request, when the current temperature reaches the TCC (Thermal Control
> > Circuit) activation temperature.
> >
> > The default TCC activation temperature is specified by
> > MSR_IA32_TEMPERATURE_TARGET. However, it can be adjusted by
> specifying
> > an offset in degrees C, using the TCC Offset bits in the same MSR register.
> >
> > This patch introduces a cooling devices driver that utilizes the TCC
> > Offset feature. The bigger the current cooling state is, the lower the
> > effective TCC activation temperature is, so that the processors can be
> > throttled earlier before system critical overheats.
> 
> Thank you for this useful patch.
> My systems don't need thermald or any other thermal control, but it is nice
> to have this extra margin to add to the critical stuff, as a backup.
> I also like to use the offset to test stuff.
> 
> I use the internal power limit servo for power limiting, and that servo works
> very well indeed. Using this temperature offset as a way to servo the
> thermal operating limit does work, but tends to overshoot, oscillate, hold low
> excessively long (minutes). 

Do you have a script to test and show the drawbacks of this feature?
It seems that it behaves differently on different platforms.
Maybe we can evaluate this on more platforms.

> It also seems to limit CPU clock frequency
> reduction to the non-turbo limit, regardless of the desired maximum
> temperature.
> 
> I am not familiar with the thermal stuff at all, and didn't know where to find
> the trip point knob. Anyway, found "cooling_devices11".
> 
> I do not understand this:
> 
> ~$ cat /sys/devices/virtual/thermal/cooling_device11/stats/trans_table
> cat: /sys/devices/virtual/thermal/cooling_device11/stats/trans_table: File
> too large

This is a known issue that stats table can not handle devices with too many cooling states, say, 127 cooling states for TCC Offset cooling device.
We can ignore this for now.

> 
> Rather than enter the actual TCC offset, I would rather enter the desired trip
> point, and have the driver do the math to convert it to the offset.

Hmmm, a writable trip point? I need to think about this.

> 
> Example step function overshoot, trip point set to 55 degrees C.
> 
> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 1
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> 0.07    800     45      24      1.89    0.00
> 0.04    800     29      23      1.89    0.00
> 61.76   4546    4151    66      103.77  0.00 < step function load applied on 4 of 6
> cores
> 67.76   4570    4476    66      120.42  0.00
> 68.03   4567    4488    66      120.73  0.00
> 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point
> 68.10   4489    4493    58      109.19  0.00 < this throttling is either the power
> servo or the temp servo.
> 68.08   4262    4476    51      82.82   0.00 < this throttling is the temp servo.
> 68.13   4143    4513    48      75.16   0.00
> 68.03   4086    4488    46      71.87   0.00 < It actually undershoots often, I don't
> know why.
> 68.12   4000    4505    46      67.02   0.00 < often it doesn't undershoot.
> 68.44   4000    4502    45      67.16   0.00
> 68.06   4000    4483    45      66.95   0.00
> 68.02   3973    4490    44      65.20   0.00
> 67.94   3900    4489    43      60.51   0.00
> 67.88   3900    4501    44      60.55   0.00
> 67.85   3900    4472    43      60.52   0.00
> 67.96   3900    4481    43      60.59   0.00
> 68.26   3900    4501    44      60.70   0.00
> 67.93   3900    4498    43      60.58   0.00
> 68.03   3900    4476    43      60.68   0.00
> 67.83   3900    4481    44      60.54   0.00
> 35.06   3895    2412    25      32.13   0.00 < load removed.
> 0.04    800     25      24      1.89    0.00
> 0.04    800     22      23      1.89    0.00
> 0.06    800     35      23      1.90    0.00
> 0.03    800     18      23      1.89    0.00
> 0.04    800     26      22      1.90    0.00
> 0.30    1927    44      23      1.97    0.00
> ^C0.10  800     25      23      1.91    0.00
> 
> Example long time to recover:
> (actually, this example never recovers, unusual):
> Note: 3.7 GHz is the limit.
> 
> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 30
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> 67.58   3700    134812  42      52.15   0.00 <<< the trip point was changed from 37
> to 57 degrees
> 67.90   3700    134964  42      52.08   0.00
> 68.07   3700    134424  42      52.06   0.00
> 68.01   3700    134415  41      50.76   0.00
> 68.14   3700    134521  41      50.78   0.00
> 68.11   3700    134424  42      50.75   0.00
> 68.03   3700    134329  42      50.70   0.00
> 68.11   3700    134321  42      50.76   0.00
> 68.05   3700    134456  42      51.09   0.00
> 68.12   3700    134549  42      52.21   0.00
> 68.12   3700    134482  42      52.19   0.00
> 68.10   3700    134301  42      52.20   0.00
> 68.11   3700    134444  42      52.14   0.00
> 68.08   3700    134422  42      52.17   0.00
> 68.07   3700    134430  42      52.23   0.00
> 68.00   3700    134723  42      52.12   0.00
> 67.96   3711    135207  44      52.53   0.00 <<< It takes 8 minutes until the
> frequency goes above 3.7 GHz
> 68.05   3765    134519  42      54.34   0.00
> 68.11   3771    134461  43      54.60   0.00
> 67.83   3763    134867  43      54.26   0.00
> 67.93   3773    134577  43      54.78   0.00 <<< But it never recovers, Why not?
> ...
> 
> For unknown reason the processor seems to now think it is not heavily
> loaded. From my MSR decoder:
> 
> 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL
> 
> From the book:
> 
> > Autonomous Utilization-Based Frequency Control Status (R0) When set,
> > frequency is reduced below the operating system request because the
> > processor has detected that utilization is low.
> 
> Which is not true.
> 
> Anyway,
> 
> Acked-by: Doug Smythies <dsmythies@telus.net>
> 
thanks,
rui
Doug Smythies Jan. 19, 2021, 7:10 a.m. UTC | #5
On 2021.01.18 01:32 Zhang, Rui wrote:
>  On 2021.01.17 05:22 Doug Smythies wrote:
> > On 2021.01.16 09:08 Doug Smythies wrote:
> > > On 2021.01.15 Zhang Rui wrote:
...
> 
> What platform this is?

My i5-9600K test server.
Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
6 CPUs and 6 cores.
Kernel: 5.11-rc3 + this patch.
Water cooled, with water pump always running full speed.

> On a KBL platform I'm running right now, with performance governor, and tcc offset set to 30
> (Effective TCC  is 70c), and also turbostat fixed,
> I can observe that
> 1. all cpus running at max turbo freq (3.9G) when idle, PkgTmp around 40C
> 2. with load applied (I use stress tool to get 100% CPU load), the PkgTmp reports 70C and the
> frequency drops to  around 3G, IMMEDIATELY.
> 3. when I change TCC Offset to 60, cpu is throttled to around 200MHz, and the temperature is at around
> 50C, IMMEDIATELY.
> 4. when I change TCC Offset to  20, cpu freq raises to turbo range, and PkgTmp reaches 80C,
> IMMEDIATELY.

O.K. You should be able to measure "IMMEDIATELY" and tell us what it is.

> 
> So in your test, there is something I don't understand. 
Doug Smythies Jan. 26, 2021, 7:18 p.m. UTC | #6
Hi, Just a small follow up on this one:

On 2021.01.16 09:08 Doug Smythies wrote:
> On 2021.01.15 Zhang Rui wrote:
...
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
> 67.93   3773    134577  43      54.78
> 
> For unknown reason the processor seems to now
> think it is not heavily loaded. From my MSR decoder:
> 
> 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL
> 
> From the book:
> 
> > Autonomous Utilization-Based Frequency Control
> > Status (R0)
> > When set, frequency is reduced below the operating
> > system request because the processor has detected
> > that utilization is low.
> 
> Which is not true.
> 
> Anyway,
> 
> Acked-by: Doug Smythies <dsmythies@telus.net>

O.K. there were 2 things wrong above:

1.) I used the wrong intel SDM table for those bit definitions.
They should have been: RATL and RATLL.

From the proper page of the book:

> Running Average Thermal Limit Status (R0)
> When set, frequency is reduced below the operating
> system request due to Running Average Thermal Limit
> (RATL).

2.) Due to the already discussed turbostat issue, that was not
the actual temperature and so the RATL bit being set was actually
valid at that time.

I have not been able to find the time window knob for this, if there
even is one, similar to the time window knobs for the package power limits.
I wanted to reduce the time constant, just as a test, in an attempt
to reduce the step function load potential temperature overshoot.

One additional informational follow up note:

There always seems to be a significant turn on transient to using the
TCC offset, appearing as temperature undershoot. I am saying that
an offset of 0 seems to also act as some sort of on/off switch to the
running average.

Example 1 - start with offset 0:

$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
51.17   4600    3531    71      93.57
51.37   4600    3543    71      93.60
51.37   4600    3590    71      93.63  <<< offset changed from 0 to 24
50.99   3737    3566    52      43.49  <<< trip point = 76 degrees
51.20   3700    3550    51      41.14  <<< TCC offset turn on transient
51.09   3700    3559    51      41.30  <<< There was no need to throttle
51.12   3779    3515    53      43.78
50.95   4064    3553    58      55.57
51.55   4271    3522    63      65.30
51.18   4424    3534    67      76.58
51.27   4500    3532    68      84.12
51.14   4500    3529    68      84.14
51.24   4599    3522    71      93.61
51.14   4600    3523    71      93.71  <<< Eventually it does return to not throttled.

Example 2 - start with offset 1:

Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
51.14   4600    3554    73      94.73
51.37   4600    3544    73      94.85
51.03   4600    3560    74      94.64 <<< offset changed from 1 to 24
51.33   4600    3508    73      94.88 <<< trip point = 76 degrees
51.14   4600    3526    73      94.69 <<< No TCC offset transient
51.22   4600    3614    73      94.85
51.24   4600    3531    73      94.84
51.50   4600    3578    73      94.92
51.15   4600    3571    73      94.77
51.20   4600    3521    73      94.91
51.19   4600    3550    73      94.76
51.27   4600    3522    74      94.81
51.27   4600    3530    74      94.98

... Doug
Zhang Rui Jan. 28, 2021, 5:29 p.m. UTC | #7
Hi, Doug,

On Tue, 2021-01-26 at 11:18 -0800, Doug Smythies wrote:
> Hi, Just a small follow up on this one:
> 
> On 2021.01.16 09:08 Doug Smythies wrote:
> > On 2021.01.15 Zhang Rui wrote:
> 
> ...
> > Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
> > 67.93   3773    134577  43      54.78
> > 
> > For unknown reason the processor seems to now
> > think it is not heavily loaded. From my MSR decoder:
> > 
> > 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL
> > 
> > From the book:
> > 
> > > Autonomous Utilization-Based Frequency Control
> > > Status (R0)
> > > When set, frequency is reduced below the operating
> > > system request because the processor has detected
> > > that utilization is low.
> > 
> > Which is not true.
> > 
> > Anyway,
> > 
> > Acked-by: Doug Smythies <dsmythies@telus.net>
> 

> O.K. there were 2 things wrong above:
> 
> 1.) I used the wrong intel SDM table for those bit definitions.
> They should have been: RATL and RATLL.
> 
> From the proper page of the book:
> 
> > Running Average Thermal Limit Status (R0)
> > When set, frequency is reduced below the operating
> > system request due to Running Average Thermal Limit
> > (RATL).
> 

> 2.) Due to the already discussed turbostat issue, that was not
> the actual temperature and so the RATL bit being set was actually
> valid at that time.
> 
On my side, I got the "Thermal status bit" set.

> I have not been able to find the time window knob for this, if there
> even is one, similar to the time window knobs for the package power
> limits.
> I wanted to reduce the time constant, just as a test, in an attempt
> to reduce the step function load potential temperature overshoot.
> 


> One additional informational follow up note:
> 
> There always seems to be a significant turn on transient to using the
> TCC offset, appearing as temperature undershoot. I am saying that
> an offset of 0 seems to also act as some sort of on/off switch to the
> running average.
> 
> Example 1 - start with offset 0:
> 
> $ sudo ~/turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
> 51.17   4600    3531    71      93.57
> 51.37   4600    3543    71      93.60
> 51.37   4600    3590    71      93.63  <<< offset changed from 0 to
> 24
> 50.99   3737    3566    52      43.49  <<< trip point = 76 degrees
> 51.20   3700    3550    51      41.14  <<< TCC offset turn on
> transient
> 51.09   3700    3559    51      41.30  <<< There was no need to
> throttle
> 51.12   3779    3515    53      43.78
> 50.95   4064    3553    58      55.57
> 51.55   4271    3522    63      65.30
> 51.18   4424    3534    67      76.58
> 51.27   4500    3532    68      84.12
> 51.14   4500    3529    68      84.14
> 51.24   4599    3522    71      93.61
> 51.14   4600    3523    71      93.71  <<< Eventually it does return
> to not throttled.
> 

> Example 2 - start with offset 1:
> 
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
> 51.14   4600    3554    73      94.73
> 51.37   4600    3544    73      94.85
> 51.03   4600    3560    74      94.64 <<< offset changed from 1 to 24
> 51.33   4600    3508    73      94.88 <<< trip point = 76 degrees
> 51.14   4600    3526    73      94.69 <<< No TCC offset transient
> 51.22   4600    3614    73      94.85
> 51.24   4600    3531    73      94.84
> 51.50   4600    3578    73      94.92
> 51.15   4600    3571    73      94.77
> 51.20   4600    3521    73      94.91
> 51.19   4600    3550    73      94.76
> 51.27   4600    3522    74      94.81
> 51.27   4600    3530    74      94.98
> 
> 
Thanks for your test.
I'd prefer this is platform specific. 
Because it behaves really differently from what I observed.

$sudo turbostat --Summary --quiet --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
99.45	2216	10656	80	14.81  <<< start with offset=0
99.48	2234	10621	79	15.02
99.47	2233	10436	80	14.96
99.45	2236	10587	79	14.94
99.49	2216	10673	79	15.04
99.46	2226	10685	79	14.87
99.43	2233	10776	79	14.89
99.73	399	9139	66	4.51   <<< offset set to 50
99.76	212	8998	65	3.31
99.77	212	8902	64	3.27
...                                    <<< throttled for 20 seconds
99.76	212	8911	55	2.97
99.77	211	8851	55	2.95
99.76	211	8916	55	2.94
99.77	211	8844	55	3.05
99.77	211	8828	54	3.21
99.77	211	8911	54	3.05
99.74	212	8998	54	3.20
99.77	212	8802	54	2.90
99.77	211	8849	54	2.90
99.76	212	8942	53	2.98
99.76	211	9039	53	3.22
99.74	212	8977	53	2.89
99.77	211	8913	53	2.89
99.76	212	8900	53	2.89
99.77	211	8817	52	2.87
99.77	212	8923	52	2.88
99.77	212	8985	52	2.88
99.73	212	8877	52	2.86
99.58	575	9308	66	5.54    <<< offset set to 32
98.92	2460	13694	66	17.32
98.98	2298	13768	66	15.24
99.03	2244	14652	66	14.48
98.97	2198	14489	66	13.95
99.03	2148	14583	66	13.43
99.02	2107	14093	66	13.45
99.06	2060	13750	66	12.61
99.06	2036	14195	66	12.27
99.07	2007	14240	66	12.07   
99.12	2888	12147	98	28.23   <<< offset cleared
99.03	3413	11503	98	37.21
98.96	3317	11698	98	34.64
99.07	3246	11410	98	32.89
98.95	3210	12107	98	32.13
98.94	3164	11790	98	31.08
99.00	3124	12106	98	30.84
99.00	3086	11876	98	29.60
98.94	3054	12482	98	29.00
98.89	3030	12629	98	28.54
99.39	2377	10764	82	17.62   <<< Didn't do anything, so it
is probably thermald or something 
99.49	2200	10679	81	14.44
99.52	2211	10267	80	14.66
99.49	2221	10318	80	14.71
99.45	2220	10289	81	14.74
99.43	2222	10326	81	14.76

I tried both tests, and the results are the same, in both cases, it
starts throttling immediately (within a second), and no over-throttling 
observed.

Do you have a script to do this? Say, run turbostat in background and
then change tcc offset at certain timestamp? Maybe we can try exactly
the same test on different machines.

thanks,
rui
Zhang Rui Jan. 28, 2021, 5:32 p.m. UTC | #8
> > 
> > Rather than enter the actual TCC offset, I would rather enter the
> > desired trip
> > point, and have the driver do the math to convert it to the offset.
> 
> Hmmm, a writable trip point? I need to think about this.

I think this is a better idea, and I will export this as a writable
trip point of the x86_pkg_temp_thermal driver later, thanks for the
suggestion.

thanks,
rui
Doug Smythies Jan. 30, 2021, 4:58 p.m. UTC | #9
On Thu, Jan 28, 2021 at 9:30 AM Zhang Rui <rui.zhang@intel.com> wrote:
> On Tue, 2021-01-26 at 11:18 -0800, Doug Smythies wrote:
> > On 2021.01.16 09:08 Doug Smythies wrote:
> > > On 2021.01.15 Zhang Rui wrote:
...
> > They should have been: RATL and RATLL.
> >
> > From the proper page of the book:
> >
> > > Running Average Thermal Limit Status (R0)
> > > When set, frequency is reduced below the operating
> > > system request due to Running Average Thermal Limit
> > > (RATL).
> >
>
> > 2.) Due to the already discussed turbostat issue, that was not
> > the actual temperature and so the RATL bit being set was actually
> > valid at that time.
> >
> On my side, I got the "Thermal status bit" set.

Yes, and if I understand your comment correctly, you are referring
to IA32_THERM_STATUS (0X19C) and/or
IA32_PACKAGE_THERM_STATUS (0X1B1). I am referring to
MSR_CORE_PERF_LIMIT_REASONS (0X64F).

>
> > I have not been able to find the time window knob for this, if there
> > even is one, similar to the time window knobs for the package power
> > limits.

I just assume there is a time window, similar to the RAPL based
power limits. But I haven't found it.

> > I wanted to reduce the time constant, just as a test, in an attempt
> > to reduce the step function load potential temperature overshoot.
...

> >
> Thanks for your test.
> I'd prefer this is platform specific.
> Because it behaves really differently from what I observed.

O.K. These oddities aside, in the end it does do
the expected job.

> 99.06   2036    14195   66      12.27
> 99.07   2007    14240   66      12.07
> 99.12   2888    12147   98      28.23   <<< offset cleared
> 99.03   3413    11503   98      37.21
> 98.96   3317    11698   98      34.64

very close to critical temp.
I never knowingly allow my processor
to go above 80 degrees.
Although, I admit it hit 90 degrees a couple of
times during this work.

> 99.07   3246    11410   98      32.89
> 98.95   3210    12107   98      32.13
> 98.94   3164    11790   98      31.08
> 99.00   3124    12106   98      30.84
> 99.00   3086    11876   98      29.60
> 98.94   3054    12482   98      29.00
> 98.89   3030    12629   98      28.54
> 99.39   2377    10764   82      17.62   <<< Didn't do anything, so it
> is probably thermald or something

or critical temp hit.

>
> I tried both tests, and the results are the same, in both cases, it
> starts throttling immediately (within a second), and no over-throttling
> observed.
>
> Do you have a script to do this?

No, all of my tests were done manually, varing:
. placement of high loads on some cores for more heat over smaller surface area.
. balance between 100% CPU load at max heat verses 100% CPU load at less heat.
. balance between this TCC Offset throttling verses package power limits
. using ambient (coolant temperature) as a heat removal capacity knob.

In summary: I played around until I found something interesting.

> Say, run turbostat in background and
> then change tcc offset at certain timestamp? Maybe we can try exactly
> the same test on different machines.

I had an idea, and wasted way way too much time trying to make it work.
I thought to just get turbostat to also show the offset, so then we know for
certain when it changed. I tried virtually all combinations of:

turbostat --Summary --quiet --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,,,,TCC --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
turbostat --Summary --quiet --add msr0x1a2,u32,package,raw,TCC --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1

and could never get it to work in "Summary" mode. (note: about 95% of
my use of turbostat is in "Summary" mode.)

Anyway, after too long, I did get this to work:

turbostat --quiet --cpu 0 --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,u32,,raw,TCC
--show CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1 | grep "^ 0"

Example 1:

turbostat --quiet --cpu 0 --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,u32,,raw,TCC
--show CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1 | grep "^0"
CPU     Busy%   Bzy_MHz IRQ            TCC      PkgTmp  PkgWatt
0       100.26  4500    1002    0x00000001      78      99.88 <<< Offset = 1
0       100.26  4501    1002    0x00000001      77      99.90 <<<
steady state power limit throttle
0       100.26  4501    1004    0x00000001      77      99.92
0       100.26  4500    1002    0x0000001e      78      99.91   <<<
offset changed, trip int 70
0       100.25  4502    1003    0x0000001e      77      100.03
0       100.25  4503    1002    0x0000001e      77      99.85
0       100.25  4502    1002    0x0000001e      78      99.92
0       100.26  4501    1003    0x0000001e      78      99.95
0       100.25  4503    1002    0x0000001e      77      99.88
0       100.25  4502    1002    0x0000001e      78      99.86
0       100.25  4502    1004    0x0000001e      77      99.92
0       100.25  4503    1002    0x0000001e      77      99.98
0       100.25  4502    1002    0x0000001e      77      99.88
0       100.26  4498    1004    0x0000001e      77      100.06
0       100.26  4501    1002    0x0000001e      78      99.77
0       100.26  4500    1002    0x0000001e      78      99.53
0       100.26  4430    1002    0x0000001e      72      91.19  <<<
Thermal throttling. 13 Seconds
0       100.26  4400    1002    0x0000001e      72      87.55
0       100.26  4400    1002    0x0000001e      71      87.52
0       100.26  4400    1005    0x0000001e      71      87.56
0       100.26  4400    1002    0x0000001e      72      87.53

Example 2:

0       100.26  4600    1002    0x00000000      83      113.26 <<< Offset = 0
0       100.26  4600    1002    0x00000000      84      113.43
0       100.25  4599    1002    0x00000000      83      113.42 <<< No
power limit throttle yet.
0       100.26  4600    1004    0x00000000      83      113.40 <<< Not
steady state.
0       100.26  4600    1002    0x00000000      83      113.25
0       100.25  3797    1003    0x00000018      56      54.11  <<<
Overshoot is immediate.
0       100.26  3700    1002    0x00000018      56      47.09
0       100.26  3700    1002    0x00000018      55      47.08
0       100.26  3700    1002    0x00000018      54      46.98
0       100.26  3820    1002    0x00000018      58      51.62  <<<
starts to recover.
0       100.26  4016    1002    0x00000018      62      61.55
0       100.26  4177    1002    0x00000018      64      69.91
0       100.26  4275    1004    0x00000018      68      75.81
0       100.26  4300    1002    0x00000018      68      77.36
0       100.26  4371    1002    0x00000018      71      84.53
0       100.26  4400    1002    0x00000018      72      87.52
0       100.26  4400    1003    0x00000018      72      87.62

Example 3:
This test is specifically an attempt to test the TCC Offset in the exact
way I intend to use it. trip point = 75 degrees, and never changes.
Power limit 2 is 115 watts, timing window short.
Power limit 1 is 100 watts , timing window 8 seconds.
Note: all previous work was with the timing window at 28 seconds.
Note: typically temperature < 75 at 100 watts.

The load is 4 prime95 maximum heat threads, plus 0 weaker memory
hammering threads.

The collant had to be preheated for about an hour before this test
started, otherwise
the  processor would not get hot enough before package power limit 1
took over the
throttling duties.

Now, watching the TCC offset is useless for this test, so let's watch
MSR_CORE_PERF_LIMIT_REASONS instead:

turbostat --add msr0x64f,u32,,raw,TCC --show
CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ,RAMWatt --interval 1 | grep "^0"

(O.K., I should have changed the added column name. I filter it
anyhow, but manually added back, edited.)

CPU     Busy%   Bzy_MHz IRQ            TCC      PkgTmp  PkgWatt RAMWatt
0       0.07    1081    5       0x08200000      38      2.31    0.45
<<< Note high idle start temp.
0       0.16    824     11      0x08200000      38      2.12    0.45
0       1.74    3430    44      0x00000000      38      2.65    0.45
<<< clear last times log bits
0       0.16    851     6       0x00000000      37      2.27    0.45
0       4.32    3313    269     0x00000000      75      47.15   0.45
<<< load applied
0       4.24    4585    458     0x08000800      78      97.16   0.45
<<< package power limit 2
0       2.80    4588    482     0x08000000      77      97.49   0.45
<<< temperature just high
0       2.87    4593    463     0x08000000      78      97.95   0.45
0       3.39    4600    465     0x08000000      78      97.68   0.45
0       2.66    4600    462     0x08000000      78      97.55   0.45
0       2.28    4584    490     0x08000000      78      97.97   0.45
0       3.29    4583    478     0x08000000      78      97.72   0.45
0       3.24    4595    465     0x08000000      77      97.52   0.45
0       2.47    4600    465     0x08000000      78      97.50   0.45
0       4.18    4570    464     0x08000000      78      97.72   0.45
0       2.51    4600    470     0x08000000      78      97.40   0.45
0       1.77    4601    482     0x08000000      78      97.33   0.45
0       3.13    4584    462     0x08000000      78      97.57   0.45
0       3.06    4600    466     0x08000000      78      97.77   0.45
0       2.86    4592    461     0x08000000      78      97.56   0.45
0       2.85    4569    486     0x08000000      78      97.99   0.45
0       2.96    4600    465     0x08000000      78      97.91   0.45
0       3.00    4585    451     0x08000000      78      97.68   0.45
0       2.06    4600    475     0x08000000      78      97.50   0.45
0       3.05    4594    462     0x08000000      78      97.78   0.45
0       3.11    4592    461     0x08000000      78      97.68   0.45
0       2.31    4546    463     0x08200020      73      93.00   0.45  <<< RATL
0       2.80    4525    454     0x08200000      78      91.29   0.45
<<< Oscillates within
0       3.32    4538    445     0x08200020      73      91.61   0.45
<<< 1 pstate
0       3.27    4557    434     0x08200000      78      93.12   0.45
0       3.26    4523    470     0x08200020      73      89.85   0.45
<<< rough estimate is
0       2.48    4586    466     0x08200020      74      95.67   0.45
<<< oscillation costs 0.4%
0       1.95    4521    468     0x08200000      76      87.93   0.45
<<< performance loss verses
0       3.28    4569    449     0x08200020      73      94.67   0.45
<<< the power limit 2 servo.
0       0.44    4546    495     0x08200000      78      91.77   0.45
<<< (very crude, hard to defend
0       1.91    4518    487     0x08200020      73      91.24   0.45 <<< data.)
0       3.25    4539    460     0x08200000      78      91.63   0.45
0       2.51    4546    469     0x08200020      74      91.12   0.45
0       3.60    4540    453     0x08200000      77      91.43   0.45
0       3.06    4542    463     0x08200020      73      91.56   0.45

... Doug
diff mbox series

Patch

diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index 8025b21f43fa..67de49cc9fb4 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -75,3 +75,11 @@  config INTEL_PCH_THERMAL
 	  Enable this to support thermal reporting on certain intel PCHs.
 	  Thermal reporting device will provide temperature reading,
 	  programmable trip points and other information.
+
+config INTEL_TCC_COOLING
+	tristate "Intel TCC offset cooling Driver"
+	depends on X86
+	help
+	  Enable this to support system cooling by adjusting the effective TCC
+          activation temperature via the TCC Offset register, which is widely
+          supported on modern Intel platforms.
diff --git a/drivers/thermal/intel/Makefile b/drivers/thermal/intel/Makefile
index 0d9736ced5d4..40e86973f88d 100644
--- a/drivers/thermal/intel/Makefile
+++ b/drivers/thermal/intel/Makefile
@@ -10,3 +10,4 @@  obj-$(CONFIG_INTEL_QUARK_DTS_THERMAL)	+= intel_quark_dts_thermal.o
 obj-$(CONFIG_INT340X_THERMAL)  += int340x_thermal/
 obj-$(CONFIG_INTEL_BXT_PMIC_THERMAL) += intel_bxt_pmic_thermal.o
 obj-$(CONFIG_INTEL_PCH_THERMAL)	+= intel_pch_thermal.o
+obj-$(CONFIG_INTEL_TCC_COOLING)	+= intel_tcc_cooling.o
diff --git a/drivers/thermal/intel/intel_tcc_cooling.c b/drivers/thermal/intel/intel_tcc_cooling.c
new file mode 100644
index 000000000000..aa6bbb9ba898
--- /dev/null
+++ b/drivers/thermal/intel/intel_tcc_cooling.c
@@ -0,0 +1,128 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * cooling device driver that activates the processor throttling by
+ * programming the TCC Offset register.
+ * Copyright (c) 2021, Intel Corporation.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/thermal.h>
+#include <asm/cpu_device_id.h>
+
+#define TCC_SHIFT 24
+#define TCC_MASK	(0x3fULL<<24)
+#define TCC_PROGRAMMABLE	BIT(30)
+
+static struct thermal_cooling_device *tcc_cdev;
+
+static int tcc_get_max_state(struct thermal_cooling_device *cdev, unsigned long
+			     *state)
+{
+	*state = TCC_MASK >> TCC_SHIFT;
+	return 0;
+}
+
+static int tcc_offset_update(int tcc)
+{
+	u64 val;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_TEMPERATURE_TARGET, &val);
+	if (err)
+		return err;
+
+	val &= ~TCC_MASK;
+	val |= tcc << TCC_SHIFT;
+
+	err = wrmsrl_safe(MSR_IA32_TEMPERATURE_TARGET, val);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+static int tcc_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
+			     *state)
+{
+	u64 val;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_TEMPERATURE_TARGET, &val);
+	if (err)
+		return err;
+
+	*state = (val & TCC_MASK) >> TCC_SHIFT;
+	return 0;
+}
+
+static int tcc_set_cur_state(struct thermal_cooling_device *cdev, unsigned long
+			     state)
+{
+	return tcc_offset_update(state);
+}
+
+static const struct thermal_cooling_device_ops tcc_cooling_ops = {
+	.get_max_state = tcc_get_max_state,
+	.get_cur_state = tcc_get_cur_state,
+	.set_cur_state = tcc_set_cur_state,
+};
+
+static const struct x86_cpu_id tcc_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_L, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, NULL),
+	{}
+};
+
+MODULE_DEVICE_TABLE(x86cpu, tcc_ids);
+
+static int __init tcc_cooling_init(void)
+{
+	int ret;
+	u64 val;
+	const struct x86_cpu_id *id;
+
+	int err;
+
+	id = x86_match_cpu(tcc_ids);
+	if (!id)
+		return -ENODEV;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, &val);
+	if (err)
+		return err;
+
+	if (!(val & TCC_PROGRAMMABLE))
+		return -ENODEV;
+
+	pr_info("Programmable TCC Offset detected\n");
+
+	tcc_cdev =
+	    thermal_cooling_device_register("TCC Offset", NULL,
+					    &tcc_cooling_ops);
+	if (IS_ERR(tcc_cdev)) {
+		ret = PTR_ERR(tcc_cdev);
+		return ret;
+	}
+	return 0;
+}
+
+module_init(tcc_cooling_init)
+
+static void __exit tcc_cooling_exit(void)
+{
+	thermal_cooling_device_unregister(tcc_cdev);
+}
+
+module_exit(tcc_cooling_exit)
+
+MODULE_DESCRIPTION("TCC offset cooling device Driver");
+MODULE_AUTHOR("Zhang Rui <rui.zhang@intel.com>");
+MODULE_LICENSE("GPL v2");