diff mbox series

[v3,04/14,RESEND] cpufreq: Add Hardware P-State (HWP) driver

Message ID 20230501193034.88575-5-jandryuk@gmail.com (mailing list archive)
State Superseded
Headers show
Series Intel Hardware P-States (HWP) support | expand

Commit Message

Jason Andryuk May 1, 2023, 7:30 p.m. UTC
From the Intel SDM: "Hardware-Controlled Performance States (HWP), which
autonomously selects performance states while utilizing OS supplied
performance guidance hints."

Enable HWP to run in autonomous mode by poking the correct MSRs.
cpufreq=xen:hwp enables and cpufreq=xen:hwp=0 disables.  The same for
hdc.

There is no interface to configure - xen_sysctl_pm_op/xenpm will
be to be extended to configure in subsequent patches.  It will run with
the default values, which should be the default 0x80 (out
of 0x0-0xff) energy/performance preference.

Unscientific powertop measurement of an mostly idle, customized OpenXT
install:
A 10th gen 6-core laptop showed battery discharge drop from ~9.x to
~7.x watts.
A 8th gen 4-core laptop dropped from ~10 to ~9

Power usage depends on many factors, especially display brightness, but
this does show an power saving in balanced mode when CPU utilization is
low.

HWP isn't compatible with an external governor - it doesn't take
explicit frequency requests.  Therefore a minimal internal governor,
hwp-internal, is also added as a placeholder.

While adding to the xen-command-line.pandoc entry, un-nest verbose from
minfreq.  They are independent.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>

---

We disable on cpuid_level < 0x16.  cpuid(0x16) is used to get the cpu
frequencies for calculating the APERF/MPERF.  Without it, things would
still work, but the averge cpufrequency output would be wrong.

My 8th & 10th gen test systems both report:
(XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0
(XEN) HWP: Hardware Duty Cycling (HDC) supported
(XEN) HWP: HW_FEEDBACK not supported

IA32_ENERGY_PERF_BIAS has not been tested.

For cpufreq=xen:hwp, placing the option inside the governor wouldn't
work.  Users would have to select the hwp-internal governor to turn off
hwp support.  hwp-internal isn't usable without hwp, and users wouldn't
be able to select a different governor.  That doesn't matter while hwp
defaults off, but it would if or when hwp defaults to enabled.

We can't use parse_boolean() since it requires a single name=val string
and cpufreq_handle_common_option is provided two strings.  Use
parse_bool() and manual handle no-hwp.

Write to disable the interrupt - the linux pstate driver does this.  We
don't use the interrupts, so we can just turn them off.  We aren't ready
to handle them, so we don't want any.  Unclear if this is necessary.
SDM says it's default disabled.

FAST_IA32_HWP_REQUEST was removed in v2.  The check in v1 was wrong,
it's a model specific feature and the CPUID bit is only available
after enabling via the MSR.  Support was untested since I don't have
hardware with the feature.  Writes are expected to be infrequent, so
just leave it out.

---
v2:
Alphabetize headers
Re-work driver registration
name hwp_drv_data anonymous union "hw"
Drop hwp_verbose_cont
style cleanups
Condense hwp_governor switch
hwp_cpufreq_target remove .raw from hwp_req assignment
Use typed-pointer in a few functions
Pass type to xzalloc
Add HWP_ENERGY_PERF_BALANCE/IA32_ENERGY_BIAS_BALANCE defines
Add XEN_HWP_GOVERNOR define for "hwp-internal"
Capitalize CPUID and MSR defines
Change '_' to '-' for energy-perf & act-window
Read-modify-write MSRs updates
Use FAST_IA32_HWP_REQUEST_MSR_ENABLE define
constify pointer in hwp_set_misc_turbo
Add space after non-fallthrough break in governor switch
Add IA32_ENERGY_BIAS_MASK define
Check CPUID_PM_LEAK for energy bias when needed
Fail initialization with curr_req = -1
Fold hwp_read_capabilities into hwp_init_msrs
Add command line cpufreq=xen:hwp
Add command line cpufreq=xen:hdc
Use per_cpu for hwp_drv_data pointers
Move hwp_energy_perf_bias call into hwp_write_request
energy_perf 0 is valid, so hwp_energy_perf_bias cannot be skipped
Ensure we don't generate interrupts
Remove Fast Write of Uncore MSR
Initialize hwp_drv_data from curr_req
Use SPDX line instead of license text in hwp.c

v3:
Add cf_check to cpufreq_gov_hwp_init() - Marek
Print cpuid_level with %#x - Marek
---
 docs/misc/xen-command-line.pandoc         |   8 +-
 xen/arch/x86/acpi/cpufreq/Makefile        |   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c       |   5 +-
 xen/arch/x86/acpi/cpufreq/hwp.c           | 506 ++++++++++++++++++++++
 xen/arch/x86/include/asm/cpufeature.h     |  13 +-
 xen/arch/x86/include/asm/msr-index.h      |  13 +
 xen/drivers/cpufreq/cpufreq.c             |  32 ++
 xen/include/acpi/cpufreq/cpufreq.h        |   3 +
 xen/include/acpi/cpufreq/processor_perf.h |   3 +
 xen/include/public/sysctl.h               |   1 +
 10 files changed, 581 insertions(+), 4 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

Comments

Jan Beulich May 4, 2023, 1:11 p.m. UTC | #1
On 01.05.2023 21:30, Jason Andryuk wrote:
> For cpufreq=xen:hwp, placing the option inside the governor wouldn't
> work.  Users would have to select the hwp-internal governor to turn off
> hwp support.

I'm afraid I don't understand this, and you'll find a comment towards
this further down. Even when ...

>  hwp-internal isn't usable without hwp, and users wouldn't
> be able to select a different governor.  That doesn't matter while hwp
> defaults off, but it would if or when hwp defaults to enabled.

... it starts defaulting to enabled, selecting another governor can
simply have the side effect of turning off hwp.

> Write to disable the interrupt - the linux pstate driver does this.  We
> don't use the interrupts, so we can just turn them off.  We aren't ready
> to handle them, so we don't want any.  Unclear if this is necessary.
> SDM says it's default disabled.

Definitely better to be on the safe side.

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
>  available support.
>  
>  ### cpufreq
> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`

Considering you use a special internal governor, the 4 governor alternatives are
meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
was another governor name would seem better to me. This would then also get rid
of one of the two special "no-" prefix parsing cases (which I'm not overly
happy about).

Even if not done that way I'm puzzled by the way you spell out the interaction
of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
"hwp" was specified, so even if not merged with the governors group "hwp" should
come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
way you've spelled it out it actually looks to be kind of the other way around.)

Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
was specified.

Overall I'm getting the impression that beyond your "verbose" related adjustment
more is needed, if you're meaning to get things closer to how we parse the
option (splitting across multiple lines to help see what I mean):

`= none
 | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
                          [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
                          [,verbose]]}
 | dom0-kernel`

(We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
maxfreq, but better be more tight in the doc than too relaxed.)

Furthermore while max/min freq don't apply directly, there are still two MSRs
controlling bounds at the package and logical processor levels.

> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,506 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
> + */
> +
> +#include <xen/cpumask.h>
> +#include <xen/init.h>
> +#include <xen/param.h>
> +#include <xen/xmalloc.h>
> +#include <asm/io.h>
> +#include <asm/msr.h>
> +#include <acpi/cpufreq/cpufreq.h>
> +
> +static bool feature_hwp;
> +static bool feature_hwp_notification;
> +static bool feature_hwp_activity_window;
> +static bool feature_hwp_energy_perf;
> +static bool feature_hwp_pkg_level_ctl;
> +static bool feature_hwp_peci;
> +
> +static bool feature_hdc;

Most (all?) of these want to be __ro_after_init, I expect.

> +__initdata bool opt_cpufreq_hwp = false;
> +__initdata bool opt_cpufreq_hdc = true;

Nit (style): Please put annotations after the type.

> +#define HWP_ENERGY_PERF_BALANCE         0x80
> +#define IA32_ENERGY_BIAS_BALANCE        0x7
> +#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
> +#define IA32_ENERGY_BIAS_MASK           0xf
> +
> +union hwp_request
> +{
> +    struct
> +    {
> +        uint64_t min_perf:8;
> +        uint64_t max_perf:8;
> +        uint64_t desired:8;
> +        uint64_t energy_perf:8;
> +        uint64_t activity_window:10;
> +        uint64_t package_control:1;
> +        uint64_t reserved:16;
> +        uint64_t activity_window_valid:1;
> +        uint64_t energy_perf_valid:1;
> +        uint64_t desired_valid:1;
> +        uint64_t max_perf_valid:1;
> +        uint64_t min_perf_valid:1;

The boolean fields here would probably better be of type "bool". I also
don't see the need for using uint64_t for any of the other fields -
unsigned int will be quite fine, I think. Only ...

> +    };
> +    uint64_t raw;

... this wants to keep this type. (Same again below then.)

> +bool __init hwp_available(void)
> +{
> +    unsigned int eax, ecx, unused;
> +    bool use_hwp;
> +
> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> +    {
> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> +                    boot_cpu_data.cpuid_level);
> +        return false;
> +    }
> +
> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> +    {
> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> +                 boot_cpu_data.cpuid_level);
> +        return false;
> +    }
> +
> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> +
> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> +    {
> +        hwp_verbose("HWP disabled: No energy/performance preference available");
> +        return false;
> +    }
> +
> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> +    feature_hwp_energy_perf     =
> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> +
> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> +                feature_hwp, feature_hwp_notification,
> +                feature_hwp_activity_window, feature_hwp_energy_perf,
> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> +
> +    if ( !feature_hwp )
> +        return false;
> +
> +    feature_hdc = eax & CPUID6_EAX_HDC;
> +
> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> +                feature_hdc ? "" : "not ",
> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> +                            : "");
> +
> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> +
> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");

You report this, but you don't really use it?

> +    use_hwp = feature_hwp && opt_cpufreq_hwp;

There's a lot of output you may produce until you make it here, which is
largely meaningless when opt_cpufreq_hwp == false. Is there a reason you
don't check that flag first thing in the function?

> +static void hdc_set_pkg_hdc_ctl(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
> +
> +        return;
> +    }
> +
> +    if ( val )
> +        msr |= IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> +    else
> +        msr &= ~IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> +
> +    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
> +}
> +
> +static void hdc_set_pm_ctl1(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> +
> +        return;
> +    }
> +
> +    if ( val )
> +        msr |= IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> +    else
> +        msr &= ~IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> +
> +    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
> +}

For both functions: Elsewhere you also log the affected CPU in hwp_err().
Without this I'm not convinced the logging here is very useful. In fact I
wonder whether hwp_err() shouldn't take care of this and/or the "error"
part of the string literal. A HWP: prefix might also not be bad ...

> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> +{
> +    uint32_t base_khz, max_khz, bus_khz, edx;
> +
> +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> +
> +    /* aperf/mperf scales base. */
> +    policy->cpuinfo.perf_freq = base_khz * 1000;
> +    policy->cpuinfo.min_freq = base_khz * 1000;
> +    policy->cpuinfo.max_freq = max_khz * 1000;
> +    policy->min = base_khz * 1000;
> +    policy->max = max_khz * 1000;
> +    policy->cur = 0;

What is the comment intended to be telling me here?

> +static void cf_check hwp_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    uint64_t val;
> +
> +    /*
> +     * Package level MSR, but we don't have a good idea of packages here, so
> +     * just do it everytime.
> +     */
> +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    /* Ensure we don't generate interrupts */
> +    if ( feature_hwp_notification )
> +        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
> +
> +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
> +    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
> +    {
> +        val |= IA32_PM_ENABLE_HWP_ENABLE;
> +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +        {
> +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
> +                    policy->cpu, val);
> +            data->curr_req.raw = -1;
> +            return;
> +        }
> +    }
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> +                policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    if ( !feature_hwp_energy_perf ) {

Nit: Brace placement.

> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> +        {
> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> +            data->curr_req.raw = -1;
> +
> +            return;
> +        }
> +
> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> +    }

In order to not need to undo the "enable" you've already done, maybe that
should move down here? With all the sanity checking you do here, maybe
you should also check that the write of the enable bit actually took
effect?

> +/* val 0 - highest performance, 15 - maximum energy savings */
> +static void hwp_energy_perf_bias(const struct hwp_drv_data *data)
> +{
> +    uint64_t msr;
> +    uint8_t val = data->energy_perf;
> +
> +    ASSERT(val <= IA32_ENERGY_BIAS_MAX_POWERSAVE);
> +
> +    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> +
> +        return;
> +    }
> +
> +    msr &= ~IA32_ENERGY_BIAS_MASK;
> +    msr |= val;
> +
> +    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr);
> +}
> +
> +static void cf_check hwp_write_request(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    union hwp_request hwp_req = data->curr_req;
> +
> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
> +    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
> +    {
> +        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
> +                policy->cpu, hwp_req.raw);
> +        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
> +    }
> +
> +    if ( !feature_hwp_energy_perf )
> +        hwp_energy_perf_bias(data);
> +
> +}
> +
> +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> +                                       unsigned int target_freq,
> +                                       unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +    /* Zero everything to ensure reserved bits are zero... */
> +    union hwp_request hwp_req = { .raw = 0 };
> +
> +    /* .. and update from there */
> +    hwp_req.min_perf = data->minimum;
> +    hwp_req.max_perf = data->maximum;
> +    hwp_req.desired = data->desired;
> +    if ( feature_hwp_energy_perf )
> +        hwp_req.energy_perf = data->energy_perf;
> +    if ( feature_hwp_activity_window )
> +        hwp_req.activity_window = data->activity_window;
> +
> +    if ( hwp_req.raw == data->curr_req.raw )
> +        return 0;
> +
> +    data->curr_req = hwp_req;
> +
> +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> +
> +    return 0;
> +}

If I'm not mistaken these 3 functions can only be reached from the user
space tool (via set_cpufreq_para()). On that path I don't think there
should be any hwp_err(); definitely not in non-verbose mode. Instead it
would be good if a sensible error code could be reported back. (Same
then for hwp_cpufreq_update() and its helper.)

> --- a/xen/arch/x86/include/asm/cpufeature.h
> +++ b/xen/arch/x86/include/asm/cpufeature.h
> @@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
>  #define cpu_has(c, bit)		test_bit(bit, (c)->x86_capability)
>  #define boot_cpu_has(bit)	test_bit(bit, boot_cpu_data.x86_capability)
>  
> -#define CPUID_PM_LEAF                    6
> -#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
> +#define CPUID_PM_LEAF                                6
> +#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
> +#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
> +#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
> +#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
> +#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
> +#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
> +#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
> +#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)

Perhaps better without open-coding BIT()?

I also find it a little odd that e.g. bit 17 is left out here despite you
declaring the 5 "valid" bits in union hwp_request (which are qualified by
this CPUID bit afaict).

> +#define CPUID6_ECX_APERFMPERF_CAPABILITY             0x1
> +#define CPUID6_ECX_IA32_ENERGY_PERF_BIAS             0x8

Why not the same form here?

> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -151,6 +151,13 @@
>  
>  #define MSR_PKRS                            0x000006e1
>  
> +#define MSR_IA32_PM_ENABLE                  0x00000770
> +#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
> +
> +#define MSR_IA32_HWP_CAPABILITIES           0x00000771
> +#define MSR_IA32_HWP_INTERRUPT              0x00000773
> +#define MSR_IA32_HWP_REQUEST                0x00000774

I think for new MSRs being added here in particular Andrew would like to
see the IA32 infixes omitted. (I'd extend this then to
CPUID6_ECX_IA32_ENERGY_PERF_BIAS as well.)

> @@ -165,6 +172,11 @@
>  #define  PASID_PASID_MASK                   0x000fffff
>  #define  PASID_VALID                        (_AC(1, ULL) << 31)
>  
> +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
> +#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)

The name has two redundant infixes, which looks odd, but then I can't
suggest any better without going too much out of sync with the SDM.

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -565,6 +565,38 @@ static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
>  
>  static int __init cpufreq_handle_common_option(const char *name, const char *val)
>  {
> +    if (!strcmp(name, "hdc")) {
> +        if (val) {
> +            int ret = parse_bool(val, NULL);
> +            if (ret != -1) {
> +                opt_cpufreq_hdc = ret;
> +                return 1;
> +            }
> +        } else {
> +            opt_cpufreq_hdc = true;
> +            return 1;
> +        }
> +    } else if (!strcmp(name, "no-hdc")) {
> +        opt_cpufreq_hdc = false;
> +        return 1;
> +    }

I think recognizing a "no-" prefix would want to be separated out, and be
restricted to val being NULL. It would result in val being pointed at the
string "no" (or "off" or anything else parse_bool() recognizes as negative
indicator).

Yet if, as suggested above, "hwp" became a "fake" governor also when
parsing the command line, "hdc" could actually be handled in its
handle_option() hook.

Jan
Jason Andryuk May 4, 2023, 4:56 p.m. UTC | #2
On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > For cpufreq=xen:hwp, placing the option inside the governor wouldn't
> > work.  Users would have to select the hwp-internal governor to turn off
> > hwp support.
>
> I'm afraid I don't understand this, and you'll find a comment towards
> this further down. Even when ...
>
> >  hwp-internal isn't usable without hwp, and users wouldn't
> > be able to select a different governor.  That doesn't matter while hwp
> > defaults off, but it would if or when hwp defaults to enabled.
>
> ... it starts defaulting to enabled, selecting another governor can
> simply have the side effect of turning off hwp.

I didn't think of that - makes sense.

> > Write to disable the interrupt - the linux pstate driver does this.  We
> > don't use the interrupts, so we can just turn them off.  We aren't ready
> > to handle them, so we don't want any.  Unclear if this is necessary.
> > SDM says it's default disabled.
>
> Definitely better to be on the safe side.
>
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
> >  available support.
> >
> >  ### cpufreq
> > -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> > +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
>
> Considering you use a special internal governor, the 4 governor alternatives are
> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
> was another governor name would seem better to me. This would then also get rid
> of one of the two special "no-" prefix parsing cases (which I'm not overly
> happy about).
>
> Even if not done that way I'm puzzled by the way you spell out the interaction
> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
> "hwp" was specified, so even if not merged with the governors group "hwp" should
> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
> way you've spelled it out it actually looks to be kind of the other way around.)

I placed them in alphabetical order, but, yes, it doesn't make sense.

> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
> was specified.
>
> Overall I'm getting the impression that beyond your "verbose" related adjustment
> more is needed, if you're meaning to get things closer to how we parse the
> option (splitting across multiple lines to help see what I mean):
>
> `= none
>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
>                           [,verbose]]}
>  | dom0-kernel`
>
> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> maxfreq, but better be more tight in the doc than too relaxed.)
>
> Furthermore while max/min freq don't apply directly, there are still two MSRs
> controlling bounds at the package and logical processor levels.

Well, we only program the logical processor level MSRs because we
don't have a good idea of the packages to know when we can skip
writing an MSR.

How about this:
`= none
 | {{ <boolean> | xen } {
[:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
                        | [:hwp[,hdc]] }
                          [,verbose]]}
 | dom0-kernel`

i.e:
xen:hwp,hdc

> > --- /dev/null
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -0,0 +1,506 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> > + *
> > + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
> > + */
> > +
> > +#include <xen/cpumask.h>
> > +#include <xen/init.h>
> > +#include <xen/param.h>
> > +#include <xen/xmalloc.h>
> > +#include <asm/io.h>
> > +#include <asm/msr.h>
> > +#include <acpi/cpufreq/cpufreq.h>
> > +
> > +static bool feature_hwp;
> > +static bool feature_hwp_notification;
> > +static bool feature_hwp_activity_window;
> > +static bool feature_hwp_energy_perf;
> > +static bool feature_hwp_pkg_level_ctl;
> > +static bool feature_hwp_peci;
> > +
> > +static bool feature_hdc;
>
> Most (all?) of these want to be __ro_after_init, I expect.

I think you are correct.  (This pre-dates __ro_after_init and I didn't
update it.)

> > +__initdata bool opt_cpufreq_hwp = false;
> > +__initdata bool opt_cpufreq_hdc = true;
>
> Nit (style): Please put annotations after the type.
>
> > +#define HWP_ENERGY_PERF_BALANCE         0x80
> > +#define IA32_ENERGY_BIAS_BALANCE        0x7
> > +#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
> > +#define IA32_ENERGY_BIAS_MASK           0xf
> > +
> > +union hwp_request
> > +{
> > +    struct
> > +    {
> > +        uint64_t min_perf:8;
> > +        uint64_t max_perf:8;
> > +        uint64_t desired:8;
> > +        uint64_t energy_perf:8;
> > +        uint64_t activity_window:10;
> > +        uint64_t package_control:1;
> > +        uint64_t reserved:16;
> > +        uint64_t activity_window_valid:1;
> > +        uint64_t energy_perf_valid:1;
> > +        uint64_t desired_valid:1;
> > +        uint64_t max_perf_valid:1;
> > +        uint64_t min_perf_valid:1;
>
> The boolean fields here would probably better be of type "bool". I also
> don't see the need for using uint64_t for any of the other fields -
> unsigned int will be quite fine, I think. Only ...

This is the hardware MSR format, so it seemed natural to use uint64_t
and the bit fields.  To me, uint64_t foo:$bits; better shows that we
are dividing up a single hardware register using bit fields.
Honestly, I'm unfamiliar with the finer points of laying out bitfields
with bool.  And the 10 bits of activity window throws off aligning to
standard types.

This seems to have the correct layout:
struct
{
        unsigned char min_perf;
        unsigned char max_perf;
        unsigned char desired;
        unsigned char energy_perf;
        unsigned int activity_window:10;
        bool package_control:1;
        unsigned int reserved:16;
        bool activity_window_valid:1;
        bool energy_perf_valid:1;
        bool desired_valid:1;
        bool max_perf_valid:1;
        bool min_perf_valid:1;
} ;

Or would you prefer the first 8 bit ones to be unsigned int
min_perf:8?  The bools seem to need :1, which doesn't seem to be
gaining us much, IMO.  I'd strongly prefer just keeping it as I have
it, but I will change it however you like.

> > +    };
> > +    uint64_t raw;
>
> ... this wants to keep this type. (Same again below then.)

For "below", do you want:

        struct
        {
            unsigned char highest;
            unsigned char guaranteed;
            unsigned char most_efficient;
            unsigned char lowest;
            unsigned int reserved;
        } hw;
?

> > +bool __init hwp_available(void)
> > +{
> > +    unsigned int eax, ecx, unused;
> > +    bool use_hwp;
> > +
> > +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> > +    {
> > +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> > +                    boot_cpu_data.cpuid_level);
> > +        return false;
> > +    }
> > +
> > +    if ( boot_cpu_data.cpuid_level < 0x16 )
> > +    {
> > +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> > +                 boot_cpu_data.cpuid_level);
> > +        return false;
> > +    }
> > +
> > +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> > +
> > +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> > +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> > +    {
> > +        hwp_verbose("HWP disabled: No energy/performance preference available");
> > +        return false;
> > +    }
> > +
> > +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> > +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> > +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> > +    feature_hwp_energy_perf     =
> > +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> > +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> > +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> > +
> > +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> > +                feature_hwp, feature_hwp_notification,
> > +                feature_hwp_activity_window, feature_hwp_energy_perf,
> > +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> > +
> > +    if ( !feature_hwp )
> > +        return false;
> > +
> > +    feature_hdc = eax & CPUID6_EAX_HDC;
> > +
> > +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> > +                feature_hdc ? "" : "not ",
> > +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> > +                            : "");
> > +
> > +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> > +
> > +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> > +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
>
> You report this, but you don't really use it?

Correct.  I needed to know what capabilities my processors have.

feature_hwp_pkg_level_ctl and feature_hwp_peci can also be dropped
since they aren't used beyond printing their values.  I'd still lean
toward keeping their printing under verbose since otherwise there
isn't a convenient way to know if they are available without
recompiling.

> > +    use_hwp = feature_hwp && opt_cpufreq_hwp;
>
> There's a lot of output you may produce until you make it here, which is
> largely meaningless when opt_cpufreq_hwp == false. Is there a reason you
> don't check that flag first thing in the function?

opt_cpufreq_hwp can be checked earlier for an early exit, yes.  The
code came about during development to print all the HWP capabilities
even if it wasn't enabled.  But eliminating it now makes sense.

> > +static void hdc_set_pkg_hdc_ctl(bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
> > +
> > +        return;
> > +    }
> > +
> > +    if ( val )
> > +        msr |= IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +    else
> > +        msr &= ~IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> > +
> > +    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> > +        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
> > +}
> > +
> > +static void hdc_set_pm_ctl1(bool val)
> > +{
> > +    uint64_t msr;
> > +
> > +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> > +
> > +        return;
> > +    }
> > +
> > +    if ( val )
> > +        msr |= IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> > +    else
> > +        msr &= ~IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> > +
> > +    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
> > +        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
> > +}
>
> For both functions: Elsewhere you also log the affected CPU in hwp_err().
> Without this I'm not convinced the logging here is very useful. In fact I
> wonder whether hwp_err() shouldn't take care of this and/or the "error"
> part of the string literal. A HWP: prefix might also not be bad ...

Sounds good.  I'll investigate.

> > +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> > +{
> > +    uint32_t base_khz, max_khz, bus_khz, edx;
> > +
> > +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> > +
> > +    /* aperf/mperf scales base. */
> > +    policy->cpuinfo.perf_freq = base_khz * 1000;
> > +    policy->cpuinfo.min_freq = base_khz * 1000;
> > +    policy->cpuinfo.max_freq = max_khz * 1000;
> > +    policy->min = base_khz * 1000;
> > +    policy->max = max_khz * 1000;
> > +    policy->cur = 0;
>
> What is the comment intended to be telling me here?

When I was surprised to discover that I needed to pass in the base
frequency for proper aperf/mperf scaling, it seemed relevant at the
time as it's the opposite of ACPI cpufreq.  It can be dropped now.

> > +static void cf_check hwp_init_msrs(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    uint64_t val;
> > +
> > +    /*
> > +     * Package level MSR, but we don't have a good idea of packages here, so
> > +     * just do it everytime.
> > +     */
> > +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    /* Ensure we don't generate interrupts */
> > +    if ( feature_hwp_notification )
> > +        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
> > +
> > +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
> > +    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
> > +    {
> > +        val |= IA32_PM_ENABLE_HWP_ENABLE;
> > +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
> > +        {
> > +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
> > +                    policy->cpu, val);
> > +            data->curr_req.raw = -1;
> > +            return;
> > +        }
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> > +                policy->cpu);
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> > +    {
> > +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
> > +        data->curr_req.raw = -1;
> > +        return;
> > +    }
> > +
> > +    if ( !feature_hwp_energy_perf ) {
>
> Nit: Brace placement.
>
> > +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> > +        {
> > +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> > +            data->curr_req.raw = -1;
> > +
> > +            return;
> > +        }
> > +
> > +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> > +    }
>
> In order to not need to undo the "enable" you've already done, maybe that
> should move down here?

HWP needs to be enabled before the Capabilities and Request MSRs can
be read.  Reading them shouldn't fail, but it seems safer to use
rdmsr_safe in case something goes wrong.

I think I will rip out ENERGY_PERF_BIAS.  The Linux driver doesn't
support it.  I thought it might be necessary, but my test machines
don't need it.  The Qubes report with SkyLake wasn't using
ENERGY_PERF_BIAS, and SkyLake introduced HWP.  So the set of machines
needing it is probably small and older, so it probably isn't worth
supporting.

> With all the sanity checking you do here, maybe
> you should also check that the write of the enable bit actually took
> effect?

I can add that.

> > +/* val 0 - highest performance, 15 - maximum energy savings */
> > +static void hwp_energy_perf_bias(const struct hwp_drv_data *data)
> > +{
> > +    uint64_t msr;
> > +    uint8_t val = data->energy_perf;
> > +
> > +    ASSERT(val <= IA32_ENERGY_BIAS_MAX_POWERSAVE);
> > +
> > +    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> > +    {
> > +        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> > +
> > +        return;
> > +    }
> > +
> > +    msr &= ~IA32_ENERGY_BIAS_MASK;
> > +    msr |= val;
> > +
> > +    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> > +        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr);
> > +}
> > +
> > +static void cf_check hwp_write_request(void *info)
> > +{
> > +    struct cpufreq_policy *policy = info;
> > +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> > +    union hwp_request hwp_req = data->curr_req;
> > +
> > +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
> > +    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
> > +    {
> > +        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
> > +                policy->cpu, hwp_req.raw);
> > +        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
> > +    }
> > +
> > +    if ( !feature_hwp_energy_perf )
> > +        hwp_energy_perf_bias(data);
> > +
> > +}
> > +
> > +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> > +                                       unsigned int target_freq,
> > +                                       unsigned int relation)
> > +{
> > +    unsigned int cpu = policy->cpu;
> > +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> > +    /* Zero everything to ensure reserved bits are zero... */
> > +    union hwp_request hwp_req = { .raw = 0 };
> > +
> > +    /* .. and update from there */
> > +    hwp_req.min_perf = data->minimum;
> > +    hwp_req.max_perf = data->maximum;
> > +    hwp_req.desired = data->desired;
> > +    if ( feature_hwp_energy_perf )
> > +        hwp_req.energy_perf = data->energy_perf;
> > +    if ( feature_hwp_activity_window )
> > +        hwp_req.activity_window = data->activity_window;
> > +
> > +    if ( hwp_req.raw == data->curr_req.raw )
> > +        return 0;
> > +
> > +    data->curr_req = hwp_req;
> > +
> > +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> > +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> > +
> > +    return 0;
> > +}
>
> If I'm not mistaken these 3 functions can only be reached from the user
> space tool (via set_cpufreq_para()). On that path I don't think there
> should be any hwp_err(); definitely not in non-verbose mode. Instead it
> would be good if a sensible error code could be reported back. (Same
> then for hwp_cpufreq_update() and its helper.)

I'll investigate this.  I guess I'll have to stash a result in struct
hwp_drv_data.

> > --- a/xen/arch/x86/include/asm/cpufeature.h
> > +++ b/xen/arch/x86/include/asm/cpufeature.h
> > @@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
> >  #define cpu_has(c, bit)              test_bit(bit, (c)->x86_capability)
> >  #define boot_cpu_has(bit)    test_bit(bit, boot_cpu_data.x86_capability)
> >
> > -#define CPUID_PM_LEAF                    6
> > -#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
> > +#define CPUID_PM_LEAF                                6
> > +#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
> > +#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
> > +#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
> > +#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
> > +#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
> > +#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
> > +#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
> > +#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)
>
> Perhaps better without open-coding BIT()?

Ok.

> I also find it a little odd that e.g. bit 17 is left out here despite you
> declaring the 5 "valid" bits in union hwp_request (which are qualified by
> this CPUID bit afaict).

Well, I thought I wasn't supposed to introduce unused defines, so I
didn't add one for 17.  For union hwp_request, the "valid" bits are
part of the register structure, so it makes sense to include them
instead of an incomplete definition.  IIRC, at some point I set the
"valid" bits when I wasn't supposed to, and they caused the wrmsr
calls to fail.  That might have been because my test machines don't
have package-level HWP.

(I was confused when the CPUID section stated "Bit 17: Flexible HWP is
supported if set.", but there are no further references to "Flexible
HWP" in the SDM.)

> > +#define CPUID6_ECX_APERFMPERF_CAPABILITY             0x1
> > +#define CPUID6_ECX_IA32_ENERGY_PERF_BIAS             0x8
>
> Why not the same form here?

I was re-indenting APERFMPERF, and added ENERGY_PERF_BIAS in a
consistent style.  I will update with BIT().

> > --- a/xen/arch/x86/include/asm/msr-index.h
> > +++ b/xen/arch/x86/include/asm/msr-index.h
> > @@ -151,6 +151,13 @@
> >
> >  #define MSR_PKRS                            0x000006e1
> >
> > +#define MSR_IA32_PM_ENABLE                  0x00000770
> > +#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
> > +
> > +#define MSR_IA32_HWP_CAPABILITIES           0x00000771
> > +#define MSR_IA32_HWP_INTERRUPT              0x00000773
> > +#define MSR_IA32_HWP_REQUEST                0x00000774
>
> I think for new MSRs being added here in particular Andrew would like to
> see the IA32 infixes omitted. (I'd extend this then to
> CPUID6_ECX_IA32_ENERGY_PERF_BIAS as well.)

Ok.

> > @@ -165,6 +172,11 @@
> >  #define  PASID_PASID_MASK                   0x000fffff
> >  #define  PASID_VALID                        (_AC(1, ULL) << 31)
> >
> > +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
> > +#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)
>
> The name has two redundant infixes, which looks odd, but then I can't
> suggest any better without going too much out of sync with the SDM.

Yes, it's not a good name, but I was trying to keep close to the SDM.
FAOD, these should drop IA32_ to become:
MSR_PKG_HDC_CTL
PKG_HDC_CTL_HDC_PKG_ENABLE
?

> > --- a/xen/drivers/cpufreq/cpufreq.c
> > +++ b/xen/drivers/cpufreq/cpufreq.c
> > @@ -565,6 +565,38 @@ static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
> >
> >  static int __init cpufreq_handle_common_option(const char *name, const char *val)
> >  {
> > +    if (!strcmp(name, "hdc")) {
> > +        if (val) {
> > +            int ret = parse_bool(val, NULL);
> > +            if (ret != -1) {
> > +                opt_cpufreq_hdc = ret;
> > +                return 1;
> > +            }
> > +        } else {
> > +            opt_cpufreq_hdc = true;
> > +            return 1;
> > +        }
> > +    } else if (!strcmp(name, "no-hdc")) {
> > +        opt_cpufreq_hdc = false;
> > +        return 1;
> > +    }
>
> I think recognizing a "no-" prefix would want to be separated out, and be
> restricted to val being NULL. It would result in val being pointed at the
> string "no" (or "off" or anything else parse_bool() recognizes as negative
> indicator).
>
> Yet if, as suggested above, "hwp" became a "fake" governor also when
> parsing the command line, "hdc" could actually be handled in its
> handle_option() hook.

Makes sense.

Thank you for taking the time to review this.

Regards,
Jason
Jan Beulich May 5, 2023, 7:01 a.m. UTC | #3
On 04.05.2023 18:56, Jason Andryuk wrote:
> On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
>>>  available support.
>>>
>>>  ### cpufreq
>>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
>>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
>>
>> Considering you use a special internal governor, the 4 governor alternatives are
>> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
>> was another governor name would seem better to me. This would then also get rid
>> of one of the two special "no-" prefix parsing cases (which I'm not overly
>> happy about).
>>
>> Even if not done that way I'm puzzled by the way you spell out the interaction
>> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
>> "hwp" was specified, so even if not merged with the governors group "hwp" should
>> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
>> way you've spelled it out it actually looks to be kind of the other way around.)
> 
> I placed them in alphabetical order, but, yes, it doesn't make sense.
> 
>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
>> was specified.
>>
>> Overall I'm getting the impression that beyond your "verbose" related adjustment
>> more is needed, if you're meaning to get things closer to how we parse the
>> option (splitting across multiple lines to help see what I mean):
>>
>> `= none
>>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
>>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
>>                           [,verbose]]}
>>  | dom0-kernel`
>>
>> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
>> maxfreq, but better be more tight in the doc than too relaxed.)
>>
>> Furthermore while max/min freq don't apply directly, there are still two MSRs
>> controlling bounds at the package and logical processor levels.
> 
> Well, we only program the logical processor level MSRs because we
> don't have a good idea of the packages to know when we can skip
> writing an MSR.
> 
> How about this:
> `= none
>  | {{ <boolean> | xen } {
> [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
>                         | [:hwp[,hdc]] }
>                           [,verbose]]}
>  | dom0-kernel`

Looks right, yes.

>>> --- /dev/null
>>> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
>>> @@ -0,0 +1,506 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
>>> + *
>>> + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
>>> + */
>>> +
>>> +#include <xen/cpumask.h>
>>> +#include <xen/init.h>
>>> +#include <xen/param.h>
>>> +#include <xen/xmalloc.h>
>>> +#include <asm/io.h>
>>> +#include <asm/msr.h>
>>> +#include <acpi/cpufreq/cpufreq.h>
>>> +
>>> +static bool feature_hwp;
>>> +static bool feature_hwp_notification;
>>> +static bool feature_hwp_activity_window;
>>> +static bool feature_hwp_energy_perf;
>>> +static bool feature_hwp_pkg_level_ctl;
>>> +static bool feature_hwp_peci;
>>> +
>>> +static bool feature_hdc;
>>
>> Most (all?) of these want to be __ro_after_init, I expect.
> 
> I think you are correct.  (This pre-dates __ro_after_init and I didn't
> update it.)

Yet even then they should have used __read_mostly.

>>> +union hwp_request
>>> +{
>>> +    struct
>>> +    {
>>> +        uint64_t min_perf:8;
>>> +        uint64_t max_perf:8;
>>> +        uint64_t desired:8;
>>> +        uint64_t energy_perf:8;
>>> +        uint64_t activity_window:10;
>>> +        uint64_t package_control:1;
>>> +        uint64_t reserved:16;
>>> +        uint64_t activity_window_valid:1;
>>> +        uint64_t energy_perf_valid:1;
>>> +        uint64_t desired_valid:1;
>>> +        uint64_t max_perf_valid:1;
>>> +        uint64_t min_perf_valid:1;
>>
>> The boolean fields here would probably better be of type "bool". I also
>> don't see the need for using uint64_t for any of the other fields -
>> unsigned int will be quite fine, I think. Only ...
> 
> This is the hardware MSR format, so it seemed natural to use uint64_t
> and the bit fields.  To me, uint64_t foo:$bits; better shows that we
> are dividing up a single hardware register using bit fields.
> Honestly, I'm unfamiliar with the finer points of laying out bitfields
> with bool.  And the 10 bits of activity window throws off aligning to
> standard types.
> 
> This seems to have the correct layout:
> struct
> {
>         unsigned char min_perf;
>         unsigned char max_perf;
>         unsigned char desired;
>         unsigned char energy_perf;
>         unsigned int activity_window:10;
>         bool package_control:1;
>         unsigned int reserved:16;
>         bool activity_window_valid:1;
>         bool energy_perf_valid:1;
>         bool desired_valid:1;
>         bool max_perf_valid:1;
>         bool min_perf_valid:1;
> } ;
> 
> Or would you prefer the first 8 bit ones to be unsigned int
> min_perf:8?

Personally I think using bitfields uniformly would be better. What you
definitely cannot use if not using a bitfield is "unsigned char", it
ought to by uint8_t then. If using a bitfield, as said, I think it's
best to stick to unsigned int and bool, unless field width goes
beyond 32 bits or fields cross a 32-bit boundary.

>  The bools seem to need :1, which doesn't seem to be
> gaining us much, IMO.  I'd strongly prefer just keeping it as I have
> it, but I will change it however you like.

It's not so much how I like it, but to follow (a) existing practice
(for the boolean fields) and (b) ./CODING_STYLE (for the selection of
types).

>>> +    };
>>> +    uint64_t raw;
>>
>> ... this wants to keep this type. (Same again below then.)
> 
> For "below", do you want:
> 
>         struct
>         {
>             unsigned char highest;
>             unsigned char guaranteed;
>             unsigned char most_efficient;
>             unsigned char lowest;
>             unsigned int reserved;
>         } hw;
> ?

No - it can only be bitfields or fixed-width types here.

>>> +bool __init hwp_available(void)
>>> +{
>>> +    unsigned int eax, ecx, unused;
>>> +    bool use_hwp;
>>> +
>>> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
>>> +    {
>>> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
>>> +                    boot_cpu_data.cpuid_level);
>>> +        return false;
>>> +    }
>>> +
>>> +    if ( boot_cpu_data.cpuid_level < 0x16 )
>>> +    {
>>> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
>>> +                 boot_cpu_data.cpuid_level);
>>> +        return false;
>>> +    }
>>> +
>>> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
>>> +
>>> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
>>> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
>>> +    {
>>> +        hwp_verbose("HWP disabled: No energy/performance preference available");
>>> +        return false;
>>> +    }
>>> +
>>> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
>>> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
>>> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
>>> +    feature_hwp_energy_perf     =
>>> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
>>> +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
>>> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
>>> +
>>> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
>>> +                feature_hwp, feature_hwp_notification,
>>> +                feature_hwp_activity_window, feature_hwp_energy_perf,
>>> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
>>> +
>>> +    if ( !feature_hwp )
>>> +        return false;
>>> +
>>> +    feature_hdc = eax & CPUID6_EAX_HDC;
>>> +
>>> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
>>> +                feature_hdc ? "" : "not ",
>>> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
>>> +                            : "");
>>> +
>>> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
>>> +
>>> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
>>> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
>>
>> You report this, but you don't really use it?
> 
> Correct.  I needed to know what capabilities my processors have.
> 
> feature_hwp_pkg_level_ctl and feature_hwp_peci can also be dropped
> since they aren't used beyond printing their values.  I'd still lean
> toward keeping their printing under verbose since otherwise there
> isn't a convenient way to know if they are available without
> recompiling.

That's fine, but wants mentioning in the description. Also respective
variables would want to be __initdata then, be local to the function,
or be dropped altogether. Plus you'd want to be consistent - either
you use a helper variable for all print-only features, or you don't.

>>> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
>>> +{
>>> +    uint32_t base_khz, max_khz, bus_khz, edx;
>>> +
>>> +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
>>> +
>>> +    /* aperf/mperf scales base. */
>>> +    policy->cpuinfo.perf_freq = base_khz * 1000;
>>> +    policy->cpuinfo.min_freq = base_khz * 1000;
>>> +    policy->cpuinfo.max_freq = max_khz * 1000;
>>> +    policy->min = base_khz * 1000;
>>> +    policy->max = max_khz * 1000;
>>> +    policy->cur = 0;
>>
>> What is the comment intended to be telling me here?
> 
> When I was surprised to discover that I needed to pass in the base
> frequency for proper aperf/mperf scaling, it seemed relevant at the
> time as it's the opposite of ACPI cpufreq.  It can be dropped now.

Well, I'm not insisting on dropping the comment. It could also be left,
but then extended so it can be understood what is meant.

>>> +static void cf_check hwp_init_msrs(void *info)
>>> +{
>>> +    struct cpufreq_policy *policy = info;
>>> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
>>> +    uint64_t val;
>>> +
>>> +    /*
>>> +     * Package level MSR, but we don't have a good idea of packages here, so
>>> +     * just do it everytime.
>>> +     */
>>> +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
>>> +        data->curr_req.raw = -1;
>>> +        return;
>>> +    }
>>> +
>>> +    /* Ensure we don't generate interrupts */
>>> +    if ( feature_hwp_notification )
>>> +        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
>>> +
>>> +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
>>> +    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
>>> +    {
>>> +        val |= IA32_PM_ENABLE_HWP_ENABLE;
>>> +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
>>> +        {
>>> +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
>>> +                    policy->cpu, val);
>>> +            data->curr_req.raw = -1;
>>> +            return;
>>> +        }
>>> +    }
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
>>> +                policy->cpu);
>>> +        data->curr_req.raw = -1;
>>> +        return;
>>> +    }
>>> +
>>> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
>>> +    {
>>> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
>>> +        data->curr_req.raw = -1;
>>> +        return;
>>> +    }
>>> +
>>> +    if ( !feature_hwp_energy_perf ) {
>>
>> Nit: Brace placement.
>>
>>> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
>>> +        {
>>> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
>>> +            data->curr_req.raw = -1;
>>> +
>>> +            return;
>>> +        }
>>> +
>>> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
>>> +    }
>>
>> In order to not need to undo the "enable" you've already done, maybe that
>> should move down here?
> 
> HWP needs to be enabled before the Capabilities and Request MSRs can
> be read.

I must have missed this aspect in the SDM. Do you have a pointer?

>  Reading them shouldn't fail, but it seems safer to use
> rdmsr_safe in case something goes wrong.

Sure. But then the "enable" will need undoing in the unlikely event of
failure.

>>> --- a/xen/arch/x86/include/asm/cpufeature.h
>>> +++ b/xen/arch/x86/include/asm/cpufeature.h
>>> @@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
>>>  #define cpu_has(c, bit)              test_bit(bit, (c)->x86_capability)
>>>  #define boot_cpu_has(bit)    test_bit(bit, boot_cpu_data.x86_capability)
>>>
>>> -#define CPUID_PM_LEAF                    6
>>> -#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
>>> +#define CPUID_PM_LEAF                                6
>>> +#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
>>> +#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
>>> +#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
>>> +#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
>>> +#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
>>> +#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
>>> +#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
>>> +#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)
>>
>> Perhaps better without open-coding BIT()?
> 
> Ok.
> 
>> I also find it a little odd that e.g. bit 17 is left out here despite you
>> declaring the 5 "valid" bits in union hwp_request (which are qualified by
>> this CPUID bit afaict).
> 
> Well, I thought I wasn't supposed to introduce unused defines, so I
> didn't add one for 17.  For union hwp_request, the "valid" bits are
> part of the register structure, so it makes sense to include them
> instead of an incomplete definition.  IIRC, at some point I set the
> "valid" bits when I wasn't supposed to, and they caused the wrmsr
> calls to fail.  That might have been because my test machines don't
> have package-level HWP.
> 
> (I was confused when the CPUID section stated "Bit 17: Flexible HWP is
> supported if set.", but there are no further references to "Flexible
> HWP" in the SDM.)

A not uncommon issue with the SDM. At least there is a place where bit
17's purpose is described in the HWP section.

>>> @@ -165,6 +172,11 @@
>>>  #define  PASID_PASID_MASK                   0x000fffff
>>>  #define  PASID_VALID                        (_AC(1, ULL) << 31)
>>>
>>> +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
>>> +#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)
>>
>> The name has two redundant infixes, which looks odd, but then I can't
>> suggest any better without going too much out of sync with the SDM.
> 
> Yes, it's not a good name, but I was trying to keep close to the SDM.
> FAOD, these should drop IA32_ to become:
> MSR_PKG_HDC_CTL
> PKG_HDC_CTL_HDC_PKG_ENABLE
> ?

Right.

> Thank you for taking the time to review this.

Well, it has taken me awfully long to get back to this.

Jan
Jason Andryuk May 5, 2023, 3:35 p.m. UTC | #4
On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 04.05.2023 18:56, Jason Andryuk wrote:
> > On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>> --- a/docs/misc/xen-command-line.pandoc
> >>> +++ b/docs/misc/xen-command-line.pandoc
> >>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
> >>>  available support.
> >>>
> >>>  ### cpufreq
> >>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> >>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
> >>
> >> Considering you use a special internal governor, the 4 governor alternatives are
> >> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
> >> was another governor name would seem better to me. This would then also get rid
> >> of one of the two special "no-" prefix parsing cases (which I'm not overly
> >> happy about).
> >>
> >> Even if not done that way I'm puzzled by the way you spell out the interaction
> >> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
> >> "hwp" was specified, so even if not merged with the governors group "hwp" should
> >> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
> >> way you've spelled it out it actually looks to be kind of the other way around.)
> >
> > I placed them in alphabetical order, but, yes, it doesn't make sense.
> >
> >> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
> >> was specified.
> >>
> >> Overall I'm getting the impression that beyond your "verbose" related adjustment
> >> more is needed, if you're meaning to get things closer to how we parse the
> >> option (splitting across multiple lines to help see what I mean):
> >>
> >> `= none
> >>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
> >>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
> >>                           [,verbose]]}
> >>  | dom0-kernel`
> >>
> >> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> >> maxfreq, but better be more tight in the doc than too relaxed.)
> >>
> >> Furthermore while max/min freq don't apply directly, there are still two MSRs
> >> controlling bounds at the package and logical processor levels.
> >
> > Well, we only program the logical processor level MSRs because we
> > don't have a good idea of the packages to know when we can skip
> > writing an MSR.
> >
> > How about this:
> > `= none
> >  | {{ <boolean> | xen } {
> > [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
> >                         | [:hwp[,hdc]] }
> >                           [,verbose]]}
> >  | dom0-kernel`
>
> Looks right, yes.

There is a wrinkle to using the hwp governor.  The hwp governor was
named "hwp-internal", so it needs to be renamed to "hwp" for use with
command line parsing.  That means the checking for "-internal" needs
to change to just "hwp" which removes the generality of the original
implementation.

The other issue is that if you select "hwp" as the governor, but HWP
hardware support is not available, then hwp_available() needs to reset
the governor back to the default.  This feels like a layering
violation.

I'm still investigating, but promoting hwp to a top level option -
cpufreq=hwp - might be a better arrangement.

> >>> +union hwp_request
> >>> +{
> >>> +    struct
> >>> +    {
> >>> +        uint64_t min_perf:8;
> >>> +        uint64_t max_perf:8;
> >>> +        uint64_t desired:8;
> >>> +        uint64_t energy_perf:8;
> >>> +        uint64_t activity_window:10;
> >>> +        uint64_t package_control:1;
> >>> +        uint64_t reserved:16;
> >>> +        uint64_t activity_window_valid:1;
> >>> +        uint64_t energy_perf_valid:1;
> >>> +        uint64_t desired_valid:1;
> >>> +        uint64_t max_perf_valid:1;
> >>> +        uint64_t min_perf_valid:1;
> >>
> >> The boolean fields here would probably better be of type "bool". I also
> >> don't see the need for using uint64_t for any of the other fields -
> >> unsigned int will be quite fine, I think. Only ...
> >
> > This is the hardware MSR format, so it seemed natural to use uint64_t
> > and the bit fields.  To me, uint64_t foo:$bits; better shows that we
> > are dividing up a single hardware register using bit fields.
> > Honestly, I'm unfamiliar with the finer points of laying out bitfields
> > with bool.  And the 10 bits of activity window throws off aligning to
> > standard types.
> >
> > This seems to have the correct layout:
> > struct
> > {
> >         unsigned char min_perf;
> >         unsigned char max_perf;
> >         unsigned char desired;
> >         unsigned char energy_perf;
> >         unsigned int activity_window:10;
> >         bool package_control:1;
> >         unsigned int reserved:16;
> >         bool activity_window_valid:1;
> >         bool energy_perf_valid:1;
> >         bool desired_valid:1;
> >         bool max_perf_valid:1;
> >         bool min_perf_valid:1;
> > } ;
> >
> > Or would you prefer the first 8 bit ones to be unsigned int
> > min_perf:8?
>
> Personally I think using bitfields uniformly would be better. What you
> definitely cannot use if not using a bitfield is "unsigned char", it
> ought to by uint8_t then. If using a bitfield, as said, I think it's
> best to stick to unsigned int and bool, unless field width goes
> beyond 32 bits or fields cross a 32-bit boundary.

Ok, thanks.

> >>> +bool __init hwp_available(void)
> >>> +{
> >>> +    unsigned int eax, ecx, unused;
> >>> +    bool use_hwp;
> >>> +
> >>> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> >>> +    {
> >>> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> >>> +                    boot_cpu_data.cpuid_level);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> >>> +    {
> >>> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
> >>> +                 boot_cpu_data.cpuid_level);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> >>> +
> >>> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> >>> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> >>> +    {
> >>> +        hwp_verbose("HWP disabled: No energy/performance preference available");
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> >>> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> >>> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> >>> +    feature_hwp_energy_perf     =
> >>> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> >>> +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> >>> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> >>> +
> >>> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
> >>> +                feature_hwp, feature_hwp_notification,
> >>> +                feature_hwp_activity_window, feature_hwp_energy_perf,
> >>> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> >>> +
> >>> +    if ( !feature_hwp )
> >>> +        return false;
> >>> +
> >>> +    feature_hdc = eax & CPUID6_EAX_HDC;
> >>> +
> >>> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> >>> +                feature_hdc ? "" : "not ",
> >>> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> >>> +                            : "");
> >>> +
> >>> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> >>> +
> >>> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> >>> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
> >>
> >> You report this, but you don't really use it?
> >
> > Correct.  I needed to know what capabilities my processors have.
> >
> > feature_hwp_pkg_level_ctl and feature_hwp_peci can also be dropped
> > since they aren't used beyond printing their values.  I'd still lean
> > toward keeping their printing under verbose since otherwise there
> > isn't a convenient way to know if they are available without
> > recompiling.
>
> That's fine, but wants mentioning in the description. Also respective
> variables would want to be __initdata then, be local to the function,
> or be dropped altogether. Plus you'd want to be consistent - either
> you use a helper variable for all print-only features, or you don't.

Got it, thanks.

> >>> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> >>> +        {
> >>> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> >>> +            data->curr_req.raw = -1;
> >>> +
> >>> +            return;
> >>> +        }
> >>> +
> >>> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> >>> +    }
> >>
> >> In order to not need to undo the "enable" you've already done, maybe that
> >> should move down here?
> >
> > HWP needs to be enabled before the Capabilities and Request MSRs can
> > be read.
>
> I must have missed this aspect in the SDM. Do you have a pointer?

In 15.4.2 Enabling HWP
Additional MSRs associated with HWP may only be accessed after HWP is
enabled, with the exception of IA32_HWP_INTERRUPT and MSR_PPERF.
Accessing the IA32_HWP_INTERRUPT MSR requires only HWP is present as
enumerated by CPUID but does not require enabling HWP.

> >  Reading them shouldn't fail, but it seems safer to use
> > rdmsr_safe in case something goes wrong.
>
> Sure. But then the "enable" will need undoing in the unlikely event of
> failure.

Yes.

Regards,
Jason
Jan Beulich May 8, 2023, 6:33 a.m. UTC | #5
On 05.05.2023 17:35, Jason Andryuk wrote:
> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 04.05.2023 18:56, Jason Andryuk wrote:
>>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>>>> --- a/docs/misc/xen-command-line.pandoc
>>>>> +++ b/docs/misc/xen-command-line.pandoc
>>>>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
>>>>>  available support.
>>>>>
>>>>>  ### cpufreq
>>>>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
>>>>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
>>>>
>>>> Considering you use a special internal governor, the 4 governor alternatives are
>>>> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
>>>> was another governor name would seem better to me. This would then also get rid
>>>> of one of the two special "no-" prefix parsing cases (which I'm not overly
>>>> happy about).
>>>>
>>>> Even if not done that way I'm puzzled by the way you spell out the interaction
>>>> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
>>>> "hwp" was specified, so even if not merged with the governors group "hwp" should
>>>> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
>>>> way you've spelled it out it actually looks to be kind of the other way around.)
>>>
>>> I placed them in alphabetical order, but, yes, it doesn't make sense.
>>>
>>>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
>>>> was specified.
>>>>
>>>> Overall I'm getting the impression that beyond your "verbose" related adjustment
>>>> more is needed, if you're meaning to get things closer to how we parse the
>>>> option (splitting across multiple lines to help see what I mean):
>>>>
>>>> `= none
>>>>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
>>>>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
>>>>                           [,verbose]]}
>>>>  | dom0-kernel`
>>>>
>>>> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
>>>> maxfreq, but better be more tight in the doc than too relaxed.)
>>>>
>>>> Furthermore while max/min freq don't apply directly, there are still two MSRs
>>>> controlling bounds at the package and logical processor levels.
>>>
>>> Well, we only program the logical processor level MSRs because we
>>> don't have a good idea of the packages to know when we can skip
>>> writing an MSR.
>>>
>>> How about this:
>>> `= none
>>>  | {{ <boolean> | xen } {
>>> [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
>>>                         | [:hwp[,hdc]] }
>>>                           [,verbose]]}
>>>  | dom0-kernel`
>>
>> Looks right, yes.
> 
> There is a wrinkle to using the hwp governor.  The hwp governor was
> named "hwp-internal", so it needs to be renamed to "hwp" for use with
> command line parsing.  That means the checking for "-internal" needs
> to change to just "hwp" which removes the generality of the original
> implementation.

I'm afraid I don't see why this would strictly be necessary or a
consequence.

> The other issue is that if you select "hwp" as the governor, but HWP
> hardware support is not available, then hwp_available() needs to reset
> the governor back to the default.  This feels like a layering
> violation.

Layering violation - yes. But why would the governor need resetting in
this case? If HWP was asked for but isn't available, I don't think any
other cpufreq handling (and hence governor) should be put in place.
And turning off cpufreq altogether (if necessary in the first place)
wouldn't, to me, feel as much like a layering violation.

> I'm still investigating, but promoting hwp to a top level option -
> cpufreq=hwp - might be a better arrangement.

Might be an alternative, yes.

Jan
Jason Andryuk May 10, 2023, 1:54 p.m. UTC | #6
On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 05.05.2023 17:35, Jason Andryuk wrote:
> > On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>
> >> On 04.05.2023 18:56, Jason Andryuk wrote:
> >>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>>> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>>>> --- a/docs/misc/xen-command-line.pandoc
> >>>>> +++ b/docs/misc/xen-command-line.pandoc
> >>>>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
> >>>>>  available support.
> >>>>>
> >>>>>  ### cpufreq
> >>>>> -> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
> >>>>> +> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
> >>>>
> >>>> Considering you use a special internal governor, the 4 governor alternatives are
> >>>> meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
> >>>> was another governor name would seem better to me. This would then also get rid
> >>>> of one of the two special "no-" prefix parsing cases (which I'm not overly
> >>>> happy about).
> >>>>
> >>>> Even if not done that way I'm puzzled by the way you spell out the interaction
> >>>> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
> >>>> "hwp" was specified, so even if not merged with the governors group "hwp" should
> >>>> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
> >>>> way you've spelled it out it actually looks to be kind of the other way around.)
> >>>
> >>> I placed them in alphabetical order, but, yes, it doesn't make sense.
> >>>
> >>>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
> >>>> was specified.
> >>>>
> >>>> Overall I'm getting the impression that beyond your "verbose" related adjustment
> >>>> more is needed, if you're meaning to get things closer to how we parse the
> >>>> option (splitting across multiple lines to help see what I mean):
> >>>>
> >>>> `= none
> >>>>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
> >>>>                           [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
> >>>>                           [,verbose]]}
> >>>>  | dom0-kernel`
> >>>>
> >>>> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> >>>> maxfreq, but better be more tight in the doc than too relaxed.)
> >>>>
> >>>> Furthermore while max/min freq don't apply directly, there are still two MSRs
> >>>> controlling bounds at the package and logical processor levels.
> >>>
> >>> Well, we only program the logical processor level MSRs because we
> >>> don't have a good idea of the packages to know when we can skip
> >>> writing an MSR.
> >>>
> >>> How about this:
> >>> `= none
> >>>  | {{ <boolean> | xen } {
> >>> [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
> >>>                         | [:hwp[,hdc]] }
> >>>                           [,verbose]]}
> >>>  | dom0-kernel`
> >>
> >> Looks right, yes.
> >
> > There is a wrinkle to using the hwp governor.  The hwp governor was
> > named "hwp-internal", so it needs to be renamed to "hwp" for use with
> > command line parsing.  That means the checking for "-internal" needs
> > to change to just "hwp" which removes the generality of the original
> > implementation.
>
> I'm afraid I don't see why this would strictly be necessary or a
> consequence.

Maybe I took your comment too far when you mentioned using hwp as a
fake governor.  I used the actual HWP struct cpufreq_governor to hook
into cpufreq_cmdline_parse().  cpufreq_cmdline_parse() uses the that
name for comparison.  But the naming stops being an issue if struct
cpufreq_governor gains a bool .internal flag.  That flag also makes
the registration checks clearer.

> > The other issue is that if you select "hwp" as the governor, but HWP
> > hardware support is not available, then hwp_available() needs to reset
> > the governor back to the default.  This feels like a layering
> > violation.
>
> Layering violation - yes. But why would the governor need resetting in
> this case? If HWP was asked for but isn't available, I don't think any
> other cpufreq handling (and hence governor) should be put in place.
> And turning off cpufreq altogether (if necessary in the first place)
> wouldn't, to me, feel as much like a layering violation.

My goal was for Xen to use HWP if available and fallback to the acpi
cpufreq driver if not.  That to me seems more user-friendly than
disabling cpufreq.

            if ( hwp_available() )
                ret = hwp_register_driver();
            else
                ret = cpufreq_register_driver(&acpi_cpufreq_driver);

If we are setting cpufreq_opt_governor to enter hwp_available(), but
then HWP isn't available, it seems to me that we need to reset
cpufreq_opt_governor when exiting hwp_available() false.

Regards,
Jason
Jan Beulich May 10, 2023, 2:19 p.m. UTC | #7
On 10.05.2023 15:54, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 05.05.2023 17:35, Jason Andryuk wrote:
>>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>>> The other issue is that if you select "hwp" as the governor, but HWP
>>> hardware support is not available, then hwp_available() needs to reset
>>> the governor back to the default.  This feels like a layering
>>> violation.
>>
>> Layering violation - yes. But why would the governor need resetting in
>> this case? If HWP was asked for but isn't available, I don't think any
>> other cpufreq handling (and hence governor) should be put in place.
>> And turning off cpufreq altogether (if necessary in the first place)
>> wouldn't, to me, feel as much like a layering violation.
> 
> My goal was for Xen to use HWP if available and fallback to the acpi
> cpufreq driver if not.  That to me seems more user-friendly than
> disabling cpufreq.
> 
>             if ( hwp_available() )
>                 ret = hwp_register_driver();
>             else
>                 ret = cpufreq_register_driver(&acpi_cpufreq_driver);

That's fine as a (future) default, but for now using hwp requires a
command line option, and if that option says "hwp" then it ought to
be hwp imo.

> If we are setting cpufreq_opt_governor to enter hwp_available(), but
> then HWP isn't available, it seems to me that we need to reset
> cpufreq_opt_governor when exiting hwp_available() false.

This may be necessary in the future, but shouldn't be necessary right
now. It's not entirely clear to me how that future is going to look
like, command line option wise.

Jan
Marek Marczykowski-Górecki May 12, 2023, 1:02 a.m. UTC | #8
On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote:
> On 10.05.2023 15:54, Jason Andryuk wrote:
> > On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 05.05.2023 17:35, Jason Andryuk wrote:
> >>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
> >>> The other issue is that if you select "hwp" as the governor, but HWP
> >>> hardware support is not available, then hwp_available() needs to reset
> >>> the governor back to the default.  This feels like a layering
> >>> violation.
> >>
> >> Layering violation - yes. But why would the governor need resetting in
> >> this case? If HWP was asked for but isn't available, I don't think any
> >> other cpufreq handling (and hence governor) should be put in place.
> >> And turning off cpufreq altogether (if necessary in the first place)
> >> wouldn't, to me, feel as much like a layering violation.
> > 
> > My goal was for Xen to use HWP if available and fallback to the acpi
> > cpufreq driver if not.  That to me seems more user-friendly than
> > disabling cpufreq.
> > 
> >             if ( hwp_available() )
> >                 ret = hwp_register_driver();
> >             else
> >                 ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> 
> That's fine as a (future) default, but for now using hwp requires a
> command line option, and if that option says "hwp" then it ought to
> be hwp imo.

As a downstrem distribution, I'd strongly prefer to have an option that
would enable HWP when present and fallback to other driver otherwise,
even if that isn't the default upstream. I can't possibly require large
group of users (either HWP-having or HWP-not-having) to edit the Xen
cmdline to get power management working well.

If the meaning for cpufreq=hwp absolutely must include "nothing if HWP
is not available", then maybe it should be named cpufreq=try-hwp
instead, or cpufreq=prefer-hwp or something else like this?

> > If we are setting cpufreq_opt_governor to enter hwp_available(), but
> > then HWP isn't available, it seems to me that we need to reset
> > cpufreq_opt_governor when exiting hwp_available() false.
> 
> This may be necessary in the future, but shouldn't be necessary right
> now. It's not entirely clear to me how that future is going to look
> like, command line option wise.
> 
> Jan
>
Jan Beulich May 12, 2023, 6:28 a.m. UTC | #9
On 12.05.2023 03:02, Marek Marczykowski-Górecki wrote:
> On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote:
>> On 10.05.2023 15:54, Jason Andryuk wrote:
>>> On Mon, May 8, 2023 at 2:33 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 05.05.2023 17:35, Jason Andryuk wrote:
>>>>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@suse.com> wrote:
>>>>> The other issue is that if you select "hwp" as the governor, but HWP
>>>>> hardware support is not available, then hwp_available() needs to reset
>>>>> the governor back to the default.  This feels like a layering
>>>>> violation.
>>>>
>>>> Layering violation - yes. But why would the governor need resetting in
>>>> this case? If HWP was asked for but isn't available, I don't think any
>>>> other cpufreq handling (and hence governor) should be put in place.
>>>> And turning off cpufreq altogether (if necessary in the first place)
>>>> wouldn't, to me, feel as much like a layering violation.
>>>
>>> My goal was for Xen to use HWP if available and fallback to the acpi
>>> cpufreq driver if not.  That to me seems more user-friendly than
>>> disabling cpufreq.
>>>
>>>             if ( hwp_available() )
>>>                 ret = hwp_register_driver();
>>>             else
>>>                 ret = cpufreq_register_driver(&acpi_cpufreq_driver);
>>
>> That's fine as a (future) default, but for now using hwp requires a
>> command line option, and if that option says "hwp" then it ought to
>> be hwp imo.
> 
> As a downstrem distribution, I'd strongly prefer to have an option that
> would enable HWP when present and fallback to other driver otherwise,
> even if that isn't the default upstream. I can't possibly require large
> group of users (either HWP-having or HWP-not-having) to edit the Xen
> cmdline to get power management working well.
> 
> If the meaning for cpufreq=hwp absolutely must include "nothing if HWP
> is not available", then maybe it should be named cpufreq=try-hwp
> instead, or cpufreq=prefer-hwp or something else like this?

Any new sub-option needs to fit the existing ones in its meaning. I
could see e.g. "cpufreq=xen" alone to effect what you want (once hwp
becomes available for use by default). But (for now at least) I
continue to think that a request for "hwp" ought to mean HWP.

Jan
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e0b89b7d33..aaa31f444b 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -499,7 +499,7 @@  If set, force use of the performance counters for oprofile, rather than detectin
 available support.
 
 ### cpufreq
-> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]} | dom0-kernel`
+> `= none | {{ <boolean> | xen } [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]} | dom0-kernel`
 
 > Default: `xen`
 
@@ -510,6 +510,12 @@  choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `<maxfreq>` and `<minfreq>` are integers which represent max and min processor frequencies
   respectively.
 * `verbose` option can be included as a string or also as `verbose=<integer>`
+* `<hwp>` is a boolean to enable Hardware-Controlled Performance States (HWP)
+  on supported Intel hardware.  HWP is a Skylake+ feature which provides better
+  CPU power management.  The default is disabled.
+* `<hdc>` is a boolean to enable Hardware Duty Cycling (HDC).  HDC enables the
+  processor to autonomously force physical package components into idle state.
+  The default is enabled, but the option only applies when `<hwp>` is enabled.
 
 ### cpuid (x86)
 > `= List of comma separated booleans`
diff --git a/xen/arch/x86/acpi/cpufreq/Makefile b/xen/arch/x86/acpi/cpufreq/Makefile
index f75da9b9ca..db83aa6b14 100644
--- a/xen/arch/x86/acpi/cpufreq/Makefile
+++ b/xen/arch/x86/acpi/cpufreq/Makefile
@@ -1,2 +1,3 @@ 
 obj-y += cpufreq.o
+obj-y += hwp.o
 obj-y += powernow.o
diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index f1cc473b4f..56816b1aee 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -642,7 +642,10 @@  static int __init cf_check cpufreq_driver_init(void)
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = cpufreq_register_driver(&acpi_cpufreq_driver);
+            if ( hwp_available() )
+                ret = hwp_register_driver();
+            else
+                ret = cpufreq_register_driver(&acpi_cpufreq_driver);
             break;
 
         case X86_VENDOR_AMD:
diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
new file mode 100644
index 0000000000..57f13867d3
--- /dev/null
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -0,0 +1,506 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
+ *
+ * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com>
+ */
+
+#include <xen/cpumask.h>
+#include <xen/init.h>
+#include <xen/param.h>
+#include <xen/xmalloc.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <acpi/cpufreq/cpufreq.h>
+
+static bool feature_hwp;
+static bool feature_hwp_notification;
+static bool feature_hwp_activity_window;
+static bool feature_hwp_energy_perf;
+static bool feature_hwp_pkg_level_ctl;
+static bool feature_hwp_peci;
+
+static bool feature_hdc;
+
+__initdata bool opt_cpufreq_hwp = false;
+__initdata bool opt_cpufreq_hdc = true;
+
+#define HWP_ENERGY_PERF_BALANCE         0x80
+#define IA32_ENERGY_BIAS_BALANCE        0x7
+#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
+#define IA32_ENERGY_BIAS_MASK           0xf
+
+union hwp_request
+{
+    struct
+    {
+        uint64_t min_perf:8;
+        uint64_t max_perf:8;
+        uint64_t desired:8;
+        uint64_t energy_perf:8;
+        uint64_t activity_window:10;
+        uint64_t package_control:1;
+        uint64_t reserved:16;
+        uint64_t activity_window_valid:1;
+        uint64_t energy_perf_valid:1;
+        uint64_t desired_valid:1;
+        uint64_t max_perf_valid:1;
+        uint64_t min_perf_valid:1;
+    };
+    uint64_t raw;
+};
+
+struct hwp_drv_data
+{
+    union
+    {
+        uint64_t hwp_caps;
+        struct
+        {
+            uint64_t highest:8;
+            uint64_t guaranteed:8;
+            uint64_t most_efficient:8;
+            uint64_t lowest:8;
+            uint64_t reserved:32;
+        } hw;
+    };
+    union hwp_request curr_req;
+    uint16_t activity_window;
+    uint8_t minimum;
+    uint8_t maximum;
+    uint8_t desired;
+    uint8_t energy_perf;
+};
+DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);
+
+#define hwp_err(...)     printk(XENLOG_ERR __VA_ARGS__)
+#define hwp_info(...)    printk(XENLOG_INFO __VA_ARGS__)
+#define hwp_verbose(...)                   \
+({                                         \
+    if ( cpufreq_verbose )                 \
+        printk(XENLOG_DEBUG __VA_ARGS__);  \
+})
+
+static int cf_check hwp_governor(struct cpufreq_policy *policy,
+                                 unsigned int event)
+{
+    int ret;
+
+    if ( policy == NULL )
+        return -EINVAL;
+
+    switch ( event )
+    {
+    case CPUFREQ_GOV_START:
+    case CPUFREQ_GOV_LIMITS:
+        ret = 0;
+        break;
+
+    case CPUFREQ_GOV_STOP:
+    default:
+        ret = -EINVAL;
+        break;
+    }
+
+    return ret;
+}
+
+static struct cpufreq_governor hwp_cpufreq_governor =
+{
+    .name          = XEN_HWP_GOVERNOR,
+    .governor      = hwp_governor,
+};
+
+static int __init cf_check cpufreq_gov_hwp_init(void)
+{
+    return cpufreq_register_governor(&hwp_cpufreq_governor);
+}
+__initcall(cpufreq_gov_hwp_init);
+
+bool __init hwp_available(void)
+{
+    unsigned int eax, ecx, unused;
+    bool use_hwp;
+
+    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
+    {
+        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
+                    boot_cpu_data.cpuid_level);
+        return false;
+    }
+
+    if ( boot_cpu_data.cpuid_level < 0x16 )
+    {
+        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq info\n",
+                 boot_cpu_data.cpuid_level);
+        return false;
+    }
+
+    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
+
+    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
+         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
+    {
+        hwp_verbose("HWP disabled: No energy/performance preference available");
+        return false;
+    }
+
+    feature_hwp                 = eax & CPUID6_EAX_HWP;
+    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
+    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
+    feature_hwp_energy_perf     =
+        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
+    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
+    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
+
+    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d pkg-level: %d peci: %d\n",
+                feature_hwp, feature_hwp_notification,
+                feature_hwp_activity_window, feature_hwp_energy_perf,
+                feature_hwp_pkg_level_ctl, feature_hwp_peci);
+
+    if ( !feature_hwp )
+        return false;
+
+    feature_hdc = eax & CPUID6_EAX_HDC;
+
+    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
+                feature_hdc ? "" : "not ",
+                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
+                            : "");
+
+    feature_hdc = feature_hdc && opt_cpufreq_hdc;
+
+    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
+                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
+
+    use_hwp = feature_hwp && opt_cpufreq_hwp;
+    cpufreq_governor_internal = use_hwp;
+
+    if ( use_hwp )
+        hwp_info("Using HWP for cpufreq\n");
+
+    return use_hwp;
+}
+
+static void hdc_set_pkg_hdc_ctl(bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
+
+        return;
+    }
+
+    if ( val )
+        msr |= IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
+    else
+        msr &= ~IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
+
+    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
+}
+
+static void hdc_set_pm_ctl1(bool val)
+{
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
+
+        return;
+    }
+
+    if ( val )
+        msr |= IA32_PM_CTL1_HDC_ALLOW_BLOCK;
+    else
+        msr &= ~IA32_PM_CTL1_HDC_ALLOW_BLOCK;
+
+    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
+}
+
+static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
+{
+    uint32_t base_khz, max_khz, bus_khz, edx;
+
+    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
+
+    /* aperf/mperf scales base. */
+    policy->cpuinfo.perf_freq = base_khz * 1000;
+    policy->cpuinfo.min_freq = base_khz * 1000;
+    policy->cpuinfo.max_freq = max_khz * 1000;
+    policy->min = base_khz * 1000;
+    policy->max = max_khz * 1000;
+    policy->cur = 0;
+}
+
+static void cf_check hwp_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    uint64_t val;
+
+    /*
+     * Package level MSR, but we don't have a good idea of packages here, so
+     * just do it everytime.
+     */
+    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu);
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    /* Ensure we don't generate interrupts */
+    if ( feature_hwp_notification )
+        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
+
+    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
+    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
+    {
+        val |= IA32_PM_ENABLE_HWP_ENABLE;
+        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
+        {
+            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
+                    policy->cpu, val);
+            data->curr_req.raw = -1;
+            return;
+        }
+    }
+
+    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
+                policy->cpu);
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu);
+        data->curr_req.raw = -1;
+        return;
+    }
+
+    if ( !feature_hwp_energy_perf ) {
+        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
+        {
+            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
+            data->curr_req.raw = -1;
+
+            return;
+        }
+
+        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
+    }
+
+    /*
+     * Check for APERF/MPERF support in hardware
+     * also check for boost/turbo support
+     */
+    intel_feature_detect(policy);
+
+    if ( feature_hdc )
+    {
+        hdc_set_pkg_hdc_ctl(true);
+        hdc_set_pm_ctl1(true);
+    }
+
+    hwp_get_cpu_speeds(policy);
+}
+
+static int cf_check hwp_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+
+    if ( !feature_hwp_energy_perf && data->energy_perf )
+    {
+        if ( data->energy_perf > IA32_ENERGY_BIAS_MAX_POWERSAVE )
+        {
+            hwp_err("energy_perf %d exceeds IA32_ENERGY_PERF_BIAS range 0-15\n",
+                    data->energy_perf);
+
+            return -EINVAL;
+        }
+    }
+
+    if ( !feature_hwp_activity_window && data->activity_window )
+    {
+        hwp_err("HWP activity window not supported\n");
+
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+/* val 0 - highest performance, 15 - maximum energy savings */
+static void hwp_energy_perf_bias(const struct hwp_drv_data *data)
+{
+    uint64_t msr;
+    uint8_t val = data->energy_perf;
+
+    ASSERT(val <= IA32_ENERGY_BIAS_MAX_POWERSAVE);
+
+    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
+    {
+        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
+
+        return;
+    }
+
+    msr &= ~IA32_ENERGY_BIAS_MASK;
+    msr |= val;
+
+    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
+        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr);
+}
+
+static void cf_check hwp_write_request(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    union hwp_request hwp_req = data->curr_req;
+
+    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
+    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
+    {
+        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
+                policy->cpu, hwp_req.raw);
+        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
+    }
+
+    if ( !feature_hwp_energy_perf )
+        hwp_energy_perf_bias(data);
+
+}
+
+static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
+                                       unsigned int target_freq,
+                                       unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    /* Zero everything to ensure reserved bits are zero... */
+    union hwp_request hwp_req = { .raw = 0 };
+
+    /* .. and update from there */
+    hwp_req.min_perf = data->minimum;
+    hwp_req.max_perf = data->maximum;
+    hwp_req.desired = data->desired;
+    if ( feature_hwp_energy_perf )
+        hwp_req.energy_perf = data->energy_perf;
+    if ( feature_hwp_activity_window )
+        hwp_req.activity_window = data->activity_window;
+
+    if ( hwp_req.raw == data->curr_req.raw )
+        return 0;
+
+    data->curr_req = hwp_req;
+
+    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
+    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
+
+    return 0;
+}
+
+static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct hwp_drv_data *data;
+
+    data = xzalloc(struct hwp_drv_data);
+    if ( !data )
+        return -ENOMEM;
+
+    if ( cpufreq_opt_governor )
+        printk(XENLOG_WARNING
+               "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n",
+               cpufreq_opt_governor->name, hwp_cpufreq_governor.name);
+    policy->governor = &hwp_cpufreq_governor;
+
+    per_cpu(hwp_drv_data, cpu) = data;
+
+    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
+
+    if ( data->curr_req.raw == -1 )
+    {
+        hwp_err("CPU%u: Could not initialize HWP properly\n", cpu);
+        XFREE(per_cpu(hwp_drv_data, cpu));
+        return -ENODEV;
+    }
+
+    data->minimum = data->curr_req.min_perf;
+    data->maximum = data->curr_req.max_perf;
+    data->desired = data->curr_req.desired;
+    /* the !feature_hwp_energy_perf case was handled in hwp_init_msrs(). */
+    if ( feature_hwp_energy_perf )
+        data->energy_perf = data->curr_req.energy_perf;
+
+    hwp_verbose("CPU%u: IA32_HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
+
+    hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw);
+
+    return 0;
+}
+
+static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    XFREE(per_cpu(hwp_drv_data, policy->cpu));
+
+    return 0;
+}
+
+/*
+ * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
+ * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least
+ * with my HWP testing.  MSR_IA32_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
+ * is what Linux uses and seems to work.
+ */
+static void cf_check hwp_set_misc_turbo(void *info)
+{
+    const struct cpufreq_policy *policy = info;
+    uint64_t msr;
+
+    if ( rdmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+    {
+        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_MISC_ENABLE)\n", policy->cpu);
+
+        return;
+    }
+
+    if ( policy->turbo == CPUFREQ_TURBO_ENABLED )
+        msr &= ~MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+    else
+        msr |= MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE;
+
+    if ( wrmsr_safe(MSR_IA32_MISC_ENABLE, msr) )
+        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_MISC_ENABLE): %016lx\n",
+                policy->cpu, msr);
+}
+
+static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy)
+{
+    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);
+
+    return 0;
+}
+
+static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
+{
+    .name   = "hwp-cpufreq",
+    .verify = hwp_cpufreq_verify,
+    .target = hwp_cpufreq_target,
+    .init   = hwp_cpufreq_cpu_init,
+    .exit   = hwp_cpufreq_cpu_exit,
+    .update = hwp_cpufreq_update,
+};
+
+int __init hwp_register_driver(void)
+{
+    return cpufreq_register_driver(&hwp_cpufreq_driver);
+}
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 4140ec0938..f2ff1d5fde 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -46,8 +46,17 @@  extern struct cpuinfo_x86 boot_cpu_data;
 #define cpu_has(c, bit)		test_bit(bit, (c)->x86_capability)
 #define boot_cpu_has(bit)	test_bit(bit, boot_cpu_data.x86_capability)
 
-#define CPUID_PM_LEAF                    6
-#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
+#define CPUID_PM_LEAF                                6
+#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
+#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
+#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
+#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
+#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
+#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
+#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
+#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)
+#define CPUID6_ECX_APERFMPERF_CAPABILITY             0x1
+#define CPUID6_ECX_IA32_ENERGY_PERF_BIAS             0x8
 
 /* CPUID level 0x00000001.edx */
 #define cpu_has_fpu             1
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index fa771ed0b5..a2a22339e4 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -151,6 +151,13 @@ 
 
 #define MSR_PKRS                            0x000006e1
 
+#define MSR_IA32_PM_ENABLE                  0x00000770
+#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
+
+#define MSR_IA32_HWP_CAPABILITIES           0x00000771
+#define MSR_IA32_HWP_INTERRUPT              0x00000773
+#define MSR_IA32_HWP_REQUEST                0x00000774
+
 #define MSR_X2APIC_FIRST                    0x00000800
 #define MSR_X2APIC_LAST                     0x000008ff
 
@@ -165,6 +172,11 @@ 
 #define  PASID_PASID_MASK                   0x000fffff
 #define  PASID_VALID                        (_AC(1, ULL) << 31)
 
+#define MSR_IA32_PKG_HDC_CTL                0x00000db0
+#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)
+#define MSR_IA32_PM_CTL1                    0x00000db1
+#define  IA32_PM_CTL1_HDC_ALLOW_BLOCK       (_AC(1, ULL) <<  0)
+
 #define MSR_UARCH_MISC_CTRL                 0x00001b01
 #define  UARCH_CTRL_DOITM                   (_AC(1, ULL) <<  0)
 
@@ -500,6 +512,7 @@ 
 #define MSR_IA32_MISC_ENABLE_LIMIT_CPUID  (1<<22)
 #define MSR_IA32_MISC_ENABLE_XTPR_DISABLE (1<<23)
 #define MSR_IA32_MISC_ENABLE_XD_DISABLE	(1ULL << 34)
+#define MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE (1ULL << 38)
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 #define MSR_IA32_ENERGY_PERF_BIAS	0x000001b0
diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c
index 7bd81680da..9470eb7230 100644
--- a/xen/drivers/cpufreq/cpufreq.c
+++ b/xen/drivers/cpufreq/cpufreq.c
@@ -565,6 +565,38 @@  static void cpufreq_cmdline_common_para(struct cpufreq_policy *new_policy)
 
 static int __init cpufreq_handle_common_option(const char *name, const char *val)
 {
+    if (!strcmp(name, "hdc")) {
+        if (val) {
+            int ret = parse_bool(val, NULL);
+            if (ret != -1) {
+                opt_cpufreq_hdc = ret;
+                return 1;
+            }
+        } else {
+            opt_cpufreq_hdc = true;
+            return 1;
+        }
+    } else if (!strcmp(name, "no-hdc")) {
+        opt_cpufreq_hdc = false;
+        return 1;
+    }
+
+    if (!strcmp(name, "hwp")) {
+        if (val) {
+            int ret = parse_bool(val, NULL);
+            if (ret != -1) {
+                opt_cpufreq_hwp = ret;
+                return 1;
+            }
+        } else {
+            opt_cpufreq_hwp = true;
+            return 1;
+        }
+    } else if (!strcmp(name, "no-hwp")) {
+        opt_cpufreq_hwp = false;
+        return 1;
+    }
+
     if (!strcmp(name, "maxfreq") && val) {
         usr_max_freq = simple_strtoul(val, NULL, 0);
         return 1;
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 0f334d2a43..29a712a4f1 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -245,4 +245,7 @@  void cpufreq_dbs_timer_resume(void);
 
 void intel_feature_detect(struct cpufreq_policy *policy);
 
+extern bool opt_cpufreq_hwp;
+extern bool opt_cpufreq_hdc;
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h
index d8a1ba68a6..b751ca4937 100644
--- a/xen/include/acpi/cpufreq/processor_perf.h
+++ b/xen/include/acpi/cpufreq/processor_perf.h
@@ -7,6 +7,9 @@ 
 
 #define XEN_PX_INIT 0x80000000
 
+bool hwp_available(void);
+int hwp_register_driver(void);
+
 int powernow_cpufreq_init(void);
 unsigned int powernow_register_driver(void);
 unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 2b24d6bfd0..b448f13b75 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -292,6 +292,7 @@  struct xen_ondemand {
     uint32_t up_threshold;
 };
 
+#define XEN_HWP_GOVERNOR "hwp-internal"
 /*
  * cpufreq para name of this structure named
  * same as sysfs file name of native linux