Message ID | 20210503192810.36084-5-jandryuk@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Intel Hardware P-States (HWP) support | expand |
On 03.05.2021 21:28, Jason Andryuk wrote: > If HWP Energy_Performance_Preference isn't supported, the code falls > back to IA32_ENERGY_PERF_BIAS. Right now, we don't check > CPUID.06H:ECX.SETBH[bit 3] before using that MSR. May I ask what problem there is doing so? > The SDM reads like > it'll be available, and I assume it was available by the time Skylake > introduced HWP. The SDM documents the MSR's presence back to at least Nehalem, but ties it to the CPUID bit even there. > --- a/docs/misc/xen-command-line.pandoc > +++ b/docs/misc/xen-command-line.pandoc > @@ -1355,6 +1355,15 @@ Specify whether guests are to be given access to physical port 80 > (often used for debugging purposes), to override the DMI based > detection of systems known to misbehave upon accesses to that port. > > +### hwp (x86) > +> `= <boolean>` > + > +> Default: `true` > + > +Specifies whether Xen uses Hardware-Controlled Performance States (HWP) > +on supported Intel hardware. HWP is a Skylake+ feature which provides > +better CPU power management. Is there a particular reason giving this a top-level option rather than a sub-option of cpufreq=? > --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c > +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c > @@ -641,9 +641,12 @@ static int __init cpufreq_driver_init(void) > int ret = 0; > > if ((cpufreq_controller == FREQCTL_xen) && > - (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) > - ret = cpufreq_register_driver(&acpi_cpufreq_driver); > - else if ((cpufreq_controller == FREQCTL_xen) && > + (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) { > + if (hwp_available()) > + ret = hwp_register_driver(); > + else > + ret = cpufreq_register_driver(&acpi_cpufreq_driver); > + } else if ((cpufreq_controller == FREQCTL_xen) && I'd prefer if you did this with slightly less code churn, e.g. (considering that the vendor check isn't really necessary afaict) if ((cpufreq_controller == FREQCTL_xen) && hwp_available()) ret = hwp_register_driver(); else if ((cpufreq_controller == FREQCTL_xen) && (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) ret = cpufreq_register_driver(&acpi_cpufreq_driver); ... > --- /dev/null > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c > @@ -0,0 +1,533 @@ > +/* > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP) > + * > + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or (at > + * your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License along > + * with this program; If not, see <http://www.gnu.org/licenses/>. > + */ > + > +#include <xen/cpumask.h> > +#include <xen/init.h> > +#include <xen/param.h> > +#include <xen/xmalloc.h> > +#include <asm/msr.h> > +#include <asm/io.h> Nit: Please swap these two for being properly sorted (alphabetically). > +#include <acpi/cpufreq/cpufreq.h> > + > +static bool feature_hwp; > +static bool feature_hwp_notification; > +static bool feature_hwp_activity_window; > +static bool feature_hwp_energy_perf; > +static bool feature_hwp_pkg_level_ctl; > +static bool feature_hwp_peci; > + > +static bool feature_hdc; > +static bool feature_fast_msr; > + > +bool opt_hwp = true; Please add __read_mostly or even __initdata as applicable. > +struct hwp_drv_data > +{ > + union > + { > + uint64_t hwp_caps; > + struct > + { > + uint64_t hw_highest:8; > + uint64_t hw_guaranteed:8; > + uint64_t hw_most_efficient:8; > + uint64_t hw_lowest:8; > + uint64_t hw_reserved:32; I'd like to suggest to convert the hw_ prefixes here into ... > + }; ... naming of the field ("hw") here. > + }; > + union hwp_request curr_req; > + uint16_t activity_window; > + uint8_t minimum; > + uint8_t maximum; > + uint8_t desired; > + uint8_t energy_perf; > +}; > +struct hwp_drv_data *hwp_drv_data[NR_CPUS]; New NR_CPUS-dimensioned arrays need explicit justification. From what I get I can't see why this couldn't be per-CPU data instead. Also - static? > +#define hwp_err(...) printk(XENLOG_ERR __VA_ARGS__) > +#define hwp_info(...) printk(XENLOG_INFO __VA_ARGS__) > +#define hwp_verbose(...) \ > +({ \ > + if ( cpufreq_verbose ) \ > + { \ > + printk(XENLOG_DEBUG __VA_ARGS__); \ > + } \ > +}) > +#define hwp_verbose_cont(...) \ > +({ \ > + if ( cpufreq_verbose ) \ > + { \ > + printk( __VA_ARGS__); \ > + } \ > +}) Please omit this as not properly working (we don't have Linux'es printk continuation logic) and as not actually used anywhere. For hwp_verbose() please omit unnecessary braces. > +static int hwp_governor(struct cpufreq_policy *policy, > + unsigned int event) > +{ > + int ret; > + > + if ( policy == NULL ) > + return -EINVAL; > + > + switch (event) Style: Missing blanks. > + { > + case CPUFREQ_GOV_START: > + ret = 0; > + break; > + case CPUFREQ_GOV_STOP: > + ret = -EINVAL; > + break; Any particular need for this, when you have ... > + case CPUFREQ_GOV_LIMITS: > + ret = 0; > + break; > + default: > + ret = -EINVAL; ... this (albeit with a missing "break")? Similarly, any particular reason not to fold the other two? > +bool hwp_available(void) The only caller of this function is __init, so the function here should be, too. > +{ > + uint32_t eax; This could well be unsigned int afaict - see ./CODING_STYLE. > + uint64_t val; > + bool use_hwp; > + > + if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF ) > + { > + hwp_verbose("cpuid_level (%u) lacks HWP support\n", boot_cpu_data.cpuid_level); > + > + return false; > + } > + > + eax = cpuid_eax(CPUID_PM_LEAF); > + feature_hwp = !!(eax & CPUID6_EAX_HWP); > + feature_hwp_notification = !!(eax & CPUID6_EAX_HWP_Notification); > + feature_hwp_activity_window = !!(eax & CPUID6_EAX_HWP_Activity_Window); > + feature_hwp_energy_perf = > + !!(eax & CPUID6_EAX_HWP_Energy_Performance_Preference); > + feature_hwp_pkg_level_ctl = > + !!(eax & CPUID6_EAX_HWP_Package_Level_Request); > + feature_hwp_peci = !!(eax & CPUID6_EAX_HWP_PECI); Please avoid !! unless it's really needed (i.e. when the conversion to bool isn't implicit anyway). Also elsewhere. > + hwp_verbose("HWP: %d notify: %d act_window: %d energy_perf: %d pkg_level: %d peci: %d\n", > + feature_hwp, feature_hwp_notification, > + feature_hwp_activity_window, feature_hwp_energy_perf, > + feature_hwp_pkg_level_ctl, feature_hwp_peci); > + > + if ( !feature_hwp ) > + { > + hwp_verbose("Hardware does not support HWP\n"); This is redundant with the hwp_verbose immediately ahead. > + return false; > + } > + > + if ( boot_cpu_data.cpuid_level < 0x16 ) > + { > + hwp_info("HWP disabled: cpuid_level %x < 0x16 lacks CPU freq info\n", > + boot_cpu_data.cpuid_level); > + > + return false; While we commonly insist on blank lines ahead ofthe main return statement of a function, we don't normally have such extra blank lines in cases like this one. > + } Perhaps worth folding with the earlier CPUID level check? > + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n", > + eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not "); > + if ( eax & CPUID6_EAX_FAST_HWP_MSR ) > + { > + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) ) > + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n"); > + > + hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val); Missing "else" above here? > + if (val & FAST_IA32_HWP_REQUEST ) Style: Missing blank. > +static void hdc_set_pkg_hdc_ctl(bool val) > +{ > + uint64_t msr; > + > + if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) > + { > + hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n"); I'm not convinced of the need of having such log messages after failed rdmsr/wrmsr, but I'm definitely against them getting logged unconditionally. In verbose mode, maybe. > + return; > + } > + > + msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0; If you don't use the prior value, why did you read it? But I think you really mean to set/clear just bot 0. > +static void hdc_set_pm_ctl1(bool val) > +{ > + uint64_t msr; > + > + if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) ) > + { > + hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n"); > + > + return; > + } > + > + msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0; Same here then, and ... > +static void hwp_fast_uncore_msrs_ctl(bool val) > +{ > + uint64_t msr; > + > + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) ) > + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n"); > + > + msr = val; ... here (where you imply bit 0 instead of using a proper constant). Also for all three functions I'm not convinced their names are in good sync with their parameters being boolean. > +static void hwp_read_capabilities(void *info) > +{ > + struct cpufreq_policy *policy = info; > + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; > + > + if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) ) > + { > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n", > + policy->cpu); > + > + return; > + } > + > + if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) ) > + { > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu); > + > + return; > + } > +} This function doesn't indicate failure to its caller(s), so am I to understand that failure to read either ofthe MSRs is actually benign to the driver? > +static void hwp_init_msrs(void *info) > +{ > + struct cpufreq_policy *policy = info; > + uint64_t val; > + > + /* Package level MSR, but we don't have a good idea of packages here, so > + * just do it everytime. */ Style: Comment format (also elsewhere). > + if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) ) > + { > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu); > + > + return; > + } > + > + hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val); > + if ( val != IA32_PM_ENABLE_HWP_ENABLE ) > + { > + val = IA32_PM_ENABLE_HWP_ENABLE; You should neither depend on reserved bits being zero, nor discard any non-zero value here, I think. > + if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) ) > + hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n", > + policy->cpu, val); > + } > + > + hwp_read_capabilities(info); Please pass properly typed arguments (and have properly typed parameters) wherever possible - here: policy. The exception are callback functions and alike, where the function type may have to match a sufficiently generic one. > +static int hwp_cpufreq_verify(struct cpufreq_policy *policy) > +{ > + unsigned int cpu = policy->cpu; > + struct hwp_drv_data *data = hwp_drv_data[cpu]; > + > + if ( !feature_hwp_energy_perf && data->energy_perf ) > + { > + if ( data->energy_perf > 15 ) > + { > + hwp_err("energy_perf %d exceeds IA32_ENERGY_PERF_BIAS range 0-15\n", > + data->energy_perf); > + > + return -EINVAL; > + } > + } > + > + if ( !feature_hwp_activity_window && data->activity_window ) > + { > + hwp_err("HWP activity window not supported.\n"); As in the majority of log messages you have, please omit full stops. > +static void hwp_energy_perf_bias(void *info) > +{ > + uint64_t msr; > + struct hwp_drv_data *data = info; > + uint8_t val = data->energy_perf; > + > + ASSERT(val <= 15); > + > + if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) ) > + { > + hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n"); > + > + return; > + } > + > + msr &= ~(0xf); Unnessary parentheses. > +static void hwp_write_request(void *info) > +{ > + struct cpufreq_policy *policy = info; > + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; > + union hwp_request hwp_req = data->curr_req; > + > + BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t)); ITYM BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw)); here? > + if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) ) > + { > + hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n", > + policy->cpu, hwp_req.raw); > + rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw); What if this one fails, too? data->curr_req.raw then pretty certainly ends up stale. > +static int hwp_cpufreq_target(struct cpufreq_policy *policy, > + unsigned int target_freq, unsigned int relation) > +{ > + unsigned int cpu = policy->cpu; > + struct hwp_drv_data *data = hwp_drv_data[cpu]; > + union hwp_request hwp_req; > + > + /* Zero everything to ensure reserved bits are zero... */ > + hwp_req.raw = 0;> + /* .. and update from there */ > + hwp_req.min_perf = data->minimum; > + hwp_req.max_perf = data->maximum; > + hwp_req.desired = data->desired; We typically prefer to use initializers to achieve the same effect. Since the bitfields part is in an unnamed struct, old gcc would prohibit use of an initializer for all of the assignments, but at least "raw" can be set in the initializer. > + if ( feature_hwp_energy_perf ) > + hwp_req.energy_perf = data->energy_perf; > + if ( feature_hwp_activity_window ) > + hwp_req.activity_window = data->activity_window; > + > + if ( hwp_req.raw == data->curr_req.raw ) > + return 0; > + > + data->curr_req.raw = hwp_req.raw; I think you can omit .raw on both sides. > + hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw); > + on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1); > + > + if ( !feature_hwp_energy_perf && data->energy_perf ) > + { > + on_selected_cpus(cpumask_of(cpu), hwp_energy_perf_bias, > + data, 1); > + } Like elsewhere, please omit unnecessary braces. > +static int hwp_cpufreq_cpu_init(struct cpufreq_policy *policy) > +{ > + unsigned int cpu = policy->cpu; > + struct hwp_drv_data *data; > + > + if ( cpufreq_opt_governor ) > + { > + printk(XENLOG_WARNING > + "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n", > + cpufreq_opt_governor->name, hwp_cpufreq_governor.name); > + } Same here (and perhaps elsewhere) again. > + policy->governor = &hwp_cpufreq_governor; > + > + data = xzalloc(typeof(*data)); Commonly we specify the type explicitly in such cases, rather than using typeof(). I will admit though that I'm not entirely certain which one's better. But consistency across the code base is perhaps preferable for the time being. > + if ( !data ) > + return -ENOMEM; Is it correct to have set the governor before this error check? > + hwp_drv_data[cpu] = data; > + > + on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1); > + > + data->minimum = data->hw_lowest; > + data->maximum = data->hw_highest; > + data->desired = 0; /* default to HW autonomous */ > + if ( feature_hwp_energy_perf ) > + data->energy_perf = 0x80; > + else > + data->energy_perf = 7; Where's this 7 coming from? (You do mention the 0x80 at least in the description.) > +static int hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy) > +{ > + unsigned int cpu = policy->cpu; > + > + xfree(hwp_drv_data[cpu]); > + hwp_drv_data[cpu] = NULL; Please don't open-code XFREE(). > +int hwp_register_driver(void) __init > +{ > + int ret; > + > + ret = cpufreq_register_driver(&hwp_cpufreq_driver); > + > + return ret; Preferably the body would consist of just a singe return statement. > --- a/xen/include/asm-x86/msr-index.h > +++ b/xen/include/asm-x86/msr-index.h > @@ -101,6 +101,12 @@ > #define MSR_RTIT_ADDR_A(n) (0x00000580 + (n) * 2) > #define MSR_RTIT_ADDR_B(n) (0x00000581 + (n) * 2) > > +#define MSR_FAST_UNCORE_MSRS_CTL 0x00000657 > +#define FAST_IA32_HWP_REQUEST_MSR_ENABLE (_AC(1, ULL) << 0) > + > +#define MSR_FAST_UNCORE_MSRS_CAPABILITY 0x0000065f > +#define FAST_IA32_HWP_REQUEST (_AC(1, ULL) << 0) > + > #define MSR_U_CET 0x000006a0 > #define MSR_S_CET 0x000006a2 > #define CET_SHSTK_EN (_AC(1, ULL) << 0) > @@ -112,10 +118,20 @@ > #define MSR_PL3_SSP 0x000006a7 > #define MSR_INTERRUPT_SSP_TABLE 0x000006a8 > > +#define MSR_IA32_PM_ENABLE 0x00000770 > +#define IA32_PM_ENABLE_HWP_ENABLE (_AC(1, ULL) << 0) Please have a blank line after here. > +#define MSR_IA32_HWP_CAPABILITIES 0x00000771 > +#define MSR_IA32_HWP_REQUEST 0x00000774 > + > #define MSR_PASID 0x00000d93 > #define PASID_PASID_MASK 0x000fffff > #define PASID_VALID (_AC(1, ULL) << 31) > > +#define MSR_IA32_PKG_HDC_CTL 0x00000db0 > +#define IA32_PKG_HDC_CTL_HDC_PKG_Enable (_AC(1, ULL) << 0) I don't think "HDC" twice is helpful? Also please use all upper case names (actually also for the CPUID constants higher up). Jan
On 26.05.2021 16:59, Jan Beulich wrote: > On 03.05.2021 21:28, Jason Andryuk wrote: >> +static void hdc_set_pkg_hdc_ctl(bool val) >> +{ >> + uint64_t msr; >> + >> + if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) >> + { >> + hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n"); > > I'm not convinced of the need of having such log messages after > failed rdmsr/wrmsr, but I'm definitely against them getting logged > unconditionally. In verbose mode, maybe. Perhaps not even there, considering that recovery from faults gets logged already anyway (see extable_fixup()), and is suitably restricted to debug builds. Jan
On 03.05.2021 21:28, Jason Andryuk wrote: > If HWP Energy_Performance_Preference isn't supported, the code falls > back to IA32_ENERGY_PERF_BIAS. Right now, we don't check > CPUID.06H:ECX.SETBH[bit 3] before using that MSR. The SDM reads like > it'll be available, and I assume it was available by the time Skylake > introduced HWP. Upon more detailed reading of the respective SDM sections, I only see two options: Either you fail driver initialization if the bit is clear, or you correctly deal with the bit being clear. If Xen runs virtualized itself, the combination of CPUID bits set may not match that of any bare metal hardware that exists. Jan
On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote: > > On 03.05.2021 21:28, Jason Andryuk wrote: > > If HWP Energy_Performance_Preference isn't supported, the code falls > > back to IA32_ENERGY_PERF_BIAS. Right now, we don't check > > CPUID.06H:ECX.SETBH[bit 3] before using that MSR. > > May I ask what problem there is doing so? > > > The SDM reads like > > it'll be available, and I assume it was available by the time Skylake > > introduced HWP. > > The SDM documents the MSR's presence back to at least Nehalem, but ties > it to the CPUID bit even there. Your point about inconsistent virtualized environments is something I hadn't considered. I will add a check, but I will make hwp disable in that case. While it could run just without an energy/performance preference, HWP doesn't seem useful in that scenario. > > --- a/docs/misc/xen-command-line.pandoc > > +++ b/docs/misc/xen-command-line.pandoc > > @@ -1355,6 +1355,15 @@ Specify whether guests are to be given access to physical port 80 > > (often used for debugging purposes), to override the DMI based > > detection of systems known to misbehave upon accesses to that port. > > > > +### hwp (x86) > > +> `= <boolean>` > > + > > +> Default: `true` > > + > > +Specifies whether Xen uses Hardware-Controlled Performance States (HWP) > > +on supported Intel hardware. HWP is a Skylake+ feature which provides > > +better CPU power management. > > Is there a particular reason giving this a top-level option rather > than a sub-option of cpufreq=? I will investigate. Below, I'm trimming everything where I will just make the changes according to your earlier email. > > + }; > > + union hwp_request curr_req; > > + uint16_t activity_window; > > + uint8_t minimum; > > + uint8_t maximum; > > + uint8_t desired; > > + uint8_t energy_perf; > > +}; > > +struct hwp_drv_data *hwp_drv_data[NR_CPUS]; > > New NR_CPUS-dimensioned arrays need explicit justification. From > what I get I can't see why this couldn't be per-CPU data instead. > > Also - static? I followed acpi-cpufreq with the NR_CPUS array. per-cpu makes sense and I'll investigate. > > + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n", > > + eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not "); > > + if ( eax & CPUID6_EAX_FAST_HWP_MSR ) > > + { > > + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) ) > > + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n"); > > + > > + hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val); > > Missing "else" above here? Are unbalanced braces acceptable or must they be balanced? Is this acceptable: if () one; else { two; three; } > > +static void hdc_set_pkg_hdc_ctl(bool val) > > +{ > > + uint64_t msr; > > + > > + if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) > > + { > > + hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n"); > > I'm not convinced of the need of having such log messages after > failed rdmsr/wrmsr, but I'm definitely against them getting logged > unconditionally. In verbose mode, maybe. We should not fail the rdmsr if our earlier cpuid check passed. So in that respect these messages can be removed. The benefit here is that you can see which MSR failed. If you relied on extable_fixup, you would have to cross reference manually. How will administors know if hwp isn't working properly there are no messages? For HWP in general, the SDM says to check CPUID for the availability of MSRs. Therefore rdmsr/wrmsr shouldn't fail. During development, I saw wrmsr failures when with "Valid Maximum" and other "Valid" bits set. I think that was because I hadn't set up the Package Request MSR. That has been fixed by not using those bits. Anyway, rdmsr/wrmsr shouldn't fail, so how much code should be put into checking for those failures which shouldn't happen? My sample size is 2 models of processor, so verbose reporting of errors makes some sense during wider deployment and testing. > > + return; > > + } > > + > > + msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0; > > If you don't use the prior value, why did you read it? But I > think you really mean to set/clear just bot 0. During development I printed the initial values . I removed the printing before submission but not the reading. In the SDM, It says "Bits 63:1 are reserved and must be zero", so I intended to only write 0 or 1. Below, you remark on not discarding reserved bits. So you want all of these to be read-modify-write operations? > > +static void hdc_set_pm_ctl1(bool val) > > +{ > > + uint64_t msr; > > + > > + if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) ) > > + { > > + hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n"); > > + > > + return; > > + } > > + > > + msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0; > > Same here then, and ... > > > +static void hwp_fast_uncore_msrs_ctl(bool val) > > +{ > > + uint64_t msr; > > + > > + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) ) > > + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n"); > > + > > + msr = val; > > ... here (where you imply bit 0 instead of using a proper > constant). > > Also for all three functions I'm not convinced their names are > in good sync with their parameters being boolean. Would you prefer something named closer to the bit names like: hdc_set_pkg_hdc_ctl -> hdc_set_pkg_enable hdc_set_pm_ctl1 -> hdc_set_allow_block hwp_fast_uncore_msrs_ctl -> hwp_fast_request_enable > > +static void hwp_read_capabilities(void *info) > > +{ > > + struct cpufreq_policy *policy = info; > > + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; > > + > > + if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) ) > > + { > > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n", > > + policy->cpu); > > + > > + return; > > + } > > + > > + if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) ) > > + { > > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu); > > + > > + return; > > + } > > +} > > This function doesn't indicate failure to its caller(s), so am I > to understand that failure to read either ofthe MSRs is actually > benign to the driver? They really shouldn't fail since the CPUID check passed during initialization. If you can't read/write MSRs, something is broken and the driver cannot function. The machine would still run, but HWP would be uncontrollable. How much care should be given to "impossible" situations? > > + if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) ) > > + { > > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu); > > + > > + return; > > + } > > + > > + hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val); > > + if ( val != IA32_PM_ENABLE_HWP_ENABLE ) > > + { > > + val = IA32_PM_ENABLE_HWP_ENABLE; > > You should neither depend on reserved bits being zero, nor discard any > non-zero value here, I think. Ok. > > +static void hwp_write_request(void *info) > > +{ > > + struct cpufreq_policy *policy = info; > > + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; > > + union hwp_request hwp_req = data->curr_req; > > + > > + BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t)); > > ITYM > > BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw)); > > here? Originally, I wanted this to live next to the union definition. However, BUILD_BUG_ON needs to live in a function, so I placed it here where it is used. I'd prefer BUILD_BUG_ON(sizeof(hwp_req) != sizeof(uint64_t)) to make the comparison to 64bit, the size of the MSR, explicit. > > + if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) ) > > + { > > + hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n", > > + policy->cpu, hwp_req.raw); > > + rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw); > > What if this one fails, too? data->curr_req.raw then pretty certainly > ends up stale. It would be stale, but I think this is unlikely. rdmsr should not fail given our earlier CPUID checks. wrmsr could fail if you set something wrong. During testing I set some of the valid "Maximum/Minimum" bits (now unused) that cause wrmsr to fault when some other MSR (maybe IA32_HWP_REQUEST_PKG) wasn't set. > > + policy->governor = &hwp_cpufreq_governor; > > + > > + data = xzalloc(typeof(*data)); > > Commonly we specify the type explicitly in such cases, rather than using > typeof(). I will admit though that I'm not entirely certain which one's > better. But consistency across the code base is perhaps preferable for > the time being. I thought typeof(*data) is always preferable since it will always be the matching type. I can change it, but I feel like it's a step backwards. > > + if ( !data ) > > + return -ENOMEM; > > Is it correct to have set the governor before this error check? I will re-order. > > + hwp_drv_data[cpu] = data; > > + > > + on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1); > > + > > + data->minimum = data->hw_lowest; > > + data->maximum = data->hw_highest; > > + data->desired = 0; /* default to HW autonomous */ > > + if ( feature_hwp_energy_perf ) > > + data->energy_perf = 0x80; > > + else > > + data->energy_perf = 7; > > Where's this 7 coming from? (You do mention the 0x80 at least in the > description.) When HWP energy performance preference is unavailable, it falls back to IA32_ENERGY_PERF_BIAS with a 0-15 range. Per the SDM Vol3 14.3.4, "A value of 7 roughly translates into a hint to balance performance with energy consumption." I will add a comment. > > --- a/xen/include/asm-x86/msr-index.h > > +++ b/xen/include/asm-x86/msr-index.h > > +#define MSR_IA32_HWP_CAPABILITIES 0x00000771 > > +#define MSR_IA32_HWP_REQUEST 0x00000774 > > + > > #define MSR_PASID 0x00000d93 > > #define PASID_PASID_MASK 0x000fffff > > #define PASID_VALID (_AC(1, ULL) << 31) > > > > +#define MSR_IA32_PKG_HDC_CTL 0x00000db0 > > +#define IA32_PKG_HDC_CTL_HDC_PKG_Enable (_AC(1, ULL) << 0) > > I don't think "HDC" twice is helpful? I took the MSR name "IA32_PKG_HDC_CTL" and bit name "HDC_PKG_Enable" from the SDM. I intentionally left the case to match the SDM. > Also please use all upper case names (actually also for the > CPUID constants higher up). Again, I took them straight from the SDM. I will make them all upper case and switch IA32_PKG_HDC_CTL_HDC_PKG_Enable to IA32_PKG_HDC_CTL_PKG_Enable. Regards, Jason
On 27.05.2021 20:50, Jason Andryuk wrote: > On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote: >> >> On 03.05.2021 21:28, Jason Andryuk wrote: >>> If HWP Energy_Performance_Preference isn't supported, the code falls >>> back to IA32_ENERGY_PERF_BIAS. Right now, we don't check >>> CPUID.06H:ECX.SETBH[bit 3] before using that MSR. >> >> May I ask what problem there is doing so? >> >>> The SDM reads like >>> it'll be available, and I assume it was available by the time Skylake >>> introduced HWP. >> >> The SDM documents the MSR's presence back to at least Nehalem, but ties >> it to the CPUID bit even there. > > Your point about inconsistent virtualized environments is something I > hadn't considered. I will add a check, but I will make hwp disable in > that case. While it could run just without an energy/performance > preference, HWP doesn't seem useful in that scenario. Makes sense. Of course I wouldn't expect hypervisors to populate much of CPUID leaf 6 anyway, like is the case for Xen itself. >>> + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n", >>> + eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not "); >>> + if ( eax & CPUID6_EAX_FAST_HWP_MSR ) >>> + { >>> + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) ) >>> + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n"); >>> + >>> + hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val); >> >> Missing "else" above here? > > Are unbalanced braces acceptable or must they be balanced? Is this acceptable: > if () > one; > else { > two; > three; > } Yes, it is. But I don't see how the question relates to my comment. All that needs to go in the else's body is the hwp_verbose(). >>> +static void hdc_set_pkg_hdc_ctl(bool val) >>> +{ >>> + uint64_t msr; >>> + >>> + if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) >>> + { >>> + hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n"); >> >> I'm not convinced of the need of having such log messages after >> failed rdmsr/wrmsr, but I'm definitely against them getting logged >> unconditionally. In verbose mode, maybe. > > We should not fail the rdmsr if our earlier cpuid check passed. So in > that respect these messages can be removed. The benefit here is that > you can see which MSR failed. If you relied on extable_fixup, you > would have to cross reference manually. How will administors know if > hwp isn't working properly there are no messages? This same question would go for various other MSR accesses which might fail, but which aren't accompanied by an explicit log message. I wouldn't mind a single summary message reporting if e.g. HWP setup failed. More detailed analysis of such would be more of a developer's than an administrator's job then anyway. > For HWP in general, the SDM says to check CPUID for the availability > of MSRs. Therefore rdmsr/wrmsr shouldn't fail. During development, I > saw wrmsr failures when with "Valid Maximum" and other "Valid" bits > set. I think that was because I hadn't set up the Package Request > MSR. That has been fixed by not using those bits. Anyway, > rdmsr/wrmsr shouldn't fail, so how much code should be put into > checking for those failures which shouldn't happen? If there's any risk of accesses causing a fault, using *msr_safe() is of course the right choice. Beyond that you will need to consider what the consequences are. Sadly this needs doing on every single case individually. See "Handling unexpected conditions" section of ./CODING_STYLE for guidelines (even if the specific constructs aren't in use here). >>> + return; >>> + } >>> + >>> + msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0; >> >> If you don't use the prior value, why did you read it? But I >> think you really mean to set/clear just bot 0. > > During development I printed the initial values . I removed the > printing before submission but not the reading. > > In the SDM, It says "Bits 63:1 are reserved and must be zero", so I > intended to only write 0 or 1. Below, you remark on not discarding > reserved bits. So you want all of these to be read-modify-write > operations? Yes, this is the only way to be forward compatible. >>> +static void hdc_set_pm_ctl1(bool val) >>> +{ >>> + uint64_t msr; >>> + >>> + if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) ) >>> + { >>> + hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n"); >>> + >>> + return; >>> + } >>> + >>> + msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0; >> >> Same here then, and ... >> >>> +static void hwp_fast_uncore_msrs_ctl(bool val) >>> +{ >>> + uint64_t msr; >>> + >>> + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) ) >>> + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n"); >>> + >>> + msr = val; >> >> ... here (where you imply bit 0 instead of using a proper >> constant). >> >> Also for all three functions I'm not convinced their names are >> in good sync with their parameters being boolean. > > Would you prefer something named closer to the bit names like: > hdc_set_pkg_hdc_ctl -> hdc_set_pkg_enable > hdc_set_pm_ctl1 -> hdc_set_allow_block > hwp_fast_uncore_msrs_ctl -> hwp_fast_request_enable My primary request is for function name, purpose, and parameter(s) to be in line. So e.g. if you left the parameters as boolean, then I think your suggested name changes make sense. Alternatively you could make the functions e.g. be full register updating ones, with suitable parameters, which could be a mask-to-set and mask-to-clear. >>> +static void hwp_read_capabilities(void *info) >>> +{ >>> + struct cpufreq_policy *policy = info; >>> + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; >>> + >>> + if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) ) >>> + { >>> + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n", >>> + policy->cpu); >>> + >>> + return; >>> + } >>> + >>> + if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) ) >>> + { >>> + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu); >>> + >>> + return; >>> + } >>> +} >> >> This function doesn't indicate failure to its caller(s), so am I >> to understand that failure to read either ofthe MSRs is actually >> benign to the driver? > > They really shouldn't fail since the CPUID check passed during > initialization. If you can't read/write MSRs, something is broken and > the driver cannot function. The machine would still run, but HWP > would be uncontrollable. How much care should be given to > "impossible" situations? See above. The main point is to avoid crashing. In the specific case here I think you could simply drop both the log messages and the early return, assuming the caller can deal fine with the zero value(s) that rdmsr_safe() will substitute for the MSR value(s). Bailing early, otoh, calls for handing back an error indicator imo. >>> +static void hwp_write_request(void *info) >>> +{ >>> + struct cpufreq_policy *policy = info; >>> + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; >>> + union hwp_request hwp_req = data->curr_req; >>> + >>> + BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t)); >> >> ITYM >> >> BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw)); >> >> here? > > Originally, I wanted this to live next to the union definition. > However, BUILD_BUG_ON needs to live in a function, so I placed it here > where it is used. > > I'd prefer > BUILD_BUG_ON(sizeof(hwp_req) != sizeof(uint64_t)) > > to make the comparison to 64bit, the size of the MSR, explicit. Well, then "raw" could still be wrong in principle, but it is the whole point of having "raw" to have it match MSR size. So while I could live with your alternative, I continue to think my suggestion is the more appropriate form of the check. >>> + policy->governor = &hwp_cpufreq_governor; >>> + >>> + data = xzalloc(typeof(*data)); >> >> Commonly we specify the type explicitly in such cases, rather than using >> typeof(). I will admit though that I'm not entirely certain which one's >> better. But consistency across the code base is perhaps preferable for >> the time being. > > I thought typeof(*data) is always preferable since it will always be > the matching type. I can change it, but I feel like it's a step > backwards. There's exactly one similar use in the entire code base. The original idea with xmalloc() was that one would explicitly specify the intended type, such that without looking elsewhere one can see what exactly is to be allocated. One could further rely on the compiler warning if there was a disagreement between declaration and assignment. If instead we were to use typeof() everywhere, there's be a fair amount of redundancy between the LHS of the assignment and the operand of typeof(), which would call for eliminating (by abstracting away). >>> + if ( feature_hwp_energy_perf ) >>> + data->energy_perf = 0x80; >>> + else >>> + data->energy_perf = 7; >> >> Where's this 7 coming from? (You do mention the 0x80 at least in the >> description.) > > When HWP energy performance preference is unavailable, it falls back > to IA32_ENERGY_PERF_BIAS with a 0-15 range. Per the SDM Vol3 14.3.4, > "A value of 7 roughly translates into a hint to balance performance > with energy consumption." I will add a comment. Actually, as per a comment on a later patch, I'm instead expecting you to add a #define, the name of which will serve as comment. Jan
On Fri, May 28, 2021 at 2:35 AM Jan Beulich <jbeulich@suse.com> wrote: > > On 27.05.2021 20:50, Jason Andryuk wrote: > > On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote: > >> > >> On 03.05.2021 21:28, Jason Andryuk wrote: > >>> + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n", > >>> + eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not "); > >>> + if ( eax & CPUID6_EAX_FAST_HWP_MSR ) > >>> + { > >>> + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) ) > >>> + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n"); > >>> + > >>> + hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val); > >> > >> Missing "else" above here? > > > > Are unbalanced braces acceptable or must they be balanced? Is this acceptable: > > if () > > one; > > else { > > two; > > three; > > } > > Yes, it is. But I don't see how the question relates to my comment. > All that needs to go in the else's body is the hwp_verbose(). 'val' shouldn't be used to set features when the rdmsr fails, so the following code needs to be within the else. Unless you want to rely on a failed rdmsr returning 0. > >>> +static void hdc_set_pkg_hdc_ctl(bool val) > >>> +{ > >>> + uint64_t msr; > >>> + > >>> + if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) > >>> + { > >>> + hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n"); > >> > >> I'm not convinced of the need of having such log messages after > >> failed rdmsr/wrmsr, but I'm definitely against them getting logged > >> unconditionally. In verbose mode, maybe. > > > > We should not fail the rdmsr if our earlier cpuid check passed. So in > > that respect these messages can be removed. The benefit here is that > > you can see which MSR failed. If you relied on extable_fixup, you > > would have to cross reference manually. How will administors know if > > hwp isn't working properly there are no messages? > > This same question would go for various other MSR accesses which > might fail, but which aren't accompanied by an explicit log message. > I wouldn't mind a single summary message reporting if e.g. HWP > setup failed. More detailed analysis of such would be more of a > developer's than an administrator's job then anyway. > > > For HWP in general, the SDM says to check CPUID for the availability > > of MSRs. Therefore rdmsr/wrmsr shouldn't fail. During development, I > > saw wrmsr failures when with "Valid Maximum" and other "Valid" bits > > set. I think that was because I hadn't set up the Package Request > > MSR. That has been fixed by not using those bits. Anyway, > > rdmsr/wrmsr shouldn't fail, so how much code should be put into > > checking for those failures which shouldn't happen? > > If there's any risk of accesses causing a fault, using *msr_safe() > is of course the right choice. Beyond that you will need to consider > what the consequences are. Sadly this needs doing on every single > case individually. See "Handling unexpected conditions" section of > ./CODING_STYLE for guidelines (even if the specific constructs > aren't in use here). Using *msr_safe(), I think the worst case is the users can't set HWP in the way they want. So power/performance may not be what the users wanted, but Xen won't crash. The hardware will throttle itself if needed for self-protection, so that should be okay as well. > >>> +static void hdc_set_pm_ctl1(bool val) > >>> +{ > >>> + uint64_t msr; > >>> + > >>> + if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) ) > >>> + { > >>> + hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n"); > >>> + > >>> + return; > >>> + } > >>> + > >>> + msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0; > >> > >> Same here then, and ... > >> > >>> +static void hwp_fast_uncore_msrs_ctl(bool val) > >>> +{ > >>> + uint64_t msr; > >>> + > >>> + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) ) > >>> + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n"); > >>> + > >>> + msr = val; > >> > >> ... here (where you imply bit 0 instead of using a proper > >> constant). > >> > >> Also for all three functions I'm not convinced their names are > >> in good sync with their parameters being boolean. > > > > Would you prefer something named closer to the bit names like: > > hdc_set_pkg_hdc_ctl -> hdc_set_pkg_enable > > hdc_set_pm_ctl1 -> hdc_set_allow_block > > hwp_fast_uncore_msrs_ctl -> hwp_fast_request_enable > > My primary request is for function name, purpose, and parameter(s) > to be in line. So e.g. if you left the parameters as boolean, then > I think your suggested name changes make sense. Alternatively you > could make the functions e.g. be full register updating ones, with > suitable parameters, which could be a mask-to-set and mask-to-clear. I'm going to use the replacement names while keeping the boolean argument. Those MSRs only have single bits defined today, so functions with boolean parameters are simpler. > >>> +static void hwp_read_capabilities(void *info) > >>> +{ > >>> + struct cpufreq_policy *policy = info; > >>> + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; > >>> + > >>> + if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) ) > >>> + { > >>> + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n", > >>> + policy->cpu); > >>> + > >>> + return; > >>> + } > >>> + > >>> + if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) ) > >>> + { > >>> + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu); > >>> + > >>> + return; > >>> + } > >>> +} > >> > >> This function doesn't indicate failure to its caller(s), so am I > >> to understand that failure to read either ofthe MSRs is actually > >> benign to the driver? > > > > They really shouldn't fail since the CPUID check passed during > > initialization. If you can't read/write MSRs, something is broken and > > the driver cannot function. The machine would still run, but HWP > > would be uncontrollable. How much care should be given to > > "impossible" situations? > > See above. The main point is to avoid crashing. In the specific > case here I think you could simply drop both the log messages and > the early return, assuming the caller can deal fine with the zero > value(s) that rdmsr_safe() will substitute for the MSR value(s). > Bailing early, otoh, calls for handing back an error indicator > imo. I changed it to have failure set curr_req.raw = -1. Then cpufreq_driver.init returns -ENODEV in that case. > >>> + policy->governor = &hwp_cpufreq_governor; > >>> + > >>> + data = xzalloc(typeof(*data)); > >> > >> Commonly we specify the type explicitly in such cases, rather than using > >> typeof(). I will admit though that I'm not entirely certain which one's > >> better. But consistency across the code base is perhaps preferable for > >> the time being. > > > > I thought typeof(*data) is always preferable since it will always be > > the matching type. I can change it, but I feel like it's a step > > backwards. > > There's exactly one similar use in the entire code base. The original > idea with xmalloc() was that one would explicitly specify the intended > type, such that without looking elsewhere one can see what exactly is > to be allocated. One could further rely on the compiler warning if > there was a disagreement between declaration and assignment. Oh, okay. Thanks for the explanation of xmalloc(). > >>> + if ( feature_hwp_energy_perf ) > >>> + data->energy_perf = 0x80; > >>> + else > >>> + data->energy_perf = 7; > >> > >> Where's this 7 coming from? (You do mention the 0x80 at least in the > >> description.) > > > > When HWP energy performance preference is unavailable, it falls back > > to IA32_ENERGY_PERF_BIAS with a 0-15 range. Per the SDM Vol3 14.3.4, > > "A value of 7 roughly translates into a hint to balance performance > > with energy consumption." I will add a comment. > > Actually, as per a comment on a later patch, I'm instead expecting > you to add a #define, the name of which will serve as comment. Yes, sounds good. Regards, Jason
On 03.06.2021 13:55, Jason Andryuk wrote: > On Fri, May 28, 2021 at 2:35 AM Jan Beulich <jbeulich@suse.com> wrote: >> On 27.05.2021 20:50, Jason Andryuk wrote: >>> On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@suse.com> wrote: >>>> On 03.05.2021 21:28, Jason Andryuk wrote: >>>>> + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n", >>>>> + eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not "); >>>>> + if ( eax & CPUID6_EAX_FAST_HWP_MSR ) >>>>> + { >>>>> + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) ) >>>>> + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n"); >>>>> + >>>>> + hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val); >>>> >>>> Missing "else" above here? >>> >>> Are unbalanced braces acceptable or must they be balanced? Is this acceptable: >>> if () >>> one; >>> else { >>> two; >>> three; >>> } >> >> Yes, it is. But I don't see how the question relates to my comment. >> All that needs to go in the else's body is the hwp_verbose(). > > 'val' shouldn't be used to set features when the rdmsr fails, so the > following code needs to be within the else. Unless you want to rely > on a failed rdmsr returning 0. It is intentional for rdmsr_safe() to return a zero value when the access faulted, so I certainly think you may rely on this. Jan
diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index c32a397a12..66363a3d71 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -1355,6 +1355,15 @@ Specify whether guests are to be given access to physical port 80 (often used for debugging purposes), to override the DMI based detection of systems known to misbehave upon accesses to that port. +### hwp (x86) +> `= <boolean>` + +> Default: `true` + +Specifies whether Xen uses Hardware-Controlled Performance States (HWP) +on supported Intel hardware. HWP is a Skylake+ feature which provides +better CPU power management. + ### idle_latency_factor (x86) > `= <integer>` diff --git a/xen/arch/x86/acpi/cpufreq/Makefile b/xen/arch/x86/acpi/cpufreq/Makefile index f75da9b9ca..db83aa6b14 100644 --- a/xen/arch/x86/acpi/cpufreq/Makefile +++ b/xen/arch/x86/acpi/cpufreq/Makefile @@ -1,2 +1,3 @@ obj-y += cpufreq.o +obj-y += hwp.o obj-y += powernow.o diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c index 8aae9b534d..966490bda1 100644 --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c @@ -641,9 +641,12 @@ static int __init cpufreq_driver_init(void) int ret = 0; if ((cpufreq_controller == FREQCTL_xen) && - (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) - ret = cpufreq_register_driver(&acpi_cpufreq_driver); - else if ((cpufreq_controller == FREQCTL_xen) && + (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) { + if (hwp_available()) + ret = hwp_register_driver(); + else + ret = cpufreq_register_driver(&acpi_cpufreq_driver); + } else if ((cpufreq_controller == FREQCTL_xen) && (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON))) ret = powernow_register_driver(); diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c new file mode 100644 index 0000000000..f8e6fdbd41 --- /dev/null +++ b/xen/arch/x86/acpi/cpufreq/hwp.c @@ -0,0 +1,533 @@ +/* + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP) + * + * Copyright (C) 2021 Jason Andryuk <jandryuk@gmail.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or (at + * your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with this program; If not, see <http://www.gnu.org/licenses/>. + */ + +#include <xen/cpumask.h> +#include <xen/init.h> +#include <xen/param.h> +#include <xen/xmalloc.h> +#include <asm/msr.h> +#include <asm/io.h> +#include <acpi/cpufreq/cpufreq.h> + +static bool feature_hwp; +static bool feature_hwp_notification; +static bool feature_hwp_activity_window; +static bool feature_hwp_energy_perf; +static bool feature_hwp_pkg_level_ctl; +static bool feature_hwp_peci; + +static bool feature_hdc; +static bool feature_fast_msr; + +bool opt_hwp = true; +boolean_param("hwp", opt_hwp); + +union hwp_request +{ + struct + { + uint64_t min_perf:8; + uint64_t max_perf:8; + uint64_t desired:8; + uint64_t energy_perf:8; + uint64_t activity_window:10; + uint64_t package_control:1; + uint64_t reserved:16; + uint64_t activity_window_valid:1; + uint64_t energy_perf_valid:1; + uint64_t desired_valid:1; + uint64_t max_perf_valid:1; + uint64_t min_perf_valid:1; + }; + uint64_t raw; +}; + +struct hwp_drv_data +{ + union + { + uint64_t hwp_caps; + struct + { + uint64_t hw_highest:8; + uint64_t hw_guaranteed:8; + uint64_t hw_most_efficient:8; + uint64_t hw_lowest:8; + uint64_t hw_reserved:32; + }; + }; + union hwp_request curr_req; + uint16_t activity_window; + uint8_t minimum; + uint8_t maximum; + uint8_t desired; + uint8_t energy_perf; +}; +struct hwp_drv_data *hwp_drv_data[NR_CPUS]; + +#define hwp_err(...) printk(XENLOG_ERR __VA_ARGS__) +#define hwp_info(...) printk(XENLOG_INFO __VA_ARGS__) +#define hwp_verbose(...) \ +({ \ + if ( cpufreq_verbose ) \ + { \ + printk(XENLOG_DEBUG __VA_ARGS__); \ + } \ +}) +#define hwp_verbose_cont(...) \ +({ \ + if ( cpufreq_verbose ) \ + { \ + printk( __VA_ARGS__); \ + } \ +}) + +static int hwp_governor(struct cpufreq_policy *policy, + unsigned int event) +{ + int ret; + + if ( policy == NULL ) + return -EINVAL; + + switch (event) + { + case CPUFREQ_GOV_START: + ret = 0; + break; + case CPUFREQ_GOV_STOP: + ret = -EINVAL; + break; + case CPUFREQ_GOV_LIMITS: + ret = 0; + break; + default: + ret = -EINVAL; + } + + return ret; +} + +static struct cpufreq_governor hwp_cpufreq_governor = +{ + .name = "hwp-internal", + .governor = hwp_governor, +}; + +static int __init cpufreq_gov_hwp_init(void) +{ + return cpufreq_register_governor(&hwp_cpufreq_governor); +} +__initcall(cpufreq_gov_hwp_init); + +bool hwp_available(void) +{ + uint32_t eax; + uint64_t val; + bool use_hwp; + + if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF ) + { + hwp_verbose("cpuid_level (%u) lacks HWP support\n", boot_cpu_data.cpuid_level); + + return false; + } + + eax = cpuid_eax(CPUID_PM_LEAF); + feature_hwp = !!(eax & CPUID6_EAX_HWP); + feature_hwp_notification = !!(eax & CPUID6_EAX_HWP_Notification); + feature_hwp_activity_window = !!(eax & CPUID6_EAX_HWP_Activity_Window); + feature_hwp_energy_perf = + !!(eax & CPUID6_EAX_HWP_Energy_Performance_Preference); + feature_hwp_pkg_level_ctl = + !!(eax & CPUID6_EAX_HWP_Package_Level_Request); + feature_hwp_peci = !!(eax & CPUID6_EAX_HWP_PECI); + + hwp_verbose("HWP: %d notify: %d act_window: %d energy_perf: %d pkg_level: %d peci: %d\n", + feature_hwp, feature_hwp_notification, + feature_hwp_activity_window, feature_hwp_energy_perf, + feature_hwp_pkg_level_ctl, feature_hwp_peci); + + if ( !feature_hwp ) + { + hwp_verbose("Hardware does not support HWP\n"); + + return false; + } + + if ( boot_cpu_data.cpuid_level < 0x16 ) + { + hwp_info("HWP disabled: cpuid_level %x < 0x16 lacks CPU freq info\n", + boot_cpu_data.cpuid_level); + + return false; + } + + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n", + eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not "); + if ( eax & CPUID6_EAX_FAST_HWP_MSR ) + { + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) ) + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n"); + + hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val); + if (val & FAST_IA32_HWP_REQUEST ) + { + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST MSR available\n"); + feature_fast_msr = true; + } + } + + feature_hdc = !!(eax & CPUID6_EAX_HDC); + + hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported\n", + feature_hdc ? "" : "not "); + + hwp_verbose("HWP: HW_FEEDBACK %ssupported\n", + (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not "); + + use_hwp = feature_hwp && opt_hwp; + cpufreq_governor_internal = use_hwp; + + if ( use_hwp ) + hwp_info("Using HWP for cpufreq\n"); + + return use_hwp; +} + +static void hdc_set_pkg_hdc_ctl(bool val) +{ + uint64_t msr; + + if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) + { + hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n"); + + return; + } + + msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0; + + if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) + hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr); +} + +static void hdc_set_pm_ctl1(bool val) +{ + uint64_t msr; + + if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) ) + { + hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n"); + + return; + } + + msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0; + + if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) ) + hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr); +} + +static void hwp_fast_uncore_msrs_ctl(bool val) +{ + uint64_t msr; + + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) ) + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n"); + + msr = val; + + if ( wrmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) ) + hwp_err("error wrmsr_safe(MSR_FAST_UNCORE_MSRS_CTL): %016lx\n", msr); +} + +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy) +{ + uint32_t base_khz, max_khz, bus_khz, edx; + + cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx); + + /* aperf/mperf scales base. */ + policy->cpuinfo.perf_freq = base_khz * 1000; + policy->cpuinfo.min_freq = base_khz * 1000; + policy->cpuinfo.max_freq = max_khz * 1000; + policy->min = base_khz * 1000; + policy->max = max_khz * 1000; + policy->cur = 0; +} + +static void hwp_read_capabilities(void *info) +{ + struct cpufreq_policy *policy = info; + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; + + if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) ) + { + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n", + policy->cpu); + + return; + } + + if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) ) + { + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", policy->cpu); + + return; + } +} + +static void hwp_init_msrs(void *info) +{ + struct cpufreq_policy *policy = info; + uint64_t val; + + /* Package level MSR, but we don't have a good idea of packages here, so + * just do it everytime. */ + if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) ) + { + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", policy->cpu); + + return; + } + + hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val); + if ( val != IA32_PM_ENABLE_HWP_ENABLE ) + { + val = IA32_PM_ENABLE_HWP_ENABLE; + if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) ) + hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n", + policy->cpu, val); + } + + hwp_read_capabilities(info); + + /* Check for APERF/MPERF support in hardware + * also check for boost/turbo support */ + intel_feature_detect(policy); + + if ( feature_hdc ) + { + hdc_set_pkg_hdc_ctl(true); + hdc_set_pm_ctl1(true); + } + + if ( feature_fast_msr ) + hwp_fast_uncore_msrs_ctl(true); + + hwp_get_cpu_speeds(policy); +} + +static int hwp_cpufreq_verify(struct cpufreq_policy *policy) +{ + unsigned int cpu = policy->cpu; + struct hwp_drv_data *data = hwp_drv_data[cpu]; + + if ( !feature_hwp_energy_perf && data->energy_perf ) + { + if ( data->energy_perf > 15 ) + { + hwp_err("energy_perf %d exceeds IA32_ENERGY_PERF_BIAS range 0-15\n", + data->energy_perf); + + return -EINVAL; + } + } + + if ( !feature_hwp_activity_window && data->activity_window ) + { + hwp_err("HWP activity window not supported.\n"); + + return -EINVAL; + } + + return 0; +} + +/* val 0 - highest performance, 15 - maximum energy savings */ +static void hwp_energy_perf_bias(void *info) +{ + uint64_t msr; + struct hwp_drv_data *data = info; + uint8_t val = data->energy_perf; + + ASSERT(val <= 15); + + if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) ) + { + hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n"); + + return; + } + + msr &= ~(0xf); + msr |= val; + + if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) ) + hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", msr); +} + +static void hwp_write_request(void *info) +{ + struct cpufreq_policy *policy = info; + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; + union hwp_request hwp_req = data->curr_req; + + BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t)); + if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) ) + { + hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n", + policy->cpu, hwp_req.raw); + rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw); + } +} + +static int hwp_cpufreq_target(struct cpufreq_policy *policy, + unsigned int target_freq, unsigned int relation) +{ + unsigned int cpu = policy->cpu; + struct hwp_drv_data *data = hwp_drv_data[cpu]; + union hwp_request hwp_req; + + /* Zero everything to ensure reserved bits are zero... */ + hwp_req.raw = 0; + /* .. and update from there */ + hwp_req.min_perf = data->minimum; + hwp_req.max_perf = data->maximum; + hwp_req.desired = data->desired; + if ( feature_hwp_energy_perf ) + hwp_req.energy_perf = data->energy_perf; + if ( feature_hwp_activity_window ) + hwp_req.activity_window = data->activity_window; + + if ( hwp_req.raw == data->curr_req.raw ) + return 0; + + data->curr_req.raw = hwp_req.raw; + + hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw); + on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1); + + if ( !feature_hwp_energy_perf && data->energy_perf ) + { + on_selected_cpus(cpumask_of(cpu), hwp_energy_perf_bias, + data, 1); + } + + return 0; +} + +static int hwp_cpufreq_cpu_init(struct cpufreq_policy *policy) +{ + unsigned int cpu = policy->cpu; + struct hwp_drv_data *data; + + if ( cpufreq_opt_governor ) + { + printk(XENLOG_WARNING + "HWP: governor \"%s\" is incompatible with hwp. Using default \"%s\"\n", + cpufreq_opt_governor->name, hwp_cpufreq_governor.name); + } + policy->governor = &hwp_cpufreq_governor; + + data = xzalloc(typeof(*data)); + if ( !data ) + return -ENOMEM; + + hwp_drv_data[cpu] = data; + + on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1); + + data->minimum = data->hw_lowest; + data->maximum = data->hw_highest; + data->desired = 0; /* default to HW autonomous */ + if ( feature_hwp_energy_perf ) + data->energy_perf = 0x80; + else + data->energy_perf = 7; + + hwp_verbose("CPU%u: IA32_HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps); + + hwp_verbose("CPU%u: rdmsr HWP_REQUEST %016lx\n", cpu, data->curr_req.raw); + + return 0; +} + +static int hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy) +{ + unsigned int cpu = policy->cpu; + + xfree(hwp_drv_data[cpu]); + hwp_drv_data[cpu] = NULL; + + return 0; +} + +/* The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and + * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at least + * with my HWP testing. MSR_IA32_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE + * is what Linux uses and seems to work. */ +static void hwp_set_misc_turbo(void *info) +{ + struct cpufreq_policy *policy = info; + uint64_t msr; + + if ( rdmsr_safe(MSR_IA32_MISC_ENABLE, msr) ) + { + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_MISC_ENABLE)\n", policy->cpu); + + return; + } + + if ( policy->turbo == CPUFREQ_TURBO_ENABLED ) + msr &= ~MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE; + else + msr |= MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE; + + if ( wrmsr_safe(MSR_IA32_MISC_ENABLE, msr) ) + hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_MISC_ENABLE): %016lx\n", + policy->cpu, msr); +} + +static int hwp_cpufreq_update(int cpuid, struct cpufreq_policy *policy) +{ + on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1); + + return 0; +} + +static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver = +{ + .name = "hwp-cpufreq", + .verify = hwp_cpufreq_verify, + .target = hwp_cpufreq_target, + .init = hwp_cpufreq_cpu_init, + .exit = hwp_cpufreq_cpu_exit, + .update = hwp_cpufreq_update, +}; + +int hwp_register_driver(void) +{ + int ret; + + ret = cpufreq_register_driver(&hwp_cpufreq_driver); + + return ret; +} diff --git a/xen/include/acpi/cpufreq/processor_perf.h b/xen/include/acpi/cpufreq/processor_perf.h index e2c08f0e6d..2e67e667e0 100644 --- a/xen/include/acpi/cpufreq/processor_perf.h +++ b/xen/include/acpi/cpufreq/processor_perf.h @@ -9,6 +9,9 @@ void intel_feature_detect(void *info); +bool hwp_available(void); +int hwp_register_driver(void); + int powernow_cpufreq_init(void); unsigned int powernow_register_driver(void); unsigned int get_measured_perf(unsigned int cpu, unsigned int flag); diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h index 33b2257888..1900c90f90 100644 --- a/xen/include/asm-x86/cpufeature.h +++ b/xen/include/asm-x86/cpufeature.h @@ -26,7 +26,16 @@ #define CPUID5_ECX_EXTENSIONS_SUPPORTED 0x1 #define CPUID5_ECX_INTERRUPT_BREAK 0x2 -#define CPUID_PM_LEAF 6 +#define CPUID_PM_LEAF 6 +#define CPUID6_EAX_HWP (_AC(1, U) << 7) +#define CPUID6_EAX_HWP_Notification (_AC(1, U) << 8) +#define CPUID6_EAX_HWP_Activity_Window (_AC(1, U) << 9) +#define CPUID6_EAX_HWP_Energy_Performance_Preference (_AC(1, U) << 10) +#define CPUID6_EAX_HWP_Package_Level_Request (_AC(1, U) << 11) +#define CPUID6_EAX_HDC (_AC(1, U) << 13) +#define CPUID6_EAX_HWP_PECI (_AC(1, U) << 16) +#define CPUID6_EAX_FAST_HWP_MSR (_AC(1, U) << 18) +#define CPUID6_EAX_HW_FEEDBACK (_AC(1, U) << 19) #define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1 /* CPUID level 0x00000001.edx */ diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h index bd3a3a1e7f..b8f712e1a3 100644 --- a/xen/include/asm-x86/msr-index.h +++ b/xen/include/asm-x86/msr-index.h @@ -101,6 +101,12 @@ #define MSR_RTIT_ADDR_A(n) (0x00000580 + (n) * 2) #define MSR_RTIT_ADDR_B(n) (0x00000581 + (n) * 2) +#define MSR_FAST_UNCORE_MSRS_CTL 0x00000657 +#define FAST_IA32_HWP_REQUEST_MSR_ENABLE (_AC(1, ULL) << 0) + +#define MSR_FAST_UNCORE_MSRS_CAPABILITY 0x0000065f +#define FAST_IA32_HWP_REQUEST (_AC(1, ULL) << 0) + #define MSR_U_CET 0x000006a0 #define MSR_S_CET 0x000006a2 #define CET_SHSTK_EN (_AC(1, ULL) << 0) @@ -112,10 +118,20 @@ #define MSR_PL3_SSP 0x000006a7 #define MSR_INTERRUPT_SSP_TABLE 0x000006a8 +#define MSR_IA32_PM_ENABLE 0x00000770 +#define IA32_PM_ENABLE_HWP_ENABLE (_AC(1, ULL) << 0) +#define MSR_IA32_HWP_CAPABILITIES 0x00000771 +#define MSR_IA32_HWP_REQUEST 0x00000774 + #define MSR_PASID 0x00000d93 #define PASID_PASID_MASK 0x000fffff #define PASID_VALID (_AC(1, ULL) << 31) +#define MSR_IA32_PKG_HDC_CTL 0x00000db0 +#define IA32_PKG_HDC_CTL_HDC_PKG_Enable (_AC(1, ULL) << 0) +#define MSR_IA32_PM_CTL1 0x00000db1 +#define IA32_PM_CTL1_HDC_Allow_Block (_AC(1, ULL) << 0) + #define MSR_K8_VM_CR 0xc0010114 #define VM_CR_INIT_REDIRECTION (_AC(1, ULL) << 1) #define VM_CR_SVM_DISABLE (_AC(1, ULL) << 4) @@ -460,6 +476,7 @@ #define MSR_IA32_MISC_ENABLE_LIMIT_CPUID (1<<22) #define MSR_IA32_MISC_ENABLE_XTPR_DISABLE (1<<23) #define MSR_IA32_MISC_ENABLE_XD_DISABLE (1ULL << 34) +#define MSR_IA32_MISC_ENABLE_TURBO_DISENGAGE (1ULL << 38) #define MSR_IA32_TSC_DEADLINE 0x000006E0 #define MSR_IA32_ENERGY_PERF_BIAS 0x000001b0
From the Intel SDM: "Hardware-Controlled Performance States (HWP), which autonomously selects performance states while utilizing OS supplied performance guidance hints." Enable HWP to run in autonomous mode by poking the correct MSRs. There is no interface to configure - it hardcodes the default 0x80 (out of 0x0-0xff) energy/performance preference. xen_sysctl_pm_op/xenpm will be to be extended to configure in subsequent patches. Unscientific powertop measurement of an mostly idle, customized OpenXT install: A 10th gen 6-core laptop showed battery discharge drop from ~9.x to ~7.x watts. A 8th gen 4-core laptop dropped from ~10 to ~9 Power usage depends on many factors, especially display brightness, but this does show an power saving in balanced mode when CPU utilization is low. HWP isn't compatible with an external governor - it doesn't take explicit frequency requests. Therefore a minimal internal governor, hwp-internal, is also added as a placeholder. Signed-off-by: Jason Andryuk <jandryuk@gmail.com> --- We disable on cpuid_level < 0x16. cpuid(0x16) is used to get the cpu frequencies for calculating the APERF/MPERF. Without it, things would still work, but the averge cpufrequency output would be wrong. If HWP Energy_Performance_Preference isn't supported, the code falls back to IA32_ENERGY_PERF_BIAS. Right now, we don't check CPUID.06H:ECX.SETBH[bit 3] before using that MSR. The SDM reads like it'll be available, and I assume it was available by the time Skylake introduced HWP. My 8th & 10th gen test systems both report: (XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0 (XEN) HWP: FAST_IA32_HWP_REQUEST not supported (XEN) HWP: Hardware Duty Cycling (HDC) supported (XEN) HWP: HW_FEEDBACK not supported So FAST_IA32_HWP_REQUEST and IA32_ENERGY_PERF_BIAS have not been tested. --- docs/misc/xen-command-line.pandoc | 9 + xen/arch/x86/acpi/cpufreq/Makefile | 1 + xen/arch/x86/acpi/cpufreq/cpufreq.c | 9 +- xen/arch/x86/acpi/cpufreq/hwp.c | 533 ++++++++++++++++++++++ xen/include/acpi/cpufreq/processor_perf.h | 3 + xen/include/asm-x86/cpufeature.h | 11 +- xen/include/asm-x86/msr-index.h | 17 + 7 files changed, 579 insertions(+), 4 deletions(-) create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c