diff mbox series

[v8,12/12] x86/HVM: don't needlessly intercept APERF/MPERF/TSC MSR reads

Message ID e92b6c1a-b2c3-13e7-116c-4772c851dd0b@suse.com (mailing list archive)
State Superseded
Headers show
Series x86emul: further work | expand

Commit Message

Jan Beulich May 5, 2020, 8:20 a.m. UTC
If the hardware can handle accesses, we should allow it to do so. This
way we can expose EFRO to HVM guests, and "all" that's left for exposing
APERF/MPERF is to figure out how to handle writes to these MSRs. (Note
that the leaf 6 guest CPUID checks will evaluate to false for now, as
recalculate_misc() zaps the entire leaf.)

For TSC the intercepts are made mirror the RDTSC ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v4: Make TSC intercepts mirror RDTSC ones. Re-base.
v3: New.

Comments

Andrew Cooper May 8, 2020, 9:04 p.m. UTC | #1
On 05/05/2020 09:20, Jan Beulich wrote:
> If the hardware can handle accesses, we should allow it to do so. This
> way we can expose EFRO to HVM guests,

I'm reminded now of the conversation I'm sure we've had before, although
I have a feeling it was on IRC.

APERF/MPERF (including the EFRO interface on AMD) are free running
counters but only in C0.  The raw values are not synchronised across
threads.

A vCPU which gets rescheduled has a 50% chance of finding the one or
both values going backwards, and a 100% chance of totally bogus calculation.

There is no point exposing APERF/MPERF to guests.  It can only be used
safely in hypervisor context, on AMD hardware with a CLGI/STGI region,
or on Intel hardware in an NMI handler if you trust that a machine check
isn't going to ruin your day.

VMs have no way of achieving the sampling requirements, and has a fair
chance of getting a plausible-but-wrong answer.

The only possibility to do this safely is on a VM which is pinned to
pCPU for its lifetime, but even I'm unconvinced of the correctness.

I don't think we should be exposing this functionality to guests at all,
although I might be persuaded if someone wanting to use it in a VM can
provide a concrete justification of why the above problems won't get in
their way.

~Andrew
Jan Beulich May 13, 2020, 1:35 p.m. UTC | #2
On 08.05.2020 23:04, Andrew Cooper wrote:
> On 05/05/2020 09:20, Jan Beulich wrote:
>> If the hardware can handle accesses, we should allow it to do so. This
>> way we can expose EFRO to HVM guests,
> 
> I'm reminded now of the conversation I'm sure we've had before, although
> I have a feeling it was on IRC.
> 
> APERF/MPERF (including the EFRO interface on AMD) are free running
> counters but only in C0.  The raw values are not synchronised across
> threads.
> 
> A vCPU which gets rescheduled has a 50% chance of finding the one or
> both values going backwards, and a 100% chance of totally bogus calculation.
> 
> There is no point exposing APERF/MPERF to guests.  It can only be used
> safely in hypervisor context, on AMD hardware with a CLGI/STGI region,
> or on Intel hardware in an NMI handler if you trust that a machine check
> isn't going to ruin your day.
> 
> VMs have no way of achieving the sampling requirements, and has a fair
> chance of getting a plausible-but-wrong answer.
> 
> The only possibility to do this safely is on a VM which is pinned to
> pCPU for its lifetime, but even I'm unconvinced of the correctness.
> 
> I don't think we should be exposing this functionality to guests at all,
> although I might be persuaded if someone wanting to use it in a VM can
> provide a concrete justification of why the above problems won't get in
> their way.

Am I getting it right then that here you're reverting what you've said
on patch 10: "I'm tempted to suggest that we offer EFRO on Intel ..."?
And hence your request is to drop both that and this patch?

Jan
Jan Beulich May 14, 2020, 8:52 a.m. UTC | #3
On 13.05.2020 15:35, Jan Beulich wrote:
> On 08.05.2020 23:04, Andrew Cooper wrote:
>> On 05/05/2020 09:20, Jan Beulich wrote:
>>> If the hardware can handle accesses, we should allow it to do so. This
>>> way we can expose EFRO to HVM guests,
>>
>> I'm reminded now of the conversation I'm sure we've had before, although
>> I have a feeling it was on IRC.
>>
>> APERF/MPERF (including the EFRO interface on AMD) are free running
>> counters but only in C0.  The raw values are not synchronised across
>> threads.
>>
>> A vCPU which gets rescheduled has a 50% chance of finding the one or
>> both values going backwards, and a 100% chance of totally bogus calculation.
>>
>> There is no point exposing APERF/MPERF to guests.  It can only be used
>> safely in hypervisor context, on AMD hardware with a CLGI/STGI region,
>> or on Intel hardware in an NMI handler if you trust that a machine check
>> isn't going to ruin your day.
>>
>> VMs have no way of achieving the sampling requirements, and has a fair
>> chance of getting a plausible-but-wrong answer.
>>
>> The only possibility to do this safely is on a VM which is pinned to
>> pCPU for its lifetime, but even I'm unconvinced of the correctness.
>>
>> I don't think we should be exposing this functionality to guests at all,
>> although I might be persuaded if someone wanting to use it in a VM can
>> provide a concrete justification of why the above problems won't get in
>> their way.
> 
> Am I getting it right then that here you're reverting what you've said
> on patch 10: "I'm tempted to suggest that we offer EFRO on Intel ..."?
> And hence your request is to drop both that and this patch?

Which in turn would then mean also dropping patch 11. Not
supporting RDPRU because of the APERF/MPERF peculiarities
means we also won't be able to support it once other leaves
get defined, as its specification seems to only halfway
allow for supporting higher leaves but not lower ones (by
making the lower ones return zero, which for the two
initially defined leaves may not be expected by the caller).

Since I don't see us settle on this very soon, I guess I'll
re-submit the series with the last three patches left out.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -595,6 +595,7 @@  static void svm_cpuid_policy_changed(str
     struct vmcb_struct *vmcb = svm->vmcb;
     const struct cpuid_policy *cp = v->domain->arch.cpuid;
     u32 bitmap = vmcb_get_exception_intercepts(vmcb);
+    unsigned int mode;
 
     if ( opt_hvm_fep ||
          (v->domain->arch.cpuid->x86_vendor != boot_cpu_data.x86_vendor) )
@@ -607,6 +608,17 @@  static void svm_cpuid_policy_changed(str
     /* Give access to MSR_PRED_CMD if the guest has been told about it. */
     svm_intercept_msr(v, MSR_PRED_CMD,
                       cp->extd.ibpb ? MSR_INTERCEPT_NONE : MSR_INTERCEPT_RW);
+
+    /* Allow direct reads from APERF/MPERF if permitted by the policy. */
+    mode = cp->basic.raw[6].c & CPUID6_ECX_APERFMPERF_CAPABILITY
+           ? MSR_INTERCEPT_WRITE : MSR_INTERCEPT_RW;
+    svm_intercept_msr(v, MSR_IA32_APERF, mode);
+    svm_intercept_msr(v, MSR_IA32_MPERF, mode);
+
+    /* Allow direct access to their r/o counterparts if permitted. */
+    mode = cp->extd.efro ? MSR_INTERCEPT_NONE : MSR_INTERCEPT_RW;
+    svm_intercept_msr(v, MSR_APERF_RD_ONLY, mode);
+    svm_intercept_msr(v, MSR_MPERF_RD_ONLY, mode);
 }
 
 void svm_sync_vmcb(struct vcpu *v, enum vmcb_sync_state new_state)
@@ -860,7 +872,10 @@  static void svm_set_rdtsc_exiting(struct
     {
         general1_intercepts |= GENERAL1_INTERCEPT_RDTSC;
         general2_intercepts |= GENERAL2_INTERCEPT_RDTSCP;
+        svm_enable_intercept_for_msr(v, MSR_IA32_TSC);
     }
+    else
+        svm_intercept_msr(v, MSR_IA32_TSC, MSR_INTERCEPT_WRITE);
 
     vmcb_set_general1_intercepts(vmcb, general1_intercepts);
     vmcb_set_general2_intercepts(vmcb, general2_intercepts);
--- a/xen/arch/x86/hvm/svm/vmcb.c
+++ b/xen/arch/x86/hvm/svm/vmcb.c
@@ -108,6 +108,7 @@  static int construct_vmcb(struct vcpu *v
     {
         vmcb->_general1_intercepts |= GENERAL1_INTERCEPT_RDTSC;
         vmcb->_general2_intercepts |= GENERAL2_INTERCEPT_RDTSCP;
+        svm_intercept_msr(v, MSR_IA32_TSC, MSR_INTERCEPT_WRITE);
     }
 
     /* Guest segment limits. */
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -1141,8 +1141,13 @@  static int construct_vmcs(struct vcpu *v
         vmx_clear_msr_intercept(v, MSR_IA32_SYSENTER_CS, VMX_MSR_RW);
         vmx_clear_msr_intercept(v, MSR_IA32_SYSENTER_ESP, VMX_MSR_RW);
         vmx_clear_msr_intercept(v, MSR_IA32_SYSENTER_EIP, VMX_MSR_RW);
+
+        if ( !(v->arch.hvm.vmx.exec_control & CPU_BASED_RDTSC_EXITING) )
+            vmx_clear_msr_intercept(v, MSR_IA32_TSC, VMX_MSR_R);
+
         if ( paging_mode_hap(d) && (!is_iommu_enabled(d) || iommu_snoop) )
             vmx_clear_msr_intercept(v, MSR_IA32_CR_PAT, VMX_MSR_RW);
+
         if ( (vmexit_ctl & VM_EXIT_CLEAR_BNDCFGS) &&
              (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS) )
             vmx_clear_msr_intercept(v, MSR_IA32_BNDCFGS, VMX_MSR_RW);
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -589,6 +589,18 @@  static void vmx_cpuid_policy_changed(str
         vmx_clear_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW);
     else
         vmx_set_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW);
+
+    /* Allow direct reads from APERF/MPERF if permitted by the policy. */
+    if ( cp->basic.raw[6].c & CPUID6_ECX_APERFMPERF_CAPABILITY )
+    {
+        vmx_clear_msr_intercept(v, MSR_IA32_APERF, VMX_MSR_R);
+        vmx_clear_msr_intercept(v, MSR_IA32_MPERF, VMX_MSR_R);
+    }
+    else
+    {
+        vmx_set_msr_intercept(v, MSR_IA32_APERF, VMX_MSR_R);
+        vmx_set_msr_intercept(v, MSR_IA32_MPERF, VMX_MSR_R);
+    }
 }
 
 int vmx_guest_x86_mode(struct vcpu *v)
@@ -1254,7 +1266,12 @@  static void vmx_set_rdtsc_exiting(struct
     vmx_vmcs_enter(v);
     v->arch.hvm.vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
     if ( enable )
+    {
         v->arch.hvm.vmx.exec_control |= CPU_BASED_RDTSC_EXITING;
+        vmx_set_msr_intercept(v, MSR_IA32_TSC, VMX_MSR_R);
+    }
+    else
+        vmx_clear_msr_intercept(v, MSR_IA32_TSC, VMX_MSR_R);
     vmx_update_cpu_exec_control(v);
     vmx_vmcs_exit(v);
 }
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -245,7 +245,7 @@  XEN_CPUFEATURE(ENQCMD,        6*32+29) /
 
 /* AMD-defined CPU features, CPUID level 0x80000007.edx, word 7 */
 XEN_CPUFEATURE(ITSC,          7*32+ 8) /*   Invariant TSC */
-XEN_CPUFEATURE(EFRO,          7*32+10) /*   APERF/MPERF Read Only interface */
+XEN_CPUFEATURE(EFRO,          7*32+10) /*S  APERF/MPERF Read Only interface */
 
 /* AMD-defined CPU features, CPUID level 0x80000008.ebx, word 8 */
 XEN_CPUFEATURE(CLZERO,        8*32+ 0) /*A  CLZERO instruction */