Message ID | e92b6c1a-b2c3-13e7-116c-4772c851dd0b@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86emul: further work | expand |
On 05/05/2020 09:20, Jan Beulich wrote: > If the hardware can handle accesses, we should allow it to do so. This > way we can expose EFRO to HVM guests, I'm reminded now of the conversation I'm sure we've had before, although I have a feeling it was on IRC. APERF/MPERF (including the EFRO interface on AMD) are free running counters but only in C0. The raw values are not synchronised across threads. A vCPU which gets rescheduled has a 50% chance of finding the one or both values going backwards, and a 100% chance of totally bogus calculation. There is no point exposing APERF/MPERF to guests. It can only be used safely in hypervisor context, on AMD hardware with a CLGI/STGI region, or on Intel hardware in an NMI handler if you trust that a machine check isn't going to ruin your day. VMs have no way of achieving the sampling requirements, and has a fair chance of getting a plausible-but-wrong answer. The only possibility to do this safely is on a VM which is pinned to pCPU for its lifetime, but even I'm unconvinced of the correctness. I don't think we should be exposing this functionality to guests at all, although I might be persuaded if someone wanting to use it in a VM can provide a concrete justification of why the above problems won't get in their way. ~Andrew
On 08.05.2020 23:04, Andrew Cooper wrote: > On 05/05/2020 09:20, Jan Beulich wrote: >> If the hardware can handle accesses, we should allow it to do so. This >> way we can expose EFRO to HVM guests, > > I'm reminded now of the conversation I'm sure we've had before, although > I have a feeling it was on IRC. > > APERF/MPERF (including the EFRO interface on AMD) are free running > counters but only in C0. The raw values are not synchronised across > threads. > > A vCPU which gets rescheduled has a 50% chance of finding the one or > both values going backwards, and a 100% chance of totally bogus calculation. > > There is no point exposing APERF/MPERF to guests. It can only be used > safely in hypervisor context, on AMD hardware with a CLGI/STGI region, > or on Intel hardware in an NMI handler if you trust that a machine check > isn't going to ruin your day. > > VMs have no way of achieving the sampling requirements, and has a fair > chance of getting a plausible-but-wrong answer. > > The only possibility to do this safely is on a VM which is pinned to > pCPU for its lifetime, but even I'm unconvinced of the correctness. > > I don't think we should be exposing this functionality to guests at all, > although I might be persuaded if someone wanting to use it in a VM can > provide a concrete justification of why the above problems won't get in > their way. Am I getting it right then that here you're reverting what you've said on patch 10: "I'm tempted to suggest that we offer EFRO on Intel ..."? And hence your request is to drop both that and this patch? Jan
On 13.05.2020 15:35, Jan Beulich wrote: > On 08.05.2020 23:04, Andrew Cooper wrote: >> On 05/05/2020 09:20, Jan Beulich wrote: >>> If the hardware can handle accesses, we should allow it to do so. This >>> way we can expose EFRO to HVM guests, >> >> I'm reminded now of the conversation I'm sure we've had before, although >> I have a feeling it was on IRC. >> >> APERF/MPERF (including the EFRO interface on AMD) are free running >> counters but only in C0. The raw values are not synchronised across >> threads. >> >> A vCPU which gets rescheduled has a 50% chance of finding the one or >> both values going backwards, and a 100% chance of totally bogus calculation. >> >> There is no point exposing APERF/MPERF to guests. It can only be used >> safely in hypervisor context, on AMD hardware with a CLGI/STGI region, >> or on Intel hardware in an NMI handler if you trust that a machine check >> isn't going to ruin your day. >> >> VMs have no way of achieving the sampling requirements, and has a fair >> chance of getting a plausible-but-wrong answer. >> >> The only possibility to do this safely is on a VM which is pinned to >> pCPU for its lifetime, but even I'm unconvinced of the correctness. >> >> I don't think we should be exposing this functionality to guests at all, >> although I might be persuaded if someone wanting to use it in a VM can >> provide a concrete justification of why the above problems won't get in >> their way. > > Am I getting it right then that here you're reverting what you've said > on patch 10: "I'm tempted to suggest that we offer EFRO on Intel ..."? > And hence your request is to drop both that and this patch? Which in turn would then mean also dropping patch 11. Not supporting RDPRU because of the APERF/MPERF peculiarities means we also won't be able to support it once other leaves get defined, as its specification seems to only halfway allow for supporting higher leaves but not lower ones (by making the lower ones return zero, which for the two initially defined leaves may not be expected by the caller). Since I don't see us settle on this very soon, I guess I'll re-submit the series with the last three patches left out. Jan
--- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -595,6 +595,7 @@ static void svm_cpuid_policy_changed(str struct vmcb_struct *vmcb = svm->vmcb; const struct cpuid_policy *cp = v->domain->arch.cpuid; u32 bitmap = vmcb_get_exception_intercepts(vmcb); + unsigned int mode; if ( opt_hvm_fep || (v->domain->arch.cpuid->x86_vendor != boot_cpu_data.x86_vendor) ) @@ -607,6 +608,17 @@ static void svm_cpuid_policy_changed(str /* Give access to MSR_PRED_CMD if the guest has been told about it. */ svm_intercept_msr(v, MSR_PRED_CMD, cp->extd.ibpb ? MSR_INTERCEPT_NONE : MSR_INTERCEPT_RW); + + /* Allow direct reads from APERF/MPERF if permitted by the policy. */ + mode = cp->basic.raw[6].c & CPUID6_ECX_APERFMPERF_CAPABILITY + ? MSR_INTERCEPT_WRITE : MSR_INTERCEPT_RW; + svm_intercept_msr(v, MSR_IA32_APERF, mode); + svm_intercept_msr(v, MSR_IA32_MPERF, mode); + + /* Allow direct access to their r/o counterparts if permitted. */ + mode = cp->extd.efro ? MSR_INTERCEPT_NONE : MSR_INTERCEPT_RW; + svm_intercept_msr(v, MSR_APERF_RD_ONLY, mode); + svm_intercept_msr(v, MSR_MPERF_RD_ONLY, mode); } void svm_sync_vmcb(struct vcpu *v, enum vmcb_sync_state new_state) @@ -860,7 +872,10 @@ static void svm_set_rdtsc_exiting(struct { general1_intercepts |= GENERAL1_INTERCEPT_RDTSC; general2_intercepts |= GENERAL2_INTERCEPT_RDTSCP; + svm_enable_intercept_for_msr(v, MSR_IA32_TSC); } + else + svm_intercept_msr(v, MSR_IA32_TSC, MSR_INTERCEPT_WRITE); vmcb_set_general1_intercepts(vmcb, general1_intercepts); vmcb_set_general2_intercepts(vmcb, general2_intercepts); --- a/xen/arch/x86/hvm/svm/vmcb.c +++ b/xen/arch/x86/hvm/svm/vmcb.c @@ -108,6 +108,7 @@ static int construct_vmcb(struct vcpu *v { vmcb->_general1_intercepts |= GENERAL1_INTERCEPT_RDTSC; vmcb->_general2_intercepts |= GENERAL2_INTERCEPT_RDTSCP; + svm_intercept_msr(v, MSR_IA32_TSC, MSR_INTERCEPT_WRITE); } /* Guest segment limits. */ --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -1141,8 +1141,13 @@ static int construct_vmcs(struct vcpu *v vmx_clear_msr_intercept(v, MSR_IA32_SYSENTER_CS, VMX_MSR_RW); vmx_clear_msr_intercept(v, MSR_IA32_SYSENTER_ESP, VMX_MSR_RW); vmx_clear_msr_intercept(v, MSR_IA32_SYSENTER_EIP, VMX_MSR_RW); + + if ( !(v->arch.hvm.vmx.exec_control & CPU_BASED_RDTSC_EXITING) ) + vmx_clear_msr_intercept(v, MSR_IA32_TSC, VMX_MSR_R); + if ( paging_mode_hap(d) && (!is_iommu_enabled(d) || iommu_snoop) ) vmx_clear_msr_intercept(v, MSR_IA32_CR_PAT, VMX_MSR_RW); + if ( (vmexit_ctl & VM_EXIT_CLEAR_BNDCFGS) && (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS) ) vmx_clear_msr_intercept(v, MSR_IA32_BNDCFGS, VMX_MSR_RW); --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -589,6 +589,18 @@ static void vmx_cpuid_policy_changed(str vmx_clear_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); else vmx_set_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); + + /* Allow direct reads from APERF/MPERF if permitted by the policy. */ + if ( cp->basic.raw[6].c & CPUID6_ECX_APERFMPERF_CAPABILITY ) + { + vmx_clear_msr_intercept(v, MSR_IA32_APERF, VMX_MSR_R); + vmx_clear_msr_intercept(v, MSR_IA32_MPERF, VMX_MSR_R); + } + else + { + vmx_set_msr_intercept(v, MSR_IA32_APERF, VMX_MSR_R); + vmx_set_msr_intercept(v, MSR_IA32_MPERF, VMX_MSR_R); + } } int vmx_guest_x86_mode(struct vcpu *v) @@ -1254,7 +1266,12 @@ static void vmx_set_rdtsc_exiting(struct vmx_vmcs_enter(v); v->arch.hvm.vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING; if ( enable ) + { v->arch.hvm.vmx.exec_control |= CPU_BASED_RDTSC_EXITING; + vmx_set_msr_intercept(v, MSR_IA32_TSC, VMX_MSR_R); + } + else + vmx_clear_msr_intercept(v, MSR_IA32_TSC, VMX_MSR_R); vmx_update_cpu_exec_control(v); vmx_vmcs_exit(v); } --- a/xen/include/public/arch-x86/cpufeatureset.h +++ b/xen/include/public/arch-x86/cpufeatureset.h @@ -245,7 +245,7 @@ XEN_CPUFEATURE(ENQCMD, 6*32+29) / /* AMD-defined CPU features, CPUID level 0x80000007.edx, word 7 */ XEN_CPUFEATURE(ITSC, 7*32+ 8) /* Invariant TSC */ -XEN_CPUFEATURE(EFRO, 7*32+10) /* APERF/MPERF Read Only interface */ +XEN_CPUFEATURE(EFRO, 7*32+10) /*S APERF/MPERF Read Only interface */ /* AMD-defined CPU features, CPUID level 0x80000008.ebx, word 8 */ XEN_CPUFEATURE(CLZERO, 8*32+ 0) /*A CLZERO instruction */