diff mbox series

[v8,10/12] x86/HVM: scale MPERF values reported to guests (on AMD)

Message ID 5da4ed2e-8eb8-0b18-3c1f-9d419371c08a@suse.com (mailing list archive)
State Superseded
Headers show
Series x86emul: further work | expand

Commit Message

Jan Beulich May 5, 2020, 8:18 a.m. UTC
AMD's PM specifies that MPERF (and its r/o counterpart) reads are
affected by the TSC ratio. Hence when processing such reads in software
we too should scale the values. While we don't currently (yet) expose
the underlying feature flags, besides us allowing the MSRs to be read
nevertheless, RDPRU is going to expose the values even to user space.

Furthermore, due to the not exposed feature flags, this change has the
effect of making properly inaccessible (for reads) the two MSRs.

Note that writes to MPERF (and APERF) continue to be unsupported.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.
---
I did consider whether to put the code in guest_rdmsr() instead, but
decided that it's better to have it next to TSC handling.

Comments

Andrew Cooper May 8, 2020, 8:32 p.m. UTC | #1
On 05/05/2020 09:18, Jan Beulich wrote:
> AMD's PM specifies that MPERF (and its r/o counterpart) reads are
> affected by the TSC ratio. Hence when processing such reads in software
> we too should scale the values. While we don't currently (yet) expose
> the underlying feature flags, besides us allowing the MSRs to be read
> nevertheless, RDPRU is going to expose the values even to user space.
>
> Furthermore, due to the not exposed feature flags, this change has the
> effect of making properly inaccessible (for reads) the two MSRs.
>
> Note that writes to MPERF (and APERF) continue to be unsupported.

Linux is now using MPERF/APERF for its frequency-invariant scheduling
logic.  Irritatingly, via its read/write alias rather than its read-only
alias.  Even more irritatingly, Intel's reference algorithm recommends
writing to both, despite this being being far less efficient than (one
of) AMD's (two) algorithm(s) which tells you just to subtract the values
you last sampled.

On the one hand, I'm tempted to suggest that we offer EFRO on Intel and
update Linux to use it.  OTOH, that would VMExit as Intel CPUs don't
understand the EFRO interface.

I can't see any sane way to virtualise the write behaviour for MPERF/APERF.

>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: New.
> ---
> I did consider whether to put the code in guest_rdmsr() instead, but
> decided that it's better to have it next to TSC handling.

Please do put it in guest_rdmsr().  This is code hygene just as much as
bool_t or style fixes are.

The relationship to TSC is passing-at-best.

>
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3478,6 +3478,22 @@ int hvm_msr_read_intercept(unsigned int
>          *msr_content = v->arch.hvm.msr_tsc_adjust;
>          break;
>  
> +    case MSR_MPERF_RD_ONLY:
> +        if ( !d->arch.cpuid->extd.efro )
> +        {
> +            goto gp_fault;
> +
> +    case MSR_IA32_MPERF:
> +            if ( !(d->arch.cpuid->basic.raw[6].c &
> +                   CPUID6_ECX_APERFMPERF_CAPABILITY) )
> +                goto gp_fault;
> +        }
> +        if ( rdmsr_safe(msr, *msr_content) )
> +            goto gp_fault;
> +        if ( d->arch.cpuid->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )

I suspect we want to gain amd_like() outside of the emulator.

> +            *msr_content = hvm_get_guest_tsc_fixed(v, *msr_content);
> +        break;
> +
>      case MSR_APIC_BASE:
>          *msr_content = vcpu_vlapic(v)->hw.apic_base_msr;
>          break;
> --- a/xen/include/asm-x86/msr-index.h
> +++ b/xen/include/asm-x86/msr-index.h
> @@ -405,6 +405,9 @@
>  #define MSR_IA32_MPERF			0x000000e7
>  #define MSR_IA32_APERF			0x000000e8
>  
> +#define MSR_MPERF_RD_ONLY		0xc00000e7
> +#define MSR_APERF_RD_ONLY		0xc00000e8

S/RD_ONLY/RO/ ?  No loss of meaning.  Also, above the legacy line please.

~Andrew

> +
>  #define MSR_IA32_THERM_CONTROL		0x0000019a
>  #define MSR_IA32_THERM_INTERRUPT	0x0000019b
>  #define MSR_IA32_THERM_STATUS		0x0000019c
>
diff mbox series

Patch

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3478,6 +3478,22 @@  int hvm_msr_read_intercept(unsigned int
         *msr_content = v->arch.hvm.msr_tsc_adjust;
         break;
 
+    case MSR_MPERF_RD_ONLY:
+        if ( !d->arch.cpuid->extd.efro )
+        {
+            goto gp_fault;
+
+    case MSR_IA32_MPERF:
+            if ( !(d->arch.cpuid->basic.raw[6].c &
+                   CPUID6_ECX_APERFMPERF_CAPABILITY) )
+                goto gp_fault;
+        }
+        if ( rdmsr_safe(msr, *msr_content) )
+            goto gp_fault;
+        if ( d->arch.cpuid->x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
+            *msr_content = hvm_get_guest_tsc_fixed(v, *msr_content);
+        break;
+
     case MSR_APIC_BASE:
         *msr_content = vcpu_vlapic(v)->hw.apic_base_msr;
         break;
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -405,6 +405,9 @@ 
 #define MSR_IA32_MPERF			0x000000e7
 #define MSR_IA32_APERF			0x000000e8
 
+#define MSR_MPERF_RD_ONLY		0xc00000e7
+#define MSR_APERF_RD_ONLY		0xc00000e8
+
 #define MSR_IA32_THERM_CONTROL		0x0000019a
 #define MSR_IA32_THERM_INTERRUPT	0x0000019b
 #define MSR_IA32_THERM_STATUS		0x0000019c