target-i386: Do not set MCG_SER_P by default

Message ID	1448060471-14128-1-git-send-email-bp@alien8.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Borislav Petkov <bp@alien8.de> To: KVM <kvm@vger.kernel.org> Cc: "Andreas Faerber" <afaerber@suse.de>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <rth@twiddle.net>, Eduardo Habkost <ehabkost@redhat.com>, Marcelo Tosatti <mtosatti@redhat.com> Subject: [PATCH] target-i386: Do not set MCG_SER_P by default Date: Sat, 21 Nov 2015 00:01:11 +0100 Message-Id: <1448060471-14128-1-git-send-email-bp@alien8.de> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Borislav Petkov Nov. 20, 2015, 11:01 p.m. UTC

From: Borislav Petkov <bp@suse.de>

Software Error Recovery, i.e. SER, is purely an Intel feature and it
shouldn't be set by default. Enable it only on Intel.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 target-i386/cpu.c | 7 -------
 target-i386/cpu.h | 9 ++++++++-
 target-i386/kvm.c | 5 +++++
 3 files changed, 13 insertions(+), 8 deletions(-)

Andreas Färber Nov. 20, 2015, 11:11 p.m. UTC | #1

Hi,

CC'ing qemu-devel.

Am 21.11.2015 um 00:01 schrieb Borislav Petkov:
> From: Borislav Petkov <bp@suse.de>
> 
> Software Error Recovery, i.e. SER, is purely an Intel feature and it
> shouldn't be set by default. Enable it only on Intel.

Is this new in 2.5? Otherwise we would probably need compatibility code
in pc*.[ch] for incoming live migration from older versions.

> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  target-i386/cpu.c | 7 -------
>  target-i386/cpu.h | 9 ++++++++-
>  target-i386/kvm.c | 5 +++++
>  3 files changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 11e5e39a756a..8155ee94fbe1 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2803,13 +2803,6 @@ static void x86_cpu_apic_realize(X86CPU *cpu, Error **errp)
>  }
>  #endif
>  
> -
> -#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
> -                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
> -                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
> -#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> -                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> -                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
>  static void x86_cpu_realizefn(DeviceState *dev, Error **errp)
>  {
>      CPUState *cs = CPU(dev);
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index fc4a605d6a29..2605c564239a 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -283,7 +283,7 @@
>  #define MCG_CTL_P       (1ULL<<8)   /* MCG_CAP register available */
>  #define MCG_SER_P       (1ULL<<24) /* MCA recovery/new status bits */
>  
> -#define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
> +#define MCE_CAP_DEF     MCG_CTL_P
>  #define MCE_BANKS_DEF   10
>  
>  #define MCG_STATUS_RIPV (1ULL<<0)   /* restart ip valid */
> @@ -610,6 +610,13 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
>  
> +#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
> +                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
> +                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
> +#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> +                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> +                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +
>  #ifndef HYPERV_SPINLOCK_NEVER_RETRY
>  #define HYPERV_SPINLOCK_NEVER_RETRY             0xFFFFFFFF
>  #endif
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 2a9953b2d4b5..082d38d4838d 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -787,8 +787,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
>          if (banks > MCE_BANKS_DEF) {
>              banks = MCE_BANKS_DEF;
>          }
> +
>          mcg_cap &= MCE_CAP_DEF;
>          mcg_cap |= banks;
> +
> +	if (IS_INTEL_CPU(env))
> +		mcg_cap |= MCG_SER_P;

Tabs and missing braces.

> +
>          ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &mcg_cap);
>          if (ret < 0) {
>              fprintf(stderr, "KVM_X86_SETUP_MCE: %s", strerror(-ret));

Regards,
Andreas

Borislav Petkov Nov. 21, 2015, 1:09 a.m. UTC | #2

On Sat, Nov 21, 2015 at 12:11:35AM +0100, Andreas Färber wrote:
> Hi,
> 
> CC'ing qemu-devel.

Ah, thanks.

> Am 21.11.2015 um 00:01 schrieb Borislav Petkov:
> > From: Borislav Petkov <bp@suse.de>
> > 
> > Software Error Recovery, i.e. SER, is purely an Intel feature and it
> > shouldn't be set by default. Enable it only on Intel.
> 
> Is this new in 2.5? Otherwise we would probably need compatibility code
> in pc*.[ch] for incoming live migration from older versions.

It looks it is really old, AFAIK from 2010:

c0532a76b407 ("MCE: Relay UCR MCE to guest")

You'd need to be more verbose about pc*.[ch]. An example perhaps...?

> >          mcg_cap &= MCE_CAP_DEF;
> >          mcg_cap |= banks;
> > +
> > +	if (IS_INTEL_CPU(env))
> > +		mcg_cap |= MCG_SER_P;
> 
> Tabs and missing braces.

Ok.

Thanks.

Eduardo Habkost Nov. 23, 2015, 1:22 p.m. UTC | #3

On Sat, Nov 21, 2015 at 02:09:25AM +0100, Borislav Petkov wrote:
> On Sat, Nov 21, 2015 at 12:11:35AM +0100, Andreas Färber wrote:
> > Hi,
> > 
> > CC'ing qemu-devel.
> 
> Ah, thanks.
> 
> > Am 21.11.2015 um 00:01 schrieb Borislav Petkov:
> > > From: Borislav Petkov <bp@suse.de>
> > > 
> > > Software Error Recovery, i.e. SER, is purely an Intel feature and it
> > > shouldn't be set by default. Enable it only on Intel.

What happens when SER is enabled on an AMD CPU? If it really
should't be enabled, why is KVM returning it on
KVM_X86_GET_MCE_CAP_SUPPORTED?

> > 
> > Is this new in 2.5? Otherwise we would probably need compatibility code
> > in pc*.[ch] for incoming live migration from older versions.
> 
> It looks it is really old, AFAIK from 2010:
> 
> c0532a76b407 ("MCE: Relay UCR MCE to guest")
> 
> You'd need to be more verbose about pc*.[ch]. An example perhaps...?

If you change something that's guest-visible and not part of the
migration stream, you need to keep the old behavior on older
machine-types (e.g. pc-i440fx-2.4), or the CPU will change under
the guest's feet when migrating to another host.

For examples, see the recent commits to include/hw/i386/pc.h.
They are all about keeping compatibility when CPUID bits are
changed:

33b5e8c0 target-i386: Disable rdtscp on Opteron_G* CPU models
6aa91e4a target-i386: Remove POPCNT from qemu64 and qemu32 CPU models
71195672 target-i386: Remove ABM from qemu64 CPU model
0909ad24 target-i386: Remove SSE4a from qemu64 CPU model

In the case of this code, it looks like it's already broken
because the resulting mcg_cap depends on host kernel capabilities
(the ones reported by kvm_get_mce_cap_supported()), and the data
initialized by target-i386/cpu.c:mce_init() is silently
overwritten by kvm_arch_init_vcpu(). So we would need to fix that
before implementing a proper compatibility mechanism for
mcg_cap.

Paolo Bonzini Nov. 23, 2015, 2:47 p.m. UTC | #4

On 23/11/2015 14:22, Eduardo Habkost wrote:
> > Software Error Recovery, i.e. SER, is purely an Intel feature and it
> > shouldn't be set by default. Enable it only on Intel.
> 
> What happens when SER is enabled on an AMD CPU? If it really
> should't be enabled, why is KVM returning it on
> KVM_X86_GET_MCE_CAP_SUPPORTED?

Indeed... is it a problem if our frankenstein AMD CPU can recover from
memory errors?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov Nov. 23, 2015, 3:03 p.m. UTC | #5

+ Tony.

On Mon, Nov 23, 2015 at 03:47:44PM +0100, Paolo Bonzini wrote:
> On 23/11/2015 14:22, Eduardo Habkost wrote:
> > > Software Error Recovery, i.e. SER, is purely an Intel feature and it
> > > shouldn't be set by default. Enable it only on Intel.
> > 
> > What happens when SER is enabled on an AMD CPU? If it really
> > should't be enabled, why is KVM returning it on
> > KVM_X86_GET_MCE_CAP_SUPPORTED?
> 
> Indeed... is it a problem if our frankenstein AMD CPU can recover from
> memory errors?

The problem stems from the fact that the guest kernel looks at SER and
does different handling depending on that bit:

machine_check_poll:

	...

	if (!(flags & MCP_UC) &&
		(m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
                	continue;

so when the guest kernel gets a correctable error (injected..., for
example) it sees that bit set. Even though kvm/qemu emulates an AMD
CPU. So on AMD with that bit set, it would puzzle the whole error
handling/reporting in the guest kernel.

Oh, btw, I'm using a kvm guest to inject MCEs. In case you were
wondering why is he even doing that. :-)

And I'm not sure it makes sense to set that bit for an Intel guest too.
For the simple reason that I don't know how much of the Software Error
Recovery stuff is actually implemented there. If stuff is missing, you
probably don't want to advertize it there too. And by "stuff" I mean
all that fun in section "15.6 RECOVERY OF UNCORRECTED RECOVERABLE (UCR)
ERRORS" of the SDM.

It's a whole another question how/whether to do UCR error handling in
the guest or maybe leave it to the host...

Eduardo Habkost Nov. 23, 2015, 3:11 p.m. UTC | #6

On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote:
[...]
> In the case of this code, it looks like it's already broken
> because the resulting mcg_cap depends on host kernel capabilities
> (the ones reported by kvm_get_mce_cap_supported()), and the data
> initialized by target-i386/cpu.c:mce_init() is silently
> overwritten by kvm_arch_init_vcpu(). So we would need to fix that
> before implementing a proper compatibility mechanism for
> mcg_cap.

Fortunately, when running Linux v2.6.37 and later,
kvm_arch_init_vcpu() won't actually change mcg_cap (see details
below).

But the code is broken if running on Linux between v2.6.32 and
v2.6.36: it will clear MCG_SER_P silently (and silently enable
MCG_SER_P when migrating to a newer host).

But I don't know what we should do on those cases. If we abort
initialization when the host doesn't support MCG_SER_P, all CPU
models with MCE and MCA enabled will become unrunnable on Linux
between v2.6.32 and v2.6.36. Should we do that, and simply ask
people to upgrade their kernels (or explicitly disable MCE) if
they want to run latest QEMU?

For reference, these are the capabilities returned by Linux:
* KVM_MAX_MCE_BANKS is 32 since
  890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
* KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since
  5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)
* KVM_MCE_CAP_SUPPORTED is MCG_CTL_P between
  890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
  and 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)

The current definitions in QEMU are:
    #define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
    #define MCE_BANKS_DEF   10

The target-i386/cpu.c:mce_init() code sets mcg_cap to:
    env->mcg_cap == MCE_CAP_DEF | MCE_BANKS_DEF;
                 == (MCG_CTL_P|MCG_SER_P) | 10;

The kvm_arch_init_vcpu() code that changes mcg_cap does the following:
    kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);
    if (banks > MCE_BANKS_DEF) {
        banks = MCE_BANKS_DEF;
    }
    mcg_cap &= MCE_CAP_DEF;
    mcg_cap |= banks;
    env->mcg_cap = mcg_cap;
  * Therefore, if running Linux v2.6.37 or newer, this will be
    the result:
    banks == 10;
    mcg_cap == (MCG_CTL_P|MCG_SER_P) | banks
            == (MCG_CTL_P|MCG_SER_P) | 10;
    * That's the same value set by mce_init(), so fortunately
      kvm_arch_init_vcpu() isn't actually changing mcg_cap
      if running Linux v.2.6.37 and newer.
  * However, if running Linux between v2.6.32 and v2.6.37,
    kvm_arch_init_vcpu() will silently clear MCG_SER_P.

Borislav Petkov Nov. 23, 2015, 4:43 p.m. UTC | #7

On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote:
> On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote:
> [...]
> > In the case of this code, it looks like it's already broken
> > because the resulting mcg_cap depends on host kernel capabilities
> > (the ones reported by kvm_get_mce_cap_supported()), and the data
> > initialized by target-i386/cpu.c:mce_init() is silently
> > overwritten by kvm_arch_init_vcpu(). So we would need to fix that
> > before implementing a proper compatibility mechanism for
> > mcg_cap.
> 
> Fortunately, when running Linux v2.6.37 and later,
> kvm_arch_init_vcpu() won't actually change mcg_cap (see details
> below).
> 
> But the code is broken if running on Linux between v2.6.32 and
> v2.6.36: it will clear MCG_SER_P silently (and silently enable
> MCG_SER_P when migrating to a newer host).
> 
> But I don't know what we should do on those cases. If we abort
> initialization when the host doesn't support MCG_SER_P, all CPU
> models with MCE and MCA enabled will become unrunnable on Linux
> between v2.6.32 and v2.6.36. Should we do that, and simply ask
> people to upgrade their kernels (or explicitly disable MCE) if
> they want to run latest QEMU?
> 
> For reference, these are the capabilities returned by Linux:
> * KVM_MAX_MCE_BANKS is 32 since
>   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
> * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since
>   5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)

The commit message of that one says that there is MCG_SER_P support in
the kernel.

The previous commit talks about MCE injection with KVM_X86_SET_MCE
ioctl but frankly, I don't see that. From looking at the current code,
KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets
MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is

#define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)

So it basically sets those two supported bits. But how is

	supported == actually present

?!?!

That soo doesn't make any sense.

> * KVM_MCE_CAP_SUPPORTED is MCG_CTL_P between
>   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
>   and 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)
> 
> The current definitions in QEMU are:
>     #define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
>     #define MCE_BANKS_DEF   10

That's also wrong. The number of banks is, of course,
generation-specific. The qemu commit adding that is

79c4f6b08009 ("QEMU: MCE: Add MCE simulation to qemu/tcg")

and I really don't get what the reasoning behind it is? Is this supposed
to mimick a certain default CPU which is not really resembling a real
CPU or some generic CPU or what is it?

Because the moment you start qemu with -cpu <something AMD>, all that
MCA information is totally wrong.

> The target-i386/cpu.c:mce_init() code sets mcg_cap to:
>     env->mcg_cap == MCE_CAP_DEF | MCE_BANKS_DEF;
>                  == (MCG_CTL_P|MCG_SER_P) | 10;
> 
> The kvm_arch_init_vcpu() code that changes mcg_cap does the following:
>     kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);
>     if (banks > MCE_BANKS_DEF) {
>         banks = MCE_BANKS_DEF;
>     }
>     mcg_cap &= MCE_CAP_DEF;
>     mcg_cap |= banks;
>     env->mcg_cap = mcg_cap;
>   * Therefore, if running Linux v2.6.37 or newer, this will be
>     the result:
>     banks == 10;
>     mcg_cap == (MCG_CTL_P|MCG_SER_P) | banks
>             == (MCG_CTL_P|MCG_SER_P) | 10;
>     * That's the same value set by mce_init(), so fortunately
>       kvm_arch_init_vcpu() isn't actually changing mcg_cap
>       if running Linux v.2.6.37 and newer.
>   * However, if running Linux between v2.6.32 and v2.6.37,
>     kvm_arch_init_vcpu() will silently clear MCG_SER_P.

I don't understand - you want that cleared when emulating a !Intel CPU
or an Intel CPU which doesn't support SER. This bit should be clear
there. Why should be set at all there?

Hmmm?

Eduardo Habkost Nov. 23, 2015, 7:42 p.m. UTC | #8

On Mon, Nov 23, 2015 at 05:43:14PM +0100, Borislav Petkov wrote:
> On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote:
> > On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote:
> > [...]
> > > In the case of this code, it looks like it's already broken
> > > because the resulting mcg_cap depends on host kernel capabilities
> > > (the ones reported by kvm_get_mce_cap_supported()), and the data
> > > initialized by target-i386/cpu.c:mce_init() is silently
> > > overwritten by kvm_arch_init_vcpu(). So we would need to fix that
> > > before implementing a proper compatibility mechanism for
> > > mcg_cap.
> > 
> > Fortunately, when running Linux v2.6.37 and later,
> > kvm_arch_init_vcpu() won't actually change mcg_cap (see details
> > below).
> > 
> > But the code is broken if running on Linux between v2.6.32 and
> > v2.6.36: it will clear MCG_SER_P silently (and silently enable
> > MCG_SER_P when migrating to a newer host).
> > 
> > But I don't know what we should do on those cases. If we abort
> > initialization when the host doesn't support MCG_SER_P, all CPU
> > models with MCE and MCA enabled will become unrunnable on Linux
> > between v2.6.32 and v2.6.36. Should we do that, and simply ask
> > people to upgrade their kernels (or explicitly disable MCE) if
> > they want to run latest QEMU?
> > 
> > For reference, these are the capabilities returned by Linux:
> > * KVM_MAX_MCE_BANKS is 32 since
> >   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
> > * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since
> >   5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)
> 
> The commit message of that one says that there is MCG_SER_P support in
> the kernel.
> 
> The previous commit talks about MCE injection with KVM_X86_SET_MCE
> ioctl but frankly, I don't see that. From looking at the current code,
> KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets
> MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is
> 
> #define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)
> 
> So it basically sets those two supported bits. But how is
> 
> 	supported == actually present
> 
> ?!?!
> 
> That soo doesn't make any sense.

I will let the people working on the actual MCE emulation in KVM
answer that. I am assuming that KVM_MCE_CAP_SUPPORTED is set to
something that makes sense.

> 
> > * KVM_MCE_CAP_SUPPORTED is MCG_CTL_P between
> >   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
> >   and 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)
> > 
> > The current definitions in QEMU are:
> >     #define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
> >     #define MCE_BANKS_DEF   10
> 
> That's also wrong. The number of banks is, of course,
> generation-specific.

Note that we don't mimick every single detail of real CPUs out
there, and this isn't necessarily a problem (although sometimes
we choose bad defaults). Do you see real world scenarios when
choosing 10 as the default causes problems for guest OSes, or you
just worry that this might cause problems because it doesn't
match any real-world CPU?

The number of banks is whatever QEMU chooses to be the number of
banks. The pc-*-2.4 and older machine-types already expose 10
banks, but we can change it in future machine-types.


> The qemu commit adding that is
> 
> 79c4f6b08009 ("QEMU: MCE: Add MCE simulation to qemu/tcg")
> 
> and I really don't get what the reasoning behind it is? Is this supposed
> to mimick a certain default CPU which is not really resembling a real
> CPU or some generic CPU or what is it?

I don't know the reasoning behind those defaults, but that's the
existing default in pc-*-2.4 and older.

> 
> Because the moment you start qemu with -cpu <something AMD>, all that
> MCA information is totally wrong.

If we really care about matching the number of banks of real
CPUs, we can make it configurable, defined by the CPU model,
and/or have better defaults in future machine-types. That won't
be a problem.

But I still don't know what we should do when somebody runs:
  -machine pc-i440fx-2.4 -cpu Penryn
on a host kernel that doesn't report MCG_SER_P on
KVM_MCE_CAP_SUPPORTED.

> 
> > The target-i386/cpu.c:mce_init() code sets mcg_cap to:
> >     env->mcg_cap == MCE_CAP_DEF | MCE_BANKS_DEF;
> >                  == (MCG_CTL_P|MCG_SER_P) | 10;
> > 
> > The kvm_arch_init_vcpu() code that changes mcg_cap does the following:
> >     kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);
> >     if (banks > MCE_BANKS_DEF) {
> >         banks = MCE_BANKS_DEF;
> >     }
> >     mcg_cap &= MCE_CAP_DEF;
> >     mcg_cap |= banks;
> >     env->mcg_cap = mcg_cap;
> >   * Therefore, if running Linux v2.6.37 or newer, this will be
> >     the result:
> >     banks == 10;
> >     mcg_cap == (MCG_CTL_P|MCG_SER_P) | banks
> >             == (MCG_CTL_P|MCG_SER_P) | 10;
> >     * That's the same value set by mce_init(), so fortunately
> >       kvm_arch_init_vcpu() isn't actually changing mcg_cap
> >       if running Linux v.2.6.37 and newer.
> >   * However, if running Linux between v2.6.32 and v2.6.37,
> >     kvm_arch_init_vcpu() will silently clear MCG_SER_P.
> 
> I don't understand - you want that cleared when emulating a !Intel CPU
> or an Intel CPU which doesn't support SER. This bit should be clear
> there. Why should be set at all there?

I am just saying we already clear it when running on Linux
v2.6.32-v2.6.36, it doesn't matter the host CPU or the -cpu
options we have. And we do not clear it when running Linux
v2.6.37 or newer. That's the behavior of pc-*-2.4 and older, even
if we change it on future machine-types.

Borislav Petkov Nov. 23, 2015, 8:46 p.m. UTC | #9

On Mon, Nov 23, 2015 at 05:42:08PM -0200, Eduardo Habkost wrote:
> I will let the people working on the actual MCE emulation in KVM
> answer that. I am assuming that KVM_MCE_CAP_SUPPORTED is set to
> something that makes sense.

Well, that should be, IMHO, the same like all those feature bits
assigned to the ->feature arrays of the different cpu types in qemu's
X86CPUDefinition descriptors.

> Note that we don't mimick every single detail of real CPUs out
> there, and this isn't necessarily a problem (although sometimes
> we choose bad defaults). Do you see real world scenarios when
> choosing 10 as the default causes problems for guest OSes, or you
> just worry that this might cause problems because it doesn't
> match any real-world CPU?

Well, the problems would come when the guests start using the MCA
infrastructure bits. That's why I asked how exactly do people imagine of
doing all the hardware errors handling in the guest.

I know we do something with poisoning pages, i.e.
kvm_send_hwpoison_signal() and all that machinery around it but in that
particular case it is the hypervisor which marks the pages as poison
and kvm notices that on the __get_user_pages() path and the error is
injected into the guest. AFAICT, of course.

In my case, I'm injecting a HW error in the guest kernel by writing into
the *guest* MSRs and the *guest* kernel MCA code is supposed to handle
the error.

And the problem here is that I'm emulating an AMD guest. But a guest
which sports an Intel-only feature and that puzzles the guest kernel.

Does that make more sense? I hope...

> If we really care about matching the number of banks of real
> CPUs, we can make it configurable, defined by the CPU model,
> and/or have better defaults in future machine-types. That won't
> be a problem.

I think we should try to do that if we're striving for accurate
emulation of guest CPUs. But then there's the migration use-case which
has different focus...

> But I still don't know what we should do when somebody runs:
>   -machine pc-i440fx-2.4 -cpu Penryn
> on a host kernel that doesn't report MCG_SER_P on
> KVM_MCE_CAP_SUPPORTED.

Right, before we ask that question we should ask the more generic one:
how do people want to do error handling in the guest? Do they even want
to? More importantly, does it even make sense to handle hardware errors
in the guest? If so, which and if not, why not?

I mean, no one would've noticed the MCG_SER_P issue if no one would've
tried to use it and what it implies. So it all comes down to whether the
guest uses the emulated feature.

It seems to me this issue remained unnoticed for such a long time now
for the simple reason that nothing used it. So nothing in the guest
cared whether SER_P is set or not, or how many MCA banks are there.

So if you run "-machine pc-i440fx-2.4 -cpu Penryn" it wouldn't matter
because, AFAIK - and correct me if I'm wrong here - the guest never
got to see the Action Required and Action Optional MCEs which are the
result from SER_P support. So the guest didn't care.

Yes, no, am I missing something completely here?

> I am just saying we already clear it when running on Linux
> v2.6.32-v2.6.36, it doesn't matter the host CPU or the -cpu
> options we have. And we do not clear it when running Linux
> v2.6.37 or newer. That's the behavior of pc-*-2.4 and older, even
> if we change it on future machine-types.

Right, ok. So the fact that it was clear in the v2.6.32-v2.6.36 frame
and set later and nothing complained, *probably* confirms my theory that
the guest didn't really care about that setting and it probably doesn't
do now either... Unless you try to use it, like I did :-)

Thanks.

Eduardo Habkost Nov. 24, 2015, 4:36 p.m. UTC | #10

On Mon, Nov 23, 2015 at 05:43:14PM +0100, Borislav Petkov wrote:
> On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote:
> > On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote:
> > [...]
> > > In the case of this code, it looks like it's already broken
> > > because the resulting mcg_cap depends on host kernel capabilities
> > > (the ones reported by kvm_get_mce_cap_supported()), and the data
> > > initialized by target-i386/cpu.c:mce_init() is silently
> > > overwritten by kvm_arch_init_vcpu(). So we would need to fix that
> > > before implementing a proper compatibility mechanism for
> > > mcg_cap.
> > 
> > Fortunately, when running Linux v2.6.37 and later,
> > kvm_arch_init_vcpu() won't actually change mcg_cap (see details
> > below).
> > 
> > But the code is broken if running on Linux between v2.6.32 and
> > v2.6.36: it will clear MCG_SER_P silently (and silently enable
> > MCG_SER_P when migrating to a newer host).
> > 
> > But I don't know what we should do on those cases. If we abort
> > initialization when the host doesn't support MCG_SER_P, all CPU
> > models with MCE and MCA enabled will become unrunnable on Linux
> > between v2.6.32 and v2.6.36. Should we do that, and simply ask
> > people to upgrade their kernels (or explicitly disable MCE) if
> > they want to run latest QEMU?
> > 
> > For reference, these are the capabilities returned by Linux:
> > * KVM_MAX_MCE_BANKS is 32 since
> >   890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199)
> > * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since
> >   5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3)
> 
> The commit message of that one says that there is MCG_SER_P support in
> the kernel.
> 
> The previous commit talks about MCE injection with KVM_X86_SET_MCE
> ioctl but frankly, I don't see that. From looking at the current code,
> KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets
> MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is

KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It
calls kvm_vcpu_ioctl_x86_set_mce(), which stores the
IA32_MCi_{STATUS,ADDR,MISC} register contents at
vcpu->arch.mce_banks.

> 
> #define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P)
> 
> So it basically sets those two supported bits. But how is
> 
> 	supported == actually present
> 
> ?!?!
> 
> That soo doesn't make any sense.
> 

I didn't check the QEMU MCE code to confirm that, but I assume it
is implemented there. In that case, MCG_SER_P in
KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by
userspace, as long as it makes the appropriate KVM_X86_SET_MCE
(or maybe KVM_SET_MSRS?) calls.

Borislav Petkov Nov. 24, 2015, 6:44 p.m. UTC | #11

On Tue, Nov 24, 2015 at 02:36:20PM -0200, Eduardo Habkost wrote:
> KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It
> calls kvm_vcpu_ioctl_x86_set_mce(), which stores the
> IA32_MCi_{STATUS,ADDR,MISC} register contents at
> vcpu->arch.mce_banks.

Ah, correct. I've mistakenly followed KVM_X86_SETUP_MCE and not
KVM_X86_SET_MCE, sorry.

Ok, so this makes more sense now - there's kvm_inject_mce_oldstyle() in
qemu and kvm_arch_on_sigbus_vcpu() which is on the SIGBUS handler path
actually does:

    if ((env->mcg_cap & MCG_SER_P) && addr
        && (code == BUS_MCEERR_AR || code == BUS_MCEERR_AO)) {
	    ...

I betcha that MCG_SER_P is set on every guest, even !Intel ones. I need
to go stare more at that code.

> I didn't check the QEMU MCE code to confirm that, but I assume it
> is implemented there. In that case, MCG_SER_P in
> KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by
> userspace, as long as it makes the appropriate KVM_X86_SET_MCE
> (or maybe KVM_SET_MSRS?) calls.

I think it is that kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus()
which handles SIGBUS with BUS_MCEERR_AR/BUS_MCEERR_AO si_code. See
mm/memory-failure.c:kill_proc() in the kernel where we do send those
signals to processes.

However, I still think the MCG_SER_P bit being set on
!Intel is wrong even though the recovery action done by
kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus() is correct.

Why, you're asking. :-)

Well, what happens above is that the qemu process gets the signal that
there was an uncorrectable error detected in its memory and it is either
required to do something: BUS_MCEERR_AR == Action Required or its action
is optional: BUS_MCEERR_AO == Action Optional.

The SER_P text in the SDM describes those two:

"SRAO errors indicate that some data in the system is corrupt, but the
data has not been consumed and the processor state is valid. SRAO errors
provide the additional error information for system software to perform
a recovery action. An SRAO error is indicated with UC=1, PCC=0, S=1,
EN=1 and AR=0 in the IA32_MCi_STATUS register."

and

"Software recoverable action required (SRAR) - a UCR error that requires
system software to take a recovery action on this processor before
scheduling another stream of execution on this processor. SRAR errors
indicate that the error was detected and raised at the point of the
consumption in the execution flow. An SRAR error is indicated with UC=1,
PCC=0, S=1, EN=1 and AR=1 in the IA32_MCi_STATUS register."

And for that we don't need to look at SER_P in qemu - we only need to
know what the error severity of the error is and then we go and handle
accordingly.

Because those two si_codes are purely software-defined. And the
application which gets that SIGBUS type doesn't need to care about
SER_P.

For example, AMD has similar error severities and they can be injected
into qemu too. And qemu can do the exact same recovery actions based on
the severity without even looking at the SER_P bit.

So here's the problem:

* SER_P is set on all guests and it puzzles kernels running on !Intel
guests.

* Hardware error recovery actions can be done regardless of that bit.

The only case where that bit makes sense is if the emulated hardware
itself is generating accurate MCEs and then, as a result, wants to make
generate accurate error signatures:

SRAO:	UC=1, PCC=0, S=1, EN=1 and AR=0
SRAR:	UC=1, PCC=0, S=1, EN=1 and AR=1

Those bits should have these settings only when the emulated hw actually
implements SER_P. Otherwise, you'd get those old crude MCEs which are
either uncorrectable and generate an #MC or are correctable errors.

But ok, let me go do some staring at the examples you sent me
previously first. I might get a better idea after I sleep on it.

:-)

Thanks!

target-i386: Do not set MCG_SER_P by default

Commit Message

Comments

Patch