Message ID | 1448060471-14128-1-git-send-email-bp@alien8.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, CC'ing qemu-devel. Am 21.11.2015 um 00:01 schrieb Borislav Petkov: > From: Borislav Petkov <bp@suse.de> > > Software Error Recovery, i.e. SER, is purely an Intel feature and it > shouldn't be set by default. Enable it only on Intel. Is this new in 2.5? Otherwise we would probably need compatibility code in pc*.[ch] for incoming live migration from older versions. > > Signed-off-by: Borislav Petkov <bp@suse.de> > --- > target-i386/cpu.c | 7 ------- > target-i386/cpu.h | 9 ++++++++- > target-i386/kvm.c | 5 +++++ > 3 files changed, 13 insertions(+), 8 deletions(-) > > diff --git a/target-i386/cpu.c b/target-i386/cpu.c > index 11e5e39a756a..8155ee94fbe1 100644 > --- a/target-i386/cpu.c > +++ b/target-i386/cpu.c > @@ -2803,13 +2803,6 @@ static void x86_cpu_apic_realize(X86CPU *cpu, Error **errp) > } > #endif > > - > -#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \ > - (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \ > - (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3) > -#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \ > - (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \ > - (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3) > static void x86_cpu_realizefn(DeviceState *dev, Error **errp) > { > CPUState *cs = CPU(dev); > diff --git a/target-i386/cpu.h b/target-i386/cpu.h > index fc4a605d6a29..2605c564239a 100644 > --- a/target-i386/cpu.h > +++ b/target-i386/cpu.h > @@ -283,7 +283,7 @@ > #define MCG_CTL_P (1ULL<<8) /* MCG_CAP register available */ > #define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ > > -#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) > +#define MCE_CAP_DEF MCG_CTL_P > #define MCE_BANKS_DEF 10 > > #define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ > @@ -610,6 +610,13 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS]; > #define CPUID_MWAIT_IBE (1U << 1) /* Interrupts can exit capability */ > #define CPUID_MWAIT_EMX (1U << 0) /* enumeration supported */ > > +#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \ > + (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \ > + (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3) > +#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \ > + (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \ > + (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3) > + > #ifndef HYPERV_SPINLOCK_NEVER_RETRY > #define HYPERV_SPINLOCK_NEVER_RETRY 0xFFFFFFFF > #endif > diff --git a/target-i386/kvm.c b/target-i386/kvm.c > index 2a9953b2d4b5..082d38d4838d 100644 > --- a/target-i386/kvm.c > +++ b/target-i386/kvm.c > @@ -787,8 +787,13 @@ int kvm_arch_init_vcpu(CPUState *cs) > if (banks > MCE_BANKS_DEF) { > banks = MCE_BANKS_DEF; > } > + > mcg_cap &= MCE_CAP_DEF; > mcg_cap |= banks; > + > + if (IS_INTEL_CPU(env)) > + mcg_cap |= MCG_SER_P; Tabs and missing braces. > + > ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &mcg_cap); > if (ret < 0) { > fprintf(stderr, "KVM_X86_SETUP_MCE: %s", strerror(-ret)); Regards, Andreas
On Sat, Nov 21, 2015 at 12:11:35AM +0100, Andreas Färber wrote: > Hi, > > CC'ing qemu-devel. Ah, thanks. > Am 21.11.2015 um 00:01 schrieb Borislav Petkov: > > From: Borislav Petkov <bp@suse.de> > > > > Software Error Recovery, i.e. SER, is purely an Intel feature and it > > shouldn't be set by default. Enable it only on Intel. > > Is this new in 2.5? Otherwise we would probably need compatibility code > in pc*.[ch] for incoming live migration from older versions. It looks it is really old, AFAIK from 2010: c0532a76b407 ("MCE: Relay UCR MCE to guest") You'd need to be more verbose about pc*.[ch]. An example perhaps...? > > mcg_cap &= MCE_CAP_DEF; > > mcg_cap |= banks; > > + > > + if (IS_INTEL_CPU(env)) > > + mcg_cap |= MCG_SER_P; > > Tabs and missing braces. Ok. Thanks.
On Sat, Nov 21, 2015 at 02:09:25AM +0100, Borislav Petkov wrote: > On Sat, Nov 21, 2015 at 12:11:35AM +0100, Andreas Färber wrote: > > Hi, > > > > CC'ing qemu-devel. > > Ah, thanks. > > > Am 21.11.2015 um 00:01 schrieb Borislav Petkov: > > > From: Borislav Petkov <bp@suse.de> > > > > > > Software Error Recovery, i.e. SER, is purely an Intel feature and it > > > shouldn't be set by default. Enable it only on Intel. What happens when SER is enabled on an AMD CPU? If it really should't be enabled, why is KVM returning it on KVM_X86_GET_MCE_CAP_SUPPORTED? > > > > Is this new in 2.5? Otherwise we would probably need compatibility code > > in pc*.[ch] for incoming live migration from older versions. > > It looks it is really old, AFAIK from 2010: > > c0532a76b407 ("MCE: Relay UCR MCE to guest") > > You'd need to be more verbose about pc*.[ch]. An example perhaps...? If you change something that's guest-visible and not part of the migration stream, you need to keep the old behavior on older machine-types (e.g. pc-i440fx-2.4), or the CPU will change under the guest's feet when migrating to another host. For examples, see the recent commits to include/hw/i386/pc.h. They are all about keeping compatibility when CPUID bits are changed: 33b5e8c0 target-i386: Disable rdtscp on Opteron_G* CPU models 6aa91e4a target-i386: Remove POPCNT from qemu64 and qemu32 CPU models 71195672 target-i386: Remove ABM from qemu64 CPU model 0909ad24 target-i386: Remove SSE4a from qemu64 CPU model In the case of this code, it looks like it's already broken because the resulting mcg_cap depends on host kernel capabilities (the ones reported by kvm_get_mce_cap_supported()), and the data initialized by target-i386/cpu.c:mce_init() is silently overwritten by kvm_arch_init_vcpu(). So we would need to fix that before implementing a proper compatibility mechanism for mcg_cap.
On 23/11/2015 14:22, Eduardo Habkost wrote: > > Software Error Recovery, i.e. SER, is purely an Intel feature and it > > shouldn't be set by default. Enable it only on Intel. > > What happens when SER is enabled on an AMD CPU? If it really > should't be enabled, why is KVM returning it on > KVM_X86_GET_MCE_CAP_SUPPORTED? Indeed... is it a problem if our frankenstein AMD CPU can recover from memory errors? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
+ Tony. On Mon, Nov 23, 2015 at 03:47:44PM +0100, Paolo Bonzini wrote: > On 23/11/2015 14:22, Eduardo Habkost wrote: > > > Software Error Recovery, i.e. SER, is purely an Intel feature and it > > > shouldn't be set by default. Enable it only on Intel. > > > > What happens when SER is enabled on an AMD CPU? If it really > > should't be enabled, why is KVM returning it on > > KVM_X86_GET_MCE_CAP_SUPPORTED? > > Indeed... is it a problem if our frankenstein AMD CPU can recover from > memory errors? The problem stems from the fact that the guest kernel looks at SER and does different handling depending on that bit: machine_check_poll: ... if (!(flags & MCP_UC) && (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; so when the guest kernel gets a correctable error (injected..., for example) it sees that bit set. Even though kvm/qemu emulates an AMD CPU. So on AMD with that bit set, it would puzzle the whole error handling/reporting in the guest kernel. Oh, btw, I'm using a kvm guest to inject MCEs. In case you were wondering why is he even doing that. :-) And I'm not sure it makes sense to set that bit for an Intel guest too. For the simple reason that I don't know how much of the Software Error Recovery stuff is actually implemented there. If stuff is missing, you probably don't want to advertize it there too. And by "stuff" I mean all that fun in section "15.6 RECOVERY OF UNCORRECTED RECOVERABLE (UCR) ERRORS" of the SDM. It's a whole another question how/whether to do UCR error handling in the guest or maybe leave it to the host...
On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote: [...] > In the case of this code, it looks like it's already broken > because the resulting mcg_cap depends on host kernel capabilities > (the ones reported by kvm_get_mce_cap_supported()), and the data > initialized by target-i386/cpu.c:mce_init() is silently > overwritten by kvm_arch_init_vcpu(). So we would need to fix that > before implementing a proper compatibility mechanism for > mcg_cap. Fortunately, when running Linux v2.6.37 and later, kvm_arch_init_vcpu() won't actually change mcg_cap (see details below). But the code is broken if running on Linux between v2.6.32 and v2.6.36: it will clear MCG_SER_P silently (and silently enable MCG_SER_P when migrating to a newer host). But I don't know what we should do on those cases. If we abort initialization when the host doesn't support MCG_SER_P, all CPU models with MCE and MCA enabled will become unrunnable on Linux between v2.6.32 and v2.6.36. Should we do that, and simply ask people to upgrade their kernels (or explicitly disable MCE) if they want to run latest QEMU? For reference, these are the capabilities returned by Linux: * KVM_MAX_MCE_BANKS is 32 since 890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199) * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3) * KVM_MCE_CAP_SUPPORTED is MCG_CTL_P between 890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199) and 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3) The current definitions in QEMU are: #define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) #define MCE_BANKS_DEF 10 The target-i386/cpu.c:mce_init() code sets mcg_cap to: env->mcg_cap == MCE_CAP_DEF | MCE_BANKS_DEF; == (MCG_CTL_P|MCG_SER_P) | 10; The kvm_arch_init_vcpu() code that changes mcg_cap does the following: kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks); if (banks > MCE_BANKS_DEF) { banks = MCE_BANKS_DEF; } mcg_cap &= MCE_CAP_DEF; mcg_cap |= banks; env->mcg_cap = mcg_cap; * Therefore, if running Linux v2.6.37 or newer, this will be the result: banks == 10; mcg_cap == (MCG_CTL_P|MCG_SER_P) | banks == (MCG_CTL_P|MCG_SER_P) | 10; * That's the same value set by mce_init(), so fortunately kvm_arch_init_vcpu() isn't actually changing mcg_cap if running Linux v.2.6.37 and newer. * However, if running Linux between v2.6.32 and v2.6.37, kvm_arch_init_vcpu() will silently clear MCG_SER_P.
On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote: > On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote: > [...] > > In the case of this code, it looks like it's already broken > > because the resulting mcg_cap depends on host kernel capabilities > > (the ones reported by kvm_get_mce_cap_supported()), and the data > > initialized by target-i386/cpu.c:mce_init() is silently > > overwritten by kvm_arch_init_vcpu(). So we would need to fix that > > before implementing a proper compatibility mechanism for > > mcg_cap. > > Fortunately, when running Linux v2.6.37 and later, > kvm_arch_init_vcpu() won't actually change mcg_cap (see details > below). > > But the code is broken if running on Linux between v2.6.32 and > v2.6.36: it will clear MCG_SER_P silently (and silently enable > MCG_SER_P when migrating to a newer host). > > But I don't know what we should do on those cases. If we abort > initialization when the host doesn't support MCG_SER_P, all CPU > models with MCE and MCA enabled will become unrunnable on Linux > between v2.6.32 and v2.6.36. Should we do that, and simply ask > people to upgrade their kernels (or explicitly disable MCE) if > they want to run latest QEMU? > > For reference, these are the capabilities returned by Linux: > * KVM_MAX_MCE_BANKS is 32 since > 890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199) > * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since > 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3) The commit message of that one says that there is MCG_SER_P support in the kernel. The previous commit talks about MCE injection with KVM_X86_SET_MCE ioctl but frankly, I don't see that. From looking at the current code, KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is #define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P) So it basically sets those two supported bits. But how is supported == actually present ?!?! That soo doesn't make any sense. > * KVM_MCE_CAP_SUPPORTED is MCG_CTL_P between > 890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199) > and 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3) > > The current definitions in QEMU are: > #define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) > #define MCE_BANKS_DEF 10 That's also wrong. The number of banks is, of course, generation-specific. The qemu commit adding that is 79c4f6b08009 ("QEMU: MCE: Add MCE simulation to qemu/tcg") and I really don't get what the reasoning behind it is? Is this supposed to mimick a certain default CPU which is not really resembling a real CPU or some generic CPU or what is it? Because the moment you start qemu with -cpu <something AMD>, all that MCA information is totally wrong. > The target-i386/cpu.c:mce_init() code sets mcg_cap to: > env->mcg_cap == MCE_CAP_DEF | MCE_BANKS_DEF; > == (MCG_CTL_P|MCG_SER_P) | 10; > > The kvm_arch_init_vcpu() code that changes mcg_cap does the following: > kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks); > if (banks > MCE_BANKS_DEF) { > banks = MCE_BANKS_DEF; > } > mcg_cap &= MCE_CAP_DEF; > mcg_cap |= banks; > env->mcg_cap = mcg_cap; > * Therefore, if running Linux v2.6.37 or newer, this will be > the result: > banks == 10; > mcg_cap == (MCG_CTL_P|MCG_SER_P) | banks > == (MCG_CTL_P|MCG_SER_P) | 10; > * That's the same value set by mce_init(), so fortunately > kvm_arch_init_vcpu() isn't actually changing mcg_cap > if running Linux v.2.6.37 and newer. > * However, if running Linux between v2.6.32 and v2.6.37, > kvm_arch_init_vcpu() will silently clear MCG_SER_P. I don't understand - you want that cleared when emulating a !Intel CPU or an Intel CPU which doesn't support SER. This bit should be clear there. Why should be set at all there? Hmmm?
On Mon, Nov 23, 2015 at 05:43:14PM +0100, Borislav Petkov wrote: > On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote: > > On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote: > > [...] > > > In the case of this code, it looks like it's already broken > > > because the resulting mcg_cap depends on host kernel capabilities > > > (the ones reported by kvm_get_mce_cap_supported()), and the data > > > initialized by target-i386/cpu.c:mce_init() is silently > > > overwritten by kvm_arch_init_vcpu(). So we would need to fix that > > > before implementing a proper compatibility mechanism for > > > mcg_cap. > > > > Fortunately, when running Linux v2.6.37 and later, > > kvm_arch_init_vcpu() won't actually change mcg_cap (see details > > below). > > > > But the code is broken if running on Linux between v2.6.32 and > > v2.6.36: it will clear MCG_SER_P silently (and silently enable > > MCG_SER_P when migrating to a newer host). > > > > But I don't know what we should do on those cases. If we abort > > initialization when the host doesn't support MCG_SER_P, all CPU > > models with MCE and MCA enabled will become unrunnable on Linux > > between v2.6.32 and v2.6.36. Should we do that, and simply ask > > people to upgrade their kernels (or explicitly disable MCE) if > > they want to run latest QEMU? > > > > For reference, these are the capabilities returned by Linux: > > * KVM_MAX_MCE_BANKS is 32 since > > 890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199) > > * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since > > 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3) > > The commit message of that one says that there is MCG_SER_P support in > the kernel. > > The previous commit talks about MCE injection with KVM_X86_SET_MCE > ioctl but frankly, I don't see that. From looking at the current code, > KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets > MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is > > #define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P) > > So it basically sets those two supported bits. But how is > > supported == actually present > > ?!?! > > That soo doesn't make any sense. I will let the people working on the actual MCE emulation in KVM answer that. I am assuming that KVM_MCE_CAP_SUPPORTED is set to something that makes sense. > > > * KVM_MCE_CAP_SUPPORTED is MCG_CTL_P between > > 890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199) > > and 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3) > > > > The current definitions in QEMU are: > > #define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) > > #define MCE_BANKS_DEF 10 > > That's also wrong. The number of banks is, of course, > generation-specific. Note that we don't mimick every single detail of real CPUs out there, and this isn't necessarily a problem (although sometimes we choose bad defaults). Do you see real world scenarios when choosing 10 as the default causes problems for guest OSes, or you just worry that this might cause problems because it doesn't match any real-world CPU? The number of banks is whatever QEMU chooses to be the number of banks. The pc-*-2.4 and older machine-types already expose 10 banks, but we can change it in future machine-types. > The qemu commit adding that is > > 79c4f6b08009 ("QEMU: MCE: Add MCE simulation to qemu/tcg") > > and I really don't get what the reasoning behind it is? Is this supposed > to mimick a certain default CPU which is not really resembling a real > CPU or some generic CPU or what is it? I don't know the reasoning behind those defaults, but that's the existing default in pc-*-2.4 and older. > > Because the moment you start qemu with -cpu <something AMD>, all that > MCA information is totally wrong. If we really care about matching the number of banks of real CPUs, we can make it configurable, defined by the CPU model, and/or have better defaults in future machine-types. That won't be a problem. But I still don't know what we should do when somebody runs: -machine pc-i440fx-2.4 -cpu Penryn on a host kernel that doesn't report MCG_SER_P on KVM_MCE_CAP_SUPPORTED. > > > The target-i386/cpu.c:mce_init() code sets mcg_cap to: > > env->mcg_cap == MCE_CAP_DEF | MCE_BANKS_DEF; > > == (MCG_CTL_P|MCG_SER_P) | 10; > > > > The kvm_arch_init_vcpu() code that changes mcg_cap does the following: > > kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks); > > if (banks > MCE_BANKS_DEF) { > > banks = MCE_BANKS_DEF; > > } > > mcg_cap &= MCE_CAP_DEF; > > mcg_cap |= banks; > > env->mcg_cap = mcg_cap; > > * Therefore, if running Linux v2.6.37 or newer, this will be > > the result: > > banks == 10; > > mcg_cap == (MCG_CTL_P|MCG_SER_P) | banks > > == (MCG_CTL_P|MCG_SER_P) | 10; > > * That's the same value set by mce_init(), so fortunately > > kvm_arch_init_vcpu() isn't actually changing mcg_cap > > if running Linux v.2.6.37 and newer. > > * However, if running Linux between v2.6.32 and v2.6.37, > > kvm_arch_init_vcpu() will silently clear MCG_SER_P. > > I don't understand - you want that cleared when emulating a !Intel CPU > or an Intel CPU which doesn't support SER. This bit should be clear > there. Why should be set at all there? I am just saying we already clear it when running on Linux v2.6.32-v2.6.36, it doesn't matter the host CPU or the -cpu options we have. And we do not clear it when running Linux v2.6.37 or newer. That's the behavior of pc-*-2.4 and older, even if we change it on future machine-types.
On Mon, Nov 23, 2015 at 05:42:08PM -0200, Eduardo Habkost wrote: > I will let the people working on the actual MCE emulation in KVM > answer that. I am assuming that KVM_MCE_CAP_SUPPORTED is set to > something that makes sense. Well, that should be, IMHO, the same like all those feature bits assigned to the ->feature arrays of the different cpu types in qemu's X86CPUDefinition descriptors. > Note that we don't mimick every single detail of real CPUs out > there, and this isn't necessarily a problem (although sometimes > we choose bad defaults). Do you see real world scenarios when > choosing 10 as the default causes problems for guest OSes, or you > just worry that this might cause problems because it doesn't > match any real-world CPU? Well, the problems would come when the guests start using the MCA infrastructure bits. That's why I asked how exactly do people imagine of doing all the hardware errors handling in the guest. I know we do something with poisoning pages, i.e. kvm_send_hwpoison_signal() and all that machinery around it but in that particular case it is the hypervisor which marks the pages as poison and kvm notices that on the __get_user_pages() path and the error is injected into the guest. AFAICT, of course. In my case, I'm injecting a HW error in the guest kernel by writing into the *guest* MSRs and the *guest* kernel MCA code is supposed to handle the error. And the problem here is that I'm emulating an AMD guest. But a guest which sports an Intel-only feature and that puzzles the guest kernel. Does that make more sense? I hope... > If we really care about matching the number of banks of real > CPUs, we can make it configurable, defined by the CPU model, > and/or have better defaults in future machine-types. That won't > be a problem. I think we should try to do that if we're striving for accurate emulation of guest CPUs. But then there's the migration use-case which has different focus... > But I still don't know what we should do when somebody runs: > -machine pc-i440fx-2.4 -cpu Penryn > on a host kernel that doesn't report MCG_SER_P on > KVM_MCE_CAP_SUPPORTED. Right, before we ask that question we should ask the more generic one: how do people want to do error handling in the guest? Do they even want to? More importantly, does it even make sense to handle hardware errors in the guest? If so, which and if not, why not? I mean, no one would've noticed the MCG_SER_P issue if no one would've tried to use it and what it implies. So it all comes down to whether the guest uses the emulated feature. It seems to me this issue remained unnoticed for such a long time now for the simple reason that nothing used it. So nothing in the guest cared whether SER_P is set or not, or how many MCA banks are there. So if you run "-machine pc-i440fx-2.4 -cpu Penryn" it wouldn't matter because, AFAIK - and correct me if I'm wrong here - the guest never got to see the Action Required and Action Optional MCEs which are the result from SER_P support. So the guest didn't care. Yes, no, am I missing something completely here? > I am just saying we already clear it when running on Linux > v2.6.32-v2.6.36, it doesn't matter the host CPU or the -cpu > options we have. And we do not clear it when running Linux > v2.6.37 or newer. That's the behavior of pc-*-2.4 and older, even > if we change it on future machine-types. Right, ok. So the fact that it was clear in the v2.6.32-v2.6.36 frame and set later and nothing complained, *probably* confirms my theory that the guest didn't really care about that setting and it probably doesn't do now either... Unless you try to use it, like I did :-) Thanks.
On Mon, Nov 23, 2015 at 05:43:14PM +0100, Borislav Petkov wrote: > On Mon, Nov 23, 2015 at 01:11:27PM -0200, Eduardo Habkost wrote: > > On Mon, Nov 23, 2015 at 11:22:37AM -0200, Eduardo Habkost wrote: > > [...] > > > In the case of this code, it looks like it's already broken > > > because the resulting mcg_cap depends on host kernel capabilities > > > (the ones reported by kvm_get_mce_cap_supported()), and the data > > > initialized by target-i386/cpu.c:mce_init() is silently > > > overwritten by kvm_arch_init_vcpu(). So we would need to fix that > > > before implementing a proper compatibility mechanism for > > > mcg_cap. > > > > Fortunately, when running Linux v2.6.37 and later, > > kvm_arch_init_vcpu() won't actually change mcg_cap (see details > > below). > > > > But the code is broken if running on Linux between v2.6.32 and > > v2.6.36: it will clear MCG_SER_P silently (and silently enable > > MCG_SER_P when migrating to a newer host). > > > > But I don't know what we should do on those cases. If we abort > > initialization when the host doesn't support MCG_SER_P, all CPU > > models with MCE and MCA enabled will become unrunnable on Linux > > between v2.6.32 and v2.6.36. Should we do that, and simply ask > > people to upgrade their kernels (or explicitly disable MCE) if > > they want to run latest QEMU? > > > > For reference, these are the capabilities returned by Linux: > > * KVM_MAX_MCE_BANKS is 32 since > > 890ca9aefa78f7831f8f633cab9e4803636dffe4 (v2.6.32-rc1~693^2~199) > > * KVM_MCE_CAP_SUPPORTED is (MCG_CTL_P | MCG_SER_P) since > > 5854dbca9b235f8cdd414a0961018763d2d5bf77 (v2.6.37-rc1~142^2~3) > > The commit message of that one says that there is MCG_SER_P support in > the kernel. > > The previous commit talks about MCE injection with KVM_X86_SET_MCE > ioctl but frankly, I don't see that. From looking at the current code, > KVM_X86_SET_MCE does kvm_vcpu_ioctl_x86_setup_mce() which simply sets > MCG_CAP. And it gets those from KVM_X86_GET_MCE_CAP_SUPPORTED which is KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It calls kvm_vcpu_ioctl_x86_set_mce(), which stores the IA32_MCi_{STATUS,ADDR,MISC} register contents at vcpu->arch.mce_banks. > > #define KVM_MCE_CAP_SUPPORTED (MCG_CTL_P | MCG_SER_P) > > So it basically sets those two supported bits. But how is > > supported == actually present > > ?!?! > > That soo doesn't make any sense. > I didn't check the QEMU MCE code to confirm that, but I assume it is implemented there. In that case, MCG_SER_P in KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by userspace, as long as it makes the appropriate KVM_X86_SET_MCE (or maybe KVM_SET_MSRS?) calls.
On Tue, Nov 24, 2015 at 02:36:20PM -0200, Eduardo Habkost wrote: > KVM_X86_SET_MCE does not call kvm_vcpu_ioctl_x86_setup_mce(). It > calls kvm_vcpu_ioctl_x86_set_mce(), which stores the > IA32_MCi_{STATUS,ADDR,MISC} register contents at > vcpu->arch.mce_banks. Ah, correct. I've mistakenly followed KVM_X86_SETUP_MCE and not KVM_X86_SET_MCE, sorry. Ok, so this makes more sense now - there's kvm_inject_mce_oldstyle() in qemu and kvm_arch_on_sigbus_vcpu() which is on the SIGBUS handler path actually does: if ((env->mcg_cap & MCG_SER_P) && addr && (code == BUS_MCEERR_AR || code == BUS_MCEERR_AO)) { ... I betcha that MCG_SER_P is set on every guest, even !Intel ones. I need to go stare more at that code. > I didn't check the QEMU MCE code to confirm that, but I assume it > is implemented there. In that case, MCG_SER_P in > KVM_MCE_CAP_SUPPORTED just indicates it can be implemented by > userspace, as long as it makes the appropriate KVM_X86_SET_MCE > (or maybe KVM_SET_MSRS?) calls. I think it is that kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus() which handles SIGBUS with BUS_MCEERR_AR/BUS_MCEERR_AO si_code. See mm/memory-failure.c:kill_proc() in the kernel where we do send those signals to processes. However, I still think the MCG_SER_P bit being set on !Intel is wrong even though the recovery action done by kvm_arch_on_sigbus_vcpu()/kvm_arch_on_sigbus() is correct. Why, you're asking. :-) Well, what happens above is that the qemu process gets the signal that there was an uncorrectable error detected in its memory and it is either required to do something: BUS_MCEERR_AR == Action Required or its action is optional: BUS_MCEERR_AO == Action Optional. The SER_P text in the SDM describes those two: "SRAO errors indicate that some data in the system is corrupt, but the data has not been consumed and the processor state is valid. SRAO errors provide the additional error information for system software to perform a recovery action. An SRAO error is indicated with UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register." and "Software recoverable action required (SRAR) - a UCR error that requires system software to take a recovery action on this processor before scheduling another stream of execution on this processor. SRAR errors indicate that the error was detected and raised at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1, EN=1 and AR=1 in the IA32_MCi_STATUS register." And for that we don't need to look at SER_P in qemu - we only need to know what the error severity of the error is and then we go and handle accordingly. Because those two si_codes are purely software-defined. And the application which gets that SIGBUS type doesn't need to care about SER_P. For example, AMD has similar error severities and they can be injected into qemu too. And qemu can do the exact same recovery actions based on the severity without even looking at the SER_P bit. So here's the problem: * SER_P is set on all guests and it puzzles kernels running on !Intel guests. * Hardware error recovery actions can be done regardless of that bit. The only case where that bit makes sense is if the emulated hardware itself is generating accurate MCEs and then, as a result, wants to make generate accurate error signatures: SRAO: UC=1, PCC=0, S=1, EN=1 and AR=0 SRAR: UC=1, PCC=0, S=1, EN=1 and AR=1 Those bits should have these settings only when the emulated hw actually implements SER_P. Otherwise, you'd get those old crude MCEs which are either uncorrectable and generate an #MC or are correctable errors. But ok, let me go do some staring at the examples you sent me previously first. I might get a better idea after I sleep on it. :-) Thanks!
diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 11e5e39a756a..8155ee94fbe1 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2803,13 +2803,6 @@ static void x86_cpu_apic_realize(X86CPU *cpu, Error **errp) } #endif - -#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \ - (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \ - (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3) -#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \ - (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \ - (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3) static void x86_cpu_realizefn(DeviceState *dev, Error **errp) { CPUState *cs = CPU(dev); diff --git a/target-i386/cpu.h b/target-i386/cpu.h index fc4a605d6a29..2605c564239a 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -283,7 +283,7 @@ #define MCG_CTL_P (1ULL<<8) /* MCG_CAP register available */ #define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ -#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) +#define MCE_CAP_DEF MCG_CTL_P #define MCE_BANKS_DEF 10 #define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ @@ -610,6 +610,13 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS]; #define CPUID_MWAIT_IBE (1U << 1) /* Interrupts can exit capability */ #define CPUID_MWAIT_EMX (1U << 0) /* enumeration supported */ +#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \ + (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \ + (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3) +#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \ + (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \ + (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3) + #ifndef HYPERV_SPINLOCK_NEVER_RETRY #define HYPERV_SPINLOCK_NEVER_RETRY 0xFFFFFFFF #endif diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 2a9953b2d4b5..082d38d4838d 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -787,8 +787,13 @@ int kvm_arch_init_vcpu(CPUState *cs) if (banks > MCE_BANKS_DEF) { banks = MCE_BANKS_DEF; } + mcg_cap &= MCE_CAP_DEF; mcg_cap |= banks; + + if (IS_INTEL_CPU(env)) + mcg_cap |= MCG_SER_P; + ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &mcg_cap); if (ret < 0) { fprintf(stderr, "KVM_X86_SETUP_MCE: %s", strerror(-ret));