diff mbox series

EDAC/mce_amd: Reduce unnecessary spew in dmesg if SMCA feature bit is not exposed

Message ID 20210614212129.227698-1-Smita.KoralahalliChannabasappa@amd.com (mailing list archive)
State New, archived
Headers show
Series EDAC/mce_amd: Reduce unnecessary spew in dmesg if SMCA feature bit is not exposed | expand

Commit Message

Smita Koralahalli June 14, 2021, 9:21 p.m. UTC
The SMCA feature bit is not exposed on the guest.

This causes a lot of noise in dmesg as the warning is printed for each
logical CPU.

$ dmesg |grep -i family

[    0.031000] smpboot: CPU0: AMD EPYC-Milan Processor (family: 0x19, model: 0x1, stepping: 0x1)
[    4.653422] Huh? What family is it: 0x19?!
[    4.720732] Huh? What family is it: 0x19?!
[    6.171028] Huh? What family is it: 0x19?!
[    6.766552] Huh? What family is it: 0x19?!
[    6.811119] Huh? What family is it: 0x19?!
[    6.839855] Huh? What family is it: 0x19?!

Give these messages debug severity and output once as it is mostly useful
for module developers and just noise for users.

Rephrase the statement to make it more meaningful.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
 drivers/edac/mce_amd.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

Comments

Borislav Petkov June 14, 2021, 10:01 p.m. UTC | #1
On Mon, Jun 14, 2021 at 04:21:29PM -0500, Smita Koralahalli wrote:
> The SMCA feature bit is not exposed on the guest.
> 
> This causes a lot of noise in dmesg as the warning is printed for each
> logical CPU.
> 
> $ dmesg |grep -i family
> 
> [    0.031000] smpboot: CPU0: AMD EPYC-Milan Processor (family: 0x19, model: 0x1, stepping: 0x1)
> [    4.653422] Huh? What family is it: 0x19?!
> [    4.720732] Huh? What family is it: 0x19?!
> [    6.171028] Huh? What family is it: 0x19?!
> [    6.766552] Huh? What family is it: 0x19?!
> [    6.811119] Huh? What family is it: 0x19?!
> [    6.839855] Huh? What family is it: 0x19?!
> 
> Give these messages debug severity and output once as it is mostly useful
> for module developers and just noise for users.

I'm getting a patch like that on a pretty regular basis and each time I
tell people simply not to load the module in guests. Hypervisors do not
support SMCA...

But apparently those wrong error messages bug people on a regular basis
and I'm tired of typing this each time so I'll take a different version
of this patch: check X86_FEATURE_HYPERVISOR on entry in mce_amd_init()
and return -ENODEV if set.

So that this is done once and for all.

Thx.
Luck, Tony June 14, 2021, 10:25 p.m. UTC | #2
> But apparently those wrong error messages bug people on a regular basis
> and I'm tired of typing this each time so I'll take a different version
> of this patch: check X86_FEATURE_HYPERVISOR on entry in mce_amd_init()
> and return -ENODEV if set.
>
> So that this is done once and for all.

I expect all the Intel EDAC drivers that load based on CPU model have similar
issues. Maybe they aren't whining as loudly about not being able to find the
memory controller devices?

They should also check for X86_FEATURE_HYPERVISOR too.

$ git grep -l x86_match_cpu -- drivers/edac
drivers/edac/amd64_edac.c
drivers/edac/i10nm_base.c
drivers/edac/ieh_edac.c
drivers/edac/pnd2_edac.c
drivers/edac/sb_edac.c
drivers/edac/skx_base.c

Though perhaps this is an issue outside of EDAC and x86_match_cpu()
could do the HYPERVISOR check and return no match. The few callers
who want to believe the fictional CPU model number passed in by the
VMM would need to use some new variant of the call?

-Tony
Borislav Petkov June 15, 2021, 9:16 a.m. UTC | #3
On Mon, Jun 14, 2021 at 10:25:36PM +0000, Luck, Tony wrote:
> I expect all the Intel EDAC drivers that load based on CPU model have similar
> issues. Maybe they aren't whining as loudly about not being able to find the
> memory controller devices?

Right.

> Though perhaps this is an issue outside of EDAC and x86_match_cpu()
> could do the HYPERVISOR check and return no match. The few callers
> who want to believe the fictional CPU model number passed in by the
> VMM would need to use some new variant of the call?

Yeah, we could do

X86_MATCH_VENDOR_FAM_MODEL_STEPPINGS_NOT_FEATURE

notice the "NOT" and have a x86_cpu_id.not_feature which to match
X86_FEATURE_HYPERVISOR. I'm not sure it is worth it, though, for a
handful of drivers.

The whole thing is a meh, why bother, but I got tired of this particular
intent of people wanting to shut this error message up just because they
should not load that driver in a VM in the first place.

But what happens is they boot a guest with -cpu host and in that case
that's a new CPU - family 0x19 - so it doesn't have a case 0x19 for the
pr_warn_once() there.

And instead of keep adding adding families there, I'd simply check
X86_FEATURE_HYPERVISOR.

Oh and that thing - mce_amd.c - doesn't use x86_match_cpu() so it has to
be an explicit check on function entry.

Thx.
Yazen Ghannam June 15, 2021, 3:08 p.m. UTC | #4
On Tue, Jun 15, 2021 at 11:16:49AM +0200, Borislav Petkov wrote:
> On Mon, Jun 14, 2021 at 10:25:36PM +0000, Luck, Tony wrote:
> > I expect all the Intel EDAC drivers that load based on CPU model have similar
> > issues. Maybe they aren't whining as loudly about not being able to find the
> > memory controller devices?
> 
> Right.
> 
> > Though perhaps this is an issue outside of EDAC and x86_match_cpu()
> > could do the HYPERVISOR check and return no match. The few callers
> > who want to believe the fictional CPU model number passed in by the
> > VMM would need to use some new variant of the call?
> 
> Yeah, we could do
> 
> X86_MATCH_VENDOR_FAM_MODEL_STEPPINGS_NOT_FEATURE
> 
> notice the "NOT" and have a x86_cpu_id.not_feature which to match
> X86_FEATURE_HYPERVISOR. I'm not sure it is worth it, though, for a
> handful of drivers.
> 
> The whole thing is a meh, why bother, but I got tired of this particular
> intent of people wanting to shut this error message up just because they
> should not load that driver in a VM in the first place.
> 
> But what happens is they boot a guest with -cpu host and in that case
> that's a new CPU - family 0x19 - so it doesn't have a case 0x19 for the
> pr_warn_once() there.
> 
> And instead of keep adding adding families there, I'd simply check
> X86_FEATURE_HYPERVISOR.
> 
> Oh and that thing - mce_amd.c - doesn't use x86_match_cpu() so it has to
> be an explicit check on function entry.
>

How about adding the the SMCA feature to the amd64_cpuids[] table in
amd64_edac.c?

We can use X86_MATCH_VENDOR_FEATURE to match on AMD (and Hygon) systems
with SMCA. And we can remove the X86_MATCH_VENDOR_FAM entries for
families 17h-19h.

I'm assuming the issue is that amd64_edac_mod is autoloading due to the
family-based device table, and this will load edac_mce_amd as a
dependency.

Thanks,
Yazen
Borislav Petkov June 15, 2021, 3:18 p.m. UTC | #5
On Tue, Jun 15, 2021 at 11:08:46AM -0400, Yazen Ghannam wrote:
> How about adding the the SMCA feature to the amd64_cpuids[] table in
> amd64_edac.c?
> 
> We can use X86_MATCH_VENDOR_FEATURE to match on AMD (and Hygon) systems
> with SMCA. And we can remove the X86_MATCH_VENDOR_FAM entries for
> families 17h-19h.

Sure. That'll alleviate the need to add new families which support SMCA
too.

> I'm assuming the issue is that amd64_edac_mod is autoloading due to the
> family-based device table, and this will load edac_mce_amd as a
> dependency.

Is it?

We have

early_initcall(mce_amd_init);

in mce_amd.c which attempts to load this thing unconditionally.
Yazen Ghannam June 15, 2021, 4 p.m. UTC | #6
On Tue, Jun 15, 2021 at 05:18:50PM +0200, Borislav Petkov wrote:
> On Tue, Jun 15, 2021 at 11:08:46AM -0400, Yazen Ghannam wrote:
> > How about adding the the SMCA feature to the amd64_cpuids[] table in
> > amd64_edac.c?
> > 
> > We can use X86_MATCH_VENDOR_FEATURE to match on AMD (and Hygon) systems
> > with SMCA. And we can remove the X86_MATCH_VENDOR_FAM entries for
> > families 17h-19h.
> 
> Sure. That'll alleviate the need to add new families which support SMCA
> too.
> 
> > I'm assuming the issue is that amd64_edac_mod is autoloading due to the
> > family-based device table, and this will load edac_mce_amd as a
> > dependency.
> 
> Is it?
> 
> We have
> 
> early_initcall(mce_amd_init);
> 
> in mce_amd.c which attempts to load this thing unconditionally.
>

I think edac_mce_amd is usually built as a module by distro configs, so
early_initcall() would be replaced by module_init(). But the default
option is built-in, so you're right about the early_initcall().

Also, you bring up a good point. We can't say to people "don't load the
module" if it's builtin. And I don't think it's fair to say "don't
build-in the module" if the default is "y".

So I think we can downgrade this warning to a debug message, if the
module stays builtin. And/or we change the default config option to
module, and we make sure the module only autoloads in the proper cases.

What do you think?

Thanks,
Yazen
Borislav Petkov June 15, 2021, 4:11 p.m. UTC | #7
On Tue, Jun 15, 2021 at 12:00:09PM -0400, Yazen Ghannam wrote:
> So I think we can downgrade this warning to a debug message, if the
> module stays builtin. And/or we change the default config option to
> module, and we make sure the module only autoloads in the proper cases.
> 
> What do you think?

I think, as I said before, that we should simply not load this in
guests. Then this will be taken care of once and for all and we can do
whatever we want on baremetal, as to what error message to issue and how
many times to issue it, whether the decoder is builtin or default y or
yadda yadda.

Because even if you say in a guest:

	pr_warn_once("Decoding supported only on Scalable MCA processors.\n");

you're kinda misleading because the guest might be an SMCA processor but
not all features are emulated, including SMCA. So it is not really an
SMCA processor but some not really existant hybrid or so.

So since this whole SMCA thing does not apply to virtualization, we
should simply not load on virt and be done with it.

Thx.
Yazen Ghannam June 15, 2021, 4:32 p.m. UTC | #8
On Tue, Jun 15, 2021 at 06:11:04PM +0200, Borislav Petkov wrote:
> On Tue, Jun 15, 2021 at 12:00:09PM -0400, Yazen Ghannam wrote:
> > So I think we can downgrade this warning to a debug message, if the
> > module stays builtin. And/or we change the default config option to
> > module, and we make sure the module only autoloads in the proper cases.
> > 
> > What do you think?
> 
> I think, as I said before, that we should simply not load this in
> guests. Then this will be taken care of once and for all and we can do
> whatever we want on baremetal, as to what error message to issue and how
> many times to issue it, whether the decoder is builtin or default y or
> yadda yadda.
> 
> Because even if you say in a guest:
> 
> 	pr_warn_once("Decoding supported only on Scalable MCA processors.\n");
> 
> you're kinda misleading because the guest might be an SMCA processor but
> not all features are emulated, including SMCA. So it is not really an
> SMCA processor but some not really existant hybrid or so.
> 
> So since this whole SMCA thing does not apply to virtualization, we
> should simply not load on virt and be done with it.
>

Yes, I agree. I was a bit confused about the X86_FEATURE_HYPERVISOR
thing, but I think I get it. This definitely looks simple to do.

Thanks,
Yazen
Luck, Tony June 15, 2021, 4:45 p.m. UTC | #9
There's little to no point in loading an EDAC driver running in a guest:
1) The CPU model reported by CPUID may not represent actual h/w
2) The hypervisor likely does not pass in access to memory controller devices
3) Hypervisors generally do not pass corrected error details to guests

Add a check in each of the Intel EDAC drivers for X86_FEATURE_HYPERVISOR
and simply return -ENODEV in the init routine.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 drivers/edac/i10nm_base.c | 3 +++
 drivers/edac/pnd2_edac.c  | 3 +++
 drivers/edac/sb_edac.c    | 3 +++
 drivers/edac/skx_base.c   | 3 +++
 4 files changed, 12 insertions(+)

diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c
index 238a4ad1e526..fa08622ff2a8 100644
--- a/drivers/edac/i10nm_base.c
+++ b/drivers/edac/i10nm_base.c
@@ -278,6 +278,9 @@ static int __init i10nm_init(void)
 	if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR)))
 		return -EBUSY;
 
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		return -ENODEV;
+
 	id = x86_match_cpu(i10nm_cpuids);
 	if (!id)
 		return -ENODEV;
diff --git a/drivers/edac/pnd2_edac.c b/drivers/edac/pnd2_edac.c
index 928f63a374c7..5c7b36e2325f 100644
--- a/drivers/edac/pnd2_edac.c
+++ b/drivers/edac/pnd2_edac.c
@@ -1554,6 +1554,9 @@ static int __init pnd2_init(void)
 	if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR)))
 		return -EBUSY;
 
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		return -ENODEV;
+
 	id = x86_match_cpu(pnd2_cpuids);
 	if (!id)
 		return -ENODEV;
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 93daa4297f2e..7d3b2b164e3c 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3510,6 +3510,9 @@ static int __init sbridge_init(void)
 	if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR)))
 		return -EBUSY;
 
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		return -ENODEV;
+
 	id = x86_match_cpu(sbridge_cpuids);
 	if (!id)
 		return -ENODEV;
diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c
index 6a4f0b27c654..c2f9da941217 100644
--- a/drivers/edac/skx_base.c
+++ b/drivers/edac/skx_base.c
@@ -656,6 +656,9 @@ static int __init skx_init(void)
 	if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR)))
 		return -EBUSY;
 
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		return -ENODEV;
+
 	id = x86_match_cpu(skx_cpuids);
 	if (!id)
 		return -ENODEV;
Borislav Petkov June 15, 2021, 5 p.m. UTC | #10
On Tue, Jun 15, 2021 at 09:45:25AM -0700, Luck, Tony wrote:
> There's little to no point in loading an EDAC driver running in a guest:
> 1) The CPU model reported by CPUID may not represent actual h/w
> 2) The hypervisor likely does not pass in access to memory controller devices
> 3) Hypervisors generally do not pass corrected error details to guests
> 
> Add a check in each of the Intel EDAC drivers for X86_FEATURE_HYPERVISOR
> and simply return -ENODEV in the init routine.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  drivers/edac/i10nm_base.c | 3 +++
>  drivers/edac/pnd2_edac.c  | 3 +++
>  drivers/edac/sb_edac.c    | 3 +++
>  drivers/edac/skx_base.c   | 3 +++
>  4 files changed, 12 insertions(+)

If you insist... I mean, those drivers won't load for other reasons but
sure.

> diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c
> index 238a4ad1e526..fa08622ff2a8 100644
> --- a/drivers/edac/i10nm_base.c
> +++ b/drivers/edac/i10nm_base.c
> @@ -278,6 +278,9 @@ static int __init i10nm_init(void)
>  	if (owner && strncmp(owner, EDAC_MOD_STR, sizeof(EDAC_MOD_STR)))
>  		return -EBUSY;
>  
> +	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))

Can you change that to cpu_feature_enabled() everywhere pls?

We're going to use that single API for checking CPU features from now on
and slowly move away from the others to simplify things.

With that changed you can add:

Acked-by: Borislav Petkov <bp@suse.de>

Thx.
Borislav Petkov June 15, 2021, 5:25 p.m. UTC | #11
On Tue, Jun 15, 2021 at 12:32:21PM -0400, Yazen Ghannam wrote:
> Yes, I agree. I was a bit confused about the X86_FEATURE_HYPERVISOR
> thing, but I think I get it. This definitely looks simple to do.

Yeah, if you look at the qemu repo:

kvm_arch_get_supported_cpuid:

...

    } else if (function == 1 && reg == R_ECX) {
        /* We can set the hypervisor flag, even if KVM does not return it on
         * GET_SUPPORTED_CPUID
         */
        ret |= CPUID_EXT_HYPERVISOR;


so long story short, that thing is set by the HV and there's an
"agreement" of sorts and APM v3 even says so:

"CPUID Fn0000_0001_ECX Feature Identifiers

...

31:	RAZ. Reserved for use by hypervisor to indicate guest status."

HTH.
diff mbox series

Patch

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 5dd905a3f30c..2a9899088389 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1227,13 +1227,9 @@  static int __init mce_amd_init(void)
 		fam_ops.mc2_mce = f16h_mc2_mce;
 		break;
 
-	case 0x17:
-	case 0x18:
-		pr_warn_once("Decoding supported only on Scalable MCA processors.\n");
-		return -EINVAL;
-
 	default:
-		printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
+		pr_debug_once("MCE decoding is not supported for family: 0x%x\n",
+			      c->x86);
 		return -EINVAL;
 	}