Message ID | 87108f1d-4b13-4c1e-9432-4f14d4f5c12d@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/NMI: refine "watchdog stuck" log message | expand |
On Wed, Jan 24, 2024 at 04:21:24PM +0100, Jan Beulich wrote: > Observing > > "Testing NMI watchdog on all CPUs: 0 stuck" > > it felt like it's not quite right, but I still read it as "no CPU stuck; > all good", when really the system suffered from what 6bdb965178bb > ("x86/intel: ensure Global Performance Counter Control is setup > correctly") works around. Convert this to > > "Testing NMI watchdog on all CPUs: {0} stuck" To make this even more obvious, maybe it could be prefixed with "error": "Testing NMI watchdog on all CPUs: error {0} stuck" Hm, albeit I don't like it that much > > or, with multiple CPUs having an issue, e.g. > > "Testing NMI watchdog on all CPUs: {0,40} stuck" > > to make more obvious that a lone number is not a count of CPUs. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> > --- > In principle "sep" could also fulfill the job of "ok"; it felt to me as > if this may not be liked very much, though. I think I prefer it the current way. Thanks, Roger.
On 24/01/2024 3:21 pm, Jan Beulich wrote: > Observing > > "Testing NMI watchdog on all CPUs: 0 stuck" > > it felt like it's not quite right, but I still read it as "no CPU stuck; > all good", when really the system suffered from what 6bdb965178bb > ("x86/intel: ensure Global Performance Counter Control is setup > correctly") works around. Convert this to > > "Testing NMI watchdog on all CPUs: {0} stuck" > > or, with multiple CPUs having an issue, e.g. > > "Testing NMI watchdog on all CPUs: {0,40} stuck" > > to make more obvious that a lone number is not a count of CPUs. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> I'd forgotten it was still opencoded like this. Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, but if you felt turning this into using a scratch cpumask and %*pb then I think that would be nicer still. ~Andrew
On 24.01.2024 16:55, Andrew Cooper wrote: > On 24/01/2024 3:21 pm, Jan Beulich wrote: >> Observing >> >> "Testing NMI watchdog on all CPUs: 0 stuck" >> >> it felt like it's not quite right, but I still read it as "no CPU stuck; >> all good", when really the system suffered from what 6bdb965178bb >> ("x86/intel: ensure Global Performance Counter Control is setup >> correctly") works around. Convert this to >> >> "Testing NMI watchdog on all CPUs: {0} stuck" >> >> or, with multiple CPUs having an issue, e.g. >> >> "Testing NMI watchdog on all CPUs: {0,40} stuck" >> >> to make more obvious that a lone number is not a count of CPUs. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > I'd forgotten it was still opencoded like this. > > Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, Thanks (Roger, also to you). > but if you felt > turning this into using a scratch cpumask and %*pb then I think that > would be nicer still. I expected you might say that, and indeed I considered it. But then I wanted to keep code churn limited. The way it's done it's imo an almost purely mechanical change. Much like with Roger's consideration of further refining the message text in the error case. Jan
On Thu, Jan 25, 2024 at 09:05:13AM +0100, Jan Beulich wrote: > On 24.01.2024 16:55, Andrew Cooper wrote: > > On 24/01/2024 3:21 pm, Jan Beulich wrote: > > but if you felt > > turning this into using a scratch cpumask and %*pb then I think that > > would be nicer still. > > I expected you might say that, and indeed I considered it. But then > I wanted to keep code churn limited. The way it's done it's imo an > almost purely mechanical change. Much like with Roger's consideration > of further refining the message text in the error case. One suggestion I had in mind (which is not explicitly related to the fix here) would be to taint the hypervisor when we hit such selftests errors, as that would make it more obvious to go search the dmesg for reported failures. I think it's still fairly easy to miss noticing the watchdog selftest failed on some CPUs. Regards, Roger.
--- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -167,13 +167,14 @@ static void __init cf_check wait_for_nmi void __init check_nmi_watchdog(void) { static unsigned int __initdata prev_nmi_count[NR_CPUS]; - int cpu; + unsigned int cpu; + char sep = '{'; bool ok = true; if ( nmi_watchdog == NMI_NONE ) return; - printk("Testing NMI watchdog on all CPUs:"); + printk("Testing NMI watchdog on all CPUs: "); for_each_online_cpu ( cpu ) prev_nmi_count[cpu] = per_cpu(nmi_count, cpu); @@ -189,12 +190,13 @@ void __init check_nmi_watchdog(void) { if ( per_cpu(nmi_count, cpu) - prev_nmi_count[cpu] < 2 ) { - printk(" %d", cpu); + printk("%c%u", sep, cpu); + sep = ','; ok = false; } } - printk(" %s\n", ok ? "ok" : "stuck"); + printk("%s\n", ok ? "ok" : "} stuck"); /* * Now that we know it works we can reduce NMI frequency to
Observing "Testing NMI watchdog on all CPUs: 0 stuck" it felt like it's not quite right, but I still read it as "no CPU stuck; all good", when really the system suffered from what 6bdb965178bb ("x86/intel: ensure Global Performance Counter Control is setup correctly") works around. Convert this to "Testing NMI watchdog on all CPUs: {0} stuck" or, with multiple CPUs having an issue, e.g. "Testing NMI watchdog on all CPUs: {0,40} stuck" to make more obvious that a lone number is not a count of CPUs. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- In principle "sep" could also fulfill the job of "ok"; it felt to me as if this may not be liked very much, though.