diff mbox series

x86/NMI: refine "watchdog stuck" log message

Message ID 87108f1d-4b13-4c1e-9432-4f14d4f5c12d@suse.com (mailing list archive)
State New, archived
Headers show
Series x86/NMI: refine "watchdog stuck" log message | expand

Commit Message

Jan Beulich Jan. 24, 2024, 3:21 p.m. UTC
Observing

"Testing NMI watchdog on all CPUs: 0 stuck"

it felt like it's not quite right, but I still read it as "no CPU stuck;
all good", when really the system suffered from what 6bdb965178bb
("x86/intel: ensure Global Performance Counter Control is setup
correctly") works around. Convert this to

"Testing NMI watchdog on all CPUs: {0} stuck"

or, with multiple CPUs having an issue, e.g.

"Testing NMI watchdog on all CPUs: {0,40} stuck"

to make more obvious that a lone number is not a count of CPUs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
In principle "sep" could also fulfill the job of "ok"; it felt to me as
if this may not be liked very much, though.

Comments

Roger Pau Monné Jan. 24, 2024, 3:52 p.m. UTC | #1
On Wed, Jan 24, 2024 at 04:21:24PM +0100, Jan Beulich wrote:
> Observing
> 
> "Testing NMI watchdog on all CPUs: 0 stuck"
> 
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
> 
> "Testing NMI watchdog on all CPUs: {0} stuck"

To make this even more obvious, maybe it could be prefixed with "error":

"Testing NMI watchdog on all CPUs: error {0} stuck"

Hm, albeit I don't like it that much

> 
> or, with multiple CPUs having an issue, e.g.
> 
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
> 
> to make more obvious that a lone number is not a count of CPUs.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> In principle "sep" could also fulfill the job of "ok"; it felt to me as
> if this may not be liked very much, though.

I think I prefer it the current way.

Thanks, Roger.
Andrew Cooper Jan. 24, 2024, 3:55 p.m. UTC | #2
On 24/01/2024 3:21 pm, Jan Beulich wrote:
> Observing
>
> "Testing NMI watchdog on all CPUs: 0 stuck"
>
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
>
> "Testing NMI watchdog on all CPUs: {0} stuck"
>
> or, with multiple CPUs having an issue, e.g.
>
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>
> to make more obvious that a lone number is not a count of CPUs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I'd forgotten it was still opencoded like this.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, but if you felt
turning this into using a scratch cpumask and %*pb then I think that
would be nicer still.

~Andrew
Jan Beulich Jan. 25, 2024, 8:05 a.m. UTC | #3
On 24.01.2024 16:55, Andrew Cooper wrote:
> On 24/01/2024 3:21 pm, Jan Beulich wrote:
>> Observing
>>
>> "Testing NMI watchdog on all CPUs: 0 stuck"
>>
>> it felt like it's not quite right, but I still read it as "no CPU stuck;
>> all good", when really the system suffered from what 6bdb965178bb
>> ("x86/intel: ensure Global Performance Counter Control is setup
>> correctly") works around. Convert this to
>>
>> "Testing NMI watchdog on all CPUs: {0} stuck"
>>
>> or, with multiple CPUs having an issue, e.g.
>>
>> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>>
>> to make more obvious that a lone number is not a count of CPUs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I'd forgotten it was still opencoded like this.
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,

Thanks (Roger, also to you).

> but if you felt
> turning this into using a scratch cpumask and %*pb then I think that
> would be nicer still.

I expected you might say that, and indeed I considered it. But then
I wanted to keep code churn limited. The way it's done it's imo an
almost purely mechanical change. Much like with Roger's consideration
of further refining the message text in the error case.

Jan
Roger Pau Monné Jan. 25, 2024, 9:10 a.m. UTC | #4
On Thu, Jan 25, 2024 at 09:05:13AM +0100, Jan Beulich wrote:
> On 24.01.2024 16:55, Andrew Cooper wrote:
> > On 24/01/2024 3:21 pm, Jan Beulich wrote:
> > but if you felt
> > turning this into using a scratch cpumask and %*pb then I think that
> > would be nicer still.
> 
> I expected you might say that, and indeed I considered it. But then
> I wanted to keep code churn limited. The way it's done it's imo an
> almost purely mechanical change. Much like with Roger's consideration
> of further refining the message text in the error case.

One suggestion I had in mind (which is not explicitly related to the
fix here) would be to taint the hypervisor when we hit such selftests
errors, as that would make it more obvious to go search the dmesg for
reported failures. I think it's still fairly easy to miss noticing the
watchdog selftest failed on some CPUs.

Regards, Roger.
diff mbox series

Patch

--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -167,13 +167,14 @@  static void __init cf_check wait_for_nmi
 void __init check_nmi_watchdog(void)
 {
     static unsigned int __initdata prev_nmi_count[NR_CPUS];
-    int cpu;
+    unsigned int cpu;
+    char sep = '{';
     bool ok = true;
 
     if ( nmi_watchdog == NMI_NONE )
         return;
 
-    printk("Testing NMI watchdog on all CPUs:");
+    printk("Testing NMI watchdog on all CPUs: ");
 
     for_each_online_cpu ( cpu )
         prev_nmi_count[cpu] = per_cpu(nmi_count, cpu);
@@ -189,12 +190,13 @@  void __init check_nmi_watchdog(void)
     {
         if ( per_cpu(nmi_count, cpu) - prev_nmi_count[cpu] < 2 )
         {
-            printk(" %d", cpu);
+            printk("%c%u", sep, cpu);
+            sep = ',';
             ok = false;
         }
     }
 
-    printk(" %s\n", ok ? "ok" : "stuck");
+    printk("%s\n", ok ? "ok" : "} stuck");
 
     /*
      * Now that we know it works we can reduce NMI frequency to