x86/NMI: refine "watchdog stuck" log message

Message ID	87108f1d-4b13-4c1e-9432-4f14d4f5c12d@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Message-ID: <87108f1d-4b13-4c1e-9432-4f14d4f5c12d@suse.com> Date: Wed, 24 Jan 2024 16:21:24 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> From: Jan Beulich <jbeulich@suse.com> Subject: [PATCH] x86/NMI: refine "watchdog stuck" log message Autocrypt: addr=jbeulich@suse.com; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit
Series	x86/NMI: refine "watchdog stuck" log message \| expand x86/NMI: refine "watchdog stuck" log message

Message ID

87108f1d-4b13-4c1e-9432-4f14d4f5c12d@suse.com (mailing list archive)

State

New, archived

Headers

Errors-To: xen-devel-bounces@lists.xenproject.org
Precedence: list
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Message-ID: <87108f1d-4b13-4c1e-9432-4f14d4f5c12d@suse.com>
Date: Wed, 24 Jan 2024 16:21:24 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Content-Language: en-US
To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>,
	=?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>
From: Jan Beulich <jbeulich@suse.com>
Subject: [PATCH] x86/NMI: refine "watchdog stuck" log message
Autocrypt: addr=jbeulich@suse.com; keydata=
 xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk
 hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK
 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD
 /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py
 O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl
 MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP
 nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo
 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp
 Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC
 AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee
 e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF
 hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l
 IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS
 FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj
 t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8
 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3
 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9
 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V
 m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM
 EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr
 wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A
 nAuWpQkjM1ASeQwSHEeAWPgskBQL
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Series

x86/NMI: refine "watchdog stuck" log message | expand

Commit Message

Jan Beulich Jan. 24, 2024, 3:21 p.m. UTC

Observing

"Testing NMI watchdog on all CPUs: 0 stuck"

it felt like it's not quite right, but I still read it as "no CPU stuck;
all good", when really the system suffered from what 6bdb965178bb
("x86/intel: ensure Global Performance Counter Control is setup
correctly") works around. Convert this to

"Testing NMI watchdog on all CPUs: {0} stuck"

or, with multiple CPUs having an issue, e.g.

"Testing NMI watchdog on all CPUs: {0,40} stuck"

to make more obvious that a lone number is not a count of CPUs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
In principle "sep" could also fulfill the job of "ok"; it felt to me as
if this may not be liked very much, though.

Comments

Roger Pau Monné Jan. 24, 2024, 3:52 p.m. UTC | #1

On Wed, Jan 24, 2024 at 04:21:24PM +0100, Jan Beulich wrote:
> Observing
> 
> "Testing NMI watchdog on all CPUs: 0 stuck"
> 
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
> 
> "Testing NMI watchdog on all CPUs: {0} stuck"

To make this even more obvious, maybe it could be prefixed with "error":

"Testing NMI watchdog on all CPUs: error {0} stuck"

Hm, albeit I don't like it that much

> 
> or, with multiple CPUs having an issue, e.g.
> 
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
> 
> to make more obvious that a lone number is not a count of CPUs.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> In principle "sep" could also fulfill the job of "ok"; it felt to me as
> if this may not be liked very much, though.

I think I prefer it the current way.

Thanks, Roger.

Andrew Cooper Jan. 24, 2024, 3:55 p.m. UTC | #2

On 24/01/2024 3:21 pm, Jan Beulich wrote:
> Observing
>
> "Testing NMI watchdog on all CPUs: 0 stuck"
>
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
>
> "Testing NMI watchdog on all CPUs: {0} stuck"
>
> or, with multiple CPUs having an issue, e.g.
>
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>
> to make more obvious that a lone number is not a count of CPUs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I'd forgotten it was still opencoded like this.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, but if you felt
turning this into using a scratch cpumask and %*pb then I think that
would be nicer still.

~Andrew

Jan Beulich Jan. 25, 2024, 8:05 a.m. UTC | #3

On 24.01.2024 16:55, Andrew Cooper wrote:
> On 24/01/2024 3:21 pm, Jan Beulich wrote:
>> Observing
>>
>> "Testing NMI watchdog on all CPUs: 0 stuck"
>>
>> it felt like it's not quite right, but I still read it as "no CPU stuck;
>> all good", when really the system suffered from what 6bdb965178bb
>> ("x86/intel: ensure Global Performance Counter Control is setup
>> correctly") works around. Convert this to
>>
>> "Testing NMI watchdog on all CPUs: {0} stuck"
>>
>> or, with multiple CPUs having an issue, e.g.
>>
>> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>>
>> to make more obvious that a lone number is not a count of CPUs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I'd forgotten it was still opencoded like this.
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,

Thanks (Roger, also to you).

> but if you felt
> turning this into using a scratch cpumask and %*pb then I think that
> would be nicer still.

I expected you might say that, and indeed I considered it. But then
I wanted to keep code churn limited. The way it's done it's imo an
almost purely mechanical change. Much like with Roger's consideration
of further refining the message text in the error case.

Jan

Roger Pau Monné Jan. 25, 2024, 9:10 a.m. UTC | #4

On Thu, Jan 25, 2024 at 09:05:13AM +0100, Jan Beulich wrote:
> On 24.01.2024 16:55, Andrew Cooper wrote:
> > On 24/01/2024 3:21 pm, Jan Beulich wrote:
> > but if you felt
> > turning this into using a scratch cpumask and %*pb then I think that
> > would be nicer still.
> 
> I expected you might say that, and indeed I considered it. But then
> I wanted to keep code churn limited. The way it's done it's imo an
> almost purely mechanical change. Much like with Roger's consideration
> of further refining the message text in the error case.

One suggestion I had in mind (which is not explicitly related to the
fix here) would be to taint the hypervisor when we hit such selftests
errors, as that would make it more obvious to go search the dmesg for
reported failures. I think it's still fairly easy to miss noticing the
watchdog selftest failed on some CPUs.

Regards, Roger.

--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -167,13 +167,14 @@  static void __init cf_check wait_for_nmi
 void __init check_nmi_watchdog(void)
 {
     static unsigned int __initdata prev_nmi_count[NR_CPUS];
-    int cpu;
+    unsigned int cpu;
+    char sep = '{';
     bool ok = true;
 
     if ( nmi_watchdog == NMI_NONE )
         return;
 
-    printk("Testing NMI watchdog on all CPUs:");
+    printk("Testing NMI watchdog on all CPUs: ");
 
     for_each_online_cpu ( cpu )
         prev_nmi_count[cpu] = per_cpu(nmi_count, cpu);
@@ -189,12 +190,13 @@  void __init check_nmi_watchdog(void)
     {
         if ( per_cpu(nmi_count, cpu) - prev_nmi_count[cpu] < 2 )
         {
-            printk(" %d", cpu);
+            printk("%c%u", sep, cpu);
+            sep = ',';
             ok = false;
         }
     }
 
-    printk(" %s\n", ok ? "ok" : "stuck");
+    printk("%s\n", ok ? "ok" : "} stuck");
 
     /*
      * Now that we know it works we can reduce NMI frequency to

x86/NMI: refine "watchdog stuck" log message

Commit Message

Comments

Patch