diff mbox series

xen: add support for automatic debug key actions in case of crash

Message ID 20201022143905.11032-1-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen: add support for automatic debug key actions in case of crash | expand

Commit Message

Jürgen Groß Oct. 22, 2020, 2:39 p.m. UTC
When the host crashes it would sometimes be nice to have additional
debug data available which could be produced via debug keys, but
halting the server for manual intervention might be impossible due to
the need to reboot/kexec rather sooner than later.

Add support for automatic debug key actions in case of crashes which
can be activated via boot- or runtime-parameter.

Depending on the type of crash the desired data might be different, so
support different settings for the possible types of crashes.

The parameter is "crash-debug" with the following syntax:

  crash-debug-<type>=<string>

with <type> being one of:

  panic, hwdom, watchdog, kexeccmd, debugkey

and <string> a sequence of debug key characters with '.' having the
special semantics of a 1 second pause.

So "crash-debug-watchdog=0.0qr" would result in special output in case
of watchdog triggered crash (dom0 state, 1 second pause, dom0 state,
domain info, run queues).

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 docs/misc/xen-command-line.pandoc | 23 +++++++++++++++++++
 xen/common/kexec.c                |  8 ++++---
 xen/common/keyhandler.c           | 37 +++++++++++++++++++++++++++++++
 xen/common/shutdown.c             |  4 ++--
 xen/drivers/char/console.c        |  2 +-
 xen/include/xen/kexec.h           | 10 +++++++--
 xen/include/xen/keyhandler.h      | 11 +++++++++
 7 files changed, 87 insertions(+), 8 deletions(-)

Comments

Jan Beulich Oct. 29, 2020, 2:22 p.m. UTC | #1
On 22.10.2020 16:39, Juergen Gross wrote:
> When the host crashes it would sometimes be nice to have additional
> debug data available which could be produced via debug keys, but
> halting the server for manual intervention might be impossible due to
> the need to reboot/kexec rather sooner than later.
> 
> Add support for automatic debug key actions in case of crashes which
> can be activated via boot- or runtime-parameter.

While I can certainly see this possibly being a useful thing in
certain situations, I'm uncertain it's going to be helpful in at
least a fair set of cases. What output to request very much
depends on the nature of the problem one is running into, and
the more keys one adds "just in case", the longer the reboot
latency, and the higher the risk (see also below) of the output
generation actually causing further problems.

IOW I'm neither fully convinced that we want this, nor fully
opposed.

> Depending on the type of crash the desired data might be different, so
> support different settings for the possible types of crashes.
> 
> The parameter is "crash-debug" with the following syntax:
> 
>   crash-debug-<type>=<string>
> 
> with <type> being one of:
> 
>   panic, hwdom, watchdog, kexeccmd, debugkey
> 
> and <string> a sequence of debug key characters with '.' having the
> special semantics of a 1 second pause.

1 second is a whole lot of time. To get two successive sets
of data, a much shorter delay (if any) would normally suffice.

Also, while '.' may seem like a good choice right now, with the
shortage of characters we may want to put a real handler behind
it at come point. The one character that clearly won't make much
sense to use in this context is 'h', but that's awful as a (kind
of) separator. Could we perhaps replace 'h' by '?', freeing up
'h' and allowing '?' to be used for this purpose here?

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -574,6 +574,29 @@ reduction of features at Xen's disposal to manage guests.
>  ### cpuinfo (x86)
>  > `= <boolean>`
>  
> +### crash-debug-debugkey
> +### crash-debug-hwdom
> +### crash-debug-kexeccmd
> +### crash-debug-panic
> +### crash-debug-watchdog
> +> `= <string>`
> +
> +> Can be modified at runtime
> +
> +Specify debug-key actions in cases of crashes. Each of the parameters applies
> +to a different crash reason. The `<string>` is a sequence of debug key
> +characters, with `.` having the special meaning of a 1 second pause.
> +
> +So e.g. `crash-debug-watchdog=0.0r` would dump dom0 state twice with a
> +second between the two state dumps, followed by the run queues of the
> +hypervisor, if the system crashes due to a watchdog timeout.
> +
> +These parameters should be used carefully, as e.g. specifying
> +`crash-debug-debugkey=C` would result in an endless loop. Depending on the
> +reason of the system crash it might happen that triggering some debug key
> +action will result in a hang instead of dumping data and then doing a
> +reboot or crash dump.

I think it would be useful if the flavors were (briefly)
explained: At the very least "debugkey" doesn't directly fit "in
cases of crashes", as there's no crash involved. kexec_crash()
instead gets invoked without there having been any crash.

You may also want to point out that this is a best effort thing
only - system state at the point of a crash may be such that the
attempt of handling one or the debug keys would have further bad
effects on the system, including that the actual kexec may then
never occur.

> @@ -507,6 +509,41 @@ void __init initialize_keytable(void)
>      }
>  }
>  
> +#define CRASHACTION_SIZE  32
> +static char crash_debug_panic[CRASHACTION_SIZE];
> +static char crash_debug_hwdom[CRASHACTION_SIZE];
> +static char crash_debug_watchdog[CRASHACTION_SIZE];
> +static char crash_debug_kexeccmd[CRASHACTION_SIZE];
> +static char crash_debug_debugkey[CRASHACTION_SIZE];
> +
> +static char *crash_action[CRASHREASON_N] = {
> +    [CRASHREASON_PANIC] = crash_debug_panic,
> +    [CRASHREASON_HWDOM] = crash_debug_hwdom,
> +    [CRASHREASON_WATCHDOG] = crash_debug_watchdog,
> +    [CRASHREASON_KEXECCMD] = crash_debug_kexeccmd,
> +    [CRASHREASON_DEBUGKEY] = crash_debug_debugkey,
> +};
> +
> +string_runtime_param("crash-debug-panic", crash_debug_panic);
> +string_runtime_param("crash-debug-hwdom", crash_debug_hwdom);
> +string_runtime_param("crash-debug-watchdog", crash_debug_watchdog);
> +string_runtime_param("crash-debug-kexeccmd", crash_debug_kexeccmd);
> +string_runtime_param("crash-debug-debugkey", crash_debug_debugkey);

In general I'm not in favor of this (or similar) needing
five new command line options, instead of just one. The problem
with e.g.

crash-debug=panic:rq,watchdog:0d

is that ',' (or any other separator chosen) could in principle
also be a debug key. It would still be possible to use

crash-debug=panic:rq crash-debug=watchdog:0d

though. Thoughts?

> +void keyhandler_crash_action(enum crash_reason reason)
> +{
> +    const char *action = crash_action[reason];
> +    struct cpu_user_regs *regs = get_irq_regs() ? : guest_cpu_user_regs();
> +
> +    while ( *action ) {

Misplaced brace.

> +        if ( *action == '.' )
> +            mdelay(1000);
> +        else
> +            handle_keypress(*action, regs);
> +        action++;
> +    }
> +}

I think only diagnostic keys should be permitted here.

> --- a/xen/include/xen/kexec.h
> +++ b/xen/include/xen/kexec.h
> @@ -1,6 +1,8 @@
>  #ifndef __XEN_KEXEC_H__
>  #define __XEN_KEXEC_H__
>  
> +#include <xen/keyhandler.h>

Could we go without this, utilizing the gcc extension of forward
declared enums? Otoh ...

> @@ -82,7 +84,11 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>  #define kexecing 0
>  
>  static inline void kexec_early_calculations(void) {}
> -static inline void kexec_crash(void) {}
> +static inline void kexec_crash(enum crash_reason reason)
> +{
> +    keyhandler_crash_action(reason);
> +}

... if this is to be an inline function and not just a #define,
it'll need the declaration of the function to have been seen.

Jan
Jürgen Groß Oct. 29, 2020, 2:40 p.m. UTC | #2
On 29.10.20 15:22, Jan Beulich wrote:
> On 22.10.2020 16:39, Juergen Gross wrote:
>> When the host crashes it would sometimes be nice to have additional
>> debug data available which could be produced via debug keys, but
>> halting the server for manual intervention might be impossible due to
>> the need to reboot/kexec rather sooner than later.
>>
>> Add support for automatic debug key actions in case of crashes which
>> can be activated via boot- or runtime-parameter.
> 
> While I can certainly see this possibly being a useful thing in
> certain situations, I'm uncertain it's going to be helpful in at
> least a fair set of cases. What output to request very much
> depends on the nature of the problem one is running into, and
> the more keys one adds "just in case", the longer the reboot
> latency, and the higher the risk (see also below) of the output
> generation actually causing further problems.

The obvious case is a watchdog induced crash: at least 2 sets of dom0
state will help in many cases.

> IOW I'm neither fully convinced that we want this, nor fully
> opposed.
> 
>> Depending on the type of crash the desired data might be different, so
>> support different settings for the possible types of crashes.
>>
>> The parameter is "crash-debug" with the following syntax:
>>
>>    crash-debug-<type>=<string>
>>
>> with <type> being one of:
>>
>>    panic, hwdom, watchdog, kexeccmd, debugkey
>>
>> and <string> a sequence of debug key characters with '.' having the
>> special semantics of a 1 second pause.
> 
> 1 second is a whole lot of time. To get two successive sets
> of data, a much shorter delay (if any) would normally suffice.

Yes, I'd be fine to trade that for a shorter period of time.

> Also, while '.' may seem like a good choice right now, with the
> shortage of characters we may want to put a real handler behind
> it at come point. The one character that clearly won't make much
> sense to use in this context is 'h', but that's awful as a (kind
> of) separator. Could we perhaps replace 'h' by '?', freeing up
> 'h' and allowing '?' to be used for this purpose here?

Fine with me. Another possibility would be to add '\' as an escape
character with '\.' meaning "debug-key .".

>> --- a/docs/misc/xen-command-line.pandoc
>> +++ b/docs/misc/xen-command-line.pandoc
>> @@ -574,6 +574,29 @@ reduction of features at Xen's disposal to manage guests.
>>   ### cpuinfo (x86)
>>   > `= <boolean>`
>>   
>> +### crash-debug-debugkey
>> +### crash-debug-hwdom
>> +### crash-debug-kexeccmd
>> +### crash-debug-panic
>> +### crash-debug-watchdog
>> +> `= <string>`
>> +
>> +> Can be modified at runtime
>> +
>> +Specify debug-key actions in cases of crashes. Each of the parameters applies
>> +to a different crash reason. The `<string>` is a sequence of debug key
>> +characters, with `.` having the special meaning of a 1 second pause.
>> +
>> +So e.g. `crash-debug-watchdog=0.0r` would dump dom0 state twice with a
>> +second between the two state dumps, followed by the run queues of the
>> +hypervisor, if the system crashes due to a watchdog timeout.
>> +
>> +These parameters should be used carefully, as e.g. specifying
>> +`crash-debug-debugkey=C` would result in an endless loop. Depending on the
>> +reason of the system crash it might happen that triggering some debug key
>> +action will result in a hang instead of dumping data and then doing a
>> +reboot or crash dump.
> 
> I think it would be useful if the flavors were (briefly)
> explained: At the very least "debugkey" doesn't directly fit "in
> cases of crashes", as there's no crash involved. kexec_crash()
> instead gets invoked without there having been any crash.

Yes, and having some additional state generate for this case might
help diagnosis.

> 
> You may also want to point out that this is a best effort thing
> only - system state at the point of a crash may be such that the
> attempt of handling one or the debug keys would have further bad
> effects on the system, including that the actual kexec may then
> never occur.

True.

> 
>> @@ -507,6 +509,41 @@ void __init initialize_keytable(void)
>>       }
>>   }
>>   
>> +#define CRASHACTION_SIZE  32
>> +static char crash_debug_panic[CRASHACTION_SIZE];
>> +static char crash_debug_hwdom[CRASHACTION_SIZE];
>> +static char crash_debug_watchdog[CRASHACTION_SIZE];
>> +static char crash_debug_kexeccmd[CRASHACTION_SIZE];
>> +static char crash_debug_debugkey[CRASHACTION_SIZE];
>> +
>> +static char *crash_action[CRASHREASON_N] = {
>> +    [CRASHREASON_PANIC] = crash_debug_panic,
>> +    [CRASHREASON_HWDOM] = crash_debug_hwdom,
>> +    [CRASHREASON_WATCHDOG] = crash_debug_watchdog,
>> +    [CRASHREASON_KEXECCMD] = crash_debug_kexeccmd,
>> +    [CRASHREASON_DEBUGKEY] = crash_debug_debugkey,
>> +};
>> +
>> +string_runtime_param("crash-debug-panic", crash_debug_panic);
>> +string_runtime_param("crash-debug-hwdom", crash_debug_hwdom);
>> +string_runtime_param("crash-debug-watchdog", crash_debug_watchdog);
>> +string_runtime_param("crash-debug-kexeccmd", crash_debug_kexeccmd);
>> +string_runtime_param("crash-debug-debugkey", crash_debug_debugkey);
> 
> In general I'm not in favor of this (or similar) needing
> five new command line options, instead of just one. The problem
> with e.g.
> 
> crash-debug=panic:rq,watchdog:0d
> 
> is that ',' (or any other separator chosen) could in principle
> also be a debug key. It would still be possible to use
> 
> crash-debug=panic:rq crash-debug=watchdog:0d
> 
> though. Thoughts?

OTOH the runtime parameters are much easier addressable that way.

> 
>> +void keyhandler_crash_action(enum crash_reason reason)
>> +{
>> +    const char *action = crash_action[reason];
>> +    struct cpu_user_regs *regs = get_irq_regs() ? : guest_cpu_user_regs();
>> +
>> +    while ( *action ) {
> 
> Misplaced brace.

Will fix.

> 
>> +        if ( *action == '.' )
>> +            mdelay(1000);
>> +        else
>> +            handle_keypress(*action, regs);
>> +        action++;
>> +    }
>> +}
> 
> I think only diagnostic keys should be permitted here.

While I understand that other keys could produce nonsense or do harm,
I'm not sure we should really prohibit them. Allowing them would e.g.
allow to do just a reboot without kdump for one type of crashes.

> 
>> --- a/xen/include/xen/kexec.h
>> +++ b/xen/include/xen/kexec.h
>> @@ -1,6 +1,8 @@
>>   #ifndef __XEN_KEXEC_H__
>>   #define __XEN_KEXEC_H__
>>   
>> +#include <xen/keyhandler.h>
> 
> Could we go without this, utilizing the gcc extension of forward
> declared enums? Otoh ...
> 
>> @@ -82,7 +84,11 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>>   #define kexecing 0
>>   
>>   static inline void kexec_early_calculations(void) {}
>> -static inline void kexec_crash(void) {}
>> +static inline void kexec_crash(enum crash_reason reason)
>> +{
>> +    keyhandler_crash_action(reason);
>> +}
> 
> ... if this is to be an inline function and not just a #define,
> it'll need the declaration of the function to have been seen.

And even being a #define all users of kexec_crash() would need to
#include keyhandler.h (and I'm not sure there are any source files
including kexec.h which don't use kexec_crash()).


Juergen
Jan Beulich Oct. 29, 2020, 2:49 p.m. UTC | #3
On 29.10.2020 15:40, Jürgen Groß wrote:
> On 29.10.20 15:22, Jan Beulich wrote:
>> On 22.10.2020 16:39, Juergen Gross wrote:
>>> @@ -507,6 +509,41 @@ void __init initialize_keytable(void)
>>>       }
>>>   }
>>>   
>>> +#define CRASHACTION_SIZE  32
>>> +static char crash_debug_panic[CRASHACTION_SIZE];
>>> +static char crash_debug_hwdom[CRASHACTION_SIZE];
>>> +static char crash_debug_watchdog[CRASHACTION_SIZE];
>>> +static char crash_debug_kexeccmd[CRASHACTION_SIZE];
>>> +static char crash_debug_debugkey[CRASHACTION_SIZE];
>>> +
>>> +static char *crash_action[CRASHREASON_N] = {
>>> +    [CRASHREASON_PANIC] = crash_debug_panic,
>>> +    [CRASHREASON_HWDOM] = crash_debug_hwdom,
>>> +    [CRASHREASON_WATCHDOG] = crash_debug_watchdog,
>>> +    [CRASHREASON_KEXECCMD] = crash_debug_kexeccmd,
>>> +    [CRASHREASON_DEBUGKEY] = crash_debug_debugkey,
>>> +};
>>> +
>>> +string_runtime_param("crash-debug-panic", crash_debug_panic);
>>> +string_runtime_param("crash-debug-hwdom", crash_debug_hwdom);
>>> +string_runtime_param("crash-debug-watchdog", crash_debug_watchdog);
>>> +string_runtime_param("crash-debug-kexeccmd", crash_debug_kexeccmd);
>>> +string_runtime_param("crash-debug-debugkey", crash_debug_debugkey);
>>
>> In general I'm not in favor of this (or similar) needing
>> five new command line options, instead of just one. The problem
>> with e.g.
>>
>> crash-debug=panic:rq,watchdog:0d
>>
>> is that ',' (or any other separator chosen) could in principle
>> also be a debug key. It would still be possible to use
>>
>> crash-debug=panic:rq crash-debug=watchdog:0d
>>
>> though. Thoughts?
> 
> OTOH the runtime parameters are much easier addressable that way.

Ah yes, I can see this as a reason. Would make we wonder whether
command line and runtime handling may want disconnecting in this
specific case then. (But I can also see the argument of this
being too much overhead then.)

>>> +void keyhandler_crash_action(enum crash_reason reason)
>>> +{
>>> +    const char *action = crash_action[reason];
>>> +    struct cpu_user_regs *regs = get_irq_regs() ? : guest_cpu_user_regs();
>>> +
>>> +    while ( *action ) {
>>> +        if ( *action == '.' )
>>> +            mdelay(1000);
>>> +        else
>>> +            handle_keypress(*action, regs);
>>> +        action++;
>>> +    }
>>> +}
>>
>> I think only diagnostic keys should be permitted here.
> 
> While I understand that other keys could produce nonsense or do harm,
> I'm not sure we should really prohibit them. Allowing them would e.g.
> allow to do just a reboot without kdump for one type of crashes.

Ah yes, that's a fair point.

>>> --- a/xen/include/xen/kexec.h
>>> +++ b/xen/include/xen/kexec.h
>>> @@ -1,6 +1,8 @@
>>>   #ifndef __XEN_KEXEC_H__
>>>   #define __XEN_KEXEC_H__
>>>   
>>> +#include <xen/keyhandler.h>
>>
>> Could we go without this, utilizing the gcc extension of forward
>> declared enums? Otoh ...
>>
>>> @@ -82,7 +84,11 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>>>   #define kexecing 0
>>>   
>>>   static inline void kexec_early_calculations(void) {}
>>> -static inline void kexec_crash(void) {}
>>> +static inline void kexec_crash(enum crash_reason reason)
>>> +{
>>> +    keyhandler_crash_action(reason);
>>> +}
>>
>> ... if this is to be an inline function and not just a #define,
>> it'll need the declaration of the function to have been seen.
> 
> And even being a #define all users of kexec_crash() would need to
> #include keyhandler.h (and I'm not sure there are any source files
> including kexec.h which don't use kexec_crash()).

About as many which do as ones which don't. But there's no
generally accessible header which includes xen/kexec.h, so perhaps
the extra dependency indeed isn't all this problematic.

Jan
Jürgen Groß Nov. 12, 2020, 12:50 p.m. UTC | #4
On 29.10.20 15:49, Jan Beulich wrote:
> On 29.10.2020 15:40, Jürgen Groß wrote:
>> On 29.10.20 15:22, Jan Beulich wrote:
>>> On 22.10.2020 16:39, Juergen Gross wrote:
>>>> @@ -507,6 +509,41 @@ void __init initialize_keytable(void)
>>>>        }
>>>>    }
>>>>    
>>>> +#define CRASHACTION_SIZE  32
>>>> +static char crash_debug_panic[CRASHACTION_SIZE];
>>>> +static char crash_debug_hwdom[CRASHACTION_SIZE];
>>>> +static char crash_debug_watchdog[CRASHACTION_SIZE];
>>>> +static char crash_debug_kexeccmd[CRASHACTION_SIZE];
>>>> +static char crash_debug_debugkey[CRASHACTION_SIZE];
>>>> +
>>>> +static char *crash_action[CRASHREASON_N] = {
>>>> +    [CRASHREASON_PANIC] = crash_debug_panic,
>>>> +    [CRASHREASON_HWDOM] = crash_debug_hwdom,
>>>> +    [CRASHREASON_WATCHDOG] = crash_debug_watchdog,
>>>> +    [CRASHREASON_KEXECCMD] = crash_debug_kexeccmd,
>>>> +    [CRASHREASON_DEBUGKEY] = crash_debug_debugkey,
>>>> +};
>>>> +
>>>> +string_runtime_param("crash-debug-panic", crash_debug_panic);
>>>> +string_runtime_param("crash-debug-hwdom", crash_debug_hwdom);
>>>> +string_runtime_param("crash-debug-watchdog", crash_debug_watchdog);
>>>> +string_runtime_param("crash-debug-kexeccmd", crash_debug_kexeccmd);
>>>> +string_runtime_param("crash-debug-debugkey", crash_debug_debugkey);
>>>
>>> In general I'm not in favor of this (or similar) needing
>>> five new command line options, instead of just one. The problem
>>> with e.g.
>>>
>>> crash-debug=panic:rq,watchdog:0d
>>>
>>> is that ',' (or any other separator chosen) could in principle
>>> also be a debug key. It would still be possible to use
>>>
>>> crash-debug=panic:rq crash-debug=watchdog:0d
>>>
>>> though. Thoughts?
>>
>> OTOH the runtime parameters are much easier addressable that way.
> 
> Ah yes, I can see this as a reason. Would make we wonder whether
> command line and runtime handling may want disconnecting in this
> specific case then. (But I can also see the argument of this
> being too much overhead then.)
> 
>>>> +void keyhandler_crash_action(enum crash_reason reason)
>>>> +{
>>>> +    const char *action = crash_action[reason];
>>>> +    struct cpu_user_regs *regs = get_irq_regs() ? : guest_cpu_user_regs();
>>>> +
>>>> +    while ( *action ) {
>>>> +        if ( *action == '.' )
>>>> +            mdelay(1000);
>>>> +        else
>>>> +            handle_keypress(*action, regs);
>>>> +        action++;
>>>> +    }
>>>> +}
>>>
>>> I think only diagnostic keys should be permitted here.
>>
>> While I understand that other keys could produce nonsense or do harm,
>> I'm not sure we should really prohibit them. Allowing them would e.g.
>> allow to do just a reboot without kdump for one type of crashes.
> 
> Ah yes, that's a fair point.
> 
>>>> --- a/xen/include/xen/kexec.h
>>>> +++ b/xen/include/xen/kexec.h
>>>> @@ -1,6 +1,8 @@
>>>>    #ifndef __XEN_KEXEC_H__
>>>>    #define __XEN_KEXEC_H__
>>>>    
>>>> +#include <xen/keyhandler.h>
>>>
>>> Could we go without this, utilizing the gcc extension of forward
>>> declared enums? Otoh ...
>>>
>>>> @@ -82,7 +84,11 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>>>>    #define kexecing 0
>>>>    
>>>>    static inline void kexec_early_calculations(void) {}
>>>> -static inline void kexec_crash(void) {}
>>>> +static inline void kexec_crash(enum crash_reason reason)
>>>> +{
>>>> +    keyhandler_crash_action(reason);
>>>> +}
>>>
>>> ... if this is to be an inline function and not just a #define,
>>> it'll need the declaration of the function to have been seen.
>>
>> And even being a #define all users of kexec_crash() would need to
>> #include keyhandler.h (and I'm not sure there are any source files
>> including kexec.h which don't use kexec_crash()).
> 
> About as many which do as ones which don't. But there's no
> generally accessible header which includes xen/kexec.h, so perhaps
> the extra dependency indeed isn't all this problematic.

Any further comments, or even better, Acks?


Juergen
Jan Beulich Nov. 12, 2020, 1:36 p.m. UTC | #5
On 12.11.2020 13:50, Jürgen Groß wrote:
> Any further comments, or even better, Acks?

To be honest I'd prefer to have at least one of the people Cc-ed
minimally indicate they consider this a good idea. No need for a
close review or such, just a basic opinion. Anyone?

Jan
Stefano Stabellini Nov. 12, 2020, 9:38 p.m. UTC | #6
On Thu, 12 Nov 2020, Jan Beulich wrote:
> On 12.11.2020 13:50, Jürgen Groß wrote:
> > Any further comments, or even better, Acks?
> 
> To be honest I'd prefer to have at least one of the people Cc-ed
> minimally indicate they consider this a good idea. No need for a
> close review or such, just a basic opinion. Anyone?

I see Jan's point that it is not clear how much this is going to help in
production. However, it is not going to hurt either, and I have been
told a few times recently that debugging Xen is not easy. Anything that
helps in that regard would be good. So I think this patch would be an
improvement.
Jürgen Groß Nov. 13, 2020, 5:18 a.m. UTC | #7
On 12.11.20 22:38, Stefano Stabellini wrote:
> On Thu, 12 Nov 2020, Jan Beulich wrote:
>> On 12.11.2020 13:50, Jürgen Groß wrote:
>>> Any further comments, or even better, Acks?
>>
>> To be honest I'd prefer to have at least one of the people Cc-ed
>> minimally indicate they consider this a good idea. No need for a
>> close review or such, just a basic opinion. Anyone?
> 
> I see Jan's point that it is not clear how much this is going to help in
> production. However, it is not going to hurt either, and I have been
> told a few times recently that debugging Xen is not easy. Anything that
> helps in that regard would be good. So I think this patch would be an
> improvement.
> 

This patch is an effort to get better diagnostic data in case of
Xen crashes from our largest customer, so clearly intended for
production use.


Juergen
Julien Grall Nov. 18, 2020, 4:46 p.m. UTC | #8
Hi,

On 13/11/2020 05:18, Jürgen Groß wrote:
> On 12.11.20 22:38, Stefano Stabellini wrote:
>> On Thu, 12 Nov 2020, Jan Beulich wrote:
>>> On 12.11.2020 13:50, Jürgen Groß wrote:
>>>> Any further comments, or even better, Acks?
>>>
>>> To be honest I'd prefer to have at least one of the people Cc-ed
>>> minimally indicate they consider this a good idea. No need for a
>>> close review or such, just a basic opinion. Anyone?
>>
>> I see Jan's point that it is not clear how much this is going to help in
>> production. However, it is not going to hurt either, and I have been
>> told a few times recently that debugging Xen is not easy. Anything that
>> helps in that regard would be good. So I think this patch would be an
>> improvement.
>>
> 
> This patch is an effort to get better diagnostic data in case of
> Xen crashes from our largest customer, so clearly intended for
> production use.
>

I agree with this statement. When you get a crash from Xen in 
production, it can be useful to get as much information as possible 
dumped. Some of the information printed by the debugkeys cannot be 
retrieved from a crashkernel.

Cheers,
Jan Beulich Nov. 19, 2020, 9:59 a.m. UTC | #9
On 18.11.2020 17:46, Julien Grall wrote:
> On 13/11/2020 05:18, Jürgen Groß wrote:
>> On 12.11.20 22:38, Stefano Stabellini wrote:
>>> On Thu, 12 Nov 2020, Jan Beulich wrote:
>>>> On 12.11.2020 13:50, Jürgen Groß wrote:
>>>>> Any further comments, or even better, Acks?
>>>>
>>>> To be honest I'd prefer to have at least one of the people Cc-ed
>>>> minimally indicate they consider this a good idea. No need for a
>>>> close review or such, just a basic opinion. Anyone?
>>>
>>> I see Jan's point that it is not clear how much this is going to help in
>>> production. However, it is not going to hurt either, and I have been
>>> told a few times recently that debugging Xen is not easy. Anything that
>>> helps in that regard would be good. So I think this patch would be an
>>> improvement.
>>>
>>
>> This patch is an effort to get better diagnostic data in case of
>> Xen crashes from our largest customer, so clearly intended for
>> production use.
>>
> 
> I agree with this statement. When you get a crash from Xen in 
> production, it can be useful to get as much information as possible 
> dumped. Some of the information printed by the debugkeys cannot be 
> retrieved from a crashkernel.

So, Jürgen, my request was satisfied, so all that's needed is, I
suppose, a re-submission with the few smaller items addressed.

Jan
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 4ae9391fcd..f328c99cf8 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -574,6 +574,29 @@  reduction of features at Xen's disposal to manage guests.
 ### cpuinfo (x86)
 > `= <boolean>`
 
+### crash-debug-debugkey
+### crash-debug-hwdom
+### crash-debug-kexeccmd
+### crash-debug-panic
+### crash-debug-watchdog
+> `= <string>`
+
+> Can be modified at runtime
+
+Specify debug-key actions in cases of crashes. Each of the parameters applies
+to a different crash reason. The `<string>` is a sequence of debug key
+characters, with `.` having the special meaning of a 1 second pause.
+
+So e.g. `crash-debug-watchdog=0.0r` would dump dom0 state twice with a
+second between the two state dumps, followed by the run queues of the
+hypervisor, if the system crashes due to a watchdog timeout.
+
+These parameters should be used carefully, as e.g. specifying
+`crash-debug-debugkey=C` would result in an endless loop. Depending on the
+reason of the system crash it might happen that triggering some debug key
+action will result in a hang instead of dumping data and then doing a
+reboot or crash dump.
+
 ### crashinfo_maxaddr
 > `= <size>`
 
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index 52cdc4ebc3..ebeee6405a 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -373,10 +373,12 @@  static int kexec_common_shutdown(void)
     return 0;
 }
 
-void kexec_crash(void)
+void kexec_crash(enum crash_reason reason)
 {
     int pos;
 
+    keyhandler_crash_action(reason);
+
     pos = (test_bit(KEXEC_FLAG_CRASH_POS, &kexec_flags) != 0);
     if ( !test_bit(KEXEC_IMAGE_CRASH_BASE + pos, &kexec_flags) )
         return;
@@ -409,7 +411,7 @@  static long kexec_reboot(void *_image)
 static void do_crashdump_trigger(unsigned char key)
 {
     printk("'%c' pressed -> triggering crashdump\n", key);
-    kexec_crash();
+    kexec_crash(CRASHREASON_DEBUGKEY);
     printk(" * no crash kernel loaded!\n");
 }
 
@@ -840,7 +842,7 @@  static int kexec_exec(XEN_GUEST_HANDLE_PARAM(void) uarg)
         ret = continue_hypercall_on_cpu(0, kexec_reboot, image);
         break;
     case KEXEC_TYPE_CRASH:
-        kexec_crash(); /* Does not return */
+        kexec_crash(CRASHREASON_KEXECCMD); /* Does not return */
         break;
     }
 
diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
index 68364e987d..ac8229a4d7 100644
--- a/xen/common/keyhandler.c
+++ b/xen/common/keyhandler.c
@@ -3,7 +3,9 @@ 
  */
 
 #include <asm/regs.h>
+#include <xen/delay.h>
 #include <xen/keyhandler.h>
+#include <xen/param.h>
 #include <xen/shutdown.h>
 #include <xen/event.h>
 #include <xen/console.h>
@@ -507,6 +509,41 @@  void __init initialize_keytable(void)
     }
 }
 
+#define CRASHACTION_SIZE  32
+static char crash_debug_panic[CRASHACTION_SIZE];
+static char crash_debug_hwdom[CRASHACTION_SIZE];
+static char crash_debug_watchdog[CRASHACTION_SIZE];
+static char crash_debug_kexeccmd[CRASHACTION_SIZE];
+static char crash_debug_debugkey[CRASHACTION_SIZE];
+
+static char *crash_action[CRASHREASON_N] = {
+    [CRASHREASON_PANIC] = crash_debug_panic,
+    [CRASHREASON_HWDOM] = crash_debug_hwdom,
+    [CRASHREASON_WATCHDOG] = crash_debug_watchdog,
+    [CRASHREASON_KEXECCMD] = crash_debug_kexeccmd,
+    [CRASHREASON_DEBUGKEY] = crash_debug_debugkey,
+};
+
+string_runtime_param("crash-debug-panic", crash_debug_panic);
+string_runtime_param("crash-debug-hwdom", crash_debug_hwdom);
+string_runtime_param("crash-debug-watchdog", crash_debug_watchdog);
+string_runtime_param("crash-debug-kexeccmd", crash_debug_kexeccmd);
+string_runtime_param("crash-debug-debugkey", crash_debug_debugkey);
+
+void keyhandler_crash_action(enum crash_reason reason)
+{
+    const char *action = crash_action[reason];
+    struct cpu_user_regs *regs = get_irq_regs() ? : guest_cpu_user_regs();
+
+    while ( *action ) {
+        if ( *action == '.' )
+            mdelay(1000);
+        else
+            handle_keypress(*action, regs);
+        action++;
+    }
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c
index 912593915b..abde48aa4c 100644
--- a/xen/common/shutdown.c
+++ b/xen/common/shutdown.c
@@ -43,7 +43,7 @@  void hwdom_shutdown(u8 reason)
     case SHUTDOWN_crash:
         debugger_trap_immediate();
         printk("Hardware Dom%u crashed: ", hardware_domain->domain_id);
-        kexec_crash();
+        kexec_crash(CRASHREASON_HWDOM);
         maybe_reboot();
         break; /* not reached */
 
@@ -56,7 +56,7 @@  void hwdom_shutdown(u8 reason)
     case SHUTDOWN_watchdog:
         printk("Hardware Dom%u shutdown: watchdog rebooting machine\n",
                hardware_domain->domain_id);
-        kexec_crash();
+        kexec_crash(CRASHREASON_WATCHDOG);
         machine_restart(0);
         break; /* not reached */
 
diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c
index 861ad53a8f..acec277f5e 100644
--- a/xen/drivers/char/console.c
+++ b/xen/drivers/char/console.c
@@ -1271,7 +1271,7 @@  void panic(const char *fmt, ...)
 
     debugger_trap_immediate();
 
-    kexec_crash();
+    kexec_crash(CRASHREASON_PANIC);
 
     if ( opt_noreboot )
         machine_halt();
diff --git a/xen/include/xen/kexec.h b/xen/include/xen/kexec.h
index e85ba16405..9f7a912e97 100644
--- a/xen/include/xen/kexec.h
+++ b/xen/include/xen/kexec.h
@@ -1,6 +1,8 @@ 
 #ifndef __XEN_KEXEC_H__
 #define __XEN_KEXEC_H__
 
+#include <xen/keyhandler.h>
+
 #ifdef CONFIG_KEXEC
 
 #include <public/kexec.h>
@@ -48,7 +50,7 @@  void machine_kexec_unload(struct kexec_image *image);
 void machine_kexec_reserved(xen_kexec_reserve_t *reservation);
 void machine_reboot_kexec(struct kexec_image *image);
 void machine_kexec(struct kexec_image *image);
-void kexec_crash(void);
+void kexec_crash(enum crash_reason reason);
 void kexec_crash_save_cpu(void);
 struct crash_xen_info *kexec_crash_save_info(void);
 void machine_crash_shutdown(void);
@@ -82,7 +84,11 @@  void vmcoreinfo_append_str(const char *fmt, ...)
 #define kexecing 0
 
 static inline void kexec_early_calculations(void) {}
-static inline void kexec_crash(void) {}
+static inline void kexec_crash(enum crash_reason reason)
+{
+    keyhandler_crash_action(reason);
+}
+
 static inline void kexec_crash_save_cpu(void) {}
 static inline void set_kexec_crash_area_size(u64 system_ram) {}
 
diff --git a/xen/include/xen/keyhandler.h b/xen/include/xen/keyhandler.h
index 5131e86cbc..dbf797a8b4 100644
--- a/xen/include/xen/keyhandler.h
+++ b/xen/include/xen/keyhandler.h
@@ -48,4 +48,15 @@  void register_irq_keyhandler(unsigned char key,
 /* Inject a keypress into the key-handling subsystem. */
 extern void handle_keypress(unsigned char key, struct cpu_user_regs *regs);
 
+enum crash_reason {
+    CRASHREASON_PANIC,
+    CRASHREASON_HWDOM,
+    CRASHREASON_WATCHDOG,
+    CRASHREASON_KEXECCMD,
+    CRASHREASON_DEBUGKEY,
+    CRASHREASON_N
+};
+
+void keyhandler_crash_action(enum crash_reason reason);
+
 #endif /* __XEN_KEYHANDLER_H__ */