mbox series

[v3,0/3] add MEMORY_FAILURE event

Message ID 20200930100440.1060708-1-pizhenwei@bytedance.com (mailing list archive)
Headers show
Series add MEMORY_FAILURE event | expand

Message

zhenwei pi Sept. 30, 2020, 10:04 a.m. UTC
v2->v3:
Use g_strdup_printf instead of snprintf.
Declear memory failure event as 3 parts: 'recipient', 'action', 'flags'.
Wrapper function emit_guest_memory_failure&emit_hypervisor_memory_failure.

v1->v2:
Suggested by Peter Maydell, rename events to make them
architecture-neutral:
'PC-RAM' -> 'guest-memory'
'guest-triple-fault' -> 'guest-mce-fatal'

Suggested by Paolo, add more fields in event:
'action-required': boolean type to distinguish a guest-mce is AR/AO.
'recursive': boolean type. set true if: previous MCE in processing
             in guest, another AO MCE occurs.

v1:
Although QEMU could catch signal BUS to handle hardware memory
corrupted event, sadly, QEMU just prints a little log and try to fix
it silently.

In these patches, introduce a 'MEMORY_FAILURE' event with 4 detailed
actions of QEMU, then uplayer could know what situaction QEMU hit and
did. And further step we can do: if a host server hits a 'hypervisor-ignore'
or 'guest-mce', scheduler could migrate VM to another host; if hitting
'hypervisor-stop' or 'guest-triple-fault', scheduler could select other
healthy servers to launch VM.

Zhenwei Pi (3):
  target-i386: seperate MCIP & MCE_MASK error reason
  qapi/run-state.json: introduce memory failure event
  target-i386: post memory failure event to uplayer

 qapi/run-state.json  | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 target/i386/helper.c | 47 ++++++++++++++++++++++-------
 target/i386/kvm.c    | 13 +++++++-
 3 files changed, 134 insertions(+), 11 deletions(-)

Comments

Paolo Bonzini Oct. 1, 2020, 5:28 p.m. UTC | #1
On 30/09/20 12:04, zhenwei pi wrote:
> v2->v3:
> Use g_strdup_printf instead of snprintf.
> Declear memory failure event as 3 parts: 'recipient', 'action', 'flags'.
> Wrapper function emit_guest_memory_failure&emit_hypervisor_memory_failure.

Queued, thanks.  I took the liberty of adding a fourth value to
MemoryFailureAction, "reset", since "fatal" was used for two different
actions.

Paolo

> v1->v2:
> Suggested by Peter Maydell, rename events to make them
> architecture-neutral:
> 'PC-RAM' -> 'guest-memory'
> 'guest-triple-fault' -> 'guest-mce-fatal'
> 
> Suggested by Paolo, add more fields in event:
> 'action-required': boolean type to distinguish a guest-mce is AR/AO.
> 'recursive': boolean type. set true if: previous MCE in processing
>              in guest, another AO MCE occurs.
> 
> v1:
> Although QEMU could catch signal BUS to handle hardware memory
> corrupted event, sadly, QEMU just prints a little log and try to fix
> it silently.
> 
> In these patches, introduce a 'MEMORY_FAILURE' event with 4 detailed
> actions of QEMU, then uplayer could know what situaction QEMU hit and
> did. And further step we can do: if a host server hits a 'hypervisor-ignore'
> or 'guest-mce', scheduler could migrate VM to another host; if hitting
> 'hypervisor-stop' or 'guest-triple-fault', scheduler could select other
> healthy servers to launch VM.
> 
> Zhenwei Pi (3):
>   target-i386: seperate MCIP & MCE_MASK error reason
>   qapi/run-state.json: introduce memory failure event
>   target-i386: post memory failure event to uplayer
> 
>  qapi/run-state.json  | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  target/i386/helper.c | 47 ++++++++++++++++++++++-------
>  target/i386/kvm.c    | 13 +++++++-
>  3 files changed, 134 insertions(+), 11 deletions(-)
>