mbox series

[RFC,0/6] hugetlbfs largepage RAS project

Message ID 20240910090747.2741475-1-william.roche@oracle.com (mailing list archive)
Headers show
Series hugetlbfs largepage RAS project | expand

Message

William Roche Sept. 10, 2024, 9:07 a.m. UTC
From: William Roche <william.roche@oracle.com>

Hello,

This is a Qemu RFC to introduce the possibility to deal with hardware
memory errors impacting hugetlbfs memory backed VMs. When using
hugetlbfs large pages, any large page location being impacted by an
HW memory error results in poisoning the entire page, suddenly making
a large chunk of the VM memory unusable.

The implemented proposal is simply a memory mapping change when an HW error
is reported to Qemu, to transform a hugetlbfs large page into a set of
standard sized pages. The failed large page is unmapped and a set of
standard sized pages are mapped in place.
This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received
by qemu and the reported location corresponds to a large page.

This gives the possibility to:
- Take advantage of newer hypervisor kernel providing a way to retrieve
still valid data on the impacted hugetlbfs poisoned large page.
If the backend file is MAP_SHARED, we can copy the valid data into the
set of standard sized pages. But if an error is returned when accessing
a location we consider it poisoned and mark the corresponding standard sized
memory page as poisoned with a MADV_HWPOISON madvise call. Hence, the VM
can also continue to use the possible valid pieces of information retrieved.
- Adjust the poison address information. When accessing a poison location,
an older Kernel version may only provide the address of the beginning of
the poisoned large page in the associated SIGBUS siginfo data. Pointing to
a more accurate touched poison location allows the VM kernel to trigger
the right memory error reaction.

A warning is given for hugetlbfs backed memory-regions that are mapped
without the 'share=on' option.
(This warning is also given when using the deprecated "-mem-path" option)

The hugetlbfs memory mapping option should look like that
(with XXX replaced with the actual size):
  -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc=on,share=on,size=XXX -machine memory-backend=pc.ram

I'm introducing new system/hugetlbfs_ras.[ch] files to separate the specific
code for this feature. It's only compiled on Linux versions.

Note that we have to be able to mark as "poison" a replacing valid standard
sized page. We currently do that calling madvise(..., MADV_HWPOISON).
But this requires qemu process to have CAP_SYS_ADMIN priviledge.
Using userfaultfd instead of madvise() to mark the pages as poison could
remove this constraint, and complicating the code adding thread(s) dealing
with the user page faults service.


It's also worth mentioning the IO memory, vfio configured memory buffers
case. The Qemu memory remapping (if it succeeds) will not reconfigure any
device IO buffers locations (no dma unmap/remap is performed) and if an
hardware IO is supposed to access (read or write) a poisoned hugetlbfs
page, I would expect it to fail the same way as before (as its location
hasn't been updated to take into account the new mapping).
But can someone confirm this possible behavior ? Or indicate me what should
be done to deal with this type of memory buffers ?

Details:
--------
The following problems had to be considered:

. kvm dealing with memory faults:
 - Address space mapping changes can't be handled in a signal handler (mmap
   is not async signal safe for example)
     We have a separate listener thread (only created when we use hugetlbfs)
     to deal with the mapping changes.
 - If a memory is not mapped when accessed, kvm fails with
   (exit_reason: KVM_EXIT_UNKNOWN)
     To avoid that, I needed to prevent the access to a changing memory
     region: pausing the VM is used to do so.
 - A fault on a poisoned hugetlbfs large page will report a hardcoded page
   size of 4k (See kernel kvm_send_hwpoison_signal() function)
     When a SIGBUS is received with a page size indication of 4k we have to
     verify if the impacted page is not a hugetlbfs page.
 - Asynchronous SIGBUS/BUS_MCEERR_AO signals provide the right page size,
   but the current Qemu version needs to take the information into account.

. system/physmem needed fixes:
 - When recreating the memory mapping on VM reset, we have to consider the
   memory size impacted.
 - In the case of a mapped file, punching a hole is necessary to clean the
   poison.

. Implementation details:
 - SIGBUS signal received for a large page will trigger the page modification,
   but in order to pause the VM, the signal handers have to terminate.
     So we return from the SIGBUS signal handler(s) when a VM has to be stopped.
     A memory access that generated a SIGBUS/BUS_MCEERR_AR signals before the
     VM pause, will be repeated when the VM resumes. If the memory is still
     not accessible (poisoned) the signal will be generated again by the
     hypervisor kernel.
     In the case of an asyncrounous SIGBUS/BUS_MCEERR_AO signal, the signal is
     not repeated by the kernel and will be recorded by qemu in order to be
     replayed when the VM resumes.
 - Poisoning a memory page with MADV_HWPOISON can generate a SIGBUS when
   called. The listener thread taking care of the memory modification needs
   to deal with this case. To do so, it sets a thread specific variable
   that is recognized by the sigbus handler.


Some questions:
---------------
. Should we take extra care for IO memory, vfio configured memory buffers ?

. My feature code is enclosed within "ifdef CONFIG_HUGETLBFS_RAS" and is only
  compiled on linux versions
  Should we have a configure option to prevent the introduction of this
  feature in the code (turning off CONFIG_HUGETLBFS_RAS) ?

. Should I include the content of my system/hugetlbfs_ras.[ch] files into
  another existing file ?

. Should we force 'sharing' when using "-mem-path" option, instead of the
  -object memory-backend-file,share=on,... ?


This prototype is scripts/checkpatch.pl clean (except for the MAINTAINERS
update for the 2 added files).
'make check' runs fine on both x86 and ARM
Units tests have been done on Intel, AMD and ARM platforms.



William Roche (6):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage sizes
  system/physmem: Remap memory pages on reset based on the page size
  system: Introducing hugetlbfs largepage RAS feature
  system/hugetlb_ras: Handle madvise SIGBUS signal on listener
  system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume

 accel/kvm/kvm-all.c      |  24 +-
 accel/stubs/kvm-stub.c   |   4 +-
 include/qemu/osdep.h     |   5 +-
 include/sysemu/kvm.h     |   7 +-
 include/sysemu/kvm_int.h |   3 +-
 meson.build              |   2 +
 system/cpus.c            |  15 +-
 system/hugetlbfs_ras.c   | 645 +++++++++++++++++++++++++++++++++++++++
 system/hugetlbfs_ras.h   |   4 +
 system/meson.build       |   1 +
 system/physmem.c         |  30 ++
 target/arm/kvm.c         |  15 +-
 target/i386/kvm/kvm.c    |  15 +-
 util/oslib-posix.c       |   3 +
 14 files changed, 753 insertions(+), 20 deletions(-)
 create mode 100644 system/hugetlbfs_ras.c
 create mode 100644 system/hugetlbfs_ras.h