Message ID | 20240910100216.2744078-1-william.roche@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlbfs largepage RAS project | expand |
On 10.09.24 12:02, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > Hi, > > Apologies for the noise; resending as I missed CC'ing the maintainers of the > changed files > > > Hello, > > This is a Qemu RFC to introduce the possibility to deal with hardware > memory errors impacting hugetlbfs memory backed VMs. When using > hugetlbfs large pages, any large page location being impacted by an > HW memory error results in poisoning the entire page, suddenly making > a large chunk of the VM memory unusable. > > The implemented proposal is simply a memory mapping change when an HW error > is reported to Qemu, to transform a hugetlbfs large page into a set of > standard sized pages. The failed large page is unmapped and a set of > standard sized pages are mapped in place. > This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received > by qemu and the reported location corresponds to a large page. > > This gives the possibility to: > - Take advantage of newer hypervisor kernel providing a way to retrieve > still valid data on the impacted hugetlbfs poisoned large page. > If the backend file is MAP_SHARED, we can copy the valid data into the How are you dealing with other consumers of the shared memory, such as vhost-user processes, vm migration whereby RAM is migrated using file content, vfio that might have these pages pinned? In general, you cannot simply replace pages by private copies when somebody else might be relying on these pages to go to actual guest RAM. It sounds very hacky and incomplete at first.
On 9/10/24 13:36, David Hildenbrand wrote: > On 10.09.24 12:02, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> > > Hi, > >> >> Apologies for the noise; resending as I missed CC'ing the maintainers >> of the >> changed files >> >> >> Hello, >> >> This is a Qemu RFC to introduce the possibility to deal with hardware >> memory errors impacting hugetlbfs memory backed VMs. When using >> hugetlbfs large pages, any large page location being impacted by an >> HW memory error results in poisoning the entire page, suddenly making >> a large chunk of the VM memory unusable. >> >> The implemented proposal is simply a memory mapping change when an HW >> error >> is reported to Qemu, to transform a hugetlbfs large page into a set of >> standard sized pages. The failed large page is unmapped and a set of >> standard sized pages are mapped in place. >> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is >> received >> by qemu and the reported location corresponds to a large page. >> >> This gives the possibility to: >> - Take advantage of newer hypervisor kernel providing a way to retrieve >> still valid data on the impacted hugetlbfs poisoned large page. >> If the backend file is MAP_SHARED, we can copy the valid data into the Thank you David for this first reaction on this proposal. > How are you dealing with other consumers of the shared memory, > such as vhost-user processes, In the current proposal, I don't deal with this aspect. In fact, any other process sharing the changed memory will continue to map the poisoned large page. So any access to this page will generate a SIGBUS to this other process. In this situation vhost-user processes should continue to receive SIGBUS signals (and probably continue to die because of that). So I do see a real problem if 2 qemu processes are sharing the same hugetlbfs segment -- in this case, error recovery should not occur on this piece of the memory. Maybe dealing with this situation with "ivshmem" options is doable (marking the shared segment "not eligible" to hugetlbfs recovery, just like not "share=on" hugetlbfs entries are not eligible) -- I need to think about this specific case. Please let me know if there is a better way to deal with this shared memory aspect and have a better system reaction. > vm migration whereby RAM is migrated using file content, Migration doesn't currently work with memory poisoning. You can give a look at the already integrated following commit: 06152b89db64 migration: prevent migration when VM has poisoned memory This proposal doesn't change anything on this side. > vfio that might have these pages pinned? AFAIK even pinned memory can be impacted by memory error and poisoned by the kernel. Now as I said in the cover letter, I'd like to know if we should take extra care for IO memory, vfio configured memory buffers... > In general, you cannot simply replace pages by private copies > when somebody else might be relying on these pages to go to > actual guest RAM. This is correct, but the current proposal is dealing with a specific shared memory type: poisoned large pages. So any other process mapping this type of page can't access it without generating a SIGBUS. > It sounds very hacky and incomplete at first. As you can see, RAS features need to be completed. And if this proposal is incomplete, what other changes should be done to complete it ? I do hope we can discuss this RFC to adapt what is incorrect, or find a better way to address this situation. Thanks in advance for your feedback, William.
Hi again, >>> This is a Qemu RFC to introduce the possibility to deal with hardware >>> memory errors impacting hugetlbfs memory backed VMs. When using >>> hugetlbfs large pages, any large page location being impacted by an >>> HW memory error results in poisoning the entire page, suddenly making >>> a large chunk of the VM memory unusable. >>> >>> The implemented proposal is simply a memory mapping change when an HW >>> error >>> is reported to Qemu, to transform a hugetlbfs large page into a set of >>> standard sized pages. The failed large page is unmapped and a set of >>> standard sized pages are mapped in place. >>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is >>> received >>> by qemu and the reported location corresponds to a large page. One clarifying question: you simply replace the hugetlb page by multiple small pages using mmap(MAP_FIXED). So you (a) are not able to recover any memory of the original page (as of now) (b) no longer have a hugetlb page and, therefore, possibly a performance degradation, relevant in low-latency applications that really care about the usage of hugetlb pages. (c) run into the described inconsistency issues Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the full page and get a fresh, non-poisoned page instead? Sure, you have to reserve some pages if that ever happens, but what is the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it and it was spelled out) >>> >>> This gives the possibility to: >>> - Take advantage of newer hypervisor kernel providing a way to retrieve >>> still valid data on the impacted hugetlbfs poisoned large page. Reading that again, that shouldn't have to be hypervisor-specific. Really, if someone were to extract data from a poisoned hugetlb folio, it shouldn't be hypervisor-specific. The kernel should be able to know which regions are accessible and could allow ways for reading these, one way or the other. It could just be a fairly hugetlb-special feature that would replace the poisoned page by a fresh hugetlb page where as much page content as possible has been recoverd from the old one. >>> If the backend file is MAP_SHARED, we can copy the valid data into the > > > Thank you David for this first reaction on this proposal. > > >> How are you dealing with other consumers of the shared memory, >> such as vhost-user processes, > > > In the current proposal, I don't deal with this aspect. > In fact, any other process sharing the changed memory will > continue to map the poisoned large page. So any access to > this page will generate a SIGBUS to this other process. > > In this situation vhost-user processes should continue to receive > SIGBUS signals (and probably continue to die because of that). That's ... suboptimal. :) Assume you have a 1 GiB page. The guest OS can happily allocate buffers in there so they can end up in vhost-user and crash that process. Without any warning. > > So I do see a real problem if 2 qemu processes are sharing the > same hugetlbfs segment -- in this case, error recovery should not > occur on this piece of the memory. Maybe dealing with this situation > with "ivshmem" options is doable (marking the shared segment > "not eligible" to hugetlbfs recovery, just like not "share=on" > hugetlbfs entries are not eligible) > -- I need to think about this specific case. > > Please let me know if there is a better way to deal with this > shared memory aspect and have a better system reaction. Not creating the inconsistency in the first place :) >> vm migration whereby RAM is migrated using file content, > > > Migration doesn't currently work with memory poisoning. > You can give a look at the already integrated following commit: > > 06152b89db64 migration: prevent migration when VM has poisoned memory > > This proposal doesn't change anything on this side. That commit is fairly fresh and likely missed the option to *not* migrate RAM by reading it, but instead by migrating it through a shared file. For example, VM life-upgrade (CPR) wants to use that (or is already using that), to avoid RAM migration completely. > >> vfio that might have these pages pinned? > > AFAIK even pinned memory can be impacted by memory error and poisoned > by the kernel. Now as I said in the cover letter, I'd like to know if > we should take extra care for IO memory, vfio configured memory buffers... Assume your GPU has a hugetlb folio pinned via vfio. As soon as you make the guest RAM point at anything else as VFIO is aware of, we end up in the same problem we had when we learned about having to disable balloon inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages. We'd have to inform VFIO that the mapping is now different. Otherwise it's really better to crash the VM than having your GPU read/write different data than your CPU reads/writes, > > >> In general, you cannot simply replace pages by private copies >> when somebody else might be relying on these pages to go to >> actual guest RAM. > > This is correct, but the current proposal is dealing with a specific > shared memory type: poisoned large pages. So any other process mapping > this type of page can't access it without generating a SIGBUS. Right, and that's the issue. Because, for example, how should the VM be aware that this memory is now special and must not be used for some purposes without leading to problems elsewhere? > > >> It sounds very hacky and incomplete at first. > > As you can see, RAS features need to be completed. > And if this proposal is incomplete, what other changes should be > done to complete it ? > > I do hope we can discuss this RFC to adapt what is incorrect, or > find a better way to address this situation. One long-term goal people are working on is to allow remapping the hugetlb folios in smaller granularity, such that only a single affected PTE can be marked as poisoned. (used to be called high-granularity-mapping) However, at the same time, the focus hseems to shift towards using guest_memfd instead of hugetlb, once it supports 1 GiB pages and shared memory. It will likely be easier to support mapping 1 GiB pages using PTEs that way, and there are ongoing discussions how that can be achieved more easily. There are also discussions [1] about not poisoning the mappings at all and handling it differently. But I haven't yet digested how exactly that could look like in reality. [1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com
On 9/12/24 00:07, David Hildenbrand wrote: > Hi again, > >>>> This is a Qemu RFC to introduce the possibility to deal with hardware >>>> memory errors impacting hugetlbfs memory backed VMs. When using >>>> hugetlbfs large pages, any large page location being impacted by an >>>> HW memory error results in poisoning the entire page, suddenly making >>>> a large chunk of the VM memory unusable. >>>> >>>> The implemented proposal is simply a memory mapping change when an HW >>>> error >>>> is reported to Qemu, to transform a hugetlbfs large page into a set of >>>> standard sized pages. The failed large page is unmapped and a set of >>>> standard sized pages are mapped in place. >>>> This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is >>>> received >>>> by qemu and the reported location corresponds to a large page. > > One clarifying question: you simply replace the hugetlb page by > multiple small pages using mmap(MAP_FIXED). That's right. > So you > > (a) are not able to recover any memory of the original page (as of now) Once poisoned by the kernel, the original large page is entirely not accessible anymore, but the Kernel can provide what remains from the poisoned hugetlbfs page through the backend file. (When this file was mapped MAP_SHARED) > (b) no longer have a hugetlb page and, therefore, possibly a performance > degradation, relevant in low-latency applications that really care > about the usage of hugetlb pages. This is correct. > (c) run into the described inconsistency issues The inconsistency I agreed upon is the case of 2 qemu processes sharing a piece of the memory (through the ivshmem mechanism) which can be fixed by disabling recovery for ivshmem associated hugetlbfs segment. > Why is what you propose beneficial over just fallocate(PUNCH_HOLE) the > full page and get a fresh, non-poisoned page instead? > > Sure, you have to reserve some pages if that ever happens, but what is > the big selling point over PUNCH_HOLE + realloc? (sorry if I missed it > and it was spelled out) This project provides an essential component that can't be done keeping a large page to replace a failed large page: an uncorrected memory error on a memory page is a lost memory piece and needs to be identified for any user to indicate the loss. The kernel granularity for that is the entire page. It marks it 'poisoned' making it inaccessible (no matter what the page size, or the lost memory piece size). So recovering an area of a large page impacted by a memory error has to keep track of the lost area, and there is no other way but to lower the granularity and split the page into smaller pieces that can be marked 'poisoned' for the lost area. That's the reason why we can't replace a failed large page with another large page. We need smaller pages. >>>> >>>> This gives the possibility to: >>>> - Take advantage of newer hypervisor kernel providing a way to >>>> retrieve >>>> still valid data on the impacted hugetlbfs poisoned large page. > > Reading that again, that shouldn't have to be hypervisor-specific. > Really, if someone were to extract data from a poisoned hugetlb folio, > it shouldn't be hypervisor-specific. The kernel should be able to know > which regions are accessible and could allow ways for reading these, > one way or the other. > > It could just be a fairly hugetlb-special feature that would replace > the poisoned page by a fresh hugetlb page where as much page content > as possible has been recoverd from the old one. I totally agree with the fact that it should be the Kernel role to split the page and keep track of the valid and lost pieces. This was an aspect of the high-granularity-mapping (HGM) project you are referring to. But HGM is not there yet (and may never be), and currently the only automatic memory split done by the kernel occurs when we are using Transparent Huge Pages (THP). Unfortunately THP doesn't show (for the moment) all the performance and memory optimisation possibilities that hugetlbfs use provides. And it's a large topic I'd prefer not to get into. >>> How are you dealing with other consumers of the shared memory, >>> such as vhost-user processes, >> >> >> In the current proposal, I don't deal with this aspect. >> In fact, any other process sharing the changed memory will >> continue to map the poisoned large page. So any access to >> this page will generate a SIGBUS to this other process. >> >> In this situation vhost-user processes should continue to receive >> SIGBUS signals (and probably continue to die because of that). > > That's ... suboptimal. :) True. > > Assume you have a 1 GiB page. The guest OS can happily allocate > buffers in there so they can end up in vhost-user and crash that > process. Without any warning. I confess that I don't know how/when and where vhost-user processes get their shared memory locations. But I agree that a recovered large page is currently not usable to associate new shared buffers between qemu and external processes. Note that previously allocated buffers that could have been located on this page are marked 'poisoned' (after a memory error)on the vhost-user process the same way they were before this project . The only difference is that, after a recovered memory error, qemu may continue to see the recovered address space and use it. But the receiving side (on vhost-user) will fail when accessing the location. Can a vhost-process fail without any warning reported ? I hope not. >> So I do see a real problem if 2 qemu processes are sharing the >> same hugetlbfs segment -- in this case, error recovery should not >> occur on this piece of the memory. Maybe dealing with this situation >> with "ivshmem" options is doable (marking the shared segment >> "not eligible" to hugetlbfs recovery, just like not "share=on" >> hugetlbfs entries are not eligible) >> -- I need to think about this specific case. >> >> Please let me know if there is a better way to deal with this >> shared memory aspect and have a better system reaction. > > Not creating the inconsistency in the first place :) Yes :) Of course I don't want to introduce any inconsistency situation leading to a memory corruption. But if we consider that 'ivshmem' memory is not eligible for a recovery, it means that we still leave the entire large page location poisoned and there would not be any inconsistency for this memory component. Other hugetlbfs memory componentswould still have the possibility to be partially recovered, and give a higher chance to the VM not to crash immediately. >>> vm migration whereby RAM is migrated using file content, >> >> >> Migration doesn't currently work with memory poisoning. >> You can give a look at the already integrated following commit: >> >> 06152b89db64 migration: prevent migration when VM has poisoned memory >> >> This proposal doesn't change anything on this side. > > That commit is fairly fresh and likely missed the option to *not* > migrate RAM by reading it, but instead by migrating it through a > shared file. For example, VM life-upgrade (CPR) wants to use that (or > is already using that), to avoid RAM migration completely. When a memory error occurs on a dirty page used for a mapped file, the data is lost and the file synchronisation should fail with EIO. You can't rely on the file content to reflect the latest memory content. So even a migration using such a file should be avoided according to me. >>> vfio that might have these pages pinned? >> >> AFAIK even pinned memory can be impacted by memory error and poisoned >> by the kernel. Now as I said in the cover letter, I'd like to know if >> we should take extra care for IO memory, vfio configured memory >> buffers... > > Assume your GPU has a hugetlb folio pinned via vfio. As soon as you > make the guest RAM point at anything else as VFIO is aware of, we end > up in the same problem we had when we learned about having to disable > balloon inflation (MADVISE_DONTNEED) as soon as VFIO pinned pages. > > We'd have to inform VFIO that the mapping is now different. Otherwise > it's really better to crash the VM than having your GPU read/write > different data than your CPU reads/writes, Absolutely true, and fortunately this is not what would happen when the large poisoned page is still used by the VFIO. After a successful recovery, the CPU may still be able to read/write on a location where we had a vfio buffer, but the other side (the device for example) would fail reading or writing to any location of the poisoned large page. >>> In general, you cannot simply replace pages by private copies >>> when somebody else might be relying on these pages to go to >>> actual guest RAM. >> >> This is correct, but the current proposal is dealing with a specific >> shared memory type: poisoned large pages. So any other process mapping >> this type of page can't access it without generating a SIGBUS. > > Right, and that's the issue. Because, for example, how should the VM > be aware that this memory is now special and must not be used for some > purposes without leading to problems elsewhere? That's an excellent question, that I don't have the full answer to. We are dealing here with a hardware fault situation; the hugetlbfs backend file still has poisoned large page, so any attempt to map it in a process, or any process mapping it before the error will not be able to use the segment. It doesn't mean that they get their own private copy of a page. The only one getting a private copy (to get what was still valid on the faulted large page) is qemu. So if we imagine that ivshmem segments (between 2 qemu processes) don't get this recovery, I'm expecting the data exchange on this shared memory to fail, just like they do without the recovery mechanism. So I don't expect any established communication to continue to work or any new segment using the recovered area to successfully being created. But of course I could be missing something here and be too optimistic... So let take a step back. I guess these "sharing" questions would not relate to memory segments that are not defined as 'share=on', am I right ? Do ivshmem, vhost-user processes or even vfio only use 'share=on' memory segments ? If yes, we could also imagine to only enable recovery for hugetlbfs segments that do not have 'share=on' attribute, but we would have to map them MAP_SHARED in qemu address space anyway. This can maybe create other kinds of problems (?), but if these inconsistency questions would not appear with this approach it would be easy to adapt, and still enhance hugetlbfs use. For a first version of this feature. >>> It sounds very hacky and incomplete at first. >> >> As you can see, RAS features need to be completed. >> And if this proposal is incomplete, what other changes should be >> done to complete it ? >> >> I do hope we can discuss this RFC to adapt what is incorrect, or >> find a better way to address this situation. > > One long-term goal people are working on is to allow remapping the > hugetlb folios in smaller granularity, such that only a single > affected PTE can be marked as poisoned. (used to be called > high-granularity-mapping) I look forward to seeing this implemented, but it seems that it will take time to appear, and if hugetlbfs RAS can be enhanced for qemu it would be very useful. The day a kernel solution works, we can disable CONFIG_HUGETLBFS_RAS and rely on the kernel to provide the appropriate information. The first commits will continue to be necessary (dealing with si_addr_lsb value of the SIGBUS signinfo, tracking the page size information in the hwpoison_page_list and the memory remap on reset with the missing PUNCH_HOLE). > However, at the same time, the focus hseems to shift towards using > guest_memfd instead of hugetlb, once it supports 1 GiB pages and > shared memory. It will likely be easier to support mapping 1 GiB pages > using PTEs that way, and there are ongoing discussions how that can be > achieved more easily. > > There are also discussions [1] about not poisoning the mappings at all > and handling it differently. But I haven't yet digested how exactly > that could look like in reality. > > > [1] https://lkml.kernel.org/r/20240828234958.GE3773488@nvidia.com Thank you very much for this pointer. I hope a kernel solution (this one or another) can be implemented and widely adopted before the next 5 to 10 years ;) In the meantime, we can try to enhance qemu using hugetlbfs for VM memory which is more and more deployed. Best regards, William.
Hello David, I hope my last week email answered your interrogations about: - retrieving the valid data from the lost hugepage - the need of smaller pages to replace a failed large page - the interaction of memory error and VM migration - the non-symmetrical access to a poisoned memory area after a recovery Qemu would be able to continue to access the still valid data location of the formerly poisoned hugepage, but any other entity mapping the large page would not be allowed to use the location. I understand that this last item _is_ some kind of "inconsistency". So if I want to make sure that a "shared" memory region (used for vhost-user processes, vfio or ivshmem) is not recovered, how can I identify what region(s) of a guest memory could be used for such a shared location ? Is there a way for qemu to identify the memory locations that have been shared ? Could you please let me know if there is an entry point I should consider ? Thanks in advance for your feedback. William.
On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote: > Hello David, > > I hope my last week email answered your interrogations about: > - retrieving the valid data from the lost hugepage > - the need of smaller pages to replace a failed large page > - the interaction of memory error and VM migration > - the non-symmetrical access to a poisoned memory area after a recovery > Qemu would be able to continue to access the still valid data > location of the formerly poisoned hugepage, but any other entity > mapping the large page would not be allowed to use the location. > > I understand that this last item _is_ some kind of "inconsistency". > So if I want to make sure that a "shared" memory region (used for vhost-user > processes, vfio or ivshmem) is not recovered, how can I identify what > region(s) > of a guest memory could be used for such a shared location ? > Is there a way for qemu to identify the memory locations that have been > shared ? When there's no vIOMMU I think all guest pages need to be shared. When with vIOMMU it depends on what was mapped by the guest drivers, while in most sane setups they can still always be shared because the guest OS (if Linux) should normally have iommu=pt speeding up kernel drivers. > > Could you please let me know if there is an entry point I should consider ? IMHO it'll still be more reasonable that this issue be tackled from the kernel not userspace, simply because it's a shared problem of all userspaces rather than QEMU process alone. When with that the kernel should guarantee consistencies on different processes accessing these pages properly, so logically all these complexities should be better done in the kernel once for all. There's indeed difficulties on providing it in hugetlbfs with mm community, and this is also not the only effort trying to fix 1G page poisoning with userspace workarounds, see: https://lore.kernel.org/r/20240924043924.3562257-1-jiaqiyan@google.com My gut feeling is either hugetlbfs needs to be fixed (with less hope) or QEMU in general needs to move over to other file systems on consuming huge pages. Poisoning is not the only driven force, but at least we want to also work out postcopy which has similar goal as David said, on being able to map hugetlbfs pages differently. May consider having a look at gmemfd 1G proposal, posted here: https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@google.com We probably need that in one way or another for CoCo, and the chance is it can easily support non-CoCo with the same interface ultimately. Then 1G hugetlbfs can be abandoned in QEMU. It'll also need to tackle the same challenge here either on page poisoning, or postcopy, with/without QEMU's specific solution, because QEMU is also not the only userspace hypervisor. Said that, the initial few small patches seem to be standalone small fixes which may still be good. So if you think that's the case you can at least consider sending them separately without RFC tag. Thanks,
On 10/9/24 17:45, Peter Xu wrote: > On Thu, Sep 19, 2024 at 06:52:37PM +0200, William Roche wrote: >> Hello David, >> >> I hope my last week email answered your interrogations about: >> - retrieving the valid data from the lost hugepage >> - the need of smaller pages to replace a failed large page >> - the interaction of memory error and VM migration >> - the non-symmetrical access to a poisoned memory area after a recovery >> Qemu would be able to continue to access the still valid data >> location of the formerly poisoned hugepage, but any other entity >> mapping the large page would not be allowed to use the location. >> >> I understand that this last item _is_ some kind of "inconsistency". >> So if I want to make sure that a "shared" memory region (used for vhost-user >> processes, vfio or ivshmem) is not recovered, how can I identify what >> region(s) >> of a guest memory could be used for such a shared location ? >> Is there a way for qemu to identify the memory locations that have been >> shared ? > > When there's no vIOMMU I think all guest pages need to be shared. When > with vIOMMU it depends on what was mapped by the guest drivers, while in > most sane setups they can still always be shared because the guest OS (if > Linux) should normally have iommu=pt speeding up kernel drivers. > >> >> Could you please let me know if there is an entry point I should consider ? > > IMHO it'll still be more reasonable that this issue be tackled from the > kernel not userspace, simply because it's a shared problem of all > userspaces rather than QEMU process alone. > > When with that the kernel should guarantee consistencies on different > processes accessing these pages properly, so logically all these > complexities should be better done in the kernel once for all. > > There's indeed difficulties on providing it in hugetlbfs with mm community, > and this is also not the only effort trying to fix 1G page poisoning with > userspace workarounds, see: > > https://lore.kernel.org/r/20240924043924.3562257-1-jiaqiyan@google.com > > My gut feeling is either hugetlbfs needs to be fixed (with less hope) or > QEMU in general needs to move over to other file systems on consuming huge > pages. Poisoning is not the only driven force, but at least we want to > also work out postcopy which has similar goal as David said, on being able > to map hugetlbfs pages differently. > > May consider having a look at gmemfd 1G proposal, posted here: > > https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@google.com > > We probably need that in one way or another for CoCo, and the chance is it > can easily support non-CoCo with the same interface ultimately. Then 1G > hugetlbfs can be abandoned in QEMU. It'll also need to tackle the same > challenge here either on page poisoning, or postcopy, with/without QEMU's > specific solution, because QEMU is also not the only userspace hypervisor. > > Said that, the initial few small patches seem to be standalone small fixes > which may still be good. So if you think that's the case you can at least > consider sending them separately without RFC tag. > > Thanks, Thank you very much Peter for your answer, pointers and explanations. I understand and agree that having the Kernel to deal with huge pages errors is a much better approach. Not an easy one... I'll submit a trimmed down version of my first patches to fix some problems that currently exist in Qemu. Thanks again, William.
On 19.09.24 18:52, William Roche wrote: > Hello David, Hi William, sorry for not replying earlier, it somehow fell through the cracks as my inbox got flooded :( > > I hope my last week email answered your interrogations about: > - retrieving the valid data from the lost hugepage > - the need of smaller pages to replace a failed large page > - the interaction of memory error and VM migration > - the non-symmetrical access to a poisoned memory area after a recovery > Qemu would be able to continue to access the still valid data > location of the formerly poisoned hugepage, but any other entity > mapping the large page would not be allowed to use the location. > > I understand that this last item _is_ some kind of "inconsistency". That's my biggest concern. Physical memory and its properties are described by the QEMU RAMBlock, which includes page size, shared/private, and sometimes properties (e.g., uffd). Adding inconsistent there is really suboptimal :( > So if I want to make sure that a "shared" memory region (used for vhost-user > processes, vfio or ivshmem) is not recovered, how can I identify what > region(s) > of a guest memory could be used for such a shared location ? > Is there a way for qemu to identify the memory locations that have been > shared ? I'll reply to your other cleanups/improvements, but we can detect if we must not discard arbitrary memory (because likely something is relying on long-term pinnings) using ram_block_discard_is_disabled().
From: William Roche <william.roche@oracle.com> Apologies for the noise; resending as I missed CC'ing the maintainers of the changed files Hello, This is a Qemu RFC to introduce the possibility to deal with hardware memory errors impacting hugetlbfs memory backed VMs. When using hugetlbfs large pages, any large page location being impacted by an HW memory error results in poisoning the entire page, suddenly making a large chunk of the VM memory unusable. The implemented proposal is simply a memory mapping change when an HW error is reported to Qemu, to transform a hugetlbfs large page into a set of standard sized pages. The failed large page is unmapped and a set of standard sized pages are mapped in place. This mechanism is triggered when a SIGBUS/MCE_MCEERR_Ax signal is received by qemu and the reported location corresponds to a large page. This gives the possibility to: - Take advantage of newer hypervisor kernel providing a way to retrieve still valid data on the impacted hugetlbfs poisoned large page. If the backend file is MAP_SHARED, we can copy the valid data into the set of standard sized pages. But if an error is returned when accessing a location we consider it poisoned and mark the corresponding standard sized memory page as poisoned with a MADV_HWPOISON madvise call. Hence, the VM can also continue to use the possible valid pieces of information retrieved. - Adjust the poison address information. When accessing a poison location, an older Kernel version may only provide the address of the beginning of the poisoned large page in the associated SIGBUS siginfo data. Pointing to a more accurate touched poison location allows the VM kernel to trigger the right memory error reaction. A warning is given for hugetlbfs backed memory-regions that are mapped without the 'share=on' option. (This warning is also given when using the deprecated "-mem-path" option) The hugetlbfs memory mapping option should look like that (with XXX replaced with the actual size): -object memory-backend-file,id=pc.ram,mem-path=/dev/hugepages,prealloc=on,share=on,size=XXX -machine memory-backend=pc.ram I'm introducing new system/hugetlbfs_ras.[ch] files to separate the specific code for this feature. It's only compiled on Linux versions. Note that we have to be able to mark as "poison" a replacing valid standard sized page. We currently do that calling madvise(..., MADV_HWPOISON). But this requires qemu process to have CAP_SYS_ADMIN priviledge. Using userfaultfd instead of madvise() to mark the pages as poison could remove this constraint, and complicating the code adding thread(s) dealing with the user page faults service. It's also worth mentioning the IO memory, vfio configured memory buffers case. The Qemu memory remapping (if it succeeds) will not reconfigure any device IO buffers locations (no dma unmap/remap is performed) and if an hardware IO is supposed to access (read or write) a poisoned hugetlbfs page, I would expect it to fail the same way as before (as its location hasn't been updated to take into account the new mapping). But can someone confirm this possible behavior ? Or indicate me what should be done to deal with this type of memory buffers ? Details: -------- The following problems had to be considered: . kvm dealing with memory faults: - Address space mapping changes can't be handled in a signal handler (mmap is not async signal safe for example) We have a separate listener thread (only created when we use hugetlbfs) to deal with the mapping changes. - If a memory is not mapped when accessed, kvm fails with (exit_reason: KVM_EXIT_UNKNOWN) To avoid that, I needed to prevent the access to a changing memory region: pausing the VM is used to do so. - A fault on a poisoned hugetlbfs large page will report a hardcoded page size of 4k (See kernel kvm_send_hwpoison_signal() function) When a SIGBUS is received with a page size indication of 4k we have to verify if the impacted page is not a hugetlbfs page. - Asynchronous SIGBUS/BUS_MCEERR_AO signals provide the right page size, but the current Qemu version needs to take the information into account. . system/physmem needed fixes: - When recreating the memory mapping on VM reset, we have to consider the memory size impacted. - In the case of a mapped file, punching a hole is necessary to clean the poison. . Implementation details: - SIGBUS signal received for a large page will trigger the page modification, but in order to pause the VM, the signal handers have to terminate. So we return from the SIGBUS signal handler(s) when a VM has to be stopped. A memory access that generated a SIGBUS/BUS_MCEERR_AR signals before the VM pause, will be repeated when the VM resumes. If the memory is still not accessible (poisoned) the signal will be generated again by the hypervisor kernel. In the case of an asyncrounous SIGBUS/BUS_MCEERR_AO signal, the signal is not repeated by the kernel and will be recorded by qemu in order to be replayed when the VM resumes. - Poisoning a memory page with MADV_HWPOISON can generate a SIGBUS when called. The listener thread taking care of the memory modification needs to deal with this case. To do so, it sets a thread specific variable that is recognized by the sigbus handler. Some questions: --------------- . Should we take extra care for IO memory, vfio configured memory buffers ? . My feature code is enclosed within "ifdef CONFIG_HUGETLBFS_RAS" and is only compiled on linux versions Should we have a configure option to prevent the introduction of this feature in the code (turning off CONFIG_HUGETLBFS_RAS) ? . Should I include the content of my system/hugetlbfs_ras.[ch] files into another existing file ? . Should we force 'sharing' when using "-mem-path" option, instead of the -object memory-backend-file,share=on,... ? This prototype is scripts/checkpatch.pl clean (except for the MAINTAINERS update for the 2 added files). 'make check' runs fine on both x86 and ARM Units tests have been done on Intel, AMD and ARM platforms. William Roche (6): accel/kvm: SIGBUS handler should also deal with si_addr_lsb accel/kvm: Keep track of the HWPoisonPage sizes system/physmem: Remap memory pages on reset based on the page size system: Introducing hugetlbfs largepage RAS feature system/hugetlb_ras: Handle madvise SIGBUS signal on listener system/hugetlb_ras: Replay lost BUS_MCEERR_AO signals on VM resume accel/kvm/kvm-all.c | 24 +- accel/stubs/kvm-stub.c | 4 +- include/qemu/osdep.h | 5 +- include/sysemu/kvm.h | 7 +- include/sysemu/kvm_int.h | 3 +- meson.build | 2 + system/cpus.c | 15 +- system/hugetlbfs_ras.c | 645 +++++++++++++++++++++++++++++++++++++++ system/hugetlbfs_ras.h | 4 + system/meson.build | 1 + system/physmem.c | 30 ++ target/arm/kvm.c | 15 +- target/i386/kvm/kvm.c | 15 +- util/oslib-posix.c | 3 + 14 files changed, 753 insertions(+), 20 deletions(-) create mode 100644 system/hugetlbfs_ras.c create mode 100644 system/hugetlbfs_ras.h