Message ID | 20241107102126.2183152-3-william.roche@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/7] accel/kvm: Keep track of the HWPoisonPage page_size | expand |
On 07.11.24 11:21, “William Roche wrote: > From: William Roche <william.roche@oracle.com> > > We take into account the recorded page sizes to repair the > memory locations, calling ram_block_discard_range() to punch a hole > in the backend file when necessary and regenerate a usable memory. > Fall back to unmap/remap the memory location(s) if the kernel doesn't > support the madvise calls used by ram_block_discard_range(). > > Hugetlbfs poison case is also taken into account as a hole punch > with fallocate will reload a new page when first touched. > > Signed-off-by: William Roche <william.roche@oracle.com> > --- > system/physmem.c | 50 +++++++++++++++++++++++++++++------------------- > 1 file changed, 30 insertions(+), 20 deletions(-) > > diff --git a/system/physmem.c b/system/physmem.c > index 750604d47d..dfea120cc5 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) > } else if (xen_enabled()) { > abort(); > } else { > - flags = MAP_FIXED; > - flags |= block->flags & RAM_SHARED ? > - MAP_SHARED : MAP_PRIVATE; > - flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0; > - prot = PROT_READ; > - prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; > - if (block->fd >= 0) { > - area = mmap(vaddr, length, prot, flags, block->fd, > - offset + block->fd_offset); > - } else { > - flags |= MAP_ANONYMOUS; > - area = mmap(vaddr, length, prot, flags, -1, 0); > - } > - if (area != vaddr) { > - error_report("Could not remap addr: " > - RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > - length, addr); > - exit(1); > + if (ram_block_discard_range(block, offset + block->fd_offset, > + length) != 0) { > + if (length > TARGET_PAGE_SIZE) { > + /* punch hole is mandatory on hugetlbfs */ > + error_report("large page recovery failure addr: " > + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > + length, addr); > + exit(1); > + } For shared memory we really need it. Private file-backed is weird ... because we don't know if the shared or the private page is problematic ... :( Maybe we should just do: if (block->fd >= 0) { /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */ error_report(...); exit(-1); } Or alternatively if (block->fd >= 0 && qemu_ram_is_shared(block)) { /* mmap() cannot possibly zap our problematic page. */ error_report(...); exit(-1); } else if (block->fd >= 0) { /* * MAP_PRIVATE file-backed ... mmap() can only zap the private * page, not the shared one ... we don't know which one is * problematic. */ warn_report(...); } > + flags = MAP_FIXED; > + flags |= block->flags & RAM_SHARED ? > + MAP_SHARED : MAP_PRIVATE; > + flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0; > + prot = PROT_READ; > + prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; > + if (block->fd >= 0) { > + area = mmap(vaddr, length, prot, flags, block->fd, > + offset + block->fd_offset); > + } else { > + flags |= MAP_ANONYMOUS; > + area = mmap(vaddr, length, prot, flags, -1, 0); > + } > + if (area != vaddr) { > + error_report("Could not remap addr: " > + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > + length, addr); > + exit(1); > + } > + memory_try_enable_merging(vaddr, length); > + qemu_ram_setup_dump(vaddr, length); Can we factor the mmap hack out into a separate helper function to clean this up a bit?
On 11/12/24 12:07, David Hildenbrand wrote: > On 07.11.24 11:21, “William Roche wrote: >> From: William Roche <william.roche@oracle.com> >> >> We take into account the recorded page sizes to repair the >> memory locations, calling ram_block_discard_range() to punch a hole >> in the backend file when necessary and regenerate a usable memory. >> Fall back to unmap/remap the memory location(s) if the kernel doesn't >> support the madvise calls used by ram_block_discard_range(). >> >> Hugetlbfs poison case is also taken into account as a hole punch >> with fallocate will reload a new page when first touched. >> >> Signed-off-by: William Roche <william.roche@oracle.com> >> --- >> system/physmem.c | 50 +++++++++++++++++++++++++++++------------------- >> 1 file changed, 30 insertions(+), 20 deletions(-) >> >> diff --git a/system/physmem.c b/system/physmem.c >> index 750604d47d..dfea120cc5 100644 >> --- a/system/physmem.c >> +++ b/system/physmem.c >> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, >> ram_addr_t length) >> } else if (xen_enabled()) { >> abort(); >> } else { >> - flags = MAP_FIXED; >> - flags |= block->flags & RAM_SHARED ? >> - MAP_SHARED : MAP_PRIVATE; >> - flags |= block->flags & RAM_NORESERVE ? >> MAP_NORESERVE : 0; >> - prot = PROT_READ; >> - prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; >> - if (block->fd >= 0) { >> - area = mmap(vaddr, length, prot, flags, block->fd, >> - offset + block->fd_offset); >> - } else { >> - flags |= MAP_ANONYMOUS; >> - area = mmap(vaddr, length, prot, flags, -1, 0); >> - } >> - if (area != vaddr) { >> - error_report("Could not remap addr: " >> - RAM_ADDR_FMT "@" RAM_ADDR_FMT "", >> - length, addr); >> - exit(1); >> + if (ram_block_discard_range(block, offset + block- >> >fd_offset, >> + length) != 0) { >> + if (length > TARGET_PAGE_SIZE) { >> + /* punch hole is mandatory on hugetlbfs */ >> + error_report("large page recovery failure >> addr: " >> + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", >> + length, addr); >> + exit(1); >> + } > > For shared memory we really need it. > > Private file-backed is weird ... because we don't know if the shared or > the private page is problematic ... :( I agree with you, and we have to decide when should we bail out if ram_block_discard_range() doesn't work. According to me, if discard doesn't work and we are dealing with file-backed largepages (shared or not) we have to exit, because the fallocate is mandatory. It is the case with hugetlbfs. In the non-file-backed case, or the file-backed non-largepage private case, according to me we can trust the mmap() method to put everything back in place for the VM reset to work as expected. Are there aspects I don't see, and for which mmap + the remap handler is not sufficient and we should also bail out here ? > > Maybe we should just do: > > if (block->fd >= 0) { > /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */ > error_report(...); > exit(-1); > } > > Or alternatively > > if (block->fd >= 0 && qemu_ram_is_shared(block)) { > /* mmap() cannot possibly zap our problematic page. */ > error_report(...); > exit(-1); > } else if (block->fd >= 0) { > /* > * MAP_PRIVATE file-backed ... mmap() can only zap the private > * page, not the shared one ... we don't know which one is > * problematic. > */ > warn_report(...); > } I also agree that any file-backed/shared case should bail out if discard (fallocate) fails, no mater large or standard pages are used. In the case of file-backed private standard pages, I think that a poison on the private page can be fixed with a new mmap. According to me, there are 2 cases to consider: at the moment the poison is seen, the page was dirty (so it means that it was a pure private page), or the page was not dirty, and in this case the poison could replace this non-dirty page with a new copy of the file content. In both cases, I'd say that the remap should clean up the poison. So the conditions when discard fails, could be something like: if (block->fd >= 0 && (qemu_ram_is_shared(block) || (length > TARGET_PAGE_SIZE))) { /* punch hole is mandatory, mmap() cannot possibly zap our page*/ error_report("%spage recovery failure addr: " RAM_ADDR_FMT "@" RAM_ADDR_FMT "", (length > TARGET_PAGE_SIZE) ? "large " : "", length, addr); exit(1); } >> + flags = MAP_FIXED; >> + flags |= block->flags & RAM_SHARED ? >> + MAP_SHARED : MAP_PRIVATE; >> + flags |= block->flags & RAM_NORESERVE ? >> MAP_NORESERVE : 0; >> + prot = PROT_READ; >> + prot |= block->flags & RAM_READONLY ? 0 : >> PROT_WRITE; >> + if (block->fd >= 0) { >> + area = mmap(vaddr, length, prot, flags, >> block->fd, >> + offset + block->fd_offset); >> + } else { >> + flags |= MAP_ANONYMOUS; >> + area = mmap(vaddr, length, prot, flags, -1, 0); >> + } >> + if (area != vaddr) { >> + error_report("Could not remap addr: " >> + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", >> + length, addr); >> + exit(1); >> + } >> + memory_try_enable_merging(vaddr, length); >> + qemu_ram_setup_dump(vaddr, length); > > Can we factor the mmap hack out into a separate helper function to clean > this up a bit? Sure, I'll do that.
>> For shared memory we really need it. >> >> Private file-backed is weird ... because we don't know if the shared or >> the private page is problematic ... :( > > > I agree with you, and we have to decide when should we bail out if > ram_block_discard_range() doesn't work. > According to me, if discard doesn't work and we are dealing with > file-backed largepages (shared or not) we have to exit, because the > fallocate is mandatory. It is the case with hugetlbfs. > > In the non-file-backed case, or the file-backed non-largepage private > case, according to me we can trust the mmap() method to put everything > back in place for the VM reset to work as expected. > Are there aspects I don't see, and for which mmap + the remap handler is > not sufficient and we should also bail out here ? mmap() will only zap anonymous pages, no pagecache pages. See below. >> >> Maybe we should just do: >> >> if (block->fd >= 0) { >> /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */ >> error_report(...); >> exit(-1); >> } >> >> Or alternatively >> >> if (block->fd >= 0 && qemu_ram_is_shared(block)) { >> /* mmap() cannot possibly zap our problematic page. */ >> error_report(...); >> exit(-1); >> } else if (block->fd >= 0) { >> /* >> * MAP_PRIVATE file-backed ... mmap() can only zap the private >> * page, not the shared one ... we don't know which one is >> * problematic. >> */ >> warn_report(...); >> } > > I also agree that any file-backed/shared case should bail out if discard > (fallocate) fails, no mater large or standard pages are used. > > In the case of file-backed private standard pages, I think that a poison > on the private page can be fixed with a new mmap. > According to me, there are 2 cases to consider: at the moment the poison > is seen, the page was dirty (so it means that it was a pure private > page), or the page was not dirty, and in this case the poison could > replace this non-dirty page with a new copy of the file content. > In both cases, I'd say that the remap should clean up the poison. Let's assume we have mmap(MAP_RIVATE, fd). The following scenarios are possible: (a) We only have a pagecache page (never written) that is poisoned -> mmap(MAP_FIXED) cannot resolve that (b) We only have an anonymous page (e.g., pagecache truncated, or if the hugetlb file was empty) that is poisoned -> mmap(MAP_FIXED) can resolve that (c) We have an anonymous and a pagecache page (written -> COW). (c1) Anonymous page is poisoned -> mmap(MAP_FIXED) can resolve that (c2) Pagecache page is poisoned -> mmap(MAP_FIXED) cannot resolve that So mmap(MAP_FIXED) cannot sort out all cases. In practice, (a) and (c2) are uncommon, but possible. (b) is common with hugetlb. (a) and (c) are uncommon with hugetlb, just because of the nature of hugetlb pages being a scarce resource. And IIRC, (b) with hugetlb should should be sorted out with mmap(MAP_FIXED). Please double-check. > > So the conditions when discard fails, could be something like: > > if (block->fd >= 0 && (qemu_ram_is_shared(block) || > (length > TARGET_PAGE_SIZE))) { > /* punch hole is mandatory, mmap() cannot possibly zap our page*/ > error_report("%spage recovery failure addr: " > RAM_ADDR_FMT "@" RAM_ADDR_FMT "", > (length > TARGET_PAGE_SIZE) ? "large " : "", > length, addr); I'm not sure if we should be special-casing hugetlb. If we want to be 100% sure, we will do if (block->fd >= 0) { error_report(); exit(1); } But we could decide to be "nice" to hugetlb and assume (b) for them above: that is, we would do /* * mmap() cannot zap pagecache pages, only anonymous pages. As soon as * we might have pagecache pages involved (either private or shared * mapping), we must be careful. However, MAP_PRIVATE on empty hugetlb * files is common, and extremely uncommon on non-empty hugetlb files, * so we'll special-case them here. */ if (block->fd >= 0 && (qemu_ram_is_shared(block) || length == TARGET_PAGE_SIZE))) { ... } [in practice, we could use /proc/self/pagemap to see if we map an anonymous page ... but I'd rather not go down that path just yet] But, in the end the expectation is that madvise()+fallocate() will usually not fail.
diff --git a/system/physmem.c b/system/physmem.c index 750604d47d..dfea120cc5 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) } else if (xen_enabled()) { abort(); } else { - flags = MAP_FIXED; - flags |= block->flags & RAM_SHARED ? - MAP_SHARED : MAP_PRIVATE; - flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0; - prot = PROT_READ; - prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; - if (block->fd >= 0) { - area = mmap(vaddr, length, prot, flags, block->fd, - offset + block->fd_offset); - } else { - flags |= MAP_ANONYMOUS; - area = mmap(vaddr, length, prot, flags, -1, 0); - } - if (area != vaddr) { - error_report("Could not remap addr: " - RAM_ADDR_FMT "@" RAM_ADDR_FMT "", - length, addr); - exit(1); + if (ram_block_discard_range(block, offset + block->fd_offset, + length) != 0) { + if (length > TARGET_PAGE_SIZE) { + /* punch hole is mandatory on hugetlbfs */ + error_report("large page recovery failure addr: " + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", + length, addr); + exit(1); + } + flags = MAP_FIXED; + flags |= block->flags & RAM_SHARED ? + MAP_SHARED : MAP_PRIVATE; + flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0; + prot = PROT_READ; + prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE; + if (block->fd >= 0) { + area = mmap(vaddr, length, prot, flags, block->fd, + offset + block->fd_offset); + } else { + flags |= MAP_ANONYMOUS; + area = mmap(vaddr, length, prot, flags, -1, 0); + } + if (area != vaddr) { + error_report("Could not remap addr: " + RAM_ADDR_FMT "@" RAM_ADDR_FMT "", + length, addr); + exit(1); + } + memory_try_enable_merging(vaddr, length); + qemu_ram_setup_dump(vaddr, length); } - memory_try_enable_merging(vaddr, length); - qemu_ram_setup_dump(vaddr, length); } } }