diff mbox series

[v2,2/7] system/physmem: poisoned memory discard on reboot

Message ID 20241107102126.2183152-3-william.roche@oracle.com (mailing list archive)
State New
Headers show
Series [v2,1/7] accel/kvm: Keep track of the HWPoisonPage page_size | expand

Commit Message

William Roche Nov. 7, 2024, 10:21 a.m. UTC
From: William Roche <william.roche@oracle.com>

We take into account the recorded page sizes to repair the
memory locations, calling ram_block_discard_range() to punch a hole
in the backend file when necessary and regenerate a usable memory.
Fall back to unmap/remap the memory location(s) if the kernel doesn't
support the madvise calls used by ram_block_discard_range().

Hugetlbfs poison case is also taken into account as a hole punch
with fallocate will reload a new page when first touched.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

Comments

David Hildenbrand Nov. 12, 2024, 11:07 a.m. UTC | #1
On 07.11.24 11:21, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> We take into account the recorded page sizes to repair the
> memory locations, calling ram_block_discard_range() to punch a hole
> in the backend file when necessary and regenerate a usable memory.
> Fall back to unmap/remap the memory location(s) if the kernel doesn't
> support the madvise calls used by ram_block_discard_range().
> 
> Hugetlbfs poison case is also taken into account as a hole punch
> with fallocate will reload a new page when first touched.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>
> ---
>   system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
>   1 file changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 750604d47d..dfea120cc5 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
>               } else if (xen_enabled()) {
>                   abort();
>               } else {
> -                flags = MAP_FIXED;
> -                flags |= block->flags & RAM_SHARED ?
> -                         MAP_SHARED : MAP_PRIVATE;
> -                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> -                prot = PROT_READ;
> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> -                if (block->fd >= 0) {
> -                    area = mmap(vaddr, length, prot, flags, block->fd,
> -                                offset + block->fd_offset);
> -                } else {
> -                    flags |= MAP_ANONYMOUS;
> -                    area = mmap(vaddr, length, prot, flags, -1, 0);
> -                }
> -                if (area != vaddr) {
> -                    error_report("Could not remap addr: "
> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> -                                 length, addr);
> -                    exit(1);
> +                if (ram_block_discard_range(block, offset + block->fd_offset,
> +                                            length) != 0) {
> +                    if (length > TARGET_PAGE_SIZE) {
> +                        /* punch hole is mandatory on hugetlbfs */
> +                        error_report("large page recovery failure addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }

For shared memory we really need it.

Private file-backed is weird ... because we don't know if the shared or 
the private page is problematic ... :(

Maybe we should just do:

if (block->fd >= 0) {
	/* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
	error_report(...);
	exit(-1);
}

Or alternatively

if (block->fd >= 0 && qemu_ram_is_shared(block)) {
	/* mmap() cannot possibly zap our problematic page. */
	error_report(...);
	exit(-1);
} else if (block->fd >= 0) {
	/*
	 * MAP_PRIVATE file-backed ... mmap() can only zap the private
	 * page, not the shared one ... we don't know which one is
	 * problematic.
	 */
	warn_report(...);
}


> +                    flags = MAP_FIXED;
> +                    flags |= block->flags & RAM_SHARED ?
> +                             MAP_SHARED : MAP_PRIVATE;
> +                    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
> +                    prot = PROT_READ;
> +                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
> +                    if (block->fd >= 0) {
> +                        area = mmap(vaddr, length, prot, flags, block->fd,
> +                                    offset + block->fd_offset);
> +                    } else {
> +                        flags |= MAP_ANONYMOUS;
> +                        area = mmap(vaddr, length, prot, flags, -1, 0);
> +                    }
> +                    if (area != vaddr) {
> +                        error_report("Could not remap addr: "
> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
> +                                     length, addr);
> +                        exit(1);
> +                    }
> +                    memory_try_enable_merging(vaddr, length);
> +                    qemu_ram_setup_dump(vaddr, length);

Can we factor the mmap hack out into a separate helper function to clean 
this up a bit?
William Roche Nov. 12, 2024, 6:17 p.m. UTC | #2
On 11/12/24 12:07, David Hildenbrand wrote:
> On 07.11.24 11:21, “William Roche wrote:
>> From: William Roche <william.roche@oracle.com>
>>
>> We take into account the recorded page sizes to repair the
>> memory locations, calling ram_block_discard_range() to punch a hole
>> in the backend file when necessary and regenerate a usable memory.
>> Fall back to unmap/remap the memory location(s) if the kernel doesn't
>> support the madvise calls used by ram_block_discard_range().
>>
>> Hugetlbfs poison case is also taken into account as a hole punch
>> with fallocate will reload a new page when first touched.
>>
>> Signed-off-by: William Roche <william.roche@oracle.com>
>> ---
>>   system/physmem.c | 50 +++++++++++++++++++++++++++++-------------------
>>   1 file changed, 30 insertions(+), 20 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 750604d47d..dfea120cc5 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -2197,27 +2197,37 @@ void qemu_ram_remap(ram_addr_t addr, 
>> ram_addr_t length)
>>               } else if (xen_enabled()) {
>>                   abort();
>>               } else {
>> -                flags = MAP_FIXED;
>> -                flags |= block->flags & RAM_SHARED ?
>> -                         MAP_SHARED : MAP_PRIVATE;
>> -                flags |= block->flags & RAM_NORESERVE ? 
>> MAP_NORESERVE : 0;
>> -                prot = PROT_READ;
>> -                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
>> -                if (block->fd >= 0) {
>> -                    area = mmap(vaddr, length, prot, flags, block->fd,
>> -                                offset + block->fd_offset);
>> -                } else {
>> -                    flags |= MAP_ANONYMOUS;
>> -                    area = mmap(vaddr, length, prot, flags, -1, 0);
>> -                }
>> -                if (area != vaddr) {
>> -                    error_report("Could not remap addr: "
>> -                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> -                                 length, addr);
>> -                    exit(1);
>> +                if (ram_block_discard_range(block, offset + block- 
>> >fd_offset,
>> +                                            length) != 0) {
>> +                    if (length > TARGET_PAGE_SIZE) {
>> +                        /* punch hole is mandatory on hugetlbfs */
>> +                        error_report("large page recovery failure 
>> addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
> 
> For shared memory we really need it.
> 
> Private file-backed is weird ... because we don't know if the shared or 
> the private page is problematic ... :(


I agree with you, and we have to decide when should we bail out if 
ram_block_discard_range() doesn't work.
According to me, if discard doesn't work and we are dealing with 
file-backed largepages (shared or not) we have to exit, because the 
fallocate is mandatory. It is the case with hugetlbfs.

In the non-file-backed case, or the file-backed non-largepage private 
case, according to me we can trust the mmap() method to put everything 
back in place for the VM reset to work as expected.
Are there aspects I don't see, and for which mmap + the remap handler is 
not sufficient and we should also bail out here ?



> 
> Maybe we should just do:
> 
> if (block->fd >= 0) {
>      /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
>      error_report(...);
>      exit(-1);
> }
> 
> Or alternatively
> 
> if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>      /* mmap() cannot possibly zap our problematic page. */
>      error_report(...);
>      exit(-1);
> } else if (block->fd >= 0) {
>      /*
>       * MAP_PRIVATE file-backed ... mmap() can only zap the private
>       * page, not the shared one ... we don't know which one is
>       * problematic.
>       */
>      warn_report(...);
> }

I also agree that any file-backed/shared case should bail out if discard 
(fallocate) fails, no mater large or standard pages are used.

In the case of file-backed private standard pages, I think that a poison 
on the private page can be fixed with a new mmap.
According to me, there are 2 cases to consider: at the moment the poison 
is seen, the page was dirty (so it means that it was a pure private 
page), or the page was not dirty, and in this case the poison could 
replace this non-dirty page with a new copy of the file content.
In both cases, I'd say that the remap should clean up the poison.

So the conditions when discard fails, could be something like:

    if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
        (length > TARGET_PAGE_SIZE))) {
        /* punch hole is mandatory, mmap() cannot possibly zap our page*/
         error_report("%spage recovery failure addr: "
                      RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
                      (length > TARGET_PAGE_SIZE) ? "large " : "",
                      length, addr);
         exit(1);
     }


>> +                    flags = MAP_FIXED;
>> +                    flags |= block->flags & RAM_SHARED ?
>> +                             MAP_SHARED : MAP_PRIVATE;
>> +                    flags |= block->flags & RAM_NORESERVE ? 
>> MAP_NORESERVE : 0;
>> +                    prot = PROT_READ;
>> +                    prot |= block->flags & RAM_READONLY ? 0 : 
>> PROT_WRITE;
>> +                    if (block->fd >= 0) {
>> +                        area = mmap(vaddr, length, prot, flags, 
>> block->fd,
>> +                                    offset + block->fd_offset);
>> +                    } else {
>> +                        flags |= MAP_ANONYMOUS;
>> +                        area = mmap(vaddr, length, prot, flags, -1, 0);
>> +                    }
>> +                    if (area != vaddr) {
>> +                        error_report("Could not remap addr: "
>> +                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>> +                                     length, addr);
>> +                        exit(1);
>> +                    }
>> +                    memory_try_enable_merging(vaddr, length);
>> +                    qemu_ram_setup_dump(vaddr, length);
> 
> Can we factor the mmap hack out into a separate helper function to clean 
> this up a bit?

Sure, I'll do that.
David Hildenbrand Nov. 12, 2024, 10:06 p.m. UTC | #3
>> For shared memory we really need it.
>>
>> Private file-backed is weird ... because we don't know if the shared or
>> the private page is problematic ... :(
> 
> 
> I agree with you, and we have to decide when should we bail out if
> ram_block_discard_range() doesn't work.
> According to me, if discard doesn't work and we are dealing with
> file-backed largepages (shared or not) we have to exit, because the
> fallocate is mandatory. It is the case with hugetlbfs.
 > > In the non-file-backed case, or the file-backed non-largepage private
> case, according to me we can trust the mmap() method to put everything
> back in place for the VM reset to work as expected.
> Are there aspects I don't see, and for which mmap + the remap handler is
> not sufficient and we should also bail out here ?

mmap() will only zap anonymous pages, no pagecache pages. See below.

>>
>> Maybe we should just do:
>>
>> if (block->fd >= 0) {
>>       /* mmap(MAP_FIXED) cannot reliably zap our problematic page. */
>>       error_report(...);
>>       exit(-1);
>> }
>>
>> Or alternatively
>>
>> if (block->fd >= 0 && qemu_ram_is_shared(block)) {
>>       /* mmap() cannot possibly zap our problematic page. */
>>       error_report(...);
>>       exit(-1);
>> } else if (block->fd >= 0) {
>>       /*
>>        * MAP_PRIVATE file-backed ... mmap() can only zap the private
>>        * page, not the shared one ... we don't know which one is
>>        * problematic.
>>        */
>>       warn_report(...);
>> }
> 
> I also agree that any file-backed/shared case should bail out if discard
> (fallocate) fails, no mater large or standard pages are used.
> 
> In the case of file-backed private standard pages, I think that a poison
> on the private page can be fixed with a new mmap.
> According to me, there are 2 cases to consider: at the moment the poison
> is seen, the page was dirty (so it means that it was a pure private
> page), or the page was not dirty, and in this case the poison could
> replace this non-dirty page with a new copy of the file content.
> In both cases, I'd say that the remap should clean up the poison.

Let's assume we have mmap(MAP_RIVATE, fd). The following scenarios are 
possible:

(a) We only have a pagecache page (never written) that is poisoned
	-> mmap(MAP_FIXED) cannot resolve that

(b) We only have an anonymous page (e.g., pagecache truncated, or if the
     hugetlb file was empty) that is poisoned
	-> mmap(MAP_FIXED) can resolve that

(c) We have an anonymous and a pagecache page (written -> COW).
(c1) Anonymous page is poisoned -> mmap(MAP_FIXED) can resolve that
(c2) Pagecache page is poisoned -> mmap(MAP_FIXED) cannot resolve that


So mmap(MAP_FIXED) cannot sort out all cases. In practice, (a) and (c2) 
are uncommon, but possible.

(b) is common with hugetlb. (a) and (c) are uncommon with hugetlb, just 
because of the nature of hugetlb pages being a scarce resource.

And IIRC, (b) with hugetlb should should be sorted out with 
mmap(MAP_FIXED). Please double-check.

> 
> So the conditions when discard fails, could be something like:
> 
>      if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
>          (length > TARGET_PAGE_SIZE))) {
>          /* punch hole is mandatory, mmap() cannot possibly zap our page*/
>           error_report("%spage recovery failure addr: "
>                        RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
>                        (length > TARGET_PAGE_SIZE) ? "large " : "",
>                        length, addr);

I'm not sure if we should be special-casing hugetlb.

If we want to be 100% sure, we will do

if (block->fd >= 0) {
	error_report();
	exit(1);
}

But we could decide to be "nice" to hugetlb and assume (b) for them 
above: that is, we would do

/*
  * mmap() cannot zap pagecache pages, only anonymous pages. As soon as
  * we might have pagecache pages involved (either private or shared
  * mapping), we must be careful. However, MAP_PRIVATE on empty hugetlb
  * files is common, and extremely uncommon on non-empty hugetlb files,
  * so we'll special-case them here.
  */
if (block->fd >= 0 && (qemu_ram_is_shared(block) ||
     length == TARGET_PAGE_SIZE))) {
	...
}

[in practice, we could use /proc/self/pagemap to see if we map an 
anonymous page ... but I'd rather not go down that path just yet]

But, in the end the expectation is that madvise()+fallocate() will 
usually not fail.
diff mbox series

Patch

diff --git a/system/physmem.c b/system/physmem.c
index 750604d47d..dfea120cc5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2197,27 +2197,37 @@  void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
             } else if (xen_enabled()) {
                 abort();
             } else {
-                flags = MAP_FIXED;
-                flags |= block->flags & RAM_SHARED ?
-                         MAP_SHARED : MAP_PRIVATE;
-                flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
-                prot = PROT_READ;
-                prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
-                if (block->fd >= 0) {
-                    area = mmap(vaddr, length, prot, flags, block->fd,
-                                offset + block->fd_offset);
-                } else {
-                    flags |= MAP_ANONYMOUS;
-                    area = mmap(vaddr, length, prot, flags, -1, 0);
-                }
-                if (area != vaddr) {
-                    error_report("Could not remap addr: "
-                                 RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
-                                 length, addr);
-                    exit(1);
+                if (ram_block_discard_range(block, offset + block->fd_offset,
+                                            length) != 0) {
+                    if (length > TARGET_PAGE_SIZE) {
+                        /* punch hole is mandatory on hugetlbfs */
+                        error_report("large page recovery failure addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    flags = MAP_FIXED;
+                    flags |= block->flags & RAM_SHARED ?
+                             MAP_SHARED : MAP_PRIVATE;
+                    flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
+                    prot = PROT_READ;
+                    prot |= block->flags & RAM_READONLY ? 0 : PROT_WRITE;
+                    if (block->fd >= 0) {
+                        area = mmap(vaddr, length, prot, flags, block->fd,
+                                    offset + block->fd_offset);
+                    } else {
+                        flags |= MAP_ANONYMOUS;
+                        area = mmap(vaddr, length, prot, flags, -1, 0);
+                    }
+                    if (area != vaddr) {
+                        error_report("Could not remap addr: "
+                                     RAM_ADDR_FMT "@" RAM_ADDR_FMT "",
+                                     length, addr);
+                        exit(1);
+                    }
+                    memory_try_enable_merging(vaddr, length);
+                    qemu_ram_setup_dump(vaddr, length);
                 }
-                memory_try_enable_merging(vaddr, length);
-                qemu_ram_setup_dump(vaddr, length);
             }
         }
     }