Message ID | 1620390320-301716-1-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Live Update | expand |
Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/ Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com Subject: [PATCH V3 00/22] Live Update === TEST SCRIPT BEGIN === #!/bin/bash git rev-parse base > /dev/null || exit 0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram ./scripts/checkpatch.pl --mailback base.. === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 From https://github.com/patchew-project/qemu * [new tag] patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com - [tag update] patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net - [tag update] patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com - [tag update] patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com - [tag update] patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org Switched to a new branch 'test' 8c778e6 simplify savevm aca4f09 cpr: maintainers 697f8d0 cpr: only-cpr-capable option 0a8c20e chardev: cpr for sockets cb270f4 chardev: cpr for pty 279230e chardev: cpr for simple devices b122cfa chardev: cpr framework 6596676 hostmem-memfd: cpr support 8cb6348 vhost: reset vhost devices upon cprsave e3ae86d vfio-pci: cpr part 2 02c628d vfio-pci: cpr part 1 d93623c vfio-pci: refactor for cpr bc63b3e pci: export functions for cpr 2b10bdd cpr: HMP interfaces 29bc20a cpr: QMP interfaces 3f84e6c cpr 0a32588 vl: add helper to request re-exec 466b4cf machine: memfd-alloc option 50c3e84 util: env var helpers 76c3550 oslib: qemu_clr_cloexec d819bd4 qemu_ram_volatile c466ddf as_flat_walk === OUTPUT BEGIN === 1/22 Checking commit c466ddfd2209 (as_flat_walk) 2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile) 3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec) 4/22 Checking commit 50c3e84cf5a6 (util: env var helpers) Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #19: new file mode 100644 ERROR: consider using qemu_strtol in preference to strtol #72: FILE: util/env.c:20: + res = strtol(val, 0, 10); total: 1 errors, 1 warnings, 129 lines checked Patch 4/22 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option) 6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec) 7/22 Checking commit 3f84e6c38bd6 (cpr) Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #55: new file mode 100644 total: 0 errors, 1 warnings, 314 lines checked Patch 7/22 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces) Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #81: new file mode 100644 total: 0 errors, 1 warnings, 136 lines checked Patch 8/22 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces) 10/22 Checking commit bc63b3edc621 (pci: export functions for cpr) 11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr) 12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1) Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #271: new file mode 100644 total: 0 errors, 1 warnings, 566 lines checked Patch 12/22 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2) 14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave) 15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support) 16/22 Checking commit b122cfa96106 (chardev: cpr framework) 17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices) 18/22 Checking commit cb270f49693f (chardev: cpr for pty) 19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets) 20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option) Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? #200: new file mode 100644 total: 0 errors, 1 warnings, 133 lines checked Patch 20/22 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 21/22 Checking commit aca4f092c865 (cpr: maintainers) 22/22 Checking commit 8c778e6f284c (simplify savevm) === OUTPUT END === Test command exited with code: 1 The full log is available at http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > Provide the cprsave and cprload commands for live update. These save and > restore VM state, with minimal guest pause time, so that qemu may be updated > to a new version in between. > > cprsave stops the VM and saves vmstate to an ordinary file. It supports two > modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > paused state and waits for the cprload command. I think cprsave/cprload could be generalized by using QMP to stash the file descriptors. The 'getfd' QMP command already exists and QEMU code already opens fds passed using this mechanism. I haven't checked but it may be possible to drop some patches by reusing QEMU's monitor file descriptor passing since the code already knows how to open from 'getfd' fds. The reason why using QMP is interesting is because it eliminates the need for execve(2). QEMU may be unable to execute a program due to chroot, seccomp, etc. QMP would enable cprsave/cprload to work both with and without execve(2). One tricky thing with this approach might be startup ordering: how to get fds via the QMP monitor in the new process before processing the entire command-line. Stefan
On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: >> Provide the cprsave and cprload commands for live update. These save and >> restore VM state, with minimal guest pause time, so that qemu may be updated >> to a new version in between. >> >> cprsave stops the VM and saves vmstate to an ordinary file. It supports two >> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or >> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a >> paused state and waits for the cprload command. > > I think cprsave/cprload could be generalized by using QMP to stash the > file descriptors. The 'getfd' QMP command already exists and QEMU code > already opens fds passed using this mechanism. > > I haven't checked but it may be possible to drop some patches by reusing > QEMU's monitor file descriptor passing since the code already knows how > to open from 'getfd' fds. > > The reason why using QMP is interesting is because it eliminates the > need for execve(2). QEMU may be unable to execute a program due to > chroot, seccomp, etc. > > QMP would enable cprsave/cprload to work both with and without > execve(2). > > One tricky thing with this approach might be startup ordering: how to > get fds via the QMP monitor in the new process before processing the > entire command-line. Early on I experimented with a similar approach. Old qemu passed descriptors to an escrow process and exited; new qemu started and retrieved the descriptors from escrow. vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. I suspect my recent vfio extensions would smooth the rough edges. However, the main issue is that guest ram must be backed by named shared memory, and we would need to add code to support shared memory for all the secondary memory objects. That makes it less interesting for us at this time; we care about updating legacy qemu instances with anonymous guest memory. Having said all that, this would be an interesting project, just not the one I want to push now. In the future we could add a new cprsave mode to support it in a backward compatible manner. - Steve
I will add to MAINTAINERS incrementally instead of at the end to make checkpatch happy. I will use qemu_strtol even though I thought the message "consider using qemu_strtol" was giving me a choice. You can't fight The Man when the man is a robot. - Steve On 5/7/2021 9:00 AM, no-reply@patchew.org wrote: > Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/ > > Hi, > > This series seems to have some coding style problems. See output below for > more information: > > Type: series > Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com > Subject: [PATCH V3 00/22] Live Update > > === TEST SCRIPT BEGIN === > #!/bin/bash > git rev-parse base > /dev/null || exit 0 > git config --local diff.renamelimit 0 > git config --local diff.renames True > git config --local diff.algorithm histogram > ./scripts/checkpatch.pl --mailback base.. > === TEST SCRIPT END === > > Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 > From https://github.com/patchew-project/qemu > * [new tag] patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com > - [tag update] patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net > - [tag update] patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com > - [tag update] patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com > - [tag update] patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org > Switched to a new branch 'test' > 8c778e6 simplify savevm > aca4f09 cpr: maintainers > 697f8d0 cpr: only-cpr-capable option > 0a8c20e chardev: cpr for sockets > cb270f4 chardev: cpr for pty > 279230e chardev: cpr for simple devices > b122cfa chardev: cpr framework > 6596676 hostmem-memfd: cpr support > 8cb6348 vhost: reset vhost devices upon cprsave > e3ae86d vfio-pci: cpr part 2 > 02c628d vfio-pci: cpr part 1 > d93623c vfio-pci: refactor for cpr > bc63b3e pci: export functions for cpr > 2b10bdd cpr: HMP interfaces > 29bc20a cpr: QMP interfaces > 3f84e6c cpr > 0a32588 vl: add helper to request re-exec > 466b4cf machine: memfd-alloc option > 50c3e84 util: env var helpers > 76c3550 oslib: qemu_clr_cloexec > d819bd4 qemu_ram_volatile > c466ddf as_flat_walk > > === OUTPUT BEGIN === > 1/22 Checking commit c466ddfd2209 (as_flat_walk) > 2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile) > 3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec) > 4/22 Checking commit 50c3e84cf5a6 (util: env var helpers) > Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #19: > new file mode 100644 > > ERROR: consider using qemu_strtol in preference to strtol > #72: FILE: util/env.c:20: > + res = strtol(val, 0, 10); > > total: 1 errors, 1 warnings, 129 lines checked > > Patch 4/22 has style problems, please review. If any of these errors > are false positives report them to the maintainer, see > CHECKPATCH in MAINTAINERS. > > 5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option) > 6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec) > 7/22 Checking commit 3f84e6c38bd6 (cpr) > Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #55: > new file mode 100644 > > total: 0 errors, 1 warnings, 314 lines checked > > Patch 7/22 has style problems, please review. If any of these errors > are false positives report them to the maintainer, see > CHECKPATCH in MAINTAINERS. > 8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces) > Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #81: > new file mode 100644 > > total: 0 errors, 1 warnings, 136 lines checked > > Patch 8/22 has style problems, please review. If any of these errors > are false positives report them to the maintainer, see > CHECKPATCH in MAINTAINERS. > 9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces) > 10/22 Checking commit bc63b3edc621 (pci: export functions for cpr) > 11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr) > 12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1) > Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #271: > new file mode 100644 > > total: 0 errors, 1 warnings, 566 lines checked > > Patch 12/22 has style problems, please review. If any of these errors > are false positives report them to the maintainer, see > CHECKPATCH in MAINTAINERS. > 13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2) > 14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave) > 15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support) > 16/22 Checking commit b122cfa96106 (chardev: cpr framework) > 17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices) > 18/22 Checking commit cb270f49693f (chardev: cpr for pty) > 19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets) > 20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option) > Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529. > WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? > #200: > new file mode 100644 > > total: 0 errors, 1 warnings, 133 lines checked > > Patch 20/22 has style problems, please review. If any of these errors > are false positives report them to the maintainer, see > CHECKPATCH in MAINTAINERS. > 21/22 Checking commit aca4f092c865 (cpr: maintainers) > 22/22 Checking commit 8c778e6f284c (simplify savevm) > === OUTPUT END === > > Test command exited with code: 1 > > > The full log is available at > http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message. > --- > Email generated automatically by Patchew [https://patchew.org/]. > Please send your feedback to patchew-devel@redhat.com >
On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > > On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > >> Provide the cprsave and cprload commands for live update. These save and > >> restore VM state, with minimal guest pause time, so that qemu may be updated > >> to a new version in between. > >> > >> cprsave stops the VM and saves vmstate to an ordinary file. It supports two > >> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > >> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > >> paused state and waits for the cprload command. > > > > I think cprsave/cprload could be generalized by using QMP to stash the > > file descriptors. The 'getfd' QMP command already exists and QEMU code > > already opens fds passed using this mechanism. > > > > I haven't checked but it may be possible to drop some patches by reusing > > QEMU's monitor file descriptor passing since the code already knows how > > to open from 'getfd' fds. > > > > The reason why using QMP is interesting is because it eliminates the > > need for execve(2). QEMU may be unable to execute a program due to > > chroot, seccomp, etc. > > > > QMP would enable cprsave/cprload to work both with and without > > execve(2). > > > > One tricky thing with this approach might be startup ordering: how to > > get fds via the QMP monitor in the new process before processing the > > entire command-line. > > Early on I experimented with a similar approach. Old qemu passed descriptors to an > escrow process and exited; new qemu started and retrieved the descriptors from escrow. > vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. > I suspect my recent vfio extensions would smooth the rough edges. I wonder about the reason for VFIO's pid limitation, maybe because it pins pages from the original process? Is this VFIO pid limitation the main reason why you chose to make QEMU execve(2) the new binary? > However, the main issue is that guest ram must be backed by named shared memory, and > we would need to add code to support shared memory for all the secondary memory objects. > That makes it less interesting for us at this time; we care about updating legacy qemu > instances with anonymous guest memory. Thanks for explaining this more in the other sub-thread. The secondary memory objects you mentioned are relatively small so I don't think saving them in the traditional way is a problem. Two approaches for zero-copy memory migration fit into QEMU's existing migration infrastructure: - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, etc) so they are not saved into the savevm file. The existing --object memory-backend-file syntax can be used. - Extending the live migration protocol to detect when file descriptor passing is available (i.e. UNIX domain socket migration) and using that for memory-backend-* objects that have fds. Either of these approaches would handle RAM with existing savevm/migrate commands. The remaining issue is how to migrate VFIO and other file descriptors that cannot be reopened by the new process. As mentioned, QEMU already has file descriptor passing support in the QMP monitor and support for opening passed file descriptors (see qemu_open_internal(), monitor_fd_param(), and socket_get_fd()). The advantage of integrating live update functionality into the existing savevm/migrate commands is that it will work in more use cases with less code duplication/maintenance/bitrot prevention than the special-case cprsave command in this patch series. Maybe there is a fundamental technical reason why live update needs to be different from QEMU's existing migration commands but I haven't figured it out yet. Stefan
On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: >>>> Provide the cprsave and cprload commands for live update. These save and >>>> restore VM state, with minimal guest pause time, so that qemu may be updated >>>> to a new version in between. >>>> >>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two >>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or >>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a >>>> paused state and waits for the cprload command. >>> >>> I think cprsave/cprload could be generalized by using QMP to stash the >>> file descriptors. The 'getfd' QMP command already exists and QEMU code >>> already opens fds passed using this mechanism. >>> >>> I haven't checked but it may be possible to drop some patches by reusing >>> QEMU's monitor file descriptor passing since the code already knows how >>> to open from 'getfd' fds. >>> >>> The reason why using QMP is interesting is because it eliminates the >>> need for execve(2). QEMU may be unable to execute a program due to >>> chroot, seccomp, etc. >>> >>> QMP would enable cprsave/cprload to work both with and without >>> execve(2). >>> >>> One tricky thing with this approach might be startup ordering: how to >>> get fds via the QMP monitor in the new process before processing the >>> entire command-line. >> >> Early on I experimented with a similar approach. Old qemu passed descriptors to an >> escrow process and exited; new qemu started and retrieved the descriptors from escrow. >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. >> I suspect my recent vfio extensions would smooth the rough edges. > > I wonder about the reason for VFIO's pid limitation, maybe because it > pins pages from the original process? The dma unmap code verifies that the requesting task is the same as the task that mapped the pages. We could add an ioctl that passes ownership to a new task. We would also need to fix locked memory accounting, which is associated with the mm of the original task. > Is this VFIO pid limitation the main reason why you chose to make QEMU > execve(2) the new binary? That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict errors I mentioned in the previous email. We would need to suppress redundant dma map calls, but allow legitimate dma maps and unmaps in response to the ongoing address space changes and diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like I was working against vfio rather than with it. Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates without exec, I still need the exec option. >> However, the main issue is that guest ram must be backed by named shared memory, and >> we would need to add code to support shared memory for all the secondary memory objects. >> That makes it less interesting for us at this time; we care about updating legacy qemu >> instances with anonymous guest memory. > > Thanks for explaining this more in the other sub-thread. The secondary > memory objects you mentioned are relatively small so I don't think > saving them in the traditional way is a problem. > > Two approaches for zero-copy memory migration fit into QEMU's existing > migration infrastructure: > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > etc) so they are not saved into the savevm file. The existing --object > memory-backend-file syntax can be used. > > - Extending the live migration protocol to detect when file descriptor > passing is available (i.e. UNIX domain socket migration) and using > that for memory-backend-* objects that have fds. > > Either of these approaches would handle RAM with existing savevm/migrate > commands. Yes, but the vfio issues would still need to be solved, and we would need new command line options to back existing and future secondary memory objects with named shared memory. > The remaining issue is how to migrate VFIO and other file descriptors > that cannot be reopened by the new process. As mentioned, QEMU already > has file descriptor passing support in the QMP monitor and support for > opening passed file descriptors (see qemu_open_internal(), > monitor_fd_param(), and socket_get_fd()). > > The advantage of integrating live update functionality into the existing > savevm/migrate commands is that it will work in more use cases with > less code duplication/maintenance/bitrot prevention than the > special-case cprsave command in this patch series. > > Maybe there is a fundamental technical reason why live update needs to > be different from QEMU's existing migration commands but I haven't > figured it out yet. vfio and anonymous memory. Regarding code duplication, I did consider whether to extend the migration syntax and implementation versus creating something new. Those functions handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr use case, and the cpr functions handle state that is n/a for the migration case. I judged that handling both in the same functions would be less readable and maintainable. After feedback during the V1 review, I simplified the cprsave code by by calling qemu_save_device_state, as Xen does, thus eliminating any interaction with the migration code. Regarding bit rot, I still need to add a cpr test to the test suite, when the review is more complete and folks agree on the final form of the functionality. I do like the idea of supporting update without exec, but as a future project, and not at the expense of dropping update with exec. - Steve
On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote: > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > >>>> Provide the cprsave and cprload commands for live update. These save and > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated > >>>> to a new version in between. > >>>> > >>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two > >>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > >>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > >>>> paused state and waits for the cprload command. > >>> > >>> I think cprsave/cprload could be generalized by using QMP to stash the > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code > >>> already opens fds passed using this mechanism. > >>> > >>> I haven't checked but it may be possible to drop some patches by reusing > >>> QEMU's monitor file descriptor passing since the code already knows how > >>> to open from 'getfd' fds. > >>> > >>> The reason why using QMP is interesting is because it eliminates the > >>> need for execve(2). QEMU may be unable to execute a program due to > >>> chroot, seccomp, etc. > >>> > >>> QMP would enable cprsave/cprload to work both with and without > >>> execve(2). > >>> > >>> One tricky thing with this approach might be startup ordering: how to > >>> get fds via the QMP monitor in the new process before processing the > >>> entire command-line. > >> > >> Early on I experimented with a similar approach. Old qemu passed descriptors to an > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow. > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. > >> I suspect my recent vfio extensions would smooth the rough edges. > > > > I wonder about the reason for VFIO's pid limitation, maybe because it > > pins pages from the original process? > > The dma unmap code verifies that the requesting task is the same as the task that mapped > the pages. We could add an ioctl that passes ownership to a new task. We would also need > to fix locked memory accounting, which is associated with the mm of the original task. > > > Is this VFIO pid limitation the main reason why you chose to make QEMU > > execve(2) the new binary? > > That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict > errors I mentioned in the previous email. We would need to suppress redundant dma map calls, > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like > I was working against vfio rather than with it. > > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via > code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates > without exec, I still need the exec option. > > >> However, the main issue is that guest ram must be backed by named shared memory, and > >> we would need to add code to support shared memory for all the secondary memory objects. > >> That makes it less interesting for us at this time; we care about updating legacy qemu > >> instances with anonymous guest memory. > > > > Thanks for explaining this more in the other sub-thread. The secondary > > memory objects you mentioned are relatively small so I don't think > > saving them in the traditional way is a problem. > > > > Two approaches for zero-copy memory migration fit into QEMU's existing > > migration infrastructure: > > > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > > etc) so they are not saved into the savevm file. The existing --object > > memory-backend-file syntax can be used. > > > > - Extending the live migration protocol to detect when file descriptor > > passing is available (i.e. UNIX domain socket migration) and using > > that for memory-backend-* objects that have fds. > > > > Either of these approaches would handle RAM with existing savevm/migrate > > commands. > > Yes, but the vfio issues would still need to be solved, and we would need new > command line options to back existing and future secondary memory objects with > named shared memory. > > > The remaining issue is how to migrate VFIO and other file descriptors > > that cannot be reopened by the new process. As mentioned, QEMU already > > has file descriptor passing support in the QMP monitor and support for > > opening passed file descriptors (see qemu_open_internal(), > > monitor_fd_param(), and socket_get_fd()). > > > > The advantage of integrating live update functionality into the existing > > savevm/migrate commands is that it will work in more use cases with > > less code duplication/maintenance/bitrot prevention than the > > special-case cprsave command in this patch series. > > > > Maybe there is a fundamental technical reason why live update needs to > > be different from QEMU's existing migration commands but I haven't > > figured it out yet. > > vfio and anonymous memory. > > Regarding code duplication, I did consider whether to extend the migration > syntax and implementation versus creating something new. Those functions > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr > use case, and the cpr functions handle state that is n/a for the migration case. > I judged that handling both in the same functions would be less readable and > maintainable. After feedback during the V1 review, I simplified the cprsave > code by by calling qemu_save_device_state, as Xen does, thus eliminating any > interaction with the migration code. > > Regarding bit rot, I still need to add a cpr test to the test suite, when the > review is more complete and folks agree on the final form of the functionality. > > I do like the idea of supporting update without exec, but as a future project, > and not at the expense of dropping update with exec. Alex: We're discussing how to live update QEMU while VFIO devices are running. This patch series introduces monitor commands that call execve(2) to run the new QEMU binary and inherit the memory/vfio/etc file descriptors. This way live update is transparent to VFIO but it won't work if a sandboxed QEMU process is forbidden to call execve(2). What are your thoughts on 1) the execve(2) approach and 2) extending VFIO to allow running devices to be attached to a different process so that execve(2) is not necessary? Steven: Do you know if cpr will work with Intel's upcoming Shared Virtual Addressing? I'm worried that execve(2) may be a short-term solution that works around VFIO's current limitations but even execve(2) may stop working in the future as IOMMUs and DMA approaches change. Stefan
On Mon, 17 May 2021 12:40:43 +0100 Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote: > > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > > >>>> Provide the cprsave and cprload commands for live update. These save and > > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated > > >>>> to a new version in between. > > >>>> > > >>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two > > >>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > > >>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > > >>>> paused state and waits for the cprload command. > > >>> > > >>> I think cprsave/cprload could be generalized by using QMP to stash the > > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code > > >>> already opens fds passed using this mechanism. > > >>> > > >>> I haven't checked but it may be possible to drop some patches by reusing > > >>> QEMU's monitor file descriptor passing since the code already knows how > > >>> to open from 'getfd' fds. > > >>> > > >>> The reason why using QMP is interesting is because it eliminates the > > >>> need for execve(2). QEMU may be unable to execute a program due to > > >>> chroot, seccomp, etc. > > >>> > > >>> QMP would enable cprsave/cprload to work both with and without > > >>> execve(2). > > >>> > > >>> One tricky thing with this approach might be startup ordering: how to > > >>> get fds via the QMP monitor in the new process before processing the > > >>> entire command-line. > > >> > > >> Early on I experimented with a similar approach. Old qemu passed descriptors to an > > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow. > > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. > > >> I suspect my recent vfio extensions would smooth the rough edges. > > > > > > I wonder about the reason for VFIO's pid limitation, maybe because it > > > pins pages from the original process? > > > > The dma unmap code verifies that the requesting task is the same as the task that mapped > > the pages. We could add an ioctl that passes ownership to a new task. We would also need > > to fix locked memory accounting, which is associated with the mm of the original task. > > > > > Is this VFIO pid limitation the main reason why you chose to make QEMU > > > execve(2) the new binary? > > > > That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict > > errors I mentioned in the previous email. We would need to suppress redundant dma map calls, > > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and > > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like > > I was working against vfio rather than with it. > > > > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via > > code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates > > without exec, I still need the exec option. > > > > >> However, the main issue is that guest ram must be backed by named shared memory, and > > >> we would need to add code to support shared memory for all the secondary memory objects. > > >> That makes it less interesting for us at this time; we care about updating legacy qemu > > >> instances with anonymous guest memory. > > > > > > Thanks for explaining this more in the other sub-thread. The secondary > > > memory objects you mentioned are relatively small so I don't think > > > saving them in the traditional way is a problem. > > > > > > Two approaches for zero-copy memory migration fit into QEMU's existing > > > migration infrastructure: > > > > > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > > > etc) so they are not saved into the savevm file. The existing --object > > > memory-backend-file syntax can be used. > > > > > > - Extending the live migration protocol to detect when file descriptor > > > passing is available (i.e. UNIX domain socket migration) and using > > > that for memory-backend-* objects that have fds. > > > > > > Either of these approaches would handle RAM with existing savevm/migrate > > > commands. > > > > Yes, but the vfio issues would still need to be solved, and we would need new > > command line options to back existing and future secondary memory objects with > > named shared memory. > > > > > The remaining issue is how to migrate VFIO and other file descriptors > > > that cannot be reopened by the new process. As mentioned, QEMU already > > > has file descriptor passing support in the QMP monitor and support for > > > opening passed file descriptors (see qemu_open_internal(), > > > monitor_fd_param(), and socket_get_fd()). > > > > > > The advantage of integrating live update functionality into the existing > > > savevm/migrate commands is that it will work in more use cases with > > > less code duplication/maintenance/bitrot prevention than the > > > special-case cprsave command in this patch series. > > > > > > Maybe there is a fundamental technical reason why live update needs to > > > be different from QEMU's existing migration commands but I haven't > > > figured it out yet. > > > > vfio and anonymous memory. > > > > Regarding code duplication, I did consider whether to extend the migration > > syntax and implementation versus creating something new. Those functions > > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr > > use case, and the cpr functions handle state that is n/a for the migration case. > > I judged that handling both in the same functions would be less readable and > > maintainable. After feedback during the V1 review, I simplified the cprsave > > code by by calling qemu_save_device_state, as Xen does, thus eliminating any > > interaction with the migration code. > > > > Regarding bit rot, I still need to add a cpr test to the test suite, when the > > review is more complete and folks agree on the final form of the functionality. > > > > I do like the idea of supporting update without exec, but as a future project, > > and not at the expense of dropping update with exec. > > Alex: We're discussing how to live update QEMU while VFIO devices are > running. This patch series introduces monitor commands that call > execve(2) to run the new QEMU binary and inherit the memory/vfio/etc > file descriptors. This way live update is transparent to VFIO but it > won't work if a sandboxed QEMU process is forbidden to call execve(2). > What are your thoughts on 1) the execve(2) approach and 2) extending > VFIO to allow running devices to be attached to a different process so > that execve(2) is not necessary? Tracking processes is largely to support page pinning; we need to be able to support both asynchronous page pinning to handle requests from mdev drivers and we need to make sure pinned page accounting is tracked to the same process. If userspace can "pay" for locked pages from one process on mappping, then "credit" them to another process on unmap, that seems fairly exploitable. We'd need some way to transfer the locked memory accounting or handle it outside of vfio. Thanks, Alex
* Steven Sistare (steven.sistare@oracle.com) wrote: > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > >>>> Provide the cprsave and cprload commands for live update. These save and > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated > >>>> to a new version in between. > >>>> > >>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two > >>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > >>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > >>>> paused state and waits for the cprload command. > >>> > >>> I think cprsave/cprload could be generalized by using QMP to stash the > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code > >>> already opens fds passed using this mechanism. > >>> > >>> I haven't checked but it may be possible to drop some patches by reusing > >>> QEMU's monitor file descriptor passing since the code already knows how > >>> to open from 'getfd' fds. > >>> > >>> The reason why using QMP is interesting is because it eliminates the > >>> need for execve(2). QEMU may be unable to execute a program due to > >>> chroot, seccomp, etc. > >>> > >>> QMP would enable cprsave/cprload to work both with and without > >>> execve(2). > >>> > >>> One tricky thing with this approach might be startup ordering: how to > >>> get fds via the QMP monitor in the new process before processing the > >>> entire command-line. > >> > >> Early on I experimented with a similar approach. Old qemu passed descriptors to an > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow. > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. > >> I suspect my recent vfio extensions would smooth the rough edges. > > > > I wonder about the reason for VFIO's pid limitation, maybe because it > > pins pages from the original process? > > The dma unmap code verifies that the requesting task is the same as the task that mapped > the pages. We could add an ioctl that passes ownership to a new task. We would also need > to fix locked memory accounting, which is associated with the mm of the original task. > > Is this VFIO pid limitation the main reason why you chose to make QEMU > > execve(2) the new binary? > > That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict > errors I mentioned in the previous email. We would need to suppress redundant dma map calls, > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like > I was working against vfio rather than with it. OK the weirdness of vfio helps explain a bit about why you're doing it this way; can you help separate some difference between restart and reboot for me though: In 'reboot' mode; where the guest must do suspend in it's drivers, how much of these vfio requirements are needed? I guess the memfd use for the anonymous areas isn't any use for reboot mode. You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does vfio still care about the currently-anonymous areas? > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via > code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates > without exec, I still need the exec option. Can you explain what that code injection mechanism is for those of us who didn't see that? Dave > >> However, the main issue is that guest ram must be backed by named shared memory, and > >> we would need to add code to support shared memory for all the secondary memory objects. > >> That makes it less interesting for us at this time; we care about updating legacy qemu > >> instances with anonymous guest memory. > > > > Thanks for explaining this more in the other sub-thread. The secondary > > memory objects you mentioned are relatively small so I don't think > > saving them in the traditional way is a problem. > > > > Two approaches for zero-copy memory migration fit into QEMU's existing > > migration infrastructure: > > > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > > etc) so they are not saved into the savevm file. The existing --object > > memory-backend-file syntax can be used. > > > > - Extending the live migration protocol to detect when file descriptor > > passing is available (i.e. UNIX domain socket migration) and using > > that for memory-backend-* objects that have fds. > > > > Either of these approaches would handle RAM with existing savevm/migrate > > commands. > > Yes, but the vfio issues would still need to be solved, and we would need new > command line options to back existing and future secondary memory objects with > named shared memory. > > > The remaining issue is how to migrate VFIO and other file descriptors > > that cannot be reopened by the new process. As mentioned, QEMU already > > has file descriptor passing support in the QMP monitor and support for > > opening passed file descriptors (see qemu_open_internal(), > > monitor_fd_param(), and socket_get_fd()). > > > > The advantage of integrating live update functionality into the existing > > savevm/migrate commands is that it will work in more use cases with > > less code duplication/maintenance/bitrot prevention than the > > special-case cprsave command in this patch series. > > > > Maybe there is a fundamental technical reason why live update needs to > > be different from QEMU's existing migration commands but I haven't > > figured it out yet. > > vfio and anonymous memory. > > Regarding code duplication, I did consider whether to extend the migration > syntax and implementation versus creating something new. Those functions > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr > use case, and the cpr functions handle state that is n/a for the migration case. > I judged that handling both in the same functions would be less readable and > maintainable. After feedback during the V1 review, I simplified the cprsave > code by by calling qemu_save_device_state, as Xen does, thus eliminating any > interaction with the migration code. > > Regarding bit rot, I still need to add a cpr test to the test suite, when the > review is more complete and folks agree on the final form of the functionality. > > I do like the idea of supporting update without exec, but as a future project, > and not at the expense of dropping update with exec. > > - Steve >
On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote: > On Mon, 17 May 2021 12:40:43 +0100 > Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote: > > > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > > > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > > > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > > > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > > > >>>> Provide the cprsave and cprload commands for live update. These save and > > > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated > > > >>>> to a new version in between. > > > >>>> > > > >>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two > > > >>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > > > >>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > > > >>>> paused state and waits for the cprload command. > > > >>> > > > >>> I think cprsave/cprload could be generalized by using QMP to stash the > > > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code > > > >>> already opens fds passed using this mechanism. > > > >>> > > > >>> I haven't checked but it may be possible to drop some patches by reusing > > > >>> QEMU's monitor file descriptor passing since the code already knows how > > > >>> to open from 'getfd' fds. > > > >>> > > > >>> The reason why using QMP is interesting is because it eliminates the > > > >>> need for execve(2). QEMU may be unable to execute a program due to > > > >>> chroot, seccomp, etc. > > > >>> > > > >>> QMP would enable cprsave/cprload to work both with and without > > > >>> execve(2). > > > >>> > > > >>> One tricky thing with this approach might be startup ordering: how to > > > >>> get fds via the QMP monitor in the new process before processing the > > > >>> entire command-line. > > > >> > > > >> Early on I experimented with a similar approach. Old qemu passed descriptors to an > > > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow. > > > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. > > > >> I suspect my recent vfio extensions would smooth the rough edges. > > > > > > > > I wonder about the reason for VFIO's pid limitation, maybe because it > > > > pins pages from the original process? > > > > > > The dma unmap code verifies that the requesting task is the same as the task that mapped > > > the pages. We could add an ioctl that passes ownership to a new task. We would also need > > > to fix locked memory accounting, which is associated with the mm of the original task. > > > > > > > Is this VFIO pid limitation the main reason why you chose to make QEMU > > > > execve(2) the new binary? > > > > > > That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict > > > errors I mentioned in the previous email. We would need to suppress redundant dma map calls, > > > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and > > > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like > > > I was working against vfio rather than with it. > > > > > > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via > > > code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates > > > without exec, I still need the exec option. > > > > > > >> However, the main issue is that guest ram must be backed by named shared memory, and > > > >> we would need to add code to support shared memory for all the secondary memory objects. > > > >> That makes it less interesting for us at this time; we care about updating legacy qemu > > > >> instances with anonymous guest memory. > > > > > > > > Thanks for explaining this more in the other sub-thread. The secondary > > > > memory objects you mentioned are relatively small so I don't think > > > > saving them in the traditional way is a problem. > > > > > > > > Two approaches for zero-copy memory migration fit into QEMU's existing > > > > migration infrastructure: > > > > > > > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > > > > etc) so they are not saved into the savevm file. The existing --object > > > > memory-backend-file syntax can be used. > > > > > > > > - Extending the live migration protocol to detect when file descriptor > > > > passing is available (i.e. UNIX domain socket migration) and using > > > > that for memory-backend-* objects that have fds. > > > > > > > > Either of these approaches would handle RAM with existing savevm/migrate > > > > commands. > > > > > > Yes, but the vfio issues would still need to be solved, and we would need new > > > command line options to back existing and future secondary memory objects with > > > named shared memory. > > > > > > > The remaining issue is how to migrate VFIO and other file descriptors > > > > that cannot be reopened by the new process. As mentioned, QEMU already > > > > has file descriptor passing support in the QMP monitor and support for > > > > opening passed file descriptors (see qemu_open_internal(), > > > > monitor_fd_param(), and socket_get_fd()). > > > > > > > > The advantage of integrating live update functionality into the existing > > > > savevm/migrate commands is that it will work in more use cases with > > > > less code duplication/maintenance/bitrot prevention than the > > > > special-case cprsave command in this patch series. > > > > > > > > Maybe there is a fundamental technical reason why live update needs to > > > > be different from QEMU's existing migration commands but I haven't > > > > figured it out yet. > > > > > > vfio and anonymous memory. > > > > > > Regarding code duplication, I did consider whether to extend the migration > > > syntax and implementation versus creating something new. Those functions > > > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr > > > use case, and the cpr functions handle state that is n/a for the migration case. > > > I judged that handling both in the same functions would be less readable and > > > maintainable. After feedback during the V1 review, I simplified the cprsave > > > code by by calling qemu_save_device_state, as Xen does, thus eliminating any > > > interaction with the migration code. > > > > > > Regarding bit rot, I still need to add a cpr test to the test suite, when the > > > review is more complete and folks agree on the final form of the functionality. > > > > > > I do like the idea of supporting update without exec, but as a future project, > > > and not at the expense of dropping update with exec. > > > > Alex: We're discussing how to live update QEMU while VFIO devices are > > running. This patch series introduces monitor commands that call > > execve(2) to run the new QEMU binary and inherit the memory/vfio/etc > > file descriptors. This way live update is transparent to VFIO but it > > won't work if a sandboxed QEMU process is forbidden to call execve(2). > > What are your thoughts on 1) the execve(2) approach and 2) extending > > VFIO to allow running devices to be attached to a different process so > > that execve(2) is not necessary? > > Tracking processes is largely to support page pinning; we need to be > able to support both asynchronous page pinning to handle requests from > mdev drivers and we need to make sure pinned page accounting is > tracked to the same process. If userspace can "pay" for locked pages > from one process on mappping, then "credit" them to another process on > unmap, that seems fairly exploitable. We'd need some way to transfer > the locked memory accounting or handle it outside of vfio. Thanks, Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to associate the in-kernel vhost device with a userspace process and it's mm. Would it be possible to add a VFIO_SET_OWNER ioctl that associates the current process with the vfio_device? Only one process would be the owner at any given time. I'm not sure how existing DMA mappings would behave, but this patch series seems to rely on DMA continuing to work even though there is a window of time when the execve(2) process doesn't have guest RAM mmapped. So I guess existing DMA mappings continue to work because the pages were previously pinned? Stefan
On 5/18/2021 9:39 AM, Stefan Hajnoczi wrote: > On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote: >> On Mon, 17 May 2021 12:40:43 +0100 >> Stefan Hajnoczi <stefanha@redhat.com> wrote: >> >>> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote: >>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: >>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: >>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: >>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: >>>>>>>> Provide the cprsave and cprload commands for live update. These save and >>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated >>>>>>>> to a new version in between. >>>>>>>> >>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two >>>>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or >>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a >>>>>>>> paused state and waits for the cprload command. >>>>>>> >>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the >>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code >>>>>>> already opens fds passed using this mechanism. >>>>>>> >>>>>>> I haven't checked but it may be possible to drop some patches by reusing >>>>>>> QEMU's monitor file descriptor passing since the code already knows how >>>>>>> to open from 'getfd' fds. >>>>>>> >>>>>>> The reason why using QMP is interesting is because it eliminates the >>>>>>> need for execve(2). QEMU may be unable to execute a program due to >>>>>>> chroot, seccomp, etc. >>>>>>> >>>>>>> QMP would enable cprsave/cprload to work both with and without >>>>>>> execve(2). >>>>>>> >>>>>>> One tricky thing with this approach might be startup ordering: how to >>>>>>> get fds via the QMP monitor in the new process before processing the >>>>>>> entire command-line. >>>>>> >>>>>> Early on I experimented with a similar approach. Old qemu passed descriptors to an >>>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow. >>>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. >>>>>> I suspect my recent vfio extensions would smooth the rough edges. >>>>> >>>>> I wonder about the reason for VFIO's pid limitation, maybe because it >>>>> pins pages from the original process? >>>> >>>> The dma unmap code verifies that the requesting task is the same as the task that mapped >>>> the pages. We could add an ioctl that passes ownership to a new task. We would also need >>>> to fix locked memory accounting, which is associated with the mm of the original task. >>>> >>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU >>>>> execve(2) the new binary? >>>> >>>> That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict >>>> errors I mentioned in the previous email. We would need to suppress redundant dma map calls, >>>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and >>>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like >>>> I was working against vfio rather than with it. >>>> >>>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via >>>> code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates >>>> without exec, I still need the exec option. >>>> >>>>>> However, the main issue is that guest ram must be backed by named shared memory, and >>>>>> we would need to add code to support shared memory for all the secondary memory objects. >>>>>> That makes it less interesting for us at this time; we care about updating legacy qemu >>>>>> instances with anonymous guest memory. >>>>> >>>>> Thanks for explaining this more in the other sub-thread. The secondary >>>>> memory objects you mentioned are relatively small so I don't think >>>>> saving them in the traditional way is a problem. >>>>> >>>>> Two approaches for zero-copy memory migration fit into QEMU's existing >>>>> migration infrastructure: >>>>> >>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, >>>>> etc) so they are not saved into the savevm file. The existing --object >>>>> memory-backend-file syntax can be used. >>>>> >>>>> - Extending the live migration protocol to detect when file descriptor >>>>> passing is available (i.e. UNIX domain socket migration) and using >>>>> that for memory-backend-* objects that have fds. >>>>> >>>>> Either of these approaches would handle RAM with existing savevm/migrate >>>>> commands. >>>> >>>> Yes, but the vfio issues would still need to be solved, and we would need new >>>> command line options to back existing and future secondary memory objects with >>>> named shared memory. >>>> >>>>> The remaining issue is how to migrate VFIO and other file descriptors >>>>> that cannot be reopened by the new process. As mentioned, QEMU already >>>>> has file descriptor passing support in the QMP monitor and support for >>>>> opening passed file descriptors (see qemu_open_internal(), >>>>> monitor_fd_param(), and socket_get_fd()). >>>>> >>>>> The advantage of integrating live update functionality into the existing >>>>> savevm/migrate commands is that it will work in more use cases with >>>>> less code duplication/maintenance/bitrot prevention than the >>>>> special-case cprsave command in this patch series. >>>>> >>>>> Maybe there is a fundamental technical reason why live update needs to >>>>> be different from QEMU's existing migration commands but I haven't >>>>> figured it out yet. >>>> >>>> vfio and anonymous memory. >>>> >>>> Regarding code duplication, I did consider whether to extend the migration >>>> syntax and implementation versus creating something new. Those functions >>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr >>>> use case, and the cpr functions handle state that is n/a for the migration case. >>>> I judged that handling both in the same functions would be less readable and >>>> maintainable. After feedback during the V1 review, I simplified the cprsave >>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any >>>> interaction with the migration code. >>>> >>>> Regarding bit rot, I still need to add a cpr test to the test suite, when the >>>> review is more complete and folks agree on the final form of the functionality. >>>> >>>> I do like the idea of supporting update without exec, but as a future project, >>>> and not at the expense of dropping update with exec. >>> >>> Alex: We're discussing how to live update QEMU while VFIO devices are >>> running. This patch series introduces monitor commands that call >>> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc >>> file descriptors. This way live update is transparent to VFIO but it >>> won't work if a sandboxed QEMU process is forbidden to call execve(2). >>> What are your thoughts on 1) the execve(2) approach and 2) extending >>> VFIO to allow running devices to be attached to a different process so >>> that execve(2) is not necessary? >> >> Tracking processes is largely to support page pinning; we need to be >> able to support both asynchronous page pinning to handle requests from >> mdev drivers and we need to make sure pinned page accounting is >> tracked to the same process. If userspace can "pay" for locked pages >> from one process on mappping, then "credit" them to another process on >> unmap, that seems fairly exploitable. We'd need some way to transfer >> the locked memory accounting or handle it outside of vfio. Thanks, > > Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to > associate the in-kernel vhost device with a userspace process and it's > mm. > > Would it be possible to add a VFIO_SET_OWNER ioctl that associates the > current process with the vfio_device? Only one process would be the > owner at any given time. It is possible, but the implementation would need to invent new hooks in mm to transfer the locked memory accounting. > I'm not sure how existing DMA mappings would behave, but this patch > series seems to rely on DMA continuing to work even though there is a > window of time when the execve(2) process doesn't have guest RAM > mmapped. So I guess existing DMA mappings continue to work because the > pages were previously pinned? Correct. And changes to mappings are blocked between the pre-exec process issuing VFIO_DMA_UNMAP_FLAG_VADDR and the post-exec process issuing VFIO_DMA_MAP_FLAG_VADDR. - Steve
On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote: > * Steven Sistare (steven.sistare@oracle.com) wrote: >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: >>>>>> Provide the cprsave and cprload commands for live update. These save and >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated >>>>>> to a new version in between. >>>>>> >>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two >>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or >>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a >>>>>> paused state and waits for the cprload command. >>>>> >>>>> I think cprsave/cprload could be generalized by using QMP to stash the >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code >>>>> already opens fds passed using this mechanism. >>>>> >>>>> I haven't checked but it may be possible to drop some patches by reusing >>>>> QEMU's monitor file descriptor passing since the code already knows how >>>>> to open from 'getfd' fds. >>>>> >>>>> The reason why using QMP is interesting is because it eliminates the >>>>> need for execve(2). QEMU may be unable to execute a program due to >>>>> chroot, seccomp, etc. >>>>> >>>>> QMP would enable cprsave/cprload to work both with and without >>>>> execve(2). >>>>> >>>>> One tricky thing with this approach might be startup ordering: how to >>>>> get fds via the QMP monitor in the new process before processing the >>>>> entire command-line. >>>> >>>> Early on I experimented with a similar approach. Old qemu passed descriptors to an >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow. >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. >>>> I suspect my recent vfio extensions would smooth the rough edges. >>> >>> I wonder about the reason for VFIO's pid limitation, maybe because it >>> pins pages from the original process? >> >> The dma unmap code verifies that the requesting task is the same as the task that mapped >> the pages. We could add an ioctl that passes ownership to a new task. We would also need >> to fix locked memory accounting, which is associated with the mm of the original task. > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU >>> execve(2) the new binary? >> >> That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict >> errors I mentioned in the previous email. We would need to suppress redundant dma map calls, >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like >> I was working against vfio rather than with it. > > OK the weirdness of vfio helps explain a bit about why you're doing it > this way; can you help separate some difference between restart and > reboot for me though: > > In 'reboot' mode; where the guest must do suspend in it's drivers, how > much of these vfio requirements are needed? I guess the memfd use > for the anonymous areas isn't any use for reboot mode. Correct. For reboot no special vfio support or fiddling is needed. > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does > vfio still care about the currently-anonymous areas? Yes, for restart mode. The physical pages behind the anonymous memory remain pinned and are targets for ongoing DMA. Post-exec qemu needs a way to find those same pages. >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via >> code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates >> without exec, I still need the exec option. > > Can you explain what that code injection mechanism is for those of us > who didn't see that? Sure. Here is slide 12 from the talk. It relies on mmap(MADV_DOEXEC) which was not accepted upstream. ----------------------------------------------------------------------------- Legacy Live Update * Update legacy qemu process to latest version - Inject code into legacy qemu process to perform cprsave: vmsave.so . Access qemu data structures and globals - eg ram_list, savevm_state, chardevs, vhost_devices - dlopen does not resolve them, must get addresses via symbol lookup. . Delete some vmstate handlers, register new ones (eg vfio) . Call MADV_DOEXEC on guest memory. Find devices, preserve fd * Hot patch a monitor function to dlopen vmsave.so, call entry point - write patch to /proc/pid/mem - Call the monitor function via monitor socket * Send cprload to update qemu * vmsave.so has binary dependency on qemu data structures and variables - Build vmsave-ver.so per legacy version - Indexed by qemu's gcc build-id ----------------------------------------------------------------------------- - Steve >>>> However, the main issue is that guest ram must be backed by named shared memory, and >>>> we would need to add code to support shared memory for all the secondary memory objects. >>>> That makes it less interesting for us at this time; we care about updating legacy qemu >>>> instances with anonymous guest memory. >>> >>> Thanks for explaining this more in the other sub-thread. The secondary >>> memory objects you mentioned are relatively small so I don't think >>> saving them in the traditional way is a problem. >>> >>> Two approaches for zero-copy memory migration fit into QEMU's existing >>> migration infrastructure: >>> >>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, >>> etc) so they are not saved into the savevm file. The existing --object >>> memory-backend-file syntax can be used. >>> >>> - Extending the live migration protocol to detect when file descriptor >>> passing is available (i.e. UNIX domain socket migration) and using >>> that for memory-backend-* objects that have fds. >>> >>> Either of these approaches would handle RAM with existing savevm/migrate >>> commands. >> >> Yes, but the vfio issues would still need to be solved, and we would need new >> command line options to back existing and future secondary memory objects with >> named shared memory. >> >>> The remaining issue is how to migrate VFIO and other file descriptors >>> that cannot be reopened by the new process. As mentioned, QEMU already >>> has file descriptor passing support in the QMP monitor and support for >>> opening passed file descriptors (see qemu_open_internal(), >>> monitor_fd_param(), and socket_get_fd()). >>> >>> The advantage of integrating live update functionality into the existing >>> savevm/migrate commands is that it will work in more use cases with >>> less code duplication/maintenance/bitrot prevention than the >>> special-case cprsave command in this patch series. >>> >>> Maybe there is a fundamental technical reason why live update needs to >>> be different from QEMU's existing migration commands but I haven't >>> figured it out yet. >> >> vfio and anonymous memory. >> >> Regarding code duplication, I did consider whether to extend the migration >> syntax and implementation versus creating something new. Those functions >> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr >> use case, and the cpr functions handle state that is n/a for the migration case. >> I judged that handling both in the same functions would be less readable and >> maintainable. After feedback during the V1 review, I simplified the cprsave >> code by by calling qemu_save_device_state, as Xen does, thus eliminating any >> interaction with the migration code. >> >> Regarding bit rot, I still need to add a cpr test to the test suite, when the >> review is more complete and folks agree on the final form of the functionality. >> >> I do like the idea of supporting update without exec, but as a future project, >> and not at the expense of dropping update with exec. >> >> - Steve >>
* Steven Sistare (steven.sistare@oracle.com) wrote: > On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote: > > * Steven Sistare (steven.sistare@oracle.com) wrote: > >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > >>>>>> Provide the cprsave and cprload commands for live update. These save and > >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated > >>>>>> to a new version in between. > >>>>>> > >>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two > >>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > >>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > >>>>>> paused state and waits for the cprload command. > >>>>> > >>>>> I think cprsave/cprload could be generalized by using QMP to stash the > >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code > >>>>> already opens fds passed using this mechanism. > >>>>> > >>>>> I haven't checked but it may be possible to drop some patches by reusing > >>>>> QEMU's monitor file descriptor passing since the code already knows how > >>>>> to open from 'getfd' fds. > >>>>> > >>>>> The reason why using QMP is interesting is because it eliminates the > >>>>> need for execve(2). QEMU may be unable to execute a program due to > >>>>> chroot, seccomp, etc. > >>>>> > >>>>> QMP would enable cprsave/cprload to work both with and without > >>>>> execve(2). > >>>>> > >>>>> One tricky thing with this approach might be startup ordering: how to > >>>>> get fds via the QMP monitor in the new process before processing the > >>>>> entire command-line. > >>>> > >>>> Early on I experimented with a similar approach. Old qemu passed descriptors to an > >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow. > >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. > >>>> I suspect my recent vfio extensions would smooth the rough edges. > >>> > >>> I wonder about the reason for VFIO's pid limitation, maybe because it > >>> pins pages from the original process? > >> > >> The dma unmap code verifies that the requesting task is the same as the task that mapped > >> the pages. We could add an ioctl that passes ownership to a new task. We would also need > >> to fix locked memory accounting, which is associated with the mm of the original task. > > > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU > >>> execve(2) the new binary? > >> > >> That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict > >> errors I mentioned in the previous email. We would need to suppress redundant dma map calls, > >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and > >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like > >> I was working against vfio rather than with it. > > > > OK the weirdness of vfio helps explain a bit about why you're doing it > > this way; can you help separate some difference between restart and > > reboot for me though: > > > > In 'reboot' mode; where the guest must do suspend in it's drivers, how > > much of these vfio requirements are needed? I guess the memfd use > > for the anonymous areas isn't any use for reboot mode. > > Correct. For reboot no special vfio support or fiddling is needed. > > > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does > > vfio still care about the currently-anonymous areas? > > Yes, for restart mode. The physical pages behind the anonymous memory remain pinned and > are targets for ongoing DMA. Post-exec qemu needs a way to find those same pages. Is it possible with vfio to map it into multiple processes simultaneously or does it have to only be one at a time? Are you saying that you have no way to shut off DMA, and thus you can never know it's safe to terminate the source process? > >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via > >> code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates > >> without exec, I still need the exec option. > > > > Can you explain what that code injection mechanism is for those of us > > who didn't see that? > > Sure. Here is slide 12 from the talk. It relies on mmap(MADV_DOEXEC) which was not > accepted upstream. In this series, without MADV_DOEXEC, how do you guarantee the same HVA in source and destination - or is that not necessary? > ----------------------------------------------------------------------------- > Legacy Live Update > > * Update legacy qemu process to latest version > - Inject code into legacy qemu process to perform cprsave: vmsave.so > . Access qemu data structures and globals > - eg ram_list, savevm_state, chardevs, vhost_devices > - dlopen does not resolve them, must get addresses via symbol lookup. > . Delete some vmstate handlers, register new ones (eg vfio) > . Call MADV_DOEXEC on guest memory. Find devices, preserve fd > * Hot patch a monitor function to dlopen vmsave.so, call entry point > - write patch to /proc/pid/mem > - Call the monitor function via monitor socket > * Send cprload to update qemu > * vmsave.so has binary dependency on qemu data structures and variables > - Build vmsave-ver.so per legacy version > - Indexed by qemu's gcc build-id > > ----------------------------------------------------------------------------- That's hairy! At that point isn't it easier to recompile a patched qemu against the original sources and ptrace something in to mmap the new qemu? Dave > - Steve > > >>>> However, the main issue is that guest ram must be backed by named shared memory, and > >>>> we would need to add code to support shared memory for all the secondary memory objects. > >>>> That makes it less interesting for us at this time; we care about updating legacy qemu > >>>> instances with anonymous guest memory. > >>> > >>> Thanks for explaining this more in the other sub-thread. The secondary > >>> memory objects you mentioned are relatively small so I don't think > >>> saving them in the traditional way is a problem. > >>> > >>> Two approaches for zero-copy memory migration fit into QEMU's existing > >>> migration infrastructure: > >>> > >>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > >>> etc) so they are not saved into the savevm file. The existing --object > >>> memory-backend-file syntax can be used. > >>> > >>> - Extending the live migration protocol to detect when file descriptor > >>> passing is available (i.e. UNIX domain socket migration) and using > >>> that for memory-backend-* objects that have fds. > >>> > >>> Either of these approaches would handle RAM with existing savevm/migrate > >>> commands. > >> > >> Yes, but the vfio issues would still need to be solved, and we would need new > >> command line options to back existing and future secondary memory objects with > >> named shared memory. > >> > >>> The remaining issue is how to migrate VFIO and other file descriptors > >>> that cannot be reopened by the new process. As mentioned, QEMU already > >>> has file descriptor passing support in the QMP monitor and support for > >>> opening passed file descriptors (see qemu_open_internal(), > >>> monitor_fd_param(), and socket_get_fd()). > >>> > >>> The advantage of integrating live update functionality into the existing > >>> savevm/migrate commands is that it will work in more use cases with > >>> less code duplication/maintenance/bitrot prevention than the > >>> special-case cprsave command in this patch series. > >>> > >>> Maybe there is a fundamental technical reason why live update needs to > >>> be different from QEMU's existing migration commands but I haven't > >>> figured it out yet. > >> > >> vfio and anonymous memory. > >> > >> Regarding code duplication, I did consider whether to extend the migration > >> syntax and implementation versus creating something new. Those functions > >> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr > >> use case, and the cpr functions handle state that is n/a for the migration case. > >> I judged that handling both in the same functions would be less readable and > >> maintainable. After feedback during the V1 review, I simplified the cprsave > >> code by by calling qemu_save_device_state, as Xen does, thus eliminating any > >> interaction with the migration code. > >> > >> Regarding bit rot, I still need to add a cpr test to the test suite, when the > >> review is more complete and folks agree on the final form of the functionality. > >> > >> I do like the idea of supporting update without exec, but as a future project, > >> and not at the expense of dropping update with exec. > >> > >> - Steve > >> >
On Tue, 18 May 2021 20:23:25 +0100 "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > * Steven Sistare (steven.sistare@oracle.com) wrote: > > On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote: > > > * Steven Sistare (steven.sistare@oracle.com) wrote: > > >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > > >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > > >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > > >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > > >>>>>> Provide the cprsave and cprload commands for live update. These save and > > >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated > > >>>>>> to a new version in between. > > >>>>>> > > >>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two > > >>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > > >>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > > >>>>>> paused state and waits for the cprload command. > > >>>>> > > >>>>> I think cprsave/cprload could be generalized by using QMP to stash the > > >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code > > >>>>> already opens fds passed using this mechanism. > > >>>>> > > >>>>> I haven't checked but it may be possible to drop some patches by reusing > > >>>>> QEMU's monitor file descriptor passing since the code already knows how > > >>>>> to open from 'getfd' fds. > > >>>>> > > >>>>> The reason why using QMP is interesting is because it eliminates the > > >>>>> need for execve(2). QEMU may be unable to execute a program due to > > >>>>> chroot, seccomp, etc. > > >>>>> > > >>>>> QMP would enable cprsave/cprload to work both with and without > > >>>>> execve(2). > > >>>>> > > >>>>> One tricky thing with this approach might be startup ordering: how to > > >>>>> get fds via the QMP monitor in the new process before processing the > > >>>>> entire command-line. > > >>>> > > >>>> Early on I experimented with a similar approach. Old qemu passed descriptors to an > > >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow. > > >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. > > >>>> I suspect my recent vfio extensions would smooth the rough edges. > > >>> > > >>> I wonder about the reason for VFIO's pid limitation, maybe because it > > >>> pins pages from the original process? > > >> > > >> The dma unmap code verifies that the requesting task is the same as the task that mapped > > >> the pages. We could add an ioctl that passes ownership to a new task. We would also need > > >> to fix locked memory accounting, which is associated with the mm of the original task. > > > > > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU > > >>> execve(2) the new binary? > > >> > > >> That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict > > >> errors I mentioned in the previous email. We would need to suppress redundant dma map calls, > > >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and > > >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like > > >> I was working against vfio rather than with it. > > > > > > OK the weirdness of vfio helps explain a bit about why you're doing it > > > this way; can you help separate some difference between restart and > > > reboot for me though: > > > > > > In 'reboot' mode; where the guest must do suspend in it's drivers, how > > > much of these vfio requirements are needed? I guess the memfd use > > > for the anonymous areas isn't any use for reboot mode. > > > > Correct. For reboot no special vfio support or fiddling is needed. > > > > > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does > > > vfio still care about the currently-anonymous areas? > > > > Yes, for restart mode. The physical pages behind the anonymous memory remain pinned and > > are targets for ongoing DMA. Post-exec qemu needs a way to find those same pages. > > Is it possible with vfio to map it into multiple processes > simultaneously or does it have to only be one at a time? The IOMMU maps an IOVA to a physical address, what Steve is saying is that mapping persists across the restart. A given IOVA can only map to a specific physical address, so mapping into multiple processes doesn't make any sense. The two processes need to map the same IOVA to the same HPA, only the HVA is allowed to change. > Are you saying that you have no way to shut off DMA, and thus you can > never know it's safe to terminate the source process? Stopping DMA, ex. disabling PCI bus master, would be not only visible to the behavior of the device, but likely detrimental. You'd need driver or device participation to some extent to make this seamless. > > >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via > > >> code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates > > >> without exec, I still need the exec option. > > > > > > Can you explain what that code injection mechanism is for those of us > > > who didn't see that? > > > > Sure. Here is slide 12 from the talk. It relies on mmap(MADV_DOEXEC) which was not > > accepted upstream. > > In this series, without MADV_DOEXEC, how do you guarantee the same HVA > in source and destination - or is that not necessary? It's not necessary, the HVA is used to establish the IOVA to HPA mapping for the IOMMU. We have patches upstream that suspend (block) that translation for the window when the HVA is invalid and resume when it becomes valid. It's expected that the new HVA is equivalent to the old HVA and that the user can only hurt themselves should they violate this, ie. they can still only map+pin memory they own, so at worst they create a bad translation for their own device. Thanks, Alex
On 5/18/2021 3:23 PM, Dr. David Alan Gilbert wrote: > * Steven Sistare (steven.sistare@oracle.com) wrote: >> On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote: >>> * Steven Sistare (steven.sistare@oracle.com) wrote: >>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: >>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: >>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: >>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: >>>>>>>> Provide the cprsave and cprload commands for live update. These save and >>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated >>>>>>>> to a new version in between. >>>>>>>> >>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports two >>>>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or >>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a >>>>>>>> paused state and waits for the cprload command. >>>>>>> >>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the >>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code >>>>>>> already opens fds passed using this mechanism. >>>>>>> >>>>>>> I haven't checked but it may be possible to drop some patches by reusing >>>>>>> QEMU's monitor file descriptor passing since the code already knows how >>>>>>> to open from 'getfd' fds. >>>>>>> >>>>>>> The reason why using QMP is interesting is because it eliminates the >>>>>>> need for execve(2). QEMU may be unable to execute a program due to >>>>>>> chroot, seccomp, etc. >>>>>>> >>>>>>> QMP would enable cprsave/cprload to work both with and without >>>>>>> execve(2). >>>>>>> >>>>>>> One tricky thing with this approach might be startup ordering: how to >>>>>>> get fds via the QMP monitor in the new process before processing the >>>>>>> entire command-line. >>>>>> >>>>>> Early on I experimented with a similar approach. Old qemu passed descriptors to an >>>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow. >>>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check. >>>>>> I suspect my recent vfio extensions would smooth the rough edges. >>>>> >>>>> I wonder about the reason for VFIO's pid limitation, maybe because it >>>>> pins pages from the original process? >>>> >>>> The dma unmap code verifies that the requesting task is the same as the task that mapped >>>> the pages. We could add an ioctl that passes ownership to a new task. We would also need >>>> to fix locked memory accounting, which is associated with the mm of the original task. >>> >>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU >>>>> execve(2) the new binary? >>>> >>>> That is one. Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict >>>> errors I mentioned in the previous email. We would need to suppress redundant dma map calls, >>>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and >>>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like >>>> I was working against vfio rather than with it. >>> >>> OK the weirdness of vfio helps explain a bit about why you're doing it >>> this way; can you help separate some difference between restart and >>> reboot for me though: >>> >>> In 'reboot' mode; where the guest must do suspend in it's drivers, how >>> much of these vfio requirements are needed? I guess the memfd use >>> for the anonymous areas isn't any use for reboot mode. >> >> Correct. For reboot no special vfio support or fiddling is needed. >> >>> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does >>> vfio still care about the currently-anonymous areas? >> >> Yes, for restart mode. The physical pages behind the anonymous memory remain pinned and >> are targets for ongoing DMA. Post-exec qemu needs a way to find those same pages. > > Is it possible with vfio to map it into multiple processes > simultaneously or does it have to only be one at a time? > Are you saying that you have no way to shut off DMA, and thus you can > never know it's safe to terminate the source process? > >>>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via >>>> code injection which I briefly mentioned in KVM forum). If we extend cpr to allow updates >>>> without exec, I still need the exec option. >>> >>> Can you explain what that code injection mechanism is for those of us >>> who didn't see that? >> >> Sure. Here is slide 12 from the talk. It relies on mmap(MADV_DOEXEC) which was not >> accepted upstream. > > In this series, without MADV_DOEXEC, how do you guarantee the same HVA > in source and destination - or is that not necessary? Not necessary. We can safely change the HVS using the new vfio ioctls. >> ----------------------------------------------------------------------------- >> Legacy Live Update >> >> * Update legacy qemu process to latest version >> - Inject code into legacy qemu process to perform cprsave: vmsave.so >> . Access qemu data structures and globals >> - eg ram_list, savevm_state, chardevs, vhost_devices >> - dlopen does not resolve them, must get addresses via symbol lookup. >> . Delete some vmstate handlers, register new ones (eg vfio) >> . Call MADV_DOEXEC on guest memory. Find devices, preserve fd >> * Hot patch a monitor function to dlopen vmsave.so, call entry point >> - write patch to /proc/pid/mem >> - Call the monitor function via monitor socket >> * Send cprload to update qemu >> * vmsave.so has binary dependency on qemu data structures and variables >> - Build vmsave-ver.so per legacy version >> - Indexed by qemu's gcc build-id >> >> ----------------------------------------------------------------------------- > > That's hairy! > At that point isn't it easier to recompile a patched qemu against the > original sources and ptrace something in to mmap the new qemu That could work, but safely capturing all the threads and forcing them to jump to the mmap'd qemu is hard. - Steve >>>>>> However, the main issue is that guest ram must be backed by named shared memory, and >>>>>> we would need to add code to support shared memory for all the secondary memory objects. >>>>>> That makes it less interesting for us at this time; we care about updating legacy qemu >>>>>> instances with anonymous guest memory. >>>>> >>>>> Thanks for explaining this more in the other sub-thread. The secondary >>>>> memory objects you mentioned are relatively small so I don't think >>>>> saving them in the traditional way is a problem. >>>>> >>>>> Two approaches for zero-copy memory migration fit into QEMU's existing >>>>> migration infrastructure: >>>>> >>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, >>>>> etc) so they are not saved into the savevm file. The existing --object >>>>> memory-backend-file syntax can be used. >>>>> >>>>> - Extending the live migration protocol to detect when file descriptor >>>>> passing is available (i.e. UNIX domain socket migration) and using >>>>> that for memory-backend-* objects that have fds. >>>>> >>>>> Either of these approaches would handle RAM with existing savevm/migrate >>>>> commands. >>>> >>>> Yes, but the vfio issues would still need to be solved, and we would need new >>>> command line options to back existing and future secondary memory objects with >>>> named shared memory. >>>> >>>>> The remaining issue is how to migrate VFIO and other file descriptors >>>>> that cannot be reopened by the new process. As mentioned, QEMU already >>>>> has file descriptor passing support in the QMP monitor and support for >>>>> opening passed file descriptors (see qemu_open_internal(), >>>>> monitor_fd_param(), and socket_get_fd()). >>>>> >>>>> The advantage of integrating live update functionality into the existing >>>>> savevm/migrate commands is that it will work in more use cases with >>>>> less code duplication/maintenance/bitrot prevention than the >>>>> special-case cprsave command in this patch series. >>>>> >>>>> Maybe there is a fundamental technical reason why live update needs to >>>>> be different from QEMU's existing migration commands but I haven't >>>>> figured it out yet. >>>> >>>> vfio and anonymous memory. >>>> >>>> Regarding code duplication, I did consider whether to extend the migration >>>> syntax and implementation versus creating something new. Those functions >>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr >>>> use case, and the cpr functions handle state that is n/a for the migration case. >>>> I judged that handling both in the same functions would be less readable and >>>> maintainable. After feedback during the V1 review, I simplified the cprsave >>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any >>>> interaction with the migration code. >>>> >>>> Regarding bit rot, I still need to add a cpr test to the test suite, when the >>>> review is more complete and folks agree on the final form of the functionality. >>>> >>>> I do like the idea of supporting update without exec, but as a future project, >>>> and not at the expense of dropping update with exec. >>>> >>>> - Steve >>>> >>
Hi Michael, Marcel, I hope you have time to review the pci and vfio-pci related patches in this series. They are an essential part of the live update functionality. The first 2 patches are straightforward, just exposing functions for use in vfio. The last 2 patches are more substantial. - pci: export functions for cpr - vfio-pci: refactor for cpr - vfio-pci: cpr part 1 - vfio-pci: cpr part 2 - Steve On 5/7/2021 8:24 AM, Steve Sistare wrote: > Provide the cprsave and cprload commands for live update. These save and > restore VM state, with minimal guest pause time, so that qemu may be updated > to a new version in between. > > cprsave stops the VM and saves vmstate to an ordinary file. It supports two > modes: restart and reboot. For restart, cprsave exec's the qemu binary (or > /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > paused state and waits for the cprload command. > > To use the restart mode, qemu must be started with the memfd-alloc option, > which allocates guest ram using memfd_create. The memfd's are saved to > the environment and kept open across exec, after which they are found from > the environment and re-mmap'd. Hence guest ram is preserved in place, > albeit with new virtual addresses in the qemu process. The caller resumes > the guest by calling cprload, which loads state from the file. If the VM > was running at cprsave time, then VM execution resumes. cprsave supports > any type of guest image and block device, but the caller must not modify > guest block devices between cprsave and cprload. > > The restart mode supports vfio devices by preserving the vfio container, > group, device, and event descriptors across the qemu re-exec, and by > updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and > VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/ > and integrated in Linux kernel 5.12. > > For the reboot mode, cprsave saves state and exits qemu, and the caller is > allowed to update the host kernel and system software and reboot. The > caller resumes the guest by running qemu with the same arguments as the > original process and calling cprload. To use this mode, guest ram must be > mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm > PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com. > > The reboot mode supports vfio devices if the caller suspends the guest > instead of stopping the VM, such as by issuing guest-suspend-ram to the > qemu guest agent. The guest drivers' suspend methods flush outstanding > requests and re-initialize the devices, and thus there is no device state > to save and restore. > > The first patches add helper functions: > > - as_flat_walk > - qemu_ram_volatile > - oslib: qemu_clr_cloexec > - util: env var helpers > - machine: memfd-alloc option > - vl: add helper to request re-exec > > The next patches implement cprsave and cprload: > > - cpr > - cpr: QMP interfaces > - cpr: HMP interfaces > > The next patches add vfio support for the restart mode: > > - pci: export functions for cpr > - vfio-pci: refactor for cpr > - vfio-pci: cpr part 1 > - vfio-pci: cpr part 2 > > The next patches preserve various descriptor-based backend devices across > a cprsave restart: > > - vhost: reset vhost devices upon cprsave > - hostmem-memfd: cpr support > - chardev: cpr framework > - chardev: cpr for simple devices > - chardev: cpr for pty > - chardev: cpr for sockets > - cpr: only-cpr-capable option > - cpr: maintainers > - simplify savevm > > Here is an example of updating qemu from v4.2.0 to v4.2.1 using > "cprload restart". The software update is performed while the guest is > running to minimize downtime. > > window 1 | window 2 > | > # qemu-system-x86_64 ... | > QEMU 4.2.0 monitor - type 'help' ... | > (qemu) info status | > VM status: running | > | # yum update qemu > (qemu) cprsave /tmp/qemu.sav restart | > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: paused (prelaunch) | > (qemu) cprload /tmp/qemu.sav | > (qemu) info status | > VM status: running | > > > Here is an example of updating the host kernel using "cprload reboot" > > window 1 | window 2 > | > # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: running | > | # yum update kernel-uek > (qemu) cprsave /tmp/qemu.sav restart | > | > # systemctl kexec | > kexec_core: Starting new kernel | > ... | > | > # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| > QEMU 4.2.1 monitor - type 'help' ... | > (qemu) info status | > VM status: paused (prelaunch) | > (qemu) cprload /tmp/qemu.sav | > (qemu) info status | > VM status: running | > > Changes from V1 to V2: > - revert vmstate infrastructure changes > - refactor cpr functions into new files > - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to > preserve memory. > - add framework to filter chardev's that support cpr > - save and restore vfio eventfd's > - modify cprinfo QMP interface > - incorporate misc review feedback > - remove unrelated and unneeded patches > - refactor all patches into a shorter and easier to review series > > Changes from V2 to V3: > - rebase to qemu 6.0.0 > - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc) > - change memfd-alloc to a machine option > - use existing channel socket function instead of defining new ones > - close monitor socket during cpr > - support memory-backend-memfd > - fix a few unreported bugs > > Steve Sistare (18): > as_flat_walk > qemu_ram_volatile > oslib: qemu_clr_cloexec > util: env var helpers > machine: memfd-alloc option > vl: add helper to request re-exec > cpr > pci: export functions for cpr > vfio-pci: refactor for cpr > vfio-pci: cpr part 1 > vfio-pci: cpr part 2 > hostmem-memfd: cpr support > chardev: cpr framework > chardev: cpr for simple devices > chardev: cpr for pty > cpr: only-cpr-capable option > cpr: maintainers > simplify savevm > > Mark Kanda, Steve Sistare (4): > cpr: QMP interfaces > cpr: HMP interfaces > vhost: reset vhost devices upon cprsave > chardev: cpr for sockets > > MAINTAINERS | 11 +++ > backends/hostmem-memfd.c | 21 +++-- > chardev/char-mux.c | 1 + > chardev/char-null.c | 1 + > chardev/char-pty.c | 15 ++- > chardev/char-serial.c | 1 + > chardev/char-socket.c | 35 +++++++ > chardev/char-stdio.c | 8 ++ > chardev/char.c | 41 +++++++- > gdbstub.c | 1 + > hmp-commands.hx | 44 +++++++++ > hw/core/machine.c | 19 ++++ > hw/pci/msi.c | 4 + > hw/pci/msix.c | 20 ++-- > hw/pci/pci.c | 7 +- > hw/vfio/common.c | 68 +++++++++++++- > hw/vfio/cpr.c | 131 ++++++++++++++++++++++++++ > hw/vfio/meson.build | 1 + > hw/vfio/pci.c | 214 ++++++++++++++++++++++++++++++++++++++---- > hw/vfio/trace-events | 1 + > hw/virtio/vhost.c | 11 +++ > include/chardev/char.h | 6 ++ > include/exec/memory.h | 25 +++++ > include/hw/boards.h | 1 + > include/hw/pci/msix.h | 5 + > include/hw/pci/pci.h | 2 + > include/hw/vfio/vfio-common.h | 8 ++ > include/hw/virtio/vhost.h | 1 + > include/migration/cpr.h | 17 ++++ > include/monitor/hmp.h | 3 + > include/qemu/env.h | 23 +++++ > include/qemu/osdep.h | 1 + > include/sysemu/runstate.h | 2 + > include/sysemu/sysemu.h | 2 + > linux-headers/linux/vfio.h | 27 ++++++ > migration/cpr.c | 200 +++++++++++++++++++++++++++++++++++++++ > migration/meson.build | 1 + > migration/migration.c | 5 + > migration/savevm.c | 21 ++--- > migration/savevm.h | 2 + > monitor/hmp-cmds.c | 48 ++++++++++ > monitor/hmp.c | 3 + > monitor/qmp-cmds.c | 31 ++++++ > monitor/qmp.c | 3 + > qapi/char.json | 5 +- > qapi/cpr.json | 76 +++++++++++++++ > qapi/meson.build | 1 + > qapi/qapi-schema.json | 1 + > qemu-options.hx | 39 +++++++- > softmmu/globals.c | 2 + > softmmu/memory.c | 48 ++++++++++ > softmmu/physmem.c | 49 ++++++++-- > softmmu/runstate.c | 49 +++++++++- > softmmu/vl.c | 21 ++++- > stubs/cpr.c | 3 + > stubs/meson.build | 1 + > trace-events | 1 + > util/env.c | 99 +++++++++++++++++++ > util/meson.build | 1 + > util/oslib-posix.c | 9 ++ > util/oslib-win32.c | 4 + > util/qemu-config.c | 4 + > 62 files changed, 1431 insertions(+), 74 deletions(-) > create mode 100644 hw/vfio/cpr.c > create mode 100644 include/migration/cpr.h > create mode 100644 include/qemu/env.h > create mode 100644 migration/cpr.c > create mode 100644 qapi/cpr.json > create mode 100644 stubs/cpr.c > create mode 100644 util/env.c >
Hi Steven, I'd like to split the discussion into reboot and restart, so I can make sure I understand them individually. So reboot mode; Can you explain which parts of this series are needed for reboot mode; I've managed to do a kexec based reboot on qemu with the current qemu - albeit with no vfio devices, my current understanding is that for doing reboot with vfio we just need some way of getting migrate to send the metadata associated with vfio devices if the guest is in S3. Is there something I'm missing and which you have in this series? Dave
On the 'restart' branch of questions; can you explain, other than the passing of the fd's, why the outgoing side of qemu's 'migrate exec:' doesn't work for you? Dave
On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote: > Hi Steven, > I'd like to split the discussion into reboot and restart, > so I can make sure I understand them individually. > > So reboot mode; > Can you explain which parts of this series are needed for reboot mode; > I've managed to do a kexec based reboot on qemu with the current qemu - > albeit with no vfio devices, my current understanding is that for doing > reboot with vfio we just need some way of getting migrate to send the > metadata associated with vfio devices if the guest is in S3. > > Is there something I'm missing and which you have in this series? You are correct, this series has little special code for reboot mode, but it does allow reboot and restart to be handled similarly, which simplifies the management layer because the same calls are performed for each mode. For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram command to the qemu guest agent. This flushes requests and brings the guest device to a reset state, so there is no vfio metadata to save. Reboot mode does not call vfio_cprsave. There are a few unique patches to support reboot mode. One is qemu_ram_volatile, which is a sanity check that the writable ram blocks are backed by some form of shared memory. Plus there are a few fragments in the "cpr" patch that handle the suspended state that is induced by guest-suspend-ram. See qemu_system_start_on_wake_request() and instances of RUN_STATE_SUSPENDED in migration/cpr.c - Steve
On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: > On the 'restart' branch of questions; can you explain, > other than the passing of the fd's, why the outgoing side of > qemu's 'migrate exec:' doesn't work for you? I'm not sure what I should describe. Can you be more specific? Do you mean: can we add the cpr specific bits to the migrate exec code? - Steve
* Steven Sistare (steven.sistare@oracle.com) wrote: > On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: > > On the 'restart' branch of questions; can you explain, > > other than the passing of the fd's, why the outgoing side of > > qemu's 'migrate exec:' doesn't work for you? > > I'm not sure what I should describe. Can you be more specific? > Do you mean: can we add the cpr specific bits to the migrate exec code? Yes; if possible I'd prefer to just keep the one exec mechanism. It has an advantage of letting you specify the new command line; that avoids the problems I'd pointed out with needing to change the command line if a hotplug had happened. It also means we only need one chunk of exec code. Dave > - Steve >
On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: > * Steven Sistare (steven.sistare@oracle.com) wrote: >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: >>> On the 'restart' branch of questions; can you explain, >>> other than the passing of the fd's, why the outgoing side of >>> qemu's 'migrate exec:' doesn't work for you? >> >> I'm not sure what I should describe. Can you be more specific? >> Do you mean: can we add the cpr specific bits to the migrate exec code? > > Yes; if possible I'd prefer to just keep the one exec mechanism. > It has an advantage of letting you specify the new command line; that > avoids the problems I'd pointed out with needing to change the command > line if a hotplug had happened. It also means we only need one chunk of > exec code. How/where would you specify a new command line? Are you picturing the usual migration setup where you start a second qemu with its own arguments, plus a migrate_incoming option or command? That does not work for live update restart; the old qemu must exec the new qemu. Or something else? We could shoehorn cpr restart into the migrate exec path by defining a new migration capability that the client would set before issuing the migrate command. However, the implementation code would be sprinkled with conditionals to suppress migrate-only bits and call cpr-only bits. IMO that would be less maintainable than having a separate cprsave function. Currently cprsave does not duplicate any migration functionality. cprsave calls qemu_save_device_state() which is used by xen. - Steve
Hi Michael, Alex has reviewed the vfio-pci patches. If you could give me a thumbs-up or a needs-work on "pci: export functions for cpr", I would appreciate it. Thanks! [PATCH V3 10/22] pci: export functions for cpr https://lore.kernel.org/qemu-devel/1620390320-301716-11-git-send-email-steven.sistare@oracle.com - Steve On 5/19/2021 12:43 PM, Steven Sistare wrote: > Hi Michael, Marcel, > I hope you have time to review the pci and vfio-pci related patches in this > series. They are an essential part of the live update functionality. The > first 2 patches are straightforward, just exposing functions for use in vfio. > The last 2 patches are more substantial. > > - pci: export functions for cpr > - vfio-pci: refactor for cpr > - vfio-pci: cpr part 1 > - vfio-pci: cpr part 2 > > - Steve > > On 5/7/2021 8:24 AM, Steve Sistare wrote: >> Provide the cprsave and cprload commands for live update. These save and >> restore VM state, with minimal guest pause time, so that qemu may be updated >> to a new version in between. >> >> cprsave stops the VM and saves vmstate to an ordinary file. It supports two >> modes: restart and reboot. For restart, cprsave exec's the qemu binary (or >> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a >> paused state and waits for the cprload command. >> >> To use the restart mode, qemu must be started with the memfd-alloc option, >> which allocates guest ram using memfd_create. The memfd's are saved to >> the environment and kept open across exec, after which they are found from >> the environment and re-mmap'd. Hence guest ram is preserved in place, >> albeit with new virtual addresses in the qemu process. The caller resumes >> the guest by calling cprload, which loads state from the file. If the VM >> was running at cprsave time, then VM execution resumes. cprsave supports >> any type of guest image and block device, but the caller must not modify >> guest block devices between cprsave and cprload. >> >> The restart mode supports vfio devices by preserving the vfio container, >> group, device, and event descriptors across the qemu re-exec, and by >> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and >> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/ >> and integrated in Linux kernel 5.12. >> >> For the reboot mode, cprsave saves state and exits qemu, and the caller is >> allowed to update the host kernel and system software and reboot. The >> caller resumes the guest by running qemu with the same arguments as the >> original process and calling cprload. To use this mode, guest ram must be >> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm >> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com. >> >> The reboot mode supports vfio devices if the caller suspends the guest >> instead of stopping the VM, such as by issuing guest-suspend-ram to the >> qemu guest agent. The guest drivers' suspend methods flush outstanding >> requests and re-initialize the devices, and thus there is no device state >> to save and restore. >> >> The first patches add helper functions: >> >> - as_flat_walk >> - qemu_ram_volatile >> - oslib: qemu_clr_cloexec >> - util: env var helpers >> - machine: memfd-alloc option >> - vl: add helper to request re-exec >> >> The next patches implement cprsave and cprload: >> >> - cpr >> - cpr: QMP interfaces >> - cpr: HMP interfaces >> >> The next patches add vfio support for the restart mode: >> >> - pci: export functions for cpr >> - vfio-pci: refactor for cpr >> - vfio-pci: cpr part 1 >> - vfio-pci: cpr part 2 >> >> The next patches preserve various descriptor-based backend devices across >> a cprsave restart: >> >> - vhost: reset vhost devices upon cprsave >> - hostmem-memfd: cpr support >> - chardev: cpr framework >> - chardev: cpr for simple devices >> - chardev: cpr for pty >> - chardev: cpr for sockets >> - cpr: only-cpr-capable option >> - cpr: maintainers >> - simplify savevm >> >> Here is an example of updating qemu from v4.2.0 to v4.2.1 using >> "cprload restart". The software update is performed while the guest is >> running to minimize downtime. >> >> window 1 | window 2 >> | >> # qemu-system-x86_64 ... | >> QEMU 4.2.0 monitor - type 'help' ... | >> (qemu) info status | >> VM status: running | >> | # yum update qemu >> (qemu) cprsave /tmp/qemu.sav restart | >> QEMU 4.2.1 monitor - type 'help' ... | >> (qemu) info status | >> VM status: paused (prelaunch) | >> (qemu) cprload /tmp/qemu.sav | >> (qemu) info status | >> VM status: running | >> >> >> Here is an example of updating the host kernel using "cprload reboot" >> >> window 1 | window 2 >> | >> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| >> QEMU 4.2.1 monitor - type 'help' ... | >> (qemu) info status | >> VM status: running | >> | # yum update kernel-uek >> (qemu) cprsave /tmp/qemu.sav restart | >> | >> # systemctl kexec | >> kexec_core: Starting new kernel | >> ... | >> | >> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...| >> QEMU 4.2.1 monitor - type 'help' ... | >> (qemu) info status | >> VM status: paused (prelaunch) | >> (qemu) cprload /tmp/qemu.sav | >> (qemu) info status | >> VM status: running | >> >> Changes from V1 to V2: >> - revert vmstate infrastructure changes >> - refactor cpr functions into new files >> - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to >> preserve memory. >> - add framework to filter chardev's that support cpr >> - save and restore vfio eventfd's >> - modify cprinfo QMP interface >> - incorporate misc review feedback >> - remove unrelated and unneeded patches >> - refactor all patches into a shorter and easier to review series >> >> Changes from V2 to V3: >> - rebase to qemu 6.0.0 >> - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc) >> - change memfd-alloc to a machine option >> - use existing channel socket function instead of defining new ones >> - close monitor socket during cpr >> - support memory-backend-memfd >> - fix a few unreported bugs >> >> Steve Sistare (18): >> as_flat_walk >> qemu_ram_volatile >> oslib: qemu_clr_cloexec >> util: env var helpers >> machine: memfd-alloc option >> vl: add helper to request re-exec >> cpr >> pci: export functions for cpr >> vfio-pci: refactor for cpr >> vfio-pci: cpr part 1 >> vfio-pci: cpr part 2 >> hostmem-memfd: cpr support >> chardev: cpr framework >> chardev: cpr for simple devices >> chardev: cpr for pty >> cpr: only-cpr-capable option >> cpr: maintainers >> simplify savevm >> >> Mark Kanda, Steve Sistare (4): >> cpr: QMP interfaces >> cpr: HMP interfaces >> vhost: reset vhost devices upon cprsave >> chardev: cpr for sockets >> >> MAINTAINERS | 11 +++ >> backends/hostmem-memfd.c | 21 +++-- >> chardev/char-mux.c | 1 + >> chardev/char-null.c | 1 + >> chardev/char-pty.c | 15 ++- >> chardev/char-serial.c | 1 + >> chardev/char-socket.c | 35 +++++++ >> chardev/char-stdio.c | 8 ++ >> chardev/char.c | 41 +++++++- >> gdbstub.c | 1 + >> hmp-commands.hx | 44 +++++++++ >> hw/core/machine.c | 19 ++++ >> hw/pci/msi.c | 4 + >> hw/pci/msix.c | 20 ++-- >> hw/pci/pci.c | 7 +- >> hw/vfio/common.c | 68 +++++++++++++- >> hw/vfio/cpr.c | 131 ++++++++++++++++++++++++++ >> hw/vfio/meson.build | 1 + >> hw/vfio/pci.c | 214 ++++++++++++++++++++++++++++++++++++++---- >> hw/vfio/trace-events | 1 + >> hw/virtio/vhost.c | 11 +++ >> include/chardev/char.h | 6 ++ >> include/exec/memory.h | 25 +++++ >> include/hw/boards.h | 1 + >> include/hw/pci/msix.h | 5 + >> include/hw/pci/pci.h | 2 + >> include/hw/vfio/vfio-common.h | 8 ++ >> include/hw/virtio/vhost.h | 1 + >> include/migration/cpr.h | 17 ++++ >> include/monitor/hmp.h | 3 + >> include/qemu/env.h | 23 +++++ >> include/qemu/osdep.h | 1 + >> include/sysemu/runstate.h | 2 + >> include/sysemu/sysemu.h | 2 + >> linux-headers/linux/vfio.h | 27 ++++++ >> migration/cpr.c | 200 +++++++++++++++++++++++++++++++++++++++ >> migration/meson.build | 1 + >> migration/migration.c | 5 + >> migration/savevm.c | 21 ++--- >> migration/savevm.h | 2 + >> monitor/hmp-cmds.c | 48 ++++++++++ >> monitor/hmp.c | 3 + >> monitor/qmp-cmds.c | 31 ++++++ >> monitor/qmp.c | 3 + >> qapi/char.json | 5 +- >> qapi/cpr.json | 76 +++++++++++++++ >> qapi/meson.build | 1 + >> qapi/qapi-schema.json | 1 + >> qemu-options.hx | 39 +++++++- >> softmmu/globals.c | 2 + >> softmmu/memory.c | 48 ++++++++++ >> softmmu/physmem.c | 49 ++++++++-- >> softmmu/runstate.c | 49 +++++++++- >> softmmu/vl.c | 21 ++++- >> stubs/cpr.c | 3 + >> stubs/meson.build | 1 + >> trace-events | 1 + >> util/env.c | 99 +++++++++++++++++++ >> util/meson.build | 1 + >> util/oslib-posix.c | 9 ++ >> util/oslib-win32.c | 4 + >> util/qemu-config.c | 4 + >> 62 files changed, 1431 insertions(+), 74 deletions(-) >> create mode 100644 hw/vfio/cpr.c >> create mode 100644 include/migration/cpr.h >> create mode 100644 include/qemu/env.h >> create mode 100644 migration/cpr.c >> create mode 100644 qapi/cpr.json >> create mode 100644 stubs/cpr.c >> create mode 100644 util/env.c >>
* Steven Sistare (steven.sistare@oracle.com) wrote: > On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: > > * Steven Sistare (steven.sistare@oracle.com) wrote: > >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: > >>> On the 'restart' branch of questions; can you explain, > >>> other than the passing of the fd's, why the outgoing side of > >>> qemu's 'migrate exec:' doesn't work for you? > >> > >> I'm not sure what I should describe. Can you be more specific? > >> Do you mean: can we add the cpr specific bits to the migrate exec code? > > > > Yes; if possible I'd prefer to just keep the one exec mechanism. > > It has an advantage of letting you specify the new command line; that > > avoids the problems I'd pointed out with needing to change the command > > line if a hotplug had happened. It also means we only need one chunk of > > exec code. > > How/where would you specify a new command line? Are you picturing the usual migration > setup where you start a second qemu with its own arguments, plus a migrate_incoming > option or command? That does not work for live update restart; the old qemu must exec > the new qemu. Or something else? The existing migration path allows an exec - originally intended to exec something like a compressor or a store to file rather than a real migration; i.e. you can do: migrate "exec:gzip > mig" and that will save the migration stream to a compressed file called mig. Now, I *think* we can already do: migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'" (That's probably cleaner via the QMP interface). I'm not quite sure what I want in the incoming there, but that is already the source execing the destination qemu - although I think we'd probably need to check if that's actually via an intermediary. > We could shoehorn cpr restart into the migrate exec path by defining a new migration > capability that the client would set before issuing the migrate command. However, the > implementation code would be sprinkled with conditionals to suppress migrate-only bits > and call cpr-only bits. IMO that would be less maintainable than having a separate > cprsave function. Currently cprsave does not duplicate any migration functionality. > cprsave calls qemu_save_device_state() which is used by xen. To me it feels like cprsave in particular is replicating more code. It's also jumping through hoops in places to avoid changing the commandline; that's going to cause more pain for a lot of people - not just because it's hacks all over for that, but because a lot of people are actually going to need to change the commandline even in a cpr like case (e.g. due to hotplug or changing something else about the environment, like auth data or route to storage or networking that changed). There are hooks for early parameter parsing, so if we need to add extra commandline args we can; but for example the case of QEMU_START_FREEZE to add -S just isn't needed as soon as you let go of the idea of needing an identical commandline. Dave > - Steve >
On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote: > * Steven Sistare (steven.sistare@oracle.com) wrote: > > On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: > > > * Steven Sistare (steven.sistare@oracle.com) wrote: > > >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: > > >>> On the 'restart' branch of questions; can you explain, > > >>> other than the passing of the fd's, why the outgoing side of > > >>> qemu's 'migrate exec:' doesn't work for you? > > >> > > >> I'm not sure what I should describe. Can you be more specific? > > >> Do you mean: can we add the cpr specific bits to the migrate exec code? > > > > > > Yes; if possible I'd prefer to just keep the one exec mechanism. > > > It has an advantage of letting you specify the new command line; that > > > avoids the problems I'd pointed out with needing to change the command > > > line if a hotplug had happened. It also means we only need one chunk of > > > exec code. > > > > How/where would you specify a new command line? Are you picturing the usual migration > > setup where you start a second qemu with its own arguments, plus a migrate_incoming > > option or command? That does not work for live update restart; the old qemu must exec > > the new qemu. Or something else? > > The existing migration path allows an exec - originally intended to exec > something like a compressor or a store to file rather than a real > migration; i.e. you can do: > > migrate "exec:gzip > mig" > > and that will save the migration stream to a compressed file called mig. > Now, I *think* we can already do: > > migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'" > (That's probably cleaner via the QMP interface). > > I'm not quite sure what I want in the incoming there, but that is > already the source execing the destination qemu - although I think we'd > probably need to check if that's actually via an intermediary. I don't think you can dirctly exec qemu in that way, because the source QEMU migration code is going to wait for completion of the QEMU you exec'd and that'll never come on success. So you'll end up with both QEMU's running forever. If you pass the daemonize option to the new QEMU then it will immediately detach itself, and the source QEMU will think the migration command has finished or failed. I think you can probably do it if you use a wrapper script though. The wrapper would have to fork QEMU in the backend, and then the wrapper would have to monitor the new QEMU to see when the incoming migration has finished/aborted, at which point the wrapper can exit, so the source QEMU sees a successful cleanup of the exec'd command. </hand waving> > > We could shoehorn cpr restart into the migrate exec path by defining a new migration > > capability that the client would set before issuing the migrate command. However, the > > implementation code would be sprinkled with conditionals to suppress migrate-only bits > > and call cpr-only bits. IMO that would be less maintainable than having a separate > > cprsave function. Currently cprsave does not duplicate any migration functionality. > > cprsave calls qemu_save_device_state() which is used by xen. > > To me it feels like cprsave in particular is replicating more code. > > It's also jumping through hoops in places to avoid changing the > commandline; that's going to cause more pain for a lot of people - not > just because it's hacks all over for that, but because a lot of people > are actually going to need to change the commandline even in a cpr like > case (e.g. due to hotplug or changing something else about the > environment, like auth data or route to storage or networking that > changed). Management apps that already support migration, will almost certainly know how to start up a new QEMU with a different command line that takes account of hotplugged/unplugged devices. IOW avoiding changing the command line only really addresses the simple case, and the hard case is likely already solved for purposes of handling regular live migration. Regards, Daniel
On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote: > On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote: >> * Steven Sistare (steven.sistare@oracle.com) wrote: >>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: >>>> * Steven Sistare (steven.sistare@oracle.com) wrote: >>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: >>>>>> On the 'restart' branch of questions; can you explain, >>>>>> other than the passing of the fd's, why the outgoing side of >>>>>> qemu's 'migrate exec:' doesn't work for you? >>>>> >>>>> I'm not sure what I should describe. Can you be more specific? >>>>> Do you mean: can we add the cpr specific bits to the migrate exec code? >>>> >>>> Yes; if possible I'd prefer to just keep the one exec mechanism. >>>> It has an advantage of letting you specify the new command line; that >>>> avoids the problems I'd pointed out with needing to change the command >>>> line if a hotplug had happened. It also means we only need one chunk of >>>> exec code. >>> >>> How/where would you specify a new command line? Are you picturing the usual migration >>> setup where you start a second qemu with its own arguments, plus a migrate_incoming >>> option or command? That does not work for live update restart; the old qemu must exec >>> the new qemu. Or something else? >> >> The existing migration path allows an exec - originally intended to exec >> something like a compressor or a store to file rather than a real >> migration; i.e. you can do: >> >> migrate "exec:gzip > mig" >> >> and that will save the migration stream to a compressed file called mig. >> Now, I *think* we can already do: >> >> migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'" >> (That's probably cleaner via the QMP interface). >> >> I'm not quite sure what I want in the incoming there, but that is >> already the source execing the destination qemu - although I think we'd >> probably need to check if that's actually via an intermediary. > > I don't think you can dirctly exec qemu in that way, because the > source QEMU migration code is going to wait for completion of the > QEMU you exec'd and that'll never come on success. So you'll end > up with both QEMU's running forever. If you pass the daemonize > option to the new QEMU then it will immediately detach itself, > and the source QEMU will think the migration command has finished > or failed. > > I think you can probably do it if you use a wrapper script though. > The wrapper would have to fork QEMU in the backend, and then the > wrapper would have to monitor the new QEMU to see when the incoming > migration has finished/aborted, at which point the wrapper can > exit, so the source QEMU sees a successful cleanup of the exec'd > command. </hand waving> cpr restart does not work for any scheme that involves the old qemu process co-existing with the new qemu process. To preserve descriptors and anonymous memory, cpr restart requires that old qemu directly execs new qemu. Not fork-exec. Same pid. So responding to Dave's comment, "keep the one exec mechanism", that is not possible. We still need the qemu_exec_requested mechanism to cause a direct exec after state is saved. >>> We could shoehorn cpr restart into the migrate exec path by defining a new migration >>> capability that the client would set before issuing the migrate command. However, the >>> implementation code would be sprinkled with conditionals to suppress migrate-only bits >>> and call cpr-only bits. IMO that would be less maintainable than having a separate >>> cprsave function. Currently cprsave does not duplicate any migration functionality. >>> cprsave calls qemu_save_device_state() which is used by xen. >> >> To me it feels like cprsave in particular is replicating more code. >> >> It's also jumping through hoops in places to avoid changing the >> commandline; that's going to cause more pain for a lot of people - not >> just because it's hacks all over for that, but because a lot of people >> are actually going to need to change the commandline even in a cpr like >> case (e.g. due to hotplug or changing something else about the >> environment, like auth data or route to storage or networking that >> changed). > > Management apps that already support migration, will almost certainly > know how to start up a new QEMU with a different command line that > takes account of hotplugged/unplugged devices. IOW avoiding changing > the command line only really addresses the simple case, and the hard > case is likely already solved for purposes of handling regular live > migration. Agreed, with the caveat that for cpr, the management app must communicate the new arguments to the qemu-exec trampoline, rather than passing the args on the command line to a new qemu process. >> There are hooks for early parameter parsing, so if we need to add extra >> commandline args we can; but for example the case of QEMU_START_FREEZE >> to add -S just isn't needed as soon as you let go of the idea of needing >> an identical commandline. I'll delete QEMU_START_FREEZE. I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as the args contain identifying information that the management app needs to modify the arguments based the the instances's hot plug history. Or, here is another possibility. We could redefine cprsave to leave the VM in a stopped state, and add a cprstart command to be called subsequently that performs the exec. It takes a single string argument: a command plus arguments to exec. The command may be qemu or a trampoline like qemu-exec. I like that the trampoline name is no longer hardcoded. The management app can derive new qemu args for the instances as it would with migration, and pass them to the command, instead of passing them to qemu-exec via some side channel. cprload finishes the job and does not change. I already like this scheme better. - Steve
On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote: > * Steven Sistare (steven.sistare@oracle.com) wrote: >> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: >>> * Steven Sistare (steven.sistare@oracle.com) wrote: >>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: >>>>> On the 'restart' branch of questions; can you explain, >>>>> other than the passing of the fd's, why the outgoing side of >>>>> qemu's 'migrate exec:' doesn't work for you? >>>> >>>> I'm not sure what I should describe. Can you be more specific? >>>> Do you mean: can we add the cpr specific bits to the migrate exec code? >>> >>> Yes; if possible I'd prefer to just keep the one exec mechanism. >>> It has an advantage of letting you specify the new command line; that >>> avoids the problems I'd pointed out with needing to change the command >>> line if a hotplug had happened. It also means we only need one chunk of >>> exec code. >> >> [...] > > I'm not quite sure what I want in the incoming there, but that is > already the source execing the destination qemu - although I think we'd > probably need to check if that's actually via an intermediary. > >> We could shoehorn cpr restart into the migrate exec path by defining a new migration >> capability that the client would set before issuing the migrate command. However, the >> implementation code would be sprinkled with conditionals to suppress migrate-only bits >> and call cpr-only bits. IMO that would be less maintainable than having a separate >> cprsave function. Currently cprsave does not duplicate any migration functionality. >> cprsave calls qemu_save_device_state() which is used by xen. > > To me it feels like cprsave in particular is replicating more code. In the attached file I annotated lines of code that have some overlap with migration code actions. They include vm control, global_state_store, and vmstate save, and cover 18 lines of 78 total. I did not include the body of qf_file_open because it is also called by xen. The migration code adds capabilities, parameters, state, status, info, precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types, blockers, pause+resume, return path, request-reply commands, throttling, colo, blocks, phases, iteration, and threading, implemented by 20000+ lines of code. To me it seems wrong to throw cpr into that mix to avoid adding tens of lines of similar code. - Steve void cprsave(const char *file, CprMode mode, Error **errp) { int ret = 0; ** QEMUFile *f; ** int saved_vm_running = runstate_is_running(); bool restart = (mode == CPR_MODE_RESTART); bool reboot = (mode == CPR_MODE_REBOOT); if (reboot && qemu_ram_volatile(errp)) { return; } if (restart && xen_enabled()) { error_setg(errp, "xen does not support cprsave restart"); return; } if (migrate_colo_enabled()) { error_setg(errp, "error: cprsave does not support x-colo"); return; } if (replay_mode != REPLAY_MODE_NONE) { error_setg(errp, "error: cprsave does not support replay"); return; } f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, "cprsave", errp); if (!f) { return; } ** ret = global_state_store(); ** if (ret) { ** error_setg(errp, "Error saving global state"); ** qemu_fclose(f); ** return; ** } if (runstate_check(RUN_STATE_SUSPENDED)) { /* Update timers_state before saving. Suspend did not so do. */ cpu_disable_ticks(); } ** vm_stop(RUN_STATE_SAVE_VM); cpr_is_active = true; ** ret = qemu_save_device_state(f); ** qemu_fclose(f); ** if (ret < 0) { ** error_setg(errp, "Error %d while saving VM state", ret); ** goto err; ** } if (reboot) { shutdown_action = SHUTDOWN_ACTION_POWEROFF; qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN); } else if (restart) { if (!qemu_chr_cpr_capable(errp)) { goto err; } if (vfio_cprsave(errp)) { goto err; } walkenv(FD_PREFIX, preserve_fd, 0); vhost_dev_reset_all(); qemu_term_exit(); setenv("QEMU_START_FREEZE", "", 1); qemu_system_exec_request(); } goto done; err: ** if (saved_vm_running) { ** vm_start(); ** } done: cpr_is_active = false; return; }
On 6/7/2021 12:40 PM, Steven Sistare wrote: > On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote: >> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote: >>> * Steven Sistare (steven.sistare@oracle.com) wrote: >>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: >>>>> * Steven Sistare (steven.sistare@oracle.com) wrote: >>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: >>>>>>> On the 'restart' branch of questions; can you explain, >>>>>>> other than the passing of the fd's, why the outgoing side of >>>>>>> qemu's 'migrate exec:' doesn't work for you? >>>>>> >>>>>> I'm not sure what I should describe. Can you be more specific? >>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code? >>>>> >>>>> Yes; if possible I'd prefer to just keep the one exec mechanism. >>>>> It has an advantage of letting you specify the new command line; that >>>>> avoids the problems I'd pointed out with needing to change the command >>>>> line if a hotplug had happened. It also means we only need one chunk of >>>>> exec code. >>>> >>>> How/where would you specify a new command line? Are you picturing the usual migration >>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming >>>> option or command? That does not work for live update restart; the old qemu must exec >>>> the new qemu. Or something else? >>> >>> The existing migration path allows an exec - originally intended to exec >>> something like a compressor or a store to file rather than a real >>> migration; i.e. you can do: >>> >>> migrate "exec:gzip > mig" >>> >>> and that will save the migration stream to a compressed file called mig. >>> Now, I *think* we can already do: >>> >>> migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'" >>> (That's probably cleaner via the QMP interface). >>> >>> I'm not quite sure what I want in the incoming there, but that is >>> already the source execing the destination qemu - although I think we'd >>> probably need to check if that's actually via an intermediary. >> >> I don't think you can dirctly exec qemu in that way, because the >> source QEMU migration code is going to wait for completion of the >> QEMU you exec'd and that'll never come on success. So you'll end >> up with both QEMU's running forever. If you pass the daemonize >> option to the new QEMU then it will immediately detach itself, >> and the source QEMU will think the migration command has finished >> or failed. >> >> I think you can probably do it if you use a wrapper script though. >> The wrapper would have to fork QEMU in the backend, and then the >> wrapper would have to monitor the new QEMU to see when the incoming >> migration has finished/aborted, at which point the wrapper can >> exit, so the source QEMU sees a successful cleanup of the exec'd >> command. </hand waving> > > cpr restart does not work for any scheme that involves the old qemu process co-existing with > the new qemu process. To preserve descriptors and anonymous memory, cpr restart requires > that old qemu directly execs new qemu. Not fork-exec. Same pid. > > So responding to Dave's comment, "keep the one exec mechanism", that is not possible. > We still need the qemu_exec_requested mechanism to cause a direct exec after state is > saved. > >>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration >>>> capability that the client would set before issuing the migrate command. However, the >>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits >>>> and call cpr-only bits. IMO that would be less maintainable than having a separate >>>> cprsave function. Currently cprsave does not duplicate any migration functionality. >>>> cprsave calls qemu_save_device_state() which is used by xen. >>> >>> To me it feels like cprsave in particular is replicating more code. >>> >>> It's also jumping through hoops in places to avoid changing the >>> commandline; that's going to cause more pain for a lot of people - not >>> just because it's hacks all over for that, but because a lot of people >>> are actually going to need to change the commandline even in a cpr like >>> case (e.g. due to hotplug or changing something else about the >>> environment, like auth data or route to storage or networking that >>> changed). >> >> Management apps that already support migration, will almost certainly >> know how to start up a new QEMU with a different command line that >> takes account of hotplugged/unplugged devices. IOW avoiding changing >> the command line only really addresses the simple case, and the hard >> case is likely already solved for purposes of handling regular live >> migration. > > Agreed, with the caveat that for cpr, the management app must communicate the new arguments > to the qemu-exec trampoline, rather than passing the args on the command line to a new > qemu process. > >>> There are hooks for early parameter parsing, so if we need to add extra >>> commandline args we can; but for example the case of QEMU_START_FREEZE >>> to add -S just isn't needed as soon as you let go of the idea of needing >>> an identical commandline. > > I'll delete QEMU_START_FREEZE. > > I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as > the args contain identifying information that the management app needs to modify the > arguments based the the instances's hot plug history. > > Or, here is another possibility. We could redefine cprsave to leave the VM in a > stopped state, and add a cprstart command to be called subsequently that performs > the exec. It takes a single string argument: a command plus arguments to exec. > The command may be qemu or a trampoline like qemu-exec. I like that the trampoline > name is no longer hardcoded. The management app can derive new qemu args for the > instances as it would with migration, and pass them to the command, instead of passing > them to qemu-exec via some side channel. cprload finishes the job and does not change. > I already like this scheme better. Or, pass argv as an additional parameter to cprsave. Daniel, David, do you like passing argv to cprsave or a new cprstart command better than the current scheme? I am ready to sent V4 of the series after we resolve this and the question of whether or not to fold cpr into the migration command. - Steve
On 6/7/2021 2:08 PM, Steven Sistare wrote: > On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote: >> * Steven Sistare (steven.sistare@oracle.com) wrote: >>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: >>>> * Steven Sistare (steven.sistare@oracle.com) wrote: >>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: >>>>>> On the 'restart' branch of questions; can you explain, >>>>>> other than the passing of the fd's, why the outgoing side of >>>>>> qemu's 'migrate exec:' doesn't work for you? >>>>> >>>>> I'm not sure what I should describe. Can you be more specific? >>>>> Do you mean: can we add the cpr specific bits to the migrate exec code? >>>> >>>> Yes; if possible I'd prefer to just keep the one exec mechanism. >>>> It has an advantage of letting you specify the new command line; that >>>> avoids the problems I'd pointed out with needing to change the command >>>> line if a hotplug had happened. It also means we only need one chunk of >>>> exec code. >>> >>> [...] >> >> I'm not quite sure what I want in the incoming there, but that is >> already the source execing the destination qemu - although I think we'd >> probably need to check if that's actually via an intermediary. >> >>> We could shoehorn cpr restart into the migrate exec path by defining a new migration >>> capability that the client would set before issuing the migrate command. However, the >>> implementation code would be sprinkled with conditionals to suppress migrate-only bits >>> and call cpr-only bits. IMO that would be less maintainable than having a separate >>> cprsave function. Currently cprsave does not duplicate any migration functionality. >>> cprsave calls qemu_save_device_state() which is used by xen. >> >> To me it feels like cprsave in particular is replicating more code. > > In the attached file I annotated lines of code that have some overlap > with migration code actions. They include vm control, global_state_store, > and vmstate save, and cover 18 lines of 78 total. I did not include the > body of qf_file_open because it is also called by xen. > > The migration code adds capabilities, parameters, state, status, info, > precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types, > blockers, pause+resume, return path, request-reply commands, throttling, colo, > blocks, phases, iteration, and threading, implemented by 20000+ lines of code. > To me it seems wrong to throw cpr into that mix to avoid adding tens of lines > of similar code. Hi David, what is your decision, will you accept separate cpr commands? One last point is that Xen made a similar choice, adding the xen-save-devices-state command which calls qemu_save_device_state instead of migration_thread. - Steve
On Mon, Jun 14, 2021 at 10:31:32AM -0400, Steven Sistare wrote: > On 6/7/2021 12:40 PM, Steven Sistare wrote: > > On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote: > >> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote: > >>> * Steven Sistare (steven.sistare@oracle.com) wrote: > >>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: > >>>>> * Steven Sistare (steven.sistare@oracle.com) wrote: > >>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: > >>>>>>> On the 'restart' branch of questions; can you explain, > >>>>>>> other than the passing of the fd's, why the outgoing side of > >>>>>>> qemu's 'migrate exec:' doesn't work for you? > >>>>>> > >>>>>> I'm not sure what I should describe. Can you be more specific? > >>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code? > >>>>> > >>>>> Yes; if possible I'd prefer to just keep the one exec mechanism. > >>>>> It has an advantage of letting you specify the new command line; that > >>>>> avoids the problems I'd pointed out with needing to change the command > >>>>> line if a hotplug had happened. It also means we only need one chunk of > >>>>> exec code. > >>>> > >>>> How/where would you specify a new command line? Are you picturing the usual migration > >>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming > >>>> option or command? That does not work for live update restart; the old qemu must exec > >>>> the new qemu. Or something else? > >>> > >>> The existing migration path allows an exec - originally intended to exec > >>> something like a compressor or a store to file rather than a real > >>> migration; i.e. you can do: > >>> > >>> migrate "exec:gzip > mig" > >>> > >>> and that will save the migration stream to a compressed file called mig. > >>> Now, I *think* we can already do: > >>> > >>> migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'" > >>> (That's probably cleaner via the QMP interface). > >>> > >>> I'm not quite sure what I want in the incoming there, but that is > >>> already the source execing the destination qemu - although I think we'd > >>> probably need to check if that's actually via an intermediary. > >> > >> I don't think you can dirctly exec qemu in that way, because the > >> source QEMU migration code is going to wait for completion of the > >> QEMU you exec'd and that'll never come on success. So you'll end > >> up with both QEMU's running forever. If you pass the daemonize > >> option to the new QEMU then it will immediately detach itself, > >> and the source QEMU will think the migration command has finished > >> or failed. > >> > >> I think you can probably do it if you use a wrapper script though. > >> The wrapper would have to fork QEMU in the backend, and then the > >> wrapper would have to monitor the new QEMU to see when the incoming > >> migration has finished/aborted, at which point the wrapper can > >> exit, so the source QEMU sees a successful cleanup of the exec'd > >> command. </hand waving> > > > > cpr restart does not work for any scheme that involves the old qemu process co-existing with > > the new qemu process. To preserve descriptors and anonymous memory, cpr restart requires > > that old qemu directly execs new qemu. Not fork-exec. Same pid. > > > > So responding to Dave's comment, "keep the one exec mechanism", that is not possible. > > We still need the qemu_exec_requested mechanism to cause a direct exec after state is > > saved. > > > >>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration > >>>> capability that the client would set before issuing the migrate command. However, the > >>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits > >>>> and call cpr-only bits. IMO that would be less maintainable than having a separate > >>>> cprsave function. Currently cprsave does not duplicate any migration functionality. > >>>> cprsave calls qemu_save_device_state() which is used by xen. > >>> > >>> To me it feels like cprsave in particular is replicating more code. > >>> > >>> It's also jumping through hoops in places to avoid changing the > >>> commandline; that's going to cause more pain for a lot of people - not > >>> just because it's hacks all over for that, but because a lot of people > >>> are actually going to need to change the commandline even in a cpr like > >>> case (e.g. due to hotplug or changing something else about the > >>> environment, like auth data or route to storage or networking that > >>> changed). > >> > >> Management apps that already support migration, will almost certainly > >> know how to start up a new QEMU with a different command line that > >> takes account of hotplugged/unplugged devices. IOW avoiding changing > >> the command line only really addresses the simple case, and the hard > >> case is likely already solved for purposes of handling regular live > >> migration. > > > > Agreed, with the caveat that for cpr, the management app must communicate the new arguments > > to the qemu-exec trampoline, rather than passing the args on the command line to a new > > qemu process. > > > >>> There are hooks for early parameter parsing, so if we need to add extra > >>> commandline args we can; but for example the case of QEMU_START_FREEZE > >>> to add -S just isn't needed as soon as you let go of the idea of needing > >>> an identical commandline. > > > > I'll delete QEMU_START_FREEZE. > > > > I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as > > the args contain identifying information that the management app needs to modify the > > arguments based the the instances's hot plug history. > > > > Or, here is another possibility. We could redefine cprsave to leave the VM in a > > stopped state, and add a cprstart command to be called subsequently that performs > > the exec. It takes a single string argument: a command plus arguments to exec. > > The command may be qemu or a trampoline like qemu-exec. I like that the trampoline > > name is no longer hardcoded. The management app can derive new qemu args for the > > instances as it would with migration, and pass them to the command, instead of passing > > them to qemu-exec via some side channel. cprload finishes the job and does not change. > > I already like this scheme better. > > Or, pass argv as an additional parameter to cprsave. > > Daniel, David, do you like passing argv to cprsave or a new cprstart command better than the > current scheme? I am ready to sent V4 of the series after we resolve this and the question of > whether or not to fold cpr into the migration command. I don't really have a strong opinion on this either way. Regards, Daniel
* Steven Sistare (steven.sistare@oracle.com) wrote: > On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote: > > On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote: > >> * Steven Sistare (steven.sistare@oracle.com) wrote: > >>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote: > >>>> * Steven Sistare (steven.sistare@oracle.com) wrote: > >>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote: > >>>>>> On the 'restart' branch of questions; can you explain, > >>>>>> other than the passing of the fd's, why the outgoing side of > >>>>>> qemu's 'migrate exec:' doesn't work for you? > >>>>> > >>>>> I'm not sure what I should describe. Can you be more specific? > >>>>> Do you mean: can we add the cpr specific bits to the migrate exec code? > >>>> > >>>> Yes; if possible I'd prefer to just keep the one exec mechanism. > >>>> It has an advantage of letting you specify the new command line; that > >>>> avoids the problems I'd pointed out with needing to change the command > >>>> line if a hotplug had happened. It also means we only need one chunk of > >>>> exec code. > >>> > >>> How/where would you specify a new command line? Are you picturing the usual migration > >>> setup where you start a second qemu with its own arguments, plus a migrate_incoming > >>> option or command? That does not work for live update restart; the old qemu must exec > >>> the new qemu. Or something else? > >> > >> The existing migration path allows an exec - originally intended to exec > >> something like a compressor or a store to file rather than a real > >> migration; i.e. you can do: > >> > >> migrate "exec:gzip > mig" > >> > >> and that will save the migration stream to a compressed file called mig. > >> Now, I *think* we can already do: > >> > >> migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'" > >> (That's probably cleaner via the QMP interface). > >> > >> I'm not quite sure what I want in the incoming there, but that is > >> already the source execing the destination qemu - although I think we'd > >> probably need to check if that's actually via an intermediary. > > > > I don't think you can dirctly exec qemu in that way, because the > > source QEMU migration code is going to wait for completion of the > > QEMU you exec'd and that'll never come on success. So you'll end > > up with both QEMU's running forever. If you pass the daemonize > > option to the new QEMU then it will immediately detach itself, > > and the source QEMU will think the migration command has finished > > or failed. > > > > I think you can probably do it if you use a wrapper script though. > > The wrapper would have to fork QEMU in the backend, and then the > > wrapper would have to monitor the new QEMU to see when the incoming > > migration has finished/aborted, at which point the wrapper can > > exit, so the source QEMU sees a successful cleanup of the exec'd > > command. </hand waving> > > cpr restart does not work for any scheme that involves the old qemu process co-existing with > the new qemu process. To preserve descriptors and anonymous memory, cpr restart requires > that old qemu directly execs new qemu. Not fork-exec. Same pid. > > So responding to Dave's comment, "keep the one exec mechanism", that is not possible. > We still need the qemu_exec_requested mechanism to cause a direct exec after state is > saved. OK, note if you can find anyway to make kernel changes to avoid this kexec, life is going to get *much* better; starting a separate qemu at the management layer would be so much easier. > >>> We could shoehorn cpr restart into the migrate exec path by defining a new migration > >>> capability that the client would set before issuing the migrate command. However, the > >>> implementation code would be sprinkled with conditionals to suppress migrate-only bits > >>> and call cpr-only bits. IMO that would be less maintainable than having a separate > >>> cprsave function. Currently cprsave does not duplicate any migration functionality. > >>> cprsave calls qemu_save_device_state() which is used by xen. > >> > >> To me it feels like cprsave in particular is replicating more code. > >> > >> It's also jumping through hoops in places to avoid changing the > >> commandline; that's going to cause more pain for a lot of people - not > >> just because it's hacks all over for that, but because a lot of people > >> are actually going to need to change the commandline even in a cpr like > >> case (e.g. due to hotplug or changing something else about the > >> environment, like auth data or route to storage or networking that > >> changed). > > > > Management apps that already support migration, will almost certainly > > know how to start up a new QEMU with a different command line that > > takes account of hotplugged/unplugged devices. IOW avoiding changing > > the command line only really addresses the simple case, and the hard > > case is likely already solved for purposes of handling regular live > > migration. > > Agreed, with the caveat that for cpr, the management app must communicate the new arguments > to the qemu-exec trampoline, rather than passing the args on the command line to a new > qemu process. > > >> There are hooks for early parameter parsing, so if we need to add extra > >> commandline args we can; but for example the case of QEMU_START_FREEZE > >> to add -S just isn't needed as soon as you let go of the idea of needing > >> an identical commandline. > > I'll delete QEMU_START_FREEZE. > > I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as > the args contain identifying information that the management app needs to modify the > arguments based the the instances's hot plug history. > > Or, here is another possibility. We could redefine cprsave to leave the VM in a > stopped state, and add a cprstart command to be called subsequently that performs > the exec. It takes a single string argument: a command plus arguments to exec. > The command may be qemu or a trampoline like qemu-exec. I like that the trampoline > name is no longer hardcoded. The management app can derive new qemu args for the > instances as it would with migration, and pass them to the command, instead of passing > them to qemu-exec via some side channel. cprload finishes the job and does not change. > I already like this scheme better. Right, that's sounding better; now the other benefit you get is you don't need to play with environment variables; you can define a command line option that takes all the extra data it needs. Dave > - Steve >
* Steven Sistare (steven.sistare@oracle.com) wrote: > On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote: > > Hi Steven, > > I'd like to split the discussion into reboot and restart, > > so I can make sure I understand them individually. > > > > So reboot mode; > > Can you explain which parts of this series are needed for reboot mode; > > I've managed to do a kexec based reboot on qemu with the current qemu - > > albeit with no vfio devices, my current understanding is that for doing > > reboot with vfio we just need some way of getting migrate to send the > > metadata associated with vfio devices if the guest is in S3. > > > > Is there something I'm missing and which you have in this series? > > You are correct, this series has little special code for reboot mode, but it does allow > reboot and restart to be handled similarly, which simplifies the management layer because > the same calls are performed for each mode. > > For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram > command to the qemu guest agent. This flushes requests and brings the guest device to a > reset state, so there is no vfio metadata to save. Reboot mode does not call vfio_cprsave. > > There are a few unique patches to support reboot mode. One is qemu_ram_volatile, which > is a sanity check that the writable ram blocks are backed by some form of shared memory. > Plus there are a few fragments in the "cpr" patch that handle the suspended state that > is induced by guest-suspend-ram. See qemu_system_start_on_wake_request() and instances > of RUN_STATE_SUSPENDED in migration/cpr.c Could you split the 'reboot' part of separately, then we can review that and perhaps get it in first? It should be a relatively small patch set - it'll get things moving in the right direction. The guest-suspend-ram stuff seems reasonable as an idea; lets just try and avoid doing it all via environment variables though; make it proper command line options. Dave > - Steve >
On 6/15/2021 3:14 PM, Dr. David Alan Gilbert wrote: > * Steven Sistare (steven.sistare@oracle.com) wrote: >> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote: >>> Hi Steven, >>> I'd like to split the discussion into reboot and restart, >>> so I can make sure I understand them individually. >>> >>> So reboot mode; >>> Can you explain which parts of this series are needed for reboot mode; >>> I've managed to do a kexec based reboot on qemu with the current qemu - >>> albeit with no vfio devices, my current understanding is that for doing >>> reboot with vfio we just need some way of getting migrate to send the >>> metadata associated with vfio devices if the guest is in S3. >>> >>> Is there something I'm missing and which you have in this series? >> >> You are correct, this series has little special code for reboot mode, but it does allow >> reboot and restart to be handled similarly, which simplifies the management layer because >> the same calls are performed for each mode. >> >> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram >> command to the qemu guest agent. This flushes requests and brings the guest device to a >> reset state, so there is no vfio metadata to save. Reboot mode does not call vfio_cprsave. >> >> There are a few unique patches to support reboot mode. One is qemu_ram_volatile, which >> is a sanity check that the writable ram blocks are backed by some form of shared memory. >> Plus there are a few fragments in the "cpr" patch that handle the suspended state that >> is induced by guest-suspend-ram. See qemu_system_start_on_wake_request() and instances >> of RUN_STATE_SUSPENDED in migration/cpr.c > > Could you split the 'reboot' part of separately, then we can review > that and perhaps get it in first? It should be a relatively small patch > set - it'll get things moving in the right direction. > > The guest-suspend-ram stuff seems reasonable as an idea; lets just try > and avoid doing it all via environment variables though; make it proper > command line options. How about I delete reboot mode and the mode argument instead. Having two modes is causing no end of confusion, and my primary business need is for restart mode. - Steve
On 6/24/2021 11:05 AM, Steven Sistare wrote: > On 6/15/2021 3:14 PM, Dr. David Alan Gilbert wrote: >> * Steven Sistare (steven.sistare@oracle.com) wrote: >>> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote: >>>> Hi Steven, >>>> I'd like to split the discussion into reboot and restart, >>>> so I can make sure I understand them individually. >>>> >>>> So reboot mode; >>>> Can you explain which parts of this series are needed for reboot mode; >>>> I've managed to do a kexec based reboot on qemu with the current qemu - >>>> albeit with no vfio devices, my current understanding is that for doing >>>> reboot with vfio we just need some way of getting migrate to send the >>>> metadata associated with vfio devices if the guest is in S3. >>>> >>>> Is there something I'm missing and which you have in this series? >>> >>> You are correct, this series has little special code for reboot mode, but it does allow >>> reboot and restart to be handled similarly, which simplifies the management layer because >>> the same calls are performed for each mode. >>> >>> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram >>> command to the qemu guest agent. This flushes requests and brings the guest device to a >>> reset state, so there is no vfio metadata to save. Reboot mode does not call vfio_cprsave. >>> >>> There are a few unique patches to support reboot mode. One is qemu_ram_volatile, which >>> is a sanity check that the writable ram blocks are backed by some form of shared memory. >>> Plus there are a few fragments in the "cpr" patch that handle the suspended state that >>> is induced by guest-suspend-ram. See qemu_system_start_on_wake_request() and instances >>> of RUN_STATE_SUSPENDED in migration/cpr.c >> >> Could you split the 'reboot' part of separately, then we can review >> that and perhaps get it in first? It should be a relatively small patch >> set - it'll get things moving in the right direction. >> >> The guest-suspend-ram stuff seems reasonable as an idea; lets just try >> and avoid doing it all via environment variables though; make it proper >> command line options. > > How about I delete reboot mode and the mode argument instead. Having two modes is causing no > end of confusion, and my primary business need is for restart mode. I just posted V4 of the patch series, refactoring reboot mode into the first 4 patches. - Steve