mbox series

[V3,00/22] Live Update

Message ID 1620390320-301716-1-git-send-email-steven.sistare@oracle.com (mailing list archive)
Headers show
Series Live Update | expand

Message

Steven Sistare May 7, 2021, 12:24 p.m. UTC
Provide the cprsave and cprload commands for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
/usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
paused state and waits for the cprload command.

To use the restart mode, qemu must be started with the memfd-alloc option,
which allocates guest ram using memfd_create.  The memfd's are saved to
the environment and kept open across exec, after which they are found from
the environment and re-mmap'd.  Hence guest ram is preserved in place,
albeit with new virtual addresses in the qemu process.  The caller resumes
the guest by calling cprload, which loads state from the file.  If the VM
was running at cprsave time, then VM execution resumes.  cprsave supports
any type of guest image and block device, but the caller must not modify
guest block devices between cprsave and cprload.

The restart mode supports vfio devices by preserving the vfio container,
group, device, and event descriptors across the qemu re-exec, and by
updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
and integrated in Linux kernel 5.12.

For the reboot mode, cprsave saves state and exits qemu, and the caller is
allowed to update the host kernel and system software and reboot.  The
caller resumes the guest by running qemu with the same arguments as the
original process and calling cprload.  To use this mode, guest ram must be
mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.

The reboot mode supports vfio devices if the caller suspends the guest
instead of stopping the VM, such as by issuing guest-suspend-ram to the
qemu guest agent.  The guest drivers' suspend methods flush outstanding
requests and re-initialize the devices, and thus there is no device state
to save and restore.

The first patches add helper functions:

  - as_flat_walk
  - qemu_ram_volatile
  - oslib: qemu_clr_cloexec
  - util: env var helpers
  - machine: memfd-alloc option
  - vl: add helper to request re-exec

The next patches implement cprsave and cprload:

  - cpr
  - cpr: QMP interfaces
  - cpr: HMP interfaces

The next patches add vfio support for the restart mode:

  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1
  - vfio-pci: cpr part 2

The next patches preserve various descriptor-based backend devices across
a cprsave restart:

  - vhost: reset vhost devices upon cprsave
  - hostmem-memfd: cpr support
  - chardev: cpr framework
  - chardev: cpr for simple devices
  - chardev: cpr for pty
  - chardev: cpr for sockets
  - cpr: only-cpr-capable option
  - cpr: maintainers
  - simplify savevm

Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
"cprload restart".  The software update is performed while the guest is
running to minimize downtime.

window 1				| window 2
					|
# qemu-system-x86_64 ... 		|
QEMU 4.2.0 monitor - type 'help' ...	|
(qemu) info status			|
VM status: running			|
					| # yum update qemu
(qemu) cprsave /tmp/qemu.sav restart	|
QEMU 4.2.1 monitor - type 'help' ...	|
(qemu) info status			|
VM status: paused (prelaunch)		|
(qemu) cprload /tmp/qemu.sav		|
(qemu) info status			|
VM status: running			|


Here is an example of updating the host kernel using "cprload reboot"

window 1					| window 2
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: running				|
						| # yum update kernel-uek
(qemu) cprsave /tmp/qemu.sav restart		|
						|
# systemctl kexec				|
kexec_core: Starting new kernel			|
...						|
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: paused (prelaunch)			|
(qemu) cprload /tmp/qemu.sav			|
(qemu) info status				|
VM status: running				|

Changes from V1 to V2:
  - revert vmstate infrastructure changes
  - refactor cpr functions into new files
  - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
    preserve memory.
  - add framework to filter chardev's that support cpr
  - save and restore vfio eventfd's
  - modify cprinfo QMP interface
  - incorporate misc review feedback
  - remove unrelated and unneeded patches
  - refactor all patches into a shorter and easier to review series

Changes from V2 to V3:
  - rebase to qemu 6.0.0
  - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
  - change memfd-alloc to a machine option
  - use existing channel socket function instead of defining new ones
  - close monitor socket during cpr
  - support memory-backend-memfd
  - fix a few unreported bugs

Steve Sistare (18):
  as_flat_walk
  qemu_ram_volatile
  oslib: qemu_clr_cloexec
  util: env var helpers
  machine: memfd-alloc option
  vl: add helper to request re-exec
  cpr
  pci: export functions for cpr
  vfio-pci: refactor for cpr
  vfio-pci: cpr part 1
  vfio-pci: cpr part 2
  hostmem-memfd: cpr support
  chardev: cpr framework
  chardev: cpr for simple devices
  chardev: cpr for pty
  cpr: only-cpr-capable option
  cpr: maintainers
  simplify savevm

Mark Kanda, Steve Sistare (4):
  cpr: QMP interfaces
  cpr: HMP interfaces
  vhost: reset vhost devices upon cprsave
  chardev: cpr for sockets

 MAINTAINERS                   |  11 +++
 backends/hostmem-memfd.c      |  21 +++--
 chardev/char-mux.c            |   1 +
 chardev/char-null.c           |   1 +
 chardev/char-pty.c            |  15 ++-
 chardev/char-serial.c         |   1 +
 chardev/char-socket.c         |  35 +++++++
 chardev/char-stdio.c          |   8 ++
 chardev/char.c                |  41 +++++++-
 gdbstub.c                     |   1 +
 hmp-commands.hx               |  44 +++++++++
 hw/core/machine.c             |  19 ++++
 hw/pci/msi.c                  |   4 +
 hw/pci/msix.c                 |  20 ++--
 hw/pci/pci.c                  |   7 +-
 hw/vfio/common.c              |  68 +++++++++++++-
 hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
 hw/vfio/trace-events          |   1 +
 hw/virtio/vhost.c             |  11 +++
 include/chardev/char.h        |   6 ++
 include/exec/memory.h         |  25 +++++
 include/hw/boards.h           |   1 +
 include/hw/pci/msix.h         |   5 +
 include/hw/pci/pci.h          |   2 +
 include/hw/vfio/vfio-common.h |   8 ++
 include/hw/virtio/vhost.h     |   1 +
 include/migration/cpr.h       |  17 ++++
 include/monitor/hmp.h         |   3 +
 include/qemu/env.h            |  23 +++++
 include/qemu/osdep.h          |   1 +
 include/sysemu/runstate.h     |   2 +
 include/sysemu/sysemu.h       |   2 +
 linux-headers/linux/vfio.h    |  27 ++++++
 migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
 migration/meson.build         |   1 +
 migration/migration.c         |   5 +
 migration/savevm.c            |  21 ++---
 migration/savevm.h            |   2 +
 monitor/hmp-cmds.c            |  48 ++++++++++
 monitor/hmp.c                 |   3 +
 monitor/qmp-cmds.c            |  31 ++++++
 monitor/qmp.c                 |   3 +
 qapi/char.json                |   5 +-
 qapi/cpr.json                 |  76 +++++++++++++++
 qapi/meson.build              |   1 +
 qapi/qapi-schema.json         |   1 +
 qemu-options.hx               |  39 +++++++-
 softmmu/globals.c             |   2 +
 softmmu/memory.c              |  48 ++++++++++
 softmmu/physmem.c             |  49 ++++++++--
 softmmu/runstate.c            |  49 +++++++++-
 softmmu/vl.c                  |  21 ++++-
 stubs/cpr.c                   |   3 +
 stubs/meson.build             |   1 +
 trace-events                  |   1 +
 util/env.c                    |  99 +++++++++++++++++++
 util/meson.build              |   1 +
 util/oslib-posix.c            |   9 ++
 util/oslib-win32.c            |   4 +
 util/qemu-config.c            |   4 +
 62 files changed, 1431 insertions(+), 74 deletions(-)
 create mode 100644 hw/vfio/cpr.c
 create mode 100644 include/migration/cpr.h
 create mode 100644 include/qemu/env.h
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json
 create mode 100644 stubs/cpr.c
 create mode 100644 util/env.c

Comments

no-reply@patchew.org May 7, 2021, 1 p.m. UTC | #1
Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com
Subject: [PATCH V3 00/22] Live Update

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com
 - [tag update]      patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net
 - [tag update]      patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com
 - [tag update]      patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com
 - [tag update]      patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org
Switched to a new branch 'test'
8c778e6 simplify savevm
aca4f09 cpr: maintainers
697f8d0 cpr: only-cpr-capable option
0a8c20e chardev: cpr for sockets
cb270f4 chardev: cpr for pty
279230e chardev: cpr for simple devices
b122cfa chardev: cpr framework
6596676 hostmem-memfd: cpr support
8cb6348 vhost: reset vhost devices upon cprsave
e3ae86d vfio-pci: cpr part 2
02c628d vfio-pci: cpr part 1
d93623c vfio-pci: refactor for cpr
bc63b3e pci: export functions for cpr
2b10bdd cpr: HMP interfaces
29bc20a cpr: QMP interfaces
3f84e6c cpr
0a32588 vl: add helper to request re-exec
466b4cf machine: memfd-alloc option
50c3e84 util: env var helpers
76c3550 oslib: qemu_clr_cloexec
d819bd4 qemu_ram_volatile
c466ddf as_flat_walk

=== OUTPUT BEGIN ===
1/22 Checking commit c466ddfd2209 (as_flat_walk)
2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile)
3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec)
4/22 Checking commit 50c3e84cf5a6 (util: env var helpers)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#19: 
new file mode 100644

ERROR: consider using qemu_strtol in preference to strtol
#72: FILE: util/env.c:20:
+        res = strtol(val, 0, 10);

total: 1 errors, 1 warnings, 129 lines checked

Patch 4/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option)
6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec)
7/22 Checking commit 3f84e6c38bd6 (cpr)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#55: 
new file mode 100644

total: 0 errors, 1 warnings, 314 lines checked

Patch 7/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#81: 
new file mode 100644

total: 0 errors, 1 warnings, 136 lines checked

Patch 8/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces)
10/22 Checking commit bc63b3edc621 (pci: export functions for cpr)
11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr)
12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#271: 
new file mode 100644

total: 0 errors, 1 warnings, 566 lines checked

Patch 12/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2)
14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave)
15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support)
16/22 Checking commit b122cfa96106 (chardev: cpr framework)
17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices)
18/22 Checking commit cb270f49693f (chardev: cpr for pty)
19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets)
20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#200: 
new file mode 100644

total: 0 errors, 1 warnings, 133 lines checked

Patch 20/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
21/22 Checking commit aca4f092c865 (cpr: maintainers)
22/22 Checking commit 8c778e6f284c (simplify savevm)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Stefan Hajnoczi May 12, 2021, 4:42 p.m. UTC | #2
On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> Provide the cprsave and cprload commands for live update.  These save and
> restore VM state, with minimal guest pause time, so that qemu may be updated
> to a new version in between.
> 
> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> paused state and waits for the cprload command.

I think cprsave/cprload could be generalized by using QMP to stash the
file descriptors. The 'getfd' QMP command already exists and QEMU code
already opens fds passed using this mechanism.

I haven't checked but it may be possible to drop some patches by reusing
QEMU's monitor file descriptor passing since the code already knows how
to open from 'getfd' fds.

The reason why using QMP is interesting is because it eliminates the
need for execve(2). QEMU may be unable to execute a program due to
chroot, seccomp, etc.

QMP would enable cprsave/cprload to work both with and without
execve(2).

One tricky thing with this approach might be startup ordering: how to
get fds via the QMP monitor in the new process before processing the
entire command-line.

Stefan
Steven Sistare May 13, 2021, 8:21 p.m. UTC | #3
On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>> Provide the cprsave and cprload commands for live update.  These save and
>> restore VM state, with minimal guest pause time, so that qemu may be updated
>> to a new version in between.
>>
>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>> paused state and waits for the cprload command.
> 
> I think cprsave/cprload could be generalized by using QMP to stash the
> file descriptors. The 'getfd' QMP command already exists and QEMU code
> already opens fds passed using this mechanism.
> 
> I haven't checked but it may be possible to drop some patches by reusing
> QEMU's monitor file descriptor passing since the code already knows how
> to open from 'getfd' fds.
> 
> The reason why using QMP is interesting is because it eliminates the
> need for execve(2). QEMU may be unable to execute a program due to
> chroot, seccomp, etc.
> 
> QMP would enable cprsave/cprload to work both with and without
> execve(2).
> 
> One tricky thing with this approach might be startup ordering: how to
> get fds via the QMP monitor in the new process before processing the
> entire command-line.

Early on I experimented with a similar approach.  Old qemu passed descriptors to an
escrow process and exited; new qemu started and retrieved the descriptors from escrow.
vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
I suspect my recent vfio extensions would smooth the rough edges.

However, the main issue is that guest ram must be backed by named shared memory, and
we would need to add code to support shared memory for all the secondary memory objects.
That makes it less interesting for us at this time; we care about updating legacy qemu 
instances with anonymous guest memory.

Having said all that, this would be an interesting project, just not the one I want to 
push now.  In the future we could add a new cprsave mode to support it in a backward
compatible manner.

- Steve
Steven Sistare May 13, 2021, 8:42 p.m. UTC | #4
I will add to MAINTAINERS incrementally instead of at the end to make checkpatch happy.

I will use qemu_strtol even though I thought the message "consider using qemu_strtol"
was giving me a choice.

You can't fight The Man when the man is a robot.

- Steve

On 5/7/2021 9:00 AM, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com
> Subject: [PATCH V3 00/22] Live Update
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> git rev-parse base > /dev/null || exit 0
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> ./scripts/checkpatch.pl --mailback base..
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> From https://github.com/patchew-project/qemu
>  * [new tag]         patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com
>  - [tag update]      patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net
>  - [tag update]      patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com
>  - [tag update]      patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com
>  - [tag update]      patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org
> Switched to a new branch 'test'
> 8c778e6 simplify savevm
> aca4f09 cpr: maintainers
> 697f8d0 cpr: only-cpr-capable option
> 0a8c20e chardev: cpr for sockets
> cb270f4 chardev: cpr for pty
> 279230e chardev: cpr for simple devices
> b122cfa chardev: cpr framework
> 6596676 hostmem-memfd: cpr support
> 8cb6348 vhost: reset vhost devices upon cprsave
> e3ae86d vfio-pci: cpr part 2
> 02c628d vfio-pci: cpr part 1
> d93623c vfio-pci: refactor for cpr
> bc63b3e pci: export functions for cpr
> 2b10bdd cpr: HMP interfaces
> 29bc20a cpr: QMP interfaces
> 3f84e6c cpr
> 0a32588 vl: add helper to request re-exec
> 466b4cf machine: memfd-alloc option
> 50c3e84 util: env var helpers
> 76c3550 oslib: qemu_clr_cloexec
> d819bd4 qemu_ram_volatile
> c466ddf as_flat_walk
> 
> === OUTPUT BEGIN ===
> 1/22 Checking commit c466ddfd2209 (as_flat_walk)
> 2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile)
> 3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec)
> 4/22 Checking commit 50c3e84cf5a6 (util: env var helpers)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #19: 
> new file mode 100644
> 
> ERROR: consider using qemu_strtol in preference to strtol
> #72: FILE: util/env.c:20:
> +        res = strtol(val, 0, 10);
> 
> total: 1 errors, 1 warnings, 129 lines checked
> 
> Patch 4/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 
> 5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option)
> 6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec)
> 7/22 Checking commit 3f84e6c38bd6 (cpr)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #55: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 314 lines checked
> 
> Patch 7/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #81: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 136 lines checked
> 
> Patch 8/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces)
> 10/22 Checking commit bc63b3edc621 (pci: export functions for cpr)
> 11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr)
> 12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #271: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 566 lines checked
> 
> Patch 12/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2)
> 14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave)
> 15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support)
> 16/22 Checking commit b122cfa96106 (chardev: cpr framework)
> 17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices)
> 18/22 Checking commit cb270f49693f (chardev: cpr for pty)
> 19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets)
> 20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #200: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 133 lines checked
> 
> Patch 20/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 21/22 Checking commit aca4f092c865 (cpr: maintainers)
> 22/22 Checking commit 8c778e6f284c (simplify savevm)
> === OUTPUT END ===
> 
> Test command exited with code: 1
> 
> 
> The full log is available at
> http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message.
> ---
> Email generated automatically by Patchew [https://patchew.org/].
> Please send your feedback to patchew-devel@redhat.com
>
Stefan Hajnoczi May 14, 2021, 11:53 a.m. UTC | #5
On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> > On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >> Provide the cprsave and cprload commands for live update.  These save and
> >> restore VM state, with minimal guest pause time, so that qemu may be updated
> >> to a new version in between.
> >>
> >> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >> paused state and waits for the cprload command.
> > 
> > I think cprsave/cprload could be generalized by using QMP to stash the
> > file descriptors. The 'getfd' QMP command already exists and QEMU code
> > already opens fds passed using this mechanism.
> > 
> > I haven't checked but it may be possible to drop some patches by reusing
> > QEMU's monitor file descriptor passing since the code already knows how
> > to open from 'getfd' fds.
> > 
> > The reason why using QMP is interesting is because it eliminates the
> > need for execve(2). QEMU may be unable to execute a program due to
> > chroot, seccomp, etc.
> > 
> > QMP would enable cprsave/cprload to work both with and without
> > execve(2).
> > 
> > One tricky thing with this approach might be startup ordering: how to
> > get fds via the QMP monitor in the new process before processing the
> > entire command-line.
> 
> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> I suspect my recent vfio extensions would smooth the rough edges.

I wonder about the reason for VFIO's pid limitation, maybe because it
pins pages from the original process?

Is this VFIO pid limitation the main reason why you chose to make QEMU
execve(2) the new binary?

> However, the main issue is that guest ram must be backed by named shared memory, and
> we would need to add code to support shared memory for all the secondary memory objects.
> That makes it less interesting for us at this time; we care about updating legacy qemu 
> instances with anonymous guest memory.

Thanks for explaining this more in the other sub-thread. The secondary
memory objects you mentioned are relatively small so I don't think
saving them in the traditional way is a problem.

Two approaches for zero-copy memory migration fit into QEMU's existing
migration infrastructure:

- Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
  etc) so they are not saved into the savevm file. The existing --object
  memory-backend-file syntax can be used.

- Extending the live migration protocol to detect when file descriptor
  passing is available (i.e. UNIX domain socket migration) and using
  that for memory-backend-* objects that have fds.

Either of these approaches would handle RAM with existing savevm/migrate
commands.

The remaining issue is how to migrate VFIO and other file descriptors
that cannot be reopened by the new process. As mentioned, QEMU already
has file descriptor passing support in the QMP monitor and support for
opening passed file descriptors (see qemu_open_internal(),
monitor_fd_param(), and socket_get_fd()).

The advantage of integrating live update functionality into the existing
savevm/migrate commands is that it will work in more use cases with
less code duplication/maintenance/bitrot prevention than the
special-case cprsave command in this patch series.

Maybe there is a fundamental technical reason why live update needs to
be different from QEMU's existing migration commands but I haven't
figured it out yet.

Stefan
Steven Sistare May 14, 2021, 3:15 p.m. UTC | #6
On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>> to a new version in between.
>>>>
>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>> paused state and waits for the cprload command.
>>>
>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>> already opens fds passed using this mechanism.
>>>
>>> I haven't checked but it may be possible to drop some patches by reusing
>>> QEMU's monitor file descriptor passing since the code already knows how
>>> to open from 'getfd' fds.
>>>
>>> The reason why using QMP is interesting is because it eliminates the
>>> need for execve(2). QEMU may be unable to execute a program due to
>>> chroot, seccomp, etc.
>>>
>>> QMP would enable cprsave/cprload to work both with and without
>>> execve(2).
>>>
>>> One tricky thing with this approach might be startup ordering: how to
>>> get fds via the QMP monitor in the new process before processing the
>>> entire command-line.
>>
>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>> I suspect my recent vfio extensions would smooth the rough edges.
> 
> I wonder about the reason for VFIO's pid limitation, maybe because it
> pins pages from the original process?

The dma unmap code verifies that the requesting task is the same as the task that mapped
the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
to fix locked memory accounting, which is associated with the mm of the original task.

> Is this VFIO pid limitation the main reason why you chose to make QEMU
> execve(2) the new binary?

That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
I was working against vfio rather than with it.

Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
without exec, I still need the exec option.

>> However, the main issue is that guest ram must be backed by named shared memory, and
>> we would need to add code to support shared memory for all the secondary memory objects.
>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>> instances with anonymous guest memory.
> 
> Thanks for explaining this more in the other sub-thread. The secondary
> memory objects you mentioned are relatively small so I don't think
> saving them in the traditional way is a problem.
> 
> Two approaches for zero-copy memory migration fit into QEMU's existing
> migration infrastructure:
> 
> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>   etc) so they are not saved into the savevm file. The existing --object
>   memory-backend-file syntax can be used.
> 
> - Extending the live migration protocol to detect when file descriptor
>   passing is available (i.e. UNIX domain socket migration) and using
>   that for memory-backend-* objects that have fds.
> 
> Either of these approaches would handle RAM with existing savevm/migrate
> commands.

Yes, but the vfio issues would still need to be solved, and we would need new
command line options to back existing and future secondary memory objects with 
named shared memory.

> The remaining issue is how to migrate VFIO and other file descriptors
> that cannot be reopened by the new process. As mentioned, QEMU already
> has file descriptor passing support in the QMP monitor and support for
> opening passed file descriptors (see qemu_open_internal(),
> monitor_fd_param(), and socket_get_fd()).
> 
> The advantage of integrating live update functionality into the existing
> savevm/migrate commands is that it will work in more use cases with
> less code duplication/maintenance/bitrot prevention than the
> special-case cprsave command in this patch series.
> 
> Maybe there is a fundamental technical reason why live update needs to
> be different from QEMU's existing migration commands but I haven't
> figured it out yet.

vfio and anonymous memory.

Regarding code duplication, I did consider whether to extend the migration
syntax and implementation versus creating something new.  Those functions
handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
use case, and the cpr functions handle state that is n/a for the migration case.
I judged that handling both in the same functions would be less readable and
maintainable.  After feedback during the V1 review, I simplified the cprsave
code by by calling qemu_save_device_state, as Xen does, thus eliminating any
interaction with the migration code.

Regarding bit rot, I still need to add a cpr test to the test suite, when the 
review is more complete and folks agree on the final form of the functionality.

I do like the idea of supporting update without exec, but as a future project, 
and not at the expense of dropping update with exec.

- Steve
Stefan Hajnoczi May 17, 2021, 11:40 a.m. UTC | #7
On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>> to a new version in between.
> >>>>
> >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>> paused state and waits for the cprload command.
> >>>
> >>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>> already opens fds passed using this mechanism.
> >>>
> >>> I haven't checked but it may be possible to drop some patches by reusing
> >>> QEMU's monitor file descriptor passing since the code already knows how
> >>> to open from 'getfd' fds.
> >>>
> >>> The reason why using QMP is interesting is because it eliminates the
> >>> need for execve(2). QEMU may be unable to execute a program due to
> >>> chroot, seccomp, etc.
> >>>
> >>> QMP would enable cprsave/cprload to work both with and without
> >>> execve(2).
> >>>
> >>> One tricky thing with this approach might be startup ordering: how to
> >>> get fds via the QMP monitor in the new process before processing the
> >>> entire command-line.
> >>
> >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >> I suspect my recent vfio extensions would smooth the rough edges.
> > 
> > I wonder about the reason for VFIO's pid limitation, maybe because it
> > pins pages from the original process?
> 
> The dma unmap code verifies that the requesting task is the same as the task that mapped
> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> to fix locked memory accounting, which is associated with the mm of the original task.
> 
> > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > execve(2) the new binary?
> 
> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> I was working against vfio rather than with it.
> 
> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> without exec, I still need the exec option.
> 
> >> However, the main issue is that guest ram must be backed by named shared memory, and
> >> we would need to add code to support shared memory for all the secondary memory objects.
> >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >> instances with anonymous guest memory.
> > 
> > Thanks for explaining this more in the other sub-thread. The secondary
> > memory objects you mentioned are relatively small so I don't think
> > saving them in the traditional way is a problem.
> > 
> > Two approaches for zero-copy memory migration fit into QEMU's existing
> > migration infrastructure:
> > 
> > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >   etc) so they are not saved into the savevm file. The existing --object
> >   memory-backend-file syntax can be used.
> > 
> > - Extending the live migration protocol to detect when file descriptor
> >   passing is available (i.e. UNIX domain socket migration) and using
> >   that for memory-backend-* objects that have fds.
> > 
> > Either of these approaches would handle RAM with existing savevm/migrate
> > commands.
> 
> Yes, but the vfio issues would still need to be solved, and we would need new
> command line options to back existing and future secondary memory objects with 
> named shared memory.
> 
> > The remaining issue is how to migrate VFIO and other file descriptors
> > that cannot be reopened by the new process. As mentioned, QEMU already
> > has file descriptor passing support in the QMP monitor and support for
> > opening passed file descriptors (see qemu_open_internal(),
> > monitor_fd_param(), and socket_get_fd()).
> > 
> > The advantage of integrating live update functionality into the existing
> > savevm/migrate commands is that it will work in more use cases with
> > less code duplication/maintenance/bitrot prevention than the
> > special-case cprsave command in this patch series.
> > 
> > Maybe there is a fundamental technical reason why live update needs to
> > be different from QEMU's existing migration commands but I haven't
> > figured it out yet.
> 
> vfio and anonymous memory.
> 
> Regarding code duplication, I did consider whether to extend the migration
> syntax and implementation versus creating something new.  Those functions
> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> use case, and the cpr functions handle state that is n/a for the migration case.
> I judged that handling both in the same functions would be less readable and
> maintainable.  After feedback during the V1 review, I simplified the cprsave
> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> interaction with the migration code.
> 
> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> review is more complete and folks agree on the final form of the functionality.
> 
> I do like the idea of supporting update without exec, but as a future project, 
> and not at the expense of dropping update with exec.

Alex: We're discussing how to live update QEMU while VFIO devices are
running. This patch series introduces monitor commands that call
execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
file descriptors. This way live update is transparent to VFIO but it
won't work if a sandboxed QEMU process is forbidden to call execve(2).
What are your thoughts on 1) the execve(2) approach and 2) extending
VFIO to allow running devices to be attached to a different process so
that execve(2) is not necessary?

Steven: Do you know if cpr will work with Intel's upcoming Shared
Virtual Addressing? I'm worried that execve(2) may be a short-term
solution that works around VFIO's current limitations but even execve(2)
may stop working in the future as IOMMUs and DMA approaches change.

Stefan
Alex Williamson May 17, 2021, 7:10 p.m. UTC | #8
On Mon, 17 May 2021 12:40:43 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > >>>> Provide the cprsave and cprload commands for live update.  These save and
> > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > >>>> to a new version in between.
> > >>>>
> > >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > >>>> paused state and waits for the cprload command.  
> > >>>
> > >>> I think cprsave/cprload could be generalized by using QMP to stash the
> > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > >>> already opens fds passed using this mechanism.
> > >>>
> > >>> I haven't checked but it may be possible to drop some patches by reusing
> > >>> QEMU's monitor file descriptor passing since the code already knows how
> > >>> to open from 'getfd' fds.
> > >>>
> > >>> The reason why using QMP is interesting is because it eliminates the
> > >>> need for execve(2). QEMU may be unable to execute a program due to
> > >>> chroot, seccomp, etc.
> > >>>
> > >>> QMP would enable cprsave/cprload to work both with and without
> > >>> execve(2).
> > >>>
> > >>> One tricky thing with this approach might be startup ordering: how to
> > >>> get fds via the QMP monitor in the new process before processing the
> > >>> entire command-line.  
> > >>
> > >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > >> I suspect my recent vfio extensions would smooth the rough edges.  
> > > 
> > > I wonder about the reason for VFIO's pid limitation, maybe because it
> > > pins pages from the original process?  
> > 
> > The dma unmap code verifies that the requesting task is the same as the task that mapped
> > the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > to fix locked memory accounting, which is associated with the mm of the original task.
> >   
> > > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > > execve(2) the new binary?  
> > 
> > That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > I was working against vfio rather than with it.
> > 
> > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > without exec, I still need the exec option.
> >   
> > >> However, the main issue is that guest ram must be backed by named shared memory, and
> > >> we would need to add code to support shared memory for all the secondary memory objects.
> > >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> > >> instances with anonymous guest memory.  
> > > 
> > > Thanks for explaining this more in the other sub-thread. The secondary
> > > memory objects you mentioned are relatively small so I don't think
> > > saving them in the traditional way is a problem.
> > > 
> > > Two approaches for zero-copy memory migration fit into QEMU's existing
> > > migration infrastructure:
> > > 
> > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> > >   etc) so they are not saved into the savevm file. The existing --object
> > >   memory-backend-file syntax can be used.
> > > 
> > > - Extending the live migration protocol to detect when file descriptor
> > >   passing is available (i.e. UNIX domain socket migration) and using
> > >   that for memory-backend-* objects that have fds.
> > > 
> > > Either of these approaches would handle RAM with existing savevm/migrate
> > > commands.  
> > 
> > Yes, but the vfio issues would still need to be solved, and we would need new
> > command line options to back existing and future secondary memory objects with 
> > named shared memory.
> >   
> > > The remaining issue is how to migrate VFIO and other file descriptors
> > > that cannot be reopened by the new process. As mentioned, QEMU already
> > > has file descriptor passing support in the QMP monitor and support for
> > > opening passed file descriptors (see qemu_open_internal(),
> > > monitor_fd_param(), and socket_get_fd()).
> > > 
> > > The advantage of integrating live update functionality into the existing
> > > savevm/migrate commands is that it will work in more use cases with
> > > less code duplication/maintenance/bitrot prevention than the
> > > special-case cprsave command in this patch series.
> > > 
> > > Maybe there is a fundamental technical reason why live update needs to
> > > be different from QEMU's existing migration commands but I haven't
> > > figured it out yet.  
> > 
> > vfio and anonymous memory.
> > 
> > Regarding code duplication, I did consider whether to extend the migration
> > syntax and implementation versus creating something new.  Those functions
> > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> > use case, and the cpr functions handle state that is n/a for the migration case.
> > I judged that handling both in the same functions would be less readable and
> > maintainable.  After feedback during the V1 review, I simplified the cprsave
> > code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> > interaction with the migration code.
> > 
> > Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> > review is more complete and folks agree on the final form of the functionality.
> > 
> > I do like the idea of supporting update without exec, but as a future project, 
> > and not at the expense of dropping update with exec.  
> 
> Alex: We're discussing how to live update QEMU while VFIO devices are
> running. This patch series introduces monitor commands that call
> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
> file descriptors. This way live update is transparent to VFIO but it
> won't work if a sandboxed QEMU process is forbidden to call execve(2).
> What are your thoughts on 1) the execve(2) approach and 2) extending
> VFIO to allow running devices to be attached to a different process so
> that execve(2) is not necessary?

Tracking processes is largely to support page pinning; we need to be
able to support both asynchronous page pinning to handle requests from
mdev drivers and we need to make sure pinned page accounting is
tracked to the same process.  If userspace can "pay" for locked pages
from one process on mappping, then "credit" them to another process on
unmap, that seems fairly exploitable.  We'd need some way to transfer
the locked memory accounting or handle it outside of vfio.  Thanks,

Alex
Dr. David Alan Gilbert May 18, 2021, 9:57 a.m. UTC | #9
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>> to a new version in between.
> >>>>
> >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>> paused state and waits for the cprload command.
> >>>
> >>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>> already opens fds passed using this mechanism.
> >>>
> >>> I haven't checked but it may be possible to drop some patches by reusing
> >>> QEMU's monitor file descriptor passing since the code already knows how
> >>> to open from 'getfd' fds.
> >>>
> >>> The reason why using QMP is interesting is because it eliminates the
> >>> need for execve(2). QEMU may be unable to execute a program due to
> >>> chroot, seccomp, etc.
> >>>
> >>> QMP would enable cprsave/cprload to work both with and without
> >>> execve(2).
> >>>
> >>> One tricky thing with this approach might be startup ordering: how to
> >>> get fds via the QMP monitor in the new process before processing the
> >>> entire command-line.
> >>
> >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >> I suspect my recent vfio extensions would smooth the rough edges.
> > 
> > I wonder about the reason for VFIO's pid limitation, maybe because it
> > pins pages from the original process?
> 
> The dma unmap code verifies that the requesting task is the same as the task that mapped
> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> to fix locked memory accounting, which is associated with the mm of the original task.

> > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > execve(2) the new binary?
> 
> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> I was working against vfio rather than with it.

OK the weirdness of vfio helps explain a bit about why you're doing it
this way; can you help separate some difference between restart and
reboot for me though:

In 'reboot' mode; where the guest must do suspend in it's drivers, how
much of these vfio requirements are needed?  I guess the memfd use
for the anonymous areas isn't any use for reboot mode.

You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
vfio still care about the currently-anonymous areas?

> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> without exec, I still need the exec option.

Can you explain what that code injection mechanism is for those of us
who didn't see that?

Dave

> >> However, the main issue is that guest ram must be backed by named shared memory, and
> >> we would need to add code to support shared memory for all the secondary memory objects.
> >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >> instances with anonymous guest memory.
> > 
> > Thanks for explaining this more in the other sub-thread. The secondary
> > memory objects you mentioned are relatively small so I don't think
> > saving them in the traditional way is a problem.
> > 
> > Two approaches for zero-copy memory migration fit into QEMU's existing
> > migration infrastructure:
> > 
> > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >   etc) so they are not saved into the savevm file. The existing --object
> >   memory-backend-file syntax can be used.
> > 
> > - Extending the live migration protocol to detect when file descriptor
> >   passing is available (i.e. UNIX domain socket migration) and using
> >   that for memory-backend-* objects that have fds.
> > 
> > Either of these approaches would handle RAM with existing savevm/migrate
> > commands.
> 
> Yes, but the vfio issues would still need to be solved, and we would need new
> command line options to back existing and future secondary memory objects with 
> named shared memory.
> 
> > The remaining issue is how to migrate VFIO and other file descriptors
> > that cannot be reopened by the new process. As mentioned, QEMU already
> > has file descriptor passing support in the QMP monitor and support for
> > opening passed file descriptors (see qemu_open_internal(),
> > monitor_fd_param(), and socket_get_fd()).
> > 
> > The advantage of integrating live update functionality into the existing
> > savevm/migrate commands is that it will work in more use cases with
> > less code duplication/maintenance/bitrot prevention than the
> > special-case cprsave command in this patch series.
> > 
> > Maybe there is a fundamental technical reason why live update needs to
> > be different from QEMU's existing migration commands but I haven't
> > figured it out yet.
> 
> vfio and anonymous memory.
> 
> Regarding code duplication, I did consider whether to extend the migration
> syntax and implementation versus creating something new.  Those functions
> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> use case, and the cpr functions handle state that is n/a for the migration case.
> I judged that handling both in the same functions would be less readable and
> maintainable.  After feedback during the V1 review, I simplified the cprsave
> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> interaction with the migration code.
> 
> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> review is more complete and folks agree on the final form of the functionality.
> 
> I do like the idea of supporting update without exec, but as a future project, 
> and not at the expense of dropping update with exec.
> 
> - Steve
>
Stefan Hajnoczi May 18, 2021, 1:39 p.m. UTC | #10
On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote:
> On Mon, 17 May 2021 12:40:43 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> > > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > > >>>> Provide the cprsave and cprload commands for live update.  These save and
> > > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > > >>>> to a new version in between.
> > > >>>>
> > > >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > > >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > > >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > > >>>> paused state and waits for the cprload command.  
> > > >>>
> > > >>> I think cprsave/cprload could be generalized by using QMP to stash the
> > > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > > >>> already opens fds passed using this mechanism.
> > > >>>
> > > >>> I haven't checked but it may be possible to drop some patches by reusing
> > > >>> QEMU's monitor file descriptor passing since the code already knows how
> > > >>> to open from 'getfd' fds.
> > > >>>
> > > >>> The reason why using QMP is interesting is because it eliminates the
> > > >>> need for execve(2). QEMU may be unable to execute a program due to
> > > >>> chroot, seccomp, etc.
> > > >>>
> > > >>> QMP would enable cprsave/cprload to work both with and without
> > > >>> execve(2).
> > > >>>
> > > >>> One tricky thing with this approach might be startup ordering: how to
> > > >>> get fds via the QMP monitor in the new process before processing the
> > > >>> entire command-line.  
> > > >>
> > > >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > > >> I suspect my recent vfio extensions would smooth the rough edges.  
> > > > 
> > > > I wonder about the reason for VFIO's pid limitation, maybe because it
> > > > pins pages from the original process?  
> > > 
> > > The dma unmap code verifies that the requesting task is the same as the task that mapped
> > > the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > > to fix locked memory accounting, which is associated with the mm of the original task.
> > >   
> > > > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > > > execve(2) the new binary?  
> > > 
> > > That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > > errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > > I was working against vfio rather than with it.
> > > 
> > > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > > code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > > without exec, I still need the exec option.
> > >   
> > > >> However, the main issue is that guest ram must be backed by named shared memory, and
> > > >> we would need to add code to support shared memory for all the secondary memory objects.
> > > >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> > > >> instances with anonymous guest memory.  
> > > > 
> > > > Thanks for explaining this more in the other sub-thread. The secondary
> > > > memory objects you mentioned are relatively small so I don't think
> > > > saving them in the traditional way is a problem.
> > > > 
> > > > Two approaches for zero-copy memory migration fit into QEMU's existing
> > > > migration infrastructure:
> > > > 
> > > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> > > >   etc) so they are not saved into the savevm file. The existing --object
> > > >   memory-backend-file syntax can be used.
> > > > 
> > > > - Extending the live migration protocol to detect when file descriptor
> > > >   passing is available (i.e. UNIX domain socket migration) and using
> > > >   that for memory-backend-* objects that have fds.
> > > > 
> > > > Either of these approaches would handle RAM with existing savevm/migrate
> > > > commands.  
> > > 
> > > Yes, but the vfio issues would still need to be solved, and we would need new
> > > command line options to back existing and future secondary memory objects with 
> > > named shared memory.
> > >   
> > > > The remaining issue is how to migrate VFIO and other file descriptors
> > > > that cannot be reopened by the new process. As mentioned, QEMU already
> > > > has file descriptor passing support in the QMP monitor and support for
> > > > opening passed file descriptors (see qemu_open_internal(),
> > > > monitor_fd_param(), and socket_get_fd()).
> > > > 
> > > > The advantage of integrating live update functionality into the existing
> > > > savevm/migrate commands is that it will work in more use cases with
> > > > less code duplication/maintenance/bitrot prevention than the
> > > > special-case cprsave command in this patch series.
> > > > 
> > > > Maybe there is a fundamental technical reason why live update needs to
> > > > be different from QEMU's existing migration commands but I haven't
> > > > figured it out yet.  
> > > 
> > > vfio and anonymous memory.
> > > 
> > > Regarding code duplication, I did consider whether to extend the migration
> > > syntax and implementation versus creating something new.  Those functions
> > > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> > > use case, and the cpr functions handle state that is n/a for the migration case.
> > > I judged that handling both in the same functions would be less readable and
> > > maintainable.  After feedback during the V1 review, I simplified the cprsave
> > > code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> > > interaction with the migration code.
> > > 
> > > Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> > > review is more complete and folks agree on the final form of the functionality.
> > > 
> > > I do like the idea of supporting update without exec, but as a future project, 
> > > and not at the expense of dropping update with exec.  
> > 
> > Alex: We're discussing how to live update QEMU while VFIO devices are
> > running. This patch series introduces monitor commands that call
> > execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
> > file descriptors. This way live update is transparent to VFIO but it
> > won't work if a sandboxed QEMU process is forbidden to call execve(2).
> > What are your thoughts on 1) the execve(2) approach and 2) extending
> > VFIO to allow running devices to be attached to a different process so
> > that execve(2) is not necessary?
> 
> Tracking processes is largely to support page pinning; we need to be
> able to support both asynchronous page pinning to handle requests from
> mdev drivers and we need to make sure pinned page accounting is
> tracked to the same process.  If userspace can "pay" for locked pages
> from one process on mappping, then "credit" them to another process on
> unmap, that seems fairly exploitable.  We'd need some way to transfer
> the locked memory accounting or handle it outside of vfio.  Thanks,

Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to
associate the in-kernel vhost device with a userspace process and it's
mm.

Would it be possible to add a VFIO_SET_OWNER ioctl that associates the
current process with the vfio_device? Only one process would be the
owner at any given time.

I'm not sure how existing DMA mappings would behave, but this patch
series seems to rely on DMA continuing to work even though there is a
window of time when the execve(2) process doesn't have guest RAM
mmapped. So I guess existing DMA mappings continue to work because the
pages were previously pinned?

Stefan
Steven Sistare May 18, 2021, 3:48 p.m. UTC | #11
On 5/18/2021 9:39 AM, Stefan Hajnoczi wrote:
> On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote:
>> On Mon, 17 May 2021 12:40:43 +0100
>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>>> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
>>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
>>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
>>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
>>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
>>>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>>>> to a new version in between.
>>>>>>>>
>>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>>>> paused state and waits for the cprload command.  
>>>>>>>
>>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>>>> already opens fds passed using this mechanism.
>>>>>>>
>>>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>>>> to open from 'getfd' fds.
>>>>>>>
>>>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>>>> chroot, seccomp, etc.
>>>>>>>
>>>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>>>> execve(2).
>>>>>>>
>>>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>>>> get fds via the QMP monitor in the new process before processing the
>>>>>>> entire command-line.  
>>>>>>
>>>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>>>> I suspect my recent vfio extensions would smooth the rough edges.  
>>>>>
>>>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>>>> pins pages from the original process?  
>>>>
>>>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>>>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>>>> to fix locked memory accounting, which is associated with the mm of the original task.
>>>>   
>>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>>>> execve(2) the new binary?  
>>>>
>>>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>>>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>>>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>>>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>>>> I was working against vfio rather than with it.
>>>>
>>>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>>>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>>>> without exec, I still need the exec option.
>>>>   
>>>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>>>> instances with anonymous guest memory.  
>>>>>
>>>>> Thanks for explaining this more in the other sub-thread. The secondary
>>>>> memory objects you mentioned are relatively small so I don't think
>>>>> saving them in the traditional way is a problem.
>>>>>
>>>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>>>> migration infrastructure:
>>>>>
>>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>>>   etc) so they are not saved into the savevm file. The existing --object
>>>>>   memory-backend-file syntax can be used.
>>>>>
>>>>> - Extending the live migration protocol to detect when file descriptor
>>>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>>>   that for memory-backend-* objects that have fds.
>>>>>
>>>>> Either of these approaches would handle RAM with existing savevm/migrate
>>>>> commands.  
>>>>
>>>> Yes, but the vfio issues would still need to be solved, and we would need new
>>>> command line options to back existing and future secondary memory objects with 
>>>> named shared memory.
>>>>   
>>>>> The remaining issue is how to migrate VFIO and other file descriptors
>>>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>>>> has file descriptor passing support in the QMP monitor and support for
>>>>> opening passed file descriptors (see qemu_open_internal(),
>>>>> monitor_fd_param(), and socket_get_fd()).
>>>>>
>>>>> The advantage of integrating live update functionality into the existing
>>>>> savevm/migrate commands is that it will work in more use cases with
>>>>> less code duplication/maintenance/bitrot prevention than the
>>>>> special-case cprsave command in this patch series.
>>>>>
>>>>> Maybe there is a fundamental technical reason why live update needs to
>>>>> be different from QEMU's existing migration commands but I haven't
>>>>> figured it out yet.  
>>>>
>>>> vfio and anonymous memory.
>>>>
>>>> Regarding code duplication, I did consider whether to extend the migration
>>>> syntax and implementation versus creating something new.  Those functions
>>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>>>> use case, and the cpr functions handle state that is n/a for the migration case.
>>>> I judged that handling both in the same functions would be less readable and
>>>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>>>> interaction with the migration code.
>>>>
>>>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>>>> review is more complete and folks agree on the final form of the functionality.
>>>>
>>>> I do like the idea of supporting update without exec, but as a future project, 
>>>> and not at the expense of dropping update with exec.  
>>>
>>> Alex: We're discussing how to live update QEMU while VFIO devices are
>>> running. This patch series introduces monitor commands that call
>>> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
>>> file descriptors. This way live update is transparent to VFIO but it
>>> won't work if a sandboxed QEMU process is forbidden to call execve(2).
>>> What are your thoughts on 1) the execve(2) approach and 2) extending
>>> VFIO to allow running devices to be attached to a different process so
>>> that execve(2) is not necessary?
>>
>> Tracking processes is largely to support page pinning; we need to be
>> able to support both asynchronous page pinning to handle requests from
>> mdev drivers and we need to make sure pinned page accounting is
>> tracked to the same process.  If userspace can "pay" for locked pages
>> from one process on mappping, then "credit" them to another process on
>> unmap, that seems fairly exploitable.  We'd need some way to transfer
>> the locked memory accounting or handle it outside of vfio.  Thanks,
> 
> Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to
> associate the in-kernel vhost device with a userspace process and it's
> mm.
> 
> Would it be possible to add a VFIO_SET_OWNER ioctl that associates the
> current process with the vfio_device? Only one process would be the
> owner at any given time.

It is possible, but the implementation would need to invent new hooks in mm
to transfer the locked memory accounting.

> I'm not sure how existing DMA mappings would behave, but this patch
> series seems to rely on DMA continuing to work even though there is a
> window of time when the execve(2) process doesn't have guest RAM
> mmapped. So I guess existing DMA mappings continue to work because the
> pages were previously pinned?

Correct.  And changes to mappings are blocked between the pre-exec process issuing
VFIO_DMA_UNMAP_FLAG_VADDR and the post-exec process issuing VFIO_DMA_MAP_FLAG_VADDR.

- Steve
Steven Sistare May 18, 2021, 4 p.m. UTC | #12
On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>> to a new version in between.
>>>>>>
>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>> paused state and waits for the cprload command.
>>>>>
>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>> already opens fds passed using this mechanism.
>>>>>
>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>> to open from 'getfd' fds.
>>>>>
>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>> chroot, seccomp, etc.
>>>>>
>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>> execve(2).
>>>>>
>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>> get fds via the QMP monitor in the new process before processing the
>>>>> entire command-line.
>>>>
>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>
>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>> pins pages from the original process?
>>
>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>> to fix locked memory accounting, which is associated with the mm of the original task.
> 
>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>> execve(2) the new binary?
>>
>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>> I was working against vfio rather than with it.
> 
> OK the weirdness of vfio helps explain a bit about why you're doing it
> this way; can you help separate some difference between restart and
> reboot for me though:
> 
> In 'reboot' mode; where the guest must do suspend in it's drivers, how
> much of these vfio requirements are needed?  I guess the memfd use
> for the anonymous areas isn't any use for reboot mode.

Correct.  For reboot no special vfio support or fiddling is needed.

> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> vfio still care about the currently-anonymous areas?

Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.

>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>> without exec, I still need the exec option.
> 
> Can you explain what that code injection mechanism is for those of us
> who didn't see that?

Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
accepted upstream.

-----------------------------------------------------------------------------
Legacy Live Update

 * Update legacy qemu process to latest version
   - Inject code into legacy qemu process to perform cprsave: vmsave.so
     . Access qemu data structures and globals
       - eg ram_list, savevm_state, chardevs, vhost_devices
       - dlopen does not resolve them, must get addresses via symbol lookup.
     . Delete some vmstate handlers, register new ones (eg vfio)
     . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
 * Hot patch a monitor function to dlopen vmsave.so, call entry point
   - write patch to /proc/pid/mem
   - Call the monitor function via monitor socket
 * Send cprload to update qemu
 * vmsave.so has binary dependency on qemu data structures and variables
   - Build vmsave-ver.so per legacy version
   - Indexed by qemu's gcc build-id

-----------------------------------------------------------------------------

- Steve
 
>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>> instances with anonymous guest memory.
>>>
>>> Thanks for explaining this more in the other sub-thread. The secondary
>>> memory objects you mentioned are relatively small so I don't think
>>> saving them in the traditional way is a problem.
>>>
>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>> migration infrastructure:
>>>
>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>   etc) so they are not saved into the savevm file. The existing --object
>>>   memory-backend-file syntax can be used.
>>>
>>> - Extending the live migration protocol to detect when file descriptor
>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>   that for memory-backend-* objects that have fds.
>>>
>>> Either of these approaches would handle RAM with existing savevm/migrate
>>> commands.
>>
>> Yes, but the vfio issues would still need to be solved, and we would need new
>> command line options to back existing and future secondary memory objects with 
>> named shared memory.
>>
>>> The remaining issue is how to migrate VFIO and other file descriptors
>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>> has file descriptor passing support in the QMP monitor and support for
>>> opening passed file descriptors (see qemu_open_internal(),
>>> monitor_fd_param(), and socket_get_fd()).
>>>
>>> The advantage of integrating live update functionality into the existing
>>> savevm/migrate commands is that it will work in more use cases with
>>> less code duplication/maintenance/bitrot prevention than the
>>> special-case cprsave command in this patch series.
>>>
>>> Maybe there is a fundamental technical reason why live update needs to
>>> be different from QEMU's existing migration commands but I haven't
>>> figured it out yet.
>>
>> vfio and anonymous memory.
>>
>> Regarding code duplication, I did consider whether to extend the migration
>> syntax and implementation versus creating something new.  Those functions
>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>> use case, and the cpr functions handle state that is n/a for the migration case.
>> I judged that handling both in the same functions would be less readable and
>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>> interaction with the migration code.
>>
>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>> review is more complete and folks agree on the final form of the functionality.
>>
>> I do like the idea of supporting update without exec, but as a future project, 
>> and not at the expense of dropping update with exec.
>>
>> - Steve
>>
Dr. David Alan Gilbert May 18, 2021, 7:23 p.m. UTC | #13
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> > * Steven Sistare (steven.sistare@oracle.com) wrote:
> >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>>>> to a new version in between.
> >>>>>>
> >>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>>>> paused state and waits for the cprload command.
> >>>>>
> >>>>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>>>> already opens fds passed using this mechanism.
> >>>>>
> >>>>> I haven't checked but it may be possible to drop some patches by reusing
> >>>>> QEMU's monitor file descriptor passing since the code already knows how
> >>>>> to open from 'getfd' fds.
> >>>>>
> >>>>> The reason why using QMP is interesting is because it eliminates the
> >>>>> need for execve(2). QEMU may be unable to execute a program due to
> >>>>> chroot, seccomp, etc.
> >>>>>
> >>>>> QMP would enable cprsave/cprload to work both with and without
> >>>>> execve(2).
> >>>>>
> >>>>> One tricky thing with this approach might be startup ordering: how to
> >>>>> get fds via the QMP monitor in the new process before processing the
> >>>>> entire command-line.
> >>>>
> >>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >>>> I suspect my recent vfio extensions would smooth the rough edges.
> >>>
> >>> I wonder about the reason for VFIO's pid limitation, maybe because it
> >>> pins pages from the original process?
> >>
> >> The dma unmap code verifies that the requesting task is the same as the task that mapped
> >> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> >> to fix locked memory accounting, which is associated with the mm of the original task.
> > 
> >>> Is this VFIO pid limitation the main reason why you chose to make QEMU
> >>> execve(2) the new binary?
> >>
> >> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> >> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> >> I was working against vfio rather than with it.
> > 
> > OK the weirdness of vfio helps explain a bit about why you're doing it
> > this way; can you help separate some difference between restart and
> > reboot for me though:
> > 
> > In 'reboot' mode; where the guest must do suspend in it's drivers, how
> > much of these vfio requirements are needed?  I guess the memfd use
> > for the anonymous areas isn't any use for reboot mode.
> 
> Correct.  For reboot no special vfio support or fiddling is needed.
> 
> > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> > vfio still care about the currently-anonymous areas?
> 
> Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
> are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.

Is it possible with vfio to map it into multiple processes
simultaneously or does it have to only be one at a time?
Are you saying that you have no way to shut off DMA, and thus you can
never know it's safe to terminate the source process?

> >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> >> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> >> without exec, I still need the exec option.
> > 
> > Can you explain what that code injection mechanism is for those of us
> > who didn't see that?
> 
> Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
> accepted upstream.

In this series, without MADV_DOEXEC, how do you guarantee the same HVA
in source and destination - or is that not necessary?

> -----------------------------------------------------------------------------
> Legacy Live Update
> 
>  * Update legacy qemu process to latest version
>    - Inject code into legacy qemu process to perform cprsave: vmsave.so
>      . Access qemu data structures and globals
>        - eg ram_list, savevm_state, chardevs, vhost_devices
>        - dlopen does not resolve them, must get addresses via symbol lookup.
>      . Delete some vmstate handlers, register new ones (eg vfio)
>      . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
>  * Hot patch a monitor function to dlopen vmsave.so, call entry point
>    - write patch to /proc/pid/mem
>    - Call the monitor function via monitor socket
>  * Send cprload to update qemu
>  * vmsave.so has binary dependency on qemu data structures and variables
>    - Build vmsave-ver.so per legacy version
>    - Indexed by qemu's gcc build-id
> 
> -----------------------------------------------------------------------------

That's hairy!
At that point isn't it easier to recompile a patched qemu against the
original sources and ptrace something in to mmap the new qemu?

Dave

> - Steve
>  
> >>>> However, the main issue is that guest ram must be backed by named shared memory, and
> >>>> we would need to add code to support shared memory for all the secondary memory objects.
> >>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >>>> instances with anonymous guest memory.
> >>>
> >>> Thanks for explaining this more in the other sub-thread. The secondary
> >>> memory objects you mentioned are relatively small so I don't think
> >>> saving them in the traditional way is a problem.
> >>>
> >>> Two approaches for zero-copy memory migration fit into QEMU's existing
> >>> migration infrastructure:
> >>>
> >>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >>>   etc) so they are not saved into the savevm file. The existing --object
> >>>   memory-backend-file syntax can be used.
> >>>
> >>> - Extending the live migration protocol to detect when file descriptor
> >>>   passing is available (i.e. UNIX domain socket migration) and using
> >>>   that for memory-backend-* objects that have fds.
> >>>
> >>> Either of these approaches would handle RAM with existing savevm/migrate
> >>> commands.
> >>
> >> Yes, but the vfio issues would still need to be solved, and we would need new
> >> command line options to back existing and future secondary memory objects with 
> >> named shared memory.
> >>
> >>> The remaining issue is how to migrate VFIO and other file descriptors
> >>> that cannot be reopened by the new process. As mentioned, QEMU already
> >>> has file descriptor passing support in the QMP monitor and support for
> >>> opening passed file descriptors (see qemu_open_internal(),
> >>> monitor_fd_param(), and socket_get_fd()).
> >>>
> >>> The advantage of integrating live update functionality into the existing
> >>> savevm/migrate commands is that it will work in more use cases with
> >>> less code duplication/maintenance/bitrot prevention than the
> >>> special-case cprsave command in this patch series.
> >>>
> >>> Maybe there is a fundamental technical reason why live update needs to
> >>> be different from QEMU's existing migration commands but I haven't
> >>> figured it out yet.
> >>
> >> vfio and anonymous memory.
> >>
> >> Regarding code duplication, I did consider whether to extend the migration
> >> syntax and implementation versus creating something new.  Those functions
> >> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> >> use case, and the cpr functions handle state that is n/a for the migration case.
> >> I judged that handling both in the same functions would be less readable and
> >> maintainable.  After feedback during the V1 review, I simplified the cprsave
> >> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> >> interaction with the migration code.
> >>
> >> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> >> review is more complete and folks agree on the final form of the functionality.
> >>
> >> I do like the idea of supporting update without exec, but as a future project, 
> >> and not at the expense of dropping update with exec.
> >>
> >> - Steve
> >>
>
Alex Williamson May 18, 2021, 8:01 p.m. UTC | #14
On Tue, 18 May 2021 20:23:25 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Steven Sistare (steven.sistare@oracle.com) wrote:
> > On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:  
> > > * Steven Sistare (steven.sistare@oracle.com) wrote:  
> > >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > >>>>>> Provide the cprsave and cprload commands for live update.  These save and
> > >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > >>>>>> to a new version in between.
> > >>>>>>
> > >>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > >>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > >>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > >>>>>> paused state and waits for the cprload command.  
> > >>>>>
> > >>>>> I think cprsave/cprload could be generalized by using QMP to stash the
> > >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > >>>>> already opens fds passed using this mechanism.
> > >>>>>
> > >>>>> I haven't checked but it may be possible to drop some patches by reusing
> > >>>>> QEMU's monitor file descriptor passing since the code already knows how
> > >>>>> to open from 'getfd' fds.
> > >>>>>
> > >>>>> The reason why using QMP is interesting is because it eliminates the
> > >>>>> need for execve(2). QEMU may be unable to execute a program due to
> > >>>>> chroot, seccomp, etc.
> > >>>>>
> > >>>>> QMP would enable cprsave/cprload to work both with and without
> > >>>>> execve(2).
> > >>>>>
> > >>>>> One tricky thing with this approach might be startup ordering: how to
> > >>>>> get fds via the QMP monitor in the new process before processing the
> > >>>>> entire command-line.  
> > >>>>
> > >>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > >>>> I suspect my recent vfio extensions would smooth the rough edges.  
> > >>>
> > >>> I wonder about the reason for VFIO's pid limitation, maybe because it
> > >>> pins pages from the original process?  
> > >>
> > >> The dma unmap code verifies that the requesting task is the same as the task that mapped
> > >> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > >> to fix locked memory accounting, which is associated with the mm of the original task.  
> > >   
> > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU
> > >>> execve(2) the new binary?  
> > >>
> > >> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > >> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > >> I was working against vfio rather than with it.  
> > > 
> > > OK the weirdness of vfio helps explain a bit about why you're doing it
> > > this way; can you help separate some difference between restart and
> > > reboot for me though:
> > > 
> > > In 'reboot' mode; where the guest must do suspend in it's drivers, how
> > > much of these vfio requirements are needed?  I guess the memfd use
> > > for the anonymous areas isn't any use for reboot mode.  
> > 
> > Correct.  For reboot no special vfio support or fiddling is needed.
> >   
> > > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> > > vfio still care about the currently-anonymous areas?  
> > 
> > Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
> > are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.  
> 
> Is it possible with vfio to map it into multiple processes
> simultaneously or does it have to only be one at a time?

The IOMMU maps an IOVA to a physical address, what Steve is saying is
that mapping persists across the restart.  A given IOVA can only map to
a specific physical address, so mapping into multiple processes doesn't
make any sense.  The two processes need to map the same IOVA to the
same HPA, only the HVA is allowed to change.

> Are you saying that you have no way to shut off DMA, and thus you can
> never know it's safe to terminate the source process?

Stopping DMA, ex. disabling PCI bus master, would be not only visible
to the behavior of the device, but likely detrimental.  You'd need
driver or device participation to some extent to make this seamless.

> > >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > >> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > >> without exec, I still need the exec option.  
> > > 
> > > Can you explain what that code injection mechanism is for those of us
> > > who didn't see that?  
> > 
> > Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
> > accepted upstream.  
> 
> In this series, without MADV_DOEXEC, how do you guarantee the same HVA
> in source and destination - or is that not necessary?

It's not necessary, the HVA is used to establish the IOVA to HPA
mapping for the IOMMU.  We have patches upstream that suspend (block)
that translation for the window when the HVA is invalid and resume when
it becomes valid.  It's expected that the new HVA is equivalent to the
old HVA and that the user can only hurt themselves should they violate
this, ie. they can still only map+pin memory they own, so at worst they
create a bad translation for their own device.  Thanks,

Alex
Steven Sistare May 18, 2021, 8:14 p.m. UTC | #15
On 5/18/2021 3:23 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>>>> to a new version in between.
>>>>>>>>
>>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>>>> paused state and waits for the cprload command.
>>>>>>>
>>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>>>> already opens fds passed using this mechanism.
>>>>>>>
>>>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>>>> to open from 'getfd' fds.
>>>>>>>
>>>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>>>> chroot, seccomp, etc.
>>>>>>>
>>>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>>>> execve(2).
>>>>>>>
>>>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>>>> get fds via the QMP monitor in the new process before processing the
>>>>>>> entire command-line.
>>>>>>
>>>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>>>
>>>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>>>> pins pages from the original process?
>>>>
>>>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>>>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>>>> to fix locked memory accounting, which is associated with the mm of the original task.
>>>
>>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>>>> execve(2) the new binary?
>>>>
>>>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>>>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>>>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>>>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>>>> I was working against vfio rather than with it.
>>>
>>> OK the weirdness of vfio helps explain a bit about why you're doing it
>>> this way; can you help separate some difference between restart and
>>> reboot for me though:
>>>
>>> In 'reboot' mode; where the guest must do suspend in it's drivers, how
>>> much of these vfio requirements are needed?  I guess the memfd use
>>> for the anonymous areas isn't any use for reboot mode.
>>
>> Correct.  For reboot no special vfio support or fiddling is needed.
>>
>>> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
>>> vfio still care about the currently-anonymous areas?
>>
>> Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
>> are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.
> 
> Is it possible with vfio to map it into multiple processes
> simultaneously or does it have to only be one at a time?
> Are you saying that you have no way to shut off DMA, and thus you can
> never know it's safe to terminate the source process?
> 
>>>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>>>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>>>> without exec, I still need the exec option.
>>>
>>> Can you explain what that code injection mechanism is for those of us
>>> who didn't see that?
>>
>> Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
>> accepted upstream.
> 
> In this series, without MADV_DOEXEC, how do you guarantee the same HVA
> in source and destination - or is that not necessary?

Not necessary.  We can safely change the HVS using the new vfio ioctls.

>> -----------------------------------------------------------------------------
>> Legacy Live Update
>>
>>  * Update legacy qemu process to latest version
>>    - Inject code into legacy qemu process to perform cprsave: vmsave.so
>>      . Access qemu data structures and globals
>>        - eg ram_list, savevm_state, chardevs, vhost_devices
>>        - dlopen does not resolve them, must get addresses via symbol lookup.
>>      . Delete some vmstate handlers, register new ones (eg vfio)
>>      . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
>>  * Hot patch a monitor function to dlopen vmsave.so, call entry point
>>    - write patch to /proc/pid/mem
>>    - Call the monitor function via monitor socket
>>  * Send cprload to update qemu
>>  * vmsave.so has binary dependency on qemu data structures and variables
>>    - Build vmsave-ver.so per legacy version
>>    - Indexed by qemu's gcc build-id
>>
>> -----------------------------------------------------------------------------
> 
> That's hairy!
> At that point isn't it easier to recompile a patched qemu against the
> original sources and ptrace something in to mmap the new qemu
That could work, but safely capturing all the threads and forcing them to jump to the
mmap'd qemu is hard.

- Steve

>>>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>>>> instances with anonymous guest memory.
>>>>>
>>>>> Thanks for explaining this more in the other sub-thread. The secondary
>>>>> memory objects you mentioned are relatively small so I don't think
>>>>> saving them in the traditional way is a problem.
>>>>>
>>>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>>>> migration infrastructure:
>>>>>
>>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>>>   etc) so they are not saved into the savevm file. The existing --object
>>>>>   memory-backend-file syntax can be used.
>>>>>
>>>>> - Extending the live migration protocol to detect when file descriptor
>>>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>>>   that for memory-backend-* objects that have fds.
>>>>>
>>>>> Either of these approaches would handle RAM with existing savevm/migrate
>>>>> commands.
>>>>
>>>> Yes, but the vfio issues would still need to be solved, and we would need new
>>>> command line options to back existing and future secondary memory objects with 
>>>> named shared memory.
>>>>
>>>>> The remaining issue is how to migrate VFIO and other file descriptors
>>>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>>>> has file descriptor passing support in the QMP monitor and support for
>>>>> opening passed file descriptors (see qemu_open_internal(),
>>>>> monitor_fd_param(), and socket_get_fd()).
>>>>>
>>>>> The advantage of integrating live update functionality into the existing
>>>>> savevm/migrate commands is that it will work in more use cases with
>>>>> less code duplication/maintenance/bitrot prevention than the
>>>>> special-case cprsave command in this patch series.
>>>>>
>>>>> Maybe there is a fundamental technical reason why live update needs to
>>>>> be different from QEMU's existing migration commands but I haven't
>>>>> figured it out yet.
>>>>
>>>> vfio and anonymous memory.
>>>>
>>>> Regarding code duplication, I did consider whether to extend the migration
>>>> syntax and implementation versus creating something new.  Those functions
>>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>>>> use case, and the cpr functions handle state that is n/a for the migration case.
>>>> I judged that handling both in the same functions would be less readable and
>>>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>>>> interaction with the migration code.
>>>>
>>>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>>>> review is more complete and folks agree on the final form of the functionality.
>>>>
>>>> I do like the idea of supporting update without exec, but as a future project, 
>>>> and not at the expense of dropping update with exec.
>>>>
>>>> - Steve
>>>>
>>
Steven Sistare May 19, 2021, 4:43 p.m. UTC | #16
Hi Michael, Marcel,
  I hope you have time to review the pci and vfio-pci related patches in this
series.  They are an essential part of the live update functionality.  The
first 2 patches are straightforward, just exposing functions for use in vfio.
The last 2 patches are more substantial.

  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1
  - vfio-pci: cpr part 2

- Steve

On 5/7/2021 8:24 AM, Steve Sistare wrote:
> Provide the cprsave and cprload commands for live update.  These save and
> restore VM state, with minimal guest pause time, so that qemu may be updated
> to a new version in between.
> 
> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> paused state and waits for the cprload command.
> 
> To use the restart mode, qemu must be started with the memfd-alloc option,
> which allocates guest ram using memfd_create.  The memfd's are saved to
> the environment and kept open across exec, after which they are found from
> the environment and re-mmap'd.  Hence guest ram is preserved in place,
> albeit with new virtual addresses in the qemu process.  The caller resumes
> the guest by calling cprload, which loads state from the file.  If the VM
> was running at cprsave time, then VM execution resumes.  cprsave supports
> any type of guest image and block device, but the caller must not modify
> guest block devices between cprsave and cprload.
> 
> The restart mode supports vfio devices by preserving the vfio container,
> group, device, and event descriptors across the qemu re-exec, and by
> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
> and integrated in Linux kernel 5.12.
> 
> For the reboot mode, cprsave saves state and exits qemu, and the caller is
> allowed to update the host kernel and system software and reboot.  The
> caller resumes the guest by running qemu with the same arguments as the
> original process and calling cprload.  To use this mode, guest ram must be
> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
> 
> The reboot mode supports vfio devices if the caller suspends the guest
> instead of stopping the VM, such as by issuing guest-suspend-ram to the
> qemu guest agent.  The guest drivers' suspend methods flush outstanding
> requests and re-initialize the devices, and thus there is no device state
> to save and restore.
> 
> The first patches add helper functions:
> 
>   - as_flat_walk
>   - qemu_ram_volatile
>   - oslib: qemu_clr_cloexec
>   - util: env var helpers
>   - machine: memfd-alloc option
>   - vl: add helper to request re-exec
> 
> The next patches implement cprsave and cprload:
> 
>   - cpr
>   - cpr: QMP interfaces
>   - cpr: HMP interfaces
> 
> The next patches add vfio support for the restart mode:
> 
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1
>   - vfio-pci: cpr part 2
> 
> The next patches preserve various descriptor-based backend devices across
> a cprsave restart:
> 
>   - vhost: reset vhost devices upon cprsave
>   - hostmem-memfd: cpr support
>   - chardev: cpr framework
>   - chardev: cpr for simple devices
>   - chardev: cpr for pty
>   - chardev: cpr for sockets
>   - cpr: only-cpr-capable option
>   - cpr: maintainers
>   - simplify savevm
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
> "cprload restart".  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1				| window 2
> 					|
> # qemu-system-x86_64 ... 		|
> QEMU 4.2.0 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: running			|
> 					| # yum update qemu
> (qemu) cprsave /tmp/qemu.sav restart	|
> QEMU 4.2.1 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: paused (prelaunch)		|
> (qemu) cprload /tmp/qemu.sav		|
> (qemu) info status			|
> VM status: running			|
> 
> 
> Here is an example of updating the host kernel using "cprload reboot"
> 
> window 1					| window 2
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: running				|
> 						| # yum update kernel-uek
> (qemu) cprsave /tmp/qemu.sav restart		|
> 						|
> # systemctl kexec				|
> kexec_core: Starting new kernel			|
> ...						|
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: paused (prelaunch)			|
> (qemu) cprload /tmp/qemu.sav			|
> (qemu) info status				|
> VM status: running				|
> 
> Changes from V1 to V2:
>   - revert vmstate infrastructure changes
>   - refactor cpr functions into new files
>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
>     preserve memory.
>   - add framework to filter chardev's that support cpr
>   - save and restore vfio eventfd's
>   - modify cprinfo QMP interface
>   - incorporate misc review feedback
>   - remove unrelated and unneeded patches
>   - refactor all patches into a shorter and easier to review series
> 
> Changes from V2 to V3:
>   - rebase to qemu 6.0.0
>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>   - change memfd-alloc to a machine option
>   - use existing channel socket function instead of defining new ones
>   - close monitor socket during cpr
>   - support memory-backend-memfd
>   - fix a few unreported bugs
> 
> Steve Sistare (18):
>   as_flat_walk
>   qemu_ram_volatile
>   oslib: qemu_clr_cloexec
>   util: env var helpers
>   machine: memfd-alloc option
>   vl: add helper to request re-exec
>   cpr
>   pci: export functions for cpr
>   vfio-pci: refactor for cpr
>   vfio-pci: cpr part 1
>   vfio-pci: cpr part 2
>   hostmem-memfd: cpr support
>   chardev: cpr framework
>   chardev: cpr for simple devices
>   chardev: cpr for pty
>   cpr: only-cpr-capable option
>   cpr: maintainers
>   simplify savevm
> 
> Mark Kanda, Steve Sistare (4):
>   cpr: QMP interfaces
>   cpr: HMP interfaces
>   vhost: reset vhost devices upon cprsave
>   chardev: cpr for sockets
> 
>  MAINTAINERS                   |  11 +++
>  backends/hostmem-memfd.c      |  21 +++--
>  chardev/char-mux.c            |   1 +
>  chardev/char-null.c           |   1 +
>  chardev/char-pty.c            |  15 ++-
>  chardev/char-serial.c         |   1 +
>  chardev/char-socket.c         |  35 +++++++
>  chardev/char-stdio.c          |   8 ++
>  chardev/char.c                |  41 +++++++-
>  gdbstub.c                     |   1 +
>  hmp-commands.hx               |  44 +++++++++
>  hw/core/machine.c             |  19 ++++
>  hw/pci/msi.c                  |   4 +
>  hw/pci/msix.c                 |  20 ++--
>  hw/pci/pci.c                  |   7 +-
>  hw/vfio/common.c              |  68 +++++++++++++-
>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
>  hw/vfio/trace-events          |   1 +
>  hw/virtio/vhost.c             |  11 +++
>  include/chardev/char.h        |   6 ++
>  include/exec/memory.h         |  25 +++++
>  include/hw/boards.h           |   1 +
>  include/hw/pci/msix.h         |   5 +
>  include/hw/pci/pci.h          |   2 +
>  include/hw/vfio/vfio-common.h |   8 ++
>  include/hw/virtio/vhost.h     |   1 +
>  include/migration/cpr.h       |  17 ++++
>  include/monitor/hmp.h         |   3 +
>  include/qemu/env.h            |  23 +++++
>  include/qemu/osdep.h          |   1 +
>  include/sysemu/runstate.h     |   2 +
>  include/sysemu/sysemu.h       |   2 +
>  linux-headers/linux/vfio.h    |  27 ++++++
>  migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
>  migration/meson.build         |   1 +
>  migration/migration.c         |   5 +
>  migration/savevm.c            |  21 ++---
>  migration/savevm.h            |   2 +
>  monitor/hmp-cmds.c            |  48 ++++++++++
>  monitor/hmp.c                 |   3 +
>  monitor/qmp-cmds.c            |  31 ++++++
>  monitor/qmp.c                 |   3 +
>  qapi/char.json                |   5 +-
>  qapi/cpr.json                 |  76 +++++++++++++++
>  qapi/meson.build              |   1 +
>  qapi/qapi-schema.json         |   1 +
>  qemu-options.hx               |  39 +++++++-
>  softmmu/globals.c             |   2 +
>  softmmu/memory.c              |  48 ++++++++++
>  softmmu/physmem.c             |  49 ++++++++--
>  softmmu/runstate.c            |  49 +++++++++-
>  softmmu/vl.c                  |  21 ++++-
>  stubs/cpr.c                   |   3 +
>  stubs/meson.build             |   1 +
>  trace-events                  |   1 +
>  util/env.c                    |  99 +++++++++++++++++++
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |   9 ++
>  util/oslib-win32.c            |   4 +
>  util/qemu-config.c            |   4 +
>  62 files changed, 1431 insertions(+), 74 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 include/qemu/env.h
>  create mode 100644 migration/cpr.c
>  create mode 100644 qapi/cpr.json
>  create mode 100644 stubs/cpr.c
>  create mode 100644 util/env.c
>
Dr. David Alan Gilbert May 20, 2021, 1 p.m. UTC | #17
Hi Steven,
  I'd like to split the discussion into reboot and restart,
so I can make sure I understand them individually.

So reboot mode;
Can you explain which parts of this series are needed for reboot mode;
I've managed to do a kexec based reboot on qemu with the current qemu -
albeit with no vfio devices, my current understanding is that for doing
reboot with vfio we just need some way of getting migrate to send the
metadata associated with vfio devices if the guest is in S3.

Is there something I'm missing and which you have in this series?

Dave
Dr. David Alan Gilbert May 20, 2021, 1:13 p.m. UTC | #18
On the 'restart' branch of questions; can you explain,
other than the passing of the fd's, why the outgoing side of
qemu's 'migrate exec:' doesn't work for you?

Dave
Steven Sistare May 21, 2021, 2:55 p.m. UTC | #19
On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
> Hi Steven,
>   I'd like to split the discussion into reboot and restart,
> so I can make sure I understand them individually.
> 
> So reboot mode;
> Can you explain which parts of this series are needed for reboot mode;
> I've managed to do a kexec based reboot on qemu with the current qemu -
> albeit with no vfio devices, my current understanding is that for doing
> reboot with vfio we just need some way of getting migrate to send the
> metadata associated with vfio devices if the guest is in S3.
> 
> Is there something I'm missing and which you have in this series?

You are correct, this series has little special code for reboot mode, but it does allow
reboot and restart to be handled similarly, which simplifies the management layer because 
the same calls are performed for each mode. 

For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
command to the qemu guest agent. This flushes requests and brings the guest device to a 
reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.

There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
is a sanity check that the writable ram blocks are backed by some form of shared memory.
Plus there are a few fragments in the "cpr" patch that handle the suspended state that
is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
of RUN_STATE_SUSPENDED in migration/cpr.c

- Steve
Steven Sistare May 21, 2021, 2:56 p.m. UTC | #20
On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> On the 'restart' branch of questions; can you explain,
> other than the passing of the fd's, why the outgoing side of
> qemu's 'migrate exec:' doesn't work for you?

I'm not sure what I should describe.  Can you be more specific?
Do you mean: can we add the cpr specific bits to the migrate exec code?

- Steve
Dr. David Alan Gilbert May 24, 2021, 10:39 a.m. UTC | #21
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> > On the 'restart' branch of questions; can you explain,
> > other than the passing of the fd's, why the outgoing side of
> > qemu's 'migrate exec:' doesn't work for you?
> 
> I'm not sure what I should describe.  Can you be more specific?
> Do you mean: can we add the cpr specific bits to the migrate exec code?

Yes; if possible I'd prefer to just keep the one exec mechanism.
It has an advantage of letting you specify the new command line; that
avoids the problems I'd pointed out with needing to change the command
line if a hotplug had happened.  It also means we only need one chunk of
exec code.

Dave

> - Steve
>
Steven Sistare June 2, 2021, 1:51 p.m. UTC | #22
On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>> On the 'restart' branch of questions; can you explain,
>>> other than the passing of the fd's, why the outgoing side of
>>> qemu's 'migrate exec:' doesn't work for you?
>>
>> I'm not sure what I should describe.  Can you be more specific?
>> Do you mean: can we add the cpr specific bits to the migrate exec code?
> 
> Yes; if possible I'd prefer to just keep the one exec mechanism.
> It has an advantage of letting you specify the new command line; that
> avoids the problems I'd pointed out with needing to change the command
> line if a hotplug had happened.  It also means we only need one chunk of
> exec code.

How/where would you specify a new command line?  Are you picturing the usual migration
setup where you start a second qemu with its own arguments, plus a migrate_incoming
option or command?  That does not work for live update restart; the old qemu must exec
the new qemu.  Or something else?

We could shoehorn cpr restart into the migrate exec path by defining a new migration 
capability that the client would set before issuing the migrate command.  However, the
implementation code would be sprinkled with conditionals to suppress migrate-only bits
and call cpr-only bits.  IMO that would be less maintainable than having a separate
cprsave function.  Currently cprsave does not duplicate any migration functionality.
cprsave calls qemu_save_device_state() which is used by xen.

- Steve
Steven Sistare June 2, 2021, 3:19 p.m. UTC | #23
Hi Michael,
  Alex has reviewed the vfio-pci patches.  If you could give me a thumbs-up or a 
needs-work on "pci: export functions for cpr", I would appreciate it.  Thanks!

[PATCH V3 10/22] pci: export functions for cpr
https://lore.kernel.org/qemu-devel/1620390320-301716-11-git-send-email-steven.sistare@oracle.com

- Steve

On 5/19/2021 12:43 PM, Steven Sistare wrote:
> Hi Michael, Marcel,
>   I hope you have time to review the pci and vfio-pci related patches in this
> series.  They are an essential part of the live update functionality.  The
> first 2 patches are straightforward, just exposing functions for use in vfio.
> The last 2 patches are more substantial.
> 
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1
>   - vfio-pci: cpr part 2
> 
> - Steve
> 
> On 5/7/2021 8:24 AM, Steve Sistare wrote:
>> Provide the cprsave and cprload commands for live update.  These save and
>> restore VM state, with minimal guest pause time, so that qemu may be updated
>> to a new version in between.
>>
>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>> paused state and waits for the cprload command.
>>
>> To use the restart mode, qemu must be started with the memfd-alloc option,
>> which allocates guest ram using memfd_create.  The memfd's are saved to
>> the environment and kept open across exec, after which they are found from
>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>> albeit with new virtual addresses in the qemu process.  The caller resumes
>> the guest by calling cprload, which loads state from the file.  If the VM
>> was running at cprsave time, then VM execution resumes.  cprsave supports
>> any type of guest image and block device, but the caller must not modify
>> guest block devices between cprsave and cprload.
>>
>> The restart mode supports vfio devices by preserving the vfio container,
>> group, device, and event descriptors across the qemu re-exec, and by
>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>> and integrated in Linux kernel 5.12.
>>
>> For the reboot mode, cprsave saves state and exits qemu, and the caller is
>> allowed to update the host kernel and system software and reboot.  The
>> caller resumes the guest by running qemu with the same arguments as the
>> original process and calling cprload.  To use this mode, guest ram must be
>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>
>> The reboot mode supports vfio devices if the caller suspends the guest
>> instead of stopping the VM, such as by issuing guest-suspend-ram to the
>> qemu guest agent.  The guest drivers' suspend methods flush outstanding
>> requests and re-initialize the devices, and thus there is no device state
>> to save and restore.
>>
>> The first patches add helper functions:
>>
>>   - as_flat_walk
>>   - qemu_ram_volatile
>>   - oslib: qemu_clr_cloexec
>>   - util: env var helpers
>>   - machine: memfd-alloc option
>>   - vl: add helper to request re-exec
>>
>> The next patches implement cprsave and cprload:
>>
>>   - cpr
>>   - cpr: QMP interfaces
>>   - cpr: HMP interfaces
>>
>> The next patches add vfio support for the restart mode:
>>
>>   - pci: export functions for cpr
>>   - vfio-pci: refactor for cpr
>>   - vfio-pci: cpr part 1
>>   - vfio-pci: cpr part 2
>>
>> The next patches preserve various descriptor-based backend devices across
>> a cprsave restart:
>>
>>   - vhost: reset vhost devices upon cprsave
>>   - hostmem-memfd: cpr support
>>   - chardev: cpr framework
>>   - chardev: cpr for simple devices
>>   - chardev: cpr for pty
>>   - chardev: cpr for sockets
>>   - cpr: only-cpr-capable option
>>   - cpr: maintainers
>>   - simplify savevm
>>
>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
>> "cprload restart".  The software update is performed while the guest is
>> running to minimize downtime.
>>
>> window 1				| window 2
>> 					|
>> # qemu-system-x86_64 ... 		|
>> QEMU 4.2.0 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: running			|
>> 					| # yum update qemu
>> (qemu) cprsave /tmp/qemu.sav restart	|
>> QEMU 4.2.1 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: paused (prelaunch)		|
>> (qemu) cprload /tmp/qemu.sav		|
>> (qemu) info status			|
>> VM status: running			|
>>
>>
>> Here is an example of updating the host kernel using "cprload reboot"
>>
>> window 1					| window 2
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: running				|
>> 						| # yum update kernel-uek
>> (qemu) cprsave /tmp/qemu.sav restart		|
>> 						|
>> # systemctl kexec				|
>> kexec_core: Starting new kernel			|
>> ...						|
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: paused (prelaunch)			|
>> (qemu) cprload /tmp/qemu.sav			|
>> (qemu) info status				|
>> VM status: running				|
>>
>> Changes from V1 to V2:
>>   - revert vmstate infrastructure changes
>>   - refactor cpr functions into new files
>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
>>     preserve memory.
>>   - add framework to filter chardev's that support cpr
>>   - save and restore vfio eventfd's
>>   - modify cprinfo QMP interface
>>   - incorporate misc review feedback
>>   - remove unrelated and unneeded patches
>>   - refactor all patches into a shorter and easier to review series
>>
>> Changes from V2 to V3:
>>   - rebase to qemu 6.0.0
>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>   - change memfd-alloc to a machine option
>>   - use existing channel socket function instead of defining new ones
>>   - close monitor socket during cpr
>>   - support memory-backend-memfd
>>   - fix a few unreported bugs
>>
>> Steve Sistare (18):
>>   as_flat_walk
>>   qemu_ram_volatile
>>   oslib: qemu_clr_cloexec
>>   util: env var helpers
>>   machine: memfd-alloc option
>>   vl: add helper to request re-exec
>>   cpr
>>   pci: export functions for cpr
>>   vfio-pci: refactor for cpr
>>   vfio-pci: cpr part 1
>>   vfio-pci: cpr part 2
>>   hostmem-memfd: cpr support
>>   chardev: cpr framework
>>   chardev: cpr for simple devices
>>   chardev: cpr for pty
>>   cpr: only-cpr-capable option
>>   cpr: maintainers
>>   simplify savevm
>>
>> Mark Kanda, Steve Sistare (4):
>>   cpr: QMP interfaces
>>   cpr: HMP interfaces
>>   vhost: reset vhost devices upon cprsave
>>   chardev: cpr for sockets
>>
>>  MAINTAINERS                   |  11 +++
>>  backends/hostmem-memfd.c      |  21 +++--
>>  chardev/char-mux.c            |   1 +
>>  chardev/char-null.c           |   1 +
>>  chardev/char-pty.c            |  15 ++-
>>  chardev/char-serial.c         |   1 +
>>  chardev/char-socket.c         |  35 +++++++
>>  chardev/char-stdio.c          |   8 ++
>>  chardev/char.c                |  41 +++++++-
>>  gdbstub.c                     |   1 +
>>  hmp-commands.hx               |  44 +++++++++
>>  hw/core/machine.c             |  19 ++++
>>  hw/pci/msi.c                  |   4 +
>>  hw/pci/msix.c                 |  20 ++--
>>  hw/pci/pci.c                  |   7 +-
>>  hw/vfio/common.c              |  68 +++++++++++++-
>>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
>>  hw/vfio/trace-events          |   1 +
>>  hw/virtio/vhost.c             |  11 +++
>>  include/chardev/char.h        |   6 ++
>>  include/exec/memory.h         |  25 +++++
>>  include/hw/boards.h           |   1 +
>>  include/hw/pci/msix.h         |   5 +
>>  include/hw/pci/pci.h          |   2 +
>>  include/hw/vfio/vfio-common.h |   8 ++
>>  include/hw/virtio/vhost.h     |   1 +
>>  include/migration/cpr.h       |  17 ++++
>>  include/monitor/hmp.h         |   3 +
>>  include/qemu/env.h            |  23 +++++
>>  include/qemu/osdep.h          |   1 +
>>  include/sysemu/runstate.h     |   2 +
>>  include/sysemu/sysemu.h       |   2 +
>>  linux-headers/linux/vfio.h    |  27 ++++++
>>  migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
>>  migration/meson.build         |   1 +
>>  migration/migration.c         |   5 +
>>  migration/savevm.c            |  21 ++---
>>  migration/savevm.h            |   2 +
>>  monitor/hmp-cmds.c            |  48 ++++++++++
>>  monitor/hmp.c                 |   3 +
>>  monitor/qmp-cmds.c            |  31 ++++++
>>  monitor/qmp.c                 |   3 +
>>  qapi/char.json                |   5 +-
>>  qapi/cpr.json                 |  76 +++++++++++++++
>>  qapi/meson.build              |   1 +
>>  qapi/qapi-schema.json         |   1 +
>>  qemu-options.hx               |  39 +++++++-
>>  softmmu/globals.c             |   2 +
>>  softmmu/memory.c              |  48 ++++++++++
>>  softmmu/physmem.c             |  49 ++++++++--
>>  softmmu/runstate.c            |  49 +++++++++-
>>  softmmu/vl.c                  |  21 ++++-
>>  stubs/cpr.c                   |   3 +
>>  stubs/meson.build             |   1 +
>>  trace-events                  |   1 +
>>  util/env.c                    |  99 +++++++++++++++++++
>>  util/meson.build              |   1 +
>>  util/oslib-posix.c            |   9 ++
>>  util/oslib-win32.c            |   4 +
>>  util/qemu-config.c            |   4 +
>>  62 files changed, 1431 insertions(+), 74 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>  create mode 100644 include/migration/cpr.h
>>  create mode 100644 include/qemu/env.h
>>  create mode 100644 migration/cpr.c
>>  create mode 100644 qapi/cpr.json
>>  create mode 100644 stubs/cpr.c
>>  create mode 100644 util/env.c
>>
Dr. David Alan Gilbert June 3, 2021, 7:36 p.m. UTC | #24
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> > * Steven Sistare (steven.sistare@oracle.com) wrote:
> >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> >>> On the 'restart' branch of questions; can you explain,
> >>> other than the passing of the fd's, why the outgoing side of
> >>> qemu's 'migrate exec:' doesn't work for you?
> >>
> >> I'm not sure what I should describe.  Can you be more specific?
> >> Do you mean: can we add the cpr specific bits to the migrate exec code?
> > 
> > Yes; if possible I'd prefer to just keep the one exec mechanism.
> > It has an advantage of letting you specify the new command line; that
> > avoids the problems I'd pointed out with needing to change the command
> > line if a hotplug had happened.  It also means we only need one chunk of
> > exec code.
> 
> How/where would you specify a new command line?  Are you picturing the usual migration
> setup where you start a second qemu with its own arguments, plus a migrate_incoming
> option or command?  That does not work for live update restart; the old qemu must exec
> the new qemu.  Or something else?

The existing migration path allows an exec - originally intended to exec
something like a compressor or a store to file rather than a real
migration; i.e. you can do:

  migrate "exec:gzip > mig"

and that will save the migration stream to a compressed file called mig.
Now, I *think* we can already do:

  migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
(That's probably cleaner via the QMP interface).

I'm not quite sure what I want in the incoming there, but that is
already the source execing the destination qemu - although I think we'd
probably need to check if that's actually via an intermediary.

> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> capability that the client would set before issuing the migrate command.  However, the
> implementation code would be sprinkled with conditionals to suppress migrate-only bits
> and call cpr-only bits.  IMO that would be less maintainable than having a separate
> cprsave function.  Currently cprsave does not duplicate any migration functionality.
> cprsave calls qemu_save_device_state() which is used by xen.

To me it feels like cprsave in particular is replicating more code.

It's also jumping through hoops in places to avoid changing the
commandline;  that's going to cause more pain for a lot of people - not
just because it's hacks all over for that, but because a lot of people
are actually going to need to change the commandline even in a cpr like
case (e.g. due to hotplug or changing something else about the
environment, like auth data or route to storage or networking that
changed).

There are hooks for early parameter parsing, so if we need to add extra
commandline args we can; but for example the case of QEMU_START_FREEZE
to add -S just isn't needed as soon as you let go of the idea of needing
an identical commandline.

Dave

> - Steve
>
Daniel P. Berrangé June 3, 2021, 8:44 p.m. UTC | #25
On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
> > On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> > > * Steven Sistare (steven.sistare@oracle.com) wrote:
> > >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> > >>> On the 'restart' branch of questions; can you explain,
> > >>> other than the passing of the fd's, why the outgoing side of
> > >>> qemu's 'migrate exec:' doesn't work for you?
> > >>
> > >> I'm not sure what I should describe.  Can you be more specific?
> > >> Do you mean: can we add the cpr specific bits to the migrate exec code?
> > > 
> > > Yes; if possible I'd prefer to just keep the one exec mechanism.
> > > It has an advantage of letting you specify the new command line; that
> > > avoids the problems I'd pointed out with needing to change the command
> > > line if a hotplug had happened.  It also means we only need one chunk of
> > > exec code.
> > 
> > How/where would you specify a new command line?  Are you picturing the usual migration
> > setup where you start a second qemu with its own arguments, plus a migrate_incoming
> > option or command?  That does not work for live update restart; the old qemu must exec
> > the new qemu.  Or something else?
> 
> The existing migration path allows an exec - originally intended to exec
> something like a compressor or a store to file rather than a real
> migration; i.e. you can do:
> 
>   migrate "exec:gzip > mig"
> 
> and that will save the migration stream to a compressed file called mig.
> Now, I *think* we can already do:
> 
>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
> (That's probably cleaner via the QMP interface).
> 
> I'm not quite sure what I want in the incoming there, but that is
> already the source execing the destination qemu - although I think we'd
> probably need to check if that's actually via an intermediary.

I don't think you can dirctly exec  qemu in that way, because the
source QEMU migration code is going to wait for completion of the
QEMU you exec'd and that'll never come on success. So you'll end
up with both QEMU's running forever. If you pass the daemonize
option to the new QEMU then it will immediately detach itself,
and the source QEMU will think the migration command has finished
or failed.

I think you can probably do it if you use a wrapper script though.
The wrapper would have to fork QEMU in the backend, and then the
wrapper would have to monitor the new QEMU to see when the incoming
migration has finished/aborted, at which point the wrapper can
exit, so the source QEMU sees a successful cleanup of the exec'd
command. </hand waving>

> > We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> > capability that the client would set before issuing the migrate command.  However, the
> > implementation code would be sprinkled with conditionals to suppress migrate-only bits
> > and call cpr-only bits.  IMO that would be less maintainable than having a separate
> > cprsave function.  Currently cprsave does not duplicate any migration functionality.
> > cprsave calls qemu_save_device_state() which is used by xen.
> 
> To me it feels like cprsave in particular is replicating more code.
> 
> It's also jumping through hoops in places to avoid changing the
> commandline;  that's going to cause more pain for a lot of people - not
> just because it's hacks all over for that, but because a lot of people
> are actually going to need to change the commandline even in a cpr like
> case (e.g. due to hotplug or changing something else about the
> environment, like auth data or route to storage or networking that
> changed).

Management apps that already support migration, will almost certainly
know how to start up a new QEMU with a different command line that
takes account of hotplugged/unplugged devices. IOW avoiding changing
the command line only really addresses the simple case, and the hard
case is likely already solved for purposes of handling regular live
migration.

Regards,
Daniel
Steven Sistare June 7, 2021, 4:40 p.m. UTC | #26
On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>>> On the 'restart' branch of questions; can you explain,
>>>>>> other than the passing of the fd's, why the outgoing side of
>>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>>
>>>>> I'm not sure what I should describe.  Can you be more specific?
>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>>
>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>>> It has an advantage of letting you specify the new command line; that
>>>> avoids the problems I'd pointed out with needing to change the command
>>>> line if a hotplug had happened.  It also means we only need one chunk of
>>>> exec code.
>>>
>>> How/where would you specify a new command line?  Are you picturing the usual migration
>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
>>> option or command?  That does not work for live update restart; the old qemu must exec
>>> the new qemu.  Or something else?
>>
>> The existing migration path allows an exec - originally intended to exec
>> something like a compressor or a store to file rather than a real
>> migration; i.e. you can do:
>>
>>   migrate "exec:gzip > mig"
>>
>> and that will save the migration stream to a compressed file called mig.
>> Now, I *think* we can already do:
>>
>>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
>> (That's probably cleaner via the QMP interface).
>>
>> I'm not quite sure what I want in the incoming there, but that is
>> already the source execing the destination qemu - although I think we'd
>> probably need to check if that's actually via an intermediary.
> 
> I don't think you can dirctly exec  qemu in that way, because the
> source QEMU migration code is going to wait for completion of the
> QEMU you exec'd and that'll never come on success. So you'll end
> up with both QEMU's running forever. If you pass the daemonize
> option to the new QEMU then it will immediately detach itself,
> and the source QEMU will think the migration command has finished
> or failed.
> 
> I think you can probably do it if you use a wrapper script though.
> The wrapper would have to fork QEMU in the backend, and then the
> wrapper would have to monitor the new QEMU to see when the incoming
> migration has finished/aborted, at which point the wrapper can
> exit, so the source QEMU sees a successful cleanup of the exec'd
> command. </hand waving>

cpr restart does not work for any scheme that involves the old qemu process co-existing with
the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
that old qemu directly execs new qemu.  Not fork-exec.  Same pid.

So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
We still need the qemu_exec_requested mechanism to cause a direct exec after state is
saved.

>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>>> capability that the client would set before issuing the migrate command.  However, the
>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>>> cprsave calls qemu_save_device_state() which is used by xen.
>>
>> To me it feels like cprsave in particular is replicating more code.
>>
>> It's also jumping through hoops in places to avoid changing the
>> commandline;  that's going to cause more pain for a lot of people - not
>> just because it's hacks all over for that, but because a lot of people
>> are actually going to need to change the commandline even in a cpr like
>> case (e.g. due to hotplug or changing something else about the
>> environment, like auth data or route to storage or networking that
>> changed).
> 
> Management apps that already support migration, will almost certainly
> know how to start up a new QEMU with a different command line that
> takes account of hotplugged/unplugged devices. IOW avoiding changing
> the command line only really addresses the simple case, and the hard
> case is likely already solved for purposes of handling regular live
> migration. 

Agreed, with the caveat that for cpr, the management app must communicate the new arguments
to the qemu-exec trampoline, rather than passing the args on the command line to a new 
qemu process.

>> There are hooks for early parameter parsing, so if we need to add extra
>> commandline args we can; but for example the case of QEMU_START_FREEZE
>> to add -S just isn't needed as soon as you let go of the idea of needing
>> an identical commandline.

I'll delete QEMU_START_FREEZE.  

I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
the args contain identifying information that the management app needs to modify the 
arguments based the the instances's hot plug history.

Or, here is another possibility.  We could redefine cprsave to leave the VM in a
stopped state, and add a cprstart command to be called subsequently that performs 
the exec.  It takes a single string argument: a command plus arguments to exec.  
The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
name is no longer hardcoded.  The management app can derive new qemu args for the
instances as it would with migration, and pass them to the command, instead of passing
them to qemu-exec via some side channel.  cprload finishes the job and does not change.
I already like this scheme better.

- Steve
Steven Sistare June 7, 2021, 6:08 p.m. UTC | #27
On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>> On the 'restart' branch of questions; can you explain,
>>>>> other than the passing of the fd's, why the outgoing side of
>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>
>>>> I'm not sure what I should describe.  Can you be more specific?
>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>
>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>> It has an advantage of letting you specify the new command line; that
>>> avoids the problems I'd pointed out with needing to change the command
>>> line if a hotplug had happened.  It also means we only need one chunk of
>>> exec code.
>>
>> [...]
> 
> I'm not quite sure what I want in the incoming there, but that is
> already the source execing the destination qemu - although I think we'd
> probably need to check if that's actually via an intermediary.
> 
>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>> capability that the client would set before issuing the migrate command.  However, the
>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>> cprsave calls qemu_save_device_state() which is used by xen.
> 
> To me it feels like cprsave in particular is replicating more code. 

In the attached file I annotated lines of code that have some overlap
with migration code actions.  They include vm control, global_state_store,
and vmstate save, and cover 18 lines of 78 total.  I did not include the
body of qf_file_open because it is also called by xen.

The migration code adds capabilities, parameters, state, status, info,
precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types,
blockers, pause+resume, return path, request-reply commands, throttling, colo,
blocks, phases, iteration, and threading, implemented by 20000+ lines of code.
To me it seems wrong to throw cpr into that mix to avoid adding tens of lines 
of similar code.

- Steve
void cprsave(const char *file, CprMode mode, Error **errp)
  {
      int ret = 0;
**    QEMUFile *f;
**    int saved_vm_running = runstate_is_running();
      bool restart = (mode == CPR_MODE_RESTART);
      bool reboot = (mode == CPR_MODE_REBOOT);
  
      if (reboot && qemu_ram_volatile(errp)) {
          return;
      }
  
      if (restart && xen_enabled()) {
          error_setg(errp, "xen does not support cprsave restart");
          return;
      }
  
      if (migrate_colo_enabled()) {
          error_setg(errp, "error: cprsave does not support x-colo");
          return;
      }
  
      if (replay_mode != REPLAY_MODE_NONE) {
          error_setg(errp, "error: cprsave does not support replay");
          return;
      }
  
      f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, "cprsave", errp);
      if (!f) {
          return;
      }
  
**    ret = global_state_store();
**    if (ret) {
**        error_setg(errp, "Error saving global state");
**        qemu_fclose(f);
**        return;
**    }
      if (runstate_check(RUN_STATE_SUSPENDED)) {
          /* Update timers_state before saving.  Suspend did not so do. */
          cpu_disable_ticks();
      }
**    vm_stop(RUN_STATE_SAVE_VM);
  
      cpr_is_active = true;
**    ret = qemu_save_device_state(f);
**    qemu_fclose(f);
**    if (ret < 0) {
**        error_setg(errp, "Error %d while saving VM state", ret);
**        goto err;
**    }
  
      if (reboot) {
          shutdown_action = SHUTDOWN_ACTION_POWEROFF;
          qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
      } else if (restart) {
          if (!qemu_chr_cpr_capable(errp)) {
              goto err;
          }
          if (vfio_cprsave(errp)) {
              goto err;
          }
          walkenv(FD_PREFIX, preserve_fd, 0);
          vhost_dev_reset_all();
          qemu_term_exit();
          setenv("QEMU_START_FREEZE", "", 1);
          qemu_system_exec_request();
      }
      goto done;
  
  err:
**    if (saved_vm_running) {
**        vm_start();
**    }
  done:
      cpr_is_active = false;
      return;
  }
Steven Sistare June 14, 2021, 2:31 p.m. UTC | #28
On 6/7/2021 12:40 PM, Steven Sistare wrote:
> On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
>> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>>>> On the 'restart' branch of questions; can you explain,
>>>>>>> other than the passing of the fd's, why the outgoing side of
>>>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>>>
>>>>>> I'm not sure what I should describe.  Can you be more specific?
>>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>>>
>>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>>>> It has an advantage of letting you specify the new command line; that
>>>>> avoids the problems I'd pointed out with needing to change the command
>>>>> line if a hotplug had happened.  It also means we only need one chunk of
>>>>> exec code.
>>>>
>>>> How/where would you specify a new command line?  Are you picturing the usual migration
>>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
>>>> option or command?  That does not work for live update restart; the old qemu must exec
>>>> the new qemu.  Or something else?
>>>
>>> The existing migration path allows an exec - originally intended to exec
>>> something like a compressor or a store to file rather than a real
>>> migration; i.e. you can do:
>>>
>>>   migrate "exec:gzip > mig"
>>>
>>> and that will save the migration stream to a compressed file called mig.
>>> Now, I *think* we can already do:
>>>
>>>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
>>> (That's probably cleaner via the QMP interface).
>>>
>>> I'm not quite sure what I want in the incoming there, but that is
>>> already the source execing the destination qemu - although I think we'd
>>> probably need to check if that's actually via an intermediary.
>>
>> I don't think you can dirctly exec  qemu in that way, because the
>> source QEMU migration code is going to wait for completion of the
>> QEMU you exec'd and that'll never come on success. So you'll end
>> up with both QEMU's running forever. If you pass the daemonize
>> option to the new QEMU then it will immediately detach itself,
>> and the source QEMU will think the migration command has finished
>> or failed.
>>
>> I think you can probably do it if you use a wrapper script though.
>> The wrapper would have to fork QEMU in the backend, and then the
>> wrapper would have to monitor the new QEMU to see when the incoming
>> migration has finished/aborted, at which point the wrapper can
>> exit, so the source QEMU sees a successful cleanup of the exec'd
>> command. </hand waving>
> 
> cpr restart does not work for any scheme that involves the old qemu process co-existing with
> the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
> that old qemu directly execs new qemu.  Not fork-exec.  Same pid.
> 
> So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
> We still need the qemu_exec_requested mechanism to cause a direct exec after state is
> saved.
> 
>>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>>>> capability that the client would set before issuing the migrate command.  However, the
>>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>>>> cprsave calls qemu_save_device_state() which is used by xen.
>>>
>>> To me it feels like cprsave in particular is replicating more code.
>>>
>>> It's also jumping through hoops in places to avoid changing the
>>> commandline;  that's going to cause more pain for a lot of people - not
>>> just because it's hacks all over for that, but because a lot of people
>>> are actually going to need to change the commandline even in a cpr like
>>> case (e.g. due to hotplug or changing something else about the
>>> environment, like auth data or route to storage or networking that
>>> changed).
>>
>> Management apps that already support migration, will almost certainly
>> know how to start up a new QEMU with a different command line that
>> takes account of hotplugged/unplugged devices. IOW avoiding changing
>> the command line only really addresses the simple case, and the hard
>> case is likely already solved for purposes of handling regular live
>> migration. 
> 
> Agreed, with the caveat that for cpr, the management app must communicate the new arguments
> to the qemu-exec trampoline, rather than passing the args on the command line to a new 
> qemu process.
> 
>>> There are hooks for early parameter parsing, so if we need to add extra
>>> commandline args we can; but for example the case of QEMU_START_FREEZE
>>> to add -S just isn't needed as soon as you let go of the idea of needing
>>> an identical commandline.
> 
> I'll delete QEMU_START_FREEZE.  
> 
> I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
> the args contain identifying information that the management app needs to modify the 
> arguments based the the instances's hot plug history.
> 
> Or, here is another possibility.  We could redefine cprsave to leave the VM in a
> stopped state, and add a cprstart command to be called subsequently that performs 
> the exec.  It takes a single string argument: a command plus arguments to exec.  
> The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
> name is no longer hardcoded.  The management app can derive new qemu args for the
> instances as it would with migration, and pass them to the command, instead of passing
> them to qemu-exec via some side channel.  cprload finishes the job and does not change.
> I already like this scheme better.

Or, pass argv as an additional parameter to cprsave.

Daniel, David, do you like passing argv to cprsave or a new cprstart command better than the 
current scheme?  I am ready to sent V4 of the series after we resolve this and the question of
whether or not to fold cpr into the migration command.

- Steve
Steven Sistare June 14, 2021, 2:33 p.m. UTC | #29
On 6/7/2021 2:08 PM, Steven Sistare wrote:
> On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote:
>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>>> On the 'restart' branch of questions; can you explain,
>>>>>> other than the passing of the fd's, why the outgoing side of
>>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>>
>>>>> I'm not sure what I should describe.  Can you be more specific?
>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>>
>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>>> It has an advantage of letting you specify the new command line; that
>>>> avoids the problems I'd pointed out with needing to change the command
>>>> line if a hotplug had happened.  It also means we only need one chunk of
>>>> exec code.
>>>
>>> [...]
>>
>> I'm not quite sure what I want in the incoming there, but that is
>> already the source execing the destination qemu - although I think we'd
>> probably need to check if that's actually via an intermediary.
>>
>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>>> capability that the client would set before issuing the migrate command.  However, the
>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>>> cprsave calls qemu_save_device_state() which is used by xen.
>>
>> To me it feels like cprsave in particular is replicating more code. 
> 
> In the attached file I annotated lines of code that have some overlap
> with migration code actions.  They include vm control, global_state_store,
> and vmstate save, and cover 18 lines of 78 total.  I did not include the
> body of qf_file_open because it is also called by xen.
> 
> The migration code adds capabilities, parameters, state, status, info,
> precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types,
> blockers, pause+resume, return path, request-reply commands, throttling, colo,
> blocks, phases, iteration, and threading, implemented by 20000+ lines of code.
> To me it seems wrong to throw cpr into that mix to avoid adding tens of lines 
> of similar code.

Hi David, what is your decision, will you accept separate cpr commands?
One last point is that Xen made a similar choice, adding the xen-save-devices-state
command which calls qemu_save_device_state instead of migration_thread.

- Steve
Daniel P. Berrangé June 14, 2021, 2:36 p.m. UTC | #30
On Mon, Jun 14, 2021 at 10:31:32AM -0400, Steven Sistare wrote:
> On 6/7/2021 12:40 PM, Steven Sistare wrote:
> > On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
> >> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
> >>> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> >>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> >>>>>>> On the 'restart' branch of questions; can you explain,
> >>>>>>> other than the passing of the fd's, why the outgoing side of
> >>>>>>> qemu's 'migrate exec:' doesn't work for you?
> >>>>>>
> >>>>>> I'm not sure what I should describe.  Can you be more specific?
> >>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
> >>>>>
> >>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
> >>>>> It has an advantage of letting you specify the new command line; that
> >>>>> avoids the problems I'd pointed out with needing to change the command
> >>>>> line if a hotplug had happened.  It also means we only need one chunk of
> >>>>> exec code.
> >>>>
> >>>> How/where would you specify a new command line?  Are you picturing the usual migration
> >>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
> >>>> option or command?  That does not work for live update restart; the old qemu must exec
> >>>> the new qemu.  Or something else?
> >>>
> >>> The existing migration path allows an exec - originally intended to exec
> >>> something like a compressor or a store to file rather than a real
> >>> migration; i.e. you can do:
> >>>
> >>>   migrate "exec:gzip > mig"
> >>>
> >>> and that will save the migration stream to a compressed file called mig.
> >>> Now, I *think* we can already do:
> >>>
> >>>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
> >>> (That's probably cleaner via the QMP interface).
> >>>
> >>> I'm not quite sure what I want in the incoming there, but that is
> >>> already the source execing the destination qemu - although I think we'd
> >>> probably need to check if that's actually via an intermediary.
> >>
> >> I don't think you can dirctly exec  qemu in that way, because the
> >> source QEMU migration code is going to wait for completion of the
> >> QEMU you exec'd and that'll never come on success. So you'll end
> >> up with both QEMU's running forever. If you pass the daemonize
> >> option to the new QEMU then it will immediately detach itself,
> >> and the source QEMU will think the migration command has finished
> >> or failed.
> >>
> >> I think you can probably do it if you use a wrapper script though.
> >> The wrapper would have to fork QEMU in the backend, and then the
> >> wrapper would have to monitor the new QEMU to see when the incoming
> >> migration has finished/aborted, at which point the wrapper can
> >> exit, so the source QEMU sees a successful cleanup of the exec'd
> >> command. </hand waving>
> > 
> > cpr restart does not work for any scheme that involves the old qemu process co-existing with
> > the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
> > that old qemu directly execs new qemu.  Not fork-exec.  Same pid.
> > 
> > So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
> > We still need the qemu_exec_requested mechanism to cause a direct exec after state is
> > saved.
> > 
> >>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> >>>> capability that the client would set before issuing the migrate command.  However, the
> >>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
> >>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
> >>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
> >>>> cprsave calls qemu_save_device_state() which is used by xen.
> >>>
> >>> To me it feels like cprsave in particular is replicating more code.
> >>>
> >>> It's also jumping through hoops in places to avoid changing the
> >>> commandline;  that's going to cause more pain for a lot of people - not
> >>> just because it's hacks all over for that, but because a lot of people
> >>> are actually going to need to change the commandline even in a cpr like
> >>> case (e.g. due to hotplug or changing something else about the
> >>> environment, like auth data or route to storage or networking that
> >>> changed).
> >>
> >> Management apps that already support migration, will almost certainly
> >> know how to start up a new QEMU with a different command line that
> >> takes account of hotplugged/unplugged devices. IOW avoiding changing
> >> the command line only really addresses the simple case, and the hard
> >> case is likely already solved for purposes of handling regular live
> >> migration. 
> > 
> > Agreed, with the caveat that for cpr, the management app must communicate the new arguments
> > to the qemu-exec trampoline, rather than passing the args on the command line to a new 
> > qemu process.
> > 
> >>> There are hooks for early parameter parsing, so if we need to add extra
> >>> commandline args we can; but for example the case of QEMU_START_FREEZE
> >>> to add -S just isn't needed as soon as you let go of the idea of needing
> >>> an identical commandline.
> > 
> > I'll delete QEMU_START_FREEZE.  
> > 
> > I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
> > the args contain identifying information that the management app needs to modify the 
> > arguments based the the instances's hot plug history.
> > 
> > Or, here is another possibility.  We could redefine cprsave to leave the VM in a
> > stopped state, and add a cprstart command to be called subsequently that performs 
> > the exec.  It takes a single string argument: a command plus arguments to exec.  
> > The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
> > name is no longer hardcoded.  The management app can derive new qemu args for the
> > instances as it would with migration, and pass them to the command, instead of passing
> > them to qemu-exec via some side channel.  cprload finishes the job and does not change.
> > I already like this scheme better.
> 
> Or, pass argv as an additional parameter to cprsave.
> 
> Daniel, David, do you like passing argv to cprsave or a new cprstart command better than the 
> current scheme?  I am ready to sent V4 of the series after we resolve this and the question of
> whether or not to fold cpr into the migration command.

I don't really have a strong opinion on this either way.


Regards,
Daniel
Dr. David Alan Gilbert June 15, 2021, 7:05 p.m. UTC | #31
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
> > On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
> >> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> >>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> >>>>>> On the 'restart' branch of questions; can you explain,
> >>>>>> other than the passing of the fd's, why the outgoing side of
> >>>>>> qemu's 'migrate exec:' doesn't work for you?
> >>>>>
> >>>>> I'm not sure what I should describe.  Can you be more specific?
> >>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
> >>>>
> >>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
> >>>> It has an advantage of letting you specify the new command line; that
> >>>> avoids the problems I'd pointed out with needing to change the command
> >>>> line if a hotplug had happened.  It also means we only need one chunk of
> >>>> exec code.
> >>>
> >>> How/where would you specify a new command line?  Are you picturing the usual migration
> >>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
> >>> option or command?  That does not work for live update restart; the old qemu must exec
> >>> the new qemu.  Or something else?
> >>
> >> The existing migration path allows an exec - originally intended to exec
> >> something like a compressor or a store to file rather than a real
> >> migration; i.e. you can do:
> >>
> >>   migrate "exec:gzip > mig"
> >>
> >> and that will save the migration stream to a compressed file called mig.
> >> Now, I *think* we can already do:
> >>
> >>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
> >> (That's probably cleaner via the QMP interface).
> >>
> >> I'm not quite sure what I want in the incoming there, but that is
> >> already the source execing the destination qemu - although I think we'd
> >> probably need to check if that's actually via an intermediary.
> > 
> > I don't think you can dirctly exec  qemu in that way, because the
> > source QEMU migration code is going to wait for completion of the
> > QEMU you exec'd and that'll never come on success. So you'll end
> > up with both QEMU's running forever. If you pass the daemonize
> > option to the new QEMU then it will immediately detach itself,
> > and the source QEMU will think the migration command has finished
> > or failed.
> > 
> > I think you can probably do it if you use a wrapper script though.
> > The wrapper would have to fork QEMU in the backend, and then the
> > wrapper would have to monitor the new QEMU to see when the incoming
> > migration has finished/aborted, at which point the wrapper can
> > exit, so the source QEMU sees a successful cleanup of the exec'd
> > command. </hand waving>
> 
> cpr restart does not work for any scheme that involves the old qemu process co-existing with
> the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
> that old qemu directly execs new qemu.  Not fork-exec.  Same pid.
> 
> So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
> We still need the qemu_exec_requested mechanism to cause a direct exec after state is
> saved.

OK, note if you can find anyway to make kernel changes to avoid this
kexec, life is going to get *much* better; starting a separate qemu at
the management layer would be so much easier.

> >>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> >>> capability that the client would set before issuing the migrate command.  However, the
> >>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
> >>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
> >>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
> >>> cprsave calls qemu_save_device_state() which is used by xen.
> >>
> >> To me it feels like cprsave in particular is replicating more code.
> >>
> >> It's also jumping through hoops in places to avoid changing the
> >> commandline;  that's going to cause more pain for a lot of people - not
> >> just because it's hacks all over for that, but because a lot of people
> >> are actually going to need to change the commandline even in a cpr like
> >> case (e.g. due to hotplug or changing something else about the
> >> environment, like auth data or route to storage or networking that
> >> changed).
> > 
> > Management apps that already support migration, will almost certainly
> > know how to start up a new QEMU with a different command line that
> > takes account of hotplugged/unplugged devices. IOW avoiding changing
> > the command line only really addresses the simple case, and the hard
> > case is likely already solved for purposes of handling regular live
> > migration. 
> 
> Agreed, with the caveat that for cpr, the management app must communicate the new arguments
> to the qemu-exec trampoline, rather than passing the args on the command line to a new 
> qemu process.
> 
> >> There are hooks for early parameter parsing, so if we need to add extra
> >> commandline args we can; but for example the case of QEMU_START_FREEZE
> >> to add -S just isn't needed as soon as you let go of the idea of needing
> >> an identical commandline.
> 
> I'll delete QEMU_START_FREEZE.  
> 
> I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
> the args contain identifying information that the management app needs to modify the 
> arguments based the the instances's hot plug history.
> 
> Or, here is another possibility.  We could redefine cprsave to leave the VM in a
> stopped state, and add a cprstart command to be called subsequently that performs 
> the exec.  It takes a single string argument: a command plus arguments to exec.  
> The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
> name is no longer hardcoded.  The management app can derive new qemu args for the
> instances as it would with migration, and pass them to the command, instead of passing
> them to qemu-exec via some side channel.  cprload finishes the job and does not change.
> I already like this scheme better.

Right, that's sounding better; now the other benefit you get is you
don't need to play with environment variables; you can define a command
line option that takes all the extra data it needs.

Dave

> - Steve
>
Dr. David Alan Gilbert June 15, 2021, 7:14 p.m. UTC | #32
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
> > Hi Steven,
> >   I'd like to split the discussion into reboot and restart,
> > so I can make sure I understand them individually.
> > 
> > So reboot mode;
> > Can you explain which parts of this series are needed for reboot mode;
> > I've managed to do a kexec based reboot on qemu with the current qemu -
> > albeit with no vfio devices, my current understanding is that for doing
> > reboot with vfio we just need some way of getting migrate to send the
> > metadata associated with vfio devices if the guest is in S3.
> > 
> > Is there something I'm missing and which you have in this series?
> 
> You are correct, this series has little special code for reboot mode, but it does allow
> reboot and restart to be handled similarly, which simplifies the management layer because 
> the same calls are performed for each mode. 
> 
> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
> command to the qemu guest agent. This flushes requests and brings the guest device to a 
> reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.
> 
> There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
> is a sanity check that the writable ram blocks are backed by some form of shared memory.
> Plus there are a few fragments in the "cpr" patch that handle the suspended state that
> is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
> of RUN_STATE_SUSPENDED in migration/cpr.c

Could you split the 'reboot' part of separately, then we can review
that and perhaps get it in first? It should be a relatively small patch
set - it'll get things moving in the right direction.

The guest-suspend-ram stuff seems reasonable as an idea; lets just try
and avoid doing it all via environment variables though; make it proper
command line options.

Dave

> - Steve
>
Steven Sistare June 24, 2021, 3:05 p.m. UTC | #33
On 6/15/2021 3:14 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
>>> Hi Steven,
>>>   I'd like to split the discussion into reboot and restart,
>>> so I can make sure I understand them individually.
>>>
>>> So reboot mode;
>>> Can you explain which parts of this series are needed for reboot mode;
>>> I've managed to do a kexec based reboot on qemu with the current qemu -
>>> albeit with no vfio devices, my current understanding is that for doing
>>> reboot with vfio we just need some way of getting migrate to send the
>>> metadata associated with vfio devices if the guest is in S3.
>>>
>>> Is there something I'm missing and which you have in this series?
>>
>> You are correct, this series has little special code for reboot mode, but it does allow
>> reboot and restart to be handled similarly, which simplifies the management layer because 
>> the same calls are performed for each mode. 
>>
>> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
>> command to the qemu guest agent. This flushes requests and brings the guest device to a 
>> reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.
>>
>> There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
>> is a sanity check that the writable ram blocks are backed by some form of shared memory.
>> Plus there are a few fragments in the "cpr" patch that handle the suspended state that
>> is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
>> of RUN_STATE_SUSPENDED in migration/cpr.c
> 
> Could you split the 'reboot' part of separately, then we can review
> that and perhaps get it in first? It should be a relatively small patch
> set - it'll get things moving in the right direction.
> 
> The guest-suspend-ram stuff seems reasonable as an idea; lets just try
> and avoid doing it all via environment variables though; make it proper
> command line options.

How about I delete reboot mode and the mode argument instead.  Having two modes is causing no 
end of confusion, and my primary business need is for restart mode.

- Steve
Steven Sistare July 6, 2021, 5:31 p.m. UTC | #34
On 6/24/2021 11:05 AM, Steven Sistare wrote:
> On 6/15/2021 3:14 PM, Dr. David Alan Gilbert wrote:
>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
>>>> Hi Steven,
>>>>   I'd like to split the discussion into reboot and restart,
>>>> so I can make sure I understand them individually.
>>>>
>>>> So reboot mode;
>>>> Can you explain which parts of this series are needed for reboot mode;
>>>> I've managed to do a kexec based reboot on qemu with the current qemu -
>>>> albeit with no vfio devices, my current understanding is that for doing
>>>> reboot with vfio we just need some way of getting migrate to send the
>>>> metadata associated with vfio devices if the guest is in S3.
>>>>
>>>> Is there something I'm missing and which you have in this series?
>>>
>>> You are correct, this series has little special code for reboot mode, but it does allow
>>> reboot and restart to be handled similarly, which simplifies the management layer because 
>>> the same calls are performed for each mode. 
>>>
>>> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
>>> command to the qemu guest agent. This flushes requests and brings the guest device to a 
>>> reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.
>>>
>>> There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
>>> is a sanity check that the writable ram blocks are backed by some form of shared memory.
>>> Plus there are a few fragments in the "cpr" patch that handle the suspended state that
>>> is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
>>> of RUN_STATE_SUSPENDED in migration/cpr.c
>>
>> Could you split the 'reboot' part of separately, then we can review
>> that and perhaps get it in first? It should be a relatively small patch
>> set - it'll get things moving in the right direction.
>>
>> The guest-suspend-ram stuff seems reasonable as an idea; lets just try
>> and avoid doing it all via environment variables though; make it proper
>> command line options.
> 
> How about I delete reboot mode and the mode argument instead.  Having two modes is causing no 
> end of confusion, and my primary business need is for restart mode.

I just posted V4 of the patch series, refactoring reboot mode into the first 4 patches.

- Steve