mbox series

[V1,00/26] Live update: cpr-exec

Message ID 1714406135-451286-1-git-send-email-steven.sistare@oracle.com (mailing list archive)
Headers show
Series Live update: cpr-exec | expand

Message

Steven Sistare April 29, 2024, 3:55 p.m. UTC
This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
stops the VM, writes VM state to the migration URI, and directly exec's a
new version of QEMU on the same host, replacing the original process while
retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
addresses.  The user completes the migration by specifying the -incoming
option, and by issuing the migrate-incoming command if necessary.  This
saves and restores VM state, with minimal guest pause time, so that QEMU may
be updated to a new version in between.

The new interfaces are:
  * cpr-exec (MigMode migration parameter)
  * cpr-exec-args (migration parameter)
  * memfd-alloc=on (command-line option for -machine)
  * only-migratable-modes (command-line argument)

The caller sets the mode parameter before invoking the migrate command.

Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
The first argument should be the path of a new QEMU binary, or a prefix
command that exec's the new QEMU binary, and the arguments should include
the -incoming option.

Memory backend objects must have the share=on attribute, and must be mmap'able
in the new QEMU process.  For example, memory-backend-file is acceptable,
but memory-backend-ram is not.

QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
implicit RAM blocks (those not explicitly described by a memory-backend
object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
and even guest RAM when it is specified without without reference to a
memory-backend object.   The memfds are kept open across exec, their values
are saved in vmstate which is retrieved after exec, and they are re-mmap'd.

The '-only-migratable-modes cpr-exec' option guarantees that the
configuration supports cpr-exec.  QEMU will exit at start time if not.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would set a new QEMU binary path in cpr-exec-args.

  # qemu-kvm -monitor stdio -object
  memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
  -m 4G -machine memfd-alloc=on ...

  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

cpr-exec mode preserves attributes of outgoing devices that must be known
before the device is created on the incoming side, such as the memfd descriptor
number, but currently the migration stream is read after all devices are
created.  To solve this problem, I add two VMStateDescription options:
precreate and factory.  precreate objects are saved to their own migration
stream, distinct from the main stream, and are read early by incoming QEMU,
before devices are created.  Factory objects are allocated on demand, without
relying on a pre-registered object's opaque address, which is necessary
because the devices to which the state will apply have not been created yet
and hence have not registered an opaque address to receive the state.

This patch series implements a minimal version of cpr-exec.  Future series
will add support for:
  * vfio
  * chardev's without loss of connectivity
  * vhost
  * fine-grained seccomp controls
  * hostmem-memfd
  * cpr-exec migration test


Steve Sistare (26):
  oslib: qemu_clear_cloexec
  vl: helper to request re-exec
  migration: SAVEVM_FOREACH
  migration: delete unused parameter mis
  migration: precreate vmstate
  migration: precreate vmstate for exec
  migration: VMStateId
  migration: vmstate_info_void_ptr
  migration: vmstate_register_named
  migration: vmstate_unregister_named
  migration: vmstate_register at init time
  migration: vmstate factory object
  physmem: ram_block_create
  physmem: hoist guest_memfd creation
  physmem: hoist host memory allocation
  physmem: set ram block idstr earlier
  machine: memfd-alloc option
  migration: cpr-exec-args parameter
  physmem: preserve ram blocks for cpr
  migration: cpr-exec mode
  migration: migrate_add_blocker_mode
  migration: ram block cpr-exec blockers
  migration: misc cpr-exec blockers
  seccomp: cpr-exec blocker
  migration: fix mismatched GPAs during cpr-exec
  migration: only-migratable-modes

 accel/xen/xen-all.c            |   5 +
 backends/hostmem-epc.c         |  12 +-
 hmp-commands.hx                |   2 +-
 hw/core/machine.c              |  22 +++
 hw/core/qdev.c                 |   1 +
 hw/intc/apic_common.c          |   2 +-
 hw/vfio/migration.c            |   3 +-
 include/exec/cpu-common.h      |   3 +-
 include/exec/memory.h          |  15 ++
 include/exec/ramblock.h        |  10 +-
 include/hw/boards.h            |   1 +
 include/migration/blocker.h    |   7 +
 include/migration/cpr.h        |  14 ++
 include/migration/misc.h       |  11 ++
 include/migration/vmstate.h    | 133 +++++++++++++++-
 include/qemu/osdep.h           |   9 ++
 include/sysemu/runstate.h      |   3 +
 include/sysemu/seccomp.h       |   1 +
 include/sysemu/sysemu.h        |   1 -
 migration/cpr.c                | 131 ++++++++++++++++
 migration/meson.build          |   3 +
 migration/migration-hmp-cmds.c |  50 +++++-
 migration/migration.c          |  48 +++++-
 migration/migration.h          |   5 +-
 migration/options.c            |  13 ++
 migration/precreate.c          | 139 +++++++++++++++++
 migration/ram.c                |  16 +-
 migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
 migration/savevm.h             |   3 +
 migration/trace-events         |   7 +
 migration/vmstate-factory.c    |  78 ++++++++++
 migration/vmstate-types.c      |  24 +++
 migration/vmstate.c            |   3 +-
 qapi/migration.json            |  48 +++++-
 qemu-options.hx                |  22 ++-
 replay/replay.c                |   6 +
 stubs/migr-blocker.c           |   5 +
 stubs/vmstate.c                |  13 ++
 system/globals.c               |   1 -
 system/memory.c                |  19 ++-
 system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
 system/qemu-seccomp.c          |  10 +-
 system/runstate.c              |  29 ++++
 system/trace-events            |   4 +
 system/vl.c                    |  26 +++-
 target/s390x/cpu_models.c      |   4 +-
 util/oslib-posix.c             |   9 ++
 util/oslib-win32.c             |   4 +
 48 files changed, 1417 insertions(+), 210 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c
 create mode 100644 migration/precreate.c
 create mode 100644 migration/vmstate-factory.c

Comments

Steven Sistare May 2, 2024, 4:13 p.m. UTC | #1
On 4/29/2024 11:55 AM, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.

Here is the text I plan to add to docs/devel/migration/CPR.rst.  It is
premature for me to submit this as a patch, because it includes all
the functionality I plan to add in this and future series, but it may
help you while reviewing this series.

- Steve

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

cpr-exec mode
---------------

In this mode, QEMU stops the VM, writes VM state to the migration
URI, and directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID.  Guest RAM is
preserved in place, albeit with new virtual addresses.  The user
completes the migration by specifying the ``-incoming`` option, and
by issuing the ``migrate-incoming`` command if necessary; see details
below.

This mode supports vfio devices by preserving device descriptors and
hence kernel state across the exec, even for devices that do not
support live migration, and preserves tap and vhost descriptors.

cpr-exec also preserves descriptors for a subset of chardevs,
including socket, file, parallel, pipe, serial, pty, stdio, and null.
chardevs that support cpr-exec have the QEMU_CHAR_FEATURE_CPR set in
the Chardev object.  The client side of a preserved chardev sees no
loss of connectivity during cpr-exec.  More chardevs could be
preserved with additional developement.

All chardevs have a ``reopen-on-cpr`` option which causes the chardev
to be closed and reopened during cpr-exec.  This can be set to allow
cpr-exec when the configuration includes a chardev (such as vc) that
does not have QEMU_CHAR_FEATURE_CPR.

Because the old and new QEMU instances are not active concurrently,
the URI cannot be a type that streams data from one instance to the
other.

Usage
^^^^^

Arguments for the new QEMU process are taken from the
@cpr-exec-args parameter.  The first argument should be the
path of a new QEMU binary, or a prefix command that exec's the
new QEMU binary, and the arguments should include the ''-incoming''
option.

Memory backend objects must have the ``share=on`` attribute, and
must be mmap'able in the new QEMU process.  For example,
memory-backend-file is acceptable, but memory-backend-ram is
not.

The VM must be started with the ``-machine memfd-alloc=on``
option.  This causes implicit RAM blocks (those not explicitly
described by a memory-backend object) to be allocated by
mmap'ing a memfd.  Examples include VGA, ROM, and even guest
RAM when it is specified without without reference to a
memory-backend object.

Add the ``-only-migratable-modes cpr-exec`` option to guarantee that
the configuration supports cpr-exec.  QEMU will exit at start time
if not.

Outgoing:
   * Set the migration mode parameter to ``cpr-exec``.
   * Set the ``cpr-exec-args`` parameter.
   * Issue the ``migrate`` command.  It is recommended the the URI be
     a ``file`` type, but one can use other types such as ``exec``,
     provided the command captures all the data from the outgoing side,
     and provides all the data to the incoming side.

Incoming:
   * You do not need to explicitly start new QEMU.  It is started as
     a side effect of the migrate command above.
   * If the VM was running when the outgoing ``migrate`` command was
     issued, then QEMU automatically resumes VM execution.

Example 1: incoming URI
^^^^^^^^^^^^^^^^^^^^^^^

In these examples, we simply restart the same version of QEMU, but in
a real scenario one would set a new QEMU binary path in cpr-exec-args.

::

   # qemu-kvm -monitor stdio
   -object 
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G
   -machine memfd-alloc=on
   ...

   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running
   (qemu) migrate_set_parameter mode cpr-exec
   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming 
file:vm.state
   (qemu) migrate -d file:vm.state
   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running

Example 2: incoming defer
^^^^^^^^^^^^^^^^^^^^^^^^^
::

   # qemu-kvm -monitor stdio
   -object 
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G
   -machine memfd-alloc=on
   ...

   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running
   (qemu) migrate_set_parameter mode cpr-exec
   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming defer
   (qemu) migrate -d file:vm.state
   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   status: paused (inmigrate)
   (qemu) migrate_incoming file:vm.state
   (qemu) info status
   VM status: running


Caveats
^^^^^^^

cpr-exec mode may not be used with postcopy, background-snapshot,
or COLO.

cpr-exec mode requires permission to use the exec system call, which
is denied by certain sandbox options, such as spawn.  Use finer
grained controls to allow exec, eg:
``-sandbox on,fork=deny,ns=deny,exec=allow``

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Peter Xu May 2, 2024, 6:15 p.m. UTC | #2
On Thu, May 02, 2024 at 12:13:17PM -0400, Steven Sistare wrote:
> On 4/29/2024 11:55 AM, Steve Sistare wrote:
> > This patch series adds the live migration cpr-exec mode.
> 
> Here is the text I plan to add to docs/devel/migration/CPR.rst.  It is
> premature for me to submit this as a patch, because it includes all
> the functionality I plan to add in this and future series, but it may
> help you while reviewing this series.

I haven't reached this series at all yet but thanks for sending this,
definitely helpful for reviews.  I almost tried to ask for it. :)

I don't think it's an issue to send doc updates without full
implementations ready.  We can still mark things as BTD even in doc IMHO,
and it may help to provide a better picture of the whole thing if e.g. this
series only implemented part of them, to either reviewers or users (for the
latter, if the partially impl feature can already be consumed).

Thanks,
Steven Sistare May 20, 2024, 6:30 p.m. UTC | #3
Hi Peter, Hi Fabiano,
   Will you have time to review the migration guts of this series any time soon?
In particular:

[PATCH V1 05/26] migration: precreate vmstate
[PATCH V1 06/26] migration: precreate vmstate for exec
[PATCH V1 12/26] migration: vmstate factory object
[PATCH V1 18/26] migration: cpr-exec-args parameter
[PATCH V1 20/26] migration: cpr-exec mode

- Steve

On 4/29/2024 11:55 AM, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
> stops the VM, writes VM state to the migration URI, and directly exec's a
> new version of QEMU on the same host, replacing the original process while
> retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
> addresses.  The user completes the migration by specifying the -incoming
> option, and by issuing the migrate-incoming command if necessary.  This
> saves and restores VM state, with minimal guest pause time, so that QEMU may
> be updated to a new version in between.
> 
> The new interfaces are:
>    * cpr-exec (MigMode migration parameter)
>    * cpr-exec-args (migration parameter)
>    * memfd-alloc=on (command-line option for -machine)
>    * only-migratable-modes (command-line argument)
> 
> The caller sets the mode parameter before invoking the migrate command.
> 
> Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
> The first argument should be the path of a new QEMU binary, or a prefix
> command that exec's the new QEMU binary, and the arguments should include
> the -incoming option.
> 
> Memory backend objects must have the share=on attribute, and must be mmap'able
> in the new QEMU process.  For example, memory-backend-file is acceptable,
> but memory-backend-ram is not.
> 
> QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
> implicit RAM blocks (those not explicitly described by a memory-backend
> object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
> and even guest RAM when it is specified without without reference to a
> memory-backend object.   The memfds are kept open across exec, their values
> are saved in vmstate which is retrieved after exec, and they are re-mmap'd.
> 
> The '-only-migratable-modes cpr-exec' option guarantees that the
> configuration supports cpr-exec.  QEMU will exit at start time if not.
> 
> Example:
> 
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would set a new QEMU binary path in cpr-exec-args.
> 
>    # qemu-kvm -monitor stdio -object
>    memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
>    -m 4G -machine memfd-alloc=on ...
> 
>    QEMU 9.1.50 monitor - type 'help' for more information
>    (qemu) info status
>    VM status: running
>    (qemu) migrate_set_parameter mode cpr-exec
>    (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
>    (qemu) migrate -d file:vm.state
>    (qemu) QEMU 9.1.50 monitor - type 'help' for more information
>    (qemu) info status
>    VM status: running
> 
> cpr-exec mode preserves attributes of outgoing devices that must be known
> before the device is created on the incoming side, such as the memfd descriptor
> number, but currently the migration stream is read after all devices are
> created.  To solve this problem, I add two VMStateDescription options:
> precreate and factory.  precreate objects are saved to their own migration
> stream, distinct from the main stream, and are read early by incoming QEMU,
> before devices are created.  Factory objects are allocated on demand, without
> relying on a pre-registered object's opaque address, which is necessary
> because the devices to which the state will apply have not been created yet
> and hence have not registered an opaque address to receive the state.
> 
> This patch series implements a minimal version of cpr-exec.  Future series
> will add support for:
>    * vfio
>    * chardev's without loss of connectivity
>    * vhost
>    * fine-grained seccomp controls
>    * hostmem-memfd
>    * cpr-exec migration test
> 
> 
> Steve Sistare (26):
>    oslib: qemu_clear_cloexec
>    vl: helper to request re-exec
>    migration: SAVEVM_FOREACH
>    migration: delete unused parameter mis
>    migration: precreate vmstate
>    migration: precreate vmstate for exec
>    migration: VMStateId
>    migration: vmstate_info_void_ptr
>    migration: vmstate_register_named
>    migration: vmstate_unregister_named
>    migration: vmstate_register at init time
>    migration: vmstate factory object
>    physmem: ram_block_create
>    physmem: hoist guest_memfd creation
>    physmem: hoist host memory allocation
>    physmem: set ram block idstr earlier
>    machine: memfd-alloc option
>    migration: cpr-exec-args parameter
>    physmem: preserve ram blocks for cpr
>    migration: cpr-exec mode
>    migration: migrate_add_blocker_mode
>    migration: ram block cpr-exec blockers
>    migration: misc cpr-exec blockers
>    seccomp: cpr-exec blocker
>    migration: fix mismatched GPAs during cpr-exec
>    migration: only-migratable-modes
> 
>   accel/xen/xen-all.c            |   5 +
>   backends/hostmem-epc.c         |  12 +-
>   hmp-commands.hx                |   2 +-
>   hw/core/machine.c              |  22 +++
>   hw/core/qdev.c                 |   1 +
>   hw/intc/apic_common.c          |   2 +-
>   hw/vfio/migration.c            |   3 +-
>   include/exec/cpu-common.h      |   3 +-
>   include/exec/memory.h          |  15 ++
>   include/exec/ramblock.h        |  10 +-
>   include/hw/boards.h            |   1 +
>   include/migration/blocker.h    |   7 +
>   include/migration/cpr.h        |  14 ++
>   include/migration/misc.h       |  11 ++
>   include/migration/vmstate.h    | 133 +++++++++++++++-
>   include/qemu/osdep.h           |   9 ++
>   include/sysemu/runstate.h      |   3 +
>   include/sysemu/seccomp.h       |   1 +
>   include/sysemu/sysemu.h        |   1 -
>   migration/cpr.c                | 131 ++++++++++++++++
>   migration/meson.build          |   3 +
>   migration/migration-hmp-cmds.c |  50 +++++-
>   migration/migration.c          |  48 +++++-
>   migration/migration.h          |   5 +-
>   migration/options.c            |  13 ++
>   migration/precreate.c          | 139 +++++++++++++++++
>   migration/ram.c                |  16 +-
>   migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
>   migration/savevm.h             |   3 +
>   migration/trace-events         |   7 +
>   migration/vmstate-factory.c    |  78 ++++++++++
>   migration/vmstate-types.c      |  24 +++
>   migration/vmstate.c            |   3 +-
>   qapi/migration.json            |  48 +++++-
>   qemu-options.hx                |  22 ++-
>   replay/replay.c                |   6 +
>   stubs/migr-blocker.c           |   5 +
>   stubs/vmstate.c                |  13 ++
>   system/globals.c               |   1 -
>   system/memory.c                |  19 ++-
>   system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
>   system/qemu-seccomp.c          |  10 +-
>   system/runstate.c              |  29 ++++
>   system/trace-events            |   4 +
>   system/vl.c                    |  26 +++-
>   target/s390x/cpu_models.c      |   4 +-
>   util/oslib-posix.c             |   9 ++
>   util/oslib-win32.c             |   4 +
>   48 files changed, 1417 insertions(+), 210 deletions(-)
>   create mode 100644 include/migration/cpr.h
>   create mode 100644 migration/cpr.c
>   create mode 100644 migration/precreate.c
>   create mode 100644 migration/vmstate-factory.c
>
Fabiano Rosas May 20, 2024, 10:28 p.m. UTC | #4
Steven Sistare <steven.sistare@oracle.com> writes:

> Hi Peter, Hi Fabiano,
>    Will you have time to review the migration guts of this series any time soon?
> In particular:
>
> [PATCH V1 05/26] migration: precreate vmstate
> [PATCH V1 06/26] migration: precreate vmstate for exec
> [PATCH V1 12/26] migration: vmstate factory object
> [PATCH V1 18/26] migration: cpr-exec-args parameter
> [PATCH V1 20/26] migration: cpr-exec mode
>

I'll get to them this week. I'm trying to make some progress with my own
code before I forget how to program. I'm also trying to find some time
to implement the device options in the migration tests so we can stop
these virtio-* breakages that have been popping up.
Peter Xu May 21, 2024, 2:31 a.m. UTC | #5
Conference back then pto until today, so tomorrow will be my first working
day after those. Sorry Steve, will try my best to read it before next week.
I didn't dare to read too much my inbox yet.  A bit scared but need to face
it tomorrow.

On Mon, May 20, 2024, 6:28 p.m. Fabiano Rosas <farosas@suse.de> wrote:

> Steven Sistare <steven.sistare@oracle.com> writes:
>
> > Hi Peter, Hi Fabiano,
> >    Will you have time to review the migration guts of this series any
> time soon?
> > In particular:
> >
> > [PATCH V1 05/26] migration: precreate vmstate
> > [PATCH V1 06/26] migration: precreate vmstate for exec
> > [PATCH V1 12/26] migration: vmstate factory object
> > [PATCH V1 18/26] migration: cpr-exec-args parameter
> > [PATCH V1 20/26] migration: cpr-exec mode
> >
>
> I'll get to them this week. I'm trying to make some progress with my own
> code before I forget how to program. I'm also trying to find some time
> to implement the device options in the migration tests so we can stop
> these virtio-* breakages that have been popping up.
>
>
Steven Sistare May 21, 2024, 11:46 a.m. UTC | #6
I understand, thanks.  If I can help with any of your todo list,
just ask - steve

On 5/20/2024 10:31 PM, Peter Xu wrote:
> Conference back then pto until today, so tomorrow will be my first working day 
> after those. Sorry Steve, will try my best to read it before next week. I didn't 
> dare to read too much my inbox yet.  A bit scared but need to face it tomorrow.
> 
> On Mon, May 20, 2024, 6:28 p.m. Fabiano Rosas <farosas@suse.de 
> <mailto:farosas@suse.de>> wrote:
> 
>     Steven Sistare <steven.sistare@oracle.com
>     <mailto:steven.sistare@oracle.com>> writes:
> 
>      > Hi Peter, Hi Fabiano,
>      >    Will you have time to review the migration guts of this series any
>     time soon?
>      > In particular:
>      >
>      > [PATCH V1 05/26] migration: precreate vmstate
>      > [PATCH V1 06/26] migration: precreate vmstate for exec
>      > [PATCH V1 12/26] migration: vmstate factory object
>      > [PATCH V1 18/26] migration: cpr-exec-args parameter
>      > [PATCH V1 20/26] migration: cpr-exec mode
>      >
> 
>     I'll get to them this week. I'm trying to make some progress with my own
>     code before I forget how to program. I'm also trying to find some time
>     to implement the device options in the migration tests so we can stop
>     these virtio-* breakages that have been popping up.
>
Fabiano Rosas May 24, 2024, 1:02 p.m. UTC | #7
Steve Sistare <steven.sistare@oracle.com> writes:

> This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
> stops the VM, writes VM state to the migration URI, and directly exec's a
> new version of QEMU on the same host, replacing the original process while
> retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
> addresses.  The user completes the migration by specifying the -incoming
> option, and by issuing the migrate-incoming command if necessary.  This
> saves and restores VM state, with minimal guest pause time, so that QEMU may
> be updated to a new version in between.
>
> The new interfaces are:
>   * cpr-exec (MigMode migration parameter)
>   * cpr-exec-args (migration parameter)
>   * memfd-alloc=on (command-line option for -machine)
>   * only-migratable-modes (command-line argument)
>
> The caller sets the mode parameter before invoking the migrate command.
>
> Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
> The first argument should be the path of a new QEMU binary, or a prefix
> command that exec's the new QEMU binary, and the arguments should include
> the -incoming option.
>
> Memory backend objects must have the share=on attribute, and must be mmap'able
> in the new QEMU process.  For example, memory-backend-file is acceptable,
> but memory-backend-ram is not.
>
> QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
> implicit RAM blocks (those not explicitly described by a memory-backend
> object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
> and even guest RAM when it is specified without without reference to a
> memory-backend object.   The memfds are kept open across exec, their values
> are saved in vmstate which is retrieved after exec, and they are re-mmap'd.
>
> The '-only-migratable-modes cpr-exec' option guarantees that the
> configuration supports cpr-exec.  QEMU will exit at start time if not.
>
> Example:
>
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would set a new QEMU binary path in cpr-exec-args.
>
>   # qemu-kvm -monitor stdio -object
>   memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
>   -m 4G -machine memfd-alloc=on ...
>
>   QEMU 9.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
>   (qemu) migrate_set_parameter mode cpr-exec
>   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
>   (qemu) migrate -d file:vm.state
>   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
>
> cpr-exec mode preserves attributes of outgoing devices that must be known
> before the device is created on the incoming side, such as the memfd descriptor
> number, but currently the migration stream is read after all devices are
> created.  To solve this problem, I add two VMStateDescription options:
> precreate and factory.  precreate objects are saved to their own migration
> stream, distinct from the main stream, and are read early by incoming QEMU,
> before devices are created.  Factory objects are allocated on demand, without
> relying on a pre-registered object's opaque address, which is necessary
> because the devices to which the state will apply have not been created yet
> and hence have not registered an opaque address to receive the state.
>
> This patch series implements a minimal version of cpr-exec.  Future series
> will add support for:
>   * vfio
>   * chardev's without loss of connectivity
>   * vhost
>   * fine-grained seccomp controls
>   * hostmem-memfd
>   * cpr-exec migration test
>
>
> Steve Sistare (26):
>   oslib: qemu_clear_cloexec
>   vl: helper to request re-exec
>   migration: SAVEVM_FOREACH
>   migration: delete unused parameter mis
>   migration: precreate vmstate
>   migration: precreate vmstate for exec
>   migration: VMStateId
>   migration: vmstate_info_void_ptr
>   migration: vmstate_register_named
>   migration: vmstate_unregister_named
>   migration: vmstate_register at init time
>   migration: vmstate factory object
>   physmem: ram_block_create
>   physmem: hoist guest_memfd creation
>   physmem: hoist host memory allocation
>   physmem: set ram block idstr earlier
>   machine: memfd-alloc option
>   migration: cpr-exec-args parameter
>   physmem: preserve ram blocks for cpr
>   migration: cpr-exec mode
>   migration: migrate_add_blocker_mode
>   migration: ram block cpr-exec blockers
>   migration: misc cpr-exec blockers
>   seccomp: cpr-exec blocker
>   migration: fix mismatched GPAs during cpr-exec
>   migration: only-migratable-modes
>
>  accel/xen/xen-all.c            |   5 +
>  backends/hostmem-epc.c         |  12 +-
>  hmp-commands.hx                |   2 +-
>  hw/core/machine.c              |  22 +++
>  hw/core/qdev.c                 |   1 +
>  hw/intc/apic_common.c          |   2 +-
>  hw/vfio/migration.c            |   3 +-
>  include/exec/cpu-common.h      |   3 +-
>  include/exec/memory.h          |  15 ++
>  include/exec/ramblock.h        |  10 +-
>  include/hw/boards.h            |   1 +
>  include/migration/blocker.h    |   7 +
>  include/migration/cpr.h        |  14 ++
>  include/migration/misc.h       |  11 ++
>  include/migration/vmstate.h    | 133 +++++++++++++++-
>  include/qemu/osdep.h           |   9 ++
>  include/sysemu/runstate.h      |   3 +
>  include/sysemu/seccomp.h       |   1 +
>  include/sysemu/sysemu.h        |   1 -
>  migration/cpr.c                | 131 ++++++++++++++++
>  migration/meson.build          |   3 +
>  migration/migration-hmp-cmds.c |  50 +++++-
>  migration/migration.c          |  48 +++++-
>  migration/migration.h          |   5 +-
>  migration/options.c            |  13 ++
>  migration/precreate.c          | 139 +++++++++++++++++
>  migration/ram.c                |  16 +-
>  migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
>  migration/savevm.h             |   3 +
>  migration/trace-events         |   7 +
>  migration/vmstate-factory.c    |  78 ++++++++++
>  migration/vmstate-types.c      |  24 +++
>  migration/vmstate.c            |   3 +-
>  qapi/migration.json            |  48 +++++-
>  qemu-options.hx                |  22 ++-
>  replay/replay.c                |   6 +
>  stubs/migr-blocker.c           |   5 +
>  stubs/vmstate.c                |  13 ++
>  system/globals.c               |   1 -
>  system/memory.c                |  19 ++-
>  system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
>  system/qemu-seccomp.c          |  10 +-
>  system/runstate.c              |  29 ++++
>  system/trace-events            |   4 +
>  system/vl.c                    |  26 +++-
>  target/s390x/cpu_models.c      |   4 +-
>  util/oslib-posix.c             |   9 ++
>  util/oslib-win32.c             |   4 +
>  48 files changed, 1417 insertions(+), 210 deletions(-)
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr.c
>  create mode 100644 migration/precreate.c
>  create mode 100644 migration/vmstate-factory.c

Hi Steve,

make check is failing. I applied the series on top of master @
70581940ca (Merge tag 'pull-tcg-20240523' of
https://gitlab.com/rth7680/qemu into staging, 2024-05-23).

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/ivshmem-test
...
qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
Assertion `!strcmp(new_block->idstr, idstr)' failed.

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 \
./tests/qtest/test-x86-cpuid-compat -p \
/x86_64/x86/cpuid/auto-level/pc-2.7
...
qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
Assertion `!strcmp(new_block->idstr, idstr)' failed.

$ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/qmp-cmd-test -p \
/x86_64/qmp/object-add-failure-modes
...
savevm_state_handler_insert: Detected duplicate SaveStateEntry:
id=ram1/RAMBlock, instance_id=0x0
Steven Sistare May 24, 2024, 2:07 p.m. UTC | #8
On 5/24/2024 9:02 AM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
>> stops the VM, writes VM state to the migration URI, and directly exec's a
>> new version of QEMU on the same host, replacing the original process while
>> retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
>> addresses.  The user completes the migration by specifying the -incoming
>> option, and by issuing the migrate-incoming command if necessary.  This
>> saves and restores VM state, with minimal guest pause time, so that QEMU may
>> be updated to a new version in between.
>>
>> The new interfaces are:
>>    * cpr-exec (MigMode migration parameter)
>>    * cpr-exec-args (migration parameter)
>>    * memfd-alloc=on (command-line option for -machine)
>>    * only-migratable-modes (command-line argument)
>>
>> The caller sets the mode parameter before invoking the migrate command.
>>
>> Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
>> The first argument should be the path of a new QEMU binary, or a prefix
>> command that exec's the new QEMU binary, and the arguments should include
>> the -incoming option.
>>
>> Memory backend objects must have the share=on attribute, and must be mmap'able
>> in the new QEMU process.  For example, memory-backend-file is acceptable,
>> but memory-backend-ram is not.
>>
>> QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
>> implicit RAM blocks (those not explicitly described by a memory-backend
>> object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
>> and even guest RAM when it is specified without without reference to a
>> memory-backend object.   The memfds are kept open across exec, their values
>> are saved in vmstate which is retrieved after exec, and they are re-mmap'd.
>>
>> The '-only-migratable-modes cpr-exec' option guarantees that the
>> configuration supports cpr-exec.  QEMU will exit at start time if not.
>>
>> Example:
>>
>> In this example, we simply restart the same version of QEMU, but in
>> a real scenario one would set a new QEMU binary path in cpr-exec-args.
>>
>>    # qemu-kvm -monitor stdio -object
>>    memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
>>    -m 4G -machine memfd-alloc=on ...
>>
>>    QEMU 9.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>    (qemu) migrate_set_parameter mode cpr-exec
>>    (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state
>>    (qemu) migrate -d file:vm.state
>>    (qemu) QEMU 9.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>
>> cpr-exec mode preserves attributes of outgoing devices that must be known
>> before the device is created on the incoming side, such as the memfd descriptor
>> number, but currently the migration stream is read after all devices are
>> created.  To solve this problem, I add two VMStateDescription options:
>> precreate and factory.  precreate objects are saved to their own migration
>> stream, distinct from the main stream, and are read early by incoming QEMU,
>> before devices are created.  Factory objects are allocated on demand, without
>> relying on a pre-registered object's opaque address, which is necessary
>> because the devices to which the state will apply have not been created yet
>> and hence have not registered an opaque address to receive the state.
>>
>> This patch series implements a minimal version of cpr-exec.  Future series
>> will add support for:
>>    * vfio
>>    * chardev's without loss of connectivity
>>    * vhost
>>    * fine-grained seccomp controls
>>    * hostmem-memfd
>>    * cpr-exec migration test
>>
>>
>> Steve Sistare (26):
>>    oslib: qemu_clear_cloexec
>>    vl: helper to request re-exec
>>    migration: SAVEVM_FOREACH
>>    migration: delete unused parameter mis
>>    migration: precreate vmstate
>>    migration: precreate vmstate for exec
>>    migration: VMStateId
>>    migration: vmstate_info_void_ptr
>>    migration: vmstate_register_named
>>    migration: vmstate_unregister_named
>>    migration: vmstate_register at init time
>>    migration: vmstate factory object
>>    physmem: ram_block_create
>>    physmem: hoist guest_memfd creation
>>    physmem: hoist host memory allocation
>>    physmem: set ram block idstr earlier
>>    machine: memfd-alloc option
>>    migration: cpr-exec-args parameter
>>    physmem: preserve ram blocks for cpr
>>    migration: cpr-exec mode
>>    migration: migrate_add_blocker_mode
>>    migration: ram block cpr-exec blockers
>>    migration: misc cpr-exec blockers
>>    seccomp: cpr-exec blocker
>>    migration: fix mismatched GPAs during cpr-exec
>>    migration: only-migratable-modes
>>
>>   accel/xen/xen-all.c            |   5 +
>>   backends/hostmem-epc.c         |  12 +-
>>   hmp-commands.hx                |   2 +-
>>   hw/core/machine.c              |  22 +++
>>   hw/core/qdev.c                 |   1 +
>>   hw/intc/apic_common.c          |   2 +-
>>   hw/vfio/migration.c            |   3 +-
>>   include/exec/cpu-common.h      |   3 +-
>>   include/exec/memory.h          |  15 ++
>>   include/exec/ramblock.h        |  10 +-
>>   include/hw/boards.h            |   1 +
>>   include/migration/blocker.h    |   7 +
>>   include/migration/cpr.h        |  14 ++
>>   include/migration/misc.h       |  11 ++
>>   include/migration/vmstate.h    | 133 +++++++++++++++-
>>   include/qemu/osdep.h           |   9 ++
>>   include/sysemu/runstate.h      |   3 +
>>   include/sysemu/seccomp.h       |   1 +
>>   include/sysemu/sysemu.h        |   1 -
>>   migration/cpr.c                | 131 ++++++++++++++++
>>   migration/meson.build          |   3 +
>>   migration/migration-hmp-cmds.c |  50 +++++-
>>   migration/migration.c          |  48 +++++-
>>   migration/migration.h          |   5 +-
>>   migration/options.c            |  13 ++
>>   migration/precreate.c          | 139 +++++++++++++++++
>>   migration/ram.c                |  16 +-
>>   migration/savevm.c             | 306 +++++++++++++++++++++++++++++-------
>>   migration/savevm.h             |   3 +
>>   migration/trace-events         |   7 +
>>   migration/vmstate-factory.c    |  78 ++++++++++
>>   migration/vmstate-types.c      |  24 +++
>>   migration/vmstate.c            |   3 +-
>>   qapi/migration.json            |  48 +++++-
>>   qemu-options.hx                |  22 ++-
>>   replay/replay.c                |   6 +
>>   stubs/migr-blocker.c           |   5 +
>>   stubs/vmstate.c                |  13 ++
>>   system/globals.c               |   1 -
>>   system/memory.c                |  19 ++-
>>   system/physmem.c               | 346 +++++++++++++++++++++++++++--------------
>>   system/qemu-seccomp.c          |  10 +-
>>   system/runstate.c              |  29 ++++
>>   system/trace-events            |   4 +
>>   system/vl.c                    |  26 +++-
>>   target/s390x/cpu_models.c      |   4 +-
>>   util/oslib-posix.c             |   9 ++
>>   util/oslib-win32.c             |   4 +
>>   48 files changed, 1417 insertions(+), 210 deletions(-)
>>   create mode 100644 include/migration/cpr.h
>>   create mode 100644 migration/cpr.c
>>   create mode 100644 migration/precreate.c
>>   create mode 100644 migration/vmstate-factory.c
> 
> Hi Steve,
> 
> make check is failing. I applied the series on top of master @
> 70581940ca (Merge tag 'pull-tcg-20240523' of
> https://gitlab.com/rth7680/qemu into staging, 2024-05-23).
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/ivshmem-test
> ...
> qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
> Assertion `!strcmp(new_block->idstr, idstr)' failed.
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 \
> ./tests/qtest/test-x86-cpuid-compat -p \
> /x86_64/x86/cpuid/auto-level/pc-2.7
> ...
> qemu-system-x86_64: ../system/physmem.c:1634: qemu_ram_verify_idstr:
> Assertion `!strcmp(new_block->idstr, idstr)' failed.
> 
> $ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/qmp-cmd-test -p \
> /x86_64/qmp/object-add-failure-modes
> ...
> savevm_state_handler_insert: Detected duplicate SaveStateEntry:
> id=ram1/RAMBlock, instance_id=0x0

Thank you very much, I will investigate.

I suspect the vmstate dup error is due to this bug which I hit after
posting the patches:
-------------------------------
diff --git a/migration/savevm.c b/migration/savevm.c
index bb7fd9f..54aa233 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1012,7 +1012,7 @@ void vmstate_unregister_named(const char *vmsd_name,
      SaveStateEntry *se, *new_se;
      VMStateId idstr;

-    snprintf(idstr, sizeof(idstr), "%s/%s", vmsd_name, instance_name);
+    snprintf(idstr, sizeof(idstr), "%s/%s", instance_name, vmsd_name);

      SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
          if (!strcmp(se->idstr, idstr) &&
-----------------------------------

- Steve
Peter Xu May 27, 2024, 5:45 p.m. UTC | #9
On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> I understand, thanks.  If I can help with any of your todo list,
> just ask - steve

Thanks for offering the help, Steve.  Started looking at this today, then I
found that I miss something high-level.  Let me ask here, and let me
apologize already for starting to throw multiple questions..

IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
this case not host kernel but QEMU-only, and/or upper.

Is there any justification on why the complexity is needed here?  It looks
to me this one is more involved than cpr-reboot, so I'm thinking how much
we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
the min support, and if we even expect more to come, that's really
important, IMHO.

For example, what's the major motivation of this whole work?  Is that more
on performance, or is it more for supporting the special devices like VFIO
which we used to not support, or something else?  I can't find them in
whatever cover letter I can find, including this one.

Firstly, regarding performance, IMHO it'll be always nice to share even
some very fundamental downtime measurement comparisons using the new exec
mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
have some number on hand when you started working on this feature years
ago?  Or maybe some old links on the list would help too, as I didn't
follow this work since the start.

On VFIO, IIUC you started out this project without VFIO migration being
there.  Now we have VFIO migration so not sure how much it would work for
the upgrade use case. Even with current VFIO migration, we may not want to
migrate device states for a local upgrade I suppose, as that can be a lot
depending on the type of device assigned.  However it'll be nice to discuss
this too if this is the major purpose of the series.

I think one other challenge on QEMU upgrade with VFIO devices is that the
dest QEMU won't be able to open the VFIO device when the src QEMU is still
using it as the owner.  IIUC this is a similar condition where QEMU wants
to have proper ownership transfer of a shared block device, and AFAIR right
now we resolved that issue using some form of file lock on the image file.
In this case it won't easily apply to a VFIO dev fd, but maybe we still
have other approaches, not sure whether you investigated any.  E.g. could
the VFIO handle be passed over using unix scm rights?  I think this might
remove one dependency of using exec which can cause quite some difference
v.s. a generic migration (from which regard, cpr-reboot is still a pretty
generic migration).

You also mentioned vhost/tap, is that also a major goal of this series in
the follow up patchsets?  Is this a problem only because this solution will
do exec?  Can it work if either the exec()ed qemu or dst qemu create the
vhost/tap fds when boot?

Meanwhile, could you elaborate a bit on the implication on chardevs?  From
what I read in the doc update it looks like a major part of work in the
future, but I don't yet understand the issue..  Is it also relevant to the
exec() approach?

In all cases, some of such discussion would be really appreciated.  And if
you used to consider other approaches to solve this problem it'll be great
to mention how you chose this way.  Considering this work contains too many
things, it'll be nice if such discussion can start with the fundamentals,
e.g. on why exec() is a must.

Thanks,
Peter Xu May 27, 2024, 6:07 p.m. UTC | #10
On Mon, Apr 29, 2024 at 08:55:09AM -0700, Steve Sistare wrote:
> This patch series implements a minimal version of cpr-exec.  Future series
> will add support for:
>   * vfio
>   * chardev's without loss of connectivity
>   * vhost
>   * fine-grained seccomp controls
>   * hostmem-memfd
>   * cpr-exec migration test

Another request besides the questions I threw already.. could you push to a
tree where it has everything?  Maybe that'll help to review also the min set.

Thanks,
Steven Sistare May 28, 2024, 3:10 p.m. UTC | #11
On 5/27/2024 1:45 PM, Peter Xu wrote:
> On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
>> I understand, thanks.  If I can help with any of your todo list,
>> just ask - steve
> 
> Thanks for offering the help, Steve.  Started looking at this today, then I
> found that I miss something high-level.  Let me ask here, and let me
> apologize already for starting to throw multiple questions..
> 
> IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> this case not host kernel but QEMU-only, and/or upper.
> 
> Is there any justification on why the complexity is needed here?  It looks
> to me this one is more involved than cpr-reboot, so I'm thinking how much
> we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
> the min support, and if we even expect more to come, that's really
> important, IMHO.
> 
> For example, what's the major motivation of this whole work?  Is that more
> on performance, or is it more for supporting the special devices like VFIO
> which we used to not support, or something else?  I can't find them in
> whatever cover letter I can find, including this one.
> 
> Firstly, regarding performance, IMHO it'll be always nice to share even
> some very fundamental downtime measurement comparisons using the new exec
> mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> have some number on hand when you started working on this feature years
> ago?  Or maybe some old links on the list would help too, as I didn't
> follow this work since the start.
> 
> On VFIO, IIUC you started out this project without VFIO migration being
> there.  Now we have VFIO migration so not sure how much it would work for
> the upgrade use case. Even with current VFIO migration, we may not want to
> migrate device states for a local upgrade I suppose, as that can be a lot
> depending on the type of device assigned.  However it'll be nice to discuss
> this too if this is the major purpose of the series.
> 
> I think one other challenge on QEMU upgrade with VFIO devices is that the
> dest QEMU won't be able to open the VFIO device when the src QEMU is still
> using it as the owner.  IIUC this is a similar condition where QEMU wants
> to have proper ownership transfer of a shared block device, and AFAIR right
> now we resolved that issue using some form of file lock on the image file.
> In this case it won't easily apply to a VFIO dev fd, but maybe we still
> have other approaches, not sure whether you investigated any.  E.g. could
> the VFIO handle be passed over using unix scm rights?  I think this might
> remove one dependency of using exec which can cause quite some difference
> v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> generic migration).
> 
> You also mentioned vhost/tap, is that also a major goal of this series in
> the follow up patchsets?  Is this a problem only because this solution will
> do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> vhost/tap fds when boot?
> 
> Meanwhile, could you elaborate a bit on the implication on chardevs?  From
> what I read in the doc update it looks like a major part of work in the
> future, but I don't yet understand the issue..  Is it also relevant to the
> exec() approach?
> 
> In all cases, some of such discussion would be really appreciated.  And if
> you used to consider other approaches to solve this problem it'll be great
> to mention how you chose this way.  Considering this work contains too many
> things, it'll be nice if such discussion can start with the fundamentals,
> e.g. on why exec() is a must.

The main goal of cpr-exec is providing a fast and reliable way to update
qemu. cpr-reboot is not fast enough or general enough.  It requires the
guest to support suspend and resume for all devices, and that takes seconds.
If one actually reboots the host, that adds more seconds, depending on
system services.  cpr-exec takes 0.1 secs, and works every time, unlike
like migration which can fail to converge on a busy system.  Live migration
also consumes more system and network resources.  cpr-exec seamlessly
preserves client connections by preserving chardevs, and overall provides
a much nicer user experience.

chardev's are preserved by keeping their fd open across the exec, and
remembering the value of the fd in precreate vmstate so that new qemu
can associate the fd with the chardev rather than opening a new one.

The approach of preserving open file descriptors is very general and applicable
to all kinds of devices, regardless of whether they support live migration
in hardware.  Device fd's are preserved using the same mechanism as for
chardevs.

Devices that support live migration in hardware do not like to live migrate
in place to the same node.  It is not what they are designed for, and some
implementations will flat out fail because the source and target interfaces
are the same.

For vhost/tap, sometimes the management layer opens the dev and passes an
fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
allows both.  For the case where qemu opens the dev, the fd is preserved
using the same mechanism as for chardevs.

The fundamental requirements of this work are:
   - precreate vmstate
   - preserve open file descriptors

Direct exec from old to new qemu is not a hard requirement.   However,
it is simple, with few complications, and works with Oracle's cloud
containers, so it is the method I am most interested in finishing first.

I believe everything could also be made to work by using SCM_RIGHTS to
send fd's to a new qemu process that is started by some external means.
It would be requested with MIG_MODE_CPR_SCM (or some better name), and
would co-exist with MIG_MODE_CPR_EXEC.

- Steve
Peter Xu May 28, 2024, 4:42 p.m. UTC | #12
On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> On 5/27/2024 1:45 PM, Peter Xu wrote:
> > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > I understand, thanks.  If I can help with any of your todo list,
> > > just ask - steve
> > 
> > Thanks for offering the help, Steve.  Started looking at this today, then I
> > found that I miss something high-level.  Let me ask here, and let me
> > apologize already for starting to throw multiple questions..
> > 
> > IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> > this case not host kernel but QEMU-only, and/or upper.
> > 
> > Is there any justification on why the complexity is needed here?  It looks
> > to me this one is more involved than cpr-reboot, so I'm thinking how much
> > we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
> > the min support, and if we even expect more to come, that's really
> > important, IMHO.
> > 
> > For example, what's the major motivation of this whole work?  Is that more
> > on performance, or is it more for supporting the special devices like VFIO
> > which we used to not support, or something else?  I can't find them in
> > whatever cover letter I can find, including this one.
> > 
> > Firstly, regarding performance, IMHO it'll be always nice to share even
> > some very fundamental downtime measurement comparisons using the new exec
> > mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> > have some number on hand when you started working on this feature years
> > ago?  Or maybe some old links on the list would help too, as I didn't
> > follow this work since the start.
> > 
> > On VFIO, IIUC you started out this project without VFIO migration being
> > there.  Now we have VFIO migration so not sure how much it would work for
> > the upgrade use case. Even with current VFIO migration, we may not want to
> > migrate device states for a local upgrade I suppose, as that can be a lot
> > depending on the type of device assigned.  However it'll be nice to discuss
> > this too if this is the major purpose of the series.
> > 
> > I think one other challenge on QEMU upgrade with VFIO devices is that the
> > dest QEMU won't be able to open the VFIO device when the src QEMU is still
> > using it as the owner.  IIUC this is a similar condition where QEMU wants
> > to have proper ownership transfer of a shared block device, and AFAIR right
> > now we resolved that issue using some form of file lock on the image file.
> > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > have other approaches, not sure whether you investigated any.  E.g. could
> > the VFIO handle be passed over using unix scm rights?  I think this might
> > remove one dependency of using exec which can cause quite some difference
> > v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> > generic migration).
> > 
> > You also mentioned vhost/tap, is that also a major goal of this series in
> > the follow up patchsets?  Is this a problem only because this solution will
> > do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> > vhost/tap fds when boot?
> > 
> > Meanwhile, could you elaborate a bit on the implication on chardevs?  From
> > what I read in the doc update it looks like a major part of work in the
> > future, but I don't yet understand the issue..  Is it also relevant to the
> > exec() approach?
> > 
> > In all cases, some of such discussion would be really appreciated.  And if
> > you used to consider other approaches to solve this problem it'll be great
> > to mention how you chose this way.  Considering this work contains too many
> > things, it'll be nice if such discussion can start with the fundamentals,
> > e.g. on why exec() is a must.
> 
> The main goal of cpr-exec is providing a fast and reliable way to update
> qemu. cpr-reboot is not fast enough or general enough.  It requires the
> guest to support suspend and resume for all devices, and that takes seconds.
> If one actually reboots the host, that adds more seconds, depending on
> system services.  cpr-exec takes 0.1 secs, and works every time, unlike
> like migration which can fail to converge on a busy system.  Live migration
> also consumes more system and network resources.

Right, but note that when I was thinking of a comparison between cpr-exec
v.s. normal migration, I didn't mean a "normal live migration".  I think
it's more of the case whether exec() can be avoided.  I had a feeling that
this exec() will cause a major part of work elsewhere but maybe I am wrong
as I didn't see the whole branch.

AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
least should be relevant to what devices are attached to the VM, right?

E.g., I observed at least two things that can drastically enlarge the
blackout window:

  1) vcpu save/load sometimes can take ridiculously long time, even if 99%
  of them are fine.  I still didn't spend time looking at this issue, but
  the outlier (of a single cpu save/load, while I don't remember whether
  it's save or load, both will contribute to the downtime anyway) can cause
  100+ms already for that single vcpu.  It'll already get more than 0.1sec.

  2) virtio device loads can be sometimes very slow due to virtqueue
  manipulations.  We used to have developers working in this area,
  e.g. this thread:

  https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com

  I don't yet have time to further look.  Since you mentioned vhost I was
  wondering whether you hit similar issues, and if not why yet.  IIRC it
  was only during VM loads so dest QEMU only.  Again that'll contribute to
  the overall downtime too and that can also be 100ms or more, but that may
  depend on VM memory topology and device setup.

When we compare the solutions, we definitely don't need to make it "live":
it could be a migration starting with VM paused already, skipping all dirty
tracking just like cpr-reboot, but in this case it's can be a relatively
normal migration, so that we still invoke the new qemu binary and load that
on the fly, perhaps taking the fds via scm rights.  Then compare these two
solutions with/without exec().  Note that I'm not requesting for such data;
it's not fair if that takes a lot of work already first to implement such
idea, but what I wanted to say is that it might be interesting to first
analyze what caused the downtime, and whether that can be logically
resolved too without exec(); hence the below question on "why exec()" in
the first place, as I still feel like that's somewhere we should avoid
unless extremely necessary..

> cpr-exec seamlessly preserves client connections by preserving chardevs,
> and overall provides a much nicer user experience.

I see.  However this is a common issue to migration, am I right?  I mean,
if we have some chardevs on src host, then we migrate the VM from src to
dst, then a reconnect will be needed anyway.  It looks to me that as long
as the old live migration is supported, there's already a solution and apps
are ok with reconnecting to the new ports.  From that POV, I am curious
whether this can be seen as a (kind of separate) work besides the cpr-exec,
however perhaps only a new feature only be valid for cpr-exec?

Meanwhile, is there some elaborations on what would be the major change of
nicer user experience with the new solution?

> 
> chardev's are preserved by keeping their fd open across the exec, and
> remembering the value of the fd in precreate vmstate so that new qemu
> can associate the fd with the chardev rather than opening a new one.
> 
> The approach of preserving open file descriptors is very general and applicable
> to all kinds of devices, regardless of whether they support live migration
> in hardware.  Device fd's are preserved using the same mechanism as for
> chardevs.
> 
> Devices that support live migration in hardware do not like to live migrate
> in place to the same node.  It is not what they are designed for, and some
> implementations will flat out fail because the source and target interfaces
> are the same.
> 
> For vhost/tap, sometimes the management layer opens the dev and passes an
> fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
> allows both.  For the case where qemu opens the dev, the fd is preserved
> using the same mechanism as for chardevs.
> 
> The fundamental requirements of this work are:
>   - precreate vmstate
>   - preserve open file descriptors
> 
> Direct exec from old to new qemu is not a hard requirement.

Great to know..

> However, it is simple, with few complications, and works with Oracle's
> cloud containers, so it is the method I am most interested in finishing
> first.
> 
> I believe everything could also be made to work by using SCM_RIGHTS to
> send fd's to a new qemu process that is started by some external means.
> It would be requested with MIG_MODE_CPR_SCM (or some better name), and
> would co-exist with MIG_MODE_CPR_EXEC.

That sounds like a better thing to me, so that live migration framework is
not changed as drastic.  I just still feel like exec() is too powerful, and
evil can reside, just like black magic in the fairy tales; magicians try to
avoid using it unless extremely necessary.

I think the next step for my review is to understand what is implied with
exec().  I'll wait for you to push your tree somewhere so maybe I can read
that and understand better.  A base commit would work too if you can share
so I can apply the series, as it doesn't seem to apply to master now.

Thanks,
Steven Sistare May 30, 2024, 5:17 p.m. UTC | #13
On 5/28/2024 12:42 PM, Peter Xu wrote:
> On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
>> On 5/27/2024 1:45 PM, Peter Xu wrote:
>>> On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
>>>> I understand, thanks.  If I can help with any of your todo list,
>>>> just ask - steve
>>>
>>> Thanks for offering the help, Steve.  Started looking at this today, then I
>>> found that I miss something high-level.  Let me ask here, and let me
>>> apologize already for starting to throw multiple questions..
>>>
>>> IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
>>> this case not host kernel but QEMU-only, and/or upper.
>>>
>>> Is there any justification on why the complexity is needed here?  It looks
>>> to me this one is more involved than cpr-reboot, so I'm thinking how much
>>> we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
>>> the min support, and if we even expect more to come, that's really
>>> important, IMHO.
>>>
>>> For example, what's the major motivation of this whole work?  Is that more
>>> on performance, or is it more for supporting the special devices like VFIO
>>> which we used to not support, or something else?  I can't find them in
>>> whatever cover letter I can find, including this one.
>>>
>>> Firstly, regarding performance, IMHO it'll be always nice to share even
>>> some very fundamental downtime measurement comparisons using the new exec
>>> mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
>>> have some number on hand when you started working on this feature years
>>> ago?  Or maybe some old links on the list would help too, as I didn't
>>> follow this work since the start.
>>>
>>> On VFIO, IIUC you started out this project without VFIO migration being
>>> there.  Now we have VFIO migration so not sure how much it would work for
>>> the upgrade use case. Even with current VFIO migration, we may not want to
>>> migrate device states for a local upgrade I suppose, as that can be a lot
>>> depending on the type of device assigned.  However it'll be nice to discuss
>>> this too if this is the major purpose of the series.
>>>
>>> I think one other challenge on QEMU upgrade with VFIO devices is that the
>>> dest QEMU won't be able to open the VFIO device when the src QEMU is still
>>> using it as the owner.  IIUC this is a similar condition where QEMU wants
>>> to have proper ownership transfer of a shared block device, and AFAIR right
>>> now we resolved that issue using some form of file lock on the image file.
>>> In this case it won't easily apply to a VFIO dev fd, but maybe we still
>>> have other approaches, not sure whether you investigated any.  E.g. could
>>> the VFIO handle be passed over using unix scm rights?  I think this might
>>> remove one dependency of using exec which can cause quite some difference
>>> v.s. a generic migration (from which regard, cpr-reboot is still a pretty
>>> generic migration).
>>>
>>> You also mentioned vhost/tap, is that also a major goal of this series in
>>> the follow up patchsets?  Is this a problem only because this solution will
>>> do exec?  Can it work if either the exec()ed qemu or dst qemu create the
>>> vhost/tap fds when boot?
>>>
>>> Meanwhile, could you elaborate a bit on the implication on chardevs?  From
>>> what I read in the doc update it looks like a major part of work in the
>>> future, but I don't yet understand the issue..  Is it also relevant to the
>>> exec() approach?
>>>
>>> In all cases, some of such discussion would be really appreciated.  And if
>>> you used to consider other approaches to solve this problem it'll be great
>>> to mention how you chose this way.  Considering this work contains too many
>>> things, it'll be nice if such discussion can start with the fundamentals,
>>> e.g. on why exec() is a must.
>>
>> The main goal of cpr-exec is providing a fast and reliable way to update
>> qemu. cpr-reboot is not fast enough or general enough.  It requires the
>> guest to support suspend and resume for all devices, and that takes seconds.
>> If one actually reboots the host, that adds more seconds, depending on
>> system services.  cpr-exec takes 0.1 secs, and works every time, unlike
>> like migration which can fail to converge on a busy system.  Live migration
>> also consumes more system and network resources.
> 
> Right, but note that when I was thinking of a comparison between cpr-exec
> v.s. normal migration, I didn't mean a "normal live migration".  I think
> it's more of the case whether exec() can be avoided.  I had a feeling that
> this exec() will cause a major part of work elsewhere but maybe I am wrong
> as I didn't see the whole branch.

The only parts of this work that are specific to exec are these patches
and the qemu_clear_cloexec() calls in cpr.c.
   vl: helper to request re-exec
   migration: precreate vmstate for exec

The rest would be the same if some other mechanism were used to start
new qemu.   Additional code would be needed for the new mechanism, such
as SCM_RIGHTS sends.

> AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
> least should be relevant to what devices are attached to the VM, right?
>
> E.g., I observed at least two things that can drastically enlarge the
> blackout window:
> 
>    1) vcpu save/load sometimes can take ridiculously long time, even if 99%
>    of them are fine.  I still didn't spend time looking at this issue, but
>    the outlier (of a single cpu save/load, while I don't remember whether
>    it's save or load, both will contribute to the downtime anyway) can cause
>    100+ms already for that single vcpu.  It'll already get more than 0.1sec.
> 
>    2) virtio device loads can be sometimes very slow due to virtqueue
>    manipulations.  We used to have developers working in this area,
>    e.g. this thread:
> 
>    https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com
> 
>    I don't yet have time to further look.  Since you mentioned vhost I was
>    wondering whether you hit similar issues, and if not why yet.  IIRC it
>    was only during VM loads so dest QEMU only.  Again that'll contribute to
>    the overall downtime too and that can also be 100ms or more, but that may
>    depend on VM memory topology and device setup.

100 ms is not a promise, it is an order-of-magnitude characterization. A typical
result.

> When we compare the solutions, we definitely don't need to make it "live":

Agreed.  The key metric is guest blackout time.  In fact, the 100 ms I quote
is blackout time, not elapsed time, though the latter is not much longer.

> it could be a migration starting with VM paused already, skipping all dirty
> tracking just like cpr-reboot, but in this case it's can be a relatively
> normal migration, so that we still invoke the new qemu binary and load that
> on the fly, perhaps taking the fds via scm rights.  Then compare these two
> solutions with/without exec().  Note that I'm not requesting for such data;
> it's not fair if that takes a lot of work already first to implement such
> idea, but what I wanted to say is that it might be interesting to first
> analyze what caused the downtime, and whether that can be logically
> resolved too without exec(); hence the below question on "why exec()" in
> the first place, as I still feel like that's somewhere we should avoid
> unless extremely necessary..

Exec is not a key requirement, but it works well.  Please give it fair
consideration.

>> cpr-exec seamlessly preserves client connections by preserving chardevs,
>> and overall provides a much nicer user experience.
> 
> I see.  However this is a common issue to migration, am I right?  I mean,
> if we have some chardevs on src host, then we migrate the VM from src to
> dst, then a reconnect will be needed anyway.  It looks to me that as long
> as the old live migration is supported, there's already a solution and apps
> are ok with reconnecting to the new ports.  

Apps may be OK with it, but I offer a better experience.
To be clear, chardev preservation is a nice feature that is easy to implement
with the cpr-exec framework, but is not the primary motivation for my work.

> From that POV, I am curious
> whether this can be seen as a (kind of separate) work besides the cpr-exec,
> however perhaps only a new feature only be valid for cpr-exec?

You need much of the cpr-exec (or cpr-scm) framework to support it:
a mechanism to preserve the open descriptor, and precreate vmstate to
identify the descriptor for new qemu.

> Meanwhile, is there some elaborations on what would be the major change of
> nicer user experience with the new solution?
> 
>>
>> chardev's are preserved by keeping their fd open across the exec, and
>> remembering the value of the fd in precreate vmstate so that new qemu
>> can associate the fd with the chardev rather than opening a new one.
>>
>> The approach of preserving open file descriptors is very general and applicable
>> to all kinds of devices, regardless of whether they support live migration
>> in hardware.  Device fd's are preserved using the same mechanism as for
>> chardevs.
>>
>> Devices that support live migration in hardware do not like to live migrate
>> in place to the same node.  It is not what they are designed for, and some
>> implementations will flat out fail because the source and target interfaces
>> are the same.
>>
>> For vhost/tap, sometimes the management layer opens the dev and passes an
>> fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
>> allows both.  For the case where qemu opens the dev, the fd is preserved
>> using the same mechanism as for chardevs.
>>
>> The fundamental requirements of this work are:
>>    - precreate vmstate
>>    - preserve open file descriptors
>>
>> Direct exec from old to new qemu is not a hard requirement.
> 
> Great to know..
> 
>> However, it is simple, with few complications, and works with Oracle's
>> cloud containers, so it is the method I am most interested in finishing
>> first.
>>
>> I believe everything could also be made to work by using SCM_RIGHTS to
>> send fd's to a new qemu process that is started by some external means.
>> It would be requested with MIG_MODE_CPR_SCM (or some better name), and
>> would co-exist with MIG_MODE_CPR_EXEC.
> 
> That sounds like a better thing to me, so that live migration framework is
> not changed as drastic.  I just still feel like exec() is too powerful, and
> evil can reside, just like black magic in the fairy tales; magicians try to
> avoid using it unless extremely necessary.

Fork is scarier; it preserves almost everything, with a few exceptions.
Exec destroys almost everything, with a few exceptions.
Please give it a chance.  The theorized cpr-scm would no doubt be useful
for some cloud vendors, but so is cpr-exec.  cpr-scm is intellectually
interesting to me, and I might work on it at some point, but cpr-exec is
what I need for our cloud.

> I think the next step for my review is to understand what is implied with
> exec().  I'll wait for you to push your tree somewhere so maybe I can read
> that and understand better.  A base commit would work too if you can share
> so I can apply the series, as it doesn't seem to apply to master now.

Try these tracepoints:
-trace enable=qemu_anon_memfd_alloc
-trace enable=ram_block_create
-trace enable='*factory*'
-trace enable='vmstate_*register'

I sent this to Peter already, but for others benefit, this series applies to
commit 5da72194df36535d77.

- Steve
Peter Xu May 30, 2024, 7:23 p.m. UTC | #14
On Thu, May 30, 2024 at 01:17:05PM -0400, Steven Sistare wrote:
> On 5/28/2024 12:42 PM, Peter Xu wrote:
> > On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> > > On 5/27/2024 1:45 PM, Peter Xu wrote:
> > > > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > > > I understand, thanks.  If I can help with any of your todo list,
> > > > > just ask - steve
> > > > 
> > > > Thanks for offering the help, Steve.  Started looking at this today, then I
> > > > found that I miss something high-level.  Let me ask here, and let me
> > > > apologize already for starting to throw multiple questions..
> > > > 
> > > > IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> > > > this case not host kernel but QEMU-only, and/or upper.
> > > > 
> > > > Is there any justification on why the complexity is needed here?  It looks
> > > > to me this one is more involved than cpr-reboot, so I'm thinking how much
> > > > we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
> > > > the min support, and if we even expect more to come, that's really
> > > > important, IMHO.
> > > > 
> > > > For example, what's the major motivation of this whole work?  Is that more
> > > > on performance, or is it more for supporting the special devices like VFIO
> > > > which we used to not support, or something else?  I can't find them in
> > > > whatever cover letter I can find, including this one.
> > > > 
> > > > Firstly, regarding performance, IMHO it'll be always nice to share even
> > > > some very fundamental downtime measurement comparisons using the new exec
> > > > mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> > > > have some number on hand when you started working on this feature years
> > > > ago?  Or maybe some old links on the list would help too, as I didn't
> > > > follow this work since the start.
> > > > 
> > > > On VFIO, IIUC you started out this project without VFIO migration being
> > > > there.  Now we have VFIO migration so not sure how much it would work for
> > > > the upgrade use case. Even with current VFIO migration, we may not want to
> > > > migrate device states for a local upgrade I suppose, as that can be a lot
> > > > depending on the type of device assigned.  However it'll be nice to discuss
> > > > this too if this is the major purpose of the series.
> > > > 
> > > > I think one other challenge on QEMU upgrade with VFIO devices is that the
> > > > dest QEMU won't be able to open the VFIO device when the src QEMU is still
> > > > using it as the owner.  IIUC this is a similar condition where QEMU wants
> > > > to have proper ownership transfer of a shared block device, and AFAIR right
> > > > now we resolved that issue using some form of file lock on the image file.
> > > > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > > > have other approaches, not sure whether you investigated any.  E.g. could
> > > > the VFIO handle be passed over using unix scm rights?  I think this might
> > > > remove one dependency of using exec which can cause quite some difference
> > > > v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> > > > generic migration).
> > > > 
> > > > You also mentioned vhost/tap, is that also a major goal of this series in
> > > > the follow up patchsets?  Is this a problem only because this solution will
> > > > do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> > > > vhost/tap fds when boot?
> > > > 
> > > > Meanwhile, could you elaborate a bit on the implication on chardevs?  From
> > > > what I read in the doc update it looks like a major part of work in the
> > > > future, but I don't yet understand the issue..  Is it also relevant to the
> > > > exec() approach?
> > > > 
> > > > In all cases, some of such discussion would be really appreciated.  And if
> > > > you used to consider other approaches to solve this problem it'll be great
> > > > to mention how you chose this way.  Considering this work contains too many
> > > > things, it'll be nice if such discussion can start with the fundamentals,
> > > > e.g. on why exec() is a must.
> > > 
> > > The main goal of cpr-exec is providing a fast and reliable way to update
> > > qemu. cpr-reboot is not fast enough or general enough.  It requires the
> > > guest to support suspend and resume for all devices, and that takes seconds.
> > > If one actually reboots the host, that adds more seconds, depending on
> > > system services.  cpr-exec takes 0.1 secs, and works every time, unlike
> > > like migration which can fail to converge on a busy system.  Live migration
> > > also consumes more system and network resources.
> > 
> > Right, but note that when I was thinking of a comparison between cpr-exec
> > v.s. normal migration, I didn't mean a "normal live migration".  I think
> > it's more of the case whether exec() can be avoided.  I had a feeling that
> > this exec() will cause a major part of work elsewhere but maybe I am wrong
> > as I didn't see the whole branch.
> 
> The only parts of this work that are specific to exec are these patches
> and the qemu_clear_cloexec() calls in cpr.c.
>   vl: helper to request re-exec
>   migration: precreate vmstate for exec
> 
> The rest would be the same if some other mechanism were used to start
> new qemu.   Additional code would be needed for the new mechanism, such
> as SCM_RIGHTS sends.

Please see my other reply; I feel like there's chance to avoid more, but I
don't think we finished discussion on the e.g. vga ram implications, or the
vfio-pci fd reuse. So we can keep the discussion there.

> 
> > AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
> > least should be relevant to what devices are attached to the VM, right?
> > 
> > E.g., I observed at least two things that can drastically enlarge the
> > blackout window:
> > 
> >    1) vcpu save/load sometimes can take ridiculously long time, even if 99%
> >    of them are fine.  I still didn't spend time looking at this issue, but
> >    the outlier (of a single cpu save/load, while I don't remember whether
> >    it's save or load, both will contribute to the downtime anyway) can cause
> >    100+ms already for that single vcpu.  It'll already get more than 0.1sec.
> > 
> >    2) virtio device loads can be sometimes very slow due to virtqueue
> >    manipulations.  We used to have developers working in this area,
> >    e.g. this thread:
> > 
> >    https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com
> > 
> >    I don't yet have time to further look.  Since you mentioned vhost I was
> >    wondering whether you hit similar issues, and if not why yet.  IIRC it
> >    was only during VM loads so dest QEMU only.  Again that'll contribute to
> >    the overall downtime too and that can also be 100ms or more, but that may
> >    depend on VM memory topology and device setup.
> 
> 100 ms is not a promise, it is an order-of-magnitude characterization. A typical
> result.
> 
> > When we compare the solutions, we definitely don't need to make it "live":
> 
> Agreed.  The key metric is guest blackout time.  In fact, the 100 ms I quote
> is blackout time, not elapsed time, though the latter is not much longer.

Here I think what would be interesting is how exec() could help reduce the
blackout time comparing to invoking another qemu.

The major device states save/load look like to be a shared contribution.
Then ram sharing is also a shared attribute that can be leveraged without
exec() approach.

FDs passover is indeed another good point on reducing blackout window
(including your vfio vaddr update work), but that also doesn't seem like
relevant to exec().

> 
> > it could be a migration starting with VM paused already, skipping all dirty
> > tracking just like cpr-reboot, but in this case it's can be a relatively
> > normal migration, so that we still invoke the new qemu binary and load that
> > on the fly, perhaps taking the fds via scm rights.  Then compare these two
> > solutions with/without exec().  Note that I'm not requesting for such data;
> > it's not fair if that takes a lot of work already first to implement such
> > idea, but what I wanted to say is that it might be interesting to first
> > analyze what caused the downtime, and whether that can be logically
> > resolved too without exec(); hence the below question on "why exec()" in
> > the first place, as I still feel like that's somewhere we should avoid
> > unless extremely necessary..
> 
> Exec is not a key requirement, but it works well.  Please give it fair
> consideration.

Right, I think I'm still trying to understand what it can bring.  Even
though I must confess personally I definitely prefer anything but it.. So
maybe I'll be convinced at some point, so far just not fully yet.

> 
> > > cpr-exec seamlessly preserves client connections by preserving chardevs,
> > > and overall provides a much nicer user experience.
> > 
> > I see.  However this is a common issue to migration, am I right?  I mean,
> > if we have some chardevs on src host, then we migrate the VM from src to
> > dst, then a reconnect will be needed anyway.  It looks to me that as long
> > as the old live migration is supported, there's already a solution and apps
> > are ok with reconnecting to the new ports.
> 
> Apps may be OK with it, but I offer a better experience.
> To be clear, chardev preservation is a nice feature that is easy to implement
> with the cpr-exec framework, but is not the primary motivation for my
> work.

E.g., libvirt used to have a connection to a chardev backend, with legacy
code it will need a reconnect?  Now libvirt can avoid that reconnect
operation.  Is that the case?

The issue is I still don't see why it's a major benefit if libvirt already
supports the reconnections; it can be another story if it didn't.  I don't
think chardev usages should be sensitive to performance / reconnects
either?

Meanwhile, do we need to modify all chardev call sites to support them one
by one?  please bare with me if there can be silly questions, I'm not
familiar with that area.

> 
> > From that POV, I am curious
> > whether this can be seen as a (kind of separate) work besides the cpr-exec,
> > however perhaps only a new feature only be valid for cpr-exec?
> 
> You need much of the cpr-exec (or cpr-scm) framework to support it:
> a mechanism to preserve the open descriptor, and precreate vmstate to
> identify the descriptor for new qemu.

Let's see how you think about it when you read the vfio commit on the
pre-opened device.  I feel like it's a pretty good idea to provide such a
generic interface so that fds are more flexibly managed in QEMU.

I'd be more than glad if libvirt can manage all the fds, so that the
pre-create approach isn't required, maybe?  That's a major tricky part that
I feel nervous in this series besides exec() itself.  I'm not sure whether
that can also extend to chardevs too, but there'll be similar question on
whether it'll be worthwhile to avoid the reconnection if major effort is
needed.

> 
> > Meanwhile, is there some elaborations on what would be the major change of
> > nicer user experience with the new solution?
> > 
> > > 
> > > chardev's are preserved by keeping their fd open across the exec, and
> > > remembering the value of the fd in precreate vmstate so that new qemu
> > > can associate the fd with the chardev rather than opening a new one.
> > > 
> > > The approach of preserving open file descriptors is very general and applicable
> > > to all kinds of devices, regardless of whether they support live migration
> > > in hardware.  Device fd's are preserved using the same mechanism as for
> > > chardevs.
> > > 
> > > Devices that support live migration in hardware do not like to live migrate
> > > in place to the same node.  It is not what they are designed for, and some
> > > implementations will flat out fail because the source and target interfaces
> > > are the same.
> > > 
> > > For vhost/tap, sometimes the management layer opens the dev and passes an
> > > fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
> > > allows both.  For the case where qemu opens the dev, the fd is preserved
> > > using the same mechanism as for chardevs.
> > > 
> > > The fundamental requirements of this work are:
> > >    - precreate vmstate
> > >    - preserve open file descriptors
> > > 
> > > Direct exec from old to new qemu is not a hard requirement.
> > 
> > Great to know..
> > 
> > > However, it is simple, with few complications, and works with Oracle's
> > > cloud containers, so it is the method I am most interested in finishing
> > > first.
> > > 
> > > I believe everything could also be made to work by using SCM_RIGHTS to
> > > send fd's to a new qemu process that is started by some external means.
> > > It would be requested with MIG_MODE_CPR_SCM (or some better name), and
> > > would co-exist with MIG_MODE_CPR_EXEC.
> > 
> > That sounds like a better thing to me, so that live migration framework is
> > not changed as drastic.  I just still feel like exec() is too powerful, and
> > evil can reside, just like black magic in the fairy tales; magicians try to
> > avoid using it unless extremely necessary.
> 
> Fork is scarier; it preserves almost everything, with a few exceptions.
> Exec destroys almost everything, with a few exceptions.

Hmm this is a very interesting angle to see the syscalls, thanks.  And
OTOH.. I'm definitely not suggesting fork()..

> Please give it a chance.  The theorized cpr-scm would no doubt be useful
> for some cloud vendors, but so is cpr-exec.  cpr-scm is intellectually
> interesting to me, and I might work on it at some point, but cpr-exec is
> what I need for our cloud.

I kind of understand, and as an individual that I worked with you on
multiple series I have my own personal feelings. You're definitely one of
the good developers I've been working with, if not fall into great
category.  It's all about the hat, not the red one..

CPR was floating around for too long, and part of that was because there
weren't enough people reviewing, which I'd blame QEMU if that's a "person"
alone, and a person can "die" if he/she does too many wrong things.

However from a company's pov, that's the trade-off for upstreaming-first
approach, and company needs to make a decision irrelevant of community
behavior, I guess. While when a company decided to go further without
upstreaming there's the risk of "tech debt".

Please keep convicing that cpr-exec is the best.  I don't think we have a
conclusion yet.

> 
> > I think the next step for my review is to understand what is implied with
> > exec().  I'll wait for you to push your tree somewhere so maybe I can read
> > that and understand better.  A base commit would work too if you can share
> > so I can apply the series, as it doesn't seem to apply to master now.
> 
> Try these tracepoints:
> -trace enable=qemu_anon_memfd_alloc
> -trace enable=ram_block_create
> -trace enable='*factory*'
> -trace enable='vmstate_*register'
> 
> I sent this to Peter already, but for others benefit, this series applies to
> commit 5da72194df36535d77.

Yes, thanks.