diff mbox series

[v3,1/6] migration: Allow immutable device state to be migrated early (i.e., before RAM)

Message ID 20221222110215.130392-2-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtio-mem: Handle preallocation with migration | expand

Commit Message

David Hildenbrand Dec. 22, 2022, 11:02 a.m. UTC
For virtio-mem, we want to have the plugged/unplugged state of memory
blocks available before migrating any actual RAM content. This
information is immutable on the migration source while migration is active,

For example, we want to use this information for proper preallocation
support with migration: currently, we don't preallocate memory on the
migration target, and especially with hugetlb, we can easily run out of
hugetlb pages during RAM migration and will crash (SIGBUS) instead of
catching this gracefully via preallocation.

Migrating device state before we start iterating is currently impossible.
Introduce and use qemu_savevm_state_start_precopy(), and use
a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
state will be saved in qemu_savevm_state_start_precopy() or in
qemu_savevm_state_complete_precopy_*().

We have to take care of properly including the early device state in the
vmdesc. Relying on migrate_get_current() to temporarily store the vmdesc is
a bit sub-optimal, but we use that explicitly or implicitly all over the
place already, so this barely matters in practice.

Note that only very selected devices (i.e., ones seriously messing with
RAM setup) are supposed to make use of that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/migration/vmstate.h |   7 +++
 migration/migration.c       |  13 +++++
 migration/migration.h       |   4 ++
 migration/savevm.c          | 104 +++++++++++++++++++++++++++---------
 migration/savevm.h          |   1 +
 5 files changed, 104 insertions(+), 25 deletions(-)

Comments

David Hildenbrand Dec. 23, 2022, 9:34 a.m. UTC | #1
On 22.12.22 12:02, David Hildenbrand wrote:
> For virtio-mem, we want to have the plugged/unplugged state of memory
> blocks available before migrating any actual RAM content. This
> information is immutable on the migration source while migration is active,
> 
> For example, we want to use this information for proper preallocation
> support with migration: currently, we don't preallocate memory on the
> migration target, and especially with hugetlb, we can easily run out of
> hugetlb pages during RAM migration and will crash (SIGBUS) instead of
> catching this gracefully via preallocation.
> 
> Migrating device state before we start iterating is currently impossible.
> Introduce and use qemu_savevm_state_start_precopy(), and use
> a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
> state will be saved in qemu_savevm_state_start_precopy() or in
> qemu_savevm_state_complete_precopy_*().
> 
> We have to take care of properly including the early device state in the
> vmdesc. Relying on migrate_get_current() to temporarily store the vmdesc is
> a bit sub-optimal, but we use that explicitly or implicitly all over the
> place already, so this barely matters in practice.
> 
> Note that only very selected devices (i.e., ones seriously messing with
> RAM setup) are supposed to make use of that.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

[...]

>   
>       if (inactivate_disks) {
> @@ -1427,6 +1474,10 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>           qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
>       }
>   
> +    /* Free it now to detect any inconsistencies. */
> +    g_free(vmdesc);

Missed to convert that to a json_writer_free().
Peter Xu Jan. 4, 2023, 5:23 p.m. UTC | #2
On Thu, Dec 22, 2022 at 12:02:10PM +0100, David Hildenbrand wrote:
> Migrating device state before we start iterating is currently impossible.
> Introduce and use qemu_savevm_state_start_precopy(), and use
> a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
> state will be saved in qemu_savevm_state_start_precopy() or in
> qemu_savevm_state_complete_precopy_*().

Can something like this be done in qemu_savevm_state_setup()?

Thanks,
Michael S. Tsirkin Jan. 5, 2023, 1:27 a.m. UTC | #3
On Fri, Dec 23, 2022 at 10:34:36AM +0100, David Hildenbrand wrote:
> On 22.12.22 12:02, David Hildenbrand wrote:
> > For virtio-mem, we want to have the plugged/unplugged state of memory
> > blocks available before migrating any actual RAM content. This
> > information is immutable on the migration source while migration is active,
> > 
> > For example, we want to use this information for proper preallocation
> > support with migration: currently, we don't preallocate memory on the
> > migration target, and especially with hugetlb, we can easily run out of
> > hugetlb pages during RAM migration and will crash (SIGBUS) instead of
> > catching this gracefully via preallocation.
> > 
> > Migrating device state before we start iterating is currently impossible.
> > Introduce and use qemu_savevm_state_start_precopy(), and use
> > a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
> > state will be saved in qemu_savevm_state_start_precopy() or in
> > qemu_savevm_state_complete_precopy_*().
> > 
> > We have to take care of properly including the early device state in the
> > vmdesc. Relying on migrate_get_current() to temporarily store the vmdesc is
> > a bit sub-optimal, but we use that explicitly or implicitly all over the
> > place already, so this barely matters in practice.
> > 
> > Note that only very selected devices (i.e., ones seriously messing with
> > RAM setup) are supposed to make use of that.
> > 
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> [...]
> 
> >       if (inactivate_disks) {
> > @@ -1427,6 +1474,10 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> >           qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
> >       }
> > +    /* Free it now to detect any inconsistencies. */
> > +    g_free(vmdesc);
> 
> Missed to convert that to a json_writer_free().


I get it you will post v4?

> -- 
> Thanks,
> 
> David / dhildenb
David Hildenbrand Jan. 5, 2023, 8:20 a.m. UTC | #4
On 05.01.23 02:27, Michael S. Tsirkin wrote:
> On Fri, Dec 23, 2022 at 10:34:36AM +0100, David Hildenbrand wrote:
>> On 22.12.22 12:02, David Hildenbrand wrote:
>>> For virtio-mem, we want to have the plugged/unplugged state of memory
>>> blocks available before migrating any actual RAM content. This
>>> information is immutable on the migration source while migration is active,
>>>
>>> For example, we want to use this information for proper preallocation
>>> support with migration: currently, we don't preallocate memory on the
>>> migration target, and especially with hugetlb, we can easily run out of
>>> hugetlb pages during RAM migration and will crash (SIGBUS) instead of
>>> catching this gracefully via preallocation.
>>>
>>> Migrating device state before we start iterating is currently impossible.
>>> Introduce and use qemu_savevm_state_start_precopy(), and use
>>> a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
>>> state will be saved in qemu_savevm_state_start_precopy() or in
>>> qemu_savevm_state_complete_precopy_*().
>>>
>>> We have to take care of properly including the early device state in the
>>> vmdesc. Relying on migrate_get_current() to temporarily store the vmdesc is
>>> a bit sub-optimal, but we use that explicitly or implicitly all over the
>>> place already, so this barely matters in practice.
>>>
>>> Note that only very selected devices (i.e., ones seriously messing with
>>> RAM setup) are supposed to make use of that.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> [...]
>>
>>>        if (inactivate_disks) {
>>> @@ -1427,6 +1474,10 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>>            qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
>>>        }
>>> +    /* Free it now to detect any inconsistencies. */
>>> +    g_free(vmdesc);
>>
>> Missed to convert that to a json_writer_free().
> 
> 
> I get it you will post v4?

Yes, once the discussions on this version are done.
David Hildenbrand Jan. 5, 2023, 8:35 a.m. UTC | #5
On 04.01.23 18:23, Peter Xu wrote:
> On Thu, Dec 22, 2022 at 12:02:10PM +0100, David Hildenbrand wrote:
>> Migrating device state before we start iterating is currently impossible.
>> Introduce and use qemu_savevm_state_start_precopy(), and use
>> a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
>> state will be saved in qemu_savevm_state_start_precopy() or in
>> qemu_savevm_state_complete_precopy_*().
> 
> Can something like this be done in qemu_savevm_state_setup()?

Hi Peter,

Do you mean

(a) Moving qemu_savevm_state_start_precopy() effectively into
     qemu_savevm_state_setup()

(b) Using se->ops->save_setup()

I first tried going via (b), but decided to go the current way of using 
a proper vmstate with properties (instead of e.g., filling the stream 
manually), which also made vmdesc handling possible (and significantly 
cleaner).

Regarding (a), I decided to not move logic of 
qemu_savevm_state_start_precopy() into qemu_savevm_state_setup(), 
because it looked cleaner to save device state with the BQL held and for 
background snapshots, the VM has been stopped. To decouple device state 
saving from the setup path, just like we do it right now for all vmstates.

Having that said, for virtio-mem, it would still work because that state 
is immutable once migration starts, but it felt cleaner to separate the 
setup() phase from actual device state saving.

Thanks!
Peter Xu Jan. 5, 2023, 5:15 p.m. UTC | #6
On Thu, Jan 05, 2023 at 09:35:54AM +0100, David Hildenbrand wrote:
> On 04.01.23 18:23, Peter Xu wrote:
> > On Thu, Dec 22, 2022 at 12:02:10PM +0100, David Hildenbrand wrote:
> > > Migrating device state before we start iterating is currently impossible.
> > > Introduce and use qemu_savevm_state_start_precopy(), and use
> > > a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
> > > state will be saved in qemu_savevm_state_start_precopy() or in
> > > qemu_savevm_state_complete_precopy_*().
> > 
> > Can something like this be done in qemu_savevm_state_setup()?
> 
> Hi Peter,

Hi, David,

> 
> Do you mean
> 
> (a) Moving qemu_savevm_state_start_precopy() effectively into
>     qemu_savevm_state_setup()
> 
> (b) Using se->ops->save_setup()

I meant (b).

> 
> I first tried going via (b), but decided to go the current way of using a
> proper vmstate with properties (instead of e.g., filling the stream
> manually), which also made vmdesc handling possible (and significantly
> cleaner).
> 
> Regarding (a), I decided to not move logic of
> qemu_savevm_state_start_precopy() into qemu_savevm_state_setup(), because it
> looked cleaner to save device state with the BQL held and for background
> snapshots, the VM has been stopped. To decouple device state saving from the
> setup path, just like we do it right now for all vmstates.

Is BQL required or optional?  IIUC it's at least still not taken in the
migration thread path, only in savevm path.

> 
> Having that said, for virtio-mem, it would still work because that state is
> immutable once migration starts, but it felt cleaner to separate the setup()
> phase from actual device state saving.

I get the point.  My major concerns are:

  (1) The new migration priority is changing the semantic of original,
      making it over-complicated

  (2) The new precopy-start routine added one more step to the migration
      framework, while it's somehow overlapping (if not to say, mostly the
      same as..) save_setup().

For (1): the old priority was only deciding the order of save entries in
the global list, nothing more than that.  Even if we want to have a
precopy-start phase, I'd suggest we use something else and keep the
migration priority simple.  Otherwise we really need serious documentation
for MigrationPriority and if so I'd rather don't bother and not reuse the
priority field.

For (2), if you see there're a bunch of save_setup() that already does
things like transferring static data besides the device states.  Besides
the notorious ram_save_setup() there's also dirty_bitmap_save_setup() which
also sends a bitmap during save_setup() and some others.  It looks clean to
me to do it in the same way as we used to.

Reusing vmstate_save() and vmsd structures are useful too which I totally
agree.  So.. can we just call vmstate_save_state() in the save_setup() of
the other new vmsd of virtio-mem?

Thanks,
David Hildenbrand Jan. 9, 2023, 2:34 p.m. UTC | #7
On 05.01.23 18:15, Peter Xu wrote:
> On Thu, Jan 05, 2023 at 09:35:54AM +0100, David Hildenbrand wrote:
>> On 04.01.23 18:23, Peter Xu wrote:
>>> On Thu, Dec 22, 2022 at 12:02:10PM +0100, David Hildenbrand wrote:
>>>> Migrating device state before we start iterating is currently impossible.
>>>> Introduce and use qemu_savevm_state_start_precopy(), and use
>>>> a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
>>>> state will be saved in qemu_savevm_state_start_precopy() or in
>>>> qemu_savevm_state_complete_precopy_*().
>>>
>>> Can something like this be done in qemu_savevm_state_setup()?
>>
>> Hi Peter,
> 
> Hi, David,
> 
>>
>> Do you mean
>>
>> (a) Moving qemu_savevm_state_start_precopy() effectively into
>>      qemu_savevm_state_setup()
>>
>> (b) Using se->ops->save_setup()
> 
> I meant (b).
> 
>>
>> I first tried going via (b), but decided to go the current way of using a
>> proper vmstate with properties (instead of e.g., filling the stream
>> manually), which also made vmdesc handling possible (and significantly
>> cleaner).
>>
>> Regarding (a), I decided to not move logic of
>> qemu_savevm_state_start_precopy() into qemu_savevm_state_setup(), because it
>> looked cleaner to save device state with the BQL held and for background
>> snapshots, the VM has been stopped. To decouple device state saving from the
>> setup path, just like we do it right now for all vmstates.
> 
> Is BQL required or optional?  IIUC it's at least still not taken in the
> migration thread path, only in savevm path.
> 
>>
>> Having that said, for virtio-mem, it would still work because that state is
>> immutable once migration starts, but it felt cleaner to separate the setup()
>> phase from actual device state saving.
> 
> I get the point.  My major concerns are:
> 
>    (1) The new migration priority is changing the semantic of original,
>        making it over-complicated
> 
>    (2) The new precopy-start routine added one more step to the migration
>        framework, while it's somehow overlapping (if not to say, mostly the
>        same as..) save_setup().
> 
> For (1): the old priority was only deciding the order of save entries in
> the global list, nothing more than that.  Even if we want to have a
> precopy-start phase, I'd suggest we use something else and keep the
> migration priority simple.  Otherwise we really need serious documentation
> for MigrationPriority and if so I'd rather don't bother and not reuse the
> priority field.
> 
> For (2), if you see there're a bunch of save_setup() that already does
> things like transferring static data besides the device states.  Besides
> the notorious ram_save_setup() there's also dirty_bitmap_save_setup() which
> also sends a bitmap during save_setup() and some others.  It looks clean to
> me to do it in the same way as we used to.
> 
> Reusing vmstate_save() and vmsd structures are useful too which I totally
> agree.  So.. can we just call vmstate_save_state() in the save_setup() of
> the other new vmsd of virtio-mem?


I went halfway that way, by moving stuff into qemu_savevm_state_setup()
and avoiding using a new migration priority. Seems to work:

I think we could go one step further and perform it from a save_setup() callback,
however, I'm not convinced that this gets particularly cleaner (vmdesc handling
eventually).

However, if there are hard feelings, I can look into that. Thanks.


 From e501f80dbbca1260445a6dac03053f426fbb572d Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 20 Dec 2022 18:14:33 +0100
Subject: [PATCH] migration: Allow immutable device state to be migrated early
  (i.e., before RAM)

For virtio-mem, we want to have the plugged/unplugged state of memory
blocks available before migrating any actual RAM content. This
information is immutable on the migration source while migration is active,

For example, we want to use this information for proper preallocation
support with migration: currently, we don't preallocate memory on the
migration target, and especially with hugetlb, we can easily run out of
hugetlb pages during RAM migration and will crash (SIGBUS) instead of
catching this gracefully via preallocation.

Migrating device state before we start iterating is currently impossible.
Let's allow for migrating such state during the setup state, indicating
applicable vmstate descriptors using a "immutable" flag.

We have to take care of properly including the early device state in the
vmdesc. Relying on migrate_get_current() to temporarily store the vmdesc is
a bit sub-optimal, but we use that explicitly or implicitly all over the
place already, so this barely matters in practice.

Note that only very selected devices (i.e., ones seriously messing with
RAM setup) are supposed to make use of that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  include/migration/vmstate.h |  5 +++
  migration/migration.c       |  4 ++
  migration/migration.h       |  4 ++
  migration/savevm.c          | 85 +++++++++++++++++++++++++++----------
  4 files changed, 75 insertions(+), 23 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index ad24aa1934..610e4c1e38 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -179,6 +179,11 @@ struct VMStateField {
  struct VMStateDescription {
      const char *name;
      int unmigratable;
+    /*
+     * The state is immutable while migration is active and the state can
+     * be migrated early, during the setup phase.
+     */
+    int immutable;
      int version_id;
      int minimum_version_id;
      MigrationPriority priority;
diff --git a/migration/migration.c b/migration/migration.c
index 52b5d39244..1d33a7efa0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2170,6 +2170,9 @@ void migrate_init(MigrationState *s)
      s->vm_was_running = false;
      s->iteration_initial_bytes = 0;
      s->threshold_size = 0;
+
+    json_writer_free(s->vmdesc);
+    s->vmdesc = NULL;
  }
  
  int migrate_add_blocker_internal(Error *reason, Error **errp)
@@ -4445,6 +4448,7 @@ static void migration_instance_finalize(Object *obj)
      qemu_sem_destroy(&ms->rp_state.rp_sem);
      qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
      error_free(ms->error);
+    json_writer_free(ms->vmdesc);
  }
  
  static void migration_instance_init(Object *obj)
diff --git a/migration/migration.h b/migration/migration.h
index ae4ffd3454..66511ce532 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -17,6 +17,7 @@
  #include "exec/cpu-common.h"
  #include "hw/qdev-core.h"
  #include "qapi/qapi-types-migration.h"
+#include "qapi/qmp/json-writer.h"
  #include "qemu/thread.h"
  #include "qemu/coroutine_int.h"
  #include "io/channel.h"
@@ -366,6 +367,9 @@ struct MigrationState {
       * This save hostname when out-going migration starts
       */
      char *hostname;
+
+    /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
+    JSONWriter *vmdesc;
  };
  
  void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/savevm.c b/migration/savevm.c
index a0cdb714f7..e77f643f52 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -42,7 +42,6 @@
  #include "postcopy-ram.h"
  #include "qapi/error.h"
  #include "qapi/qapi-commands-migration.h"
-#include "qapi/qmp/json-writer.h"
  #include "qapi/clone-visitor.h"
  #include "qapi/qapi-builtin-visit.h"
  #include "qapi/qmp/qerror.h"
@@ -1161,14 +1160,63 @@ bool qemu_savevm_state_guest_unplug_pending(void)
      return false;
  }
  
+static int qemu_savevm_state_precopy_one_non_iterable(SaveStateEntry *se,
+                                                      QEMUFile *f,
+                                                      JSONWriter *vmdesc)
+{
+    int ret;
+
+    if (se->vmsd && !vmstate_save_needed(se->vmsd, se->opaque)) {
+        trace_savevm_section_skip(se->idstr, se->section_id);
+        return 0;
+    }
+
+    trace_savevm_section_start(se->idstr, se->section_id);
+
+    json_writer_start_object(vmdesc, NULL);
+    json_writer_str(vmdesc, "name", se->idstr);
+    json_writer_int64(vmdesc, "instance_id", se->instance_id);
+
+    save_section_header(f, se, QEMU_VM_SECTION_FULL);
+    ret = vmstate_save(f, se, vmdesc);
+    if (ret) {
+        qemu_file_set_error(f, ret);
+        return ret;
+    }
+    trace_savevm_section_end(se->idstr, se->section_id, 0);
+    save_section_footer(f, se);
+
+    json_writer_end_object(vmdesc);
+    return 0;
+}
+
  void qemu_savevm_state_setup(QEMUFile *f)
  {
-    SaveStateEntry *se;
+    MigrationState *ms = migrate_get_current();
      Error *local_err = NULL;
+    SaveStateEntry *se;
+    JSONWriter *vmdesc;
      int ret;
  
+    assert(!ms->vmdesc);
+    ms->vmdesc = vmdesc = json_writer_new(false);
+    json_writer_start_object(vmdesc, NULL);
+    json_writer_int64(vmdesc, "page_size", qemu_target_page_size());
+    json_writer_start_array(vmdesc, "devices");
+
      trace_savevm_state_setup();
      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (se->vmsd) {
+            if (!se->vmsd->immutable) {
+                continue;
+            }
+            ret = qemu_savevm_state_precopy_one_non_iterable(se, f, vmdesc);
+            if (ret) {
+                break;
+            }
+            continue;
+        }
+
          if (!se->ops || !se->ops->save_setup) {
              continue;
          }
@@ -1364,41 +1412,28 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
                                                      bool in_postcopy,
                                                      bool inactivate_disks)
  {
-    g_autoptr(JSONWriter) vmdesc = NULL;
+    MigrationState *ms = migrate_get_current();
+    JSONWriter *vmdesc = ms->vmdesc;
      int vmdesc_len;
      SaveStateEntry *se;
      int ret;
  
-    vmdesc = json_writer_new(false);
-    json_writer_start_object(vmdesc, NULL);
-    json_writer_int64(vmdesc, "page_size", qemu_target_page_size());
-    json_writer_start_array(vmdesc, "devices");
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    /* qemu_savevm_state_start_precopy() is expected to be called first. */
+    assert(vmdesc);
  
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
          if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
              continue;
          }
-        if (se->vmsd && !vmstate_save_needed(se->vmsd, se->opaque)) {
-            trace_savevm_section_skip(se->idstr, se->section_id);
+        if (se->vmsd && se->vmsd->immutable) {
+            /* Already saved during qemu_savevm_state_setup(). */
              continue;
          }
  
-        trace_savevm_section_start(se->idstr, se->section_id);
-
-        json_writer_start_object(vmdesc, NULL);
-        json_writer_str(vmdesc, "name", se->idstr);
-        json_writer_int64(vmdesc, "instance_id", se->instance_id);
-
-        save_section_header(f, se, QEMU_VM_SECTION_FULL);
-        ret = vmstate_save(f, se, vmdesc);
+        ret = qemu_savevm_state_precopy_one_non_iterable(se, f, vmdesc);
          if (ret) {
-            qemu_file_set_error(f, ret);
              return ret;
          }
-        trace_savevm_section_end(se->idstr, se->section_id, 0);
-        save_section_footer(f, se);
-
-        json_writer_end_object(vmdesc);
      }
  
      if (inactivate_disks) {
@@ -1427,6 +1462,10 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
          qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
      }
  
+    /* Free it now to detect any inconsistencies. */
+    json_writer_free(vmdesc);
+    ms->vmdesc = NULL;
+
      return 0;
  }
Peter Xu Jan. 9, 2023, 7:54 p.m. UTC | #8
On Mon, Jan 09, 2023 at 03:34:48PM +0100, David Hildenbrand wrote:
> On 05.01.23 18:15, Peter Xu wrote:
> > On Thu, Jan 05, 2023 at 09:35:54AM +0100, David Hildenbrand wrote:
> > > On 04.01.23 18:23, Peter Xu wrote:
> > > > On Thu, Dec 22, 2022 at 12:02:10PM +0100, David Hildenbrand wrote:
> > > > > Migrating device state before we start iterating is currently impossible.
> > > > > Introduce and use qemu_savevm_state_start_precopy(), and use
> > > > > a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
> > > > > state will be saved in qemu_savevm_state_start_precopy() or in
> > > > > qemu_savevm_state_complete_precopy_*().
> > > > 
> > > > Can something like this be done in qemu_savevm_state_setup()?
> > > 
> > > Hi Peter,
> > 
> > Hi, David,
> > 
> > > 
> > > Do you mean
> > > 
> > > (a) Moving qemu_savevm_state_start_precopy() effectively into
> > >      qemu_savevm_state_setup()
> > > 
> > > (b) Using se->ops->save_setup()
> > 
> > I meant (b).
> > 
> > > 
> > > I first tried going via (b), but decided to go the current way of using a
> > > proper vmstate with properties (instead of e.g., filling the stream
> > > manually), which also made vmdesc handling possible (and significantly
> > > cleaner).
> > > 
> > > Regarding (a), I decided to not move logic of
> > > qemu_savevm_state_start_precopy() into qemu_savevm_state_setup(), because it
> > > looked cleaner to save device state with the BQL held and for background
> > > snapshots, the VM has been stopped. To decouple device state saving from the
> > > setup path, just like we do it right now for all vmstates.
> > 
> > Is BQL required or optional?  IIUC it's at least still not taken in the
> > migration thread path, only in savevm path.
> > 
> > > 
> > > Having that said, for virtio-mem, it would still work because that state is
> > > immutable once migration starts, but it felt cleaner to separate the setup()
> > > phase from actual device state saving.
> > 
> > I get the point.  My major concerns are:
> > 
> >    (1) The new migration priority is changing the semantic of original,
> >        making it over-complicated
> > 
> >    (2) The new precopy-start routine added one more step to the migration
> >        framework, while it's somehow overlapping (if not to say, mostly the
> >        same as..) save_setup().
> > 
> > For (1): the old priority was only deciding the order of save entries in
> > the global list, nothing more than that.  Even if we want to have a
> > precopy-start phase, I'd suggest we use something else and keep the
> > migration priority simple.  Otherwise we really need serious documentation
> > for MigrationPriority and if so I'd rather don't bother and not reuse the
> > priority field.
> > 
> > For (2), if you see there're a bunch of save_setup() that already does
> > things like transferring static data besides the device states.  Besides
> > the notorious ram_save_setup() there's also dirty_bitmap_save_setup() which
> > also sends a bitmap during save_setup() and some others.  It looks clean to
> > me to do it in the same way as we used to.
> > 
> > Reusing vmstate_save() and vmsd structures are useful too which I totally
> > agree.  So.. can we just call vmstate_save_state() in the save_setup() of
> > the other new vmsd of virtio-mem?
> 
> 
> I went halfway that way, by moving stuff into qemu_savevm_state_setup()
> and avoiding using a new migration priority. Seems to work:

The whole point of my suggestion is not moving things into
qemu_savevm_state_setup(), but avoid introducing more complexity to the
migration framework if unnecessary, so keep the generic framework as simple
as possible.

> 
> I think we could go one step further and perform it from a save_setup() callback,
> however, I'm not convinced that this gets particularly cleaner (vmdesc handling
> eventually).

What I wanted to suggest is exactly trying to avoid vmsd handling.  To be
more explicit, I mean: besides vmstate_virtio_mem_device_early, virtio-mem
can register with another new SaveVMHandlers with both save_setup and
load_setup registered, then e.g. in its save_setup(), one simply calls:

  vmstate_save_state(f, &vmstate_virtio_mem_device_early, virtio_mem_dev,
                     NULL);

I'm not sure whether the JSONWriter* is required in this case, maybe not
yet to at least make it work.

We'll need to impl the load part, but then IIUC we don't need to touch the
migration framework at all, and we keep all similar things (like other
devices I mentioned) to be inside save_setup().

Would that work?
David Hildenbrand Jan. 10, 2023, 10:18 a.m. UTC | #9
On 09.01.23 20:54, Peter Xu wrote:
> On Mon, Jan 09, 2023 at 03:34:48PM +0100, David Hildenbrand wrote:
>> On 05.01.23 18:15, Peter Xu wrote:
>>> On Thu, Jan 05, 2023 at 09:35:54AM +0100, David Hildenbrand wrote:
>>>> On 04.01.23 18:23, Peter Xu wrote:
>>>>> On Thu, Dec 22, 2022 at 12:02:10PM +0100, David Hildenbrand wrote:
>>>>>> Migrating device state before we start iterating is currently impossible.
>>>>>> Introduce and use qemu_savevm_state_start_precopy(), and use
>>>>>> a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
>>>>>> state will be saved in qemu_savevm_state_start_precopy() or in
>>>>>> qemu_savevm_state_complete_precopy_*().
>>>>>
>>>>> Can something like this be done in qemu_savevm_state_setup()?
>>>>
>>>> Hi Peter,
>>>
>>> Hi, David,
>>>
>>>>
>>>> Do you mean
>>>>
>>>> (a) Moving qemu_savevm_state_start_precopy() effectively into
>>>>       qemu_savevm_state_setup()
>>>>
>>>> (b) Using se->ops->save_setup()
>>>
>>> I meant (b).
>>>
>>>>
>>>> I first tried going via (b), but decided to go the current way of using a
>>>> proper vmstate with properties (instead of e.g., filling the stream
>>>> manually), which also made vmdesc handling possible (and significantly
>>>> cleaner).
>>>>
>>>> Regarding (a), I decided to not move logic of
>>>> qemu_savevm_state_start_precopy() into qemu_savevm_state_setup(), because it
>>>> looked cleaner to save device state with the BQL held and for background
>>>> snapshots, the VM has been stopped. To decouple device state saving from the
>>>> setup path, just like we do it right now for all vmstates.
>>>
>>> Is BQL required or optional?  IIUC it's at least still not taken in the
>>> migration thread path, only in savevm path.
>>>
>>>>
>>>> Having that said, for virtio-mem, it would still work because that state is
>>>> immutable once migration starts, but it felt cleaner to separate the setup()
>>>> phase from actual device state saving.
>>>
>>> I get the point.  My major concerns are:
>>>
>>>     (1) The new migration priority is changing the semantic of original,
>>>         making it over-complicated
>>>
>>>     (2) The new precopy-start routine added one more step to the migration
>>>         framework, while it's somehow overlapping (if not to say, mostly the
>>>         same as..) save_setup().
>>>
>>> For (1): the old priority was only deciding the order of save entries in
>>> the global list, nothing more than that.  Even if we want to have a
>>> precopy-start phase, I'd suggest we use something else and keep the
>>> migration priority simple.  Otherwise we really need serious documentation
>>> for MigrationPriority and if so I'd rather don't bother and not reuse the
>>> priority field.
>>>
>>> For (2), if you see there're a bunch of save_setup() that already does
>>> things like transferring static data besides the device states.  Besides
>>> the notorious ram_save_setup() there's also dirty_bitmap_save_setup() which
>>> also sends a bitmap during save_setup() and some others.  It looks clean to
>>> me to do it in the same way as we used to.
>>>
>>> Reusing vmstate_save() and vmsd structures are useful too which I totally
>>> agree.  So.. can we just call vmstate_save_state() in the save_setup() of
>>> the other new vmsd of virtio-mem?
>>
>>
>> I went halfway that way, by moving stuff into qemu_savevm_state_setup()
>> and avoiding using a new migration priority. Seems to work:
> 
> The whole point of my suggestion is not moving things into
> qemu_savevm_state_setup(), but avoid introducing more complexity to the
> migration framework if unnecessary, so keep the generic framework as simple
> as possible.

IMHO, the current approach is actually quite simple and clean. But ...
> 
>>
>> I think we could go one step further and perform it from a save_setup() callback,
>> however, I'm not convinced that this gets particularly cleaner (vmdesc handling
>> eventually).
> 
> What I wanted to suggest is exactly trying to avoid vmsd handling.  To be
> more explicit, I mean: besides vmstate_virtio_mem_device_early, virtio-mem
> can register with another new SaveVMHandlers with both save_setup and
> load_setup registered, then e.g. in its save_setup(), one simply calls:

... I can see if it can be made working that way and how the result looks. I know
that we use vmstate_save_state() from virtio code, but I don't remember using
it in save_setup() from QEMU_VM_SECTION_START and not QEMU_VM_SECTION_FULL.


There is this interesting bit in register_savevm_live(), which sets "se->is_ram = 1".
qemu_save_device_state() will not include the state. As it's used by XEN, I don't
particularly care.


> 
>    vmstate_save_state(f, &vmstate_virtio_mem_device_early, virtio_mem_dev,
>                       NULL);
> 
> I'm not sure whether the JSONWriter* is required in this case, maybe not
> yet to at least make it work.

It was required when handling vmstates the current way to make
analyze-migration.py not bail out (which is a good thing because one can
actually inspect the migration content):

$ ./scripts/analyze-migration.py -f STATEFILE
{
     "ram (2)": {
         "section sizes": {
             "0000:00:03.0/mem0": "0x0000000f00000000",
             "pc.ram": "0x0000000100000000",
             "/rom@etc/acpi/tables": "0x0000000000020000",
             "pc.bios": "0x0000000000040000",
             "0000:00:02.0/e1000.rom": "0x0000000000040000",
             "pc.rom": "0x0000000000020000",
             "/rom@etc/table-loader": "0x0000000000001000",
             "/rom@etc/acpi/rsdp": "0x0000000000001000"
         }
     },
     "0000:00:03.0/virtio-mem-device-early (51)": {
         "tmp": "00 00 00 01 40 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00",
         "size": "0x0000000000000000",
         "bitmap": "00 00 00 00 00 00 00 00 00 00
	...
     },
     "timer (0)": {
...

> 
> We'll need to impl the load part, but then IIUC we don't need to touch the
> migration framework at all, and we keep all similar things (like other
> devices I mentioned) to be inside save_setup().
> 
> Would that work?

Let me play with it.
David Hildenbrand Jan. 10, 2023, 11:52 a.m. UTC | #10
On 10.01.23 11:18, David Hildenbrand wrote:
> On 09.01.23 20:54, Peter Xu wrote:
>> On Mon, Jan 09, 2023 at 03:34:48PM +0100, David Hildenbrand wrote:
>>> On 05.01.23 18:15, Peter Xu wrote:
>>>> On Thu, Jan 05, 2023 at 09:35:54AM +0100, David Hildenbrand wrote:
>>>>> On 04.01.23 18:23, Peter Xu wrote:
>>>>>> On Thu, Dec 22, 2022 at 12:02:10PM +0100, David Hildenbrand wrote:
>>>>>>> Migrating device state before we start iterating is currently impossible.
>>>>>>> Introduce and use qemu_savevm_state_start_precopy(), and use
>>>>>>> a new special migration priority -- MIG_PRI_POST_SETUP -- to decide whether
>>>>>>> state will be saved in qemu_savevm_state_start_precopy() or in
>>>>>>> qemu_savevm_state_complete_precopy_*().
>>>>>>
>>>>>> Can something like this be done in qemu_savevm_state_setup()?
>>>>>
>>>>> Hi Peter,
>>>>
>>>> Hi, David,
>>>>
>>>>>
>>>>> Do you mean
>>>>>
>>>>> (a) Moving qemu_savevm_state_start_precopy() effectively into
>>>>>        qemu_savevm_state_setup()
>>>>>
>>>>> (b) Using se->ops->save_setup()
>>>>
>>>> I meant (b).
>>>>
>>>>>
>>>>> I first tried going via (b), but decided to go the current way of using a
>>>>> proper vmstate with properties (instead of e.g., filling the stream
>>>>> manually), which also made vmdesc handling possible (and significantly
>>>>> cleaner).
>>>>>
>>>>> Regarding (a), I decided to not move logic of
>>>>> qemu_savevm_state_start_precopy() into qemu_savevm_state_setup(), because it
>>>>> looked cleaner to save device state with the BQL held and for background
>>>>> snapshots, the VM has been stopped. To decouple device state saving from the
>>>>> setup path, just like we do it right now for all vmstates.
>>>>
>>>> Is BQL required or optional?  IIUC it's at least still not taken in the
>>>> migration thread path, only in savevm path.
>>>>
>>>>>
>>>>> Having that said, for virtio-mem, it would still work because that state is
>>>>> immutable once migration starts, but it felt cleaner to separate the setup()
>>>>> phase from actual device state saving.
>>>>
>>>> I get the point.  My major concerns are:
>>>>
>>>>      (1) The new migration priority is changing the semantic of original,
>>>>          making it over-complicated
>>>>
>>>>      (2) The new precopy-start routine added one more step to the migration
>>>>          framework, while it's somehow overlapping (if not to say, mostly the
>>>>          same as..) save_setup().
>>>>
>>>> For (1): the old priority was only deciding the order of save entries in
>>>> the global list, nothing more than that.  Even if we want to have a
>>>> precopy-start phase, I'd suggest we use something else and keep the
>>>> migration priority simple.  Otherwise we really need serious documentation
>>>> for MigrationPriority and if so I'd rather don't bother and not reuse the
>>>> priority field.
>>>>
>>>> For (2), if you see there're a bunch of save_setup() that already does
>>>> things like transferring static data besides the device states.  Besides
>>>> the notorious ram_save_setup() there's also dirty_bitmap_save_setup() which
>>>> also sends a bitmap during save_setup() and some others.  It looks clean to
>>>> me to do it in the same way as we used to.
>>>>
>>>> Reusing vmstate_save() and vmsd structures are useful too which I totally
>>>> agree.  So.. can we just call vmstate_save_state() in the save_setup() of
>>>> the other new vmsd of virtio-mem?
>>>
>>>
>>> I went halfway that way, by moving stuff into qemu_savevm_state_setup()
>>> and avoiding using a new migration priority. Seems to work:
>>
>> The whole point of my suggestion is not moving things into
>> qemu_savevm_state_setup(), but avoid introducing more complexity to the
>> migration framework if unnecessary, so keep the generic framework as simple
>> as possible.
> 
> IMHO, the current approach is actually quite simple and clean. But ...
>>
>>>
>>> I think we could go one step further and perform it from a save_setup() callback,
>>> however, I'm not convinced that this gets particularly cleaner (vmdesc handling
>>> eventually).
>>
>> What I wanted to suggest is exactly trying to avoid vmsd handling.  To be
>> more explicit, I mean: besides vmstate_virtio_mem_device_early, virtio-mem
>> can register with another new SaveVMHandlers with both save_setup and
>> load_setup registered, then e.g. in its save_setup(), one simply calls:
> 
> ... I can see if it can be made working that way and how the result looks. I know
> that we use vmstate_save_state() from virtio code, but I don't remember using
> it in save_setup() from QEMU_VM_SECTION_START and not QEMU_VM_SECTION_FULL.
> 
> 
> There is this interesting bit in register_savevm_live(), which sets "se->is_ram = 1".
> qemu_save_device_state() will not include the state. As it's used by XEN, I don't
> particularly care.
> 
> 
>>
>>     vmstate_save_state(f, &vmstate_virtio_mem_device_early, virtio_mem_dev,
>>                        NULL);
>>
>> I'm not sure whether the JSONWriter* is required in this case, maybe not
>> yet to at least make it work.
> 
> It was required when handling vmstates the current way to make
> analyze-migration.py not bail out (which is a good thing because one can
> actually inspect the migration content):
> 
> $ ./scripts/analyze-migration.py -f STATEFILE
> {
>       "ram (2)": {
>           "section sizes": {
>               "0000:00:03.0/mem0": "0x0000000f00000000",
>               "pc.ram": "0x0000000100000000",
>               "/rom@etc/acpi/tables": "0x0000000000020000",
>               "pc.bios": "0x0000000000040000",
>               "0000:00:02.0/e1000.rom": "0x0000000000040000",
>               "pc.rom": "0x0000000000020000",
>               "/rom@etc/table-loader": "0x0000000000001000",
>               "/rom@etc/acpi/rsdp": "0x0000000000001000"
>           }
>       },
>       "0000:00:03.0/virtio-mem-device-early (51)": {
>           "tmp": "00 00 00 01 40 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00",
>           "size": "0x0000000000000000",
>           "bitmap": "00 00 00 00 00 00 00 00 00 00
> 	...
>       },
>       "timer (0)": {
> ...
> 
>>
>> We'll need to impl the load part, but then IIUC we don't need to touch the
>> migration framework at all, and we keep all similar things (like other
>> devices I mentioned) to be inside save_setup().
>>
>> Would that work?
> 
> Let me play with it.
> 

The following seems to work, but makes analyze-migration.py angry:

$ ./scripts/analyze-migration.py -f STATEFILE
Traceback (most recent call last):
   File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 605, in <module>
     dump.read(dump_memory = args.memory)
   File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 539, in read
     classdesc = self.section_classes[section_key]
                 ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: ('0000:00:03.0/virtio-mem-early', 0)


We need the vmdesc to create info for the device.


 From 052d8deaa5341d7abe9c8428b50b50613bfef408 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 21 Dec 2022 10:04:06 +0100
Subject: [PATCH] virtio-mem: Migrate bitmap, size and sanity checks early

The bitmap and the size are immutable while migration is active: see
virtio_mem_is_busy(). We can migrate this information early, before
migrating any actual RAM content.

Having this information in place early will, for example, allow for
properly preallocating memory before touching these memory locations
during RAM migration: this way, we can make sure that all memory was
actually preallocated and that any user errors (e.g., insufficient
hugetlb pages) can be handled gracefully.

In contrast, usable_region_size and requested_size can theoretically
still be modified on the source while the VM is running. Keep migrating
these properties the usual, late, way.

Use a new device property to keep behavior of compat machines
unmodified.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  hw/core/machine.c              |  4 +-
  hw/virtio/virtio-mem.c         | 81 +++++++++++++++++++++++++++++++++-
  include/hw/virtio/virtio-mem.h |  8 ++++
  3 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 616f3a207c..29b57f6448 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,7 +41,9 @@
  #include "hw/virtio/virtio-pci.h"
  #include "qom/object_interfaces.h"
  
-GlobalProperty hw_compat_7_2[] = {};
+GlobalProperty hw_compat_7_2[] = {
+    { "virtio-mem", "x-early-migration", "false" },
+};
  const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
  
  GlobalProperty hw_compat_7_1[] = {
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 02f7b5469a..54e4f956c8 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -26,11 +26,14 @@
  #include "qapi/visitor.h"
  #include "exec/ram_addr.h"
  #include "migration/misc.h"
+#include "migration/register.h"
  #include "hw/boards.h"
  #include "hw/qdev-properties.h"
  #include CONFIG_DEVICES
  #include "trace.h"
  
+static const SaveVMHandlers vmstate_virtio_mem_device_early_ops;
+
  /*
   * We only had legacy x86 guests that did not support
   * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE. Other targets don't have legacy guests.
@@ -878,6 +881,23 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
  
      host_memory_backend_set_mapped(vmem->memdev, true);
      vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
+    if (vmem->early_migration) {
+        char idstr[256] = "";
+        char *oid;
+
+        /* See unregister_savevm() */
+        oid = vmstate_if_get_id(VMSTATE_IF(vmem));
+        if (oid) {
+            pstrcpy(idstr, sizeof(idstr), oid);
+            pstrcat(idstr, sizeof(idstr), "/");
+            g_free(oid);
+        }
+        pstrcat(idstr, sizeof(idstr), "virtio-mem-early");
+
+        register_savevm_live(idstr, VMSTATE_INSTANCE_ID_ANY, 1,
+                             &vmstate_virtio_mem_device_early_ops,
+                             vmem);
+    }
      qemu_register_reset(virtio_mem_system_reset, vmem);
  
      /*
@@ -899,6 +919,9 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
       */
      memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
      qemu_unregister_reset(virtio_mem_system_reset, vmem);
+    if (vmem->early_migration) {
+        unregister_savevm(VMSTATE_IF(vmem), "virtio-mem-early", vmem);
+    }
      vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
      host_memory_backend_set_mapped(vmem->memdev, false);
      virtio_del_queue(vdev, 0);
@@ -1015,23 +1038,75 @@ static const VMStateDescription vmstate_virtio_mem_sanity_checks = {
      },
  };
  
+static bool virtio_mem_vmstate_field_exists(void *opaque, int version_id)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+
+    /* With early migration, these fields were already migrated. */
+    return !vmem->early_migration;
+}
+
  static const VMStateDescription vmstate_virtio_mem_device = {
      .name = "virtio-mem-device",
      .minimum_version_id = 1,
      .version_id = 1,
      .priority = MIG_PRI_VIRTIO_MEM,
      .post_load = virtio_mem_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_WITH_TMP_TEST(VirtIOMEM, virtio_mem_vmstate_field_exists,
+                              VirtIOMEMMigSanityChecks,
+                              vmstate_virtio_mem_sanity_checks),
+        VMSTATE_UINT64(usable_region_size, VirtIOMEM),
+        VMSTATE_UINT64_TEST(size, VirtIOMEM, virtio_mem_vmstate_field_exists),
+        VMSTATE_UINT64(requested_size, VirtIOMEM),
+        VMSTATE_BITMAP_TEST(bitmap, VirtIOMEM, virtio_mem_vmstate_field_exists,
+                            0, bitmap_size),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Transfer properties that are immutable while migration is active early,
+ * such that we have have this information around before migrating any RAM
+ * content.
+ *
+ * Note that virtio_mem_is_busy() makes sure these properties can no longer
+ * change on the migration source until migration completed.
+ *
+ * With QEMU compat machines, we transmit these properties later, via
+ * vmstate_virtio_mem_device instead -- see virtio_mem_vmstate_field_exists().
+ */
+static const VMStateDescription vmstate_virtio_mem_device_early = {
+    .name = "virtio-mem-device-early",
+    .minimum_version_id = 1,
+    .version_id = 1,
      .fields = (VMStateField[]) {
          VMSTATE_WITH_TMP(VirtIOMEM, VirtIOMEMMigSanityChecks,
                           vmstate_virtio_mem_sanity_checks),
-        VMSTATE_UINT64(usable_region_size, VirtIOMEM),
          VMSTATE_UINT64(size, VirtIOMEM),
-        VMSTATE_UINT64(requested_size, VirtIOMEM),
          VMSTATE_BITMAP(bitmap, VirtIOMEM, 0, bitmap_size),
          VMSTATE_END_OF_LIST()
      },
  };
  
+static int virtio_mem_save_setup_early(QEMUFile *f, void *opaque)
+{
+    return vmstate_save_state(f, &vmstate_virtio_mem_device_early, opaque,
+                              NULL);
+}
+
+static int virtio_mem_load_state_early(QEMUFile *f, void *opaque,
+                                       int version_id)
+{
+    return vmstate_load_state(f, &vmstate_virtio_mem_device_early, opaque,
+                              vmstate_virtio_mem_device_early.version_id);
+}
+
+static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
+    .save_setup = virtio_mem_save_setup_early,
+    .load_state = virtio_mem_load_state_early,
+};
+
  static const VMStateDescription vmstate_virtio_mem = {
      .name = "virtio-mem",
      .minimum_version_id = 1,
@@ -1211,6 +1286,8 @@ static Property virtio_mem_properties[] = {
      DEFINE_PROP_ON_OFF_AUTO(VIRTIO_MEM_UNPLUGGED_INACCESSIBLE_PROP, VirtIOMEM,
                              unplugged_inaccessible, ON_OFF_AUTO_AUTO),
  #endif
+    DEFINE_PROP_BOOL(VIRTIO_MEM_EARLY_MIGRATION_PROP, VirtIOMEM,
+                     early_migration, true),
      DEFINE_PROP_END_OF_LIST(),
  };
  
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 7745cfc1a3..f15e561785 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -31,6 +31,7 @@ OBJECT_DECLARE_TYPE(VirtIOMEM, VirtIOMEMClass,
  #define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
  #define VIRTIO_MEM_ADDR_PROP "memaddr"
  #define VIRTIO_MEM_UNPLUGGED_INACCESSIBLE_PROP "unplugged-inaccessible"
+#define VIRTIO_MEM_EARLY_MIGRATION_PROP "x-early-migration"
  #define VIRTIO_MEM_PREALLOC_PROP "prealloc"
  
  struct VirtIOMEM {
@@ -74,6 +75,13 @@ struct VirtIOMEM {
      /* whether to prealloc memory when plugging new blocks */
      bool prealloc;
  
+    /*
+     * Whether we migrate properties that are immutable while migration is
+     * active early, before state of other devices and especially, before
+     * migrating any RAM content.
+     */
+    bool early_migration;
+
      /* notifiers to notify when "size" changes */
      NotifierList size_change_notifiers;
Peter Xu Jan. 10, 2023, 8:03 p.m. UTC | #11
On Tue, Jan 10, 2023 at 12:52:32PM +0100, David Hildenbrand wrote:
> The following seems to work,

That looks much better at least from the diffstat pov (comparing to the
existing patch 1+5 and the framework changes), thanks.

> but makes analyze-migration.py angry:
> 
> $ ./scripts/analyze-migration.py -f STATEFILE
> Traceback (most recent call last):
>   File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 605, in <module>
>     dump.read(dump_memory = args.memory)
>   File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 539, in read
>     classdesc = self.section_classes[section_key]
>                 ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
> KeyError: ('0000:00:03.0/virtio-mem-early', 0)
> 
> 
> We need the vmdesc to create info for the device.

Migration may ignore the save entry if save_state() not provided in the
"devices" section:

        if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
            continue;
        }

Could you try providing a shim save_state() for the new virtio-mem save
entry?

/*
 * Shim function to make sure the save entry will be dumped into "devices"
 * section, to make analyze-migration.py happy.
 */
static void virtio_mem_save_state_early(QEMUFile *file, void *opaque)
{
}

Then:

static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
    .save_setup = virtio_mem_save_setup_early,
    .save_state = virtio_mem_save_state_early,
    .load_state = virtio_mem_load_state_early,
};

I'm not 100% sure it'll work yet, but maybe worth trying.

Thanks,
David Hildenbrand Jan. 11, 2023, 1:48 p.m. UTC | #12
On 10.01.23 21:03, Peter Xu wrote:
> On Tue, Jan 10, 2023 at 12:52:32PM +0100, David Hildenbrand wrote:
>> The following seems to work,
> 
> That looks much better at least from the diffstat pov (comparing to the
> existing patch 1+5 and the framework changes), thanks.
> 
>> but makes analyze-migration.py angry:
>>
>> $ ./scripts/analyze-migration.py -f STATEFILE
>> Traceback (most recent call last):
>>    File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 605, in <module>
>>      dump.read(dump_memory = args.memory)
>>    File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 539, in read
>>      classdesc = self.section_classes[section_key]
>>                  ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
>> KeyError: ('0000:00:03.0/virtio-mem-early', 0)
>>
>>
>> We need the vmdesc to create info for the device.
> 
> Migration may ignore the save entry if save_state() not provided in the
> "devices" section:
> 
>          if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
>              continue;
>          }
> 
> Could you try providing a shim save_state() for the new virtio-mem save
> entry?
> 
> /*
>   * Shim function to make sure the save entry will be dumped into "devices"
>   * section, to make analyze-migration.py happy.
>   */
> static void virtio_mem_save_state_early(QEMUFile *file, void *opaque)
> {
> }
> 
> Then:
> 
> static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
>      .save_setup = virtio_mem_save_setup_early,
>      .save_state = virtio_mem_save_state_early,
>      .load_state = virtio_mem_load_state_early,
> };
> 
> I'm not 100% sure it'll work yet, but maybe worth trying.

It doesn't. virtio_mem_load_state_early() will get called twice (once 
with state saved during save_setup() and once with effectively nothing 
during save_state(), which breaks the whole migration).

vmdesc handling is also wrong, because analyze-migration.py gets 
confused because it receives data stored during save_setup() but vmdesc 
created during save_state() was told that there would be "nothing" to 
interpret ...

$ ./scripts/analyze-migration.py -f STATEFILE
{
     "ram (2)": {
         "section sizes": {
             "0000:00:03.0/mem0": "0x0000000f00000000",
             "pc.ram": "0x0000000100000000",
             "/rom@etc/acpi/tables": "0x0000000000020000",
             "pc.bios": "0x0000000000040000",
             "0000:00:02.0/e1000.rom": "0x0000000000040000",
             "pc.rom": "0x0000000000020000",
             "/rom@etc/table-loader": "0x0000000000001000",
             "/rom@etc/acpi/rsdp": "0x0000000000001000"
         }
     },
     "0000:00:03.0/virtio-mem-early (51)": {
         "data": ""
     }
}


Not sure if the whole thing becomes nicer when manually looking up the 
vmdesc ... because filling it will also requires manually storing the 
se->idstr and the se->instance_id, whereby both are migration-internal 
and not available inside save_setup().


Hm, I really prefer something like the simplified version that let's 
migration core deal with vmstate to be migrated during save_setup() 
phase. We could avoid the vmstate->immutable flag and simply use a 
separate function for registering the vmstate, like:

vmstate_register_immutable()
vmstate_register_early()
...
Peter Xu Jan. 11, 2023, 4:35 p.m. UTC | #13
On Wed, Jan 11, 2023 at 02:48:09PM +0100, David Hildenbrand wrote:
> On 10.01.23 21:03, Peter Xu wrote:
> > On Tue, Jan 10, 2023 at 12:52:32PM +0100, David Hildenbrand wrote:
> > > The following seems to work,
> > 
> > That looks much better at least from the diffstat pov (comparing to the
> > existing patch 1+5 and the framework changes), thanks.
> > 
> > > but makes analyze-migration.py angry:
> > > 
> > > $ ./scripts/analyze-migration.py -f STATEFILE
> > > Traceback (most recent call last):
> > >    File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 605, in <module>
> > >      dump.read(dump_memory = args.memory)
> > >    File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 539, in read
> > >      classdesc = self.section_classes[section_key]
> > >                  ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
> > > KeyError: ('0000:00:03.0/virtio-mem-early', 0)
> > > 
> > > 
> > > We need the vmdesc to create info for the device.
> > 
> > Migration may ignore the save entry if save_state() not provided in the
> > "devices" section:
> > 
> >          if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
> >              continue;
> >          }
> > 
> > Could you try providing a shim save_state() for the new virtio-mem save
> > entry?
> > 
> > /*
> >   * Shim function to make sure the save entry will be dumped into "devices"
> >   * section, to make analyze-migration.py happy.
> >   */
> > static void virtio_mem_save_state_early(QEMUFile *file, void *opaque)
> > {
> > }
> > 
> > Then:
> > 
> > static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
> >      .save_setup = virtio_mem_save_setup_early,
> >      .save_state = virtio_mem_save_state_early,
> >      .load_state = virtio_mem_load_state_early,
> > };
> > 
> > I'm not 100% sure it'll work yet, but maybe worth trying.
> 
> It doesn't. virtio_mem_load_state_early() will get called twice (once with
> state saved during save_setup() and once with effectively nothing during
> save_state(), which breaks the whole migration).

Oh hold on.. so load_state() to pair save_setup()? Maybe you should pair it
with load_setup(), which I just noticed..  Then the load_state() needs to
be another shim like save_state().

> 
> vmdesc handling is also wrong, because analyze-migration.py gets confused
> because it receives data stored during save_setup() but vmdesc created
> during save_state() was told that there would be "nothing" to interpret ...
> 
> $ ./scripts/analyze-migration.py -f STATEFILE
> {
>     "ram (2)": {
>         "section sizes": {
>             "0000:00:03.0/mem0": "0x0000000f00000000",
>             "pc.ram": "0x0000000100000000",
>             "/rom@etc/acpi/tables": "0x0000000000020000",
>             "pc.bios": "0x0000000000040000",
>             "0000:00:02.0/e1000.rom": "0x0000000000040000",
>             "pc.rom": "0x0000000000020000",
>             "/rom@etc/table-loader": "0x0000000000001000",
>             "/rom@etc/acpi/rsdp": "0x0000000000001000"
>         }
>     },
>     "0000:00:03.0/virtio-mem-early (51)": {
>         "data": ""
>     }
> }

Yeah this is expected, because the whole data stream within the setup phase
is a black box and not described by anything.  "ram" can display correctly
only because it's hard coded in the python script, I think.  The trick
above can only make the script not break but not teaching the script to
also check for the early vmsd.

But that's one step forward and IMHO it's fine for special devices. For
example, even with your initial patch, I think the static analyzer (aka,
qemu -dump-vmstate) will also ignore your new vmsd anyway because it's not
directly bound to the DeviceState* but registered separately.  So no ideal
solution so far, afaict, without more work done on this aspect.

> 
> 
> Not sure if the whole thing becomes nicer when manually looking up the
> vmdesc ... because filling it will also requires manually storing the
> se->idstr and the se->instance_id, whereby both are migration-internal and
> not available inside save_setup().
> 
> 
> Hm, I really prefer something like the simplified version that let's
> migration core deal with vmstate to be migrated during save_setup() phase.
> We could avoid the vmstate->immutable flag and simply use a separate
> function for registering the vmstate, like:
> 
> vmstate_register_immutable()
> vmstate_register_early()
> ...

I agree, this looks useful.  It's just that the devices that need this will
be rare, and unfortunately some of them already implemented the stream in
other ways so maybe non-trivial to convert them.

Not against it if you want to keep exploring, but if so please avoid using
the priority field, also I'd hope the early vmsd will be saved within
qemu_savevm_state_setup() so maybe it can be another alternative to
save_setup().

Maybe it's still easier we keep going with the save_setup() and the shim
functions above, if it works for you.
David Hildenbrand Jan. 11, 2023, 4:58 p.m. UTC | #14
On 11.01.23 17:35, Peter Xu wrote:
> On Wed, Jan 11, 2023 at 02:48:09PM +0100, David Hildenbrand wrote:
>> On 10.01.23 21:03, Peter Xu wrote:
>>> On Tue, Jan 10, 2023 at 12:52:32PM +0100, David Hildenbrand wrote:
>>>> The following seems to work,
>>>
>>> That looks much better at least from the diffstat pov (comparing to the
>>> existing patch 1+5 and the framework changes), thanks.
>>>
>>>> but makes analyze-migration.py angry:
>>>>
>>>> $ ./scripts/analyze-migration.py -f STATEFILE
>>>> Traceback (most recent call last):
>>>>     File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 605, in <module>
>>>>       dump.read(dump_memory = args.memory)
>>>>     File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 539, in read
>>>>       classdesc = self.section_classes[section_key]
>>>>                   ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
>>>> KeyError: ('0000:00:03.0/virtio-mem-early', 0)
>>>>
>>>>
>>>> We need the vmdesc to create info for the device.
>>>
>>> Migration may ignore the save entry if save_state() not provided in the
>>> "devices" section:
>>>
>>>           if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
>>>               continue;
>>>           }
>>>
>>> Could you try providing a shim save_state() for the new virtio-mem save
>>> entry?
>>>
>>> /*
>>>    * Shim function to make sure the save entry will be dumped into "devices"
>>>    * section, to make analyze-migration.py happy.
>>>    */
>>> static void virtio_mem_save_state_early(QEMUFile *file, void *opaque)
>>> {
>>> }
>>>
>>> Then:
>>>
>>> static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
>>>       .save_setup = virtio_mem_save_setup_early,
>>>       .save_state = virtio_mem_save_state_early,
>>>       .load_state = virtio_mem_load_state_early,
>>> };
>>>
>>> I'm not 100% sure it'll work yet, but maybe worth trying.
>>
>> It doesn't. virtio_mem_load_state_early() will get called twice (once with
>> state saved during save_setup() and once with effectively nothing during
>> save_state(), which breaks the whole migration).
> 
> Oh hold on.. so load_state() to pair save_setup()? Maybe you should pair it
> with load_setup(), which I just noticed..  Then the load_state() needs to
> be another shim like save_state().

Let me see if that improves the situation. E.g., ram.c writes state in 
save_setup() but doesn't load any state in load_setup() -- it's all done 
in load_state().

... no, still not working. It will read garbage during load_setup() now.

qemu-system-x86_64: Property 'memaddr' changed from 0x100000002037261 to 
0x140000000
qemu-system-x86_64: Failed to load virtio-mem-device-early:tmp
qemu-system-x86_64: Load state of device 0000:00:03.0/virtio-mem-early 
failed
qemu-system-x86_64: load of migration failed: Invalid argument


I don't think that load_setup() is supposed to consume anything useful 
from the migration stream. I suspects it should actually not even 
consume a QEMU file right now, because they way it's used is just for 
initializing stuff.

qemu_loadvm_state_main() does the actual loading part, parsing sections 
etc. qemu_loadvm_state_setup() doesn't do any of that.

AFAIKS, at least qemu_loadvm_state_setup() would have to parse sections 
and the save_setup() users would have to be converted into using 
load_setup() as well. Not sure if more is missing.

Even with that, I doubt that it would make analyze-migration.py work, 
because what we store is something different than what we record in the 
vmdesc.

> 
>>
>> vmdesc handling is also wrong, because analyze-migration.py gets confused
>> because it receives data stored during save_setup() but vmdesc created
>> during save_state() was told that there would be "nothing" to interpret ...
>>
>> $ ./scripts/analyze-migration.py -f STATEFILE
>> {
>>      "ram (2)": {
>>          "section sizes": {
>>              "0000:00:03.0/mem0": "0x0000000f00000000",
>>              "pc.ram": "0x0000000100000000",
>>              "/rom@etc/acpi/tables": "0x0000000000020000",
>>              "pc.bios": "0x0000000000040000",
>>              "0000:00:02.0/e1000.rom": "0x0000000000040000",
>>              "pc.rom": "0x0000000000020000",
>>              "/rom@etc/table-loader": "0x0000000000001000",
>>              "/rom@etc/acpi/rsdp": "0x0000000000001000"
>>          }
>>      },
>>      "0000:00:03.0/virtio-mem-early (51)": {
>>          "data": ""
>>      }
>> }
> 
> Yeah this is expected, because the whole data stream within the setup phase
> is a black box and not described by anything.  "ram" can display correctly
> only because it's hard coded in the python script, I think.  The trick
> above can only make the script not break but not teaching the script to
> also check for the early vmsd.

Note that the issue here is that the scripts stops the output after the 
virtio-mem device. So I'm not complaining about the "data": "", but 
about the vmstate according to the vmdesc not having any other devices :)

> 
> But that's one step forward and IMHO it's fine for special devices. For
> example, even with your initial patch, I think the static analyzer (aka,
> qemu -dump-vmstate) will also ignore your new vmsd anyway because it's not
> directly bound to the DeviceState* but registered separately.  So no ideal
> solution so far, afaict, without more work done on this aspect.
> 
>>
>>
>> Not sure if the whole thing becomes nicer when manually looking up the
>> vmdesc ... because filling it will also requires manually storing the
>> se->idstr and the se->instance_id, whereby both are migration-internal and
>> not available inside save_setup().
>>
>>
>> Hm, I really prefer something like the simplified version that let's
>> migration core deal with vmstate to be migrated during save_setup() phase.
>> We could avoid the vmstate->immutable flag and simply use a separate
>> function for registering the vmstate, like:
>>
>> vmstate_register_immutable()
>> vmstate_register_early()
>> ...
> 
> I agree, this looks useful.  It's just that the devices that need this will
> be rare, and unfortunately some of them already implemented the stream in
> other ways so maybe non-trivial to convert them.

Right, I agree about the "rare" part and that conversion might be hard, 
because they are not using a vmstate descriptor.

The only way to avoid that is moving away from using a vmstate 
descriptor and storing everything manually in virtio-mem code. Not 
particularly happy about that, but it would be the only feasible option 
without touching migration code that I can see.

> 
> Not against it if you want to keep exploring, but if so please avoid using
> the priority field, also I'd hope the early vmsd will be saved within
> qemu_savevm_state_setup() so maybe it can be another alternative to
> save_setup().
> 
> Maybe it's still easier we keep going with the save_setup() and the shim
> functions above, if it works for you.

I'll happy to go the path you suggested if we can make it work. I'd be 
happy to have something "reasonable" for the virtio-mem device in the 
analyze-migration.py output. But I could live with just nothing useful 
for the device itself -- as long as at least the other devices still 
show up.

Thanks Peter!
Peter Xu Jan. 11, 2023, 5:28 p.m. UTC | #15
On Wed, Jan 11, 2023 at 05:58:36PM +0100, David Hildenbrand wrote:
> On 11.01.23 17:35, Peter Xu wrote:
> > On Wed, Jan 11, 2023 at 02:48:09PM +0100, David Hildenbrand wrote:
> > > On 10.01.23 21:03, Peter Xu wrote:
> > > > On Tue, Jan 10, 2023 at 12:52:32PM +0100, David Hildenbrand wrote:
> > > > > The following seems to work,
> > > > 
> > > > That looks much better at least from the diffstat pov (comparing to the
> > > > existing patch 1+5 and the framework changes), thanks.
> > > > 
> > > > > but makes analyze-migration.py angry:
> > > > > 
> > > > > $ ./scripts/analyze-migration.py -f STATEFILE
> > > > > Traceback (most recent call last):
> > > > >     File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 605, in <module>
> > > > >       dump.read(dump_memory = args.memory)
> > > > >     File "/home/dhildenb/git/qemu/./scripts/analyze-migration.py", line 539, in read
> > > > >       classdesc = self.section_classes[section_key]
> > > > >                   ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
> > > > > KeyError: ('0000:00:03.0/virtio-mem-early', 0)
> > > > > 
> > > > > 
> > > > > We need the vmdesc to create info for the device.
> > > > 
> > > > Migration may ignore the save entry if save_state() not provided in the
> > > > "devices" section:
> > > > 
> > > >           if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
> > > >               continue;
> > > >           }
> > > > 
> > > > Could you try providing a shim save_state() for the new virtio-mem save
> > > > entry?
> > > > 
> > > > /*
> > > >    * Shim function to make sure the save entry will be dumped into "devices"
> > > >    * section, to make analyze-migration.py happy.
> > > >    */
> > > > static void virtio_mem_save_state_early(QEMUFile *file, void *opaque)
> > > > {
> > > > }
> > > > 
> > > > Then:
> > > > 
> > > > static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
> > > >       .save_setup = virtio_mem_save_setup_early,
> > > >       .save_state = virtio_mem_save_state_early,
> > > >       .load_state = virtio_mem_load_state_early,
> > > > };
> > > > 
> > > > I'm not 100% sure it'll work yet, but maybe worth trying.
> > > 
> > > It doesn't. virtio_mem_load_state_early() will get called twice (once with
> > > state saved during save_setup() and once with effectively nothing during
> > > save_state(), which breaks the whole migration).
> > 
> > Oh hold on.. so load_state() to pair save_setup()? Maybe you should pair it
> > with load_setup(), which I just noticed..  Then the load_state() needs to
> > be another shim like save_state().
> 
> Let me see if that improves the situation. E.g., ram.c writes state in
> save_setup() but doesn't load any state in load_setup() -- it's all done in
> load_state().
> 
> ... no, still not working. It will read garbage during load_setup() now.
> 
> qemu-system-x86_64: Property 'memaddr' changed from 0x100000002037261 to
> 0x140000000
> qemu-system-x86_64: Failed to load virtio-mem-device-early:tmp
> qemu-system-x86_64: Load state of device 0000:00:03.0/virtio-mem-early
> failed
> qemu-system-x86_64: load of migration failed: Invalid argument
> 
> 
> I don't think that load_setup() is supposed to consume anything useful from
> the migration stream. I suspects it should actually not even consume a QEMU
> file right now, because they way it's used is just for initializing stuff.
> 
> qemu_loadvm_state_main() does the actual loading part, parsing sections etc.
> qemu_loadvm_state_setup() doesn't do any of that.
> 
> AFAIKS, at least qemu_loadvm_state_setup() would have to parse sections and
> the save_setup() users would have to be converted into using load_setup() as
> well. Not sure if more is missing.

Ouch, yeah, load_setup() is unfortunate. :(

I think load_setup() should be named load_init() without having the
qemufile at all.  But let's keep that aside for now, and see what other
option we have..

> 
> Even with that, I doubt that it would make analyze-migration.py work,
> because what we store is something different than what we record in the
> vmdesc.
> 
> > 
> > > 
> > > vmdesc handling is also wrong, because analyze-migration.py gets confused
> > > because it receives data stored during save_setup() but vmdesc created
> > > during save_state() was told that there would be "nothing" to interpret ...
> > > 
> > > $ ./scripts/analyze-migration.py -f STATEFILE
> > > {
> > >      "ram (2)": {
> > >          "section sizes": {
> > >              "0000:00:03.0/mem0": "0x0000000f00000000",
> > >              "pc.ram": "0x0000000100000000",
> > >              "/rom@etc/acpi/tables": "0x0000000000020000",
> > >              "pc.bios": "0x0000000000040000",
> > >              "0000:00:02.0/e1000.rom": "0x0000000000040000",
> > >              "pc.rom": "0x0000000000020000",
> > >              "/rom@etc/table-loader": "0x0000000000001000",
> > >              "/rom@etc/acpi/rsdp": "0x0000000000001000"
> > >          }
> > >      },
> > >      "0000:00:03.0/virtio-mem-early (51)": {
> > >          "data": ""
> > >      }
> > > }
> > 
> > Yeah this is expected, because the whole data stream within the setup phase
> > is a black box and not described by anything.  "ram" can display correctly
> > only because it's hard coded in the python script, I think.  The trick
> > above can only make the script not break but not teaching the script to
> > also check for the early vmsd.
> 
> Note that the issue here is that the scripts stops the output after the
> virtio-mem device. So I'm not complaining about the "data": "", but about
> the vmstate according to the vmdesc not having any other devices :)
> 
> > 
> > But that's one step forward and IMHO it's fine for special devices. For
> > example, even with your initial patch, I think the static analyzer (aka,
> > qemu -dump-vmstate) will also ignore your new vmsd anyway because it's not
> > directly bound to the DeviceState* but registered separately.  So no ideal
> > solution so far, afaict, without more work done on this aspect.
> > 
> > > 
> > > 
> > > Not sure if the whole thing becomes nicer when manually looking up the
> > > vmdesc ... because filling it will also requires manually storing the
> > > se->idstr and the se->instance_id, whereby both are migration-internal and
> > > not available inside save_setup().
> > > 
> > > 
> > > Hm, I really prefer something like the simplified version that let's
> > > migration core deal with vmstate to be migrated during save_setup() phase.
> > > We could avoid the vmstate->immutable flag and simply use a separate
> > > function for registering the vmstate, like:
> > > 
> > > vmstate_register_immutable()
> > > vmstate_register_early()
> > > ...
> > 
> > I agree, this looks useful.  It's just that the devices that need this will
> > be rare, and unfortunately some of them already implemented the stream in
> > other ways so maybe non-trivial to convert them.
> 
> Right, I agree about the "rare" part and that conversion might be hard,
> because they are not using a vmstate descriptor.
> 
> The only way to avoid that is moving away from using a vmstate descriptor
> and storing everything manually in virtio-mem code. Not particularly happy
> about that, but it would be the only feasible option without touching
> migration code that I can see.
> 
> > 
> > Not against it if you want to keep exploring, but if so please avoid using
> > the priority field, also I'd hope the early vmsd will be saved within
> > qemu_savevm_state_setup() so maybe it can be another alternative to
> > save_setup().
> > 
> > Maybe it's still easier we keep going with the save_setup() and the shim
> > functions above, if it works for you.
> 
> I'll happy to go the path you suggested if we can make it work. I'd be happy
> to have something "reasonable" for the virtio-mem device in the
> analyze-migration.py output. But I could live with just nothing useful for
> the device itself -- as long as at least the other devices still show up.

So, let's see whether we can go back to the load_state() approach..

static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
      .save_setup = virtio_mem_save_setup_early,
      .save_state = virtio_mem_save_state_early,
      .load_state = virtio_mem_load_state_early,
};

vfio handled it using a single header flag for either save_setup() or
save_state(), then load_state() identifies it:

    data = qemu_get_be64(f);
    ...
        switch (data) {
        case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
        ...
        case VFIO_MIG_FLAG_DEV_SETUP_STATE:
        ...

Maybe play the same trick here?  The flag needs to be hard coded but at
least not the vmsd.  Based on previous experience, I could have missed
something else, though. :)

Or if you like go the other way I'm fine too.

Thanks,
David Hildenbrand Jan. 11, 2023, 5:44 p.m. UTC | #16
>>>
>>> Not against it if you want to keep exploring, but if so please avoid using
>>> the priority field, also I'd hope the early vmsd will be saved within
>>> qemu_savevm_state_setup() so maybe it can be another alternative to
>>> save_setup().
>>>
>>> Maybe it's still easier we keep going with the save_setup() and the shim
>>> functions above, if it works for you.
>>
>> I'll happy to go the path you suggested if we can make it work. I'd be happy
>> to have something "reasonable" for the virtio-mem device in the
>> analyze-migration.py output. But I could live with just nothing useful for
>> the device itself -- as long as at least the other devices still show up.
> 
> So, let's see whether we can go back to the load_state() approach..
> 
> static const SaveVMHandlers vmstate_virtio_mem_device_early_ops = {
>        .save_setup = virtio_mem_save_setup_early,
>        .save_state = virtio_mem_save_state_early,
>        .load_state = virtio_mem_load_state_early,
> };
> 
> vfio handled it using a single header flag for either save_setup() or
> save_state(), then load_state() identifies it:
> 
>      data = qemu_get_be64(f);
>      ...
>          switch (data) {
>          case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>          ...
>          case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>          ...
> 
> Maybe play the same trick here?  The flag needs to be hard coded but at
> least not the vmsd.  Based on previous experience, I could have missed
> something else, though. :)

I could also remember "internally" if load_state() was already called 
throughout the migartion I think.

But, IIUC, it will still make analyze-migration.py produce wrong 
results, because of the vmdesc mismatch.

> 
> Or if you like go the other way I'm fine too.

IMHO my approach will be cleaner on the device side but will require 
migration code changes. I'll try getting that as clean as possible for 
now and resend. If there are more ideas on how to get the other approach 
running, I'll be happy to try.

Thanks!
diff mbox series

Patch

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index ad24aa1934..79eb2409a2 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -156,6 +156,13 @@  typedef enum {
     MIG_PRI_VIRTIO_MEM,         /* Must happen before IOMMU */
     MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
     MIG_PRI_GICV3,              /* Must happen before the ITS */
+    /*
+     * Must happen before all other devices (iterable and non-iterable),
+     * especiall, before migrating RAM content. Such device state must be
+     * guaranteed to be immutable on the migration source until migration
+     * ends and must not depend on the CPU state to be synchronized.
+     */
+    MIG_PRI_POST_SETUP,
     MIG_PRI_MAX,
 } MigrationPriority;
 
diff --git a/migration/migration.c b/migration/migration.c
index 52b5d39244..78b6bb8765 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2170,6 +2170,9 @@  void migrate_init(MigrationState *s)
     s->vm_was_running = false;
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
+
+    json_writer_free(s->vmdesc);
+    s->vmdesc = NULL;
 }
 
 int migrate_add_blocker_internal(Error *reason, Error **errp)
@@ -3997,6 +4000,9 @@  static void *migration_thread(void *opaque)
 
     trace_migration_thread_setup_complete();
 
+    /* Process early data that has to get migrated before iterating. */
+    qemu_savevm_state_start_precopy(s->to_dst_file);
+
     while (migration_is_active(s)) {
         if (urgent || !qemu_file_rate_limit(s->to_dst_file)) {
             MigIterateState iter_state = migration_iteration_run(s);
@@ -4125,6 +4131,12 @@  static void *bg_migration_thread(void *opaque)
     if (vm_stop_force_state(RUN_STATE_PAUSED)) {
         goto fail;
     }
+
+    /* Migrate early data that would usually get migrated before iterating. */
+    if (qemu_savevm_state_start_precopy(fb)) {
+        goto fail;
+    }
+
     /*
      * Put vCPUs in sync with shadow context structures, then
      * save their state to channel-buffer along with devices.
@@ -4445,6 +4457,7 @@  static void migration_instance_finalize(Object *obj)
     qemu_sem_destroy(&ms->rp_state.rp_sem);
     qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
     error_free(ms->error);
+    json_writer_free(ms->vmdesc);
 }
 
 static void migration_instance_init(Object *obj)
diff --git a/migration/migration.h b/migration/migration.h
index ae4ffd3454..66511ce532 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -17,6 +17,7 @@ 
 #include "exec/cpu-common.h"
 #include "hw/qdev-core.h"
 #include "qapi/qapi-types-migration.h"
+#include "qapi/qmp/json-writer.h"
 #include "qemu/thread.h"
 #include "qemu/coroutine_int.h"
 #include "io/channel.h"
@@ -366,6 +367,9 @@  struct MigrationState {
      * This save hostname when out-going migration starts
      */
     char *hostname;
+
+    /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
+    JSONWriter *vmdesc;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/savevm.c b/migration/savevm.c
index a0cdb714f7..b810409574 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -42,7 +42,6 @@ 
 #include "postcopy-ram.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-migration.h"
-#include "qapi/qmp/json-writer.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/qapi-builtin-visit.h"
 #include "qapi/qmp/qerror.h"
@@ -1325,6 +1324,71 @@  void qemu_savevm_state_complete_postcopy(QEMUFile *f)
     qemu_fflush(f);
 }
 
+static int qemu_savevm_state_precopy_one_non_iterable(SaveStateEntry *se,
+                                                      QEMUFile *f,
+                                                      JSONWriter *vmdesc)
+{
+    int ret;
+
+    if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
+        return 0;
+    }
+    if (se->vmsd && !vmstate_save_needed(se->vmsd, se->opaque)) {
+        trace_savevm_section_skip(se->idstr, se->section_id);
+        return 0;
+    }
+
+    trace_savevm_section_start(se->idstr, se->section_id);
+
+    json_writer_start_object(vmdesc, NULL);
+    json_writer_str(vmdesc, "name", se->idstr);
+    json_writer_int64(vmdesc, "instance_id", se->instance_id);
+
+    save_section_header(f, se, QEMU_VM_SECTION_FULL);
+    ret = vmstate_save(f, se, vmdesc);
+    if (ret) {
+        qemu_file_set_error(f, ret);
+        return ret;
+    }
+    trace_savevm_section_end(se->idstr, se->section_id, 0);
+    save_section_footer(f, se);
+
+    json_writer_end_object(vmdesc);
+    return 0;
+}
+
+int qemu_savevm_state_start_precopy(QEMUFile *f)
+{
+    MigrationState *ms = migrate_get_current();
+    JSONWriter *vmdesc;
+    SaveStateEntry *se;
+    int ret;
+
+    assert(!ms->vmdesc);
+    ms->vmdesc = vmdesc = json_writer_new(false);
+    json_writer_start_object(vmdesc, NULL);
+    json_writer_int64(vmdesc, "page_size", qemu_target_page_size());
+    json_writer_start_array(vmdesc, "devices");
+
+    /*
+     * Only immutable non-iterable device state is expected to be saved this
+     * early. All remaining (ordinary) non-iterable device state will be saved
+     * in qemu_savevm_state_complete_precopy_non_iterable().
+     */
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (save_state_priority(se) < MIG_PRI_POST_SETUP) {
+            continue;
+        }
+
+        ret = qemu_savevm_state_precopy_one_non_iterable(se, f, vmdesc);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 static
 int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
 {
@@ -1364,41 +1428,24 @@  int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
                                                     bool in_postcopy,
                                                     bool inactivate_disks)
 {
-    g_autoptr(JSONWriter) vmdesc = NULL;
+    MigrationState *ms = migrate_get_current();
+    JSONWriter *vmdesc = ms->vmdesc;
     int vmdesc_len;
     SaveStateEntry *se;
     int ret;
 
-    vmdesc = json_writer_new(false);
-    json_writer_start_object(vmdesc, NULL);
-    json_writer_int64(vmdesc, "page_size", qemu_target_page_size());
-    json_writer_start_array(vmdesc, "devices");
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    /* qemu_savevm_state_start_precopy() is expected to be called first. */
+    assert(vmdesc);
 
-        if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
-            continue;
-        }
-        if (se->vmsd && !vmstate_save_needed(se->vmsd, se->opaque)) {
-            trace_savevm_section_skip(se->idstr, se->section_id);
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (save_state_priority(se) >= MIG_PRI_POST_SETUP) {
             continue;
         }
 
-        trace_savevm_section_start(se->idstr, se->section_id);
-
-        json_writer_start_object(vmdesc, NULL);
-        json_writer_str(vmdesc, "name", se->idstr);
-        json_writer_int64(vmdesc, "instance_id", se->instance_id);
-
-        save_section_header(f, se, QEMU_VM_SECTION_FULL);
-        ret = vmstate_save(f, se, vmdesc);
+        ret = qemu_savevm_state_precopy_one_non_iterable(se, f, vmdesc);
         if (ret) {
-            qemu_file_set_error(f, ret);
             return ret;
         }
-        trace_savevm_section_end(se->idstr, se->section_id, 0);
-        save_section_footer(f, se);
-
-        json_writer_end_object(vmdesc);
     }
 
     if (inactivate_disks) {
@@ -1427,6 +1474,10 @@  int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
     }
 
+    /* Free it now to detect any inconsistencies. */
+    g_free(vmdesc);
+    ms->vmdesc = NULL;
+
     return 0;
 }
 
@@ -1541,6 +1592,9 @@  static int qemu_savevm_state(QEMUFile *f, Error **errp)
     qemu_savevm_state_setup(f);
     qemu_mutex_lock_iothread();
 
+    /* Process early data that has to get migrated before iterating. */
+    qemu_savevm_state_start_precopy(f);
+
     while (qemu_file_get_error(f) == 0) {
         if (qemu_savevm_state_iterate(f, false) > 0) {
             break;
diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342cb4..323bd5ab3b 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -38,6 +38,7 @@  void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
 void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
+int qemu_savevm_state_start_precopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
                                        bool inactivate_disks);
 void qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size,