Message ID | 1727725244-105198-5-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Live update: cpr-transfer | expand |
On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote: > Stop the vm earlier for cpr, to guarantee consistent device state when > CPR state is saved. Could you add some more info on why this order matters? E.g., qmp_migrate should switch migration state machine to SETUP, while this path holds BQL, I think it means there's no way devices got hot added concurrently of the whole process. Would other things change in the cpr states (name, fd, etc.)? It'll be great to mention these details in the commit message. Thanks, > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > --- > migration/migration.c | 22 +++++++++++++--------- > 1 file changed, 13 insertions(+), 9 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index df00e5c..868bf0e 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2082,6 +2082,7 @@ void qmp_migrate(const char *uri, bool has_channels, > MigrationState *s = migrate_get_current(); > g_autoptr(MigrationChannel) channel = NULL; > MigrationAddress *addr = NULL; > + bool stopped = false; > > /* > * Having preliminary checks for uri and channel > @@ -2125,6 +2126,15 @@ void qmp_migrate(const char *uri, bool has_channels, > } > } > > + if (migrate_mode_is_cpr(s)) { > + int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); > + if (ret < 0) { > + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret); > + goto out; > + } > + stopped = true; > + } > + > if (cpr_state_save(&local_err)) { > goto out; > } > @@ -2160,6 +2170,9 @@ out: > } > migrate_fd_error(s, local_err); > error_propagate(errp, local_err); > + if (stopped) { > + vm_resume(s->vm_old_state); > + } > return; > } > } > @@ -3743,7 +3756,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) > Error *local_err = NULL; > uint64_t rate_limit; > bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP); > - int ret; > > /* > * If there's a previous error, free it and prepare for another one. > @@ -3815,14 +3827,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) > return; > } > > - if (migrate_mode_is_cpr(s)) { > - ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); > - if (ret < 0) { > - error_setg(&local_err, "migration_stop_vm failed, error %d", -ret); > - goto fail; > - } > - } > - > if (migrate_background_snapshot()) { > qemu_thread_create(&s->thread, "mig/snapshot", > bg_migration_thread, s, QEMU_THREAD_JOINABLE); > -- > 1.8.3.1 >
On 10/7/2024 11:27 AM, Peter Xu wrote: > On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote: >> Stop the vm earlier for cpr, to guarantee consistent device state when >> CPR state is saved. > > Could you add some more info on why this order matters? > > E.g., qmp_migrate should switch migration state machine to SETUP, while > this path holds BQL, I think it means there's no way devices got hot added > concurrently of the whole process. > > Would other things change in the cpr states (name, fd, etc.)? It'll be > great to mention these details in the commit message. Because of the new cpr-state save operation needed by this mode, I created this patch to be future proof. Performing a save operation while the machine is running is asking for trouble. But right now, I am not aware of any specific issues. Later in the "tap and vhost" series there is another reason to stop the vm here and save cpr state, because the devices must be stopped in old qemu before they are initialized in new qemu. If you are curious, see the 2 patches I attached to the email at https://lore.kernel.org/qemu-devel/fa95c40d-b5e5-41eb-bba7-7842bca2f73e@oracle.com/ But, that has nothing to do with the contents of cpr state. - Steve >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >> --- >> migration/migration.c | 22 +++++++++++++--------- >> 1 file changed, 13 insertions(+), 9 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index df00e5c..868bf0e 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -2082,6 +2082,7 @@ void qmp_migrate(const char *uri, bool has_channels, >> MigrationState *s = migrate_get_current(); >> g_autoptr(MigrationChannel) channel = NULL; >> MigrationAddress *addr = NULL; >> + bool stopped = false; >> >> /* >> * Having preliminary checks for uri and channel >> @@ -2125,6 +2126,15 @@ void qmp_migrate(const char *uri, bool has_channels, >> } >> } >> >> + if (migrate_mode_is_cpr(s)) { >> + int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); >> + if (ret < 0) { >> + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret); >> + goto out; >> + } >> + stopped = true; >> + } >> + >> if (cpr_state_save(&local_err)) { >> goto out; >> } >> @@ -2160,6 +2170,9 @@ out: >> } >> migrate_fd_error(s, local_err); >> error_propagate(errp, local_err); >> + if (stopped) { >> + vm_resume(s->vm_old_state); >> + } >> return; >> } >> } >> @@ -3743,7 +3756,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) >> Error *local_err = NULL; >> uint64_t rate_limit; >> bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP); >> - int ret; >> >> /* >> * If there's a previous error, free it and prepare for another one. >> @@ -3815,14 +3827,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) >> return; >> } >> >> - if (migrate_mode_is_cpr(s)) { >> - ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); >> - if (ret < 0) { >> - error_setg(&local_err, "migration_stop_vm failed, error %d", -ret); >> - goto fail; >> - } >> - } >> - >> if (migrate_background_snapshot()) { >> qemu_thread_create(&s->thread, "mig/snapshot", >> bg_migration_thread, s, QEMU_THREAD_JOINABLE); >> -- >> 1.8.3.1 >> >
On Mon, Oct 07, 2024 at 04:52:43PM -0400, Steven Sistare wrote: > On 10/7/2024 11:27 AM, Peter Xu wrote: > > On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote: > > > Stop the vm earlier for cpr, to guarantee consistent device state when > > > CPR state is saved. > > > > Could you add some more info on why this order matters? > > > > E.g., qmp_migrate should switch migration state machine to SETUP, while > > this path holds BQL, I think it means there's no way devices got hot added > > concurrently of the whole process. > > > > Would other things change in the cpr states (name, fd, etc.)? It'll be > > great to mention these details in the commit message. > > Because of the new cpr-state save operation needed by this mode, > I created this patch to be future proof. Performing a save operation while > the machine is running is asking for trouble. But right now, I am not aware > of any specific issues. > > Later in the "tap and vhost" series there is another reason to stop the vm here and > save cpr state, because the devices must be stopped in old qemu before they > are initialized in new qemu. If you are curious, see the 2 patches I attached > to the email at > https://lore.kernel.org/qemu-devel/fa95c40d-b5e5-41eb-bba7-7842bca2f73e@oracle.com/ > But, that has nothing to do with the contents of cpr state. Then I suggest we leave this patch to the vhost/tap series, then please document clearly in the commit mesasge on why this is needed. Linking to that discussion thread could work too. Side note: I saw you have MIG_EVENT_PRECOPY_CPR_SETUP in you own tree, I wonder whether we could reuse MIG_EVENT_PRECOPY_SETUP by moving it earlier in qmp_migrate(). After all CPR-* notifiers are already registered separately with the list of migration_state_notifiers[], so I suppose it'll service the same purpose. But we can discuss that later.
On 10/8/2024 11:35 AM, Peter Xu wrote: > On Mon, Oct 07, 2024 at 04:52:43PM -0400, Steven Sistare wrote: >> On 10/7/2024 11:27 AM, Peter Xu wrote: >>> On Mon, Sep 30, 2024 at 12:40:35PM -0700, Steve Sistare wrote: >>>> Stop the vm earlier for cpr, to guarantee consistent device state when >>>> CPR state is saved. >>> >>> Could you add some more info on why this order matters? >>> >>> E.g., qmp_migrate should switch migration state machine to SETUP, while >>> this path holds BQL, I think it means there's no way devices got hot added >>> concurrently of the whole process. >>> >>> Would other things change in the cpr states (name, fd, etc.)? It'll be >>> great to mention these details in the commit message. >> >> Because of the new cpr-state save operation needed by this mode, >> I created this patch to be future proof. Performing a save operation while >> the machine is running is asking for trouble. But right now, I am not aware >> of any specific issues. >> >> Later in the "tap and vhost" series there is another reason to stop the vm here and >> save cpr state, because the devices must be stopped in old qemu before they >> are initialized in new qemu. If you are curious, see the 2 patches I attached >> to the email at >> https://lore.kernel.org/qemu-devel/fa95c40d-b5e5-41eb-bba7-7842bca2f73e@oracle.com/ >> But, that has nothing to do with the contents of cpr state. > > Then I suggest we leave this patch to the vhost/tap series, then please > document clearly in the commit mesasge on why this is needed. Linking to > that discussion thread could work too. OK. > Side note: I saw you have MIG_EVENT_PRECOPY_CPR_SETUP in you own tree, I > wonder whether we could reuse MIG_EVENT_PRECOPY_SETUP by moving it earlier > in qmp_migrate(). After all CPR-* notifiers are already registered > separately with the list of migration_state_notifiers[], so I suppose it'll > service the same purpose. But we can discuss that later. Sure, we can discuss later (and I'll take another look before posting the vhost/tap series). - Steve
diff --git a/migration/migration.c b/migration/migration.c index df00e5c..868bf0e 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2082,6 +2082,7 @@ void qmp_migrate(const char *uri, bool has_channels, MigrationState *s = migrate_get_current(); g_autoptr(MigrationChannel) channel = NULL; MigrationAddress *addr = NULL; + bool stopped = false; /* * Having preliminary checks for uri and channel @@ -2125,6 +2126,15 @@ void qmp_migrate(const char *uri, bool has_channels, } } + if (migrate_mode_is_cpr(s)) { + int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); + if (ret < 0) { + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret); + goto out; + } + stopped = true; + } + if (cpr_state_save(&local_err)) { goto out; } @@ -2160,6 +2170,9 @@ out: } migrate_fd_error(s, local_err); error_propagate(errp, local_err); + if (stopped) { + vm_resume(s->vm_old_state); + } return; } } @@ -3743,7 +3756,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) Error *local_err = NULL; uint64_t rate_limit; bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP); - int ret; /* * If there's a previous error, free it and prepare for another one. @@ -3815,14 +3827,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) return; } - if (migrate_mode_is_cpr(s)) { - ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); - if (ret < 0) { - error_setg(&local_err, "migration_stop_vm failed, error %d", -ret); - goto fail; - } - } - if (migrate_background_snapshot()) { qemu_thread_create(&s->thread, "mig/snapshot", bg_migration_thread, s, QEMU_THREAD_JOINABLE);
Stop the vm earlier for cpr, to guarantee consistent device state when CPR state is saved. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> --- migration/migration.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-)