Message ID | 1465409584-16308-1-git-send-email-haris.phnx@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 06/08/2016 12:13 PM, Md Haris Iqbal wrote: The subject line is long, and has a typo (s/incase/in case/). Also, the mailing list automatically prepends [Qemu-devel], so you shouldn't repeat it manually. Better might have been a short subject line then a longer commit body: migration: keep source alive on network failure Details about what was failing, and why this code improves it Missing a Signed-off-by: attribution; without that, we can't take it. > --- You marked this patch as v2, but in the same minute sent another email with subject line v1, and didn't say what changed to need a v2. Here after the --- divider is a good place for that. > include/migration/migration.h | 1 + > migration/migration.c | 76 ++++++++++++++++++++++++++++++++++++++++--- > qapi-schema.json | 11 +++++-- > vl.c | 4 +++ > 4 files changed, 85 insertions(+), 7 deletions(-) > > @@ -1726,11 +1755,32 @@ static void *migration_thread(void *opaque) > } > } > > - if (qemu_file_get_error(s->to_dst_file)) { > - migrate_set_state(&s->state, current_active_state, > - MIGRATION_STATUS_FAILED); > - trace_migration_thread_file_err(); > - break; > + if ((ret = qemu_file_get_error(s->to_dst_file))) { > + fprintf(stderr, "1 : Error %s %d\n", strerror(-ret), -ret); fprintf() is rather awkward for errors; can we use qemu's Error mechanism? > + > + /* This check is based on how the error is set during the network > + * recv(). When recv() returns 0 (i.e. no data to read), the error > + * is set to -EIO. For all other network errors, it is set > + * according to the return value received. > + */ > + if (ret != -EIO && s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { > + /* Network Failure during postcopy */ > + > + current_active_state = MIGRATION_STATUS_POSTCOPY_RECOVERY; > + runstate_set(RUN_STATE_POSTMIGRATE_RECOVERY); > + fprintf(stderr, "1.1 : Error %s %d\n", strerror(-ret), -ret); Does the end user really need to see "1.1 :" > +++ b/qapi-schema.json > @@ -154,12 +154,14 @@ > # @watchdog: the watchdog action is configured to pause and has been triggered > # > # @guest-panicked: guest has been panicked as a result of guest OS panic > +# > +# @postmigrate-recovery: guest is paused for recovery after a network failure Not your fault that the overall enum is missing an overall line: # Since: 1.4 nor that guest-panicked is missing a "(since 1.5)" hint, but at least your addition should have a "(since 2.7)" hint. > ## > { 'enum': 'RunState', > 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', > 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', > 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', > - 'guest-panicked' ] } > + 'guest-panicked', 'postmigrate-recovery' ] } Adding new enums can cause existing clients like libvirt to do weird things if they aren't expecting the new state. Are we sure we want to do it? Is it a state that cannot be entered by default, but only in response to a client request that proves the client is new enough to expect the new state? > > ## > # @StatusInfo: > @@ -434,12 +436,15 @@ > # > # @failed: some error occurred during migration process. > # > +# @postcopy-recovery: in recovery mode, after a network failure. > +# Missing a "(since 2.7)" hint. > # Since: 2.3 > # > ## > { 'enum': 'MigrationStatus', > 'data': [ 'none', 'setup', 'cancelling', 'cancelled', > - 'active', 'postcopy-active', 'completed', 'failed' ] } > + 'active', 'postcopy-active', 'completed', 'failed', > + 'postcopy-recovery' ] } > > ## > # @MigrationInfo > @@ -2058,6 +2063,8 @@ > # > # @uri: the Uniform Resource Identifier of the destination VM > # > +# @recover: #optional recover from a broken migration > +# I don't see any 'recover' parameter added to the 'migrate' command to match this added documentation. > # @blk: #optional do block migration (full disk copy) > # > # @inc: #optional incremental disk copy migration > diff --git a/vl.c b/vl.c > index 5fd22cb..c237140 100644 > --- a/vl.c > +++ b/vl.c > @@ -618,6 +618,10 @@ static const RunStateTransition runstate_transitions_def[] = { > { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING }, > { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE }, > { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PRELAUNCH }, > + { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE_RECOVERY }, > + > + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_FINISH_MIGRATE }, > + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_SHUTDOWN }, > > { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING }, > { RUN_STATE_RESTORE_VM, RUN_STATE_PRELAUNCH }, >
On Thu, Jun 9, 2016 at 2:11 AM, Eric Blake <eblake@redhat.com> wrote: > On 06/08/2016 12:13 PM, Md Haris Iqbal wrote: > > The subject line is long, and has a typo (s/incase/in case/). Also, the > mailing list automatically prepends [Qemu-devel], so you shouldn't > repeat it manually. Better might have been a short subject line then a > longer commit body: > > migration: keep source alive on network failure > > Details about what was failing, and why this code improves it Yes, I will add details and will take care while writing commit messages. > > Missing a Signed-off-by: attribution; without that, we can't take it. I will add it from next time. > >> --- > > You marked this patch as v2, but in the same minute sent another email > with subject line v1, and didn't say what changed to need a v2. Here > after the --- divider is a good place for that. The other patch is different then the one I posted with v1. There was another patch which I posted few days back, which wa v1 of this patch. I should have pointed out what has changed though. > >> include/migration/migration.h | 1 + >> migration/migration.c | 76 ++++++++++++++++++++++++++++++++++++++++--- >> qapi-schema.json | 11 +++++-- >> vl.c | 4 +++ >> 4 files changed, 85 insertions(+), 7 deletions(-) >> > >> @@ -1726,11 +1755,32 @@ static void *migration_thread(void *opaque) >> } >> } >> >> - if (qemu_file_get_error(s->to_dst_file)) { >> - migrate_set_state(&s->state, current_active_state, >> - MIGRATION_STATUS_FAILED); >> - trace_migration_thread_file_err(); >> - break; >> + if ((ret = qemu_file_get_error(s->to_dst_file))) { >> + fprintf(stderr, "1 : Error %s %d\n", strerror(-ret), -ret); > > fprintf() is rather awkward for errors; can we use qemu's Error mechanism? Just something I am using to debug. It will be absent in the final version of the patch. > >> + >> + /* This check is based on how the error is set during the network >> + * recv(). When recv() returns 0 (i.e. no data to read), the error >> + * is set to -EIO. For all other network errors, it is set >> + * according to the return value received. >> + */ >> + if (ret != -EIO && s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { >> + /* Network Failure during postcopy */ >> + >> + current_active_state = MIGRATION_STATUS_POSTCOPY_RECOVERY; >> + runstate_set(RUN_STATE_POSTMIGRATE_RECOVERY); >> + fprintf(stderr, "1.1 : Error %s %d\n", strerror(-ret), -ret); > > Does the end user really need to see "1.1 :" Just a debugging output. > > >> +++ b/qapi-schema.json >> @@ -154,12 +154,14 @@ >> # @watchdog: the watchdog action is configured to pause and has been triggered >> # >> # @guest-panicked: guest has been panicked as a result of guest OS panic >> +# >> +# @postmigrate-recovery: guest is paused for recovery after a network failure > > Not your fault that the overall enum is missing an overall line: > > # Since: 1.4 > > nor that guest-panicked is missing a "(since 1.5)" hint, but at least > your addition should have a "(since 2.7)" hint. Added. > >> ## >> { 'enum': 'RunState', >> 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', >> 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', >> 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', >> - 'guest-panicked' ] } >> + 'guest-panicked', 'postmigrate-recovery' ] } > > Adding new enums can cause existing clients like libvirt to do weird > things if they aren't expecting the new state. Are we sure we want to do > it? I think so. If we do not have a new state, then one would not know that the VM is in recovery. > Is it a state that cannot be entered by default, but only in > response to a client request that proves the client is new enough to > expect the new state? I did not quite understand what you are trying to say. > >> >> ## >> # @StatusInfo: >> @@ -434,12 +436,15 @@ >> # >> # @failed: some error occurred during migration process. >> # >> +# @postcopy-recovery: in recovery mode, after a network failure. >> +# > > Missing a "(since 2.7)" hint. Added. > >> # Since: 2.3 >> # >> ## >> { 'enum': 'MigrationStatus', >> 'data': [ 'none', 'setup', 'cancelling', 'cancelled', >> - 'active', 'postcopy-active', 'completed', 'failed' ] } >> + 'active', 'postcopy-active', 'completed', 'failed', >> + 'postcopy-recovery' ] } >> >> ## >> # @MigrationInfo >> @@ -2058,6 +2063,8 @@ >> # >> # @uri: the Uniform Resource Identifier of the destination VM >> # >> +# @recover: #optional recover from a broken migration >> +# > > I don't see any 'recover' parameter added to the 'migrate' command to > match this added documentation. That was a mistake. This was suppose to go to the other patch I posted this same minute. Just missed splitting it out. > >> # @blk: #optional do block migration (full disk copy) >> # >> # @inc: #optional incremental disk copy migration >> diff --git a/vl.c b/vl.c >> index 5fd22cb..c237140 100644 >> --- a/vl.c >> +++ b/vl.c >> @@ -618,6 +618,10 @@ static const RunStateTransition runstate_transitions_def[] = { >> { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING }, >> { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE }, >> { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PRELAUNCH }, >> + { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE_RECOVERY }, >> + >> + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_FINISH_MIGRATE }, >> + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_SHUTDOWN }, >> >> { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING }, >> { RUN_STATE_RESTORE_VM, RUN_STATE_PRELAUNCH }, >> > > -- > Eric Blake eblake redhat com +1-919-301-3266 > Libvirt virtualization library http://libvirt.org >
On 06/13/2016 12:38 AM, haris iqbal wrote: >>> ## >>> { 'enum': 'RunState', >>> 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', >>> 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', >>> 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', >>> - 'guest-panicked' ] } >>> + 'guest-panicked', 'postmigrate-recovery' ] } >> >> Adding new enums can cause existing clients like libvirt to do weird >> things if they aren't expecting the new state. Are we sure we want to do >> it? > I think so. If we do not have a new state, then one would not know > that the VM is in recovery. > >> Is it a state that cannot be entered by default, but only in >> response to a client request that proves the client is new enough to >> expect the new state? > > I did not quite understand what you are trying to say. > A client that is not expecting the new 'postmigrate-recovery' state may mishandle a VM that is in that state. So I'm suggesting that we may want to special case this state, and make it possible to enter the state only if the client has done something first to inform qemu that it understands what it means for a VM to be in that state.
* Md Haris Iqbal (haris.phnx@gmail.com) wrote: > --- > include/migration/migration.h | 1 + > migration/migration.c | 76 ++++++++++++++++++++++++++++++++++++++++--- > qapi-schema.json | 11 +++++-- > vl.c | 4 +++ > 4 files changed, 85 insertions(+), 7 deletions(-) You need to pull this pair of patches together as a set of patches rather than two individual patches; it just gets too confusing (like with the version numbers). Pull them together in git, and then we can also get the patches split up into smaller parts and you can keep upto date with current qemu. > diff --git a/include/migration/migration.h b/include/migration/migration.h > index 4a3201b..59e26e6 100644 > --- a/include/migration/migration.h > +++ b/include/migration/migration.h > @@ -326,6 +326,7 @@ void global_state_store_running(void); > void flush_page_queue(MigrationState *ms); > int ram_save_queue_pages(MigrationState *ms, const char *rbname, > ram_addr_t start, ram_addr_t len); > +int qemu_migrate_postcopy_outgoing_recovery(MigrationState *ms); > > PostcopyState postcopy_state_get(void); > /* Set the state and return the old state */ > diff --git a/migration/migration.c b/migration/migration.c > index a77f62e..41c28e1 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -676,6 +676,33 @@ MigrationInfo *qmp_query_migrate(Error **errp) > case MIGRATION_STATUS_CANCELLED: > info->has_status = true; > break; > + case MIGRATION_STATUS_POSTCOPY_RECOVERY: > + info->has_status = true; > + info->has_total_time = true; > + info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > + > + info->has_ram = true; > + info->ram = g_malloc0(sizeof(*info->ram)); > + info->ram->transferred = ram_bytes_transferred(); > + info->ram->remaining = ram_bytes_remaining(); > + info->ram->total = ram_bytes_total(); > + info->ram->duplicate = dup_mig_pages_transferred(); > + info->ram->skipped = skipped_mig_pages_transferred(); > + info->ram->normal = norm_mig_pages_transferred(); > + info->ram->normal_bytes = norm_mig_bytes_transferred(); > + info->ram->dirty_pages_rate = s->dirty_pages_rate; > + info->ram->mbps = s->mbps; > + info->ram->dirty_sync_count = s->dirty_sync_count; If you update to the current qemu git you'll see there's a populate_ram_info function that avoids a lot of this duplication. > + if (blk_mig_active()) { > + info->has_disk = true; > + info->disk = g_malloc0(sizeof(*info->disk)); > + info->disk->transferred = blk_mig_bytes_transferred(); > + info->disk->remaining = blk_mig_bytes_remaining(); > + info->disk->total = blk_mig_bytes_total(); > + } > + > + get_xbzrle_cache_stats(info); > } > info->status = s->state; > > @@ -1660,6 +1687,8 @@ static void *migration_thread(void *opaque) > /* The active state we expect to be in; ACTIVE or POSTCOPY_ACTIVE */ > enum MigrationStatus current_active_state = MIGRATION_STATUS_ACTIVE; > > + int ret; > + > rcu_register_thread(); > > qemu_savevm_state_header(s->to_dst_file); > @@ -1726,11 +1755,32 @@ static void *migration_thread(void *opaque) > } > } > > - if (qemu_file_get_error(s->to_dst_file)) { > - migrate_set_state(&s->state, current_active_state, > - MIGRATION_STATUS_FAILED); > - trace_migration_thread_file_err(); > - break; > + if ((ret = qemu_file_get_error(s->to_dst_file))) { > + fprintf(stderr, "1 : Error %s %d\n", strerror(-ret), -ret); > + > + /* This check is based on how the error is set during the network > + * recv(). When recv() returns 0 (i.e. no data to read), the error > + * is set to -EIO. For all other network errors, it is set > + * according to the return value received. > + */ > + if (ret != -EIO && s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { shouldn't that be ret == -EIO ? > + /* Network Failure during postcopy */ > + > + current_active_state = MIGRATION_STATUS_POSTCOPY_RECOVERY; > + runstate_set(RUN_STATE_POSTMIGRATE_RECOVERY); > + fprintf(stderr, "1.1 : Error %s %d\n", strerror(-ret), -ret); All the fprintf's need to come out for the final version; use traces instaed. > + ret = qemu_migrate_postcopy_outgoing_recovery(s); > + if(ret < 0) { > + break; > + } > + > + } else { > + migrate_set_state(&s->state, current_active_state, > + MIGRATION_STATUS_FAILED); > + fprintf(stderr, "1.2 : Error %s %d\n", strerror(-ret), -ret); > + trace_migration_thread_file_err(); > + break; > + } > } > current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > if (current_time >= initial_time + BUFFER_DELAY) { > @@ -1831,6 +1881,22 @@ void migrate_fd_connect(MigrationState *s) > s->migration_thread_running = true; > } > > +int qemu_migrate_postcopy_outgoing_recovery(MigrationState* ms) > +{ > + migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, > + MIGRATION_STATUS_POSTCOPY_RECOVERY); > + > + ms->in_recovery = true; > + /* Code for network recovery to be added here */ > + while(atomic_mb_read(&ms->in_recovery) == true) { > + fprintf(stderr, "Not letting it fail %p\n", ms->to_dst_file); > + sleep(5); > + } > + > + return -1; > + > +} > + > PostcopyState postcopy_state_get(void) > { > return atomic_mb_read(&incoming_postcopy_state); > diff --git a/qapi-schema.json b/qapi-schema.json > index 1b7b1e1..613f8d2 100644 > --- a/qapi-schema.json > +++ b/qapi-schema.json > @@ -154,12 +154,14 @@ > # @watchdog: the watchdog action is configured to pause and has been triggered > # > # @guest-panicked: guest has been panicked as a result of guest OS panic > +# > +# @postmigrate-recovery: guest is paused for recovery after a network failure > ## > { 'enum': 'RunState', > 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', > 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', > 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', > - 'guest-panicked' ] } > + 'guest-panicked', 'postmigrate-recovery' ] } Would it be possible to have that as postcopy-recovery rather than postmigrate? > ## > # @StatusInfo: > @@ -434,12 +436,15 @@ > # > # @failed: some error occurred during migration process. > # > +# @postcopy-recovery: in recovery mode, after a network failure. > +# > # Since: 2.3 > # > ## > { 'enum': 'MigrationStatus', > 'data': [ 'none', 'setup', 'cancelling', 'cancelled', > - 'active', 'postcopy-active', 'completed', 'failed' ] } > + 'active', 'postcopy-active', 'completed', 'failed', > + 'postcopy-recovery' ] } > > ## > # @MigrationInfo > @@ -2058,6 +2063,8 @@ > # > # @uri: the Uniform Resource Identifier of the destination VM > # > +# @recover: #optional recover from a broken migration > +# That looks like it belongs in a different patch. > # @blk: #optional do block migration (full disk copy) > # > # @inc: #optional incremental disk copy migration > diff --git a/vl.c b/vl.c > index 5fd22cb..c237140 100644 > --- a/vl.c > +++ b/vl.c > @@ -618,6 +618,10 @@ static const RunStateTransition runstate_transitions_def[] = { > { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING }, > { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE }, > { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PRELAUNCH }, > + { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE_RECOVERY }, > + > + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_FINISH_MIGRATE }, > + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_SHUTDOWN }, > > { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING }, > { RUN_STATE_RESTORE_VM, RUN_STATE_PRELAUNCH }, > -- > 2.7.4 > Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
* Eric Blake (eblake@redhat.com) wrote: > On 06/13/2016 12:38 AM, haris iqbal wrote: > > >>> ## > >>> { 'enum': 'RunState', > >>> 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', > >>> 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', > >>> 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', > >>> - 'guest-panicked' ] } > >>> + 'guest-panicked', 'postmigrate-recovery' ] } > >> > >> Adding new enums can cause existing clients like libvirt to do weird > >> things if they aren't expecting the new state. Are we sure we want to do > >> it? > > I think so. If we do not have a new state, then one would not know > > that the VM is in recovery. > > > >> Is it a state that cannot be entered by default, but only in > >> response to a client request that proves the client is new enough to > >> expect the new state? > > > > I did not quite understand what you are trying to say. > > > > A client that is not expecting the new 'postmigrate-recovery' state may > mishandle a VM that is in that state. So I'm suggesting that we may > want to special case this state, and make it possible to enter the state > only if the client has done something first to inform qemu that it > understands what it means for a VM to be in that state. Do you mean another migration capability? Dave > > -- > Eric Blake eblake redhat com +1-919-301-3266 > Libvirt virtualization library http://libvirt.org > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 06/15/2016 07:03 AM, Dr. David Alan Gilbert wrote: > * Eric Blake (eblake@redhat.com) wrote: >> On 06/13/2016 12:38 AM, haris iqbal wrote: >> >>>>> ## >>>>> { 'enum': 'RunState', >>>>> 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', >>>>> 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', >>>>> 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', >>>>> - 'guest-panicked' ] } >>>>> + 'guest-panicked', 'postmigrate-recovery' ] } >>>> >>>> Adding new enums can cause existing clients like libvirt to do weird >>>> things if they aren't expecting the new state. Are we sure we want to do >>>> it? >>> I think so. If we do not have a new state, then one would not know >>> that the VM is in recovery. >>> >>>> Is it a state that cannot be entered by default, but only in >>>> response to a client request that proves the client is new enough to >>>> expect the new state? >>> >>> I did not quite understand what you are trying to say. >>> >> >> A client that is not expecting the new 'postmigrate-recovery' state may >> mishandle a VM that is in that state. So I'm suggesting that we may >> want to special case this state, and make it possible to enter the state >> only if the client has done something first to inform qemu that it >> understands what it means for a VM to be in that state. > > Do you mean another migration capability? > Sure, that would be an introspectible way - if left off, the state cannot be reached, but if the client turns the capability on, then the state is useful.
diff --git a/include/migration/migration.h b/include/migration/migration.h index 4a3201b..59e26e6 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -326,6 +326,7 @@ void global_state_store_running(void); void flush_page_queue(MigrationState *ms); int ram_save_queue_pages(MigrationState *ms, const char *rbname, ram_addr_t start, ram_addr_t len); +int qemu_migrate_postcopy_outgoing_recovery(MigrationState *ms); PostcopyState postcopy_state_get(void); /* Set the state and return the old state */ diff --git a/migration/migration.c b/migration/migration.c index a77f62e..41c28e1 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -676,6 +676,33 @@ MigrationInfo *qmp_query_migrate(Error **errp) case MIGRATION_STATUS_CANCELLED: info->has_status = true; break; + case MIGRATION_STATUS_POSTCOPY_RECOVERY: + info->has_status = true; + info->has_total_time = true; + info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + + info->has_ram = true; + info->ram = g_malloc0(sizeof(*info->ram)); + info->ram->transferred = ram_bytes_transferred(); + info->ram->remaining = ram_bytes_remaining(); + info->ram->total = ram_bytes_total(); + info->ram->duplicate = dup_mig_pages_transferred(); + info->ram->skipped = skipped_mig_pages_transferred(); + info->ram->normal = norm_mig_pages_transferred(); + info->ram->normal_bytes = norm_mig_bytes_transferred(); + info->ram->dirty_pages_rate = s->dirty_pages_rate; + info->ram->mbps = s->mbps; + info->ram->dirty_sync_count = s->dirty_sync_count; + + if (blk_mig_active()) { + info->has_disk = true; + info->disk = g_malloc0(sizeof(*info->disk)); + info->disk->transferred = blk_mig_bytes_transferred(); + info->disk->remaining = blk_mig_bytes_remaining(); + info->disk->total = blk_mig_bytes_total(); + } + + get_xbzrle_cache_stats(info); } info->status = s->state; @@ -1660,6 +1687,8 @@ static void *migration_thread(void *opaque) /* The active state we expect to be in; ACTIVE or POSTCOPY_ACTIVE */ enum MigrationStatus current_active_state = MIGRATION_STATUS_ACTIVE; + int ret; + rcu_register_thread(); qemu_savevm_state_header(s->to_dst_file); @@ -1726,11 +1755,32 @@ static void *migration_thread(void *opaque) } } - if (qemu_file_get_error(s->to_dst_file)) { - migrate_set_state(&s->state, current_active_state, - MIGRATION_STATUS_FAILED); - trace_migration_thread_file_err(); - break; + if ((ret = qemu_file_get_error(s->to_dst_file))) { + fprintf(stderr, "1 : Error %s %d\n", strerror(-ret), -ret); + + /* This check is based on how the error is set during the network + * recv(). When recv() returns 0 (i.e. no data to read), the error + * is set to -EIO. For all other network errors, it is set + * according to the return value received. + */ + if (ret != -EIO && s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { + /* Network Failure during postcopy */ + + current_active_state = MIGRATION_STATUS_POSTCOPY_RECOVERY; + runstate_set(RUN_STATE_POSTMIGRATE_RECOVERY); + fprintf(stderr, "1.1 : Error %s %d\n", strerror(-ret), -ret); + ret = qemu_migrate_postcopy_outgoing_recovery(s); + if(ret < 0) { + break; + } + + } else { + migrate_set_state(&s->state, current_active_state, + MIGRATION_STATUS_FAILED); + fprintf(stderr, "1.2 : Error %s %d\n", strerror(-ret), -ret); + trace_migration_thread_file_err(); + break; + } } current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); if (current_time >= initial_time + BUFFER_DELAY) { @@ -1831,6 +1881,22 @@ void migrate_fd_connect(MigrationState *s) s->migration_thread_running = true; } +int qemu_migrate_postcopy_outgoing_recovery(MigrationState* ms) +{ + migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, + MIGRATION_STATUS_POSTCOPY_RECOVERY); + + ms->in_recovery = true; + /* Code for network recovery to be added here */ + while(atomic_mb_read(&ms->in_recovery) == true) { + fprintf(stderr, "Not letting it fail %p\n", ms->to_dst_file); + sleep(5); + } + + return -1; + +} + PostcopyState postcopy_state_get(void) { return atomic_mb_read(&incoming_postcopy_state); diff --git a/qapi-schema.json b/qapi-schema.json index 1b7b1e1..613f8d2 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -154,12 +154,14 @@ # @watchdog: the watchdog action is configured to pause and has been triggered # # @guest-panicked: guest has been panicked as a result of guest OS panic +# +# @postmigrate-recovery: guest is paused for recovery after a network failure ## { 'enum': 'RunState', 'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused', 'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm', 'running', 'save-vm', 'shutdown', 'suspended', 'watchdog', - 'guest-panicked' ] } + 'guest-panicked', 'postmigrate-recovery' ] } ## # @StatusInfo: @@ -434,12 +436,15 @@ # # @failed: some error occurred during migration process. # +# @postcopy-recovery: in recovery mode, after a network failure. +# # Since: 2.3 # ## { 'enum': 'MigrationStatus', 'data': [ 'none', 'setup', 'cancelling', 'cancelled', - 'active', 'postcopy-active', 'completed', 'failed' ] } + 'active', 'postcopy-active', 'completed', 'failed', + 'postcopy-recovery' ] } ## # @MigrationInfo @@ -2058,6 +2063,8 @@ # # @uri: the Uniform Resource Identifier of the destination VM # +# @recover: #optional recover from a broken migration +# # @blk: #optional do block migration (full disk copy) # # @inc: #optional incremental disk copy migration diff --git a/vl.c b/vl.c index 5fd22cb..c237140 100644 --- a/vl.c +++ b/vl.c @@ -618,6 +618,10 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING }, { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE }, { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PRELAUNCH }, + { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE_RECOVERY }, + + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_FINISH_MIGRATE }, + { RUN_STATE_POSTMIGRATE_RECOVERY, RUN_STATE_SHUTDOWN }, { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING }, { RUN_STATE_RESTORE_VM, RUN_STATE_PRELAUNCH },