diff mbox

[COLO-Frame,(Base),v21,03/17] migration: Enter into COLO mode after migration if COLO is enabled

Message ID 1476792613-11712-4-git-send-email-zhang.zhanghailiang@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Zhanghailiang Oct. 18, 2016, 12:09 p.m. UTC
Add a new migration state: MIGRATION_STATUS_COLO. Migration source side
enters this state after the first live migration successfully finished
if COLO is enabled by command 'migrate_set_capability x-colo on'.

We reuse migration thread, so the process of checkpointing will be handled
in migration thread.

Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
v21:
- Don't mark the image with BDRV_O_INACTIVE flag if we are going into colo
v19:
- fix title to make it more exact.
v11:
- Rebase to master
- Add Reviewed-by tag
v10:
- Simplify process by dropping colo thread and reusing migration thread.
     (Dave's suggestion)
---
 include/migration/colo.h |  3 +++
 migration/colo.c         | 31 +++++++++++++++++++++++++++++++
 migration/migration.c    | 38 +++++++++++++++++++++++++++++++++-----
 migration/trace-events   |  3 +++
 qapi-schema.json         |  4 +++-
 stubs/migration-colo.c   |  9 +++++++++
 6 files changed, 82 insertions(+), 6 deletions(-)

Comments

Amit Shah Oct. 26, 2016, 4:50 a.m. UTC | #1
On (Tue) 18 Oct 2016 [20:09:59], zhanghailiang wrote:
> Add a new migration state: MIGRATION_STATUS_COLO. Migration source side
> enters this state after the first live migration successfully finished
> if COLO is enabled by command 'migrate_set_capability x-colo on'.
> 
> We reuse migration thread, so the process of checkpointing will be handled
> in migration thread.
> 
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

(snip)

> +static void colo_process_checkpoint(MigrationState *s)
> +{
> +    qemu_mutex_lock_iothread();
> +    vm_start();
> +    qemu_mutex_unlock_iothread();
> +    trace_colo_vm_state_change("stop", "run");
> +
> +    /* TODO: COLO checkpoint savevm loop */
> +
> +    migrate_set_state(&s->state, MIGRATION_STATUS_COLO,
> +                      MIGRATION_STATUS_COMPLETED);

Is this just a temporary thing that'll be removed in the next patches?
I guess so - because once you enter COLO state, you want to remain in
it, right?

I think the commit message implies that.  So the commit msg and the
code are not in sync.

(snip)

> diff --git a/migration/migration.c b/migration/migration.c
> index f7dd9c6..462007d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -695,6 +695,10 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>  
>          get_xbzrle_cache_stats(info);
>          break;
> +    case MIGRATION_STATUS_COLO:
> +        info->has_status = true;
> +        /* TODO: display COLO specific information (checkpoint info etc.) */
> +        break;

When do you plan to add this?  I guess it's important for debugging
and also to get the state of the system while colo is active.  What
info do you have planned to display here?


		Amit
Zhanghailiang Oct. 26, 2016, 1:49 p.m. UTC | #2
On 2016/10/26 12:50, Amit Shah wrote:
> On (Tue) 18 Oct 2016 [20:09:59], zhanghailiang wrote:
>> Add a new migration state: MIGRATION_STATUS_COLO. Migration source side
>> enters this state after the first live migration successfully finished
>> if COLO is enabled by command 'migrate_set_capability x-colo on'.
>>
>> We reuse migration thread, so the process of checkpointing will be handled
>> in migration thread.
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>
> (snip)
>
>> +static void colo_process_checkpoint(MigrationState *s)
>> +{
>> +    qemu_mutex_lock_iothread();
>> +    vm_start();
>> +    qemu_mutex_unlock_iothread();
>> +    trace_colo_vm_state_change("stop", "run");
>> +
>> +    /* TODO: COLO checkpoint savevm loop */
>> +
>> +    migrate_set_state(&s->state, MIGRATION_STATUS_COLO,
>> +                      MIGRATION_STATUS_COMPLETED);
>
> Is this just a temporary thing that'll be removed in the next patches?

Yes, you are right, we will move this codes into failover process in the next
patch, because after failover, we should finish the original migration, there
are still some cleanup work need to be done.

> I guess so - because once you enter COLO state, you want to remain in
> it, right?
>

Yes.

> I think the commit message implies that.  So the commit msg and the
> code are not in sync.
>

Hmm, i'll remove it here in this patch, is it OK ?

> (snip)
>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index f7dd9c6..462007d 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -695,6 +695,10 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>>
>>           get_xbzrle_cache_stats(info);
>>           break;
>> +    case MIGRATION_STATUS_COLO:
>> +        info->has_status = true;
>> +        /* TODO: display COLO specific information (checkpoint info etc.) */
>> +        break;
>
> When do you plan to add this?  I guess it's important for debugging
> and also to get the state of the system while colo is active.  What
> info do you have planned to display here?
>

IIRC, we have such patch which implemented this specific information in the previous
version long time ago. Yes, it is quit useful, for example, the average/max time of
pause while do checkpoint, the average/max number of dirty pages transferred to SVM,
the amount time of VM in COLO state, the total checkpoint times, the count of
checkpointing because of inconsistency of network packages compare.

Thanks,
Hailiang

>
> 		Amit
>
> .
>
Amit Shah Oct. 27, 2016, 3:58 a.m. UTC | #3
On (Wed) 26 Oct 2016 [21:49:10], Hailiang Zhang wrote:
> On 2016/10/26 12:50, Amit Shah wrote:
> >On (Tue) 18 Oct 2016 [20:09:59], zhanghailiang wrote:
> >>Add a new migration state: MIGRATION_STATUS_COLO. Migration source side
> >>enters this state after the first live migration successfully finished
> >>if COLO is enabled by command 'migrate_set_capability x-colo on'.
> >>
> >>We reuse migration thread, so the process of checkpointing will be handled
> >>in migration thread.
> >>
> >>Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >>Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> >>Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> >>Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >
> >(snip)
> >
> >>+static void colo_process_checkpoint(MigrationState *s)
> >>+{
> >>+    qemu_mutex_lock_iothread();
> >>+    vm_start();
> >>+    qemu_mutex_unlock_iothread();
> >>+    trace_colo_vm_state_change("stop", "run");
> >>+
> >>+    /* TODO: COLO checkpoint savevm loop */
> >>+
> >>+    migrate_set_state(&s->state, MIGRATION_STATUS_COLO,
> >>+                      MIGRATION_STATUS_COMPLETED);
> >
> >Is this just a temporary thing that'll be removed in the next patches?
> 
> Yes, you are right, we will move this codes into failover process in the next
> patch, because after failover, we should finish the original migration, there
> are still some cleanup work need to be done.
> 
> >I guess so - because once you enter COLO state, you want to remain in
> >it, right?
> >
> 
> Yes.
> 
> >I think the commit message implies that.  So the commit msg and the
> >code are not in sync.
> >
> 
> Hmm, i'll remove it here in this patch, is it OK ?

Yes.

> 
> >(snip)
> >
> >>diff --git a/migration/migration.c b/migration/migration.c
> >>index f7dd9c6..462007d 100644
> >>--- a/migration/migration.c
> >>+++ b/migration/migration.c
> >>@@ -695,6 +695,10 @@ MigrationInfo *qmp_query_migrate(Error **errp)
> >>
> >>          get_xbzrle_cache_stats(info);
> >>          break;
> >>+    case MIGRATION_STATUS_COLO:
> >>+        info->has_status = true;
> >>+        /* TODO: display COLO specific information (checkpoint info etc.) */
> >>+        break;
> >
> >When do you plan to add this?  I guess it's important for debugging
> >and also to get the state of the system while colo is active.  What
> >info do you have planned to display here?
> >
> 
> IIRC, we have such patch which implemented this specific information in the previous
> version long time ago. Yes, it is quit useful, for example, the average/max time of
> pause while do checkpoint, the average/max number of dirty pages transferred to SVM,
> the amount time of VM in COLO state, the total checkpoint times, the count of
> checkpointing because of inconsistency of network packages compare.

Yes, please get this in soon as well.


		Amit
Zhanghailiang Oct. 27, 2016, 6:10 a.m. UTC | #4
On 2016/10/27 11:58, Amit Shah wrote:
> On (Wed) 26 Oct 2016 [21:49:10], Hailiang Zhang wrote:
>> On 2016/10/26 12:50, Amit Shah wrote:
>>> On (Tue) 18 Oct 2016 [20:09:59], zhanghailiang wrote:
>>>> Add a new migration state: MIGRATION_STATUS_COLO. Migration source side
>>>> enters this state after the first live migration successfully finished
>>>> if COLO is enabled by command 'migrate_set_capability x-colo on'.
>>>>
>>>> We reuse migration thread, so the process of checkpointing will be handled
>>>> in migration thread.
>>>>
>>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>>> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
>>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>
>>> (snip)
>>>
>>>> +static void colo_process_checkpoint(MigrationState *s)
>>>> +{
>>>> +    qemu_mutex_lock_iothread();
>>>> +    vm_start();
>>>> +    qemu_mutex_unlock_iothread();
>>>> +    trace_colo_vm_state_change("stop", "run");
>>>> +
>>>> +    /* TODO: COLO checkpoint savevm loop */
>>>> +
>>>> +    migrate_set_state(&s->state, MIGRATION_STATUS_COLO,
>>>> +                      MIGRATION_STATUS_COMPLETED);
>>>
>>> Is this just a temporary thing that'll be removed in the next patches?
>>
>> Yes, you are right, we will move this codes into failover process in the next
>> patch, because after failover, we should finish the original migration, there
>> are still some cleanup work need to be done.
>>
>>> I guess so - because once you enter COLO state, you want to remain in
>>> it, right?
>>>
>>
>> Yes.
>>
>>> I think the commit message implies that.  So the commit msg and the
>>> code are not in sync.
>>>
>>
>> Hmm, i'll remove it here in this patch, is it OK ?
>
> Yes.
>
>>
>>> (snip)
>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index f7dd9c6..462007d 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -695,6 +695,10 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>>>>
>>>>           get_xbzrle_cache_stats(info);
>>>>           break;
>>>> +    case MIGRATION_STATUS_COLO:
>>>> +        info->has_status = true;
>>>> +        /* TODO: display COLO specific information (checkpoint info etc.) */
>>>> +        break;
>>>
>>> When do you plan to add this?  I guess it's important for debugging
>>> and also to get the state of the system while colo is active.  What
>>> info do you have planned to display here?
>>>
>>
>> IIRC, we have such patch which implemented this specific information in the previous
>> version long time ago. Yes, it is quit useful, for example, the average/max time of
>> pause while do checkpoint, the average/max number of dirty pages transferred to SVM,
>> the amount time of VM in COLO state, the total checkpoint times, the count of
>> checkpointing because of inconsistency of network packages compare.
>
> Yes, please get this in soon as well.
>
>

OK, we will do it, thanks.

> 		Amit
>
> .
>
diff mbox

Patch

diff --git a/include/migration/colo.h b/include/migration/colo.h
index 1c899a0..bf84b99 100644
--- a/include/migration/colo.h
+++ b/include/migration/colo.h
@@ -19,4 +19,7 @@ 
 bool colo_supported(void);
 void colo_info_init(void);
 
+void migrate_start_colo_process(MigrationState *s);
+bool migration_in_colo_state(void);
+
 #endif
diff --git a/migration/colo.c b/migration/colo.c
index d215057..fd3ceeb 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -11,9 +11,40 @@ 
  */
 
 #include "qemu/osdep.h"
+#include "sysemu/sysemu.h"
 #include "migration/colo.h"
+#include "trace.h"
 
 bool colo_supported(void)
 {
     return false;
 }
+
+bool migration_in_colo_state(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return (s->state == MIGRATION_STATUS_COLO);
+}
+
+static void colo_process_checkpoint(MigrationState *s)
+{
+    qemu_mutex_lock_iothread();
+    vm_start();
+    qemu_mutex_unlock_iothread();
+    trace_colo_vm_state_change("stop", "run");
+
+    /* TODO: COLO checkpoint savevm loop */
+
+    migrate_set_state(&s->state, MIGRATION_STATUS_COLO,
+                      MIGRATION_STATUS_COMPLETED);
+}
+
+void migrate_start_colo_process(MigrationState *s)
+{
+    qemu_mutex_unlock_iothread();
+    migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
+                      MIGRATION_STATUS_COLO);
+    colo_process_checkpoint(s);
+    qemu_mutex_lock_iothread();
+}
diff --git a/migration/migration.c b/migration/migration.c
index f7dd9c6..462007d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -695,6 +695,10 @@  MigrationInfo *qmp_query_migrate(Error **errp)
 
         get_xbzrle_cache_stats(info);
         break;
+    case MIGRATION_STATUS_COLO:
+        info->has_status = true;
+        /* TODO: display COLO specific information (checkpoint info etc.) */
+        break;
     case MIGRATION_STATUS_COMPLETED:
         get_xbzrle_cache_stats(info);
 
@@ -1113,7 +1117,8 @@  void qmp_migrate(const char *uri, bool has_blk, bool blk,
     params.shared = has_inc && inc;
 
     if (migration_is_setup_or_active(s->state) ||
-        s->state == MIGRATION_STATUS_CANCELLING) {
+        s->state == MIGRATION_STATUS_CANCELLING ||
+        s->state == MIGRATION_STATUS_COLO) {
         error_setg(errp, QERR_MIGRATION_ACTIVE);
         return;
     }
@@ -1660,7 +1665,11 @@  static void migration_completion(MigrationState *s, int current_active_state,
 
         if (!ret) {
             ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
-            if (ret >= 0) {
+            /*
+             * Don't mark the image with BDRV_O_INACTIVE flag if
+             * we will go into COLO stage later.
+             */
+            if (ret >= 0 && !migrate_colo_enabled()) {
                 ret = bdrv_inactivate_all();
             }
             if (ret >= 0) {
@@ -1702,8 +1711,11 @@  static void migration_completion(MigrationState *s, int current_active_state,
         goto fail_invalidate;
     }
 
-    migrate_set_state(&s->state, current_active_state,
-                      MIGRATION_STATUS_COMPLETED);
+    if (!migrate_colo_enabled()) {
+        migrate_set_state(&s->state, current_active_state,
+                          MIGRATION_STATUS_COMPLETED);
+    }
+
     return;
 
 fail_invalidate:
@@ -1748,6 +1760,7 @@  static void *migration_thread(void *opaque)
     bool entered_postcopy = false;
     /* The active state we expect to be in; ACTIVE or POSTCOPY_ACTIVE */
     enum MigrationStatus current_active_state = MIGRATION_STATUS_ACTIVE;
+    bool enable_colo = migrate_colo_enabled();
 
     rcu_register_thread();
 
@@ -1856,7 +1869,13 @@  static void *migration_thread(void *opaque)
     end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 
     qemu_mutex_lock_iothread();
-    qemu_savevm_state_cleanup();
+    /*
+     * The resource has been allocated by migration will be reused in COLO
+     * process, so don't release them.
+     */
+    if (!enable_colo) {
+        qemu_savevm_state_cleanup();
+    }
     if (s->state == MIGRATION_STATUS_COMPLETED) {
         uint64_t transferred_bytes = qemu_ftell(s->to_dst_file);
         s->total_time = end_time - s->total_time;
@@ -1869,6 +1888,15 @@  static void *migration_thread(void *opaque)
         }
         runstate_set(RUN_STATE_POSTMIGRATE);
     } else {
+        if (s->state == MIGRATION_STATUS_ACTIVE && enable_colo) {
+            migrate_start_colo_process(s);
+            qemu_savevm_state_cleanup();
+            /*
+            * Fixme: we will run VM in COLO no matter its old running state.
+            * After exited COLO, we will keep running.
+            */
+            old_vm_running = true;
+        }
         if (old_vm_running && !entered_postcopy) {
             vm_start();
         } else {
diff --git a/migration/trace-events b/migration/trace-events
index dfee75a..a29e5a0 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -207,3 +207,6 @@  migration_tls_outgoing_handshake_complete(void) ""
 migration_tls_incoming_handshake_start(void) ""
 migration_tls_incoming_handshake_error(const char *err) "err=%s"
 migration_tls_incoming_handshake_complete(void) ""
+
+# migration/colo.c
+colo_vm_state_change(const char *old, const char *new) "Change '%s' => '%s'"
diff --git a/qapi-schema.json b/qapi-schema.json
index bed3d9e..aaed423 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -459,12 +459,14 @@ 
 #
 # @failed: some error occurred during migration process.
 #
+# @colo: VM is in the process of fault tolerance. (since 2.8)
+#
 # Since: 2.3
 #
 ##
 { 'enum': 'MigrationStatus',
   'data': [ 'none', 'setup', 'cancelling', 'cancelled',
-            'active', 'postcopy-active', 'completed', 'failed' ] }
+            'active', 'postcopy-active', 'completed', 'failed', 'colo' ] }
 
 ##
 # @MigrationInfo
diff --git a/stubs/migration-colo.c b/stubs/migration-colo.c
index d215057..0c8eef4 100644
--- a/stubs/migration-colo.c
+++ b/stubs/migration-colo.c
@@ -17,3 +17,12 @@  bool colo_supported(void)
 {
     return false;
 }
+
+bool migration_in_colo_state(void)
+{
+    return false;
+}
+
+void migrate_start_colo_process(MigrationState *s)
+{
+}