diff mbox

[V7,10/17] qmp event: Add COLO_EXIT event to notify users while exited COLO

Message ID 1526268228-27951-11-git-send-email-zhangckid@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Zhang Chen May 14, 2018, 3:23 a.m. UTC
From: zhanghailiang <zhang.zhanghailiang@huawei.com>

If some errors happen during VM's COLO FT stage, it's important to
notify the users of this event. Together with 'x-colo-lost-heartbeat',
Users can intervene in COLO's failover work immediately.
If users don't want to get involved in COLO's failover verdict,
it is still necessary to notify users that we exited COLO mode.

Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Zhang Chen <zhangckid@gmail.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 migration/colo.c    | 20 ++++++++++++++++++++
 qapi/migration.json | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

Comments

Markus Armbruster May 15, 2018, 2:29 p.m. UTC | #1
Zhang Chen <zhangckid@gmail.com> writes:

> From: zhanghailiang <zhang.zhanghailiang@huawei.com>
>
> If some errors happen during VM's COLO FT stage, it's important to
> notify the users of this event. Together with 'x-colo-lost-heartbeat',
> Users can intervene in COLO's failover work immediately.
> If users don't want to get involved in COLO's failover verdict,
> it is still necessary to notify users that we exited COLO mode.
>
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  migration/colo.c    | 20 ++++++++++++++++++++
>  qapi/migration.json | 37 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 57 insertions(+)
>
> diff --git a/migration/colo.c b/migration/colo.c
> index c083d36..8ca6381 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -28,6 +28,7 @@
>  #include "net/colo-compare.h"
>  #include "net/colo.h"
>  #include "block/block.h"
> +#include "qapi/qapi-events-migration.h"
>  
>  static bool vmstate_loading;
>  static Notifier packets_compare_notifier;
> @@ -514,6 +515,18 @@ out:
>          qemu_fclose(fb);
>      }
>  
> +    /*
> +     * There are only two reasons we can go here, some error happened.
> +     * Or the user triggered failover.
> +     */
> +    if (failover_get_state() == FAILOVER_STATUS_NONE) {
> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
> +                                  COLO_EXIT_REASON_ERROR, NULL);
> +    } else {
> +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
> +                                  COLO_EXIT_REASON_REQUEST, NULL);
> +    }

Your comment makes me suspect failover_get_state() can only be
FAILOVER_STATUS_NONE or FAILOVER_STATUS_REQUIRE here.  Is that correct?

If yes, I recommend to add a suitable assertion.

> +
>      /* Hope this not to be too long to wait here */
>      qemu_sem_wait(&s->colo_exit_sem);
>      qemu_sem_destroy(&s->colo_exit_sem);
> @@ -744,6 +757,13 @@ out:
>      if (local_err) {
>          error_report_err(local_err);
>      }
> +    if (failover_get_state() == FAILOVER_STATUS_NONE) {
> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
> +                                  COLO_EXIT_REASON_ERROR, NULL);
> +    } else {
> +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
> +                                  COLO_EXIT_REASON_REQUEST, NULL);
> +    }

Same question.

>  
>      if (fb) {
>          qemu_fclose(fb);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index f3974c6..55dae48 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -875,6 +875,43 @@
>    'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
>  
>  ##
> +# @COLO_EXIT:
> +#
> +# Emitted when VM finishes COLO mode due to some errors happening or
> +# at the request of users.
> +#
> +# @mode: report COLO mode when COLO exited.
> +#
> +# @reason: describes the reason for the COLO exit.
> +#
> +# Since: 2.13
> +#
> +# Example:
> +#
> +# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
> +#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
> +#
> +##
> +{ 'event': 'COLO_EXIT',
> +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }

'data' duplicates the next patch's ColoStatus, except it lacks
@colo-running.  Factoring out the common part doesn't seem worth the
bother.  Okay as is.

> +
> +##
> +# @COLOExitReason:
> +#
> +# The reason for a COLO exit
> +#
> +# @none: no failover has ever happened.

This can't occur in the COLO_EXIT event, only in the result of
query-colo-status, can it?

Worth spelling that out in the documentation?

> +#
> +# @request: COLO exit is due to an external request
> +#
> +# @error: COLO exit is due to an internal error
> +#
> +# Since: 2.13
> +##
> +{ 'enum': 'COLOExitReason',
> +  'data': [ 'none', 'request', 'error' ] }
> +
> +##
>  # @x-colo-lost-heartbeat:
>  #
>  # Tell qemu that heartbeat is lost, request it to do takeover procedures.
Zhang Chen May 16, 2018, 1:41 p.m. UTC | #2
On Tue, May 15, 2018 at 10:29 PM, Markus Armbruster <armbru@redhat.com>
wrote:

> Zhang Chen <zhangckid@gmail.com> writes:
>
> > From: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >
> > If some errors happen during VM's COLO FT stage, it's important to
> > notify the users of this event. Together with 'x-colo-lost-heartbeat',
> > Users can intervene in COLO's failover work immediately.
> > If users don't want to get involved in COLO's failover verdict,
> > it is still necessary to notify users that we exited COLO mode.
> >
> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> > Reviewed-by: Eric Blake <eblake@redhat.com>
> > ---
> >  migration/colo.c    | 20 ++++++++++++++++++++
> >  qapi/migration.json | 37 +++++++++++++++++++++++++++++++++++++
> >  2 files changed, 57 insertions(+)
> >
> > diff --git a/migration/colo.c b/migration/colo.c
> > index c083d36..8ca6381 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -28,6 +28,7 @@
> >  #include "net/colo-compare.h"
> >  #include "net/colo.h"
> >  #include "block/block.h"
> > +#include "qapi/qapi-events-migration.h"
> >
> >  static bool vmstate_loading;
> >  static Notifier packets_compare_notifier;
> > @@ -514,6 +515,18 @@ out:
> >          qemu_fclose(fb);
> >      }
> >
> > +    /*
> > +     * There are only two reasons we can go here, some error happened.
> > +     * Or the user triggered failover.
> > +     */
> > +    if (failover_get_state() == FAILOVER_STATUS_NONE) {
> > +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
> > +                                  COLO_EXIT_REASON_ERROR, NULL);
> > +    } else {
> > +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
> > +                                  COLO_EXIT_REASON_REQUEST, NULL);
> > +    }
>
> Your comment makes me suspect failover_get_state() can only be
> FAILOVER_STATUS_NONE or FAILOVER_STATUS_REQUIRE here.  Is that correct?
>
> If yes, I recommend to add a suitable assertion.
>

Yes, and what kinds of 'suitable assertion'? Just for the
'failover_get_state()' ?



>
> > +
> >      /* Hope this not to be too long to wait here */
> >      qemu_sem_wait(&s->colo_exit_sem);
> >      qemu_sem_destroy(&s->colo_exit_sem);
> > @@ -744,6 +757,13 @@ out:
> >      if (local_err) {
> >          error_report_err(local_err);
> >      }
> > +    if (failover_get_state() == FAILOVER_STATUS_NONE) {
> > +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
> > +                                  COLO_EXIT_REASON_ERROR, NULL);
> > +    } else {
> > +        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
> > +                                  COLO_EXIT_REASON_REQUEST, NULL);
> > +    }
>
> Same question.
>
> >
> >      if (fb) {
> >          qemu_fclose(fb);
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index f3974c6..55dae48 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -875,6 +875,43 @@
> >    'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
> >
> >  ##
> > +# @COLO_EXIT:
> > +#
> > +# Emitted when VM finishes COLO mode due to some errors happening or
> > +# at the request of users.
> > +#
> > +# @mode: report COLO mode when COLO exited.
> > +#
> > +# @reason: describes the reason for the COLO exit.
> > +#
> > +# Since: 2.13
> > +#
> > +# Example:
> > +#
> > +# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
> > +#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason":
> "request" } }
> > +#
> > +##
> > +{ 'event': 'COLO_EXIT',
> > +  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }
>
> 'data' duplicates the next patch's ColoStatus, except it lacks
> @colo-running.  Factoring out the common part doesn't seem worth the
> bother.  Okay as is.
>
> > +
> > +##
> > +# @COLOExitReason:
> > +#
> > +# The reason for a COLO exit
> > +#
> > +# @none: no failover has ever happened.
>
> This can't occur in the COLO_EXIT event, only in the result of
> query-colo-status, can it?


Yes.


> Worth spelling that out in the documentation?
>
>
OK, I will add more comments here in next version.

Thanks
Zhang Chen



> > +#
> > +# @request: COLO exit is due to an external request
> > +#
> > +# @error: COLO exit is due to an internal error
> > +#
> > +# Since: 2.13
> > +##
> > +{ 'enum': 'COLOExitReason',
> > +  'data': [ 'none', 'request', 'error' ] }
> > +
> > +##
> >  # @x-colo-lost-heartbeat:
> >  #
> >  # Tell qemu that heartbeat is lost, request it to do takeover
> procedures.
>
Markus Armbruster May 17, 2018, 8:19 a.m. UTC | #3
Zhang Chen <zhangckid@gmail.com> writes:

> On Tue, May 15, 2018 at 10:29 PM, Markus Armbruster <armbru@redhat.com>
> wrote:
>
>> Zhang Chen <zhangckid@gmail.com> writes:
>>
>> > From: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> >
>> > If some errors happen during VM's COLO FT stage, it's important to
>> > notify the users of this event. Together with 'x-colo-lost-heartbeat',
>> > Users can intervene in COLO's failover work immediately.
>> > If users don't want to get involved in COLO's failover verdict,
>> > it is still necessary to notify users that we exited COLO mode.
>> >
>> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
>> > Reviewed-by: Eric Blake <eblake@redhat.com>
>> > ---
>> >  migration/colo.c    | 20 ++++++++++++++++++++
>> >  qapi/migration.json | 37 +++++++++++++++++++++++++++++++++++++
>> >  2 files changed, 57 insertions(+)
>> >
>> > diff --git a/migration/colo.c b/migration/colo.c
>> > index c083d36..8ca6381 100644
>> > --- a/migration/colo.c
>> > +++ b/migration/colo.c
>> > @@ -28,6 +28,7 @@
>> >  #include "net/colo-compare.h"
>> >  #include "net/colo.h"
>> >  #include "block/block.h"
>> > +#include "qapi/qapi-events-migration.h"
>> >
>> >  static bool vmstate_loading;
>> >  static Notifier packets_compare_notifier;
>> > @@ -514,6 +515,18 @@ out:
>> >          qemu_fclose(fb);
>> >      }
>> >
>> > +    /*
>> > +     * There are only two reasons we can go here, some error happened.
>> > +     * Or the user triggered failover.
>> > +     */
>> > +    if (failover_get_state() == FAILOVER_STATUS_NONE) {
>> > +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
>> > +                                  COLO_EXIT_REASON_ERROR, NULL);
>> > +    } else {
>> > +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
>> > +                                  COLO_EXIT_REASON_REQUEST, NULL);
>> > +    }
>>
>> Your comment makes me suspect failover_get_state() can only be
>> FAILOVER_STATUS_NONE or FAILOVER_STATUS_REQUIRE here.  Is that correct?
>>
>> If yes, I recommend to add a suitable assertion.

... to make the possible states immediately obvious.  The fact that you
felt a need for a comment is further evidence of non-obviousness.

>
> Yes, and what kinds of 'suitable assertion'? Just for the
> 'failover_get_state()' ?

Here's one way to skin this cat:

          failover_state = failover_get_state();
          if (failover_state == FAILOVER_STATUS_NONE) {
              qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
                                        COLO_EXIT_REASON_ERROR, NULL);
          } else {
              assert(failover_state == FAILOVER_STATUS_REQUIRE);
              qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
                                        COLO_EXIT_REASON_REQUEST, NULL);
          }

Another one:

          switch (failover_get_state() {
          case FAILOVER_STATUS_NONE:
              qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
                                        COLO_EXIT_REASON_ERROR, NULL);
              break;
          case FAILOVER_STATUS_REQUIRE:
              qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
                                        COLO_EXIT_REASON_REQUEST, NULL);
              break;
          default:
              abort();
          }

Either way, the possible states are immediately obvious.  The run time
check is a nice bonus.

With just your comment, the reader still has to make the connection from
the comment's prose to states, i.e. from "some error happened" to
FAILOVER_STATUS_NONE, and from "user triggered failover" to
FAILOVER_STATUS_REQUIRE.

[...]
Zhang Chen May 20, 2018, 3:20 p.m. UTC | #4
On Thu, May 17, 2018 at 4:19 PM, Markus Armbruster <armbru@redhat.com>
wrote:

> Zhang Chen <zhangckid@gmail.com> writes:
>
> > On Tue, May 15, 2018 at 10:29 PM, Markus Armbruster <armbru@redhat.com>
> > wrote:
> >
> >> Zhang Chen <zhangckid@gmail.com> writes:
> >>
> >> > From: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >> >
> >> > If some errors happen during VM's COLO FT stage, it's important to
> >> > notify the users of this event. Together with 'x-colo-lost-heartbeat',
> >> > Users can intervene in COLO's failover work immediately.
> >> > If users don't want to get involved in COLO's failover verdict,
> >> > it is still necessary to notify users that we exited COLO mode.
> >> >
> >> > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> >> > Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> >> > Signed-off-by: Zhang Chen <zhangckid@gmail.com>
> >> > Reviewed-by: Eric Blake <eblake@redhat.com>
> >> > ---
> >> >  migration/colo.c    | 20 ++++++++++++++++++++
> >> >  qapi/migration.json | 37 +++++++++++++++++++++++++++++++++++++
> >> >  2 files changed, 57 insertions(+)
> >> >
> >> > diff --git a/migration/colo.c b/migration/colo.c
> >> > index c083d36..8ca6381 100644
> >> > --- a/migration/colo.c
> >> > +++ b/migration/colo.c
> >> > @@ -28,6 +28,7 @@
> >> >  #include "net/colo-compare.h"
> >> >  #include "net/colo.h"
> >> >  #include "block/block.h"
> >> > +#include "qapi/qapi-events-migration.h"
> >> >
> >> >  static bool vmstate_loading;
> >> >  static Notifier packets_compare_notifier;
> >> > @@ -514,6 +515,18 @@ out:
> >> >          qemu_fclose(fb);
> >> >      }
> >> >
> >> > +    /*
> >> > +     * There are only two reasons we can go here, some error
> happened.
> >> > +     * Or the user triggered failover.
> >> > +     */
> >> > +    if (failover_get_state() == FAILOVER_STATUS_NONE) {
> >> > +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
> >> > +                                  COLO_EXIT_REASON_ERROR, NULL);
> >> > +    } else {
> >> > +        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
> >> > +                                  COLO_EXIT_REASON_REQUEST, NULL);
> >> > +    }
> >>
> >> Your comment makes me suspect failover_get_state() can only be
> >> FAILOVER_STATUS_NONE or FAILOVER_STATUS_REQUIRE here.  Is that correct?
> >>
> >> If yes, I recommend to add a suitable assertion.
>
> ... to make the possible states immediately obvious.  The fact that you
> felt a need for a comment is further evidence of non-obviousness.
>
> >
> > Yes, and what kinds of 'suitable assertion'? Just for the
> > 'failover_get_state()' ?
>
> Here's one way to skin this cat:
>
>           failover_state = failover_get_state();
>           if (failover_state == FAILOVER_STATUS_NONE) {
>               qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
>                                         COLO_EXIT_REASON_ERROR, NULL);
>           } else {
>               assert(failover_state == FAILOVER_STATUS_REQUIRE);
>               qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
>                                         COLO_EXIT_REASON_REQUEST, NULL);
>           }
>
> Another one:
>
>           switch (failover_get_state() {
>           case FAILOVER_STATUS_NONE:
>               qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
>                                         COLO_EXIT_REASON_ERROR, NULL);
>               break;
>           case FAILOVER_STATUS_REQUIRE:
>               qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
>                                         COLO_EXIT_REASON_REQUEST, NULL);
>               break;
>           default:
>               abort();
>           }
>
> Either way, the possible states are immediately obvious.  The run time
> check is a nice bonus.
>
> With just your comment, the reader still has to make the connection from
> the comment's prose to states, i.e. from "some error happened" to
> FAILOVER_STATUS_NONE, and from "user triggered failover" to
> FAILOVER_STATUS_REQUIRE.
>
> [...]
>


I got it, thanks for your detailed reply.
I will fix it in next version.

Thanks
Zhang Chen
diff mbox

Patch

diff --git a/migration/colo.c b/migration/colo.c
index c083d36..8ca6381 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -28,6 +28,7 @@ 
 #include "net/colo-compare.h"
 #include "net/colo.h"
 #include "block/block.h"
+#include "qapi/qapi-events-migration.h"
 
 static bool vmstate_loading;
 static Notifier packets_compare_notifier;
@@ -514,6 +515,18 @@  out:
         qemu_fclose(fb);
     }
 
+    /*
+     * There are only two reasons we can go here, some error happened.
+     * Or the user triggered failover.
+     */
+    if (failover_get_state() == FAILOVER_STATUS_NONE) {
+        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
+                                  COLO_EXIT_REASON_ERROR, NULL);
+    } else {
+        qapi_event_send_colo_exit(COLO_MODE_PRIMARY,
+                                  COLO_EXIT_REASON_REQUEST, NULL);
+    }
+
     /* Hope this not to be too long to wait here */
     qemu_sem_wait(&s->colo_exit_sem);
     qemu_sem_destroy(&s->colo_exit_sem);
@@ -744,6 +757,13 @@  out:
     if (local_err) {
         error_report_err(local_err);
     }
+    if (failover_get_state() == FAILOVER_STATUS_NONE) {
+        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
+                                  COLO_EXIT_REASON_ERROR, NULL);
+    } else {
+        qapi_event_send_colo_exit(COLO_MODE_SECONDARY,
+                                  COLO_EXIT_REASON_REQUEST, NULL);
+    }
 
     if (fb) {
         qemu_fclose(fb);
diff --git a/qapi/migration.json b/qapi/migration.json
index f3974c6..55dae48 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -875,6 +875,43 @@ 
   'data': [ 'none', 'require', 'active', 'completed', 'relaunch' ] }
 
 ##
+# @COLO_EXIT:
+#
+# Emitted when VM finishes COLO mode due to some errors happening or
+# at the request of users.
+#
+# @mode: report COLO mode when COLO exited.
+#
+# @reason: describes the reason for the COLO exit.
+#
+# Since: 2.13
+#
+# Example:
+#
+# <- { "timestamp": {"seconds": 2032141960, "microseconds": 417172},
+#      "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
+#
+##
+{ 'event': 'COLO_EXIT',
+  'data': {'mode': 'COLOMode', 'reason': 'COLOExitReason' } }
+
+##
+# @COLOExitReason:
+#
+# The reason for a COLO exit
+#
+# @none: no failover has ever happened.
+#
+# @request: COLO exit is due to an external request
+#
+# @error: COLO exit is due to an internal error
+#
+# Since: 2.13
+##
+{ 'enum': 'COLOExitReason',
+  'data': [ 'none', 'request', 'error' ] }
+
+##
 # @x-colo-lost-heartbeat:
 #
 # Tell qemu that heartbeat is lost, request it to do takeover procedures.