[3/3] COLO: Optimize memory back-up process

Message ID	20200217012049.22988-4-zhang.zhanghailiang@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=lqdn=4F=nongnu.org=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1020B20726 From: Hailiang Zhang <zhang.zhanghailiang@huawei.com> To: <qemu-devel@nongnu.org> Subject: [PATCH 3/3] COLO: Optimize memory back-up process Date: Mon, 17 Feb 2020 09:20:49 +0800 Message-ID: <20200217012049.22988-4-zhang.zhanghailiang@huawei.com> In-Reply-To: <20200217012049.22988-1-zhang.zhanghailiang@huawei.com> References: <20200217012049.22988-1-zhang.zhanghailiang@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Precedence: list Cc: danielcho@qnap.com, chen.zhang@intel.com, Hailiang Zhang <zhang.zhanghailiang@huawei.com>, dgilbert@redhat.com, quintela@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	Optimize VM's downtime while do checkpoint in COLO \| expand [0/3] Optimize VM's downtime while do checkpoint in COLO [1/3] migration/colo: wrap incoming checkpoint process into new helper [2/3] COLO: Migrate dirty pages during the gap of checkpointing [3/3] COLO: Optimize memory back-up process

Message ID

20200217012049.22988-4-zhang.zhanghailiang@huawei.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1020B20726
From: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
To: <qemu-devel@nongnu.org>
Subject: [PATCH 3/3] COLO: Optimize memory back-up process
Date: Mon, 17 Feb 2020 09:20:49 +0800
Message-ID: <20200217012049.22988-4-zhang.zhanghailiang@huawei.com>
In-Reply-To: <20200217012049.22988-1-zhang.zhanghailiang@huawei.com>
References: <20200217012049.22988-1-zhang.zhanghailiang@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
Precedence: list
Cc: danielcho@qnap.com, chen.zhang@intel.com,
 Hailiang Zhang <zhang.zhanghailiang@huawei.com>, dgilbert@redhat.com,
 quintela@redhat.com
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Series

Optimize VM's downtime while do checkpoint in COLO | expand

Commit Message

Zhanghailiang Feb. 17, 2020, 1:20 a.m. UTC

This patch will reduce the downtime of VM for the initial process,
Privously, we copied all these memory in preparing stage of COLO
while we need to stop VM, which is a time-consuming process.
Here we optimize it by a trick, back-up every page while in migration
process while COLO is enabled, though it affects the speed of the
migration, but it obviously reduce the downtime of back-up all SVM'S
memory in COLO preparing stage.

Signed-off-by: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
---
 migration/colo.c |  3 +++
 migration/ram.c  | 35 +++++++++++++++++++++++++++--------
 migration/ram.h  |  1 +
 3 files changed, 31 insertions(+), 8 deletions(-)

Comments

Dr. David Alan Gilbert Feb. 20, 2020, 6:24 p.m. UTC | #1

* Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote:
> This patch will reduce the downtime of VM for the initial process,
> Privously, we copied all these memory in preparing stage of COLO
> while we need to stop VM, which is a time-consuming process.
> Here we optimize it by a trick, back-up every page while in migration
> process while COLO is enabled, though it affects the speed of the
> migration, but it obviously reduce the downtime of back-up all SVM'S
> memory in COLO preparing stage.
> 
> Signed-off-by: Hailiang Zhang <zhang.zhanghailiang@huawei.com>

OK, I think this is right, but it took me quite a while to understand,
I think one of the comments below might not be right:

> ---
>  migration/colo.c |  3 +++
>  migration/ram.c  | 35 +++++++++++++++++++++++++++--------
>  migration/ram.h  |  1 +
>  3 files changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c
> index d30c6bc4ad..febf010571 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -26,6 +26,7 @@
>  #include "qemu/main-loop.h"
>  #include "qemu/rcu.h"
>  #include "migration/failover.h"
> +#include "migration/ram.h"
>  #ifdef CONFIG_REPLICATION
>  #include "replication.h"
>  #endif
> @@ -906,6 +907,8 @@ void *colo_process_incoming_thread(void *opaque)
>       */
>      qemu_file_set_blocking(mis->from_src_file, true);
>  
> +    colo_incoming_start_dirty_log();
> +
>      bioc = qio_channel_buffer_new(COLO_BUFFER_BASE_SIZE);
>      fb = qemu_fopen_channel_input(QIO_CHANNEL(bioc));
>      object_unref(OBJECT(bioc));
> diff --git a/migration/ram.c b/migration/ram.c
> index ed23ed1c7c..24a8aa3527 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2986,7 +2986,6 @@ int colo_init_ram_cache(void)
>                  }
>                  return -errno;
>              }
> -            memcpy(block->colo_cache, block->host, block->used_length);
>          }
>      }
>  
> @@ -3005,12 +3004,16 @@ int colo_init_ram_cache(void)
>              bitmap_set(block->bmap, 0, pages);
>          }
>      }
> +
> +    return 0;
> +}
> +
> +void colo_incoming_start_dirty_log(void)
> +{
>      ram_state = g_new0(RAMState, 1);
>      ram_state->migration_dirty_pages = 0;
>      qemu_mutex_init(&ram_state->bitmap_mutex);
>      memory_global_dirty_log_start();
> -
> -    return 0;
>  }
>  
>  /* It is need to hold the global lock to call this helper */
> @@ -3348,7 +3351,7 @@ static int ram_load_precopy(QEMUFile *f)
>  
>      while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
>          ram_addr_t addr, total_ram_bytes;
> -        void *host = NULL;
> +        void *host = NULL, *host_bak = NULL;
>          uint8_t ch;
>  
>          /*
> @@ -3378,13 +3381,26 @@ static int ram_load_precopy(QEMUFile *f)
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
>              RAMBlock *block = ram_block_from_stream(f, flags);
> -
>              /*
> -             * After going into COLO, we should load the Page into colo_cache.
> +             * After going into COLO, we should load the Page into colo_cache
> +             * NOTE: We need to keep a copy of SVM's ram in colo_cache.
> +             * Privously, we copied all these memory in preparing stage of COLO
> +             * while we need to stop VM, which is a time-consuming process.
> +             * Here we optimize it by a trick, back-up every page while in
> +             * migration process while COLO is enabled, though it affects the
> +             * speed of the migration, but it obviously reduce the downtime of
> +             * back-up all SVM'S memory in COLO preparing stage.
>               */
> -            if (migration_incoming_in_colo_state()) {
> +            if (migration_incoming_colo_enabled()) {
>                  host = colo_cache_from_block_offset(block, addr);
> -            } else {
> +                /*
> +                 * After going into COLO, load the Page into colo_cache.
> +                 */
> +                if (!migration_incoming_in_colo_state()) {
> +                    host_bak = host;
> +                }
> +            }
> +            if (!migration_incoming_in_colo_state()) {
>                  host = host_from_ram_block_offset(block, addr);

So this works out as quite complicated:
   a) In normal migration we do the last one and just set:
         host = host_from_ram_block_offset(block, addr);
         host_bak = NULL

   b) At the start, when colo_enabled, but !in_colo_state
         host = colo_cache
         host_bak = host
         host = host_from_ram_block_offset

   c) in_colo_state
         host = colo_cache
         host_bak = NULL


(b) is pretty confusing, setting host twice; can't we tidy that up?

Also, that last comment 'After going into COLO' I think is really
  'Before COLO state, copy from ram into cache'

Dave

>              }
>              if (!host) {
> @@ -3506,6 +3522,9 @@ static int ram_load_precopy(QEMUFile *f)
>          if (!ret) {
>              ret = qemu_file_get_error(f);
>          }
> +        if (!ret && host_bak && host) {
> +            memcpy(host_bak, host, TARGET_PAGE_SIZE);
> +        }
>      }
>  
>      ret |= wait_for_decompress_done();
> diff --git a/migration/ram.h b/migration/ram.h
> index a553d40751..5ceaff7cb4 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -66,5 +66,6 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
>  /* ram cache */
>  int colo_init_ram_cache(void);
>  void colo_release_ram_cache(void);
> +void colo_incoming_start_dirty_log(void);
>  
>  #endif
> -- 
> 2.21.0
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Zhanghailiang Feb. 24, 2020, 4:10 a.m. UTC | #2

Hi Dave,

> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:dgilbert@redhat.com]
> Sent: Friday, February 21, 2020 2:25 AM
> To: Zhanghailiang <zhang.zhanghailiang@huawei.com>
> Cc: qemu-devel@nongnu.org; quintela@redhat.com; chen.zhang@intel.com;
> danielcho@qnap.com
> Subject: Re: [PATCH 3/3] COLO: Optimize memory back-up process
> 
> * Hailiang Zhang (zhang.zhanghailiang@huawei.com) wrote:
> > This patch will reduce the downtime of VM for the initial process,
> > Privously, we copied all these memory in preparing stage of COLO
> > while we need to stop VM, which is a time-consuming process.
> > Here we optimize it by a trick, back-up every page while in migration
> > process while COLO is enabled, though it affects the speed of the
> > migration, but it obviously reduce the downtime of back-up all SVM'S
> > memory in COLO preparing stage.
> >
> > Signed-off-by: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
> 
> OK, I think this is right, but it took me quite a while to understand,
> I think one of the comments below might not be right:
> 

> > ---
> >  migration/colo.c |  3 +++
> >  migration/ram.c  | 35 +++++++++++++++++++++++++++--------
> >  migration/ram.h  |  1 +
> >  3 files changed, 31 insertions(+), 8 deletions(-)
> >
> > diff --git a/migration/colo.c b/migration/colo.c
> > index d30c6bc4ad..febf010571 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -26,6 +26,7 @@
> >  #include "qemu/main-loop.h"
> >  #include "qemu/rcu.h"
> >  #include "migration/failover.h"
> > +#include "migration/ram.h"
> >  #ifdef CONFIG_REPLICATION
> >  #include "replication.h"
> >  #endif
> > @@ -906,6 +907,8 @@ void *colo_process_incoming_thread(void
> *opaque)
> >       */
> >      qemu_file_set_blocking(mis->from_src_file, true);
> >
> > +    colo_incoming_start_dirty_log();
> > +
> >      bioc = qio_channel_buffer_new(COLO_BUFFER_BASE_SIZE);
> >      fb = qemu_fopen_channel_input(QIO_CHANNEL(bioc));
> >      object_unref(OBJECT(bioc));
> > diff --git a/migration/ram.c b/migration/ram.c
> > index ed23ed1c7c..24a8aa3527 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2986,7 +2986,6 @@ int colo_init_ram_cache(void)
> >                  }
> >                  return -errno;
> >              }
> > -            memcpy(block->colo_cache, block->host,
> block->used_length);
> >          }
> >      }
> >
> > @@ -3005,12 +3004,16 @@ int colo_init_ram_cache(void)
> >              bitmap_set(block->bmap, 0, pages);
> >          }
> >      }
> > +
> > +    return 0;
> > +}
> > +
> > +void colo_incoming_start_dirty_log(void)
> > +{
> >      ram_state = g_new0(RAMState, 1);
> >      ram_state->migration_dirty_pages = 0;
> >      qemu_mutex_init(&ram_state->bitmap_mutex);
> >      memory_global_dirty_log_start();
> > -
> > -    return 0;
> >  }
> >
> >  /* It is need to hold the global lock to call this helper */
> > @@ -3348,7 +3351,7 @@ static int ram_load_precopy(QEMUFile *f)
> >
> >      while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
> >          ram_addr_t addr, total_ram_bytes;
> > -        void *host = NULL;
> > +        void *host = NULL, *host_bak = NULL;
> >          uint8_t ch;
> >
> >          /*
> > @@ -3378,13 +3381,26 @@ static int ram_load_precopy(QEMUFile *f)
> >          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
> >                       RAM_SAVE_FLAG_COMPRESS_PAGE |
> RAM_SAVE_FLAG_XBZRLE)) {
> >              RAMBlock *block = ram_block_from_stream(f, flags);
> > -
> >              /*
> > -             * After going into COLO, we should load the Page into
> colo_cache.
> > +             * After going into COLO, we should load the Page into
> colo_cache
> > +             * NOTE: We need to keep a copy of SVM's ram in
> colo_cache.
> > +             * Privously, we copied all these memory in preparing stage
> of COLO
> > +             * while we need to stop VM, which is a time-consuming
> process.
> > +             * Here we optimize it by a trick, back-up every page while
> in
> > +             * migration process while COLO is enabled, though it
> affects the
> > +             * speed of the migration, but it obviously reduce the
> downtime of
> > +             * back-up all SVM'S memory in COLO preparing stage.
> >               */
> > -            if (migration_incoming_in_colo_state()) {
> > +            if (migration_incoming_colo_enabled()) {
> >                  host = colo_cache_from_block_offset(block, addr);
> > -            } else {
> > +                /*
> > +                 * After going into COLO, load the Page into
> colo_cache.
> > +                 */
> > +                if (!migration_incoming_in_colo_state()) {
> > +                    host_bak = host;
> > +                }
> > +            }
> > +            if (!migration_incoming_in_colo_state()) {
> >                  host = host_from_ram_block_offset(block, addr);
> 
> So this works out as quite complicated:
>    a) In normal migration we do the last one and just set:
>          host = host_from_ram_block_offset(block, addr);
>          host_bak = NULL
> 
>    b) At the start, when colo_enabled, but !in_colo_state
>          host = colo_cache
>          host_bak = host
>          host = host_from_ram_block_offset
> 
>    c) in_colo_state
>          host = colo_cache
>          host_bak = NULL
> 
> 
> (b) is pretty confusing, setting host twice; can't we tidy that up?
> 
> Also, that last comment 'After going into COLO' I think is really
>   'Before COLO state, copy from ram into cache'
> 

The code logic here is not good, I have fixed this in V2 like this :)

+            host = host_from_ram_block_offset(block, addr);
             /*
-             * After going into COLO, we should load the Page into colo_cache.
+             * After going into COLO stage, we should not load the page
+             * into SVM's memory diretly, we put them into colo_cache firstly.
+             * NOTE: We need to keep a copy of SVM's ram in colo_cache.
+             * Privously, we copied all these memory in preparing stage of COLO
+             * while we need to stop VM, which is a time-consuming process.
+             * Here we optimize it by a trick, back-up every page while in
+             * migration process while COLO is enabled, though it affects the
+             * speed of the migration, but it obviously reduce the downtime of
+             * back-up all SVM'S memory in COLO preparing stage.
              */
-            if (migration_incoming_in_colo_state()) {
-                host = colo_cache_from_block_offset(block, addr);
-            } else {
-                host = host_from_ram_block_offset(block, addr);
+            if (migration_incoming_colo_enabled()) {
+                if (migration_incoming_in_colo_state()) {
+                    /* In COLO stage, put all pages into cache temporarily */
+                    host = colo_cache_from_block_offset(block, addr);
+                } else {
+                   /*
+                    * In migration stage but before COLO stage,
+                    * Put all pages into both cache and SVM's memory.
+                    */
+                    host_bak = colo_cache_from_block_offset(block, addr);
+                }


Thanks,
Hailiang

> Dave
> 
> >              }
> >              if (!host) {
> > @@ -3506,6 +3522,9 @@ static int ram_load_precopy(QEMUFile *f)
> >          if (!ret) {
> >              ret = qemu_file_get_error(f);
> >          }
> > +        if (!ret && host_bak && host) {
> > +            memcpy(host_bak, host, TARGET_PAGE_SIZE);
> > +        }
> >      }
> >
> >      ret |= wait_for_decompress_done();
> > diff --git a/migration/ram.h b/migration/ram.h
> > index a553d40751..5ceaff7cb4 100644
> > --- a/migration/ram.h
> > +++ b/migration/ram.h
> > @@ -66,5 +66,6 @@ int ram_dirty_bitmap_reload(MigrationState *s,
> RAMBlock *rb);
> >  /* ram cache */
> >  int colo_init_ram_cache(void);
> >  void colo_release_ram_cache(void);
> > +void colo_incoming_start_dirty_log(void);
> >
> >  #endif
> > --
> > 2.21.0
> >
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

diff --git a/migration/colo.c b/migration/colo.c
index d30c6bc4ad..febf010571 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -26,6 +26,7 @@ 
 #include "qemu/main-loop.h"
 #include "qemu/rcu.h"
 #include "migration/failover.h"
+#include "migration/ram.h"
 #ifdef CONFIG_REPLICATION
 #include "replication.h"
 #endif
@@ -906,6 +907,8 @@  void *colo_process_incoming_thread(void *opaque)
      */
     qemu_file_set_blocking(mis->from_src_file, true);
 
+    colo_incoming_start_dirty_log();
+
     bioc = qio_channel_buffer_new(COLO_BUFFER_BASE_SIZE);
     fb = qemu_fopen_channel_input(QIO_CHANNEL(bioc));
     object_unref(OBJECT(bioc));
diff --git a/migration/ram.c b/migration/ram.c
index ed23ed1c7c..24a8aa3527 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2986,7 +2986,6 @@  int colo_init_ram_cache(void)
                 }
                 return -errno;
             }
-            memcpy(block->colo_cache, block->host, block->used_length);
         }
     }
 
@@ -3005,12 +3004,16 @@  int colo_init_ram_cache(void)
             bitmap_set(block->bmap, 0, pages);
         }
     }
+
+    return 0;
+}
+
+void colo_incoming_start_dirty_log(void)
+{
     ram_state = g_new0(RAMState, 1);
     ram_state->migration_dirty_pages = 0;
     qemu_mutex_init(&ram_state->bitmap_mutex);
     memory_global_dirty_log_start();
-
-    return 0;
 }
 
 /* It is need to hold the global lock to call this helper */
@@ -3348,7 +3351,7 @@  static int ram_load_precopy(QEMUFile *f)
 
     while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
         ram_addr_t addr, total_ram_bytes;
-        void *host = NULL;
+        void *host = NULL, *host_bak = NULL;
         uint8_t ch;
 
         /*
@@ -3378,13 +3381,26 @@  static int ram_load_precopy(QEMUFile *f)
         if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
                      RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
             RAMBlock *block = ram_block_from_stream(f, flags);
-
             /*
-             * After going into COLO, we should load the Page into colo_cache.
+             * After going into COLO, we should load the Page into colo_cache
+             * NOTE: We need to keep a copy of SVM's ram in colo_cache.
+             * Privously, we copied all these memory in preparing stage of COLO
+             * while we need to stop VM, which is a time-consuming process.
+             * Here we optimize it by a trick, back-up every page while in
+             * migration process while COLO is enabled, though it affects the
+             * speed of the migration, but it obviously reduce the downtime of
+             * back-up all SVM'S memory in COLO preparing stage.
              */
-            if (migration_incoming_in_colo_state()) {
+            if (migration_incoming_colo_enabled()) {
                 host = colo_cache_from_block_offset(block, addr);
-            } else {
+                /*
+                 * After going into COLO, load the Page into colo_cache.
+                 */
+                if (!migration_incoming_in_colo_state()) {
+                    host_bak = host;
+                }
+            }
+            if (!migration_incoming_in_colo_state()) {
                 host = host_from_ram_block_offset(block, addr);
             }
             if (!host) {
@@ -3506,6 +3522,9 @@  static int ram_load_precopy(QEMUFile *f)
         if (!ret) {
             ret = qemu_file_get_error(f);
         }
+        if (!ret && host_bak && host) {
+            memcpy(host_bak, host, TARGET_PAGE_SIZE);
+        }
     }
 
     ret |= wait_for_decompress_done();
diff --git a/migration/ram.h b/migration/ram.h
index a553d40751..5ceaff7cb4 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -66,5 +66,6 @@  int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
 /* ram cache */
 int colo_init_ram_cache(void);
 void colo_release_ram_cache(void);
+void colo_incoming_start_dirty_log(void);
 
 #endif

[3/3] COLO: Optimize memory back-up process

Commit Message

Comments

Patch