diff mbox

[v2] migration: skip sending ram pages released by virtio-balloon driver.

Message ID 1459138565-6244-1-git-send-email-jitendra.kolhe@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jitendra Kolhe March 28, 2016, 4:16 a.m. UTC
While measuring live migration performance for qemu/kvm guest, it
was observed that the qemu doesn’t maintain any intelligence for the
guest ram pages which are released by the guest balloon driver and
treat such pages as any other normal guest ram pages. This has direct
impact on overall migration time for the guest which has released
(ballooned out) memory to the host.

In case of large systems, where we can configure large guests with 1TB
and with considerable amount of memory release by balloon driver to the,
host the migration time gets worse.

The solution proposed below is local only to qemu (and does not require
any modification to Linux kernel or any guest driver). We have verified
the fix for large guests =1TB on HPE Superdome X (which can support up
to 240 cores and 12TB of memory) and in case where 90% of memory is
released by balloon driver the migration time for an idle guests reduces
to ~600 sec's from ~1200 sec’s.

Detail: During live migration, as part of 1st iteration in ram_save_iterate()
-> ram_find_and_save_block () will try to migrate ram pages which are
released by vitrio-balloon driver as part of dynamic memory delete.
Even though the pages which are returned to the host by virtio-balloon
driver are zero pages, the migration algorithm will still end up
scanning the entire page ram_find_and_save_block() -> ram_save_page/
ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
also end-up sending some control information over network for these
page during migration. This adds to total migration time.

The proposed fix, uses the existing bitmap infrastructure to create
a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
entire guest ram memory till max configured memory. Guest ram pages
claimed by the virtio-balloon driver will be represented by 1 in the
bitmap. During live migration, each guest ram page (host VA offset)
is checked against the virtio-balloon bitmap, if the bit is set the
corresponding ram page will be excluded from scanning and sending
control information during migration. The bitmap is also migrated to
the target as part of every ram_save_iterate loop and after the
guest is stopped remaining balloon bitmap is migrated as part of
balloon driver save / load interface.

With the proposed fix, the average migration time for an idle guest
with 1TB maximum memory and 64vCpus
 - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
   down to 128GB (~10% of 1TB).
 - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
   down to 896GB (~90% of 1TB),
 - with no ballooning configured, we don’t expect to see any impact
   on total migration time.

The optimization gets temporarily disabled, if the balloon operation is
in progress. Since the optimization skips scanning and migrating control
information for ballooned out pages, we might skip guest ram pages in
cases where the guest balloon driver has freed the ram page to the guest
but not yet informed the host/qemu about the ram page
(VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
might skip migrating ram pages which the guest is using. Since this
problem is specific to balloon leak, we can restrict balloon operation in
progress check to only balloon leak operation in progress check.

The optimization also get permanently disabled (for all subsequent
migrations) in case any of the migration uses postcopy capability. In case
of postcopy the balloon bitmap would be required to send after vm_stop,
which has significant impact on the downtime. Moreover, the applications
in the guest space won’t be actually faulting on the ram pages which are
already ballooned out, the proposed optimization will not show any
improvement in migration time during postcopy.

Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
---
Changed in v2:
 - Resolved compilation issue for qemu-user binaries in exec.c
 - Localize balloon bitmap test to save_zero_page().
 - Updated version string for newly added migration capability to 2.7.
 - Made minor modifications to patch commit text.

 balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
 exec.c                             |   3 +
 hw/virtio/virtio-balloon.c         |  35 ++++-
 include/hw/virtio/virtio-balloon.h |   1 +
 include/migration/migration.h      |   1 +
 include/sysemu/balloon.h           |  15 ++-
 migration/migration.c              |   9 ++
 migration/ram.c                    |  31 ++++-
 qapi-schema.json                   |   5 +-
 9 files changed, 341 insertions(+), 12 deletions(-)

Comments

Michael S. Tsirkin March 28, 2016, 6:59 a.m. UTC | #1
On Mon, Mar 28, 2016 at 09:46:05AM +0530, Jitendra Kolhe wrote:
> While measuring live migration performance for qemu/kvm guest, it
> was observed that the qemu doesn’t maintain any intelligence for the
> guest ram pages which are released by the guest balloon driver and
> treat such pages as any other normal guest ram pages. This has direct
> impact on overall migration time for the guest which has released
> (ballooned out) memory to the host.
> 
> In case of large systems, where we can configure large guests with 1TB
> and with considerable amount of memory release by balloon driver to the,
> host the migration time gets worse.
> 
> The solution proposed below is local only to qemu (and does not require
> any modification to Linux kernel or any guest driver). We have verified
> the fix for large guests =1TB on HPE Superdome X (which can support up
> to 240 cores and 12TB of memory) and in case where 90% of memory is
> released by balloon driver the migration time for an idle guests reduces
> to ~600 sec's from ~1200 sec’s.
> 
> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
> -> ram_find_and_save_block () will try to migrate ram pages which are
> released by vitrio-balloon driver as part of dynamic memory delete.
> Even though the pages which are returned to the host by virtio-balloon
> driver are zero pages, the migration algorithm will still end up
> scanning the entire page ram_find_and_save_block() -> ram_save_page/
> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
> also end-up sending some control information over network for these
> page during migration. This adds to total migration time.
> 
> The proposed fix, uses the existing bitmap infrastructure to create
> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
> entire guest ram memory till max configured memory. Guest ram pages
> claimed by the virtio-balloon driver will be represented by 1 in the
> bitmap. During live migration, each guest ram page (host VA offset)
> is checked against the virtio-balloon bitmap, if the bit is set the
> corresponding ram page will be excluded from scanning and sending
> control information during migration. The bitmap is also migrated to
> the target as part of every ram_save_iterate loop and after the
> guest is stopped remaining balloon bitmap is migrated as part of
> balloon driver save / load interface.
> 
> With the proposed fix, the average migration time for an idle guest
> with 1TB maximum memory and 64vCpus
>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>    down to 128GB (~10% of 1TB).
>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>    down to 896GB (~90% of 1TB),
>  - with no ballooning configured, we don’t expect to see any impact
>    on total migration time.
> 
> The optimization gets temporarily disabled, if the balloon operation is
> in progress. Since the optimization skips scanning and migrating control
> information for ballooned out pages, we might skip guest ram pages in
> cases where the guest balloon driver has freed the ram page to the guest
> but not yet informed the host/qemu about the ram page
> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
> might skip migrating ram pages which the guest is using. Since this
> problem is specific to balloon leak, we can restrict balloon operation in
> progress check to only balloon leak operation in progress check.
> 
> The optimization also get permanently disabled (for all subsequent
> migrations) in case any of the migration uses postcopy capability. In case
> of postcopy the balloon bitmap would be required to send after vm_stop,
> which has significant impact on the downtime. Moreover, the applications
> in the guest space won’t be actually faulting on the ram pages which are
> already ballooned out, the proposed optimization will not show any
> improvement in migration time during postcopy.
> 
> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> ---
> Changed in v2:
>  - Resolved compilation issue for qemu-user binaries in exec.c
>  - Localize balloon bitmap test to save_zero_page().
>  - Updated version string for newly added migration capability to 2.7.
>  - Made minor modifications to patch commit text.
> 
>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
>  exec.c                             |   3 +
>  hw/virtio/virtio-balloon.c         |  35 ++++-
>  include/hw/virtio/virtio-balloon.h |   1 +
>  include/migration/migration.h      |   1 +
>  include/sysemu/balloon.h           |  15 ++-
>  migration/migration.c              |   9 ++
>  migration/ram.c                    |  31 ++++-
>  qapi-schema.json                   |   5 +-
>  9 files changed, 341 insertions(+), 12 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index f2ef50c..1c2d228 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -33,11 +33,34 @@
>  #include "qmp-commands.h"
>  #include "qapi/qmp/qerror.h"
>  #include "qapi/qmp/qjson.h"
> +#include "exec/ram_addr.h"
> +#include "migration/migration.h"
> +
> +#define BALLOON_BITMAP_DISABLE_FLAG -1UL
> +
> +typedef enum {
> +    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
> +    BALLOON_BITMAP_DISABLE_CURRENT,
> +    BALLOON_BITMAP_DISABLE_PERMANENT,
> +} BalloonBitmapDisableState;
>  
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
> +static QEMUBalloonInProgress *balloon_in_progress_fn;
>  static void *balloon_opaque;
>  static bool balloon_inhibited;
> +static unsigned long balloon_bitmap_pages;
> +static unsigned int  balloon_bitmap_pfn_shift;
> +static QemuMutex balloon_bitmap_mutex;
> +static bool balloon_bitmap_xfered;
> +static unsigned long balloon_min_bitmap_offset;
> +static unsigned long balloon_max_bitmap_offset;
> +static BalloonBitmapDisableState balloon_bitmap_disable_state;
> +
> +static struct BitmapRcu {
> +    struct rcu_head rcu;
> +    unsigned long *bmap;
> +} *balloon_bitmap_rcu;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> @@ -49,6 +72,21 @@ void qemu_balloon_inhibit(bool state)
>      balloon_inhibited = state;
>  }
>  
> +void qemu_mutex_lock_balloon_bitmap(void)
> +{
> +    qemu_mutex_lock(&balloon_bitmap_mutex);
> +}
> +
> +void qemu_mutex_unlock_balloon_bitmap(void)
> +{
> +    qemu_mutex_unlock(&balloon_bitmap_mutex);
> +}
> +
> +void qemu_balloon_reset_bitmap_data(void)
> +{
> +    balloon_bitmap_xfered = false;
> +}
> +
>  static bool have_balloon(Error **errp)
>  {
>      if (kvm_enabled() && !kvm_has_sync_mmu()) {
> @@ -65,9 +103,12 @@ static bool have_balloon(Error **errp)
>  }
>  
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> -                             QEMUBalloonStatus *stat_func, void *opaque)
> +                             QEMUBalloonStatus *stat_func,
> +                             QEMUBalloonInProgress *in_progress_func,
> +                             void *opaque, int pfn_shift)
>  {
> -    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
> +    if (balloon_event_fn || balloon_stat_fn ||
> +        balloon_in_progress_fn || balloon_opaque) {
>          /* We're already registered one balloon handler.  How many can
>           * a guest really have?
>           */
> @@ -75,17 +116,39 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>      }
>      balloon_event_fn = event_func;
>      balloon_stat_fn = stat_func;
> +    balloon_in_progress_fn = in_progress_func;
>      balloon_opaque = opaque;
> +
> +    qemu_mutex_init(&balloon_bitmap_mutex);
> +    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
> +    balloon_bitmap_pfn_shift = pfn_shift;
> +    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
> +    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
> +    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
> +    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
> +
>      return 0;
>  }
>  
> +static void balloon_bitmap_free(struct BitmapRcu *bmap)
> +{
> +    g_free(bmap->bmap);
> +    g_free(bmap);
> +}
> +
>  void qemu_remove_balloon_handler(void *opaque)
>  {
> +    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
>      if (balloon_opaque != opaque) {
>          return;
>      }
> +    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
> +    if (bitmap) {
> +        call_rcu(bitmap, balloon_bitmap_free, rcu);
> +    }
>      balloon_event_fn = NULL;
>      balloon_stat_fn = NULL;
> +    balloon_in_progress_fn = NULL;
>      balloon_opaque = NULL;
>  }
>  
> @@ -116,3 +179,189 @@ void qmp_balloon(int64_t target, Error **errp)
>      trace_balloon_event(balloon_opaque, target);
>      balloon_event_fn(balloon_opaque, target);
>  }
> +
> +/* Handle Ram hotplug case, only called in case old < new */
> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
> +{
> +    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
> +    unsigned long old_offset, new_offset;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +
> +    old_offset = (old >> balloon_bitmap_pfn_shift);
> +    new_offset = (new >> balloon_bitmap_pfn_shift);
> +
> +    bitmap = g_new(struct BitmapRcu, 1);
> +    bitmap->bmap = bitmap_new(new_offset);
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    bitmap_clear(bitmap->bmap, 0,
> +                 balloon_bitmap_pages + new_offset - old_offset);
> +    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
> +
> +    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
> +    balloon_bitmap_pages += new_offset - old_offset;
> +    qemu_mutex_unlock_balloon_bitmap();
> +    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
> +
> +    return 0;
> +}
> +
> +/* Should be called with balloon bitmap mutex lock held */
> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
> +{
> +    unsigned long *bitmap;
> +    unsigned long offset = 0;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +    offset = (addr >> balloon_bitmap_pfn_shift);
> +    if (balloon_bitmap_xfered) {
> +        if (offset < balloon_min_bitmap_offset) {
> +            balloon_min_bitmap_offset = offset;
> +        }
> +        if (offset > balloon_max_bitmap_offset) {
> +            balloon_max_bitmap_offset = offset;
> +        }
> +    }
> +
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    if (deflate == 0) {
> +        set_bit(offset, bitmap);
> +    } else {
> +        clear_bit(offset, bitmap);
> +    }
> +    rcu_read_unlock();
> +    return 0;
> +}
> +
> +void qemu_balloon_bitmap_setup(void)
> +{
> +    if (migrate_postcopy_ram()) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> +    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
> +               (balloon_bitmap_disable_state !=
> +                BALLOON_BITMAP_DISABLE_PERMANENT)) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
> +    }
> +}
> +
> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
> +{
> +    unsigned long *bitmap;
> +    ram_addr_t base;
> +    unsigned long nr = 0;
> +    int ret = 0;
> +
> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
> +        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> +        return 0;
> +    }
> +    balloon_in_progress_fn(balloon_opaque, &ret);
> +    if (ret == 1) {
> +        return 0;
> +    }
> +
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    base = rb->offset >> balloon_bitmap_pfn_shift;
> +    nr = base + (addr >> balloon_bitmap_pfn_shift);
> +    if (test_bit(nr, bitmap)) {
> +        ret = 1;
> +    }
> +    rcu_read_unlock();
> +    return ret;
> +}
> +
> +int qemu_balloon_bitmap_save(QEMUFile *f)
> +{
> +    unsigned long *bitmap;
> +    unsigned long offset = 0, next = 0, len = 0;
> +    unsigned long tmpoffset = 0, tmplimit = 0;
> +
> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> +        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
> +        return 0;
> +    }
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    if (balloon_bitmap_xfered) {
> +        tmpoffset = balloon_min_bitmap_offset;
> +        tmplimit  = balloon_max_bitmap_offset;
> +    } else {
> +        balloon_bitmap_xfered = true;
> +        tmpoffset = offset;
> +        tmplimit  = balloon_bitmap_pages;
> +    }
> +
> +    balloon_min_bitmap_offset = balloon_bitmap_pages;
> +    balloon_max_bitmap_offset = 0;
> +
> +    qemu_put_be64(f, balloon_bitmap_pages);
> +    qemu_put_be64(f, tmpoffset);
> +    qemu_put_be64(f, tmplimit);
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    while (tmpoffset < tmplimit) {
> +        unsigned long next_set_bit, start_set_bit;
> +        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
> +        start_set_bit = next_set_bit;
> +        if (next_set_bit == balloon_bitmap_pages) {
> +            len = 0;
> +            next = start_set_bit;
> +            qemu_put_be64(f, next);
> +            qemu_put_be64(f, len);
> +            break;
> +        }
> +        next_set_bit = find_next_zero_bit(bitmap,
> +                                          balloon_bitmap_pages,
> +                                          ++next_set_bit);
> +        len = (next_set_bit - start_set_bit);
> +        next = start_set_bit;
> +        qemu_put_be64(f, next);
> +        qemu_put_be64(f, len);
> +        tmpoffset = next + len;
> +    }
> +    rcu_read_unlock();
> +    qemu_mutex_unlock_balloon_bitmap();
> +    return 0;
> +}
> +
> +int qemu_balloon_bitmap_load(QEMUFile *f)
> +{
> +    unsigned long *bitmap;
> +    unsigned long next = 0, len = 0;
> +    unsigned long tmpoffset = 0, tmplimit = 0;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    balloon_bitmap_pages = qemu_get_be64(f);
> +    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> +        qemu_mutex_unlock_balloon_bitmap();
> +        return 0;
> +    }
> +    tmpoffset = qemu_get_be64(f);
> +    tmplimit  = qemu_get_be64(f);
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    while (tmpoffset < tmplimit) {
> +        next = qemu_get_be64(f);
> +        len  = qemu_get_be64(f);
> +        if (len == 0) {
> +            break;
> +        }
> +        bitmap_set(bitmap, next, len);
> +        tmpoffset = next + len;
> +    }
> +    rcu_read_unlock();
> +    qemu_mutex_unlock_balloon_bitmap();
> +    return 0;
> +}
> diff --git a/exec.c b/exec.c
> index f398d21..7a448e5 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -43,6 +43,7 @@
>  #else /* !CONFIG_USER_ONLY */
>  #include "sysemu/xen-mapcache.h"
>  #include "trace.h"
> +#include "sysemu/balloon.h"
>  #endif
>  #include "exec/cpu-all.h"
>  #include "qemu/rcu_queue.h"
> @@ -1610,6 +1611,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>      if (new_ram_size > old_ram_size) {
>          migration_bitmap_extend(old_ram_size, new_ram_size);
>          dirty_memory_extend(old_ram_size, new_ram_size);
> +        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
> +                                   new_ram_size << TARGET_PAGE_BITS);
>      }
>      /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
>       * QLIST (which has an RCU-friendly variant) does not have insertion at
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 22ad25c..9f3a4c8 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -27,6 +27,7 @@
>  #include "qapi/visitor.h"
>  #include "qapi-event.h"
>  #include "trace.h"
> +#include "migration/migration.h"
>  
>  #if defined(__linux__)
>  #include <sys/mman.h>
> @@ -214,11 +215,13 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>      VirtQueueElement *elem;
>      MemoryRegionSection section;
>  
> +    qemu_mutex_lock_balloon_bitmap();
>      for (;;) {
>          size_t offset = 0;
>          uint32_t pfn;
>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>          if (!elem) {
> +            qemu_mutex_unlock_balloon_bitmap();
>              return;
>          }
>  
> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>              addr = section.offset_within_region;
>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>                           !!(vq == s->dvq));
> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>              memory_region_unref(section.mr);
>          }
>  

So the assumption here is that offset_within_region equals
ram ptr if region is get_system_memory.

And I'm not sure that's always right.

Paolo?



> @@ -249,6 +253,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>          virtio_notify(vdev, vq);
>          g_free(elem);
>      }
> +    qemu_mutex_unlock_balloon_bitmap();
>  }
>  
>  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
> @@ -303,6 +308,16 @@ out:
>      }
>  }
>  
> +static void virtio_balloon_migration_state_changed(Notifier *notifier,
> +                                                   void *data)
> +{
> +    MigrationState *mig = data;
> +
> +    if (migration_has_failed(mig)) {
> +        qemu_balloon_reset_bitmap_data();
> +    }
> +}
> +
>  static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
>  {
>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> @@ -382,6 +397,16 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
>                                               VIRTIO_BALLOON_PFN_SHIFT);
>  }
>  
> +static void virtio_balloon_in_progress(void *opaque, int *status)
> +{
> +    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> +    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
> +        *status = 1;
> +        return;
> +    }
> +    *status = 0;
> +}
> +
>  static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
>  {
>      VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> @@ -409,6 +434,7 @@ static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
>  
>      qemu_put_be32(f, s->num_pages);
>      qemu_put_be32(f, s->actual);
> +    qemu_balloon_bitmap_save(f);
>  }
>  
>  static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
> @@ -426,6 +452,7 @@ static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
>  
>      s->num_pages = qemu_get_be32(f);
>      s->actual = qemu_get_be32(f);
> +    qemu_balloon_bitmap_load(f);
>      return 0;
>  }
>  
> @@ -439,7 +466,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>                  sizeof(struct virtio_balloon_config));
>  
>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> -                                   virtio_balloon_stat, s);
> +                                   virtio_balloon_stat,
> +                                   virtio_balloon_in_progress, s,
> +                                   VIRTIO_BALLOON_PFN_SHIFT);
>  
>      if (ret < 0) {
>          error_setg(errp, "Only one balloon device is supported");
> @@ -453,6 +482,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>  
>      reset_stats(s);
>  
> +    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
> +    add_migration_state_change_notifier(&s->migration_state_notifier);
> +
>      register_savevm(dev, "virtio-balloon", -1, 1,
>                      virtio_balloon_save, virtio_balloon_load, s);
>  }
> @@ -462,6 +494,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>  
> +    remove_migration_state_change_notifier(&s->migration_state_notifier);
>      balloon_stats_destroy_timer(s);
>      qemu_remove_balloon_handler(s);
>      unregister_savevm(dev, "virtio-balloon", s);
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..1ded5a9 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -43,6 +43,7 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    Notifier migration_state_notifier;
>  } VirtIOBalloon;
>  
>  #endif
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index ac2c12c..6c1d1af 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -267,6 +267,7 @@ void migrate_del_blocker(Error *reason);
>  
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> +bool migrate_skip_balloon(void);
>  
>  bool migrate_auto_converge(void);
>  
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index 3f976b4..5325c38 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -15,14 +15,27 @@
>  #define _QEMU_BALLOON_H
>  
>  #include "qapi-types.h"
> +#include "migration/qemu-file.h"
>  
>  typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
>  typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
> +typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
>  
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> -			     QEMUBalloonStatus *stat_func, void *opaque);
> +                             QEMUBalloonStatus *stat_func,
> +                             QEMUBalloonInProgress *progress_func,
> +                             void *opaque, int pfn_shift);
>  void qemu_remove_balloon_handler(void *opaque);
>  bool qemu_balloon_is_inhibited(void);
>  void qemu_balloon_inhibit(bool state);
> +void qemu_mutex_lock_balloon_bitmap(void);
> +void qemu_mutex_unlock_balloon_bitmap(void);
> +void qemu_balloon_reset_bitmap_data(void);
> +void qemu_balloon_bitmap_setup(void);
> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
> +int qemu_balloon_bitmap_save(QEMUFile *f);
> +int qemu_balloon_bitmap_load(QEMUFile *f);
>  
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 034a918..cb86307 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1200,6 +1200,15 @@ int migrate_use_xbzrle(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
>  }
>  
> +bool migrate_skip_balloon(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
> +}
> +
>  int64_t migrate_xbzrle_cache_size(void)
>  {
>      MigrationState *s;
> diff --git a/migration/ram.c b/migration/ram.c
> index 704f6a9..161ab73 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "exec/ram_addr.h"
>  #include "qemu/rcu_queue.h"
> +#include "sysemu/balloon.h"
>  
>  #ifdef DEBUG_MIGRATION_RAM
>  #define DPRINTF(fmt, ...) \
> @@ -65,6 +66,7 @@ static uint64_t bitmap_sync_count;
>  #define RAM_SAVE_FLAG_XBZRLE   0x40
>  /* 0x80 is reserved in migration.h start with 0x100 next */
>  #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
> +#define RAM_SAVE_FLAG_BALLOON  0x200
>  
>  static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
>  
> @@ -702,13 +704,17 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>  {
>      int pages = -1;
>  
> -    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> -        acct_info.dup_pages++;
> -        *bytes_transferred += save_page_header(f, block,
> +    if (qemu_balloon_bitmap_test(block, offset) != 1) {
> +        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> +            acct_info.dup_pages++;
> +            *bytes_transferred += save_page_header(f, block,
>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
> -        qemu_put_byte(f, 0);
> -        *bytes_transferred += 1;
> -        pages = 1;
> +            qemu_put_byte(f, 0);
> +            *bytes_transferred += 1;
> +            pages = 1;
> +        }
> +    } else {
> +        pages = 0;
>      }
>  
>      return pages;
> @@ -773,7 +779,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
>               * page would be stale
>               */
>              xbzrle_cache_zero_page(current_addr);
> -        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
> +        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {
>              pages = save_xbzrle_page(f, &p, current_addr, block,
>                                       offset, last_stage, bytes_transferred);
>              if (!last_stage) {
> @@ -1355,6 +1361,9 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
>          }
>  
>          if (found) {
> +            /* skip saving ram host page if the corresponding guest page
> +             * is ballooned out
> +             */
>              pages = ram_save_host_page(ms, f, &pss,
>                                         last_stage, bytes_transferred,
>                                         dirty_ram_abs);
> @@ -1959,6 +1968,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  
>      rcu_read_unlock();
>  
> +    qemu_balloon_bitmap_setup();
>      ram_control_before_iterate(f, RAM_CONTROL_SETUP);
>      ram_control_after_iterate(f, RAM_CONTROL_SETUP);
>  
> @@ -1984,6 +1994,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>  
>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>  
> +    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
> +    qemu_balloon_bitmap_save(f);
> +
>      t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      i = 0;
>      while ((ret = qemu_file_rate_limit(f)) == 0) {
> @@ -2493,6 +2506,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              }
>              break;
>  
> +        case RAM_SAVE_FLAG_BALLOON:
> +            qemu_balloon_bitmap_load(f);
> +            break;
> +
>          case RAM_SAVE_FLAG_COMPRESS:
>              ch = qemu_get_byte(f);
>              ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7f8d799..38163ca 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -544,11 +544,14 @@
>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>  #
> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
> +#          (since 2.7)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram'] }
> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
>  
>  ##
>  # @MigrationCapabilityStatus
> -- 
> 1.8.3.1
Denis V. Lunev March 28, 2016, 10:36 a.m. UTC | #2
On 03/28/2016 07:16 AM, Jitendra Kolhe wrote:
> While measuring live migration performance for qemu/kvm guest, it
> was observed that the qemu doesn’t maintain any intelligence for the
> guest ram pages which are released by the guest balloon driver and
> treat such pages as any other normal guest ram pages. This has direct
> impact on overall migration time for the guest which has released
> (ballooned out) memory to the host.
>
> In case of large systems, where we can configure large guests with 1TB
> and with considerable amount of memory release by balloon driver to the,
> host the migration time gets worse.
>
> The solution proposed below is local only to qemu (and does not require
> any modification to Linux kernel or any guest driver). We have verified
> the fix for large guests =1TB on HPE Superdome X (which can support up
> to 240 cores and 12TB of memory) and in case where 90% of memory is
> released by balloon driver the migration time for an idle guests reduces
> to ~600 sec's from ~1200 sec’s.
>
> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
> -> ram_find_and_save_block () will try to migrate ram pages which are
> released by vitrio-balloon driver as part of dynamic memory delete.
> Even though the pages which are returned to the host by virtio-balloon
> driver are zero pages, the migration algorithm will still end up
> scanning the entire page ram_find_and_save_block() -> ram_save_page/
> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
> also end-up sending some control information over network for these
> page during migration. This adds to total migration time.
>
> The proposed fix, uses the existing bitmap infrastructure to create
> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
> entire guest ram memory till max configured memory. Guest ram pages
> claimed by the virtio-balloon driver will be represented by 1 in the
> bitmap. During live migration, each guest ram page (host VA offset)
> is checked against the virtio-balloon bitmap, if the bit is set the
> corresponding ram page will be excluded from scanning and sending
> control information during migration. The bitmap is also migrated to
> the target as part of every ram_save_iterate loop and after the
> guest is stopped remaining balloon bitmap is migrated as part of
> balloon driver save / load interface.
>
> With the proposed fix, the average migration time for an idle guest
> with 1TB maximum memory and 64vCpus
>   - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>     down to 128GB (~10% of 1TB).
>   - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>     down to 896GB (~90% of 1TB),
>   - with no ballooning configured, we don’t expect to see any impact
>     on total migration time.
>
> The optimization gets temporarily disabled, if the balloon operation is
> in progress. Since the optimization skips scanning and migrating control
> information for ballooned out pages, we might skip guest ram pages in
> cases where the guest balloon driver has freed the ram page to the guest
> but not yet informed the host/qemu about the ram page
> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
> might skip migrating ram pages which the guest is using. Since this
> problem is specific to balloon leak, we can restrict balloon operation in
> progress check to only balloon leak operation in progress check.
>
> The optimization also get permanently disabled (for all subsequent
> migrations) in case any of the migration uses postcopy capability. In case
> of postcopy the balloon bitmap would be required to send after vm_stop,
> which has significant impact on the downtime. Moreover, the applications
> in the guest space won’t be actually faulting on the ram pages which are
> already ballooned out, the proposed optimization will not show any
> improvement in migration time during postcopy.
I think that you could start the guest without the knowledge of the
ballooned pages and that would be completely fine as these pages
will not be touched at all by the guest. They will come into play
when the host will deflate balloon a bit. In this case QEMU can safely
give local zeroed page for the guest.

Thus you could send bitmap of ballooned pages when the guest
on dest side will be started and in this case this will not influence
downtime.

Den
Eric Blake March 28, 2016, 2:11 p.m. UTC | #3
On 03/27/2016 10:16 PM, Jitendra Kolhe wrote:
> While measuring live migration performance for qemu/kvm guest, it
> was observed that the qemu doesn’t maintain any intelligence for the
> guest ram pages which are released by the guest balloon driver and
> treat such pages as any other normal guest ram pages. This has direct
> impact on overall migration time for the guest which has released
> (ballooned out) memory to the host.
> 
> In case of large systems, where we can configure large guests with 1TB
> and with considerable amount of memory release by balloon driver to the,
> host the migration time gets worse.

s/the, host/the host,/

> 
> The optimization gets temporarily disabled, if the balloon operation is

s/disabled,/disabled/

> in progress. Since the optimization skips scanning and migrating control
> information for ballooned out pages, we might skip guest ram pages in
> cases where the guest balloon driver has freed the ram page to the guest
> but not yet informed the host/qemu about the ram page
> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
> might skip migrating ram pages which the guest is using. Since this
> problem is specific to balloon leak, we can restrict balloon operation in
> progress check to only balloon leak operation in progress check.
> 
> The optimization also get permanently disabled (for all subsequent

s/get/gets/

> migrations) in case any of the migration uses postcopy capability. In case
> of postcopy the balloon bitmap would be required to send after vm_stop,
> which has significant impact on the downtime. Moreover, the applications
> in the guest space won’t be actually faulting on the ram pages which are
> already ballooned out, the proposed optimization will not show any
> improvement in migration time during postcopy.
> 
> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> ---
> Changed in v2:
>  - Resolved compilation issue for qemu-user binaries in exec.c
>  - Localize balloon bitmap test to save_zero_page().
>  - Updated version string for newly added migration capability to 2.7.
>  - Made minor modifications to patch commit text.

I'll leave the technical review to others.

> +++ b/qapi-schema.json
> @@ -544,11 +544,14 @@
>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>  #
> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
> +#          (since 2.7)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram'] }
> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }

Does this flag make sense to always have enabled (in which case we don't
need it as a flag), or are there cases where we'd explicitly want to
disable it?
Paolo Bonzini March 29, 2016, 10:05 a.m. UTC | #4
On 28/03/2016 08:59, Michael S. Tsirkin wrote:
>> > +    qemu_mutex_lock_balloon_bitmap();
>> >      for (;;) {
>> >          size_t offset = 0;
>> >          uint32_t pfn;
>> >          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> >          if (!elem) {
>> > +            qemu_mutex_unlock_balloon_bitmap();
>> >              return;
>> >          }
>> >  
>> > @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>> >              addr = section.offset_within_region;
>> >              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>> >                           !!(vq == s->dvq));
>> > +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>> >              memory_region_unref(section.mr);
>> >          }
>> >  
> So the assumption here is that offset_within_region equals
> ram ptr if region is get_system_memory.
> 
> And I'm not sure that's always right.
> 
> Paolo?

Indeed.  It is correct for the main system RAM, but hot-plugged RAM
would also have a zero-based section.offset_within_region.  You need to
add memory_region_get_ram_addr(section.mr), just like the call to
balloon_page adds memory_region_get_ram_ptr(section.mr).

Paolo
Jitendra Kolhe March 29, 2016, 10:47 a.m. UTC | #5
On 3/29/2016 3:35 PM, Paolo Bonzini wrote:
> 
> 
> On 28/03/2016 08:59, Michael S. Tsirkin wrote:
>>>> +    qemu_mutex_lock_balloon_bitmap();
>>>>      for (;;) {
>>>>          size_t offset = 0;
>>>>          uint32_t pfn;
>>>>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>>>>          if (!elem) {
>>>> +            qemu_mutex_unlock_balloon_bitmap();
>>>>              return;
>>>>          }
>>>>  
>>>> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>>>              addr = section.offset_within_region;
>>>>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>>>>                           !!(vq == s->dvq));
>>>> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>>>>              memory_region_unref(section.mr);
>>>>          }
>>>>  
>> So the assumption here is that offset_within_region equals
>> ram ptr if region is get_system_memory.
>>
>> And I'm not sure that's always right.
>>
>> Paolo?
> 
> Indeed.  It is correct for the main system RAM, but hot-plugged RAM
> would also have a zero-based section.offset_within_region.  You need to
> add memory_region_get_ram_addr(section.mr), just like the call to
> balloon_page adds memory_region_get_ram_ptr(section.mr).
> 
> Paolo
> 

Thanks, that's useful. 
I am only interested in the offset from memory region base. 
Would below guest PA to host offset work, as we do in
address_space_translate_internal()?
(Guest pa - section.offset_within_address_space + 
section.offset_within_region)

- Jitendra
Paolo Bonzini March 29, 2016, 10:48 a.m. UTC | #6
On 29/03/2016 12:47, Jitendra Kolhe wrote:
> > Indeed.  It is correct for the main system RAM, but hot-plugged RAM
> > would also have a zero-based section.offset_within_region.  You need to
> > add memory_region_get_ram_addr(section.mr), just like the call to
> > balloon_page adds memory_region_get_ram_ptr(section.mr).
> > 
> > Paolo
> 
> I am only interested in the offset from memory region base. 
> Would below guest PA to host offset work, as we do in
> address_space_translate_internal()?
> (Guest pa - section.offset_within_address_space + 
> section.offset_within_region)

Yes, that would work.  But I'm not sure why you're not interested in the
ram_addr_t.

Paolo
Jitendra Kolhe March 29, 2016, 11:34 a.m. UTC | #7
On 3/28/2016 4:06 PM, Denis V. Lunev wrote:
> On 03/28/2016 07:16 AM, Jitendra Kolhe wrote:
>> While measuring live migration performance for qemu/kvm guest, it
>> was observed that the qemu doesn’t maintain any intelligence for the
>> guest ram pages which are released by the guest balloon driver and
>> treat such pages as any other normal guest ram pages. This has direct
>> impact on overall migration time for the guest which has released
>> (ballooned out) memory to the host.
>>
>> In case of large systems, where we can configure large guests with 1TB
>> and with considerable amount of memory release by balloon driver to the,
>> host the migration time gets worse.
>>
>> The solution proposed below is local only to qemu (and does not require
>> any modification to Linux kernel or any guest driver). We have verified
>> the fix for large guests =1TB on HPE Superdome X (which can support up
>> to 240 cores and 12TB of memory) and in case where 90% of memory is
>> released by balloon driver the migration time for an idle guests reduces
>> to ~600 sec's from ~1200 sec’s.
>>
>> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
>> -> ram_find_and_save_block () will try to migrate ram pages which are
>> released by vitrio-balloon driver as part of dynamic memory delete.
>> Even though the pages which are returned to the host by virtio-balloon
>> driver are zero pages, the migration algorithm will still end up
>> scanning the entire page ram_find_and_save_block() -> ram_save_page/
>> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
>> also end-up sending some control information over network for these
>> page during migration. This adds to total migration time.
>>
>> The proposed fix, uses the existing bitmap infrastructure to create
>> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
>> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
>> entire guest ram memory till max configured memory. Guest ram pages
>> claimed by the virtio-balloon driver will be represented by 1 in the
>> bitmap. During live migration, each guest ram page (host VA offset)
>> is checked against the virtio-balloon bitmap, if the bit is set the
>> corresponding ram page will be excluded from scanning and sending
>> control information during migration. The bitmap is also migrated to
>> the target as part of every ram_save_iterate loop and after the
>> guest is stopped remaining balloon bitmap is migrated as part of
>> balloon driver save / load interface.
>>
>> With the proposed fix, the average migration time for an idle guest
>> with 1TB maximum memory and 64vCpus
>>   - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>>     down to 128GB (~10% of 1TB).
>>   - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>>     down to 896GB (~90% of 1TB),
>>   - with no ballooning configured, we don’t expect to see any impact
>>     on total migration time.
>>
>> The optimization gets temporarily disabled, if the balloon operation is
>> in progress. Since the optimization skips scanning and migrating control
>> information for ballooned out pages, we might skip guest ram pages in
>> cases where the guest balloon driver has freed the ram page to the guest
>> but not yet informed the host/qemu about the ram page
>> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
>> might skip migrating ram pages which the guest is using. Since this
>> problem is specific to balloon leak, we can restrict balloon operation in
>> progress check to only balloon leak operation in progress check.
>>
>> The optimization also get permanently disabled (for all subsequent
>> migrations) in case any of the migration uses postcopy capability. In case
>> of postcopy the balloon bitmap would be required to send after vm_stop,
>> which has significant impact on the downtime. Moreover, the applications
>> in the guest space won’t be actually faulting on the ram pages which are
>> already ballooned out, the proposed optimization will not show any
>> improvement in migration time during postcopy.
> I think that you could start the guest without the knowledge of the
> ballooned pages and that would be completely fine as these pages
> will not be touched at all by the guest. They will come into play
> when the host will deflate balloon a bit. In this case QEMU can safely
> give local zeroed page for the guest.
> 
> Thus you could send bitmap of ballooned pages when the guest
> on dest side will be started and in this case this will not influence
> downtime.
> 
> Den

we too were kind of more inclined to same approach to have absolutely zero 
impact on downtime, but had some concern (below) on how to approach

Would it be safe to let the guest start (without balloon bitmap) on dest 
even when the balloon operation is in progress (especially deflate) during 
migration?
in which case either we would need to merge source and dest balloon
bitmaps, that would in-turn require to keep track of offsets updated on
the dest side or block the balloon operation on dest till the balloon 
bitmap is completely merged with the dest.

Thanks,
- Jitendra
Jitendra Kolhe March 29, 2016, 12:13 p.m. UTC | #8
On 3/28/2016 7:41 PM, Eric Blake wrote:
> On 03/27/2016 10:16 PM, Jitendra Kolhe wrote:
>> While measuring live migration performance for qemu/kvm guest, it
>> was observed that the qemu doesn’t maintain any intelligence for the
>> guest ram pages which are released by the guest balloon driver and
>> treat such pages as any other normal guest ram pages. This has direct
>> impact on overall migration time for the guest which has released
>> (ballooned out) memory to the host.
>>
>> In case of large systems, where we can configure large guests with 1TB
>> and with considerable amount of memory release by balloon driver to the,
>> host the migration time gets worse.
> 
> s/the, host/the host,/
> 
>>
>> The optimization gets temporarily disabled, if the balloon operation is
> 
> s/disabled,/disabled/
> 
>> in progress. Since the optimization skips scanning and migrating control
>> information for ballooned out pages, we might skip guest ram pages in
>> cases where the guest balloon driver has freed the ram page to the guest
>> but not yet informed the host/qemu about the ram page
>> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
>> might skip migrating ram pages which the guest is using. Since this
>> problem is specific to balloon leak, we can restrict balloon operation in
>> progress check to only balloon leak operation in progress check.
>>
>> The optimization also get permanently disabled (for all subsequent
> 
> s/get/gets/
> 
>> migrations) in case any of the migration uses postcopy capability. In case
>> of postcopy the balloon bitmap would be required to send after vm_stop,
>> which has significant impact on the downtime. Moreover, the applications
>> in the guest space won’t be actually faulting on the ram pages which are
>> already ballooned out, the proposed optimization will not show any
>> improvement in migration time during postcopy.
>>
>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>> ---
>> Changed in v2:
>>  - Resolved compilation issue for qemu-user binaries in exec.c
>>  - Localize balloon bitmap test to save_zero_page().
>>  - Updated version string for newly added migration capability to 2.7.
>>  - Made minor modifications to patch commit text.
> 
> I'll leave the technical review to others.
> 
>> +++ b/qapi-schema.json
>> @@ -544,11 +544,14 @@
>>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>>  #
>> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
>> +#          (since 2.7)
>> +#
>>  # Since: 1.2
>>  ##
>>  { 'enum': 'MigrationCapability',
>>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>> -           'compress', 'events', 'postcopy-ram'] }
>> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
> 
> Does this flag make sense to always have enabled (in which case we don't
> need it as a flag), or are there cases where we'd explicitly want to
> disable it?
> 

Yes the flag can be enabled for most of the time, except in cases 
like migration using postcopy-ram (mutually exclusive) or in cases 
where the user is confident that the optimization is of no benefit 
(for e.g. no or very less pct of balloon activity has happened on 
VM i.e. penalty vs gain).

Thanks,
- Jitendra
Michael S. Tsirkin March 29, 2016, 12:28 p.m. UTC | #9
On Mon, Mar 28, 2016 at 09:46:05AM +0530, Jitendra Kolhe wrote:
> While measuring live migration performance for qemu/kvm guest, it
> was observed that the qemu doesn’t maintain any intelligence for the
> guest ram pages which are released by the guest balloon driver and
> treat such pages as any other normal guest ram pages. This has direct
> impact on overall migration time for the guest which has released
> (ballooned out) memory to the host.
> 
> In case of large systems, where we can configure large guests with 1TB
> and with considerable amount of memory release by balloon driver to the,
> host the migration time gets worse.
> 
> The solution proposed below is local only to qemu (and does not require
> any modification to Linux kernel or any guest driver). We have verified
> the fix for large guests =1TB on HPE Superdome X (which can support up
> to 240 cores and 12TB of memory) and in case where 90% of memory is
> released by balloon driver the migration time for an idle guests reduces
> to ~600 sec's from ~1200 sec’s.
> 
> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
> -> ram_find_and_save_block () will try to migrate ram pages which are
> released by vitrio-balloon driver as part of dynamic memory delete.
> Even though the pages which are returned to the host by virtio-balloon
> driver are zero pages, the migration algorithm will still end up
> scanning the entire page ram_find_and_save_block() -> ram_save_page/
> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
> also end-up sending some control information over network for these
> page during migration. This adds to total migration time.
> 
> The proposed fix, uses the existing bitmap infrastructure to create
> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
> entire guest ram memory till max configured memory. Guest ram pages
> claimed by the virtio-balloon driver will be represented by 1 in the
> bitmap. During live migration, each guest ram page (host VA offset)
> is checked against the virtio-balloon bitmap, if the bit is set the
> corresponding ram page will be excluded from scanning and sending
> control information during migration. The bitmap is also migrated to
> the target as part of every ram_save_iterate loop and after the
> guest is stopped remaining balloon bitmap is migrated as part of
> balloon driver save / load interface.

Migrating the bitmap might help a chained migration case
but will slow down the more typical case a bit.
Make it optional?


> 
> With the proposed fix, the average migration time for an idle guest
> with 1TB maximum memory and 64vCpus
>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>    down to 128GB (~10% of 1TB).
>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>    down to 896GB (~90% of 1TB),
>  - with no ballooning configured, we don’t expect to see any impact
>    on total migration time.
> 
> The optimization gets temporarily disabled, if the balloon operation is
> in progress. Since the optimization skips scanning and migrating control
> information for ballooned out pages, we might skip guest ram pages in
> cases where the guest balloon driver has freed the ram page to the guest
> but not yet informed the host/qemu about the ram page
> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
> might skip migrating ram pages which the guest is using. Since this
> problem is specific to balloon leak, we can restrict balloon operation in
> progress check to only balloon leak operation in progress check.
> 
> The optimization also get permanently disabled (for all subsequent
> migrations) in case any of the migration uses postcopy capability. In case
> of postcopy the balloon bitmap would be required to send after vm_stop,
> which has significant impact on the downtime. Moreover, the applications
> in the guest space won’t be actually faulting on the ram pages which are
> already ballooned out, the proposed optimization will not show any
> improvement in migration time during postcopy.


I think this optimization can work for post copy with some
modifications.


For post-copy, when guest takes page out of the balloon,
it notifies the host.

This only happens when MUST_TELL_HOST feature has been
negotiated but most guests do negotiate it nowdays.
(It also seems that your code assumes MUST_TELL_HOST - I think
you must check and disable the bitmap management otherwise).

At that point we can mark page as migrated to avoid
requesting it from source.



> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> ---
> Changed in v2:
>  - Resolved compilation issue for qemu-user binaries in exec.c
>  - Localize balloon bitmap test to save_zero_page().
>  - Updated version string for newly added migration capability to 2.7.
>  - Made minor modifications to patch commit text.
> 
>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
>  exec.c                             |   3 +
>  hw/virtio/virtio-balloon.c         |  35 ++++-
>  include/hw/virtio/virtio-balloon.h |   1 +
>  include/migration/migration.h      |   1 +
>  include/sysemu/balloon.h           |  15 ++-
>  migration/migration.c              |   9 ++
>  migration/ram.c                    |  31 ++++-
>  qapi-schema.json                   |   5 +-
>  9 files changed, 341 insertions(+), 12 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index f2ef50c..1c2d228 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -33,11 +33,34 @@
>  #include "qmp-commands.h"
>  #include "qapi/qmp/qerror.h"
>  #include "qapi/qmp/qjson.h"
> +#include "exec/ram_addr.h"
> +#include "migration/migration.h"
> +
> +#define BALLOON_BITMAP_DISABLE_FLAG -1UL
> +
> +typedef enum {
> +    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
> +    BALLOON_BITMAP_DISABLE_CURRENT,
> +    BALLOON_BITMAP_DISABLE_PERMANENT,
> +} BalloonBitmapDisableState;
>  
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
> +static QEMUBalloonInProgress *balloon_in_progress_fn;
>  static void *balloon_opaque;
>  static bool balloon_inhibited;
> +static unsigned long balloon_bitmap_pages;
> +static unsigned int  balloon_bitmap_pfn_shift;
> +static QemuMutex balloon_bitmap_mutex;
> +static bool balloon_bitmap_xfered;
> +static unsigned long balloon_min_bitmap_offset;
> +static unsigned long balloon_max_bitmap_offset;
> +static BalloonBitmapDisableState balloon_bitmap_disable_state;
> +
> +static struct BitmapRcu {
> +    struct rcu_head rcu;
> +    unsigned long *bmap;
> +} *balloon_bitmap_rcu;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> @@ -49,6 +72,21 @@ void qemu_balloon_inhibit(bool state)
>      balloon_inhibited = state;
>  }
>  
> +void qemu_mutex_lock_balloon_bitmap(void)
> +{
> +    qemu_mutex_lock(&balloon_bitmap_mutex);
> +}
> +
> +void qemu_mutex_unlock_balloon_bitmap(void)
> +{
> +    qemu_mutex_unlock(&balloon_bitmap_mutex);
> +}
> +
> +void qemu_balloon_reset_bitmap_data(void)
> +{
> +    balloon_bitmap_xfered = false;
> +}
> +
>  static bool have_balloon(Error **errp)
>  {
>      if (kvm_enabled() && !kvm_has_sync_mmu()) {
> @@ -65,9 +103,12 @@ static bool have_balloon(Error **errp)
>  }
>  
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> -                             QEMUBalloonStatus *stat_func, void *opaque)
> +                             QEMUBalloonStatus *stat_func,
> +                             QEMUBalloonInProgress *in_progress_func,
> +                             void *opaque, int pfn_shift)
>  {
> -    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
> +    if (balloon_event_fn || balloon_stat_fn ||
> +        balloon_in_progress_fn || balloon_opaque) {
>          /* We're already registered one balloon handler.  How many can
>           * a guest really have?
>           */
> @@ -75,17 +116,39 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>      }
>      balloon_event_fn = event_func;
>      balloon_stat_fn = stat_func;
> +    balloon_in_progress_fn = in_progress_func;
>      balloon_opaque = opaque;
> +
> +    qemu_mutex_init(&balloon_bitmap_mutex);
> +    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
> +    balloon_bitmap_pfn_shift = pfn_shift;
> +    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
> +    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
> +    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
> +    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
> +
>      return 0;
>  }
>  
> +static void balloon_bitmap_free(struct BitmapRcu *bmap)
> +{
> +    g_free(bmap->bmap);
> +    g_free(bmap);
> +}
> +
>  void qemu_remove_balloon_handler(void *opaque)
>  {
> +    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
>      if (balloon_opaque != opaque) {
>          return;
>      }
> +    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
> +    if (bitmap) {
> +        call_rcu(bitmap, balloon_bitmap_free, rcu);
> +    }
>      balloon_event_fn = NULL;
>      balloon_stat_fn = NULL;
> +    balloon_in_progress_fn = NULL;
>      balloon_opaque = NULL;
>  }
>  
> @@ -116,3 +179,189 @@ void qmp_balloon(int64_t target, Error **errp)
>      trace_balloon_event(balloon_opaque, target);
>      balloon_event_fn(balloon_opaque, target);
>  }
> +
> +/* Handle Ram hotplug case, only called in case old < new */
> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
> +{
> +    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
> +    unsigned long old_offset, new_offset;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +
> +    old_offset = (old >> balloon_bitmap_pfn_shift);
> +    new_offset = (new >> balloon_bitmap_pfn_shift);
> +
> +    bitmap = g_new(struct BitmapRcu, 1);
> +    bitmap->bmap = bitmap_new(new_offset);
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    bitmap_clear(bitmap->bmap, 0,
> +                 balloon_bitmap_pages + new_offset - old_offset);
> +    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
> +
> +    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
> +    balloon_bitmap_pages += new_offset - old_offset;
> +    qemu_mutex_unlock_balloon_bitmap();
> +    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
> +
> +    return 0;
> +}
> +
> +/* Should be called with balloon bitmap mutex lock held */
> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
> +{
> +    unsigned long *bitmap;
> +    unsigned long offset = 0;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +    offset = (addr >> balloon_bitmap_pfn_shift);
> +    if (balloon_bitmap_xfered) {
> +        if (offset < balloon_min_bitmap_offset) {
> +            balloon_min_bitmap_offset = offset;
> +        }
> +        if (offset > balloon_max_bitmap_offset) {
> +            balloon_max_bitmap_offset = offset;
> +        }
> +    }
> +
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    if (deflate == 0) {
> +        set_bit(offset, bitmap);
> +    } else {
> +        clear_bit(offset, bitmap);
> +    }
> +    rcu_read_unlock();
> +    return 0;
> +}
> +
> +void qemu_balloon_bitmap_setup(void)
> +{
> +    if (migrate_postcopy_ram()) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> +    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
> +               (balloon_bitmap_disable_state !=
> +                BALLOON_BITMAP_DISABLE_PERMANENT)) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
> +    }
> +}
> +
> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
> +{
> +    unsigned long *bitmap;
> +    ram_addr_t base;
> +    unsigned long nr = 0;
> +    int ret = 0;
> +
> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
> +        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> +        return 0;
> +    }
> +    balloon_in_progress_fn(balloon_opaque, &ret);
> +    if (ret == 1) {
> +        return 0;
> +    }
> +
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    base = rb->offset >> balloon_bitmap_pfn_shift;
> +    nr = base + (addr >> balloon_bitmap_pfn_shift);
> +    if (test_bit(nr, bitmap)) {
> +        ret = 1;
> +    }
> +    rcu_read_unlock();
> +    return ret;
> +}
> +
> +int qemu_balloon_bitmap_save(QEMUFile *f)
> +{
> +    unsigned long *bitmap;
> +    unsigned long offset = 0, next = 0, len = 0;
> +    unsigned long tmpoffset = 0, tmplimit = 0;
> +
> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> +        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
> +        return 0;
> +    }
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    if (balloon_bitmap_xfered) {
> +        tmpoffset = balloon_min_bitmap_offset;
> +        tmplimit  = balloon_max_bitmap_offset;
> +    } else {
> +        balloon_bitmap_xfered = true;
> +        tmpoffset = offset;
> +        tmplimit  = balloon_bitmap_pages;
> +    }
> +
> +    balloon_min_bitmap_offset = balloon_bitmap_pages;
> +    balloon_max_bitmap_offset = 0;
> +
> +    qemu_put_be64(f, balloon_bitmap_pages);
> +    qemu_put_be64(f, tmpoffset);
> +    qemu_put_be64(f, tmplimit);
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    while (tmpoffset < tmplimit) {
> +        unsigned long next_set_bit, start_set_bit;
> +        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
> +        start_set_bit = next_set_bit;
> +        if (next_set_bit == balloon_bitmap_pages) {
> +            len = 0;
> +            next = start_set_bit;
> +            qemu_put_be64(f, next);
> +            qemu_put_be64(f, len);
> +            break;
> +        }
> +        next_set_bit = find_next_zero_bit(bitmap,
> +                                          balloon_bitmap_pages,
> +                                          ++next_set_bit);
> +        len = (next_set_bit - start_set_bit);
> +        next = start_set_bit;
> +        qemu_put_be64(f, next);
> +        qemu_put_be64(f, len);
> +        tmpoffset = next + len;
> +    }
> +    rcu_read_unlock();
> +    qemu_mutex_unlock_balloon_bitmap();
> +    return 0;
> +}
> +
> +int qemu_balloon_bitmap_load(QEMUFile *f)
> +{
> +    unsigned long *bitmap;
> +    unsigned long next = 0, len = 0;
> +    unsigned long tmpoffset = 0, tmplimit = 0;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    balloon_bitmap_pages = qemu_get_be64(f);
> +    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> +        qemu_mutex_unlock_balloon_bitmap();
> +        return 0;
> +    }
> +    tmpoffset = qemu_get_be64(f);
> +    tmplimit  = qemu_get_be64(f);
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    while (tmpoffset < tmplimit) {
> +        next = qemu_get_be64(f);
> +        len  = qemu_get_be64(f);
> +        if (len == 0) {
> +            break;
> +        }
> +        bitmap_set(bitmap, next, len);
> +        tmpoffset = next + len;
> +    }
> +    rcu_read_unlock();
> +    qemu_mutex_unlock_balloon_bitmap();
> +    return 0;
> +}
> diff --git a/exec.c b/exec.c
> index f398d21..7a448e5 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -43,6 +43,7 @@
>  #else /* !CONFIG_USER_ONLY */
>  #include "sysemu/xen-mapcache.h"
>  #include "trace.h"
> +#include "sysemu/balloon.h"
>  #endif
>  #include "exec/cpu-all.h"
>  #include "qemu/rcu_queue.h"
> @@ -1610,6 +1611,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>      if (new_ram_size > old_ram_size) {
>          migration_bitmap_extend(old_ram_size, new_ram_size);
>          dirty_memory_extend(old_ram_size, new_ram_size);
> +        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
> +                                   new_ram_size << TARGET_PAGE_BITS);
>      }
>      /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
>       * QLIST (which has an RCU-friendly variant) does not have insertion at
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 22ad25c..9f3a4c8 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -27,6 +27,7 @@
>  #include "qapi/visitor.h"
>  #include "qapi-event.h"
>  #include "trace.h"
> +#include "migration/migration.h"
>  
>  #if defined(__linux__)
>  #include <sys/mman.h>
> @@ -214,11 +215,13 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>      VirtQueueElement *elem;
>      MemoryRegionSection section;
>  
> +    qemu_mutex_lock_balloon_bitmap();
>      for (;;) {
>          size_t offset = 0;
>          uint32_t pfn;
>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>          if (!elem) {
> +            qemu_mutex_unlock_balloon_bitmap();
>              return;
>          }
>  
> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>              addr = section.offset_within_region;
>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>                           !!(vq == s->dvq));
> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>              memory_region_unref(section.mr);
>          }
>  
> @@ -249,6 +253,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>          virtio_notify(vdev, vq);
>          g_free(elem);
>      }
> +    qemu_mutex_unlock_balloon_bitmap();
>  }
>  
>  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
> @@ -303,6 +308,16 @@ out:
>      }
>  }
>  
> +static void virtio_balloon_migration_state_changed(Notifier *notifier,
> +                                                   void *data)
> +{
> +    MigrationState *mig = data;
> +
> +    if (migration_has_failed(mig)) {
> +        qemu_balloon_reset_bitmap_data();
> +    }
> +}
> +
>  static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
>  {
>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> @@ -382,6 +397,16 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
>                                               VIRTIO_BALLOON_PFN_SHIFT);
>  }
>  
> +static void virtio_balloon_in_progress(void *opaque, int *status)
> +{
> +    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> +    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
> +        *status = 1;
> +        return;
> +    }
> +    *status = 0;
> +}
> +
>  static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
>  {
>      VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> @@ -409,6 +434,7 @@ static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
>  
>      qemu_put_be32(f, s->num_pages);
>      qemu_put_be32(f, s->actual);
> +    qemu_balloon_bitmap_save(f);
>  }
>  
>  static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
> @@ -426,6 +452,7 @@ static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
>  
>      s->num_pages = qemu_get_be32(f);
>      s->actual = qemu_get_be32(f);
> +    qemu_balloon_bitmap_load(f);
>      return 0;
>  }
>  
> @@ -439,7 +466,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>                  sizeof(struct virtio_balloon_config));
>  
>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> -                                   virtio_balloon_stat, s);
> +                                   virtio_balloon_stat,
> +                                   virtio_balloon_in_progress, s,
> +                                   VIRTIO_BALLOON_PFN_SHIFT);
>  
>      if (ret < 0) {
>          error_setg(errp, "Only one balloon device is supported");
> @@ -453,6 +482,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>  
>      reset_stats(s);
>  
> +    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
> +    add_migration_state_change_notifier(&s->migration_state_notifier);
> +
>      register_savevm(dev, "virtio-balloon", -1, 1,
>                      virtio_balloon_save, virtio_balloon_load, s);
>  }
> @@ -462,6 +494,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>  
> +    remove_migration_state_change_notifier(&s->migration_state_notifier);
>      balloon_stats_destroy_timer(s);
>      qemu_remove_balloon_handler(s);
>      unregister_savevm(dev, "virtio-balloon", s);
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..1ded5a9 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -43,6 +43,7 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    Notifier migration_state_notifier;
>  } VirtIOBalloon;
>  
>  #endif
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index ac2c12c..6c1d1af 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -267,6 +267,7 @@ void migrate_del_blocker(Error *reason);
>  
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> +bool migrate_skip_balloon(void);
>  
>  bool migrate_auto_converge(void);
>  
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index 3f976b4..5325c38 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -15,14 +15,27 @@
>  #define _QEMU_BALLOON_H
>  
>  #include "qapi-types.h"
> +#include "migration/qemu-file.h"
>  
>  typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
>  typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
> +typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
>  
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> -			     QEMUBalloonStatus *stat_func, void *opaque);
> +                             QEMUBalloonStatus *stat_func,
> +                             QEMUBalloonInProgress *progress_func,
> +                             void *opaque, int pfn_shift);
>  void qemu_remove_balloon_handler(void *opaque);
>  bool qemu_balloon_is_inhibited(void);
>  void qemu_balloon_inhibit(bool state);
> +void qemu_mutex_lock_balloon_bitmap(void);
> +void qemu_mutex_unlock_balloon_bitmap(void);
> +void qemu_balloon_reset_bitmap_data(void);
> +void qemu_balloon_bitmap_setup(void);
> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
> +int qemu_balloon_bitmap_save(QEMUFile *f);
> +int qemu_balloon_bitmap_load(QEMUFile *f);
>  
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 034a918..cb86307 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1200,6 +1200,15 @@ int migrate_use_xbzrle(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
>  }
>  
> +bool migrate_skip_balloon(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
> +}
> +
>  int64_t migrate_xbzrle_cache_size(void)
>  {
>      MigrationState *s;
> diff --git a/migration/ram.c b/migration/ram.c
> index 704f6a9..161ab73 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "exec/ram_addr.h"
>  #include "qemu/rcu_queue.h"
> +#include "sysemu/balloon.h"
>  
>  #ifdef DEBUG_MIGRATION_RAM
>  #define DPRINTF(fmt, ...) \
> @@ -65,6 +66,7 @@ static uint64_t bitmap_sync_count;
>  #define RAM_SAVE_FLAG_XBZRLE   0x40
>  /* 0x80 is reserved in migration.h start with 0x100 next */
>  #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
> +#define RAM_SAVE_FLAG_BALLOON  0x200
>  
>  static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
>  
> @@ -702,13 +704,17 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>  {
>      int pages = -1;
>  
> -    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> -        acct_info.dup_pages++;
> -        *bytes_transferred += save_page_header(f, block,
> +    if (qemu_balloon_bitmap_test(block, offset) != 1) {
> +        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> +            acct_info.dup_pages++;
> +            *bytes_transferred += save_page_header(f, block,
>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
> -        qemu_put_byte(f, 0);
> -        *bytes_transferred += 1;
> -        pages = 1;
> +            qemu_put_byte(f, 0);
> +            *bytes_transferred += 1;
> +            pages = 1;
> +        }
> +    } else {
> +        pages = 0;
>      }
>  
>      return pages;
> @@ -773,7 +779,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
>               * page would be stale
>               */
>              xbzrle_cache_zero_page(current_addr);
> -        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
> +        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {
>              pages = save_xbzrle_page(f, &p, current_addr, block,
>                                       offset, last_stage, bytes_transferred);
>              if (!last_stage) {
> @@ -1355,6 +1361,9 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
>          }
>  
>          if (found) {
> +            /* skip saving ram host page if the corresponding guest page
> +             * is ballooned out
> +             */
>              pages = ram_save_host_page(ms, f, &pss,
>                                         last_stage, bytes_transferred,
>                                         dirty_ram_abs);
> @@ -1959,6 +1968,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  
>      rcu_read_unlock();
>  
> +    qemu_balloon_bitmap_setup();
>      ram_control_before_iterate(f, RAM_CONTROL_SETUP);
>      ram_control_after_iterate(f, RAM_CONTROL_SETUP);
>  
> @@ -1984,6 +1994,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>  
>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>  
> +    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
> +    qemu_balloon_bitmap_save(f);
> +
>      t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      i = 0;
>      while ((ret = qemu_file_rate_limit(f)) == 0) {
> @@ -2493,6 +2506,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              }
>              break;
>  
> +        case RAM_SAVE_FLAG_BALLOON:
> +            qemu_balloon_bitmap_load(f);
> +            break;
> +
>          case RAM_SAVE_FLAG_COMPRESS:
>              ch = qemu_get_byte(f);
>              ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7f8d799..38163ca 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -544,11 +544,14 @@
>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>  #
> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
> +#          (since 2.7)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram'] }
> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
>  
>  ##
>  # @MigrationCapabilityStatus
> -- 
> 1.8.3.1
Dr. David Alan Gilbert March 31, 2016, 4:39 p.m. UTC | #10
* Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote:
> While measuring live migration performance for qemu/kvm guest, it
> was observed that the qemu doesn’t maintain any intelligence for the
> guest ram pages which are released by the guest balloon driver and
> treat such pages as any other normal guest ram pages. This has direct
> impact on overall migration time for the guest which has released
> (ballooned out) memory to the host.

Hi Jitendra,
  I've read over the patch and I've got a mix of comments; I've not read
it in full detail:

   a) It does need splitting up; it's a bit big to review in one go;
      I suggest you split it into the main code, and separately the bitmap
      save/load.  It might be worth splitting it up even more.

   b) in balloon_bitmap_load check the next and len fields; since it's read
      over the wire we've got to treat them as hostile; so check they don't
      run over the length of the bitmap.

   c) The bitmap_load/save needs to be tied to the machine type or something
      that means that if you were migraitng in a stream from an older qemu
      it wouldn't get upset when it tried to read the extra data you read.
      I prefer if it's tied to either the config setting or the new machine
      type (that way backwards migration works as well).

   d) I agree with the other comments tha the stuff looking up the ram blocks
      addressing looks wrong;  you use last_ram_offset() to size the bitmap,
      so it makes me think it's the whole of the ram_addr_t; but I think you're
      saying you're not interested in all of it.   However remember that
      the order of ram_addr_t is not stable between two qemu's - even something
      like hotplugging a ethercard in one qemu vs having it on the command line on
      the other can change that order; so anything going over the wire has
      to be block+offset-into-block. Also remember that last_ram_offset() includes
      annoying things like firmware RAM, video memory and all those things;

   e) It should be possible to get it to work for postcopy if you just use
      it as a way to speed up the zero detection but still send the zero page
      messages.

Dave

> 
> In case of large systems, where we can configure large guests with 1TB
> and with considerable amount of memory release by balloon driver to the,
> host the migration time gets worse.
> 
> The solution proposed below is local only to qemu (and does not require
> any modification to Linux kernel or any guest driver). We have verified
> the fix for large guests =1TB on HPE Superdome X (which can support up
> to 240 cores and 12TB of memory) and in case where 90% of memory is
> released by balloon driver the migration time for an idle guests reduces
> to ~600 sec's from ~1200 sec’s.
> 
> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
> -> ram_find_and_save_block () will try to migrate ram pages which are
> released by vitrio-balloon driver as part of dynamic memory delete.
> Even though the pages which are returned to the host by virtio-balloon
> driver are zero pages, the migration algorithm will still end up
> scanning the entire page ram_find_and_save_block() -> ram_save_page/
> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
> also end-up sending some control information over network for these
> page during migration. This adds to total migration time.
> 
> The proposed fix, uses the existing bitmap infrastructure to create
> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
> entire guest ram memory till max configured memory. Guest ram pages
> claimed by the virtio-balloon driver will be represented by 1 in the
> bitmap. During live migration, each guest ram page (host VA offset)
> is checked against the virtio-balloon bitmap, if the bit is set the
> corresponding ram page will be excluded from scanning and sending
> control information during migration. The bitmap is also migrated to
> the target as part of every ram_save_iterate loop and after the
> guest is stopped remaining balloon bitmap is migrated as part of
> balloon driver save / load interface.
> 
> With the proposed fix, the average migration time for an idle guest
> with 1TB maximum memory and 64vCpus
>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>    down to 128GB (~10% of 1TB).
>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>    down to 896GB (~90% of 1TB),
>  - with no ballooning configured, we don’t expect to see any impact
>    on total migration time.
> 
> The optimization gets temporarily disabled, if the balloon operation is
> in progress. Since the optimization skips scanning and migrating control
> information for ballooned out pages, we might skip guest ram pages in
> cases where the guest balloon driver has freed the ram page to the guest
> but not yet informed the host/qemu about the ram page
> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
> might skip migrating ram pages which the guest is using. Since this
> problem is specific to balloon leak, we can restrict balloon operation in
> progress check to only balloon leak operation in progress check.
> 
> The optimization also get permanently disabled (for all subsequent
> migrations) in case any of the migration uses postcopy capability. In case
> of postcopy the balloon bitmap would be required to send after vm_stop,
> which has significant impact on the downtime. Moreover, the applications
> in the guest space won’t be actually faulting on the ram pages which are
> already ballooned out, the proposed optimization will not show any
> improvement in migration time during postcopy.
> 
> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> ---
> Changed in v2:
>  - Resolved compilation issue for qemu-user binaries in exec.c
>  - Localize balloon bitmap test to save_zero_page().
>  - Updated version string for newly added migration capability to 2.7.
>  - Made minor modifications to patch commit text.
> 
>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
>  exec.c                             |   3 +
>  hw/virtio/virtio-balloon.c         |  35 ++++-
>  include/hw/virtio/virtio-balloon.h |   1 +
>  include/migration/migration.h      |   1 +
>  include/sysemu/balloon.h           |  15 ++-
>  migration/migration.c              |   9 ++
>  migration/ram.c                    |  31 ++++-
>  qapi-schema.json                   |   5 +-
>  9 files changed, 341 insertions(+), 12 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index f2ef50c..1c2d228 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -33,11 +33,34 @@
>  #include "qmp-commands.h"
>  #include "qapi/qmp/qerror.h"
>  #include "qapi/qmp/qjson.h"
> +#include "exec/ram_addr.h"
> +#include "migration/migration.h"
> +
> +#define BALLOON_BITMAP_DISABLE_FLAG -1UL
> +
> +typedef enum {
> +    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
> +    BALLOON_BITMAP_DISABLE_CURRENT,
> +    BALLOON_BITMAP_DISABLE_PERMANENT,
> +} BalloonBitmapDisableState;
>  
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
> +static QEMUBalloonInProgress *balloon_in_progress_fn;
>  static void *balloon_opaque;
>  static bool balloon_inhibited;
> +static unsigned long balloon_bitmap_pages;
> +static unsigned int  balloon_bitmap_pfn_shift;
> +static QemuMutex balloon_bitmap_mutex;
> +static bool balloon_bitmap_xfered;
> +static unsigned long balloon_min_bitmap_offset;
> +static unsigned long balloon_max_bitmap_offset;
> +static BalloonBitmapDisableState balloon_bitmap_disable_state;
> +
> +static struct BitmapRcu {
> +    struct rcu_head rcu;
> +    unsigned long *bmap;
> +} *balloon_bitmap_rcu;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> @@ -49,6 +72,21 @@ void qemu_balloon_inhibit(bool state)
>      balloon_inhibited = state;
>  }
>  
> +void qemu_mutex_lock_balloon_bitmap(void)
> +{
> +    qemu_mutex_lock(&balloon_bitmap_mutex);
> +}
> +
> +void qemu_mutex_unlock_balloon_bitmap(void)
> +{
> +    qemu_mutex_unlock(&balloon_bitmap_mutex);
> +}
> +
> +void qemu_balloon_reset_bitmap_data(void)
> +{
> +    balloon_bitmap_xfered = false;
> +}
> +
>  static bool have_balloon(Error **errp)
>  {
>      if (kvm_enabled() && !kvm_has_sync_mmu()) {
> @@ -65,9 +103,12 @@ static bool have_balloon(Error **errp)
>  }
>  
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> -                             QEMUBalloonStatus *stat_func, void *opaque)
> +                             QEMUBalloonStatus *stat_func,
> +                             QEMUBalloonInProgress *in_progress_func,
> +                             void *opaque, int pfn_shift)
>  {
> -    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
> +    if (balloon_event_fn || balloon_stat_fn ||
> +        balloon_in_progress_fn || balloon_opaque) {
>          /* We're already registered one balloon handler.  How many can
>           * a guest really have?
>           */
> @@ -75,17 +116,39 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>      }
>      balloon_event_fn = event_func;
>      balloon_stat_fn = stat_func;
> +    balloon_in_progress_fn = in_progress_func;
>      balloon_opaque = opaque;
> +
> +    qemu_mutex_init(&balloon_bitmap_mutex);
> +    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
> +    balloon_bitmap_pfn_shift = pfn_shift;
> +    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
> +    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
> +    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
> +    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
> +
>      return 0;
>  }
>  
> +static void balloon_bitmap_free(struct BitmapRcu *bmap)
> +{
> +    g_free(bmap->bmap);
> +    g_free(bmap);
> +}
> +
>  void qemu_remove_balloon_handler(void *opaque)
>  {
> +    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
>      if (balloon_opaque != opaque) {
>          return;
>      }
> +    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
> +    if (bitmap) {
> +        call_rcu(bitmap, balloon_bitmap_free, rcu);
> +    }
>      balloon_event_fn = NULL;
>      balloon_stat_fn = NULL;
> +    balloon_in_progress_fn = NULL;
>      balloon_opaque = NULL;
>  }
>  
> @@ -116,3 +179,189 @@ void qmp_balloon(int64_t target, Error **errp)
>      trace_balloon_event(balloon_opaque, target);
>      balloon_event_fn(balloon_opaque, target);
>  }
> +
> +/* Handle Ram hotplug case, only called in case old < new */
> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
> +{
> +    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
> +    unsigned long old_offset, new_offset;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +
> +    old_offset = (old >> balloon_bitmap_pfn_shift);
> +    new_offset = (new >> balloon_bitmap_pfn_shift);
> +
> +    bitmap = g_new(struct BitmapRcu, 1);
> +    bitmap->bmap = bitmap_new(new_offset);
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    bitmap_clear(bitmap->bmap, 0,
> +                 balloon_bitmap_pages + new_offset - old_offset);
> +    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
> +
> +    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
> +    balloon_bitmap_pages += new_offset - old_offset;
> +    qemu_mutex_unlock_balloon_bitmap();
> +    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
> +
> +    return 0;
> +}
> +
> +/* Should be called with balloon bitmap mutex lock held */
> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
> +{
> +    unsigned long *bitmap;
> +    unsigned long offset = 0;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +    offset = (addr >> balloon_bitmap_pfn_shift);
> +    if (balloon_bitmap_xfered) {
> +        if (offset < balloon_min_bitmap_offset) {
> +            balloon_min_bitmap_offset = offset;
> +        }
> +        if (offset > balloon_max_bitmap_offset) {
> +            balloon_max_bitmap_offset = offset;
> +        }
> +    }
> +
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    if (deflate == 0) {
> +        set_bit(offset, bitmap);
> +    } else {
> +        clear_bit(offset, bitmap);
> +    }
> +    rcu_read_unlock();
> +    return 0;
> +}
> +
> +void qemu_balloon_bitmap_setup(void)
> +{
> +    if (migrate_postcopy_ram()) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> +    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
> +               (balloon_bitmap_disable_state !=
> +                BALLOON_BITMAP_DISABLE_PERMANENT)) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
> +    }
> +}
> +
> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
> +{
> +    unsigned long *bitmap;
> +    ram_addr_t base;
> +    unsigned long nr = 0;
> +    int ret = 0;
> +
> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
> +        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> +        return 0;
> +    }
> +    balloon_in_progress_fn(balloon_opaque, &ret);
> +    if (ret == 1) {
> +        return 0;
> +    }
> +
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    base = rb->offset >> balloon_bitmap_pfn_shift;
> +    nr = base + (addr >> balloon_bitmap_pfn_shift);
> +    if (test_bit(nr, bitmap)) {
> +        ret = 1;
> +    }
> +    rcu_read_unlock();
> +    return ret;
> +}
> +
> +int qemu_balloon_bitmap_save(QEMUFile *f)
> +{
> +    unsigned long *bitmap;
> +    unsigned long offset = 0, next = 0, len = 0;
> +    unsigned long tmpoffset = 0, tmplimit = 0;
> +
> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> +        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
> +        return 0;
> +    }
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    if (balloon_bitmap_xfered) {
> +        tmpoffset = balloon_min_bitmap_offset;
> +        tmplimit  = balloon_max_bitmap_offset;
> +    } else {
> +        balloon_bitmap_xfered = true;
> +        tmpoffset = offset;
> +        tmplimit  = balloon_bitmap_pages;
> +    }
> +
> +    balloon_min_bitmap_offset = balloon_bitmap_pages;
> +    balloon_max_bitmap_offset = 0;
> +
> +    qemu_put_be64(f, balloon_bitmap_pages);
> +    qemu_put_be64(f, tmpoffset);
> +    qemu_put_be64(f, tmplimit);
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    while (tmpoffset < tmplimit) {
> +        unsigned long next_set_bit, start_set_bit;
> +        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
> +        start_set_bit = next_set_bit;
> +        if (next_set_bit == balloon_bitmap_pages) {
> +            len = 0;
> +            next = start_set_bit;
> +            qemu_put_be64(f, next);
> +            qemu_put_be64(f, len);
> +            break;
> +        }
> +        next_set_bit = find_next_zero_bit(bitmap,
> +                                          balloon_bitmap_pages,
> +                                          ++next_set_bit);
> +        len = (next_set_bit - start_set_bit);
> +        next = start_set_bit;
> +        qemu_put_be64(f, next);
> +        qemu_put_be64(f, len);
> +        tmpoffset = next + len;
> +    }
> +    rcu_read_unlock();
> +    qemu_mutex_unlock_balloon_bitmap();
> +    return 0;
> +}
> +
> +int qemu_balloon_bitmap_load(QEMUFile *f)
> +{
> +    unsigned long *bitmap;
> +    unsigned long next = 0, len = 0;
> +    unsigned long tmpoffset = 0, tmplimit = 0;
> +
> +    if (!balloon_bitmap_rcu) {
> +        return -1;
> +    }
> +
> +    qemu_mutex_lock_balloon_bitmap();
> +    balloon_bitmap_pages = qemu_get_be64(f);
> +    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> +        qemu_mutex_unlock_balloon_bitmap();
> +        return 0;
> +    }
> +    tmpoffset = qemu_get_be64(f);
> +    tmplimit  = qemu_get_be64(f);
> +    rcu_read_lock();
> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> +    while (tmpoffset < tmplimit) {
> +        next = qemu_get_be64(f);
> +        len  = qemu_get_be64(f);
> +        if (len == 0) {
> +            break;
> +        }
> +        bitmap_set(bitmap, next, len);
> +        tmpoffset = next + len;
> +    }
> +    rcu_read_unlock();
> +    qemu_mutex_unlock_balloon_bitmap();
> +    return 0;
> +}
> diff --git a/exec.c b/exec.c
> index f398d21..7a448e5 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -43,6 +43,7 @@
>  #else /* !CONFIG_USER_ONLY */
>  #include "sysemu/xen-mapcache.h"
>  #include "trace.h"
> +#include "sysemu/balloon.h"
>  #endif
>  #include "exec/cpu-all.h"
>  #include "qemu/rcu_queue.h"
> @@ -1610,6 +1611,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>      if (new_ram_size > old_ram_size) {
>          migration_bitmap_extend(old_ram_size, new_ram_size);
>          dirty_memory_extend(old_ram_size, new_ram_size);
> +        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
> +                                   new_ram_size << TARGET_PAGE_BITS);
>      }
>      /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
>       * QLIST (which has an RCU-friendly variant) does not have insertion at
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 22ad25c..9f3a4c8 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -27,6 +27,7 @@
>  #include "qapi/visitor.h"
>  #include "qapi-event.h"
>  #include "trace.h"
> +#include "migration/migration.h"
>  
>  #if defined(__linux__)
>  #include <sys/mman.h>
> @@ -214,11 +215,13 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>      VirtQueueElement *elem;
>      MemoryRegionSection section;
>  
> +    qemu_mutex_lock_balloon_bitmap();
>      for (;;) {
>          size_t offset = 0;
>          uint32_t pfn;
>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>          if (!elem) {
> +            qemu_mutex_unlock_balloon_bitmap();
>              return;
>          }
>  
> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>              addr = section.offset_within_region;
>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>                           !!(vq == s->dvq));
> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>              memory_region_unref(section.mr);
>          }
>  
> @@ -249,6 +253,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>          virtio_notify(vdev, vq);
>          g_free(elem);
>      }
> +    qemu_mutex_unlock_balloon_bitmap();
>  }
>  
>  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
> @@ -303,6 +308,16 @@ out:
>      }
>  }
>  
> +static void virtio_balloon_migration_state_changed(Notifier *notifier,
> +                                                   void *data)
> +{
> +    MigrationState *mig = data;
> +
> +    if (migration_has_failed(mig)) {
> +        qemu_balloon_reset_bitmap_data();
> +    }
> +}
> +
>  static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
>  {
>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> @@ -382,6 +397,16 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
>                                               VIRTIO_BALLOON_PFN_SHIFT);
>  }
>  
> +static void virtio_balloon_in_progress(void *opaque, int *status)
> +{
> +    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> +    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
> +        *status = 1;
> +        return;
> +    }
> +    *status = 0;
> +}
> +
>  static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
>  {
>      VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> @@ -409,6 +434,7 @@ static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
>  
>      qemu_put_be32(f, s->num_pages);
>      qemu_put_be32(f, s->actual);
> +    qemu_balloon_bitmap_save(f);
>  }
>  
>  static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
> @@ -426,6 +452,7 @@ static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
>  
>      s->num_pages = qemu_get_be32(f);
>      s->actual = qemu_get_be32(f);
> +    qemu_balloon_bitmap_load(f);
>      return 0;
>  }
>  
> @@ -439,7 +466,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>                  sizeof(struct virtio_balloon_config));
>  
>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> -                                   virtio_balloon_stat, s);
> +                                   virtio_balloon_stat,
> +                                   virtio_balloon_in_progress, s,
> +                                   VIRTIO_BALLOON_PFN_SHIFT);
>  
>      if (ret < 0) {
>          error_setg(errp, "Only one balloon device is supported");
> @@ -453,6 +482,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>  
>      reset_stats(s);
>  
> +    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
> +    add_migration_state_change_notifier(&s->migration_state_notifier);
> +
>      register_savevm(dev, "virtio-balloon", -1, 1,
>                      virtio_balloon_save, virtio_balloon_load, s);
>  }
> @@ -462,6 +494,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>  
> +    remove_migration_state_change_notifier(&s->migration_state_notifier);
>      balloon_stats_destroy_timer(s);
>      qemu_remove_balloon_handler(s);
>      unregister_savevm(dev, "virtio-balloon", s);
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..1ded5a9 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -43,6 +43,7 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    Notifier migration_state_notifier;
>  } VirtIOBalloon;
>  
>  #endif
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index ac2c12c..6c1d1af 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -267,6 +267,7 @@ void migrate_del_blocker(Error *reason);
>  
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> +bool migrate_skip_balloon(void);
>  
>  bool migrate_auto_converge(void);
>  
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index 3f976b4..5325c38 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -15,14 +15,27 @@
>  #define _QEMU_BALLOON_H
>  
>  #include "qapi-types.h"
> +#include "migration/qemu-file.h"
>  
>  typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
>  typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
> +typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
>  
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> -			     QEMUBalloonStatus *stat_func, void *opaque);
> +                             QEMUBalloonStatus *stat_func,
> +                             QEMUBalloonInProgress *progress_func,
> +                             void *opaque, int pfn_shift);
>  void qemu_remove_balloon_handler(void *opaque);
>  bool qemu_balloon_is_inhibited(void);
>  void qemu_balloon_inhibit(bool state);
> +void qemu_mutex_lock_balloon_bitmap(void);
> +void qemu_mutex_unlock_balloon_bitmap(void);
> +void qemu_balloon_reset_bitmap_data(void);
> +void qemu_balloon_bitmap_setup(void);
> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
> +int qemu_balloon_bitmap_save(QEMUFile *f);
> +int qemu_balloon_bitmap_load(QEMUFile *f);
>  
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 034a918..cb86307 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1200,6 +1200,15 @@ int migrate_use_xbzrle(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
>  }
>  
> +bool migrate_skip_balloon(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
> +}
> +
>  int64_t migrate_xbzrle_cache_size(void)
>  {
>      MigrationState *s;
> diff --git a/migration/ram.c b/migration/ram.c
> index 704f6a9..161ab73 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -40,6 +40,7 @@
>  #include "trace.h"
>  #include "exec/ram_addr.h"
>  #include "qemu/rcu_queue.h"
> +#include "sysemu/balloon.h"
>  
>  #ifdef DEBUG_MIGRATION_RAM
>  #define DPRINTF(fmt, ...) \
> @@ -65,6 +66,7 @@ static uint64_t bitmap_sync_count;
>  #define RAM_SAVE_FLAG_XBZRLE   0x40
>  /* 0x80 is reserved in migration.h start with 0x100 next */
>  #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
> +#define RAM_SAVE_FLAG_BALLOON  0x200
>  
>  static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
>  
> @@ -702,13 +704,17 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>  {
>      int pages = -1;
>  
> -    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> -        acct_info.dup_pages++;
> -        *bytes_transferred += save_page_header(f, block,
> +    if (qemu_balloon_bitmap_test(block, offset) != 1) {
> +        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> +            acct_info.dup_pages++;
> +            *bytes_transferred += save_page_header(f, block,
>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
> -        qemu_put_byte(f, 0);
> -        *bytes_transferred += 1;
> -        pages = 1;
> +            qemu_put_byte(f, 0);
> +            *bytes_transferred += 1;
> +            pages = 1;
> +        }
> +    } else {
> +        pages = 0;
>      }
>  
>      return pages;
> @@ -773,7 +779,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
>               * page would be stale
>               */
>              xbzrle_cache_zero_page(current_addr);
> -        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
> +        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {

Is this test the right way around - don't you want to try xbzrle if you've NOT sent
a page?

>              pages = save_xbzrle_page(f, &p, current_addr, block,
>                                       offset, last_stage, bytes_transferred);
>              if (!last_stage) {
> @@ -1355,6 +1361,9 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
>          }
>  
>          if (found) {
> +            /* skip saving ram host page if the corresponding guest page
> +             * is ballooned out
> +             */
>              pages = ram_save_host_page(ms, f, &pss,
>                                         last_stage, bytes_transferred,
>                                         dirty_ram_abs);
> @@ -1959,6 +1968,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  
>      rcu_read_unlock();
>  
> +    qemu_balloon_bitmap_setup();
>      ram_control_before_iterate(f, RAM_CONTROL_SETUP);
>      ram_control_after_iterate(f, RAM_CONTROL_SETUP);
>  
> @@ -1984,6 +1994,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>  
>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>  
> +    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
> +    qemu_balloon_bitmap_save(f);
> +
>      t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      i = 0;
>      while ((ret = qemu_file_rate_limit(f)) == 0) {
> @@ -2493,6 +2506,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>              }
>              break;
>  
> +        case RAM_SAVE_FLAG_BALLOON:
> +            qemu_balloon_bitmap_load(f);
> +            break;
> +
>          case RAM_SAVE_FLAG_COMPRESS:
>              ch = qemu_get_byte(f);
>              ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7f8d799..38163ca 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -544,11 +544,14 @@
>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>  #
> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
> +#          (since 2.7)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> -           'compress', 'events', 'postcopy-ram'] }
> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
>  
>  ##
>  # @MigrationCapabilityStatus
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Jitendra Kolhe April 1, 2016, 11:08 a.m. UTC | #11
On 3/29/2016 5:58 PM, Michael S. Tsirkin wrote:
> On Mon, Mar 28, 2016 at 09:46:05AM +0530, Jitendra Kolhe wrote:
>> While measuring live migration performance for qemu/kvm guest, it
>> was observed that the qemu doesn’t maintain any intelligence for the
>> guest ram pages which are released by the guest balloon driver and
>> treat such pages as any other normal guest ram pages. This has direct
>> impact on overall migration time for the guest which has released
>> (ballooned out) memory to the host.
>>
>> In case of large systems, where we can configure large guests with 1TB
>> and with considerable amount of memory release by balloon driver to the,
>> host the migration time gets worse.
>>
>> The solution proposed below is local only to qemu (and does not require
>> any modification to Linux kernel or any guest driver). We have verified
>> the fix for large guests =1TB on HPE Superdome X (which can support up
>> to 240 cores and 12TB of memory) and in case where 90% of memory is
>> released by balloon driver the migration time for an idle guests reduces
>> to ~600 sec's from ~1200 sec’s.
>>
>> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
>> -> ram_find_and_save_block () will try to migrate ram pages which are
>> released by vitrio-balloon driver as part of dynamic memory delete.
>> Even though the pages which are returned to the host by virtio-balloon
>> driver are zero pages, the migration algorithm will still end up
>> scanning the entire page ram_find_and_save_block() -> ram_save_page/
>> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
>> also end-up sending some control information over network for these
>> page during migration. This adds to total migration time.
>>
>> The proposed fix, uses the existing bitmap infrastructure to create
>> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
>> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
>> entire guest ram memory till max configured memory. Guest ram pages
>> claimed by the virtio-balloon driver will be represented by 1 in the
>> bitmap. During live migration, each guest ram page (host VA offset)
>> is checked against the virtio-balloon bitmap, if the bit is set the
>> corresponding ram page will be excluded from scanning and sending
>> control information during migration. The bitmap is also migrated to
>> the target as part of every ram_save_iterate loop and after the
>> guest is stopped remaining balloon bitmap is migrated as part of
>> balloon driver save / load interface.
> 
> Migrating the bitmap might help a chained migration case
> but will slow down the more typical case a bit.
> Make it optional?
> 
> 

Disabling migration of balloon bitmap would disable the 
optimization permanently for all other subsequent migrations.

The current implementation sends offsets and lengths of 1s 
in the bitmap. In cases where we see highly fragmented 
bitmap which may add to the bitmap migration time, but still 
it would be less compared to total migration time.

Here are some values from my test setup for an idle guest 
with 16GB memory ballooned down to 4GB memory (with no 
ballooning operation in progress during migration).
- Total migration time without proposed patch set - ~7600ms 
- Total migration time with proposed patch set - ~5700ms
  . Adds overhead of testing against bitmap - ~320ms
  . Adds overhead of sending the bitmap - ~1.6ms
- Saves zero page scan + protocol time - ~2348ms  
We may see some higher pct for time taken to migrate bitmap
in case ballooning is in progress during migration. 

The patch does introduces an option to enable/disable 
testing ram_addr against balloon bitmap during migration 
(since that’s the major overhead), but we still end-up 
migrating the bitmap by default. 

>>
>> With the proposed fix, the average migration time for an idle guest
>> with 1TB maximum memory and 64vCpus
>>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>>    down to 128GB (~10% of 1TB).
>>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>>    down to 896GB (~90% of 1TB),
>>  - with no ballooning configured, we don’t expect to see any impact
>>    on total migration time.
>>
>> The optimization gets temporarily disabled, if the balloon operation is
>> in progress. Since the optimization skips scanning and migrating control
>> information for ballooned out pages, we might skip guest ram pages in
>> cases where the guest balloon driver has freed the ram page to the guest
>> but not yet informed the host/qemu about the ram page
>> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
>> might skip migrating ram pages which the guest is using. Since this
>> problem is specific to balloon leak, we can restrict balloon operation in
>> progress check to only balloon leak operation in progress check.
>>
>> The optimization also get permanently disabled (for all subsequent
>> migrations) in case any of the migration uses postcopy capability. In case
>> of postcopy the balloon bitmap would be required to send after vm_stop,
>> which has significant impact on the downtime. Moreover, the applications
>> in the guest space won’t be actually faulting on the ram pages which are
>> already ballooned out, the proposed optimization will not show any
>> improvement in migration time during postcopy.
> 
> 
> I think this optimization can work for post copy with some
> modifications.
> 

So is there any way to migrate the bitmap without impacting
downtime before starting the vm on dest in case of post-copy?
In which case we can migrate bitmap to dest so that subsequent 
migrations which are not using post-copy can use the optimization.
In pre-copy case, most of the bitmap is migrated as a part of 
ram_save_iterate(), thus having a minimal impact on downtime.

I am still unclear on how the patch set will benefit post-migration 
case as guest on the dest should never fault on ballooned out pages, 
So we won’t be saving on zero page scan time?

> 
> For post-copy, when guest takes page out of the balloon,
> it notifies the host.
> 

Yes, this will help in maintaining balloon bitmap up-to-date.

> This only happens when MUST_TELL_HOST feature has been
> negotiated but most guests do negotiate it nowdays.
> (It also seems that your code assumes MUST_TELL_HOST - I think
> you must check and disable the bitmap management otherwise).
> 
> At that point we can mark page as migrated to avoid
> requesting it from source.
> 

qemu no longer seem to add MUST_TELL_HOST feature 
to dev->host_features. Unless qemu set’s the feature, 
how can it be negotiated it with the guest.
virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST)
always returns false to me. Am I missing something here?

Latest linux guest balloon driver >= 3.0, have stopped 
checking for MUST_TELL_HOST feature, but they implement 
functionality as if the feature is set.

To address cases where guest may start using pages before
host is aware of them, I am disabling the test against bitmap
If ballooning is in progress (during migration). (I think 
we can restrict ballooning in progress check to balloon 
deflate condition only?). This will make sure that the pages 
will be migrated.

Thanks,
- Jitendra

> 
> 
>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>> ---
>> Changed in v2:
>>  - Resolved compilation issue for qemu-user binaries in exec.c
>>  - Localize balloon bitmap test to save_zero_page().
>>  - Updated version string for newly added migration capability to 2.7.
>>  - Made minor modifications to patch commit text.
>>
>>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
>>  exec.c                             |   3 +
>>  hw/virtio/virtio-balloon.c         |  35 ++++-
>>  include/hw/virtio/virtio-balloon.h |   1 +
>>  include/migration/migration.h      |   1 +
>>  include/sysemu/balloon.h           |  15 ++-
>>  migration/migration.c              |   9 ++
>>  migration/ram.c                    |  31 ++++-
>>  qapi-schema.json                   |   5 +-
>>  9 files changed, 341 insertions(+), 12 deletions(-)
>>
>> diff --git a/balloon.c b/balloon.c
>> index f2ef50c..1c2d228 100644
>> --- a/balloon.c
>> +++ b/balloon.c
>> @@ -33,11 +33,34 @@
>>  #include "qmp-commands.h"
>>  #include "qapi/qmp/qerror.h"
>>  #include "qapi/qmp/qjson.h"
>> +#include "exec/ram_addr.h"
>> +#include "migration/migration.h"
>> +
>> +#define BALLOON_BITMAP_DISABLE_FLAG -1UL
>> +
>> +typedef enum {
>> +    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
>> +    BALLOON_BITMAP_DISABLE_CURRENT,
>> +    BALLOON_BITMAP_DISABLE_PERMANENT,
>> +} BalloonBitmapDisableState;
>>  
>>  static QEMUBalloonEvent *balloon_event_fn;
>>  static QEMUBalloonStatus *balloon_stat_fn;
>> +static QEMUBalloonInProgress *balloon_in_progress_fn;
>>  static void *balloon_opaque;
>>  static bool balloon_inhibited;
>> +static unsigned long balloon_bitmap_pages;
>> +static unsigned int  balloon_bitmap_pfn_shift;
>> +static QemuMutex balloon_bitmap_mutex;
>> +static bool balloon_bitmap_xfered;
>> +static unsigned long balloon_min_bitmap_offset;
>> +static unsigned long balloon_max_bitmap_offset;
>> +static BalloonBitmapDisableState balloon_bitmap_disable_state;
>> +
>> +static struct BitmapRcu {
>> +    struct rcu_head rcu;
>> +    unsigned long *bmap;
>> +} *balloon_bitmap_rcu;
>>  
>>  bool qemu_balloon_is_inhibited(void)
>>  {
>> @@ -49,6 +72,21 @@ void qemu_balloon_inhibit(bool state)
>>      balloon_inhibited = state;
>>  }
>>  
>> +void qemu_mutex_lock_balloon_bitmap(void)
>> +{
>> +    qemu_mutex_lock(&balloon_bitmap_mutex);
>> +}
>> +
>> +void qemu_mutex_unlock_balloon_bitmap(void)
>> +{
>> +    qemu_mutex_unlock(&balloon_bitmap_mutex);
>> +}
>> +
>> +void qemu_balloon_reset_bitmap_data(void)
>> +{
>> +    balloon_bitmap_xfered = false;
>> +}
>> +
>>  static bool have_balloon(Error **errp)
>>  {
>>      if (kvm_enabled() && !kvm_has_sync_mmu()) {
>> @@ -65,9 +103,12 @@ static bool have_balloon(Error **errp)
>>  }
>>  
>>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>> -                             QEMUBalloonStatus *stat_func, void *opaque)
>> +                             QEMUBalloonStatus *stat_func,
>> +                             QEMUBalloonInProgress *in_progress_func,
>> +                             void *opaque, int pfn_shift)
>>  {
>> -    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
>> +    if (balloon_event_fn || balloon_stat_fn ||
>> +        balloon_in_progress_fn || balloon_opaque) {
>>          /* We're already registered one balloon handler.  How many can
>>           * a guest really have?
>>           */
>> @@ -75,17 +116,39 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>>      }
>>      balloon_event_fn = event_func;
>>      balloon_stat_fn = stat_func;
>> +    balloon_in_progress_fn = in_progress_func;
>>      balloon_opaque = opaque;
>> +
>> +    qemu_mutex_init(&balloon_bitmap_mutex);
>> +    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
>> +    balloon_bitmap_pfn_shift = pfn_shift;
>> +    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
>> +    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
>> +    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
>> +    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
>> +
>>      return 0;
>>  }
>>  
>> +static void balloon_bitmap_free(struct BitmapRcu *bmap)
>> +{
>> +    g_free(bmap->bmap);
>> +    g_free(bmap);
>> +}
>> +
>>  void qemu_remove_balloon_handler(void *opaque)
>>  {
>> +    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
>>      if (balloon_opaque != opaque) {
>>          return;
>>      }
>> +    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
>> +    if (bitmap) {
>> +        call_rcu(bitmap, balloon_bitmap_free, rcu);
>> +    }
>>      balloon_event_fn = NULL;
>>      balloon_stat_fn = NULL;
>> +    balloon_in_progress_fn = NULL;
>>      balloon_opaque = NULL;
>>  }
>>  
>> @@ -116,3 +179,189 @@ void qmp_balloon(int64_t target, Error **errp)
>>      trace_balloon_event(balloon_opaque, target);
>>      balloon_event_fn(balloon_opaque, target);
>>  }
>> +
>> +/* Handle Ram hotplug case, only called in case old < new */
>> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
>> +{
>> +    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
>> +    unsigned long old_offset, new_offset;
>> +
>> +    if (!balloon_bitmap_rcu) {
>> +        return -1;
>> +    }
>> +
>> +    old_offset = (old >> balloon_bitmap_pfn_shift);
>> +    new_offset = (new >> balloon_bitmap_pfn_shift);
>> +
>> +    bitmap = g_new(struct BitmapRcu, 1);
>> +    bitmap->bmap = bitmap_new(new_offset);
>> +
>> +    qemu_mutex_lock_balloon_bitmap();
>> +    bitmap_clear(bitmap->bmap, 0,
>> +                 balloon_bitmap_pages + new_offset - old_offset);
>> +    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
>> +
>> +    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
>> +    balloon_bitmap_pages += new_offset - old_offset;
>> +    qemu_mutex_unlock_balloon_bitmap();
>> +    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
>> +
>> +    return 0;
>> +}
>> +
>> +/* Should be called with balloon bitmap mutex lock held */
>> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
>> +{
>> +    unsigned long *bitmap;
>> +    unsigned long offset = 0;
>> +
>> +    if (!balloon_bitmap_rcu) {
>> +        return -1;
>> +    }
>> +    offset = (addr >> balloon_bitmap_pfn_shift);
>> +    if (balloon_bitmap_xfered) {
>> +        if (offset < balloon_min_bitmap_offset) {
>> +            balloon_min_bitmap_offset = offset;
>> +        }
>> +        if (offset > balloon_max_bitmap_offset) {
>> +            balloon_max_bitmap_offset = offset;
>> +        }
>> +    }
>> +
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    if (deflate == 0) {
>> +        set_bit(offset, bitmap);
>> +    } else {
>> +        clear_bit(offset, bitmap);
>> +    }
>> +    rcu_read_unlock();
>> +    return 0;
>> +}
>> +
>> +void qemu_balloon_bitmap_setup(void)
>> +{
>> +    if (migrate_postcopy_ram()) {
>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
>> +    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
>> +               (balloon_bitmap_disable_state !=
>> +                BALLOON_BITMAP_DISABLE_PERMANENT)) {
>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
>> +    }
>> +}
>> +
>> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
>> +{
>> +    unsigned long *bitmap;
>> +    ram_addr_t base;
>> +    unsigned long nr = 0;
>> +    int ret = 0;
>> +
>> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
>> +        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
>> +        return 0;
>> +    }
>> +    balloon_in_progress_fn(balloon_opaque, &ret);
>> +    if (ret == 1) {
>> +        return 0;
>> +    }
>> +
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    base = rb->offset >> balloon_bitmap_pfn_shift;
>> +    nr = base + (addr >> balloon_bitmap_pfn_shift);
>> +    if (test_bit(nr, bitmap)) {
>> +        ret = 1;
>> +    }
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>> +int qemu_balloon_bitmap_save(QEMUFile *f)
>> +{
>> +    unsigned long *bitmap;
>> +    unsigned long offset = 0, next = 0, len = 0;
>> +    unsigned long tmpoffset = 0, tmplimit = 0;
>> +
>> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
>> +        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
>> +        return 0;
>> +    }
>> +
>> +    qemu_mutex_lock_balloon_bitmap();
>> +    if (balloon_bitmap_xfered) {
>> +        tmpoffset = balloon_min_bitmap_offset;
>> +        tmplimit  = balloon_max_bitmap_offset;
>> +    } else {
>> +        balloon_bitmap_xfered = true;
>> +        tmpoffset = offset;
>> +        tmplimit  = balloon_bitmap_pages;
>> +    }
>> +
>> +    balloon_min_bitmap_offset = balloon_bitmap_pages;
>> +    balloon_max_bitmap_offset = 0;
>> +
>> +    qemu_put_be64(f, balloon_bitmap_pages);
>> +    qemu_put_be64(f, tmpoffset);
>> +    qemu_put_be64(f, tmplimit);
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    while (tmpoffset < tmplimit) {
>> +        unsigned long next_set_bit, start_set_bit;
>> +        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
>> +        start_set_bit = next_set_bit;
>> +        if (next_set_bit == balloon_bitmap_pages) {
>> +            len = 0;
>> +            next = start_set_bit;
>> +            qemu_put_be64(f, next);
>> +            qemu_put_be64(f, len);
>> +            break;
>> +        }
>> +        next_set_bit = find_next_zero_bit(bitmap,
>> +                                          balloon_bitmap_pages,
>> +                                          ++next_set_bit);
>> +        len = (next_set_bit - start_set_bit);
>> +        next = start_set_bit;
>> +        qemu_put_be64(f, next);
>> +        qemu_put_be64(f, len);
>> +        tmpoffset = next + len;
>> +    }
>> +    rcu_read_unlock();
>> +    qemu_mutex_unlock_balloon_bitmap();
>> +    return 0;
>> +}
>> +
>> +int qemu_balloon_bitmap_load(QEMUFile *f)
>> +{
>> +    unsigned long *bitmap;
>> +    unsigned long next = 0, len = 0;
>> +    unsigned long tmpoffset = 0, tmplimit = 0;
>> +
>> +    if (!balloon_bitmap_rcu) {
>> +        return -1;
>> +    }
>> +
>> +    qemu_mutex_lock_balloon_bitmap();
>> +    balloon_bitmap_pages = qemu_get_be64(f);
>> +    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
>> +        qemu_mutex_unlock_balloon_bitmap();
>> +        return 0;
>> +    }
>> +    tmpoffset = qemu_get_be64(f);
>> +    tmplimit  = qemu_get_be64(f);
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    while (tmpoffset < tmplimit) {
>> +        next = qemu_get_be64(f);
>> +        len  = qemu_get_be64(f);
>> +        if (len == 0) {
>> +            break;
>> +        }
>> +        bitmap_set(bitmap, next, len);
>> +        tmpoffset = next + len;
>> +    }
>> +    rcu_read_unlock();
>> +    qemu_mutex_unlock_balloon_bitmap();
>> +    return 0;
>> +}
>> diff --git a/exec.c b/exec.c
>> index f398d21..7a448e5 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -43,6 +43,7 @@
>>  #else /* !CONFIG_USER_ONLY */
>>  #include "sysemu/xen-mapcache.h"
>>  #include "trace.h"
>> +#include "sysemu/balloon.h"
>>  #endif
>>  #include "exec/cpu-all.h"
>>  #include "qemu/rcu_queue.h"
>> @@ -1610,6 +1611,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>      if (new_ram_size > old_ram_size) {
>>          migration_bitmap_extend(old_ram_size, new_ram_size);
>>          dirty_memory_extend(old_ram_size, new_ram_size);
>> +        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
>> +                                   new_ram_size << TARGET_PAGE_BITS);
>>      }
>>      /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
>>       * QLIST (which has an RCU-friendly variant) does not have insertion at
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 22ad25c..9f3a4c8 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -27,6 +27,7 @@
>>  #include "qapi/visitor.h"
>>  #include "qapi-event.h"
>>  #include "trace.h"
>> +#include "migration/migration.h"
>>  
>>  #if defined(__linux__)
>>  #include <sys/mman.h>
>> @@ -214,11 +215,13 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>      VirtQueueElement *elem;
>>      MemoryRegionSection section;
>>  
>> +    qemu_mutex_lock_balloon_bitmap();
>>      for (;;) {
>>          size_t offset = 0;
>>          uint32_t pfn;
>>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>>          if (!elem) {
>> +            qemu_mutex_unlock_balloon_bitmap();
>>              return;
>>          }
>>  
>> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>              addr = section.offset_within_region;
>>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>>                           !!(vq == s->dvq));
>> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>>              memory_region_unref(section.mr);
>>          }
>>  
>> @@ -249,6 +253,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>          virtio_notify(vdev, vq);
>>          g_free(elem);
>>      }
>> +    qemu_mutex_unlock_balloon_bitmap();
>>  }
>>  
>>  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
>> @@ -303,6 +308,16 @@ out:
>>      }
>>  }
>>  
>> +static void virtio_balloon_migration_state_changed(Notifier *notifier,
>> +                                                   void *data)
>> +{
>> +    MigrationState *mig = data;
>> +
>> +    if (migration_has_failed(mig)) {
>> +        qemu_balloon_reset_bitmap_data();
>> +    }
>> +}
>> +
>>  static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
>>  {
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>> @@ -382,6 +397,16 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
>>                                               VIRTIO_BALLOON_PFN_SHIFT);
>>  }
>>  
>> +static void virtio_balloon_in_progress(void *opaque, int *status)
>> +{
>> +    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
>> +    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
>> +        *status = 1;
>> +        return;
>> +    }
>> +    *status = 0;
>> +}
>> +
>>  static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
>>  {
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
>> @@ -409,6 +434,7 @@ static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
>>  
>>      qemu_put_be32(f, s->num_pages);
>>      qemu_put_be32(f, s->actual);
>> +    qemu_balloon_bitmap_save(f);
>>  }
>>  
>>  static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
>> @@ -426,6 +452,7 @@ static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
>>  
>>      s->num_pages = qemu_get_be32(f);
>>      s->actual = qemu_get_be32(f);
>> +    qemu_balloon_bitmap_load(f);
>>      return 0;
>>  }
>>  
>> @@ -439,7 +466,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>                  sizeof(struct virtio_balloon_config));
>>  
>>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
>> -                                   virtio_balloon_stat, s);
>> +                                   virtio_balloon_stat,
>> +                                   virtio_balloon_in_progress, s,
>> +                                   VIRTIO_BALLOON_PFN_SHIFT);
>>  
>>      if (ret < 0) {
>>          error_setg(errp, "Only one balloon device is supported");
>> @@ -453,6 +482,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>  
>>      reset_stats(s);
>>  
>> +    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
>> +    add_migration_state_change_notifier(&s->migration_state_notifier);
>> +
>>      register_savevm(dev, "virtio-balloon", -1, 1,
>>                      virtio_balloon_save, virtio_balloon_load, s);
>>  }
>> @@ -462,6 +494,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>>  
>> +    remove_migration_state_change_notifier(&s->migration_state_notifier);
>>      balloon_stats_destroy_timer(s);
>>      qemu_remove_balloon_handler(s);
>>      unregister_savevm(dev, "virtio-balloon", s);
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index 35f62ac..1ded5a9 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -43,6 +43,7 @@ typedef struct VirtIOBalloon {
>>      int64_t stats_last_update;
>>      int64_t stats_poll_interval;
>>      uint32_t host_features;
>> +    Notifier migration_state_notifier;
>>  } VirtIOBalloon;
>>  
>>  #endif
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index ac2c12c..6c1d1af 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -267,6 +267,7 @@ void migrate_del_blocker(Error *reason);
>>  
>>  bool migrate_postcopy_ram(void);
>>  bool migrate_zero_blocks(void);
>> +bool migrate_skip_balloon(void);
>>  
>>  bool migrate_auto_converge(void);
>>  
>> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
>> index 3f976b4..5325c38 100644
>> --- a/include/sysemu/balloon.h
>> +++ b/include/sysemu/balloon.h
>> @@ -15,14 +15,27 @@
>>  #define _QEMU_BALLOON_H
>>  
>>  #include "qapi-types.h"
>> +#include "migration/qemu-file.h"
>>  
>>  typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
>>  typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
>> +typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
>>  
>>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>> -			     QEMUBalloonStatus *stat_func, void *opaque);
>> +                             QEMUBalloonStatus *stat_func,
>> +                             QEMUBalloonInProgress *progress_func,
>> +                             void *opaque, int pfn_shift);
>>  void qemu_remove_balloon_handler(void *opaque);
>>  bool qemu_balloon_is_inhibited(void);
>>  void qemu_balloon_inhibit(bool state);
>> +void qemu_mutex_lock_balloon_bitmap(void);
>> +void qemu_mutex_unlock_balloon_bitmap(void);
>> +void qemu_balloon_reset_bitmap_data(void);
>> +void qemu_balloon_bitmap_setup(void);
>> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
>> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
>> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
>> +int qemu_balloon_bitmap_save(QEMUFile *f);
>> +int qemu_balloon_bitmap_load(QEMUFile *f);
>>  
>>  #endif
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 034a918..cb86307 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1200,6 +1200,15 @@ int migrate_use_xbzrle(void)
>>      return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
>>  }
>>  
>> +bool migrate_skip_balloon(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
>> +}
>> +
>>  int64_t migrate_xbzrle_cache_size(void)
>>  {
>>      MigrationState *s;
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 704f6a9..161ab73 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -40,6 +40,7 @@
>>  #include "trace.h"
>>  #include "exec/ram_addr.h"
>>  #include "qemu/rcu_queue.h"
>> +#include "sysemu/balloon.h"
>>  
>>  #ifdef DEBUG_MIGRATION_RAM
>>  #define DPRINTF(fmt, ...) \
>> @@ -65,6 +66,7 @@ static uint64_t bitmap_sync_count;
>>  #define RAM_SAVE_FLAG_XBZRLE   0x40
>>  /* 0x80 is reserved in migration.h start with 0x100 next */
>>  #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
>> +#define RAM_SAVE_FLAG_BALLOON  0x200
>>  
>>  static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
>>  
>> @@ -702,13 +704,17 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>>  {
>>      int pages = -1;
>>  
>> -    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>> -        acct_info.dup_pages++;
>> -        *bytes_transferred += save_page_header(f, block,
>> +    if (qemu_balloon_bitmap_test(block, offset) != 1) {
>> +        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>> +            acct_info.dup_pages++;
>> +            *bytes_transferred += save_page_header(f, block,
>>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
>> -        qemu_put_byte(f, 0);
>> -        *bytes_transferred += 1;
>> -        pages = 1;
>> +            qemu_put_byte(f, 0);
>> +            *bytes_transferred += 1;
>> +            pages = 1;
>> +        }
>> +    } else {
>> +        pages = 0;
>>      }
>>  
>>      return pages;
>> @@ -773,7 +779,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
>>               * page would be stale
>>               */
>>              xbzrle_cache_zero_page(current_addr);
>> -        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
>> +        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {
>>              pages = save_xbzrle_page(f, &p, current_addr, block,
>>                                       offset, last_stage, bytes_transferred);
>>              if (!last_stage) {
>> @@ -1355,6 +1361,9 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
>>          }
>>  
>>          if (found) {
>> +            /* skip saving ram host page if the corresponding guest page
>> +             * is ballooned out
>> +             */
>>              pages = ram_save_host_page(ms, f, &pss,
>>                                         last_stage, bytes_transferred,
>>                                         dirty_ram_abs);
>> @@ -1959,6 +1968,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>  
>>      rcu_read_unlock();
>>  
>> +    qemu_balloon_bitmap_setup();
>>      ram_control_before_iterate(f, RAM_CONTROL_SETUP);
>>      ram_control_after_iterate(f, RAM_CONTROL_SETUP);
>>  
>> @@ -1984,6 +1994,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>  
>>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>>  
>> +    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
>> +    qemu_balloon_bitmap_save(f);
>> +
>>      t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>      i = 0;
>>      while ((ret = qemu_file_rate_limit(f)) == 0) {
>> @@ -2493,6 +2506,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>>              }
>>              break;
>>  
>> +        case RAM_SAVE_FLAG_BALLOON:
>> +            qemu_balloon_bitmap_load(f);
>> +            break;
>> +
>>          case RAM_SAVE_FLAG_COMPRESS:
>>              ch = qemu_get_byte(f);
>>              ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 7f8d799..38163ca 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -544,11 +544,14 @@
>>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>>  #
>> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
>> +#          (since 2.7)
>> +#
>>  # Since: 1.2
>>  ##
>>  { 'enum': 'MigrationCapability',
>>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>> -           'compress', 'events', 'postcopy-ram'] }
>> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
>>  
>>  ##
>>  # @MigrationCapabilityStatus
>> -- 
>> 1.8.3.1
Jitendra Kolhe April 1, 2016, 11:19 a.m. UTC | #12
On 3/29/2016 4:18 PM, Paolo Bonzini wrote:
> 
> 
> On 29/03/2016 12:47, Jitendra Kolhe wrote:
>>> Indeed.  It is correct for the main system RAM, but hot-plugged RAM
>>> would also have a zero-based section.offset_within_region.  You need to
>>> add memory_region_get_ram_addr(section.mr), just like the call to
>>> balloon_page adds memory_region_get_ram_ptr(section.mr).
>>>
>>> Paolo
>>
>> I am only interested in the offset from memory region base. 
>> Would below guest PA to host offset work, as we do in
>> address_space_translate_internal()?
>> (Guest pa - section.offset_within_address_space + 
>> section.offset_within_region)
> 
> Yes, that would work.  But I'm not sure why you're not interested in the
> ram_addr_t.
> 

You are right, I was wrongly calculating offsets for hot-plugged
RAMblocks. I will have to use qemu_ram_block_from_host() to get 
ram_addr_t to get correct offsets in the bitmap.

> Paolo
>
Jitendra Kolhe April 5, 2016, 10:04 a.m. UTC | #13
On 3/31/2016 10:09 PM, Dr. David Alan Gilbert wrote:
> * Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote:
>> While measuring live migration performance for qemu/kvm guest, it
>> was observed that the qemu doesn’t maintain any intelligence for the
>> guest ram pages which are released by the guest balloon driver and
>> treat such pages as any other normal guest ram pages. This has direct
>> impact on overall migration time for the guest which has released
>> (ballooned out) memory to the host.

sorry for delayed response.

> 
> Hi Jitendra,
>   I've read over the patch and I've got a mix of comments; I've not read
> it in full detail:
> 
>    a) It does need splitting up; it's a bit big to review in one go;
>       I suggest you split it into the main code, and separately the bitmap
>       save/load.  It might be worth splitting it up even more.

Sure will post patch series in next version.

> 
>    b) in balloon_bitmap_load check the next and len fields; since it's read
>       over the wire we've got to treat them as hostile; so check they don't
>       run over the length of the bitmap.

will check for validity for next and len fields, would assert 
makes sense in case the values are read wrongly form the wire?

> 
>    c) The bitmap_load/save needs to be tied to the machine type or something
>       that means that if you were migraitng in a stream from an older qemu
>       it wouldn't get upset when it tried to read the extra data you read.
>       I prefer if it's tied to either the config setting or the new machine
>       type (that way backwards migration works as well).

Thanks, migration from older qemu to qemu+patch and back is a problem. 
We already have a configuration option “skip-balloon” to disable the 
optimization (test against bitmap during migration), but bitmap_load/save 
are independent of this option. Will extend the configuration option for 
bitmap_load/save.

> 
>    d) I agree with the other comments tha the stuff looking up the ram blocks
>       addressing looks wrong;  you use last_ram_offset() to size the bitmap,
>       so it makes me think it's the whole of the ram_addr_t; but I think you're
>       saying you're not interested in all of it.   However remember that
>       the order of ram_addr_t is not stable between two qemu's - even something
>       like hotplugging a ethercard in one qemu vs having it on the command line on
>       the other can change that order; so anything going over the wire has
>       to be block+offset-into-block. Also remember that last_ram_offset() includes
>       annoying things like firmware RAM, video memory and all those things;
> 

Yes, we are using last_ram_offset() to create the bitmap, but the offsetting 
in the bitmap was wrongly done in patch (epically for hotplugged ramblock).
Will change code to calculate ram_addr_t using qemu_ram_block_from_host() 
rather than just using section.offset

>    e) It should be possible to get it to work for postcopy if you just use
>       it as a way to speed up the zero detection but still send the zero page
>       messages.
> 

In case of postcopy, bitmap_save will happen after vm_stop(), 
this should have direct impact on downtime. Would that be fine? 
In case of pre-copy the major chunk of the bitmap is copied as 
part of ram_save_iterate itself.
zero page detection can be enabled for postcopy. Can’t ballooned 
out zero page messages be skipped in postcopy as it is done for 
precopy, am I missing something?

> Dave
> 
>>
>> In case of large systems, where we can configure large guests with 1TB
>> and with considerable amount of memory release by balloon driver to the,
>> host the migration time gets worse.
>>
>> The solution proposed below is local only to qemu (and does not require
>> any modification to Linux kernel or any guest driver). We have verified
>> the fix for large guests =1TB on HPE Superdome X (which can support up
>> to 240 cores and 12TB of memory) and in case where 90% of memory is
>> released by balloon driver the migration time for an idle guests reduces
>> to ~600 sec's from ~1200 sec’s.
>>
>> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
>> -> ram_find_and_save_block () will try to migrate ram pages which are
>> released by vitrio-balloon driver as part of dynamic memory delete.
>> Even though the pages which are returned to the host by virtio-balloon
>> driver are zero pages, the migration algorithm will still end up
>> scanning the entire page ram_find_and_save_block() -> ram_save_page/
>> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
>> also end-up sending some control information over network for these
>> page during migration. This adds to total migration time.
>>
>> The proposed fix, uses the existing bitmap infrastructure to create
>> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
>> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
>> entire guest ram memory till max configured memory. Guest ram pages
>> claimed by the virtio-balloon driver will be represented by 1 in the
>> bitmap. During live migration, each guest ram page (host VA offset)
>> is checked against the virtio-balloon bitmap, if the bit is set the
>> corresponding ram page will be excluded from scanning and sending
>> control information during migration. The bitmap is also migrated to
>> the target as part of every ram_save_iterate loop and after the
>> guest is stopped remaining balloon bitmap is migrated as part of
>> balloon driver save / load interface.
>>
>> With the proposed fix, the average migration time for an idle guest
>> with 1TB maximum memory and 64vCpus
>>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>>    down to 128GB (~10% of 1TB).
>>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>>    down to 896GB (~90% of 1TB),
>>  - with no ballooning configured, we don’t expect to see any impact
>>    on total migration time.
>>
>> The optimization gets temporarily disabled, if the balloon operation is
>> in progress. Since the optimization skips scanning and migrating control
>> information for ballooned out pages, we might skip guest ram pages in
>> cases where the guest balloon driver has freed the ram page to the guest
>> but not yet informed the host/qemu about the ram page
>> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
>> might skip migrating ram pages which the guest is using. Since this
>> problem is specific to balloon leak, we can restrict balloon operation in
>> progress check to only balloon leak operation in progress check.
>>
>> The optimization also get permanently disabled (for all subsequent
>> migrations) in case any of the migration uses postcopy capability. In case
>> of postcopy the balloon bitmap would be required to send after vm_stop,
>> which has significant impact on the downtime. Moreover, the applications
>> in the guest space won’t be actually faulting on the ram pages which are
>> already ballooned out, the proposed optimization will not show any
>> improvement in migration time during postcopy.
>>
>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>> ---
>> Changed in v2:
>>  - Resolved compilation issue for qemu-user binaries in exec.c
>>  - Localize balloon bitmap test to save_zero_page().
>>  - Updated version string for newly added migration capability to 2.7.
>>  - Made minor modifications to patch commit text.
>>
>>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
>>  exec.c                             |   3 +
>>  hw/virtio/virtio-balloon.c         |  35 ++++-
>>  include/hw/virtio/virtio-balloon.h |   1 +
>>  include/migration/migration.h      |   1 +
>>  include/sysemu/balloon.h           |  15 ++-
>>  migration/migration.c              |   9 ++
>>  migration/ram.c                    |  31 ++++-
>>  qapi-schema.json                   |   5 +-
>>  9 files changed, 341 insertions(+), 12 deletions(-)
>>
>> diff --git a/balloon.c b/balloon.c
>> index f2ef50c..1c2d228 100644
>> --- a/balloon.c
>> +++ b/balloon.c
>> @@ -33,11 +33,34 @@
>>  #include "qmp-commands.h"
>>  #include "qapi/qmp/qerror.h"
>>  #include "qapi/qmp/qjson.h"
>> +#include "exec/ram_addr.h"
>> +#include "migration/migration.h"
>> +
>> +#define BALLOON_BITMAP_DISABLE_FLAG -1UL
>> +
>> +typedef enum {
>> +    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
>> +    BALLOON_BITMAP_DISABLE_CURRENT,
>> +    BALLOON_BITMAP_DISABLE_PERMANENT,
>> +} BalloonBitmapDisableState;
>>  
>>  static QEMUBalloonEvent *balloon_event_fn;
>>  static QEMUBalloonStatus *balloon_stat_fn;
>> +static QEMUBalloonInProgress *balloon_in_progress_fn;
>>  static void *balloon_opaque;
>>  static bool balloon_inhibited;
>> +static unsigned long balloon_bitmap_pages;
>> +static unsigned int  balloon_bitmap_pfn_shift;
>> +static QemuMutex balloon_bitmap_mutex;
>> +static bool balloon_bitmap_xfered;
>> +static unsigned long balloon_min_bitmap_offset;
>> +static unsigned long balloon_max_bitmap_offset;
>> +static BalloonBitmapDisableState balloon_bitmap_disable_state;
>> +
>> +static struct BitmapRcu {
>> +    struct rcu_head rcu;
>> +    unsigned long *bmap;
>> +} *balloon_bitmap_rcu;
>>  
>>  bool qemu_balloon_is_inhibited(void)
>>  {
>> @@ -49,6 +72,21 @@ void qemu_balloon_inhibit(bool state)
>>      balloon_inhibited = state;
>>  }
>>  
>> +void qemu_mutex_lock_balloon_bitmap(void)
>> +{
>> +    qemu_mutex_lock(&balloon_bitmap_mutex);
>> +}
>> +
>> +void qemu_mutex_unlock_balloon_bitmap(void)
>> +{
>> +    qemu_mutex_unlock(&balloon_bitmap_mutex);
>> +}
>> +
>> +void qemu_balloon_reset_bitmap_data(void)
>> +{
>> +    balloon_bitmap_xfered = false;
>> +}
>> +
>>  static bool have_balloon(Error **errp)
>>  {
>>      if (kvm_enabled() && !kvm_has_sync_mmu()) {
>> @@ -65,9 +103,12 @@ static bool have_balloon(Error **errp)
>>  }
>>  
>>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>> -                             QEMUBalloonStatus *stat_func, void *opaque)
>> +                             QEMUBalloonStatus *stat_func,
>> +                             QEMUBalloonInProgress *in_progress_func,
>> +                             void *opaque, int pfn_shift)
>>  {
>> -    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
>> +    if (balloon_event_fn || balloon_stat_fn ||
>> +        balloon_in_progress_fn || balloon_opaque) {
>>          /* We're already registered one balloon handler.  How many can
>>           * a guest really have?
>>           */
>> @@ -75,17 +116,39 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>>      }
>>      balloon_event_fn = event_func;
>>      balloon_stat_fn = stat_func;
>> +    balloon_in_progress_fn = in_progress_func;
>>      balloon_opaque = opaque;
>> +
>> +    qemu_mutex_init(&balloon_bitmap_mutex);
>> +    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
>> +    balloon_bitmap_pfn_shift = pfn_shift;
>> +    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
>> +    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
>> +    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
>> +    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
>> +
>>      return 0;
>>  }
>>  
>> +static void balloon_bitmap_free(struct BitmapRcu *bmap)
>> +{
>> +    g_free(bmap->bmap);
>> +    g_free(bmap);
>> +}
>> +
>>  void qemu_remove_balloon_handler(void *opaque)
>>  {
>> +    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
>>      if (balloon_opaque != opaque) {
>>          return;
>>      }
>> +    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
>> +    if (bitmap) {
>> +        call_rcu(bitmap, balloon_bitmap_free, rcu);
>> +    }
>>      balloon_event_fn = NULL;
>>      balloon_stat_fn = NULL;
>> +    balloon_in_progress_fn = NULL;
>>      balloon_opaque = NULL;
>>  }
>>  
>> @@ -116,3 +179,189 @@ void qmp_balloon(int64_t target, Error **errp)
>>      trace_balloon_event(balloon_opaque, target);
>>      balloon_event_fn(balloon_opaque, target);
>>  }
>> +
>> +/* Handle Ram hotplug case, only called in case old < new */
>> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
>> +{
>> +    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
>> +    unsigned long old_offset, new_offset;
>> +
>> +    if (!balloon_bitmap_rcu) {
>> +        return -1;
>> +    }
>> +
>> +    old_offset = (old >> balloon_bitmap_pfn_shift);
>> +    new_offset = (new >> balloon_bitmap_pfn_shift);
>> +
>> +    bitmap = g_new(struct BitmapRcu, 1);
>> +    bitmap->bmap = bitmap_new(new_offset);
>> +
>> +    qemu_mutex_lock_balloon_bitmap();
>> +    bitmap_clear(bitmap->bmap, 0,
>> +                 balloon_bitmap_pages + new_offset - old_offset);
>> +    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
>> +
>> +    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
>> +    balloon_bitmap_pages += new_offset - old_offset;
>> +    qemu_mutex_unlock_balloon_bitmap();
>> +    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
>> +
>> +    return 0;
>> +}
>> +
>> +/* Should be called with balloon bitmap mutex lock held */
>> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
>> +{
>> +    unsigned long *bitmap;
>> +    unsigned long offset = 0;
>> +
>> +    if (!balloon_bitmap_rcu) {
>> +        return -1;
>> +    }
>> +    offset = (addr >> balloon_bitmap_pfn_shift);
>> +    if (balloon_bitmap_xfered) {
>> +        if (offset < balloon_min_bitmap_offset) {
>> +            balloon_min_bitmap_offset = offset;
>> +        }
>> +        if (offset > balloon_max_bitmap_offset) {
>> +            balloon_max_bitmap_offset = offset;
>> +        }
>> +    }
>> +
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    if (deflate == 0) {
>> +        set_bit(offset, bitmap);
>> +    } else {
>> +        clear_bit(offset, bitmap);
>> +    }
>> +    rcu_read_unlock();
>> +    return 0;
>> +}
>> +
>> +void qemu_balloon_bitmap_setup(void)
>> +{
>> +    if (migrate_postcopy_ram()) {
>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
>> +    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
>> +               (balloon_bitmap_disable_state !=
>> +                BALLOON_BITMAP_DISABLE_PERMANENT)) {
>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
>> +    }
>> +}
>> +
>> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
>> +{
>> +    unsigned long *bitmap;
>> +    ram_addr_t base;
>> +    unsigned long nr = 0;
>> +    int ret = 0;
>> +
>> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
>> +        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
>> +        return 0;
>> +    }
>> +    balloon_in_progress_fn(balloon_opaque, &ret);
>> +    if (ret == 1) {
>> +        return 0;
>> +    }
>> +
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    base = rb->offset >> balloon_bitmap_pfn_shift;
>> +    nr = base + (addr >> balloon_bitmap_pfn_shift);
>> +    if (test_bit(nr, bitmap)) {
>> +        ret = 1;
>> +    }
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>> +int qemu_balloon_bitmap_save(QEMUFile *f)
>> +{
>> +    unsigned long *bitmap;
>> +    unsigned long offset = 0, next = 0, len = 0;
>> +    unsigned long tmpoffset = 0, tmplimit = 0;
>> +
>> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
>> +        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
>> +        return 0;
>> +    }
>> +
>> +    qemu_mutex_lock_balloon_bitmap();
>> +    if (balloon_bitmap_xfered) {
>> +        tmpoffset = balloon_min_bitmap_offset;
>> +        tmplimit  = balloon_max_bitmap_offset;
>> +    } else {
>> +        balloon_bitmap_xfered = true;
>> +        tmpoffset = offset;
>> +        tmplimit  = balloon_bitmap_pages;
>> +    }
>> +
>> +    balloon_min_bitmap_offset = balloon_bitmap_pages;
>> +    balloon_max_bitmap_offset = 0;
>> +
>> +    qemu_put_be64(f, balloon_bitmap_pages);
>> +    qemu_put_be64(f, tmpoffset);
>> +    qemu_put_be64(f, tmplimit);
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    while (tmpoffset < tmplimit) {
>> +        unsigned long next_set_bit, start_set_bit;
>> +        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
>> +        start_set_bit = next_set_bit;
>> +        if (next_set_bit == balloon_bitmap_pages) {
>> +            len = 0;
>> +            next = start_set_bit;
>> +            qemu_put_be64(f, next);
>> +            qemu_put_be64(f, len);
>> +            break;
>> +        }
>> +        next_set_bit = find_next_zero_bit(bitmap,
>> +                                          balloon_bitmap_pages,
>> +                                          ++next_set_bit);
>> +        len = (next_set_bit - start_set_bit);
>> +        next = start_set_bit;
>> +        qemu_put_be64(f, next);
>> +        qemu_put_be64(f, len);
>> +        tmpoffset = next + len;
>> +    }
>> +    rcu_read_unlock();
>> +    qemu_mutex_unlock_balloon_bitmap();
>> +    return 0;
>> +}
>> +
>> +int qemu_balloon_bitmap_load(QEMUFile *f)
>> +{
>> +    unsigned long *bitmap;
>> +    unsigned long next = 0, len = 0;
>> +    unsigned long tmpoffset = 0, tmplimit = 0;
>> +
>> +    if (!balloon_bitmap_rcu) {
>> +        return -1;
>> +    }
>> +
>> +    qemu_mutex_lock_balloon_bitmap();
>> +    balloon_bitmap_pages = qemu_get_be64(f);
>> +    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
>> +        qemu_mutex_unlock_balloon_bitmap();
>> +        return 0;
>> +    }
>> +    tmpoffset = qemu_get_be64(f);
>> +    tmplimit  = qemu_get_be64(f);
>> +    rcu_read_lock();
>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>> +    while (tmpoffset < tmplimit) {
>> +        next = qemu_get_be64(f);
>> +        len  = qemu_get_be64(f);
>> +        if (len == 0) {
>> +            break;
>> +        }
>> +        bitmap_set(bitmap, next, len);
>> +        tmpoffset = next + len;
>> +    }
>> +    rcu_read_unlock();
>> +    qemu_mutex_unlock_balloon_bitmap();
>> +    return 0;
>> +}
>> diff --git a/exec.c b/exec.c
>> index f398d21..7a448e5 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -43,6 +43,7 @@
>>  #else /* !CONFIG_USER_ONLY */
>>  #include "sysemu/xen-mapcache.h"
>>  #include "trace.h"
>> +#include "sysemu/balloon.h"
>>  #endif
>>  #include "exec/cpu-all.h"
>>  #include "qemu/rcu_queue.h"
>> @@ -1610,6 +1611,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>      if (new_ram_size > old_ram_size) {
>>          migration_bitmap_extend(old_ram_size, new_ram_size);
>>          dirty_memory_extend(old_ram_size, new_ram_size);
>> +        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
>> +                                   new_ram_size << TARGET_PAGE_BITS);
>>      }
>>      /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
>>       * QLIST (which has an RCU-friendly variant) does not have insertion at
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index 22ad25c..9f3a4c8 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -27,6 +27,7 @@
>>  #include "qapi/visitor.h"
>>  #include "qapi-event.h"
>>  #include "trace.h"
>> +#include "migration/migration.h"
>>  
>>  #if defined(__linux__)
>>  #include <sys/mman.h>
>> @@ -214,11 +215,13 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>      VirtQueueElement *elem;
>>      MemoryRegionSection section;
>>  
>> +    qemu_mutex_lock_balloon_bitmap();
>>      for (;;) {
>>          size_t offset = 0;
>>          uint32_t pfn;
>>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>>          if (!elem) {
>> +            qemu_mutex_unlock_balloon_bitmap();
>>              return;
>>          }
>>  
>> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>              addr = section.offset_within_region;
>>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>>                           !!(vq == s->dvq));
>> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>>              memory_region_unref(section.mr);
>>          }
>>  
>> @@ -249,6 +253,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>          virtio_notify(vdev, vq);
>>          g_free(elem);
>>      }
>> +    qemu_mutex_unlock_balloon_bitmap();
>>  }
>>  
>>  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
>> @@ -303,6 +308,16 @@ out:
>>      }
>>  }
>>  
>> +static void virtio_balloon_migration_state_changed(Notifier *notifier,
>> +                                                   void *data)
>> +{
>> +    MigrationState *mig = data;
>> +
>> +    if (migration_has_failed(mig)) {
>> +        qemu_balloon_reset_bitmap_data();
>> +    }
>> +}
>> +
>>  static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
>>  {
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>> @@ -382,6 +397,16 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
>>                                               VIRTIO_BALLOON_PFN_SHIFT);
>>  }
>>  
>> +static void virtio_balloon_in_progress(void *opaque, int *status)
>> +{
>> +    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
>> +    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
>> +        *status = 1;
>> +        return;
>> +    }
>> +    *status = 0;
>> +}
>> +
>>  static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
>>  {
>>      VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
>> @@ -409,6 +434,7 @@ static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
>>  
>>      qemu_put_be32(f, s->num_pages);
>>      qemu_put_be32(f, s->actual);
>> +    qemu_balloon_bitmap_save(f);
>>  }
>>  
>>  static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
>> @@ -426,6 +452,7 @@ static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
>>  
>>      s->num_pages = qemu_get_be32(f);
>>      s->actual = qemu_get_be32(f);
>> +    qemu_balloon_bitmap_load(f);
>>      return 0;
>>  }
>>  
>> @@ -439,7 +466,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>                  sizeof(struct virtio_balloon_config));
>>  
>>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
>> -                                   virtio_balloon_stat, s);
>> +                                   virtio_balloon_stat,
>> +                                   virtio_balloon_in_progress, s,
>> +                                   VIRTIO_BALLOON_PFN_SHIFT);
>>  
>>      if (ret < 0) {
>>          error_setg(errp, "Only one balloon device is supported");
>> @@ -453,6 +482,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>  
>>      reset_stats(s);
>>  
>> +    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
>> +    add_migration_state_change_notifier(&s->migration_state_notifier);
>> +
>>      register_savevm(dev, "virtio-balloon", -1, 1,
>>                      virtio_balloon_save, virtio_balloon_load, s);
>>  }
>> @@ -462,6 +494,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>>  
>> +    remove_migration_state_change_notifier(&s->migration_state_notifier);
>>      balloon_stats_destroy_timer(s);
>>      qemu_remove_balloon_handler(s);
>>      unregister_savevm(dev, "virtio-balloon", s);
>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>> index 35f62ac..1ded5a9 100644
>> --- a/include/hw/virtio/virtio-balloon.h
>> +++ b/include/hw/virtio/virtio-balloon.h
>> @@ -43,6 +43,7 @@ typedef struct VirtIOBalloon {
>>      int64_t stats_last_update;
>>      int64_t stats_poll_interval;
>>      uint32_t host_features;
>> +    Notifier migration_state_notifier;
>>  } VirtIOBalloon;
>>  
>>  #endif
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index ac2c12c..6c1d1af 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -267,6 +267,7 @@ void migrate_del_blocker(Error *reason);
>>  
>>  bool migrate_postcopy_ram(void);
>>  bool migrate_zero_blocks(void);
>> +bool migrate_skip_balloon(void);
>>  
>>  bool migrate_auto_converge(void);
>>  
>> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
>> index 3f976b4..5325c38 100644
>> --- a/include/sysemu/balloon.h
>> +++ b/include/sysemu/balloon.h
>> @@ -15,14 +15,27 @@
>>  #define _QEMU_BALLOON_H
>>  
>>  #include "qapi-types.h"
>> +#include "migration/qemu-file.h"
>>  
>>  typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
>>  typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
>> +typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
>>  
>>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>> -			     QEMUBalloonStatus *stat_func, void *opaque);
>> +                             QEMUBalloonStatus *stat_func,
>> +                             QEMUBalloonInProgress *progress_func,
>> +                             void *opaque, int pfn_shift);
>>  void qemu_remove_balloon_handler(void *opaque);
>>  bool qemu_balloon_is_inhibited(void);
>>  void qemu_balloon_inhibit(bool state);
>> +void qemu_mutex_lock_balloon_bitmap(void);
>> +void qemu_mutex_unlock_balloon_bitmap(void);
>> +void qemu_balloon_reset_bitmap_data(void);
>> +void qemu_balloon_bitmap_setup(void);
>> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
>> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
>> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
>> +int qemu_balloon_bitmap_save(QEMUFile *f);
>> +int qemu_balloon_bitmap_load(QEMUFile *f);
>>  
>>  #endif
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 034a918..cb86307 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1200,6 +1200,15 @@ int migrate_use_xbzrle(void)
>>      return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
>>  }
>>  
>> +bool migrate_skip_balloon(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
>> +}
>> +
>>  int64_t migrate_xbzrle_cache_size(void)
>>  {
>>      MigrationState *s;
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 704f6a9..161ab73 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -40,6 +40,7 @@
>>  #include "trace.h"
>>  #include "exec/ram_addr.h"
>>  #include "qemu/rcu_queue.h"
>> +#include "sysemu/balloon.h"
>>  
>>  #ifdef DEBUG_MIGRATION_RAM
>>  #define DPRINTF(fmt, ...) \
>> @@ -65,6 +66,7 @@ static uint64_t bitmap_sync_count;
>>  #define RAM_SAVE_FLAG_XBZRLE   0x40
>>  /* 0x80 is reserved in migration.h start with 0x100 next */
>>  #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
>> +#define RAM_SAVE_FLAG_BALLOON  0x200
>>  
>>  static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
>>  
>> @@ -702,13 +704,17 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>>  {
>>      int pages = -1;
>>  
>> -    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>> -        acct_info.dup_pages++;
>> -        *bytes_transferred += save_page_header(f, block,
>> +    if (qemu_balloon_bitmap_test(block, offset) != 1) {
>> +        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>> +            acct_info.dup_pages++;
>> +            *bytes_transferred += save_page_header(f, block,
>>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
>> -        qemu_put_byte(f, 0);
>> -        *bytes_transferred += 1;
>> -        pages = 1;
>> +            qemu_put_byte(f, 0);
>> +            *bytes_transferred += 1;
>> +            pages = 1;
>> +        }
>> +    } else {
>> +        pages = 0;
>>      }
>>  
>>      return pages;
>> @@ -773,7 +779,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
>>               * page would be stale
>>               */
>>              xbzrle_cache_zero_page(current_addr);
>> -        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
>> +        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {
> 
> Is this test the right way around - don't you want to try xbzrle if you've NOT sent
> a page?
> 

The page is ballooned out on the source itself, so we need not xbzrle and send the 
page. But will change the code to cache_insert() ZERO_TARGET_PAGE, so that even 
when the page gets ballooned back in on the source (during migration), 
get_cached_data() would return zero page.

Thanks,
- Jitendra

>>              pages = save_xbzrle_page(f, &p, current_addr, block,
>>                                       offset, last_stage, bytes_transferred);
>>              if (!last_stage) {
>> @@ -1355,6 +1361,9 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
>>          }
>>  
>>          if (found) {
>> +            /* skip saving ram host page if the corresponding guest page
>> +             * is ballooned out
>> +             */
>>              pages = ram_save_host_page(ms, f, &pss,
>>                                         last_stage, bytes_transferred,
>>                                         dirty_ram_abs);
>> @@ -1959,6 +1968,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>  
>>      rcu_read_unlock();
>>  
>> +    qemu_balloon_bitmap_setup();
>>      ram_control_before_iterate(f, RAM_CONTROL_SETUP);
>>      ram_control_after_iterate(f, RAM_CONTROL_SETUP);
>>  
>> @@ -1984,6 +1994,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>  
>>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>>  
>> +    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
>> +    qemu_balloon_bitmap_save(f);
>> +
>>      t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>      i = 0;
>>      while ((ret = qemu_file_rate_limit(f)) == 0) {
>> @@ -2493,6 +2506,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>>              }
>>              break;
>>  
>> +        case RAM_SAVE_FLAG_BALLOON:
>> +            qemu_balloon_bitmap_load(f);
>> +            break;
>> +
>>          case RAM_SAVE_FLAG_COMPRESS:
>>              ch = qemu_get_byte(f);
>>              ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 7f8d799..38163ca 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -544,11 +544,14 @@
>>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>>  #
>> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
>> +#          (since 2.7)
>> +#
>>  # Since: 1.2
>>  ##
>>  { 'enum': 'MigrationCapability',
>>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>> -           'compress', 'events', 'postcopy-ram'] }
>> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
>>  
>>  ##
>>  # @MigrationCapabilityStatus
>> -- 
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Michael S. Tsirkin April 10, 2016, 4:59 p.m. UTC | #14
On Fri, Apr 01, 2016 at 04:38:28PM +0530, Jitendra Kolhe wrote:
> On 3/29/2016 5:58 PM, Michael S. Tsirkin wrote:
> > On Mon, Mar 28, 2016 at 09:46:05AM +0530, Jitendra Kolhe wrote:
> >> While measuring live migration performance for qemu/kvm guest, it
> >> was observed that the qemu doesn’t maintain any intelligence for the
> >> guest ram pages which are released by the guest balloon driver and
> >> treat such pages as any other normal guest ram pages. This has direct
> >> impact on overall migration time for the guest which has released
> >> (ballooned out) memory to the host.
> >>
> >> In case of large systems, where we can configure large guests with 1TB
> >> and with considerable amount of memory release by balloon driver to the,
> >> host the migration time gets worse.
> >>
> >> The solution proposed below is local only to qemu (and does not require
> >> any modification to Linux kernel or any guest driver). We have verified
> >> the fix for large guests =1TB on HPE Superdome X (which can support up
> >> to 240 cores and 12TB of memory) and in case where 90% of memory is
> >> released by balloon driver the migration time for an idle guests reduces
> >> to ~600 sec's from ~1200 sec’s.
> >>
> >> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
> >> -> ram_find_and_save_block () will try to migrate ram pages which are
> >> released by vitrio-balloon driver as part of dynamic memory delete.
> >> Even though the pages which are returned to the host by virtio-balloon
> >> driver are zero pages, the migration algorithm will still end up
> >> scanning the entire page ram_find_and_save_block() -> ram_save_page/
> >> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
> >> also end-up sending some control information over network for these
> >> page during migration. This adds to total migration time.
> >>
> >> The proposed fix, uses the existing bitmap infrastructure to create
> >> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
> >> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
> >> entire guest ram memory till max configured memory. Guest ram pages
> >> claimed by the virtio-balloon driver will be represented by 1 in the
> >> bitmap. During live migration, each guest ram page (host VA offset)
> >> is checked against the virtio-balloon bitmap, if the bit is set the
> >> corresponding ram page will be excluded from scanning and sending
> >> control information during migration. The bitmap is also migrated to
> >> the target as part of every ram_save_iterate loop and after the
> >> guest is stopped remaining balloon bitmap is migrated as part of
> >> balloon driver save / load interface.
> > 
> > Migrating the bitmap might help a chained migration case
> > but will slow down the more typical case a bit.
> > Make it optional?
> > 
> > 
> 
> Disabling migration of balloon bitmap would disable the 
> optimization permanently for all other subsequent migrations.
> 
> The current implementation sends offsets and lengths of 1s 
> in the bitmap. In cases where we see highly fragmented 
> bitmap which may add to the bitmap migration time, but still 
> it would be less compared to total migration time.
> 
> Here are some values from my test setup for an idle guest 
> with 16GB memory ballooned down to 4GB memory (with no 
> ballooning operation in progress during migration).
> - Total migration time without proposed patch set - ~7600ms 
> - Total migration time with proposed patch set - ~5700ms
>   . Adds overhead of testing against bitmap - ~320ms
>   . Adds overhead of sending the bitmap - ~1.6ms
> - Saves zero page scan + protocol time - ~2348ms  
> We may see some higher pct for time taken to migrate bitmap
> in case ballooning is in progress during migration. 
> 
> The patch does introduces an option to enable/disable 
> testing ram_addr against balloon bitmap during migration 
> (since that’s the major overhead), but we still end-up 
> migrating the bitmap by default. 
> 
> >>
> >> With the proposed fix, the average migration time for an idle guest
> >> with 1TB maximum memory and 64vCpus
> >>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
> >>    down to 128GB (~10% of 1TB).
> >>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
> >>    down to 896GB (~90% of 1TB),
> >>  - with no ballooning configured, we don’t expect to see any impact
> >>    on total migration time.
> >>
> >> The optimization gets temporarily disabled, if the balloon operation is
> >> in progress. Since the optimization skips scanning and migrating control
> >> information for ballooned out pages, we might skip guest ram pages in
> >> cases where the guest balloon driver has freed the ram page to the guest
> >> but not yet informed the host/qemu about the ram page
> >> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
> >> might skip migrating ram pages which the guest is using. Since this
> >> problem is specific to balloon leak, we can restrict balloon operation in
> >> progress check to only balloon leak operation in progress check.
> >>
> >> The optimization also get permanently disabled (for all subsequent
> >> migrations) in case any of the migration uses postcopy capability. In case
> >> of postcopy the balloon bitmap would be required to send after vm_stop,
> >> which has significant impact on the downtime. Moreover, the applications
> >> in the guest space won’t be actually faulting on the ram pages which are
> >> already ballooned out, the proposed optimization will not show any
> >> improvement in migration time during postcopy.
> > 
> > 
> > I think this optimization can work for post copy with some
> > modifications.
> > 
> 
> So is there any way to migrate the bitmap without impacting
> downtime before starting the vm on dest in case of post-copy?

With balloon inflated pre-migration?  Yes. Put bitmap in a ram block,
and use post-copy on it: whenever you detect a fault, and before
fetching it, look at bitmap. If the relevant page is not yet on
destination, migrate it.

But you can also inflate the balloon on destination, after migration.

> In which case we can migrate bitmap to dest so that subsequent 
> migrations which are not using post-copy can use the optimization.
> In pre-copy case, most of the bitmap is migrated as a part of 
> ram_save_iterate(), thus having a minimal impact on downtime.
> 
> I am still unclear on how the patch set will benefit post-migration 
> case as guest on the dest should never fault on ballooned out pages, 
> So we won’t be saving on zero page scan time?

That too.  But you seem to assume balloon stays inflated forever.
It does not, and we currently fault when guest on dest takes page out
of the balloon.

Finally, if balloon is inflated after migration, not migrating
the pages in balloon will skip scanning non-zero pages
since pages on source are non-zero.

> > 
> > For post-copy, when guest takes page out of the balloon,
> > it notifies the host.
> > 
> 
> Yes, this will help in maintaining balloon bitmap up-to-date.
> 
> > This only happens when MUST_TELL_HOST feature has been
> > negotiated but most guests do negotiate it nowdays.
> > (It also seems that your code assumes MUST_TELL_HOST - I think
> > you must check and disable the bitmap management otherwise).
> > 
> > At that point we can mark page as migrated to avoid
> > requesting it from source.
> > 
> 
> qemu no longer seem to add MUST_TELL_HOST feature 
> to dev->host_features.
> Unless qemu set’s the feature, 
> how can it be negotiated it with the guest.
> virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST)
> always returns false to me. Am I missing something here?

It never did because it never cared. If you care, you
can set it.

> 
> Latest linux guest balloon driver >= 3.0, have stopped 
> checking for MUST_TELL_HOST feature, but they implement 
> functionality as if the feature is set.
> 
> To address cases where guest may start using pages before
> host is aware of them, I am disabling the test against bitmap
> If ballooning is in progress (during migration). (I think 
> we can restrict ballooning in progress check to balloon 
> deflate condition only?). This will make sure that the pages 
> will be migrated.
> 
> Thanks,
> - Jitendra

I don't know what does "ballooning is in progress" mean.  Pages can be
taken out of the balloon cause of compaction at any time.  I don't think
piling up special-cases like that is a good idea, just ask guest
to talk to you before using pages.

> > 
> > 
> >> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> >> ---
> >> Changed in v2:
> >>  - Resolved compilation issue for qemu-user binaries in exec.c
> >>  - Localize balloon bitmap test to save_zero_page().
> >>  - Updated version string for newly added migration capability to 2.7.
> >>  - Made minor modifications to patch commit text.
> >>
> >>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
> >>  exec.c                             |   3 +
> >>  hw/virtio/virtio-balloon.c         |  35 ++++-
> >>  include/hw/virtio/virtio-balloon.h |   1 +
> >>  include/migration/migration.h      |   1 +
> >>  include/sysemu/balloon.h           |  15 ++-
> >>  migration/migration.c              |   9 ++
> >>  migration/ram.c                    |  31 ++++-
> >>  qapi-schema.json                   |   5 +-
> >>  9 files changed, 341 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/balloon.c b/balloon.c
> >> index f2ef50c..1c2d228 100644
> >> --- a/balloon.c
> >> +++ b/balloon.c
> >> @@ -33,11 +33,34 @@
> >>  #include "qmp-commands.h"
> >>  #include "qapi/qmp/qerror.h"
> >>  #include "qapi/qmp/qjson.h"
> >> +#include "exec/ram_addr.h"
> >> +#include "migration/migration.h"
> >> +
> >> +#define BALLOON_BITMAP_DISABLE_FLAG -1UL
> >> +
> >> +typedef enum {
> >> +    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
> >> +    BALLOON_BITMAP_DISABLE_CURRENT,
> >> +    BALLOON_BITMAP_DISABLE_PERMANENT,
> >> +} BalloonBitmapDisableState;
> >>  
> >>  static QEMUBalloonEvent *balloon_event_fn;
> >>  static QEMUBalloonStatus *balloon_stat_fn;
> >> +static QEMUBalloonInProgress *balloon_in_progress_fn;
> >>  static void *balloon_opaque;
> >>  static bool balloon_inhibited;
> >> +static unsigned long balloon_bitmap_pages;
> >> +static unsigned int  balloon_bitmap_pfn_shift;
> >> +static QemuMutex balloon_bitmap_mutex;
> >> +static bool balloon_bitmap_xfered;
> >> +static unsigned long balloon_min_bitmap_offset;
> >> +static unsigned long balloon_max_bitmap_offset;
> >> +static BalloonBitmapDisableState balloon_bitmap_disable_state;
> >> +
> >> +static struct BitmapRcu {
> >> +    struct rcu_head rcu;
> >> +    unsigned long *bmap;
> >> +} *balloon_bitmap_rcu;
> >>  
> >>  bool qemu_balloon_is_inhibited(void)
> >>  {
> >> @@ -49,6 +72,21 @@ void qemu_balloon_inhibit(bool state)
> >>      balloon_inhibited = state;
> >>  }
> >>  
> >> +void qemu_mutex_lock_balloon_bitmap(void)
> >> +{
> >> +    qemu_mutex_lock(&balloon_bitmap_mutex);
> >> +}
> >> +
> >> +void qemu_mutex_unlock_balloon_bitmap(void)
> >> +{
> >> +    qemu_mutex_unlock(&balloon_bitmap_mutex);
> >> +}
> >> +
> >> +void qemu_balloon_reset_bitmap_data(void)
> >> +{
> >> +    balloon_bitmap_xfered = false;
> >> +}
> >> +
> >>  static bool have_balloon(Error **errp)
> >>  {
> >>      if (kvm_enabled() && !kvm_has_sync_mmu()) {
> >> @@ -65,9 +103,12 @@ static bool have_balloon(Error **errp)
> >>  }
> >>  
> >>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> >> -                             QEMUBalloonStatus *stat_func, void *opaque)
> >> +                             QEMUBalloonStatus *stat_func,
> >> +                             QEMUBalloonInProgress *in_progress_func,
> >> +                             void *opaque, int pfn_shift)
> >>  {
> >> -    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
> >> +    if (balloon_event_fn || balloon_stat_fn ||
> >> +        balloon_in_progress_fn || balloon_opaque) {
> >>          /* We're already registered one balloon handler.  How many can
> >>           * a guest really have?
> >>           */
> >> @@ -75,17 +116,39 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> >>      }
> >>      balloon_event_fn = event_func;
> >>      balloon_stat_fn = stat_func;
> >> +    balloon_in_progress_fn = in_progress_func;
> >>      balloon_opaque = opaque;
> >> +
> >> +    qemu_mutex_init(&balloon_bitmap_mutex);
> >> +    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
> >> +    balloon_bitmap_pfn_shift = pfn_shift;
> >> +    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
> >> +    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
> >> +    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
> >> +    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
> >> +
> >>      return 0;
> >>  }
> >>  
> >> +static void balloon_bitmap_free(struct BitmapRcu *bmap)
> >> +{
> >> +    g_free(bmap->bmap);
> >> +    g_free(bmap);
> >> +}
> >> +
> >>  void qemu_remove_balloon_handler(void *opaque)
> >>  {
> >> +    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
> >>      if (balloon_opaque != opaque) {
> >>          return;
> >>      }
> >> +    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
> >> +    if (bitmap) {
> >> +        call_rcu(bitmap, balloon_bitmap_free, rcu);
> >> +    }
> >>      balloon_event_fn = NULL;
> >>      balloon_stat_fn = NULL;
> >> +    balloon_in_progress_fn = NULL;
> >>      balloon_opaque = NULL;
> >>  }
> >>  
> >> @@ -116,3 +179,189 @@ void qmp_balloon(int64_t target, Error **errp)
> >>      trace_balloon_event(balloon_opaque, target);
> >>      balloon_event_fn(balloon_opaque, target);
> >>  }
> >> +
> >> +/* Handle Ram hotplug case, only called in case old < new */
> >> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
> >> +{
> >> +    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
> >> +    unsigned long old_offset, new_offset;
> >> +
> >> +    if (!balloon_bitmap_rcu) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    old_offset = (old >> balloon_bitmap_pfn_shift);
> >> +    new_offset = (new >> balloon_bitmap_pfn_shift);
> >> +
> >> +    bitmap = g_new(struct BitmapRcu, 1);
> >> +    bitmap->bmap = bitmap_new(new_offset);
> >> +
> >> +    qemu_mutex_lock_balloon_bitmap();
> >> +    bitmap_clear(bitmap->bmap, 0,
> >> +                 balloon_bitmap_pages + new_offset - old_offset);
> >> +    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
> >> +
> >> +    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
> >> +    balloon_bitmap_pages += new_offset - old_offset;
> >> +    qemu_mutex_unlock_balloon_bitmap();
> >> +    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +/* Should be called with balloon bitmap mutex lock held */
> >> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
> >> +{
> >> +    unsigned long *bitmap;
> >> +    unsigned long offset = 0;
> >> +
> >> +    if (!balloon_bitmap_rcu) {
> >> +        return -1;
> >> +    }
> >> +    offset = (addr >> balloon_bitmap_pfn_shift);
> >> +    if (balloon_bitmap_xfered) {
> >> +        if (offset < balloon_min_bitmap_offset) {
> >> +            balloon_min_bitmap_offset = offset;
> >> +        }
> >> +        if (offset > balloon_max_bitmap_offset) {
> >> +            balloon_max_bitmap_offset = offset;
> >> +        }
> >> +    }
> >> +
> >> +    rcu_read_lock();
> >> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> >> +    if (deflate == 0) {
> >> +        set_bit(offset, bitmap);
> >> +    } else {
> >> +        clear_bit(offset, bitmap);
> >> +    }
> >> +    rcu_read_unlock();
> >> +    return 0;
> >> +}
> >> +
> >> +void qemu_balloon_bitmap_setup(void)
> >> +{
> >> +    if (migrate_postcopy_ram()) {
> >> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> >> +    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
> >> +               (balloon_bitmap_disable_state !=
> >> +                BALLOON_BITMAP_DISABLE_PERMANENT)) {
> >> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
> >> +    }
> >> +}
> >> +
> >> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
> >> +{
> >> +    unsigned long *bitmap;
> >> +    ram_addr_t base;
> >> +    unsigned long nr = 0;
> >> +    int ret = 0;
> >> +
> >> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
> >> +        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> >> +        return 0;
> >> +    }
> >> +    balloon_in_progress_fn(balloon_opaque, &ret);
> >> +    if (ret == 1) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    rcu_read_lock();
> >> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> >> +    base = rb->offset >> balloon_bitmap_pfn_shift;
> >> +    nr = base + (addr >> balloon_bitmap_pfn_shift);
> >> +    if (test_bit(nr, bitmap)) {
> >> +        ret = 1;
> >> +    }
> >> +    rcu_read_unlock();
> >> +    return ret;
> >> +}
> >> +
> >> +int qemu_balloon_bitmap_save(QEMUFile *f)
> >> +{
> >> +    unsigned long *bitmap;
> >> +    unsigned long offset = 0, next = 0, len = 0;
> >> +    unsigned long tmpoffset = 0, tmplimit = 0;
> >> +
> >> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
> >> +        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
> >> +        return 0;
> >> +    }
> >> +
> >> +    qemu_mutex_lock_balloon_bitmap();
> >> +    if (balloon_bitmap_xfered) {
> >> +        tmpoffset = balloon_min_bitmap_offset;
> >> +        tmplimit  = balloon_max_bitmap_offset;
> >> +    } else {
> >> +        balloon_bitmap_xfered = true;
> >> +        tmpoffset = offset;
> >> +        tmplimit  = balloon_bitmap_pages;
> >> +    }
> >> +
> >> +    balloon_min_bitmap_offset = balloon_bitmap_pages;
> >> +    balloon_max_bitmap_offset = 0;
> >> +
> >> +    qemu_put_be64(f, balloon_bitmap_pages);
> >> +    qemu_put_be64(f, tmpoffset);
> >> +    qemu_put_be64(f, tmplimit);
> >> +    rcu_read_lock();
> >> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> >> +    while (tmpoffset < tmplimit) {
> >> +        unsigned long next_set_bit, start_set_bit;
> >> +        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
> >> +        start_set_bit = next_set_bit;
> >> +        if (next_set_bit == balloon_bitmap_pages) {
> >> +            len = 0;
> >> +            next = start_set_bit;
> >> +            qemu_put_be64(f, next);
> >> +            qemu_put_be64(f, len);
> >> +            break;
> >> +        }
> >> +        next_set_bit = find_next_zero_bit(bitmap,
> >> +                                          balloon_bitmap_pages,
> >> +                                          ++next_set_bit);
> >> +        len = (next_set_bit - start_set_bit);
> >> +        next = start_set_bit;
> >> +        qemu_put_be64(f, next);
> >> +        qemu_put_be64(f, len);
> >> +        tmpoffset = next + len;
> >> +    }
> >> +    rcu_read_unlock();
> >> +    qemu_mutex_unlock_balloon_bitmap();
> >> +    return 0;
> >> +}
> >> +
> >> +int qemu_balloon_bitmap_load(QEMUFile *f)
> >> +{
> >> +    unsigned long *bitmap;
> >> +    unsigned long next = 0, len = 0;
> >> +    unsigned long tmpoffset = 0, tmplimit = 0;
> >> +
> >> +    if (!balloon_bitmap_rcu) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    qemu_mutex_lock_balloon_bitmap();
> >> +    balloon_bitmap_pages = qemu_get_be64(f);
> >> +    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
> >> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> >> +        qemu_mutex_unlock_balloon_bitmap();
> >> +        return 0;
> >> +    }
> >> +    tmpoffset = qemu_get_be64(f);
> >> +    tmplimit  = qemu_get_be64(f);
> >> +    rcu_read_lock();
> >> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
> >> +    while (tmpoffset < tmplimit) {
> >> +        next = qemu_get_be64(f);
> >> +        len  = qemu_get_be64(f);
> >> +        if (len == 0) {
> >> +            break;
> >> +        }
> >> +        bitmap_set(bitmap, next, len);
> >> +        tmpoffset = next + len;
> >> +    }
> >> +    rcu_read_unlock();
> >> +    qemu_mutex_unlock_balloon_bitmap();
> >> +    return 0;
> >> +}
> >> diff --git a/exec.c b/exec.c
> >> index f398d21..7a448e5 100644
> >> --- a/exec.c
> >> +++ b/exec.c
> >> @@ -43,6 +43,7 @@
> >>  #else /* !CONFIG_USER_ONLY */
> >>  #include "sysemu/xen-mapcache.h"
> >>  #include "trace.h"
> >> +#include "sysemu/balloon.h"
> >>  #endif
> >>  #include "exec/cpu-all.h"
> >>  #include "qemu/rcu_queue.h"
> >> @@ -1610,6 +1611,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >>      if (new_ram_size > old_ram_size) {
> >>          migration_bitmap_extend(old_ram_size, new_ram_size);
> >>          dirty_memory_extend(old_ram_size, new_ram_size);
> >> +        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
> >> +                                   new_ram_size << TARGET_PAGE_BITS);
> >>      }
> >>      /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
> >>       * QLIST (which has an RCU-friendly variant) does not have insertion at
> >> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> >> index 22ad25c..9f3a4c8 100644
> >> --- a/hw/virtio/virtio-balloon.c
> >> +++ b/hw/virtio/virtio-balloon.c
> >> @@ -27,6 +27,7 @@
> >>  #include "qapi/visitor.h"
> >>  #include "qapi-event.h"
> >>  #include "trace.h"
> >> +#include "migration/migration.h"
> >>  
> >>  #if defined(__linux__)
> >>  #include <sys/mman.h>
> >> @@ -214,11 +215,13 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >>      VirtQueueElement *elem;
> >>      MemoryRegionSection section;
> >>  
> >> +    qemu_mutex_lock_balloon_bitmap();
> >>      for (;;) {
> >>          size_t offset = 0;
> >>          uint32_t pfn;
> >>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> >>          if (!elem) {
> >> +            qemu_mutex_unlock_balloon_bitmap();
> >>              return;
> >>          }
> >>  
> >> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >>              addr = section.offset_within_region;
> >>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> >>                           !!(vq == s->dvq));
> >> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
> >>              memory_region_unref(section.mr);
> >>          }
> >>  
> >> @@ -249,6 +253,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >>          virtio_notify(vdev, vq);
> >>          g_free(elem);
> >>      }
> >> +    qemu_mutex_unlock_balloon_bitmap();
> >>  }
> >>  
> >>  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
> >> @@ -303,6 +308,16 @@ out:
> >>      }
> >>  }
> >>  
> >> +static void virtio_balloon_migration_state_changed(Notifier *notifier,
> >> +                                                   void *data)
> >> +{
> >> +    MigrationState *mig = data;
> >> +
> >> +    if (migration_has_failed(mig)) {
> >> +        qemu_balloon_reset_bitmap_data();
> >> +    }
> >> +}
> >> +
> >>  static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
> >>  {
> >>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> >> @@ -382,6 +397,16 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
> >>                                               VIRTIO_BALLOON_PFN_SHIFT);
> >>  }
> >>  
> >> +static void virtio_balloon_in_progress(void *opaque, int *status)
> >> +{
> >> +    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> >> +    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
> >> +        *status = 1;
> >> +        return;
> >> +    }
> >> +    *status = 0;
> >> +}
> >> +
> >>  static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
> >>  {
> >>      VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> >> @@ -409,6 +434,7 @@ static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
> >>  
> >>      qemu_put_be32(f, s->num_pages);
> >>      qemu_put_be32(f, s->actual);
> >> +    qemu_balloon_bitmap_save(f);
> >>  }
> >>  
> >>  static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
> >> @@ -426,6 +452,7 @@ static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
> >>  
> >>      s->num_pages = qemu_get_be32(f);
> >>      s->actual = qemu_get_be32(f);
> >> +    qemu_balloon_bitmap_load(f);
> >>      return 0;
> >>  }
> >>  
> >> @@ -439,7 +466,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> >>                  sizeof(struct virtio_balloon_config));
> >>  
> >>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> >> -                                   virtio_balloon_stat, s);
> >> +                                   virtio_balloon_stat,
> >> +                                   virtio_balloon_in_progress, s,
> >> +                                   VIRTIO_BALLOON_PFN_SHIFT);
> >>  
> >>      if (ret < 0) {
> >>          error_setg(errp, "Only one balloon device is supported");
> >> @@ -453,6 +482,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> >>  
> >>      reset_stats(s);
> >>  
> >> +    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
> >> +    add_migration_state_change_notifier(&s->migration_state_notifier);
> >> +
> >>      register_savevm(dev, "virtio-balloon", -1, 1,
> >>                      virtio_balloon_save, virtio_balloon_load, s);
> >>  }
> >> @@ -462,6 +494,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
> >>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> >>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
> >>  
> >> +    remove_migration_state_change_notifier(&s->migration_state_notifier);
> >>      balloon_stats_destroy_timer(s);
> >>      qemu_remove_balloon_handler(s);
> >>      unregister_savevm(dev, "virtio-balloon", s);
> >> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> >> index 35f62ac..1ded5a9 100644
> >> --- a/include/hw/virtio/virtio-balloon.h
> >> +++ b/include/hw/virtio/virtio-balloon.h
> >> @@ -43,6 +43,7 @@ typedef struct VirtIOBalloon {
> >>      int64_t stats_last_update;
> >>      int64_t stats_poll_interval;
> >>      uint32_t host_features;
> >> +    Notifier migration_state_notifier;
> >>  } VirtIOBalloon;
> >>  
> >>  #endif
> >> diff --git a/include/migration/migration.h b/include/migration/migration.h
> >> index ac2c12c..6c1d1af 100644
> >> --- a/include/migration/migration.h
> >> +++ b/include/migration/migration.h
> >> @@ -267,6 +267,7 @@ void migrate_del_blocker(Error *reason);
> >>  
> >>  bool migrate_postcopy_ram(void);
> >>  bool migrate_zero_blocks(void);
> >> +bool migrate_skip_balloon(void);
> >>  
> >>  bool migrate_auto_converge(void);
> >>  
> >> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> >> index 3f976b4..5325c38 100644
> >> --- a/include/sysemu/balloon.h
> >> +++ b/include/sysemu/balloon.h
> >> @@ -15,14 +15,27 @@
> >>  #define _QEMU_BALLOON_H
> >>  
> >>  #include "qapi-types.h"
> >> +#include "migration/qemu-file.h"
> >>  
> >>  typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
> >>  typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
> >> +typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
> >>  
> >>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> >> -			     QEMUBalloonStatus *stat_func, void *opaque);
> >> +                             QEMUBalloonStatus *stat_func,
> >> +                             QEMUBalloonInProgress *progress_func,
> >> +                             void *opaque, int pfn_shift);
> >>  void qemu_remove_balloon_handler(void *opaque);
> >>  bool qemu_balloon_is_inhibited(void);
> >>  void qemu_balloon_inhibit(bool state);
> >> +void qemu_mutex_lock_balloon_bitmap(void);
> >> +void qemu_mutex_unlock_balloon_bitmap(void);
> >> +void qemu_balloon_reset_bitmap_data(void);
> >> +void qemu_balloon_bitmap_setup(void);
> >> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
> >> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
> >> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
> >> +int qemu_balloon_bitmap_save(QEMUFile *f);
> >> +int qemu_balloon_bitmap_load(QEMUFile *f);
> >>  
> >>  #endif
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index 034a918..cb86307 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -1200,6 +1200,15 @@ int migrate_use_xbzrle(void)
> >>      return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
> >>  }
> >>  
> >> +bool migrate_skip_balloon(void)
> >> +{
> >> +    MigrationState *s;
> >> +
> >> +    s = migrate_get_current();
> >> +
> >> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
> >> +}
> >> +
> >>  int64_t migrate_xbzrle_cache_size(void)
> >>  {
> >>      MigrationState *s;
> >> diff --git a/migration/ram.c b/migration/ram.c
> >> index 704f6a9..161ab73 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -40,6 +40,7 @@
> >>  #include "trace.h"
> >>  #include "exec/ram_addr.h"
> >>  #include "qemu/rcu_queue.h"
> >> +#include "sysemu/balloon.h"
> >>  
> >>  #ifdef DEBUG_MIGRATION_RAM
> >>  #define DPRINTF(fmt, ...) \
> >> @@ -65,6 +66,7 @@ static uint64_t bitmap_sync_count;
> >>  #define RAM_SAVE_FLAG_XBZRLE   0x40
> >>  /* 0x80 is reserved in migration.h start with 0x100 next */
> >>  #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
> >> +#define RAM_SAVE_FLAG_BALLOON  0x200
> >>  
> >>  static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
> >>  
> >> @@ -702,13 +704,17 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
> >>  {
> >>      int pages = -1;
> >>  
> >> -    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> >> -        acct_info.dup_pages++;
> >> -        *bytes_transferred += save_page_header(f, block,
> >> +    if (qemu_balloon_bitmap_test(block, offset) != 1) {
> >> +        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
> >> +            acct_info.dup_pages++;
> >> +            *bytes_transferred += save_page_header(f, block,
> >>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
> >> -        qemu_put_byte(f, 0);
> >> -        *bytes_transferred += 1;
> >> -        pages = 1;
> >> +            qemu_put_byte(f, 0);
> >> +            *bytes_transferred += 1;
> >> +            pages = 1;
> >> +        }
> >> +    } else {
> >> +        pages = 0;
> >>      }
> >>  
> >>      return pages;
> >> @@ -773,7 +779,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
> >>               * page would be stale
> >>               */
> >>              xbzrle_cache_zero_page(current_addr);
> >> -        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
> >> +        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {
> >>              pages = save_xbzrle_page(f, &p, current_addr, block,
> >>                                       offset, last_stage, bytes_transferred);
> >>              if (!last_stage) {
> >> @@ -1355,6 +1361,9 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
> >>          }
> >>  
> >>          if (found) {
> >> +            /* skip saving ram host page if the corresponding guest page
> >> +             * is ballooned out
> >> +             */
> >>              pages = ram_save_host_page(ms, f, &pss,
> >>                                         last_stage, bytes_transferred,
> >>                                         dirty_ram_abs);
> >> @@ -1959,6 +1968,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >>  
> >>      rcu_read_unlock();
> >>  
> >> +    qemu_balloon_bitmap_setup();
> >>      ram_control_before_iterate(f, RAM_CONTROL_SETUP);
> >>      ram_control_after_iterate(f, RAM_CONTROL_SETUP);
> >>  
> >> @@ -1984,6 +1994,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
> >>  
> >>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
> >>  
> >> +    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
> >> +    qemu_balloon_bitmap_save(f);
> >> +
> >>      t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> >>      i = 0;
> >>      while ((ret = qemu_file_rate_limit(f)) == 0) {
> >> @@ -2493,6 +2506,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
> >>              }
> >>              break;
> >>  
> >> +        case RAM_SAVE_FLAG_BALLOON:
> >> +            qemu_balloon_bitmap_load(f);
> >> +            break;
> >> +
> >>          case RAM_SAVE_FLAG_COMPRESS:
> >>              ch = qemu_get_byte(f);
> >>              ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
> >> diff --git a/qapi-schema.json b/qapi-schema.json
> >> index 7f8d799..38163ca 100644
> >> --- a/qapi-schema.json
> >> +++ b/qapi-schema.json
> >> @@ -544,11 +544,14 @@
> >>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
> >>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
> >>  #
> >> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
> >> +#          (since 2.7)
> >> +#
> >>  # Since: 1.2
> >>  ##
> >>  { 'enum': 'MigrationCapability',
> >>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> >> -           'compress', 'events', 'postcopy-ram'] }
> >> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
> >>  
> >>  ##
> >>  # @MigrationCapabilityStatus
> >> -- 
> >> 1.8.3.1
Jitendra Kolhe April 13, 2016, 10:54 a.m. UTC | #15
On 4/10/2016 10:29 PM, Michael S. Tsirkin wrote:
> On Fri, Apr 01, 2016 at 04:38:28PM +0530, Jitendra Kolhe wrote:
>> On 3/29/2016 5:58 PM, Michael S. Tsirkin wrote:
>>> On Mon, Mar 28, 2016 at 09:46:05AM +0530, Jitendra Kolhe wrote:
>>>> While measuring live migration performance for qemu/kvm guest, it
>>>> was observed that the qemu doesn’t maintain any intelligence for the
>>>> guest ram pages which are released by the guest balloon driver and
>>>> treat such pages as any other normal guest ram pages. This has direct
>>>> impact on overall migration time for the guest which has released
>>>> (ballooned out) memory to the host.
>>>>
>>>> In case of large systems, where we can configure large guests with 1TB
>>>> and with considerable amount of memory release by balloon driver to the,
>>>> host the migration time gets worse.
>>>>
>>>> The solution proposed below is local only to qemu (and does not require
>>>> any modification to Linux kernel or any guest driver). We have verified
>>>> the fix for large guests =1TB on HPE Superdome X (which can support up
>>>> to 240 cores and 12TB of memory) and in case where 90% of memory is
>>>> released by balloon driver the migration time for an idle guests reduces
>>>> to ~600 sec's from ~1200 sec’s.
>>>>
>>>> Detail: During live migration, as part of 1st iteration in ram_save_iterate()
>>>> -> ram_find_and_save_block () will try to migrate ram pages which are
>>>> released by vitrio-balloon driver as part of dynamic memory delete.
>>>> Even though the pages which are returned to the host by virtio-balloon
>>>> driver are zero pages, the migration algorithm will still end up
>>>> scanning the entire page ram_find_and_save_block() -> ram_save_page/
>>>> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
>>>> also end-up sending some control information over network for these
>>>> page during migration. This adds to total migration time.
>>>>
>>>> The proposed fix, uses the existing bitmap infrastructure to create
>>>> a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
>>>> page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
>>>> entire guest ram memory till max configured memory. Guest ram pages
>>>> claimed by the virtio-balloon driver will be represented by 1 in the
>>>> bitmap. During live migration, each guest ram page (host VA offset)
>>>> is checked against the virtio-balloon bitmap, if the bit is set the
>>>> corresponding ram page will be excluded from scanning and sending
>>>> control information during migration. The bitmap is also migrated to
>>>> the target as part of every ram_save_iterate loop and after the
>>>> guest is stopped remaining balloon bitmap is migrated as part of
>>>> balloon driver save / load interface.
>>>
>>> Migrating the bitmap might help a chained migration case
>>> but will slow down the more typical case a bit.
>>> Make it optional?
>>>
>>>
>>
>> Disabling migration of balloon bitmap would disable the 
>> optimization permanently for all other subsequent migrations.
>>
>> The current implementation sends offsets and lengths of 1s 
>> in the bitmap. In cases where we see highly fragmented 
>> bitmap which may add to the bitmap migration time, but still 
>> it would be less compared to total migration time.
>>
>> Here are some values from my test setup for an idle guest 
>> with 16GB memory ballooned down to 4GB memory (with no 
>> ballooning operation in progress during migration).
>> - Total migration time without proposed patch set - ~7600ms 
>> - Total migration time with proposed patch set - ~5700ms
>>   . Adds overhead of testing against bitmap - ~320ms
>>   . Adds overhead of sending the bitmap - ~1.6ms
>> - Saves zero page scan + protocol time - ~2348ms  
>> We may see some higher pct for time taken to migrate bitmap
>> in case ballooning is in progress during migration. 
>>
>> The patch does introduces an option to enable/disable 
>> testing ram_addr against balloon bitmap during migration 
>> (since that’s the major overhead), but we still end-up 
>> migrating the bitmap by default. 
>>
>>>>
>>>> With the proposed fix, the average migration time for an idle guest
>>>> with 1TB maximum memory and 64vCpus
>>>>  - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
>>>>    down to 128GB (~10% of 1TB).
>>>>  - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
>>>>    down to 896GB (~90% of 1TB),
>>>>  - with no ballooning configured, we don’t expect to see any impact
>>>>    on total migration time.
>>>>
>>>> The optimization gets temporarily disabled, if the balloon operation is
>>>> in progress. Since the optimization skips scanning and migrating control
>>>> information for ballooned out pages, we might skip guest ram pages in
>>>> cases where the guest balloon driver has freed the ram page to the guest
>>>> but not yet informed the host/qemu about the ram page
>>>> (VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
>>>> might skip migrating ram pages which the guest is using. Since this
>>>> problem is specific to balloon leak, we can restrict balloon operation in
>>>> progress check to only balloon leak operation in progress check.
>>>>
>>>> The optimization also get permanently disabled (for all subsequent
>>>> migrations) in case any of the migration uses postcopy capability. In case
>>>> of postcopy the balloon bitmap would be required to send after vm_stop,
>>>> which has significant impact on the downtime. Moreover, the applications
>>>> in the guest space won’t be actually faulting on the ram pages which are
>>>> already ballooned out, the proposed optimization will not show any
>>>> improvement in migration time during postcopy.
>>>
>>>
>>> I think this optimization can work for post copy with some
>>> modifications.
>>>
>>
>> So is there any way to migrate the bitmap without impacting
>> downtime before starting the vm on dest in case of post-copy?
> 
> With balloon inflated pre-migration?  Yes. Put bitmap in a ram block,
> and use post-copy on it: whenever you detect a fault, and before
> fetching it, look at bitmap. If the relevant page is not yet on
> destination, migrate it.
> 
> But you can also inflate the balloon on destination, after migration.
> 

Ok, will explore post-copy migration to send the bitmap and make it
available before or as part of user fault. Also need to check if there
are any consequences in case ballooning operation is in progress on 
destination.

Can we extend support for post-copy in a different patch set? and use 
current patch set to support other remaining protocols?

>> In which case we can migrate bitmap to dest so that subsequent 
>> migrations which are not using post-copy can use the optimization.
>> In pre-copy case, most of the bitmap is migrated as a part of 
>> ram_save_iterate(), thus having a minimal impact on downtime.
>>
>> I am still unclear on how the patch set will benefit post-migration 
>> case as guest on the dest should never fault on ballooned out pages, 
>> So we won’t be saving on zero page scan time?
> 
> That too.  But you seem to assume balloon stays inflated forever.
> It does not, and we currently fault when guest on dest takes page out
> of the balloon.
> 
> Finally, if balloon is inflated after migration, not migrating
> the pages in balloon will skip scanning non-zero pages
> since pages on source are non-zero.
> 
>>>
>>> For post-copy, when guest takes page out of the balloon,
>>> it notifies the host.
>>>
>>
>> Yes, this will help in maintaining balloon bitmap up-to-date.
>>
>>> This only happens when MUST_TELL_HOST feature has been
>>> negotiated but most guests do negotiate it nowdays.
>>> (It also seems that your code assumes MUST_TELL_HOST - I think
>>> you must check and disable the bitmap management otherwise).
>>>
>>> At that point we can mark page as migrated to avoid
>>> requesting it from source.
>>>
>>
>> qemu no longer seem to add MUST_TELL_HOST feature 
>> to dev->host_features.
>> Unless qemu set’s the feature, 
>> how can it be negotiated it with the guest.
>> virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST)
>> always returns false to me. Am I missing something here?
> 
> It never did because it never cared. If you care, you
> can set it.
> 

Great thanks, was doubting that VIRTIO_BALLOON_F_MUST_TELL_HOST
was not set for some reason. Will set the flag in qemu and ask 
guest to share feature list based on which will enable/disable  
the optimization.

>>
>> Latest linux guest balloon driver >= 3.0, have stopped 
>> checking for MUST_TELL_HOST feature, but they implement 
>> functionality as if the feature is set.
>>
>> To address cases where guest may start using pages before
>> host is aware of them, I am disabling the test against bitmap
>> If ballooning is in progress (during migration). (I think 
>> we can restrict ballooning in progress check to balloon 
>> deflate condition only?). This will make sure that the pages 
>> will be migrated.
>>
>> Thanks,
>> - Jitendra
> 
> I don't know what does "ballooning is in progress" mean.  Pages can be
> taken out of the balloon cause of compaction at any time.  I don't think
> piling up special-cases like that is a good idea, just ask guest
> to talk to you before using pages.
> 

Sure, the special-case check would no longer be required after 
introducing check for VIRTIO_BALLOON_F_MUST_TELL_HOST flag.
Will post patch set with these changes.

Thanks,
- Jitendra

>>>
>>>
>>>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>>>> ---
>>>> Changed in v2:
>>>>  - Resolved compilation issue for qemu-user binaries in exec.c
>>>>  - Localize balloon bitmap test to save_zero_page().
>>>>  - Updated version string for newly added migration capability to 2.7.
>>>>  - Made minor modifications to patch commit text.
>>>>
>>>>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
>>>>  exec.c                             |   3 +
>>>>  hw/virtio/virtio-balloon.c         |  35 ++++-
>>>>  include/hw/virtio/virtio-balloon.h |   1 +
>>>>  include/migration/migration.h      |   1 +
>>>>  include/sysemu/balloon.h           |  15 ++-
>>>>  migration/migration.c              |   9 ++
>>>>  migration/ram.c                    |  31 ++++-
>>>>  qapi-schema.json                   |   5 +-
>>>>  9 files changed, 341 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/balloon.c b/balloon.c
>>>> index f2ef50c..1c2d228 100644
>>>> --- a/balloon.c
>>>> +++ b/balloon.c
>>>> @@ -33,11 +33,34 @@
>>>>  #include "qmp-commands.h"
>>>>  #include "qapi/qmp/qerror.h"
>>>>  #include "qapi/qmp/qjson.h"
>>>> +#include "exec/ram_addr.h"
>>>> +#include "migration/migration.h"
>>>> +
>>>> +#define BALLOON_BITMAP_DISABLE_FLAG -1UL
>>>> +
>>>> +typedef enum {
>>>> +    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
>>>> +    BALLOON_BITMAP_DISABLE_CURRENT,
>>>> +    BALLOON_BITMAP_DISABLE_PERMANENT,
>>>> +} BalloonBitmapDisableState;
>>>>  
>>>>  static QEMUBalloonEvent *balloon_event_fn;
>>>>  static QEMUBalloonStatus *balloon_stat_fn;
>>>> +static QEMUBalloonInProgress *balloon_in_progress_fn;
>>>>  static void *balloon_opaque;
>>>>  static bool balloon_inhibited;
>>>> +static unsigned long balloon_bitmap_pages;
>>>> +static unsigned int  balloon_bitmap_pfn_shift;
>>>> +static QemuMutex balloon_bitmap_mutex;
>>>> +static bool balloon_bitmap_xfered;
>>>> +static unsigned long balloon_min_bitmap_offset;
>>>> +static unsigned long balloon_max_bitmap_offset;
>>>> +static BalloonBitmapDisableState balloon_bitmap_disable_state;
>>>> +
>>>> +static struct BitmapRcu {
>>>> +    struct rcu_head rcu;
>>>> +    unsigned long *bmap;
>>>> +} *balloon_bitmap_rcu;
>>>>  
>>>>  bool qemu_balloon_is_inhibited(void)
>>>>  {
>>>> @@ -49,6 +72,21 @@ void qemu_balloon_inhibit(bool state)
>>>>      balloon_inhibited = state;
>>>>  }
>>>>  
>>>> +void qemu_mutex_lock_balloon_bitmap(void)
>>>> +{
>>>> +    qemu_mutex_lock(&balloon_bitmap_mutex);
>>>> +}
>>>> +
>>>> +void qemu_mutex_unlock_balloon_bitmap(void)
>>>> +{
>>>> +    qemu_mutex_unlock(&balloon_bitmap_mutex);
>>>> +}
>>>> +
>>>> +void qemu_balloon_reset_bitmap_data(void)
>>>> +{
>>>> +    balloon_bitmap_xfered = false;
>>>> +}
>>>> +
>>>>  static bool have_balloon(Error **errp)
>>>>  {
>>>>      if (kvm_enabled() && !kvm_has_sync_mmu()) {
>>>> @@ -65,9 +103,12 @@ static bool have_balloon(Error **errp)
>>>>  }
>>>>  
>>>>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>>>> -                             QEMUBalloonStatus *stat_func, void *opaque)
>>>> +                             QEMUBalloonStatus *stat_func,
>>>> +                             QEMUBalloonInProgress *in_progress_func,
>>>> +                             void *opaque, int pfn_shift)
>>>>  {
>>>> -    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
>>>> +    if (balloon_event_fn || balloon_stat_fn ||
>>>> +        balloon_in_progress_fn || balloon_opaque) {
>>>>          /* We're already registered one balloon handler.  How many can
>>>>           * a guest really have?
>>>>           */
>>>> @@ -75,17 +116,39 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>>>>      }
>>>>      balloon_event_fn = event_func;
>>>>      balloon_stat_fn = stat_func;
>>>> +    balloon_in_progress_fn = in_progress_func;
>>>>      balloon_opaque = opaque;
>>>> +
>>>> +    qemu_mutex_init(&balloon_bitmap_mutex);
>>>> +    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
>>>> +    balloon_bitmap_pfn_shift = pfn_shift;
>>>> +    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
>>>> +    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
>>>> +    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
>>>> +    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
>>>> +
>>>>      return 0;
>>>>  }
>>>>  
>>>> +static void balloon_bitmap_free(struct BitmapRcu *bmap)
>>>> +{
>>>> +    g_free(bmap->bmap);
>>>> +    g_free(bmap);
>>>> +}
>>>> +
>>>>  void qemu_remove_balloon_handler(void *opaque)
>>>>  {
>>>> +    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
>>>>      if (balloon_opaque != opaque) {
>>>>          return;
>>>>      }
>>>> +    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
>>>> +    if (bitmap) {
>>>> +        call_rcu(bitmap, balloon_bitmap_free, rcu);
>>>> +    }
>>>>      balloon_event_fn = NULL;
>>>>      balloon_stat_fn = NULL;
>>>> +    balloon_in_progress_fn = NULL;
>>>>      balloon_opaque = NULL;
>>>>  }
>>>>  
>>>> @@ -116,3 +179,189 @@ void qmp_balloon(int64_t target, Error **errp)
>>>>      trace_balloon_event(balloon_opaque, target);
>>>>      balloon_event_fn(balloon_opaque, target);
>>>>  }
>>>> +
>>>> +/* Handle Ram hotplug case, only called in case old < new */
>>>> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
>>>> +{
>>>> +    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
>>>> +    unsigned long old_offset, new_offset;
>>>> +
>>>> +    if (!balloon_bitmap_rcu) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    old_offset = (old >> balloon_bitmap_pfn_shift);
>>>> +    new_offset = (new >> balloon_bitmap_pfn_shift);
>>>> +
>>>> +    bitmap = g_new(struct BitmapRcu, 1);
>>>> +    bitmap->bmap = bitmap_new(new_offset);
>>>> +
>>>> +    qemu_mutex_lock_balloon_bitmap();
>>>> +    bitmap_clear(bitmap->bmap, 0,
>>>> +                 balloon_bitmap_pages + new_offset - old_offset);
>>>> +    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
>>>> +
>>>> +    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
>>>> +    balloon_bitmap_pages += new_offset - old_offset;
>>>> +    qemu_mutex_unlock_balloon_bitmap();
>>>> +    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +/* Should be called with balloon bitmap mutex lock held */
>>>> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
>>>> +{
>>>> +    unsigned long *bitmap;
>>>> +    unsigned long offset = 0;
>>>> +
>>>> +    if (!balloon_bitmap_rcu) {
>>>> +        return -1;
>>>> +    }
>>>> +    offset = (addr >> balloon_bitmap_pfn_shift);
>>>> +    if (balloon_bitmap_xfered) {
>>>> +        if (offset < balloon_min_bitmap_offset) {
>>>> +            balloon_min_bitmap_offset = offset;
>>>> +        }
>>>> +        if (offset > balloon_max_bitmap_offset) {
>>>> +            balloon_max_bitmap_offset = offset;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    rcu_read_lock();
>>>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>>>> +    if (deflate == 0) {
>>>> +        set_bit(offset, bitmap);
>>>> +    } else {
>>>> +        clear_bit(offset, bitmap);
>>>> +    }
>>>> +    rcu_read_unlock();
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void qemu_balloon_bitmap_setup(void)
>>>> +{
>>>> +    if (migrate_postcopy_ram()) {
>>>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
>>>> +    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
>>>> +               (balloon_bitmap_disable_state !=
>>>> +                BALLOON_BITMAP_DISABLE_PERMANENT)) {
>>>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
>>>> +    }
>>>> +}
>>>> +
>>>> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
>>>> +{
>>>> +    unsigned long *bitmap;
>>>> +    ram_addr_t base;
>>>> +    unsigned long nr = 0;
>>>> +    int ret = 0;
>>>> +
>>>> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
>>>> +        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
>>>> +        return 0;
>>>> +    }
>>>> +    balloon_in_progress_fn(balloon_opaque, &ret);
>>>> +    if (ret == 1) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    rcu_read_lock();
>>>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>>>> +    base = rb->offset >> balloon_bitmap_pfn_shift;
>>>> +    nr = base + (addr >> balloon_bitmap_pfn_shift);
>>>> +    if (test_bit(nr, bitmap)) {
>>>> +        ret = 1;
>>>> +    }
>>>> +    rcu_read_unlock();
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +int qemu_balloon_bitmap_save(QEMUFile *f)
>>>> +{
>>>> +    unsigned long *bitmap;
>>>> +    unsigned long offset = 0, next = 0, len = 0;
>>>> +    unsigned long tmpoffset = 0, tmplimit = 0;
>>>> +
>>>> +    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
>>>> +        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    qemu_mutex_lock_balloon_bitmap();
>>>> +    if (balloon_bitmap_xfered) {
>>>> +        tmpoffset = balloon_min_bitmap_offset;
>>>> +        tmplimit  = balloon_max_bitmap_offset;
>>>> +    } else {
>>>> +        balloon_bitmap_xfered = true;
>>>> +        tmpoffset = offset;
>>>> +        tmplimit  = balloon_bitmap_pages;
>>>> +    }
>>>> +
>>>> +    balloon_min_bitmap_offset = balloon_bitmap_pages;
>>>> +    balloon_max_bitmap_offset = 0;
>>>> +
>>>> +    qemu_put_be64(f, balloon_bitmap_pages);
>>>> +    qemu_put_be64(f, tmpoffset);
>>>> +    qemu_put_be64(f, tmplimit);
>>>> +    rcu_read_lock();
>>>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>>>> +    while (tmpoffset < tmplimit) {
>>>> +        unsigned long next_set_bit, start_set_bit;
>>>> +        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
>>>> +        start_set_bit = next_set_bit;
>>>> +        if (next_set_bit == balloon_bitmap_pages) {
>>>> +            len = 0;
>>>> +            next = start_set_bit;
>>>> +            qemu_put_be64(f, next);
>>>> +            qemu_put_be64(f, len);
>>>> +            break;
>>>> +        }
>>>> +        next_set_bit = find_next_zero_bit(bitmap,
>>>> +                                          balloon_bitmap_pages,
>>>> +                                          ++next_set_bit);
>>>> +        len = (next_set_bit - start_set_bit);
>>>> +        next = start_set_bit;
>>>> +        qemu_put_be64(f, next);
>>>> +        qemu_put_be64(f, len);
>>>> +        tmpoffset = next + len;
>>>> +    }
>>>> +    rcu_read_unlock();
>>>> +    qemu_mutex_unlock_balloon_bitmap();
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int qemu_balloon_bitmap_load(QEMUFile *f)
>>>> +{
>>>> +    unsigned long *bitmap;
>>>> +    unsigned long next = 0, len = 0;
>>>> +    unsigned long tmpoffset = 0, tmplimit = 0;
>>>> +
>>>> +    if (!balloon_bitmap_rcu) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    qemu_mutex_lock_balloon_bitmap();
>>>> +    balloon_bitmap_pages = qemu_get_be64(f);
>>>> +    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
>>>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
>>>> +        qemu_mutex_unlock_balloon_bitmap();
>>>> +        return 0;
>>>> +    }
>>>> +    tmpoffset = qemu_get_be64(f);
>>>> +    tmplimit  = qemu_get_be64(f);
>>>> +    rcu_read_lock();
>>>> +    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
>>>> +    while (tmpoffset < tmplimit) {
>>>> +        next = qemu_get_be64(f);
>>>> +        len  = qemu_get_be64(f);
>>>> +        if (len == 0) {
>>>> +            break;
>>>> +        }
>>>> +        bitmap_set(bitmap, next, len);
>>>> +        tmpoffset = next + len;
>>>> +    }
>>>> +    rcu_read_unlock();
>>>> +    qemu_mutex_unlock_balloon_bitmap();
>>>> +    return 0;
>>>> +}
>>>> diff --git a/exec.c b/exec.c
>>>> index f398d21..7a448e5 100644
>>>> --- a/exec.c
>>>> +++ b/exec.c
>>>> @@ -43,6 +43,7 @@
>>>>  #else /* !CONFIG_USER_ONLY */
>>>>  #include "sysemu/xen-mapcache.h"
>>>>  #include "trace.h"
>>>> +#include "sysemu/balloon.h"
>>>>  #endif
>>>>  #include "exec/cpu-all.h"
>>>>  #include "qemu/rcu_queue.h"
>>>> @@ -1610,6 +1611,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>      if (new_ram_size > old_ram_size) {
>>>>          migration_bitmap_extend(old_ram_size, new_ram_size);
>>>>          dirty_memory_extend(old_ram_size, new_ram_size);
>>>> +        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
>>>> +                                   new_ram_size << TARGET_PAGE_BITS);
>>>>      }
>>>>      /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
>>>>       * QLIST (which has an RCU-friendly variant) does not have insertion at
>>>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>>>> index 22ad25c..9f3a4c8 100644
>>>> --- a/hw/virtio/virtio-balloon.c
>>>> +++ b/hw/virtio/virtio-balloon.c
>>>> @@ -27,6 +27,7 @@
>>>>  #include "qapi/visitor.h"
>>>>  #include "qapi-event.h"
>>>>  #include "trace.h"
>>>> +#include "migration/migration.h"
>>>>  
>>>>  #if defined(__linux__)
>>>>  #include <sys/mman.h>
>>>> @@ -214,11 +215,13 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>>>      VirtQueueElement *elem;
>>>>      MemoryRegionSection section;
>>>>  
>>>> +    qemu_mutex_lock_balloon_bitmap();
>>>>      for (;;) {
>>>>          size_t offset = 0;
>>>>          uint32_t pfn;
>>>>          elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>>>>          if (!elem) {
>>>> +            qemu_mutex_unlock_balloon_bitmap();
>>>>              return;
>>>>          }
>>>>  
>>>> @@ -242,6 +245,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>>>              addr = section.offset_within_region;
>>>>              balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>>>>                           !!(vq == s->dvq));
>>>> +            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
>>>>              memory_region_unref(section.mr);
>>>>          }
>>>>  
>>>> @@ -249,6 +253,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>>>>          virtio_notify(vdev, vq);
>>>>          g_free(elem);
>>>>      }
>>>> +    qemu_mutex_unlock_balloon_bitmap();
>>>>  }
>>>>  
>>>>  static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
>>>> @@ -303,6 +308,16 @@ out:
>>>>      }
>>>>  }
>>>>  
>>>> +static void virtio_balloon_migration_state_changed(Notifier *notifier,
>>>> +                                                   void *data)
>>>> +{
>>>> +    MigrationState *mig = data;
>>>> +
>>>> +    if (migration_has_failed(mig)) {
>>>> +        qemu_balloon_reset_bitmap_data();
>>>> +    }
>>>> +}
>>>> +
>>>>  static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
>>>>  {
>>>>      VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
>>>> @@ -382,6 +397,16 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
>>>>                                               VIRTIO_BALLOON_PFN_SHIFT);
>>>>  }
>>>>  
>>>> +static void virtio_balloon_in_progress(void *opaque, int *status)
>>>> +{
>>>> +    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
>>>> +    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
>>>> +        *status = 1;
>>>> +        return;
>>>> +    }
>>>> +    *status = 0;
>>>> +}
>>>> +
>>>>  static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
>>>>  {
>>>>      VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
>>>> @@ -409,6 +434,7 @@ static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
>>>>  
>>>>      qemu_put_be32(f, s->num_pages);
>>>>      qemu_put_be32(f, s->actual);
>>>> +    qemu_balloon_bitmap_save(f);
>>>>  }
>>>>  
>>>>  static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
>>>> @@ -426,6 +452,7 @@ static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
>>>>  
>>>>      s->num_pages = qemu_get_be32(f);
>>>>      s->actual = qemu_get_be32(f);
>>>> +    qemu_balloon_bitmap_load(f);
>>>>      return 0;
>>>>  }
>>>>  
>>>> @@ -439,7 +466,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>>>                  sizeof(struct virtio_balloon_config));
>>>>  
>>>>      ret = qemu_add_balloon_handler(virtio_balloon_to_target,
>>>> -                                   virtio_balloon_stat, s);
>>>> +                                   virtio_balloon_stat,
>>>> +                                   virtio_balloon_in_progress, s,
>>>> +                                   VIRTIO_BALLOON_PFN_SHIFT);
>>>>  
>>>>      if (ret < 0) {
>>>>          error_setg(errp, "Only one balloon device is supported");
>>>> @@ -453,6 +482,9 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>>>>  
>>>>      reset_stats(s);
>>>>  
>>>> +    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
>>>> +    add_migration_state_change_notifier(&s->migration_state_notifier);
>>>> +
>>>>      register_savevm(dev, "virtio-balloon", -1, 1,
>>>>                      virtio_balloon_save, virtio_balloon_load, s);
>>>>  }
>>>> @@ -462,6 +494,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>>>>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>>>>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>>>>  
>>>> +    remove_migration_state_change_notifier(&s->migration_state_notifier);
>>>>      balloon_stats_destroy_timer(s);
>>>>      qemu_remove_balloon_handler(s);
>>>>      unregister_savevm(dev, "virtio-balloon", s);
>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
>>>> index 35f62ac..1ded5a9 100644
>>>> --- a/include/hw/virtio/virtio-balloon.h
>>>> +++ b/include/hw/virtio/virtio-balloon.h
>>>> @@ -43,6 +43,7 @@ typedef struct VirtIOBalloon {
>>>>      int64_t stats_last_update;
>>>>      int64_t stats_poll_interval;
>>>>      uint32_t host_features;
>>>> +    Notifier migration_state_notifier;
>>>>  } VirtIOBalloon;
>>>>  
>>>>  #endif
>>>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>>>> index ac2c12c..6c1d1af 100644
>>>> --- a/include/migration/migration.h
>>>> +++ b/include/migration/migration.h
>>>> @@ -267,6 +267,7 @@ void migrate_del_blocker(Error *reason);
>>>>  
>>>>  bool migrate_postcopy_ram(void);
>>>>  bool migrate_zero_blocks(void);
>>>> +bool migrate_skip_balloon(void);
>>>>  
>>>>  bool migrate_auto_converge(void);
>>>>  
>>>> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
>>>> index 3f976b4..5325c38 100644
>>>> --- a/include/sysemu/balloon.h
>>>> +++ b/include/sysemu/balloon.h
>>>> @@ -15,14 +15,27 @@
>>>>  #define _QEMU_BALLOON_H
>>>>  
>>>>  #include "qapi-types.h"
>>>> +#include "migration/qemu-file.h"
>>>>  
>>>>  typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
>>>>  typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
>>>> +typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
>>>>  
>>>>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>>>> -			     QEMUBalloonStatus *stat_func, void *opaque);
>>>> +                             QEMUBalloonStatus *stat_func,
>>>> +                             QEMUBalloonInProgress *progress_func,
>>>> +                             void *opaque, int pfn_shift);
>>>>  void qemu_remove_balloon_handler(void *opaque);
>>>>  bool qemu_balloon_is_inhibited(void);
>>>>  void qemu_balloon_inhibit(bool state);
>>>> +void qemu_mutex_lock_balloon_bitmap(void);
>>>> +void qemu_mutex_unlock_balloon_bitmap(void);
>>>> +void qemu_balloon_reset_bitmap_data(void);
>>>> +void qemu_balloon_bitmap_setup(void);
>>>> +int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
>>>> +int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
>>>> +int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
>>>> +int qemu_balloon_bitmap_save(QEMUFile *f);
>>>> +int qemu_balloon_bitmap_load(QEMUFile *f);
>>>>  
>>>>  #endif
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 034a918..cb86307 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -1200,6 +1200,15 @@ int migrate_use_xbzrle(void)
>>>>      return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
>>>>  }
>>>>  
>>>> +bool migrate_skip_balloon(void)
>>>> +{
>>>> +    MigrationState *s;
>>>> +
>>>> +    s = migrate_get_current();
>>>> +
>>>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
>>>> +}
>>>> +
>>>>  int64_t migrate_xbzrle_cache_size(void)
>>>>  {
>>>>      MigrationState *s;
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index 704f6a9..161ab73 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -40,6 +40,7 @@
>>>>  #include "trace.h"
>>>>  #include "exec/ram_addr.h"
>>>>  #include "qemu/rcu_queue.h"
>>>> +#include "sysemu/balloon.h"
>>>>  
>>>>  #ifdef DEBUG_MIGRATION_RAM
>>>>  #define DPRINTF(fmt, ...) \
>>>> @@ -65,6 +66,7 @@ static uint64_t bitmap_sync_count;
>>>>  #define RAM_SAVE_FLAG_XBZRLE   0x40
>>>>  /* 0x80 is reserved in migration.h start with 0x100 next */
>>>>  #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
>>>> +#define RAM_SAVE_FLAG_BALLOON  0x200
>>>>  
>>>>  static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
>>>>  
>>>> @@ -702,13 +704,17 @@ static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>>>>  {
>>>>      int pages = -1;
>>>>  
>>>> -    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>>>> -        acct_info.dup_pages++;
>>>> -        *bytes_transferred += save_page_header(f, block,
>>>> +    if (qemu_balloon_bitmap_test(block, offset) != 1) {
>>>> +        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>>>> +            acct_info.dup_pages++;
>>>> +            *bytes_transferred += save_page_header(f, block,
>>>>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
>>>> -        qemu_put_byte(f, 0);
>>>> -        *bytes_transferred += 1;
>>>> -        pages = 1;
>>>> +            qemu_put_byte(f, 0);
>>>> +            *bytes_transferred += 1;
>>>> +            pages = 1;
>>>> +        }
>>>> +    } else {
>>>> +        pages = 0;
>>>>      }
>>>>  
>>>>      return pages;
>>>> @@ -773,7 +779,7 @@ static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
>>>>               * page would be stale
>>>>               */
>>>>              xbzrle_cache_zero_page(current_addr);
>>>> -        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
>>>> +        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {
>>>>              pages = save_xbzrle_page(f, &p, current_addr, block,
>>>>                                       offset, last_stage, bytes_transferred);
>>>>              if (!last_stage) {
>>>> @@ -1355,6 +1361,9 @@ static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
>>>>          }
>>>>  
>>>>          if (found) {
>>>> +            /* skip saving ram host page if the corresponding guest page
>>>> +             * is ballooned out
>>>> +             */
>>>>              pages = ram_save_host_page(ms, f, &pss,
>>>>                                         last_stage, bytes_transferred,
>>>>                                         dirty_ram_abs);
>>>> @@ -1959,6 +1968,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>>>  
>>>>      rcu_read_unlock();
>>>>  
>>>> +    qemu_balloon_bitmap_setup();
>>>>      ram_control_before_iterate(f, RAM_CONTROL_SETUP);
>>>>      ram_control_after_iterate(f, RAM_CONTROL_SETUP);
>>>>  
>>>> @@ -1984,6 +1994,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>>>  
>>>>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>>>>  
>>>> +    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
>>>> +    qemu_balloon_bitmap_save(f);
>>>> +
>>>>      t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>>>      i = 0;
>>>>      while ((ret = qemu_file_rate_limit(f)) == 0) {
>>>> @@ -2493,6 +2506,10 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>>>>              }
>>>>              break;
>>>>  
>>>> +        case RAM_SAVE_FLAG_BALLOON:
>>>> +            qemu_balloon_bitmap_load(f);
>>>> +            break;
>>>> +
>>>>          case RAM_SAVE_FLAG_COMPRESS:
>>>>              ch = qemu_get_byte(f);
>>>>              ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
>>>> diff --git a/qapi-schema.json b/qapi-schema.json
>>>> index 7f8d799..38163ca 100644
>>>> --- a/qapi-schema.json
>>>> +++ b/qapi-schema.json
>>>> @@ -544,11 +544,14 @@
>>>>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>>>>  #          the migration fails during postcopy the VM will fail.  (since 2.6)
>>>>  #
>>>> +# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
>>>> +#          (since 2.7)
>>>> +#
>>>>  # Since: 1.2
>>>>  ##
>>>>  { 'enum': 'MigrationCapability',
>>>>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>>>> -           'compress', 'events', 'postcopy-ram'] }
>>>> +           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
>>>>  
>>>>  ##
>>>>  # @MigrationCapabilityStatus
>>>> -- 
>>>> 1.8.3.1
Michael S. Tsirkin April 13, 2016, 11:10 a.m. UTC | #16
On Wed, Apr 13, 2016 at 04:24:55PM +0530, Jitendra Kolhe wrote:
> Can we extend support for post-copy in a different patch set?

If the optimization does not *help* on some paths,
that's fine. The issue is with adding extra code
special-casing protocols:

+    if (migrate_postcopy_ram()) {
+        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
+    }

Generally when one sees that patchset breaks XYZ...
the easy solution is "check for XYZ
and disable the optimization". But do this enough times
and the codebase becomes impossible to reason about.
	why did migration become slower? oh it enabled
	optimization A and that conflicts with optimization B ...


> and use 
> current patch set to support other remaining protocols?

Even disregarding postcopy, I think there were
comments that need to be addressed.
Dr. David Alan Gilbert April 13, 2016, 11:15 a.m. UTC | #17
* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Wed, Apr 13, 2016 at 04:24:55PM +0530, Jitendra Kolhe wrote:
> > Can we extend support for post-copy in a different patch set?
> 
> If the optimization does not *help* on some paths,
> that's fine. The issue is with adding extra code
> special-casing protocols:
> 
> +    if (migrate_postcopy_ram()) {
> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> +    }
> 
> Generally when one sees that patchset breaks XYZ...
> the easy solution is "check for XYZ
> and disable the optimization". But do this enough times
> and the codebase becomes impossible to reason about.
> 	why did migration become slower? oh it enabled
> 	optimization A and that conflicts with optimization B ...

Hang on; this is getting all very complicated; I wouldn't start tieing
this thing up with postcopy yet.  Lets try and keep this simple for starters.

Dave

> 
> 
> > and use 
> > current patch set to support other remaining protocols?
> 
> Even disregarding postcopy, I think there were
> comments that need to be addressed.
> 
> -- 
> MST
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Michael S. Tsirkin April 13, 2016, 11:36 a.m. UTC | #18
On Wed, Apr 13, 2016 at 12:15:38PM +0100, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > On Wed, Apr 13, 2016 at 04:24:55PM +0530, Jitendra Kolhe wrote:
> > > Can we extend support for post-copy in a different patch set?
> > 
> > If the optimization does not *help* on some paths,
> > that's fine. The issue is with adding extra code
> > special-casing protocols:
> > 
> > +    if (migrate_postcopy_ram()) {
> > +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
> > +    }
> > 
> > Generally when one sees that patchset breaks XYZ...
> > the easy solution is "check for XYZ
> > and disable the optimization". But do this enough times
> > and the codebase becomes impossible to reason about.
> > 	why did migration become slower? oh it enabled
> > 	optimization A and that conflicts with optimization B ...
> 
> Hang on; this is getting all very complicated; I wouldn't start tieing
> this thing up with postcopy yet.  Lets try and keep this simple for starters.
> 
> Dave

Absolutely, but I don't understand why is this implemented in such
a complex way.
1. keep track of pages in balloon in a bitmap
2. put bitmap in a ram block
3. check that before sending page. if there - it's a zero page

All the complexity is to avoid migrating an extra bit per page
and it seems like a premature optimization to me at this stage.

> > 
> > 
> > > and use 
> > > current patch set to support other remaining protocols?
> > 
> > Even disregarding postcopy, I think there were
> > comments that need to be addressed.
> > 
> > -- 
> > MST
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Jitendra Kolhe April 29, 2016, 10:31 a.m. UTC | #19
On 4/13/2016 5:06 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 13, 2016 at 12:15:38PM +0100, Dr. David Alan Gilbert wrote:
>> * Michael S. Tsirkin (mst@redhat.com) wrote:
>>> On Wed, Apr 13, 2016 at 04:24:55PM +0530, Jitendra Kolhe wrote:
>>>> Can we extend support for post-copy in a different patch set?
>>>
>>> If the optimization does not *help* on some paths,
>>> that's fine. The issue is with adding extra code
>>> special-casing protocols:
>>>
>>> +    if (migrate_postcopy_ram()) {
>>> +        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
>>> +    }
>>>
>>> Generally when one sees that patchset breaks XYZ...
>>> the easy solution is "check for XYZ
>>> and disable the optimization". But do this enough times
>>> and the codebase becomes impossible to reason about.
>>> 	why did migration become slower? oh it enabled
>>> 	optimization A and that conflicts with optimization B ...

Ok, I understand the confusion caused by such checks. Will remove 
the migration_postcopy checks from the patch and make sure things 
don’t break in postcopy path.

>>
>> Hang on; this is getting all very complicated; I wouldn't start tieing
>> this thing up with postcopy yet.  Lets try and keep this simple for starters.
>>
>> Dave
> 
> Absolutely, but I don't understand why is this implemented in such
> a complex way.
> 1. keep track of pages in balloon in a bitmap
> 2. put bitmap in a ram block
> 3. check that before sending page. if there - it's a zero page
> 

Thanks for the suggestion, putting bitmap in ramblock does remove the
code for migrating and dirty page tracking of balloon bitmap pages itself.
In case, the feature is turned-off, the balloon bitmap ramblock is resized
to 0 to make sure that the bitmap is not migrated to reduce the overhead.

> All the complexity is to avoid migrating an extra bit per page
> and it seems like a premature optimization to me at this stage.
> 

Major benefit of the optimization is to avoid zero page scan for
ballooned out pages. Skipping migrating extra bit per ballooned out 
page is just an add-on benefit without any overhead as long as it 
doesn’t break any protocol. 

Thanks,
- Jitendra

>>>
>>>
>>>> and use 
>>>> current patch set to support other remaining protocols?
>>>
>>> Even disregarding postcopy, I think there were
>>> comments that need to be addressed.
>>>
>>> -- 
>>> MST
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/balloon.c b/balloon.c
index f2ef50c..1c2d228 100644
--- a/balloon.c
+++ b/balloon.c
@@ -33,11 +33,34 @@ 
 #include "qmp-commands.h"
 #include "qapi/qmp/qerror.h"
 #include "qapi/qmp/qjson.h"
+#include "exec/ram_addr.h"
+#include "migration/migration.h"
+
+#define BALLOON_BITMAP_DISABLE_FLAG -1UL
+
+typedef enum {
+    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
+    BALLOON_BITMAP_DISABLE_CURRENT,
+    BALLOON_BITMAP_DISABLE_PERMANENT,
+} BalloonBitmapDisableState;
 
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonInProgress *balloon_in_progress_fn;
 static void *balloon_opaque;
 static bool balloon_inhibited;
+static unsigned long balloon_bitmap_pages;
+static unsigned int  balloon_bitmap_pfn_shift;
+static QemuMutex balloon_bitmap_mutex;
+static bool balloon_bitmap_xfered;
+static unsigned long balloon_min_bitmap_offset;
+static unsigned long balloon_max_bitmap_offset;
+static BalloonBitmapDisableState balloon_bitmap_disable_state;
+
+static struct BitmapRcu {
+    struct rcu_head rcu;
+    unsigned long *bmap;
+} *balloon_bitmap_rcu;
 
 bool qemu_balloon_is_inhibited(void)
 {
@@ -49,6 +72,21 @@  void qemu_balloon_inhibit(bool state)
     balloon_inhibited = state;
 }
 
+void qemu_mutex_lock_balloon_bitmap(void)
+{
+    qemu_mutex_lock(&balloon_bitmap_mutex);
+}
+
+void qemu_mutex_unlock_balloon_bitmap(void)
+{
+    qemu_mutex_unlock(&balloon_bitmap_mutex);
+}
+
+void qemu_balloon_reset_bitmap_data(void)
+{
+    balloon_bitmap_xfered = false;
+}
+
 static bool have_balloon(Error **errp)
 {
     if (kvm_enabled() && !kvm_has_sync_mmu()) {
@@ -65,9 +103,12 @@  static bool have_balloon(Error **errp)
 }
 
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
-                             QEMUBalloonStatus *stat_func, void *opaque)
+                             QEMUBalloonStatus *stat_func,
+                             QEMUBalloonInProgress *in_progress_func,
+                             void *opaque, int pfn_shift)
 {
-    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
+    if (balloon_event_fn || balloon_stat_fn ||
+        balloon_in_progress_fn || balloon_opaque) {
         /* We're already registered one balloon handler.  How many can
          * a guest really have?
          */
@@ -75,17 +116,39 @@  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
     }
     balloon_event_fn = event_func;
     balloon_stat_fn = stat_func;
+    balloon_in_progress_fn = in_progress_func;
     balloon_opaque = opaque;
+
+    qemu_mutex_init(&balloon_bitmap_mutex);
+    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
+    balloon_bitmap_pfn_shift = pfn_shift;
+    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
+    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
+    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
+    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
+
     return 0;
 }
 
+static void balloon_bitmap_free(struct BitmapRcu *bmap)
+{
+    g_free(bmap->bmap);
+    g_free(bmap);
+}
+
 void qemu_remove_balloon_handler(void *opaque)
 {
+    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
     if (balloon_opaque != opaque) {
         return;
     }
+    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
+    if (bitmap) {
+        call_rcu(bitmap, balloon_bitmap_free, rcu);
+    }
     balloon_event_fn = NULL;
     balloon_stat_fn = NULL;
+    balloon_in_progress_fn = NULL;
     balloon_opaque = NULL;
 }
 
@@ -116,3 +179,189 @@  void qmp_balloon(int64_t target, Error **errp)
     trace_balloon_event(balloon_opaque, target);
     balloon_event_fn(balloon_opaque, target);
 }
+
+/* Handle Ram hotplug case, only called in case old < new */
+int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
+{
+    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
+    unsigned long old_offset, new_offset;
+
+    if (!balloon_bitmap_rcu) {
+        return -1;
+    }
+
+    old_offset = (old >> balloon_bitmap_pfn_shift);
+    new_offset = (new >> balloon_bitmap_pfn_shift);
+
+    bitmap = g_new(struct BitmapRcu, 1);
+    bitmap->bmap = bitmap_new(new_offset);
+
+    qemu_mutex_lock_balloon_bitmap();
+    bitmap_clear(bitmap->bmap, 0,
+                 balloon_bitmap_pages + new_offset - old_offset);
+    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
+
+    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
+    balloon_bitmap_pages += new_offset - old_offset;
+    qemu_mutex_unlock_balloon_bitmap();
+    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
+
+    return 0;
+}
+
+/* Should be called with balloon bitmap mutex lock held */
+int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
+{
+    unsigned long *bitmap;
+    unsigned long offset = 0;
+
+    if (!balloon_bitmap_rcu) {
+        return -1;
+    }
+    offset = (addr >> balloon_bitmap_pfn_shift);
+    if (balloon_bitmap_xfered) {
+        if (offset < balloon_min_bitmap_offset) {
+            balloon_min_bitmap_offset = offset;
+        }
+        if (offset > balloon_max_bitmap_offset) {
+            balloon_max_bitmap_offset = offset;
+        }
+    }
+
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    if (deflate == 0) {
+        set_bit(offset, bitmap);
+    } else {
+        clear_bit(offset, bitmap);
+    }
+    rcu_read_unlock();
+    return 0;
+}
+
+void qemu_balloon_bitmap_setup(void)
+{
+    if (migrate_postcopy_ram()) {
+        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
+    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
+               (balloon_bitmap_disable_state !=
+                BALLOON_BITMAP_DISABLE_PERMANENT)) {
+        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
+    }
+}
+
+int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
+{
+    unsigned long *bitmap;
+    ram_addr_t base;
+    unsigned long nr = 0;
+    int ret = 0;
+
+    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
+        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
+        return 0;
+    }
+    balloon_in_progress_fn(balloon_opaque, &ret);
+    if (ret == 1) {
+        return 0;
+    }
+
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    base = rb->offset >> balloon_bitmap_pfn_shift;
+    nr = base + (addr >> balloon_bitmap_pfn_shift);
+    if (test_bit(nr, bitmap)) {
+        ret = 1;
+    }
+    rcu_read_unlock();
+    return ret;
+}
+
+int qemu_balloon_bitmap_save(QEMUFile *f)
+{
+    unsigned long *bitmap;
+    unsigned long offset = 0, next = 0, len = 0;
+    unsigned long tmpoffset = 0, tmplimit = 0;
+
+    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERMANENT) {
+        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
+        return 0;
+    }
+
+    qemu_mutex_lock_balloon_bitmap();
+    if (balloon_bitmap_xfered) {
+        tmpoffset = balloon_min_bitmap_offset;
+        tmplimit  = balloon_max_bitmap_offset;
+    } else {
+        balloon_bitmap_xfered = true;
+        tmpoffset = offset;
+        tmplimit  = balloon_bitmap_pages;
+    }
+
+    balloon_min_bitmap_offset = balloon_bitmap_pages;
+    balloon_max_bitmap_offset = 0;
+
+    qemu_put_be64(f, balloon_bitmap_pages);
+    qemu_put_be64(f, tmpoffset);
+    qemu_put_be64(f, tmplimit);
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    while (tmpoffset < tmplimit) {
+        unsigned long next_set_bit, start_set_bit;
+        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
+        start_set_bit = next_set_bit;
+        if (next_set_bit == balloon_bitmap_pages) {
+            len = 0;
+            next = start_set_bit;
+            qemu_put_be64(f, next);
+            qemu_put_be64(f, len);
+            break;
+        }
+        next_set_bit = find_next_zero_bit(bitmap,
+                                          balloon_bitmap_pages,
+                                          ++next_set_bit);
+        len = (next_set_bit - start_set_bit);
+        next = start_set_bit;
+        qemu_put_be64(f, next);
+        qemu_put_be64(f, len);
+        tmpoffset = next + len;
+    }
+    rcu_read_unlock();
+    qemu_mutex_unlock_balloon_bitmap();
+    return 0;
+}
+
+int qemu_balloon_bitmap_load(QEMUFile *f)
+{
+    unsigned long *bitmap;
+    unsigned long next = 0, len = 0;
+    unsigned long tmpoffset = 0, tmplimit = 0;
+
+    if (!balloon_bitmap_rcu) {
+        return -1;
+    }
+
+    qemu_mutex_lock_balloon_bitmap();
+    balloon_bitmap_pages = qemu_get_be64(f);
+    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
+        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERMANENT;
+        qemu_mutex_unlock_balloon_bitmap();
+        return 0;
+    }
+    tmpoffset = qemu_get_be64(f);
+    tmplimit  = qemu_get_be64(f);
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    while (tmpoffset < tmplimit) {
+        next = qemu_get_be64(f);
+        len  = qemu_get_be64(f);
+        if (len == 0) {
+            break;
+        }
+        bitmap_set(bitmap, next, len);
+        tmpoffset = next + len;
+    }
+    rcu_read_unlock();
+    qemu_mutex_unlock_balloon_bitmap();
+    return 0;
+}
diff --git a/exec.c b/exec.c
index f398d21..7a448e5 100644
--- a/exec.c
+++ b/exec.c
@@ -43,6 +43,7 @@ 
 #else /* !CONFIG_USER_ONLY */
 #include "sysemu/xen-mapcache.h"
 #include "trace.h"
+#include "sysemu/balloon.h"
 #endif
 #include "exec/cpu-all.h"
 #include "qemu/rcu_queue.h"
@@ -1610,6 +1611,8 @@  static void ram_block_add(RAMBlock *new_block, Error **errp)
     if (new_ram_size > old_ram_size) {
         migration_bitmap_extend(old_ram_size, new_ram_size);
         dirty_memory_extend(old_ram_size, new_ram_size);
+        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
+                                   new_ram_size << TARGET_PAGE_BITS);
     }
     /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
      * QLIST (which has an RCU-friendly variant) does not have insertion at
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 22ad25c..9f3a4c8 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -27,6 +27,7 @@ 
 #include "qapi/visitor.h"
 #include "qapi-event.h"
 #include "trace.h"
+#include "migration/migration.h"
 
 #if defined(__linux__)
 #include <sys/mman.h>
@@ -214,11 +215,13 @@  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     VirtQueueElement *elem;
     MemoryRegionSection section;
 
+    qemu_mutex_lock_balloon_bitmap();
     for (;;) {
         size_t offset = 0;
         uint32_t pfn;
         elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
         if (!elem) {
+            qemu_mutex_unlock_balloon_bitmap();
             return;
         }
 
@@ -242,6 +245,7 @@  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
             addr = section.offset_within_region;
             balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
                          !!(vq == s->dvq));
+            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
             memory_region_unref(section.mr);
         }
 
@@ -249,6 +253,7 @@  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         virtio_notify(vdev, vq);
         g_free(elem);
     }
+    qemu_mutex_unlock_balloon_bitmap();
 }
 
 static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
@@ -303,6 +308,16 @@  out:
     }
 }
 
+static void virtio_balloon_migration_state_changed(Notifier *notifier,
+                                                   void *data)
+{
+    MigrationState *mig = data;
+
+    if (migration_has_failed(mig)) {
+        qemu_balloon_reset_bitmap_data();
+    }
+}
+
 static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
 {
     VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
@@ -382,6 +397,16 @@  static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
                                              VIRTIO_BALLOON_PFN_SHIFT);
 }
 
+static void virtio_balloon_in_progress(void *opaque, int *status)
+{
+    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
+    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
+        *status = 1;
+        return;
+    }
+    *status = 0;
+}
+
 static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
 {
     VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
@@ -409,6 +434,7 @@  static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
 
     qemu_put_be32(f, s->num_pages);
     qemu_put_be32(f, s->actual);
+    qemu_balloon_bitmap_save(f);
 }
 
 static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
@@ -426,6 +452,7 @@  static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
 
     s->num_pages = qemu_get_be32(f);
     s->actual = qemu_get_be32(f);
+    qemu_balloon_bitmap_load(f);
     return 0;
 }
 
@@ -439,7 +466,9 @@  static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
                 sizeof(struct virtio_balloon_config));
 
     ret = qemu_add_balloon_handler(virtio_balloon_to_target,
-                                   virtio_balloon_stat, s);
+                                   virtio_balloon_stat,
+                                   virtio_balloon_in_progress, s,
+                                   VIRTIO_BALLOON_PFN_SHIFT);
 
     if (ret < 0) {
         error_setg(errp, "Only one balloon device is supported");
@@ -453,6 +482,9 @@  static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
 
     reset_stats(s);
 
+    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
+    add_migration_state_change_notifier(&s->migration_state_notifier);
+
     register_savevm(dev, "virtio-balloon", -1, 1,
                     virtio_balloon_save, virtio_balloon_load, s);
 }
@@ -462,6 +494,7 @@  static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOBalloon *s = VIRTIO_BALLOON(dev);
 
+    remove_migration_state_change_notifier(&s->migration_state_notifier);
     balloon_stats_destroy_timer(s);
     qemu_remove_balloon_handler(s);
     unregister_savevm(dev, "virtio-balloon", s);
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 35f62ac..1ded5a9 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -43,6 +43,7 @@  typedef struct VirtIOBalloon {
     int64_t stats_last_update;
     int64_t stats_poll_interval;
     uint32_t host_features;
+    Notifier migration_state_notifier;
 } VirtIOBalloon;
 
 #endif
diff --git a/include/migration/migration.h b/include/migration/migration.h
index ac2c12c..6c1d1af 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -267,6 +267,7 @@  void migrate_del_blocker(Error *reason);
 
 bool migrate_postcopy_ram(void);
 bool migrate_zero_blocks(void);
+bool migrate_skip_balloon(void);
 
 bool migrate_auto_converge(void);
 
diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
index 3f976b4..5325c38 100644
--- a/include/sysemu/balloon.h
+++ b/include/sysemu/balloon.h
@@ -15,14 +15,27 @@ 
 #define _QEMU_BALLOON_H
 
 #include "qapi-types.h"
+#include "migration/qemu-file.h"
 
 typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
 typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
+typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
 
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
-			     QEMUBalloonStatus *stat_func, void *opaque);
+                             QEMUBalloonStatus *stat_func,
+                             QEMUBalloonInProgress *progress_func,
+                             void *opaque, int pfn_shift);
 void qemu_remove_balloon_handler(void *opaque);
 bool qemu_balloon_is_inhibited(void);
 void qemu_balloon_inhibit(bool state);
+void qemu_mutex_lock_balloon_bitmap(void);
+void qemu_mutex_unlock_balloon_bitmap(void);
+void qemu_balloon_reset_bitmap_data(void);
+void qemu_balloon_bitmap_setup(void);
+int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
+int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
+int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
+int qemu_balloon_bitmap_save(QEMUFile *f);
+int qemu_balloon_bitmap_load(QEMUFile *f);
 
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 034a918..cb86307 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1200,6 +1200,15 @@  int migrate_use_xbzrle(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
 }
 
+bool migrate_skip_balloon(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
+}
+
 int64_t migrate_xbzrle_cache_size(void)
 {
     MigrationState *s;
diff --git a/migration/ram.c b/migration/ram.c
index 704f6a9..161ab73 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -40,6 +40,7 @@ 
 #include "trace.h"
 #include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
+#include "sysemu/balloon.h"
 
 #ifdef DEBUG_MIGRATION_RAM
 #define DPRINTF(fmt, ...) \
@@ -65,6 +66,7 @@  static uint64_t bitmap_sync_count;
 #define RAM_SAVE_FLAG_XBZRLE   0x40
 /* 0x80 is reserved in migration.h start with 0x100 next */
 #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
+#define RAM_SAVE_FLAG_BALLOON  0x200
 
 static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
 
@@ -702,13 +704,17 @@  static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
 {
     int pages = -1;
 
-    if (is_zero_range(p, TARGET_PAGE_SIZE)) {
-        acct_info.dup_pages++;
-        *bytes_transferred += save_page_header(f, block,
+    if (qemu_balloon_bitmap_test(block, offset) != 1) {
+        if (is_zero_range(p, TARGET_PAGE_SIZE)) {
+            acct_info.dup_pages++;
+            *bytes_transferred += save_page_header(f, block,
                                                offset | RAM_SAVE_FLAG_COMPRESS);
-        qemu_put_byte(f, 0);
-        *bytes_transferred += 1;
-        pages = 1;
+            qemu_put_byte(f, 0);
+            *bytes_transferred += 1;
+            pages = 1;
+        }
+    } else {
+        pages = 0;
     }
 
     return pages;
@@ -773,7 +779,7 @@  static int ram_save_page(QEMUFile *f, PageSearchStatus *pss,
              * page would be stale
              */
             xbzrle_cache_zero_page(current_addr);
-        } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
+        } else if (pages != 0 && !ram_bulk_stage && migrate_use_xbzrle()) {
             pages = save_xbzrle_page(f, &p, current_addr, block,
                                      offset, last_stage, bytes_transferred);
             if (!last_stage) {
@@ -1355,6 +1361,9 @@  static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
         }
 
         if (found) {
+            /* skip saving ram host page if the corresponding guest page
+             * is ballooned out
+             */
             pages = ram_save_host_page(ms, f, &pss,
                                        last_stage, bytes_transferred,
                                        dirty_ram_abs);
@@ -1959,6 +1968,7 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
 
     rcu_read_unlock();
 
+    qemu_balloon_bitmap_setup();
     ram_control_before_iterate(f, RAM_CONTROL_SETUP);
     ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
@@ -1984,6 +1994,9 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
 
     ram_control_before_iterate(f, RAM_CONTROL_ROUND);
 
+    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
+    qemu_balloon_bitmap_save(f);
+
     t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     i = 0;
     while ((ret = qemu_file_rate_limit(f)) == 0) {
@@ -2493,6 +2506,10 @@  static int ram_load(QEMUFile *f, void *opaque, int version_id)
             }
             break;
 
+        case RAM_SAVE_FLAG_BALLOON:
+            qemu_balloon_bitmap_load(f);
+            break;
+
         case RAM_SAVE_FLAG_COMPRESS:
             ch = qemu_get_byte(f);
             ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
diff --git a/qapi-schema.json b/qapi-schema.json
index 7f8d799..38163ca 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -544,11 +544,14 @@ 
 #          been migrated, pulling the remaining pages along as needed. NOTE: If
 #          the migration fails during postcopy the VM will fail.  (since 2.6)
 #
+# @skip-balloon: Skip scanning ram pages released by virtio-balloon driver.
+#          (since 2.7)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
-           'compress', 'events', 'postcopy-ram'] }
+           'compress', 'events', 'postcopy-ram', 'skip-balloon'] }
 
 ##
 # @MigrationCapabilityStatus