diff mbox

[v11,20/27] Support colo mode for qemu disk

Message ID 1457080891-26054-21-git-send-email-xiecl.fnst@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Changlong Xie March 4, 2016, 8:41 a.m. UTC
From: Wen Congyang <wency@cn.fujitsu.com>

Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
For QEMU block replication details:
http://wiki.qemu.org/Features/BlockReplication

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 docs/man/xl.pod.1                   |  36 ++++++-
 docs/misc/xl-disk-configuration.txt |  53 +++++++++++
 tools/libxl/libxl.c                 |  64 ++++++++++++-
 tools/libxl/libxl_create.c          |  25 ++++-
 tools/libxl/libxl_device.c          |  54 +++++++++++
 tools/libxl/libxl_dm.c              | 184 ++++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_types.idl         |   7 ++
 tools/libxl/libxlu_disk_l.l         |   7 ++
 8 files changed, 418 insertions(+), 12 deletions(-)

Comments

Ian Jackson March 4, 2016, 5:44 p.m. UTC | #1
Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
> For QEMU block replication details:
> http://wiki.qemu.org/Features/BlockReplication

So now I am slightly confused by the design, I think.

When you replicate a VM with COLO using xl, its memory state is
transferred over ssh.  But its disk replication is done unencrypted
and unauthenticated ?

And the disk replication is, out of band, and needs to be configured
separately ?  This is rather awkward, although maybe not a
showstopper.  (Maybe we can have a plan to fix it in the future...)

And, how does the disk replication, which doesn't depend on there
being xl running, relate to the vm state replication, which does ?  I
think at the very least I'd like to see some information about the
principles of operation - either explained, or referred to, in the
user manual.

Is it possible to use COLO with an existing full-service disk
replication service such as DRBD ?

> +(a) An example for COLO replication's configuration: disk =['...,colo,colo-host
> +=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
> +
> +=item B<colo-host>      :Secondary host's ip address.
> +
> +=item B<colo-port>      :Secondary host's port, we will run a nbd server on
> +secondary host, and the nbd server will listen this port.
> +
> +=item B<colo-export>    :Nbd server's disk export name of secondary host.
> +
> +=item B<active-disk>    :Secondary's guest write will be buffered in this disk,
> +and it's used by secondary.
> +
> +=item B<hidden-disk>    :Primary's modified contents will be buffered in this
> +disk, and it's used by secondary.

What would a typical configuration look like ?  I don't understand the
relationship between active-disk and hidden-disk, etc.

> +colo-host
> +---------
> +
> +Description:           Secondary host's address
> +Mandatory:             Yes when COLO enabled

Is it permitted to specify a host DNS name ?

> +    if (libxl_defbool_val(disk->colo_enable)) {
> +        tmp = xs_read(ctx->xsh, XBT_NULL,
> +                      GCSPRINTF("%s/colo-port", be_path), &len);
> +        if (!tmp) {
> +            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
> +            goto cleanup;
> +        }
> +        disk->colo_port = tmp;

This is quite repetitive code.


> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 3610a39..bff08b0 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -1800,12 +1800,29 @@ static void domain_create_cb(libxl__egc *egc,
...
> @@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a,
>      LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
>          a->disk->vdev, libxl_disk_backend_to_string(backend));
>      return 0;
> +
> + bad_colo:
> +    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
> +    return 0;

This is correct here, I think.

> + bad_colo_host:
> +    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
> +    return 0;

I think these options should be checked later.  disk_try_backend isn't
the place for general parameter checking; it is searching for which
backend to try.

>  int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
> index 4aca38e..ba17251 100644
> --- a/tools/libxl/libxl_dm.c
> +++ b/tools/libxl/libxl_dm.c
> @@ -751,6 +751,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const char *username)
...
> +static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
> +                                         int unit, const char *format,
> +                                         const libxl_device_disk *disk,
> +                                         int colo_mode)
> +{
> +    char *drive = NULL;
> +    const char *exportname = disk->colo_export;
> +    const char *active_disk = disk->active_disk;
> +    const char *hidden_disk = disk->hidden_disk;
> +
> +    switch (colo_mode) {
> +    case LIBXL__COLO_NONE:
> +        drive = libxl__sprintf
> +            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
> +             pdev_path, unit, format);

I think this would be a lot clearer if the refactoring was split into
a seperate patch.

>                  if (strncmp(disks[i].vdev, "sd", 2) == 0) {
> -                    drive = libxl__sprintf
> -                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
> -                         pdev_path, disk, format, disks[i].readwrite ? "off" : "on");
> +                    if (colo_mode == LIBXL__COLO_SECONDARY) {
> +                        /*
> +                         * -drive if=none,driver=format,file=pdev_path,\
> +                         * id=exportname
> +                         */

I think this comment adds nothing to the code and could be profitably
omitted.

> +                        drive = libxl__sprintf
> +                            (gc, "if=none,driver=%s,file=%s,id=%s",
> +                             format, pdev_path, disks[i].colo_export);

I don't understand how this works here.  COLO_SECONDARY seems to
suppress the rest of the disk specification.

Also, this same logic seems to appear many times.  Maybe it could be
centralised.  Perhaps I would be able to advise more clearly if I
understood how this was put together.

> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index 9b0a537..a2078d1 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -575,6 +575,13 @@ libxl_device_disk = Struct("device_disk", [
>      ("is_cdrom", integer),
>      ("direct_io_safe", bool),
>      ("discard_enable", libxl_defbool),
> +    ("colo_enable", libxl_defbool),
> +    ("colo_restore_enable", libxl_defbool),
> +    ("colo_host", string),
> +    ("colo_port", string),
> +    ("colo_export", string),
> +    ("active_disk", string),
> +    ("hidden_disk", string)

In general, many these should probably not be strings.  Certainly the
port should be an integer.  I don't quite understand the semantics of
the others.

Ian.
Ian Jackson March 4, 2016, 5:52 p.m. UTC | #2
Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> +Enable COLO HA for disk. For better understanding block replication on
> +QEMU, please refer to:
> +http://wiki.qemu.org/Features/BlockReplication

Sorry, I missed this link on my first pass.  I still think that at the
very least this needs something more user-facing (ie, how should one
set this up).

But, I'm kind of worried that qemu is the wrong place to be doing
this.

How can this be made to work with PV guests ?

What if an HVM guest has PV-on-HVM drivers ?  In this case there might
be two relevant qemus, one for the qdisk Xen PV block backend, and one
for the emulated IDE.

I don't understand how discrepant writes are detected.  Surely they
might occur and should trigger a resynch ?

Ian.
Konrad Rzeszutek Wilk March 4, 2016, 8:30 p.m. UTC | #3
On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> > +Enable COLO HA for disk. For better understanding block replication on
> > +QEMU, please refer to:
> > +http://wiki.qemu.org/Features/BlockReplication
> 
> Sorry, I missed this link on my first pass.  I still think that at the
> very least this needs something more user-facing (ie, how should one
> set this up).
> 
> But, I'm kind of worried that qemu is the wrong place to be doing
> this.
> 
> How can this be made to work with PV guests ?

QEMU can also serve PV guests (qdisk).

I think your question is more of - what about making this work with
PV block backend?
> 
> What if an HVM guest has PV-on-HVM drivers ?  In this case there might
> be two relevant qemus, one for the qdisk Xen PV block backend, and one
> for the emulated IDE.

In both cases QEMU would use the same underlaying API to actually write/read
out the blocks. That API would then use NBD, etc to replicate writes.

Maybe a little ASCII art?

	qdisk  ide
	  \    /
           \  /
           block API
            |
           QCOW2
            |
           NBD

Or such?

> 
> I don't understand how discrepant writes are detected.  Surely they
> might occur and should trigger a resynch ?
> 
> Ian.
Wen Congyang March 7, 2016, 2:06 a.m. UTC | #4
On 03/05/2016 01:44 AM, Ian Jackson wrote:
> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> For QEMU block replication details:
>> http://wiki.qemu.org/Features/BlockReplication
> 
> So now I am slightly confused by the design, I think.
> 
> When you replicate a VM with COLO using xl, its memory state is
> transferred over ssh.  But its disk replication is done unencrypted
> and unauthenticated ?

Yes, it is a problem. I will think how to improve it.

> 
> And the disk replication is, out of band, and needs to be configured
> separately ?  This is rather awkward, although maybe not a
> showstopper.  (Maybe we can have a plan to fix it in the future...)

colo-host,colo-port should be the global configuration. And colo-export,
active-disk,hidden-disk must be configured separately, because each
disk should have a different configuration.

> 
> And, how does the disk replication, which doesn't depend on there
> being xl running, relate to the vm state replication, which does ?  I
> think at the very least I'd like to see some information about the
> principles of operation - either explained, or referred to, in the
> user manual.

OK. The disk replication doesn't depend on xl. We only can operate it
via qemu monitor command:
1. stop the vm
2. do the checkpoint
3. start the vm
1/3 is suspend/resume the guest. We only need to do 2 when both vm are
in the consistent state.

> 
> Is it possible to use COLO with an existing full-service disk
> replication service such as DRBD ?

DRBD doesn's support the case like COLO. Because both primary guest
and secondary guest need to write to the disk.

> 
>> +(a) An example for COLO replication's configuration: disk =['...,colo,colo-host
>> +=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> +
>> +=item B<colo-host>      :Secondary host's ip address.
>> +
>> +=item B<colo-port>      :Secondary host's port, we will run a nbd server on
>> +secondary host, and the nbd server will listen this port.
>> +
>> +=item B<colo-export>    :Nbd server's disk export name of secondary host.
>> +
>> +=item B<active-disk>    :Secondary's guest write will be buffered in this disk,
>> +and it's used by secondary.
>> +
>> +=item B<hidden-disk>    :Primary's modified contents will be buffered in this
>> +disk, and it's used by secondary.
> 
> What would a typical configuration look like ?  I don't understand the
> relationship between active-disk and hidden-disk, etc.

QEMU has a feature: backing file
For example: A's backing file is B
1. If we read from A, but the sector is not allocated in A. We wil return a zero
   sector to the guest. If A has a backing file, we will read the sector from B
   instead of returning a zero sector.
2. The backing file doesn't affect the write operation.

QEMU has another feature: backup block job
Backup job has two file: one is source and another is the target. It has some running
mode. For block replication, we use the mode "sync=none". In this mode, we will read
the data from the source disk before we modify it, and write it to the target disk.
We keep a bitmap to remeber which sector is backuped from the source disk to the
target disk. If the target disk is an empty disk, and empty disk's backing file is
the source disk, we can read from the target disk to get the source disk's originnal data.


How does block replication work:
A. primary qemu:
1. use the block driver quorum: it will read from all children and write to all children.
   child 0: real disk
   child 1: nbd client
   reading from child 1 will fail, but we use the fifo mode. In this mode, we read from
   child 0 will success and we don't read from child 0
   write to child 1: because child 1 is nbd client, it will forward the write request to
   nbd server

B. secondary qemu:
We have 3 disks: active disk(called it A), hidden disk(called it H), and secondary disk
(real disk, called it S).
A's backing file is H, and H's backing file is S.
We also start a backup job: the source disk is S, and the target disk is H.
we run nbd server in secondary qemu. And the nbd server will write to S.

Before resuming both primary vm and secondary vm: the state is:
1. primary disk and secondary disk are in the consistent state(contain the same data)
2. active disk and hidden disk are the empty disk
When the guest is running:
1. NBD server receives the primary write operation and writes the data to S
2. Before we write data to S, the backup job will read the original data and backup it
   to H
3. The secondary vm will write data to A.
4. If secondary vm will read data from A:
   I. If the sector is allocated in A, read it from A.
  II. Otherwise, the secondary vm doesn't modify this sector after the latest is resumed.
 III. In this case, we read it from H. We can read S's original data from H(See the explanation
      In backup job).

If we have more than 1 real disk, we can use exportname to tag each disk. Each pair of primary disk and
secondary disk should have the same export name.

> 
>> +colo-host
>> +---------
>> +
>> +Description:           Secondary host's address
>> +Mandatory:             Yes when COLO enabled
> 
> Is it permitted to specify a host DNS name ?

IIRC, I think it is OK.

> 
>> +    if (libxl_defbool_val(disk->colo_enable)) {
>> +        tmp = xs_read(ctx->xsh, XBT_NULL,
>> +                      GCSPRINTF("%s/colo-port", be_path), &len);
>> +        if (!tmp) {
>> +            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
>> +            goto cleanup;
>> +        }
>> +        disk->colo_port = tmp;
> 
> This is quite repetitive code.

Yes. Will introduce a new function to avoid it in the next version.

> 
> 
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 3610a39..bff08b0 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -1800,12 +1800,29 @@ static void domain_create_cb(libxl__egc *egc,
> ...
>> @@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a,
>>      LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
>>          a->disk->vdev, libxl_disk_backend_to_string(backend));
>>      return 0;
>> +
>> + bad_colo:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> This is correct here, I think.
> 
>> + bad_colo_host:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> I think these options should be checked later.  disk_try_backend isn't
> the place for general parameter checking; it is searching for which
> backend to try.

Hmm, do you mean we check it when we need to use COLO?

> 
>>  int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
>> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
>> index 4aca38e..ba17251 100644
>> --- a/tools/libxl/libxl_dm.c
>> +++ b/tools/libxl/libxl_dm.c
>> @@ -751,6 +751,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const char *username)
> ...
>> +static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
>> +                                         int unit, const char *format,
>> +                                         const libxl_device_disk *disk,
>> +                                         int colo_mode)
>> +{
>> +    char *drive = NULL;
>> +    const char *exportname = disk->colo_export;
>> +    const char *active_disk = disk->active_disk;
>> +    const char *hidden_disk = disk->hidden_disk;
>> +
>> +    switch (colo_mode) {
>> +    case LIBXL__COLO_NONE:
>> +        drive = libxl__sprintf
>> +            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
>> +             pdev_path, unit, format);
> 
> I think this would be a lot clearer if the refactoring was split into
> a seperate patch.

OK.

> 
>>                  if (strncmp(disks[i].vdev, "sd", 2) == 0) {
>> -                    drive = libxl__sprintf
>> -                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
>> -                         pdev_path, disk, format, disks[i].readwrite ? "off" : "on");
>> +                    if (colo_mode == LIBXL__COLO_SECONDARY) {
>> +                        /*
>> +                         * -drive if=none,driver=format,file=pdev_path,\
>> +                         * id=exportname
>> +                         */
> 
> I think this comment adds nothing to the code and could be profitably
> omitted.

OK.

> 
>> +                        drive = libxl__sprintf
>> +                            (gc, "if=none,driver=%s,file=%s,id=%s",
>> +                             format, pdev_path, disks[i].colo_export);
> 
> I don't understand how this works here.  COLO_SECONDARY seems to
> suppress the rest of the disk specification.

COLO_SECONDAY will use two disks: one is for S, and one is for A.
H is sepecified in A. This line is for S, and the codes for A is
in the function qemu_disk_scsi_drive_string().

> 
> Also, this same logic seems to appear many times.  Maybe it could be
> centralised.  Perhaps I would be able to advise more clearly if I
> understood how this was put together.
> 
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index 9b0a537..a2078d1 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -575,6 +575,13 @@ libxl_device_disk = Struct("device_disk", [
>>      ("is_cdrom", integer),
>>      ("direct_io_safe", bool),
>>      ("discard_enable", libxl_defbool),
>> +    ("colo_enable", libxl_defbool),
>> +    ("colo_restore_enable", libxl_defbool),
>> +    ("colo_host", string),
>> +    ("colo_port", string),
>> +    ("colo_export", string),
>> +    ("active_disk", string),
>> +    ("hidden_disk", string)
> 
> In general, many these should probably not be strings.  Certainly the
> port should be an integer.  I don't quite understand the semantics of
> the others.

Yes, the port should be an integer. Will fix it in the next version.

Thanks
Wen Congyang

> 
> Ian.
> 
> 
> .
>
Wen Congyang March 7, 2016, 2:10 a.m. UTC | #5
On 03/05/2016 04:30 AM, Konrad Rzeszutek Wilk wrote:
> On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
>> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
>>> +Enable COLO HA for disk. For better understanding block replication on
>>> +QEMU, please refer to:
>>> +http://wiki.qemu.org/Features/BlockReplication
>>
>> Sorry, I missed this link on my first pass.  I still think that at the
>> very least this needs something more user-facing (ie, how should one
>> set this up).
>>
>> But, I'm kind of worried that qemu is the wrong place to be doing
>> this.
>>
>> How can this be made to work with PV guests ?
> 
> QEMU can also serve PV guests (qdisk).
> 
> I think your question is more of - what about making this work with
> PV block backend?

I don't know how to work with PV block backend. It is one reason that
why we only support pure HVM now.
For PV block backend, there is also other problem. For exampe resuming
it in the secondary side is very slow, because we need to disconnect and
reconnect.

Thanks
Wen Congyang

>>
>> What if an HVM guest has PV-on-HVM drivers ?  In this case there might
>> be two relevant qemus, one for the qdisk Xen PV block backend, and one
>> for the emulated IDE.
> 
> In both cases QEMU would use the same underlaying API to actually write/read
> out the blocks. That API would then use NBD, etc to replicate writes.
> 
> Maybe a little ASCII art?
> 
> 	qdisk  ide
> 	  \    /
>            \  /
>            block API
>             |
>            QCOW2
>             |
>            NBD
> 
> Or such?
> 
>>
>> I don't understand how discrepant writes are detected.  Surely they
>> might occur and should trigger a resynch ?
>>
>> Ian.
> 
> 
> .
>
Wei Liu March 8, 2016, 5:22 p.m. UTC | #6
On Mon, Mar 07, 2016 at 10:10:07AM +0800, Wen Congyang wrote:
> On 03/05/2016 04:30 AM, Konrad Rzeszutek Wilk wrote:
> > On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
> >> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> >>> +Enable COLO HA for disk. For better understanding block replication on
> >>> +QEMU, please refer to:
> >>> +http://wiki.qemu.org/Features/BlockReplication
> >>
> >> Sorry, I missed this link on my first pass.  I still think that at the
> >> very least this needs something more user-facing (ie, how should one
> >> set this up).
> >>
> >> But, I'm kind of worried that qemu is the wrong place to be doing
> >> this.
> >>
> >> How can this be made to work with PV guests ?
> > 
> > QEMU can also serve PV guests (qdisk).
> > 
> > I think your question is more of - what about making this work with
> > PV block backend?
> 
> I don't know how to work with PV block backend. It is one reason that
> why we only support pure HVM now.
> For PV block backend, there is also other problem. For exampe resuming
> it in the secondary side is very slow, because we need to disconnect and
> reconnect.
> 

Supporting PV guest is certainly going to be non-trivial. And I don't
think we would ever ask you to actually implement that.

The point is to have a story that when other people want to implement
COLO for PV-aware guests (PVHVM, PV and PVH), they are not crippled by
existing interfaces.

Currently the disk spec seems to be designed exclusively for QEMU. This
is not very desirable, but at least it wouldn't stop people from either
reusing them or inventing new parameters.

Furthermore, I think coming up with a story for PV-aware guests (PVHVM,
PV and PVH) is also non-trivial. For one the disk replication logic is
not implemented in PV block backend,  we're not sure how feasible to
replicate thing in QEMU into kernel, but we're quite sure it is not
going to be trivial technically and politically. The uncertainty is too
big to come up with a clear idea what it would look like.

Wei.

> Thanks
> Wen Congyang
> 
> >>
> >> What if an HVM guest has PV-on-HVM drivers ?  In this case there might
> >> be two relevant qemus, one for the qdisk Xen PV block backend, and one
> >> for the emulated IDE.
> > 
> > In both cases QEMU would use the same underlaying API to actually write/read
> > out the blocks. That API would then use NBD, etc to replicate writes.
> > 
> > Maybe a little ASCII art?
> > 
> > 	qdisk  ide
> > 	  \    /
> >            \  /
> >            block API
> >             |
> >            QCOW2
> >             |
> >            NBD
> > 
> > Or such?
> > 
> >>
> >> I don't understand how discrepant writes are detected.  Surely they
> >> might occur and should trigger a resynch ?
> >>
> >> Ian.
> > 
> > 
> > .
> > 
> 
> 
>
Konrad Rzeszutek Wilk March 9, 2016, 2:09 a.m. UTC | #7
> Furthermore, I think coming up with a story for PV-aware guests (PVHVM,
> PV and PVH) is also non-trivial. For one the disk replication logic is

Pls keep in mind that PV guests can use QEMU qdisk if they wish.

This is more of 'phy' vs 'file' question.
> not implemented in PV block backend,  we're not sure how feasible to
> replicate thing in QEMU into kernel, but we're quite sure it is not
> going to be trivial technically and politically. The uncertainty is too

.. and I think doing it in the kernel is not a good idea. With PVH
coming as the initial domain - we will now have solved the @$@#@@
syscall bounce to hypervisor slowpath - which means that any libaio
operations QEMU does on a file - which ends up going to the kernel
with a syscall - will happen without incuring a huge context switch!

In other words the speed difference between qdisk and xen-blkback
will now be just the syscall + aio code + fs overhead.

With proper batching this can be made faster*.

*: I am ignoring the other issues such as grant unmap in PVH guests
and qdisk being quite spartan in features compared to xen-blkback.

> big to come up with a clear idea what it would look like.
Wei Liu March 9, 2016, 4:55 p.m. UTC | #8
On Tue, Mar 08, 2016 at 09:09:20PM -0500, Konrad Rzeszutek Wilk wrote:
> > Furthermore, I think coming up with a story for PV-aware guests (PVHVM,
> > PV and PVH) is also non-trivial. For one the disk replication logic is
> 
> Pls keep in mind that PV guests can use QEMU qdisk if they wish.
> 

Oh right, I somehow talked myself into thinking blkback only.

> This is more of 'phy' vs 'file' question.
> > not implemented in PV block backend,  we're not sure how feasible to
> > replicate thing in QEMU into kernel, but we're quite sure it is not
> > going to be trivial technically and politically. The uncertainty is too
> 
> .. and I think doing it in the kernel is not a good idea. With PVH
> coming as the initial domain - we will now have solved the @$@#@@
> syscall bounce to hypervisor slowpath - which means that any libaio
> operations QEMU does on a file - which ends up going to the kernel
> with a syscall - will happen without incuring a huge context switch!
> 
> In other words the speed difference between qdisk and xen-blkback
> will now be just the syscall + aio code + fs overhead.
> 
> With proper batching this can be made faster*.
> 
> *: I am ignoring the other issues such as grant unmap in PVH guests
> and qdisk being quite spartan in features compared to xen-blkback.
> 

So now we do have a tangible story for PV-aware guest for disk
replication (note, other hindrance such as ckeckpointing the kernel in a
fast enough pace still needs to be solved) -- to use qdisk backend. That
means the current interfaces should be sufficient.

Wei.

> > big to come up with a clear idea what it would look like.
Ian Jackson March 17, 2016, 5:09 p.m. UTC | #9
Wei Liu writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
> Supporting PV guest is certainly going to be non-trivial. And I don't
> think we would ever ask you to actually implement that.

Indeed.

> The point is to have a story that when other people want to implement
> COLO for PV-aware guests (PVHVM, PV and PVH), they are not crippled by
> existing interfaces.
> 
> Currently the disk spec seems to be designed exclusively for QEMU. This
> is not very desirable, but at least it wouldn't stop people from either
> reusing them or inventing new parameters.

I think in fact (following some in-person conversations) that I am
comfortable with this implementation requiring qemu right now.

Future PV[H] COLO arrangements might well use qdisk anyway.  I think
it is OK for libxl to decide that qemu is needed in this case.  All
that's needed now is to arrange that: if someone, in the future, wants
to make a version of COLO that works without qemu somehow then they
can do that without having to significantly change the libxl API.

So all that's needed is for the interface to libxl not to imply that
qemu is in use.  I don't think the interface proposed here implies
that qemu is in use.

The proposed interface seems to mostly speak about things which are
not qemu-specific, so it's probably OK.

Ian.
Ian Jackson March 17, 2016, 5:10 p.m. UTC | #10
Konrad Rzeszutek Wilk writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
> On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
> > How can this be made to work with PV guests ?
> 
> QEMU can also serve PV guests (qdisk).

Yes.

> I think your question is more of - what about making this work with
> PV block backend?

Yes.

> > What if an HVM guest has PV-on-HVM drivers ?  In this case there might
> > be two relevant qemus, one for the qdisk Xen PV block backend, and one
> > for the emulated IDE.
> 
> In both cases QEMU would use the same underlaying API to actually write/read
> out the blocks. That API would then use NBD, etc to replicate writes.
> 
> Maybe a little ASCII art?
> 
>       qdisk  ide
>         \    /
>          \  /
>          block API
>           |
>          QCOW2
>           |
>          NBD

Except that currently libxl may launch two qemus: one to serve
emulated ide requests, and one to service PV qdisk.

Ian.
Ian Jackson March 17, 2016, 5:18 p.m. UTC | #11
Wen Congyang writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
> How does block replication work:

Thanks for this explanation, which is reallt helpful.

I would like to repeat back to you what I think I have understood:

Between resynchs, you allow each VM to run in parallel and to generate
possibly-divergent disk writes.

So on host B you retain both the A and B disk writes.  They are
stored as differences (qcow2) for performance reasons.

If A fails and it becomes necessary to resume the VM only on B, you
use B's version of the VM (both disk and memory).

If B fails then A can use the A version (disk and memory).

If the two are still up, but they diverge in network traffic, you
resynch the memory from A to B, and drop B's disk and replace it with
a copy of A's.

Have I understood correctly ?


If so, what software, where, arranges for the management of the
different qcow2 `layers' ?  Ie, what creates the layers; what resynchs
them, etc. ?

The reason I started asking all these questions is because of these
parameters in your disk config:

 +    ("colo_enable", libxl_defbool),
 +    ("colo_restore_enable", libxl_defbool),
 +    ("colo_host", string),
 +    ("colo_port", string),
 +    ("colo_export", string),
 +    ("active_disk", string),
 +    ("hidden_disk", string)

For COLO to work properly it is necessary that the `active_disk' and
`hidden_disk' to relate in a specific way to the main disk: they must
be related snapshots (qcow2, currently).

Would it be possible for these disk names to have formulaic,
predicatable, names, so that they wouldn't need to be specified
separately ?

Is there any value in being able to specify them separately ?

Ian.
Wen Congyang March 18, 2016, 5:42 a.m. UTC | #12
On 03/18/2016 01:18 AM, Ian Jackson wrote:
> Wen Congyang writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
>> How does block replication work:
> 
> Thanks for this explanation, which is reallt helpful.
> 
> I would like to repeat back to you what I think I have understood:
> 
> Between resynchs, you allow each VM to run in parallel and to generate
> possibly-divergent disk writes.
> 
> So on host B you retain both the A and B disk writes.  They are
> stored as differences (qcow2) for performance reasons.
> 
> If A fails and it becomes necessary to resume the VM only on B, you
> use B's version of the VM (both disk and memory).
> 
> If B fails then A can use the A version (disk and memory).
> 
> If the two are still up, but they diverge in network traffic, you
> resynch the memory from A to B, and drop B's disk and replace it with
> a copy of A's.
> 
> Have I understood correctly ?

Yes.

> 
> 
> If so, what software, where, arranges for the management of the
> different qcow2 `layers' ?  Ie, what creates the layers; what resynchs
> them, etc. ?

active disk and hidden disk are seperate disk. The management application
can create an empty qcow disk before running COLO. These two disks are
empty disk, and have the same size with the secondary disk.

Thanks
Wen Congyang

> 
> The reason I started asking all these questions is because of these
> parameters in your disk config:
> 
>  +    ("colo_enable", libxl_defbool),
>  +    ("colo_restore_enable", libxl_defbool),
>  +    ("colo_host", string),
>  +    ("colo_port", string),
>  +    ("colo_export", string),
>  +    ("active_disk", string),
>  +    ("hidden_disk", string)
> 
> For COLO to work properly it is necessary that the `active_disk' and
> `hidden_disk' to relate in a specific way to the main disk: they must
> be related snapshots (qcow2, currently).
> 
> Would it be possible for these disk names to have formulaic,
> predicatable, names, so that they wouldn't need to be specified
> separately ?
> 
> Is there any value in being able to specify them separately ?
> 
> Ian.
> 
> 
> .
>
diff mbox

Patch

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 1c6dd87..36b093b 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -450,11 +450,39 @@  Print huge (!) amount of debug during the migration process.
 Enable Remus HA or COLO HA for domain. By default B<xl> relies on ssh as a
 transport mechanism between the two hosts.
 
-N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
-     Disk replication support is limited to DRBD disks.
+B<NOTES>
+
+=over 4
+
+Remus support in xl is still in experimental (proof-of-concept) phase.
+Disk replication support is limited to DRBD disks.
+
+COLO support in xl is still in experimental (proof-of-concept) phase.
+There is no support for network at the moment.
+
+=back
+
+B<EXAMPLE>
 
-     COLO support in xl is still in experimental (proof-of-concept) phase.
-     There is no support for network or disk at the moment.
+=over 4
+
+(a) An example for COLO replication's configuration: disk =['...,colo,colo-host
+=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
+
+=item B<colo-host>      :Secondary host's ip address.
+
+=item B<colo-port>      :Secondary host's port, we will run a nbd server on
+secondary host, and the nbd server will listen this port.
+
+=item B<colo-export>    :Nbd server's disk export name of secondary host.
+
+=item B<active-disk>    :Secondary's guest write will be buffered in this disk,
+and it's used by secondary.
+
+=item B<hidden-disk>    :Primary's modified contents will be buffered in this
+disk, and it's used by secondary.
+
+=back
 
 B<OPTIONS>
 
diff --git a/docs/misc/xl-disk-configuration.txt b/docs/misc/xl-disk-configuration.txt
index 29f6ddb..6e73975 100644
--- a/docs/misc/xl-disk-configuration.txt
+++ b/docs/misc/xl-disk-configuration.txt
@@ -234,6 +234,59 @@  were intentionally created non-sparse to avoid fragmentation of the
 file.
 
 
+===============
+COLO PARAMETERS
+===============
+
+
+colo
+----
+
+Enable COLO HA for disk. For better understanding block replication on
+QEMU, please refer to:
+http://wiki.qemu.org/Features/BlockReplication
+
+
+colo-host
+---------
+
+Description:           Secondary host's address
+Mandatory:             Yes when COLO enabled
+
+
+colo-port
+---------
+
+Description:           Secondary port
+                       We will run a nbd server on secondary host,
+                       and the nbd server will listen this port.
+Mandatory:             Yes when COLO enabled
+
+
+colo-export
+-----------
+
+Description:           We will run a nbd server on secondary host,
+                       exportname is the nbd server's disk export name.
+Mandatory:             Yes when COLO enabled
+
+
+active-disk
+-----------
+
+Description:           This is used by secondary. Secondary guest's write
+                       will be buffered in this disk.
+Mandatory:             Yes when COLO enabled
+
+
+hidden-disk
+-----------
+
+Description:           This is used by secondary. It buffers the original
+                       content that is modified by the primary VM.
+Mandatory:             Yes when COLO enabled
+
+
 ============================================
 DEPRECATED PARAMETERS, PREFIXES AND SYNTAXES
 ============================================
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 3689dfc..eb1b277 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -2307,6 +2307,8 @@  int libxl__device_disk_setdefault(libxl__gc *gc, libxl_device_disk *disk)
     int rc;
 
     libxl_defbool_setdefault(&disk->discard_enable, !!disk->readwrite);
+    libxl_defbool_setdefault(&disk->colo_enable, false);
+    libxl_defbool_setdefault(&disk->colo_restore_enable, false);
 
     rc = libxl__resolve_domid(gc, disk->backend_domname, &disk->backend_domid);
     if (rc < 0) return rc;
@@ -2505,6 +2507,18 @@  static void device_disk_add(libxl__egc *egc, uint32_t domid,
                 flexarray_append(back, "params");
                 flexarray_append(back, GCSPRINTF("%s:%s",
                               libxl__device_disk_string_of_format(disk->format), disk->pdev_path));
+                if (libxl_defbool_val(disk->colo_enable)) {
+                    flexarray_append(back, "colo-host");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->colo_host));
+                    flexarray_append(back, "colo-port");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->colo_port));
+                    flexarray_append(back, "colo-export");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->colo_export));
+                    flexarray_append(back, "active-disk");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->active_disk));
+                    flexarray_append(back, "hidden-disk");
+                    flexarray_append(back, libxl__sprintf(gc, "%s", disk->hidden_disk));
+                }
                 assert(device->backend_kind == LIBXL__DEVICE_KIND_QDISK);
                 break;
             default:
@@ -2620,7 +2634,12 @@  static int libxl__device_disk_from_xs_be(libxl__gc *gc,
         goto cleanup;
     }
 
-    /* "params" may not be present; but everything else must be. */
+    /*
+     * "params" may not be present; but everything else must be.
+     * colo releated entries(colo-host, colo-port, colo-export,
+     * active-disk and hidden-disk) are present only if colo is
+     * enabled.
+     */
     tmp = xs_read(ctx->xsh, XBT_NULL,
                   GCSPRINTF("%s/params", be_path), &len);
     if (tmp && strchr(tmp, ':')) {
@@ -2630,6 +2649,49 @@  static int libxl__device_disk_from_xs_be(libxl__gc *gc,
         disk->pdev_path = tmp;
     }
 
+    tmp = xs_read(ctx->xsh, XBT_NULL,
+                  GCSPRINTF("%s/colo-host", be_path), &len);
+    if (tmp) {
+        libxl_defbool_set(&disk->colo_enable, true);
+        disk->colo_host = tmp;
+    } else {
+        libxl_defbool_set(&disk->colo_enable, false);
+    }
+
+    if (libxl_defbool_val(disk->colo_enable)) {
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/colo-port", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
+            goto cleanup;
+        }
+        disk->colo_port = tmp;
+
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/colo-export", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/colo-export", be_path);
+            goto cleanup;
+        }
+        disk->colo_export = tmp;
+
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/active-disk", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/active-disk", be_path);
+            goto cleanup;
+        }
+        disk->active_disk = tmp;
+
+        tmp = xs_read(ctx->xsh, XBT_NULL,
+                      GCSPRINTF("%s/hidden-disk", be_path), &len);
+        if (!tmp) {
+            LOG(ERROR, "Missing xenstore node %s/hidden-disk", be_path);
+            goto cleanup;
+        }
+        disk->hidden_disk = tmp;
+    }
+
 
     tmp = libxl__xs_read(gc, XBT_NULL,
                          GCSPRINTF("%s/type", be_path));
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 3610a39..bff08b0 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1800,12 +1800,29 @@  static void domain_create_cb(libxl__egc *egc,
 
     libxl__ao_complete(egc, ao, rc);
 }
-    
+
+static void set_disk_colo_restore(libxl_domain_config *d_config)
+{
+    int i;
+
+    for (i = 0; i < d_config->num_disks; i++)
+        libxl_defbool_set(&d_config->disks[i].colo_restore_enable, true);
+}
+
+static void unset_disk_colo_restore(libxl_domain_config *d_config)
+{
+    int i;
+
+    for (i = 0; i < d_config->num_disks; i++)
+        libxl_defbool_set(&d_config->disks[i].colo_restore_enable, false);
+}
+
 int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             uint32_t *domid,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
 {
+    unset_disk_colo_restore(d_config);
     return do_domain_create(ctx, d_config, domid, -1, -1, NULL,
                             ao_how, aop_console_how);
 }
@@ -1817,6 +1834,12 @@  int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
 {
+    if (params->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        set_disk_colo_restore(d_config);
+    } else {
+        unset_disk_colo_restore(d_config);
+    }
+
     return do_domain_create(ctx, d_config, domid, restore_fd, send_back_fd,
                             params, ao_how, aop_console_how);
 }
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 8bb5e93..039afc6 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -196,6 +196,10 @@  static int disk_try_backend(disk_try_backend_args *a,
             goto bad_format;
         }
 
+        if (libxl_defbool_val(a->disk->colo_enable) ||
+            a->disk->active_disk || a->disk->hidden_disk)
+            goto bad_colo;
+
         if (a->disk->backend_domid != LIBXL_TOOLSTACK_DOMID) {
             LOG(DEBUG, "Disk vdev=%s, is using a storage driver domain, "
                        "skipping physical device check", a->disk->vdev);
@@ -218,6 +222,10 @@  static int disk_try_backend(disk_try_backend_args *a,
     case LIBXL_DISK_BACKEND_TAP:
         if (a->disk->script) goto bad_script;
 
+        if (libxl_defbool_val(a->disk->colo_enable) ||
+            a->disk->active_disk || a->disk->hidden_disk)
+            goto bad_colo;
+
         if (a->disk->is_cdrom) {
             LOG(DEBUG, "Disk vdev=%s, backend tap unsuitable for cdroms",
                        a->disk->vdev);
@@ -236,6 +244,22 @@  static int disk_try_backend(disk_try_backend_args *a,
 
     case LIBXL_DISK_BACKEND_QDISK:
         if (a->disk->script) goto bad_script;
+        if (libxl_defbool_val(a->disk->colo_enable)) {
+            if (!a->disk->colo_host)
+                goto bad_colo_host;
+
+            if (!a->disk->colo_port)
+                goto bad_colo_port;
+
+            if (!a->disk->colo_export)
+                goto bad_colo_export;
+
+            if (!a->disk->active_disk)
+                goto bad_active_disk;
+
+            if (!a->disk->hidden_disk)
+                goto bad_hidden_disk;
+        }
         return backend;
 
     default:
@@ -256,6 +280,36 @@  static int disk_try_backend(disk_try_backend_args *a,
     LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
         a->disk->vdev, libxl_disk_backend_to_string(backend));
     return 0;
+
+ bad_colo:
+    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_colo_host:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_colo_port:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-port=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_colo_export:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-export=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_active_disk:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs active-disk=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_hidden_disk:
+    LOG(DEBUG, "Disk vdev=%s, backend %s needs hidden-disk=... for colo",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
 }
 
 int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 4aca38e..ba17251 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -751,6 +751,139 @@  static int libxl__dm_runas_helper(libxl__gc *gc, const char *username)
     }
 }
 
+/* colo mode */
+enum {
+    LIBXL__COLO_NONE = 0,
+    LIBXL__COLO_PRIMARY,
+    LIBXL__COLO_SECONDARY,
+};
+
+static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
+                                         int unit, const char *format,
+                                         const libxl_device_disk *disk,
+                                         int colo_mode)
+{
+    char *drive = NULL;
+    const char *exportname = disk->colo_export;
+    const char *active_disk = disk->active_disk;
+    const char *hidden_disk = disk->hidden_disk;
+
+    switch (colo_mode) {
+    case LIBXL__COLO_NONE:
+        drive = libxl__sprintf
+            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
+             pdev_path, unit, format);
+        break;
+    case LIBXL__COLO_PRIMARY:
+        /*
+         * primary:
+         *  -dirve if=scsi,bus=0,unit=x,cache=writeback,driver=quorum,\
+         *  id=exportname,\
+         *  children.0.file.filename=pdev_path,\
+         *  children.0.driver=format,\
+         *  read-pattern=fifo,\
+         *  vote-threshold=1
+         */
+        drive = GCSPRINTF(
+            "if=scsi,bus=0,unit=%d,cache=writeback,driver=quorum,"
+            "id=%s,"
+            "children.0.file.filename=%s,"
+            "children.0.driver=%s,"
+            "read-pattern=fifo,"
+            "vote-threshold=1",
+            unit, exportname, pdev_path, format);
+        break;
+    case LIBXL__COLO_SECONDARY:
+        /*
+         * secondary:
+         *  -drive if=scsi,bus=0,unit=x,cache=writeback,driver=replication,\
+         *  mode=secondary,\
+         *  file.driver=qcow2,\
+         *  file.file.filename=active_disk,\
+         *  file.backing.driver=qcow2,\
+         *  file.backing.file.filename=hidden_disk,\
+         *  file.backing.backing=exportname,
+         */
+        drive = GCSPRINTF(
+            "if=scsi,bus=0,unit=%d,cache=writeback,driver=replication,"
+            "mode=secondary,"
+            "file.driver=qcow2,"
+            "file.file.filename=%s,"
+            "file.backing.driver=qcow2,"
+            "file.backing.file.filename=%s,"
+            "file.backing.backing=%s",
+            unit, active_disk, hidden_disk, exportname);
+        break;
+    default:
+        abort();
+    }
+
+    return drive;
+}
+
+static char *qemu_disk_ide_drive_string(libxl__gc *gc, const char *pdev_path,
+                                        int unit, const char *format,
+                                        const libxl_device_disk *disk,
+                                        int colo_mode)
+{
+    char *drive = NULL;
+    const char *exportname = disk->colo_export;
+    const char *active_disk = disk->active_disk;
+    const char *hidden_disk = disk->hidden_disk;
+
+    switch (colo_mode) {
+    case LIBXL__COLO_NONE:
+        drive = GCSPRINTF
+            ("file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback",
+             pdev_path, unit, format);
+        break;
+    case LIBXL__COLO_PRIMARY:
+        /*
+         * primary:
+         *  -dirve if=ide,index=x,media=disk,cache=writeback,driver=quorum,\
+         *  id=exportname,\
+         *  children.0.file.filename=pdev_path,\
+         *  children.0.driver=format,\
+         *  read-pattern=fifo,\
+         *  vote-threshold=1
+         */
+        drive = GCSPRINTF(
+            "if=ide,index=%d,media=disk,cache=writeback,driver=quorum,"
+            "id=%s,"
+            "children.0.file.filename=%s,"
+            "children.0.driver=%s,"
+            "read-pattern=fifo,"
+            "vote-threshold=1",
+             unit, exportname, pdev_path, format);
+        break;
+    case LIBXL__COLO_SECONDARY:
+        /*
+         * secondary:
+         *  -drive if=ide,index=x,media=disk,cache=writeback,driver=replication,\
+         *  mode=secondary,\
+         *  file.driver=qcow2,\
+         *  file.file.filename=active_disk,\
+         *  file.backing.driver=qcow2,\
+         *  file.backing.file.filename=hidden_disk,\
+         *  file.backing.backing=exportname,
+         */
+        drive = GCSPRINTF(
+            "if=ide,index=%d,media=disk,cache=writeback,driver=replication,"
+            "mode=secondary,"
+            "file.driver=qcow2,"
+            "file.file.filename=%s,"
+            "file.backing.driver=qcow2,"
+            "file.backing.file.filename=%s,"
+            "file.backing.backing=%s",
+            unit, active_disk, hidden_disk, exportname);
+        break;
+    default:
+        abort();
+    }
+
+    return drive;
+}
+
 static int libxl__build_device_model_args_new(libxl__gc *gc,
                                         const char *dm, int guest_domid,
                                         const libxl_domain_config *guest_config,
@@ -1164,6 +1297,7 @@  static int libxl__build_device_model_args_new(libxl__gc *gc,
             const char *format = qemu_disk_format_string(disks[i].format);
             char *drive;
             const char *pdev_path;
+            int colo_mode;
 
             if (dev_number == -1) {
                 LOG(WARN, "unable to determine"" disk number for %s",
@@ -1208,10 +1342,32 @@  static int libxl__build_device_model_args_new(libxl__gc *gc,
                  * For other disks we translate devices 0..3 into
                  * hd[a-d] and ignore the rest.
                  */
+                if (libxl_defbool_val(disks[i].colo_enable)) {
+                    if (libxl_defbool_val(disks[i].colo_restore_enable))
+                        colo_mode = LIBXL__COLO_SECONDARY;
+                    else
+                        colo_mode = LIBXL__COLO_PRIMARY;
+                } else {
+                    colo_mode = LIBXL__COLO_NONE;
+                }
+
                 if (strncmp(disks[i].vdev, "sd", 2) == 0) {
-                    drive = libxl__sprintf
-                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
-                         pdev_path, disk, format, disks[i].readwrite ? "off" : "on");
+                    if (colo_mode == LIBXL__COLO_SECONDARY) {
+                        /*
+                         * -drive if=none,driver=format,file=pdev_path,\
+                         * id=exportname
+                         */
+                        drive = libxl__sprintf
+                            (gc, "if=none,driver=%s,file=%s,id=%s",
+                             format, pdev_path, disks[i].colo_export);
+
+                        flexarray_append(dm_args, "-drive");
+                        flexarray_append(dm_args, drive);
+                    }
+                    drive = qemu_disk_scsi_drive_string(gc, pdev_path, disk,
+                                                        format,
+                                                        &disks[i],
+                                                        colo_mode);
                 } else if (strncmp(disks[i].vdev, "xvd", 3) == 0) {
                     /*
                      * Do not add any emulated disk when PV disk are
@@ -1234,12 +1390,28 @@  static int libxl__build_device_model_args_new(libxl__gc *gc,
                         LOG(ERROR, "qemu-xen doesn't support read-only IDE disk drivers");
                         return ERROR_INVAL;
                     }
-                    drive = libxl__sprintf
-                        (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback",
-                         pdev_path, disk, format);
+                    if (colo_mode == LIBXL__COLO_SECONDARY) {
+                        /*
+                         * -drive if=none,driver=format,file=pdev_path,\
+                         * id=exportname
+                         */
+                        drive = libxl__sprintf
+                            (gc, "if=none,driver=%s,file=%s,id=%s",
+                             format, pdev_path, disks[i].colo_export);
+
+                        flexarray_append(dm_args, "-drive");
+                        flexarray_append(dm_args, drive);
+                    }
+                    drive = qemu_disk_ide_drive_string(gc, pdev_path, disk,
+                                                       format,
+                                                       &disks[i],
+                                                       colo_mode);
                 } else {
                     continue; /* Do not emulate this disk */
                 }
+
+                if (!drive)
+                    continue;
             }
 
             flexarray_append(dm_args, "-drive");
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9b0a537..a2078d1 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -575,6 +575,13 @@  libxl_device_disk = Struct("device_disk", [
     ("is_cdrom", integer),
     ("direct_io_safe", bool),
     ("discard_enable", libxl_defbool),
+    ("colo_enable", libxl_defbool),
+    ("colo_restore_enable", libxl_defbool),
+    ("colo_host", string),
+    ("colo_port", string),
+    ("colo_export", string),
+    ("active_disk", string),
+    ("hidden_disk", string)
     ])
 
 libxl_device_nic = Struct("device_nic", [
diff --git a/tools/libxl/libxlu_disk_l.l b/tools/libxl/libxlu_disk_l.l
index 1a5deb5..58da943 100644
--- a/tools/libxl/libxlu_disk_l.l
+++ b/tools/libxl/libxlu_disk_l.l
@@ -176,6 +176,13 @@  script=[^,]*,?	{ STRIP(','); SAVESTRING("script", script, FROMEQUALS); }
 direct-io-safe,? { DPC->disk->direct_io_safe = 1; }
 discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, true); }
 no-discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, false); }
+colo,?		{ libxl_defbool_set(&DPC->disk->colo_enable, true); }
+no-colo,?	{ libxl_defbool_set(&DPC->disk->colo_enable, false); }
+colo-host=[^,]*,?	{ STRIP(','); SAVESTRING("colo-host", colo_host, FROMEQUALS); }
+colo-port=[^,]*,?	{ STRIP(','); SAVESTRING("colo-port", colo_port, FROMEQUALS); }
+colo-export=[^,]*,?	{ STRIP(','); SAVESTRING("colo-export", colo_export, FROMEQUALS); }
+active-disk=[^,]*,?	{ STRIP(','); SAVESTRING("active-disk", active_disk, FROMEQUALS); }
+hidden-disk=[^,]*,?	{ STRIP(','); SAVESTRING("hidden-disk", hidden_disk, FROMEQUALS); }
 
  /* the target magic parameter, eats the rest of the string */