[v11,20/27] Support colo mode for qemu disk

Message ID	1457080891-26054-21-git-send-email-xiecl.fnst@cn.fujitsu.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Changlong Xie <xiecl.fnst@cn.fujitsu.com> To: xen devel <xen-devel@lists.xen.org>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Campbell <ian.campbell@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, Wei Liu <wei.liu2@citrix.com> Date: Fri, 4 Mar 2016 16:41:24 +0800 Message-ID: <1457080891-26054-21-git-send-email-xiecl.fnst@cn.fujitsu.com> In-Reply-To: <1457080891-26054-1-git-send-email-xiecl.fnst@cn.fujitsu.com> References: <1457080891-26054-1-git-send-email-xiecl.fnst@cn.fujitsu.com> MIME-Version: 1.0 Cc: Lars Kurth <lars.kurth@citrix.com>, Changlong Xie <xiecl.fnst@cn.fujitsu.com>, Wen Congyang <wency@cn.fujitsu.com>, Gui Jianfeng <guijianfeng@cn.fujitsu.com>, Jiang Yunhong <yunhong.jiang@intel.com>, Dong Eddie <eddie.dong@intel.com>, Anthony Perard <anthony.perard@citrix.com>, Shriram Rajagopalan <rshriram@cs.ubc.ca>, Yang Hongyang <hongyang.yang@easystack.cn> Subject: [Xen-devel] [PATCH v11 20/27] Support colo mode for qemu disk Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Changlong Xie March 4, 2016, 8:41 a.m. UTC

From: Wen Congyang <wency@cn.fujitsu.com>

Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
For QEMU block replication details:
http://wiki.qemu.org/Features/BlockReplication

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 docs/man/xl.pod.1                   |  36 ++++++-
 docs/misc/xl-disk-configuration.txt |  53 +++++++++++
 tools/libxl/libxl.c                 |  64 ++++++++++++-
 tools/libxl/libxl_create.c          |  25 ++++-
 tools/libxl/libxl_device.c          |  54 +++++++++++
 tools/libxl/libxl_dm.c              | 184 ++++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_types.idl         |   7 ++
 tools/libxl/libxlu_disk_l.l         |   7 ++
 8 files changed, 418 insertions(+), 12 deletions(-)

Ian Jackson March 4, 2016, 5:44 p.m. UTC | #1

Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
> For QEMU block replication details:
> http://wiki.qemu.org/Features/BlockReplication

So now I am slightly confused by the design, I think.

When you replicate a VM with COLO using xl, its memory state is
transferred over ssh.  But its disk replication is done unencrypted
and unauthenticated ?

And the disk replication is, out of band, and needs to be configured
separately ?  This is rather awkward, although maybe not a
showstopper.  (Maybe we can have a plan to fix it in the future...)

And, how does the disk replication, which doesn't depend on there
being xl running, relate to the vm state replication, which does ?  I
think at the very least I'd like to see some information about the
principles of operation - either explained, or referred to, in the
user manual.

Is it possible to use COLO with an existing full-service disk
replication service such as DRBD ?

> +(a) An example for COLO replication's configuration: disk =['...,colo,colo-host
> +=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
> +
> +=item B<colo-host>      :Secondary host's ip address.
> +
> +=item B<colo-port>      :Secondary host's port, we will run a nbd server on
> +secondary host, and the nbd server will listen this port.
> +
> +=item B<colo-export>    :Nbd server's disk export name of secondary host.
> +
> +=item B<active-disk>    :Secondary's guest write will be buffered in this disk,
> +and it's used by secondary.
> +
> +=item B<hidden-disk>    :Primary's modified contents will be buffered in this
> +disk, and it's used by secondary.

What would a typical configuration look like ?  I don't understand the
relationship between active-disk and hidden-disk, etc.

> +colo-host
> +---------
> +
> +Description:           Secondary host's address
> +Mandatory:             Yes when COLO enabled

Is it permitted to specify a host DNS name ?

> +    if (libxl_defbool_val(disk->colo_enable)) {
> +        tmp = xs_read(ctx->xsh, XBT_NULL,
> +                      GCSPRINTF("%s/colo-port", be_path), &len);
> +        if (!tmp) {
> +            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
> +            goto cleanup;
> +        }
> +        disk->colo_port = tmp;

This is quite repetitive code.


> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 3610a39..bff08b0 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -1800,12 +1800,29 @@ static void domain_create_cb(libxl__egc *egc,
...
> @@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a,
>      LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
>          a->disk->vdev, libxl_disk_backend_to_string(backend));
>      return 0;
> +
> + bad_colo:
> +    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
> +    return 0;

This is correct here, I think.

> + bad_colo_host:
> +    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
> +    return 0;

I think these options should be checked later.  disk_try_backend isn't
the place for general parameter checking; it is searching for which
backend to try.

>  int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
> index 4aca38e..ba17251 100644
> --- a/tools/libxl/libxl_dm.c
> +++ b/tools/libxl/libxl_dm.c
> @@ -751,6 +751,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const char *username)
...
> +static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
> +                                         int unit, const char *format,
> +                                         const libxl_device_disk *disk,
> +                                         int colo_mode)
> +{
> +    char *drive = NULL;
> +    const char *exportname = disk->colo_export;
> +    const char *active_disk = disk->active_disk;
> +    const char *hidden_disk = disk->hidden_disk;
> +
> +    switch (colo_mode) {
> +    case LIBXL__COLO_NONE:
> +        drive = libxl__sprintf
> +            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
> +             pdev_path, unit, format);

I think this would be a lot clearer if the refactoring was split into
a seperate patch.

>                  if (strncmp(disks[i].vdev, "sd", 2) == 0) {
> -                    drive = libxl__sprintf
> -                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
> -                         pdev_path, disk, format, disks[i].readwrite ? "off" : "on");
> +                    if (colo_mode == LIBXL__COLO_SECONDARY) {
> +                        /*
> +                         * -drive if=none,driver=format,file=pdev_path,\
> +                         * id=exportname
> +                         */

I think this comment adds nothing to the code and could be profitably
omitted.

> +                        drive = libxl__sprintf
> +                            (gc, "if=none,driver=%s,file=%s,id=%s",
> +                             format, pdev_path, disks[i].colo_export);

I don't understand how this works here.  COLO_SECONDARY seems to
suppress the rest of the disk specification.

Also, this same logic seems to appear many times.  Maybe it could be
centralised.  Perhaps I would be able to advise more clearly if I
understood how this was put together.

> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index 9b0a537..a2078d1 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -575,6 +575,13 @@ libxl_device_disk = Struct("device_disk", [
>      ("is_cdrom", integer),
>      ("direct_io_safe", bool),
>      ("discard_enable", libxl_defbool),
> +    ("colo_enable", libxl_defbool),
> +    ("colo_restore_enable", libxl_defbool),
> +    ("colo_host", string),
> +    ("colo_port", string),
> +    ("colo_export", string),
> +    ("active_disk", string),
> +    ("hidden_disk", string)

In general, many these should probably not be strings.  Certainly the
port should be an integer.  I don't quite understand the semantics of
the others.

Ian.

Ian Jackson March 4, 2016, 5:52 p.m. UTC | #2

Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> +Enable COLO HA for disk. For better understanding block replication on
> +QEMU, please refer to:
> +http://wiki.qemu.org/Features/BlockReplication

Sorry, I missed this link on my first pass.  I still think that at the
very least this needs something more user-facing (ie, how should one
set this up).

But, I'm kind of worried that qemu is the wrong place to be doing
this.

How can this be made to work with PV guests ?

What if an HVM guest has PV-on-HVM drivers ?  In this case there might
be two relevant qemus, one for the qdisk Xen PV block backend, and one
for the emulated IDE.

I don't understand how discrepant writes are detected.  Surely they
might occur and should trigger a resynch ?

Ian.

Konrad Rzeszutek Wilk March 4, 2016, 8:30 p.m. UTC | #3

On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> > +Enable COLO HA for disk. For better understanding block replication on
> > +QEMU, please refer to:
> > +http://wiki.qemu.org/Features/BlockReplication
> 
> Sorry, I missed this link on my first pass.  I still think that at the
> very least this needs something more user-facing (ie, how should one
> set this up).
> 
> But, I'm kind of worried that qemu is the wrong place to be doing
> this.
> 
> How can this be made to work with PV guests ?

QEMU can also serve PV guests (qdisk).

I think your question is more of - what about making this work with
PV block backend?
> 
> What if an HVM guest has PV-on-HVM drivers ?  In this case there might
> be two relevant qemus, one for the qdisk Xen PV block backend, and one
> for the emulated IDE.

In both cases QEMU would use the same underlaying API to actually write/read
out the blocks. That API would then use NBD, etc to replicate writes.

Maybe a little ASCII art?

	qdisk  ide
	  \    /
           \  /
           block API
            |
           QCOW2
            |
           NBD

Or such?

> 
> I don't understand how discrepant writes are detected.  Surely they
> might occur and should trigger a resynch ?
> 
> Ian.

Wen Congyang March 7, 2016, 2:06 a.m. UTC | #4

On 03/05/2016 01:44 AM, Ian Jackson wrote:
> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> For QEMU block replication details:
>> http://wiki.qemu.org/Features/BlockReplication
> 
> So now I am slightly confused by the design, I think.
> 
> When you replicate a VM with COLO using xl, its memory state is
> transferred over ssh.  But its disk replication is done unencrypted
> and unauthenticated ?

Yes, it is a problem. I will think how to improve it.

> 
> And the disk replication is, out of band, and needs to be configured
> separately ?  This is rather awkward, although maybe not a
> showstopper.  (Maybe we can have a plan to fix it in the future...)

colo-host,colo-port should be the global configuration. And colo-export,
active-disk,hidden-disk must be configured separately, because each
disk should have a different configuration.

> 
> And, how does the disk replication, which doesn't depend on there
> being xl running, relate to the vm state replication, which does ?  I
> think at the very least I'd like to see some information about the
> principles of operation - either explained, or referred to, in the
> user manual.

OK. The disk replication doesn't depend on xl. We only can operate it
via qemu monitor command:
1. stop the vm
2. do the checkpoint
3. start the vm
1/3 is suspend/resume the guest. We only need to do 2 when both vm are
in the consistent state.

> 
> Is it possible to use COLO with an existing full-service disk
> replication service such as DRBD ?

DRBD doesn's support the case like COLO. Because both primary guest
and secondary guest need to write to the disk.

> 
>> +(a) An example for COLO replication's configuration: disk =['...,colo,colo-host
>> +=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> +
>> +=item B<colo-host>      :Secondary host's ip address.
>> +
>> +=item B<colo-port>      :Secondary host's port, we will run a nbd server on
>> +secondary host, and the nbd server will listen this port.
>> +
>> +=item B<colo-export>    :Nbd server's disk export name of secondary host.
>> +
>> +=item B<active-disk>    :Secondary's guest write will be buffered in this disk,
>> +and it's used by secondary.
>> +
>> +=item B<hidden-disk>    :Primary's modified contents will be buffered in this
>> +disk, and it's used by secondary.
> 
> What would a typical configuration look like ?  I don't understand the
> relationship between active-disk and hidden-disk, etc.

QEMU has a feature: backing file
For example: A's backing file is B
1. If we read from A, but the sector is not allocated in A. We wil return a zero
   sector to the guest. If A has a backing file, we will read the sector from B
   instead of returning a zero sector.
2. The backing file doesn't affect the write operation.

QEMU has another feature: backup block job
Backup job has two file: one is source and another is the target. It has some running
mode. For block replication, we use the mode "sync=none". In this mode, we will read
the data from the source disk before we modify it, and write it to the target disk.
We keep a bitmap to remeber which sector is backuped from the source disk to the
target disk. If the target disk is an empty disk, and empty disk's backing file is
the source disk, we can read from the target disk to get the source disk's originnal data.


How does block replication work:
A. primary qemu:
1. use the block driver quorum: it will read from all children and write to all children.
   child 0: real disk
   child 1: nbd client
   reading from child 1 will fail, but we use the fifo mode. In this mode, we read from
   child 0 will success and we don't read from child 0
   write to child 1: because child 1 is nbd client, it will forward the write request to
   nbd server

B. secondary qemu:
We have 3 disks: active disk(called it A), hidden disk(called it H), and secondary disk
(real disk, called it S).
A's backing file is H, and H's backing file is S.
We also start a backup job: the source disk is S, and the target disk is H.
we run nbd server in secondary qemu. And the nbd server will write to S.

Before resuming both primary vm and secondary vm: the state is:
1. primary disk and secondary disk are in the consistent state(contain the same data)
2. active disk and hidden disk are the empty disk
When the guest is running:
1. NBD server receives the primary write operation and writes the data to S
2. Before we write data to S, the backup job will read the original data and backup it
   to H
3. The secondary vm will write data to A.
4. If secondary vm will read data from A:
   I. If the sector is allocated in A, read it from A.
  II. Otherwise, the secondary vm doesn't modify this sector after the latest is resumed.
 III. In this case, we read it from H. We can read S's original data from H(See the explanation
      In backup job).

If we have more than 1 real disk, we can use exportname to tag each disk. Each pair of primary disk and
secondary disk should have the same export name.

> 
>> +colo-host
>> +---------
>> +
>> +Description:           Secondary host's address
>> +Mandatory:             Yes when COLO enabled
> 
> Is it permitted to specify a host DNS name ?

IIRC, I think it is OK.

> 
>> +    if (libxl_defbool_val(disk->colo_enable)) {
>> +        tmp = xs_read(ctx->xsh, XBT_NULL,
>> +                      GCSPRINTF("%s/colo-port", be_path), &len);
>> +        if (!tmp) {
>> +            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
>> +            goto cleanup;
>> +        }
>> +        disk->colo_port = tmp;
> 
> This is quite repetitive code.

Yes. Will introduce a new function to avoid it in the next version.

> 
> 
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 3610a39..bff08b0 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -1800,12 +1800,29 @@ static void domain_create_cb(libxl__egc *egc,
> ...
>> @@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a,
>>      LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
>>          a->disk->vdev, libxl_disk_backend_to_string(backend));
>>      return 0;
>> +
>> + bad_colo:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> This is correct here, I think.
> 
>> + bad_colo_host:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> I think these options should be checked later.  disk_try_backend isn't
> the place for general parameter checking; it is searching for which
> backend to try.

Hmm, do you mean we check it when we need to use COLO?

> 
>>  int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
>> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
>> index 4aca38e..ba17251 100644
>> --- a/tools/libxl/libxl_dm.c
>> +++ b/tools/libxl/libxl_dm.c
>> @@ -751,6 +751,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const char *username)
> ...
>> +static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
>> +                                         int unit, const char *format,
>> +                                         const libxl_device_disk *disk,
>> +                                         int colo_mode)
>> +{
>> +    char *drive = NULL;
>> +    const char *exportname = disk->colo_export;
>> +    const char *active_disk = disk->active_disk;
>> +    const char *hidden_disk = disk->hidden_disk;
>> +
>> +    switch (colo_mode) {
>> +    case LIBXL__COLO_NONE:
>> +        drive = libxl__sprintf
>> +            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
>> +             pdev_path, unit, format);
> 
> I think this would be a lot clearer if the refactoring was split into
> a seperate patch.

OK.

> 
>>                  if (strncmp(disks[i].vdev, "sd", 2) == 0) {
>> -                    drive = libxl__sprintf
>> -                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
>> -                         pdev_path, disk, format, disks[i].readwrite ? "off" : "on");
>> +                    if (colo_mode == LIBXL__COLO_SECONDARY) {
>> +                        /*
>> +                         * -drive if=none,driver=format,file=pdev_path,\
>> +                         * id=exportname
>> +                         */
> 
> I think this comment adds nothing to the code and could be profitably
> omitted.

OK.

> 
>> +                        drive = libxl__sprintf
>> +                            (gc, "if=none,driver=%s,file=%s,id=%s",
>> +                             format, pdev_path, disks[i].colo_export);
> 
> I don't understand how this works here.  COLO_SECONDARY seems to
> suppress the rest of the disk specification.

COLO_SECONDAY will use two disks: one is for S, and one is for A.
H is sepecified in A. This line is for S, and the codes for A is
in the function qemu_disk_scsi_drive_string().

> 
> Also, this same logic seems to appear many times.  Maybe it could be
> centralised.  Perhaps I would be able to advise more clearly if I
> understood how this was put together.
> 
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index 9b0a537..a2078d1 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -575,6 +575,13 @@ libxl_device_disk = Struct("device_disk", [
>>      ("is_cdrom", integer),
>>      ("direct_io_safe", bool),
>>      ("discard_enable", libxl_defbool),
>> +    ("colo_enable", libxl_defbool),
>> +    ("colo_restore_enable", libxl_defbool),
>> +    ("colo_host", string),
>> +    ("colo_port", string),
>> +    ("colo_export", string),
>> +    ("active_disk", string),
>> +    ("hidden_disk", string)
> 
> In general, many these should probably not be strings.  Certainly the
> port should be an integer.  I don't quite understand the semantics of
> the others.

Yes, the port should be an integer. Will fix it in the next version.

Thanks
Wen Congyang

> 
> Ian.
> 
> 
> .
>

Wen Congyang March 7, 2016, 2:10 a.m. UTC | #5

On 03/05/2016 04:30 AM, Konrad Rzeszutek Wilk wrote:
> On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
>> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
>>> +Enable COLO HA for disk. For better understanding block replication on
>>> +QEMU, please refer to:
>>> +http://wiki.qemu.org/Features/BlockReplication
>>
>> Sorry, I missed this link on my first pass.  I still think that at the
>> very least this needs something more user-facing (ie, how should one
>> set this up).
>>
>> But, I'm kind of worried that qemu is the wrong place to be doing
>> this.
>>
>> How can this be made to work with PV guests ?
> 
> QEMU can also serve PV guests (qdisk).
> 
> I think your question is more of - what about making this work with
> PV block backend?

I don't know how to work with PV block backend. It is one reason that
why we only support pure HVM now.
For PV block backend, there is also other problem. For exampe resuming
it in the secondary side is very slow, because we need to disconnect and
reconnect.

Thanks
Wen Congyang

>>
>> What if an HVM guest has PV-on-HVM drivers ?  In this case there might
>> be two relevant qemus, one for the qdisk Xen PV block backend, and one
>> for the emulated IDE.
> 
> In both cases QEMU would use the same underlaying API to actually write/read
> out the blocks. That API would then use NBD, etc to replicate writes.
> 
> Maybe a little ASCII art?
> 
> 	qdisk  ide
> 	  \    /
>            \  /
>            block API
>             |
>            QCOW2
>             |
>            NBD
> 
> Or such?
> 
>>
>> I don't understand how discrepant writes are detected.  Surely they
>> might occur and should trigger a resynch ?
>>
>> Ian.
> 
> 
> .
>

Wei Liu March 8, 2016, 5:22 p.m. UTC | #6

On Mon, Mar 07, 2016 at 10:10:07AM +0800, Wen Congyang wrote:
> On 03/05/2016 04:30 AM, Konrad Rzeszutek Wilk wrote:
> > On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
> >> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
> >>> +Enable COLO HA for disk. For better understanding block replication on
> >>> +QEMU, please refer to:
> >>> +http://wiki.qemu.org/Features/BlockReplication
> >>
> >> Sorry, I missed this link on my first pass.  I still think that at the
> >> very least this needs something more user-facing (ie, how should one
> >> set this up).
> >>
> >> But, I'm kind of worried that qemu is the wrong place to be doing
> >> this.
> >>
> >> How can this be made to work with PV guests ?
> > 
> > QEMU can also serve PV guests (qdisk).
> > 
> > I think your question is more of - what about making this work with
> > PV block backend?
> 
> I don't know how to work with PV block backend. It is one reason that
> why we only support pure HVM now.
> For PV block backend, there is also other problem. For exampe resuming
> it in the secondary side is very slow, because we need to disconnect and
> reconnect.
> 

Supporting PV guest is certainly going to be non-trivial. And I don't
think we would ever ask you to actually implement that.

The point is to have a story that when other people want to implement
COLO for PV-aware guests (PVHVM, PV and PVH), they are not crippled by
existing interfaces.

Currently the disk spec seems to be designed exclusively for QEMU. This
is not very desirable, but at least it wouldn't stop people from either
reusing them or inventing new parameters.

Furthermore, I think coming up with a story for PV-aware guests (PVHVM,
PV and PVH) is also non-trivial. For one the disk replication logic is
not implemented in PV block backend,  we're not sure how feasible to
replicate thing in QEMU into kernel, but we're quite sure it is not
going to be trivial technically and politically. The uncertainty is too
big to come up with a clear idea what it would look like.

Wei.

> Thanks
> Wen Congyang
> 
> >>
> >> What if an HVM guest has PV-on-HVM drivers ?  In this case there might
> >> be two relevant qemus, one for the qdisk Xen PV block backend, and one
> >> for the emulated IDE.
> > 
> > In both cases QEMU would use the same underlaying API to actually write/read
> > out the blocks. That API would then use NBD, etc to replicate writes.
> > 
> > Maybe a little ASCII art?
> > 
> > 	qdisk  ide
> > 	  \    /
> >            \  /
> >            block API
> >             |
> >            QCOW2
> >             |
> >            NBD
> > 
> > Or such?
> > 
> >>
> >> I don't understand how discrepant writes are detected.  Surely they
> >> might occur and should trigger a resynch ?
> >>
> >> Ian.
> > 
> > 
> > .
> > 
> 
> 
>

Konrad Rzeszutek Wilk March 9, 2016, 2:09 a.m. UTC | #7

> Furthermore, I think coming up with a story for PV-aware guests (PVHVM,
> PV and PVH) is also non-trivial. For one the disk replication logic is

Pls keep in mind that PV guests can use QEMU qdisk if they wish.

This is more of 'phy' vs 'file' question.
> not implemented in PV block backend,  we're not sure how feasible to
> replicate thing in QEMU into kernel, but we're quite sure it is not
> going to be trivial technically and politically. The uncertainty is too

.. and I think doing it in the kernel is not a good idea. With PVH
coming as the initial domain - we will now have solved the @$@#@@
syscall bounce to hypervisor slowpath - which means that any libaio
operations QEMU does on a file - which ends up going to the kernel
with a syscall - will happen without incuring a huge context switch!

In other words the speed difference between qdisk and xen-blkback
will now be just the syscall + aio code + fs overhead.

With proper batching this can be made faster*.

*: I am ignoring the other issues such as grant unmap in PVH guests
and qdisk being quite spartan in features compared to xen-blkback.

> big to come up with a clear idea what it would look like.

Wei Liu March 9, 2016, 4:55 p.m. UTC | #8

On Tue, Mar 08, 2016 at 09:09:20PM -0500, Konrad Rzeszutek Wilk wrote:
> > Furthermore, I think coming up with a story for PV-aware guests (PVHVM,
> > PV and PVH) is also non-trivial. For one the disk replication logic is
> 
> Pls keep in mind that PV guests can use QEMU qdisk if they wish.
> 

Oh right, I somehow talked myself into thinking blkback only.

> This is more of 'phy' vs 'file' question.
> > not implemented in PV block backend,  we're not sure how feasible to
> > replicate thing in QEMU into kernel, but we're quite sure it is not
> > going to be trivial technically and politically. The uncertainty is too
> 
> .. and I think doing it in the kernel is not a good idea. With PVH
> coming as the initial domain - we will now have solved the @$@#@@
> syscall bounce to hypervisor slowpath - which means that any libaio
> operations QEMU does on a file - which ends up going to the kernel
> with a syscall - will happen without incuring a huge context switch!
> 
> In other words the speed difference between qdisk and xen-blkback
> will now be just the syscall + aio code + fs overhead.
> 
> With proper batching this can be made faster*.
> 
> *: I am ignoring the other issues such as grant unmap in PVH guests
> and qdisk being quite spartan in features compared to xen-blkback.
> 

So now we do have a tangible story for PV-aware guest for disk
replication (note, other hindrance such as ckeckpointing the kernel in a
fast enough pace still needs to be solved) -- to use qdisk backend. That
means the current interfaces should be sufficient.

Wei.

> > big to come up with a clear idea what it would look like.

Ian Jackson March 17, 2016, 5:09 p.m. UTC | #9

Wei Liu writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
> Supporting PV guest is certainly going to be non-trivial. And I don't
> think we would ever ask you to actually implement that.

Indeed.

> The point is to have a story that when other people want to implement
> COLO for PV-aware guests (PVHVM, PV and PVH), they are not crippled by
> existing interfaces.
> 
> Currently the disk spec seems to be designed exclusively for QEMU. This
> is not very desirable, but at least it wouldn't stop people from either
> reusing them or inventing new parameters.

I think in fact (following some in-person conversations) that I am
comfortable with this implementation requiring qemu right now.

Future PV[H] COLO arrangements might well use qdisk anyway.  I think
it is OK for libxl to decide that qemu is needed in this case.  All
that's needed now is to arrange that: if someone, in the future, wants
to make a version of COLO that works without qemu somehow then they
can do that without having to significantly change the libxl API.

So all that's needed is for the interface to libxl not to imply that
qemu is in use.  I don't think the interface proposed here implies
that qemu is in use.

The proposed interface seems to mostly speak about things which are
not qemu-specific, so it's probably OK.

Ian.

Ian Jackson March 17, 2016, 5:10 p.m. UTC | #10

Konrad Rzeszutek Wilk writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
> On Fri, Mar 04, 2016 at 05:52:09PM +0000, Ian Jackson wrote:
> > How can this be made to work with PV guests ?
> 
> QEMU can also serve PV guests (qdisk).

Yes.

> I think your question is more of - what about making this work with
> PV block backend?

Yes.

> > What if an HVM guest has PV-on-HVM drivers ?  In this case there might
> > be two relevant qemus, one for the qdisk Xen PV block backend, and one
> > for the emulated IDE.
> 
> In both cases QEMU would use the same underlaying API to actually write/read
> out the blocks. That API would then use NBD, etc to replicate writes.
> 
> Maybe a little ASCII art?
> 
>       qdisk  ide
>         \    /
>          \  /
>          block API
>           |
>          QCOW2
>           |
>          NBD

Except that currently libxl may launch two qemus: one to serve
emulated ide requests, and one to service PV qdisk.

Ian.

Ian Jackson March 17, 2016, 5:18 p.m. UTC | #11

Wen Congyang writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
> How does block replication work:

Thanks for this explanation, which is reallt helpful.

I would like to repeat back to you what I think I have understood:

Between resynchs, you allow each VM to run in parallel and to generate
possibly-divergent disk writes.

So on host B you retain both the A and B disk writes.  They are
stored as differences (qcow2) for performance reasons.

If A fails and it becomes necessary to resume the VM only on B, you
use B's version of the VM (both disk and memory).

If B fails then A can use the A version (disk and memory).

If the two are still up, but they diverge in network traffic, you
resynch the memory from A to B, and drop B's disk and replace it with
a copy of A's.

Have I understood correctly ?


If so, what software, where, arranges for the management of the
different qcow2 `layers' ?  Ie, what creates the layers; what resynchs
them, etc. ?

The reason I started asking all these questions is because of these
parameters in your disk config:

 +    ("colo_enable", libxl_defbool),
 +    ("colo_restore_enable", libxl_defbool),
 +    ("colo_host", string),
 +    ("colo_port", string),
 +    ("colo_export", string),
 +    ("active_disk", string),
 +    ("hidden_disk", string)

For COLO to work properly it is necessary that the `active_disk' and
`hidden_disk' to relate in a specific way to the main disk: they must
be related snapshots (qcow2, currently).

Would it be possible for these disk names to have formulaic,
predicatable, names, so that they wouldn't need to be specified
separately ?

Is there any value in being able to specify them separately ?

Ian.

Wen Congyang March 18, 2016, 5:42 a.m. UTC | #12

On 03/18/2016 01:18 AM, Ian Jackson wrote:
> Wen Congyang writes ("Re: [PATCH v11 20/27] Support colo mode for qemu disk"):
>> How does block replication work:
> 
> Thanks for this explanation, which is reallt helpful.
> 
> I would like to repeat back to you what I think I have understood:
> 
> Between resynchs, you allow each VM to run in parallel and to generate
> possibly-divergent disk writes.
> 
> So on host B you retain both the A and B disk writes.  They are
> stored as differences (qcow2) for performance reasons.
> 
> If A fails and it becomes necessary to resume the VM only on B, you
> use B's version of the VM (both disk and memory).
> 
> If B fails then A can use the A version (disk and memory).
> 
> If the two are still up, but they diverge in network traffic, you
> resynch the memory from A to B, and drop B's disk and replace it with
> a copy of A's.
> 
> Have I understood correctly ?

Yes.

> 
> 
> If so, what software, where, arranges for the management of the
> different qcow2 `layers' ?  Ie, what creates the layers; what resynchs
> them, etc. ?

active disk and hidden disk are seperate disk. The management application
can create an empty qcow disk before running COLO. These two disks are
empty disk, and have the same size with the secondary disk.

Thanks
Wen Congyang

> 
> The reason I started asking all these questions is because of these
> parameters in your disk config:
> 
>  +    ("colo_enable", libxl_defbool),
>  +    ("colo_restore_enable", libxl_defbool),
>  +    ("colo_host", string),
>  +    ("colo_port", string),
>  +    ("colo_export", string),
>  +    ("active_disk", string),
>  +    ("hidden_disk", string)
> 
> For COLO to work properly it is necessary that the `active_disk' and
> `hidden_disk' to relate in a specific way to the main disk: they must
> be related snapshots (qcow2, currently).
> 
> Would it be possible for these disk names to have formulaic,
> predicatable, names, so that they wouldn't need to be specified
> separately ?
> 
> Is there any value in being able to specify them separately ?
> 
> Ian.
> 
> 
> .
>

[v11,20/27] Support colo mode for qemu disk

Commit Message

Comments

Patch