diff mbox

[v3,5/6] replication: Implement block replication for shared disk case

Message ID 1484884080-28836-6-git-send-email-zhang.zhanghailiang@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Zhanghailiang Jan. 20, 2017, 3:47 a.m. UTC
Just as the scenario of non-shared disk block replication,
we are going to implement block replication from many basic
blocks that are already in QEMU.
The architecture is:

         virtio-blk                     ||                               .----------
             /                          ||                               | Secondary
            /                           ||                               '----------
           /                            ||                                 virtio-blk
          /                             ||                                      |
          |                             ||                               replication(5)
          |                    NBD  -------->   NBD   (2)                       |
          |                  client     ||    server ---> hidden disk <-- active disk(4)
          |                     ^       ||                      |
          |              replication(1) ||                      |
          |                     |       ||                      |
          |   +-----------------'       ||                      |
         (3)  |drive-backup sync=none   ||                      |
--------. |   +-----------------+       ||                      |
Primary | |                     |       ||           backing    |
--------' |                     |       ||                      |
          V                     |                               |
       +-------------------------------------------+            |
       |               shared disk                 | <----------+
       +-------------------------------------------+

    1) Primary writes will read original data and forward it to Secondary
       QEMU.
    2) The hidden-disk is created automatically. It buffers the original content
       that is modified by the primary VM. It should also be an empty disk, and
       the driver supports bdrv_make_empty() and backing file.
    3) Primary write requests will be written to Shared disk.
    4) Secondary write requests will be buffered in the active disk and it
       will overwrite the existing sector content in the buffer.

Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
---
 block/replication.c | 48 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 6 deletions(-)

Comments

Stefan Hajnoczi Feb. 27, 2017, 5:37 p.m. UTC | #1
On Fri, Jan 20, 2017 at 11:47:59AM +0800, zhanghailiang wrote:
> Just as the scenario of non-shared disk block replication,
> we are going to implement block replication from many basic
> blocks that are already in QEMU.
> The architecture is:
> 
>          virtio-blk                     ||                               .----------
>              /                          ||                               | Secondary
>             /                           ||                               '----------
>            /                            ||                                 virtio-blk
>           /                             ||                                      |
>           |                             ||                               replication(5)
>           |                    NBD  -------->   NBD   (2)                       |
>           |                  client     ||    server ---> hidden disk <-- active disk(4)
>           |                     ^       ||                      |
>           |              replication(1) ||                      |
>           |                     |       ||                      |
>           |   +-----------------'       ||                      |
>          (3)  |drive-backup sync=none   ||                      |
> --------. |   +-----------------+       ||                      |
> Primary | |                     |       ||           backing    |
> --------' |                     |       ||                      |
>           V                     |                               |
>        +-------------------------------------------+            |
>        |               shared disk                 | <----------+
>        +-------------------------------------------+
> 
>     1) Primary writes will read original data and forward it to Secondary
>        QEMU.
>     2) The hidden-disk is created automatically. It buffers the original content
>        that is modified by the primary VM. It should also be an empty disk, and
>        the driver supports bdrv_make_empty() and backing file.
>     3) Primary write requests will be written to Shared disk.
>     4) Secondary write requests will be buffered in the active disk and it
>        will overwrite the existing sector content in the buffer.
> 
> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>

Are there any restrictions on the shared disk?  For example the -drive
cache= mode must be 'none'.  If the cache mode isn't 'none' the
secondary host might have old data in the host page cache.  The
Secondary QEMU would have an inconsistent view of the shared disk.

Are image file formats like qcow2 supported for the shared disk?  Extra
steps are required to achieve consistency, see bdrv_invalidate_cache().

Stefan
Zhanghailiang March 7, 2017, 2:30 p.m. UTC | #2
Hi Stefan,

Sorry for the delayed reply.

On 2017/2/28 1:37, Stefan Hajnoczi wrote:
> On Fri, Jan 20, 2017 at 11:47:59AM +0800, zhanghailiang wrote:
>> Just as the scenario of non-shared disk block replication,
>> we are going to implement block replication from many basic
>> blocks that are already in QEMU.
>> The architecture is:
>>
>>           virtio-blk                     ||                               .----------
>>               /                          ||                               | Secondary
>>              /                           ||                               '----------
>>             /                            ||                                 virtio-blk
>>            /                             ||                                      |
>>            |                             ||                               replication(5)
>>            |                    NBD  -------->   NBD   (2)                       |
>>            |                  client     ||    server ---> hidden disk <-- active disk(4)
>>            |                     ^       ||                      |
>>            |              replication(1) ||                      |
>>            |                     |       ||                      |
>>            |   +-----------------'       ||                      |
>>           (3)  |drive-backup sync=none   ||                      |
>> --------. |   +-----------------+       ||                      |
>> Primary | |                     |       ||           backing    |
>> --------' |                     |       ||                      |
>>            V                     |                               |
>>         +-------------------------------------------+            |
>>         |               shared disk                 | <----------+
>>         +-------------------------------------------+
>>
>>      1) Primary writes will read original data and forward it to Secondary
>>         QEMU.
>>      2) The hidden-disk is created automatically. It buffers the original content
>>         that is modified by the primary VM. It should also be an empty disk, and
>>         the driver supports bdrv_make_empty() and backing file.
>>      3) Primary write requests will be written to Shared disk.
>>      4) Secondary write requests will be buffered in the active disk and it
>>         will overwrite the existing sector content in the buffer.
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
>
> Are there any restrictions on the shared disk?  For example the -drive
> cache= mode must be 'none'.  If the cache mode isn't 'none' the
> secondary host might have old data in the host page cache.  The

While do checkpoint, we will call vm_stop(), in which, the bdrv_flush_all()
will be called, is it enough ?

> Secondary QEMU would have an inconsistent view of the shared disk.
>
> Are image file formats like qcow2 supported for the shared disk?  Extra

In the above scenario, it has no limitation of formats for the shared disk.

> steps are required to achieve consistency, see bdrv_invalidate_cache().
>

Hmm, in that case, we should call bdrv_invalidate_cache_all() while checkpoint.


Thanks,
Hailiang

> Stefan
>
Stefan Hajnoczi March 10, 2017, 4:17 a.m. UTC | #3
On Tue, Mar 07, 2017 at 10:30:30PM +0800, Hailiang Zhang wrote:
> Hi Stefan,
> 
> Sorry for the delayed reply.
> 
> On 2017/2/28 1:37, Stefan Hajnoczi wrote:
> > On Fri, Jan 20, 2017 at 11:47:59AM +0800, zhanghailiang wrote:
> > > Just as the scenario of non-shared disk block replication,
> > > we are going to implement block replication from many basic
> > > blocks that are already in QEMU.
> > > The architecture is:
> > > 
> > >           virtio-blk                     ||                               .----------
> > >               /                          ||                               | Secondary
> > >              /                           ||                               '----------
> > >             /                            ||                                 virtio-blk
> > >            /                             ||                                      |
> > >            |                             ||                               replication(5)
> > >            |                    NBD  -------->   NBD   (2)                       |
> > >            |                  client     ||    server ---> hidden disk <-- active disk(4)
> > >            |                     ^       ||                      |
> > >            |              replication(1) ||                      |
> > >            |                     |       ||                      |
> > >            |   +-----------------'       ||                      |
> > >           (3)  |drive-backup sync=none   ||                      |
> > > --------. |   +-----------------+       ||                      |
> > > Primary | |                     |       ||           backing    |
> > > --------' |                     |       ||                      |
> > >            V                     |                               |
> > >         +-------------------------------------------+            |
> > >         |               shared disk                 | <----------+
> > >         +-------------------------------------------+
> > > 
> > >      1) Primary writes will read original data and forward it to Secondary
> > >         QEMU.
> > >      2) The hidden-disk is created automatically. It buffers the original content
> > >         that is modified by the primary VM. It should also be an empty disk, and
> > >         the driver supports bdrv_make_empty() and backing file.
> > >      3) Primary write requests will be written to Shared disk.
> > >      4) Secondary write requests will be buffered in the active disk and it
> > >         will overwrite the existing sector content in the buffer.
> > > 
> > > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
> > > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> > > Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
> > 
> > Are there any restrictions on the shared disk?  For example the -drive
> > cache= mode must be 'none'.  If the cache mode isn't 'none' the
> > secondary host might have old data in the host page cache.  The
> 
> While do checkpoint, we will call vm_stop(), in which, the bdrv_flush_all()
> will be called, is it enough ?
> 
> > Secondary QEMU would have an inconsistent view of the shared disk.
> > 
> > Are image file formats like qcow2 supported for the shared disk?  Extra
> 
> In the above scenario, it has no limitation of formats for the shared disk.
> 
> > steps are required to achieve consistency, see bdrv_invalidate_cache().
> > 
> 
> Hmm, in that case, we should call bdrv_invalidate_cache_all() while checkpoint.

Yes, it's not enough to just call bdrv_drain_all()/bdrv_flush_all().
The Secondary may need to reread metadata that is loaded in memory (e.g.
qcow2's L2 table cache) so bdrv_invalidate_cache() is needed.

Stefan
diff mbox

Patch

diff --git a/block/replication.c b/block/replication.c
index 70ec08c..a0b3e41 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -233,7 +233,7 @@  static coroutine_fn int replication_co_readv(BlockDriverState *bs,
                                              QEMUIOVector *qiov)
 {
     BDRVReplicationState *s = bs->opaque;
-    BdrvChild *child = s->secondary_disk;
+    BdrvChild *child = s->is_shared_disk ? s->primary_disk : s->secondary_disk;
     BlockJob *job = NULL;
     CowRequest req;
     int ret;
@@ -415,7 +415,12 @@  static void backup_job_completed(void *opaque, int ret)
         s->error = -EIO;
     }
 
-    backup_job_cleanup(bs);
+    if (s->mode == REPLICATION_MODE_PRIMARY) {
+        s->replication_state = BLOCK_REPLICATION_DONE;
+        s->error = 0;
+    } else {
+        backup_job_cleanup(bs);
+    }
 }
 
 static bool check_top_bs(BlockDriverState *top_bs, BlockDriverState *bs)
@@ -467,6 +472,19 @@  static void replication_start(ReplicationState *rs, ReplicationMode mode,
 
     switch (s->mode) {
     case REPLICATION_MODE_PRIMARY:
+        if (s->is_shared_disk) {
+            job = backup_job_create(NULL, s->primary_disk->bs, bs, 0,
+                MIRROR_SYNC_MODE_NONE, NULL, false, BLOCKDEV_ON_ERROR_REPORT,
+                BLOCKDEV_ON_ERROR_REPORT, BLOCK_JOB_INTERNAL,
+                backup_job_completed, bs, NULL, &local_err);
+            if (local_err) {
+                error_propagate(errp, local_err);
+                backup_job_cleanup(bs);
+                aio_context_release(aio_context);
+                return;
+            }
+            block_job_start(job);
+        }
         break;
     case REPLICATION_MODE_SECONDARY:
         s->active_disk = bs->file;
@@ -485,7 +503,8 @@  static void replication_start(ReplicationState *rs, ReplicationMode mode,
         }
 
         s->secondary_disk = s->hidden_disk->bs->backing;
-        if (!s->secondary_disk->bs || !bdrv_has_blk(s->secondary_disk->bs)) {
+        if (!s->secondary_disk->bs ||
+            (!s->is_shared_disk && !bdrv_has_blk(s->secondary_disk->bs))) {
             error_setg(errp, "The secondary disk doesn't have block backend");
             aio_context_release(aio_context);
             return;
@@ -580,11 +599,24 @@  static void replication_do_checkpoint(ReplicationState *rs, Error **errp)
 
     switch (s->mode) {
     case REPLICATION_MODE_PRIMARY:
+        if (s->is_shared_disk) {
+            if (!s->primary_disk->bs->job) {
+                error_setg(errp, "Primary backup job was cancelled"
+                           " unexpectedly");
+                break;
+            }
+
+            backup_do_checkpoint(s->primary_disk->bs->job, &local_err);
+            if (local_err) {
+                error_propagate(errp, local_err);
+            }
+        }
         break;
     case REPLICATION_MODE_SECONDARY:
         if (!s->is_shared_disk) {
             if (!s->secondary_disk->bs->job) {
-                error_setg(errp, "Backup job was cancelled unexpectedly");
+                error_setg(errp, "Secondary backup job was cancelled"
+                           " unexpectedly");
                 break;
             }
             backup_do_checkpoint(s->secondary_disk->bs->job, &local_err);
@@ -663,8 +695,12 @@  static void replication_stop(ReplicationState *rs, bool failover, Error **errp)
 
     switch (s->mode) {
     case REPLICATION_MODE_PRIMARY:
-        s->replication_state = BLOCK_REPLICATION_DONE;
-        s->error = 0;
+        if (s->is_shared_disk && s->primary_disk->bs->job) {
+            block_job_cancel(s->primary_disk->bs->job);
+        } else {
+            s->replication_state = BLOCK_REPLICATION_DONE;
+            s->error = 0;
+        }
         break;
     case REPLICATION_MODE_SECONDARY:
         /*