Message ID | 1480926904-17596-2-git-send-email-zhang.zhanghailiang@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 12/05/2016 04:34 PM, zhanghailiang wrote: > Introuduce the scenario of shared-disk block replication > and how to use it. > > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> > Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com> > --- > v2: > - fix some problems found by Changlong > --- > docs/block-replication.txt | 139 +++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 135 insertions(+), 4 deletions(-) > > diff --git a/docs/block-replication.txt b/docs/block-replication.txt > index 6bde673..fbfe005 100644 > --- a/docs/block-replication.txt > +++ b/docs/block-replication.txt > @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network transportation > effort during a vmstate checkpoint, the disk modification operations of > the Primary disk are asynchronously forwarded to the Secondary node. > > -== Workflow == > +== Non-shared disk workflow == > The following is the image of block replication workflow: > > +----------------------+ +------------------------+ > @@ -57,7 +57,7 @@ The following is the image of block replication workflow: > 4) Secondary write requests will be buffered in the Disk buffer and it > will overwrite the existing sector content in the buffer. > > -== Architecture == > +== Non-shared disk architecture == > We are going to implement block replication from many basic > blocks that are already in QEMU. > > @@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative write-through > of the NBD server into the secondary disk. So before block replication, > the primary disk and secondary disk should contain the same data. > > +== Shared Disk Mode Workflow == > +The following is the image of block replication workflow: > + > + +----------------------+ +------------------------+ > + |Primary Write Requests| |Secondary Write Requests| > + +----------------------+ +------------------------+ > + | | > + | (4) > + | V > + | /-------------\ > + | (2)Forward and write through | | > + | +--------------------------> | Disk Buffer | > + | | | | > + | | \-------------/ > + | |(1)read | > + | | | > + (3)write | | | backing file > + V | | > + +-----------------------------+ | > + | Shared Disk | <-----+ > + +-----------------------------+ > + > + 1) Primary writes will read original data and forward it to Secondary > + QEMU. > + 2) Before Primary write requests are written to Shared disk, the > + original sector content will be read from Shared disk and > + forwarded and buffered in the Disk buffer on the secondary site, > + but it will not overwrite the existing sector content (it could be > + from either "Secondary Write Requests" or previous COW of "Primary > + Write Requests") in the Disk buffer. > + 3) Primary write requests will be written to Shared disk. > + 4) Secondary write requests will be buffered in the Disk buffer and it > + will overwrite the existing sector content in the buffer. > + > +== Shared Disk Mode Architecture == > +We are going to implement block replication from many basic > +blocks that are already in QEMU. > + virtio-blk || .---------- > + / || | Secondary > + / || '---------- > + / || virtio-blk > + / || | > + | || replication(5) > + | NBD --------> NBD (2) | > + | client || server ---> hidden disk <-- active disk(4) > + | ^ || | > + | replication(1) || | > + | | || | > + | +-----------------' || | > + (3) |drive-backup sync=none || | > +--------. | +-----------------+ || | > +Primary | | | || backing | > +--------' | | || | > + V | | > + +-------------------------------------------+ | > + | shared disk | <----------+ > + +-------------------------------------------+ > + > + > + 1) Primary writes will read original data and forward it to Secondary > + QEMU. > + 2) The hidden-disk buffers the original content that is modified by the > + primary VM. It should also be an empty disk, and the driver supports > + bdrv_make_empty() and backing file. > + 3) Primary write requests will be written to Shared disk. > + 4) Secondary write requests will be buffered in the active disk and it > + will overwrite the existing sector content in the buffer. > + > == Failure Handling == > There are 7 internal errors when block replication is running: > 1. I/O error on primary disk > @@ -145,7 +213,7 @@ d. replication_stop_all() > things except failover. The caller must hold the I/O mutex lock if it is > in migration/checkpoint thread. > > -== Usage == > +== Non-shared disk usage == > Primary: > -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ > children.0.file.filename=1.raw,\ > @@ -234,6 +302,69 @@ Secondary: > The primary host is down, so we should do the following thing: > { 'execute': 'nbd-server-stop' } > > +== Shared disk usage == > +Primary: > + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw > + > +Issue qmp command: > + { 'execute': 'blockdev-add', > + 'arguments': { > + 'driver': 'replication', > + 'node-name': 'rep', > + 'mode': 'primary', > + 'shared-disk-id': 'primary_disk0', > + 'shared-disk': true, > + 'file': { > + 'driver': 'nbd', > + 'export': 'hidden_disk0', > + 'server': { > + 'type': 'inet', > + 'data': { > + 'host': 'xxx.xxx.xxx.xxx', > + 'port': 'yyy' > + } > + } > + } > + } > + } > + > +Secondary: > + -drive if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ > + backing.driver=raw,backing.file.filename=1.raw \ > + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ > + file.driver=qcow2,top-id=active-disk0,\ > + file.file.filename=/mnt/ramfs/active_disk.img,\ > + file.backing=hidden_disk0,shared-disk=on > + > +Issue qmp command: > +1. { 'execute': 'nbd-server-start', > + 'arguments': { > + 'addr': { > + 'type': 'inet', > + 'data': { > + 'host': '0', > + 'port': 'yyy' > + } > + } > + } > + } > +2. { 'execute': 'nbd-server-add', > + 'arguments': { > + 'device': 'hidden_disk0', > + 'writable': true > + } > + } > + > +After Failover: > +Primary: > + { 'execute': 'x-blockdev-del', > + 'arguments': { > + 'node-name': 'rep' > + } > + } > + > +Secondary: > + {'execute': 'nbd-server-stop' } > + > TODO: > 1. Continuous block replication > -2. Shared disk > Looks good to me Reviewed-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
On Mon, Dec 05, 2016 at 04:34:59PM +0800, zhanghailiang wrote: > +Issue qmp command: > + { 'execute': 'blockdev-add', > + 'arguments': { > + 'driver': 'replication', > + 'node-name': 'rep', > + 'mode': 'primary', > + 'shared-disk-id': 'primary_disk0', > + 'shared-disk': true, > + 'file': { > + 'driver': 'nbd', > + 'export': 'hidden_disk0', > + 'server': { > + 'type': 'inet', > + 'data': { > + 'host': 'xxx.xxx.xxx.xxx', > + 'port': 'yyy' > + } > + } block/nbd.c does have good error handling and recovery in case there is a network issue. There are no reconnection attempts or timeouts that deal with a temporary loss of network connectivity. This is a general problem with block/nbd.c and not something to solve in this patch series. I'm just mentioning it because it may affect COLO replication. I'm sure these limitations in block/nbd.c can be fixed but it will take some effort. Maybe block/sheepdog.c, net/socket.c, and other network code could also benefit from generic network connection recovery. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
On 2017/1/13 21:41, Stefan Hajnoczi wrote: > On Mon, Dec 05, 2016 at 04:34:59PM +0800, zhanghailiang wrote: >> +Issue qmp command: >> + { 'execute': 'blockdev-add', >> + 'arguments': { >> + 'driver': 'replication', >> + 'node-name': 'rep', >> + 'mode': 'primary', >> + 'shared-disk-id': 'primary_disk0', >> + 'shared-disk': true, >> + 'file': { >> + 'driver': 'nbd', >> + 'export': 'hidden_disk0', >> + 'server': { >> + 'type': 'inet', >> + 'data': { >> + 'host': 'xxx.xxx.xxx.xxx', >> + 'port': 'yyy' >> + } >> + } > > block/nbd.c does have good error handling and recovery in case there is > a network issue. There are no reconnection attempts or timeouts that > deal with a temporary loss of network connectivity. > > This is a general problem with block/nbd.c and not something to solve in > this patch series. I'm just mentioning it because it may affect COLO > replication. > > I'm sure these limitations in block/nbd.c can be fixed but it will take > some effort. Maybe block/sheepdog.c, net/socket.c, and other network > code could also benefit from generic network connection recovery. > Hmm, good suggestion, but IMHO, here, COLO is a little different from other scenes, if the reconnection method has been implemented, it still needs a mechanism to identify the temporary loss of network connection or real broken in network connection. I did a simple test, just ifconfig down the network card that be used by block replication, It seems that NBD in qemu doesn't has a ability to find the connection has been broken, there was no error reports and COLO just got stuck in vm_stop() where it called aio_poll(). Thanks, Hailiang > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >
On Thu, Jan 19, 2017 at 10:50:19AM +0800, Hailiang Zhang wrote: > On 2017/1/13 21:41, Stefan Hajnoczi wrote: > > On Mon, Dec 05, 2016 at 04:34:59PM +0800, zhanghailiang wrote: > > > +Issue qmp command: > > > + { 'execute': 'blockdev-add', > > > + 'arguments': { > > > + 'driver': 'replication', > > > + 'node-name': 'rep', > > > + 'mode': 'primary', > > > + 'shared-disk-id': 'primary_disk0', > > > + 'shared-disk': true, > > > + 'file': { > > > + 'driver': 'nbd', > > > + 'export': 'hidden_disk0', > > > + 'server': { > > > + 'type': 'inet', > > > + 'data': { > > > + 'host': 'xxx.xxx.xxx.xxx', > > > + 'port': 'yyy' > > > + } > > > + } > > > > block/nbd.c does have good error handling and recovery in case there is > > a network issue. There are no reconnection attempts or timeouts that > > deal with a temporary loss of network connectivity. > > > > This is a general problem with block/nbd.c and not something to solve in > > this patch series. I'm just mentioning it because it may affect COLO > > replication. > > > > I'm sure these limitations in block/nbd.c can be fixed but it will take > > some effort. Maybe block/sheepdog.c, net/socket.c, and other network > > code could also benefit from generic network connection recovery. > > > > Hmm, good suggestion, but IMHO, here, COLO is a little different from > other scenes, if the reconnection method has been implemented, > it still needs a mechanism to identify the temporary loss of network > connection or real broken in network connection. > > I did a simple test, just ifconfig down the network card that be used > by block replication, It seems that NBD in qemu doesn't has a ability to > find the connection has been broken, there was no error reports > and COLO just got stuck in vm_stop() where it called aio_poll(). Yes, this is the vm_stop() problem again. There is no reliable way to cancel I/O requests so instead QEMU waits...forever. A solution is needed so COLO doesn't hang on network failure. I'm not sure how to solve the problem. The secondary still has the last successful checkpoint so it could resume instead of waiting for the current checkpoint to commit. There may still be NBD I/O in flight, so the would need to drain it or fence storage to prevent interference once the secondary VM is running. Stefan
On 2017/1/20 0:41, Stefan Hajnoczi wrote: > On Thu, Jan 19, 2017 at 10:50:19AM +0800, Hailiang Zhang wrote: >> On 2017/1/13 21:41, Stefan Hajnoczi wrote: >>> On Mon, Dec 05, 2016 at 04:34:59PM +0800, zhanghailiang wrote: >>>> +Issue qmp command: >>>> + { 'execute': 'blockdev-add', >>>> + 'arguments': { >>>> + 'driver': 'replication', >>>> + 'node-name': 'rep', >>>> + 'mode': 'primary', >>>> + 'shared-disk-id': 'primary_disk0', >>>> + 'shared-disk': true, >>>> + 'file': { >>>> + 'driver': 'nbd', >>>> + 'export': 'hidden_disk0', >>>> + 'server': { >>>> + 'type': 'inet', >>>> + 'data': { >>>> + 'host': 'xxx.xxx.xxx.xxx', >>>> + 'port': 'yyy' >>>> + } >>>> + } >>> >>> block/nbd.c does have good error handling and recovery in case there is >>> a network issue. There are no reconnection attempts or timeouts that >>> deal with a temporary loss of network connectivity. >>> >>> This is a general problem with block/nbd.c and not something to solve in >>> this patch series. I'm just mentioning it because it may affect COLO >>> replication. >>> >>> I'm sure these limitations in block/nbd.c can be fixed but it will take >>> some effort. Maybe block/sheepdog.c, net/socket.c, and other network >>> code could also benefit from generic network connection recovery. >>> >> >> Hmm, good suggestion, but IMHO, here, COLO is a little different from >> other scenes, if the reconnection method has been implemented, >> it still needs a mechanism to identify the temporary loss of network >> connection or real broken in network connection. >> >> I did a simple test, just ifconfig down the network card that be used >> by block replication, It seems that NBD in qemu doesn't has a ability to >> find the connection has been broken, there was no error reports >> and COLO just got stuck in vm_stop() where it called aio_poll(). > > Yes, this is the vm_stop() problem again. There is no reliable way to > cancel I/O requests so instead QEMU waits...forever. A solution is > needed so COLO doesn't hang on network failure. > Yes, COLO needs to detect this situation and cancel the requests in a proper way. > I'm not sure how to solve the problem. The secondary still has the last > successful checkpoint so it could resume instead of waiting for the > current checkpoint to commit. > > There may still be NBD I/O in flight, so the would need to drain it or > fence storage to prevent interference once the secondary VM is running. > Agreed, we need to think this carefully. We'll put these reliabilities developing in future after COLO's basic function completed. Thanks, Hailiang > Stefan >
diff --git a/docs/block-replication.txt b/docs/block-replication.txt index 6bde673..fbfe005 100644 --- a/docs/block-replication.txt +++ b/docs/block-replication.txt @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network transportation effort during a vmstate checkpoint, the disk modification operations of the Primary disk are asynchronously forwarded to the Secondary node. -== Workflow == +== Non-shared disk workflow == The following is the image of block replication workflow: +----------------------+ +------------------------+ @@ -57,7 +57,7 @@ The following is the image of block replication workflow: 4) Secondary write requests will be buffered in the Disk buffer and it will overwrite the existing sector content in the buffer. -== Architecture == +== Non-shared disk architecture == We are going to implement block replication from many basic blocks that are already in QEMU. @@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative write-through of the NBD server into the secondary disk. So before block replication, the primary disk and secondary disk should contain the same data. +== Shared Disk Mode Workflow == +The following is the image of block replication workflow: + + +----------------------+ +------------------------+ + |Primary Write Requests| |Secondary Write Requests| + +----------------------+ +------------------------+ + | | + | (4) + | V + | /-------------\ + | (2)Forward and write through | | + | +--------------------------> | Disk Buffer | + | | | | + | | \-------------/ + | |(1)read | + | | | + (3)write | | | backing file + V | | + +-----------------------------+ | + | Shared Disk | <-----+ + +-----------------------------+ + + 1) Primary writes will read original data and forward it to Secondary + QEMU. + 2) Before Primary write requests are written to Shared disk, the + original sector content will be read from Shared disk and + forwarded and buffered in the Disk buffer on the secondary site, + but it will not overwrite the existing sector content (it could be + from either "Secondary Write Requests" or previous COW of "Primary + Write Requests") in the Disk buffer. + 3) Primary write requests will be written to Shared disk. + 4) Secondary write requests will be buffered in the Disk buffer and it + will overwrite the existing sector content in the buffer. + +== Shared Disk Mode Architecture == +We are going to implement block replication from many basic +blocks that are already in QEMU. + virtio-blk || .---------- + / || | Secondary + / || '---------- + / || virtio-blk + / || | + | || replication(5) + | NBD --------> NBD (2) | + | client || server ---> hidden disk <-- active disk(4) + | ^ || | + | replication(1) || | + | | || | + | +-----------------' || | + (3) |drive-backup sync=none || | +--------. | +-----------------+ || | +Primary | | | || backing | +--------' | | || | + V | | + +-------------------------------------------+ | + | shared disk | <----------+ + +-------------------------------------------+ + + + 1) Primary writes will read original data and forward it to Secondary + QEMU. + 2) The hidden-disk buffers the original content that is modified by the + primary VM. It should also be an empty disk, and the driver supports + bdrv_make_empty() and backing file. + 3) Primary write requests will be written to Shared disk. + 4) Secondary write requests will be buffered in the active disk and it + will overwrite the existing sector content in the buffer. + == Failure Handling == There are 7 internal errors when block replication is running: 1. I/O error on primary disk @@ -145,7 +213,7 @@ d. replication_stop_all() things except failover. The caller must hold the I/O mutex lock if it is in migration/checkpoint thread. -== Usage == +== Non-shared disk usage == Primary: -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ children.0.file.filename=1.raw,\ @@ -234,6 +302,69 @@ Secondary: The primary host is down, so we should do the following thing: { 'execute': 'nbd-server-stop' } +== Shared disk usage == +Primary: + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw + +Issue qmp command: + { 'execute': 'blockdev-add', + 'arguments': { + 'driver': 'replication', + 'node-name': 'rep', + 'mode': 'primary', + 'shared-disk-id': 'primary_disk0', + 'shared-disk': true, + 'file': { + 'driver': 'nbd', + 'export': 'hidden_disk0', + 'server': { + 'type': 'inet', + 'data': { + 'host': 'xxx.xxx.xxx.xxx', + 'port': 'yyy' + } + } + } + } + } + +Secondary: + -drive if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ + backing.driver=raw,backing.file.filename=1.raw \ + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ + file.driver=qcow2,top-id=active-disk0,\ + file.file.filename=/mnt/ramfs/active_disk.img,\ + file.backing=hidden_disk0,shared-disk=on + +Issue qmp command: +1. { 'execute': 'nbd-server-start', + 'arguments': { + 'addr': { + 'type': 'inet', + 'data': { + 'host': '0', + 'port': 'yyy' + } + } + } + } +2. { 'execute': 'nbd-server-add', + 'arguments': { + 'device': 'hidden_disk0', + 'writable': true + } + } + +After Failover: +Primary: + { 'execute': 'x-blockdev-del', + 'arguments': { + 'node-name': 'rep' + } + } + +Secondary: + {'execute': 'nbd-server-stop' } + TODO: 1. Continuous block replication -2. Shared disk