Message ID | 1476971860-20860-2-git-send-email-zhang.zhanghailiang@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 10/20/2016 09:57 PM, zhanghailiang wrote: > Introuduce the scenario of shared-disk block replication > and how to use it. > > Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> > Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com> > --- > docs/block-replication.txt | 131 +++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 127 insertions(+), 4 deletions(-) > > diff --git a/docs/block-replication.txt b/docs/block-replication.txt > index 6bde673..97fcfc1 100644 > --- a/docs/block-replication.txt > +++ b/docs/block-replication.txt > @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network transportation > effort during a vmstate checkpoint, the disk modification operations of > the Primary disk are asynchronously forwarded to the Secondary node. > > -== Workflow == > +== Non-shared disk workflow == > The following is the image of block replication workflow: > > +----------------------+ +------------------------+ > @@ -57,7 +57,7 @@ The following is the image of block replication workflow: > 4) Secondary write requests will be buffered in the Disk buffer and it > will overwrite the existing sector content in the buffer. > > -== Architecture == > +== None-shared disk architecture == s/None-shared/Non-shared/g > We are going to implement block replication from many basic > blocks that are already in QEMU. > > @@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative write-through > of the NBD server into the secondary disk. So before block replication, > the primary disk and secondary disk should contain the same data. > > +== Shared Disk Mode Workflow == > +The following is the image of block replication workflow: > + > + +----------------------+ +------------------------+ > + |Primary Write Requests| |Secondary Write Requests| > + +----------------------+ +------------------------+ > + | | > + | (4) > + | V > + | /-------------\ > + | (2)Forward and write through | | > + | +--------------------------> | Disk Buffer | > + | | | | > + | | \-------------/ > + | |(1)read | > + | | | > + (3)write | | | backing file > + V | | > + +-----------------------------+ | > + | Shared Disk | <-----+ > + +-----------------------------+ > + > + 1) Primary writes will read original data and forward it to Secondary > + QEMU. > + 2) Before Primary write requests are written to Shared disk, the > + original sector content will be read from Shared disk and > + forwarded and buffered in the Disk buffer on the secondary site, > + but it will not overwrite the existing extra spaces at the end of line > + sector content(it could be from either "Secondary Write Requests" or Need a space before "(" for better style. > + previous COW of "Primary Write Requests") in the Disk buffer. > + 3) Primary write requests will be written to Shared disk. > + 4) Secondary write requests will be buffered in the Disk buffer and it > + will overwrite the existing sector content in the buffer. > + > +== Shared Disk Mode Architecture == > +We are going to implement block replication from many basic > +blocks that are already in QEMU. > + virtio-blk || .---------- > + / || | Secondary > + / || '---------- > + / || virtio-blk > + / || | > + | || replication(5) > + | NBD --------> NBD (2) | > + | client || server ---> hidden disk <-- active disk(4) > + | ^ || | > + | replication(1) || | > + | | || | > + | +-----------------' || | > + (3) |drive-backup sync=none || | > +--------. | +-----------------+ || | > +Primary | | | || backing | > +--------' | | || | > + V | | > + +-------------------------------------------+ | > + | shared disk | <----------+ > + +-------------------------------------------+ > + > + > + 1) Primary writes will read original data and forward it to Secondary > + QEMU. > + 2) The hidden-disk buffers the original content that is modified by the > + primary VM. It should also be an empty disk, and extra spaces at end of line > + the driver supports bdrv_make_empty() and backing file. > + 3) Primary write requests will be written to Shared disk. > + 4) Secondary write requests will be buffered in the active disk and it > + will overwrite the existing sector content in the buffer. > + > == Failure Handling == > There are 7 internal errors when block replication is running: > 1. I/O error on primary disk > @@ -145,7 +213,7 @@ d. replication_stop_all() > things except failover. The caller must hold the I/O mutex lock if it is > in migration/checkpoint thread. > > -== Usage == > +== Non-shared disk usage == > Primary: > -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ > children.0.file.filename=1.raw,\ > @@ -234,6 +302,61 @@ Secondary: > The primary host is down, so we should do the following thing: > { 'execute': 'nbd-server-stop' } > > +== Shared disk usage == Keep the some coding style with "== Non-shared disk usage ==" part is good to me. > +Primary: > + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw > + > +Issue qmp command: > + {'execute': 'human-monitor-command', two space indentation for the whole "{...}" part > + 'arguments': { > + 'command-line': 'drive_add-nbuddydriver=replication, missing spaces > + mode=primary, > + file.driver=nbd, > + file.host=9.42.3.17, > + file.port=9998, > + file.export=hidden_disk0, > + shared-disk-id=primary_disk0, > + shared-disk=on, > + node-name=rep' Keep the whole commands after "command-line" in one line, or you can execute it correctly. IIRC > + } > + } Secondary: > + -drive if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ > + backing.driver=raw,backing.file.filename=1.raw \ > + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ > + file.driver=qcow2,top-id=active-disk0,\ > + file.file.filename=/mnt/ramfs/active_disk.img,\ > + file.backing=hidden_disk0,shared-disk=on > + > +Issue qmp command: > +1. {'execute': 'nbd-server-start', > + 'arguments': { > + 'addr': { > + 'type': 'inet', > + 'data': { > + 'host': '0', s/0/9.42.3.17/g, since you use designated ip address above > + 'port': '9998' > + } > + } > + } > + } > +2. { > + 'execute': 'nbd-server-add', > + 'arguments': { > + 'device': 'hidden_disk0', > + 'writable': true > + } > + } > + > +After Failover: > +Primary: > +{'execute': 'human-monitor-command', > + 'arguments': { > + 'command-line': 'drive_delrep' drive_del rep > + } > +} > + > +Secondary: > + {'execute': 'nbd-server-stop' } > + > TODO: > 1. Continuous block replication > -2. Shared disk >
On 2016/10/25 17:03, Changlong Xie wrote: > On 10/20/2016 09:57 PM, zhanghailiang wrote: >> Introuduce the scenario of shared-disk block replication >> and how to use it. >> >> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> >> Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com> >> --- >> docs/block-replication.txt | 131 +++++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 127 insertions(+), 4 deletions(-) >> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt >> index 6bde673..97fcfc1 100644 >> --- a/docs/block-replication.txt >> +++ b/docs/block-replication.txt >> @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network transportation >> effort during a vmstate checkpoint, the disk modification operations of >> the Primary disk are asynchronously forwarded to the Secondary node. >> >> -== Workflow == >> +== Non-shared disk workflow == >> The following is the image of block replication workflow: >> >> +----------------------+ +------------------------+ >> @@ -57,7 +57,7 @@ The following is the image of block replication workflow: >> 4) Secondary write requests will be buffered in the Disk buffer and it >> will overwrite the existing sector content in the buffer. >> >> -== Architecture == >> +== None-shared disk architecture == > > s/None-shared/Non-shared/g > >> We are going to implement block replication from many basic >> blocks that are already in QEMU. >> >> @@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative write-through >> of the NBD server into the secondary disk. So before block replication, >> the primary disk and secondary disk should contain the same data. >> >> +== Shared Disk Mode Workflow == >> +The following is the image of block replication workflow: >> + >> + +----------------------+ +------------------------+ >> + |Primary Write Requests| |Secondary Write Requests| >> + +----------------------+ +------------------------+ >> + | | >> + | (4) >> + | V >> + | /-------------\ >> + | (2)Forward and write through | | >> + | +--------------------------> | Disk Buffer | >> + | | | | >> + | | \-------------/ >> + | |(1)read | >> + | | | >> + (3)write | | | backing file >> + V | | >> + +-----------------------------+ | >> + | Shared Disk | <-----+ >> + +-----------------------------+ >> + >> + 1) Primary writes will read original data and forward it to Secondary >> + QEMU. >> + 2) Before Primary write requests are written to Shared disk, the >> + original sector content will be read from Shared disk and >> + forwarded and buffered in the Disk buffer on the secondary site, >> + but it will not overwrite the existing > > extra spaces at the end of line > >> + sector content(it could be from either "Secondary Write Requests" or > > Need a space before "(" for better style. > >> + previous COW of "Primary Write Requests") in the Disk buffer. >> + 3) Primary write requests will be written to Shared disk. >> + 4) Secondary write requests will be buffered in the Disk buffer and it >> + will overwrite the existing sector content in the buffer. >> + >> +== Shared Disk Mode Architecture == >> +We are going to implement block replication from many basic >> +blocks that are already in QEMU. >> + virtio-blk || .---------- >> + / || | Secondary >> + / || '---------- >> + / || virtio-blk >> + / || | >> + | || replication(5) >> + | NBD --------> NBD (2) | >> + | client || server ---> hidden disk <-- active disk(4) >> + | ^ || | >> + | replication(1) || | >> + | | || | >> + | +-----------------' || | >> + (3) |drive-backup sync=none || | >> +--------. | +-----------------+ || | >> +Primary | | | || backing | >> +--------' | | || | >> + V | | >> + +-------------------------------------------+ | >> + | shared disk | <----------+ >> + +-------------------------------------------+ >> + >> + >> + 1) Primary writes will read original data and forward it to Secondary >> + QEMU. >> + 2) The hidden-disk buffers the original content that is modified by the >> + primary VM. It should also be an empty disk, and > > extra spaces at end of line > >> + the driver supports bdrv_make_empty() and backing file. >> + 3) Primary write requests will be written to Shared disk. >> + 4) Secondary write requests will be buffered in the active disk and it >> + will overwrite the existing sector content in the buffer. >> + >> == Failure Handling == >> There are 7 internal errors when block replication is running: >> 1. I/O error on primary disk >> @@ -145,7 +213,7 @@ d. replication_stop_all() >> things except failover. The caller must hold the I/O mutex lock if it is >> in migration/checkpoint thread. >> >> -== Usage == >> +== Non-shared disk usage == >> Primary: >> -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ >> children.0.file.filename=1.raw,\ >> @@ -234,6 +302,61 @@ Secondary: >> The primary host is down, so we should do the following thing: >> { 'execute': 'nbd-server-stop' } >> >> +== Shared disk usage == > > Keep the some coding style with "== Non-shared disk usage ==" part is > good to me. > >> +Primary: >> + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw >> + >> +Issue qmp command: >> + {'execute': 'human-monitor-command', > > two space indentation for the whole "{...}" part > >> + 'arguments': { >> + 'command-line': 'drive_add-nbuddydriver=replication, > > missing spaces > >> + mode=primary, >> + file.driver=nbd, >> + file.host=9.42.3.17, >> + file.port=9998, >> + file.export=hidden_disk0, >> + shared-disk-id=primary_disk0, >> + shared-disk=on, >> + node-name=rep' > > Keep the whole commands after "command-line" in one line, or you can > execute it correctly. IIRC > Hmm, i will change this hmp command to qmp 'blockdev-add' command in next version, because it is supported now, though it is ready for production. >> + } >> + } > > Secondary: > >> + -drive if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ >> + backing.driver=raw,backing.file.filename=1.raw \ >> + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ >> + file.driver=qcow2,top-id=active-disk0,\ >> + file.file.filename=/mnt/ramfs/active_disk.img,\ >> + file.backing=hidden_disk0,shared-disk=on >> + >> +Issue qmp command: >> +1. {'execute': 'nbd-server-start', >> + 'arguments': { >> + 'addr': { >> + 'type': 'inet', >> + 'data': { >> + 'host': '0', > > s/0/9.42.3.17/g, since you use designated ip address above > >> + 'port': '9998' >> + } >> + } >> + } >> + } >> +2. { >> + 'execute': 'nbd-server-add', >> + 'arguments': { >> + 'device': 'hidden_disk0', >> + 'writable': true >> + } >> + } >> + >> +After Failover: >> +Primary: >> +{'execute': 'human-monitor-command', >> + 'arguments': { >> + 'command-line': 'drive_delrep' > > drive_del rep > I'll use the qmp command instead here. >> + } >> +} >> + >> +Secondary: >> + {'execute': 'nbd-server-stop' } >> + >> TODO: >> 1. Continuous block replication >> -2. Shared disk >> > I will fix all the above problems in next version, thanks. > > > . >
On 2016/11/28 14:00, Changlong Xie wrote: > On 11/28/2016 01:13 PM, Hailiang Zhang wrote: >> >> On 2016/10/25 17:03, Changlong Xie wrote: >>> On 10/20/2016 09:57 PM, zhanghailiang wrote: >>>> Introuduce the scenario of shared-disk block replication >>>> and how to use it. >>>> >>>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> >>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> >>>> Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com> >>>> --- >>>> docs/block-replication.txt | 131 >>>> +++++++++++++++++++++++++++++++++++++++++++-- >>>> 1 file changed, 127 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt >>>> index 6bde673..97fcfc1 100644 >>>> --- a/docs/block-replication.txt >>>> +++ b/docs/block-replication.txt >>>> @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the >>>> network transportation >>>> effort during a vmstate checkpoint, the disk modification >>>> operations of >>>> the Primary disk are asynchronously forwarded to the Secondary node. >>>> >>>> -== Workflow == >>>> +== Non-shared disk workflow == >>>> The following is the image of block replication workflow: >>>> >>>> +----------------------+ >>>> +------------------------+ >>>> @@ -57,7 +57,7 @@ The following is the image of block replication >>>> workflow: >>>> 4) Secondary write requests will be buffered in the Disk >>>> buffer and it >>>> will overwrite the existing sector content in the buffer. >>>> >>>> -== Architecture == >>>> +== None-shared disk architecture == >>> >>> s/None-shared/Non-shared/g >>> >> >>>> We are going to implement block replication from many basic >>>> blocks that are already in QEMU. >>>> >>>> @@ -106,6 +106,74 @@ any state that would otherwise be lost by the >>>> speculative write-through >>>> of the NBD server into the secondary disk. So before block >>>> replication, >>>> the primary disk and secondary disk should contain the same data. >>>> >>>> +== Shared Disk Mode Workflow == >>>> +The following is the image of block replication workflow: >>>> + >>>> + +----------------------+ +------------------------+ >>>> + |Primary Write Requests| |Secondary Write Requests| >>>> + +----------------------+ +------------------------+ >>>> + | | >>>> + | (4) >>>> + | V >>>> + | /-------------\ >>>> + | (2)Forward and write through | | >>>> + | +--------------------------> | Disk Buffer | >>>> + | | | | >>>> + | | \-------------/ >>>> + | |(1)read | >>>> + | | | >>>> + (3)write | | | backing file >>>> + V | | >>>> + +-----------------------------+ | >>>> + | Shared Disk | <-----+ >>>> + +-----------------------------+ >>>> + >>>> + 1) Primary writes will read original data and forward it to >>>> Secondary >>>> + QEMU. >>>> + 2) Before Primary write requests are written to Shared disk, the >>>> + original sector content will be read from Shared disk and >>>> + forwarded and buffered in the Disk buffer on the secondary site, >>>> + but it will not overwrite the existing >>> >>> extra spaces at the end of line >>> >> >>>> + sector content(it could be from either "Secondary Write >>>> Requests" or >>> >>> Need a space before "(" for better style. >>> >> >>>> + previous COW of "Primary Write Requests") in the Disk buffer. >>>> + 3) Primary write requests will be written to Shared disk. >>>> + 4) Secondary write requests will be buffered in the Disk buffer >>>> and it >>>> + will overwrite the existing sector content in the buffer. >>>> + >>>> +== Shared Disk Mode Architecture == >>>> +We are going to implement block replication from many basic >>>> +blocks that are already in QEMU. >>>> + virtio-blk >>>> || .---------- >>>> + / >>>> || | Secondary >>>> + / >>>> || '---------- >>>> + / >>>> || virtio-blk >>>> + / >>>> || | >>>> + | >>>> || replication(5) >>>> + | NBD --------> NBD >>>> (2) | >>>> + | client || server ---> hidden >>>> disk <-- active disk(4) >>>> + | ^ || | >>>> + | replication(1) || | >>>> + | | || | >>>> + | +-----------------' || | >>>> + (3) |drive-backup sync=none || | >>>> +--------. | +-----------------+ || | >>>> +Primary | | | || backing | >>>> +--------' | | || | >>>> + V | | >>>> + +-------------------------------------------+ | >>>> + | shared disk | <----------+ >>>> + +-------------------------------------------+ >>>> + >>>> + >>>> + 1) Primary writes will read original data and forward it to >>>> Secondary >>>> + QEMU. >>>> + 2) The hidden-disk buffers the original content that is modified >>>> by the >>>> + primary VM. It should also be an empty disk, and >>> >>> extra spaces at end of line >>> >> >>>> + the driver supports bdrv_make_empty() and backing file. >>>> + 3) Primary write requests will be written to Shared disk. >>>> + 4) Secondary write requests will be buffered in the active disk >>>> and it >>>> + will overwrite the existing sector content in the buffer. >>>> + >>>> == Failure Handling == >>>> There are 7 internal errors when block replication is running: >>>> 1. I/O error on primary disk >>>> @@ -145,7 +213,7 @@ d. replication_stop_all() >>>> things except failover. The caller must hold the I/O mutex lock >>>> if it is >>>> in migration/checkpoint thread. >>>> >>>> -== Usage == >>>> +== Non-shared disk usage == >>>> Primary: >>>> -drive >>>> if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ >>>> children.0.file.filename=1.raw,\ >>>> @@ -234,6 +302,61 @@ Secondary: >>>> The primary host is down, so we should do the following thing: >>>> { 'execute': 'nbd-server-stop' } >>>> >>>> +== Shared disk usage == >>> >>> Keep the some coding style with "== Non-shared disk usage ==" part is >>> good to me. >>> >> >>>> +Primary: >>>> + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw >>>> + >>>> +Issue qmp command: >>>> + {'execute': 'human-monitor-command', >>> >>> two space indentation for the whole "{...}" part >>> >>>> + 'arguments': { >>>> + 'command-line': 'drive_add-nbuddydriver=replication, >>> >>> missing spaces >>> >>>> + mode=primary, >>>> + file.driver=nbd, >>>> + file.host=9.42.3.17, >>>> + file.port=9998, >>>> + file.export=hidden_disk0, >>>> + shared-disk-id=primary_disk0, >>>> + shared-disk=on, >>>> + node-name=rep' >>> >> >>> Keep the whole commands after "command-line" in one line, or you can >>> execute it correctly. IIRC >>> >> >> Hmm, i will change this hmp command to qmp 'blockdev-add' command in next >> version, because it is supported now, though it is ready for production. >> > > It's a good start, but i'm not sure here. > > http://lists.nongnu.org/archive/html/qemu-devel/2016-11/msg01062.html > Yes, i noticed that, but for COLO, it is not ready for production either. So I think it is OK to use it here ... > Thanks > -Xie >>>> + } >>>> + } >>> >>> Secondary: >>> >>>> + -drive >>>> if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ >>>> >>>> + backing.driver=raw,backing.file.filename=1.raw \ >>>> + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ >>>> + file.driver=qcow2,top-id=active-disk0,\ >>>> + file.file.filename=/mnt/ramfs/active_disk.img,\ >>>> + file.backing=hidden_disk0,shared-disk=on >>>> + >>>> +Issue qmp command: >>>> +1. {'execute': 'nbd-server-start', >>>> + 'arguments': { >>>> + 'addr': { >>>> + 'type': 'inet', >>>> + 'data': { >>>> + 'host': '0', >>> >>> s/0/9.42.3.17/g, since you use designated ip address above >>> >> >>>> + 'port': '9998' >>>> + } >>>> + } >>>> + } >>>> + } >>>> +2. { >>>> + 'execute': 'nbd-server-add', >>>> + 'arguments': { >>>> + 'device': 'hidden_disk0', >>>> + 'writable': true >>>> + } >>>> + } >>>> + >>>> +After Failover: >>>> +Primary: >>>> +{'execute': 'human-monitor-command', >>>> + 'arguments': { >>>> + 'command-line': 'drive_delrep' >>> >>> drive_del rep >>> >> >> I'll use the qmp command instead here. >> >>>> + } >>>> +} >>>> + >>>> +Secondary: >>>> + {'execute': 'nbd-server-stop' } >>>> + >>>> TODO: >>>> 1. Continuous block replication >>>> -2. Shared disk >>>> >>> >> >> I will fix all the above problems in next version, thanks. >> >>> >>> >>> . >>> >> >> >> >> . >> > > > > . >
On 11/28/2016 01:13 PM, Hailiang Zhang wrote: > > On 2016/10/25 17:03, Changlong Xie wrote: >> On 10/20/2016 09:57 PM, zhanghailiang wrote: >>> Introuduce the scenario of shared-disk block replication >>> and how to use it. >>> >>> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> >>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> >>> Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com> >>> --- >>> docs/block-replication.txt | 131 >>> +++++++++++++++++++++++++++++++++++++++++++-- >>> 1 file changed, 127 insertions(+), 4 deletions(-) >>> >>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt >>> index 6bde673..97fcfc1 100644 >>> --- a/docs/block-replication.txt >>> +++ b/docs/block-replication.txt >>> @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the >>> network transportation >>> effort during a vmstate checkpoint, the disk modification >>> operations of >>> the Primary disk are asynchronously forwarded to the Secondary node. >>> >>> -== Workflow == >>> +== Non-shared disk workflow == >>> The following is the image of block replication workflow: >>> >>> +----------------------+ >>> +------------------------+ >>> @@ -57,7 +57,7 @@ The following is the image of block replication >>> workflow: >>> 4) Secondary write requests will be buffered in the Disk >>> buffer and it >>> will overwrite the existing sector content in the buffer. >>> >>> -== Architecture == >>> +== None-shared disk architecture == >> >> s/None-shared/Non-shared/g >> > >>> We are going to implement block replication from many basic >>> blocks that are already in QEMU. >>> >>> @@ -106,6 +106,74 @@ any state that would otherwise be lost by the >>> speculative write-through >>> of the NBD server into the secondary disk. So before block >>> replication, >>> the primary disk and secondary disk should contain the same data. >>> >>> +== Shared Disk Mode Workflow == >>> +The following is the image of block replication workflow: >>> + >>> + +----------------------+ +------------------------+ >>> + |Primary Write Requests| |Secondary Write Requests| >>> + +----------------------+ +------------------------+ >>> + | | >>> + | (4) >>> + | V >>> + | /-------------\ >>> + | (2)Forward and write through | | >>> + | +--------------------------> | Disk Buffer | >>> + | | | | >>> + | | \-------------/ >>> + | |(1)read | >>> + | | | >>> + (3)write | | | backing file >>> + V | | >>> + +-----------------------------+ | >>> + | Shared Disk | <-----+ >>> + +-----------------------------+ >>> + >>> + 1) Primary writes will read original data and forward it to >>> Secondary >>> + QEMU. >>> + 2) Before Primary write requests are written to Shared disk, the >>> + original sector content will be read from Shared disk and >>> + forwarded and buffered in the Disk buffer on the secondary site, >>> + but it will not overwrite the existing >> >> extra spaces at the end of line >> > >>> + sector content(it could be from either "Secondary Write >>> Requests" or >> >> Need a space before "(" for better style. >> > >>> + previous COW of "Primary Write Requests") in the Disk buffer. >>> + 3) Primary write requests will be written to Shared disk. >>> + 4) Secondary write requests will be buffered in the Disk buffer >>> and it >>> + will overwrite the existing sector content in the buffer. >>> + >>> +== Shared Disk Mode Architecture == >>> +We are going to implement block replication from many basic >>> +blocks that are already in QEMU. >>> + virtio-blk >>> || .---------- >>> + / >>> || | Secondary >>> + / >>> || '---------- >>> + / >>> || virtio-blk >>> + / >>> || | >>> + | >>> || replication(5) >>> + | NBD --------> NBD >>> (2) | >>> + | client || server ---> hidden >>> disk <-- active disk(4) >>> + | ^ || | >>> + | replication(1) || | >>> + | | || | >>> + | +-----------------' || | >>> + (3) |drive-backup sync=none || | >>> +--------. | +-----------------+ || | >>> +Primary | | | || backing | >>> +--------' | | || | >>> + V | | >>> + +-------------------------------------------+ | >>> + | shared disk | <----------+ >>> + +-------------------------------------------+ >>> + >>> + >>> + 1) Primary writes will read original data and forward it to >>> Secondary >>> + QEMU. >>> + 2) The hidden-disk buffers the original content that is modified >>> by the >>> + primary VM. It should also be an empty disk, and >> >> extra spaces at end of line >> > >>> + the driver supports bdrv_make_empty() and backing file. >>> + 3) Primary write requests will be written to Shared disk. >>> + 4) Secondary write requests will be buffered in the active disk >>> and it >>> + will overwrite the existing sector content in the buffer. >>> + >>> == Failure Handling == >>> There are 7 internal errors when block replication is running: >>> 1. I/O error on primary disk >>> @@ -145,7 +213,7 @@ d. replication_stop_all() >>> things except failover. The caller must hold the I/O mutex lock >>> if it is >>> in migration/checkpoint thread. >>> >>> -== Usage == >>> +== Non-shared disk usage == >>> Primary: >>> -drive >>> if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ >>> children.0.file.filename=1.raw,\ >>> @@ -234,6 +302,61 @@ Secondary: >>> The primary host is down, so we should do the following thing: >>> { 'execute': 'nbd-server-stop' } >>> >>> +== Shared disk usage == >> >> Keep the some coding style with "== Non-shared disk usage ==" part is >> good to me. >> > >>> +Primary: >>> + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw >>> + >>> +Issue qmp command: >>> + {'execute': 'human-monitor-command', >> >> two space indentation for the whole "{...}" part >> >>> + 'arguments': { >>> + 'command-line': 'drive_add-nbuddydriver=replication, >> >> missing spaces >> >>> + mode=primary, >>> + file.driver=nbd, >>> + file.host=9.42.3.17, >>> + file.port=9998, >>> + file.export=hidden_disk0, >>> + shared-disk-id=primary_disk0, >>> + shared-disk=on, >>> + node-name=rep' >> > >> Keep the whole commands after "command-line" in one line, or you can >> execute it correctly. IIRC >> > > Hmm, i will change this hmp command to qmp 'blockdev-add' command in next > version, because it is supported now, though it is ready for production. > It's a good start, but i'm not sure here. http://lists.nongnu.org/archive/html/qemu-devel/2016-11/msg01062.html Thanks -Xie >>> + } >>> + } >> >> Secondary: >> >>> + -drive >>> if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ >>> >>> + backing.driver=raw,backing.file.filename=1.raw \ >>> + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ >>> + file.driver=qcow2,top-id=active-disk0,\ >>> + file.file.filename=/mnt/ramfs/active_disk.img,\ >>> + file.backing=hidden_disk0,shared-disk=on >>> + >>> +Issue qmp command: >>> +1. {'execute': 'nbd-server-start', >>> + 'arguments': { >>> + 'addr': { >>> + 'type': 'inet', >>> + 'data': { >>> + 'host': '0', >> >> s/0/9.42.3.17/g, since you use designated ip address above >> > >>> + 'port': '9998' >>> + } >>> + } >>> + } >>> + } >>> +2. { >>> + 'execute': 'nbd-server-add', >>> + 'arguments': { >>> + 'device': 'hidden_disk0', >>> + 'writable': true >>> + } >>> + } >>> + >>> +After Failover: >>> +Primary: >>> +{'execute': 'human-monitor-command', >>> + 'arguments': { >>> + 'command-line': 'drive_delrep' >> >> drive_del rep >> > > I'll use the qmp command instead here. > >>> + } >>> +} >>> + >>> +Secondary: >>> + {'execute': 'nbd-server-stop' } >>> + >>> TODO: >>> 1. Continuous block replication >>> -2. Shared disk >>> >> > > I will fix all the above problems in next version, thanks. > >> >> >> . >> > > > > . >
diff --git a/docs/block-replication.txt b/docs/block-replication.txt index 6bde673..97fcfc1 100644 --- a/docs/block-replication.txt +++ b/docs/block-replication.txt @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network transportation effort during a vmstate checkpoint, the disk modification operations of the Primary disk are asynchronously forwarded to the Secondary node. -== Workflow == +== Non-shared disk workflow == The following is the image of block replication workflow: +----------------------+ +------------------------+ @@ -57,7 +57,7 @@ The following is the image of block replication workflow: 4) Secondary write requests will be buffered in the Disk buffer and it will overwrite the existing sector content in the buffer. -== Architecture == +== None-shared disk architecture == We are going to implement block replication from many basic blocks that are already in QEMU. @@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative write-through of the NBD server into the secondary disk. So before block replication, the primary disk and secondary disk should contain the same data. +== Shared Disk Mode Workflow == +The following is the image of block replication workflow: + + +----------------------+ +------------------------+ + |Primary Write Requests| |Secondary Write Requests| + +----------------------+ +------------------------+ + | | + | (4) + | V + | /-------------\ + | (2)Forward and write through | | + | +--------------------------> | Disk Buffer | + | | | | + | | \-------------/ + | |(1)read | + | | | + (3)write | | | backing file + V | | + +-----------------------------+ | + | Shared Disk | <-----+ + +-----------------------------+ + + 1) Primary writes will read original data and forward it to Secondary + QEMU. + 2) Before Primary write requests are written to Shared disk, the + original sector content will be read from Shared disk and + forwarded and buffered in the Disk buffer on the secondary site, + but it will not overwrite the existing + sector content(it could be from either "Secondary Write Requests" or + previous COW of "Primary Write Requests") in the Disk buffer. + 3) Primary write requests will be written to Shared disk. + 4) Secondary write requests will be buffered in the Disk buffer and it + will overwrite the existing sector content in the buffer. + +== Shared Disk Mode Architecture == +We are going to implement block replication from many basic +blocks that are already in QEMU. + virtio-blk || .---------- + / || | Secondary + / || '---------- + / || virtio-blk + / || | + | || replication(5) + | NBD --------> NBD (2) | + | client || server ---> hidden disk <-- active disk(4) + | ^ || | + | replication(1) || | + | | || | + | +-----------------' || | + (3) |drive-backup sync=none || | +--------. | +-----------------+ || | +Primary | | | || backing | +--------' | | || | + V | | + +-------------------------------------------+ | + | shared disk | <----------+ + +-------------------------------------------+ + + + 1) Primary writes will read original data and forward it to Secondary + QEMU. + 2) The hidden-disk buffers the original content that is modified by the + primary VM. It should also be an empty disk, and + the driver supports bdrv_make_empty() and backing file. + 3) Primary write requests will be written to Shared disk. + 4) Secondary write requests will be buffered in the active disk and it + will overwrite the existing sector content in the buffer. + == Failure Handling == There are 7 internal errors when block replication is running: 1. I/O error on primary disk @@ -145,7 +213,7 @@ d. replication_stop_all() things except failover. The caller must hold the I/O mutex lock if it is in migration/checkpoint thread. -== Usage == +== Non-shared disk usage == Primary: -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ children.0.file.filename=1.raw,\ @@ -234,6 +302,61 @@ Secondary: The primary host is down, so we should do the following thing: { 'execute': 'nbd-server-stop' } +== Shared disk usage == +Primary: + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw + +Issue qmp command: + {'execute': 'human-monitor-command', + 'arguments': { + 'command-line': 'drive_add-nbuddydriver=replication, + mode=primary, + file.driver=nbd, + file.host=9.42.3.17, + file.port=9998, + file.export=hidden_disk0, + shared-disk-id=primary_disk0, + shared-disk=on, + node-name=rep' + } + } + -drive if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ + backing.driver=raw,backing.file.filename=1.raw \ + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ + file.driver=qcow2,top-id=active-disk0,\ + file.file.filename=/mnt/ramfs/active_disk.img,\ + file.backing=hidden_disk0,shared-disk=on + +Issue qmp command: +1. {'execute': 'nbd-server-start', + 'arguments': { + 'addr': { + 'type': 'inet', + 'data': { + 'host': '0', + 'port': '9998' + } + } + } + } +2. { + 'execute': 'nbd-server-add', + 'arguments': { + 'device': 'hidden_disk0', + 'writable': true + } + } + +After Failover: +Primary: +{'execute': 'human-monitor-command', + 'arguments': { + 'command-line': 'drive_delrep' + } +} + +Secondary: + {'execute': 'nbd-server-stop' } + TODO: 1. Continuous block replication -2. Shared disk