diff mbox

[v12,2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

Message ID 1457578181-27111-3-git-send-email-xiecl.fnst@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Changlong Xie March 10, 2016, 2:49 a.m. UTC
From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block.c               |   8 ++--
 block/quorum.c        | 121 +++++++++++++++++++++++++++++++++++++++++++++++++-
 include/block/block.h |   4 ++
 3 files changed, 127 insertions(+), 6 deletions(-)

Comments

Alberto Garcia March 11, 2016, 12:21 p.m. UTC | #1
On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>                              * block if Quorum is reached.
>                              */
> +    unsigned long *index_bitmap;
> +    int bsize;
  [...]
> +static int get_new_child_index(BDRVQuorumState *s)
  [...]
> +static void remove_child_index(BDRVQuorumState *s, int index)
  [...]

Sorry if I missed a previous discussion, but why is this necessary?

Berto
Changlong Xie March 14, 2016, 1:33 a.m. UTC | #2
On 03/11/2016 08:21 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>                               * block if Quorum is reached.
>>                               */
>> +    unsigned long *index_bitmap;
>> +    int bsize;
>    [...]
>> +static int get_new_child_index(BDRVQuorumState *s)
>    [...]
>> +static void remove_child_index(BDRVQuorumState *s, int index)
>    [...]
>
> Sorry if I missed a previous discussion, but why is this necessary?

Hi Betro

Currently we implement this for COLO, we need the capability to hotplug 
NBD child in COLO mode.

More detail please reference 
http://wiki.qemu.org/Features/BlockReplication.

Thanks
	-Xie

>
> Berto
>
>
> .
>
Changlong Xie March 14, 2016, 6:02 a.m. UTC | #3
On 03/11/2016 08:21 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>                               * block if Quorum is reached.
>>                               */
>> +    unsigned long *index_bitmap;

Hi Berto

*NOTE*, In the old version, we just used "bs->node_name", but in the 
lastest one, as Kevin suggested we introduce "child->child_name"(formart 
as "children.xxx"), this is the key cause why we need this two functions 
here.

Thanks
	-Xie
>> +    int bsize;
>    [...]
>> +static int get_new_child_index(BDRVQuorumState *s)
>    [...]
>> +static void remove_child_index(BDRVQuorumState *s, int index)
>    [...]
>
> Sorry if I missed a previous discussion, but why is this necessary?
>
> Berto
>
>
> .
>
Wen Congyang March 16, 2016, 2:10 a.m. UTC | #4
On 03/11/2016 08:21 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>                              * block if Quorum is reached.
>>                              */
>> +    unsigned long *index_bitmap;
>> +    int bsize;
>   [...]
>> +static int get_new_child_index(BDRVQuorumState *s)
>   [...]
>> +static void remove_child_index(BDRVQuorumState *s, int index)
>   [...]
> 
> Sorry if I missed a previous discussion, but why is this necessary?

Hi, Alberto Garcia

Do you have any comments about this patch or give a R-B?

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>
Alberto Garcia March 16, 2016, 12:38 p.m. UTC | #5
On Mon 14 Mar 2016 07:02:08 AM CET, Changlong Xie <xiecl.fnst@cn.fujitsu.com> wrote:

>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>                               * block if Quorum is reached.
>>>                               */
>>> +    unsigned long *index_bitmap;
>
> Hi Berto
>
> *NOTE*, In the old version, we just used "bs->node_name", but in the
> lastest one, as Kevin suggested we introduce
> "child->child_name"(formart as "children.xxx"), this is the key cause
> why we need this two functions here.

I'm sorry I missed this discussion earlier. Your code seems technically
correct but I have several questions:

- I read that one of the reasons for this change is that "In theory, the
  same node could be attached twice to the same parent in different
  roles.". Is there any example of that? What's the use case?

- How do you obtain the child name?

- I see that if you have children.0 and children.1 (let's say hd0.qcow2
  and hd1.qcow2), then you remove children.0 and add it again, it will
  keep the 'children.0' name (that's what the bitmap is for if I'm
  understanding it correctly). However the position in the s->children
  array will change because you do memmove() when you remove children.0
  and then add it again to the end of the array.

  Initial status:

    s->children[0] <--> "children.0" (hd0.qcow2)
    s->children[1] <--> "children.1" (hd1.qcow2)

  children.0 (hd0.qcow2) is removed:

    s->children[0] <--> "children.1" (hd1.qcow2)

  children.0 (hd0.qcow2) is added again:

    s->children[0] <--> "children.1" (hd1.qcow2)
    s->children[1] <--> "children.0" (hd0.qcow2)

  Is this correct? Is this the indented behavior? Since you are reading
  in FIFO mode, now hd1.qcow2 will always be read first, so if
  children.1 was the secondary disk, it has just become the primary.

I also think that it would be great to have tests for this
functionality, but they can be added later.

Thanks,

Berto
Wen Congyang March 17, 2016, 1:22 a.m. UTC | #6
On 03/16/2016 08:38 PM, Alberto Garcia wrote:
> On Mon 14 Mar 2016 07:02:08 AM CET, Changlong Xie <xiecl.fnst@cn.fujitsu.com> wrote:
> 
>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>                               * block if Quorum is reached.
>>>>                               */
>>>> +    unsigned long *index_bitmap;
>>
>> Hi Berto
>>
>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>> lastest one, as Kevin suggested we introduce
>> "child->child_name"(formart as "children.xxx"), this is the key cause
>> why we need this two functions here.
> 
> I'm sorry I missed this discussion earlier. Your code seems technically
> correct but I have several questions:
> 
> - I read that one of the reasons for this change is that "In theory, the
>   same node could be attached twice to the same parent in different
>   roles.". Is there any example of that? What's the use case?

Kevin may know the case.

> 
> - How do you obtain the child name?

IIRC, the answer is no now. I think we can improve 'info block' output

> 
> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>   and hd1.qcow2), then you remove children.0 and add it again, it will
>   keep the 'children.0' name (that's what the bitmap is for if I'm
>   understanding it correctly). However the position in the s->children
>   array will change because you do memmove() when you remove children.0
>   and then add it again to the end of the array.
> 
>   Initial status:
> 
>     s->children[0] <--> "children.0" (hd0.qcow2)
>     s->children[1] <--> "children.1" (hd1.qcow2)
> 
>   children.0 (hd0.qcow2) is removed:
> 
>     s->children[0] <--> "children.1" (hd1.qcow2)
> 
>   children.0 (hd0.qcow2) is added again:
> 
>     s->children[0] <--> "children.1" (hd1.qcow2)
>     s->children[1] <--> "children.0" (hd0.qcow2)

Yes, it is correct.

> 
>   Is this correct? Is this the indented behavior? Since you are reading
>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>   children.1 was the secondary disk, it has just become the primary.

Yes.

Thanks
Wen Congyang

> 
> I also think that it would be great to have tests for this
> functionality, but they can be added later.
> 
> Thanks,
> 
> Berto
> 
> 
> .
>
Alberto Garcia March 17, 2016, 9:10 a.m. UTC | #7
On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>>                               * block if Quorum is reached.
>>>>>                               */
>>>>> +    unsigned long *index_bitmap;
>>>
>>> Hi Berto
>>>
>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>> lastest one, as Kevin suggested we introduce
>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>> why we need this two functions here.
>> 
>> I'm sorry I missed this discussion earlier. Your code seems technically
>> correct but I have several questions:
>> 
>> - I read that one of the reasons for this change is that "In theory, the
>>   same node could be attached twice to the same parent in different
>>   roles.". Is there any example of that? What's the use case?
>
> Kevin may know the case.

Kevin, do you have an example?

>> - How do you obtain the child name?
>
> IIRC, the answer is no now. I think we can improve 'info block' output

Okay, but then we should extend that first, otherwise this API cannot be
used.

>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>   understanding it correctly). However the position in the s->children
>>   array will change because you do memmove() when you remove children.0
>>   and then add it again to the end of the array.
>> 
>>   Initial status:
>> 
>>     s->children[0] <--> "children.0" (hd0.qcow2)
>>     s->children[1] <--> "children.1" (hd1.qcow2)
>> 
>>   children.0 (hd0.qcow2) is removed:
>> 
>>     s->children[0] <--> "children.1" (hd1.qcow2)
>> 
>>   children.0 (hd0.qcow2) is added again:
>> 
>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>     s->children[1] <--> "children.0" (hd0.qcow2)
>
> Yes, it is correct.
>
>> 
>>   Is this correct? Is this the indented behavior? Since you are reading
>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>   children.1 was the secondary disk, it has just become the primary.
>
> Yes.

And don't you need a way to control the order in which the disks must be
read for COLO?

Berto
Wen Congyang March 17, 2016, 9:44 a.m. UTC | #8
On 03/17/2016 05:10 PM, Alberto Garcia wrote:
> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>>>                               * block if Quorum is reached.
>>>>>>                               */
>>>>>> +    unsigned long *index_bitmap;
>>>>
>>>> Hi Berto
>>>>
>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>>> lastest one, as Kevin suggested we introduce
>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>>> why we need this two functions here.
>>>
>>> I'm sorry I missed this discussion earlier. Your code seems technically
>>> correct but I have several questions:
>>>
>>> - I read that one of the reasons for this change is that "In theory, the
>>>   same node could be attached twice to the same parent in different
>>>   roles.". Is there any example of that? What's the use case?
>>
>> Kevin may know the case.
> 
> Kevin, do you have an example?
> 
>>> - How do you obtain the child name?
>>
>> IIRC, the answer is no now. I think we can improve 'info block' output
> 
> Okay, but then we should extend that first, otherwise this API cannot be
> used.
> 
>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>>   understanding it correctly). However the position in the s->children
>>>   array will change because you do memmove() when you remove children.0
>>>   and then add it again to the end of the array.
>>>
>>>   Initial status:
>>>
>>>     s->children[0] <--> "children.0" (hd0.qcow2)
>>>     s->children[1] <--> "children.1" (hd1.qcow2)
>>>
>>>   children.0 (hd0.qcow2) is removed:
>>>
>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>
>>>   children.0 (hd0.qcow2) is added again:
>>>
>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>     s->children[1] <--> "children.0" (hd0.qcow2)
>>
>> Yes, it is correct.
>>
>>>
>>>   Is this correct? Is this the indented behavior? Since you are reading
>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>>   children.1 was the secondary disk, it has just become the primary.
>>
>> Yes.
> 
> And don't you need a way to control the order in which the disks must be
> read for COLO?

I think in fifo mode, we should read the disk first that is added earlier.

We don't need a way to control the order now.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>
Dr. David Alan Gilbert March 17, 2016, 9:48 a.m. UTC | #9
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 05:10 PM, Alberto Garcia wrote:
> > On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> >>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
> >>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
> >>>>>>                               * block if Quorum is reached.
> >>>>>>                               */
> >>>>>> +    unsigned long *index_bitmap;
> >>>>
> >>>> Hi Berto
> >>>>
> >>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
> >>>> lastest one, as Kevin suggested we introduce
> >>>> "child->child_name"(formart as "children.xxx"), this is the key cause
> >>>> why we need this two functions here.
> >>>
> >>> I'm sorry I missed this discussion earlier. Your code seems technically
> >>> correct but I have several questions:
> >>>
> >>> - I read that one of the reasons for this change is that "In theory, the
> >>>   same node could be attached twice to the same parent in different
> >>>   roles.". Is there any example of that? What's the use case?
> >>
> >> Kevin may know the case.
> > 
> > Kevin, do you have an example?
> > 
> >>> - How do you obtain the child name?
> >>
> >> IIRC, the answer is no now. I think we can improve 'info block' output
> > 
> > Okay, but then we should extend that first, otherwise this API cannot be
> > used.
> > 
> >>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
> >>>   and hd1.qcow2), then you remove children.0 and add it again, it will
> >>>   keep the 'children.0' name (that's what the bitmap is for if I'm
> >>>   understanding it correctly). However the position in the s->children
> >>>   array will change because you do memmove() when you remove children.0
> >>>   and then add it again to the end of the array.
> >>>
> >>>   Initial status:
> >>>
> >>>     s->children[0] <--> "children.0" (hd0.qcow2)
> >>>     s->children[1] <--> "children.1" (hd1.qcow2)
> >>>
> >>>   children.0 (hd0.qcow2) is removed:
> >>>
> >>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>
> >>>   children.0 (hd0.qcow2) is added again:
> >>>
> >>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>     s->children[1] <--> "children.0" (hd0.qcow2)
> >>
> >> Yes, it is correct.
> >>
> >>>
> >>>   Is this correct? Is this the indented behavior? Since you are reading
> >>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
> >>>   children.1 was the secondary disk, it has just become the primary.
> >>
> >> Yes.
> > 
> > And don't you need a way to control the order in which the disks must be
> > read for COLO?
> 
> I think in fifo mode, we should read the disk first that is added earlier.
> 
> We don't need a way to control the order now.

Can you document fully how it's used in COLO then?
We should have the failure modes documented, and how you'll use
it after failover etc   Without that it's really difficult to tell
if this naming is right.
The children.0 notation is really confusing in the way that Berto
describes; I hit this a couple of months ago and it really doesn't
make sense.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Berto
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Wen Congyang March 17, 2016, 9:56 a.m. UTC | #10
On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (wency@cn.fujitsu.com) wrote:
>> On 03/17/2016 05:10 PM, Alberto Garcia wrote:
>>> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>>>>>>>>                               * block if Quorum is reached.
>>>>>>>>                               */
>>>>>>>> +    unsigned long *index_bitmap;
>>>>>>
>>>>>> Hi Berto
>>>>>>
>>>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>>>>> lastest one, as Kevin suggested we introduce
>>>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>>>>> why we need this two functions here.
>>>>>
>>>>> I'm sorry I missed this discussion earlier. Your code seems technically
>>>>> correct but I have several questions:
>>>>>
>>>>> - I read that one of the reasons for this change is that "In theory, the
>>>>>   same node could be attached twice to the same parent in different
>>>>>   roles.". Is there any example of that? What's the use case?
>>>>
>>>> Kevin may know the case.
>>>
>>> Kevin, do you have an example?
>>>
>>>>> - How do you obtain the child name?
>>>>
>>>> IIRC, the answer is no now. I think we can improve 'info block' output
>>>
>>> Okay, but then we should extend that first, otherwise this API cannot be
>>> used.
>>>
>>>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>>>>   understanding it correctly). However the position in the s->children
>>>>>   array will change because you do memmove() when you remove children.0
>>>>>   and then add it again to the end of the array.
>>>>>
>>>>>   Initial status:
>>>>>
>>>>>     s->children[0] <--> "children.0" (hd0.qcow2)
>>>>>     s->children[1] <--> "children.1" (hd1.qcow2)
>>>>>
>>>>>   children.0 (hd0.qcow2) is removed:
>>>>>
>>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>>>
>>>>>   children.0 (hd0.qcow2) is added again:
>>>>>
>>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
>>>>>     s->children[1] <--> "children.0" (hd0.qcow2)
>>>>
>>>> Yes, it is correct.
>>>>
>>>>>
>>>>>   Is this correct? Is this the indented behavior? Since you are reading
>>>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>>>>   children.1 was the secondary disk, it has just become the primary.
>>>>
>>>> Yes.
>>>
>>> And don't you need a way to control the order in which the disks must be
>>> read for COLO?
>>
>> I think in fifo mode, we should read the disk first that is added earlier.
>>
>> We don't need a way to control the order now.
> 
> Can you document fully how it's used in COLO then?

Do you mean document it in docs/block-replication.txt?

> We should have the failure modes documented, and how you'll use
> it after failover etc   Without that it's really difficult to tell
> if this naming is right.

For COLO, children.0 is the real disk, children.1 is replication driver.
After failure, children.1 will be removed by the user. If we want to
continue do COLO, we need add a new children.1 again.

> The children.0 notation is really confusing in the way that Berto
> describes; I hit this a couple of months ago and it really doesn't
> make sense.

Do you mean: read from children.1 first, and then read from children.0 in
fifo mode? Yes, the behavior is very strange.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> Berto
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> .
>
Dr. David Alan Gilbert March 17, 2016, 9:59 a.m. UTC | #11
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (wency@cn.fujitsu.com) wrote:
> >> On 03/17/2016 05:10 PM, Alberto Garcia wrote:
> >>> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> >>>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
> >>>>>>>>       bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
> >>>>>>>>                               * block if Quorum is reached.
> >>>>>>>>                               */
> >>>>>>>> +    unsigned long *index_bitmap;
> >>>>>>
> >>>>>> Hi Berto
> >>>>>>
> >>>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
> >>>>>> lastest one, as Kevin suggested we introduce
> >>>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
> >>>>>> why we need this two functions here.
> >>>>>
> >>>>> I'm sorry I missed this discussion earlier. Your code seems technically
> >>>>> correct but I have several questions:
> >>>>>
> >>>>> - I read that one of the reasons for this change is that "In theory, the
> >>>>>   same node could be attached twice to the same parent in different
> >>>>>   roles.". Is there any example of that? What's the use case?
> >>>>
> >>>> Kevin may know the case.
> >>>
> >>> Kevin, do you have an example?
> >>>
> >>>>> - How do you obtain the child name?
> >>>>
> >>>> IIRC, the answer is no now. I think we can improve 'info block' output
> >>>
> >>> Okay, but then we should extend that first, otherwise this API cannot be
> >>> used.
> >>>
> >>>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
> >>>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
> >>>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
> >>>>>   understanding it correctly). However the position in the s->children
> >>>>>   array will change because you do memmove() when you remove children.0
> >>>>>   and then add it again to the end of the array.
> >>>>>
> >>>>>   Initial status:
> >>>>>
> >>>>>     s->children[0] <--> "children.0" (hd0.qcow2)
> >>>>>     s->children[1] <--> "children.1" (hd1.qcow2)
> >>>>>
> >>>>>   children.0 (hd0.qcow2) is removed:
> >>>>>
> >>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>>>
> >>>>>   children.0 (hd0.qcow2) is added again:
> >>>>>
> >>>>>     s->children[0] <--> "children.1" (hd1.qcow2)
> >>>>>     s->children[1] <--> "children.0" (hd0.qcow2)
> >>>>
> >>>> Yes, it is correct.
> >>>>
> >>>>>
> >>>>>   Is this correct? Is this the indented behavior? Since you are reading
> >>>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
> >>>>>   children.1 was the secondary disk, it has just become the primary.
> >>>>
> >>>> Yes.
> >>>
> >>> And don't you need a way to control the order in which the disks must be
> >>> read for COLO?
> >>
> >> I think in fifo mode, we should read the disk first that is added earlier.
> >>
> >> We don't need a way to control the order now.
> > 
> > Can you document fully how it's used in COLO then?
> 
> Do you mean document it in docs/block-replication.txt?

That would be OK.

> > We should have the failure modes documented, and how you'll use
> > it after failover etc   Without that it's really difficult to tell
> > if this naming is right.
> 
> For COLO, children.0 is the real disk, children.1 is replication driver.
> After failure, children.1 will be removed by the user. If we want to
> continue do COLO, we need add a new children.1 again.

So you need to document how to do that.

> > The children.0 notation is really confusing in the way that Berto
> > describes; I hit this a couple of months ago and it really doesn't
> > make sense.
> 
> Do you mean: read from children.1 first, and then read from children.0 in
> fifo mode? Yes, the behavior is very strange.

I mean the 'children.0' 'children.1' naming is just very confusing.
Also because the order in the array is important it's even more confusing
since the 'children.1' isn't necessarily the children[1].

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >>
> >> Thanks
> >> Wen Congyang
> >>
> >>>
> >>> Berto
> >>>
> >>>
> >>> .
> >>>
> >>
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alberto Garcia March 17, 2016, 10:07 a.m. UTC | #12
On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>> We should have the failure modes documented, and how you'll use it
>> after failover etc Without that it's really difficult to tell if this
>> naming is right.
>
> For COLO, children.0 is the real disk, children.1 is replication
> driver.  After failure, children.1 will be removed by the user. If we
> want to continue do COLO, we need add a new children.1 again.

What if children.0 fails ?

Berto
Wen Congyang March 17, 2016, 10:23 a.m. UTC | #13
On 03/17/2016 06:07 PM, Alberto Garcia wrote:
> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>>> We should have the failure modes documented, and how you'll use it
>>> after failover etc Without that it's really difficult to tell if this
>>> naming is right.
>>
>> For COLO, children.0 is the real disk, children.1 is replication
>> driver.  After failure, children.1 will be removed by the user. If we
>> want to continue do COLO, we need add a new children.1 again.
> 
> What if children.0 fails ?

For COLO, reading from children.1 always fails. if children.0 fails, it
means that reading from the disk fails. The guest vm will see the I/O error.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>
Dr. David Alan Gilbert March 17, 2016, 11:25 a.m. UTC | #14
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 06:07 PM, Alberto Garcia wrote:
> > On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
> >>> We should have the failure modes documented, and how you'll use it
> >>> after failover etc Without that it's really difficult to tell if this
> >>> naming is right.
> >>
> >> For COLO, children.0 is the real disk, children.1 is replication
> >> driver.  After failure, children.1 will be removed by the user. If we
> >> want to continue do COLO, we need add a new children.1 again.
> > 
> > What if children.0 fails ?
> 
> For COLO, reading from children.1 always fails. if children.0 fails, it
> means that reading from the disk fails. The guest vm will see the I/O error.

How do we get that to cause a fail over before the guest detects it?
If the primary's local disk (children.0) fails then if we can failover
at that point then the guest carries running on the secondary without
ever knowing about the failure.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Berto
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Wen Congyang March 18, 2016, 2:56 a.m. UTC | #15
On 03/17/2016 07:25 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (wency@cn.fujitsu.com) wrote:
>> On 03/17/2016 06:07 PM, Alberto Garcia wrote:
>>> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>>>>> We should have the failure modes documented, and how you'll use it
>>>>> after failover etc Without that it's really difficult to tell if this
>>>>> naming is right.
>>>>
>>>> For COLO, children.0 is the real disk, children.1 is replication
>>>> driver.  After failure, children.1 will be removed by the user. If we
>>>> want to continue do COLO, we need add a new children.1 again.
>>>
>>> What if children.0 fails ?
>>
>> For COLO, reading from children.1 always fails. if children.0 fails, it
>> means that reading from the disk fails. The guest vm will see the I/O error.
> 
> How do we get that to cause a fail over before the guest detects it?
> If the primary's local disk (children.0) fails then if we can failover
> at that point then the guest carries running on the secondary without
> ever knowing about the failure.

COLO is not designed for such case. The children.0 can also be quorum, so
you can add more than one real disk, and get more reliability. Another
choice is that, the real disk is an external storage, and it has
its own replication solution.

COLO is designed for such case: the host is crashed, and the guest is still
alive after failover, the client doesn't know this event.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> Berto
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> .
>
Dr. David Alan Gilbert March 18, 2016, 10:48 a.m. UTC | #16
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/17/2016 07:25 PM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (wency@cn.fujitsu.com) wrote:
> >> On 03/17/2016 06:07 PM, Alberto Garcia wrote:
> >>> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
> >>>>> We should have the failure modes documented, and how you'll use it
> >>>>> after failover etc Without that it's really difficult to tell if this
> >>>>> naming is right.
> >>>>
> >>>> For COLO, children.0 is the real disk, children.1 is replication
> >>>> driver.  After failure, children.1 will be removed by the user. If we
> >>>> want to continue do COLO, we need add a new children.1 again.
> >>>
> >>> What if children.0 fails ?
> >>
> >> For COLO, reading from children.1 always fails. if children.0 fails, it
> >> means that reading from the disk fails. The guest vm will see the I/O error.
> > 
> > How do we get that to cause a fail over before the guest detects it?
> > If the primary's local disk (children.0) fails then if we can failover
> > at that point then the guest carries running on the secondary without
> > ever knowing about the failure.
> 
> COLO is not designed for such case. The children.0 can also be quorum, so
> you can add more than one real disk, and get more reliability. Another
> choice is that, the real disk is an external storage, and it has
> its own replication solution.
> 
> COLO is designed for such case: the host is crashed, and the guest is still
> alive after failover, the client doesn't know this event.

That seems an odd limitation; the only thing needed for COLO to survive a disk
failure on the primary would be to ensure that the primary fails/triggers failover
if access to the local disk fails and to do it before the IO result is returned to
the guest.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >>
> >> Thanks
> >> Wen Congyang
> >>
> >>>
> >>> Berto
> >>>
> >>>
> >>> .
> >>>
> >>
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > 
> > .
> > 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Max Reitz March 29, 2016, 3:38 p.m. UTC | #17
On 17.03.2016 10:56, Wen Congyang wrote:
> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:

[...]

>> The children.0 notation is really confusing in the way that Berto
>> describes; I hit this a couple of months ago and it really doesn't
>> make sense.
> 
> Do you mean: read from children.1 first, and then read from children.0 in
> fifo mode? Yes, the behavior is very strange.

So is this intended or is it not? In
http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
you said that it is.

I myself would indeed say it is very strange. If I were a user, I would
not expect this behavior. And as I developer, I think that how a BDS's
child is used by its parent should solely depend on its role (e.g.
whether it is "children.0" or "children.1").

Max
Eric Blake March 29, 2016, 3:44 p.m. UTC | #18
On 03/29/2016 09:38 AM, Max Reitz wrote:
> On 17.03.2016 10:56, Wen Congyang wrote:
>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> 
> [...]
> 
>>> The children.0 notation is really confusing in the way that Berto
>>> describes; I hit this a couple of months ago and it really doesn't
>>> make sense.
>>
>> Do you mean: read from children.1 first, and then read from children.0 in
>> fifo mode? Yes, the behavior is very strange.
> 
> So is this intended or is it not? In
> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> you said that it is.
> 
> I myself would indeed say it is very strange. If I were a user, I would
> not expect this behavior. And as I developer, I think that how a BDS's
> child is used by its parent should solely depend on its role (e.g.
> whether it is "children.0" or "children.1").

It sounds like the argument here, and in Max's thread on
query-block-node-tree, is that we DO have cases where order matters, and
so we need a way for the hot-add operation to explicitly specify where
in the list a child is inserted (whether it is being inserted as the new
primary image, or explicitly as the last resort, or somewhere in the
middle).  An optional parameter, that defaults to appending, may be ok,
but we definitely need to consider how the order of children is affected
by hot-add.
Dr. David Alan Gilbert March 29, 2016, 3:50 p.m. UTC | #19
* Eric Blake (eblake@redhat.com) wrote:
> On 03/29/2016 09:38 AM, Max Reitz wrote:
> > On 17.03.2016 10:56, Wen Congyang wrote:
> >> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> > 
> > [...]
> > 
> >>> The children.0 notation is really confusing in the way that Berto
> >>> describes; I hit this a couple of months ago and it really doesn't
> >>> make sense.
> >>
> >> Do you mean: read from children.1 first, and then read from children.0 in
> >> fifo mode? Yes, the behavior is very strange.
> > 
> > So is this intended or is it not? In
> > http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> > you said that it is.
> > 
> > I myself would indeed say it is very strange. If I were a user, I would
> > not expect this behavior. And as I developer, I think that how a BDS's
> > child is used by its parent should solely depend on its role (e.g.
> > whether it is "children.0" or "children.1").
> 
> It sounds like the argument here, and in Max's thread on
> query-block-node-tree, is that we DO have cases where order matters, and
> so we need a way for the hot-add operation to explicitly specify where
> in the list a child is inserted (whether it is being inserted as the new
> primary image, or explicitly as the last resort, or somewhere in the
> middle).  An optional parameter, that defaults to appending, may be ok,
> but we definitely need to consider how the order of children is affected
> by hot-add.

Certainly in the COLO case the two children are not identical; and IMHO we need
to get away from thinking about ordering and start thinking about functional
namingd - children.0/children.1 doesn't suggest the fact they behave
differently.

Dave

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Max Reitz March 29, 2016, 3:51 p.m. UTC | #20
On 29.03.2016 17:44, Eric Blake wrote:
> On 03/29/2016 09:38 AM, Max Reitz wrote:
>> On 17.03.2016 10:56, Wen Congyang wrote:
>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>
>> [...]
>>
>>>> The children.0 notation is really confusing in the way that Berto
>>>> describes; I hit this a couple of months ago and it really doesn't
>>>> make sense.
>>>
>>> Do you mean: read from children.1 first, and then read from children.0 in
>>> fifo mode? Yes, the behavior is very strange.
>>
>> So is this intended or is it not? In
>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>> you said that it is.
>>
>> I myself would indeed say it is very strange. If I were a user, I would
>> not expect this behavior. And as I developer, I think that how a BDS's
>> child is used by its parent should solely depend on its role (e.g.
>> whether it is "children.0" or "children.1").
> 
> It sounds like the argument here, and in Max's thread on
> query-block-node-tree, is that we DO have cases where order matters, and
> so we need a way for the hot-add operation to explicitly specify where
> in the list a child is inserted (whether it is being inserted as the new
> primary image, or explicitly as the last resort, or somewhere in the
> middle).  An optional parameter, that defaults to appending, may be ok,
> but we definitely need to consider how the order of children is affected
> by hot-add.

However, the order should be queriable after the fact, and there are
three ways I see to accomplish this:

(1) Make this information queriable as driver-specific BDS information.
    I personally don't like it very much, but it would be fine.
(2) Implement query-block-node-tree, make the order of child nodes
    significant and thus represent the FIFO order there. I don't like
    this because it would mean returning two orders through that child
    node list: One is the numeric order (children.0, children.1, ...)
    and another is the FIFO order, which are not necessarily equal.
(3) Fix FIFO order to the child name (its role). I'm very much in favor
    of this.

While I don't have good arguments against (1), I think I have good
arguments for (3) instead: It just doesn't make sense to have a numeric
order of children if this order doesn't mean anything; especially if you
suddenly do need the list of child nodes to be ordered. To me, it
doesn't make any sense to introduce a new hidden order which takes
precedence over this obvious user-visible order.

Max
Max Reitz March 29, 2016, 3:52 p.m. UTC | #21
On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> * Eric Blake (eblake@redhat.com) wrote:
>> On 03/29/2016 09:38 AM, Max Reitz wrote:
>>> On 17.03.2016 10:56, Wen Congyang wrote:
>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>>
>>> [...]
>>>
>>>>> The children.0 notation is really confusing in the way that Berto
>>>>> describes; I hit this a couple of months ago and it really doesn't
>>>>> make sense.
>>>>
>>>> Do you mean: read from children.1 first, and then read from children.0 in
>>>> fifo mode? Yes, the behavior is very strange.
>>>
>>> So is this intended or is it not? In
>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>>> you said that it is.
>>>
>>> I myself would indeed say it is very strange. If I were a user, I would
>>> not expect this behavior. And as I developer, I think that how a BDS's
>>> child is used by its parent should solely depend on its role (e.g.
>>> whether it is "children.0" or "children.1").
>>
>> It sounds like the argument here, and in Max's thread on
>> query-block-node-tree, is that we DO have cases where order matters, and
>> so we need a way for the hot-add operation to explicitly specify where
>> in the list a child is inserted (whether it is being inserted as the new
>> primary image, or explicitly as the last resort, or somewhere in the
>> middle).  An optional parameter, that defaults to appending, may be ok,
>> but we definitely need to consider how the order of children is affected
>> by hot-add.
> 
> Certainly in the COLO case the two children are not identical; and IMHO we need
> to get away from thinking about ordering and start thinking about functional
> namingd - children.0/children.1 doesn't suggest the fact they behave
> differently.

To me it does. If quorum is operating in a mode call "FIFO" I would
expect some order on the child nodes, and if the child nodes are
actually numbered in an ascending order, that is an obvious order.

Max
Dr. David Alan Gilbert March 29, 2016, 3:54 p.m. UTC | #22
* Max Reitz (mreitz@redhat.com) wrote:
> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> > * Eric Blake (eblake@redhat.com) wrote:
> >> On 03/29/2016 09:38 AM, Max Reitz wrote:
> >>> On 17.03.2016 10:56, Wen Congyang wrote:
> >>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> >>>
> >>> [...]
> >>>
> >>>>> The children.0 notation is really confusing in the way that Berto
> >>>>> describes; I hit this a couple of months ago and it really doesn't
> >>>>> make sense.
> >>>>
> >>>> Do you mean: read from children.1 first, and then read from children.0 in
> >>>> fifo mode? Yes, the behavior is very strange.
> >>>
> >>> So is this intended or is it not? In
> >>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> >>> you said that it is.
> >>>
> >>> I myself would indeed say it is very strange. If I were a user, I would
> >>> not expect this behavior. And as I developer, I think that how a BDS's
> >>> child is used by its parent should solely depend on its role (e.g.
> >>> whether it is "children.0" or "children.1").
> >>
> >> It sounds like the argument here, and in Max's thread on
> >> query-block-node-tree, is that we DO have cases where order matters, and
> >> so we need a way for the hot-add operation to explicitly specify where
> >> in the list a child is inserted (whether it is being inserted as the new
> >> primary image, or explicitly as the last resort, or somewhere in the
> >> middle).  An optional parameter, that defaults to appending, may be ok,
> >> but we definitely need to consider how the order of children is affected
> >> by hot-add.
> > 
> > Certainly in the COLO case the two children are not identical; and IMHO we need
> > to get away from thinking about ordering and start thinking about functional
> > namingd - children.0/children.1 doesn't suggest the fact they behave
> > differently.
> 
> To me it does. If quorum is operating in a mode call "FIFO" I would
> expect some order on the child nodes, and if the child nodes are
> actually numbered in an ascending order, that is an obvious order.

I don't understand why it's called 'FIFO'.

Dave

> Max
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Max Reitz March 29, 2016, 3:59 p.m. UTC | #23
On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
> * Max Reitz (mreitz@redhat.com) wrote:
>> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
>>> * Eric Blake (eblake@redhat.com) wrote:
>>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
>>>>> On 17.03.2016 10:56, Wen Congyang wrote:
>>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> The children.0 notation is really confusing in the way that Berto
>>>>>>> describes; I hit this a couple of months ago and it really doesn't
>>>>>>> make sense.
>>>>>>
>>>>>> Do you mean: read from children.1 first, and then read from children.0 in
>>>>>> fifo mode? Yes, the behavior is very strange.
>>>>>
>>>>> So is this intended or is it not? In
>>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>>>>> you said that it is.
>>>>>
>>>>> I myself would indeed say it is very strange. If I were a user, I would
>>>>> not expect this behavior. And as I developer, I think that how a BDS's
>>>>> child is used by its parent should solely depend on its role (e.g.
>>>>> whether it is "children.0" or "children.1").
>>>>
>>>> It sounds like the argument here, and in Max's thread on
>>>> query-block-node-tree, is that we DO have cases where order matters, and
>>>> so we need a way for the hot-add operation to explicitly specify where
>>>> in the list a child is inserted (whether it is being inserted as the new
>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>>> but we definitely need to consider how the order of children is affected
>>>> by hot-add.
>>>
>>> Certainly in the COLO case the two children are not identical; and IMHO we need
>>> to get away from thinking about ordering and start thinking about functional
>>> namingd - children.0/children.1 doesn't suggest the fact they behave
>>> differently.
>>
>> To me it does. If quorum is operating in a mode call "FIFO" I would
>> expect some order on the child nodes, and if the child nodes are
>> actually numbered in an ascending order, that is an obvious order.
> 
> I don't understand why it's called 'FIFO'.

Because in that mode quorum successively reads from all of its children
and returns the first successful result. So the First successful Input
is the one that becomes quorum's Output (there isn't much of a
successive output, so it doesn't make much sense to call that the First
Output, though...).

I didn't name it, though. *waves hands defensively* :-)

Max
Dr. David Alan Gilbert March 29, 2016, 4:03 p.m. UTC | #24
* Max Reitz (mreitz@redhat.com) wrote:
> On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
> > * Max Reitz (mreitz@redhat.com) wrote:
> >> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> >>> * Eric Blake (eblake@redhat.com) wrote:
> >>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
> >>>>> On 17.03.2016 10:56, Wen Congyang wrote:
> >>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>>> The children.0 notation is really confusing in the way that Berto
> >>>>>>> describes; I hit this a couple of months ago and it really doesn't
> >>>>>>> make sense.
> >>>>>>
> >>>>>> Do you mean: read from children.1 first, and then read from children.0 in
> >>>>>> fifo mode? Yes, the behavior is very strange.
> >>>>>
> >>>>> So is this intended or is it not? In
> >>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> >>>>> you said that it is.
> >>>>>
> >>>>> I myself would indeed say it is very strange. If I were a user, I would
> >>>>> not expect this behavior. And as I developer, I think that how a BDS's
> >>>>> child is used by its parent should solely depend on its role (e.g.
> >>>>> whether it is "children.0" or "children.1").
> >>>>
> >>>> It sounds like the argument here, and in Max's thread on
> >>>> query-block-node-tree, is that we DO have cases where order matters, and
> >>>> so we need a way for the hot-add operation to explicitly specify where
> >>>> in the list a child is inserted (whether it is being inserted as the new
> >>>> primary image, or explicitly as the last resort, or somewhere in the
> >>>> middle).  An optional parameter, that defaults to appending, may be ok,
> >>>> but we definitely need to consider how the order of children is affected
> >>>> by hot-add.
> >>>
> >>> Certainly in the COLO case the two children are not identical; and IMHO we need
> >>> to get away from thinking about ordering and start thinking about functional
> >>> namingd - children.0/children.1 doesn't suggest the fact they behave
> >>> differently.
> >>
> >> To me it does. If quorum is operating in a mode call "FIFO" I would
> >> expect some order on the child nodes, and if the child nodes are
> >> actually numbered in an ascending order, that is an obvious order.
> > 
> > I don't understand why it's called 'FIFO'.
> 
> Because in that mode quorum successively reads from all of its children
> and returns the first successful result. So the First successful Input
> is the one that becomes quorum's Output (there isn't much of a
> successive output, so it doesn't make much sense to call that the First
> Output, though...).
> 
> I didn't name it, though. *waves hands defensively* :-)

But that description doesn't make sense for what COLO uses it for.

They have, on the primary host:
   0) Local disk
   1) an NBD connection to the secondary

So in theory a read should always happen from (0) and writes should
go to both.

Dave

> Max
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Max Reitz March 29, 2016, 4:09 p.m. UTC | #25
On 29.03.2016 18:03, Dr. David Alan Gilbert wrote:
> * Max Reitz (mreitz@redhat.com) wrote:
>> On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
>>> * Max Reitz (mreitz@redhat.com) wrote:
>>>> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
>>>>> * Eric Blake (eblake@redhat.com) wrote:
>>>>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
>>>>>>> On 17.03.2016 10:56, Wen Congyang wrote:
>>>>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>>> The children.0 notation is really confusing in the way that Berto
>>>>>>>>> describes; I hit this a couple of months ago and it really doesn't
>>>>>>>>> make sense.
>>>>>>>>
>>>>>>>> Do you mean: read from children.1 first, and then read from children.0 in
>>>>>>>> fifo mode? Yes, the behavior is very strange.
>>>>>>>
>>>>>>> So is this intended or is it not? In
>>>>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
>>>>>>> you said that it is.
>>>>>>>
>>>>>>> I myself would indeed say it is very strange. If I were a user, I would
>>>>>>> not expect this behavior. And as I developer, I think that how a BDS's
>>>>>>> child is used by its parent should solely depend on its role (e.g.
>>>>>>> whether it is "children.0" or "children.1").
>>>>>>
>>>>>> It sounds like the argument here, and in Max's thread on
>>>>>> query-block-node-tree, is that we DO have cases where order matters, and
>>>>>> so we need a way for the hot-add operation to explicitly specify where
>>>>>> in the list a child is inserted (whether it is being inserted as the new
>>>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>>>>> but we definitely need to consider how the order of children is affected
>>>>>> by hot-add.
>>>>>
>>>>> Certainly in the COLO case the two children are not identical; and IMHO we need
>>>>> to get away from thinking about ordering and start thinking about functional
>>>>> namingd - children.0/children.1 doesn't suggest the fact they behave
>>>>> differently.
>>>>
>>>> To me it does. If quorum is operating in a mode call "FIFO" I would
>>>> expect some order on the child nodes, and if the child nodes are
>>>> actually numbered in an ascending order, that is an obvious order.
>>>
>>> I don't understand why it's called 'FIFO'.
>>
>> Because in that mode quorum successively reads from all of its children
>> and returns the first successful result. So the First successful Input
>> is the one that becomes quorum's Output (there isn't much of a
>> successive output, so it doesn't make much sense to call that the First
>> Output, though...).
>>
>> I didn't name it, though. *waves hands defensively* :-)
> 
> But that description doesn't make sense for what COLO uses it for.
> 
> They have, on the primary host:
>    0) Local disk
>    1) an NBD connection to the secondary
> 
> So in theory a read should always happen from (0) and writes should
> go to both.

Well that's the way it works, isn't it?

I didn't mention what happens with writes, but those are indeed
distributed to all of quorum's children. And as long as the local disk
doesn't fail, data is always read from it alone.

All you need to do is make sure that the local disk is the first node in
whatever order FIFO is supposed to use.

Max
Dr. David Alan Gilbert March 29, 2016, 5:33 p.m. UTC | #26
* Max Reitz (mreitz@redhat.com) wrote:
> On 29.03.2016 18:03, Dr. David Alan Gilbert wrote:
> > * Max Reitz (mreitz@redhat.com) wrote:
> >> On 29.03.2016 17:54, Dr. David Alan Gilbert wrote:
> >>> * Max Reitz (mreitz@redhat.com) wrote:
> >>>> On 29.03.2016 17:50, Dr. David Alan Gilbert wrote:
> >>>>> * Eric Blake (eblake@redhat.com) wrote:
> >>>>>> On 03/29/2016 09:38 AM, Max Reitz wrote:
> >>>>>>> On 17.03.2016 10:56, Wen Congyang wrote:
> >>>>>>>> On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> >>>>>>>
> >>>>>>> [...]
> >>>>>>>
> >>>>>>>>> The children.0 notation is really confusing in the way that Berto
> >>>>>>>>> describes; I hit this a couple of months ago and it really doesn't
> >>>>>>>>> make sense.
> >>>>>>>>
> >>>>>>>> Do you mean: read from children.1 first, and then read from children.0 in
> >>>>>>>> fifo mode? Yes, the behavior is very strange.
> >>>>>>>
> >>>>>>> So is this intended or is it not? In
> >>>>>>> http://lists.nongnu.org/archive/html/qemu-block/2016-03/msg00526.html
> >>>>>>> you said that it is.
> >>>>>>>
> >>>>>>> I myself would indeed say it is very strange. If I were a user, I would
> >>>>>>> not expect this behavior. And as I developer, I think that how a BDS's
> >>>>>>> child is used by its parent should solely depend on its role (e.g.
> >>>>>>> whether it is "children.0" or "children.1").
> >>>>>>
> >>>>>> It sounds like the argument here, and in Max's thread on
> >>>>>> query-block-node-tree, is that we DO have cases where order matters, and
> >>>>>> so we need a way for the hot-add operation to explicitly specify where
> >>>>>> in the list a child is inserted (whether it is being inserted as the new
> >>>>>> primary image, or explicitly as the last resort, or somewhere in the
> >>>>>> middle).  An optional parameter, that defaults to appending, may be ok,
> >>>>>> but we definitely need to consider how the order of children is affected
> >>>>>> by hot-add.
> >>>>>
> >>>>> Certainly in the COLO case the two children are not identical; and IMHO we need
> >>>>> to get away from thinking about ordering and start thinking about functional
> >>>>> namingd - children.0/children.1 doesn't suggest the fact they behave
> >>>>> differently.
> >>>>
> >>>> To me it does. If quorum is operating in a mode call "FIFO" I would
> >>>> expect some order on the child nodes, and if the child nodes are
> >>>> actually numbered in an ascending order, that is an obvious order.
> >>>
> >>> I don't understand why it's called 'FIFO'.
> >>
> >> Because in that mode quorum successively reads from all of its children
> >> and returns the first successful result. So the First successful Input
> >> is the one that becomes quorum's Output (there isn't much of a
> >> successive output, so it doesn't make much sense to call that the First
> >> Output, though...).
> >>
> >> I didn't name it, though. *waves hands defensively* :-)
> > 
> > But that description doesn't make sense for what COLO uses it for.
> > 
> > They have, on the primary host:
> >    0) Local disk
> >    1) an NBD connection to the secondary
> > 
> > So in theory a read should always happen from (0) and writes should
> > go to both.
> 
> Well that's the way it works, isn't it?
> 
> I didn't mention what happens with writes, but those are indeed
> distributed to all of quorum's children. And as long as the local disk
> doesn't fail, data is always read from it alone.

I guess so, but it seems to be odd to name something after an ordering
when you never expect it to actually perform the read from anything other
than the first;  and certainly for fault tolerance stuff I think it's
important to define the failure modes.

Dave

> All you need to do is make sure that the local disk is the first node in
> whatever order FIFO is supposed to use.
> 
> Max
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alberto Garcia March 30, 2016, 11:39 a.m. UTC | #27
On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>> It sounds like the argument here, and in Max's thread on
>> query-block-node-tree, is that we DO have cases where order matters, and
>> so we need a way for the hot-add operation to explicitly specify where
>> in the list a child is inserted (whether it is being inserted as the new
>> primary image, or explicitly as the last resort, or somewhere in the
>> middle).  An optional parameter, that defaults to appending, may be ok,
>> but we definitely need to consider how the order of children is affected
>> by hot-add.
>
> However, the order should be queriable after the fact, and there are
> three ways I see to accomplish this:
>
> (1) Make this information queriable as driver-specific BDS information.
>     I personally don't like it very much, but it would be fine.
> (2) Implement query-block-node-tree, make the order of child nodes
>     significant and thus represent the FIFO order there. I don't like
>     this because it would mean returning two orders through that child
>     node list: One is the numeric order (children.0, children.1, ...)
>     and another is the FIFO order, which are not necessarily equal.
> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>     of this.
>
> While I don't have good arguments against (1), I think I have good
> arguments for (3) instead: It just doesn't make sense to have a numeric
> order of children if this order doesn't mean anything; especially if you
> suddenly do need the list of child nodes to be ordered. To me, it
> doesn't make any sense to introduce a new hidden order which takes
> precedence over this obvious user-visible order.

I'm not sure if I understand correctly what you mean in (3). The
user-visible FIFO order is the one specified when the Quorum is created:

children.0.file.filename=hd0.qcow2,
children.1.file.filename=hd1.qcow2,
...

Would you then call those BDS children.0, children.1, etc and make those
names be the ones that actually define how they are ordered internally?

I also have another (not directly related) question: why not simply use
the node name when removing children? I understood that the idea was
that it's possible to have the same node attached twice to the same
Quorum, but can you actually do that? And what's the use case?

Berto
Max Reitz March 30, 2016, 3:07 p.m. UTC | #28
On 30.03.2016 13:39, Alberto Garcia wrote:
> On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>>> It sounds like the argument here, and in Max's thread on
>>> query-block-node-tree, is that we DO have cases where order matters, and
>>> so we need a way for the hot-add operation to explicitly specify where
>>> in the list a child is inserted (whether it is being inserted as the new
>>> primary image, or explicitly as the last resort, or somewhere in the
>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>> but we definitely need to consider how the order of children is affected
>>> by hot-add.
>>
>> However, the order should be queriable after the fact, and there are
>> three ways I see to accomplish this:
>>
>> (1) Make this information queriable as driver-specific BDS information.
>>     I personally don't like it very much, but it would be fine.
>> (2) Implement query-block-node-tree, make the order of child nodes
>>     significant and thus represent the FIFO order there. I don't like
>>     this because it would mean returning two orders through that child
>>     node list: One is the numeric order (children.0, children.1, ...)
>>     and another is the FIFO order, which are not necessarily equal.
>> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>>     of this.
>>
>> While I don't have good arguments against (1), I think I have good
>> arguments for (3) instead: It just doesn't make sense to have a numeric
>> order of children if this order doesn't mean anything; especially if you
>> suddenly do need the list of child nodes to be ordered. To me, it
>> doesn't make any sense to introduce a new hidden order which takes
>> precedence over this obvious user-visible order.
> 
> I'm not sure if I understand correctly what you mean in (3). The
> user-visible FIFO order is the one specified when the Quorum is created:
> 
> children.0.file.filename=hd0.qcow2,
> children.1.file.filename=hd1.qcow2,
> ...
> 
> Would you then call those BDS children.0, children.1, etc

They are already called that way; it's not their node name but their
"child role" name.

>                                                           and make those
> names be the ones that actually define how they are ordered internally?

Yes, that's what I meant.

> I also have another (not directly related) question: why not simply use
> the node name when removing children? I understood that the idea was
> that it's possible to have the same node attached twice to the same
> Quorum, but can you actually do that? And what's the use case?

What I like about using the child role name is that it automatically
prevents you from specifying a node that is not a child of the given parent.

Which makes me notice that it might be a good idea to require the user
to specify the child's role when adding a new child. In this version of
this series (where only quorum is supported), the children are just
inserted in numerical order (first free slot is taken first), but maybe
the user wants to insert them in a different order.

For quorum, this is basically irrelevant if the order doesn't mean
anything (which I don't like), but it may be relevant for other block
drivers.

And the "filling up quorum's children" topic then makes me notice that
(x-)blockdev-change should probably be transaction-able (so you can
restructure the whole BDS graph in a single transaction), but that's
something we can care about later on.

Max
Alberto Garcia March 31, 2016, 11:42 a.m. UTC | #29
On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
>> I also have another (not directly related) question: why not simply
>> use the node name when removing children? I understood that the idea
>> was that it's possible to have the same node attached twice to the
>> same Quorum, but can you actually do that? And what's the use case?
>
> What I like about using the child role name is that it automatically
> prevents you from specifying a node that is not a child of the given
> parent.

Right, but checking if a node is not a child and returning an error is
very simple. And it doesn't require the user to keep track of the node
name *and* the child role name.

Unless I'm forgetting something this would be the first time we expose
the child role name in the API, that's why I'm wondering if it's
something worth doing.

> Which makes me notice that it might be a good idea to require the user
> to specify the child's role when adding a new child. In this version
> of this series (where only quorum is supported), the children are just
> inserted in numerical order (first free slot is taken first), but
> maybe the user wants to insert them in a different order.

For the Quorum case it totally makes sense to let the user choose the
position of the new child.

But for creating a Quorum array in the first place we don't require
that, the order is the one that the user provides, and the user does not
need to know about the child role names at that point.

> And the "filling up quorum's children" topic then makes me notice that
> (x-)blockdev-change should probably be transaction-able (so you can
> restructure the whole BDS graph in a single transaction), but that's
> something we can care about later on.

I agree.

Berto
Dr. David Alan Gilbert March 31, 2016, 12:31 p.m. UTC | #30
* Alberto Garcia (berto@igalia.com) wrote:
> On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
> >> I also have another (not directly related) question: why not simply
> >> use the node name when removing children? I understood that the idea
> >> was that it's possible to have the same node attached twice to the
> >> same Quorum, but can you actually do that? And what's the use case?
> >
> > What I like about using the child role name is that it automatically
> > prevents you from specifying a node that is not a child of the given
> > parent.
> 
> Right, but checking if a node is not a child and returning an error is
> very simple. And it doesn't require the user to keep track of the node
> name *and* the child role name.
> 
> Unless I'm forgetting something this would be the first time we expose
> the child role name in the API, that's why I'm wondering if it's
> something worth doing.
> 
> > Which makes me notice that it might be a good idea to require the user
> > to specify the child's role when adding a new child. In this version
> > of this series (where only quorum is supported), the children are just
> > inserted in numerical order (first free slot is taken first), but
> > maybe the user wants to insert them in a different order.
> 
> For the Quorum case it totally makes sense to let the user choose the
> position of the new child.
> 
> But for creating a Quorum array in the first place we don't require
> that, the order is the one that the user provides, and the user does not
> need to know about the child role names at that point.

Actually in my setup I only add the local block device using the normal command
line, and use drive_add even at startup.  It solves a lot of the ordering
problems with getting COLO booted.

Dave

> 
> > And the "filling up quorum's children" topic then makes me notice that
> > (x-)blockdev-change should probably be transaction-able (so you can
> > restructure the whole BDS graph in a single transaction), but that's
> > something we can care about later on.
> 
> I agree.
> 
> Berto
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Max Reitz April 1, 2016, 3:20 p.m. UTC | #31
On 31.03.2016 13:42, Alberto Garcia wrote:
> On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
>>> I also have another (not directly related) question: why not simply
>>> use the node name when removing children? I understood that the idea
>>> was that it's possible to have the same node attached twice to the
>>> same Quorum, but can you actually do that? And what's the use case?
>>
>> What I like about using the child role name is that it automatically
>> prevents you from specifying a node that is not a child of the given
>> parent.
> 
> Right, but checking if a node is not a child and returning an error is
> very simple. And it doesn't require the user to keep track of the node
> name *and* the child role name.

Yes. But I think that you need to know parent and child anyway if you
want to modify (delete) an edge in the graph.

Also, it may be possible to have multiple parents per node. Actually, it
is already possible because the BB-BDS relationship is modeled as a
parent-child relationship. Thus, I'm not sure whether it would be
sufficient to specify a single node if you want to delete a single edge.

> Unless I'm forgetting something this would be the first time we expose
> the child role name in the API, that's why I'm wondering if it's
> something worth doing.

Well, the roles are kind of exposed already. It's exactly what you
specify in -drive or blockdev-add.

>> Which makes me notice that it might be a good idea to require the user
>> to specify the child's role when adding a new child. In this version
>> of this series (where only quorum is supported), the children are just
>> inserted in numerical order (first free slot is taken first), but
>> maybe the user wants to insert them in a different order.
> 
> For the Quorum case it totally makes sense to let the user choose the
> position of the new child.
> 
> But for creating a Quorum array in the first place we don't require
> that, the order is the one that the user provides, and the user does not
> need to know about the child role names at that point.

Depends. If you create an empty quorum BDS and then add the children
using the QAPI command introduced in this series, you are right. But if
you add children along with creating the quorum BDS (be it via -drive or
via blockdev-add), one has to specify the child role names.

Max
Wen Congyang April 6, 2016, 7:48 a.m. UTC | #32
On 04/01/2016 11:20 PM, Max Reitz wrote:
> On 31.03.2016 13:42, Alberto Garcia wrote:
>> On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
>>>> I also have another (not directly related) question: why not simply
>>>> use the node name when removing children? I understood that the idea
>>>> was that it's possible to have the same node attached twice to the
>>>> same Quorum, but can you actually do that? And what's the use case?
>>>
>>> What I like about using the child role name is that it automatically
>>> prevents you from specifying a node that is not a child of the given
>>> parent.
>>
>> Right, but checking if a node is not a child and returning an error is
>> very simple. And it doesn't require the user to keep track of the node
>> name *and* the child role name.
> 
> Yes. But I think that you need to know parent and child anyway if you
> want to modify (delete) an edge in the graph.
> 
> Also, it may be possible to have multiple parents per node. Actually, it
> is already possible because the BB-BDS relationship is modeled as a
> parent-child relationship. Thus, I'm not sure whether it would be
> sufficient to specify a single node if you want to delete a single edge.
> 
>> Unless I'm forgetting something this would be the first time we expose
>> the child role name in the API, that's why I'm wondering if it's
>> something worth doing.
> 
> Well, the roles are kind of exposed already. It's exactly what you
> specify in -drive or blockdev-add.
> 
>>> Which makes me notice that it might be a good idea to require the user
>>> to specify the child's role when adding a new child. In this version
>>> of this series (where only quorum is supported), the children are just
>>> inserted in numerical order (first free slot is taken first), but
>>> maybe the user wants to insert them in a different order.
>>
>> For the Quorum case it totally makes sense to let the user choose the
>> position of the new child.
>>
>> But for creating a Quorum array in the first place we don't require
>> that, the order is the one that the user provides, and the user does not
>> need to know about the child role names at that point.
> 
> Depends. If you create an empty quorum BDS and then add the children
> using the QAPI command introduced in this series, you are right. But if
> you add children along with creating the quorum BDS (be it via -drive or
> via blockdev-add), one has to specify the child role names.

I think the problem is that: the child role name is wrong.

If we always attach the child in the tail, we can do it like this:
the child role name is children.XXX, and the XXX's value is larger than
any child role name's XXX.

For example:
Quorum has one child: children.1(children.0 is removed)
We add a new child, its role name is children.2, not children.0.

If we want to attach the child not in the tail, for example:
Quorum has two children: children.0, children.1. And the new child should
be before children.1. In this case, we should rename children.1 to children.2
and the new child role name can be children.1. If we allow such usage, we
should rename the other child role name when add/deleting a child. It means
that we should query the role name again after add/deleting a child.

Thanks
Wen Congyang

> 
> Max
>
Changlong Xie April 11, 2016, 5:18 a.m. UTC | #33
On 03/30/2016 11:07 PM, Max Reitz wrote:
> On 30.03.2016 13:39, Alberto Garcia wrote:
>> On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>>>> It sounds like the argument here, and in Max's thread on
>>>> query-block-node-tree, is that we DO have cases where order matters, and
>>>> so we need a way for the hot-add operation to explicitly specify where
>>>> in the list a child is inserted (whether it is being inserted as the new
>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>> middle).  An optional parameter, that defaults to appending, may be ok,
>>>> but we definitely need to consider how the order of children is affected
>>>> by hot-add.
>>>
>>> However, the order should be queriable after the fact, and there are
>>> three ways I see to accomplish this:
>>>
>>> (1) Make this information queriable as driver-specific BDS information.
>>>      I personally don't like it very much, but it would be fine.
>>> (2) Implement query-block-node-tree, make the order of child nodes
>>>      significant and thus represent the FIFO order there. I don't like
>>>      this because it would mean returning two orders through that child
>>>      node list: One is the numeric order (children.0, children.1, ...)
>>>      and another is the FIFO order, which are not necessarily equal.
>>> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>>>      of this.
>>>
>>> While I don't have good arguments against (1), I think I have good
>>> arguments for (3) instead: It just doesn't make sense to have a numeric
>>> order of children if this order doesn't mean anything; especially if you
>>> suddenly do need the list of child nodes to be ordered. To me, it
>>> doesn't make any sense to introduce a new hidden order which takes
>>> precedence over this obvious user-visible order.
>>
>> I'm not sure if I understand correctly what you mean in (3). The
>> user-visible FIFO order is the one specified when the Quorum is created:
>>
>> children.0.file.filename=hd0.qcow2,
>> children.1.file.filename=hd1.qcow2,
>> ...
>>
>> Would you then call those BDS children.0, children.1, etc
>
> They are already called that way; it's not their node name but their
> "child role" name.
>
>>                                                            and make those
>> names be the ones that actually define how they are ordered internally?
>
> Yes, that's what I meant.
>
Hi Max

I think you just mean what i draw in below chart:

1) Insert 4 children.
0           1          2          3
+----------------------------------------------------
|children.0|children.1|children.2|children.3|
+----------------------------------------------------

2) Remove the 2th child (s->children[1])

{ "execute": "x-blockdev-change", 
 

   "arguments": { "parent": "xxx", 
 

                  "child": "children.1" } }

0           1          2          3
+----------------------------------------------------
|children.0|children.1|children.2|children.3|
+------------------------+------------+--------------
                          |            | 	
                   +------+   +--------+
0           1     |    2     |
+----------------v----------v------------------------
|children.0|children.1|children.2|
+----------------------------------------------------

Remove children.1 and shorten the array, then rename children.{2,3} to 
children.{1.2}

3) Insert a new child

0           1          2         3
+----------------------------------------------------
|children.0|children.1|children.2|children.3|
+----------------------------------------------------

But as Wen said: 
http://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg00898.html

Everytime we try to remove a children.i (i < n-1, so it's not the last 
element of the array[n]), we have to rename children.{i+1, n-1} to 
children.{i, n-2}.

Thanks
	-Xie

>> I also have another (not directly related) question: why not simply use
>> the node name when removing children? I understood that the idea was
>> that it's possible to have the same node attached twice to the same
>> Quorum, but can you actually do that? And what's the use case?
>
> What I like about using the child role name is that it automatically
> prevents you from specifying a node that is not a child of the given parent.
>
> Which makes me notice that it might be a good idea to require the user
> to specify the child's role when adding a new child. In this version of
> this series (where only quorum is supported), the children are just
> inserted in numerical order (first free slot is taken first), but maybe
> the user wants to insert them in a different order.
>
> For quorum, this is basically irrelevant if the order doesn't mean
> anything (which I don't like), but it may be relevant for other block
> drivers.
>
> And the "filling up quorum's children" topic then makes me notice that
> (x-)blockdev-change should probably be transaction-able (so you can
> restructure the whole BDS graph in a single transaction), but that's
> something we can care about later on.
>
> Max
>
Max Reitz April 12, 2016, 4:21 p.m. UTC | #34
On 11.04.2016 07:18, Changlong Xie wrote:
> On 03/30/2016 11:07 PM, Max Reitz wrote:
>> On 30.03.2016 13:39, Alberto Garcia wrote:
>>> On Tue 29 Mar 2016 05:51:22 PM CEST, Max Reitz wrote:
>>>>> It sounds like the argument here, and in Max's thread on
>>>>> query-block-node-tree, is that we DO have cases where order
>>>>> matters, and
>>>>> so we need a way for the hot-add operation to explicitly specify where
>>>>> in the list a child is inserted (whether it is being inserted as
>>>>> the new
>>>>> primary image, or explicitly as the last resort, or somewhere in the
>>>>> middle).  An optional parameter, that defaults to appending, may be
>>>>> ok,
>>>>> but we definitely need to consider how the order of children is
>>>>> affected
>>>>> by hot-add.
>>>>
>>>> However, the order should be queriable after the fact, and there are
>>>> three ways I see to accomplish this:
>>>>
>>>> (1) Make this information queriable as driver-specific BDS information.
>>>>      I personally don't like it very much, but it would be fine.
>>>> (2) Implement query-block-node-tree, make the order of child nodes
>>>>      significant and thus represent the FIFO order there. I don't like
>>>>      this because it would mean returning two orders through that child
>>>>      node list: One is the numeric order (children.0, children.1, ...)
>>>>      and another is the FIFO order, which are not necessarily equal.
>>>> (3) Fix FIFO order to the child name (its role). I'm very much in favor
>>>>      of this.
>>>>
>>>> While I don't have good arguments against (1), I think I have good
>>>> arguments for (3) instead: It just doesn't make sense to have a numeric
>>>> order of children if this order doesn't mean anything; especially if
>>>> you
>>>> suddenly do need the list of child nodes to be ordered. To me, it
>>>> doesn't make any sense to introduce a new hidden order which takes
>>>> precedence over this obvious user-visible order.
>>>
>>> I'm not sure if I understand correctly what you mean in (3). The
>>> user-visible FIFO order is the one specified when the Quorum is created:
>>>
>>> children.0.file.filename=hd0.qcow2,
>>> children.1.file.filename=hd1.qcow2,
>>> ...
>>>
>>> Would you then call those BDS children.0, children.1, etc
>>
>> They are already called that way; it's not their node name but their
>> "child role" name.
>>
>>>                                                            and make
>>> those
>>> names be the ones that actually define how they are ordered internally?
>>
>> Yes, that's what I meant.
>>
> Hi Max
> 
> I think you just mean what i draw in below chart:
> 
> 1) Insert 4 children.
> 0           1          2          3
> +----------------------------------------------------
> |children.0|children.1|children.2|children.3|
> +----------------------------------------------------
> 
> 2) Remove the 2th child (s->children[1])
> 
> { "execute": "x-blockdev-change",
> 
>   "arguments": { "parent": "xxx",
> 
>                  "child": "children.1" } }
> 
> 0           1          2          3
> +----------------------------------------------------
> |children.0|children.1|children.2|children.3|
> +------------------------+------------+--------------
>                          |            |    
>                   +------+   +--------+
> 0           1     |    2     |
> +----------------v----------v------------------------
> |children.0|children.1|children.2|
> +----------------------------------------------------

No, what I meant, is:

 0          1          2          3
+----------+----------+----------+----------+
|children.0|children.1|children.2|children.3|
+----------+----------+----------+----------+

|
v

 0          1          2          3
+----------+----------+----------+----------+
|children.0|          |children.2|children.3|
+----------+----------+----------+----------+

I.e., children.1 simply ceases to exist.

Max

> Remove children.1 and shorten the array, then rename children.{2,3} to
> children.{1.2}
> 
> 3) Insert a new child
> 
> 0           1          2         3
> +----------------------------------------------------
> |children.0|children.1|children.2|children.3|
> +----------------------------------------------------
> 
> But as Wen said:
> http://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg00898.html
> 
> Everytime we try to remove a children.i (i < n-1, so it's not the last
> element of the array[n]), we have to rename children.{i+1, n-1} to
> children.{i, n-2}.
> 
> Thanks
>     -Xie
diff mbox

Patch

diff --git a/block.c b/block.c
index d48f441..66d21af 100644
--- a/block.c
+++ b/block.c
@@ -1194,10 +1194,10 @@  static int bdrv_fill_options(QDict **options, const char *filename,
     return 0;
 }
 
-static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
-                                    BlockDriverState *child_bs,
-                                    const char *child_name,
-                                    const BdrvChildRole *child_role)
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+                             BlockDriverState *child_bs,
+                             const char *child_name,
+                             const BdrvChildRole *child_role)
 {
     BdrvChild *child = g_new(BdrvChild, 1);
     *child = (BdrvChild) {
diff --git a/block/quorum.c b/block/quorum.c
index 11cc60b..97f030b 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@ 
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
 #include "crypto/hash.h"
+#include "qemu/bitmap.h"
 
 #define HASH_LENGTH 32
 
@@ -81,6 +82,8 @@  typedef struct BDRVQuorumState {
     bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
                             * block if Quorum is reached.
                             */
+    unsigned long *index_bitmap;
+    int bsize;
 
     QuorumReadPattern read_pattern;
 } BDRVQuorumState;
@@ -877,9 +880,9 @@  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
         ret = -EINVAL;
         goto exit;
     }
-    if (s->num_children < 2) {
+    if (s->num_children < 1) {
         error_setg(&local_err,
-                   "Number of provided children must be greater than 1");
+                   "Number of provided children must be 1 or more");
         ret = -EINVAL;
         goto exit;
     }
@@ -928,6 +931,7 @@  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     /* allocate the children array */
     s->children = g_new0(BdrvChild *, s->num_children);
     opened = g_new0(bool, s->num_children);
+    s->index_bitmap = bitmap_new(s->num_children);
 
     for (i = 0; i < s->num_children; i++) {
         char indexstr[32];
@@ -943,6 +947,8 @@  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
 
         opened[i] = true;
     }
+    bitmap_set(s->index_bitmap, 0, s->num_children);
+    s->bsize = s->num_children;
 
     g_free(opened);
     goto exit;
@@ -999,6 +1005,114 @@  static void quorum_attach_aio_context(BlockDriverState *bs,
     }
 }
 
+static int get_new_child_index(BDRVQuorumState *s)
+{
+    int index;
+
+    index = find_next_zero_bit(s->index_bitmap, s->bsize, 0);
+    if (index < s->bsize) {
+        return index;
+    }
+
+    s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize,
+                                         s->bsize + 1);
+    return s->bsize++;
+}
+
+static void remove_child_index(BDRVQuorumState *s, int index)
+{
+    int last_index, old_bsize;
+    size_t new_len;
+
+    assert(index < s->bsize);
+
+    clear_bit(index, s->index_bitmap);
+    if (index < s->bsize - 1) {
+        /* The last bit is always set */
+        return;
+    }
+
+    /* Clear last bit */
+    old_bsize = s->bsize;
+    last_index = find_last_bit(s->index_bitmap, s->bsize);
+    assert(last_index < old_bsize);
+    s->bsize = last_index + 1;
+
+    if (BITS_TO_LONGS(old_bsize) == BITS_TO_LONGS(s->bsize)) {
+        return;
+    }
+
+    new_len = BITS_TO_LONGS(s->bsize) * sizeof(unsigned long);
+    s->index_bitmap = g_realloc(s->index_bitmap, new_len);
+}
+
+static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
+                             Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    BdrvChild *child;
+    char indexstr[32];
+    int index, ret;
+
+    index = get_new_child_index(s);
+    ret = snprintf(indexstr, 32, "children.%d", index);
+    if (ret < 0 || ret >= 32) {
+        error_setg(errp, "cannot generate child name");
+        return;
+    }
+
+    bdrv_drain(bs);
+
+    assert(s->num_children <= INT_MAX / sizeof(BdrvChild *));
+    if (s->num_children == INT_MAX / sizeof(BdrvChild *)) {
+        error_setg(errp, "Too many children");
+        return;
+    }
+    s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
+
+    bdrv_ref(child_bs);
+    child = bdrv_attach_child(bs, child_bs, indexstr, &child_format);
+    s->children[s->num_children++] = child;
+    set_bit(index, s->index_bitmap);
+}
+
+static void quorum_del_child(BlockDriverState *bs, BdrvChild *child,
+                             Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    int i, rc;
+    unsigned long index;
+
+    for (i = 0; i < s->num_children; i++) {
+        if (s->children[i] == child) {
+            break;
+        }
+    }
+
+    /* we have checked it in bdrv_del_child() */
+    assert(i < s->num_children);
+
+    if (s->num_children <= s->threshold) {
+        error_setg(errp,
+            "The number of children cannot be lower than the vote threshold %d",
+            s->threshold);
+        return;
+    }
+
+    /* child->name is "children.%d" */
+    assert(!strncmp(child->name, "children.", 9));
+    rc = qemu_strtoul(child->name + 9, NULL, 10, &index);
+    assert(!rc && index < INT_MAX / sizeof(BdrvChild *));
+
+    bdrv_drain(bs);
+    /* We can safely remove this child now */
+    memmove(&s->children[i], &s->children[i + 1],
+            (s->num_children - i - 1) * sizeof(void *));
+    s->children = g_renew(BdrvChild *, s->children, --s->num_children);
+    remove_child_index(s, index);
+    bdrv_unref_child(bs, child);
+}
+
 static void quorum_refresh_filename(BlockDriverState *bs, QDict *options)
 {
     BDRVQuorumState *s = bs->opaque;
@@ -1054,6 +1168,9 @@  static BlockDriver bdrv_quorum = {
     .bdrv_detach_aio_context            = quorum_detach_aio_context,
     .bdrv_attach_aio_context            = quorum_attach_aio_context,
 
+    .bdrv_add_child                     = quorum_add_child,
+    .bdrv_del_child                     = quorum_del_child,
+
     .is_filter                          = true,
     .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
 };
diff --git a/include/block/block.h b/include/block/block.h
index 7378e74..8a3966d 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -517,6 +517,10 @@  void bdrv_disable_copy_on_read(BlockDriverState *bs);
 void bdrv_ref(BlockDriverState *bs);
 void bdrv_unref(BlockDriverState *bs);
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+                             BlockDriverState *child_bs,
+                             const char *child_name,
+                             const BdrvChildRole *child_role);
 
 bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
 void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);