mbox series

[RFC,0/4] mirror: implement incremental and bitmap modes

Message ID 20240216105513.309901-1-f.ebner@proxmox.com (mailing list archive)
Headers show
Series mirror: implement incremental and bitmap modes | expand

Message

Fiona Ebner Feb. 16, 2024, 10:55 a.m. UTC
Previous discussion from when this was sent upstream [0] (it's been a
while). I rebased the patches and re-ordered and squashed like
suggested back then [1].

This implements two new mirror modes:

- bitmap mirror mode with always/on-success/never bitmap sync mode
- incremental mirror mode as sugar for bitmap + on-success

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
  mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [2]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)


Now, the IO test added in patch 4/4 actually contains yet another use
case, namely doing incremental mirrors to stand-alone qcow2 "diff"
images, that only contain the delta and can be rebased later. I had to
adapt the IO test, because its output expected the mirror bitmap to
still be dirty, but nowadays the mirror is apparently already done
when the bitmaps are queried. So I thought, I'll just use
'write-blocking' mode to avoid any potential timing issues.

But this exposed an issue with the diff image approach. If a write is
not aligned to the granularity of the mirror target, then rebasing the
diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

The question is how to deal with these edge cases? Some possibilities
that would make sense to me:

For 'background' mode:
* prohibit if target's cluster size is larger than the bitmap
  granularity
* document the limitation

For 'write-blocking' mode:
* disallow in combination with bitmap mode (would not be happy about
  it, because I'd like to use this without diff images)
* for writes that are not aligned to the target's cluster size, read
  the relevant/missing parts from the source image to be able to write
  whole target clusters (seems rather complex)
* document the limitation


[0]: https://lore.kernel.org/qemu-devel/20200218100740.2228521-1-f.gruenbichler@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/d35a76de-78d5-af56-0b34-f7bd2bbd3733@redhat.com/
[2]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astroid@nora.none/


Fabian Grünbichler (2):
  mirror: move some checks to qmp
  iotests: add test for bitmap mirror

John Snow (2):
  drive-mirror: add support for sync=bitmap mode=never
  drive-mirror: add support for conditional and always bitmap sync modes

 block/mirror.c                                |   94 +-
 blockdev.c                                    |   70 +-
 include/block/block_int-global-state.h        |    4 +-
 qapi/block-core.json                          |   25 +-
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  550 ++++
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 +++++++++++++++++
 tests/unit/test-block-iothread.c              |    4 +-
 7 files changed, 3527 insertions(+), 30 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

Comments

Vladimir Sementsov-Ogievskiy Feb. 28, 2024, 4 p.m. UTC | #1
On 16.02.24 13:55, Fiona Ebner wrote:
> Previous discussion from when this was sent upstream [0] (it's been a
> while). I rebased the patches and re-ordered and squashed like
> suggested back then [1].
> 
> This implements two new mirror modes:
> 
> - bitmap mirror mode with always/on-success/never bitmap sync mode
> - incremental mirror mode as sugar for bitmap + on-success
> 
> Use cases:
> * Possibility to resume a failed mirror later.
> * Possibility to only mirror deltas to a previously mirrored volume.
> * Possibility to (efficiently) mirror an drive that was previously
>    mirrored via some external mechanism (e.g. ZFS replication).
> 
> We are using the last one in production without any issues since about
> 4 years now. In particular, like mentioned in [2]:
> 
>> - create bitmap(s)
>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>> - incrementally drive mirror as part of a live migration of VM
>> - drop bitmap(s)
> 
> 
> Now, the IO test added in patch 4/4 actually contains yet another use
> case, namely doing incremental mirrors to stand-alone qcow2 "diff"
> images, that only contain the delta and can be rebased later. I had to
> adapt the IO test, because its output expected the mirror bitmap to
> still be dirty, but nowadays the mirror is apparently already done
> when the bitmaps are queried. So I thought, I'll just use
> 'write-blocking' mode to avoid any potential timing issues.
> 
> But this exposed an issue with the diff image approach. If a write is
> not aligned to the granularity of the mirror target, then rebasing the
> diff image onto a backing image will not yield the desired result,
> because the full cluster is considered to be allocated and will "hide"
> some part of the base/backing image. The failure can be seen by either
> using 'write-blocking' mode in the IO test or setting the (bitmap)
> granularity to 32 KiB rather than the current 64 KiB.
> 
> The question is how to deal with these edge cases? Some possibilities
> that would make sense to me:
> 
> For 'background' mode:
> * prohibit if target's cluster size is larger than the bitmap
>    granularity
> * document the limitation
> 
> For 'write-blocking' mode:
> * disallow in combination with bitmap mode (would not be happy about
>    it, because I'd like to use this without diff images)

why not just require the same: bitmap granularity must be >= target granularity

> * for writes that are not aligned to the target's cluster size, read
>    the relevant/missing parts from the source image to be able to write
>    whole target clusters (seems rather complex)

There is another approach: consider and unaligned part of the request, fit in one cluster (we can always split any request to "aligned" middle part, and at most two small "unligned" parts, each fit into one cluster).

We have two possibilities:

1. the cluster is dirty (marked dirty in the bitmap used by background process)

We can simply ignore this part and rely on background process. This will not affect the convergence of the mirror job.

2. the cluster is clear (i.e. background process, or some previous write already copied it)

In this case, we are safe to do unaligned write, as target cluster must be allocated.

(for bitmap-mode, I don't consider here clusters that are clear from the start, which we shouldn't copy in any case)

> * document the limitation
> 
> 
> [0]: https://lore.kernel.org/qemu-devel/20200218100740.2228521-1-f.gruenbichler@proxmox.com/
> [1]: https://lore.kernel.org/qemu-devel/d35a76de-78d5-af56-0b34-f7bd2bbd3733@redhat.com/
> [2]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astroid@nora.none/
> 
> 
> Fabian Grünbichler (2):
>    mirror: move some checks to qmp
>    iotests: add test for bitmap mirror
> 
> John Snow (2):
>    drive-mirror: add support for sync=bitmap mode=never
>    drive-mirror: add support for conditional and always bitmap sync modes
> 
>   block/mirror.c                                |   94 +-
>   blockdev.c                                    |   70 +-
>   include/block/block_int-global-state.h        |    4 +-
>   qapi/block-core.json                          |   25 +-
>   tests/qemu-iotests/tests/bitmap-sync-mirror   |  550 ++++
>   .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 +++++++++++++++++
>   tests/unit/test-block-iothread.c              |    4 +-
>   7 files changed, 3527 insertions(+), 30 deletions(-)
>   create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
>   create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out
>
Vladimir Sementsov-Ogievskiy Feb. 28, 2024, 4:06 p.m. UTC | #2
On 28.02.24 19:00, Vladimir Sementsov-Ogievskiy wrote:
> On 16.02.24 13:55, Fiona Ebner wrote:
>> Previous discussion from when this was sent upstream [0] (it's been a
>> while). I rebased the patches and re-ordered and squashed like
>> suggested back then [1].
>>
>> This implements two new mirror modes:
>>
>> - bitmap mirror mode with always/on-success/never bitmap sync mode
>> - incremental mirror mode as sugar for bitmap + on-success
>>
>> Use cases:
>> * Possibility to resume a failed mirror later.
>> * Possibility to only mirror deltas to a previously mirrored volume.
>> * Possibility to (efficiently) mirror an drive that was previously
>>    mirrored via some external mechanism (e.g. ZFS replication).
>>
>> We are using the last one in production without any issues since about
>> 4 years now. In particular, like mentioned in [2]:
>>
>>> - create bitmap(s)
>>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>>> - incrementally drive mirror as part of a live migration of VM
>>> - drop bitmap(s)
>>
>>
>> Now, the IO test added in patch 4/4 actually contains yet another use
>> case, namely doing incremental mirrors to stand-alone qcow2 "diff"
>> images, that only contain the delta and can be rebased later. I had to
>> adapt the IO test, because its output expected the mirror bitmap to
>> still be dirty, but nowadays the mirror is apparently already done
>> when the bitmaps are queried. So I thought, I'll just use
>> 'write-blocking' mode to avoid any potential timing issues.
>>
>> But this exposed an issue with the diff image approach. If a write is
>> not aligned to the granularity of the mirror target, then rebasing the
>> diff image onto a backing image will not yield the desired result,
>> because the full cluster is considered to be allocated and will "hide"
>> some part of the base/backing image. The failure can be seen by either
>> using 'write-blocking' mode in the IO test or setting the (bitmap)
>> granularity to 32 KiB rather than the current 64 KiB.
>>
>> The question is how to deal with these edge cases? Some possibilities
>> that would make sense to me:
>>
>> For 'background' mode:
>> * prohibit if target's cluster size is larger than the bitmap
>>    granularity
>> * document the limitation
>>
>> For 'write-blocking' mode:
>> * disallow in combination with bitmap mode (would not be happy about
>>    it, because I'd like to use this without diff images)
> 
> why not just require the same: bitmap granularity must be >= target granularity
> 
>> * for writes that are not aligned to the target's cluster size, read
>>    the relevant/missing parts from the source image to be able to write
>>    whole target clusters (seems rather complex)
> 
> There is another approach: consider and unaligned part of the request, fit in one cluster (we can always split any request to "aligned" middle part, and at most two small "unligned" parts, each fit into one cluster).
> 
> We have two possibilities:
> 
> 1. the cluster is dirty (marked dirty in the bitmap used by background process)
> 
> We can simply ignore this part and rely on background process. This will not affect the convergence of the mirror job.
> 
> 2. the cluster is clear (i.e. background process, or some previous write already copied it)
> 
> In this case, we are safe to do unaligned write, as target cluster must be allocated.
> 
> (for bitmap-mode, I don't consider here clusters that are clear from the start, which we shouldn't copy in any case)
> 

Hmm, right, and that's exactly the logic we already have in do_sync_target_write(). So that's enough just to require that bitmap_granularity >= target_granularity
Vladimir Sementsov-Ogievskiy Feb. 28, 2024, 4:24 p.m. UTC | #3
On 16.02.24 13:55, Fiona Ebner wrote:
> Previous discussion from when this was sent upstream [0] (it's been a
> while). I rebased the patches and re-ordered and squashed like
> suggested back then [1].
> 
> This implements two new mirror modes:
> 
> - bitmap mirror mode with always/on-success/never bitmap sync mode
> - incremental mirror mode as sugar for bitmap + on-success
> 
> Use cases:
> * Possibility to resume a failed mirror later.
> * Possibility to only mirror deltas to a previously mirrored volume.
> * Possibility to (efficiently) mirror an drive that was previously
>    mirrored via some external mechanism (e.g. ZFS replication).
> 
> We are using the last one in production without any issues since about
> 4 years now. In particular, like mentioned in [2]:
> 
>> - create bitmap(s)
>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>> - incrementally drive mirror as part of a live migration of VM
>> - drop bitmap(s)

Actually which mode you use, "never", "always" or "conditional"? Or in downstream you have different approach?

Why am I asking:

These modes (for backup) were developed prior to block-dirty-bitmap-merge command, which allowed to copy bitmaps as you want. With that API, we actually don't need all these modes, instead it's enough to pass a bitmap, which would be _actually_ used by mirror.

So, if you need "never" mode, you just copy your bitmap by block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to mirror job.

Or, you pass your bitmap to mirror-job, and have a "always" mode.

And I don't see, why we need a "conditional" mode, which actually just drops away the progress we actually made. (OK, we failed, but why to drop the progress of successfully copied clusters?)


Using user-given bitmap in the mirror job has also an additional advantage of live progress: up to visualization of disk copying by visualization of the dirty bitmap contents.
Fiona Ebner Feb. 29, 2024, 10:11 a.m. UTC | #4
Am 28.02.24 um 17:06 schrieb Vladimir Sementsov-Ogievskiy:
> On 28.02.24 19:00, Vladimir Sementsov-Ogievskiy wrote:
>> On 16.02.24 13:55, Fiona Ebner wrote:
>>> Now, the IO test added in patch 4/4 actually contains yet another use
>>> case, namely doing incremental mirrors to stand-alone qcow2 "diff"
>>> images, that only contain the delta and can be rebased later. I had to
>>> adapt the IO test, because its output expected the mirror bitmap to
>>> still be dirty, but nowadays the mirror is apparently already done
>>> when the bitmaps are queried. So I thought, I'll just use
>>> 'write-blocking' mode to avoid any potential timing issues.
>>>
>>> But this exposed an issue with the diff image approach. If a write is
>>> not aligned to the granularity of the mirror target, then rebasing the
>>> diff image onto a backing image will not yield the desired result,
>>> because the full cluster is considered to be allocated and will "hide"
>>> some part of the base/backing image. The failure can be seen by either
>>> using 'write-blocking' mode in the IO test or setting the (bitmap)
>>> granularity to 32 KiB rather than the current 64 KiB.
>>>
>>> The question is how to deal with these edge cases? Some possibilities
>>> that would make sense to me:
>>>
>>> For 'background' mode:
>>> * prohibit if target's cluster size is larger than the bitmap
>>>    granularity
>>> * document the limitation
>>>
>>> For 'write-blocking' mode:
>>> * disallow in combination with bitmap mode (would not be happy about
>>>    it, because I'd like to use this without diff images)
>>
>> why not just require the same: bitmap granularity must be >= target
>> granularity
>>

For the iotest's use-case, that only works for background mode. I'll
explain below.

>>> * for writes that are not aligned to the target's cluster size, read
>>>    the relevant/missing parts from the source image to be able to write
>>>    whole target clusters (seems rather complex)
>>
>> There is another approach: consider and unaligned part of the request,
>> fit in one cluster (we can always split any request to "aligned"
>> middle part, and at most two small "unligned" parts, each fit into one
>> cluster).
>>
>> We have two possibilities:
>>
>> 1. the cluster is dirty (marked dirty in the bitmap used by background
>> process)
>>
>> We can simply ignore this part and rely on background process. This
>> will not affect the convergence of the mirror job.
>>

Agreed.

>> 2. the cluster is clear (i.e. background process, or some previous
>> write already copied it)
>>

The iotest creates a new target image for each incremental sync which
only records the diff relative to the previous mirror and those diff
images are later rebased onto each other to get the full picture.

Thus, it can be that a previous mirror job (not just background process
or previous write) already copied a cluster, and in particular, copied
it to a different target!

>> In this case, we are safe to do unaligned write, as target cluster
>> must be allocated.

Because the diff image is new, the target's cluster is not necessarily
allocated. When using write-blocking and a write of, e.g., 9 bytes to a
clear source cluster comes in, only those 9 bytes are written to the
target. Now the target's cluster is allocated but with only those 9
bytes of data. When rebasing, the previously copied cluster is "masked"
and when reading the rebased image, we only see the cluster with those 9
bytes (and IIRC, zeroes for the rest of the cluster rather than the
previously copied data).

>>
>> (for bitmap-mode, I don't consider here clusters that are clear from
>> the start, which we shouldn't copy in any case)
>>

We do need to copy new writes to any cluster, and with a clear cluster
and write-blocking, the issue can manifest.

> 
> Hmm, right, and that's exactly the logic we already have in
> do_sync_target_write(). So that's enough just to require that
> bitmap_granularity >= target_granularity
> 

Best Regards,
Fiona
Fiona Ebner Feb. 29, 2024, 10:41 a.m. UTC | #5
Am 28.02.24 um 17:24 schrieb Vladimir Sementsov-Ogievskiy:
> On 16.02.24 13:55, Fiona Ebner wrote:
>> Previous discussion from when this was sent upstream [0] (it's been a
>> while). I rebased the patches and re-ordered and squashed like
>> suggested back then [1].
>>
>> This implements two new mirror modes:
>>
>> - bitmap mirror mode with always/on-success/never bitmap sync mode
>> - incremental mirror mode as sugar for bitmap + on-success
>>
>> Use cases:
>> * Possibility to resume a failed mirror later.
>> * Possibility to only mirror deltas to a previously mirrored volume.
>> * Possibility to (efficiently) mirror an drive that was previously
>>    mirrored via some external mechanism (e.g. ZFS replication).
>>
>> We are using the last one in production without any issues since about
>> 4 years now. In particular, like mentioned in [2]:
>>
>>> - create bitmap(s)
>>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>>> - incrementally drive mirror as part of a live migration of VM
>>> - drop bitmap(s)
> 
> Actually which mode you use, "never", "always" or "conditional"? Or in
> downstream you have different approach?
> 

We are using "conditional", but I think we don't really require any
specific mode, because we drop the bitmaps after mirroring (even in
failure case). Fabian, please correct me if I'm wrong.

> Why am I asking:
> 
> These modes (for backup) were developed prior to
> block-dirty-bitmap-merge command, which allowed to copy bitmaps as you
> want. With that API, we actually don't need all these modes, instead
> it's enough to pass a bitmap, which would be _actually_ used by mirror.
> 
> So, if you need "never" mode, you just copy your bitmap by
> block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to
> mirror job.
> 
> Or, you pass your bitmap to mirror-job, and have a "always" mode.
> 
> And I don't see, why we need a "conditional" mode, which actually just
> drops away the progress we actually made. (OK, we failed, but why to
> drop the progress of successfully copied clusters?)
> 

I'm not sure actually. Maybe John remembers?

I see, I'll drop the 'bitmap-mode' in the next version if nobody
complains :)

> 
> Using user-given bitmap in the mirror job has also an additional
> advantage of live progress: up to visualization of disk copying by
> visualization of the dirty bitmap contents.
> 

Best Regards,
Fiona
Vladimir Sementsov-Ogievskiy Feb. 29, 2024, 11:48 a.m. UTC | #6
On 29.02.24 13:11, Fiona Ebner wrote:
> Am 28.02.24 um 17:06 schrieb Vladimir Sementsov-Ogievskiy:
>> On 28.02.24 19:00, Vladimir Sementsov-Ogievskiy wrote:
>>> On 16.02.24 13:55, Fiona Ebner wrote:
>>>> Now, the IO test added in patch 4/4 actually contains yet another use
>>>> case, namely doing incremental mirrors to stand-alone qcow2 "diff"
>>>> images, that only contain the delta and can be rebased later. I had to
>>>> adapt the IO test, because its output expected the mirror bitmap to
>>>> still be dirty, but nowadays the mirror is apparently already done
>>>> when the bitmaps are queried. So I thought, I'll just use
>>>> 'write-blocking' mode to avoid any potential timing issues.
>>>>
>>>> But this exposed an issue with the diff image approach. If a write is
>>>> not aligned to the granularity of the mirror target, then rebasing the
>>>> diff image onto a backing image will not yield the desired result,
>>>> because the full cluster is considered to be allocated and will "hide"
>>>> some part of the base/backing image. The failure can be seen by either
>>>> using 'write-blocking' mode in the IO test or setting the (bitmap)
>>>> granularity to 32 KiB rather than the current 64 KiB.
>>>>
>>>> The question is how to deal with these edge cases? Some possibilities
>>>> that would make sense to me:
>>>>
>>>> For 'background' mode:
>>>> * prohibit if target's cluster size is larger than the bitmap
>>>>     granularity
>>>> * document the limitation
>>>>
>>>> For 'write-blocking' mode:
>>>> * disallow in combination with bitmap mode (would not be happy about
>>>>     it, because I'd like to use this without diff images)
>>>
>>> why not just require the same: bitmap granularity must be >= target
>>> granularity
>>>
> 
> For the iotest's use-case, that only works for background mode. I'll
> explain below.
> 
>>>> * for writes that are not aligned to the target's cluster size, read
>>>>     the relevant/missing parts from the source image to be able to write
>>>>     whole target clusters (seems rather complex)
>>>
>>> There is another approach: consider and unaligned part of the request,
>>> fit in one cluster (we can always split any request to "aligned"
>>> middle part, and at most two small "unligned" parts, each fit into one
>>> cluster).
>>>
>>> We have two possibilities:
>>>
>>> 1. the cluster is dirty (marked dirty in the bitmap used by background
>>> process)
>>>
>>> We can simply ignore this part and rely on background process. This
>>> will not affect the convergence of the mirror job.
>>>
> 
> Agreed.
> 
>>> 2. the cluster is clear (i.e. background process, or some previous
>>> write already copied it)
>>>
> 
> The iotest creates a new target image for each incremental sync which
> only records the diff relative to the previous mirror and those diff
> images are later rebased onto each other to get the full picture.
> 
> Thus, it can be that a previous mirror job (not just background process
> or previous write) already copied a cluster, and in particular, copied
> it to a different target!

Aha understand.

For simplicity, let's consider case, when source "cluster size" = "job cluster size" = "bitmap granularity" = "target cluster size".

Which types of clusters we should consider, when we want to handle guest write?

1. Clusters, that should be copied by background process

These are dirty clusters from user-given bitmap, or if we do a full-disk mirror, all clusters, not yet copied by background process.

For such clusters we simply ignore the unaligned write. We can even ignore the aligned write too: less disturbing the guest by delays.

2. Clusters, already copied by background process during this mirror job and not dirtied by guest since this time.

For such clusters we are safe to do unaligned write, as target cluster must be allocated.

3. Clusters, not marked initially by dirty bitmap.

What to do with them? We can't do unaligned write. I see two variants:

- do additional read from source, to fill the whole cluster, which seems a bit too heavy

- just mark the cluster as dirty for background job. So we behave like in "background" mode. But why not? The maximum count of such "hacks" is limited to number of "clear" clusters at start of mirror job, which means that we don't seriously affect the convergence. Mirror is guaranteed to converge anyway. And the whole sense of "write-blocking" mode is to have a guaranteed convergence. What do you think?


----

Of course, we can't distinguish 3 types by on dirty bitmap, so we need the second one. For example "done_bitmap", where we can mark clusters that were successfully copied. That would be a kind of block-status of target image. But using bitmap is a lot better than querying block-status from target.


> 
>>> In this case, we are safe to do unaligned write, as target cluster
>>> must be allocated.
> 
> Because the diff image is new, the target's cluster is not necessarily
> allocated. When using write-blocking and a write of, e.g., 9 bytes to a
> clear source cluster comes in, only those 9 bytes are written to the
> target. Now the target's cluster is allocated but with only those 9
> bytes of data. When rebasing, the previously copied cluster is "masked"
> and when reading the rebased image, we only see the cluster with those 9
> bytes (and IIRC, zeroes for the rest of the cluster rather than the
> previously copied data).
> 
>>>
>>> (for bitmap-mode, I don't consider here clusters that are clear from
>>> the start, which we shouldn't copy in any case)
>>>
> 
> We do need to copy new writes to any cluster, and with a clear cluster
> and write-blocking, the issue can manifest.

OK right, I was misunderstanding bitmap-mode for mirror. Now I understand.

> 
>>
>> Hmm, right, and that's exactly the logic we already have in
>> do_sync_target_write(). So that's enough just to require that
>> bitmap_granularity >= target_granularity
>>
> 
> Best Regards,
> Fiona
>
Vladimir Sementsov-Ogievskiy Feb. 29, 2024, noon UTC | #7
On 29.02.24 13:41, Fiona Ebner wrote:
> Am 28.02.24 um 17:24 schrieb Vladimir Sementsov-Ogievskiy:
>> On 16.02.24 13:55, Fiona Ebner wrote:
>>> Previous discussion from when this was sent upstream [0] (it's been a
>>> while). I rebased the patches and re-ordered and squashed like
>>> suggested back then [1].
>>>
>>> This implements two new mirror modes:
>>>
>>> - bitmap mirror mode with always/on-success/never bitmap sync mode
>>> - incremental mirror mode as sugar for bitmap + on-success
>>>
>>> Use cases:
>>> * Possibility to resume a failed mirror later.
>>> * Possibility to only mirror deltas to a previously mirrored volume.
>>> * Possibility to (efficiently) mirror an drive that was previously
>>>     mirrored via some external mechanism (e.g. ZFS replication).
>>>
>>> We are using the last one in production without any issues since about
>>> 4 years now. In particular, like mentioned in [2]:
>>>
>>>> - create bitmap(s)
>>>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>>>> - incrementally drive mirror as part of a live migration of VM
>>>> - drop bitmap(s)
>>
>> Actually which mode you use, "never", "always" or "conditional"? Or in
>> downstream you have different approach?
>>
> 
> We are using "conditional", but I think we don't really require any
> specific mode, because we drop the bitmaps after mirroring (even in
> failure case). Fabian, please correct me if I'm wrong.
> 
>> Why am I asking:
>>
>> These modes (for backup) were developed prior to
>> block-dirty-bitmap-merge command, which allowed to copy bitmaps as you
>> want. With that API, we actually don't need all these modes, instead
>> it's enough to pass a bitmap, which would be _actually_ used by mirror.
>>
>> So, if you need "never" mode, you just copy your bitmap by
>> block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to
>> mirror job.
>>
>> Or, you pass your bitmap to mirror-job, and have a "always" mode.
>>
>> And I don't see, why we need a "conditional" mode, which actually just
>> drops away the progress we actually made. (OK, we failed, but why to
>> drop the progress of successfully copied clusters?)
>>
> 
> I'm not sure actually. Maybe John remembers?

Ah, I understand. Conditional just make sense if you don't support "partial success", and you want to delete target image in case of failure. And create a new one, to restart incremental job.

But anyway, this all could be simply achieved with bitmap-copying/merging API, if we allow to pass user-given bitmap to the mirror as working bitmap.

> 
> I see, I'll drop the 'bitmap-mode' in the next version if nobody
> complains :)
> 

Good. It's a golden rule: never make public interfaces which you don't actually need for production. I myself sometimes violate it and spend extra time on developing features, which we later have to just drop as "not needed downstream, no sense in upstreaming".

>>
>> Using user-given bitmap in the mirror job has also an additional
>> advantage of live progress: up to visualization of disk copying by
>> visualization of the dirty bitmap contents.
>>
> 
> Best Regards,
> Fiona
>
Fiona Ebner Feb. 29, 2024, 12:47 p.m. UTC | #8
Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:
> On 29.02.24 13:11, Fiona Ebner wrote:
>>
>> The iotest creates a new target image for each incremental sync which
>> only records the diff relative to the previous mirror and those diff
>> images are later rebased onto each other to get the full picture.
>>
>> Thus, it can be that a previous mirror job (not just background process
>> or previous write) already copied a cluster, and in particular, copied
>> it to a different target!
> 
> Aha understand.
> 
> For simplicity, let's consider case, when source "cluster size" = "job
> cluster size" = "bitmap granularity" = "target cluster size".
> 
> Which types of clusters we should consider, when we want to handle guest
> write?
> 
> 1. Clusters, that should be copied by background process
> 
> These are dirty clusters from user-given bitmap, or if we do a full-disk
> mirror, all clusters, not yet copied by background process.
> 
> For such clusters we simply ignore the unaligned write. We can even
> ignore the aligned write too: less disturbing the guest by delays.
> 

Since do_sync_target_write() currently doesn't ignore aligned writes, I
wouldn't change it. Of course they can count towards the "done_bitmap"
you propose below.

> 2. Clusters, already copied by background process during this mirror job
> and not dirtied by guest since this time.
> 
> For such clusters we are safe to do unaligned write, as target cluster
> must be allocated.
> 

Right.

> 3. Clusters, not marked initially by dirty bitmap.
> 
> What to do with them? We can't do unaligned write. I see two variants:
> 
> - do additional read from source, to fill the whole cluster, which seems
> a bit too heavy
> 

Yes, I'd rather only do that as a last resort.

> - just mark the cluster as dirty for background job. So we behave like
> in "background" mode. But why not? The maximum count of such "hacks" is
> limited to number of "clear" clusters at start of mirror job, which
> means that we don't seriously affect the convergence. Mirror is
> guaranteed to converge anyway. And the whole sense of "write-blocking"
> mode is to have a guaranteed convergence. What do you think?
> 

It could lead to a lot of flips between job->actively_synced == true and
== false. AFAIU, currently, we only switch back from true to false when
an error happens. While I don't see a concrete issue with it, at least
it might be unexpected to users, so it better be documented.

I'll try going with this approach, thanks!

> 
> ----
> 
> Of course, we can't distinguish 3 types by on dirty bitmap, so we need
> the second one. For example "done_bitmap", where we can mark clusters
> that were successfully copied. That would be a kind of block-status of
> target image. But using bitmap is a lot better than querying
> block-status from target.

Best Regards,
Fiona
Fiona Ebner Feb. 29, 2024, 2:50 p.m. UTC | #9
Am 29.02.24 um 13:00 schrieb Vladimir Sementsov-Ogievskiy:
> 
> But anyway, this all could be simply achieved with
> bitmap-copying/merging API, if we allow to pass user-given bitmap to the
> mirror as working bitmap.
> 
>>
>> I see, I'll drop the 'bitmap-mode' in the next version if nobody
>> complains :)
>>
> 
> Good. It's a golden rule: never make public interfaces which you don't
> actually need for production. I myself sometimes violate it and spend
> extra time on developing features, which we later have to just drop as
> "not needed downstream, no sense in upstreaming".
> 

Just wondering which new mode I should allow for the @MirrorSyncMode
then? The documentation states:

> # @incremental: only copy data described by the dirty bitmap.
> #     (since: 2.4)
> #
> # @bitmap: only copy data described by the dirty bitmap.  (since: 4.2)
> #     Behavior on completion is determined by the BitmapSyncMode.

For backup, do_backup_common() just maps @incremental to @bitmap +
@bitmap-mode == @on-success.

Using @bitmap for mirror would lead to being at odds with the
documentation, because it mentions the BitmapSyncMode, which mirror
won't have.

Using @incremental for mirror would be consistent with the
documentation, but behave a bit differently from backup.

Opinions?

Best Regards,
Fiona
Vladimir Sementsov-Ogievskiy March 1, 2024, 2:14 p.m. UTC | #10
On 29.02.24 17:50, Fiona Ebner wrote:
> Am 29.02.24 um 13:00 schrieb Vladimir Sementsov-Ogievskiy:
>>
>> But anyway, this all could be simply achieved with
>> bitmap-copying/merging API, if we allow to pass user-given bitmap to the
>> mirror as working bitmap.
>>
>>>
>>> I see, I'll drop the 'bitmap-mode' in the next version if nobody
>>> complains :)
>>>
>>
>> Good. It's a golden rule: never make public interfaces which you don't
>> actually need for production. I myself sometimes violate it and spend
>> extra time on developing features, which we later have to just drop as
>> "not needed downstream, no sense in upstreaming".
>>
> 
> Just wondering which new mode I should allow for the @MirrorSyncMode
> then? The documentation states:
> 
>> # @incremental: only copy data described by the dirty bitmap.
>> #     (since: 2.4)
>> #
>> # @bitmap: only copy data described by the dirty bitmap.  (since: 4.2)
>> #     Behavior on completion is determined by the BitmapSyncMode.
> 
> For backup, do_backup_common() just maps @incremental to @bitmap +
> @bitmap-mode == @on-success.
> 
> Using @bitmap for mirror would lead to being at odds with the
> documentation, because it mentions the BitmapSyncMode, which mirror
> won't have.
> 
> Using @incremental for mirror would be consistent with the
> documentation, but behave a bit differently from backup.
> 
> Opinions?
> 

Good question.

As we already understood, (block-)job-api needs some spring-cleaning. Unfortunately I don't have much time on it, but still I decided to start from finally depreacting block-job-* API and moving to job-*.. Probably bitmap/bitmap-mode/sync APIs also need some optimization, keeping in mind new block-dirty-bitmap-merge api.

So, what I could advice in this situation for newc interfaces:

1. be minimalistic
2. add `x-` prefix when unsure

So, following these two rules, what about x-bitmap field, which may be combined only with 'full' mode, and do what you need?

About documentation: actually, I never liked that we use for backup job "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two values supported only by backup.

I'm also unsure how mode=full&bitmap=some_bitmap differs from mode=bitmap&bitmap=some_bitmap..

So, I'd suggest simply rename MirrorSyncMode to BackupSyncMode, and add separate MirrorSyncMode with only "full", "top" and "none" values.
Fiona Ebner March 1, 2024, 2:52 p.m. UTC | #11
Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> 
> As we already understood, (block-)job-api needs some spring-cleaning.
> Unfortunately I don't have much time on it, but still I decided to start
> from finally depreacting block-job-* API and moving to job-*.. Probably
> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
> mind new block-dirty-bitmap-merge api.
> 
> So, what I could advice in this situation for newc interfaces:
> 
> 1. be minimalistic
> 2. add `x-` prefix when unsure
> 
> So, following these two rules, what about x-bitmap field, which may be
> combined only with 'full' mode, and do what you need?
> 

AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
doesn't need to be renamed if it becomes stable later.

> About documentation: actually, I never liked that we use for backup job
> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
> values supported only by backup.
> 
> I'm also unsure how mode=full&bitmap=some_bitmap differs from
> mode=bitmap&bitmap=some_bitmap..
> 

With the current patches, it was an error to specify @bitmap for other
modes than 'incremental' and 'bitmap'.

> So, I'd suggest simply rename MirrorSyncMode to BackupSyncMode, and add
> separate MirrorSyncMode with only "full", "top" and "none" values.
> 

Sounds good to me!

[0]:
https://gitlab.com/qemu-project/qemu/-/commit/a3c45b3e62962f99338716b1347cfb0d427cea44

Best Regards,
Fiona
Vladimir Sementsov-Ogievskiy March 1, 2024, 3:02 p.m. UTC | #12
On 01.03.24 17:52, Fiona Ebner wrote:
> Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
>>
>> As we already understood, (block-)job-api needs some spring-cleaning.
>> Unfortunately I don't have much time on it, but still I decided to start
>> from finally depreacting block-job-* API and moving to job-*.. Probably
>> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
>> mind new block-dirty-bitmap-merge api.
>>
>> So, what I could advice in this situation for newc interfaces:
>>
>> 1. be minimalistic
>> 2. add `x-` prefix when unsure
>>
>> So, following these two rules, what about x-bitmap field, which may be
>> combined only with 'full' mode, and do what you need?
>>
> 
> AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
> doesn't need to be renamed if it becomes stable later.

Right, unstable feature is needed, using "x-" is optional.

Recent discussion about it was in my "vhost-user-blk: live resize additional APIs" series:

https://patchew.org/QEMU/20231006202045.1161543-1-vsementsov@yandex-team.ru/20231006202045.1161543-5-vsementsov@yandex-team.ru/

Following it, I think it's OK to not care anymore with "x-" prefixes, and rely on unstable feature.

> 
>> About documentation: actually, I never liked that we use for backup job
>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>> values supported only by backup.
>>
>> I'm also unsure how mode=full&bitmap=some_bitmap differs from
>> mode=bitmap&bitmap=some_bitmap..
>>
> 
> With the current patches, it was an error to specify @bitmap for other
> modes than 'incremental' and 'bitmap'.

Current documentation says:
   # @bitmap: The name of a dirty bitmap to use.  Must be present if sync
   #     is "bitmap" or "incremental". Can be present if sync is "full"
   #     or "top".  Must not be present otherwise.
   #     (Since 2.4 (drive-backup), 3.1 (blockdev-backup))


> 
>> So, I'd suggest simply rename MirrorSyncMode to BackupSyncMode, and add
>> separate MirrorSyncMode with only "full", "top" and "none" values.
>>
> 
> Sounds good to me!
> 
> [0]:
> https://gitlab.com/qemu-project/qemu/-/commit/a3c45b3e62962f99338716b1347cfb0d427cea44
> 
> Best Regards,
> Fiona
>
Fiona Ebner March 1, 2024, 3:14 p.m. UTC | #13
Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy:
> On 01.03.24 17:52, Fiona Ebner wrote:
>> Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
>>>
>>> As we already understood, (block-)job-api needs some spring-cleaning.
>>> Unfortunately I don't have much time on it, but still I decided to start
>>> from finally depreacting block-job-* API and moving to job-*.. Probably
>>> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
>>> mind new block-dirty-bitmap-merge api.
>>>
>>> So, what I could advice in this situation for newc interfaces:
>>>
>>> 1. be minimalistic
>>> 2. add `x-` prefix when unsure
>>>
>>> So, following these two rules, what about x-bitmap field, which may be
>>> combined only with 'full' mode, and do what you need?
>>>
>>
>> AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
>> doesn't need to be renamed if it becomes stable later.
> 
> Right, unstable feature is needed, using "x-" is optional.
> 
> Recent discussion about it was in my "vhost-user-blk: live resize
> additional APIs" series:
> 
> https://patchew.org/QEMU/20231006202045.1161543-1-vsementsov@yandex-team.ru/20231006202045.1161543-5-vsementsov@yandex-team.ru/
> 
> Following it, I think it's OK to not care anymore with "x-" prefixes,
> and rely on unstable feature.
> 

Thanks for the confirmation! I'll go without the prefix in the name then.

>>
>>> About documentation: actually, I never liked that we use for backup job
>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>>> values supported only by backup.
>>>
>>> I'm also unsure how mode=full&bitmap=some_bitmap differs from
>>> mode=bitmap&bitmap=some_bitmap..
>>>
>>
>> With the current patches, it was an error to specify @bitmap for other
>> modes than 'incremental' and 'bitmap'.
> 
> Current documentation says:
>   # @bitmap: The name of a dirty bitmap to use.  Must be present if sync
>   #     is "bitmap" or "incremental". Can be present if sync is "full"
>   #     or "top".  Must not be present otherwise.
>   #     (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
> 
> 

This is for backup. The documentation (and behavior) for @bitmap added
by these patches for mirror is different ;)

Best Regards,
Fiona
Vladimir Sementsov-Ogievskiy March 1, 2024, 3:46 p.m. UTC | #14
On 01.03.24 18:14, Fiona Ebner wrote:
> Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy:
>> On 01.03.24 17:52, Fiona Ebner wrote:
>>> Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
>>>>
>>>> As we already understood, (block-)job-api needs some spring-cleaning.
>>>> Unfortunately I don't have much time on it, but still I decided to start
>>>> from finally depreacting block-job-* API and moving to job-*.. Probably
>>>> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
>>>> mind new block-dirty-bitmap-merge api.
>>>>
>>>> So, what I could advice in this situation for newc interfaces:
>>>>
>>>> 1. be minimalistic
>>>> 2. add `x-` prefix when unsure
>>>>
>>>> So, following these two rules, what about x-bitmap field, which may be
>>>> combined only with 'full' mode, and do what you need?
>>>>
>>>
>>> AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
>>> doesn't need to be renamed if it becomes stable later.
>>
>> Right, unstable feature is needed, using "x-" is optional.
>>
>> Recent discussion about it was in my "vhost-user-blk: live resize
>> additional APIs" series:
>>
>> https://patchew.org/QEMU/20231006202045.1161543-1-vsementsov@yandex-team.ru/20231006202045.1161543-5-vsementsov@yandex-team.ru/
>>
>> Following it, I think it's OK to not care anymore with "x-" prefixes,
>> and rely on unstable feature.
>>
> 
> Thanks for the confirmation! I'll go without the prefix in the name then.
> 
>>>
>>>> About documentation: actually, I never liked that we use for backup job
>>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>>>> values supported only by backup.
>>>>
>>>> I'm also unsure how mode=full&bitmap=some_bitmap differs from
>>>> mode=bitmap&bitmap=some_bitmap..
>>>>
>>>
>>> With the current patches, it was an error to specify @bitmap for other
>>> modes than 'incremental' and 'bitmap'.
>>
>> Current documentation says:
>>    # @bitmap: The name of a dirty bitmap to use.  Must be present if sync
>>    #     is "bitmap" or "incremental". Can be present if sync is "full"
>>    #     or "top".  Must not be present otherwise.
>>    #     (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
>>
>>
> 
> This is for backup. The documentation (and behavior) for @bitmap added
> by these patches for mirror is different ;)

I meant backup in "I'm also unsure", just as one more point not consider backup-bitmap-API as a prototype for mirror-bitmap-API.

> 
> Best Regards,
> Fiona
>
Fiona Ebner March 4, 2024, 7:50 a.m. UTC | #15
Am 01.03.24 um 16:46 schrieb Vladimir Sementsov-Ogievskiy:
> On 01.03.24 18:14, Fiona Ebner wrote:
>> Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> About documentation: actually, I never liked that we use for backup
>>>>> job
>>>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>>>>> values supported only by backup.
>>>>>
>>>>> I'm also unsure how mode=full&bitmap=some_bitmap differs from
>>>>> mode=bitmap&bitmap=some_bitmap..
>>>>>
>>>>
>>>> With the current patches, it was an error to specify @bitmap for other
>>>> modes than 'incremental' and 'bitmap'.
>>>
>>> Current documentation says:
>>>    # @bitmap: The name of a dirty bitmap to use.  Must be present if
>>> sync
>>>    #     is "bitmap" or "incremental". Can be present if sync is "full"
>>>    #     or "top".  Must not be present otherwise.
>>>    #     (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
>>>
>>>
>>
>> This is for backup. The documentation (and behavior) for @bitmap added
>> by these patches for mirror is different ;)
> 
> I meant backup in "I'm also unsure", just as one more point not consider
> backup-bitmap-API as a prototype for mirror-bitmap-API.
> 

Oh, I see. Sorry for the confusion!

Best Regards,
Fiona
Fabian Grünbichler March 4, 2024, 9:02 a.m. UTC | #16
On February 29, 2024 11:41 am, Fiona Ebner wrote:
> Am 28.02.24 um 17:24 schrieb Vladimir Sementsov-Ogievskiy:
>> On 16.02.24 13:55, Fiona Ebner wrote:
>>> Previous discussion from when this was sent upstream [0] (it's been a
>>> while). I rebased the patches and re-ordered and squashed like
>>> suggested back then [1].
>>>
>>> This implements two new mirror modes:
>>>
>>> - bitmap mirror mode with always/on-success/never bitmap sync mode
>>> - incremental mirror mode as sugar for bitmap + on-success
>>>
>>> Use cases:
>>> * Possibility to resume a failed mirror later.
>>> * Possibility to only mirror deltas to a previously mirrored volume.
>>> * Possibility to (efficiently) mirror an drive that was previously
>>>    mirrored via some external mechanism (e.g. ZFS replication).
>>>
>>> We are using the last one in production without any issues since about
>>> 4 years now. In particular, like mentioned in [2]:
>>>
>>>> - create bitmap(s)
>>>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>>>> - incrementally drive mirror as part of a live migration of VM
>>>> - drop bitmap(s)
>> 
>> Actually which mode you use, "never", "always" or "conditional"? Or in
>> downstream you have different approach?
>> 
> 
> We are using "conditional", but I think we don't really require any
> specific mode, because we drop the bitmaps after mirroring (even in
> failure case). Fabian, please correct me if I'm wrong.

indeed, we don't really care for our current use case (and don't have
any other planned either, AFAIK), the bitmap is used only for the
duration of a single mirror, and always discarded at the end.

>> Why am I asking:
>> 
>> These modes (for backup) were developed prior to
>> block-dirty-bitmap-merge command, which allowed to copy bitmaps as you
>> want. With that API, we actually don't need all these modes, instead
>> it's enough to pass a bitmap, which would be _actually_ used by mirror.
>> 
>> So, if you need "never" mode, you just copy your bitmap by
>> block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to
>> mirror job.
>> 
>> Or, you pass your bitmap to mirror-job, and have a "always" mode.
>> 
>> And I don't see, why we need a "conditional" mode, which actually just
>> drops away the progress we actually made. (OK, we failed, but why to
>> drop the progress of successfully copied clusters?)
>> 
> 
> I'm not sure actually. Maybe John remembers?
> 
> I see, I'll drop the 'bitmap-mode' in the next version if nobody
> complains :)

it was probably just done to mimic the backup interface, if that is not
desired, dropping it is probably a good idea.
Fiona Ebner March 6, 2024, 1:44 p.m. UTC | #17
Am 29.02.24 um 13:47 schrieb Fiona Ebner:
> Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:
>> On 29.02.24 13:11, Fiona Ebner wrote:
>>>
>>> The iotest creates a new target image for each incremental sync which
>>> only records the diff relative to the previous mirror and those diff
>>> images are later rebased onto each other to get the full picture.
>>>
>>> Thus, it can be that a previous mirror job (not just background process
>>> or previous write) already copied a cluster, and in particular, copied
>>> it to a different target!
>>
>> Aha understand.
>>
>> For simplicity, let's consider case, when source "cluster size" = "job
>> cluster size" = "bitmap granularity" = "target cluster size".
>>
>> Which types of clusters we should consider, when we want to handle guest
>> write?
>>
>> 1. Clusters, that should be copied by background process
>>
>> These are dirty clusters from user-given bitmap, or if we do a full-disk
>> mirror, all clusters, not yet copied by background process.
>>
>> For such clusters we simply ignore the unaligned write. We can even
>> ignore the aligned write too: less disturbing the guest by delays.
>>
> 
> Since do_sync_target_write() currently doesn't ignore aligned writes, I
> wouldn't change it. Of course they can count towards the "done_bitmap"
> you propose below.
> 
>> 2. Clusters, already copied by background process during this mirror job
>> and not dirtied by guest since this time.
>>
>> For such clusters we are safe to do unaligned write, as target cluster
>> must be allocated.
>>
> 
> Right.
> 
>> 3. Clusters, not marked initially by dirty bitmap.
>>
>> What to do with them? We can't do unaligned write. I see two variants:
>>
>> - do additional read from source, to fill the whole cluster, which seems
>> a bit too heavy
>>
> 
> Yes, I'd rather only do that as a last resort.
> 
>> - just mark the cluster as dirty for background job. So we behave like
>> in "background" mode. But why not? The maximum count of such "hacks" is
>> limited to number of "clear" clusters at start of mirror job, which
>> means that we don't seriously affect the convergence. Mirror is
>> guaranteed to converge anyway. And the whole sense of "write-blocking"
>> mode is to have a guaranteed convergence. What do you think?
>>
> 
> It could lead to a lot of flips between job->actively_synced == true and
> == false. AFAIU, currently, we only switch back from true to false when
> an error happens. While I don't see a concrete issue with it, at least
> it might be unexpected to users, so it better be documented.
> 
> I'll try going with this approach, thanks!
> 

These flips are actually a problem. When using live-migration with disk
mirroring, it's good that an actively synced image stays actively
synced. Otherwise, migration could finish at an inconvenient time and
try to inactivate the block device while mirror still got something to
do which would lead to an assertion failure [0].

The IO test added by this series is what uses the possibility to sync to
"diff images" which contain only the delta. In production, we are only
syncing to a previously mirrored target image. Non-aligned writes are
not an issue later like with a diff image. (Even if the initial
mirroring happened via ZFS replication outside of QEMU).

So copy-mode=write-blocking would work fine for our use case, but if I
go with the "mark clusters for unaligned writes dirty"-approach, it
would not work fine anymore.

Should I rather just document the limitation for the combination "target
is a diff image" and copy-mode=write-blocking?

I'd still add the check for the granularity and target cluster size.
While also only needed for diff images, it would allow using background
mode safely for those.

Best Regards,
Fiona

[0]:
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060c3f@proxmox.com/
Vladimir Sementsov-Ogievskiy March 7, 2024, 9:20 a.m. UTC | #18
On 06.03.24 16:44, Fiona Ebner wrote:
> Am 29.02.24 um 13:47 schrieb Fiona Ebner:
>> Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 29.02.24 13:11, Fiona Ebner wrote:
>>>>
>>>> The iotest creates a new target image for each incremental sync which
>>>> only records the diff relative to the previous mirror and those diff
>>>> images are later rebased onto each other to get the full picture.
>>>>
>>>> Thus, it can be that a previous mirror job (not just background process
>>>> or previous write) already copied a cluster, and in particular, copied
>>>> it to a different target!
>>>
>>> Aha understand.
>>>
>>> For simplicity, let's consider case, when source "cluster size" = "job
>>> cluster size" = "bitmap granularity" = "target cluster size".
>>>
>>> Which types of clusters we should consider, when we want to handle guest
>>> write?
>>>
>>> 1. Clusters, that should be copied by background process
>>>
>>> These are dirty clusters from user-given bitmap, or if we do a full-disk
>>> mirror, all clusters, not yet copied by background process.
>>>
>>> For such clusters we simply ignore the unaligned write. We can even
>>> ignore the aligned write too: less disturbing the guest by delays.
>>>
>>
>> Since do_sync_target_write() currently doesn't ignore aligned writes, I
>> wouldn't change it. Of course they can count towards the "done_bitmap"
>> you propose below.
>>
>>> 2. Clusters, already copied by background process during this mirror job
>>> and not dirtied by guest since this time.
>>>
>>> For such clusters we are safe to do unaligned write, as target cluster
>>> must be allocated.
>>>
>>
>> Right.
>>
>>> 3. Clusters, not marked initially by dirty bitmap.
>>>
>>> What to do with them? We can't do unaligned write. I see two variants:
>>>
>>> - do additional read from source, to fill the whole cluster, which seems
>>> a bit too heavy
>>>
>>
>> Yes, I'd rather only do that as a last resort.
>>
>>> - just mark the cluster as dirty for background job. So we behave like
>>> in "background" mode. But why not? The maximum count of such "hacks" is
>>> limited to number of "clear" clusters at start of mirror job, which
>>> means that we don't seriously affect the convergence. Mirror is
>>> guaranteed to converge anyway. And the whole sense of "write-blocking"
>>> mode is to have a guaranteed convergence. What do you think?
>>>
>>
>> It could lead to a lot of flips between job->actively_synced == true and
>> == false. AFAIU, currently, we only switch back from true to false when
>> an error happens. While I don't see a concrete issue with it, at least
>> it might be unexpected to users, so it better be documented.
>>
>> I'll try going with this approach, thanks!
>>
> 
> These flips are actually a problem. When using live-migration with disk
> mirroring, it's good that an actively synced image stays actively
> synced. Otherwise, migration could finish at an inconvenient time and
> try to inactivate the block device while mirror still got something to
> do which would lead to an assertion failure [0].

Hmm right. So, when mirror is actively-synced, we have to read the whole cluster from source to make an aligned write on target.

> 
> The IO test added by this series is what uses the possibility to sync to
> "diff images" which contain only the delta. In production, we are only
> syncing to a previously mirrored target image. Non-aligned writes are
> not an issue later like with a diff image. (Even if the initial
> mirroring happened via ZFS replication outside of QEMU).
> 
> So copy-mode=write-blocking would work fine for our use case, but if I
> go with the "mark clusters for unaligned writes dirty"-approach, it
> would not work fine anymore.
> 
> Should I rather just document the limitation for the combination "target
> is a diff image" and copy-mode=write-blocking?

Of course, simply documenting the limitation is better than implementing a new feature, if you don't need the feature for production)

> 
> I'd still add the check for the granularity and target cluster size.

Check is good too.

> While also only needed for diff images, it would allow using background
> mode safely for those.
> 
> Best Regards,
> Fiona
> 
> [0]:
> https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060c3f@proxmox.com/
>