diff mbox series

[v3,09/19] block: introduce FleecingState class

Message ID 20211222174018.257550-10-vsementsov@virtuozzo.com (mailing list archive)
State New, archived
Headers show
Series Make image fleecing more usable | expand

Commit Message

Vladimir Sementsov-Ogievskiy Dec. 22, 2021, 5:40 p.m. UTC
FleecingState represents state shared between copy-before-write filter
and upcoming fleecing block driver.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/fleecing.h  | 135 ++++++++++++++++++++++++++++++++++
 block/fleecing.c  | 182 ++++++++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS       |   2 +
 block/meson.build |   1 +
 4 files changed, 320 insertions(+)
 create mode 100644 block/fleecing.h
 create mode 100644 block/fleecing.c

Comments

Hanna Czenczek Jan. 18, 2022, 4:37 p.m. UTC | #1
On 22.12.21 18:40, Vladimir Sementsov-Ogievskiy wrote:
> FleecingState represents state shared between copy-before-write filter
> and upcoming fleecing block driver.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/fleecing.h  | 135 ++++++++++++++++++++++++++++++++++
>   block/fleecing.c  | 182 ++++++++++++++++++++++++++++++++++++++++++++++
>   MAINTAINERS       |   2 +
>   block/meson.build |   1 +
>   4 files changed, 320 insertions(+)
>   create mode 100644 block/fleecing.h
>   create mode 100644 block/fleecing.c
>
> diff --git a/block/fleecing.h b/block/fleecing.h
> new file mode 100644
> index 0000000000..fb7b2f86c4
> --- /dev/null
> +++ b/block/fleecing.h
> @@ -0,0 +1,135 @@
> +/*
> + * FleecingState
> + *
> + * The common state of image fleecing, shared between copy-before-write filter
> + * and fleecing block driver.

 From this documentation, it’s unclear to me who owns the FleecingState 
object.  I would have assumed it’s the fleecing node, and if it is, I 
wonder why we even have this external interface instead of considering 
FleecingState a helper object for the fleecing block driver (or rather 
the block driver’s opaque state, which it basically is, as far as I can 
see from peeking into the next patch), and putting both into a single 
file with no external interface except for 
fleecing_mark_done_and_wait_readers().

> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + *
> + *
> + * Fleecing scheme looks as follows:
> + *
> + * [guest blk]                   [nbd export]
> + *    |                              |
> + *    |root                          |
> + *    v                              v
> + * [copy-before-write]--target-->[fleecing drv]
> + *    |                          /   |
> + *    |file                     /    |file
> + *    v                        /     v
> + * [active disk]<--source-----/  [temp disk]
> + *
> + * Note that "active disk" is also called just "source" and "temp disk" is also
> + * called "target".
> + *
> + * What happens here:
> + *
> + * copy-before-write filter performs copy-before-write operations: on guest
> + * write we should copy old data to target child before rewriting. Note that we
> + * write this data through fleecing driver: it saves a possibility to implement
> + * a kind of cache in fleecing driver in future.

I don’t understand why this explanation is the first one given (and the 
only one given explicitly as a reason) for why we want a fleecing block 
driver.

(1) If we implement caching later, I have a feeling that we’ll want new 
options for this.  So a management layer that wants caching will need to 
be updated at that point anyway (to specify these new options), so I 
don’t understand how adding a fleecing block driver now would make it 
easier later on to introduce caching.

(1b) It’s actually entirely possible that we will not want to use the 
fleecing driver for caching, because we decide that caching is much more 
useful as its own dedicated block driver.

(2) There are much better arguments below.  This FleecingState you 
introduce here makes it clear why we need a fleecing block driver; it 
helps with synchronization, and it provides the “I’m done with this bit, 
I don’t care about it anymore” discard interface.

> + *
> + * Fleecing user is nbd export: it can read from fleecing node, which guarantees
> + * a snapshot-view for fleecing user. Fleecing user may also do discard
> + * operations.
> + *
> + * FleecingState is responsible for most of the fleecing logic:
> + *
> + * 1. Fleecing read. Handle reads of fleecing user: we should decide where from
> + * to read, from source node or from copy-before-write target node. In former
> + * case we need to synchronize with guest writes. See fleecing_read_lock() and
> + * fleecing_read_unlock() functionality.
> + *
> + * 2. Guest write synchronization (part of [1] actually). See
> + * fleecing_mark_done_and_wait_readers()
> + *
> + * 3. Fleecing discard. Used by fleecing user when corresponding area is already
> + * copied. Fleecing user may discard the area which is not needed anymore, that
> + * should result in:
> + *   - discarding data to free disk space
> + *   - clear bits in copy-bitmap of block-copy, to avoid extra copy-before-write
> + *     operations
> + *   - clear bits in access-bitmap of FleecingState, to avoid further wrong
> + *     access
> + *
> + * Still, FleecingState doesn't own any block children, so all real io
> + * operations (reads, writes and discards) are done by copy-before-write filter
> + * and fleecing block driver.

I find this a bit confusing, because for me, it raised the question of 
“why would it own block childen?”, which led to me wanting to know even 
more where the place of FleecingState is.  This sentence makes it really 
sound as if FleecingState is its own independent object floating around 
somewhere, not owned by anything, and that feels very wrong.

(If FleecingState were owned by the fleecing driver, i.e. if it 
basically were just the fleecing driver’s opaque data itself, then the 
question of what the FleecingState is, and whether it could own block 
children wouldn’t even come up.)

> + */
> +
> +#ifndef FLEECING_H
> +#define FLEECING_H
> +
> +#include "block/block_int.h"
> +#include "block/block-copy.h"
> +#include "block/reqlist.h"
> +
> +typedef struct FleecingState FleecingState;
> +
> +/*
> + * Create FleecingState.
> + *
> + * @bcs: link to block-copy owned by copy-before-write filter.

s/block-copy/block-copy state/

> + *
> + * @fleecing_node: should be fleecing block driver node. Used to create some

I think the “should be” should be dropped.  This must be a fleecing 
block driver, right?

(But then again, I really don’t understand why the FleecingState is 
separate from BDRVFleecingState in the first place.)

> + * bitmaps in it.
> + */
> +FleecingState *fleecing_new(BlockCopyState *bcs,
> +                            BlockDriverState *fleecing_node,
> +                            Error **errp);
> +
> +/* Free the state. Doesn't free block-copy state (@bcs) */
> +void fleecing_free(FleecingState *s);
> +
> +/*
> + * Convenient function for thous who want to do fleecing read.

s/thous/those/

I kind of miss a quick summary here of what this function is for, i.e. to
(1) find out where to read the data from, and
(2) if it’s to be read from the source, to block the affected area from 
writes until the read is done.

But I don’t know how to phrase that concisely, so I’m also OK with not 
having such a summary.

> + *
> + * If requested region starts in "done" area, i.e. data is already copied to
> + * copy-before-write target node, req is set to NULL, pnum is set to available
> + * bytes to read from target. User is free to read @pnum bytes from target.
> + * Still, user is responsible for concurrent discards on target.
> + *
> + * If requests region starts in "not done" area, i.e. we have to read from

s/requests/requested/

> + * source node directly, than @pnum bytes of source node are frozen and

s/than/then/

> + * guaranteed not be rewritten until user calls cbw_snapshot_read_unlock().

s/guaranteed not be rewritten/guaranteed not to be rewritten/

(or perhaps also s/rewritten/modified/, but it’s probably just me to 
whom “rewritten” sounds a bit like “the same data is written again”)

> + *
> + * Returns 0 on success and -EACCES when try to read non-dirty area of
> + * access_bitmap.
> + */

This description doesn’t sufficiently describe the @req parameter. It 
only says that `*req == NULL` will be returned if the data is to be read 
from the target node, but other than that it doesn’t say whether *req is 
a pure “out” or “inout” parameter.  It doesn’t say whether the caller 
has to pre-fill it (and fleecing_read_lock() will set it to NULL if the 
caller should read from the target), or whether fleecing_read_lock() 
will always set it (depending on whether to read from the source 
(non-NULL) or the target (NULL)).

Since it’s the latter (fleecing_read_lock() will allocate a BlockReq and 
return it), we also need to explain what we expect the user to do with 
this, namely absolutely nothing except pass it again to 
fleecing_read_unlock().

> +int fleecing_read_lock(FleecingState *f, int64_t offset,
> +                       int64_t bytes, const BlockReq **req, int64_t *pnum);
> +/* Called as closing pair for fleecing_read_lock() */

It isn’t quite clear from this summary whether this function should also 
be called if fleecing_read_lock() returned success, but *req == NULL.  
It shouldn’t, but given this description, I’d do it.

> +void fleecing_read_unlock(FleecingState *f, const BlockReq *req);
> +
> +/*
> + * Called when fleecing user doesn't need the region anymore (for example the
> + * region is successfully read and backed up somewhere).
> + * This prevents extra copy-before-write operations in this area in future.
> + * Next fleecing read from this area will fail with -EACCES.
> + */
> +void fleecing_discard(FleecingState *f, int64_t offset, int64_t bytes);
> +
> +/*
> + * Called by copy-before-write filter after successful copy-before-write
> + * operation to synchronize with parallel fleecing reads.
> + */
> +void fleecing_mark_done_and_wait_readers(FleecingState *f, int64_t offset,
> +                                         int64_t bytes);
> +
> +#endif /* FLEECING_H */
> diff --git a/block/fleecing.c b/block/fleecing.c
> new file mode 100644
> index 0000000000..f75d11b892
> --- /dev/null
> +++ b/block/fleecing.c
> @@ -0,0 +1,182 @@
> +/*
> + * FleecingState
> + *
> + * The common state of image fleecing, shared between copy-before-write filter
> + * and fleecing block driver.
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "sysemu/block-backend.h"
> +#include "qemu/cutils.h"
> +#include "qapi/error.h"
> +#include "block/block_int.h"
> +#include "block/coroutines.h"
> +#include "block/qdict.h"
> +#include "block/block-copy.h"
> +#include "block/reqlist.h"
> +
> +#include "block/fleecing.h"
> +
> +/*
> + * @bcs: link to block-copy state owned by copy-before-write filter which
> + * performs copy-before-write operations in context of fleecing scheme.
> + * FleecingState doesn't own the block-copy state and don't free it on cleanup.

s/don't/doesn't/

> + *
> + * @lock: protects access to @access_bitmap, @done_bitmap and @frozen_read_reqs
> + *
> + * @access_bitmap: represents areas allowed for reading by fleecing user.
> + * Reading from non-dirty areas leads to -EACCES. Discard operation among other

Since this is not really abort dirty or not, I’d prefer “clear 
(non-dirty)” instead of just “non-dirty”.

> + * things clears corresponding bits in this bitmaps.

It isn’t quite clear whether (A) the discard operation does various 
things, and one of them is to reset the corresponding area in the 
access_bitmap; or (B) the discard operation is only one of various ways 
to reset areas in the access_bitmap.  (It’s (A), and so I’d just say 
“fleecing_discard() clears areas in this bitmap.”)

> + *
> + * @done_bitmap: represents areas that was successfully copied by

s/that was/that were/

> + * copy-before-write operations. So, for dirty areas fleecing user should read
> + * from target node and for clear areas - from source node.

I’d prefer “from target node, and for clear areas from source node”.

> + *
> + * @frozen_read_reqs: current read requests for fleecing user in source node.

Hmm, perhaps “ongoing” would be clearer than “current”?

> + * corresponding areas must not be rewritten by guest.

Not necessarily just the guest, so something like “Writing to these 
areas must wait until the respective requests are settled.” would be 
more general.

> + */
> +typedef struct FleecingState {
> +    BlockCopyState *bcs;
> +
> +    CoMutex lock;
> +
> +    BdrvDirtyBitmap *access_bitmap;
> +    BdrvDirtyBitmap *done_bitmap;
> +
> +    BlockReqList frozen_read_reqs;
> +} FleecingState;
> +
> +FleecingState *fleecing_new(BlockCopyState *bcs,
> +                            BlockDriverState *fleecing_node,
> +                            Error **errp)
> +{
> +    BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(bcs),
> +                    *done_bitmap, *access_bitmap;

I don’t really understand why you didn’t just start a new declaration 
here, putting “BdrvDirtyBitmap” at the beginning of the line again.

> +    int64_t cluster_size = block_copy_cluster_size(bcs);
> +    FleecingState *s;
> +
> +    /* done_bitmap starts empty */
> +    done_bitmap = bdrv_create_dirty_bitmap(fleecing_node, cluster_size, NULL,
> +                                           errp);
> +    if (!done_bitmap) {
> +        return NULL;
> +    }
> +    bdrv_disable_dirty_bitmap(done_bitmap);
> +
> +    /* access_bitmap starts equal to bcs_bitmap */
> +    access_bitmap = bdrv_create_dirty_bitmap(fleecing_node, cluster_size, NULL,
> +                                             errp);
> +    if (!access_bitmap) {
> +        return NULL;
> +    }
> +    bdrv_disable_dirty_bitmap(access_bitmap);
> +    if (!bdrv_dirty_bitmap_merge_internal(access_bitmap, bcs_bitmap,
> +                                          NULL, true))
> +    {
> +        return NULL;
> +    }

This function lacks a proper on-error clean-up path to free the dirty 
bitmaps.

> +
> +    s = g_new(FleecingState, 1);
> +    *s = (FleecingState) {
> +        .bcs = bcs,
> +        .done_bitmap = done_bitmap,
> +        .access_bitmap = access_bitmap,
> +    };
> +    qemu_co_mutex_init(&s->lock);
> +    QLIST_INIT(&s->frozen_read_reqs);
> +
> +    return s;
> +}
> +
> +void fleecing_free(FleecingState *s)
> +{
> +    if (!s) {
> +        return;
> +    }
> +
> +    bdrv_release_dirty_bitmap(s->access_bitmap);
> +    bdrv_release_dirty_bitmap(s->done_bitmap);
> +    g_free(s);
> +}
> +
> +static BlockReq *add_read_req(FleecingState *s, uint64_t offset, uint64_t bytes)
> +{
> +    BlockReq *req = g_new(BlockReq, 1);
> +
> +    reqlist_init_req(&s->frozen_read_reqs, req, offset, bytes);
> +
> +    return req;
> +}
> +
> +static void drop_read_req(BlockReq *req)
> +{
> +    reqlist_remove_req(req);
> +    g_free(req);
> +}
> +
> +int fleecing_read_lock(FleecingState *s, int64_t offset,
> +                       int64_t bytes, const BlockReq **req,
> +                       int64_t *pnum)
> +{
> +    bool done;
> +
> +    QEMU_LOCK_GUARD(&s->lock);
> +
> +    if (bdrv_dirty_bitmap_next_zero(s->access_bitmap, offset, bytes) != -1) {
> +        return -EACCES;
> +    }
> +
> +    bdrv_dirty_bitmap_status(s->done_bitmap, offset, bytes, &done, pnum);
> +    if (!done) {
> +        *req = add_read_req(s, offset, *pnum);
> +    }
> +
> +    return 0;
> +}
> +
> +void fleecing_read_unlock(FleecingState *s, const BlockReq *req)
> +{
> +    QEMU_LOCK_GUARD(&s->lock);
> +
> +    drop_read_req((BlockReq *)req);

In my opinion, any cast removing a `const` must be accompanied by an 
explanatory comment.

As I understand it, this function takes a `const BlockReq *` pointer so 
because `fleecing_read_lock()` returns a `const BlockReq *` object, 
because we don’t want the user to modify that request, but still want 
them to be able to easily pass the object they’ve received to 
`fleecing_read_unlock()`, right?

The problem is of course that we need a mutable BlockReq object here, 
because QLIST_REMOVE() will modify it.  Taking a `const BlockReq *` is 
just not correct.

Perhaps instead of returning a `const BlockReq *` pointer to the caller, 
we should just return a `void *` opaque pointer that they are to pass to 
this function again, that should ensure they don’t touch it just the 
same, and we wouldn’t need this cast here.

> +}
> +
> +void fleecing_discard(FleecingState *s, int64_t offset, int64_t bytes)
> +{
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        bdrv_reset_dirty_bitmap(s->access_bitmap, offset, bytes);
> +    }
> +
> +    block_copy_reset(s->bcs, offset, bytes);
> +}
> +
> +void fleecing_mark_done_and_wait_readers(FleecingState *s, int64_t offset,
> +                                         int64_t bytes)
> +{
> +    assert(QEMU_IS_ALIGNED(offset, block_copy_cluster_size(s->bcs)));
> +    assert(QEMU_IS_ALIGNED(bytes, block_copy_cluster_size(s->bcs)));
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        bdrv_set_dirty_bitmap(s->done_bitmap, offset, bytes);
> +        reqlist_wait_all(&s->frozen_read_reqs, offset, bytes, &s->lock);
> +    }
> +}
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7f24ee4b92..78ea04e292 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2423,6 +2423,8 @@ F: block/reqlist.c
>   F: include/block/reqlist.h
>   F: block/copy-before-write.h
>   F: block/copy-before-write.c
> +F: block/fleecing.h
> +F: block/fleecing.c
>   F: include/block/aio_task.h
>   F: block/aio_task.c
>   F: util/qemu-co-shared-resource.c
> diff --git a/block/meson.build b/block/meson.build
> index 5065cf33ba..d30da90a01 100644
> --- a/block/meson.build
> +++ b/block/meson.build
> @@ -18,6 +18,7 @@ block_ss.add(files(
>     'crypto.c',
>     'dirty-bitmap.c',
>     'filter-compress.c',
> +  'fleecing.c',
>     'io.c',
>     'mirror.c',
>     'nbd.c',
Vladimir Sementsov-Ogievskiy Jan. 18, 2022, 6:35 p.m. UTC | #2
18.01.2022 19:37, Hanna Reitz wrote:
> On 22.12.21 18:40, Vladimir Sementsov-Ogievskiy wrote:
>> FleecingState represents state shared between copy-before-write filter
>> and upcoming fleecing block driver.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/fleecing.h  | 135 ++++++++++++++++++++++++++++++++++
>>   block/fleecing.c  | 182 ++++++++++++++++++++++++++++++++++++++++++++++
>>   MAINTAINERS       |   2 +
>>   block/meson.build |   1 +
>>   4 files changed, 320 insertions(+)
>>   create mode 100644 block/fleecing.h
>>   create mode 100644 block/fleecing.c
>>
>> diff --git a/block/fleecing.h b/block/fleecing.h
>> new file mode 100644
>> index 0000000000..fb7b2f86c4
>> --- /dev/null
>> +++ b/block/fleecing.h
>> @@ -0,0 +1,135 @@
>> +/*
>> + * FleecingState
>> + *
>> + * The common state of image fleecing, shared between copy-before-write filter
>> + * and fleecing block driver.
> 
>  From this documentation, it’s unclear to me who owns the FleecingState object.  I would have assumed it’s the fleecing node, and if it is, I wonder why we even have this external interface instead of considering FleecingState a helper object for the fleecing block driver (or rather the block driver’s opaque state, which it basically is, as far as I can see from peeking into the next patch), and putting both into a single file with no external interface except for fleecing_mark_done_and_wait_readers().

FleecingState object is owned by copy-before-write node. copy-before-write has the whole information, and it owns BlockCopyState object, which is used to create FleecingState. copy-before-write node can easily detect that its target is fleecing filter, and initialize FleecingState in this case.

On the other hand, if we want to create FleecingState from fleecing filter (or even merge the state into its driver state), we'll have to search through parents to find copy-before-write, which may be not trivial. Moreover, at time of open() we may have no parents yet.


Hmm, but may be just pass bcs to fleecing-node by activate(), like we are going to do with fleecing state?  I'll give it a try.

> 
>> + *
>> + * Copyright (c) 2021 Virtuozzo International GmbH.
>> + *
>> + * Author:
>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + *
>> + *
>> + * Fleecing scheme looks as follows:
>> + *
>> + * [guest blk]                   [nbd export]
>> + *    |                              |
>> + *    |root                          |
>> + *    v                              v
>> + * [copy-before-write]--target-->[fleecing drv]
>> + *    |                          /   |
>> + *    |file                     /    |file
>> + *    v                        /     v
>> + * [active disk]<--source-----/  [temp disk]
>> + *
>> + * Note that "active disk" is also called just "source" and "temp disk" is also
>> + * called "target".
>> + *
>> + * What happens here:
>> + *
>> + * copy-before-write filter performs copy-before-write operations: on guest
>> + * write we should copy old data to target child before rewriting. Note that we
>> + * write this data through fleecing driver: it saves a possibility to implement
>> + * a kind of cache in fleecing driver in future.
> 
> I don’t understand why this explanation is the first one given (and the only one given explicitly as a reason) for why we want a fleecing block driver.

Actually, benefits are given in the next commit message.

> 
> (1) If we implement caching later, I have a feeling that we’ll want new options for this.  So a management layer that wants caching will need to be updated at that point anyway (to specify these new options), so I don’t understand how adding a fleecing block driver now would make it easier later on to introduce caching.
> 
> (1b) It’s actually entirely possible that we will not want to use the fleecing driver for caching, because we decide that caching is much more useful as its own dedicated block driver.
> 
> (2) There are much better arguments below.  This FleecingState you introduce here makes it clear why we need a fleecing block driver; it helps with synchronization, and it provides the “I’m done with this bit, I don’t care about it anymore” discard interface.
> 
>> + *
>> + * Fleecing user is nbd export: it can read from fleecing node, which guarantees
>> + * a snapshot-view for fleecing user. Fleecing user may also do discard
>> + * operations.
>> + *
>> + * FleecingState is responsible for most of the fleecing logic:
>> + *
>> + * 1. Fleecing read. Handle reads of fleecing user: we should decide where from
>> + * to read, from source node or from copy-before-write target node. In former
>> + * case we need to synchronize with guest writes. See fleecing_read_lock() and
>> + * fleecing_read_unlock() functionality.
>> + *
>> + * 2. Guest write synchronization (part of [1] actually). See
>> + * fleecing_mark_done_and_wait_readers()
>> + *
>> + * 3. Fleecing discard. Used by fleecing user when corresponding area is already
>> + * copied. Fleecing user may discard the area which is not needed anymore, that
>> + * should result in:
>> + *   - discarding data to free disk space
>> + *   - clear bits in copy-bitmap of block-copy, to avoid extra copy-before-write
>> + *     operations
>> + *   - clear bits in access-bitmap of FleecingState, to avoid further wrong
>> + *     access
>> + *
>> + * Still, FleecingState doesn't own any block children, so all real io
>> + * operations (reads, writes and discards) are done by copy-before-write filter
>> + * and fleecing block driver.
> 
> I find this a bit confusing, because for me, it raised the question of “why would it own block childen?”, which led to me wanting to know even more where the place of FleecingState is.  This sentence makes it really sound as if FleecingState is its own independent object floating around somewhere, not owned by anything, and that feels very wrong.

It's owned by copy-before-write node. Hmm, and seems doesn't operate directly on any block children, so this sentence may be removed.
diff mbox series

Patch

diff --git a/block/fleecing.h b/block/fleecing.h
new file mode 100644
index 0000000000..fb7b2f86c4
--- /dev/null
+++ b/block/fleecing.h
@@ -0,0 +1,135 @@ 
+/*
+ * FleecingState
+ *
+ * The common state of image fleecing, shared between copy-before-write filter
+ * and fleecing block driver.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ *
+ *
+ * Fleecing scheme looks as follows:
+ *
+ * [guest blk]                   [nbd export]
+ *    |                              |
+ *    |root                          |
+ *    v                              v
+ * [copy-before-write]--target-->[fleecing drv]
+ *    |                          /   |
+ *    |file                     /    |file
+ *    v                        /     v
+ * [active disk]<--source-----/  [temp disk]
+ *
+ * Note that "active disk" is also called just "source" and "temp disk" is also
+ * called "target".
+ *
+ * What happens here:
+ *
+ * copy-before-write filter performs copy-before-write operations: on guest
+ * write we should copy old data to target child before rewriting. Note that we
+ * write this data through fleecing driver: it saves a possibility to implement
+ * a kind of cache in fleecing driver in future.
+ *
+ * Fleecing user is nbd export: it can read from fleecing node, which guarantees
+ * a snapshot-view for fleecing user. Fleecing user may also do discard
+ * operations.
+ *
+ * FleecingState is responsible for most of the fleecing logic:
+ *
+ * 1. Fleecing read. Handle reads of fleecing user: we should decide where from
+ * to read, from source node or from copy-before-write target node. In former
+ * case we need to synchronize with guest writes. See fleecing_read_lock() and
+ * fleecing_read_unlock() functionality.
+ *
+ * 2. Guest write synchronization (part of [1] actually). See
+ * fleecing_mark_done_and_wait_readers()
+ *
+ * 3. Fleecing discard. Used by fleecing user when corresponding area is already
+ * copied. Fleecing user may discard the area which is not needed anymore, that
+ * should result in:
+ *   - discarding data to free disk space
+ *   - clear bits in copy-bitmap of block-copy, to avoid extra copy-before-write
+ *     operations
+ *   - clear bits in access-bitmap of FleecingState, to avoid further wrong
+ *     access
+ *
+ * Still, FleecingState doesn't own any block children, so all real io
+ * operations (reads, writes and discards) are done by copy-before-write filter
+ * and fleecing block driver.
+ */
+
+#ifndef FLEECING_H
+#define FLEECING_H
+
+#include "block/block_int.h"
+#include "block/block-copy.h"
+#include "block/reqlist.h"
+
+typedef struct FleecingState FleecingState;
+
+/*
+ * Create FleecingState.
+ *
+ * @bcs: link to block-copy owned by copy-before-write filter.
+ *
+ * @fleecing_node: should be fleecing block driver node. Used to create some
+ * bitmaps in it.
+ */
+FleecingState *fleecing_new(BlockCopyState *bcs,
+                            BlockDriverState *fleecing_node,
+                            Error **errp);
+
+/* Free the state. Doesn't free block-copy state (@bcs) */
+void fleecing_free(FleecingState *s);
+
+/*
+ * Convenient function for thous who want to do fleecing read.
+ *
+ * If requested region starts in "done" area, i.e. data is already copied to
+ * copy-before-write target node, req is set to NULL, pnum is set to available
+ * bytes to read from target. User is free to read @pnum bytes from target.
+ * Still, user is responsible for concurrent discards on target.
+ *
+ * If requests region starts in "not done" area, i.e. we have to read from
+ * source node directly, than @pnum bytes of source node are frozen and
+ * guaranteed not be rewritten until user calls cbw_snapshot_read_unlock().
+ *
+ * Returns 0 on success and -EACCES when try to read non-dirty area of
+ * access_bitmap.
+ */
+int fleecing_read_lock(FleecingState *f, int64_t offset,
+                       int64_t bytes, const BlockReq **req, int64_t *pnum);
+/* Called as closing pair for fleecing_read_lock() */
+void fleecing_read_unlock(FleecingState *f, const BlockReq *req);
+
+/*
+ * Called when fleecing user doesn't need the region anymore (for example the
+ * region is successfully read and backed up somewhere).
+ * This prevents extra copy-before-write operations in this area in future.
+ * Next fleecing read from this area will fail with -EACCES.
+ */
+void fleecing_discard(FleecingState *f, int64_t offset, int64_t bytes);
+
+/*
+ * Called by copy-before-write filter after successful copy-before-write
+ * operation to synchronize with parallel fleecing reads.
+ */
+void fleecing_mark_done_and_wait_readers(FleecingState *f, int64_t offset,
+                                         int64_t bytes);
+
+#endif /* FLEECING_H */
diff --git a/block/fleecing.c b/block/fleecing.c
new file mode 100644
index 0000000000..f75d11b892
--- /dev/null
+++ b/block/fleecing.c
@@ -0,0 +1,182 @@ 
+/*
+ * FleecingState
+ *
+ * The common state of image fleecing, shared between copy-before-write filter
+ * and fleecing block driver.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+
+#include "sysemu/block-backend.h"
+#include "qemu/cutils.h"
+#include "qapi/error.h"
+#include "block/block_int.h"
+#include "block/coroutines.h"
+#include "block/qdict.h"
+#include "block/block-copy.h"
+#include "block/reqlist.h"
+
+#include "block/fleecing.h"
+
+/*
+ * @bcs: link to block-copy state owned by copy-before-write filter which
+ * performs copy-before-write operations in context of fleecing scheme.
+ * FleecingState doesn't own the block-copy state and don't free it on cleanup.
+ *
+ * @lock: protects access to @access_bitmap, @done_bitmap and @frozen_read_reqs
+ *
+ * @access_bitmap: represents areas allowed for reading by fleecing user.
+ * Reading from non-dirty areas leads to -EACCES. Discard operation among other
+ * things clears corresponding bits in this bitmaps.
+ *
+ * @done_bitmap: represents areas that was successfully copied by
+ * copy-before-write operations. So, for dirty areas fleecing user should read
+ * from target node and for clear areas - from source node.
+ *
+ * @frozen_read_reqs: current read requests for fleecing user in source node.
+ * corresponding areas must not be rewritten by guest.
+ */
+typedef struct FleecingState {
+    BlockCopyState *bcs;
+
+    CoMutex lock;
+
+    BdrvDirtyBitmap *access_bitmap;
+    BdrvDirtyBitmap *done_bitmap;
+
+    BlockReqList frozen_read_reqs;
+} FleecingState;
+
+FleecingState *fleecing_new(BlockCopyState *bcs,
+                            BlockDriverState *fleecing_node,
+                            Error **errp)
+{
+    BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(bcs),
+                    *done_bitmap, *access_bitmap;
+    int64_t cluster_size = block_copy_cluster_size(bcs);
+    FleecingState *s;
+
+    /* done_bitmap starts empty */
+    done_bitmap = bdrv_create_dirty_bitmap(fleecing_node, cluster_size, NULL,
+                                           errp);
+    if (!done_bitmap) {
+        return NULL;
+    }
+    bdrv_disable_dirty_bitmap(done_bitmap);
+
+    /* access_bitmap starts equal to bcs_bitmap */
+    access_bitmap = bdrv_create_dirty_bitmap(fleecing_node, cluster_size, NULL,
+                                             errp);
+    if (!access_bitmap) {
+        return NULL;
+    }
+    bdrv_disable_dirty_bitmap(access_bitmap);
+    if (!bdrv_dirty_bitmap_merge_internal(access_bitmap, bcs_bitmap,
+                                          NULL, true))
+    {
+        return NULL;
+    }
+
+    s = g_new(FleecingState, 1);
+    *s = (FleecingState) {
+        .bcs = bcs,
+        .done_bitmap = done_bitmap,
+        .access_bitmap = access_bitmap,
+    };
+    qemu_co_mutex_init(&s->lock);
+    QLIST_INIT(&s->frozen_read_reqs);
+
+    return s;
+}
+
+void fleecing_free(FleecingState *s)
+{
+    if (!s) {
+        return;
+    }
+
+    bdrv_release_dirty_bitmap(s->access_bitmap);
+    bdrv_release_dirty_bitmap(s->done_bitmap);
+    g_free(s);
+}
+
+static BlockReq *add_read_req(FleecingState *s, uint64_t offset, uint64_t bytes)
+{
+    BlockReq *req = g_new(BlockReq, 1);
+
+    reqlist_init_req(&s->frozen_read_reqs, req, offset, bytes);
+
+    return req;
+}
+
+static void drop_read_req(BlockReq *req)
+{
+    reqlist_remove_req(req);
+    g_free(req);
+}
+
+int fleecing_read_lock(FleecingState *s, int64_t offset,
+                       int64_t bytes, const BlockReq **req,
+                       int64_t *pnum)
+{
+    bool done;
+
+    QEMU_LOCK_GUARD(&s->lock);
+
+    if (bdrv_dirty_bitmap_next_zero(s->access_bitmap, offset, bytes) != -1) {
+        return -EACCES;
+    }
+
+    bdrv_dirty_bitmap_status(s->done_bitmap, offset, bytes, &done, pnum);
+    if (!done) {
+        *req = add_read_req(s, offset, *pnum);
+    }
+
+    return 0;
+}
+
+void fleecing_read_unlock(FleecingState *s, const BlockReq *req)
+{
+    QEMU_LOCK_GUARD(&s->lock);
+
+    drop_read_req((BlockReq *)req);
+}
+
+void fleecing_discard(FleecingState *s, int64_t offset, int64_t bytes)
+{
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        bdrv_reset_dirty_bitmap(s->access_bitmap, offset, bytes);
+    }
+
+    block_copy_reset(s->bcs, offset, bytes);
+}
+
+void fleecing_mark_done_and_wait_readers(FleecingState *s, int64_t offset,
+                                         int64_t bytes)
+{
+    assert(QEMU_IS_ALIGNED(offset, block_copy_cluster_size(s->bcs)));
+    assert(QEMU_IS_ALIGNED(bytes, block_copy_cluster_size(s->bcs)));
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        bdrv_set_dirty_bitmap(s->done_bitmap, offset, bytes);
+        reqlist_wait_all(&s->frozen_read_reqs, offset, bytes, &s->lock);
+    }
+}
diff --git a/MAINTAINERS b/MAINTAINERS
index 7f24ee4b92..78ea04e292 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2423,6 +2423,8 @@  F: block/reqlist.c
 F: include/block/reqlist.h
 F: block/copy-before-write.h
 F: block/copy-before-write.c
+F: block/fleecing.h
+F: block/fleecing.c
 F: include/block/aio_task.h
 F: block/aio_task.c
 F: util/qemu-co-shared-resource.c
diff --git a/block/meson.build b/block/meson.build
index 5065cf33ba..d30da90a01 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -18,6 +18,7 @@  block_ss.add(files(
   'crypto.c',
   'dirty-bitmap.c',
   'filter-compress.c',
+  'fleecing.c',
   'io.c',
   'mirror.c',
   'nbd.c',