diff mbox series

[v5,19/31] qcow2: Add subcluster support to calculate_l2_meta()

Message ID 907ab6846b93b441a27eb6853ff3058f1c821bf9.1588699789.git.berto@igalia.com (mailing list archive)
State New, archived
Headers show
Series Add subcluster allocation to qcow2 | expand

Commit Message

Alberto Garcia May 5, 2020, 5:38 p.m. UTC
If an image has subclusters then there are more copy-on-write
scenarios that we need to consider. Let's say we have a write request
from the middle of subcluster #3 until the end of the cluster:

1) If we are writing to a newly allocated cluster then we need
   copy-on-write. The previous contents of subclusters #0 to #3 must
   be copied to the new cluster. We can optimize this process by
   skipping all leading unallocated or zero subclusters (the status of
   those skipped subclusters will be reflected in the new L2 bitmap).

2) If we are overwriting an existing cluster:

   2.1) If subcluster #3 is unallocated or has the all-zeroes bit set
        then we need copy-on-write (on subcluster #3 only).

   2.2) If subcluster #3 was already allocated then there is no need
        for any copy-on-write. However we still need to update the L2
        bitmap to reflect possible changes in the allocation status of
        subclusters #4 to #31. Because of this, this function checks
        if all the overwritten subclusters are already allocated and
        in this case it returns without creating a new QCowL2Meta
        structure.

After all these changes l2meta_cow_start() and l2meta_cow_end()
are not necessarily cluster-aligned anymore. We need to update the
calculation of old_start and old_end in handle_dependencies() to
guarantee that no two requests try to write on the same cluster.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 block/qcow2-cluster.c | 174 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 146 insertions(+), 28 deletions(-)

Comments

Eric Blake May 5, 2020, 9:59 p.m. UTC | #1
On 5/5/20 12:38 PM, Alberto Garcia wrote:
> If an image has subclusters then there are more copy-on-write
> scenarios that we need to consider. Let's say we have a write request
> from the middle of subcluster #3 until the end of the cluster:
> 
> 1) If we are writing to a newly allocated cluster then we need
>     copy-on-write. The previous contents of subclusters #0 to #3 must
>     be copied to the new cluster. We can optimize this process by
>     skipping all leading unallocated or zero subclusters (the status of
>     those skipped subclusters will be reflected in the new L2 bitmap).
> 
> 2) If we are overwriting an existing cluster:
> 
>     2.1) If subcluster #3 is unallocated or has the all-zeroes bit set
>          then we need copy-on-write (on subcluster #3 only).
> 
>     2.2) If subcluster #3 was already allocated then there is no need
>          for any copy-on-write. However we still need to update the L2
>          bitmap to reflect possible changes in the allocation status of
>          subclusters #4 to #31. Because of this, this function checks
>          if all the overwritten subclusters are already allocated and
>          in this case it returns without creating a new QCowL2Meta
>          structure.
> 
> After all these changes l2meta_cow_start() and l2meta_cow_end()
> are not necessarily cluster-aligned anymore. We need to update the
> calculation of old_start and old_end in handle_dependencies() to
> guarantee that no two requests try to write on the same cluster.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
>   block/qcow2-cluster.c | 174 +++++++++++++++++++++++++++++++++++-------
>   1 file changed, 146 insertions(+), 28 deletions(-)
> 

> -    /* Return if there's no COW (all clusters are normal and we keep them) */
> +    /* Return if there's no COW (all subclusters are normal and we are
> +     * keeping the clusters) */
>       if (keep_old) {
> +        unsigned first_sc = cow_start_to / s->subcluster_size;
> +        unsigned last_sc = (cow_end_from - 1) / s->subcluster_size;
>           int i;
> -        for (i = 0; i < nb_clusters; i++) {
> -            l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
> -            if (qcow2_get_cluster_type(bs, l2_entry) != QCOW2_CLUSTER_NORMAL) {
> +        for (i = first_sc; i <= last_sc; i++) {
> +            unsigned c = i / s->subclusters_per_cluster;
> +            unsigned sc = i % s->subclusters_per_cluster;
> +            l2_entry = get_l2_entry(s, l2_slice, l2_index + c);
> +            l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + c);
> +            type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc);
> +            if (type == QCOW2_SUBCLUSTER_INVALID) {
> +                l2_index += c; /* Point to the invalid entry */
> +                goto fail;
> +            }
> +            if (type != QCOW2_SUBCLUSTER_NORMAL) {
>                   break;
>               }
>           }

This loop is now 32 times slower (for extended L2 entries).  Do you 
really need to check for an invalid subcluster here, or can we just 
blindly check that all 32 subclusters are NORMAL, and leave handling of 
invalid clusters for the rest of the function after we failed the 
exit-early test?  For that matter, for all but the first and last 
cluster, checking if 32 clusters are NORMAL is a simple 64-bit 
comparison rather than 32 iterations of a loop; and even for the first 
and last cluster, the _RANGE macros in 14/31 work well to mask out which 
bits must be set/cleared.  My guess is that optimizing this loop is 
worthwhile, since overwriting existing data is probably more common than 
allocating new data.

> -        if (i == nb_clusters) {
> -            return;
> +        if (i == last_sc + 1) {
> +            return 0;
>           }
>       }

If we get here, then i is now the address of the first subcluster that 
was not NORMAL, even if it is much smaller than the final subcluster 
learned by nb_clusters for the overall request.  [1]

>   
>       /* Get the L2 entry of the first cluster */
>       l2_entry = get_l2_entry(s, l2_slice, l2_index);
> -    type = qcow2_get_cluster_type(bs, l2_entry);
> +    l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
> +    sc_index = offset_to_sc_index(s, guest_offset);
> +    type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
>   
> -    if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
> -        cow_start_from = cow_start_to;
> +    if (type == QCOW2_SUBCLUSTER_INVALID) {
> +        goto fail;
> +    }
> +
> +    if (!keep_old) {
> +        switch (type) {
> +        case QCOW2_SUBCLUSTER_COMPRESSED:
> +            cow_start_from = 0;
> +            break;
> +        case QCOW2_SUBCLUSTER_NORMAL:
> +        case QCOW2_SUBCLUSTER_ZERO_ALLOC:
> +        case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC: {
> +            int i;
> +            /* Skip all leading zero and unallocated subclusters */
> +            for (i = 0; i < sc_index; i++) {
> +                QCow2SubclusterType t;
> +                t = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, i);
> +                if (t == QCOW2_SUBCLUSTER_INVALID) {
> +                    goto fail;
> +                } else if (t == QCOW2_SUBCLUSTER_NORMAL) {
> +                    break;
> +                }
> +            }
> +            cow_start_from = i << s->subcluster_bits;
> +            break;

Note that you are only skipping until the first normal subcluster, even 
if other zero/unallocated clusters occur between the first normal 
cluster and the start of the action.  Or visually, suppose we have:

--0-NN-0_NNNNNNNN_NNNNNNNN_NNNNNNNN

as our 32 subclusters, with sc_index of 8.  You will end up skipping 
subclusters 0-3, but NOT 6 and 7.  Still, even though we spend time 
copying the allocated contents of those two subclusters, we also copy 
the subcluster status, and the guest still ends up reading the same data 
as before.  I don't know if it is worth trying to further minimize I/O 
to non-contiguous zero/unalloc subclusters in the head.

> +        }
> +        case QCOW2_SUBCLUSTER_ZERO_PLAIN:
> +        case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
> +            cow_start_from = sc_index << s->subcluster_bits;
> +            break;
> +        default:
> +            g_assert_not_reached();
> +        }
>       } else {
> -        cow_start_from = 0;
> +        switch (type) {
> +        case QCOW2_SUBCLUSTER_NORMAL:
> +            cow_start_from = cow_start_to;
> +            break;
> +        case QCOW2_SUBCLUSTER_ZERO_ALLOC:
> +        case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC:
> +            cow_start_from = sc_index << s->subcluster_bits;
> +            break;
> +        default:
> +            g_assert_not_reached();
> +        }
>       }
>   
>       /* Get the L2 entry of the last cluster */
> -    l2_entry = get_l2_entry(s, l2_slice, l2_index + nb_clusters - 1);
> -    type = qcow2_get_cluster_type(bs, l2_entry);
> +    l2_index += nb_clusters - 1;
> +    l2_entry = get_l2_entry(s, l2_slice, l2_index);
> +    l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
> +    sc_index = offset_to_sc_index(s, guest_offset + bytes - 1);
> +    type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);

[1] but here, we are skipping any intermediate clusters, and worrying 
only about the state of the final cluster.  Is that always going to do 
the correct thing, or will there be cases where we need to update the L2 
entries of intermediate clusters?  If we don't check specifically for 
INVALID in the initial subcluster, but instead break the loop as soon as 
we find non-NORMAL, then it seems like the rest of the function should 
fragment the overall request to deal with just the clusters up to the 
point where we found a non-NORMAL, and leave the remaining portion of 
the original request for another round.

Thus, I'm not convinced that this patch is quite right.
Alberto Garcia May 6, 2020, 5:14 p.m. UTC | #2
On Tue 05 May 2020 11:59:18 PM CEST, Eric Blake wrote:
>> +        for (i = first_sc; i <= last_sc; i++) {
>> +            unsigned c = i / s->subclusters_per_cluster;
>> +            unsigned sc = i % s->subclusters_per_cluster;
>> +            l2_entry = get_l2_entry(s, l2_slice, l2_index + c);
>> +            l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + c);
>> +            type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc);
>> +            if (type == QCOW2_SUBCLUSTER_INVALID) {
>> +                l2_index += c; /* Point to the invalid entry */
>> +                goto fail;
>> +            }
>> +            if (type != QCOW2_SUBCLUSTER_NORMAL) {
>>                   break;
>>               }
>>           }
>
> This loop is now 32 times slower (for extended L2 entries).  Do you
> really need to check for an invalid subcluster here, or can we just
> blindly check that all 32 subclusters are NORMAL, and leave handling
> of invalid clusters for the rest of the function after we failed the
> exit-early test?  For that matter, for all but the first and last
> cluster, checking if 32 clusters are NORMAL is a simple 64-bit
> comparison rather than 32 iterations of a loop; and even for the first
> and last cluster, the _RANGE macros in 14/31 work well to mask out
> which bits must be set/cleared.  My guess is that optimizing this loop
> is worthwhile, since overwriting existing data is probably more common
> than allocating new data.

I think you're right, and now we have the _RANGE macros so it should be
doable. I'll look into it.

>> -        if (i == nb_clusters) {
>> -            return;
>> +        if (i == last_sc + 1) {
>> +            return 0;
>>           }
>>       }
>
> If we get here, then i is now the address of the first subcluster that 
> was not NORMAL, even if it is much smaller than the final subcluster 
> learned by nb_clusters for the overall request.  [1]

I'm replying to this part later in [1]

>>       /* Get the L2 entry of the first cluster */
>>       l2_entry = get_l2_entry(s, l2_slice, l2_index);
>> -    type = qcow2_get_cluster_type(bs, l2_entry);
>> +    l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
>> +    sc_index = offset_to_sc_index(s, guest_offset);
>> +    type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
>>   
>> -    if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
>> -        cow_start_from = cow_start_to;
>> +    if (type == QCOW2_SUBCLUSTER_INVALID) {
>> +        goto fail;
>> +    }
>> +
>> +    if (!keep_old) {
>> +        switch (type) {
>> +        case QCOW2_SUBCLUSTER_COMPRESSED:
>> +            cow_start_from = 0;
>> +            break;
>> +        case QCOW2_SUBCLUSTER_NORMAL:
>> +        case QCOW2_SUBCLUSTER_ZERO_ALLOC:
>> +        case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC: {
>> +            int i;
>> +            /* Skip all leading zero and unallocated subclusters */
>> +            for (i = 0; i < sc_index; i++) {
>> +                QCow2SubclusterType t;
>> +                t = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, i);
>> +                if (t == QCOW2_SUBCLUSTER_INVALID) {
>> +                    goto fail;
>> +                } else if (t == QCOW2_SUBCLUSTER_NORMAL) {
>> +                    break;
>> +                }
>> +            }
>> +            cow_start_from = i << s->subcluster_bits;
>> +            break;
>
> Note that you are only skipping until the first normal subcluster, even 
> if other zero/unallocated clusters occur between the first normal 
> cluster and the start of the action.

That's correct.

> Or visually, suppose we have:
>
> --0-NN-0_NNNNNNNN_NNNNNNNN_NNNNNNNN
>
> as our 32 subclusters, with sc_index of 8.  You will end up skipping
> subclusters 0-3, but NOT 6 and 7.

That's correct. This function works with the assumption that the initial
COW region is located immediately before the data region, which is in
turn contiguous to the tail COW region.

I'm actually not sure that it necessarily has to be that way, but at
least it seems that functions like handle_alloc_space() rely on
that. Certainly before subclusters I don't see how there would be any
space between the COW regions and the actual data region.

While checking the documentation of QCowL2Meta I also realized that
maybe it also needs to be updated. "The COW Region between the start of
the first allocated cluster and the area the guest actually writes to",
it's not necessarily the start of the cluster anymore, although the word
"between" leaves some room for interpretation.

Anyway, even if we could do COW of subclusters 4-5 only, there's no
general way to do that without touching QCowL2Meta or using more than
one structure (imagine we have -N-N-N-N_NNNN ...). I'm also not sure
that it's worth it.

> Still, even though we spend time copying the allocated contents of
> those two subclusters, we also copy the subcluster status, and the
> guest still ends up reading the same data as before.

No, the subcluster status is updated and those subclusters are marked
now as allocated. That's actually why we can use the _RANGE masks that
you proposed here:

   https://lists.gnu.org/archive/html/qemu-block/2020-04/msg01155.html

In other words, we have this bitmap:

   --0-NN-0_NNNNNNNN_NNNNNNNN_NNNNNNNN

If we write to subcluster 8 (which was already allocated), the resulting
bitmap is this one:

   --0-NNNN_NNNNNNNN_NNNNNNNN_NNNNNNNN

The last block in iotest 271 deals exactly with this kind of scenarios.

>>       /* Get the L2 entry of the last cluster */
>> -    l2_entry = get_l2_entry(s, l2_slice, l2_index + nb_clusters - 1);
>> -    type = qcow2_get_cluster_type(bs, l2_entry);
>> +    l2_index += nb_clusters - 1;
>> +    l2_entry = get_l2_entry(s, l2_slice, l2_index);
>> +    l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
>> +    sc_index = offset_to_sc_index(s, guest_offset + bytes - 1);
>> +    type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
>
> [1] but here, we are skipping any intermediate clusters, and worrying
> only about the state of the final cluster.  Is that always going to do
> the correct thing, or will there be cases where we need to update the
> L2 entries of intermediate clusters?

       Cluster 1             Cluster 2             Cluster 3
|---------------------|---------------------|---------------------|
   <---cow_start--><-------write request--------><--cow_end--->


All L2 entries from the beginning of cow_start until the end of cow_end
are always updated. That's again what the loop that I optimized using
the _RANGE masks (and that I liked above) was doing.

The code in calculate_l2_meta() is only concerned with determining the
actual start and end points. Everything between them will be written to
and marked as allocated. It's only the subclusters outside that range
that keep the previous values (unallocated, or zero).

Berto
Eric Blake May 6, 2020, 5:39 p.m. UTC | #3
On 5/6/20 12:14 PM, Alberto Garcia wrote:

>> Note that you are only skipping until the first normal subcluster, even
>> if other zero/unallocated clusters occur between the first normal
>> cluster and the start of the action.
> 
> That's correct.
> 
>> Or visually, suppose we have:
>>
>> --0-NN-0_NNNNNNNN_NNNNNNNN_NNNNNNNN
>>
>> as our 32 subclusters, with sc_index of 8.  You will end up skipping
>> subclusters 0-3, but NOT 6 and 7.
> 
> That's correct. This function works with the assumption that the initial
> COW region is located immediately before the data region, which is in
> turn contiguous to the tail COW region.
> 
...
>> Still, even though we spend time copying the allocated contents of
>> those two subclusters, we also copy the subcluster status, and the
>> guest still ends up reading the same data as before.
> 
> No, the subcluster status is updated and those subclusters are marked
> now as allocated. That's actually why we can use the _RANGE masks that
> you proposed here:
> 
>     https://lists.gnu.org/archive/html/qemu-block/2020-04/msg01155.html
> 
> In other words, we have this bitmap:
> 
>     --0-NN-0_NNNNNNNN_NNNNNNNN_NNNNNNNN
> 
> If we write to subcluster 8 (which was already allocated), the resulting
> bitmap is this one:
> 
>     --0-NNNN_NNNNNNNN_NNNNNNNN_NNNNNNNN
> 
> The last block in iotest 271 deals exactly with this kind of scenarios.

Ah, that's what I was missing. So we indeed do more I/O than strictly 
necessary (by reading from the backing file or writing explicit zeroes 
rather than leaving unallocated or zero subclusters), but we are careful 
that the COW range of the operation puts the correct data into the head 
and tail (either from the backing file or explicitly zero) so that even 
though the status changes to N, the guest still sees the same contents.

I also agree that it is not worth trying to optimize to those later 
subclusters - it adds a lot of complexity for something that does not 
occur as frequently.  It is more important to optimize to the initial 
sequence of unalloc/zero clusters, but once we hit data, taking the 
penalty of COWing a few extra subclusters isn't going to be much worse.

> 
>>>        /* Get the L2 entry of the last cluster */
>>> -    l2_entry = get_l2_entry(s, l2_slice, l2_index + nb_clusters - 1);
>>> -    type = qcow2_get_cluster_type(bs, l2_entry);
>>> +    l2_index += nb_clusters - 1;
>>> +    l2_entry = get_l2_entry(s, l2_slice, l2_index);
>>> +    l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
>>> +    sc_index = offset_to_sc_index(s, guest_offset + bytes - 1);
>>> +    type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
>>
>> [1] but here, we are skipping any intermediate clusters, and worrying
>> only about the state of the final cluster.  Is that always going to do
>> the correct thing, or will there be cases where we need to update the
>> L2 entries of intermediate clusters?
> 
>         Cluster 1             Cluster 2             Cluster 3
> |---------------------|---------------------|---------------------|
>     <---cow_start--><-------write request--------><--cow_end--->
> 
> 
> All L2 entries from the beginning of cow_start until the end of cow_end
> are always updated. That's again what the loop that I optimized using
> the _RANGE masks (and that I liked above) was doing.

I guess the other thing that helps is that even if there are 
intermediate invalid subclusters, ideally the caller of this function is 
setting nb_clusters according to its own scan of how many contiguous 
clusters are even available for us to attempt to write into.  So patch 
20/31 catches the case where we don't have contiguous clusters if any of 
the subclusters are invalid.  And in _this_ patch, it shouldn't really 
matter if we don't detect that we are overwriting what used to be an 
invalid subcluster with something that is now valid data.

> 
> The code in calculate_l2_meta() is only concerned with determining the
> actual start and end points. Everything between them will be written to
> and marked as allocated. It's only the subclusters outside that range
> that keep the previous values (unallocated, or zero).

I think you've convinced me that we are safe on what this function does. 
  In fact, if we rely on 20/31 checking for invalid subclusters when 
computing nb_clusters, we could probably assert that the start and end 
cluster in this function are not invalid, instead of adding the fail: 
label.  But even if you don't do that, I can live with:

Reviewed-by: Eric Blake <eblake@redhat.com>
Alberto Garcia May 7, 2020, 3:34 p.m. UTC | #4
On Wed 06 May 2020 07:39:48 PM CEST, Eric Blake wrote:
>   In fact, if we rely on 20/31 checking for invalid subclusters when
> computing nb_clusters, we could probably assert that the start and end
> cluster in this function are not invalid, instead of adding the fail:
> label.

I think you're right with that, good catch! There's no need to return an
error code in this function.

Berto
Alberto Garcia May 7, 2020, 3:50 p.m. UTC | #5
On Thu 07 May 2020 05:34:18 PM CEST, Alberto Garcia wrote:
> On Wed 06 May 2020 07:39:48 PM CEST, Eric Blake wrote:
>>   In fact, if we rely on 20/31 checking for invalid subclusters when
>> computing nb_clusters, we could probably assert that the start and end
>> cluster in this function are not invalid, instead of adding the fail:
>> label.
>
> I think you're right with that, good catch! There's no need to return an
> error code in this function.

No, wait, I got confused. 

Patch 20/31 updates qcow2_get_host_offset(), and that's used for
reading.

The caller of calculate_l2_meta() is qcow2_alloc_cluster_offset()
(indirectly), which is used for writing.

nb_clusters here is calculated with cluster_needs_new_alloc(), but that
doesn't check whether a cluster is valid or not, and certainly doesn't
check if every single subcluster is valid.

Thinking about it, there are probably faster ways to check for the
validity of extended L2 entries, for example:

- For QCOW2_CLUSTER_NORMAL, (l2_bitmap >> 32) & l2_bitmap should be 0.

- For QCOW2_CLUSTER_UNALLOCATED, l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC
  should be 0.

That's enough to cover all cases of QCOW2_SUBCLUSTER_INVALID that we
have at the moment.

Berto
diff mbox series

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 5595ce1404..ffcb11edda 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1059,56 +1059,156 @@  void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m)
  * If @keep_old is true it means that the clusters were already
  * allocated and will be overwritten. If false then the clusters are
  * new and we have to decrease the reference count of the old ones.
+ *
+ * Returns 0 on success, -errno on failure.
  */
-static void calculate_l2_meta(BlockDriverState *bs,
-                              uint64_t host_cluster_offset,
-                              uint64_t guest_offset, unsigned bytes,
-                              uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
+static int calculate_l2_meta(BlockDriverState *bs, uint64_t host_cluster_offset,
+                             uint64_t guest_offset, unsigned bytes,
+                             uint64_t *l2_slice, QCowL2Meta **m, bool keep_old)
 {
     BDRVQcow2State *s = bs->opaque;
-    int l2_index = offset_to_l2_slice_index(s, guest_offset);
-    uint64_t l2_entry;
+    int sc_index, l2_index = offset_to_l2_slice_index(s, guest_offset);
+    uint64_t l2_entry, l2_bitmap;
     unsigned cow_start_from, cow_end_to;
     unsigned cow_start_to = offset_into_cluster(s, guest_offset);
     unsigned cow_end_from = cow_start_to + bytes;
     unsigned nb_clusters = size_to_clusters(s, cow_end_from);
     QCowL2Meta *old_m = *m;
-    QCow2ClusterType type;
+    QCow2SubclusterType type;
 
     assert(nb_clusters <= s->l2_slice_size - l2_index);
 
-    /* Return if there's no COW (all clusters are normal and we keep them) */
+    /* Return if there's no COW (all subclusters are normal and we are
+     * keeping the clusters) */
     if (keep_old) {
+        unsigned first_sc = cow_start_to / s->subcluster_size;
+        unsigned last_sc = (cow_end_from - 1) / s->subcluster_size;
         int i;
-        for (i = 0; i < nb_clusters; i++) {
-            l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
-            if (qcow2_get_cluster_type(bs, l2_entry) != QCOW2_CLUSTER_NORMAL) {
+        for (i = first_sc; i <= last_sc; i++) {
+            unsigned c = i / s->subclusters_per_cluster;
+            unsigned sc = i % s->subclusters_per_cluster;
+            l2_entry = get_l2_entry(s, l2_slice, l2_index + c);
+            l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + c);
+            type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc);
+            if (type == QCOW2_SUBCLUSTER_INVALID) {
+                l2_index += c; /* Point to the invalid entry */
+                goto fail;
+            }
+            if (type != QCOW2_SUBCLUSTER_NORMAL) {
                 break;
             }
         }
-        if (i == nb_clusters) {
-            return;
+        if (i == last_sc + 1) {
+            return 0;
         }
     }
 
     /* Get the L2 entry of the first cluster */
     l2_entry = get_l2_entry(s, l2_slice, l2_index);
-    type = qcow2_get_cluster_type(bs, l2_entry);
+    l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
+    sc_index = offset_to_sc_index(s, guest_offset);
+    type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
 
-    if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
-        cow_start_from = cow_start_to;
+    if (type == QCOW2_SUBCLUSTER_INVALID) {
+        goto fail;
+    }
+
+    if (!keep_old) {
+        switch (type) {
+        case QCOW2_SUBCLUSTER_COMPRESSED:
+            cow_start_from = 0;
+            break;
+        case QCOW2_SUBCLUSTER_NORMAL:
+        case QCOW2_SUBCLUSTER_ZERO_ALLOC:
+        case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC: {
+            int i;
+            /* Skip all leading zero and unallocated subclusters */
+            for (i = 0; i < sc_index; i++) {
+                QCow2SubclusterType t;
+                t = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, i);
+                if (t == QCOW2_SUBCLUSTER_INVALID) {
+                    goto fail;
+                } else if (t == QCOW2_SUBCLUSTER_NORMAL) {
+                    break;
+                }
+            }
+            cow_start_from = i << s->subcluster_bits;
+            break;
+        }
+        case QCOW2_SUBCLUSTER_ZERO_PLAIN:
+        case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
+            cow_start_from = sc_index << s->subcluster_bits;
+            break;
+        default:
+            g_assert_not_reached();
+        }
     } else {
-        cow_start_from = 0;
+        switch (type) {
+        case QCOW2_SUBCLUSTER_NORMAL:
+            cow_start_from = cow_start_to;
+            break;
+        case QCOW2_SUBCLUSTER_ZERO_ALLOC:
+        case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC:
+            cow_start_from = sc_index << s->subcluster_bits;
+            break;
+        default:
+            g_assert_not_reached();
+        }
     }
 
     /* Get the L2 entry of the last cluster */
-    l2_entry = get_l2_entry(s, l2_slice, l2_index + nb_clusters - 1);
-    type = qcow2_get_cluster_type(bs, l2_entry);
+    l2_index += nb_clusters - 1;
+    l2_entry = get_l2_entry(s, l2_slice, l2_index);
+    l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
+    sc_index = offset_to_sc_index(s, guest_offset + bytes - 1);
+    type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
 
-    if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
-        cow_end_to = cow_end_from;
+    if (type == QCOW2_SUBCLUSTER_INVALID) {
+        goto fail;
+    }
+
+    if (!keep_old) {
+        switch (type) {
+        case QCOW2_SUBCLUSTER_COMPRESSED:
+            cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
+            break;
+        case QCOW2_SUBCLUSTER_NORMAL:
+        case QCOW2_SUBCLUSTER_ZERO_ALLOC:
+        case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC: {
+            int i;
+            cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
+            /* Skip all trailing zero and unallocated subclusters */
+            for (i = s->subclusters_per_cluster - 1; i > sc_index; i--) {
+                QCow2SubclusterType t;
+                t = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, i);
+                if (t == QCOW2_SUBCLUSTER_INVALID) {
+                    goto fail;
+                } else if (t == QCOW2_SUBCLUSTER_NORMAL) {
+                    break;
+                }
+                cow_end_to -= s->subcluster_size;
+            }
+            break;
+        }
+        case QCOW2_SUBCLUSTER_ZERO_PLAIN:
+        case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
+            cow_end_to = ROUND_UP(cow_end_from, s->subcluster_size);
+            break;
+        default:
+            g_assert_not_reached();
+        }
     } else {
-        cow_end_to = ROUND_UP(cow_end_from, s->cluster_size);
+        switch (type) {
+        case QCOW2_SUBCLUSTER_NORMAL:
+            cow_end_to = cow_end_from;
+            break;
+        case QCOW2_SUBCLUSTER_ZERO_ALLOC:
+        case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC:
+            cow_end_to = ROUND_UP(cow_end_from, s->subcluster_size);
+            break;
+        default:
+            g_assert_not_reached();
+        }
     }
 
     *m = g_malloc0(sizeof(**m));
@@ -1133,6 +1233,18 @@  static void calculate_l2_meta(BlockDriverState *bs,
 
     qemu_co_queue_init(&(*m)->dependent_requests);
     QLIST_INSERT_HEAD(&s->cluster_allocs, *m, next_in_flight);
+
+fail:
+    if (type == QCOW2_SUBCLUSTER_INVALID) {
+        uint64_t l1_index = offset_to_l1_index(s, guest_offset);
+        uint64_t l2_offset = s->l1_table[l1_index] & L1E_OFFSET_MASK;
+        qcow2_signal_corruption(bs, true, -1, -1, "Invalid cluster entry found "
+                                " (L2 offset: %#" PRIx64 ", L2 index: %#x)",
+                                l2_offset, l2_index);
+        return -EIO;
+    }
+
+    return 0;
 }
 
 /*
@@ -1221,8 +1333,8 @@  static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
 
         uint64_t start = guest_offset;
         uint64_t end = start + bytes;
-        uint64_t old_start = l2meta_cow_start(old_alloc);
-        uint64_t old_end = l2meta_cow_end(old_alloc);
+        uint64_t old_start = start_of_cluster(s, l2meta_cow_start(old_alloc));
+        uint64_t old_end = ROUND_UP(l2meta_cow_end(old_alloc), s->cluster_size);
 
         if (end <= old_start || start >= old_end) {
             /* No intersection */
@@ -1347,8 +1459,11 @@  static int handle_copied(BlockDriverState *bs, uint64_t guest_offset,
                  - offset_into_cluster(s, guest_offset));
         assert(*bytes != 0);
 
-        calculate_l2_meta(bs, cluster_offset, guest_offset,
-                          *bytes, l2_slice, m, true);
+        ret = calculate_l2_meta(bs, cluster_offset, guest_offset,
+                                *bytes, l2_slice, m, true);
+        if (ret < 0) {
+            goto out;
+        }
 
         ret = 1;
     } else {
@@ -1524,8 +1639,11 @@  static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
     *bytes = MIN(*bytes, nb_bytes - offset_into_cluster(s, guest_offset));
     assert(*bytes != 0);
 
-    calculate_l2_meta(bs, alloc_cluster_offset, guest_offset, *bytes, l2_slice,
-                      m, false);
+    ret = calculate_l2_meta(bs, alloc_cluster_offset, guest_offset, *bytes,
+                            l2_slice, m, false);
+    if (ret < 0) {
+        goto out;
+    }
 
     ret = 1;