diff mbox series

[v9,30/34] qcow2: Add prealloc field to QCowL2Meta

Message ID 0dd88b3b4d3e90b28267267d7686cf5350d9dea0.1593342067.git.berto@igalia.com (mailing list archive)
State New, archived
Headers show
Series Add subcluster allocation to qcow2 | expand

Commit Message

Alberto Garcia June 28, 2020, 11:02 a.m. UTC
This field allows us to indicate that the L2 metadata update does not
come from a write request with actual data but from a preallocation
request.

For traditional images this does not make any difference, but for
images with extended L2 entries this means that the clusters are
allocated normally in the L2 table but individual subclusters are
marked as unallocated.

This will allow preallocating images that have a backing file.

There is one special case: when we resize an existing image we can
also request that the new clusters are preallocated. If the image
already had a backing file then we have to hide any possible stale
data and zero out the new clusters (see commit 955c7d6687 for more
details).

In this case the subclusters cannot be left as unallocated so the L2
bitmap must be updated.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2.h         | 8 ++++++++
 block/qcow2-cluster.c | 2 +-
 block/qcow2.c         | 6 ++++++
 3 files changed, 15 insertions(+), 1 deletion(-)

Comments

Max Reitz July 2, 2020, 2:50 p.m. UTC | #1
On 28.06.20 13:02, Alberto Garcia wrote:
> This field allows us to indicate that the L2 metadata update does not
> come from a write request with actual data but from a preallocation
> request.
> 
> For traditional images this does not make any difference, but for
> images with extended L2 entries this means that the clusters are
> allocated normally in the L2 table but individual subclusters are
> marked as unallocated.
> 
> This will allow preallocating images that have a backing file.
> 
> There is one special case: when we resize an existing image we can
> also request that the new clusters are preallocated. If the image
> already had a backing file then we have to hide any possible stale
> data and zero out the new clusters (see commit 955c7d6687 for more
> details).
> 
> In this case the subclusters cannot be left as unallocated so the L2
> bitmap must be updated.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  block/qcow2.h         | 8 ++++++++
>  block/qcow2-cluster.c | 2 +-
>  block/qcow2.c         | 6 ++++++
>  3 files changed, 15 insertions(+), 1 deletion(-)

Sounds good, but I’m just not quite sure about the details on
falloc/full allocation: With .prealloc = true, writing to the
preallocated subclusters will require a COW operation.  That’s not
ideal, and avoiding those COWs may be a reason to do preallocation in
the first place.

Now, with backing files, it’s entirely correct.  You need a COW
operation, because that’s the point of having a backing file.

But without a backing file I wonder if it wouldn’t be better to set
.prealloc = false to avoid that COW.

Of course, if we did that, you couldn’t create the overlay separately
from the backing file, preallocate it, and only then attach the backing
file to the overlay.  But is that a problem?

(Or are there other problems to consider?)

Max
Eric Blake July 2, 2020, 2:58 p.m. UTC | #2
On 7/2/20 9:50 AM, Max Reitz wrote:
> On 28.06.20 13:02, Alberto Garcia wrote:
>> This field allows us to indicate that the L2 metadata update does not
>> come from a write request with actual data but from a preallocation
>> request.
>>
>> For traditional images this does not make any difference, but for
>> images with extended L2 entries this means that the clusters are
>> allocated normally in the L2 table but individual subclusters are
>> marked as unallocated.
>>
>> This will allow preallocating images that have a backing file.
>>
>> There is one special case: when we resize an existing image we can
>> also request that the new clusters are preallocated. If the image
>> already had a backing file then we have to hide any possible stale
>> data and zero out the new clusters (see commit 955c7d6687 for more
>> details).
>>
>> In this case the subclusters cannot be left as unallocated so the L2
>> bitmap must be updated.
>>
>> Signed-off-by: Alberto Garcia <berto@igalia.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> ---
>>   block/qcow2.h         | 8 ++++++++
>>   block/qcow2-cluster.c | 2 +-
>>   block/qcow2.c         | 6 ++++++
>>   3 files changed, 15 insertions(+), 1 deletion(-)
> 
> Sounds good, but I’m just not quite sure about the details on
> falloc/full allocation: With .prealloc = true, writing to the
> preallocated subclusters will require a COW operation.  That’s not
> ideal, and avoiding those COWs may be a reason to do preallocation in
> the first place.

I'm not sure I follow the complaint.  If a cluster is preallocated but 
the subcluster is marked unallocated, then doing a partial write to that 
subcluster must provide the correct contents for the rest of the 
subcluster (either filling with zero, or reading from a backing file) - 
but this COW can be limited to just the portion of the subcluster, and 
is no different than the COW you have to perform without subclusters 
when doing a write to a preallocated cluster in general.

> 
> Now, with backing files, it’s entirely correct.  You need a COW
> operation, because that’s the point of having a backing file.
> 
> But without a backing file I wonder if it wouldn’t be better to set
> .prealloc = false to avoid that COW.

Without a backing file, there is no read required - writing to an 
unallocated subcluster within a preallocated cluster merely has to 
provide zeros to the rest of the write.  And depending on whether we can 
intelligently guarantee that the underlying protocol already reads as 
zeroes when preallocated, we even have an optimization where even that 
is not necessary.  We can still lump it in the "COW" terminology, in 
that our write is more complex than merely writing in place, but it 
isn't a true copy-on-write operation as there is nothing to be copied.

> 
> Of course, if we did that, you couldn’t create the overlay separately
> from the backing file, preallocate it, and only then attach the backing
> file to the overlay.  But is that a problem?
> 
> (Or are there other problems to consider?)
> 
> Max
>
Max Reitz July 2, 2020, 3:09 p.m. UTC | #3
On 02.07.20 16:58, Eric Blake wrote:
> On 7/2/20 9:50 AM, Max Reitz wrote:
>> On 28.06.20 13:02, Alberto Garcia wrote:
>>> This field allows us to indicate that the L2 metadata update does not
>>> come from a write request with actual data but from a preallocation
>>> request.
>>>
>>> For traditional images this does not make any difference, but for
>>> images with extended L2 entries this means that the clusters are
>>> allocated normally in the L2 table but individual subclusters are
>>> marked as unallocated.
>>>
>>> This will allow preallocating images that have a backing file.
>>>
>>> There is one special case: when we resize an existing image we can
>>> also request that the new clusters are preallocated. If the image
>>> already had a backing file then we have to hide any possible stale
>>> data and zero out the new clusters (see commit 955c7d6687 for more
>>> details).
>>>
>>> In this case the subclusters cannot be left as unallocated so the L2
>>> bitmap must be updated.
>>>
>>> Signed-off-by: Alberto Garcia <berto@igalia.com>
>>> Reviewed-by: Eric Blake <eblake@redhat.com>
>>> ---
>>>   block/qcow2.h         | 8 ++++++++
>>>   block/qcow2-cluster.c | 2 +-
>>>   block/qcow2.c         | 6 ++++++
>>>   3 files changed, 15 insertions(+), 1 deletion(-)
>>
>> Sounds good, but I’m just not quite sure about the details on
>> falloc/full allocation: With .prealloc = true, writing to the
>> preallocated subclusters will require a COW operation.  That’s not
>> ideal, and avoiding those COWs may be a reason to do preallocation in
>> the first place.
> 
> I'm not sure I follow the complaint.  If a cluster is preallocated but
> the subcluster is marked unallocated, then doing a partial write to that
> subcluster must provide the correct contents for the rest of the
> subcluster (either filling with zero, or reading from a backing file) -
> but this COW can be limited to just the portion of the subcluster, and
> is no different than the COW you have to perform without subclusters
> when doing a write to a preallocated cluster in general.

It was my impression that falloc/full preallocation would create normal
data clusters, not zero clusters, so no COW was necessary when writing
to them.

>> Now, with backing files, it’s entirely correct.  You need a COW
>> operation, because that’s the point of having a backing file.
>>
>> But without a backing file I wonder if it wouldn’t be better to set
>> .prealloc = false to avoid that COW.
> 
> Without a backing file, there is no read required - writing to an
> unallocated subcluster within a preallocated cluster merely has to
> provide zeros to the rest of the write.  And depending on whether we can
> intelligently guarantee that the underlying protocol already reads as
> zeroes when preallocated, we even have an optimization where even that
> is not necessary.  We can still lump it in the "COW" terminology, in
> that our write is more complex than merely writing in place, but it
> isn't a true copy-on-write operation as there is nothing to be copied.

The term “COW” specifically in the qcow2 driver also refers to having to
write zeroes to an area that isn’t written to by the guest as part of
the process of having to allocate a (sub)cluster.

(Of course there is no COW from a backing file if there is no backing file.)

Max
Alberto Garcia July 2, 2020, 11:05 p.m. UTC | #4
On Thu 02 Jul 2020 05:09:47 PM CEST, Max Reitz wrote:
>> Without a backing file, there is no read required - writing to an
>> unallocated subcluster within a preallocated cluster merely has to
>> provide zeros to the rest of the write.  And depending on whether we
>> can intelligently guarantee that the underlying protocol already
>> reads as zeroes when preallocated, we even have an optimization where
>> even that is not necessary.  We can still lump it in the "COW"
>> terminology, in that our write is more complex than merely writing in
>> place, but it isn't a true copy-on-write operation as there is
>> nothing to be copied.
>
> The term “COW” specifically in the qcow2 driver also refers to having
> to write zeroes to an area that isn’t written to by the guest as part
> of the process of having to allocate a (sub)cluster.

The question is valid: if the space for the clusters is allocated but
the subclusters are not marked as such then any partial write request
will need to fill the rest with zeroes (in practice handle_alloc_space()
can do that efficiently but that's another question).

If there is a backing file then there's no other alternative because we
do need to copy the data from the backing file.

If there is no backing file perhaps we could allocate all subclusters as
well. I suppose we can detect that scenario at that point in the code (I
haven't checked) and I don't know what would happen if one later
attaches a backing file on runtime using the command-line options.

But what I would argue is that I don't see the benefit of using extended
L2 entries on an preallocated image with no backing file: other than
having twice as much L2 metadata what would be the use? The point of
subclusters is that they make allocation more efficient, but if the
image is already fully allocated then they give you nothing.

Berto
Max Reitz July 3, 2020, 7:22 a.m. UTC | #5
On 03.07.20 01:05, Alberto Garcia wrote:
> On Thu 02 Jul 2020 05:09:47 PM CEST, Max Reitz wrote:
>>> Without a backing file, there is no read required - writing to an
>>> unallocated subcluster within a preallocated cluster merely has to
>>> provide zeros to the rest of the write.  And depending on whether we
>>> can intelligently guarantee that the underlying protocol already
>>> reads as zeroes when preallocated, we even have an optimization where
>>> even that is not necessary.  We can still lump it in the "COW"
>>> terminology, in that our write is more complex than merely writing in
>>> place, but it isn't a true copy-on-write operation as there is
>>> nothing to be copied.
>>
>> The term “COW” specifically in the qcow2 driver also refers to having
>> to write zeroes to an area that isn’t written to by the guest as part
>> of the process of having to allocate a (sub)cluster.
> 
> The question is valid: if the space for the clusters is allocated but
> the subclusters are not marked as such then any partial write request
> will need to fill the rest with zeroes (in practice handle_alloc_space()
> can do that efficiently but that's another question).
> 
> If there is a backing file then there's no other alternative because we
> do need to copy the data from the backing file.
> 
> If there is no backing file perhaps we could allocate all subclusters as
> well. I suppose we can detect that scenario at that point in the code (I
> haven't checked) and I don't know what would happen if one later
> attaches a backing file on runtime using the command-line options.
> 
> But what I would argue is that I don't see the benefit of using extended
> L2 entries on an preallocated image with no backing file: other than
> having twice as much L2 metadata what would be the use? The point of
> subclusters is that they make allocation more efficient, but if the
> image is already fully allocated then they give you nothing.

That’s true.  I didn’t think about it this way.

Then indeed it doesn’t make sense to potentially break cases of later
adding a backing file:

Reviewed-by: Max Reitz <mreitz@redhat.com>
diff mbox series

Patch

diff --git a/block/qcow2.h b/block/qcow2.h
index 4ef4ae4ab0..f3499e53bf 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -463,6 +463,14 @@  typedef struct QCowL2Meta
      */
     bool skip_cow;
 
+    /**
+     * Indicates that this is not a normal write request but a preallocation.
+     * If the image has extended L2 entries this means that no new individual
+     * subclusters will be marked as allocated in the L2 bitmap (but any
+     * existing contents of that bitmap will be kept).
+     */
+    bool prealloc;
+
     /**
      * The I/O vector with the data from the actual guest write request.
      * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 1641976028..c8217081f2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1066,7 +1066,7 @@  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
         set_l2_entry(s, l2_slice, l2_index + i, offset | QCOW_OFLAG_COPIED);
 
         /* Update bitmap with the subclusters that were just written */
-        if (has_subclusters(s)) {
+        if (has_subclusters(s) && !m->prealloc) {
             uint64_t l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
             unsigned written_from = m->cow_start.offset;
             unsigned written_to = m->cow_end.offset + m->cow_end.nb_bytes ?:
diff --git a/block/qcow2.c b/block/qcow2.c
index 72bd25e774..003f166024 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2086,6 +2086,7 @@  static coroutine_fn int qcow2_handle_l2meta(BlockDriverState *bs,
         QCowL2Meta *next;
 
         if (link_l2) {
+            assert(!l2meta->prealloc);
             ret = qcow2_alloc_cluster_link_l2(bs, l2meta);
             if (ret) {
                 goto out;
@@ -3131,6 +3132,7 @@  static int coroutine_fn preallocate_co(BlockDriverState *bs, uint64_t offset,
 
         while (meta) {
             QCowL2Meta *next = meta->next;
+            meta->prealloc = true;
 
             ret = qcow2_alloc_cluster_link_l2(bs, meta);
             if (ret < 0) {
@@ -4224,6 +4226,7 @@  static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
         int64_t clusters_allocated;
         int64_t old_file_size, last_cluster, new_file_size;
         uint64_t nb_new_data_clusters, nb_new_l2_tables;
+        bool subclusters_need_allocation = false;
 
         /* With a data file, preallocation means just allocating the metadata
          * and forwarding the truncate request to the data file */
@@ -4305,6 +4308,8 @@  static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
                                    BDRV_REQ_ZERO_WRITE, NULL);
             if (ret >= 0) {
                 flags &= ~BDRV_REQ_ZERO_WRITE;
+                /* Ensure that we read zeroes and not backing file data */
+                subclusters_need_allocation = true;
             }
         } else {
             ret = -1;
@@ -4343,6 +4348,7 @@  static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
                     .offset       = nb_clusters << s->cluster_bits,
                     .nb_bytes     = 0,
                 },
+                .prealloc     = !subclusters_need_allocation,
             };
             qemu_co_queue_init(&allocation.dependent_requests);