diff mbox series

qcow2: add discard-no-unref option

Message ID 20230515073644.166677-1-jean-louis@dupond.be (mailing list archive)
State New, archived
Headers show
Series qcow2: add discard-no-unref option | expand

Commit Message

Jean-Louis Dupond May 15, 2023, 7:36 a.m. UTC
When we for example have a sparse qcow2 image and discard: unmap is enabled,
there can be a lot of fragmentation in the image after some time. Surely on VM's
that do a lot of writes/deletes.
This causes the qcow2 image to grow even over 110% of its virtual size,
because the free gaps in the image get to small to allocate new
continuous clusters. So it allocates new space as the end of the image.

Disabling discard is not an option, as discard is needed to keep the
incremental backup size as low as possible. Without discard, the
incremental backups would become large, as qemu thinks it's just dirty
blocks but it doesn't know the blocks are empty/useless.
So we need to avoid fragmentation but also 'empty' the useless blocks in
the image to have a small incremental backup.

Next to that we also want to send the discards futher down the stack, so
the underlying blocks are still discarded.

Therefor we introduce a new qcow2 option "discard-no-unref". When
setting this option to true (defaults to false), the discard requests
will still be executed, but it will keep the offset of the cluster. And
it will also pass the discard request further down the stack (if
discard:unmap is enabled).
This will avoid fragmentation and for example on a fully preallocated
qcow2 image, this will make sure the image is perfectly continuous.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
---
 block/qcow2-cluster.c  |  16 ++++-
 block/qcow2-refcount.c | 136 ++++++++++++++++++++++++-----------------
 block/qcow2.c          |  12 ++++
 block/qcow2.h          |   3 +
 qapi/block-core.json   |   4 ++
 qemu-options.hx        |   6 ++
 6 files changed, 120 insertions(+), 57 deletions(-)

Comments

Hanna Czenczek May 26, 2023, 1:31 p.m. UTC | #1
On 15.05.23 09:36, Jean-Louis Dupond wrote:
> When we for example have a sparse qcow2 image and discard: unmap is enabled,
> there can be a lot of fragmentation in the image after some time. Surely on VM's
> that do a lot of writes/deletes.
> This causes the qcow2 image to grow even over 110% of its virtual size,
> because the free gaps in the image get to small to allocate new
> continuous clusters. So it allocates new space as the end of the image.
>
> Disabling discard is not an option, as discard is needed to keep the
> incremental backup size as low as possible. Without discard, the
> incremental backups would become large, as qemu thinks it's just dirty
> blocks but it doesn't know the blocks are empty/useless.
> So we need to avoid fragmentation but also 'empty' the useless blocks in
> the image to have a small incremental backup.
>
> Next to that we also want to send the discards futher down the stack, so
> the underlying blocks are still discarded.
>
> Therefor we introduce a new qcow2 option "discard-no-unref". When
> setting this option to true (defaults to false), the discard requests
> will still be executed, but it will keep the offset of the cluster. And
> it will also pass the discard request further down the stack (if
> discard:unmap is enabled).
> This will avoid fragmentation and for example on a fully preallocated
> qcow2 image, this will make sure the image is perfectly continuous.

I don’t follow how this patch is cleaner than the “block: Add zeroes 
discard option” patch.  That patch’s diff stat is +14/-5, this one’s is 
120+/57-.

As for better, I don’t want to discount that, but at the same time I 
don’t know the reasoning for it.  As far as I understand, this patch 
makes qcow2 retain the cluster mapping as-is, and only discards on the 
lower layer.  So effectively, instead of marking the cluster as zero on 
the qcow2 level, we do so on the filesystem level.  I’m not sure I see 
all the implications of that difference.

The advantage I see is that this free up disk space, which the 
discard=zeroes wouldn’t do.  I wonder whether that couldn’t be solved 
with an orthogonal qcow2-only option, though, which would have the qcow2 
driver always discard zeroed clusters.

On the other hand, one thing to note is that we’ve had performance 
problems in the past with holes in the filesystem level (on tmpfs at 
least).  qemu generally can get the information on which clusters are 
zero quicker from qcow2 than from the filesystem, which suggests that 
even if we want to discard something on the filesystem layer, we 
probably also want to mark it as zero on the qcow2 layer.

I also have a small concern about fragmentation on the filesystem layer 
– if you use these options to prevent fragmentation, with this patch, 
you’re only doing so on the qcow2 layer.  Because the cluster is then 
discarded in the filesystem, it’s possible to get fragmentation there, 
which might not be desirable.  I don’t think that’s too important, but I 
think it’d just be nice to have a configuration in which the guest can 
tell qemu what areas it doesn’t care about, qemu marks these as zero, so 
that backups are more efficient; but at the same time, everything stays 
allocated, so no fragmentatoin is introduced.

Hanna
Jean-Louis Dupond May 26, 2023, 2:30 p.m. UTC | #2
On 26/05/2023 15:31, Hanna Czenczek wrote:
> On 15.05.23 09:36, Jean-Louis Dupond wrote:
>> When we for example have a sparse qcow2 image and discard: unmap is 
>> enabled,
>> there can be a lot of fragmentation in the image after some time. 
>> Surely on VM's
>> that do a lot of writes/deletes.
>> This causes the qcow2 image to grow even over 110% of its virtual size,
>> because the free gaps in the image get to small to allocate new
>> continuous clusters. So it allocates new space as the end of the image.
>>
>> Disabling discard is not an option, as discard is needed to keep the
>> incremental backup size as low as possible. Without discard, the
>> incremental backups would become large, as qemu thinks it's just dirty
>> blocks but it doesn't know the blocks are empty/useless.
>> So we need to avoid fragmentation but also 'empty' the useless blocks in
>> the image to have a small incremental backup.
>>
>> Next to that we also want to send the discards futher down the stack, so
>> the underlying blocks are still discarded.
>>
>> Therefor we introduce a new qcow2 option "discard-no-unref". When
>> setting this option to true (defaults to false), the discard requests
>> will still be executed, but it will keep the offset of the cluster. And
>> it will also pass the discard request further down the stack (if
>> discard:unmap is enabled).
>> This will avoid fragmentation and for example on a fully preallocated
>> qcow2 image, this will make sure the image is perfectly continuous.
>
> I don’t follow how this patch is cleaner than the “block: Add zeroes 
> discard option” patch.  That patch’s diff stat is +14/-5, this one’s 
> is 120+/57-.
Multiple reasons :)
- It's made for this use-case only, as there might be no other use-cases 
except this one so the scope for discard=zeroes might be to big
- This one still handles the discards and passes it to the lower layer, 
which might be an important reason for the fact you enable discards
- The diffstat is mostly bigger because of indention changes in 
update_refcount
>
> As for better, I don’t want to discount that, but at the same time I 
> don’t know the reasoning for it.  As far as I understand, this patch 
> makes qcow2 retain the cluster mapping as-is, and only discards on the 
> lower layer.  So effectively, instead of marking the cluster as zero 
> on the qcow2 level, we do so on the filesystem level.  I’m not sure I 
> see all the implications of that difference.
We want to keep the cluster mapping to avoid creating holes in the qcow2 
cluster.
But we still want to do discards for 2 reasons:
- Mark the cluster as ZERO, so that incremental backups using dirty 
bitmaps can just skip this block
- Discard the data on the lower layer for efficiency reasons (for 
example if the lower layer has some dedup/compression/whatever), we 
still want the lower layer to know the block has been emptied
>
> The advantage I see is that this free up disk space, which the 
> discard=zeroes wouldn’t do.  I wonder whether that couldn’t be solved 
> with an orthogonal qcow2-only option, though, which would have the 
> qcow2 driver always discard zeroed clusters.
This is an option also indeed. But we will end up with a similar patch 
(also in size).
>
> On the other hand, one thing to note is that we’ve had performance 
> problems in the past with holes in the filesystem level (on tmpfs at 
> least).  qemu generally can get the information on which clusters are 
> zero quicker from qcow2 than from the filesystem, which suggests that 
> even if we want to discard something on the filesystem layer, we 
> probably also want to mark it as zero on the qcow2 layer.
This is what we do in discard_in_l2_slice, we mark the entry as 
QCOW_OFLAG_ZERO when we receive a discard request.
>
> I also have a small concern about fragmentation on the filesystem 
> layer – if you use these options to prevent fragmentation, with this 
> patch, you’re only doing so on the qcow2 layer.  Because the cluster 
> is then discarded in the filesystem, it’s possible to get 
> fragmentation there, which might not be desirable.  I don’t think 
> that’s too important, but I think it’d just be nice to have a 
> configuration in which the guest can tell qemu what areas it doesn’t 
> care about, qemu marks these as zero, so that backups are more 
> efficient; but at the same time, everything stays allocated, so no 
> fragmentatoin is introduced.
That would be an additional option/improvement indeed. But it's not that 
this patch makes the fragmentation worse then it's already the case when 
you enable discard.
If you really want this you might even want your lower level storage to 
just ignore discards instead of fixing it in qcow2.
>
> Hanna
>

Thanks for your review
Jean-Louis
Hanna Czenczek May 31, 2023, 12:48 p.m. UTC | #3
On 26.05.23 16:30, Jean-Louis Dupond wrote:
> On 26/05/2023 15:31, Hanna Czenczek wrote:
>>
[...]

>> I don’t follow how this patch is cleaner than the “block: Add zeroes 
>> discard option” patch.  That patch’s diff stat is +14/-5, this one’s 
>> is 120+/57-.
> Multiple reasons :)
> - It's made for this use-case only, as there might be no other 
> use-cases except this one so the scope for discard=zeroes might be to big

That sounds either like a disadvantage (only usable for qcow2), or like 
it makes no practical difference because even with the other approach, 
only qcow2 would support this kind of zeroing anyway.  In terms of 
complexity, I don’t really see the difference whether this is a global 
flag for all formats or just for qcow2.

> - This one still handles the discards and passes it to the lower 
> layer, which might be an important reason for the fact you enable 
> discards

Indeed this is an important benefit that probably conflicts with and 
outweighs point 1 above, though not really a reason why this patch would 
be “cleaner”.

> - The diffstat is mostly bigger because of indention changes in 
> update_refcount

The fact that update_refcount() is updated at all is reason for caution 
to me.  Changing how refcounts are updated is generally more on the 
dangerous side of things.

Furthermore, going from a net total of +9 to +63 can’t be explained 
solely through indentation changes.  I understand some of it is also 
comments and documentation, but, well.

Now that I understand this patch better, I can see that it probably 
better fulfills the intended purpose, but it still seems less clean to 
me.  Not that it really matters; I was just wondering because you 
advertised it as cleaner. O:)

>>
>> As for better, I don’t want to discount that, but at the same time I 
>> don’t know the reasoning for it.  As far as I understand, this patch 
>> makes qcow2 retain the cluster mapping as-is, and only discards on 
>> the lower layer.  So effectively, instead of marking the cluster as 
>> zero on the qcow2 level, we do so on the filesystem level.  I’m not 
>> sure I see all the implications of that difference.
> We want to keep the cluster mapping to avoid creating holes in the 
> qcow2 cluster.
> But we still want to do discards for 2 reasons:
> - Mark the cluster as ZERO, so that incremental backups using dirty 
> bitmaps can just skip this block
> - Discard the data on the lower layer for efficiency reasons (for 
> example if the lower layer has some dedup/compression/whatever), we 
> still want the lower layer to know the block has been emptied

Agreed, I missed that the cluster is stilled marked as ZERO.

>>
>> The advantage I see is that this free up disk space, which the 
>> discard=zeroes wouldn’t do.  I wonder whether that couldn’t be solved 
>> with an orthogonal qcow2-only option, though, which would have the 
>> qcow2 driver always discard zeroed clusters.
> This is an option also indeed. But we will end up with a similar patch 
> (also in size).
>>
>> On the other hand, one thing to note is that we’ve had performance 
>> problems in the past with holes in the filesystem level (on tmpfs at 
>> least).  qemu generally can get the information on which clusters are 
>> zero quicker from qcow2 than from the filesystem, which suggests that 
>> even if we want to discard something on the filesystem layer, we 
>> probably also want to mark it as zero on the qcow2 layer.
> This is what we do in discard_in_l2_slice, we mark the entry as 
> QCOW_OFLAG_ZERO when we receive a discard request.

Thanks!  I missed that.

>>
>> I also have a small concern about fragmentation on the filesystem 
>> layer – if you use these options to prevent fragmentation, with this 
>> patch, you’re only doing so on the qcow2 layer.  Because the cluster 
>> is then discarded in the filesystem, it’s possible to get 
>> fragmentation there, which might not be desirable.  I don’t think 
>> that’s too important, but I think it’d just be nice to have a 
>> configuration in which the guest can tell qemu what areas it doesn’t 
>> care about, qemu marks these as zero, so that backups are more 
>> efficient; but at the same time, everything stays allocated, so no 
>> fragmentatoin is introduced.
> That would be an additional option/improvement indeed. But it's not 
> that this patch makes the fragmentation worse then it's already the 
> case when you enable discard.

It doesn’t, but then again, it is supposed to reduce fragmentation. Only 
qcow2 fragmentation, right, but the direction is there.  On the other 
hand...

> If you really want this you might even want your lower level storage 
> to just ignore discards instead of fixing it in qcow2.

...this makes much more sense!  Agreed.


I missed the fact that this patch still marks the cluster as zero on the 
qcow2 level.  I agree that it’s better to keep the discard on the lower 
layer, which the other patch variant can’t achieve, so there is merit to 
this approach.  I’ll go about reviewing the patch in detail!

Hanna
Hanna Czenczek May 31, 2023, 3:05 p.m. UTC | #4
On 15.05.23 09:36, Jean-Louis Dupond wrote:
> When we for example have a sparse qcow2 image and discard: unmap is enabled,
> there can be a lot of fragmentation in the image after some time. Surely on VM's

s/. Surely/, especially/

> that do a lot of writes/deletes.
> This causes the qcow2 image to grow even over 110% of its virtual size,
> because the free gaps in the image get to small to allocate new

s/to small/too small/

> continuous clusters. So it allocates new space as the end of the image.

s/as/at/

> Disabling discard is not an option, as discard is needed to keep the
> incremental backup size as low as possible. Without discard, the
> incremental backups would become large, as qemu thinks it's just dirty
> blocks but it doesn't know the blocks are empty/useless.
> So we need to avoid fragmentation but also 'empty' the useless blocks in

s/useless/unneeded/ in both lines?

> the image to have a small incremental backup.
>
> Next to that we also want to send the discards futher down the stack, so

s/Next to that/In addition/, s/futher/further/

> the underlying blocks are still discarded.
>
> Therefor we introduce a new qcow2 option "discard-no-unref". When
> setting this option to true (defaults to false), the discard requests
> will still be executed, but it will keep the offset of the cluster. And
> it will also pass the discard request further down the stack (if
> discard:unmap is enabled).

I think this could be more explicit, e.g. “When setting this option to 
true, discards will no longer have the qcow2 driver relinquish cluster 
allocations. Other than that, the request is handled as normal: All 
clusters in range are marked as zero, and, if pass-discard-request is 
true, it is passed further down the stack. The only difference is that 
the now-zero clusters are preallocated instead of being unallocated.”

> This will avoid fragmentation and for example on a fully preallocated
> qcow2 image, this will make sure the image is perfectly continuous.

Well, on the qcow2 layer, yes.

> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
> Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
> ---
>   block/qcow2-cluster.c  |  16 ++++-
>   block/qcow2-refcount.c | 136 ++++++++++++++++++++++++-----------------
>   block/qcow2.c          |  12 ++++
>   block/qcow2.h          |   3 +
>   qapi/block-core.json   |   4 ++
>   qemu-options.hx        |   6 ++
>   6 files changed, 120 insertions(+), 57 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 39cda7f907..88da70db5e 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1943,10 +1943,22 @@ static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
>               new_l2_entry = new_l2_bitmap = 0;
>           } else if (bs->backing || qcow2_cluster_is_allocated(cluster_type)) {
>               if (has_subclusters(s)) {
> -                new_l2_entry = 0;
> +                if (s->discard_no_unref && (type & QCOW2_DISCARD_REQUEST)) {

As far as I understand the discard type is just a plain enum, not a bit 
field.  So I think this should be `type == QCOW2_DISCARD_REQUEST`, not 
an `&`.  (Same below.)

> +                    new_l2_entry = old_l2_entry;
> +                } else {
> +                    new_l2_entry = 0;
> +                }
>                   new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
>               } else {
> -                new_l2_entry = s->qcow_version >= 3 ? QCOW_OFLAG_ZERO : 0;
> +                if (s->qcow_version >= 3) {
> +                    if (s->discard_no_unref && (type & QCOW2_DISCARD_REQUEST)) {
> +                        new_l2_entry |= QCOW_OFLAG_ZERO;
> +                    } else {
> +                        new_l2_entry = QCOW_OFLAG_ZERO;
> +                    }
> +                } else {
> +                    new_l2_entry = 0;
> +                }
>               }
>           }
>   

Context below:

         if (old_l2_entry == new_l2_entry && old_l2_bitmap == 
new_l2_bitmap) {
             continue;
         }

         /* First remove L2 entries */
         qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
         set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
         if (has_subclusters(s)) {
             set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
         }
         /* Then decrease the refcount */
         qcow2_free_any_cluster(bs, old_l2_entry, type);

If we keep the allocation, I don’t see why we would call 
qcow2_free_any_cluster().  If we simply skip the call (if 
`qcow2_is_allocated(qcow2_get_cluster_type(bs, new_l2_entry))`), I think 
you could drop the modification to update_refcount().

[...]

> diff --git a/block/qcow2.c b/block/qcow2.c
> index 5bde3b8401..9dde2ac1a5 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c

[...]

> @@ -725,6 +726,11 @@ static QemuOptsList qcow2_runtime_opts = {
>               .type = QEMU_OPT_BOOL,
>               .help = "Generate discard requests when other clusters are freed",
>           },
> +        {
> +            .name = QCOW2_OPT_DISCARD_NO_UNREF,
> +            .type = QEMU_OPT_BOOL,
> +            .help = "Do not dereference discarded clusters",

I wouldn’t call it “dereference” because of the overloaded meaning in C, 
but “unreference” instead.

> +        },
>           {
>               .name = QCOW2_OPT_OVERLAP,
>               .type = QEMU_OPT_STRING,

[...]

> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 187e35d473..63aa792e9c 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3432,6 +3432,9 @@
>   # @pass-discard-other: whether discard requests for the data source
>   #     should be issued on other occasions where a cluster gets freed
>   #
> +# @discard-no-unref: don't dereference the cluster when we do a discard
> +#     this to avoid fragmentation of the qcow2 image (since 8.1)

Because this comment is used to build different documentation than 
qemu-options.hx is, I would duplicate the full comment you put into 
qemu-options.hx here (to provide the best documentation possible).

> +#
>   # @overlap-check: which overlap checks to perform for writes to the
>   #     image, defaults to 'cached' (since 2.2)
>   #
> @@ -3470,6 +3473,7 @@
>               '*pass-discard-request': 'bool',
>               '*pass-discard-snapshot': 'bool',
>               '*pass-discard-other': 'bool',
> +            '*discard-no-unref': 'bool',
>               '*overlap-check': 'Qcow2OverlapChecks',
>               '*cache-size': 'int',
>               '*l2-cache-size': 'int',
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 42b9094c10..17ac701d0d 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -1431,6 +1431,12 @@ SRST
>               issued on other occasions where a cluster gets freed
>               (on/off; default: off)
>   
> +        ``discard-no-unref``
> +            When enabled, a discard in the guest does not cause the
> +            cluster inside the qcow2 image to be dereferenced. This

Like above, I’d prefer “unreferenced”, or “the cluster’s allocation […] 
to be relinquished”.

> +            was added to avoid qcow2 fragmentation whithin the image.
> +            (on/off; default: off)

I wouldn’t describe history here, but instead what this is for. E.g.: 
“When enabled, discards from the guest will not cause cluster 
allocations to be relinquished. This prevents qcow2 fragmentation that 
would be caused by such discards. Besides potential performance 
degradation, such fragmentation can lead to increased allocation of 
clusters past the end of the image file, resulting in image files whose 
file length can grow much larger than their guest disk size would 
suggest. If image file length is of concern (e.g. when storing qcow2 
images directly on block devices), you should consider enabling this 
option.”

What do you think?

Hanna
Hanna Czenczek May 31, 2023, 3:16 p.m. UTC | #5
On 15.05.23 09:36, Jean-Louis Dupond wrote:
> When we for example have a sparse qcow2 image and discard: unmap is enabled,
> there can be a lot of fragmentation in the image after some time. Surely on VM's
> that do a lot of writes/deletes.
> This causes the qcow2 image to grow even over 110% of its virtual size,
> because the free gaps in the image get to small to allocate new
> continuous clusters. So it allocates new space as the end of the image.
>
> Disabling discard is not an option, as discard is needed to keep the
> incremental backup size as low as possible. Without discard, the
> incremental backups would become large, as qemu thinks it's just dirty
> blocks but it doesn't know the blocks are empty/useless.
> So we need to avoid fragmentation but also 'empty' the useless blocks in
> the image to have a small incremental backup.
>
> Next to that we also want to send the discards futher down the stack, so
> the underlying blocks are still discarded.
>
> Therefor we introduce a new qcow2 option "discard-no-unref". When
> setting this option to true (defaults to false), the discard requests
> will still be executed, but it will keep the offset of the cluster. And
> it will also pass the discard request further down the stack (if
> discard:unmap is enabled).
> This will avoid fragmentation and for example on a fully preallocated
> qcow2 image, this will make sure the image is perfectly continuous.
>
> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
> Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
> ---
>   block/qcow2-cluster.c  |  16 ++++-
>   block/qcow2-refcount.c | 136 ++++++++++++++++++++++++-----------------
>   block/qcow2.c          |  12 ++++
>   block/qcow2.h          |   3 +
>   qapi/block-core.json   |   4 ++
>   qemu-options.hx        |   6 ++
>   6 files changed, 120 insertions(+), 57 deletions(-)
>
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 39cda7f907..88da70db5e 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1943,10 +1943,22 @@ static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
>               new_l2_entry = new_l2_bitmap = 0;
>           } else if (bs->backing || qcow2_cluster_is_allocated(cluster_type)) {
>               if (has_subclusters(s)) {
> -                new_l2_entry = 0;
> +                if (s->discard_no_unref && (type & QCOW2_DISCARD_REQUEST)) {
> +                    new_l2_entry = old_l2_entry;
> +                } else {
> +                    new_l2_entry = 0;
> +                }
>                   new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
>               } else {
> -                new_l2_entry = s->qcow_version >= 3 ? QCOW_OFLAG_ZERO : 0;
> +                if (s->qcow_version >= 3) {
> +                    if (s->discard_no_unref && (type & QCOW2_DISCARD_REQUEST)) {
> +                        new_l2_entry |= QCOW_OFLAG_ZERO;
> +                    } else {
> +                        new_l2_entry = QCOW_OFLAG_ZERO;
> +                    }
> +                } else {
> +                    new_l2_entry = 0;
> +                }
>               }
>           }

Forgot to note something: So this option only works for qcow2 v3. Can we 
have qcow2_update_options_prepare() return an error if s->qcow_version < 
3 and discard_no_unref is set?

Hanna
Jean-Louis Dupond June 1, 2023, 12:56 p.m. UTC | #6
On 31/05/2023 17:05, Hanna Czenczek wrote:
> On 15.05.23 09:36, Jean-Louis Dupond wrote:
>> When we for example have a sparse qcow2 image and discard: unmap is 
>> enabled,
>> there can be a lot of fragmentation in the image after some time. 
>> Surely on VM's
>
> s/. Surely/, especially/
>
>> that do a lot of writes/deletes.
>> This causes the qcow2 image to grow even over 110% of its virtual size,
>> because the free gaps in the image get to small to allocate new
>
> s/to small/too small/
>
>> continuous clusters. So it allocates new space as the end of the image.
>
> s/as/at/
>
>> Disabling discard is not an option, as discard is needed to keep the
>> incremental backup size as low as possible. Without discard, the
>> incremental backups would become large, as qemu thinks it's just dirty
>> blocks but it doesn't know the blocks are empty/useless.
>> So we need to avoid fragmentation but also 'empty' the useless blocks in
>
> s/useless/unneeded/ in both lines?
>
>> the image to have a small incremental backup.
>>
>> Next to that we also want to send the discards futher down the stack, so
>
> s/Next to that/In addition/, s/futher/further/
>
>> the underlying blocks are still discarded.
>>
>> Therefor we introduce a new qcow2 option "discard-no-unref". When
>> setting this option to true (defaults to false), the discard requests
>> will still be executed, but it will keep the offset of the cluster. And
>> it will also pass the discard request further down the stack (if
>> discard:unmap is enabled).
>
> I think this could be more explicit, e.g. “When setting this option to 
> true, discards will no longer have the qcow2 driver relinquish cluster 
> allocations. Other than that, the request is handled as normal: All 
> clusters in range are marked as zero, and, if pass-discard-request is 
> true, it is passed further down the stack. The only difference is that 
> the now-zero clusters are preallocated instead of being unallocated.”
>
>> This will avoid fragmentation and for example on a fully preallocated
>> qcow2 image, this will make sure the image is perfectly continuous.
>
> Well, on the qcow2 layer, yes.
All above -> Fixed :)
>
>> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
>> Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
>> ---
>>   block/qcow2-cluster.c  |  16 ++++-
>>   block/qcow2-refcount.c | 136 ++++++++++++++++++++++++-----------------
>>   block/qcow2.c          |  12 ++++
>>   block/qcow2.h          |   3 +
>>   qapi/block-core.json   |   4 ++
>>   qemu-options.hx        |   6 ++
>>   6 files changed, 120 insertions(+), 57 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index 39cda7f907..88da70db5e 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -1943,10 +1943,22 @@ static int 
>> discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
>>               new_l2_entry = new_l2_bitmap = 0;
>>           } else if (bs->backing || 
>> qcow2_cluster_is_allocated(cluster_type)) {
>>               if (has_subclusters(s)) {
>> -                new_l2_entry = 0;
>> +                if (s->discard_no_unref && (type & 
>> QCOW2_DISCARD_REQUEST)) {
>
> As far as I understand the discard type is just a plain enum, not a 
> bit field.  So I think this should be `type == QCOW2_DISCARD_REQUEST`, 
> not an `&`.  (Same below.)
>
Ack
>> +                    new_l2_entry = old_l2_entry;
>> +                } else {
>> +                    new_l2_entry = 0;
>> +                }
>>                   new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
>>               } else {
>> -                new_l2_entry = s->qcow_version >= 3 ? 
>> QCOW_OFLAG_ZERO : 0;
>> +                if (s->qcow_version >= 3) {
>> +                    if (s->discard_no_unref && (type & 
>> QCOW2_DISCARD_REQUEST)) {
>> +                        new_l2_entry |= QCOW_OFLAG_ZERO;
>> +                    } else {
>> +                        new_l2_entry = QCOW_OFLAG_ZERO;
>> +                    }
>> +                } else {
>> +                    new_l2_entry = 0;
>> +                }
>>               }
>>           }
>
> Context below:
>
>         if (old_l2_entry == new_l2_entry && old_l2_bitmap == 
> new_l2_bitmap) {
>             continue;
>         }
>
>         /* First remove L2 entries */
>         qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>         set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
>         if (has_subclusters(s)) {
>             set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
>         }
>         /* Then decrease the refcount */
>         qcow2_free_any_cluster(bs, old_l2_entry, type);
>
> If we keep the allocation, I don’t see why we would call 
> qcow2_free_any_cluster().  If we simply skip the call (if 
> `qcow2_is_allocated(qcow2_get_cluster_type(bs, new_l2_entry))`), I 
> think you could drop the modification to update_refcount().
>
If we don't call qcow2_free_any_cluster, the discard will not get passed 
to the lower layer.
We also call it in zero_in_l2_slice for example to discard lower layer.
> [...]
>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 5bde3b8401..9dde2ac1a5 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>
> [...]
>
>> @@ -725,6 +726,11 @@ static QemuOptsList qcow2_runtime_opts = {
>>               .type = QEMU_OPT_BOOL,
>>               .help = "Generate discard requests when other clusters 
>> are freed",
>>           },
>> +        {
>> +            .name = QCOW2_OPT_DISCARD_NO_UNREF,
>> +            .type = QEMU_OPT_BOOL,
>> +            .help = "Do not dereference discarded clusters",
>
> I wouldn’t call it “dereference” because of the overloaded meaning in 
> C, but “unreference” instead.
>
ack
>> +        },
>>           {
>>               .name = QCOW2_OPT_OVERLAP,
>>               .type = QEMU_OPT_STRING,
>
> [...]
>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 187e35d473..63aa792e9c 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3432,6 +3432,9 @@
>>   # @pass-discard-other: whether discard requests for the data source
>>   #     should be issued on other occasions where a cluster gets freed
>>   #
>> +# @discard-no-unref: don't dereference the cluster when we do a discard
>> +#     this to avoid fragmentation of the qcow2 image (since 8.1)
>
> Because this comment is used to build different documentation than 
> qemu-options.hx is, I would duplicate the full comment you put into 
> qemu-options.hx here (to provide the best documentation possible).
>
ack
>> +#
>>   # @overlap-check: which overlap checks to perform for writes to the
>>   #     image, defaults to 'cached' (since 2.2)
>>   #
>> @@ -3470,6 +3473,7 @@
>>               '*pass-discard-request': 'bool',
>>               '*pass-discard-snapshot': 'bool',
>>               '*pass-discard-other': 'bool',
>> +            '*discard-no-unref': 'bool',
>>               '*overlap-check': 'Qcow2OverlapChecks',
>>               '*cache-size': 'int',
>>               '*l2-cache-size': 'int',
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 42b9094c10..17ac701d0d 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -1431,6 +1431,12 @@ SRST
>>               issued on other occasions where a cluster gets freed
>>               (on/off; default: off)
>>   +        ``discard-no-unref``
>> +            When enabled, a discard in the guest does not cause the
>> +            cluster inside the qcow2 image to be dereferenced. This
>
> Like above, I’d prefer “unreferenced”, or “the cluster’s allocation 
> […] to be relinquished”.
>
ack
>> +            was added to avoid qcow2 fragmentation whithin the image.
>> +            (on/off; default: off)
>
> I wouldn’t describe history here, but instead what this is for. E.g.: 
> “When enabled, discards from the guest will not cause cluster 
> allocations to be relinquished. This prevents qcow2 fragmentation that 
> would be caused by such discards. Besides potential performance 
> degradation, such fragmentation can lead to increased allocation of 
> clusters past the end of the image file, resulting in image files 
> whose file length can grow much larger than their guest disk size 
> would suggest. If image file length is of concern (e.g. when storing 
> qcow2 images directly on block devices), you should consider enabling 
> this option.”
ack
>
> What do you think?

Let me know your opinion on the qcow2_free_any_cluster call, and then 
I'll post a new patch version.

>
> Hanna
>

Jean-Louis
Jean-Louis Dupond June 1, 2023, 1:13 p.m. UTC | #7
On 31/05/2023 17:16, Hanna Czenczek wrote:
> On 15.05.23 09:36, Jean-Louis Dupond wrote:
>> When we for example have a sparse qcow2 image and discard: unmap is 
>> enabled,
>> there can be a lot of fragmentation in the image after some time. 
>> Surely on VM's
>> that do a lot of writes/deletes.
>> This causes the qcow2 image to grow even over 110% of its virtual size,
>> because the free gaps in the image get to small to allocate new
>> continuous clusters. So it allocates new space as the end of the image.
>>
>> Disabling discard is not an option, as discard is needed to keep the
>> incremental backup size as low as possible. Without discard, the
>> incremental backups would become large, as qemu thinks it's just dirty
>> blocks but it doesn't know the blocks are empty/useless.
>> So we need to avoid fragmentation but also 'empty' the useless blocks in
>> the image to have a small incremental backup.
>>
>> Next to that we also want to send the discards futher down the stack, so
>> the underlying blocks are still discarded.
>>
>> Therefor we introduce a new qcow2 option "discard-no-unref". When
>> setting this option to true (defaults to false), the discard requests
>> will still be executed, but it will keep the offset of the cluster. And
>> it will also pass the discard request further down the stack (if
>> discard:unmap is enabled).
>> This will avoid fragmentation and for example on a fully preallocated
>> qcow2 image, this will make sure the image is perfectly continuous.
>>
>> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
>> Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
>> ---
>>   block/qcow2-cluster.c  |  16 ++++-
>>   block/qcow2-refcount.c | 136 ++++++++++++++++++++++++-----------------
>>   block/qcow2.c          |  12 ++++
>>   block/qcow2.h          |   3 +
>>   qapi/block-core.json   |   4 ++
>>   qemu-options.hx        |   6 ++
>>   6 files changed, 120 insertions(+), 57 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index 39cda7f907..88da70db5e 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -1943,10 +1943,22 @@ static int 
>> discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
>>               new_l2_entry = new_l2_bitmap = 0;
>>           } else if (bs->backing || 
>> qcow2_cluster_is_allocated(cluster_type)) {
>>               if (has_subclusters(s)) {
>> -                new_l2_entry = 0;
>> +                if (s->discard_no_unref && (type & 
>> QCOW2_DISCARD_REQUEST)) {
>> +                    new_l2_entry = old_l2_entry;
>> +                } else {
>> +                    new_l2_entry = 0;
>> +                }
>>                   new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
>>               } else {
>> -                new_l2_entry = s->qcow_version >= 3 ? 
>> QCOW_OFLAG_ZERO : 0;
>> +                if (s->qcow_version >= 3) {
>> +                    if (s->discard_no_unref && (type & 
>> QCOW2_DISCARD_REQUEST)) {
>> +                        new_l2_entry |= QCOW_OFLAG_ZERO;
>> +                    } else {
>> +                        new_l2_entry = QCOW_OFLAG_ZERO;
>> +                    }
>> +                } else {
>> +                    new_l2_entry = 0;
>> +                }
>>               }
>>           }
>
> Forgot to note something: So this option only works for qcow2 v3. Can 
> we have qcow2_update_options_prepare() return an error if 
> s->qcow_version < 3 and discard_no_unref is set?
Added :)
>
> Hanna
>
Hanna Czenczek June 2, 2023, 8:01 a.m. UTC | #8
On 01.06.23 14:56, Jean-Louis Dupond wrote:
> On 31/05/2023 17:05, Hanna Czenczek wrote:
>> On 15.05.23 09:36, Jean-Louis Dupond wrote:
>>> When we for example have a sparse qcow2 image and discard: unmap is 
>>> enabled,
>>> there can be a lot of fragmentation in the image after some time. 
>>> Surely on VM's
>>
>> s/. Surely/, especially/
>>
>>> that do a lot of writes/deletes.
>>> This causes the qcow2 image to grow even over 110% of its virtual size,
>>> because the free gaps in the image get to small to allocate new
>>
>> s/to small/too small/
>>
>>> continuous clusters. So it allocates new space as the end of the image.
>>
>> s/as/at/
>>
>>> Disabling discard is not an option, as discard is needed to keep the
>>> incremental backup size as low as possible. Without discard, the
>>> incremental backups would become large, as qemu thinks it's just dirty
>>> blocks but it doesn't know the blocks are empty/useless.
>>> So we need to avoid fragmentation but also 'empty' the useless 
>>> blocks in
>>
>> s/useless/unneeded/ in both lines?
>>
>>> the image to have a small incremental backup.
>>>
>>> Next to that we also want to send the discards futher down the 
>>> stack, so
>>
>> s/Next to that/In addition/, s/futher/further/
>>
>>> the underlying blocks are still discarded.
>>>
>>> Therefor we introduce a new qcow2 option "discard-no-unref". When
>>> setting this option to true (defaults to false), the discard requests
>>> will still be executed, but it will keep the offset of the cluster. And
>>> it will also pass the discard request further down the stack (if
>>> discard:unmap is enabled).
>>
>> I think this could be more explicit, e.g. “When setting this option 
>> to true, discards will no longer have the qcow2 driver relinquish 
>> cluster allocations. Other than that, the request is handled as 
>> normal: All clusters in range are marked as zero, and, if 
>> pass-discard-request is true, it is passed further down the stack. 
>> The only difference is that the now-zero clusters are preallocated 
>> instead of being unallocated.”
>>
>>> This will avoid fragmentation and for example on a fully preallocated
>>> qcow2 image, this will make sure the image is perfectly continuous.
>>
>> Well, on the qcow2 layer, yes.
> All above -> Fixed :)
>>
>>> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
>>> Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
>>> ---
>>>   block/qcow2-cluster.c  |  16 ++++-
>>>   block/qcow2-refcount.c | 136 
>>> ++++++++++++++++++++++++-----------------
>>>   block/qcow2.c          |  12 ++++
>>>   block/qcow2.h          |   3 +
>>>   qapi/block-core.json   |   4 ++
>>>   qemu-options.hx        |   6 ++
>>>   6 files changed, 120 insertions(+), 57 deletions(-)
>>>
>>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>>> index 39cda7f907..88da70db5e 100644
>>> --- a/block/qcow2-cluster.c
>>> +++ b/block/qcow2-cluster.c
>>> @@ -1943,10 +1943,22 @@ static int 
>>> discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
>>>               new_l2_entry = new_l2_bitmap = 0;
>>>           } else if (bs->backing || 
>>> qcow2_cluster_is_allocated(cluster_type)) {
>>>               if (has_subclusters(s)) {
>>> -                new_l2_entry = 0;
>>> +                if (s->discard_no_unref && (type & 
>>> QCOW2_DISCARD_REQUEST)) {
>>
>> As far as I understand the discard type is just a plain enum, not a 
>> bit field.  So I think this should be `type == 
>> QCOW2_DISCARD_REQUEST`, not an `&`.  (Same below.)
>>
> Ack
>>> +                    new_l2_entry = old_l2_entry;
>>> +                } else {
>>> +                    new_l2_entry = 0;
>>> +                }
>>>                   new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
>>>               } else {
>>> -                new_l2_entry = s->qcow_version >= 3 ? 
>>> QCOW_OFLAG_ZERO : 0;
>>> +                if (s->qcow_version >= 3) {
>>> +                    if (s->discard_no_unref && (type & 
>>> QCOW2_DISCARD_REQUEST)) {
>>> +                        new_l2_entry |= QCOW_OFLAG_ZERO;
>>> +                    } else {
>>> +                        new_l2_entry = QCOW_OFLAG_ZERO;
>>> +                    }
>>> +                } else {
>>> +                    new_l2_entry = 0;
>>> +                }
>>>               }
>>>           }
>>
>> Context below:
>>
>>         if (old_l2_entry == new_l2_entry && old_l2_bitmap == 
>> new_l2_bitmap) {
>>             continue;
>>         }
>>
>>         /* First remove L2 entries */
>>         qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>>         set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
>>         if (has_subclusters(s)) {
>>             set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
>>         }
>>         /* Then decrease the refcount */
>>         qcow2_free_any_cluster(bs, old_l2_entry, type);
>>
>> If we keep the allocation, I don’t see why we would call 
>> qcow2_free_any_cluster().  If we simply skip the call (if 
>> `qcow2_is_allocated(qcow2_get_cluster_type(bs, new_l2_entry))`), I 
>> think you could drop the modification to update_refcount().
>>
> If we don't call qcow2_free_any_cluster, the discard will not get 
> passed to the lower layer.

That’s a pickle.

> We also call it in zero_in_l2_slice for example to discard lower layer.

We only call it there if the allocation is dropped.  (`new_l2_entry = 
unmap ? 0 : old_l2_entry`)

I’d either lift the discard to discard_in_l2_slice() (if dropping the 
reference, call qcow2_free_any_cluster(); otherwise, if the old cluster 
was a normal or zero allocated cluster, discard it); or add a bool 
parameter to `qcow2_free_any_cluster()` that tells it to only discard, 
not free, the cluster, which makes it take the existing `if 
(has_data_file(bs))` path there.

The latter is simpler, but I find it problematic still to call 
qcow2_free_any_cluster() when there’s no intention of actually freeing a 
cluster (i.e. releasing the reference to it).

Hanna
diff mbox series

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 39cda7f907..88da70db5e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1943,10 +1943,22 @@  static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
             new_l2_entry = new_l2_bitmap = 0;
         } else if (bs->backing || qcow2_cluster_is_allocated(cluster_type)) {
             if (has_subclusters(s)) {
-                new_l2_entry = 0;
+                if (s->discard_no_unref && (type & QCOW2_DISCARD_REQUEST)) {
+                    new_l2_entry = old_l2_entry;
+                } else {
+                    new_l2_entry = 0;
+                }
                 new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
             } else {
-                new_l2_entry = s->qcow_version >= 3 ? QCOW_OFLAG_ZERO : 0;
+                if (s->qcow_version >= 3) {
+                    if (s->discard_no_unref && (type & QCOW2_DISCARD_REQUEST)) {
+                        new_l2_entry |= QCOW_OFLAG_ZERO;
+                    } else {
+                        new_l2_entry = QCOW_OFLAG_ZERO;
+                    }
+                } else {
+                    new_l2_entry = 0;
+                }
             }
         }
 
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 4cf91bd955..a81685ed72 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -830,7 +830,10 @@  static int update_refcount(BlockDriverState *bs,
         return 0;
     }
 
-    if (decrease) {
+    bool discard_no_unref = (s->discard_no_unref &&
+                             (type & QCOW2_DISCARD_REQUEST));
+
+    if (decrease && (!discard_no_unref)) {
         qcow2_cache_set_dependency(bs, s->refcount_block_cache,
             s->l2_table_cache);
     }
@@ -840,69 +843,92 @@  static int update_refcount(BlockDriverState *bs,
     for(cluster_offset = start; cluster_offset <= last;
         cluster_offset += s->cluster_size)
     {
-        int block_index;
-        uint64_t refcount;
-        int64_t cluster_index = cluster_offset >> s->cluster_bits;
-        int64_t table_index = cluster_index >> s->refcount_block_bits;
-
-        /* Load the refcount block and allocate it if needed */
-        if (table_index != old_table_index) {
-            if (refcount_block) {
-                qcow2_cache_put(s->refcount_block_cache, &refcount_block);
-            }
-            ret = alloc_refcount_block(bs, cluster_index, &refcount_block);
-            /* If the caller needs to restart the search for free clusters,
-             * try the same ones first to see if they're still free. */
-            if (ret == -EAGAIN) {
-                if (s->free_cluster_index > (start >> s->cluster_bits)) {
-                    s->free_cluster_index = (start >> s->cluster_bits);
+        /*
+         * If discard-no-unref is enabled and this is a DISCARD request
+         * then we can skip the reference update, as we want to keep
+         * the reference in place.
+         */
+        if (!discard_no_unref) {
+            int block_index;
+            uint64_t refcount;
+            int64_t cluster_index = cluster_offset >> s->cluster_bits;
+            int64_t table_index = cluster_index >> s->refcount_block_bits;
+
+            /*
+             * Load the refcount block and allocate it if needed
+             */
+            if (table_index != old_table_index) {
+                if (refcount_block) {
+                    qcow2_cache_put(s->refcount_block_cache, &refcount_block);
+                }
+                ret = alloc_refcount_block(bs, cluster_index, &refcount_block);
+                /*
+                 * If the caller needs to restart the search for free clusters,
+                 * try the same ones first to see if they're still free.
+                 */
+                if (ret == -EAGAIN) {
+                    if (s->free_cluster_index > (start >> s->cluster_bits)) {
+                        s->free_cluster_index = (start >> s->cluster_bits);
+                    }
+                }
+                if (ret < 0) {
+                    goto fail;
                 }
             }
-            if (ret < 0) {
-                goto fail;
-            }
-        }
-        old_table_index = table_index;
+            old_table_index = table_index;
 
-        qcow2_cache_entry_mark_dirty(s->refcount_block_cache, refcount_block);
+            qcow2_cache_entry_mark_dirty(s->refcount_block_cache,
+                                         refcount_block);
+
+            /*
+             * we can update the count and save it
+             */
+            block_index = cluster_index & (s->refcount_block_size - 1);
 
-        /* we can update the count and save it */
-        block_index = cluster_index & (s->refcount_block_size - 1);
+            refcount = s->get_refcount(refcount_block, block_index);
+            if (decrease ? (refcount - addend > refcount)
+                        : (refcount + addend < refcount ||
+                            refcount + addend > s->refcount_max))
+            {
+                ret = -EINVAL;
+                goto fail;
+            }
+            if (decrease) {
+                refcount -= addend;
+            } else {
+                refcount += addend;
+            }
+            if (refcount == 0 && cluster_index < s->free_cluster_index) {
+                s->free_cluster_index = cluster_index;
+            }
+            s->set_refcount(refcount_block, block_index, refcount);
 
-        refcount = s->get_refcount(refcount_block, block_index);
-        if (decrease ? (refcount - addend > refcount)
-                     : (refcount + addend < refcount ||
-                        refcount + addend > s->refcount_max))
-        {
-            ret = -EINVAL;
-            goto fail;
-        }
-        if (decrease) {
-            refcount -= addend;
-        } else {
-            refcount += addend;
-        }
-        if (refcount == 0 && cluster_index < s->free_cluster_index) {
-            s->free_cluster_index = cluster_index;
-        }
-        s->set_refcount(refcount_block, block_index, refcount);
+            if (refcount == 0) {
+                void *table;
 
-        if (refcount == 0) {
-            void *table;
+                table = qcow2_cache_is_table_offset(s->refcount_block_cache,
+                                                    offset);
+                if (table != NULL) {
+                    qcow2_cache_put(s->refcount_block_cache, &refcount_block);
+                    old_table_index = -1;
+                    qcow2_cache_discard(s->refcount_block_cache, table);
+                }
 
-            table = qcow2_cache_is_table_offset(s->refcount_block_cache,
-                                                offset);
-            if (table != NULL) {
-                qcow2_cache_put(s->refcount_block_cache, &refcount_block);
-                old_table_index = -1;
-                qcow2_cache_discard(s->refcount_block_cache, table);
-            }
+                table = qcow2_cache_is_table_offset(s->l2_table_cache, offset);
+                if (table != NULL) {
+                    qcow2_cache_discard(s->l2_table_cache, table);
+                }
 
-            table = qcow2_cache_is_table_offset(s->l2_table_cache, offset);
-            if (table != NULL) {
-                qcow2_cache_discard(s->l2_table_cache, table);
+                if (s->discard_passthrough[type]) {
+                    update_refcount_discard(bs, cluster_offset,
+                                            s->cluster_size);
+                }
             }
-
+        } else {
+            /*
+             * We had a discard request with discard-no-ref enabled, but we
+             * want to pass the discard to the backend to free the data there.
+             */
             if (s->discard_passthrough[type]) {
                 update_refcount_discard(bs, cluster_offset, s->cluster_size);
             }
diff --git a/block/qcow2.c b/block/qcow2.c
index 5bde3b8401..9dde2ac1a5 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -681,6 +681,7 @@  static const char *const mutable_opts[] = {
     QCOW2_OPT_DISCARD_REQUEST,
     QCOW2_OPT_DISCARD_SNAPSHOT,
     QCOW2_OPT_DISCARD_OTHER,
+    QCOW2_OPT_DISCARD_NO_UNREF,
     QCOW2_OPT_OVERLAP,
     QCOW2_OPT_OVERLAP_TEMPLATE,
     QCOW2_OPT_OVERLAP_MAIN_HEADER,
@@ -725,6 +726,11 @@  static QemuOptsList qcow2_runtime_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "Generate discard requests when other clusters are freed",
         },
+        {
+            .name = QCOW2_OPT_DISCARD_NO_UNREF,
+            .type = QEMU_OPT_BOOL,
+            .help = "Do not dereference discarded clusters",
+        },
         {
             .name = QCOW2_OPT_OVERLAP,
             .type = QEMU_OPT_STRING,
@@ -968,6 +974,7 @@  typedef struct Qcow2ReopenState {
     bool use_lazy_refcounts;
     int overlap_check;
     bool discard_passthrough[QCOW2_DISCARD_MAX];
+    bool discard_no_unref;
     uint64_t cache_clean_interval;
     QCryptoBlockOpenOptions *crypto_opts; /* Disk encryption runtime options */
 } Qcow2ReopenState;
@@ -1139,6 +1146,9 @@  static int qcow2_update_options_prepare(BlockDriverState *bs,
     r->discard_passthrough[QCOW2_DISCARD_OTHER] =
         qemu_opt_get_bool(opts, QCOW2_OPT_DISCARD_OTHER, false);
 
+    r->discard_no_unref = qemu_opt_get_bool(opts, QCOW2_OPT_DISCARD_NO_UNREF,
+                                            false);
+
     switch (s->crypt_method_header) {
     case QCOW_CRYPT_NONE:
         if (encryptfmt) {
@@ -1219,6 +1229,8 @@  static void qcow2_update_options_commit(BlockDriverState *bs,
         s->discard_passthrough[i] = r->discard_passthrough[i];
     }
 
+    s->discard_no_unref = r->discard_no_unref;
+
     if (s->cache_clean_interval != r->cache_clean_interval) {
         cache_clean_timer_del(bs);
         s->cache_clean_interval = r->cache_clean_interval;
diff --git a/block/qcow2.h b/block/qcow2.h
index 4f67eb912a..ea9adb5706 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -133,6 +133,7 @@ 
 #define QCOW2_OPT_DISCARD_REQUEST "pass-discard-request"
 #define QCOW2_OPT_DISCARD_SNAPSHOT "pass-discard-snapshot"
 #define QCOW2_OPT_DISCARD_OTHER "pass-discard-other"
+#define QCOW2_OPT_DISCARD_NO_UNREF "discard-no-unref"
 #define QCOW2_OPT_OVERLAP "overlap-check"
 #define QCOW2_OPT_OVERLAP_TEMPLATE "overlap-check.template"
 #define QCOW2_OPT_OVERLAP_MAIN_HEADER "overlap-check.main-header"
@@ -385,6 +386,8 @@  typedef struct BDRVQcow2State {
 
     bool discard_passthrough[QCOW2_DISCARD_MAX];
 
+    bool discard_no_unref;
+
     int overlap_check; /* bitmask of Qcow2MetadataOverlap values */
     bool signaled_corruption;
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 187e35d473..63aa792e9c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3432,6 +3432,9 @@ 
 # @pass-discard-other: whether discard requests for the data source
 #     should be issued on other occasions where a cluster gets freed
 #
+# @discard-no-unref: don't dereference the cluster when we do a discard
+#     this to avoid fragmentation of the qcow2 image (since 8.1)
+#
 # @overlap-check: which overlap checks to perform for writes to the
 #     image, defaults to 'cached' (since 2.2)
 #
@@ -3470,6 +3473,7 @@ 
             '*pass-discard-request': 'bool',
             '*pass-discard-snapshot': 'bool',
             '*pass-discard-other': 'bool',
+            '*discard-no-unref': 'bool',
             '*overlap-check': 'Qcow2OverlapChecks',
             '*cache-size': 'int',
             '*l2-cache-size': 'int',
diff --git a/qemu-options.hx b/qemu-options.hx
index 42b9094c10..17ac701d0d 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1431,6 +1431,12 @@  SRST
             issued on other occasions where a cluster gets freed
             (on/off; default: off)
 
+        ``discard-no-unref``
+            When enabled, a discard in the guest does not cause the
+            cluster inside the qcow2 image to be dereferenced. This
+            was added to avoid qcow2 fragmentation whithin the image.
+            (on/off; default: off)
+
         ``overlap-check``
             Which overlap checks to perform for writes to the image
             (none/constant/cached/all; default: cached). For details or