diff mbox series

[1/1] Fix qcow2 corruption on discard

Message ID 20201123154929.330338-2-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series Fix qcow2 corruption after addition of subcluster support | expand

Commit Message

Maxim Levitsky Nov. 23, 2020, 3:49 p.m. UTC
Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
introduced a subtle change to code in zero_in_l2_slice:

It swapped the order of

1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);

To

1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);

It seems harmless, however the call to qcow2_free_any_clusters
can trigger a cache flush which can mark the L2 table as clean,
and assuming that this was the last write to it,
a stale version of it will remain on the disk.

Now we have a valid L2 entry pointing to a freed cluster. Oops.

Fixes: 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 block/qcow2-cluster.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kevin Wolf Nov. 23, 2020, 4:09 p.m. UTC | #1
Am 23.11.2020 um 16:49 hat Maxim Levitsky geschrieben:
> Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> introduced a subtle change to code in zero_in_l2_slice:
> 
> It swapped the order of
> 
> 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> 2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> 3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> 
> To
> 
> 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> 2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> 3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> 
> It seems harmless, however the call to qcow2_free_any_clusters
> can trigger a cache flush which can mark the L2 table as clean,
> and assuming that this was the last write to it,
> a stale version of it will remain on the disk.
> 
> Now we have a valid L2 entry pointing to a freed cluster. Oops.
> 
> Fixes: 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  block/qcow2-cluster.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 485b4cb92e..267b46a4ca 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -2010,11 +2010,11 @@ static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
>              continue;
>          }
>  
> -        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>          if (unmap) {
>              qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
>          }
>          set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
> +        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);

Good catch, but I think your order is wrong, too. We need the original
order from before 205fa50750:

1. qcow2_cache_entry_mark_dirty()
   set_l2_entry() + set_l2_bitmap()

2. qcow2_free_any_cluster()

The order between qcow2_cache_entry_mark_dirty() and set_l2_entry()
shouldn't matter, but it's important that we update the refcount table
only after the L2 table has been updated to not reference the cluster
any more.

Otherwise a crash could lead to a situation where the cluster is
allocated (because it has refcount 0), but it was still in use in an L2
table. This is a classic corruption scenario.

Kevin
Kevin Wolf Nov. 23, 2020, 5:38 p.m. UTC | #2
Am 23.11.2020 um 16:49 hat Maxim Levitsky geschrieben:
> Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> introduced a subtle change to code in zero_in_l2_slice:
> 
> It swapped the order of
> 
> 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> 2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> 3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> 
> To
> 
> 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> 2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> 3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> 
> It seems harmless, however the call to qcow2_free_any_clusters
> can trigger a cache flush which can mark the L2 table as clean,
> and assuming that this was the last write to it,
> a stale version of it will remain on the disk.

Do you have more details on this last paragraph? I'm trying to come up
with a reproducer, but I don't see how qcow2_free_any_clusters() could
flush the L2 table cache. (It's easy to get it to flush the refcount
block cache, but that's useless for a reproducer.)

The only way I see to flush any cache with it is in update_refcount()
the qcow2_cache_set_dependency() call. This will always flush the cache
that the L2 cache depends on - which will never be the L2 cache itself,
but always either the refcount cache or nothing.

There are more options in alloc_refcount_block() if we're allocating a
new refcount block, but in the context of freeing clusters we'll never
need to do that.

Whatever I tried, at the end of zero_in_l2_slice(), I have a dirty L2
table and a dirty refcount block in the cache, with a dependency that
makes sure that the L2 table will be written out first.

If you don't have the information yet, can you try to debug your manual
reproducer a bit more to find out how this happens?

Kevin

> Now we have a valid L2 entry pointing to a freed cluster. Oops.
> 
> Fixes: 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  block/qcow2-cluster.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 485b4cb92e..267b46a4ca 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -2010,11 +2010,11 @@ static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
>              continue;
>          }
>  
> -        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>          if (unmap) {
>              qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
>          }
>          set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
> +        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>          if (has_subclusters(s)) {
>              set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
>          }
> -- 
> 2.26.2
>
Maxim Levitsky Nov. 23, 2020, 6:11 p.m. UTC | #3
On Mon, 2020-11-23 at 18:38 +0100, Kevin Wolf wrote:
> Am 23.11.2020 um 16:49 hat Maxim Levitsky geschrieben:
> > Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> > introduced a subtle change to code in zero_in_l2_slice:
> > 
> > It swapped the order of
> > 
> > 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > 2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> > 3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> > 
> > To
> > 
> > 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > 2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> > 3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> > 
> > It seems harmless, however the call to qcow2_free_any_clusters
> > can trigger a cache flush which can mark the L2 table as clean,
> > and assuming that this was the last write to it,
> > a stale version of it will remain on the disk.
> 
> Do you have more details on this last paragraph? I'm trying to come up
> with a reproducer, but I don't see how qcow2_free_any_clusters() could
> flush the L2 table cache. (It's easy to get it to flush the refcount
> block cache, but that's useless for a reproducer.)
> 
> The only way I see to flush any cache with it is in update_refcount()
> the qcow2_cache_set_dependency() call. This will always flush the cache
> that the L2 cache depends on - which will never be the L2 cache itself,
> but always either the refcount cache or nothing.
> 
> There are more options in alloc_refcount_block() if we're allocating a
> new refcount block, but in the context of freeing clusters we'll never
> need to do that.
> 
> Whatever I tried, at the end of zero_in_l2_slice(), I have a dirty L2
> table and a dirty refcount block in the cache, with a dependency that
> makes sure that the L2 table will be written out first.
> 
> If you don't have the information yet, can you try to debug your manual
> reproducer a bit more to find out how this happens?
I'll do this tomorrow.
Best regards,
	Maxim Levitsky

> 
> Kevin
> 
> > Now we have a valid L2 entry pointing to a freed cluster. Oops.
> > 
> > Fixes: 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  block/qcow2-cluster.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> > index 485b4cb92e..267b46a4ca 100644
> > --- a/block/qcow2-cluster.c
> > +++ b/block/qcow2-cluster.c
> > @@ -2010,11 +2010,11 @@ static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
> >              continue;
> >          }
> >  
> > -        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> >          if (unmap) {
> >              qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
> >          }
> >          set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
> > +        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> >          if (has_subclusters(s)) {
> >              set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
> >          }
> > -- 
> > 2.26.2
> >
Alberto Garcia Nov. 23, 2020, 6:20 p.m. UTC | #4
On Mon 23 Nov 2020 05:09:41 PM CET, Kevin Wolf wrote:
>> Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
>> introduced a subtle change to code in zero_in_l2_slice:
>> 
>> It swapped the order of
>> 
>> 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>> 2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
>> 3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
>> 
>> To
>> 
>> 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>> 2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
>> 3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);

Ouch :( Good catch!

>> -        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>>          if (unmap) {
>>              qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
>>          }
>>          set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
>> +        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
>
> Good catch, but I think your order is wrong, too. We need the original
> order from before 205fa50750:
>
> 1. qcow2_cache_entry_mark_dirty()
>    set_l2_entry() + set_l2_bitmap()
>
> 2. qcow2_free_any_cluster()

I agree with Kevin on this.

Berto
Maxim Levitsky Nov. 23, 2020, 6:23 p.m. UTC | #5
On Mon, 2020-11-23 at 19:20 +0100, Alberto Garcia wrote:
> On Mon 23 Nov 2020 05:09:41 PM CET, Kevin Wolf wrote:
> > > Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> > > introduced a subtle change to code in zero_in_l2_slice:
> > > 
> > > It swapped the order of
> > > 
> > > 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > > 2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> > > 3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> > > 
> > > To
> > > 
> > > 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > > 2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> > > 3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> 
> Ouch :( Good catch!
> 
> > > -        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > >          if (unmap) {
> > >              qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
> > >          }
> > >          set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
> > > +        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > 
> > Good catch, but I think your order is wrong, too. We need the original
> > order from before 205fa50750:
> > 
> > 1. qcow2_cache_entry_mark_dirty()
> >    set_l2_entry() + set_l2_bitmap()
> > 
> > 2. qcow2_free_any_cluster()
> 
> I agree with Kevin on this.

I also agree, I haven't thought about this.

Best regards,
	Maxim Levitsky
> 
> Berto
>
Kevin Wolf Nov. 24, 2020, 9:17 a.m. UTC | #6
Am 23.11.2020 um 19:11 hat Maxim Levitsky geschrieben:
> On Mon, 2020-11-23 at 18:38 +0100, Kevin Wolf wrote:
> > Am 23.11.2020 um 16:49 hat Maxim Levitsky geschrieben:
> > > Commit 205fa50750 ("qcow2: Add subcluster support to zero_in_l2_slice()")
> > > introduced a subtle change to code in zero_in_l2_slice:
> > > 
> > > It swapped the order of
> > > 
> > > 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > > 2. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> > > 3. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> > > 
> > > To
> > > 
> > > 1. qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
> > > 2. qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
> > > 3. set_l2_entry(s, l2_slice, l2_index + i, QCOW_OFLAG_ZERO);
> > > 
> > > It seems harmless, however the call to qcow2_free_any_clusters
> > > can trigger a cache flush which can mark the L2 table as clean,
> > > and assuming that this was the last write to it,
> > > a stale version of it will remain on the disk.
> > 
> > Do you have more details on this last paragraph? I'm trying to come up
> > with a reproducer, but I don't see how qcow2_free_any_clusters() could
> > flush the L2 table cache. (It's easy to get it to flush the refcount
> > block cache, but that's useless for a reproducer.)
> > 
> > The only way I see to flush any cache with it is in update_refcount()
> > the qcow2_cache_set_dependency() call. This will always flush the cache
> > that the L2 cache depends on - which will never be the L2 cache itself,
> > but always either the refcount cache or nothing.
> > 
> > There are more options in alloc_refcount_block() if we're allocating a
> > new refcount block, but in the context of freeing clusters we'll never
> > need to do that.
> > 
> > Whatever I tried, at the end of zero_in_l2_slice(), I have a dirty L2
> > table and a dirty refcount block in the cache, with a dependency that
> > makes sure that the L2 table will be written out first.
> > 
> > If you don't have the information yet, can you try to debug your manual
> > reproducer a bit more to find out how this happens?
> I'll do this tomorrow.

As the last RC for 5.2 is today, I will send a v2 that changes the fix
to restore the original order.

We can then continue work to find a minimal reproducer and merge the
test case in the early 6.0 cycle.

Kevin
Alberto Garcia Nov. 24, 2020, 6:59 p.m. UTC | #7
On Tue 24 Nov 2020 10:17:23 AM CET, Kevin Wolf wrote:
> We can then continue work to find a minimal reproducer and merge the
> test case in the early 6.0 cycle.

I haven't been able to reproduce the problem yet, do you have any
findings?

Berto
Maxim Levitsky Nov. 24, 2020, 6:59 p.m. UTC | #8
On Tue, 2020-11-24 at 19:59 +0100, Alberto Garcia wrote:
> On Tue 24 Nov 2020 10:17:23 AM CET, Kevin Wolf wrote:
> > We can then continue work to find a minimal reproducer and merge the
> > test case in the early 6.0 cycle.
> 
> I haven't been able to reproduce the problem yet, do you have any
> findings?
> 
> Berto
> 

I have a working reproducer script. I'll send it in a hour or so.
Best regards,
	Maxim Levitsky
Maxim Levitsky Nov. 24, 2020, 7:44 p.m. UTC | #9
On Tue, 2020-11-24 at 20:59 +0200, Maxim Levitsky wrote:
> On Tue, 2020-11-24 at 19:59 +0100, Alberto Garcia wrote:
> > On Tue 24 Nov 2020 10:17:23 AM CET, Kevin Wolf wrote:
> > > We can then continue work to find a minimal reproducer and merge the
> > > test case in the early 6.0 cycle.
> > 
> > I haven't been able to reproduce the problem yet, do you have any
> > findings?
> > 
> > Berto
> > 
> 
> I have a working reproducer script. I'll send it in a hour or so.
> Best regards,
> 	Maxim Levitsky

I have attached a minimal reproducer for this issue.
I can convert this to an iotest if you think that this is worth it.


So these are the exact conditions for the corruption to happen:

1. Image must have at least 5 refcount tables 
(1 more that default refcount table cache size, which is 4 by default)


2. IO pattern that populates the 4 entry refcount table cache fully:

 Easiest way to do it is to have 4 L2 entries populated in the base image,
 such as each entry references a physical cluster that is served by different
 refcount table.
 
 Then discard these entries in the snapshot, triggering discard in the
 base file during the commit, which will populate the refcount table cache.


4. A discard of a cluster that belongs to 5th refcount table (done in the
   exact same way as above discards).
   It should be done soon, before L2 cache flushed due to some unrelated
   IO.

   This triggers the corruption:

The call stack is:

2. qcow2_free_any_cluster->
	qcow2_free_clusters->
		update_refcount:

			//This sets dependency between flushing the refcount cache and l2 cache.
    			if (decrease)
        			qcow2_cache_set_dependency(bs, s->refcount_block_cache,s->l2_table_cache);


			ret = alloc_refcount_block(bs, cluster_index, &refcount_block);
				return load_refcount_block(bs, refcount_block_offset, refcount_block);
					return qcow2_cache_get(...
						qcow2_cache_do_get
							/* because of a cache miss, we have to evict an entry*/
							ret = qcow2_cache_entry_flush(bs, c, i);
							if (c->depends) {
								/* this flushes the L2 cache */
        							ret = qcow2_cache_flush_dependency(bs, c);
							}


I had attached a reproducer that works with almost any cluster size and refcount block size.
Cluster sizes below 4K don't work because commit which is done by the mirror job which works on 4K granularity,
and that results in it not doing any discards due to various alignment restrictions.

If I patch qemu to make mirror job work on 512B granularity, test reproduces for small clusters as well.

The reproducer creates a qcow2 image in the current directory and it needs about 11G for default parameters.
(64K cluster size, 16 bit refcounts).
For 4K cluster size and 64 bit refcounts, it needs only 11M.
(This can be changed by editing the script)

Best regards,
	Maxim Levitsky
Alberto Garcia Nov. 25, 2020, 4:49 p.m. UTC | #10
On Tue 24 Nov 2020 08:44:00 PM CET, Maxim Levitsky wrote:
> On Tue, 2020-11-24 at 20:59 +0200, Maxim Levitsky wrote:
>> On Tue, 2020-11-24 at 19:59 +0100, Alberto Garcia wrote:
>> > On Tue 24 Nov 2020 10:17:23 AM CET, Kevin Wolf wrote:
>> > > We can then continue work to find a minimal reproducer and merge the
>> > > test case in the early 6.0 cycle.
>> > 
>> > I haven't been able to reproduce the problem yet, do you have any
>> > findings?
>> > 
>> > Berto
>> > 
>> 
>> I have a working reproducer script. I'll send it in a hour or so.
>> Best regards,
>> 	Maxim Levitsky
>
> I have attached a minimal reproducer for this issue.
> I can convert this to an iotest if you think that this is worth it.

I think it would be a good idea to have an iotest, I can also prepare if
you don't have time.

Thanks for the script!

Berto
diff mbox series

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 485b4cb92e..267b46a4ca 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2010,11 +2010,11 @@  static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
             continue;
         }
 
-        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
         if (unmap) {
             qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
         }
         set_l2_entry(s, l2_slice, l2_index + i, new_l2_entry);
+        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
         if (has_subclusters(s)) {
             set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
         }