fstests: btrfs/125: do not use raid5 for metadata

Message ID	b39ba0c9a496b0ceaa940f4b1e3faf94a4256baf.1724454084.git.wqu@suse.com (mailing list archive)
State	New, archived
Headers	show Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7246E5695; Fri, 23 Aug 2024 23:02:15 +0000 (UTC) From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org, fstests@vger.kernel.org Subject: [PATCH] fstests: btrfs/125: do not use raid5 for metadata Date: Sat, 24 Aug 2024 08:31:45 +0930 Message-ID: <b39ba0c9a496b0ceaa940f4b1e3faf94a4256baf.1724454084.git.wqu@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit default: False [-5.01 / 50.00]; BAYES_HAM(-3.00)[100.00%]; DWL_DNSWL_MED(-2.00)[suse.com:dkim]; MID_CONTAINS_FROM(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_MISSING_CHARSET(0.50)[]; R_DKIM_ALLOW(-0.20)[suse.com:s=susede1]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; MX_GOOD(-0.01)[]; RECEIVED_SPAMHAUS_BLOCKED_OPENRESOLVER(0.00)[2a07:de40:b281:106:10:150:64:167:received]; RBL_SPAMHAUS_BLOCKED_OPENRESOLVER(0.00)[2a07:de40:b281:104:10:150:64:97:from]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com]; DBL_BLOCKED_OPENRESOLVER(0.00)[imap1.dmz-prg2.suse.org:helo,imap1.dmz-prg2.suse.org:rdns,suse.com:email,suse.com:dkim,suse.com:mid]; RCPT_COUNT_TWO(0.00)[2]; DKIM_SIGNED(0.00)[suse.com:s=susede1]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_ALL(0.00)[]; TO_DN_NONE(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DKIM_TRACE(0.00)[suse.com:+]
Series	fstests: btrfs/125: do not use raid5 for metadata \| expand fstests: btrfs/125: do not use raid5 for metadata

Message ID

b39ba0c9a496b0ceaa940f4b1e3faf94a4256baf.1724454084.git.wqu@suse.com (mailing list archive)

State

New, archived

Headers

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org,
	fstests@vger.kernel.org
Subject: [PATCH] fstests: btrfs/125: do not use raid5 for metadata
Date: Sat, 24 Aug 2024 08:31:45 +0930
Message-ID: 
 <b39ba0c9a496b0ceaa940f4b1e3faf94a4256baf.1724454084.git.wqu@suse.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

fstests: btrfs/125: do not use raid5 for metadata | expand

Commit Message

Qu Wenruo Aug. 23, 2024, 11:01 p.m. UTC

[BUG]
There are several bug reports of btrfs/125 failure recently, either
causing balance failure (-EIO), or even kernel crash.

The balance failure looks like this:

     Mount normal and balance
    +ERROR: error during balancing '/mnt/scratch': Input/output error
    +There may be more info in syslog - try dmesg | tail
    +md5sum: /mnt/scratch/tf2: Input/output error

[CAUSE]
There are several different factors involved.

1. RMW mix the old and new metadata, causing unrepairable corruption
   E.g. with the following layout:

   data 1   |<- Stale metadata ->| (from the out-of-date device)
   data 2   |     Unused         |
   parity   |PPPPPPPPPPPPPPPPPPPP|

   In above case, although metadata on data 1 is out-of-date, we can
   still rebuild the correct data from parity and data 2.

   But if we have new metadata writes into the data 2 stripe, an RMW
   will screw up the whole situation:

   data 1   |<- Stale metadata ->| (from the out-of-date device)
   data 2   |<-  New metadata  ->|
   parity   |XXXXXXXXXXXXXXXXXXXX|

   The RMW will use the stale metadata and new metadata to calculate new
   parity.
   The resulted new parity will no longer be able to recover the old
   data 1.

   This is a known bug, thus our documentation is already recommending
   to avoid RAID56 for metadata usage.

   > Metadata
   >    Do not use raid5 nor raid6 for metadata. Use raid1 or raid1c3
   >    respectively.

   Furthermore this is very hard to fix, unlike data we can fetch the
   data csum and verify during RMW, we can not do that during RMW.

   At the timing of RMW, we're holding the rbio lock for the full
   stripe.
   If the extent tree search requires a read-recover, it will generate
   another rbio, which may cover the same full stripe we're working on,
   leading to a deadlock.

   Furthermore the current RAID56 repair code is all based on veritical
   sectors, but metadata can cross several horizontal sectors.
   This will require multiple combinations to repair a metadata.

2. Crash caused by double freeing a bio
   By chance if the above RMW corrupted csum tree, then during
   btrfs_submit_chunk() we will hit an error path that leads to double
   freeing of a bio, leading to crash when hitting the race window.

   Thankfully the patch has been sent to the mailing list.

[WORKAROUND]
Since it's very hard to fix the RAID56 metadata problem without a
deadlock or a huge code rework, for now just use RAID1 for the metadata
of this particular test case.

There may be a chance to fix the situation by properly marking the
missing-then-reappear device as out-of-date, so no direct read from that
device.

But that will also be a huge new feature, not something can be done
immediately.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 tests/btrfs/125 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Filipe Manana Aug. 26, 2024, 12:32 p.m. UTC | #1

On Sat, Aug 24, 2024 at 12:02 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> There are several bug reports of btrfs/125 failure recently, either
> causing balance failure (-EIO), or even kernel crash.
>
> The balance failure looks like this:
>
>      Mount normal and balance
>     +ERROR: error during balancing '/mnt/scratch': Input/output error
>     +There may be more info in syslog - try dmesg | tail
>     +md5sum: /mnt/scratch/tf2: Input/output error

It's been like this forever. Historically there are periods where this
happens with some frequency, then for no reason it happens very, very,
very, sporadically, then after some time it comes back, and so on.

This has been pointed out and discussed several times in the past, for example:

https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/

https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/#r

It would be good to add links to such discussions in the changelog.

>
> [CAUSE]
> There are several different factors involved.
>
> 1. RMW mix the old and new metadata, causing unrepairable corruption
>    E.g. with the following layout:
>
>    data 1   |<- Stale metadata ->| (from the out-of-date device)
>    data 2   |     Unused         |
>    parity   |PPPPPPPPPPPPPPPPPPPP|
>
>    In above case, although metadata on data 1 is out-of-date, we can
>    still rebuild the correct data from parity and data 2.
>
>    But if we have new metadata writes into the data 2 stripe, an RMW
>    will screw up the whole situation:
>
>    data 1   |<- Stale metadata ->| (from the out-of-date device)
>    data 2   |<-  New metadata  ->|
>    parity   |XXXXXXXXXXXXXXXXXXXX|
>
>    The RMW will use the stale metadata and new metadata to calculate new
>    parity.
>    The resulted new parity will no longer be able to recover the old
>    data 1.
>
>    This is a known bug, thus our documentation is already recommending
>    to avoid RAID56 for metadata usage.
>
>    > Metadata
>    >    Do not use raid5 nor raid6 for metadata. Use raid1 or raid1c3
>    >    respectively.
>
>    Furthermore this is very hard to fix, unlike data we can fetch the
>    data csum and verify during RMW, we can not do that during RMW.
>
>    At the timing of RMW, we're holding the rbio lock for the full
>    stripe.
>    If the extent tree search requires a read-recover, it will generate
>    another rbio, which may cover the same full stripe we're working on,
>    leading to a deadlock.
>
>    Furthermore the current RAID56 repair code is all based on veritical
>    sectors, but metadata can cross several horizontal sectors.
>    This will require multiple combinations to repair a metadata.
>
> 2. Crash caused by double freeing a bio
>    By chance if the above RMW corrupted csum tree, then during
>    btrfs_submit_chunk() we will hit an error path that leads to double
>    freeing of a bio, leading to crash when hitting the race window.
>
>    Thankfully the patch has been sent to the mailing list.

If the changelog is mentioning this bug, then it should also mention
which kernel patch fixes it.

With that update:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.


>
> [WORKAROUND]
> Since it's very hard to fix the RAID56 metadata problem without a
> deadlock or a huge code rework, for now just use RAID1 for the metadata
> of this particular test case.
>
> There may be a chance to fix the situation by properly marking the
> missing-then-reappear device as out-of-date, so no direct read from that
> device.
>
> But that will also be a huge new feature, not something can be done
> immediately.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  tests/btrfs/125 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tests/btrfs/125 b/tests/btrfs/125
> index 31379d81ef73..c8c0dd422f72 100755
> --- a/tests/btrfs/125
> +++ b/tests/btrfs/125
> @@ -70,7 +70,7 @@ count=$(($max_fs_sz / 1000000))
>  echo >> $seqres.full
>  echo "max_fs_sz=$max_fs_sz count=$count" >> $seqres.full
>  echo "-----Initialize -----" >> $seqres.full
> -_scratch_pool_mkfs "-mraid5 -draid5" >> $seqres.full 2>&1
> +_scratch_pool_mkfs "-mraid1 -draid5" >> $seqres.full 2>&1
>  _scratch_mount >> $seqres.full 2>&1
>  dd if=/dev/zero of="$SCRATCH_MNT"/tf1 bs=$bs count=1 \
>                                         >>$seqres.full 2>&1
> --
> 2.46.0
>
>

diff --git a/tests/btrfs/125 b/tests/btrfs/125
index 31379d81ef73..c8c0dd422f72 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -70,7 +70,7 @@  count=$(($max_fs_sz / 1000000))
 echo >> $seqres.full
 echo "max_fs_sz=$max_fs_sz count=$count" >> $seqres.full
 echo "-----Initialize -----" >> $seqres.full
-_scratch_pool_mkfs "-mraid5 -draid5" >> $seqres.full 2>&1
+_scratch_pool_mkfs "-mraid1 -draid5" >> $seqres.full 2>&1
 _scratch_mount >> $seqres.full 2>&1
 dd if=/dev/zero of="$SCRATCH_MNT"/tf1 bs=$bs count=1 \
 					>>$seqres.full 2>&1

fstests: btrfs/125: do not use raid5 for metadata

Commit Message

Comments

Patch