diff mbox series

btrfs: add a test case to make sure scrub can repair parity corruption

Message ID 20230630013614.56667-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: add a test case to make sure scrub can repair parity corruption | expand

Commit Message

Qu Wenruo June 30, 2023, 1:36 a.m. UTC
There is a kernel regression caused by commit 75b470332965 ("btrfs:
raid56: migrate recovery and scrub recovery path to use error_bitmap"),
which leads to scrub not repairing corrupted parity stripes.

So here we add a test case to verify the P/Q stripe scrub behavior by:

- Create a RAID5 or RAID6 btrfs with minimal amount of devices
  This means 2 devices for RAID5, and 3 devices for RAID6.
  This would result the parity stripe to be a mirror of the only data
  stripe.

  And since we have control of the content of data stripes, the content
  of the P stripe is also fixed.

- Create an 64K file
  The file would cover one data stripe.

- Corrupt the P stripe

- Scrub the fs
  If scrub is working, the P stripe would be repaired.

  Unfortunately scrub can not report any P/Q corruption, limited by its
  reporting structure.
  So we can not use the return value of scrub to determine if we
  repaired anything.

- Verify the content of the P stripe

- Use "btrfs check --check-data-csum" to double check

By above steps, we can verify if the P stripe is properly fixed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 tests/btrfs/294     | 86 +++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/294.out |  2 ++
 2 files changed, 88 insertions(+)
 create mode 100755 tests/btrfs/294
 create mode 100644 tests/btrfs/294.out

Comments

Anand Jain July 22, 2023, 1:24 p.m. UTC | #1
On 30/06/2023 09:36, Qu Wenruo wrote:
> There is a kernel regression caused by commit 75b470332965 ("btrfs:
> raid56: migrate recovery and scrub recovery path to use error_bitmap"),
> which leads to scrub not repairing corrupted parity stripes.
> 
> So here we add a test case to verify the P/Q stripe scrub behavior by:
> 
> - Create a RAID5 or RAID6 btrfs with minimal amount of devices
>    This means 2 devices for RAID5, and 3 devices for RAID6.
>    This would result the parity stripe to be a mirror of the only data
>    stripe.
> 
>    And since we have control of the content of data stripes, the content
>    of the P stripe is also fixed.
> 
> - Create an 64K file
>    The file would cover one data stripe.
> 
> - Corrupt the P stripe
> 
> - Scrub the fs
>    If scrub is working, the P stripe would be repaired.
> 
>    Unfortunately scrub can not report any P/Q corruption, limited by its
>    reporting structure.
>    So we can not use the return value of scrub to determine if we
>    repaired anything.
> 
> - Verify the content of the P stripe
> 
> - Use "btrfs check --check-data-csum" to double check
> 
> By above steps, we can verify if the P stripe is properly fixed.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   tests/btrfs/294     | 86 +++++++++++++++++++++++++++++++++++++++++++++
>   tests/btrfs/294.out |  2 ++
>   2 files changed, 88 insertions(+)
>   create mode 100755 tests/btrfs/294
>   create mode 100644 tests/btrfs/294.out
> 
> diff --git a/tests/btrfs/294 b/tests/btrfs/294
> new file mode 100755
> index 00000000..97b85988
> --- /dev/null
> +++ b/tests/btrfs/294
> @@ -0,0 +1,86 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2023 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 294
> +#
> +# Make sure btrfs scrub can fix parity stripe corruption
> +#
> +. ./common/preamble
> +_begin_fstest auto quick raid scrub
> +
> +. ./common/filter
> +
> +_supported_fs btrfs
> +_require_odirect
> +_require_non_zoned_device "${SCRATCH_DEV}"
> +_require_scratch_dev_pool 3
> +_fixed_by_kernel_commit xxxxxxxxxxxx \
> +	"btrfs: raid56: always verify the P/Q contents for scrub"
> +
> +workload()
> +{
> +	local profile=$1
> +	local nr_devs=$2
> +
> +	echo "=== Testing $nr_devs devices $profile ===" >> $seqres.full
> +	_scratch_dev_pool_get $nr_devs
> +
> +	_scratch_pool_mkfs -d $profile -m single >> $seqres.full 2>&1
> +	# Disable space cache to prevent v1 space cache affecting
> +	# the result.


> +	_scratch_mount -o nospace_cache

[ 1562.331370] BTRFS error (device sdb3): cannot disable free space tree


Mount is failing for the -o nospace_cache option.

> +
> +	# Create one 64K extent which would cover one data stripe.
> +	$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 64K 0 64K" \
> +		"$SCRATCH_MNT/foobar" > /dev/null
> +	sync
> +
> +	# Corrupt the P/Q stripe
> +	local logical=$(_btrfs_get_first_logical $SCRATCH_MNT/foobar)
> +
> +	# The 2nd copy is pointed to P stripe directly.
> +	physical_p=$(_btrfs_get_physical ${logical} 2)
> +	devpath_p=$(_btrfs_get_device_path ${logical} 2)
> +
> +	_scratch_unmount
> +
> +	echo "Corrupt stripe P at devpath $devpath_p physical $physical_p" \
> +		>> $seqres.full
> +	$XFS_IO_PROG -d -c "pwrite -S 0xff -b 64K $physical_p 64K" $devpath_p \
> +		> /dev/null
> +
> +	# Do a scrub to try repair the P stripe.
> +	_scratch_mount -o nospace_cache
> +	$BTRFS_UTIL_PROG scrub start -BdR $SCRATCH_MNT >> $seqres.full 2>&1
> +	_scratch_unmount
> +
> +	# Verify the repaired content directly
> +	local output=$($XFS_IO_PROG -c "pread -qv $physical_p 16" $devpath_p | _filter_xfs_io_offset)
> +	local expect="XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................"
> +
> +	echo "The first 16 bytes of parity stripe after scrub:" >> $seqres.full

> +	echo $output >> $seqres.full
> +	

White space error here.

Thanks.

> +	if [ "$output" != "$expect" ]; then > +		echo "Unexpected parity content"
> +		echo "has:"
> +		echo "$output"
> +		echo "expect"
> +		echo "$expect"
> +	fi
> +
> +	# Last safenet, let btrfs check --check-data-csum to do an offline scrub.
> +	$BTRFS_UTIL_PROG check --check-data-csum $SCRATCH_DEV >> $seqres.full 2>&1
> +	if [ $? -ne 0 ]; then
> +		echo "Error detected after the scrub"
> +	fi
> +	_scratch_dev_pool_put
> +}
> +
> +workload raid5 2
> +workload raid6 3
> +
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/294.out b/tests/btrfs/294.out
> new file mode 100644
> index 00000000..c09531de
> --- /dev/null
> +++ b/tests/btrfs/294.out
> @@ -0,0 +1,2 @@
> +QA output created by 294
> +Silence is golden
Qu Wenruo July 23, 2023, 1:26 a.m. UTC | #2
On 2023/7/22 21:24, Anand Jain wrote:
>
>
> On 30/06/2023 09:36, Qu Wenruo wrote:
>> There is a kernel regression caused by commit 75b470332965 ("btrfs:
>> raid56: migrate recovery and scrub recovery path to use error_bitmap"),
>> which leads to scrub not repairing corrupted parity stripes.
>>
>> So here we add a test case to verify the P/Q stripe scrub behavior by:
>>
>> - Create a RAID5 or RAID6 btrfs with minimal amount of devices
>>    This means 2 devices for RAID5, and 3 devices for RAID6.
>>    This would result the parity stripe to be a mirror of the only data
>>    stripe.
>>
>>    And since we have control of the content of data stripes, the content
>>    of the P stripe is also fixed.
>>
>> - Create an 64K file
>>    The file would cover one data stripe.
>>
>> - Corrupt the P stripe
>>
>> - Scrub the fs
>>    If scrub is working, the P stripe would be repaired.
>>
>>    Unfortunately scrub can not report any P/Q corruption, limited by its
>>    reporting structure.
>>    So we can not use the return value of scrub to determine if we
>>    repaired anything.
>>
>> - Verify the content of the P stripe
>>
>> - Use "btrfs check --check-data-csum" to double check
>>
>> By above steps, we can verify if the P stripe is properly fixed.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   tests/btrfs/294     | 86 +++++++++++++++++++++++++++++++++++++++++++++
>>   tests/btrfs/294.out |  2 ++
>>   2 files changed, 88 insertions(+)
>>   create mode 100755 tests/btrfs/294
>>   create mode 100644 tests/btrfs/294.out
>>
>> diff --git a/tests/btrfs/294 b/tests/btrfs/294
>> new file mode 100755
>> index 00000000..97b85988
>> --- /dev/null
>> +++ b/tests/btrfs/294
>> @@ -0,0 +1,86 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (C) 2023 SUSE Linux Products GmbH. All Rights Reserved.
>> +#
>> +# FS QA Test 294
>> +#
>> +# Make sure btrfs scrub can fix parity stripe corruption
>> +#
>> +. ./common/preamble
>> +_begin_fstest auto quick raid scrub
>> +
>> +. ./common/filter
>> +
>> +_supported_fs btrfs
>> +_require_odirect
>> +_require_non_zoned_device "${SCRATCH_DEV}"
>> +_require_scratch_dev_pool 3
>> +_fixed_by_kernel_commit xxxxxxxxxxxx \
>> +    "btrfs: raid56: always verify the P/Q contents for scrub"
>> +
>> +workload()
>> +{
>> +    local profile=$1
>> +    local nr_devs=$2
>> +
>> +    echo "=== Testing $nr_devs devices $profile ===" >> $seqres.full
>> +    _scratch_dev_pool_get $nr_devs
>> +
>> +    _scratch_pool_mkfs -d $profile -m single >> $seqres.full 2>&1
>> +    # Disable space cache to prevent v1 space cache affecting
>> +    # the result.
>
>
>> +    _scratch_mount -o nospace_cache
>
> [ 1562.331370] BTRFS error (device sdb3): cannot disable free space tree
>
>
> Mount is failing for the -o nospace_cache option.

Indeed, I guess this is caused by block group tree feature.

The original intention is just to prevent v1 cache, considering v2 is
required by several features, I'd switch it to explicitly require v2
cache instead.

Thanks,
Qu
>
>> +
>> +    # Create one 64K extent which would cover one data stripe.
>> +    $XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 64K 0 64K" \
>> +        "$SCRATCH_MNT/foobar" > /dev/null
>> +    sync
>> +
>> +    # Corrupt the P/Q stripe
>> +    local logical=$(_btrfs_get_first_logical $SCRATCH_MNT/foobar)
>> +
>> +    # The 2nd copy is pointed to P stripe directly.
>> +    physical_p=$(_btrfs_get_physical ${logical} 2)
>> +    devpath_p=$(_btrfs_get_device_path ${logical} 2)
>> +
>> +    _scratch_unmount
>> +
>> +    echo "Corrupt stripe P at devpath $devpath_p physical $physical_p" \
>> +        >> $seqres.full
>> +    $XFS_IO_PROG -d -c "pwrite -S 0xff -b 64K $physical_p 64K"
>> $devpath_p \
>> +        > /dev/null
>> +
>> +    # Do a scrub to try repair the P stripe.
>> +    _scratch_mount -o nospace_cache
>> +    $BTRFS_UTIL_PROG scrub start -BdR $SCRATCH_MNT >> $seqres.full 2>&1
>> +    _scratch_unmount
>> +
>> +    # Verify the repaired content directly
>> +    local output=$($XFS_IO_PROG -c "pread -qv $physical_p 16"
>> $devpath_p | _filter_xfs_io_offset)
>> +    local expect="XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa
>> aa aa aa  ................"
>> +
>> +    echo "The first 16 bytes of parity stripe after scrub:" >>
>> $seqres.full
>
>> +    echo $output >> $seqres.full
>> +
>
> White space error here.
>
> Thanks.
>
>> +    if [ "$output" != "$expect" ]; then > +        echo "Unexpected
>> parity content"
>> +        echo "has:"
>> +        echo "$output"
>> +        echo "expect"
>> +        echo "$expect"
>> +    fi
>> +
>> +    # Last safenet, let btrfs check --check-data-csum to do an
>> offline scrub.
>> +    $BTRFS_UTIL_PROG check --check-data-csum $SCRATCH_DEV >>
>> $seqres.full 2>&1
>> +    if [ $? -ne 0 ]; then
>> +        echo "Error detected after the scrub"
>> +    fi
>> +    _scratch_dev_pool_put
>> +}
>> +
>> +workload raid5 2
>> +workload raid6 3
>> +
>> +echo "Silence is golden"
>> +status=0
>> +exit
>> diff --git a/tests/btrfs/294.out b/tests/btrfs/294.out
>> new file mode 100644
>> index 00000000..c09531de
>> --- /dev/null
>> +++ b/tests/btrfs/294.out
>> @@ -0,0 +1,2 @@
>> +QA output created by 294
>> +Silence is golden
>
diff mbox series

Patch

diff --git a/tests/btrfs/294 b/tests/btrfs/294
new file mode 100755
index 00000000..97b85988
--- /dev/null
+++ b/tests/btrfs/294
@@ -0,0 +1,86 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2023 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test 294
+#
+# Make sure btrfs scrub can fix parity stripe corruption
+#
+. ./common/preamble
+_begin_fstest auto quick raid scrub
+
+. ./common/filter
+
+_supported_fs btrfs
+_require_odirect
+_require_non_zoned_device "${SCRATCH_DEV}"
+_require_scratch_dev_pool 3
+_fixed_by_kernel_commit xxxxxxxxxxxx \
+	"btrfs: raid56: always verify the P/Q contents for scrub"
+
+workload()
+{
+	local profile=$1
+	local nr_devs=$2
+
+	echo "=== Testing $nr_devs devices $profile ===" >> $seqres.full
+	_scratch_dev_pool_get $nr_devs
+
+	_scratch_pool_mkfs -d $profile -m single >> $seqres.full 2>&1
+	# Disable space cache to prevent v1 space cache affecting
+	# the result.
+	_scratch_mount -o nospace_cache
+
+	# Create one 64K extent which would cover one data stripe.
+	$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 64K 0 64K" \
+		"$SCRATCH_MNT/foobar" > /dev/null
+	sync
+
+	# Corrupt the P/Q stripe
+	local logical=$(_btrfs_get_first_logical $SCRATCH_MNT/foobar)
+
+	# The 2nd copy is pointed to P stripe directly.
+	physical_p=$(_btrfs_get_physical ${logical} 2)
+	devpath_p=$(_btrfs_get_device_path ${logical} 2)
+
+	_scratch_unmount
+
+	echo "Corrupt stripe P at devpath $devpath_p physical $physical_p" \
+		>> $seqres.full
+	$XFS_IO_PROG -d -c "pwrite -S 0xff -b 64K $physical_p 64K" $devpath_p \
+		> /dev/null
+
+	# Do a scrub to try repair the P stripe.
+	_scratch_mount -o nospace_cache
+	$BTRFS_UTIL_PROG scrub start -BdR $SCRATCH_MNT >> $seqres.full 2>&1
+	_scratch_unmount
+
+	# Verify the repaired content directly
+	local output=$($XFS_IO_PROG -c "pread -qv $physical_p 16" $devpath_p | _filter_xfs_io_offset)
+	local expect="XXXXXXXX:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................"
+
+	echo "The first 16 bytes of parity stripe after scrub:" >> $seqres.full
+	echo $output >> $seqres.full
+	
+	if [ "$output" != "$expect" ]; then
+		echo "Unexpected parity content"
+		echo "has:"
+		echo "$output"
+		echo "expect"
+		echo "$expect"
+	fi
+
+	# Last safenet, let btrfs check --check-data-csum to do an offline scrub.
+	$BTRFS_UTIL_PROG check --check-data-csum $SCRATCH_DEV >> $seqres.full 2>&1
+	if [ $? -ne 0 ]; then
+		echo "Error detected after the scrub"
+	fi
+	_scratch_dev_pool_put
+}
+
+workload raid5 2
+workload raid6 3
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/btrfs/294.out b/tests/btrfs/294.out
new file mode 100644
index 00000000..c09531de
--- /dev/null
+++ b/tests/btrfs/294.out
@@ -0,0 +1,2 @@ 
+QA output created by 294
+Silence is golden