Message ID | 20180226073111.3066-1-wqu@suse.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <wqu@suse.com> wrote: > This test case is originally designed to expose unexpected corruption > for btrfs, where there are several reports about btrfs serious metadata > corruption after power loss. > > The test case itself will trigger heavy fsstress for the fs, and use > dm-flakey to emulate power loss by dropping all later writes. > Come on... dm-flakey is so 2016 You should take Josef's fsstress+log-writes test and bring it to fstests: https://github.com/josefbacik/log-writes By doing that you will gain two very important features from the test: 1. Problems will be discovered much faster, because the test can run fsck after every single block write has been replayed instead of just at random times like in your test 2. Absolute guaranty to reproducing the problem by replaying the write log. Even though your fsstress could use a pre-defined random seed to results will be far from reproduciable, because of process and IO scheduling differences between subsequent test runs. When you catch an inconsistency with log-writes test, you can send the write-log recording to the maintainer to analyze the problem, even if it is a hard problem to hit. I used that useful technique for ext4,btrfs,xfs when ran tests with generic/455 and found problems. Cheers, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2018年02月26日 16:15, Amir Goldstein wrote: > On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <wqu@suse.com> wrote: >> This test case is originally designed to expose unexpected corruption >> for btrfs, where there are several reports about btrfs serious metadata >> corruption after power loss. >> >> The test case itself will trigger heavy fsstress for the fs, and use >> dm-flakey to emulate power loss by dropping all later writes. >> > > Come on... dm-flakey is so 2016 > You should take Josef's fsstress+log-writes test and bring it to fstests: > https://github.com/josefbacik/log-writes > > By doing that you will gain two very important features from the test: > > 1. Problems will be discovered much faster, because the test can run fsck > after every single block write has been replayed instead of just at random > times like in your test That's what exactly I want!!! Great thanks for this one! I would definitely look into this. (Although the initial commit is even older than 2016) But the test itself could already expose something on EXT4, it still makes some sense for ext4 developers as a verification test case. Thanks, Qu > > 2. Absolute guaranty to reproducing the problem by replaying the write log. > Even though your fsstress could use a pre-defined random seed to results > will be far from reproduciable, because of process and IO scheduling > differences between subsequent test runs. > When you catch an inconsistency with log-writes test, you can send the > write-log recording to the maintainer to analyze the problem, even if it is > a hard problem to hit. I used that useful technique for ext4,btrfs,xfs when > ran tests with generic/455 and found problems. > > Cheers, > Amir. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On Mon, Feb 26, 2018 at 10:20 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > On 2018年02月26日 16:15, Amir Goldstein wrote: >> On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <wqu@suse.com> wrote: >>> This test case is originally designed to expose unexpected corruption >>> for btrfs, where there are several reports about btrfs serious metadata >>> corruption after power loss. >>> >>> The test case itself will trigger heavy fsstress for the fs, and use >>> dm-flakey to emulate power loss by dropping all later writes. >>> >> >> Come on... dm-flakey is so 2016 >> You should take Josef's fsstress+log-writes test and bring it to fstests: >> https://github.com/josefbacik/log-writes >> >> By doing that you will gain two very important features from the test: >> >> 1. Problems will be discovered much faster, because the test can run fsck >> after every single block write has been replayed instead of just at random >> times like in your test > > That's what exactly I want!!! > > Great thanks for this one! I would definitely look into this. > (Although the initial commit is even older than 2016) > Please note that Josef's replay-individual-faster.sh script runs fsck every 1000 writes (i.e. --check 1000), so you can play with this argument in your test. Can also run --fsck every --check fua or --check flush, which may be more indicative of real world problems. not sure. > > But the test itself could already expose something on EXT4, it still > makes some sense for ext4 developers as a verification test case. > Please take a look at generic/456 When generic/455 found a reproduciable problem in ext4, I created a specific test without any randomness to pin point the problem found (using dm-flakey). If the problem you found is reproduciable, then it will be easy for you to create a similar "bisected" test. Thanks, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2018年02月26日 16:33, Amir Goldstein wrote: > On Mon, Feb 26, 2018 at 10:20 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> On 2018年02月26日 16:15, Amir Goldstein wrote: >>> On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <wqu@suse.com> wrote: >>>> This test case is originally designed to expose unexpected corruption >>>> for btrfs, where there are several reports about btrfs serious metadata >>>> corruption after power loss. >>>> >>>> The test case itself will trigger heavy fsstress for the fs, and use >>>> dm-flakey to emulate power loss by dropping all later writes. >>>> >>> >>> Come on... dm-flakey is so 2016 >>> You should take Josef's fsstress+log-writes test and bring it to fstests: >>> https://github.com/josefbacik/log-writes >>> >>> By doing that you will gain two very important features from the test: >>> >>> 1. Problems will be discovered much faster, because the test can run fsck >>> after every single block write has been replayed instead of just at random >>> times like in your test >> >> That's what exactly I want!!! >> >> Great thanks for this one! I would definitely look into this. >> (Although the initial commit is even older than 2016) >> > > Please note that Josef's replay-individual-faster.sh script runs fsck > every 1000 writes (i.e. --check 1000), so you can play with this argument > in your test. Can also run --fsck every --check fua or --check flush, which > may be more indicative of real world problems. not sure. > >> >> But the test itself could already expose something on EXT4, it still >> makes some sense for ext4 developers as a verification test case. >> > > Please take a look at generic/456 > When generic/455 found a reproduciable problem in ext4, > I created a specific test without any randomness to pin point the > problem found (using dm-flakey). > If the problem you found is reproduciable, then it will be easy for you > to create a similar "bisected" test. Yep, it's definitely needed for a pin-point test case, but I'm also wondering if a random, stress test could also help. Test case with plain fsstress is already super helpful to expose some bugs, such stress test won't hurt. Thanks, Qu > > Thanks, > Amir. >
On Mon, Feb 26, 2018 at 10:41 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > On 2018年02月26日 16:33, Amir Goldstein wrote: >> On Mon, Feb 26, 2018 at 10:20 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> >>> >>> On 2018年02月26日 16:15, Amir Goldstein wrote: >>>> On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <wqu@suse.com> wrote: >>>>> This test case is originally designed to expose unexpected corruption >>>>> for btrfs, where there are several reports about btrfs serious metadata >>>>> corruption after power loss. >>>>> >>>>> The test case itself will trigger heavy fsstress for the fs, and use >>>>> dm-flakey to emulate power loss by dropping all later writes. >>>>> >>>> >>>> Come on... dm-flakey is so 2016 >>>> You should take Josef's fsstress+log-writes test and bring it to fstests: >>>> https://github.com/josefbacik/log-writes >>>> >>>> By doing that you will gain two very important features from the test: >>>> >>>> 1. Problems will be discovered much faster, because the test can run fsck >>>> after every single block write has been replayed instead of just at random >>>> times like in your test >>> >>> That's what exactly I want!!! >>> >>> Great thanks for this one! I would definitely look into this. >>> (Although the initial commit is even older than 2016) >>> >> >> Please note that Josef's replay-individual-faster.sh script runs fsck >> every 1000 writes (i.e. --check 1000), so you can play with this argument >> in your test. Can also run --fsck every --check fua or --check flush, which >> may be more indicative of real world problems. not sure. >> >>> >>> But the test itself could already expose something on EXT4, it still >>> makes some sense for ext4 developers as a verification test case. >>> >> >> Please take a look at generic/456 >> When generic/455 found a reproduciable problem in ext4, >> I created a specific test without any randomness to pin point the >> problem found (using dm-flakey). >> If the problem you found is reproduciable, then it will be easy for you >> to create a similar "bisected" test. > > Yep, it's definitely needed for a pin-point test case, but I'm also > wondering if a random, stress test could also help. > > Test case with plain fsstress is already super helpful to expose some > bugs, such stress test won't hurt. > Yes, but the same stress test with dm-log-writes instead of dm-flakey will be as useful and much more, so no reason to merge the less useful stress test. Thanks, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2018年02月26日 16:45, Amir Goldstein wrote: > On Mon, Feb 26, 2018 at 10:41 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> On 2018年02月26日 16:33, Amir Goldstein wrote: >>> On Mon, Feb 26, 2018 at 10:20 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>> >>>> >>>> On 2018年02月26日 16:15, Amir Goldstein wrote: >>>>> On Mon, Feb 26, 2018 at 9:31 AM, Qu Wenruo <wqu@suse.com> wrote: >>>>>> This test case is originally designed to expose unexpected corruption >>>>>> for btrfs, where there are several reports about btrfs serious metadata >>>>>> corruption after power loss. >>>>>> >>>>>> The test case itself will trigger heavy fsstress for the fs, and use >>>>>> dm-flakey to emulate power loss by dropping all later writes. >>>>>> >>>>> >>>>> Come on... dm-flakey is so 2016 >>>>> You should take Josef's fsstress+log-writes test and bring it to fstests: >>>>> https://github.com/josefbacik/log-writes >>>>> >>>>> By doing that you will gain two very important features from the test: >>>>> >>>>> 1. Problems will be discovered much faster, because the test can run fsck >>>>> after every single block write has been replayed instead of just at random >>>>> times like in your test >>>> >>>> That's what exactly I want!!! >>>> >>>> Great thanks for this one! I would definitely look into this. >>>> (Although the initial commit is even older than 2016) >>>> >>> >>> Please note that Josef's replay-individual-faster.sh script runs fsck >>> every 1000 writes (i.e. --check 1000), so you can play with this argument >>> in your test. Can also run --fsck every --check fua or --check flush, which >>> may be more indicative of real world problems. not sure. >>> >>>> >>>> But the test itself could already expose something on EXT4, it still >>>> makes some sense for ext4 developers as a verification test case. >>>> >>> >>> Please take a look at generic/456 >>> When generic/455 found a reproduciable problem in ext4, >>> I created a specific test without any randomness to pin point the >>> problem found (using dm-flakey). >>> If the problem you found is reproduciable, then it will be easy for you >>> to create a similar "bisected" test. >> >> Yep, it's definitely needed for a pin-point test case, but I'm also >> wondering if a random, stress test could also help. >> >> Test case with plain fsstress is already super helpful to expose some >> bugs, such stress test won't hurt. >> > > > Yes, but the same stress test with dm-log-writes instead of dm-flakey > will be as useful and much more, so no reason to merge the less useful > stress test. OK, I'll try to use dm-log to enhance the test case. Thanks, Qu > > Thanks, > Amir. > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
diff --git a/tests/generic/479 b/tests/generic/479 new file mode 100755 index 00000000..ab530231 --- /dev/null +++ b/tests/generic/479 @@ -0,0 +1,109 @@ +#! /bin/bash +# FS QA Test 479 +# +# Test if a filesystem can survive emulated powerloss. +# +# No matter what the solution a filesystem uses (journal or CoW), +# it should survive unexpected powerloss, without major metadata +# corruption. +# +#----------------------------------------------------------------------- +# Copyright (c) 2018 SuSE. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + ps -e | grep fsstress > /dev/null 2>&1 + while [ $? -eq 0 ]; do + $KILLALL_PROG -KILL fsstress > /dev/null 2>&1 + wait > /dev/null 2>&1 + ps -e | grep fsstress > /dev/null 2>&1 + done + _unmount_flakey &> /dev/null + _cleanup_flakey + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_scratch +_require_dm_target flakey +_require_command "$KILLALL_PROG" "killall" + +runtime=$(($TIME_FACTOR * 15)) +loops=$(($LOAD_FACTOR * 4)) + +for i in $(seq -w $loops); do + echo "=== Loop $i: $(date) ===" >> $seqres.full + + _scratch_mkfs >/dev/null 2>&1 + _init_flakey + _mount_flakey + + ($FSSTRESS_PROG $FSSTRESS_AVOID -w -d $SCRATCH_MNT -n 1000000 \ + -p 100 >> $seqres.full &) > /dev/null 2>&1 + + sleep $runtime + + # Here we only want to drop all write, don't need to umount the fs + _load_flakey_table $FLAKEY_DROP_WRITES + + ps -e | grep fsstress > /dev/null 2>&1 + while [ $? -eq 0 ]; do + $KILLALL_PROG -KILL fsstress > /dev/null 2>&1 + wait > /dev/null 2>&1 + ps -e | grep fsstress > /dev/null 2>&1 + done + + _unmount_flakey + _cleanup_flakey + + # Mount the fs to do proper log replay for journal based fs + # so later check won't report annoying dirty log and only + # report real problem. + _scratch_mount + _scratch_unmount + + _check_scratch_fs +done + +echo "Silence is golden" + +# success, all done +status=0 +exit diff --git a/tests/generic/479.out b/tests/generic/479.out new file mode 100644 index 00000000..290f18b3 --- /dev/null +++ b/tests/generic/479.out @@ -0,0 +1,2 @@ +QA output created by 479 +Silence is golden diff --git a/tests/generic/group b/tests/generic/group index 1e808865..5ce3db1d 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -481,3 +481,4 @@ 476 auto rw 477 auto quick exportfs 478 auto quick +479 auto
This test case is originally designed to expose unexpected corruption for btrfs, where there are several reports about btrfs serious metadata corruption after power loss. The test case itself will trigger heavy fsstress for the fs, and use dm-flakey to emulate power loss by dropping all later writes. For btrfs, it should be completely fine, as long as superblock write (FUA write) finishes atomically, since with metadata CoW, superblock either points to old trees or new tress, the fs should be as atomic as superblock. For journal based filesystems, each metadata update should be journaled, so metadata operation is as atomic as journal updates. It does show that XFS is doing the best work among the tested filesystems (Btrfs, XFS, ext4), no kernel nor xfs_repair problem at all. For btrfs, although btrfs check doesn't report any problem, kernel reports some data checksum error, which is a little unexpected as data is CoWed by default, which should be as atomic as superblock. (Unfortunately, still not the exact problem I'm chasing for) For EXT4, kernel is fine, but later e2fsck reports problem, which may indicates there is still something to be improved. Signed-off-by: Qu Wenruo <wqu@suse.com> --- tests/generic/479 | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/479.out | 2 + tests/generic/group | 1 + 3 files changed, 112 insertions(+) create mode 100755 tests/generic/479 create mode 100644 tests/generic/479.out