Message ID | 20130104125059.GE20089@twin.jikos.cz (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Thanks Dave, On Fri, Jan 04, 2013 at 05:50:59AM -0700, David Sterba wrote: > Hi, > > I've noticed a few csum mismatch messages, and a few failed xfstests: > > - 3.8.0-rc1 > - defautl mkfs options > - MOUNT_OPTIONS -- -o space_cache,noatime,inode_cache > - test device: 40G > - scratch device: 10G Josef, are the problems you see with 083 coming on the scratch drive or the main disk? > > 091: > --- 091.out 2011-11-01 10:31:12.000000000 +0100 > +++ 091.out.bad 2013-01-03 21:07:29.000000000 +0100 > @@ -1,7 +1,45 @@ > QA output created by 091 > fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W > +fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > +mapped writes DISABLED > +truncating to largest ever: 0x12a00 > +truncating to largest ever: 0x75400 > +doread: read: Input/output error > +LOG DUMP (35 total operations): > +1( 1 mod 256): SKIPPED (no operation) > +2( 2 mod 256): WRITE 0x62600 thru 0x6bdff (0x9800 bytes) HOLE > +3( 3 mod 256): FALLOC 0x2e0f2 thru 0x3134a (0x3258 bytes) INTERIOR > +4( 4 mod 256): TRUNCATE DOWN from 0x6be00 to 0x12a00 > +5( 5 mod 256): READ 0x0 thru 0xdfff (0xe000 bytes) > +6( 6 mod 256): FALLOC 0x7048 thru 0x9f54 (0x2f0c bytes) INTERIOR > +7( 7 mod 256): WRITE 0x5ea00 thru 0x6e7ff (0xfe00 bytes) HOLE > +8( 8 mod 256): READ 0x16000 thru 0x17fff (0x2000 bytes) > +9( 9 mod 256): FALLOC 0x4957f thru 0x5298e (0x940f bytes) INTERIOR > +10( 10 mod 256): SKIPPED (no operation) > +11( 11 mod 256): WRITE 0x10a00 thru 0x173ff (0x6a00 bytes) > +12( 12 mod 256): WRITE 0x53800 thru 0x5a7ff (0x7000 bytes) > +13( 13 mod 256): WRITE 0x5ae00 thru 0x5afff (0x200 bytes) > +14( 14 mod 256): READ 0x5d000 thru 0x66fff (0xa000 bytes) > +15( 15 mod 256): SKIPPED (no operation) > +16( 16 mod 256): READ 0x21000 thru 0x2bfff (0xb000 bytes) > +17( 17 mod 256): SKIPPED (no operation) > +18( 18 mod 256): READ 0x47000 thru 0x4ffff (0x9000 bytes) > +19( 19 mod 256): WRITE 0x17600 thru 0x25bff (0xe600 bytes) > +20( 20 mod 256): READ 0x3f000 thru 0x48fff (0xa000 bytes) > +21( 21 mod 256): FALLOC 0xea89 thru 0x19800 (0xad77 bytes) INTERIOR > +22( 22 mod 256): FALLOC 0x569aa thru 0x586ea (0x1d40 bytes) INTERIOR > +23( 23 mod 256): WRITE 0x35c00 thru 0x453ff (0xf800 bytes) > +24( 24 mod 256): SKIPPED (no operation) > +25( 25 mod 256): SKIPPED (no operation) > +26( 26 mod 256): READ 0x21000 thru 0x26fff (0x6000 bytes) > +27( 27 mod 256): READ 0x5e000 thru 0x61fff (0x4000 bytes) > +28( 28 mod 256): WRITE 0x6f600 thru 0x6f7ff (0x200 bytes) HOLE > +29( 29 mod 256): READ 0x13000 thru 0x19fff (0x7000 bytes) > +30( 30 mod 256): TRUNCATE UP from 0x6f800 to 0x75400 > +31( 31 mod 256): READ 0x4000 thru 0xafff (0x7000 bytes) > +32( 32 mod 256): SKIPPED (no operation) > +33( 33 mod 256): FALLOC 0x31d49 thru 0x3c520 (0xa7d7 bytes) INTERIOR > +34( 34 mod 256): FALLOC 0x2bbb3 thru 0x37ad8 (0xbf25 bytes) INTERIOR > +35( 35 mod 256): READ 0x68000 thru 0x73fff (0xc000 bytes) > +Correct content saved for comparison > +(maybe hexdump "/mnt/a1/junk" vs "/mnt/a1/junk.fsxgood") > > I'm not quite sure if the messages match the test (best guess, neighbouring > tests were fine): > > [102885.667444] btrfs csum failed ino 638 off 425984 csum 1842675109 private 2279232751 > [102885.676804] btrfs csum failed ino 638 off 430080 csum 1842675109 private 1192041375 > [102885.686094] btrfs csum failed ino 638 off 434176 csum 2297282744 private 1619428542 > [102885.686103] btrfs csum failed ino 638 off 438272 csum 3709984297 private 2868627320 > [102885.686112] btrfs csum failed ino 638 off 442368 csum 1504116677 private 1239355148 > [102885.686121] btrfs csum failed ino 638 off 446464 csum 1957839041 private 3848200057 > [102885.686129] btrfs csum failed ino 638 off 450560 csum 3836729483 private 2867416946 I think fsx leaves the bad file, you can test the inode number? > > > 113: > - it hung last evening and is still in that state, no disk or cpu activity, > there were only the tests running > - no process is in D state, no btrfs kernel thread is active > - the only interesting process is > > PID TTY STAT TIME COMMAND > 15585 pts/0 Sl+ 0:01 /root/xfstests/ltp/aio-stress -t 20 -s 10 -O -S -I \ > 1000 /mnt/a1/aiostress.15188.4 /mnt/a1/aiostress.15188.4.20 \ > /mnt/a1/aiostress.15188.4.19 /mnt/a1/aiostress.15188.4.18 \ > [<ffffffff810af447>] futex_wait_queue_me+0xc7/0x100 > [<ffffffff810affa1>] futex_wait+0x191/0x280 > [<ffffffff810b1cb6>] do_futex+0xd6/0xbd0 > [<ffffffff810b282b>] sys_futex+0x7b/0x180 > [<ffffffff8195fe99>] system_call_fastpath+0x16/0x1b Hmmm, I wonder if something else in rc1 is causing this? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 04, 2013 at 08:01:44AM -0500, Chris Mason wrote: > > I'm not quite sure if the messages match the test (best guess, neighbouring > > tests were fine): > > > > [102885.667444] btrfs csum failed ino 638 off 425984 csum 1842675109 private 2279232751 > > [102885.676804] btrfs csum failed ino 638 off 430080 csum 1842675109 private 1192041375 > > [102885.686094] btrfs csum failed ino 638 off 434176 csum 2297282744 private 1619428542 > > [102885.686103] btrfs csum failed ino 638 off 438272 csum 3709984297 private 2868627320 > > [102885.686112] btrfs csum failed ino 638 off 442368 csum 1504116677 private 1239355148 > > [102885.686121] btrfs csum failed ino 638 off 446464 csum 1957839041 private 3848200057 > > [102885.686129] btrfs csum failed ino 638 off 450560 csum 3836729483 private 2867416946 > > I think fsx leaves the bad file, you can test the inode number? # btrfs inspect ino 638 /mnt/a1 /mnt/a1/junk # md5sum /mnt/a1/junk md5sum: /mnt/a1/junk: Input/output error dmesg says: [165711.839959] btrfs csum failed ino 638 off 262144 csum 3105135418 private 2393479556 [165711.899328] btrfs csum failed ino 638 off 262144 csum 2566472073 private 2393479556 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 04, 2013 at 01:50:59PM +0100, David Sterba wrote: > I've noticed a few csum mismatch messages, and a few failed xfstests: They're still there, we're on rc4, so I started looking for potential patches to revert, but tonight the test reproduced csums with these patches removed: Btrfs: do not call file_update_time in aio_write Btrfs: only unlock and relock if we have to Btrfs: use tokens where we can in the tree log Btrfs: only clear dirty on the buffer if it is marked as dirty Btrfs: log changed inodes based on the extent map tree Btrfs: do not mark ems as prealloc if we are writing to them Btrfs: keep track of the extents original block length Btrfs: inline csums if we're fsyncing Btrfs: don't bother copying if we're only logging the inode Btrfs: only log the inode item if we can get away with it whole branch (test-next-csum in my git repo), was created by rebasing btrfs-next/for-chris on top of linus/master commit 9a9284153d965a57edc7162a8e57c14c97f3a935. The patches were selected semi-randomly and some of them are just dependencies that made merging easier. > 113: > - it hung last evening and is still in that state, no disk or cpu activity, > there were only the tests running > - no process is in D state, no btrfs kernel thread is active > - the only interesting process is > > PID TTY STAT TIME COMMAND > 15585 pts/0 Sl+ 0:01 /root/xfstests/ltp/aio-stress -t 20 -s 10 -O -S -I \ > 1000 /mnt/a1/aiostress.15188.4 /mnt/a1/aiostress.15188.4.20 \ > /mnt/a1/aiostress.15188.4.19 /mnt/a1/aiostress.15188.4.18 \ > [<ffffffff810af447>] futex_wait_queue_me+0xc7/0x100 > [<ffffffff810affa1>] futex_wait+0x191/0x280 > [<ffffffff810b1cb6>] do_futex+0xd6/0xbd0 > [<ffffffff810b282b>] sys_futex+0x7b/0x180 > [<ffffffff8195fe99>] system_call_fastpath+0x16/0x1b also reproduced, this happens like every 3rd run of the whole test. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 22, 2013 at 07:26:15AM -0700, David Sterba wrote: > On Fri, Jan 04, 2013 at 01:50:59PM +0100, David Sterba wrote: > > I've noticed a few csum mismatch messages, and a few failed xfstests: > > They're still there, we're on rc4, so I started looking for potential > patches to revert, but tonight the test reproduced csums with these > patches removed: I'm able to trigger crc errors with just 50 parallel fsx O_DIRECT procs hammering in parallel. Trying to nail down the test case a little better. #!/bin/bash num=$1 if [ "x$num" == "x" ]; then num=50 fi echo "using $num procs" for x in `seq 1 $num` ; do echo -n "$x " fsx -q xxxf$x -Z -R -W -r 4096 -w 4096 & done echo "waiting" wait -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 22, 2013 at 09:39:02AM -0500, Chris Mason wrote: > On Tue, Jan 22, 2013 at 07:26:15AM -0700, David Sterba wrote: > > On Fri, Jan 04, 2013 at 01:50:59PM +0100, David Sterba wrote: > > > I've noticed a few csum mismatch messages, and a few failed xfstests: > > > > They're still there, we're on rc4, so I started looking for potential > > patches to revert, but tonight the test reproduced csums with these > > patches removed: > > I'm able to trigger crc errors with just 50 parallel fsx O_DIRECT procs > hammering in parallel. Trying to nail down the test case a little > better. After this patch(Btrfs: put csums on the right ordered extent), I'm unable to get crc errors by running this script. thanks, liubo > > #!/bin/bash > > num=$1 > > if [ "x$num" == "x" ]; then > num=50 > fi > > echo "using $num procs" > > for x in `seq 1 $num` ; do > echo -n "$x " > fsx -q xxxf$x -Z -R -W -r 4096 -w 4096 & > done > echo "waiting" > wait > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- 091.out 2011-11-01 10:31:12.000000000 +0100 +++ 091.out.bad 2013-01-03 21:07:29.000000000 +0100 @@ -1,7 +1,45 @@ QA output created by 091 fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W +fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W +mapped writes DISABLED +truncating to largest ever: 0x12a00 +truncating to largest ever: 0x75400 +doread: read: Input/output error +LOG DUMP (35 total operations): +1( 1 mod 256): SKIPPED (no operation) +2( 2 mod 256): WRITE 0x62600 thru 0x6bdff (0x9800 bytes) HOLE +3( 3 mod 256): FALLOC 0x2e0f2 thru 0x3134a (0x3258 bytes) INTERIOR +4( 4 mod 256): TRUNCATE DOWN from 0x6be00 to 0x12a00 +5( 5 mod 256): READ 0x0 thru 0xdfff (0xe000 bytes) +6( 6 mod 256): FALLOC 0x7048 thru 0x9f54 (0x2f0c bytes) INTERIOR +7( 7 mod 256): WRITE 0x5ea00 thru 0x6e7ff (0xfe00 bytes) HOLE +8( 8 mod 256): READ 0x16000 thru 0x17fff (0x2000 bytes) +9( 9 mod 256): FALLOC 0x4957f thru 0x5298e (0x940f bytes) INTERIOR +10( 10 mod 256): SKIPPED (no operation) +11( 11 mod 256): WRITE 0x10a00 thru 0x173ff (0x6a00 bytes) +12( 12 mod 256): WRITE 0x53800 thru 0x5a7ff (0x7000 bytes) +13( 13 mod 256): WRITE 0x5ae00 thru 0x5afff (0x200 bytes) +14( 14 mod 256): READ 0x5d000 thru 0x66fff (0xa000 bytes) +15( 15 mod 256): SKIPPED (no operation) +16( 16 mod 256): READ 0x21000 thru 0x2bfff (0xb000 bytes) +17( 17 mod 256): SKIPPED (no operation) +18( 18 mod 256): READ 0x47000 thru 0x4ffff (0x9000 bytes) +19( 19 mod 256): WRITE 0x17600 thru 0x25bff (0xe600 bytes) +20( 20 mod 256): READ 0x3f000 thru 0x48fff (0xa000 bytes) +21( 21 mod 256): FALLOC 0xea89 thru 0x19800 (0xad77 bytes) INTERIOR +22( 22 mod 256): FALLOC 0x569aa thru 0x586ea (0x1d40 bytes) INTERIOR +23( 23 mod 256): WRITE 0x35c00 thru 0x453ff (0xf800 bytes) +24( 24 mod 256): SKIPPED (no operation) +25( 25 mod 256): SKIPPED (no operation) +26( 26 mod 256): READ 0x21000 thru 0x26fff (0x6000 bytes) +27( 27 mod 256): READ 0x5e000 thru 0x61fff (0x4000 bytes) +28( 28 mod 256): WRITE 0x6f600 thru 0x6f7ff (0x200 bytes) HOLE +29( 29 mod 256): READ 0x13000 thru 0x19fff (0x7000 bytes) +30( 30 mod 256): TRUNCATE UP from 0x6f800 to 0x75400 +31( 31 mod 256): READ 0x4000 thru 0xafff (0x7000 bytes) +32( 32 mod 256): SKIPPED (no operation) +33( 33 mod 256): FALLOC 0x31d49 thru 0x3c520 (0xa7d7 bytes) INTERIOR +34( 34 mod 256): FALLOC 0x2bbb3 thru 0x37ad8 (0xbf25 bytes) INTERIOR +35( 35 mod 256): READ 0x68000 thru 0x73fff (0xc000 bytes) +Correct content saved for comparison +(maybe hexdump "/mnt/a1/junk" vs "/mnt/a1/junk.fsxgood")