Message ID | 1423831674-25678-1-git-send-email-fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote: > This test is motivated by an fsync issue discovered in btrfs. > The issue was that we could lose file data, that was previously fsync'ed > successfully, if we end up shrinking (via truncate) the file, add a hard > link to our file and then persist the fsync log later via an fsync of > other inode for example. After a power loss our file content wouldn't > match what it had when we last fsync'ed, but instead it had the data > prior to that fsync. Prior to which fsync? XFS has strictly ordered metadata journalling, which means transactions committed prior to *any* fsync will be present on disk after the fsync. ext4 is not quite so strict, but has essentially the same behaviour. ext3 in ordered mode behaves like XFS. As such, ext3, ext4 and XFS return the state of the file as "0xaa for 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after the filesystem is remounted and the log replayed. So, as this test is written, it does not encapsulate the longstanding, expected behaviour of existing filesystems. btrfs should behave like XFS, ext3 and ext4 if possible - being different is only going to cause confusion and pain for application developers. > The btrfs issue was fixed by the following linux kernel patch: > > Btrfs: don't remove extents and xattrs when logging new names Sounds like you've just introduced an ordered mode behavioural journalling regression into btrfs... Cheers, Dave.
On Mon, Feb 16, 2015 at 11:33:37AM +1100, Dave Chinner wrote: > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote: > > This test is motivated by an fsync issue discovered in btrfs. > > The issue was that we could lose file data, that was previously fsync'ed > > successfully, if we end up shrinking (via truncate) the file, add a hard > > link to our file and then persist the fsync log later via an fsync of > > other inode for example. After a power loss our file content wouldn't > > match what it had when we last fsync'ed, but instead it had the data > > prior to that fsync. > > Prior to which fsync? > > XFS has strictly ordered metadata journalling, which means > transactions committed prior to *any* fsync will be present on disk > after the fsync. ext4 is not quite so strict, but has essentially > the same behaviour. ext3 in ordered mode behaves like XFS. > > As such, ext3, ext4 and XFS return the state of the file as "0xaa for > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after > the filesystem is remounted and the log replayed. But xfs doesn't work as above, something wrong? This my config file, ------------------------------------------ [btrfs] TEST_DEV=/dev/vdd TEST_DIR=/mnt/test SCRATCH_DEV=/dev/vdc SCRATCH_MNT=/mnt/scratch FSTYP=btrfs MKFS_OPTIONS="-f -b2G" RESULT_BASE="`pwd`/results/`date +%m%d%y_%H%M`" MOUNT_OPTIONS="" RECREATE_TEST_DEV=true [xfs] FSTYP=xfs MKFS_OPTIONS="-f" [ext4] FSTYP=ext4 MKFS_OPTIONS="" ------------------------------------------ Results: ------------------------------------- cat btrfs/generic/044.out.bad QA output created by 044 wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 4096/4096 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content before: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0100000 File content after: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 ------------------------------------- cat xfs/generic/044.out.bad QA output created by 044 wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 4096/4096 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content before: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0100000 File content after: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0100000 ------------------------------------- cat ext4/generic/044.out.bad QA output created by 044 wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 4096/4096 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content before: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0100000 File content after: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0100000 Thanks, -liubo > > So, as this test is written, it does not encapsulate the > longstanding, expected behaviour of existing filesystems. btrfs > should behave like XFS, ext3 and ext4 if possible - being different > is only going to cause confusion and pain for application > developers. > > > The btrfs issue was fixed by the following linux kernel patch: > > > > Btrfs: don't remove extents and xattrs when logging new names > > Sounds like you've just introduced an ordered mode behavioural > journalling regression into btrfs... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote: >> This test is motivated by an fsync issue discovered in btrfs. >> The issue was that we could lose file data, that was previously fsync'ed >> successfully, if we end up shrinking (via truncate) the file, add a hard >> link to our file and then persist the fsync log later via an fsync of >> other inode for example. After a power loss our file content wouldn't >> match what it had when we last fsync'ed, but instead it had the data >> prior to that fsync. > > Prior to which fsync? The fsync of file "foo" - this is the file we're testing for correctness and we only fsync it once. > > XFS has strictly ordered metadata journalling, which means > transactions committed prior to *any* fsync will be present on disk > after the fsync. ext4 is not quite so strict, but has essentially > the same behaviour. ext3 in ordered mode behaves like XFS. > > As such, ext3, ext4 and XFS return the state of the file as "0xaa for > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after > the filesystem is remounted and the log replayed. Well, that's not I get with XFS, at least on a 3.19 kernel. After log replay, I get a file with a size of 32Kb, the first 8192 bytes have the value 0xaa and the remaining 24Kb have value 0x00 - this doesn't match the file content after any operation. That seems unexpected to me. I think only 3 results are acceptable after the log replay: 1) A file with a size of 12Kb, first 8Kb of data with value 0xaa, and the last 4Kb of data with the value 0xcc. This is the file content after it was explicitly fsync'ed; 2) A file with a size of 5000 bytes, all bytes having the value 0xaa. This is the result after the shrinking truncate; 3) A file with a size of 32Kb, where the first 5000 bytes all have the value 0xaa and the remaining bytes with value 0xaa. This is the result of the expanding truncate. Result 1) is the one I most expected. Results 2) and 3) are possible - the fs might decide to commit the current transaction after any of the truncate operations - it's still correct behaviour. Also, after adding the hard link to foo, it might decide update the fsync log, in this case we should get to 3). I'm ok with changing the test to consider valid any of those 3 cases. > > So, as this test is written, it does not encapsulate the > longstanding, expected behaviour of existing filesystems. btrfs > should behave like XFS, ext3 and ext4 if possible - being different > is only going to cause confusion and pain for application > developers. > >> The btrfs issue was fixed by the following linux kernel patch: >> >> Btrfs: don't remove extents and xattrs when logging new names > > Sounds like you've just introduced an ordered mode behavioural > journalling regression into btrfs... How can it be a regression? Before this change, after the fsync log replay, the file content matched what it had when "sync" was called - i.e. doesn't match the content when the fsync (on "foo") was performed nor the content after any of the subsequent truncate operations. thanks > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 16, 2015 at 9:45 AM, Filipe David Manana <fdmanana@gmail.com> wrote: > On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote: >> On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote: >>> This test is motivated by an fsync issue discovered in btrfs. >>> The issue was that we could lose file data, that was previously fsync'ed >>> successfully, if we end up shrinking (via truncate) the file, add a hard >>> link to our file and then persist the fsync log later via an fsync of >>> other inode for example. After a power loss our file content wouldn't >>> match what it had when we last fsync'ed, but instead it had the data >>> prior to that fsync. >> >> Prior to which fsync? > > The fsync of file "foo" - this is the file we're testing for > correctness and we only fsync it once. > >> >> XFS has strictly ordered metadata journalling, which means >> transactions committed prior to *any* fsync will be present on disk >> after the fsync. ext4 is not quite so strict, but has essentially >> the same behaviour. ext3 in ordered mode behaves like XFS. >> >> As such, ext3, ext4 and XFS return the state of the file as "0xaa for >> 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after >> the filesystem is remounted and the log replayed. > > Well, that's not I get with XFS, at least on a 3.19 kernel. > After log replay, I get a file with a size of 32Kb, the first 8192 > bytes have the value 0xaa and the remaining 24Kb have value 0x00 - > this doesn't match the file content after any operation. > > That seems unexpected to me. > I think only 3 results are acceptable after the log replay: > > 1) A file with a size of 12Kb, first 8Kb of data with value 0xaa, and > the last 4Kb of data with the value 0xcc. This is the file content > after it was explicitly fsync'ed; > > 2) A file with a size of 5000 bytes, all bytes having the value 0xaa. > This is the result after the shrinking truncate; > > 3) A file with a size of 32Kb, where the first 5000 bytes all have the > value 0xaa and the remaining bytes with value 0xaa. This is the result > of the expanding truncate. Sorry, typo. Obviously I meant to say the remaining bytes (5001 to 32Kb) having a value of 0x00, and not 0xaa. > > Result 1) is the one I most expected. Results 2) and 3) are possible - > the fs might decide to commit the current transaction after any of the > truncate operations - it's still correct behaviour. > Also, after adding the hard link to foo, it might decide update the > fsync log, in this case we should get to 3). > > I'm ok with changing the test to consider valid any of those 3 cases. > >> >> So, as this test is written, it does not encapsulate the >> longstanding, expected behaviour of existing filesystems. btrfs >> should behave like XFS, ext3 and ext4 if possible - being different >> is only going to cause confusion and pain for application >> developers. >> >>> The btrfs issue was fixed by the following linux kernel patch: >>> >>> Btrfs: don't remove extents and xattrs when logging new names >> >> Sounds like you've just introduced an ordered mode behavioural >> journalling regression into btrfs... > > How can it be a regression? > Before this change, after the fsync log replay, the file content > matched what it had when "sync" was called - i.e. doesn't match the > content when the fsync (on "foo") was performed nor the content after > any of the subsequent truncate operations. > > thanks > >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> -- >> To unsubscribe from this list: send the line "unsubscribe fstests" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men."
On Mon, Feb 16, 2015 at 12:19:58PM +0800, Liu Bo wrote: > On Mon, Feb 16, 2015 at 11:33:37AM +1100, Dave Chinner wrote: > > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote: > > > This test is motivated by an fsync issue discovered in btrfs. > > > The issue was that we could lose file data, that was previously fsync'ed > > > successfully, if we end up shrinking (via truncate) the file, add a hard > > > link to our file and then persist the fsync log later via an fsync of > > > other inode for example. After a power loss our file content wouldn't > > > match what it had when we last fsync'ed, but instead it had the data > > > prior to that fsync. > > > > Prior to which fsync? > > > > XFS has strictly ordered metadata journalling, which means > > transactions committed prior to *any* fsync will be present on disk > > after the fsync. ext4 is not quite so strict, but has essentially > > the same behaviour. ext3 in ordered mode behaves like XFS. > > > > As such, ext3, ext4 and XFS return the state of the file as "0xaa for > > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after > > the filesystem is remounted and the log replayed. > > But xfs doesn't work as above, something wrong? Maybe. I didn't check the exact output XFS or ext4 were giving, just that is was different to what the test expected. XFS and ext4 also behave differently w.r.t. data vs metadata ordering, so there's a good change what you see here is the XFS metadata being written, but not the corresponding block zeroing that occurred on extenstion.... > ------------------------------------- > cat xfs/generic/044.out.bad .... > File content after: > 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > * > 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 0100000 So the truncate down to 5k should have resulted in zeros past 5k, but the zeros only start at 8k. Yup, that looks like setattr didn't flush the zeroed tail page on truncate up. I'll have a further look at that. Thanks for the head's up! Cheers, Dave.
On Mon, Feb 16, 2015 at 09:45:22AM +0000, Filipe David Manana wrote: > On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote: > >> This test is motivated by an fsync issue discovered in btrfs. > >> The issue was that we could lose file data, that was previously fsync'ed > >> successfully, if we end up shrinking (via truncate) the file, add a hard > >> link to our file and then persist the fsync log later via an fsync of > >> other inode for example. After a power loss our file content wouldn't > >> match what it had when we last fsync'ed, but instead it had the data > >> prior to that fsync. > > > > Prior to which fsync? > > The fsync of file "foo" - this is the file we're testing for > correctness and we only fsync it once. Ok. But there's also an implied fsync because btrfs is supposed to have an ordered metadata transaction subsystem. > > XFS has strictly ordered metadata journalling, which means > > transactions committed prior to *any* fsync will be present on disk > > after the fsync. ext4 is not quite so strict, but has essentially > > the same behaviour. ext3 in ordered mode behaves like XFS. > > > > As such, ext3, ext4 and XFS return the state of the file as "0xaa for > > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after > > the filesystem is remounted and the log replayed. > > Well, that's not I get with XFS, at least on a 3.19 kernel. > After log replay, I get a file with a size of 32Kb, the first 8192 > bytes have the value 0xaa and the remaining 24Kb have value 0x00 - > this doesn't match the file content after any operation. See my previous reply in the thread. In memory behaviour appears correct, which means fsx will pass but we won't have picked up a regression in power failure behaviour. > I'm ok with changing the test to consider valid any of those 3 cases. > > > > > So, as this test is written, it does not encapsulate the > > longstanding, expected behaviour of existing filesystems. btrfs > > should behave like XFS, ext3 and ext4 if possible - being different > > is only going to cause confusion and pain for application > > developers. > > > >> The btrfs issue was fixed by the following linux kernel patch: > >> > >> Btrfs: don't remove extents and xattrs when logging new names > > > > Sounds like you've just introduced an ordered mode behavioural > > journalling regression into btrfs... > > How can it be a regression? Maybe it's not a regression - I don't know what the old behaviour was - but it certainly seems like a bug to me. Btrfs is supposed to have a strongly ordered transaction model, similar to XFS and ext4 and the test as you've written it implies transactional change ordering is not preserved in btrfs (ouch!). Ordered mode on this test is tricky because you are running fsync on a newly created file. That implies fsync on the parent directory, because fsync means you are stabilising all the *transactions* that changed metadata on the new file. That means the dirent created to point to the new file should be fsync()d, too. So, order of modifications: file foo data written and synced file foo truncate/extent hard link foo_link to foo created new file bar created new file bar fsync In this case, it's the hard link of foo to foo_link that creates the ordered dependency between foo, the parent directory and bar. The directory is then modified in a future transaction by creating bar, and the fsync on bar stabilises the parent directory as transactional atomicity requires all objects in a transaction to be stabilised when one object is asked to be stabilised. And for the latest modification to an object to be stabilised, ordered metadata requires all previous modifications to that object also need to be stabilised at the same time. IOWs, the fsync on bar has this ordered change dependency resolution: fsync(bar) - implied parent directory fsync for dirent pointing to bar - stabilises parent directory, which - also stabilises new hard link dirent to foo_link, which - stabilises new link count in foo, which - stabilises inode core of foo, which - stabilises changed inode size from truncate/extent - stabilises extent list changes from truncate/extent IOWs, ordered metadata transactions mean the second fsync, through the chain of dependent modifications previously made, should also stabilise all the metadata changes made on foo since the fsync was called on it. > Before this change, after the fsync log replay, the file content > matched what it had when "sync" was called - i.e. doesn't match the > content when the fsync (on "foo") was performed nor the content after > any of the subsequent truncate operations. IOWs, btrfs did not follow ordered metadata behavioural rules before the change, nor does it follow them after the change, but it's now broken in a different, more subtle and surprising way. Cheers, Dave.
On Mon, Feb 16, 2015 at 10:02 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Feb 16, 2015 at 09:45:22AM +0000, Filipe David Manana wrote: >> On Mon, Feb 16, 2015 at 12:33 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Fri, Feb 13, 2015 at 12:47:54PM +0000, Filipe Manana wrote: >> >> This test is motivated by an fsync issue discovered in btrfs. >> >> The issue was that we could lose file data, that was previously fsync'ed >> >> successfully, if we end up shrinking (via truncate) the file, add a hard >> >> link to our file and then persist the fsync log later via an fsync of >> >> other inode for example. After a power loss our file content wouldn't >> >> match what it had when we last fsync'ed, but instead it had the data >> >> prior to that fsync. >> > >> > Prior to which fsync? >> >> The fsync of file "foo" - this is the file we're testing for >> correctness and we only fsync it once. > > Ok. But there's also an implied fsync because btrfs is supposed to > have an ordered metadata transaction subsystem. > >> > XFS has strictly ordered metadata journalling, which means >> > transactions committed prior to *any* fsync will be present on disk >> > after the fsync. ext4 is not quite so strict, but has essentially >> > the same behaviour. ext3 in ordered mode behaves like XFS. >> > >> > As such, ext3, ext4 and XFS return the state of the file as "0xaa for >> > 0 to 5k, 0x00 out to 32k" and the hardlink foo_link is present after >> > the filesystem is remounted and the log replayed. >> >> Well, that's not I get with XFS, at least on a 3.19 kernel. >> After log replay, I get a file with a size of 32Kb, the first 8192 >> bytes have the value 0xaa and the remaining 24Kb have value 0x00 - >> this doesn't match the file content after any operation. > > See my previous reply in the thread. In memory behaviour appears > correct, which means fsx will pass but we won't have picked up a > regression in power failure behaviour. > >> I'm ok with changing the test to consider valid any of those 3 cases. >> >> > >> > So, as this test is written, it does not encapsulate the >> > longstanding, expected behaviour of existing filesystems. btrfs >> > should behave like XFS, ext3 and ext4 if possible - being different >> > is only going to cause confusion and pain for application >> > developers. >> > >> >> The btrfs issue was fixed by the following linux kernel patch: >> >> >> >> Btrfs: don't remove extents and xattrs when logging new names >> > >> > Sounds like you've just introduced an ordered mode behavioural >> > journalling regression into btrfs... >> >> How can it be a regression? > > Maybe it's not a regression - I don't know what the old behaviour > was - but it certainly seems like a bug to me. Btrfs is supposed to > have a strongly ordered transaction model, similar to XFS and ext4 > and the test as you've written it implies transactional change > ordering is not preserved in btrfs (ouch!). > > Ordered mode on this test is tricky because you are running fsync on > a newly created file. That implies fsync on the parent directory, > because fsync means you are stabilising all the *transactions* that > changed metadata on the new file. That means the dirent created to > point to the new file should be fsync()d, too. > > So, order of modifications: > > file foo data written and synced > file foo truncate/extent > hard link foo_link to foo created > new file bar created > new file bar fsync > > In this case, it's the hard link of foo to foo_link that creates the > ordered dependency between foo, the parent directory and bar. The > directory is then modified in a future transaction by creating bar, > and the fsync on bar stabilises the parent directory as > transactional atomicity requires all objects in a transaction to be > stabilised when one object is asked to be stabilised. And for the > latest modification to an object to be stabilised, ordered metadata > requires all previous modifications to that object also need to be > stabilised at the same time. > > IOWs, the fsync on bar has this ordered change dependency > resolution: > > fsync(bar) > - implied parent directory fsync for dirent pointing to bar > - stabilises parent directory, which > - also stabilises new hard link dirent to foo_link, which > - stabilises new link count in foo, which > - stabilises inode core of foo, which > - stabilises changed inode size from truncate/extent > - stabilises extent list changes from truncate/extent > > IOWs, ordered metadata transactions mean the second fsync, through > the chain of dependent modifications previously made, should also > stabilise all the metadata changes made on foo since the fsync was > called on it. > >> Before this change, after the fsync log replay, the file content >> matched what it had when "sync" was called - i.e. doesn't match the >> content when the fsync (on "foo") was performed nor the content after >> any of the subsequent truncate operations. > > IOWs, btrfs did not follow ordered metadata behavioural rules before > the change, nor does it follow them after the change, but it's > now broken in a different, more subtle and surprising way. Dave, thanks for the detailed explanation. I'll update the test once I get btrfs to have the same behaviour as xfs and ext3/4. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/tests/generic/044 b/tests/generic/044 new file mode 100755 index 0000000..95a42cc --- /dev/null +++ b/tests/generic/044 @@ -0,0 +1,131 @@ +#! /bin/bash +# FS QA Test No. 044 +# +# This test is motivated by an fsync issue discovered in btrfs. +# The issue was that we could lose file data, that was previously fsync'ed +# successfully, if we end up shrinking (via truncate) the file, add a hard +# link to our file and then persist the fsync log later via an fsync of other +# inode for example. After a power loss our file content wouldn't match +# what it had when we last fsync'ed, but instead it had the data prior to +# that fsync. +# +# The btrfs issue was fixed by the following linux kernel patch: +# +# Btrfs: don't remove extents and xattrs when logging new names +# +#----------------------------------------------------------------------- +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana <fdmanana@suse.com> +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ + _cleanup_flakey + rm -f $tmp.* +} +trap "_cleanup; exit \$status" 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch +_require_dm_flakey + +rm -f $seqres.full + +_scratch_mkfs >> $seqres.full 2>&1 +_init_flakey +_mount_flakey + +# Create our test file with some data. +$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \ + $SCRATCH_MNT/foo | _filter_xfs_io + +# Make sure the file is durably persisted. +sync + +# Append some data to our file, to increase its size. +$XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \ + $SCRATCH_MNT/foo | _filter_xfs_io + +# Fsync the file, so from this point on if a crash/power failure happens, our +# new data is guaranteed to be there next time the fs is mounted. +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo + +# Now shrink our file to 5000 bytes. +$XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo + +# Now do an expanding truncate to a size larger than what we had when we last +# fsync'ed our file. This is just to verify that after power failure and +# replaying the fsync log, our file matches what it was when we last fsync'ed +# it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of +# data had a value of 0xcc. +$XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo + +# Add one hard link to our file. This made btrfs drop all of our file's +# metadata from the fsync log, including the metadata relative to the +# extent we just wrote and fsync'ed. This change was made only to the fsync +# log in memory, so adding the hard link alone doesn't change the persisted +# fsync log. This happened because the previous truncates set the runtime +# flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure. +ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link + +# Now make sure the in memory fsync log is durably persisted. +# Creating and fsync'ing another file will do it. +# After this our persisted fsync log will no longer have metadata for our file +# foo that points to the extent we wrote and fsync'ed before. +touch $SCRATCH_MNT/bar +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar + +# As expected, before the crash/power failure, we should be able to see a file +# with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all +# the remaining bytes with value 0x00. +echo "File content before:" +od -t x1 $SCRATCH_MNT/foo + +# Simulate a crash/power loss. +_load_flakey_table $FLAKEY_DROP_WRITES +_unmount_flakey + +_load_flakey_table $FLAKEY_ALLOW_WRITES +_mount_flakey + +# After mounting the fs again, the fsync log was replayed. +# The expected result is to see a file with a size of 12Kb, with its first 8Kb +# of data having the value 0xaa and its last 4Kb of data having a value of 0xcc. +# The btrfs bug used to leave the file as it used te be as of the last +# transaction commit - that is, with a size of 8Kb with all bytes having a +# value of 0xaa. +echo "File content after:" +od -t x1 $SCRATCH_MNT/foo + +status=0 +exit diff --git a/tests/generic/044.out b/tests/generic/044.out new file mode 100644 index 0000000..397dbae --- /dev/null +++ b/tests/generic/044.out @@ -0,0 +1,18 @@ +QA output created by 044 +wrote 8192/8192 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 4096/4096 bytes at offset 8192 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +File content before: +0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +* +0011600 aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 +0011620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +* +0100000 +File content after: +0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +* +0020000 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc +* +0030000 diff --git a/tests/generic/group b/tests/generic/group index d686f2f..dc16972 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -46,6 +46,7 @@ 041 metadata auto quick 042 metadata auto quick 043 metadata auto quick +044 metadata auto quick 053 acl repair auto quick 062 attr udf auto quick 068 other auto freeze dangerous stress
This test is motivated by an fsync issue discovered in btrfs. The issue was that we could lose file data, that was previously fsync'ed successfully, if we end up shrinking (via truncate) the file, add a hard link to our file and then persist the fsync log later via an fsync of other inode for example. After a power loss our file content wouldn't match what it had when we last fsync'ed, but instead it had the data prior to that fsync. The btrfs issue was fixed by the following linux kernel patch: Btrfs: don't remove extents and xattrs when logging new names Signed-off-by: Filipe Manana <fdmanana@suse.com> --- tests/generic/044 | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/044.out | 18 +++++++ tests/generic/group | 1 + 3 files changed, 150 insertions(+) create mode 100755 tests/generic/044 create mode 100644 tests/generic/044.out