Message ID | 1425200918-16739-1-git-send-email-fdmanana@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Sun, Mar 01, 2015 at 09:08:38AM +0000, Filipe Manana wrote: > When using the fast file fsync code path we can miss the fact that new > writes happened since the last file fsync and therefore return without > waiting for the IO to finish and write the new extents to the fsync log. > > Here's an example scenario where the fsync will miss the fact that new > file data exists that wasn't yet durably persisted: > > 1. fs_info->last_trans_committed == N - 1 and current transaction is > transaction N (fs_info->generation == N); > > 2. do a buffered write; > > 3. fsync our inode, this clears our inode's full sync flag, starts > an ordered extent and waits for it to complete - when it completes > at btrfs_finish_ordered_io(), the inode's last_trans is set to the > value N (via btrfs_update_inode_fallback -> btrfs_update_inode -> > btrfs_set_inode_last_trans); > > 4. transaction N is committed, so fs_info->last_trans_committed is now > set to the value N and fs_info->generation remains with the value N; > > 5. do another buffered write, when this happens btrfs_file_write_iter > sets our inode's last_trans to the value N + 1 (that is > fs_info->generation + 1 == N + 1); > > 6. transaction N + 1 is started and fs_info->generation now has the > value N + 1; > > 7. transaction N + 1 is committed, so fs_info->last_trans_committed > is set to the value N + 1; > > 8. fsync our inode - because it doesn't have the full sync flag set, > we only start the ordered extent, we don't wait for it to complete > (only in a later phase) therefore its last_trans field has the > value N + 1 set previously by btrfs_file_write_iter(), and so we > have: > > inode->last_trans <= fs_info->last_trans_committed > (N + 1) (N + 1) > > Which made us not log the last buffered write and exit the fsync > handler immediately, returning success (0) to user space and resulting > in data loss after a crash. > > This can actually be triggered deterministically and the following excerpt > from a testcase I made for xfstests triggers the issue. It moves a dummy > file across directories and then fsyncs the old parent directory - this > is just to trigger a transaction commit, so moving files around isn't > directly related to the issue but it was chosen because running 'sync' for > example does more than just committing the current transaction, as it > flushes/waits for all file data to be persisted. The issue can also happen > at random periods, since the transaction kthread periodicaly commits the > current transaction (about every 30 seconds by default). > The body of the test is: > > _scratch_mkfs >> $seqres.full 2>&1 > _init_flakey > _mount_flakey > > # Create our main test file 'foo', the one we check for data loss. > # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync' > # bit from its flags (btrfs inode specific flags). > $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \ > -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io > > # Now create one other file and 2 directories. We will move this second file > # from one directory to the other later because it forces btrfs to commit its > # currently open transaction if we fsync the old parent directory. This is > # necessary to trigger the data loss bug that affected btrfs. > mkdir $SCRATCH_MNT/testdir_1 > touch $SCRATCH_MNT/testdir_1/bar > mkdir $SCRATCH_MNT/testdir_2 > > # Make sure everything is durably persisted. > sync > > # Write more 8Kb of data to our file. > $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io > > # Move our 'bar' file into a new directory. > mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar > > # Fsync our first directory. Because it had a file moved into some other > # directory, this made btrfs commit the currently open transaction. This is > # a condition necessary to trigger the data loss bug. > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1 > > # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of > # data we wrote previously to be persisted and available if a crash happens. > # This did not happen with btrfs, because of the transaction commit that > # happened when we fsynced the parent directory. > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo > > # Simulate a crash/power loss. > _load_flakey_table $FLAKEY_DROP_WRITES > _unmount_flakey > > _load_flakey_table $FLAKEY_ALLOW_WRITES > _mount_flakey > > # Now check that all data we wrote before are available. > echo "File content after log replay:" > od -t x1 $SCRATCH_MNT/foo > > status=0 > exit > > The expected golden output for the test, which is what we get with this > fix applied (or when running against ext3/4 and xfs), is: > > wrote 8192/8192 bytes at offset 0 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > wrote 8192/8192 bytes at offset 8192 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > File content after log replay: > 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > * > 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb > * > 0040000 > > Without this fix applied, the output shows the test file does not have > the second 8Kb extent that we successfully fsynced: > > wrote 8192/8192 bytes at offset 0 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > wrote 8192/8192 bytes at offset 8192 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > File content after log replay: > 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > * > 0020000 > > So fix this by skipping the fsync only if we're doing a full sync and > if the inode's last_trans is <= fs_info->last_trans_committed, or if > the inode is already in the log. Also remove setting the inode's > last_trans in btrfs_file_write_iter since it's useless/unreliable. > > A test case for xfstests will be sent soon. > > CC: <stable@vger.kernel.org> > Reviewed-by: Liu Bo <bo.li.liu@oracle.com> > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > > V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter > (and the respective comment) since it's useless now. Added stable to > cc because it's a data loss fix. > > fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++----------------- > 1 file changed, 28 insertions(+), 17 deletions(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 2bd72cd..b7334c9 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, > mutex_unlock(&inode->i_mutex); > > /* > - * we want to make sure fsync finds this change > - * but we haven't joined a transaction running right now. > - * > - * Later on, someone is sure to update the inode and get the > - * real transid recorded. > - * > - * We set last_trans now to the fs_info generation + 1, > - * this will either be one more than the running transaction > - * or the generation used for the next transaction if there isn't > - * one running right now. > - * > * We also have to set last_sub_trans to the current log transid, > * otherwise subsequent syncs to a file that's been synced in this > * transaction will appear to have already occured. > */ > - BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; By thinking twice about it, how about setting ->last_trans with (-1ULL)? So the benefit is that if new writes have already finished its endio where calling btrfs_set_inode_last_trans() to set ->last_trans with a transid at that age, we may get a win for skipping log part if someone else has updated ->last_trans_committed. By limiting it to 'full_sync' case we lose the above opportunity. Thanks, -liubo > BTRFS_I(inode)->last_sub_trans = root->log_transid; > if (num_written > 0) { > err = generic_write_sync(file, pos, num_written); > @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > } > > /* > - * if the last transaction that changed this file was before > - * the current transaction, we can bail out now without any > - * syncing > + * If the last transaction that changed this file was before the current > + * transaction and we have the full sync flag set in our inode, we can > + * bail out now without any syncing. > + * > + * Note that we can't bail out if the full sync flag isn't set. This is > + * because when the full sync flag is set we start all ordered extents > + * and wait for them to fully complete - when they complete they update > + * the inode's last_trans field through: > + * > + * btrfs_finish_ordered_io() -> > + * btrfs_update_inode_fallback() -> > + * btrfs_update_inode() -> > + * btrfs_set_inode_last_trans() > + * > + * So we are sure that last_trans is up to date and can do this check to > + * bail out safely. For the fast path, when the full sync flag is not > + * set in our inode, we can not do it because we start only our ordered > + * extents and don't wait for them to complete (that is when > + * btrfs_finish_ordered_io runs), so here at this point their last_trans > + * value might be less than or equals to fs_info->last_trans_committed, > + * and setting a speculative last_trans for an inode when a buffered > + * write is made (such as fs_info->generation + 1 for example) would not > + * be reliable since after setting the value and before fsync is called > + * any number of transactions can start and commit (transaction kthread > + * commits the current transaction periodically), and a transaction > + * commit does not start nor waits for ordered extents to complete. > */ > smp_mb(); > if (btrfs_inode_in_log(inode, root->fs_info->generation) || > - BTRFS_I(inode)->last_trans <= > - root->fs_info->last_trans_committed) { > + (full_sync && BTRFS_I(inode)->last_trans <= > + root->fs_info->last_trans_committed)) { > BTRFS_I(inode)->last_trans = 0; > > /* > -- > 2.1.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 3, 2015 at 12:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > On Sun, Mar 01, 2015 at 09:08:38AM +0000, Filipe Manana wrote: >> When using the fast file fsync code path we can miss the fact that new >> writes happened since the last file fsync and therefore return without >> waiting for the IO to finish and write the new extents to the fsync log. >> >> Here's an example scenario where the fsync will miss the fact that new >> file data exists that wasn't yet durably persisted: >> >> 1. fs_info->last_trans_committed == N - 1 and current transaction is >> transaction N (fs_info->generation == N); >> >> 2. do a buffered write; >> >> 3. fsync our inode, this clears our inode's full sync flag, starts >> an ordered extent and waits for it to complete - when it completes >> at btrfs_finish_ordered_io(), the inode's last_trans is set to the >> value N (via btrfs_update_inode_fallback -> btrfs_update_inode -> >> btrfs_set_inode_last_trans); >> >> 4. transaction N is committed, so fs_info->last_trans_committed is now >> set to the value N and fs_info->generation remains with the value N; >> >> 5. do another buffered write, when this happens btrfs_file_write_iter >> sets our inode's last_trans to the value N + 1 (that is >> fs_info->generation + 1 == N + 1); >> >> 6. transaction N + 1 is started and fs_info->generation now has the >> value N + 1; >> >> 7. transaction N + 1 is committed, so fs_info->last_trans_committed >> is set to the value N + 1; >> >> 8. fsync our inode - because it doesn't have the full sync flag set, >> we only start the ordered extent, we don't wait for it to complete >> (only in a later phase) therefore its last_trans field has the >> value N + 1 set previously by btrfs_file_write_iter(), and so we >> have: >> >> inode->last_trans <= fs_info->last_trans_committed >> (N + 1) (N + 1) >> >> Which made us not log the last buffered write and exit the fsync >> handler immediately, returning success (0) to user space and resulting >> in data loss after a crash. >> >> This can actually be triggered deterministically and the following excerpt >> from a testcase I made for xfstests triggers the issue. It moves a dummy >> file across directories and then fsyncs the old parent directory - this >> is just to trigger a transaction commit, so moving files around isn't >> directly related to the issue but it was chosen because running 'sync' for >> example does more than just committing the current transaction, as it >> flushes/waits for all file data to be persisted. The issue can also happen >> at random periods, since the transaction kthread periodicaly commits the >> current transaction (about every 30 seconds by default). >> The body of the test is: >> >> _scratch_mkfs >> $seqres.full 2>&1 >> _init_flakey >> _mount_flakey >> >> # Create our main test file 'foo', the one we check for data loss. >> # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync' >> # bit from its flags (btrfs inode specific flags). >> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \ >> -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io >> >> # Now create one other file and 2 directories. We will move this second file >> # from one directory to the other later because it forces btrfs to commit its >> # currently open transaction if we fsync the old parent directory. This is >> # necessary to trigger the data loss bug that affected btrfs. >> mkdir $SCRATCH_MNT/testdir_1 >> touch $SCRATCH_MNT/testdir_1/bar >> mkdir $SCRATCH_MNT/testdir_2 >> >> # Make sure everything is durably persisted. >> sync >> >> # Write more 8Kb of data to our file. >> $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io >> >> # Move our 'bar' file into a new directory. >> mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar >> >> # Fsync our first directory. Because it had a file moved into some other >> # directory, this made btrfs commit the currently open transaction. This is >> # a condition necessary to trigger the data loss bug. >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1 >> >> # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of >> # data we wrote previously to be persisted and available if a crash happens. >> # This did not happen with btrfs, because of the transaction commit that >> # happened when we fsynced the parent directory. >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo >> >> # Simulate a crash/power loss. >> _load_flakey_table $FLAKEY_DROP_WRITES >> _unmount_flakey >> >> _load_flakey_table $FLAKEY_ALLOW_WRITES >> _mount_flakey >> >> # Now check that all data we wrote before are available. >> echo "File content after log replay:" >> od -t x1 $SCRATCH_MNT/foo >> >> status=0 >> exit >> >> The expected golden output for the test, which is what we get with this >> fix applied (or when running against ext3/4 and xfs), is: >> >> wrote 8192/8192 bytes at offset 0 >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) >> wrote 8192/8192 bytes at offset 8192 >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) >> File content after log replay: >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa >> * >> 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb >> * >> 0040000 >> >> Without this fix applied, the output shows the test file does not have >> the second 8Kb extent that we successfully fsynced: >> >> wrote 8192/8192 bytes at offset 0 >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) >> wrote 8192/8192 bytes at offset 8192 >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) >> File content after log replay: >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa >> * >> 0020000 >> >> So fix this by skipping the fsync only if we're doing a full sync and >> if the inode's last_trans is <= fs_info->last_trans_committed, or if >> the inode is already in the log. Also remove setting the inode's >> last_trans in btrfs_file_write_iter since it's useless/unreliable. >> >> A test case for xfstests will be sent soon. >> >> CC: <stable@vger.kernel.org> >> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> --- >> >> V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter >> (and the respective comment) since it's useless now. Added stable to >> cc because it's a data loss fix. >> >> fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++----------------- >> 1 file changed, 28 insertions(+), 17 deletions(-) >> >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c >> index 2bd72cd..b7334c9 100644 >> --- a/fs/btrfs/file.c >> +++ b/fs/btrfs/file.c >> @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, >> mutex_unlock(&inode->i_mutex); >> >> /* >> - * we want to make sure fsync finds this change >> - * but we haven't joined a transaction running right now. >> - * >> - * Later on, someone is sure to update the inode and get the >> - * real transid recorded. >> - * >> - * We set last_trans now to the fs_info generation + 1, >> - * this will either be one more than the running transaction >> - * or the generation used for the next transaction if there isn't >> - * one running right now. >> - * >> * We also have to set last_sub_trans to the current log transid, >> * otherwise subsequent syncs to a file that's been synced in this >> * transaction will appear to have already occured. >> */ >> - BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; > > By thinking twice about it, how about setting ->last_trans with (-1ULL)? > > So the benefit is that if new writes have already finished its endio where > calling btrfs_set_inode_last_trans() to set ->last_trans with a transid > at that age, we may get a win for skipping log part if someone else has > updated ->last_trans_committed. > > By limiting it to 'full_sync' case we lose the above opportunity. That still won't work. Imagine the following the scenario: 1) do 2 buffered writes to 2 different ranges of the inode - the inode's last_trans is set to (u64)-1; 2) writepages is called against the first range only (either the VM called it due to memory pressure or a ranged fsync like msync for example); 3) the ordered extent started by the previous writepages calls completes and sets inode->last_trans to N (N == current transaction id/generation); 4) transaction N commits; 5) fsync the file (either whole range or a range covering only the second dirty range) - this will bail out since last_trans == last_trans_committed, not logging the second dirty range. thanks > > Thanks, > > -liubo >> BTRFS_I(inode)->last_sub_trans = root->log_transid; >> if (num_written > 0) { >> err = generic_write_sync(file, pos, num_written); >> @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) >> } >> >> /* >> - * if the last transaction that changed this file was before >> - * the current transaction, we can bail out now without any >> - * syncing >> + * If the last transaction that changed this file was before the current >> + * transaction and we have the full sync flag set in our inode, we can >> + * bail out now without any syncing. >> + * >> + * Note that we can't bail out if the full sync flag isn't set. This is >> + * because when the full sync flag is set we start all ordered extents >> + * and wait for them to fully complete - when they complete they update >> + * the inode's last_trans field through: >> + * >> + * btrfs_finish_ordered_io() -> >> + * btrfs_update_inode_fallback() -> >> + * btrfs_update_inode() -> >> + * btrfs_set_inode_last_trans() >> + * >> + * So we are sure that last_trans is up to date and can do this check to >> + * bail out safely. For the fast path, when the full sync flag is not >> + * set in our inode, we can not do it because we start only our ordered >> + * extents and don't wait for them to complete (that is when >> + * btrfs_finish_ordered_io runs), so here at this point their last_trans >> + * value might be less than or equals to fs_info->last_trans_committed, >> + * and setting a speculative last_trans for an inode when a buffered >> + * write is made (such as fs_info->generation + 1 for example) would not >> + * be reliable since after setting the value and before fsync is called >> + * any number of transactions can start and commit (transaction kthread >> + * commits the current transaction periodically), and a transaction >> + * commit does not start nor waits for ordered extents to complete. >> */ >> smp_mb(); >> if (btrfs_inode_in_log(inode, root->fs_info->generation) || >> - BTRFS_I(inode)->last_trans <= >> - root->fs_info->last_trans_committed) { >> + (full_sync && BTRFS_I(inode)->last_trans <= >> + root->fs_info->last_trans_committed)) { >> BTRFS_I(inode)->last_trans = 0; >> >> /* >> -- >> 2.1.3 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 03, 2015 at 11:15:17AM +0000, Filipe David Manana wrote: > On Tue, Mar 3, 2015 at 12:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote: > > On Sun, Mar 01, 2015 at 09:08:38AM +0000, Filipe Manana wrote: > >> When using the fast file fsync code path we can miss the fact that new > >> writes happened since the last file fsync and therefore return without > >> waiting for the IO to finish and write the new extents to the fsync log. > >> > >> Here's an example scenario where the fsync will miss the fact that new > >> file data exists that wasn't yet durably persisted: > >> > >> 1. fs_info->last_trans_committed == N - 1 and current transaction is > >> transaction N (fs_info->generation == N); > >> > >> 2. do a buffered write; > >> > >> 3. fsync our inode, this clears our inode's full sync flag, starts > >> an ordered extent and waits for it to complete - when it completes > >> at btrfs_finish_ordered_io(), the inode's last_trans is set to the > >> value N (via btrfs_update_inode_fallback -> btrfs_update_inode -> > >> btrfs_set_inode_last_trans); > >> > >> 4. transaction N is committed, so fs_info->last_trans_committed is now > >> set to the value N and fs_info->generation remains with the value N; > >> > >> 5. do another buffered write, when this happens btrfs_file_write_iter > >> sets our inode's last_trans to the value N + 1 (that is > >> fs_info->generation + 1 == N + 1); > >> > >> 6. transaction N + 1 is started and fs_info->generation now has the > >> value N + 1; > >> > >> 7. transaction N + 1 is committed, so fs_info->last_trans_committed > >> is set to the value N + 1; > >> > >> 8. fsync our inode - because it doesn't have the full sync flag set, > >> we only start the ordered extent, we don't wait for it to complete > >> (only in a later phase) therefore its last_trans field has the > >> value N + 1 set previously by btrfs_file_write_iter(), and so we > >> have: > >> > >> inode->last_trans <= fs_info->last_trans_committed > >> (N + 1) (N + 1) > >> > >> Which made us not log the last buffered write and exit the fsync > >> handler immediately, returning success (0) to user space and resulting > >> in data loss after a crash. > >> > >> This can actually be triggered deterministically and the following excerpt > >> from a testcase I made for xfstests triggers the issue. It moves a dummy > >> file across directories and then fsyncs the old parent directory - this > >> is just to trigger a transaction commit, so moving files around isn't > >> directly related to the issue but it was chosen because running 'sync' for > >> example does more than just committing the current transaction, as it > >> flushes/waits for all file data to be persisted. The issue can also happen > >> at random periods, since the transaction kthread periodicaly commits the > >> current transaction (about every 30 seconds by default). > >> The body of the test is: > >> > >> _scratch_mkfs >> $seqres.full 2>&1 > >> _init_flakey > >> _mount_flakey > >> > >> # Create our main test file 'foo', the one we check for data loss. > >> # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync' > >> # bit from its flags (btrfs inode specific flags). > >> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \ > >> -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io > >> > >> # Now create one other file and 2 directories. We will move this second file > >> # from one directory to the other later because it forces btrfs to commit its > >> # currently open transaction if we fsync the old parent directory. This is > >> # necessary to trigger the data loss bug that affected btrfs. > >> mkdir $SCRATCH_MNT/testdir_1 > >> touch $SCRATCH_MNT/testdir_1/bar > >> mkdir $SCRATCH_MNT/testdir_2 > >> > >> # Make sure everything is durably persisted. > >> sync > >> > >> # Write more 8Kb of data to our file. > >> $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io > >> > >> # Move our 'bar' file into a new directory. > >> mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar > >> > >> # Fsync our first directory. Because it had a file moved into some other > >> # directory, this made btrfs commit the currently open transaction. This is > >> # a condition necessary to trigger the data loss bug. > >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1 > >> > >> # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of > >> # data we wrote previously to be persisted and available if a crash happens. > >> # This did not happen with btrfs, because of the transaction commit that > >> # happened when we fsynced the parent directory. > >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo > >> > >> # Simulate a crash/power loss. > >> _load_flakey_table $FLAKEY_DROP_WRITES > >> _unmount_flakey > >> > >> _load_flakey_table $FLAKEY_ALLOW_WRITES > >> _mount_flakey > >> > >> # Now check that all data we wrote before are available. > >> echo "File content after log replay:" > >> od -t x1 $SCRATCH_MNT/foo > >> > >> status=0 > >> exit > >> > >> The expected golden output for the test, which is what we get with this > >> fix applied (or when running against ext3/4 and xfs), is: > >> > >> wrote 8192/8192 bytes at offset 0 > >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > >> wrote 8192/8192 bytes at offset 8192 > >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > >> File content after log replay: > >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > >> * > >> 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb > >> * > >> 0040000 > >> > >> Without this fix applied, the output shows the test file does not have > >> the second 8Kb extent that we successfully fsynced: > >> > >> wrote 8192/8192 bytes at offset 0 > >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > >> wrote 8192/8192 bytes at offset 8192 > >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > >> File content after log replay: > >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > >> * > >> 0020000 > >> > >> So fix this by skipping the fsync only if we're doing a full sync and > >> if the inode's last_trans is <= fs_info->last_trans_committed, or if > >> the inode is already in the log. Also remove setting the inode's > >> last_trans in btrfs_file_write_iter since it's useless/unreliable. > >> > >> A test case for xfstests will be sent soon. > >> > >> CC: <stable@vger.kernel.org> > >> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> > >> Signed-off-by: Filipe Manana <fdmanana@suse.com> > >> --- > >> > >> V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter > >> (and the respective comment) since it's useless now. Added stable to > >> cc because it's a data loss fix. > >> > >> fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++----------------- > >> 1 file changed, 28 insertions(+), 17 deletions(-) > >> > >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > >> index 2bd72cd..b7334c9 100644 > >> --- a/fs/btrfs/file.c > >> +++ b/fs/btrfs/file.c > >> @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, > >> mutex_unlock(&inode->i_mutex); > >> > >> /* > >> - * we want to make sure fsync finds this change > >> - * but we haven't joined a transaction running right now. > >> - * > >> - * Later on, someone is sure to update the inode and get the > >> - * real transid recorded. > >> - * > >> - * We set last_trans now to the fs_info generation + 1, > >> - * this will either be one more than the running transaction > >> - * or the generation used for the next transaction if there isn't > >> - * one running right now. > >> - * > >> * We also have to set last_sub_trans to the current log transid, > >> * otherwise subsequent syncs to a file that's been synced in this > >> * transaction will appear to have already occured. > >> */ > >> - BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; > > > > By thinking twice about it, how about setting ->last_trans with (-1ULL)? > > > > So the benefit is that if new writes have already finished its endio where > > calling btrfs_set_inode_last_trans() to set ->last_trans with a transid > > at that age, we may get a win for skipping log part if someone else has > > updated ->last_trans_committed. > > > > By limiting it to 'full_sync' case we lose the above opportunity. > > That still won't work. > > Imagine the following the scenario: > > 1) do 2 buffered writes to 2 different ranges of the inode - the > inode's last_trans is set to (u64)-1; > > 2) writepages is called against the first range only (either the VM > called it due to memory pressure or a ranged fsync like msync for > example); > > 3) the ordered extent started by the previous writepages calls > completes and sets inode->last_trans to N (N == current transaction > id/generation); > > 4) transaction N commits; > > 5) fsync the file (either whole range or a range covering only the > second dirty range) - this will bail out since last_trans == > last_trans_committed, not logging the second dirty range. Good explanation, it's true. Thanks, -liubo > > thanks > > > > > > Thanks, > > > > -liubo > >> BTRFS_I(inode)->last_sub_trans = root->log_transid; > >> if (num_written > 0) { > >> err = generic_write_sync(file, pos, num_written); > >> @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > >> } > >> > >> /* > >> - * if the last transaction that changed this file was before > >> - * the current transaction, we can bail out now without any > >> - * syncing > >> + * If the last transaction that changed this file was before the current > >> + * transaction and we have the full sync flag set in our inode, we can > >> + * bail out now without any syncing. > >> + * > >> + * Note that we can't bail out if the full sync flag isn't set. This is > >> + * because when the full sync flag is set we start all ordered extents > >> + * and wait for them to fully complete - when they complete they update > >> + * the inode's last_trans field through: > >> + * > >> + * btrfs_finish_ordered_io() -> > >> + * btrfs_update_inode_fallback() -> > >> + * btrfs_update_inode() -> > >> + * btrfs_set_inode_last_trans() > >> + * > >> + * So we are sure that last_trans is up to date and can do this check to > >> + * bail out safely. For the fast path, when the full sync flag is not > >> + * set in our inode, we can not do it because we start only our ordered > >> + * extents and don't wait for them to complete (that is when > >> + * btrfs_finish_ordered_io runs), so here at this point their last_trans > >> + * value might be less than or equals to fs_info->last_trans_committed, > >> + * and setting a speculative last_trans for an inode when a buffered > >> + * write is made (such as fs_info->generation + 1 for example) would not > >> + * be reliable since after setting the value and before fsync is called > >> + * any number of transactions can start and commit (transaction kthread > >> + * commits the current transaction periodically), and a transaction > >> + * commit does not start nor waits for ordered extents to complete. > >> */ > >> smp_mb(); > >> if (btrfs_inode_in_log(inode, root->fs_info->generation) || > >> - BTRFS_I(inode)->last_trans <= > >> - root->fs_info->last_trans_committed) { > >> + (full_sync && BTRFS_I(inode)->last_trans <= > >> + root->fs_info->last_trans_committed)) { > >> BTRFS_I(inode)->last_trans = 0; > >> > >> /* > >> -- > >> 2.1.3 > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2bd72cd..b7334c9 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, mutex_unlock(&inode->i_mutex); /* - * we want to make sure fsync finds this change - * but we haven't joined a transaction running right now. - * - * Later on, someone is sure to update the inode and get the - * real transid recorded. - * - * We set last_trans now to the fs_info generation + 1, - * this will either be one more than the running transaction - * or the generation used for the next transaction if there isn't - * one running right now. - * * We also have to set last_sub_trans to the current log transid, * otherwise subsequent syncs to a file that's been synced in this * transaction will appear to have already occured. */ - BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; BTRFS_I(inode)->last_sub_trans = root->log_transid; if (num_written > 0) { err = generic_write_sync(file, pos, num_written); @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) } /* - * if the last transaction that changed this file was before - * the current transaction, we can bail out now without any - * syncing + * If the last transaction that changed this file was before the current + * transaction and we have the full sync flag set in our inode, we can + * bail out now without any syncing. + * + * Note that we can't bail out if the full sync flag isn't set. This is + * because when the full sync flag is set we start all ordered extents + * and wait for them to fully complete - when they complete they update + * the inode's last_trans field through: + * + * btrfs_finish_ordered_io() -> + * btrfs_update_inode_fallback() -> + * btrfs_update_inode() -> + * btrfs_set_inode_last_trans() + * + * So we are sure that last_trans is up to date and can do this check to + * bail out safely. For the fast path, when the full sync flag is not + * set in our inode, we can not do it because we start only our ordered + * extents and don't wait for them to complete (that is when + * btrfs_finish_ordered_io runs), so here at this point their last_trans + * value might be less than or equals to fs_info->last_trans_committed, + * and setting a speculative last_trans for an inode when a buffered + * write is made (such as fs_info->generation + 1 for example) would not + * be reliable since after setting the value and before fsync is called + * any number of transactions can start and commit (transaction kthread + * commits the current transaction periodically), and a transaction + * commit does not start nor waits for ordered extents to complete. */ smp_mb(); if (btrfs_inode_in_log(inode, root->fs_info->generation) || - BTRFS_I(inode)->last_trans <= - root->fs_info->last_trans_committed) { + (full_sync && BTRFS_I(inode)->last_trans <= + root->fs_info->last_trans_committed)) { BTRFS_I(inode)->last_trans = 0; /*