Message ID | 1436453447-27466-1-git-send-email-fdmanana@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
wrote on 2015/07/09 15:50 +0100: > From: Filipe Manana <fdmanana@suse.com> > > When we have an extent that got N references removed and N new references > added in the same transaction, we must run the insertion of the references > first because otherwise the last removed reference will remove the extent > item from the extent tree, resulting in a failure for the insertions. > > This is a regression introduced in the 4.2-rc1 release and this fix just > brings back the behaviour of selecting reference additions before any > reference removals. > Thanks, Filipe, that's right, it's my fault to forgot such case. Acked-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Thanks, Qu > The following test case for fstests reproduces the issue: > > seq=`basename $0` > seqres=$RESULT_DIR/$seq > echo "QA output created by $seq" > tmp=/tmp/$$ > status=1 # failure is the default! > trap "_cleanup; exit \$status" 0 1 2 3 15 > > _cleanup() > { > _cleanup_flakey > rm -f $tmp.* > } > > # get standard environment, filters and checks > . ./common/rc > . ./common/filter > . ./common/dmflakey > > # real QA test starts here > _need_to_be_root > _supported_fs btrfs > _supported_os Linux > _require_scratch > _require_dm_flakey > _require_cloner > _require_metadata_journaling $SCRATCH_DEV > > rm -f $seqres.full > > _scratch_mkfs >>$seqres.full 2>&1 > _init_flakey > _mount_flakey > > # Create prealloc extent covering range [160K, 620K[ > $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo > > # Now write to the last 80K of the prealloc extent plus 40K to the unallocated > # space that immediately follows it. This creates a new extent of 40K that spans > # the range [620K, 660K[. > $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io > > # At this point, there are now 2 back references to the prealloc extent in our > # extent tree. Both are for our file offset 160K and one relates to a file > # extent item with a data offset of 0 and a length of 380K, while the other > # relates to a file extent item with a data offset of 380K and a length of 80K. > > # Make sure everything done so far is durably persisted (all back references are > # in the extent tree, etc). > sync > > # Now clone all extents of our file that cover the offset 160K up to its eof > # (660K at this point) into itself at offset 2M. This leaves a hole in the file > # covering the range [660K, 2M[. The prealloc extent will now be referenced by > # the file twice, once for offset 160K and once for offset 2M. The 40K extent > # that follows the prealloc extent will also be referenced twice by our file, > # once for offset 620K and once for offset 2M + 460K. > $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \ > $SCRATCH_MNT/foo > > # Now create one new extent in our file with a size of 100Kb. It will span the > # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the > # range [2M + 460K, 3M[. Our new file size is 3M + 100K. > $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io > > # At this point, there are now (in memory) 4 back references to the prealloc > # extent. > # > # Two of them are for file offset 160K, related to file extent items > # matching the file offsets 160K and 540K respectively, with data offsets of > # 0 and 380K respectively, and with lengths of 380K and 80K respectively. > # > # The other two references are for file offset 2M, related to file extent items > # matching the file offsets 2M and 2M + 380K respectively, with data offsets of > # 0 and 380K respectively, and with lengths of 389K and 80K respectively. > # > # The 40K extent has 2 back references, one for file offset 620K and the other > # for file offset 2M + 460K. > # > # The 100K extent has a single back reference and it relates to file offset 3M. > > # Now clone our 100K extent into offset 600K. That offset covers the last 20K > # of the prealloc extent, the whole 40K extent and 40K of the hole starting at > # offset 660K. > $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \ > $SCRATCH_MNT/foo $SCRATCH_MNT/foo > > # At this point there's only one reference to the 40K extent, at file offset > # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset > # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for > # file offset 3M and a new one for file offset 600K). > > # Now fsync our file to make all its new data and metadata updates are durably > # persisted and present if a power failure/crash happens after a successful > # fsync and before the next transaction commit. > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo > > echo "File digest before power failure:" > md5sum $SCRATCH_MNT/foo | _filter_scratch > > # Silently drop all writes and ummount to simulate a crash/power failure. > _load_flakey_table $FLAKEY_DROP_WRITES > _unmount_flakey > > # Allow writes again, mount to trigger log replay and validate file contents. > # During log replay, the btrfs delayed references implementation used to run the > # deletion of back references before the addition of new back references, which > # made the addition fail as it didn't find the key in the extent tree that it > # was looking for. The failure triggered by this test was related to the 40K > # extent, which got 1 reference dropped and 1 reference added during the fsync > # log replay - when running the delayed references at transaction commit time, > # btrfs was applying the deletion before the insertion, resulting in a failure > # of the insertion that ended up turning the fs into read-only mode. > _load_flakey_table $FLAKEY_ALLOW_WRITES > _mount_flakey > > echo "File digest after log replay:" > md5sum $SCRATCH_MNT/foo | _filter_scratch > > _unmount_flakey > > status=0 > exit > > This issue turned the filesystem into read-only mode (current transaction > aborted) and produced the following traces: > > [ 8247.578385] ------------[ cut here ]------------ > [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]() > (...) > [ 8247.601697] Call Trace: > [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b > [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb > [ 8247.605488] [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs] > [ 8247.608226] [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs] > [ 8247.617061] [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs] > [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs] > [ 8247.624366] [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] > [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs] > [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 > [ 8247.628531] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] > (...) > [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- > > [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() > [ 8247.728954] BTRFS: Transaction aborted (error -5) > (...) > [ 8247.760866] Call Trace: > [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b > [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb > [ 8247.767582] [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] > [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 > [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] > [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 > [ 8247.773664] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] > [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf > [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189 > [ 8247.777427] [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs] > [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs] > [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs] > [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81 > [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs] > (...) > [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- > [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure > [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree) > > Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/extent-tree.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 1c2bd17..bccceea5 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, > static inline struct btrfs_delayed_ref_node * > select_delayed_ref(struct btrfs_delayed_ref_head *head) > { > + struct btrfs_delayed_ref_node *ref; > + > if (list_empty(&head->ref_list)) > return NULL; > > + /* > + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first. > + * This is to prevent a ref count from going down to zero, which deletes > + * the extent item from the extent tree, when here still are references > + * to add, which would fail because they would not find the extent item. > + */ > + list_for_each_entry(ref, &head->ref_list, list) { > + if (ref->action == BTRFS_ADD_DELAYED_REF) > + return ref; > + } > + > return list_entry(head->ref_list.next, struct btrfs_delayed_ref_node, > list); > } > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Filipe, Sorry for the late reply after it is already merged, but I'm a little concerned about the extra loop to find the first inc delayed ref. It may take some extra time when there are a lot of delayed refs. What about allowing deleting the extent item at dec delayed ref time and then add it back for later inc delayed refs? Thanks, Qu wrote on 2015/07/09 15:50 +0100: > From: Filipe Manana <fdmanana@suse.com> > > When we have an extent that got N references removed and N new references > added in the same transaction, we must run the insertion of the references > first because otherwise the last removed reference will remove the extent > item from the extent tree, resulting in a failure for the insertions. > > This is a regression introduced in the 4.2-rc1 release and this fix just > brings back the behaviour of selecting reference additions before any > reference removals. > > The following test case for fstests reproduces the issue: > > seq=`basename $0` > seqres=$RESULT_DIR/$seq > echo "QA output created by $seq" > tmp=/tmp/$$ > status=1 # failure is the default! > trap "_cleanup; exit \$status" 0 1 2 3 15 > > _cleanup() > { > _cleanup_flakey > rm -f $tmp.* > } > > # get standard environment, filters and checks > . ./common/rc > . ./common/filter > . ./common/dmflakey > > # real QA test starts here > _need_to_be_root > _supported_fs btrfs > _supported_os Linux > _require_scratch > _require_dm_flakey > _require_cloner > _require_metadata_journaling $SCRATCH_DEV > > rm -f $seqres.full > > _scratch_mkfs >>$seqres.full 2>&1 > _init_flakey > _mount_flakey > > # Create prealloc extent covering range [160K, 620K[ > $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo > > # Now write to the last 80K of the prealloc extent plus 40K to the unallocated > # space that immediately follows it. This creates a new extent of 40K that spans > # the range [620K, 660K[. > $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io > > # At this point, there are now 2 back references to the prealloc extent in our > # extent tree. Both are for our file offset 160K and one relates to a file > # extent item with a data offset of 0 and a length of 380K, while the other > # relates to a file extent item with a data offset of 380K and a length of 80K. > > # Make sure everything done so far is durably persisted (all back references are > # in the extent tree, etc). > sync > > # Now clone all extents of our file that cover the offset 160K up to its eof > # (660K at this point) into itself at offset 2M. This leaves a hole in the file > # covering the range [660K, 2M[. The prealloc extent will now be referenced by > # the file twice, once for offset 160K and once for offset 2M. The 40K extent > # that follows the prealloc extent will also be referenced twice by our file, > # once for offset 620K and once for offset 2M + 460K. > $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \ > $SCRATCH_MNT/foo > > # Now create one new extent in our file with a size of 100Kb. It will span the > # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the > # range [2M + 460K, 3M[. Our new file size is 3M + 100K. > $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io > > # At this point, there are now (in memory) 4 back references to the prealloc > # extent. > # > # Two of them are for file offset 160K, related to file extent items > # matching the file offsets 160K and 540K respectively, with data offsets of > # 0 and 380K respectively, and with lengths of 380K and 80K respectively. > # > # The other two references are for file offset 2M, related to file extent items > # matching the file offsets 2M and 2M + 380K respectively, with data offsets of > # 0 and 380K respectively, and with lengths of 389K and 80K respectively. > # > # The 40K extent has 2 back references, one for file offset 620K and the other > # for file offset 2M + 460K. > # > # The 100K extent has a single back reference and it relates to file offset 3M. > > # Now clone our 100K extent into offset 600K. That offset covers the last 20K > # of the prealloc extent, the whole 40K extent and 40K of the hole starting at > # offset 660K. > $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \ > $SCRATCH_MNT/foo $SCRATCH_MNT/foo > > # At this point there's only one reference to the 40K extent, at file offset > # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset > # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for > # file offset 3M and a new one for file offset 600K). > > # Now fsync our file to make all its new data and metadata updates are durably > # persisted and present if a power failure/crash happens after a successful > # fsync and before the next transaction commit. > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo > > echo "File digest before power failure:" > md5sum $SCRATCH_MNT/foo | _filter_scratch > > # Silently drop all writes and ummount to simulate a crash/power failure. > _load_flakey_table $FLAKEY_DROP_WRITES > _unmount_flakey > > # Allow writes again, mount to trigger log replay and validate file contents. > # During log replay, the btrfs delayed references implementation used to run the > # deletion of back references before the addition of new back references, which > # made the addition fail as it didn't find the key in the extent tree that it > # was looking for. The failure triggered by this test was related to the 40K > # extent, which got 1 reference dropped and 1 reference added during the fsync > # log replay - when running the delayed references at transaction commit time, > # btrfs was applying the deletion before the insertion, resulting in a failure > # of the insertion that ended up turning the fs into read-only mode. > _load_flakey_table $FLAKEY_ALLOW_WRITES > _mount_flakey > > echo "File digest after log replay:" > md5sum $SCRATCH_MNT/foo | _filter_scratch > > _unmount_flakey > > status=0 > exit > > This issue turned the filesystem into read-only mode (current transaction > aborted) and produced the following traces: > > [ 8247.578385] ------------[ cut here ]------------ > [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]() > (...) > [ 8247.601697] Call Trace: > [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b > [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb > [ 8247.605488] [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs] > [ 8247.608226] [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs] > [ 8247.617061] [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs] > [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs] > [ 8247.624366] [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] > [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs] > [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 > [ 8247.628531] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] > (...) > [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- > > [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() > [ 8247.728954] BTRFS: Transaction aborted (error -5) > (...) > [ 8247.760866] Call Trace: > [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b > [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb > [ 8247.767582] [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] > [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 > [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] > [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 > [ 8247.773664] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] > [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf > [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189 > [ 8247.777427] [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs] > [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs] > [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs] > [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81 > [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs] > (...) > [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- > [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure > [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree) > > Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/extent-tree.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 1c2bd17..bccceea5 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, > static inline struct btrfs_delayed_ref_node * > select_delayed_ref(struct btrfs_delayed_ref_head *head) > { > + struct btrfs_delayed_ref_node *ref; > + > if (list_empty(&head->ref_list)) > return NULL; > > + /* > + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first. > + * This is to prevent a ref count from going down to zero, which deletes > + * the extent item from the extent tree, when here still are references > + * to add, which would fail because they would not find the extent item. > + */ > + list_for_each_entry(ref, &head->ref_list, list) { > + if (ref->action == BTRFS_ADD_DELAYED_REF) > + return ref; > + } > + > return list_entry(head->ref_list.next, struct btrfs_delayed_ref_node, > list); > } > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 27, 2015 at 7:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > Hi Filipe, Hi Qu, > > Sorry for the late reply after it is already merged, > but I'm a little concerned about the extra loop to find the first inc > delayed ref. > > It may take some extra time when there are a lot of delayed refs. > > What about allowing deleting the extent item at dec delayed ref time and > then add it back for later inc delayed refs? So, the reason I did it this way is simplicity - it's pretty much what the pre 4.2-rc1 [1] code did, but slightly more efficient because it iterates a linked list rather than a red black tree using rb_next(). So it seems somewhat odd that you're worrying about this after a functional fix when we pretty much always had this behavior in place... Given the complexity of what you propose, I would prefer if we have a benchmark that indeed shows this is a performance critical area (I don't think this list can get that huge, but I might be wrong). I also like the current approach of triggering a warning/bug_on/assert if the extent item isn't found in the extent tree - it helps detecting more quickly if a logic bug elsewhere exists. But anyway, I'm ok if you are willing to implement such approach if there's indeed a test/benchmark to justify more complex/optimized code. thanks [1] - https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/extent-tree.c?id=refs/tags/v4.1.3#n2327 > > Thanks, > Qu > > wrote on 2015/07/09 15:50 +0100: >> >> From: Filipe Manana <fdmanana@suse.com> >> >> >> When we have an extent that got N references removed and N new references >> added in the same transaction, we must run the insertion of the references >> first because otherwise the last removed reference will remove the extent >> item from the extent tree, resulting in a failure for the insertions. >> >> This is a regression introduced in the 4.2-rc1 release and this fix just >> brings back the behaviour of selecting reference additions before any >> reference removals. >> >> The following test case for fstests reproduces the issue: >> >> seq=`basename $0` >> seqres=$RESULT_DIR/$seq >> echo "QA output created by $seq" >> tmp=/tmp/$$ >> status=1 # failure is the default! >> trap "_cleanup; exit \$status" 0 1 2 3 15 >> >> _cleanup() >> { >> _cleanup_flakey >> rm -f $tmp.* >> } >> >> # get standard environment, filters and checks >> . ./common/rc >> . ./common/filter >> . ./common/dmflakey >> >> # real QA test starts here >> _need_to_be_root >> _supported_fs btrfs >> _supported_os Linux >> _require_scratch >> _require_dm_flakey >> _require_cloner >> _require_metadata_journaling $SCRATCH_DEV >> >> rm -f $seqres.full >> >> _scratch_mkfs >>$seqres.full 2>&1 >> _init_flakey >> _mount_flakey >> >> # Create prealloc extent covering range [160K, 620K[ >> $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo >> >> # Now write to the last 80K of the prealloc extent plus 40K to the >> unallocated >> # space that immediately follows it. This creates a new extent of 40K >> that spans >> # the range [620K, 660K[. >> $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | >> _filter_xfs_io >> >> # At this point, there are now 2 back references to the prealloc extent >> in our >> # extent tree. Both are for our file offset 160K and one relates to a >> file >> # extent item with a data offset of 0 and a length of 380K, while the >> other >> # relates to a file extent item with a data offset of 380K and a length >> of 80K. >> >> # Make sure everything done so far is durably persisted (all back >> references are >> # in the extent tree, etc). >> sync >> >> # Now clone all extents of our file that cover the offset 160K up to >> its eof >> # (660K at this point) into itself at offset 2M. This leaves a hole in >> the file >> # covering the range [660K, 2M[. The prealloc extent will now be >> referenced by >> # the file twice, once for offset 160K and once for offset 2M. The 40K >> extent >> # that follows the prealloc extent will also be referenced twice by our >> file, >> # once for offset 620K and once for offset 2M + 460K. >> $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 >> $SCRATCH_MNT/foo \ >> $SCRATCH_MNT/foo >> >> # Now create one new extent in our file with a size of 100Kb. It will >> span the >> # range [3M, 3M + 100K[. It also will cause creation of a hole spanning >> the >> # range [2M + 460K, 3M[. Our new file size is 3M + 100K. >> $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | >> _filter_xfs_io >> >> # At this point, there are now (in memory) 4 back references to the >> prealloc >> # extent. >> # >> # Two of them are for file offset 160K, related to file extent items >> # matching the file offsets 160K and 540K respectively, with data >> offsets of >> # 0 and 380K respectively, and with lengths of 380K and 80K >> respectively. >> # >> # The other two references are for file offset 2M, related to file >> extent items >> # matching the file offsets 2M and 2M + 380K respectively, with data >> offsets of >> # 0 and 380K respectively, and with lengths of 389K and 80K >> respectively. >> # >> # The 40K extent has 2 back references, one for file offset 620K and >> the other >> # for file offset 2M + 460K. >> # >> # The 100K extent has a single back reference and it relates to file >> offset 3M. >> >> # Now clone our 100K extent into offset 600K. That offset covers the >> last 20K >> # of the prealloc extent, the whole 40K extent and 40K of the hole >> starting at >> # offset 660K. >> $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * >> 1024)) \ >> $SCRATCH_MNT/foo $SCRATCH_MNT/foo >> >> # At this point there's only one reference to the 40K extent, at file >> offset >> # 2M + 460K, we have 4 references for the prealloc extent (2 for file >> offset >> # 160K and 2 for file offset 2M) and 2 references for the 100K extent >> (1 for >> # file offset 3M and a new one for file offset 600K). >> >> # Now fsync our file to make all its new data and metadata updates are >> durably >> # persisted and present if a power failure/crash happens after a >> successful >> # fsync and before the next transaction commit. >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo >> >> echo "File digest before power failure:" >> md5sum $SCRATCH_MNT/foo | _filter_scratch >> >> # Silently drop all writes and ummount to simulate a crash/power >> failure. >> _load_flakey_table $FLAKEY_DROP_WRITES >> _unmount_flakey >> >> # Allow writes again, mount to trigger log replay and validate file >> contents. >> # During log replay, the btrfs delayed references implementation used >> to run the >> # deletion of back references before the addition of new back >> references, which >> # made the addition fail as it didn't find the key in the extent tree >> that it >> # was looking for. The failure triggered by this test was related to >> the 40K >> # extent, which got 1 reference dropped and 1 reference added during >> the fsync >> # log replay - when running the delayed references at transaction >> commit time, >> # btrfs was applying the deletion before the insertion, resulting in a >> failure >> # of the insertion that ended up turning the fs into read-only mode. >> _load_flakey_table $FLAKEY_ALLOW_WRITES >> _mount_flakey >> >> echo "File digest after log replay:" >> md5sum $SCRATCH_MNT/foo | _filter_scratch >> >> _unmount_flakey >> >> status=0 >> exit >> >> This issue turned the filesystem into read-only mode (current transaction >> aborted) and produced the following traces: >> >> [ 8247.578385] ------------[ cut here ]------------ >> [ 8247.579947] WARNING: CPU: 0 PID: 11341 at >> fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d >> [btrfs]() >> (...) >> [ 8247.601697] Call Trace: >> [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b >> [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb >> [ 8247.605488] [<ffffffffa0506c8d>] ? >> lookup_inline_extent_backref+0x17d/0x45d [btrfs] >> [ 8247.608226] [<ffffffffa0506c8d>] >> lookup_inline_extent_backref+0x17d/0x45d [btrfs] >> [ 8247.617061] [<ffffffffa0507957>] >> insert_inline_extent_backref+0x41/0xb2 [btrfs] >> [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a >> [btrfs] >> [ 8247.624366] [<ffffffffa050ee60>] >> __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] >> [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 >> [btrfs] >> [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 >> [ 8247.628531] [<ffffffffa0520482>] >> btrfs_commit_transaction+0x4c/0xa20 [btrfs] >> (...) >> [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- >> >> [ 8247.727263] WARNING: CPU: 3 PID: 11341 at >> fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() >> [ 8247.728954] BTRFS: Transaction aborted (error -5) >> (...) >> [ 8247.760866] Call Trace: >> [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b >> [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb >> [ 8247.767582] [<ffffffffa0510e04>] ? >> btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] >> [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 >> [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 >> [btrfs] >> [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 >> [ 8247.773664] [<ffffffffa0520482>] >> btrfs_commit_transaction+0x4c/0xa20 [btrfs] >> [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf >> [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189 >> [ 8247.777427] [<ffffffffa055a920>] >> btrfs_recover_log_trees+0x2da/0x33d [btrfs] >> [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc >> [btrfs] >> [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs] >> [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81 >> [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs] >> (...) >> [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- >> [ 8247.794276] BTRFS: error (device dm-0) in >> btrfs_run_delayed_refs:2771: errno=-5 IO failure >> [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: >> errno=-5 IO failure (Failed to recover log tree) >> >> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root >> in ref_head.") >> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> >> --- >> fs/btrfs/extent-tree.c | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 1c2bd17..bccceea5 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct >> btrfs_trans_handle *trans, >> static inline struct btrfs_delayed_ref_node * >> select_delayed_ref(struct btrfs_delayed_ref_head *head) >> { >> + struct btrfs_delayed_ref_node *ref; >> + >> if (list_empty(&head->ref_list)) >> return NULL; >> >> + /* >> + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first. >> + * This is to prevent a ref count from going down to zero, which >> deletes >> + * the extent item from the extent tree, when here still are >> references >> + * to add, which would fail because they would not find the extent >> item. >> + */ >> + list_for_each_entry(ref, &head->ref_list, list) { >> + if (ref->action == BTRFS_ADD_DELAYED_REF) >> + return ref; >> + } >> + >> return list_entry(head->ref_list.next, struct >> btrfs_delayed_ref_node, >> list); >> } >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Filipe Manana wrote on 2015/07/27 09:26 +0100: > On Mon, Jul 27, 2015 at 7:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: >> Hi Filipe, > > Hi Qu, > >> >> Sorry for the late reply after it is already merged, >> but I'm a little concerned about the extra loop to find the first inc >> delayed ref. >> >> It may take some extra time when there are a lot of delayed refs. >> >> What about allowing deleting the extent item at dec delayed ref time and >> then add it back for later inc delayed refs? > > So, the reason I did it this way is simplicity - it's pretty much what > the pre 4.2-rc1 [1] code did, but slightly more efficient because it > iterates a linked list rather than a red black tree using rb_next(). > So it seems somewhat odd that you're worrying about this after a > functional fix when we pretty much always had this behavior in > place... > > Given the complexity of what you propose, I would prefer if we have a > benchmark that indeed shows this is a performance critical area (I > don't think this list can get that huge, but I might be wrong). I agree with you that this may not be a performance hotspot. And in fact, before the qgroup patchset, we tested the list implement of delayed ref for performance regression, and the difference is so little that can be easily ignored. So I think even we skip the loop, the performance will only change little. One of the main purpose to change the delayed ref to list, is to make it a little more straightforward, without the hard to understand/review rb tree compare. So I hope there is another more simple and straightforward one. But as you mentioned, your fix is already simple enough, and after a short code search, my idea to fix may not be as simple as I thought. So sorry for the disturb, and thanks for your fix again. Thanks, Qu > > I also like the current approach of triggering a warning/bug_on/assert > if the extent item isn't found in the extent tree - it helps detecting > more quickly if a logic bug elsewhere exists. > > But anyway, I'm ok if you are willing to implement such approach if > there's indeed a test/benchmark to justify more complex/optimized > code. > > thanks > > [1] - https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/extent-tree.c?id=refs/tags/v4.1.3#n2327 > >> >> Thanks, >> Qu >> >> wrote on 2015/07/09 15:50 +0100: >>> >>> From: Filipe Manana <fdmanana@suse.com> >>> >>> >>> When we have an extent that got N references removed and N new references >>> added in the same transaction, we must run the insertion of the references >>> first because otherwise the last removed reference will remove the extent >>> item from the extent tree, resulting in a failure for the insertions. >>> >>> This is a regression introduced in the 4.2-rc1 release and this fix just >>> brings back the behaviour of selecting reference additions before any >>> reference removals. >>> >>> The following test case for fstests reproduces the issue: >>> >>> seq=`basename $0` >>> seqres=$RESULT_DIR/$seq >>> echo "QA output created by $seq" >>> tmp=/tmp/$$ >>> status=1 # failure is the default! >>> trap "_cleanup; exit \$status" 0 1 2 3 15 >>> >>> _cleanup() >>> { >>> _cleanup_flakey >>> rm -f $tmp.* >>> } >>> >>> # get standard environment, filters and checks >>> . ./common/rc >>> . ./common/filter >>> . ./common/dmflakey >>> >>> # real QA test starts here >>> _need_to_be_root >>> _supported_fs btrfs >>> _supported_os Linux >>> _require_scratch >>> _require_dm_flakey >>> _require_cloner >>> _require_metadata_journaling $SCRATCH_DEV >>> >>> rm -f $seqres.full >>> >>> _scratch_mkfs >>$seqres.full 2>&1 >>> _init_flakey >>> _mount_flakey >>> >>> # Create prealloc extent covering range [160K, 620K[ >>> $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo >>> >>> # Now write to the last 80K of the prealloc extent plus 40K to the >>> unallocated >>> # space that immediately follows it. This creates a new extent of 40K >>> that spans >>> # the range [620K, 660K[. >>> $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | >>> _filter_xfs_io >>> >>> # At this point, there are now 2 back references to the prealloc extent >>> in our >>> # extent tree. Both are for our file offset 160K and one relates to a >>> file >>> # extent item with a data offset of 0 and a length of 380K, while the >>> other >>> # relates to a file extent item with a data offset of 380K and a length >>> of 80K. >>> >>> # Make sure everything done so far is durably persisted (all back >>> references are >>> # in the extent tree, etc). >>> sync >>> >>> # Now clone all extents of our file that cover the offset 160K up to >>> its eof >>> # (660K at this point) into itself at offset 2M. This leaves a hole in >>> the file >>> # covering the range [660K, 2M[. The prealloc extent will now be >>> referenced by >>> # the file twice, once for offset 160K and once for offset 2M. The 40K >>> extent >>> # that follows the prealloc extent will also be referenced twice by our >>> file, >>> # once for offset 620K and once for offset 2M + 460K. >>> $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 >>> $SCRATCH_MNT/foo \ >>> $SCRATCH_MNT/foo >>> >>> # Now create one new extent in our file with a size of 100Kb. It will >>> span the >>> # range [3M, 3M + 100K[. It also will cause creation of a hole spanning >>> the >>> # range [2M + 460K, 3M[. Our new file size is 3M + 100K. >>> $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | >>> _filter_xfs_io >>> >>> # At this point, there are now (in memory) 4 back references to the >>> prealloc >>> # extent. >>> # >>> # Two of them are for file offset 160K, related to file extent items >>> # matching the file offsets 160K and 540K respectively, with data >>> offsets of >>> # 0 and 380K respectively, and with lengths of 380K and 80K >>> respectively. >>> # >>> # The other two references are for file offset 2M, related to file >>> extent items >>> # matching the file offsets 2M and 2M + 380K respectively, with data >>> offsets of >>> # 0 and 380K respectively, and with lengths of 389K and 80K >>> respectively. >>> # >>> # The 40K extent has 2 back references, one for file offset 620K and >>> the other >>> # for file offset 2M + 460K. >>> # >>> # The 100K extent has a single back reference and it relates to file >>> offset 3M. >>> >>> # Now clone our 100K extent into offset 600K. That offset covers the >>> last 20K >>> # of the prealloc extent, the whole 40K extent and 40K of the hole >>> starting at >>> # offset 660K. >>> $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * >>> 1024)) \ >>> $SCRATCH_MNT/foo $SCRATCH_MNT/foo >>> >>> # At this point there's only one reference to the 40K extent, at file >>> offset >>> # 2M + 460K, we have 4 references for the prealloc extent (2 for file >>> offset >>> # 160K and 2 for file offset 2M) and 2 references for the 100K extent >>> (1 for >>> # file offset 3M and a new one for file offset 600K). >>> >>> # Now fsync our file to make all its new data and metadata updates are >>> durably >>> # persisted and present if a power failure/crash happens after a >>> successful >>> # fsync and before the next transaction commit. >>> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo >>> >>> echo "File digest before power failure:" >>> md5sum $SCRATCH_MNT/foo | _filter_scratch >>> >>> # Silently drop all writes and ummount to simulate a crash/power >>> failure. >>> _load_flakey_table $FLAKEY_DROP_WRITES >>> _unmount_flakey >>> >>> # Allow writes again, mount to trigger log replay and validate file >>> contents. >>> # During log replay, the btrfs delayed references implementation used >>> to run the >>> # deletion of back references before the addition of new back >>> references, which >>> # made the addition fail as it didn't find the key in the extent tree >>> that it >>> # was looking for. The failure triggered by this test was related to >>> the 40K >>> # extent, which got 1 reference dropped and 1 reference added during >>> the fsync >>> # log replay - when running the delayed references at transaction >>> commit time, >>> # btrfs was applying the deletion before the insertion, resulting in a >>> failure >>> # of the insertion that ended up turning the fs into read-only mode. >>> _load_flakey_table $FLAKEY_ALLOW_WRITES >>> _mount_flakey >>> >>> echo "File digest after log replay:" >>> md5sum $SCRATCH_MNT/foo | _filter_scratch >>> >>> _unmount_flakey >>> >>> status=0 >>> exit >>> >>> This issue turned the filesystem into read-only mode (current transaction >>> aborted) and produced the following traces: >>> >>> [ 8247.578385] ------------[ cut here ]------------ >>> [ 8247.579947] WARNING: CPU: 0 PID: 11341 at >>> fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d >>> [btrfs]() >>> (...) >>> [ 8247.601697] Call Trace: >>> [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b >>> [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb >>> [ 8247.605488] [<ffffffffa0506c8d>] ? >>> lookup_inline_extent_backref+0x17d/0x45d [btrfs] >>> [ 8247.608226] [<ffffffffa0506c8d>] >>> lookup_inline_extent_backref+0x17d/0x45d [btrfs] >>> [ 8247.617061] [<ffffffffa0507957>] >>> insert_inline_extent_backref+0x41/0xb2 [btrfs] >>> [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a >>> [btrfs] >>> [ 8247.624366] [<ffffffffa050ee60>] >>> __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] >>> [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 >>> [btrfs] >>> [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 >>> [ 8247.628531] [<ffffffffa0520482>] >>> btrfs_commit_transaction+0x4c/0xa20 [btrfs] >>> (...) >>> [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- >>> >>> [ 8247.727263] WARNING: CPU: 3 PID: 11341 at >>> fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() >>> [ 8247.728954] BTRFS: Transaction aborted (error -5) >>> (...) >>> [ 8247.760866] Call Trace: >>> [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b >>> [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb >>> [ 8247.767582] [<ffffffffa0510e04>] ? >>> btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] >>> [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 >>> [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 >>> [btrfs] >>> [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 >>> [ 8247.773664] [<ffffffffa0520482>] >>> btrfs_commit_transaction+0x4c/0xa20 [btrfs] >>> [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf >>> [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189 >>> [ 8247.777427] [<ffffffffa055a920>] >>> btrfs_recover_log_trees+0x2da/0x33d [btrfs] >>> [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc >>> [btrfs] >>> [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs] >>> [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81 >>> [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs] >>> (...) >>> [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- >>> [ 8247.794276] BTRFS: error (device dm-0) in >>> btrfs_run_delayed_refs:2771: errno=-5 IO failure >>> [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: >>> errno=-5 IO failure (Failed to recover log tree) >>> >>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root >>> in ref_head.") >>> Signed-off-by: Filipe Manana <fdmanana@suse.com> >>> >>> --- >>> fs/btrfs/extent-tree.c | 13 +++++++++++++ >>> 1 file changed, 13 insertions(+) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 1c2bd17..bccceea5 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct >>> btrfs_trans_handle *trans, >>> static inline struct btrfs_delayed_ref_node * >>> select_delayed_ref(struct btrfs_delayed_ref_head *head) >>> { >>> + struct btrfs_delayed_ref_node *ref; >>> + >>> if (list_empty(&head->ref_list)) >>> return NULL; >>> >>> + /* >>> + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first. >>> + * This is to prevent a ref count from going down to zero, which >>> deletes >>> + * the extent item from the extent tree, when here still are >>> references >>> + * to add, which would fail because they would not find the extent >>> item. >>> + */ >>> + list_for_each_entry(ref, &head->ref_list, list) { >>> + if (ref->action == BTRFS_ADD_DELAYED_REF) >>> + return ref; >>> + } >>> + >>> return list_entry(head->ref_list.next, struct >>> btrfs_delayed_ref_node, >>> list); >>> } >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 27, 2015 at 05:22:28PM +0800, Qu Wenruo wrote: > > > Filipe Manana wrote on 2015/07/27 09:26 +0100: > >On Mon, Jul 27, 2015 at 7:53 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > >>Hi Filipe, > > > >Hi Qu, > > > >> > >>Sorry for the late reply after it is already merged, > >>but I'm a little concerned about the extra loop to find the first inc > >>delayed ref. > >> > >>It may take some extra time when there are a lot of delayed refs. > >> > >>What about allowing deleting the extent item at dec delayed ref time and > >>then add it back for later inc delayed refs? > > > >So, the reason I did it this way is simplicity - it's pretty much what > >the pre 4.2-rc1 [1] code did, but slightly more efficient because it > >iterates a linked list rather than a red black tree using rb_next(). > >So it seems somewhat odd that you're worrying about this after a > >functional fix when we pretty much always had this behavior in > >place... > > > >Given the complexity of what you propose, I would prefer if we have a > >benchmark that indeed shows this is a performance critical area (I > >don't think this list can get that huge, but I might be wrong). > I agree with you that this may not be a performance hotspot. > > And in fact, before the qgroup patchset, we tested the list implement of > delayed ref for performance regression, and the difference is so little that > can be easily ignored. > So I think even we skip the loop, the performance will only change little. > > One of the main purpose to change the delayed ref to list, is to make it a > little more straightforward, without the hard to understand/review rb tree > compare. > So I hope there is another more simple and straightforward one. > > But as you mentioned, your fix is already simple enough, and after a short > code search, my idea to fix may not be as simple as I thought. > > So sorry for the disturb, and thanks for your fix again. Thanks for giving this some extra review. For now I'm happy with the tradeoff, but if it shows up in profiles we can look harder. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1c2bd17..bccceea5 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, static inline struct btrfs_delayed_ref_node * select_delayed_ref(struct btrfs_delayed_ref_head *head) { + struct btrfs_delayed_ref_node *ref; + if (list_empty(&head->ref_list)) return NULL; + /* + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first. + * This is to prevent a ref count from going down to zero, which deletes + * the extent item from the extent tree, when here still are references + * to add, which would fail because they would not find the extent item. + */ + list_for_each_entry(ref, &head->ref_list, list) { + if (ref->action == BTRFS_ADD_DELAYED_REF) + return ref; + } + return list_entry(head->ref_list.next, struct btrfs_delayed_ref_node, list); }