diff mbox

[RFC] Btrfs: fix full backref problem when inserting shared block reference

Message ID 50232A19.2010704@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Miao Xie Aug. 9, 2012, 3:10 a.m. UTC
If we create several snapshots at the same time, the following BUG_ON() will be
triggered.

	kernel BUG at fs/btrfs/extent-tree.c:6047!

Steps to reproduce:
 # mkfs.btrfs <partition>
 # mount <partition> <mnt>
 # cd <mnt>
 # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
 # for ((i=0; i<4; i++))
 > do
 > mkdir $i
 > for ((j=0; j<200; j++))
 > do
 > btrfs sub snap . $i/$j
 > done &
 > done

The reason is:
Before transaction commit, some operations changed the fs tree and new tree
blocks were allocated because of COW. We used the implicit non-shared back
reference for those newly allocated tree blocks because they were not shared by
two or more trees.

And then we created the first snapshot for the fs tree, according to the back
reference rules, we also used implicit back refs for the child tree blocks of
the root node of the fs tree, now those child nodes/leaves were shared by two
trees.

Then We didn't deal with the delayed references, and continued to change the fs
tree(created the second snapshot and inserted the dir item of the new snapshot
into the fs tree). According to the rules of the back reference, we added full
back refs for those tree blocks whose parents have be shared by two trees.
Now some newly allocated tree blocks had two types of the references.

As we know, the delayed reference system handles these delayed references from
back to front, and the full delayed reference is inserted after the implicit
ones. So when we dealt with the back references of those newly allocated tree
blocks, the full references was dealt with at first. And if the first reference
is a shared back reference and the tree block that the reference points to is
newly allocated, It would be considered as a tree block which is shared by two
or more trees when it is allocated and should be a full back reference not a
implicit one, the flag of its reference also should be set to FULL_BACKREF.
But in fact, it was a non-shared tree block with a implicit reference at
beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
was triggered.

We have several methods to fix this bug:
1. deal with delayed references after the snapshot is created and before we
   change the source tree of the snapshot. This is the easiest and safest way.
2. modify the sort method of the delayed reference tree, make the full delayed
   references be inserted before the implicit ones. It is also very easy, but
   I don't know if it will introduce some problems or not.
3. modify select_delayed_ref() and make it select the implicit delayed reference
   at first. This way is not so good because it may wastes CPU time if we have
   lots of delayed references.
4. set the flags to FULL_BACKREF, this method is a little complex comparing with
   the 1st way. 

I chose the 1st way to fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/transaction.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

Comments

David Sterba Aug. 9, 2012, 6:48 a.m. UTC | #1
On Thu, Aug 09, 2012 at 11:10:17AM +0800, Miao Xie wrote:
> I chose the 1st way to fix it.

The least I can say now is that it fixed the crash! The approach is
minimalistic and I think we can take it for now. I didn't review it,
only tested with reproducer you described plus modified the numbers up
and down, no problems so far, and the "wikipedia" test-subvol stresstest
that caused trouble to one of your patches is also ok. I'll do some more
testing on other machines and will report problems eventually.

Thanks!

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Aug. 9, 2012, 7:21 a.m. UTC | #2
On Thu, Aug 09, 2012 at 08:48:02AM +0200, David Sterba wrote:
> and down, no problems so far, and the "wikipedia" test-subvol stresstest
> that caused trouble to one of your patches is also ok. I'll do some more
> testing on other machines and will report problems eventually.

So it won't be so easy :)

The test generated 15+ G of data, ~500 snapshots, then umount and fsck:

lots of

ref mismatch on [9655283712 4096] extent item 1, found 0
Incorrect local backref count on 9655283712 root 5 owner 589776 offset 110592 found 0 wanted 1 back 0x86badf0
backpointer mismatch on [9655283712 4096]
owner ref check failed [9655283712 4096]

and then

checking fs roots
root 2854 inode 233882 errors 2500
root 2880 inode 271639 errors 2200

and it's not finished yet, other types of error may pop up.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie Aug. 9, 2012, 7:50 a.m. UTC | #3
On Thu, 9 Aug 2012 09:21:29 +0200, David Sterba wrote:
> On Thu, Aug 09, 2012 at 08:48:02AM +0200, David Sterba wrote:
>> and down, no problems so far, and the "wikipedia" test-subvol stresstest
>> that caused trouble to one of your patches is also ok. I'll do some more
>> testing on other machines and will report problems eventually.
> 
> So it won't be so easy :)
> 
> The test generated 15+ G of data, ~500 snapshots, then umount and fsck:
> 
> lots of
> 
> ref mismatch on [9655283712 4096] extent item 1, found 0
> Incorrect local backref count on 9655283712 root 5 owner 589776 offset 110592 found 0 wanted 1 back 0x86badf0
> backpointer mismatch on [9655283712 4096]
> owner ref check failed [9655283712 4096]
> 
> and then
> 
> checking fs roots
> root 2854 inode 233882 errors 2500
> root 2880 inode 271639 errors 2200
> 
> and it's not finished yet, other types of error may pop up.

Thanks for your test, I'll look into it.

Regards 
Miao

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Aug. 9, 2012, 12:23 p.m. UTC | #4
On Wed, Aug 08, 2012 at 09:10:17PM -0600, Miao Xie wrote:
> If we create several snapshots at the same time, the following BUG_ON() will be
> triggered.
> 
> 	kernel BUG at fs/btrfs/extent-tree.c:6047!
> 
> Steps to reproduce:
>  # mkfs.btrfs <partition>
>  # mount <partition> <mnt>
>  # cd <mnt>
>  # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
>  # for ((i=0; i<4; i++))
>  > do
>  > mkdir $i
>  > for ((j=0; j<200; j++))
>  > do
>  > btrfs sub snap . $i/$j
>  > done &
>  > done
> 
> The reason is:
> Before transaction commit, some operations changed the fs tree and new tree
> blocks were allocated because of COW. We used the implicit non-shared back
> reference for those newly allocated tree blocks because they were not shared by
> two or more trees.
> 
> And then we created the first snapshot for the fs tree, according to the back
> reference rules, we also used implicit back refs for the child tree blocks of
> the root node of the fs tree, now those child nodes/leaves were shared by two
> trees.
> 
> Then We didn't deal with the delayed references, and continued to change the fs
> tree(created the second snapshot and inserted the dir item of the new snapshot
> into the fs tree). According to the rules of the back reference, we added full
> back refs for those tree blocks whose parents have be shared by two trees.
> Now some newly allocated tree blocks had two types of the references.
> 
> As we know, the delayed reference system handles these delayed references from
> back to front, and the full delayed reference is inserted after the implicit
> ones. So when we dealt with the back references of those newly allocated tree
> blocks, the full references was dealt with at first. And if the first reference
> is a shared back reference and the tree block that the reference points to is
> newly allocated, It would be considered as a tree block which is shared by two
> or more trees when it is allocated and should be a full back reference not a
> implicit one, the flag of its reference also should be set to FULL_BACKREF.
> But in fact, it was a non-shared tree block with a implicit reference at
> beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
> was triggered.
> 
> We have several methods to fix this bug:
> 1. deal with delayed references after the snapshot is created and before we
>    change the source tree of the snapshot. This is the easiest and safest way.
> 2. modify the sort method of the delayed reference tree, make the full delayed
>    references be inserted before the implicit ones. It is also very easy, but
>    I don't know if it will introduce some problems or not.

Thanks for tracking this down, FWIW I like option 2 the most, it would be
intereseting to see if it does actually introduce new issues.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Aug. 9, 2012, 1:11 p.m. UTC | #5
On Thu, Aug 09, 2012 at 06:23:19AM -0600, Josef Bacik wrote:
> On Wed, Aug 08, 2012 at 09:10:17PM -0600, Miao Xie wrote:
> > If we create several snapshots at the same time, the following BUG_ON() will be
> > triggered.
> > 
> > 	kernel BUG at fs/btrfs/extent-tree.c:6047!
> > 
> > Steps to reproduce:
> >  # mkfs.btrfs <partition>
> >  # mount <partition> <mnt>
> >  # cd <mnt>
> >  # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
> >  # for ((i=0; i<4; i++))
> >  > do
> >  > mkdir $i
> >  > for ((j=0; j<200; j++))
> >  > do
> >  > btrfs sub snap . $i/$j
> >  > done &
> >  > done
> > 
> > The reason is:
> > Before transaction commit, some operations changed the fs tree and new tree
> > blocks were allocated because of COW. We used the implicit non-shared back
> > reference for those newly allocated tree blocks because they were not shared by
> > two or more trees.
> > 
> > And then we created the first snapshot for the fs tree, according to the back
> > reference rules, we also used implicit back refs for the child tree blocks of
> > the root node of the fs tree, now those child nodes/leaves were shared by two
> > trees.
> > 
> > Then We didn't deal with the delayed references, and continued to change the fs
> > tree(created the second snapshot and inserted the dir item of the new snapshot
> > into the fs tree). According to the rules of the back reference, we added full
> > back refs for those tree blocks whose parents have be shared by two trees.
> > Now some newly allocated tree blocks had two types of the references.
> > 
> > As we know, the delayed reference system handles these delayed references from
> > back to front, and the full delayed reference is inserted after the implicit
> > ones. So when we dealt with the back references of those newly allocated tree
> > blocks, the full references was dealt with at first. And if the first reference
> > is a shared back reference and the tree block that the reference points to is
> > newly allocated, It would be considered as a tree block which is shared by two
> > or more trees when it is allocated and should be a full back reference not a
> > implicit one, the flag of its reference also should be set to FULL_BACKREF.
> > But in fact, it was a non-shared tree block with a implicit reference at
> > beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
> > was triggered.
> > 
> > We have several methods to fix this bug:
> > 1. deal with delayed references after the snapshot is created and before we
> >    change the source tree of the snapshot. This is the easiest and safest way.
> > 2. modify the sort method of the delayed reference tree, make the full delayed
> >    references be inserted before the implicit ones. It is also very easy, but
> >    I don't know if it will introduce some problems or not.
> 
> Thanks for tracking this down, FWIW I like option 2 the most, it would be
> intereseting to see if it does actually introduce new issues.  Thanks,

For this release, I like the current patch ;)  Great job tracking it
down Miao.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Aug. 9, 2012, 1:12 p.m. UTC | #6
On Thu, Aug 09, 2012 at 07:11:09AM -0600, Chris L. Mason wrote:
> On Thu, Aug 09, 2012 at 06:23:19AM -0600, Josef Bacik wrote:
> > On Wed, Aug 08, 2012 at 09:10:17PM -0600, Miao Xie wrote:
> > > If we create several snapshots at the same time, the following BUG_ON() will be
> > > triggered.
> > > 
> > > 	kernel BUG at fs/btrfs/extent-tree.c:6047!
> > > 
> > > Steps to reproduce:
> > >  # mkfs.btrfs <partition>
> > >  # mount <partition> <mnt>
> > >  # cd <mnt>
> > >  # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
> > >  # for ((i=0; i<4; i++))
> > >  > do
> > >  > mkdir $i
> > >  > for ((j=0; j<200; j++))
> > >  > do
> > >  > btrfs sub snap . $i/$j
> > >  > done &
> > >  > done
> > > 
> > > The reason is:
> > > Before transaction commit, some operations changed the fs tree and new tree
> > > blocks were allocated because of COW. We used the implicit non-shared back
> > > reference for those newly allocated tree blocks because they were not shared by
> > > two or more trees.
> > > 
> > > And then we created the first snapshot for the fs tree, according to the back
> > > reference rules, we also used implicit back refs for the child tree blocks of
> > > the root node of the fs tree, now those child nodes/leaves were shared by two
> > > trees.
> > > 
> > > Then We didn't deal with the delayed references, and continued to change the fs
> > > tree(created the second snapshot and inserted the dir item of the new snapshot
> > > into the fs tree). According to the rules of the back reference, we added full
> > > back refs for those tree blocks whose parents have be shared by two trees.
> > > Now some newly allocated tree blocks had two types of the references.
> > > 
> > > As we know, the delayed reference system handles these delayed references from
> > > back to front, and the full delayed reference is inserted after the implicit
> > > ones. So when we dealt with the back references of those newly allocated tree
> > > blocks, the full references was dealt with at first. And if the first reference
> > > is a shared back reference and the tree block that the reference points to is
> > > newly allocated, It would be considered as a tree block which is shared by two
> > > or more trees when it is allocated and should be a full back reference not a
> > > implicit one, the flag of its reference also should be set to FULL_BACKREF.
> > > But in fact, it was a non-shared tree block with a implicit reference at
> > > beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
> > > was triggered.
> > > 
> > > We have several methods to fix this bug:
> > > 1. deal with delayed references after the snapshot is created and before we
> > >    change the source tree of the snapshot. This is the easiest and safest way.
> > > 2. modify the sort method of the delayed reference tree, make the full delayed
> > >    references be inserted before the implicit ones. It is also very easy, but
> > >    I don't know if it will introduce some problems or not.
> > 
> > Thanks for tracking this down, FWIW I like option 2 the most, it would be
> > intereseting to see if it does actually introduce new issues.  Thanks,
> 
> For this release, I like the current patch ;)  Great job tracking it
> down Miao.
> 

Well sure it's much cleaner, but it appears to not work right?

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Aug. 9, 2012, 1:16 p.m. UTC | #7
On Thu, Aug 09, 2012 at 07:12:47AM -0600, Josef Bacik wrote:
> On Thu, Aug 09, 2012 at 07:11:09AM -0600, Chris L. Mason wrote:
> > On Thu, Aug 09, 2012 at 06:23:19AM -0600, Josef Bacik wrote:
> > > On Wed, Aug 08, 2012 at 09:10:17PM -0600, Miao Xie wrote:
> > > > If we create several snapshots at the same time, the following BUG_ON() will be
> > > > triggered.
> > > > 
> > > > 	kernel BUG at fs/btrfs/extent-tree.c:6047!
> > > > 
> > > > Steps to reproduce:
> > > >  # mkfs.btrfs <partition>
> > > >  # mount <partition> <mnt>
> > > >  # cd <mnt>
> > > >  # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
> > > >  # for ((i=0; i<4; i++))
> > > >  > do
> > > >  > mkdir $i
> > > >  > for ((j=0; j<200; j++))
> > > >  > do
> > > >  > btrfs sub snap . $i/$j
> > > >  > done &
> > > >  > done
> > > > 
> > > > The reason is:
> > > > Before transaction commit, some operations changed the fs tree and new tree
> > > > blocks were allocated because of COW. We used the implicit non-shared back
> > > > reference for those newly allocated tree blocks because they were not shared by
> > > > two or more trees.
> > > > 
> > > > And then we created the first snapshot for the fs tree, according to the back
> > > > reference rules, we also used implicit back refs for the child tree blocks of
> > > > the root node of the fs tree, now those child nodes/leaves were shared by two
> > > > trees.
> > > > 
> > > > Then We didn't deal with the delayed references, and continued to change the fs
> > > > tree(created the second snapshot and inserted the dir item of the new snapshot
> > > > into the fs tree). According to the rules of the back reference, we added full
> > > > back refs for those tree blocks whose parents have be shared by two trees.
> > > > Now some newly allocated tree blocks had two types of the references.
> > > > 
> > > > As we know, the delayed reference system handles these delayed references from
> > > > back to front, and the full delayed reference is inserted after the implicit
> > > > ones. So when we dealt with the back references of those newly allocated tree
> > > > blocks, the full references was dealt with at first. And if the first reference
> > > > is a shared back reference and the tree block that the reference points to is
> > > > newly allocated, It would be considered as a tree block which is shared by two
> > > > or more trees when it is allocated and should be a full back reference not a
> > > > implicit one, the flag of its reference also should be set to FULL_BACKREF.
> > > > But in fact, it was a non-shared tree block with a implicit reference at
> > > > beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
> > > > was triggered.
> > > > 
> > > > We have several methods to fix this bug:
> > > > 1. deal with delayed references after the snapshot is created and before we
> > > >    change the source tree of the snapshot. This is the easiest and safest way.
> > > > 2. modify the sort method of the delayed reference tree, make the full delayed
> > > >    references be inserted before the implicit ones. It is also very easy, but
> > > >    I don't know if it will introduce some problems or not.
> > > 
> > > Thanks for tracking this down, FWIW I like option 2 the most, it would be
> > > intereseting to see if it does actually introduce new issues.  Thanks,
> > 
> > For this release, I like the current patch ;)  Great job tracking it
> > down Miao.
> > 
> 
> Well sure it's much cleaner, but it appears to not work right?

I'm not sure if those are from a different bug.  I'll try to reproduce
as well.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie Aug. 10, 2012, 10:38 a.m. UTC | #8
On Thu, 9 Aug 2012 09:21:29 +0200, David Sterba wrote:
> On Thu, Aug 09, 2012 at 08:48:02AM +0200, David Sterba wrote:
>> and down, no problems so far, and the "wikipedia" test-subvol stresstest
>> that caused trouble to one of your patches is also ok. I'll do some more
>> testing on other machines and will report problems eventually.
> 
> So it won't be so easy :)
> 
> The test generated 15+ G of data, ~500 snapshots, then umount and fsck:
> 
> lots of
> 
> ref mismatch on [9655283712 4096] extent item 1, found 0
> Incorrect local backref count on 9655283712 root 5 owner 589776 offset 110592 found 0 wanted 1 back 0x86badf0
> backpointer mismatch on [9655283712 4096]
> owner ref check failed [9655283712 4096]
> 
> and then
> 
> checking fs roots
> root 2854 inode 233882 errors 2500
> root 2880 inode 271639 errors 2200
> 
> and it's not finished yet, other types of error may pop up.


Could you try Arne's patch?

[PATCH v2] Btrfs: fix race in run_clustered_refs
http://marc.info/?l=linux-btrfs&m=134449329717830&w=2

I run test for several times with this patch, and the problem didn't happen.
So it seems this patch can fix the above problem.

Regards
Miao

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 7ac7cdc..2eafbd2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1119,6 +1119,11 @@  static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	ret = btrfs_reloc_post_snapshot(trans, pending);
 	if (ret)
 		goto abort_trans;
+
+	ret = btrfs_run_delayed_refs(trans, root, (unsigned long)-1);
+	if (ret)
+		goto abort_trans;
+
 	ret = 0;
 fail:
 	kfree(new_root_item);