diff mbox series

btrfs: check delayed refs when we're checking if a ref exists

Message ID dc43a26265d37c71a53d6415ae2d352035fa6847.1712869574.git.josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series btrfs: check delayed refs when we're checking if a ref exists | expand

Commit Message

Josef Bacik April 11, 2024, 9:06 p.m. UTC
In the patch 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete
resume") I added some code to handle file systems that had been
corrupted by a bug that incorrectly skipped updating the drop progress
key while dropping a snapshot.  This code would check to see if we had
already deleted our reference for a child block, and skip the deletion
if we had already.

Unfortunately there is a bug, as the check would only check the on-disk
references.  I made an incorrect assumption that blocks in an already
deleted snapshot that was having the deletion resume on mount wouldn't
be modified.

If we have 2 pending deleted snapshots that share blocks, we can easily
modify the rules for a block.  Take the following example

subvolume a exists, and subvolume b is a snapshot of subvolume a.  They
share references to block 1.  Block 1 will have 2 full references, one
for subvolume a and one for subvolume b, and it blocks to subvolume a.

When deleting subvolume a, we will drop our full reference for block 1,
and because we are the owner we will drop our full reference for all of
block 1's children, convert block 1 to FULL BACKREF, and add a shared
reference to all of block 1's children.

Then we will start the snapshot deletion of subvolume b.  We look up the
extent info for block 1, which checks delayed refs and tells us that
FULL BACKREF is set, so sets parent to the bytenr of block 1.  However
because this is a resumed snapshot deletion, we call into
check_ref_exists().  Because check_ref_exists() only looks at the disk,
it doesn't find the shared backref for the child of block 1, and thus
returns 0 and we skip deleting the reference for the child of block 1
and continue.  This orphans the child of block 1.

The fix is to lookup the delayed refs, similar to what we do in
btrfs_lookup_extent_info().  However we only care about whether the
reference exists or not.  If we fail to find our reference on disk, go
look up the bytenr in the delayed refs, and if it exists look for an
existing ref in the delayed ref head.  If that exists then we know we
can delete the reference safely and carry on.  If it doesn't exist we
know we have to skip over this block.

This bug has existed since I introduced this fix, however requires
having multiple deleted snapshots pending when we unmount.  Previously
this was difficult to do, because we committed the transaction on
snapshot delete.  However a recent change to that code removed the
delete, so it was much easier to reproduce this issue.  We noticed this
in production because our shutdown path stops the container on the
system, which deletes a bunch of subvolumes, and then reboots the box.
This gives us plenty of opportunities to hit this issue.  Looking at the
history we've seen this occasionally in production, but we had a big
spike recently thanks to faster machines getting put onto kernels with
the fix to no longer commit the transaction on subvolume delete.

Chris Mason wrote a reproducer which does the following

mount /dev/nvme4n1 /btrfs
btrfs subvol create /btrfs/s1
simoop -E -f 4k -n 200000 -z /btrfs/s1
while(true) ; do
	btrfs subvol snap /btrfs/s1 /btrfs/s2
	simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
	btrfs subvol snap /btrfs/s2 /btrfs/s3
	btrfs balance start -dusage=80 /btrfs
	btrfs subvol del /btrfs/s2 /btrfs/s3
	umount /btrfs
	btrfsck /dev/nvme4n1 || exit 1
	mount /dev/nvme4n1 /btrfs
done

On the second loop this would fail consistently, with my patch it has
been running for hours and hasn't failed.

I also used dm-log-writes to capture the state of the failure so I could
debug the problem.  Using the existing failure case to test my patch
validated that it fixes the problem.

Fixes: 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete resume")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/delayed-ref.c | 67 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/delayed-ref.h |  2 ++
 fs/btrfs/extent-tree.c | 51 ++++++++++++++++++++++++++++----
 3 files changed, 114 insertions(+), 6 deletions(-)

Comments

Filipe Manana April 12, 2024, 12:58 p.m. UTC | #1
On Thu, Apr 11, 2024 at 10:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> In the patch 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete
> resume") I added some code to handle file systems that had been
> corrupted by a bug that incorrectly skipped updating the drop progress
> key while dropping a snapshot.  This code would check to see if we had
> already deleted our reference for a child block, and skip the deletion
> if we had already.
>
> Unfortunately there is a bug, as the check would only check the on-disk
> references.  I made an incorrect assumption that blocks in an already
> deleted snapshot that was having the deletion resume on mount wouldn't
> be modified.
>
> If we have 2 pending deleted snapshots that share blocks, we can easily
> modify the rules for a block.  Take the following example
>
> subvolume a exists, and subvolume b is a snapshot of subvolume a.  They
> share references to block 1.  Block 1 will have 2 full references, one
> for subvolume a and one for subvolume b, and it blocks to subvolume a.

Something's odd here with "and it blocks to subvolume a", blocks?
Belongs? Since it's owned by subvolume a.

>
> When deleting subvolume a, we will drop our full reference for block 1,
> and because we are the owner we will drop our full reference for all of
> block 1's children, convert block 1 to FULL BACKREF, and add a shared
> reference to all of block 1's children.

Ok, makes sense.

>
> Then we will start the snapshot deletion of subvolume b.  We look up the
> extent info for block 1, which checks delayed refs and tells us that
> FULL BACKREF is set, so sets parent to the bytenr of block 1.  However
> because this is a resumed snapshot deletion, we call into
> check_ref_exists().  Because check_ref_exists() only looks at the disk,
> it doesn't find the shared backref for the child of block 1, and thus
> returns 0 and we skip deleting the reference for the child of block 1
> and continue.  This orphans the child of block 1.
>
> The fix is to lookup the delayed refs, similar to what we do in
> btrfs_lookup_extent_info().  However we only care about whether the
> reference exists or not.  If we fail to find our reference on disk, go
> look up the bytenr in the delayed refs, and if it exists look for an
> existing ref in the delayed ref head.  If that exists then we know we
> can delete the reference safely and carry on.  If it doesn't exist we
> know we have to skip over this block.
>
> This bug has existed since I introduced this fix, however requires
> having multiple deleted snapshots pending when we unmount.  Previously
> this was difficult to do, because we committed the transaction on
> snapshot delete.  However a recent change to that code removed the
> delete, so it was much easier to reproduce this issue.

This last sentence, isn't clear because:

1) Recent change, which one? Can you mention the commit id and subject?

2) "removed the delete"? Shouldn't this be "removed the transaction
commit"? Otherwise it doesn't make sense.

> We noticed this
> in production because our shutdown path stops the container on the
> system, which deletes a bunch of subvolumes, and then reboots the box.
> This gives us plenty of opportunities to hit this issue.  Looking at the
> history we've seen this occasionally in production, but we had a big
> spike recently thanks to faster machines getting put onto kernels with
> the fix to no longer commit the transaction on subvolume delete.
>
> Chris Mason wrote a reproducer which does the following
>
> mount /dev/nvme4n1 /btrfs
> btrfs subvol create /btrfs/s1
> simoop -E -f 4k -n 200000 -z /btrfs/s1
> while(true) ; do
>         btrfs subvol snap /btrfs/s1 /btrfs/s2
>         simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
>         btrfs subvol snap /btrfs/s2 /btrfs/s3
>         btrfs balance start -dusage=80 /btrfs
>         btrfs subvol del /btrfs/s2 /btrfs/s3
>         umount /btrfs
>         btrfsck /dev/nvme4n1 || exit 1
>         mount /dev/nvme4n1 /btrfs
> done
>
> On the second loop this would fail consistently, with my patch it has
> been running for hours and hasn't failed.

Is there a way to exercise this with fstests, using fsstress for example?

>
> I also used dm-log-writes to capture the state of the failure so I could
> debug the problem.  Using the existing failure case to test my patch
> validated that it fixes the problem.
>
> Fixes: 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete resume")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/delayed-ref.c | 67 ++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/delayed-ref.h |  2 ++
>  fs/btrfs/extent-tree.c | 51 ++++++++++++++++++++++++++++----
>  3 files changed, 114 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index e44e62cf76bc..2d43cb4b2106 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -1297,6 +1297,73 @@ btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 byt
>         return find_ref_head(delayed_refs, bytenr, false);
>  }
>
> +static int find_comp(struct btrfs_delayed_ref_node *entry, u64 root, u64 parent)
> +{
> +       struct btrfs_delayed_tree_ref *ref;
> +       int type = parent ? BTRFS_SHARED_BLOCK_REF_KEY : BTRFS_TREE_BLOCK_REF_KEY;
> +
> +       if (type < entry->type)
> +               return -1;
> +       if (type > entry->type)
> +               return 1;
> +
> +       ref = btrfs_delayed_node_to_tree_ref(entry);
> +       if (type == BTRFS_TREE_BLOCK_REF_KEY) {
> +               if (root < ref->root)
> +                       return -1;
> +               if (root > ref->root)
> +                       return 1;
> +       } else {
> +               if (parent < ref->parent)
> +                       return -1;
> +               if (parent > ref->parent)
> +                       return 1;
> +       }
> +       return 0;
> +}
> +
> +/*
> + * Check to see if a given root/parent reference is attached to the head.  This
> + * only checks for BTRFS_ADD_DELAYED_REF references that match, as that
> + * indicates the reference exists for the given root or parent.  This is for
> + * tree blocks only.
> + *
> + * @head: the head of the bytenr we're searching.
> + * @root: the root objectid of the reference if it is a normal reference.
> + * @parent: the parent if this is a shared backref.
> + */
> +bool btrfs_find_delayed_tree_ref(struct btrfs_delayed_ref_head *head,
> +                                u64 root, u64 parent)
> +{
> +       struct rb_node *node;
> +       bool found = false;
> +

Shouldn't assert that head->mutex is held?

> +       spin_lock(&head->lock);
> +       node = head->ref_tree.rb_root.rb_node;
> +       while (node) {
> +               struct btrfs_delayed_ref_node *entry;
> +               int ret;
> +
> +               entry = rb_entry(node, struct btrfs_delayed_ref_node, ref_node);
> +               ret = find_comp(entry, root, parent);
> +               if (ret < 0)
> +                       node = node->rb_left;
> +               else if (ret > 0)
> +                       node = node->rb_right;
> +               else {
> +                       /*
> +                        * We only want to count ADD actions, as drops mean the
> +                        * ref doesn't exist.
> +                        */
> +                       if (entry->action == BTRFS_ADD_DELAYED_REF)
> +                               found = true;
> +                       break;
> +               }

All branches of the if statement should use { }, since the else is using them.
IIRC even chechpatch.pl used to complain about that.

> +       }
> +       spin_unlock(&head->lock);
> +       return found;
> +}
> +
>  void __cold btrfs_delayed_ref_exit(void)
>  {
>         kmem_cache_destroy(btrfs_delayed_ref_head_cachep);
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index b291147cb8ab..34c34db01cc6 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -397,6 +397,8 @@ int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
>  void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
>                                        u64 num_bytes);
>  bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
> +bool btrfs_find_delayed_tree_ref(struct btrfs_delayed_ref_head *head,
> +                                u64 root, u64 parent);
>
>  /*
>   * helper functions to cast a node into its container
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 42314604906a..4d98bd247342 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5428,23 +5428,62 @@ static int check_ref_exists(struct btrfs_trans_handle *trans,
>                             struct btrfs_root *root, u64 bytenr, u64 parent,
>                             int level)
>  {
> +       struct btrfs_delayed_ref_root *delayed_refs;
> +       struct btrfs_delayed_ref_head *head;
>         struct btrfs_path *path;
>         struct btrfs_extent_inline_ref *iref;
>         int ret;
> +       bool exists = false;
>
>         path = btrfs_alloc_path();
>         if (!path)
>                 return -ENOMEM;
> -
> +again:
>         ret = lookup_extent_backref(trans, path, &iref, bytenr,
>                                     root->fs_info->nodesize, parent,
>                                     root->root_key.objectid, level, 0);
> +       if (ret != -ENOENT) {
> +               /* If we get 0 then we found our reference, return 1, else

Another style nit, the comment should have the opening "/*" in a line by itself.

> +                * return the error if it's not -ENOENT;
> +                */
> +               btrfs_free_path(path);
> +               return (ret < 0 ) ? ret : 1;
> +       }
> +
> +       /*
> +        * We could have a delayed ref with this reference, so look it up while
> +        * we're holding the path open to make sure we don't race with the
> +        * delayed ref running.
> +        */
> +       delayed_refs = &trans->transaction->delayed_refs;
> +       spin_lock(&delayed_refs->lock);
> +       head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
> +       if (!head)
> +               goto out;
> +       if (!mutex_trylock(&head->mutex)) {
> +               /*
> +                * We're contended, means that the delayed ref is running, get a
> +                * reference and wait for the ref head to be complete and then
> +                * try again.
> +                */
> +               refcount_inc(&head->refs);
> +               spin_unlock(&delayed_refs->lock);
> +
> +               btrfs_release_path(path);
> +
> +               mutex_lock(&head->mutex);
> +               mutex_unlock(&head->mutex);
> +               btrfs_put_delayed_ref_head(head);
> +               goto again;
> +       }
> +
> +       exists = btrfs_find_delayed_tree_ref(head, root->root_key.objectid,
> +                                            parent);

Could all be placed in a single line, 84 characters is acceptable nowadays.

Anyway, apart from those clarifications/corrections in the changelog
and the code style nits, it looks correct to me.
So with those addressed:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +       mutex_unlock(&head->mutex);
> +out:
> +       spin_unlock(&delayed_refs->lock);
>         btrfs_free_path(path);
> -       if (ret == -ENOENT)
> -               return 0;
> -       if (ret < 0)
> -               return ret;
> -       return 1;
> +       return exists ? 1 : 0;
>  }
>
>  /*
> --
> 2.43.0
>
>
Chris Mason April 12, 2024, 2:14 p.m. UTC | #2
On 4/12/24 8:58 AM, Filipe Manana wrote:
> On Thu, Apr 11, 2024 at 10:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
[ ... ]
>> Chris Mason wrote a reproducer which does the following
>>
>> mount /dev/nvme4n1 /btrfs
>> btrfs subvol create /btrfs/s1
>> simoop -E -f 4k -n 200000 -z /btrfs/s1
>> while(true) ; do
>>         btrfs subvol snap /btrfs/s1 /btrfs/s2
>>         simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
>>         btrfs subvol snap /btrfs/s2 /btrfs/s3
>>         btrfs balance start -dusage=80 /btrfs
>>         btrfs subvol del /btrfs/s2 /btrfs/s3
>>         umount /btrfs
>>         btrfsck /dev/nvme4n1 || exit 1
>>         mount /dev/nvme4n1 /btrfs
>> done
>>
>> On the second loop this would fail consistently, with my patch it has
>> been running for hours and hasn't failed.
> 
> Is there a way to exercise this with fstests, using fsstress for example?
> 

Probably?  In this case, simoop is making a directory tree with 200,000
files, 4K in size, with fallocate.  The fallocate part doesn't matter,
it's just faster.

The first simoop (with -E) exits immediately after creating the files.

The simoop inside the loop:
  - reuses the directory tree.
  - runs for 10 seconds.
  - has 16 threads doing 2MB reads and writes into the subvolume.
  - basically translates to: cow some things.

I experimented a little with smaller repros, but they didn't
consistently hit the corruption.  On all the machines I've tried, the
simoop version only needs two loops to trigger.

As far as I know, fs_mark is the only other similar utility focused on
quickly creating big directory trees, but I'm sure there are others.

-chris
diff mbox series

Patch

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index e44e62cf76bc..2d43cb4b2106 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -1297,6 +1297,73 @@  btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 byt
 	return find_ref_head(delayed_refs, bytenr, false);
 }
 
+static int find_comp(struct btrfs_delayed_ref_node *entry, u64 root, u64 parent)
+{
+	struct btrfs_delayed_tree_ref *ref;
+	int type = parent ? BTRFS_SHARED_BLOCK_REF_KEY : BTRFS_TREE_BLOCK_REF_KEY;
+
+	if (type < entry->type)
+		return -1;
+	if (type > entry->type)
+		return 1;
+
+	ref = btrfs_delayed_node_to_tree_ref(entry);
+	if (type == BTRFS_TREE_BLOCK_REF_KEY) {
+		if (root < ref->root)
+			return -1;
+		if (root > ref->root)
+			return 1;
+	} else {
+		if (parent < ref->parent)
+			return -1;
+		if (parent > ref->parent)
+			return 1;
+	}
+	return 0;
+}
+
+/*
+ * Check to see if a given root/parent reference is attached to the head.  This
+ * only checks for BTRFS_ADD_DELAYED_REF references that match, as that
+ * indicates the reference exists for the given root or parent.  This is for
+ * tree blocks only.
+ *
+ * @head: the head of the bytenr we're searching.
+ * @root: the root objectid of the reference if it is a normal reference.
+ * @parent: the parent if this is a shared backref.
+ */
+bool btrfs_find_delayed_tree_ref(struct btrfs_delayed_ref_head *head,
+				 u64 root, u64 parent)
+{
+	struct rb_node *node;
+	bool found = false;
+
+	spin_lock(&head->lock);
+	node = head->ref_tree.rb_root.rb_node;
+	while (node) {
+		struct btrfs_delayed_ref_node *entry;
+		int ret;
+
+		entry = rb_entry(node, struct btrfs_delayed_ref_node, ref_node);
+		ret = find_comp(entry, root, parent);
+		if (ret < 0)
+			node = node->rb_left;
+		else if (ret > 0)
+			node = node->rb_right;
+		else {
+			/*
+			 * We only want to count ADD actions, as drops mean the
+			 * ref doesn't exist.
+			 */
+			if (entry->action == BTRFS_ADD_DELAYED_REF)
+				found = true;
+			break;
+		}
+	}
+	spin_unlock(&head->lock);
+	return found;
+}
+
 void __cold btrfs_delayed_ref_exit(void)
 {
 	kmem_cache_destroy(btrfs_delayed_ref_head_cachep);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index b291147cb8ab..34c34db01cc6 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -397,6 +397,8 @@  int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
 void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
 				       u64 num_bytes);
 bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
+bool btrfs_find_delayed_tree_ref(struct btrfs_delayed_ref_head *head,
+				 u64 root, u64 parent);
 
 /*
  * helper functions to cast a node into its container
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 42314604906a..4d98bd247342 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5428,23 +5428,62 @@  static int check_ref_exists(struct btrfs_trans_handle *trans,
 			    struct btrfs_root *root, u64 bytenr, u64 parent,
 			    int level)
 {
+	struct btrfs_delayed_ref_root *delayed_refs;
+	struct btrfs_delayed_ref_head *head;
 	struct btrfs_path *path;
 	struct btrfs_extent_inline_ref *iref;
 	int ret;
+	bool exists = false;
 
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-
+again:
 	ret = lookup_extent_backref(trans, path, &iref, bytenr,
 				    root->fs_info->nodesize, parent,
 				    root->root_key.objectid, level, 0);
+	if (ret != -ENOENT) {
+		/* If we get 0 then we found our reference, return 1, else
+		 * return the error if it's not -ENOENT;
+		 */
+		btrfs_free_path(path);
+		return (ret < 0 ) ? ret : 1;
+	}
+
+	/*
+	 * We could have a delayed ref with this reference, so look it up while
+	 * we're holding the path open to make sure we don't race with the
+	 * delayed ref running.
+	 */
+	delayed_refs = &trans->transaction->delayed_refs;
+	spin_lock(&delayed_refs->lock);
+	head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
+	if (!head)
+		goto out;
+	if (!mutex_trylock(&head->mutex)) {
+		/*
+		 * We're contended, means that the delayed ref is running, get a
+		 * reference and wait for the ref head to be complete and then
+		 * try again.
+		 */
+		refcount_inc(&head->refs);
+		spin_unlock(&delayed_refs->lock);
+
+		btrfs_release_path(path);
+
+		mutex_lock(&head->mutex);
+		mutex_unlock(&head->mutex);
+		btrfs_put_delayed_ref_head(head);
+		goto again;
+	}
+
+	exists = btrfs_find_delayed_tree_ref(head, root->root_key.objectid,
+					     parent);
+	mutex_unlock(&head->mutex);
+out:
+	spin_unlock(&delayed_refs->lock);
 	btrfs_free_path(path);
-	if (ret == -ENOENT)
-		return 0;
-	if (ret < 0)
-		return ret;
-	return 1;
+	return exists ? 1 : 0;
 }
 
 /*