diff mbox series

btrfs: qgroup: run delayed iputs after ordered extent completion

Message ID 2efe2794ecbfbfab1545a9341d3a1fb7464dc048.1728315195.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: qgroup: run delayed iputs after ordered extent completion | expand

Commit Message

Filipe Manana Oct. 7, 2024, 3:34 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When trying to flush qgroups in order to release space we run delayed
iputs in order to release space from recently deleted files (their link
counted reached zero), and then we start delalloc and wait for any
existing ordered extents to complete.

However there's a time window here where we end up not doing the final
iput on a deleted file which could release necessary space:

1) An unlink operation starts;

2) During the unlink, or right before it completes, delalloc is flushed
   and an ordered extent is created;

3) When the ordered extent is created, the inode's ref count is
   incremented (with igrab() at alloc_ordered_extent());

4) When the unlink finishes it doesn't drop the last reference on the
   inode and so it doesn't trigger inode eviction to delete all of
   the inode's items in its root and drop all references on its data
   extents;

5) Another task enters try_flush_qgroup() to try to release space,
   it runs all delayed iputs, but there's no delayed iput yet for that
   deleted file because the ordered extent hasn't completed yet;

6) Then at try_flush_qgroup() we wait for the ordered extent to complete
   and that results in adding a delayed iput at btrfs_put_ordered_extent()
   when called from btrfs_finish_one_ordered();

7) Adding the delayed iput results in waking the cleaner kthread if it's
   not running already. However it may take some time for it to be
   scheduled, or it may be running but busy running auto defrag, dropping
   deleted snapshots or doing other work, so by the time we return from
   try_flush_qgroup() the space for deleted file isn't released.

Improve on this by running delayed iputs only after flushing delalloc
and waiting for ordered extent completion.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/qgroup.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Qu Wenruo Oct. 7, 2024, 9:34 p.m. UTC | #1
在 2024/10/8 02:04, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> When trying to flush qgroups in order to release space we run delayed
> iputs in order to release space from recently deleted files (their link
> counted reached zero), and then we start delalloc and wait for any
> existing ordered extents to complete.
>
> However there's a time window here where we end up not doing the final
> iput on a deleted file which could release necessary space:
>
> 1) An unlink operation starts;
>
> 2) During the unlink, or right before it completes, delalloc is flushed
>     and an ordered extent is created;
>
> 3) When the ordered extent is created, the inode's ref count is
>     incremented (with igrab() at alloc_ordered_extent());
>
> 4) When the unlink finishes it doesn't drop the last reference on the
>     inode and so it doesn't trigger inode eviction to delete all of
>     the inode's items in its root and drop all references on its data
>     extents;
>
> 5) Another task enters try_flush_qgroup() to try to release space,
>     it runs all delayed iputs, but there's no delayed iput yet for that
>     deleted file because the ordered extent hasn't completed yet;
>
> 6) Then at try_flush_qgroup() we wait for the ordered extent to complete
>     and that results in adding a delayed iput at btrfs_put_ordered_extent()
>     when called from btrfs_finish_one_ordered();
>
> 7) Adding the delayed iput results in waking the cleaner kthread if it's
>     not running already. However it may take some time for it to be
>     scheduled, or it may be running but busy running auto defrag, dropping
>     deleted snapshots or doing other work, so by the time we return from
>     try_flush_qgroup() the space for deleted file isn't released.
>
> Improve on this by running delayed iputs only after flushing delalloc
> and waiting for ordered extent completion.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/qgroup.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index f68a26390589..bbc54dd154ec 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -4195,13 +4195,20 @@ static int try_flush_qgroup(struct btrfs_root *root)
>   		return 0;
>   	}
>
> -	btrfs_run_delayed_iputs(root->fs_info);
> -	btrfs_wait_on_delayed_iputs(root->fs_info);
>   	ret = btrfs_start_delalloc_snapshot(root, true);
>   	if (ret < 0)
>   		goto out;
>   	btrfs_wait_ordered_extents(root, U64_MAX, NULL);
>
> +	/*
> +	 * After waiting for ordered extents run delayed iputs in order to free
> +	 * space from unlinked files before committing the current transaction,
> +	 * as ordered extents may have been holding the last reference of an
> +	 * inode and they add a delayed iput when they complete.
> +	 */
> +	btrfs_run_delayed_iputs(root->fs_info);
> +	btrfs_wait_on_delayed_iputs(root->fs_info);
> +
>   	ret = btrfs_commit_current_transaction(root);
>   out:
>   	clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
diff mbox series

Patch

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index f68a26390589..bbc54dd154ec 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4195,13 +4195,20 @@  static int try_flush_qgroup(struct btrfs_root *root)
 		return 0;
 	}
 
-	btrfs_run_delayed_iputs(root->fs_info);
-	btrfs_wait_on_delayed_iputs(root->fs_info);
 	ret = btrfs_start_delalloc_snapshot(root, true);
 	if (ret < 0)
 		goto out;
 	btrfs_wait_ordered_extents(root, U64_MAX, NULL);
 
+	/*
+	 * After waiting for ordered extents run delayed iputs in order to free
+	 * space from unlinked files before committing the current transaction,
+	 * as ordered extents may have been holding the last reference of an
+	 * inode and they add a delayed iput when they complete.
+	 */
+	btrfs_run_delayed_iputs(root->fs_info);
+	btrfs_wait_on_delayed_iputs(root->fs_info);
+
 	ret = btrfs_commit_current_transaction(root);
 out:
 	clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);