diff mbox series

btrfs: handle free space tree rebuild in multiple transactions

Message ID 58dac27acbab72124549718201bec971491b5b1a.1734420572.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series btrfs: handle free space tree rebuild in multiple transactions | expand

Commit Message

Qu Wenruo Dec. 17, 2024, 7:30 a.m. UTC
During free space tree rebuild, we're holding on transaction handler for
the whole rebuild process.

This will cause blocked task warning for btrfs-transaction kernel
thread, as during the rebuild, btrfs-transaction kthread has to wait for
the running transaction we're holding for free space tree rebuild.

On a large enough btrfs, we have thousands of block groups to go
through, thus it will definitely take over 120s and trigger the blocked
task warning.

Fix the problem by handling 32 block groups in one transaction, and end
the transaction when we hit the 32 block groups threshold.

This will allow the btrfs-transaction kthread to commit the transaction
when needed.

And even if during the rebuild the system lost its power, we are still
fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount
we will still rebuild the tree, without utilizing the half built one.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/free-space-tree.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Filipe Manana Dec. 17, 2024, 8:45 a.m. UTC | #1
On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote:
>
> During free space tree rebuild, we're holding on transaction handler for
> the whole rebuild process.
>
> This will cause blocked task warning for btrfs-transaction kernel
> thread, as during the rebuild, btrfs-transaction kthread has to wait for
> the running transaction we're holding for free space tree rebuild.

Do you have a stack trace?
Does that happen where exactly, in the btrfs_attach_transaction() call
chain, while waiting on a wait queue?

>
> On a large enough btrfs, we have thousands of block groups to go
> through, thus it will definitely take over 120s and trigger the blocked
> task warning.
>
> Fix the problem by handling 32 block groups in one transaction, and end
> the transaction when we hit the 32 block groups threshold.
>
> This will allow the btrfs-transaction kthread to commit the transaction
> when needed.
>
> And even if during the rebuild the system lost its power, we are still
> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount
> we will still rebuild the tree, without utilizing the half built one.

What about if a crash happens and we already processed some block
groups and the transaction got committed?

On the next mount, when we call populate_free_space_tree() for the
same block groups, because we always start from the first one, won't
we get an -EEXIST and fail the mount?
For example, add_new_free_space_info() doesn't ignore an -EEXIST when
it calls btrfs_insert_empty_item(), so we will fail the mount when
trying to build the tree.

>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/free-space-tree.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> index 7ba50e133921..d8f334724092 100644
> --- a/fs/btrfs/free-space-tree.c
> +++ b/fs/btrfs/free-space-tree.c
> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info)
>         return btrfs_commit_transaction(trans);
>  }
>
> +/* How many block groups can be handled in one transaction. */
> +#define FREE_SPACE_TREE_REBUILD_BATCH  (32)
>  int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)

Please put a blank line after the macro declaration and place the
macro at the top of the file before the C code.

>  {
>         struct btrfs_trans_handle *trans;
> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
>         };
>         struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key);
>         struct rb_node *node;
> +       unsigned int handled = 0;
>         int ret;
>
>         trans = btrfs_start_transaction(free_space_root, 1);
> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
>                         btrfs_end_transaction(trans);
>                         return ret;
>                 }
> +               handled++;
> +               handled %= FREE_SPACE_TREE_REBUILD_BATCH;
> +               if (!handled) {
> +                       btrfs_end_transaction(trans);
> +                       trans = btrfs_start_transaction(free_space_root,
> +                                       FREE_SPACE_TREE_REBUILD_BATCH);

This is a fundamental change here, we are no longer reserving 1 unit
but 32 instead.
Plus the first call to btrfs_start_transaction(), before entering the
loop, uses 1.
For consistency, both places should reserve the same amount.

But I think that changing the amount of reserved space should be a
separate change with its own changelog,
providing a rationale for it.

It's odd indeed that only 1 item is being reserved, but on the other
hand the block reserve associated with the free space tree root is the
delayed refs reserve (see btrfs_init_root_block_rsv()),
so changing the number of units passed to btrfs_start_transaction()
shouldn't make much difference because the space reserved by a
transaction goes to the transaction block reserve
(fs_info->trans_block_rsv).

And given that free space tree build/rebuild only touches the free
space tree root, I don't see how that makes any difference, or at
least it's not trivial, hence the separate change only for the space
reservation
amount and an explanation about why we are doing it.

Thanks.


> +                       if (IS_ERR(trans))
> +                               return PTR_ERR(trans);
> +               }
>                 node = rb_next(node);
>         }
>
> --
> 2.47.1
>
>
Qu Wenruo Dec. 17, 2024, 9:02 a.m. UTC | #2
在 2024/12/17 19:15, Filipe Manana 写道:
> On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> During free space tree rebuild, we're holding on transaction handler for
>> the whole rebuild process.
>>
>> This will cause blocked task warning for btrfs-transaction kernel
>> thread, as during the rebuild, btrfs-transaction kthread has to wait for
>> the running transaction we're holding for free space tree rebuild.
> 
> Do you have a stack trace?
> Does that happen where exactly, in the btrfs_attach_transaction() call
> chain, while waiting on a wait queue?

Unfortunately no, thus it's only based on code analyze.

The original reporter's stack are not related to free space cache 
rebuild, but on metadata writeback wait:

[101497.950425] INFO: task btrfs-transacti:29385 blocked for more than
122 seconds.
[101497.950432]       Tainted: G        W  O       6.12.3-gentoo-x86_64 #1
[101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[101497.950437] task:btrfs-transacti state:D stack:0     pid:29385
tgid:29385 ppid:2      flags:0x00004000
[101497.950442] Call Trace:
[101497.950444]  <TASK>
[101497.950449]  __schedule+0x3f0/0xbd0
[101497.950458]  schedule+0x27/0xf0
[101497.950461]  io_schedule+0x46/0x70
[101497.950465]  folio_wait_bit_common+0x123/0x340
[101497.950472]  ? __pfx_wake_page_function+0x10/0x10
[101497.950477]  folio_wait_writeback+0x2b/0x80
[101497.950480]  __filemap_fdatawait_range+0x7d/0xd0
[101497.950489]  filemap_fdatawait_range+0x12/0x20
[101497.950493]  __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs]
[101497.950536]  btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs]
[101497.950572]  btrfs_commit_transaction+0x8d9/0xe80 [btrfs]
[101497.950606]  ? start_transaction+0xc0/0x820 [btrfs]
[101497.950640]  transaction_kthread+0x159/0x1c0 [btrfs]
[101497.950674]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
[101497.950705]  kthread+0xd2/0x100
[101497.950710]  ? __pfx_kthread+0x10/0x10
[101497.950714]  ret_from_fork+0x34/0x50
[101497.950718]  ? __pfx_kthread+0x10/0x10
[101497.950721]  ret_from_fork_asm+0x1a/0x30
[101497.950727]  </TASK>

Furthermore, the original reporter doesn't experience real hang.
The operation can finish (very large stream receive), sync and properly 
unmount.
I believe the original report is mostly caused by the following reasons:

- Too large physical RAM
   96GiB, causing too huge page cache

- HDD
   Low IOPS

- Mostly random metadata writes?

> 
>>
>> On a large enough btrfs, we have thousands of block groups to go
>> through, thus it will definitely take over 120s and trigger the blocked
>> task warning.
>>
>> Fix the problem by handling 32 block groups in one transaction, and end
>> the transaction when we hit the 32 block groups threshold.
>>
>> This will allow the btrfs-transaction kthread to commit the transaction
>> when needed.
>>
>> And even if during the rebuild the system lost its power, we are still
>> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount
>> we will still rebuild the tree, without utilizing the half built one.
> 
> What about if a crash happens and we already processed some block
> groups and the transaction got committed?
> 
> On the next mount, when we call populate_free_space_tree() for the
> same block groups, because we always start from the first one, won't
> we get an -EEXIST and fail the mount?
> For example, add_new_free_space_info() doesn't ignore an -EEXIST when
> it calls btrfs_insert_empty_item(), so we will fail the mount when
> trying to build the tree.

We are still fine:

btrfs_start_pre_rw_mount()
|- rebuild_free_space_tree = true;
|  This is because our fs only has FREE_SPACE_TREE, but no
|  FREE_SPACE_TREE_VALID compat_ro flag.
|
|- btrfs_rebuild_free_space_tree()
    |- clear_free_space_tree()
       Which deletes all the free space tree items.

Thanks,
Qu

> 
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/free-space-tree.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
>> index 7ba50e133921..d8f334724092 100644
>> --- a/fs/btrfs/free-space-tree.c
>> +++ b/fs/btrfs/free-space-tree.c
>> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info)
>>          return btrfs_commit_transaction(trans);
>>   }
>>
>> +/* How many block groups can be handled in one transaction. */
>> +#define FREE_SPACE_TREE_REBUILD_BATCH  (32)
>>   int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
> 
> Please put a blank line after the macro declaration and place the
> macro at the top of the file before the C code.
> 
>>   {
>>          struct btrfs_trans_handle *trans;
>> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
>>          };
>>          struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key);
>>          struct rb_node *node;
>> +       unsigned int handled = 0;
>>          int ret;
>>
>>          trans = btrfs_start_transaction(free_space_root, 1);
>> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
>>                          btrfs_end_transaction(trans);
>>                          return ret;
>>                  }
>> +               handled++;
>> +               handled %= FREE_SPACE_TREE_REBUILD_BATCH;
>> +               if (!handled) {
>> +                       btrfs_end_transaction(trans);
>> +                       trans = btrfs_start_transaction(free_space_root,
>> +                                       FREE_SPACE_TREE_REBUILD_BATCH);
> 
> This is a fundamental change here, we are no longer reserving 1 unit
> but 32 instead.
> Plus the first call to btrfs_start_transaction(), before entering the
> loop, uses 1.
> For consistency, both places should reserve the same amount.
> 
> But I think that changing the amount of reserved space should be a
> separate change with its own changelog,
> providing a rationale for it.
> 
> It's odd indeed that only 1 item is being reserved, but on the other
> hand the block reserve associated with the free space tree root is the
> delayed refs reserve (see btrfs_init_root_block_rsv()),
> so changing the number of units passed to btrfs_start_transaction()
> shouldn't make much difference because the space reserved by a
> transaction goes to the transaction block reserve
> (fs_info->trans_block_rsv).
> 
> And given that free space tree build/rebuild only touches the free
> space tree root, I don't see how that makes any difference, or at
> least it's not trivial, hence the separate change only for the space
> reservation
> amount and an explanation about why we are doing it.
> 
> Thanks.
> 
> 
>> +                       if (IS_ERR(trans))
>> +                               return PTR_ERR(trans);
>> +               }
>>                  node = rb_next(node);
>>          }
>>
>> --
>> 2.47.1
>>
>>
Filipe Manana Dec. 17, 2024, 10:36 a.m. UTC | #3
On Tue, Dec 17, 2024 at 9:02 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2024/12/17 19:15, Filipe Manana 写道:
> > On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> During free space tree rebuild, we're holding on transaction handler for
> >> the whole rebuild process.
> >>
> >> This will cause blocked task warning for btrfs-transaction kernel
> >> thread, as during the rebuild, btrfs-transaction kthread has to wait for
> >> the running transaction we're holding for free space tree rebuild.
> >
> > Do you have a stack trace?
> > Does that happen where exactly, in the btrfs_attach_transaction() call
> > chain, while waiting on a wait queue?
>
> Unfortunately no, thus it's only based on code analyze.
>
> The original reporter's stack are not related to free space cache
> rebuild, but on metadata writeback wait:
>
> [101497.950425] INFO: task btrfs-transacti:29385 blocked for more than
> 122 seconds.
> [101497.950432]       Tainted: G        W  O       6.12.3-gentoo-x86_64 #1
> [101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [101497.950437] task:btrfs-transacti state:D stack:0     pid:29385
> tgid:29385 ppid:2      flags:0x00004000
> [101497.950442] Call Trace:
> [101497.950444]  <TASK>
> [101497.950449]  __schedule+0x3f0/0xbd0
> [101497.950458]  schedule+0x27/0xf0
> [101497.950461]  io_schedule+0x46/0x70
> [101497.950465]  folio_wait_bit_common+0x123/0x340
> [101497.950472]  ? __pfx_wake_page_function+0x10/0x10
> [101497.950477]  folio_wait_writeback+0x2b/0x80
> [101497.950480]  __filemap_fdatawait_range+0x7d/0xd0
> [101497.950489]  filemap_fdatawait_range+0x12/0x20
> [101497.950493]  __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs]
> [101497.950536]  btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs]
> [101497.950572]  btrfs_commit_transaction+0x8d9/0xe80 [btrfs]
> [101497.950606]  ? start_transaction+0xc0/0x820 [btrfs]
> [101497.950640]  transaction_kthread+0x159/0x1c0 [btrfs]
> [101497.950674]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
> [101497.950705]  kthread+0xd2/0x100
> [101497.950710]  ? __pfx_kthread+0x10/0x10
> [101497.950714]  ret_from_fork+0x34/0x50
> [101497.950718]  ? __pfx_kthread+0x10/0x10
> [101497.950721]  ret_from_fork_asm+0x1a/0x30
> [101497.950727]  </TASK>
>
> Furthermore, the original reporter doesn't experience real hang.
> The operation can finish (very large stream receive), sync and properly
> unmount.
> I believe the original report is mostly caused by the following reasons:
>
> - Too large physical RAM
>    96GiB, causing too huge page cache
>
> - HDD
>    Low IOPS
>
> - Mostly random metadata writes?

Looking at that trace, I don't see how this change relates to it.
That is, how this batching helps prevent that.

That happens when committing a transaction and we're in the unblocked
transaction state, starting writeback of the metadata extents and
waiting for the writeback to finish.

Since the free space tree build code uses the same transaction handle
to build the whole free space tree,
the transaction kthread would block when calling
btrfs_commit_transaction(), waiting for the free space tree build task
to release its handle - but that's not what is happening here.

The transaction kthread is doing the commit, and already in the
unblocked transaction state, writing all dirty metadata extents.
So that means it's not getting blocked by free tree build task holding
a transaction handle - it may be holding a transaction handle open,
but it's a different transaction already, since when the current
transaction is in a state >= TRANS_STATE_UNBLOCKED, other tasks can
start a new transaction.

And each transaction has its own io tree to keep track of the metadata
extents it created/dirtied (trans->transaction->dirty_pages),
so even if the task building the free space tree keeps COWing and
creating free space tree nodes/leaves, it doesn't affect the io tree
of the
committing transaction - these extent buffers will be written by the
transaction used by the free space tree build task when it commits,
and not by the one being committed by the transaction kthread.

We should have the stack trace in the change log, a link to the
reporter's thread and an explanation of how this batching helps
anything - I don't see how, as explained above.

Thanks.


>
> >
> >>
> >> On a large enough btrfs, we have thousands of block groups to go
> >> through, thus it will definitely take over 120s and trigger the blocked
> >> task warning.
> >>
> >> Fix the problem by handling 32 block groups in one transaction, and end
> >> the transaction when we hit the 32 block groups threshold.
> >>
> >> This will allow the btrfs-transaction kthread to commit the transaction
> >> when needed.
> >>
> >> And even if during the rebuild the system lost its power, we are still
> >> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount
> >> we will still rebuild the tree, without utilizing the half built one.
> >
> > What about if a crash happens and we already processed some block
> > groups and the transaction got committed?
> >
> > On the next mount, when we call populate_free_space_tree() for the
> > same block groups, because we always start from the first one, won't
> > we get an -EEXIST and fail the mount?
> > For example, add_new_free_space_info() doesn't ignore an -EEXIST when
> > it calls btrfs_insert_empty_item(), so we will fail the mount when
> > trying to build the tree.
>
> We are still fine:
>
> btrfs_start_pre_rw_mount()
> |- rebuild_free_space_tree = true;
> |  This is because our fs only has FREE_SPACE_TREE, but no
> |  FREE_SPACE_TREE_VALID compat_ro flag.
> |
> |- btrfs_rebuild_free_space_tree()
>     |- clear_free_space_tree()
>        Which deletes all the free space tree items.
>
> Thanks,
> Qu
>
> >
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/free-space-tree.c | 12 ++++++++++++
> >>   1 file changed, 12 insertions(+)
> >>
> >> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> >> index 7ba50e133921..d8f334724092 100644
> >> --- a/fs/btrfs/free-space-tree.c
> >> +++ b/fs/btrfs/free-space-tree.c
> >> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info)
> >>          return btrfs_commit_transaction(trans);
> >>   }
> >>
> >> +/* How many block groups can be handled in one transaction. */
> >> +#define FREE_SPACE_TREE_REBUILD_BATCH  (32)
> >>   int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
> >
> > Please put a blank line after the macro declaration and place the
> > macro at the top of the file before the C code.
> >
> >>   {
> >>          struct btrfs_trans_handle *trans;
> >> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
> >>          };
> >>          struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key);
> >>          struct rb_node *node;
> >> +       unsigned int handled = 0;
> >>          int ret;
> >>
> >>          trans = btrfs_start_transaction(free_space_root, 1);
> >> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
> >>                          btrfs_end_transaction(trans);
> >>                          return ret;
> >>                  }
> >> +               handled++;
> >> +               handled %= FREE_SPACE_TREE_REBUILD_BATCH;
> >> +               if (!handled) {
> >> +                       btrfs_end_transaction(trans);
> >> +                       trans = btrfs_start_transaction(free_space_root,
> >> +                                       FREE_SPACE_TREE_REBUILD_BATCH);
> >
> > This is a fundamental change here, we are no longer reserving 1 unit
> > but 32 instead.
> > Plus the first call to btrfs_start_transaction(), before entering the
> > loop, uses 1.
> > For consistency, both places should reserve the same amount.
> >
> > But I think that changing the amount of reserved space should be a
> > separate change with its own changelog,
> > providing a rationale for it.
> >
> > It's odd indeed that only 1 item is being reserved, but on the other
> > hand the block reserve associated with the free space tree root is the
> > delayed refs reserve (see btrfs_init_root_block_rsv()),
> > so changing the number of units passed to btrfs_start_transaction()
> > shouldn't make much difference because the space reserved by a
> > transaction goes to the transaction block reserve
> > (fs_info->trans_block_rsv).
> >
> > And given that free space tree build/rebuild only touches the free
> > space tree root, I don't see how that makes any difference, or at
> > least it's not trivial, hence the separate change only for the space
> > reservation
> > amount and an explanation about why we are doing it.
> >
> > Thanks.
> >
> >
> >> +                       if (IS_ERR(trans))
> >> +                               return PTR_ERR(trans);
> >> +               }
> >>                  node = rb_next(node);
> >>          }
> >>
> >> --
> >> 2.47.1
> >>
> >>
>
Qu Wenruo Dec. 17, 2024, 8:38 p.m. UTC | #4
在 2024/12/17 21:06, Filipe Manana 写道:
> On Tue, Dec 17, 2024 at 9:02 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> 在 2024/12/17 19:15, Filipe Manana 写道:
>>> On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> During free space tree rebuild, we're holding on transaction handler for
>>>> the whole rebuild process.
>>>>
>>>> This will cause blocked task warning for btrfs-transaction kernel
>>>> thread, as during the rebuild, btrfs-transaction kthread has to wait for
>>>> the running transaction we're holding for free space tree rebuild.
>>>
>>> Do you have a stack trace?
>>> Does that happen where exactly, in the btrfs_attach_transaction() call
>>> chain, while waiting on a wait queue?
>>
>> Unfortunately no, thus it's only based on code analyze.
>>
>> The original reporter's stack are not related to free space cache
>> rebuild, but on metadata writeback wait:
>>
>> [101497.950425] INFO: task btrfs-transacti:29385 blocked for more than
>> 122 seconds.
>> [101497.950432]       Tainted: G        W  O       6.12.3-gentoo-x86_64 #1
>> [101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [101497.950437] task:btrfs-transacti state:D stack:0     pid:29385
>> tgid:29385 ppid:2      flags:0x00004000
>> [101497.950442] Call Trace:
>> [101497.950444]  <TASK>
>> [101497.950449]  __schedule+0x3f0/0xbd0
>> [101497.950458]  schedule+0x27/0xf0
>> [101497.950461]  io_schedule+0x46/0x70
>> [101497.950465]  folio_wait_bit_common+0x123/0x340
>> [101497.950472]  ? __pfx_wake_page_function+0x10/0x10
>> [101497.950477]  folio_wait_writeback+0x2b/0x80
>> [101497.950480]  __filemap_fdatawait_range+0x7d/0xd0
>> [101497.950489]  filemap_fdatawait_range+0x12/0x20
>> [101497.950493]  __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs]
>> [101497.950536]  btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs]
>> [101497.950572]  btrfs_commit_transaction+0x8d9/0xe80 [btrfs]
>> [101497.950606]  ? start_transaction+0xc0/0x820 [btrfs]
>> [101497.950640]  transaction_kthread+0x159/0x1c0 [btrfs]
>> [101497.950674]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
>> [101497.950705]  kthread+0xd2/0x100
>> [101497.950710]  ? __pfx_kthread+0x10/0x10
>> [101497.950714]  ret_from_fork+0x34/0x50
>> [101497.950718]  ? __pfx_kthread+0x10/0x10
>> [101497.950721]  ret_from_fork_asm+0x1a/0x30
>> [101497.950727]  </TASK>
>>
>> Furthermore, the original reporter doesn't experience real hang.
>> The operation can finish (very large stream receive), sync and properly
>> unmount.
>> I believe the original report is mostly caused by the following reasons:
>>
>> - Too large physical RAM
>>     96GiB, causing too huge page cache
>>
>> - HDD
>>     Low IOPS
>>
>> - Mostly random metadata writes?
>
> Looking at that trace, I don't see how this change relates to it.
> That is, how this batching helps prevent that.

Exactly, this batching fix is unrelated to the report, just as I said.

Thanks,
Qu
>
> That happens when committing a transaction and we're in the unblocked
> transaction state, starting writeback of the metadata extents and
> waiting for the writeback to finish.
>
> Since the free space tree build code uses the same transaction handle
> to build the whole free space tree,
> the transaction kthread would block when calling
> btrfs_commit_transaction(), waiting for the free space tree build task
> to release its handle - but that's not what is happening here.
>
> The transaction kthread is doing the commit, and already in the
> unblocked transaction state, writing all dirty metadata extents.
> So that means it's not getting blocked by free tree build task holding
> a transaction handle - it may be holding a transaction handle open,
> but it's a different transaction already, since when the current
> transaction is in a state >= TRANS_STATE_UNBLOCKED, other tasks can
> start a new transaction.
>
> And each transaction has its own io tree to keep track of the metadata
> extents it created/dirtied (trans->transaction->dirty_pages),
> so even if the task building the free space tree keeps COWing and
> creating free space tree nodes/leaves, it doesn't affect the io tree
> of the
> committing transaction - these extent buffers will be written by the
> transaction used by the free space tree build task when it commits,
> and not by the one being committed by the transaction kthread.
>
> We should have the stack trace in the change log, a link to the
> reporter's thread and an explanation of how this batching helps
> anything - I don't see how, as explained above.
>
> Thanks.
>
>
>>
>>>
>>>>
>>>> On a large enough btrfs, we have thousands of block groups to go
>>>> through, thus it will definitely take over 120s and trigger the blocked
>>>> task warning.
>>>>
>>>> Fix the problem by handling 32 block groups in one transaction, and end
>>>> the transaction when we hit the 32 block groups threshold.
>>>>
>>>> This will allow the btrfs-transaction kthread to commit the transaction
>>>> when needed.
>>>>
>>>> And even if during the rebuild the system lost its power, we are still
>>>> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount
>>>> we will still rebuild the tree, without utilizing the half built one.
>>>
>>> What about if a crash happens and we already processed some block
>>> groups and the transaction got committed?
>>>
>>> On the next mount, when we call populate_free_space_tree() for the
>>> same block groups, because we always start from the first one, won't
>>> we get an -EEXIST and fail the mount?
>>> For example, add_new_free_space_info() doesn't ignore an -EEXIST when
>>> it calls btrfs_insert_empty_item(), so we will fail the mount when
>>> trying to build the tree.
>>
>> We are still fine:
>>
>> btrfs_start_pre_rw_mount()
>> |- rebuild_free_space_tree = true;
>> |  This is because our fs only has FREE_SPACE_TREE, but no
>> |  FREE_SPACE_TREE_VALID compat_ro flag.
>> |
>> |- btrfs_rebuild_free_space_tree()
>>      |- clear_free_space_tree()
>>         Which deletes all the free space tree items.
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/free-space-tree.c | 12 ++++++++++++
>>>>    1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
>>>> index 7ba50e133921..d8f334724092 100644
>>>> --- a/fs/btrfs/free-space-tree.c
>>>> +++ b/fs/btrfs/free-space-tree.c
>>>> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info)
>>>>           return btrfs_commit_transaction(trans);
>>>>    }
>>>>
>>>> +/* How many block groups can be handled in one transaction. */
>>>> +#define FREE_SPACE_TREE_REBUILD_BATCH  (32)
>>>>    int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
>>>
>>> Please put a blank line after the macro declaration and place the
>>> macro at the top of the file before the C code.
>>>
>>>>    {
>>>>           struct btrfs_trans_handle *trans;
>>>> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
>>>>           };
>>>>           struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key);
>>>>           struct rb_node *node;
>>>> +       unsigned int handled = 0;
>>>>           int ret;
>>>>
>>>>           trans = btrfs_start_transaction(free_space_root, 1);
>>>> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
>>>>                           btrfs_end_transaction(trans);
>>>>                           return ret;
>>>>                   }
>>>> +               handled++;
>>>> +               handled %= FREE_SPACE_TREE_REBUILD_BATCH;
>>>> +               if (!handled) {
>>>> +                       btrfs_end_transaction(trans);
>>>> +                       trans = btrfs_start_transaction(free_space_root,
>>>> +                                       FREE_SPACE_TREE_REBUILD_BATCH);
>>>
>>> This is a fundamental change here, we are no longer reserving 1 unit
>>> but 32 instead.
>>> Plus the first call to btrfs_start_transaction(), before entering the
>>> loop, uses 1.
>>> For consistency, both places should reserve the same amount.
>>>
>>> But I think that changing the amount of reserved space should be a
>>> separate change with its own changelog,
>>> providing a rationale for it.
>>>
>>> It's odd indeed that only 1 item is being reserved, but on the other
>>> hand the block reserve associated with the free space tree root is the
>>> delayed refs reserve (see btrfs_init_root_block_rsv()),
>>> so changing the number of units passed to btrfs_start_transaction()
>>> shouldn't make much difference because the space reserved by a
>>> transaction goes to the transaction block reserve
>>> (fs_info->trans_block_rsv).
>>>
>>> And given that free space tree build/rebuild only touches the free
>>> space tree root, I don't see how that makes any difference, or at
>>> least it's not trivial, hence the separate change only for the space
>>> reservation
>>> amount and an explanation about why we are doing it.
>>>
>>> Thanks.
>>>
>>>
>>>> +                       if (IS_ERR(trans))
>>>> +                               return PTR_ERR(trans);
>>>> +               }
>>>>                   node = rb_next(node);
>>>>           }
>>>>
>>>> --
>>>> 2.47.1
>>>>
>>>>
>>
>
diff mbox series

Patch

diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index 7ba50e133921..d8f334724092 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1312,6 +1312,8 @@  int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info)
 	return btrfs_commit_transaction(trans);
 }
 
+/* How many block groups can be handled in one transaction. */
+#define FREE_SPACE_TREE_REBUILD_BATCH	(32)
 int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_trans_handle *trans;
@@ -1322,6 +1324,7 @@  int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
 	};
 	struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key);
 	struct rb_node *node;
+	unsigned int handled = 0;
 	int ret;
 
 	trans = btrfs_start_transaction(free_space_root, 1);
@@ -1350,6 +1353,15 @@  int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info)
 			btrfs_end_transaction(trans);
 			return ret;
 		}
+		handled++;
+		handled %= FREE_SPACE_TREE_REBUILD_BATCH;
+		if (!handled) {
+			btrfs_end_transaction(trans);
+			trans = btrfs_start_transaction(free_space_root,
+					FREE_SPACE_TREE_REBUILD_BATCH);
+			if (IS_ERR(trans))
+				return PTR_ERR(trans);
+		}
 		node = rb_next(node);
 	}