diff mbox series

[v3] btrfs: fix data race when accessing the inode's disk_i_size at btrfs_drop_extents()

Message ID 20241203075651.109899-1-zhenghaoran154@gmail.com (mailing list archive)
State New
Headers show
Series [v3] btrfs: fix data race when accessing the inode's disk_i_size at btrfs_drop_extents() | expand

Commit Message

Hao-ran Zheng Dec. 3, 2024, 7:56 a.m. UTC
A data race occurs when the function `insert_ordered_extent_file_extent()`
and the function `btrfs_inode_safe_disk_i_size_write()` are executed
concurrently. The function `insert_ordered_extent_file_extent()` is not
locked when reading inode->disk_i_size, causing
`btrfs_inode_safe_disk_i_size_write()` to cause data competition when
writing inode->disk_i_size, thus affecting the value of `modify_tree`.

The specific call stack that appears during testing is as follows:
============DATA_RACE============
 btrfs_drop_extents+0x89a/0xa060 [btrfs]
 insert_reserved_file_extent+0xb54/0x2960 [btrfs]
 insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
 btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
 btrfs_finish_ordered_io+0x37/0x60 [btrfs]
 finish_ordered_fn+0x3e/0x50 [btrfs]
 btrfs_work_helper+0x9c9/0x27a0 [btrfs]
 process_scheduled_works+0x716/0xf10
 worker_thread+0xb6a/0x1190
 kthread+0x292/0x330
 ret_from_fork+0x4d/0x80
 ret_from_fork_asm+0x1a/0x30
============OTHER_INFO============
 btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
 btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
 btrfs_finish_ordered_io+0x37/0x60 [btrfs]
 finish_ordered_fn+0x3e/0x50 [btrfs]
 btrfs_work_helper+0x9c9/0x27a0 [btrfs]
 process_scheduled_works+0x716/0xf10
 worker_thread+0xb6a/0x1190
 kthread+0x292/0x330
 ret_from_fork+0x4d/0x80
 ret_from_fork_asm+0x1a/0x30
=================================

The main purpose of the modify_tree variable is to optimize the
acquisition of read-write locks at or after the end of file (EOF).
When the value of modify_tree is modified due to concurrency, resulting
in unnecessary acquisition of write locks, the correctness of the
system will not be affected. When the system requires a write lock but
does not acquire the write lock because the value of modify_tree is
incorrect, the path will be subsequently released and the lock will
be reacquired to ensure the correctness of the system.

Since this data race does not affect the correctness of the function,
it is a harmless data race, and it is recommended to use `data_race`
to mark it.

Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com>
---
V2->V3: Added details on why this is a harmless data race
V1->V2: Modified patch description and format
---
 fs/btrfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Filipe Manana Dec. 3, 2024, 12:28 p.m. UTC | #1
On Tue, Dec 3, 2024 at 7:59 AM Hao-ran Zheng <zhenghaoran154@gmail.com> wrote:
>
> A data race occurs when the function `insert_ordered_extent_file_extent()`
> and the function `btrfs_inode_safe_disk_i_size_write()` are executed
> concurrently. The function `insert_ordered_extent_file_extent()` is not
> locked when reading inode->disk_i_size, causing
> `btrfs_inode_safe_disk_i_size_write()` to cause data competition when
> writing inode->disk_i_size, thus affecting the value of `modify_tree`.
>
> The specific call stack that appears during testing is as follows:
> ============DATA_RACE============
>  btrfs_drop_extents+0x89a/0xa060 [btrfs]
>  insert_reserved_file_extent+0xb54/0x2960 [btrfs]
>  insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
>  btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
>  btrfs_finish_ordered_io+0x37/0x60 [btrfs]
>  finish_ordered_fn+0x3e/0x50 [btrfs]
>  btrfs_work_helper+0x9c9/0x27a0 [btrfs]
>  process_scheduled_works+0x716/0xf10
>  worker_thread+0xb6a/0x1190
>  kthread+0x292/0x330
>  ret_from_fork+0x4d/0x80
>  ret_from_fork_asm+0x1a/0x30
> ============OTHER_INFO============
>  btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
>  btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
>  btrfs_finish_ordered_io+0x37/0x60 [btrfs]
>  finish_ordered_fn+0x3e/0x50 [btrfs]
>  btrfs_work_helper+0x9c9/0x27a0 [btrfs]
>  process_scheduled_works+0x716/0xf10
>  worker_thread+0xb6a/0x1190
>  kthread+0x292/0x330
>  ret_from_fork+0x4d/0x80
>  ret_from_fork_asm+0x1a/0x30
> =================================
>
> The main purpose of the modify_tree variable is to optimize the
> acquisition of read-write locks at or after the end of file (EOF).
> When the value of modify_tree is modified due to concurrency, resulting
> in unnecessary acquisition of write locks, the correctness of the
> system will not be affected. When the system requires a write lock but
> does not acquire the write lock because the value of modify_tree is
> incorrect, the path will be subsequently released and the lock will
> be reacquired to ensure the correctness of the system.

Looks good, I'll slightly rephrase this on things like "modify_tree is
modified due to concurrency" because what gets modified concurrently
is inode->disk_i_size.
I'll push it to the github btrfs for-next branch soon with this
paragraph instead:

"The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
eof, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks."

Thanks.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

>
> Since this data race does not affect the correctness of the function,
> it is a harmless data race, and it is recommended to use `data_race`
> to mark it.
>
> Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com>
> ---
> V2->V3: Added details on why this is a harmless data race
> V1->V2: Modified patch description and format
> ---
>  fs/btrfs/file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 4fb521d91b06..559c177456e6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -242,7 +242,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
>         if (args->drop_cache)
>                 btrfs_drop_extent_map_range(inode, args->start, args->end - 1, false);
>
> -       if (args->start >= inode->disk_i_size && !args->replace_extent)
> +       if (data_race(args->start >= inode->disk_i_size) && !args->replace_extent)
>                 modify_tree = 0;
>
>         update_refs = (btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID);
> --
> 2.34.1
>
>
diff mbox series

Patch

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4fb521d91b06..559c177456e6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -242,7 +242,7 @@  int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 	if (args->drop_cache)
 		btrfs_drop_extent_map_range(inode, args->start, args->end - 1, false);
 
-	if (args->start >= inode->disk_i_size && !args->replace_extent)
+	if (data_race(args->start >= inode->disk_i_size) && !args->replace_extent)
 		modify_tree = 0;
 
 	update_refs = (btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID);