btrfs: fix deadlock with fiemap and extent locking

Message ID	47ac92a6c8be53a5e10add9315255460c062b52d.1706817512.git.josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-ot1-f54.google.com (mail-ot1-f54.google.com [209.85.210.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9291429B0 for <linux-btrfs@vger.kernel.org>; Thu, 1 Feb 2024 19:58:59 +0000 (UTC) From: Josef Bacik <josef@toxicpanda.com> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] btrfs: fix deadlock with fiemap and extent locking Date: Thu, 1 Feb 2024 14:58:54 -0500 Message-ID: <47ac92a6c8be53a5e10add9315255460c062b52d.1706817512.git.josef@toxicpanda.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: fix deadlock with fiemap and extent locking \| expand btrfs: fix deadlock with fiemap and extent locking

Message ID

47ac92a6c8be53a5e10add9315255460c062b52d.1706817512.git.josef@toxicpanda.com (mailing list archive)

State

New, archived

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: linux-btrfs@vger.kernel.org,
	kernel-team@fb.com
Subject: [PATCH] btrfs: fix deadlock with fiemap and extent locking
Date: Thu,  1 Feb 2024 14:58:54 -0500
Message-ID: 
 <47ac92a6c8be53a5e10add9315255460c062b52d.1706817512.git.josef@toxicpanda.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

btrfs: fix deadlock with fiemap and extent locking | expand

Commit Message

Josef Bacik Feb. 1, 2024, 7:58 p.m. UTC

While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.

This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.

Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting.  Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed.  This would deadlock with the following stack trace

[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.

To fix this simply don't take the extent lock for the entire duration of
the fiemap.  This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.

The only place we maintain the lock is searching delalloc.  Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.

With this patch applied we no longer deadlock with my testcase.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 49 +++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 16 deletions(-)

Comments

Filipe Manana Feb. 2, 2024, 11:40 a.m. UTC | #1

On Thu, Feb 1, 2024 at 7:59 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> While working on the patchset to remove extent locking I got a lockdep
> splat with fiemap and pagefaulting with my new extent lock replacement
> lock.
>
> This deadlock exists with our normal code, we just don't have lockdep
> annotations with the extent locking so we've never noticed it.
>
> Since we're copying the fiemap extent to user space on every iteration
> we have the chance of pagefaulting.  Because we hold the extent lock for
> the entire range we could mkwrite into a range in the file that we have
> mmap'ed.  This would deadlock with the following stack trace
>
> [<0>] lock_extent+0x28d/0x2f0
> [<0>] btrfs_page_mkwrite+0x273/0x8a0
> [<0>] do_page_mkwrite+0x50/0xb0
> [<0>] do_fault+0xc1/0x7b0
> [<0>] __handle_mm_fault+0x2fa/0x460
> [<0>] handle_mm_fault+0xa4/0x330
> [<0>] do_user_addr_fault+0x1f4/0x800
> [<0>] exc_page_fault+0x7c/0x1e0
> [<0>] asm_exc_page_fault+0x26/0x30
> [<0>] rep_movs_alternative+0x33/0x70
> [<0>] _copy_to_user+0x49/0x70
> [<0>] fiemap_fill_next_extent+0xc8/0x120
> [<0>] emit_fiemap_extent+0x4d/0xa0
> [<0>] extent_fiemap+0x7f8/0xad0
> [<0>] btrfs_fiemap+0x49/0x80
> [<0>] __x64_sys_ioctl+0x3e1/0xb50
> [<0>] do_syscall_64+0x94/0x1a0
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
>
> I wrote an fstest to reproduce this deadlock without my replacement lock
> and verified that the deadlock exists with our existing locking.
>
> To fix this simply don't take the extent lock for the entire duration of
> the fiemap.  This is safe in general because we keep track of where we
> are when we're searching the tree, so if an ordered extent updates in
> the middle of our fiemap call we'll still emit the correct extents
> because we know what offset we were on before.
>
> The only place we maintain the lock is searching delalloc.  Since the
> delalloc stuff can change during writeback we want to lock the extent
> range so we have a consistent view of delalloc at the time we're
> checking to see if we need to set the delalloc flag.
>
> With this patch applied we no longer deadlock with my testcase.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, but just some minor comments below.

> ---
>  fs/btrfs/extent_io.c | 49 +++++++++++++++++++++++++++++---------------
>  1 file changed, 33 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8648ea9b5fb5..f8b68249d958 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2683,16 +2683,25 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
>          * it beyond i_size.
>          */
>         while (cur_offset < end && cur_offset < i_size) {
> +               struct extent_state *cached_state = NULL;
>                 u64 delalloc_start;
>                 u64 delalloc_end;
>                 u64 prealloc_start;
> +               u64 lockstart, lockend;

I think our preferred style is to declare one variable per line.
It also makes it consistent with the rest of the existing local code.

>                 u64 prealloc_len = 0;
>                 bool delalloc;
>
> +               lockstart = round_down(cur_offset,
> +                                      inode->root->fs_info->sectorsize);

As a single that would be 85 characters wide, still acceptable.

> +               lockend = round_up(end, inode->root->fs_info->sectorsize);
> +
> +               lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>                 delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
>                                                         delalloc_cached_state,
>                                                         &delalloc_start,
>                                                         &delalloc_end);
> +               unlock_extent(&inode->io_tree, lockstart, lockend,
> +                             &cached_state);

This can be made into a single line, as it's only 82 characters wide.

>                 if (!delalloc)
>                         break;
>
> @@ -2860,15 +2869,14 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>                   u64 start, u64 len)
>  {
>         const u64 ino = btrfs_ino(inode);
> -       struct extent_state *cached_state = NULL;
>         struct extent_state *delalloc_cached_state = NULL;
>         struct btrfs_path *path;
>         struct fiemap_cache cache = { 0 };
>         struct btrfs_backref_share_check_ctx *backref_ctx;
>         u64 last_extent_end;
>         u64 prev_extent_end;
> -       u64 lockstart;
> -       u64 lockend;
> +       u64 align_start;
> +       u64 align_end;
>         bool stopped = false;
>         int ret;
>
> @@ -2879,12 +2887,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>                 goto out;
>         }
>
> -       lockstart = round_down(start, inode->root->fs_info->sectorsize);
> -       lockend = round_up(start + len, inode->root->fs_info->sectorsize);
> -       prev_extent_end = lockstart;
> +       align_start = round_down(start, inode->root->fs_info->sectorsize);
> +       align_end = round_up(start + len, inode->root->fs_info->sectorsize);
> +       prev_extent_end = align_start;
>
>         btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> -       lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>
>         ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
>         if (ret < 0)
> @@ -2892,7 +2899,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>         btrfs_release_path(path);
>
>         path->reada = READA_FORWARD;
> -       ret = fiemap_search_slot(inode, path, lockstart);
> +       ret = fiemap_search_slot(inode, path, align_start);
>         if (ret < 0) {
>                 goto out_unlock;
>         } else if (ret > 0) {
> @@ -2904,7 +2911,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>                 goto check_eof_delalloc;
>         }
>
> -       while (prev_extent_end < lockend) {
> +       while (prev_extent_end < align_end) {
>                 struct extent_buffer *leaf = path->nodes[0];
>                 struct btrfs_file_extent_item *ei;
>                 struct btrfs_key key;
> @@ -2927,14 +2934,14 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>                  * The first iteration can leave us at an extent item that ends
>                  * before our range's start. Move to the next item.
>                  */
> -               if (extent_end <= lockstart)
> +               if (extent_end <= align_start)
>                         goto next_item;
>
>                 backref_ctx->curr_leaf_bytenr = leaf->start;
>
>                 /* We have in implicit hole (NO_HOLES feature enabled). */
>                 if (prev_extent_end < key.offset) {
> -                       const u64 range_end = min(key.offset, lockend) - 1;
> +                       const u64 range_end = min(key.offset, align_end) - 1;
>
>                         ret = fiemap_process_hole(inode, fieinfo, &cache,
>                                                   &delalloc_cached_state,
> @@ -2949,7 +2956,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>                         }
>
>                         /* We've reached the end of the fiemap range, stop. */
> -                       if (key.offset >= lockend) {
> +                       if (key.offset >= align_end) {
>                                 stopped = true;
>                                 break;
>                         }
> @@ -3043,29 +3050,40 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>         btrfs_free_path(path);
>         path = NULL;
>
> -       if (!stopped && prev_extent_end < lockend) {
> +       if (!stopped && prev_extent_end < align_end) {
>                 ret = fiemap_process_hole(inode, fieinfo, &cache,
>                                           &delalloc_cached_state, backref_ctx,
> -                                         0, 0, 0, prev_extent_end, lockend - 1);
> +                                         0, 0, 0, prev_extent_end, align_end - 1);
>                 if (ret < 0)
>                         goto out_unlock;
> -               prev_extent_end = lockend;
> +               prev_extent_end = align_end;
>         }
>
>         if (cache.cached && cache.offset + cache.len >= last_extent_end) {
>                 const u64 i_size = i_size_read(&inode->vfs_inode);
>
>                 if (prev_extent_end < i_size) {
> +                       struct extent_state *cached_state = NULL;
>                         u64 delalloc_start;
>                         u64 delalloc_end;
> +                       u64 lockstart, lockend;

Same as before, one per line.

>                         bool delalloc;
>
> +                       lockstart = round_down(prev_extent_end,
> +                                              inode->root->fs_info->sectorsize);
> +                       lockend = round_up(i_size,
> +                                          inode->root->fs_info->sectorsize);
> +

With several places in the whole function now using
inode->root->fs_info->sectorsize, we could
have a:

const u64 sectorsize = inode->root->fs_info->sectorsize;

At the top level of the function and then use it and avoid multiple lines.

> +                       lock_extent(&inode->io_tree, lockstart, lockend,
> +                                   &cached_state);
>                         delalloc = btrfs_find_delalloc_in_range(inode,
>                                                                 prev_extent_end,
>                                                                 i_size - 1,
>                                                                 &delalloc_cached_state,
>                                                                 &delalloc_start,
>                                                                 &delalloc_end);
> +                       unlock_extent(&inode->io_tree, lockstart, lockend,
> +                                     &cached_state);
>                         if (!delalloc)
>                                 cache.flags |= FIEMAP_EXTENT_LAST;
>                 } else {
> @@ -3076,7 +3094,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>         ret = emit_last_fiemap_cache(fieinfo, &cache);
>
>  out_unlock:
> -       unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>         btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>  out:
>         free_extent_state(delalloc_cached_state);
> --
> 2.43.0
>
>

David Sterba Feb. 6, 2024, 8:54 p.m. UTC | #2

On Thu, Feb 01, 2024 at 02:58:54PM -0500, Josef Bacik wrote:
> While working on the patchset to remove extent locking I got a lockdep
> splat with fiemap and pagefaulting with my new extent lock replacement
> lock.
> 
> This deadlock exists with our normal code, we just don't have lockdep
> annotations with the extent locking so we've never noticed it.
> 
> Since we're copying the fiemap extent to user space on every iteration
> we have the chance of pagefaulting.  Because we hold the extent lock for
> the entire range we could mkwrite into a range in the file that we have
> mmap'ed.  This would deadlock with the following stack trace
> 
> [<0>] lock_extent+0x28d/0x2f0
> [<0>] btrfs_page_mkwrite+0x273/0x8a0
> [<0>] do_page_mkwrite+0x50/0xb0
> [<0>] do_fault+0xc1/0x7b0
> [<0>] __handle_mm_fault+0x2fa/0x460
> [<0>] handle_mm_fault+0xa4/0x330
> [<0>] do_user_addr_fault+0x1f4/0x800
> [<0>] exc_page_fault+0x7c/0x1e0
> [<0>] asm_exc_page_fault+0x26/0x30
> [<0>] rep_movs_alternative+0x33/0x70
> [<0>] _copy_to_user+0x49/0x70
> [<0>] fiemap_fill_next_extent+0xc8/0x120
> [<0>] emit_fiemap_extent+0x4d/0xa0
> [<0>] extent_fiemap+0x7f8/0xad0
> [<0>] btrfs_fiemap+0x49/0x80
> [<0>] __x64_sys_ioctl+0x3e1/0xb50
> [<0>] do_syscall_64+0x94/0x1a0
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> I wrote an fstest to reproduce this deadlock without my replacement lock
> and verified that the deadlock exists with our existing locking.
> 
> To fix this simply don't take the extent lock for the entire duration of
> the fiemap.  This is safe in general because we keep track of where we
> are when we're searching the tree, so if an ordered extent updates in
> the middle of our fiemap call we'll still emit the correct extents
> because we know what offset we were on before.
> 
> The only place we maintain the lock is searching delalloc.  Since the
> delalloc stuff can change during writeback we want to lock the extent
> range so we have a consistent view of delalloc at the time we're
> checking to see if we need to set the delalloc flag.
> 
> With this patch applied we no longer deadlock with my testcase.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent_io.c | 49 +++++++++++++++++++++++++++++---------------
>  1 file changed, 33 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8648ea9b5fb5..f8b68249d958 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2683,16 +2683,25 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
>  	 * it beyond i_size.
>  	 */
>  	while (cur_offset < end && cur_offset < i_size) {
> +		struct extent_state *cached_state = NULL;
>  		u64 delalloc_start;
>  		u64 delalloc_end;
>  		u64 prealloc_start;
> +		u64 lockstart, lockend;
>  		u64 prealloc_len = 0;
>  		bool delalloc;
>  
> +		lockstart = round_down(cur_offset,
> +				       inode->root->fs_info->sectorsize);
> +		lockend = round_up(end, inode->root->fs_info->sectorsize);
> +
> +		lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);

This could use a comment why the locking is ok only for delalloc. This
is quite a step from 'the whole range'.

>  		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
>  							delalloc_cached_state,
>  							&delalloc_start,
>  							&delalloc_end);


> +		unlock_extent(&inode->io_tree, lockstart, lockend,
> +			      &cached_state);
>  		if (!delalloc)
>  			break;
>  
> @@ -2860,15 +2869,14 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		  u64 start, u64 len)
>  {
>  	const u64 ino = btrfs_ino(inode);
> -	struct extent_state *cached_state = NULL;
>  	struct extent_state *delalloc_cached_state = NULL;
>  	struct btrfs_path *path;
>  	struct fiemap_cache cache = { 0 };
>  	struct btrfs_backref_share_check_ctx *backref_ctx;
>  	u64 last_extent_end;
>  	u64 prev_extent_end;
> -	u64 lockstart;
> -	u64 lockend;
> +	u64 align_start;
> +	u64 align_end;

This could be range_start and range_end, in the context where it appears
it's IMHO more clear.

>  	bool stopped = false;
>  	int ret;
>  
> @@ -2879,12 +2887,11 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		goto out;
>  	}
>  
> -	lockstart = round_down(start, inode->root->fs_info->sectorsize);
> -	lockend = round_up(start + len, inode->root->fs_info->sectorsize);
> -	prev_extent_end = lockstart;
> +	align_start = round_down(start, inode->root->fs_info->sectorsize);
> +	align_end = round_up(start + len, inode->root->fs_info->sectorsize);
> +	prev_extent_end = align_start;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> -	lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>  
>  	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
>  	if (ret < 0)
> @@ -2892,7 +2899,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  	btrfs_release_path(path);
>  
>  	path->reada = READA_FORWARD;
> -	ret = fiemap_search_slot(inode, path, lockstart);
> +	ret = fiemap_search_slot(inode, path, align_start);
>  	if (ret < 0) {
>  		goto out_unlock;
>  	} else if (ret > 0) {
> @@ -2904,7 +2911,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		goto check_eof_delalloc;
>  	}
>  
> -	while (prev_extent_end < lockend) {
> +	while (prev_extent_end < align_end) {
>  		struct extent_buffer *leaf = path->nodes[0];
>  		struct btrfs_file_extent_item *ei;
>  		struct btrfs_key key;
> @@ -2927,14 +2934,14 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  		 * The first iteration can leave us at an extent item that ends
>  		 * before our range's start. Move to the next item.
>  		 */
> -		if (extent_end <= lockstart)
> +		if (extent_end <= align_start)
>  			goto next_item;
>  
>  		backref_ctx->curr_leaf_bytenr = leaf->start;
>  
>  		/* We have in implicit hole (NO_HOLES feature enabled). */
>  		if (prev_extent_end < key.offset) {
> -			const u64 range_end = min(key.offset, lockend) - 1;
> +			const u64 range_end = min(key.offset, align_end) - 1;
>  
>  			ret = fiemap_process_hole(inode, fieinfo, &cache,
>  						  &delalloc_cached_state,
> @@ -2949,7 +2956,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  			}
>  
>  			/* We've reached the end of the fiemap range, stop. */
> -			if (key.offset >= lockend) {
> +			if (key.offset >= align_end) {
>  				stopped = true;
>  				break;
>  			}
> @@ -3043,29 +3050,40 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  	btrfs_free_path(path);
>  	path = NULL;
>  
> -	if (!stopped && prev_extent_end < lockend) {
> +	if (!stopped && prev_extent_end < align_end) {
>  		ret = fiemap_process_hole(inode, fieinfo, &cache,
>  					  &delalloc_cached_state, backref_ctx,
> -					  0, 0, 0, prev_extent_end, lockend - 1);
> +					  0, 0, 0, prev_extent_end, align_end - 1);
>  		if (ret < 0)
>  			goto out_unlock;
> -		prev_extent_end = lockend;
> +		prev_extent_end = align_end;
>  	}
>  
>  	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
>  		const u64 i_size = i_size_read(&inode->vfs_inode);
>  
>  		if (prev_extent_end < i_size) {
> +			struct extent_state *cached_state = NULL;
>  			u64 delalloc_start;
>  			u64 delalloc_end;
> +			u64 lockstart, lockend;
>  			bool delalloc;
>  
> +			lockstart = round_down(prev_extent_end,
> +					       inode->root->fs_info->sectorsize);
> +			lockend = round_up(i_size,
> +					   inode->root->fs_info->sectorsize);
> +
> +			lock_extent(&inode->io_tree, lockstart, lockend,
> +				    &cached_state);

And same comment about locking only delalloc range.

>  			delalloc = btrfs_find_delalloc_in_range(inode,
>  								prev_extent_end,
>  								i_size - 1,
>  								&delalloc_cached_state,
>  								&delalloc_start,
>  								&delalloc_end);
> +			unlock_extent(&inode->io_tree, lockstart, lockend,
> +				      &cached_state);
>  			if (!delalloc)
>  				cache.flags |= FIEMAP_EXTENT_LAST;
>  		} else {
> @@ -3076,7 +3094,6 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
>  	ret = emit_last_fiemap_cache(fieinfo, &cache);
>  
>  out_unlock:
> -	unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>  out:
>  	free_extent_state(delalloc_cached_state);
> -- 
> 2.43.0
>

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8648ea9b5fb5..f8b68249d958 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2683,16 +2683,25 @@  static int fiemap_process_hole(struct btrfs_inode *inode,
 	 * it beyond i_size.
 	 */
 	while (cur_offset < end && cur_offset < i_size) {
+		struct extent_state *cached_state = NULL;
 		u64 delalloc_start;
 		u64 delalloc_end;
 		u64 prealloc_start;
+		u64 lockstart, lockend;
 		u64 prealloc_len = 0;
 		bool delalloc;
 
+		lockstart = round_down(cur_offset,
+				       inode->root->fs_info->sectorsize);
+		lockend = round_up(end, inode->root->fs_info->sectorsize);
+
+		lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
 							delalloc_cached_state,
 							&delalloc_start,
 							&delalloc_end);
+		unlock_extent(&inode->io_tree, lockstart, lockend,
+			      &cached_state);
 		if (!delalloc)
 			break;
 
@@ -2860,15 +2869,14 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		  u64 start, u64 len)
 {
 	const u64 ino = btrfs_ino(inode);
-	struct extent_state *cached_state = NULL;
 	struct extent_state *delalloc_cached_state = NULL;
 	struct btrfs_path *path;
 	struct fiemap_cache cache = { 0 };
 	struct btrfs_backref_share_check_ctx *backref_ctx;
 	u64 last_extent_end;
 	u64 prev_extent_end;
-	u64 lockstart;
-	u64 lockend;
+	u64 align_start;
+	u64 align_end;
 	bool stopped = false;
 	int ret;
 
@@ -2879,12 +2887,11 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		goto out;
 	}
 
-	lockstart = round_down(start, inode->root->fs_info->sectorsize);
-	lockend = round_up(start + len, inode->root->fs_info->sectorsize);
-	prev_extent_end = lockstart;
+	align_start = round_down(start, inode->root->fs_info->sectorsize);
+	align_end = round_up(start + len, inode->root->fs_info->sectorsize);
+	prev_extent_end = align_start;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-	lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 
 	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
 	if (ret < 0)
@@ -2892,7 +2899,7 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	btrfs_release_path(path);
 
 	path->reada = READA_FORWARD;
-	ret = fiemap_search_slot(inode, path, lockstart);
+	ret = fiemap_search_slot(inode, path, align_start);
 	if (ret < 0) {
 		goto out_unlock;
 	} else if (ret > 0) {
@@ -2904,7 +2911,7 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		goto check_eof_delalloc;
 	}
 
-	while (prev_extent_end < lockend) {
+	while (prev_extent_end < align_end) {
 		struct extent_buffer *leaf = path->nodes[0];
 		struct btrfs_file_extent_item *ei;
 		struct btrfs_key key;
@@ -2927,14 +2934,14 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		 * The first iteration can leave us at an extent item that ends
 		 * before our range's start. Move to the next item.
 		 */
-		if (extent_end <= lockstart)
+		if (extent_end <= align_start)
 			goto next_item;
 
 		backref_ctx->curr_leaf_bytenr = leaf->start;
 
 		/* We have in implicit hole (NO_HOLES feature enabled). */
 		if (prev_extent_end < key.offset) {
-			const u64 range_end = min(key.offset, lockend) - 1;
+			const u64 range_end = min(key.offset, align_end) - 1;
 
 			ret = fiemap_process_hole(inode, fieinfo, &cache,
 						  &delalloc_cached_state,
@@ -2949,7 +2956,7 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			}
 
 			/* We've reached the end of the fiemap range, stop. */
-			if (key.offset >= lockend) {
+			if (key.offset >= align_end) {
 				stopped = true;
 				break;
 			}
@@ -3043,29 +3050,40 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	btrfs_free_path(path);
 	path = NULL;
 
-	if (!stopped && prev_extent_end < lockend) {
+	if (!stopped && prev_extent_end < align_end) {
 		ret = fiemap_process_hole(inode, fieinfo, &cache,
 					  &delalloc_cached_state, backref_ctx,
-					  0, 0, 0, prev_extent_end, lockend - 1);
+					  0, 0, 0, prev_extent_end, align_end - 1);
 		if (ret < 0)
 			goto out_unlock;
-		prev_extent_end = lockend;
+		prev_extent_end = align_end;
 	}
 
 	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
 		const u64 i_size = i_size_read(&inode->vfs_inode);
 
 		if (prev_extent_end < i_size) {
+			struct extent_state *cached_state = NULL;
 			u64 delalloc_start;
 			u64 delalloc_end;
+			u64 lockstart, lockend;
 			bool delalloc;
 
+			lockstart = round_down(prev_extent_end,
+					       inode->root->fs_info->sectorsize);
+			lockend = round_up(i_size,
+					   inode->root->fs_info->sectorsize);
+
+			lock_extent(&inode->io_tree, lockstart, lockend,
+				    &cached_state);
 			delalloc = btrfs_find_delalloc_in_range(inode,
 								prev_extent_end,
 								i_size - 1,
 								&delalloc_cached_state,
 								&delalloc_start,
 								&delalloc_end);
+			unlock_extent(&inode->io_tree, lockstart, lockend,
+				      &cached_state);
 			if (!delalloc)
 				cache.flags |= FIEMAP_EXTENT_LAST;
 		} else {
@@ -3076,7 +3094,6 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	ret = emit_last_fiemap_cache(fieinfo, &cache);
 
 out_unlock:
-	unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 out:
 	free_extent_state(delalloc_cached_state);

btrfs: fix deadlock with fiemap and extent locking

Commit Message

Comments

Patch