[v2] btrfs: fix deadlock with fiemap and extent locking

Message ID	49c34d4ede23d28d916eab4a22d4ec698f77f498.1707756893.git.josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-yw1-f176.google.com (mail-yw1-f176.google.com [209.85.128.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F05C3C495 for <linux-btrfs@vger.kernel.org>; Mon, 12 Feb 2024 16:56:08 +0000 (UTC) From: Josef Bacik <josef@toxicpanda.com> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Cc: Filipe Manana <fdmanana@suse.com> Subject: [PATCH v2] btrfs: fix deadlock with fiemap and extent locking Date: Mon, 12 Feb 2024 11:56:02 -0500 Message-ID: <49c34d4ede23d28d916eab4a22d4ec698f77f498.1707756893.git.josef@toxicpanda.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2] btrfs: fix deadlock with fiemap and extent locking \| expand [v2] btrfs: fix deadlock with fiemap and extent locking

Message ID

49c34d4ede23d28d916eab4a22d4ec698f77f498.1707756893.git.josef@toxicpanda.com (mailing list archive)

State

New, archived

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: linux-btrfs@vger.kernel.org,
	kernel-team@fb.com
Cc: Filipe Manana <fdmanana@suse.com>
Subject: [PATCH v2] btrfs: fix deadlock with fiemap and extent locking
Date: Mon, 12 Feb 2024 11:56:02 -0500
Message-ID: 
 <49c34d4ede23d28d916eab4a22d4ec698f77f498.1707756893.git.josef@toxicpanda.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

[v2] btrfs: fix deadlock with fiemap and extent locking | expand

Commit Message

Josef Bacik Feb. 12, 2024, 4:56 p.m. UTC

While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.

This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.

Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting.  Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed.  This would deadlock with the following stack trace

[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76

I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.

To fix this simply don't take the extent lock for the entire duration of
the fiemap.  This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.

The only place we maintain the lock is searching delalloc.  Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.

With this patch applied we no longer deadlock with my testcase.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
v1->v2:
- Fixed up the various formatting comments.
- Added a comment for the locking.

 fs/btrfs/extent_io.c | 62 ++++++++++++++++++++++++++++++++------------
 1 file changed, 45 insertions(+), 17 deletions(-)

Comments

David Sterba Feb. 13, 2024, 7:11 p.m. UTC | #1

On Mon, Feb 12, 2024 at 11:56:02AM -0500, Josef Bacik wrote:
> While working on the patchset to remove extent locking I got a lockdep
> splat with fiemap and pagefaulting with my new extent lock replacement
> lock.
> 
> This deadlock exists with our normal code, we just don't have lockdep
> annotations with the extent locking so we've never noticed it.
> 
> Since we're copying the fiemap extent to user space on every iteration
> we have the chance of pagefaulting.  Because we hold the extent lock for
> the entire range we could mkwrite into a range in the file that we have
> mmap'ed.  This would deadlock with the following stack trace
> 
> [<0>] lock_extent+0x28d/0x2f0
> [<0>] btrfs_page_mkwrite+0x273/0x8a0
> [<0>] do_page_mkwrite+0x50/0xb0
> [<0>] do_fault+0xc1/0x7b0
> [<0>] __handle_mm_fault+0x2fa/0x460
> [<0>] handle_mm_fault+0xa4/0x330
> [<0>] do_user_addr_fault+0x1f4/0x800
> [<0>] exc_page_fault+0x7c/0x1e0
> [<0>] asm_exc_page_fault+0x26/0x30
> [<0>] rep_movs_alternative+0x33/0x70
> [<0>] _copy_to_user+0x49/0x70
> [<0>] fiemap_fill_next_extent+0xc8/0x120
> [<0>] emit_fiemap_extent+0x4d/0xa0
> [<0>] extent_fiemap+0x7f8/0xad0
> [<0>] btrfs_fiemap+0x49/0x80
> [<0>] __x64_sys_ioctl+0x3e1/0xb50
> [<0>] do_syscall_64+0x94/0x1a0
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> I wrote an fstest to reproduce this deadlock without my replacement lock
> and verified that the deadlock exists with our existing locking.
> 
> To fix this simply don't take the extent lock for the entire duration of
> the fiemap.  This is safe in general because we keep track of where we
> are when we're searching the tree, so if an ordered extent updates in
> the middle of our fiemap call we'll still emit the correct extents
> because we know what offset we were on before.
> 
> The only place we maintain the lock is searching delalloc.  Since the
> delalloc stuff can change during writeback we want to lock the extent
> range so we have a consistent view of delalloc at the time we're
> checking to see if we need to set the delalloc flag.
> 
> With this patch applied we no longer deadlock with my testcase.
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> v1->v2:
> - Fixed up the various formatting comments.
> - Added a comment for the locking.

Reviewed-by: David Sterba <dsterba@suse.com>

Wang Yugui Feb. 23, 2024, 2:46 a.m. UTC | #2

Hi,

> While working on the patchset to remove extent locking I got a lockdep
> splat with fiemap and pagefaulting with my new extent lock replacement
> lock.
> 
> This deadlock exists with our normal code, we just don't have lockdep
> annotations with the extent locking so we've never noticed it.
> 
> Since we're copying the fiemap extent to user space on every iteration
> we have the chance of pagefaulting.  Because we hold the extent lock for
> the entire range we could mkwrite into a range in the file that we have
> mmap'ed.  This would deadlock with the following stack trace
> 
> [<0>] lock_extent+0x28d/0x2f0
> [<0>] btrfs_page_mkwrite+0x273/0x8a0
> [<0>] do_page_mkwrite+0x50/0xb0
> [<0>] do_fault+0xc1/0x7b0
> [<0>] __handle_mm_fault+0x2fa/0x460
> [<0>] handle_mm_fault+0xa4/0x330
> [<0>] do_user_addr_fault+0x1f4/0x800
> [<0>] exc_page_fault+0x7c/0x1e0
> [<0>] asm_exc_page_fault+0x26/0x30
> [<0>] rep_movs_alternative+0x33/0x70
> [<0>] _copy_to_user+0x49/0x70
> [<0>] fiemap_fill_next_extent+0xc8/0x120
> [<0>] emit_fiemap_extent+0x4d/0xa0
> [<0>] extent_fiemap+0x7f8/0xad0
> [<0>] btrfs_fiemap+0x49/0x80
> [<0>] __x64_sys_ioctl+0x3e1/0xb50
> [<0>] do_syscall_64+0x94/0x1a0
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> 
> I wrote an fstest to reproduce this deadlock without my replacement lock
> and verified that the deadlock exists with our existing locking.
> 
> To fix this simply don't take the extent lock for the entire duration of
> the fiemap.  This is safe in general because we keep track of where we
> are when we're searching the tree, so if an ordered extent updates in
> the middle of our fiemap call we'll still emit the correct extents
> because we know what offset we were on before.
> 
> The only place we maintain the lock is searching delalloc.  Since the
> delalloc stuff can change during writeback we want to lock the extent
> range so we have a consistent view of delalloc at the time we're
> checking to see if we need to set the delalloc flag.
> 
> With this patch applied we no longer deadlock with my testcase.

After I applied this patch(upstream b0ad381fa7690244802aed119b478b4bdafc31dd )
into 6.6.18-rc1, fstests generic/561 will trigger a WARNING.

reproduce frequency:  about 1/5

Or we need some other patches too?

[22281.809944] ------------[ cut here ]------------
[22281.816133] WARNING: CPU: 12 PID: 2995743 at fs/btrfs/extent_io.c:2457 emit_fiemap_extent+0x84/0x90 [btrfs]
[22281.827552] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio dm_dust dm_flakey ext4 mbcache jbd2 loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs nfnetlink overlay rfkill vfat fat drm_exec amdxcp drm_buddy gpu_sched intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio kvm_intel snd_hda_intel radeon kvm snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core i2c_algo_bit snd_hwdep irqbypass drm_suballoc_helper snd_seq drm_display_helper rapl snd_seq_device cec intel_cstate snd_pcm mei_wdt drm_ttm_helper dcdbas snd_timer iTCO_wdt ttm mei_me dell_smm_hwmon iTCO_vendor_support intel_uncore mei snd video pcspkr i2c_i801 soundcore lpc_ich i2c_smbus fuse xfs sd_mod sg nvme_tcp nvme_fabrics nvme_core ahci libahci nvme_common mpt3sas t10_pi libata e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel raid_class
[22281.827650]  scsi_transport_sas wmi dm_multipath btrfs blake2b_generic xor zstd_compress raid6_pq sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi i2c_dev [last unloaded: scsi_debug]
[22281.958863] CPU: 12 PID: 2995743 Comm: pool Tainted: G        W          6.6.18-0.1.el7.x86_64 #1
[22281.969394] Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019
[22281.978405] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
[22281.985631] Code: 2b 4c 89 63 08 4c 89 73 10 44 89 6b 18 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc 45 39 c1 75 cc 4c 01 f1 31 c0 48 89 4b 10 eb e3 <0f> 0b b8 ea ff ff ff eb da 0f 1f 00 90 90 90 90 90 90 90 90 90 90
[22282.007771] RSP: 0018:ffffc90026767c28 EFLAGS: 00010206
[22282.014681] RAX: 00000000001d1000 RBX: ffffc90026767d08 RCX: 0000000000005000
[22282.023469] RDX: 00000000001cd000 RSI: 00000000001cc000 RDI: ffffc90026767d90
[22282.032256] RBP: 00000000001cd000 R08: 0000000000008000 R09: 0000000000000000
[22282.041016] R10: 0000000000000000 R11: ffff88a0feeb20e0 R12: 0000000043ec2000
[22282.049759] R13: 0000000000000000 R14: 0000000000008000 R15: ffff88a10e2ca8c0
[22282.058636] FS:  00007fbedbfff640(0000) GS:ffff88b08f500000(0000) knlGS:0000000000000000
[22282.068287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22282.075624] CR2: 00007fbfb401a000 CR3: 0000000512642001 CR4: 00000000001706e0
[22282.084242] Call Trace:
[22282.088280]  <TASK>
[22282.091954]  ? __warn+0x85/0x140
[22282.096772]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
[22282.103359]  ? report_bug+0xfc/0x1e0
[22282.108505]  ? handle_bug+0x3f/0x70
[22282.113545]  ? exc_invalid_op+0x17/0x70
[22282.118902]  ? asm_exc_invalid_op+0x1a/0x20
[22282.124639]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
[22282.131140]  extent_fiemap+0x7d9/0xae0 [btrfs]
[22282.137198]  btrfs_fiemap+0x49/0x80 [btrfs]
[22282.143001]  do_vfs_ioctl+0x177/0x900
[22282.148220]  __x64_sys_ioctl+0x6e/0xd0
[22282.153502]  do_syscall_64+0x5c/0x90
[22282.158616]  ? __might_fault+0x26/0x30
[22282.163911]  ? _copy_to_user+0x49/0x70
[22282.169209]  ? do_vfs_ioctl+0x197/0x900
[22282.174579]  ? __x64_sys_ioctl+0xab/0xd0
[22282.180035]  ? syscall_exit_to_user_mode+0x26/0x40
[22282.186375]  ? do_syscall_64+0x6b/0x90
[22282.191650]  ? exc_page_fault+0x6b/0x150
[22282.197128]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[22282.203714] RIP: 0033:0x7fbfde23ec6b
[22282.208882] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
[22282.231063] RSP: 002b:00007fbedbffeb98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[22282.240383] RAX: ffffffffffffffda RBX: 00007fbec4801378 RCX: 00007fbfde23ec6b
[22282.249283] RDX: 00007fbec4801378 RSI: 00000000c020660b RDI: 0000000000000003
[22282.258180] RBP: 00007fbec4801370 R08: 00007fbedbffec90 R09: 00007fbedbffec7c
[22282.266974] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fbedbffec7c
[22282.275703] R13: 00007fbedbffec88 R14: 00007fbedbffec80 R15: 00007fbedbffec90
[22282.284443]  </TASK>
[22282.288165] ---[ end trace 0000000000000000 ]---

static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
    if (cache->offset + cache->len > offset) {
L2457:  WARN_ON(1);
        return -EINVAL;
    }

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2024/02/23

Filipe Manana Feb. 23, 2024, 8:11 a.m. UTC | #3

On Fri, Feb 23, 2024 at 2:52 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> > While working on the patchset to remove extent locking I got a lockdep
> > splat with fiemap and pagefaulting with my new extent lock replacement
> > lock.
> >
> > This deadlock exists with our normal code, we just don't have lockdep
> > annotations with the extent locking so we've never noticed it.
> >
> > Since we're copying the fiemap extent to user space on every iteration
> > we have the chance of pagefaulting.  Because we hold the extent lock for
> > the entire range we could mkwrite into a range in the file that we have
> > mmap'ed.  This would deadlock with the following stack trace
> >
> > [<0>] lock_extent+0x28d/0x2f0
> > [<0>] btrfs_page_mkwrite+0x273/0x8a0
> > [<0>] do_page_mkwrite+0x50/0xb0
> > [<0>] do_fault+0xc1/0x7b0
> > [<0>] __handle_mm_fault+0x2fa/0x460
> > [<0>] handle_mm_fault+0xa4/0x330
> > [<0>] do_user_addr_fault+0x1f4/0x800
> > [<0>] exc_page_fault+0x7c/0x1e0
> > [<0>] asm_exc_page_fault+0x26/0x30
> > [<0>] rep_movs_alternative+0x33/0x70
> > [<0>] _copy_to_user+0x49/0x70
> > [<0>] fiemap_fill_next_extent+0xc8/0x120
> > [<0>] emit_fiemap_extent+0x4d/0xa0
> > [<0>] extent_fiemap+0x7f8/0xad0
> > [<0>] btrfs_fiemap+0x49/0x80
> > [<0>] __x64_sys_ioctl+0x3e1/0xb50
> > [<0>] do_syscall_64+0x94/0x1a0
> > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> >
> > I wrote an fstest to reproduce this deadlock without my replacement lock
> > and verified that the deadlock exists with our existing locking.
> >
> > To fix this simply don't take the extent lock for the entire duration of
> > the fiemap.  This is safe in general because we keep track of where we
> > are when we're searching the tree, so if an ordered extent updates in
> > the middle of our fiemap call we'll still emit the correct extents
> > because we know what offset we were on before.
> >
> > The only place we maintain the lock is searching delalloc.  Since the
> > delalloc stuff can change during writeback we want to lock the extent
> > range so we have a consistent view of delalloc at the time we're
> > checking to see if we need to set the delalloc flag.
> >
> > With this patch applied we no longer deadlock with my testcase.
>
> After I applied this patch(upstream b0ad381fa7690244802aed119b478b4bdafc31dd )
> into 6.6.18-rc1, fstests generic/561 will trigger a WARNING.
>
> reproduce frequency:  about 1/5
>
> Or we need some other patches too?
>
> [22281.809944] ------------[ cut here ]------------
> [22281.816133] WARNING: CPU: 12 PID: 2995743 at fs/btrfs/extent_io.c:2457 emit_fiemap_extent+0x84/0x90 [btrfs]
> [22281.827552] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio dm_dust dm_flakey ext4 mbcache jbd2 loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs nfnetlink overlay rfkill vfat fat drm_exec amdxcp drm_buddy gpu_sched intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio kvm_intel snd_hda_intel radeon kvm snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core i2c_algo_bit snd_hwdep irqbypass drm_suballoc_helper snd_seq drm_display_helper rapl snd_seq_device cec intel_cstate snd_pcm mei_wdt drm_ttm_helper dcdbas snd_timer iTCO_wdt ttm mei_me dell_smm_hwmon iTCO_vendor_support intel_uncore mei snd video pcspkr i2c_i801 soundcore lpc_ich i2c_smbus fuse xfs sd_mod sg nvme_tcp nvme_fabrics nvme_core ahci libahci nvme_common mpt3sas t10_pi libata e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel raid_class
> [22281.827650]  scsi_transport_sas wmi dm_multipath btrfs blake2b_generic xor zstd_compress raid6_pq sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi i2c_dev [last unloaded: scsi_debug]
> [22281.958863] CPU: 12 PID: 2995743 Comm: pool Tainted: G        W          6.6.18-0.1.el7.x86_64 #1
> [22281.969394] Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019
> [22281.978405] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
> [22281.985631] Code: 2b 4c 89 63 08 4c 89 73 10 44 89 6b 18 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc 45 39 c1 75 cc 4c 01 f1 31 c0 48 89 4b 10 eb e3 <0f> 0b b8 ea ff ff ff eb da 0f 1f 00 90 90 90 90 90 90 90 90 90 90
> [22282.007771] RSP: 0018:ffffc90026767c28 EFLAGS: 00010206
> [22282.014681] RAX: 00000000001d1000 RBX: ffffc90026767d08 RCX: 0000000000005000
> [22282.023469] RDX: 00000000001cd000 RSI: 00000000001cc000 RDI: ffffc90026767d90
> [22282.032256] RBP: 00000000001cd000 R08: 0000000000008000 R09: 0000000000000000
> [22282.041016] R10: 0000000000000000 R11: ffff88a0feeb20e0 R12: 0000000043ec2000
> [22282.049759] R13: 0000000000000000 R14: 0000000000008000 R15: ffff88a10e2ca8c0
> [22282.058636] FS:  00007fbedbfff640(0000) GS:ffff88b08f500000(0000) knlGS:0000000000000000
> [22282.068287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [22282.075624] CR2: 00007fbfb401a000 CR3: 0000000512642001 CR4: 00000000001706e0
> [22282.084242] Call Trace:
> [22282.088280]  <TASK>
> [22282.091954]  ? __warn+0x85/0x140
> [22282.096772]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
> [22282.103359]  ? report_bug+0xfc/0x1e0
> [22282.108505]  ? handle_bug+0x3f/0x70
> [22282.113545]  ? exc_invalid_op+0x17/0x70
> [22282.118902]  ? asm_exc_invalid_op+0x1a/0x20
> [22282.124639]  ? emit_fiemap_extent+0x84/0x90 [btrfs]
> [22282.131140]  extent_fiemap+0x7d9/0xae0 [btrfs]
> [22282.137198]  btrfs_fiemap+0x49/0x80 [btrfs]
> [22282.143001]  do_vfs_ioctl+0x177/0x900
> [22282.148220]  __x64_sys_ioctl+0x6e/0xd0
> [22282.153502]  do_syscall_64+0x5c/0x90
> [22282.158616]  ? __might_fault+0x26/0x30
> [22282.163911]  ? _copy_to_user+0x49/0x70
> [22282.169209]  ? do_vfs_ioctl+0x197/0x900
> [22282.174579]  ? __x64_sys_ioctl+0xab/0xd0
> [22282.180035]  ? syscall_exit_to_user_mode+0x26/0x40
> [22282.186375]  ? do_syscall_64+0x6b/0x90
> [22282.191650]  ? exc_page_fault+0x6b/0x150
> [22282.197128]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [22282.203714] RIP: 0033:0x7fbfde23ec6b
> [22282.208882] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
> [22282.231063] RSP: 002b:00007fbedbffeb98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [22282.240383] RAX: ffffffffffffffda RBX: 00007fbec4801378 RCX: 00007fbfde23ec6b
> [22282.249283] RDX: 00007fbec4801378 RSI: 00000000c020660b RDI: 0000000000000003
> [22282.258180] RBP: 00007fbec4801370 R08: 00007fbedbffec90 R09: 00007fbedbffec7c
> [22282.266974] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fbedbffec7c
> [22282.275703] R13: 00007fbedbffec88 R14: 00007fbedbffec80 R15: 00007fbedbffec90
> [22282.284443]  </TASK>
> [22282.288165] ---[ end trace 0000000000000000 ]---
>
> static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
>     if (cache->offset + cache->len > offset) {
> L2457:  WARN_ON(1);
>         return -EINVAL;
>     }

Yes, this patch triggers a race. I was just working on a fix for it yesterday.
I've just sent it to the list, it's the first patch in this series:

https://lore.kernel.org/linux-btrfs/cover.1708635463.git.fdmanana@suse.com/

Thanks.

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2024/02/23
>
>
>

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8648ea9b5fb5..dfcde1610b0a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2683,16 +2683,34 @@  static int fiemap_process_hole(struct btrfs_inode *inode,
 	 * it beyond i_size.
 	 */
 	while (cur_offset < end && cur_offset < i_size) {
+		struct extent_state *cached_state = NULL;
 		u64 delalloc_start;
 		u64 delalloc_end;
 		u64 prealloc_start;
+		u64 lockstart;
+		u64 lockend;
 		u64 prealloc_len = 0;
 		bool delalloc;
 
+		lockstart = round_down(cur_offset, inode->root->fs_info->sectorsize);
+		lockend = round_up(end, inode->root->fs_info->sectorsize);
+
+		/*
+		 * We are only locking for the delalloc range because that's the
+		 * only thing that can change here.  With fiemap we have a lock
+		 * on the inode, so no buffered or direct writes can happen.
+		 *
+		 * However mmaps and normal page writeback will cause this to
+		 * change arbitrarily.  We have to lock the extent lock here to
+		 * make sure that nobody messes with the tree while we're doing
+		 * btrfs_find_delalloc_in_range.
+		 */
+		lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
 							delalloc_cached_state,
 							&delalloc_start,
 							&delalloc_end);
+		unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 		if (!delalloc)
 			break;
 
@@ -2860,15 +2878,15 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		  u64 start, u64 len)
 {
 	const u64 ino = btrfs_ino(inode);
-	struct extent_state *cached_state = NULL;
 	struct extent_state *delalloc_cached_state = NULL;
 	struct btrfs_path *path;
 	struct fiemap_cache cache = { 0 };
 	struct btrfs_backref_share_check_ctx *backref_ctx;
 	u64 last_extent_end;
 	u64 prev_extent_end;
-	u64 lockstart;
-	u64 lockend;
+	u64 range_start;
+	u64 range_end;
+	const u64 sectorsize = inode->root->fs_info->sectorsize;
 	bool stopped = false;
 	int ret;
 
@@ -2879,12 +2897,11 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		goto out;
 	}
 
-	lockstart = round_down(start, inode->root->fs_info->sectorsize);
-	lockend = round_up(start + len, inode->root->fs_info->sectorsize);
-	prev_extent_end = lockstart;
+	range_start = round_down(start, sectorsize);
+	range_end = round_up(start + len, sectorsize);
+	prev_extent_end = range_start;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-	lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 
 	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
 	if (ret < 0)
@@ -2892,7 +2909,7 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	btrfs_release_path(path);
 
 	path->reada = READA_FORWARD;
-	ret = fiemap_search_slot(inode, path, lockstart);
+	ret = fiemap_search_slot(inode, path, range_start);
 	if (ret < 0) {
 		goto out_unlock;
 	} else if (ret > 0) {
@@ -2904,7 +2921,7 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		goto check_eof_delalloc;
 	}
 
-	while (prev_extent_end < lockend) {
+	while (prev_extent_end < range_end) {
 		struct extent_buffer *leaf = path->nodes[0];
 		struct btrfs_file_extent_item *ei;
 		struct btrfs_key key;
@@ -2927,19 +2944,19 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 		 * The first iteration can leave us at an extent item that ends
 		 * before our range's start. Move to the next item.
 		 */
-		if (extent_end <= lockstart)
+		if (extent_end <= range_start)
 			goto next_item;
 
 		backref_ctx->curr_leaf_bytenr = leaf->start;
 
 		/* We have in implicit hole (NO_HOLES feature enabled). */
 		if (prev_extent_end < key.offset) {
-			const u64 range_end = min(key.offset, lockend) - 1;
+			const u64 hole_end = min(key.offset, range_end) - 1;
 
 			ret = fiemap_process_hole(inode, fieinfo, &cache,
 						  &delalloc_cached_state,
 						  backref_ctx, 0, 0, 0,
-						  prev_extent_end, range_end);
+						  prev_extent_end, hole_end);
 			if (ret < 0) {
 				goto out_unlock;
 			} else if (ret > 0) {
@@ -2949,7 +2966,7 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			}
 
 			/* We've reached the end of the fiemap range, stop. */
-			if (key.offset >= lockend) {
+			if (key.offset >= range_end) {
 				stopped = true;
 				break;
 			}
@@ -3043,29 +3060,41 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	btrfs_free_path(path);
 	path = NULL;
 
-	if (!stopped && prev_extent_end < lockend) {
+	if (!stopped && prev_extent_end < range_end) {
 		ret = fiemap_process_hole(inode, fieinfo, &cache,
 					  &delalloc_cached_state, backref_ctx,
-					  0, 0, 0, prev_extent_end, lockend - 1);
+					  0, 0, 0, prev_extent_end, range_end - 1);
 		if (ret < 0)
 			goto out_unlock;
-		prev_extent_end = lockend;
+		prev_extent_end = range_end;
 	}
 
 	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
 		const u64 i_size = i_size_read(&inode->vfs_inode);
 
 		if (prev_extent_end < i_size) {
+			struct extent_state *cached_state = NULL;
 			u64 delalloc_start;
 			u64 delalloc_end;
+			u64 lockstart;
+			u64 lockend;
 			bool delalloc;
 
+			lockstart = round_down(prev_extent_end, sectorsize);
+			lockend = round_up(i_size, sectorsize);
+
+			/*
+			 * See the comment in fiemap_process_hole as to why
+			 * we're doing the locking here.
+			 */
+			lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 			delalloc = btrfs_find_delalloc_in_range(inode,
 								prev_extent_end,
 								i_size - 1,
 								&delalloc_cached_state,
 								&delalloc_start,
 								&delalloc_end);
+			unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 			if (!delalloc)
 				cache.flags |= FIEMAP_EXTENT_LAST;
 		} else {
@@ -3076,7 +3105,6 @@  int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 	ret = emit_last_fiemap_cache(fieinfo, &cache);
 
 out_unlock:
-	unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 out:
 	free_extent_state(delalloc_cached_state);

[v2] btrfs: fix deadlock with fiemap and extent locking

Commit Message

Comments

Patch