Message ID | 20181012090355.25755-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix deadlock when writing out free space caches | expand |
On Fri, Oct 12, 2018 at 10:03:55AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When writing out a block group free space cache we can end deadlocking > with ourseves on an extent buffer lock resulting in a warning like the > following: > > [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs] > [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G > W I 4.16.8 #1 > [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] > [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246 > [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001 > [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20 > [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000 > [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630 > [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000 > [245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000 > [245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0 > [245043.408670] Call Trace: > [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] > [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] > [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] > [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] > [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] > [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] > [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] > [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] > [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] > [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] > [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] > [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] > [245043.425538] ? iput+0x72/0x1e0 > [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] > [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] > [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] > [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] > [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] > [245043.433341] kthread+0xf0/0x130 > [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] > [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 > [245043.437236] ret_from_fork+0x1f/0x30 > [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- > > This is because at write_one_cache_group() when we are COWing a leaf from > the extent tree we end up allocating a new block group (chunk) and, > because we have hit a threshold on the number of bytes reserved for system > chunks, we attempt to finalize the creation of new block groups from the > current transaction, by calling btrfs_create_pending_block_groups(). > However here we also need to modify the extent tree in order to insert > a block group item, and if the location for this new block group item > happens to be in the same leaf that we were COWing earlier, we deadlock > since btrfs_search_slot() tries to write lock the extent buffer that we > locked before at write_one_cache_group(). > > We have already hit similar cases in the past and commit d9a0540a79f8 > ("Btrfs: fix deadlock when finalizing block group creation") fixed some > of those cases by delaying the creation of pending block groups at the > known specific spots that could lead to a deadlock. This change reworks > that commit to be more generic so that we don't have to add similar logic > to every possible path that can lead to a deadlock. This is done by > making __btrfs_cow_block() disallowing the creation of new block groups > (setting the transaction's can_flush_pending_bgs to false) before it > attempts to allocate a new extent buffer for either the extent, chunk or > device trees, since those are the trees that pending block creation > modifies. Once the new extent buffer is allocated, it allows creation of > pending block groups to happen again. > > This change depends on a recent patch from Josef which is not yet in > Linus' tree, named "btrfs: make sure we create all new block groups" in > order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). > > Fixes: d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation") > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753 > Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/ > Reported-by: E V <eliventer@gmail.com> > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Fri, Oct 12, 2018 at 10:03:55AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When writing out a block group free space cache we can end deadlocking > with ourseves on an extent buffer lock resulting in a warning like the > following: > > [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs] > [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G > W I 4.16.8 #1 > [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] > [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246 > [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001 > [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20 > [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000 > [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630 > [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000 > [245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000 > [245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0 > [245043.408670] Call Trace: > [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] > [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] > [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] > [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] > [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] > [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] > [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] > [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] > [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] > [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] > [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] > [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] > [245043.425538] ? iput+0x72/0x1e0 > [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] > [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] > [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] > [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] > [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] > [245043.433341] kthread+0xf0/0x130 > [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] > [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 > [245043.437236] ret_from_fork+0x1f/0x30 > [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- > > This is because at write_one_cache_group() when we are COWing a leaf from > the extent tree we end up allocating a new block group (chunk) and, > because we have hit a threshold on the number of bytes reserved for system > chunks, we attempt to finalize the creation of new block groups from the > current transaction, by calling btrfs_create_pending_block_groups(). > However here we also need to modify the extent tree in order to insert > a block group item, and if the location for this new block group item > happens to be in the same leaf that we were COWing earlier, we deadlock > since btrfs_search_slot() tries to write lock the extent buffer that we > locked before at write_one_cache_group(). > > We have already hit similar cases in the past and commit d9a0540a79f8 > ("Btrfs: fix deadlock when finalizing block group creation") fixed some > of those cases by delaying the creation of pending block groups at the > known specific spots that could lead to a deadlock. This change reworks > that commit to be more generic so that we don't have to add similar logic > to every possible path that can lead to a deadlock. This is done by > making __btrfs_cow_block() disallowing the creation of new block groups > (setting the transaction's can_flush_pending_bgs to false) before it > attempts to allocate a new extent buffer for either the extent, chunk or > device trees, since those are the trees that pending block creation > modifies. Once the new extent buffer is allocated, it allows creation of > pending block groups to happen again. > > This change depends on a recent patch from Josef which is not yet in > Linus' tree, named "btrfs: make sure we create all new block groups" in > order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). Thanks for mentioning it, the referenced patch has been in misc-next and is scheduled for 4.20 so your patch will be added too.
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index d436fb4c002e..089b46c4d97f 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1050,9 +1050,26 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, if ((root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) && parent) parent_start = parent->start; + /* + * If we are COWing a node/leaf from the extent, chunk or device trees, + * make sure that we do not finish block group creation of pending block + * groups. We do this to avoid a deadlock. + * COWing can result in allocation of a new chunk, and flushing pending + * block groups (btrfs_create_pending_block_groups()) can be triggered + * when finishing allocation of a new chunk. Creation of a pending block + * group modifies the extent, chunk and device trees, therefore we could + * deadlock with ourselves since we are holding a lock on an extent + * buffer that btrfs_create_pending_block_groups() may try to COW later. + */ + if (root == fs_info->extent_root || + root == fs_info->chunk_root || + root == fs_info->dev_root) + trans->can_flush_pending_bgs = false; + cow = btrfs_alloc_tree_block(trans, root, parent_start, root->root_key.objectid, &disk_key, level, search_start, empty_size); + trans->can_flush_pending_bgs = true; if (IS_ERR(cow)) return PTR_ERR(cow); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2f9748599268..577878324799 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2911,7 +2911,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head; int ret; int run_all = count == (unsigned long)-1; - bool can_flush_pending_bgs = trans->can_flush_pending_bgs; /* We'll clean this up in btrfs_cleanup_transaction */ if (trans->aborted) @@ -2928,7 +2927,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, #ifdef SCRAMBLE_DELAYED_REFS delayed_refs->run_delayed_start = find_middle(&delayed_refs->root); #endif - trans->can_flush_pending_bgs = false; ret = __btrfs_run_delayed_refs(trans, count); if (ret < 0) { btrfs_abort_transaction(trans, ret); @@ -2959,7 +2957,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, goto again; } out: - trans->can_flush_pending_bgs = can_flush_pending_bgs; return 0; } @@ -4554,11 +4551,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, * the block groups that were made dirty during the lifetime of the * transaction. */ - if (trans->can_flush_pending_bgs && - trans->chunk_bytes_reserved >= (u64)SZ_2M) { + if (trans->chunk_bytes_reserved >= (u64)SZ_2M) btrfs_create_pending_block_groups(trans); - btrfs_trans_release_chunk_metadata(trans); - } + return ret; } @@ -10079,9 +10074,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) struct btrfs_block_group_item item; struct btrfs_key key; int ret = 0; - bool can_flush_pending_bgs = trans->can_flush_pending_bgs; - trans->can_flush_pending_bgs = false; + if (!trans->can_flush_pending_bgs) + return; + while (!list_empty(&trans->new_bgs)) { block_group = list_first_entry(&trans->new_bgs, struct btrfs_block_group_cache, @@ -10106,7 +10102,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) next: list_del_init(&block_group->bg_list); } - trans->can_flush_pending_bgs = can_flush_pending_bgs; + btrfs_trans_release_chunk_metadata(trans); } int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,