From patchwork Fri Oct 12 09:03:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 10638127 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 76D7217E1 for ; Fri, 12 Oct 2018 09:03:58 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 651992BA11 for ; Fri, 12 Oct 2018 09:03:58 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 590A92BA22; Fri, 12 Oct 2018 09:03:58 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9B1072BA11 for ; Fri, 12 Oct 2018 09:03:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728028AbeJLQfW (ORCPT ); Fri, 12 Oct 2018 12:35:22 -0400 Received: from mail.kernel.org ([198.145.29.99]:37340 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727992AbeJLQfW (ORCPT ); Fri, 12 Oct 2018 12:35:22 -0400 Received: from localhost.localdomain (bl8-197-74.dsl.telepac.pt [85.241.197.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 27C602075B for ; Fri, 12 Oct 2018 09:03:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1539335035; bh=7oFXwbmetbEBcAtRMHybbDoDkHJn89LuybMMHjHZ1Hw=; h=From:To:Subject:Date:From; b=w+1u9WAVt+RuxJr/xV16sBAzNVo8cJ5Y7lJZYql2BqkYITlVT7pZRdnkNCbfPrSPt N8p7fx8n6xvaFe4rpaZolZ0ybv4aUbhWEy8n/932Y2eoLNciiwSad/bFLYUESzBZ5l JLUbuRLStzpUPFzjZwrbFrW5vTK2F3tEbnJbpPfU= From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH] Btrfs: fix deadlock when writing out free space caches Date: Fri, 12 Oct 2018 10:03:55 +0100 Message-Id: <20181012090355.25755-1-fdmanana@kernel.org> X-Mailer: git-send-email 2.11.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Filipe Manana When writing out a block group free space cache we can end deadlocking with ourseves on an extent buffer lock resulting in a warning like the following: [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G W I 4.16.8 #1 [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246 [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001 [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20 [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000 [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630 [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000 [245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000 [245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0 [245043.408670] Call Trace: [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [245043.425538] ? iput+0x72/0x1e0 [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] [245043.433341] kthread+0xf0/0x130 [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 [245043.437236] ret_from_fork+0x1f/0x30 [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- This is because at write_one_cache_group() when we are COWing a leaf from the extent tree we end up allocating a new block group (chunk) and, because we have hit a threshold on the number of bytes reserved for system chunks, we attempt to finalize the creation of new block groups from the current transaction, by calling btrfs_create_pending_block_groups(). However here we also need to modify the extent tree in order to insert a block group item, and if the location for this new block group item happens to be in the same leaf that we were COWing earlier, we deadlock since btrfs_search_slot() tries to write lock the extent buffer that we locked before at write_one_cache_group(). We have already hit similar cases in the past and commit d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation") fixed some of those cases by delaying the creation of pending block groups at the known specific spots that could lead to a deadlock. This change reworks that commit to be more generic so that we don't have to add similar logic to every possible path that can lead to a deadlock. This is done by making __btrfs_cow_block() disallowing the creation of new block groups (setting the transaction's can_flush_pending_bgs to false) before it attempts to allocate a new extent buffer for either the extent, chunk or device trees, since those are the trees that pending block creation modifies. Once the new extent buffer is allocated, it allows creation of pending block groups to happen again. This change depends on a recent patch from Josef which is not yet in Linus' tree, named "btrfs: make sure we create all new block groups" in order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). Fixes: d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753 Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/ Reported-by: E V Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik --- fs/btrfs/ctree.c | 17 +++++++++++++++++ fs/btrfs/extent-tree.c | 16 ++++++---------- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index d436fb4c002e..089b46c4d97f 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1050,9 +1050,26 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, if ((root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) && parent) parent_start = parent->start; + /* + * If we are COWing a node/leaf from the extent, chunk or device trees, + * make sure that we do not finish block group creation of pending block + * groups. We do this to avoid a deadlock. + * COWing can result in allocation of a new chunk, and flushing pending + * block groups (btrfs_create_pending_block_groups()) can be triggered + * when finishing allocation of a new chunk. Creation of a pending block + * group modifies the extent, chunk and device trees, therefore we could + * deadlock with ourselves since we are holding a lock on an extent + * buffer that btrfs_create_pending_block_groups() may try to COW later. + */ + if (root == fs_info->extent_root || + root == fs_info->chunk_root || + root == fs_info->dev_root) + trans->can_flush_pending_bgs = false; + cow = btrfs_alloc_tree_block(trans, root, parent_start, root->root_key.objectid, &disk_key, level, search_start, empty_size); + trans->can_flush_pending_bgs = true; if (IS_ERR(cow)) return PTR_ERR(cow); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2f9748599268..577878324799 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2911,7 +2911,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head; int ret; int run_all = count == (unsigned long)-1; - bool can_flush_pending_bgs = trans->can_flush_pending_bgs; /* We'll clean this up in btrfs_cleanup_transaction */ if (trans->aborted) @@ -2928,7 +2927,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, #ifdef SCRAMBLE_DELAYED_REFS delayed_refs->run_delayed_start = find_middle(&delayed_refs->root); #endif - trans->can_flush_pending_bgs = false; ret = __btrfs_run_delayed_refs(trans, count); if (ret < 0) { btrfs_abort_transaction(trans, ret); @@ -2959,7 +2957,6 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, goto again; } out: - trans->can_flush_pending_bgs = can_flush_pending_bgs; return 0; } @@ -4554,11 +4551,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, * the block groups that were made dirty during the lifetime of the * transaction. */ - if (trans->can_flush_pending_bgs && - trans->chunk_bytes_reserved >= (u64)SZ_2M) { + if (trans->chunk_bytes_reserved >= (u64)SZ_2M) btrfs_create_pending_block_groups(trans); - btrfs_trans_release_chunk_metadata(trans); - } + return ret; } @@ -10079,9 +10074,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) struct btrfs_block_group_item item; struct btrfs_key key; int ret = 0; - bool can_flush_pending_bgs = trans->can_flush_pending_bgs; - trans->can_flush_pending_bgs = false; + if (!trans->can_flush_pending_bgs) + return; + while (!list_empty(&trans->new_bgs)) { block_group = list_first_entry(&trans->new_bgs, struct btrfs_block_group_cache, @@ -10106,7 +10102,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) next: list_del_init(&block_group->bg_list); } - trans->can_flush_pending_bgs = can_flush_pending_bgs; + btrfs_trans_release_chunk_metadata(trans); } int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,