From patchwork Wed May 6 15:15:09 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 6350281 Return-Path: X-Original-To: patchwork-linux-btrfs@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id C8DA8BEEE1 for ; Wed, 6 May 2015 15:19:21 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id BEB8B20279 for ; Wed, 6 May 2015 15:19:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7D73E20149 for ; Wed, 6 May 2015 15:19:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751541AbbEFPTL (ORCPT ); Wed, 6 May 2015 11:19:11 -0400 Received: from victor.provo.novell.com ([137.65.250.26]:57203 "EHLO prv3-mh.provo.novell.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751127AbbEFPTJ (ORCPT ); Wed, 6 May 2015 11:19:09 -0400 Received: from debian3.lan (prv-ext-foundry1int.gns.novell.com [137.65.251.240]) by prv3-mh.provo.novell.com with ESMTP (NOT encrypted); Wed, 06 May 2015 09:19:02 -0600 From: Filipe Manana To: linux-btrfs@vger.kernel.org Cc: clm@fb.com, Filipe Manana Subject: [PATCH v2] Btrfs: fix race between block group creation and their cache writeout Date: Wed, 6 May 2015 16:15:09 +0100 Message-Id: <1430925309-20927-1-git-send-email-fdmanana@suse.com> X-Mailer: git-send-email 2.1.3 In-Reply-To: <1430907756-21192-1-git-send-email-fdmanana@suse.com> References: <1430907756-21192-1-git-send-email-fdmanana@suse.com> Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP So creating a block group has 2 distinct phases: Phase 1 - creates the btrfs_block_group_cache item and adds it to the rbtree fs_info->block_group_cache_tree and to the corresponding list space_info->block_groups[]; Phase 2 - adds the block group item to the extent tree and corresponding items to the chunk tree. The first phase adds the block_group_cache_item to a list of pending block groups in the transaction handle, and phase 2 happens when btrfs_end_transaction() is called against the transaction handle. It happens that once phase 1 completes, other concurrent tasks that use their own transaction handle, but points to the same running transaction (struct btrfs_trans_handle->transaction), can use this block group for space allocations and therefore mark it dirty. Dirty block groups are tracked in a list belonging to the currently running transaction (struct btrfs_transaction) and not in the transaction handle (btrfs_trans_handle). This is a problem because once a task calls btrfs_commit_transaction(), it calls btrfs_start_dirty_block_groups() which will see all dirty block groups and attempt to start their writeout, including those that are still attached to the transaction handle of some concurrent task that hasn't called btrfs_end_transaction() yet - which means those block groups haven't gone through phase 2 yet and therefore when write_one_cache_group() is called, it won't find the block group items in the extent tree and abort the current transaction with -ENOENT, turning the fs into readonly mode and require a remount. Fix this by ignoring -ENOENT when looking for block group items in the extent tree when we attempt to start the writeout of the block group caches outside the critical section of the transaction commit. We will try again later during the critical section and if there we still don't find the block group item in the extent tree, we then abort the current transaction. This issue happened twice, once while running fstests btrfs/067 and once for btrfs/078, which produced the following trace: [ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [ 3278.707329] BTRFS: Transaction aborted (error -2) (...) [ 3278.731555] Call Trace: [ 3278.732396] [] dump_stack+0x4f/0x7b [ 3278.733860] [] ? console_unlock+0x361/0x3ad [ 3278.735312] [] warn_slowpath_common+0xa1/0xbb [ 3278.736874] [] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [ 3278.738302] [] warn_slowpath_fmt+0x46/0x48 [ 3278.739520] [] __btrfs_abort_transaction+0x52/0x114 [btrfs] [ 3278.741222] [] write_one_cache_group+0xae/0xbf [btrfs] [ 3278.742797] [] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs] [ 3278.744492] [] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [ 3278.746084] [] ? trace_hardirqs_on+0xd/0xf [ 3278.747249] [] btrfs_sync_file+0x313/0x387 [btrfs] [ 3278.748744] [] vfs_fsync_range+0x95/0xa4 [ 3278.749958] [] ? ret_from_sys_call+0x1d/0x58 [ 3278.751218] [] vfs_fsync+0x1c/0x1e [ 3278.754197] [] do_fsync+0x34/0x4e [ 3278.755192] [] SyS_fsync+0x10/0x14 [ 3278.756236] [] system_call_fastpath+0x12/0x17 [ 3278.757366] ---[ end trace 9a4d4df4969709aa ]--- Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana --- V2: Change the approach. Former version inserted the block group items into the extent tree while were holding the chunks mutex, which can be a problem if the insertion into the extent tree requires allocating a new metadata block group (deadlock, attempting to acquire the chunks mutex again). So just make btrfs_start_dirty_block_groups() skip block groups for which the item is missing in the extent tree - we'll try it again in the critical section of the transaction commit. fs/btrfs/extent-tree.c | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0ec8e22..7effed6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3180,8 +3180,6 @@ static int write_one_cache_group(struct btrfs_trans_handle *trans, btrfs_mark_buffer_dirty(leaf); fail: btrfs_release_path(path); - if (ret) - btrfs_abort_transaction(trans, root, ret); return ret; } @@ -3487,8 +3485,30 @@ again: ret = 0; } } - if (!ret) + if (!ret) { ret = write_one_cache_group(trans, root, path, cache); + /* + * Our block group might still be attached to the list + * of new block groups in the transaction handle of some + * other task (struct btrfs_trans_handle->new_bgs). This + * means its block group item isn't yet in the extent + * tree. If this happens ignore the error, as we will + * try again later in the critical section of the + * transaction commit. + */ + if (ret == -ENOENT) { + ret = 0; + spin_lock(&cur_trans->dirty_bgs_lock); + if (list_empty(&cache->dirty_list)) { + list_add_tail(&cache->dirty_list, + &cur_trans->dirty_bgs); + btrfs_get_block_group(cache); + } + spin_unlock(&cur_trans->dirty_bgs_lock); + } else if (ret) { + btrfs_abort_transaction(trans, root, ret); + } + } /* if its not on the io list, we need to put the block group */ if (should_put) @@ -3597,8 +3617,11 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, ret = 0; } } - if (!ret) + if (!ret) { ret = write_one_cache_group(trans, root, path, cache); + if (ret) + btrfs_abort_transaction(trans, root, ret); + } /* if its not on the io list, we need to put the block group */ if (should_put)