From patchwork Thu May 14 09:46:03 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 6403761 Return-Path: X-Original-To: patchwork-linux-btrfs@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 3936F9F32E for ; Thu, 14 May 2015 09:46:44 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 4376B20435 for ; Thu, 14 May 2015 09:46:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D9223203F7 for ; Thu, 14 May 2015 09:46:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753509AbbENJqc (ORCPT ); Thu, 14 May 2015 05:46:32 -0400 Received: from victor.provo.novell.com ([137.65.250.26]:47562 "EHLO prv3-mh.provo.novell.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753058AbbENJqb (ORCPT ); Thu, 14 May 2015 05:46:31 -0400 Received: from debian3.lan (prv-ext-foundry1int.gns.novell.com [137.65.251.240]) by prv3-mh.provo.novell.com with ESMTP (NOT encrypted); Thu, 14 May 2015 03:46:24 -0600 From: Filipe Manana To: linux-btrfs@vger.kernel.org Cc: jeffm@suse.com, clm@fb.com, Filipe Manana Subject: [PATCH] Btrfs: fix chunk allocation regression leading to transaction abort Date: Thu, 14 May 2015 10:46:03 +0100 Message-Id: <1431596763-13608-1-git-send-email-fdmanana@suse.com> X-Mailer: git-send-email 2.1.3 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP With commit 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") introduced in the kernel 4.1 merge window, we end up using part of a device hole for which there are already pending chunks or pinned chunks. Before that commit we didn't use the hole and would just move on to the next hole in the device. However when we adjust the start offset for the chunk allocation and we have pinned chunks, we set it blindly to the end offset of the pinned chunk we are currently processing, which is dangerous because we can have a pending chunk that has a start offset that matches the end offset of our pinned chunk - leading us to a case where we end up getting two pending chunks that start at the same physical device offset, which makes us later abort the current transaction with -EEXIST when finishing the chunk allocation at btrfs_create_pending_block_groups(): [194737.659017] ------------[ cut here ]------------ [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [194737.662209] BTRFS: Transaction aborted (error -17) [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [194737.682999] 0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2 [194737.684540] ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78 [194737.686017] ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40 [194737.687509] Call Trace: [194737.688068] [] dump_stack+0x4f/0x7b [194737.689027] [] ? console_unlock+0x361/0x3ad [194737.690095] [] warn_slowpath_common+0xa1/0xbb [194737.691198] [] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.693789] [] warn_slowpath_fmt+0x46/0x48 [194737.695065] [] __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.696806] [] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [194737.698683] [] __btrfs_end_transaction+0x84/0x366 [btrfs] [194737.700329] [] btrfs_end_transaction+0x10/0x12 [btrfs] [194737.701924] [] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [194737.703675] [] __btrfs_buffered_write+0x16a/0x4c8 [btrfs] [194737.705417] [] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [194737.707058] [] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs] [194737.708560] [] btrfs_file_write_iter+0x325/0x431 [btrfs] [194737.710673] [] ? get_parent_ip+0xe/0x3e [194737.712076] [] new_sync_write+0x7c/0xa0 [194737.713293] [] vfs_write+0xb2/0x117 [194737.714443] [] SyS_pwrite64+0x64/0x82 [194737.715646] [] system_call_fastpath+0x12/0x17 [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]--- [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by btrfs_create_pending_block_groups(), when it attempts to insert a duplicated device extent item via btrfs_alloc_dev_extent(). This issue was reproducible with fstests generic/038 running in a loop for several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard". Applying Jeff's recent patch titled "btrfs: add missing discards when unpinning extents with -o discard" makes the issue much easier to reproduce (usually within 4 to 5 hours), since it pins chunks for longer periods of time when an unused block group is deleted by the cleaner kthread. Fix this by making sure that we never adjust the start offset to a lower value than it currently has. Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole" Signed-off-by: Filipe Manana --- fs/btrfs/volumes.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 96aebf3..56b91ed 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1067,15 +1067,31 @@ again: map = (struct map_lookup *)em->bdev; for (i = 0; i < map->num_stripes; i++) { + u64 end; + if (map->stripes[i].dev != device) continue; if (map->stripes[i].physical >= physical_start + len || map->stripes[i].physical + em->orig_block_len <= physical_start) continue; - *start = map->stripes[i].physical + - em->orig_block_len; - ret = 1; + /* + * Make sure that while processing the pinned list we do + * not override our *start with a lower value, because + * we can have pinned chunks that fall within this + * device hole and that have lower physical addresses + * than the pending chunks we processed before. If we + * do not take this special care we can end up getting + * 2 pending chunks that start at the same physical + * device offsets because the end offset of a pinned + * chunk can be equal to the start offset of some + * pending chunk. + */ + end = map->stripes[i].physical + em->orig_block_len; + if (end > *start) { + *start = end; + ret = 1; + } } } if (search_list == &trans->transaction->pending_chunks) {