From patchwork Thu Feb 17 23:14:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12750717 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CCB4C433F5 for ; Thu, 17 Feb 2022 23:17:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229889AbiBQXR0 (ORCPT ); Thu, 17 Feb 2022 18:17:26 -0500 Received: from gmail-smtp-in.l.google.com ([23.128.96.19]:53464 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229473AbiBQXRZ (ORCPT ); Thu, 17 Feb 2022 18:17:25 -0500 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5962A3526E for ; Thu, 17 Feb 2022 15:17:03 -0800 (PST) Received: by mail-pf1-x431.google.com with SMTP id p8so911965pfh.8 for ; Thu, 17 Feb 2022 15:17:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=XU8iIGar3QBec2AaEQxE3tFA1chwQxxmXuzZk4nbLLQ=; b=UUkpxPbxcRng6h0KqgbDohMBnbc+vaDLZqQCYKIoKd9XoG5keulvPWPLQ9OkhtUxvg aIBVOGhcI6s4eTU7ou2p/gt0GMhEvfSWZMhna+0POwscxwq6nkXi7y6kLiibS3sWG5Kx o3YUdtdR+jowvkYFJffUrc/wKcHZo3FLpjB54KZCoxJfMbSF3AmfWOSjde+ByNO9jUez oXlrDwCTR89ExiyQzrofIJzYasd8xxeF01DqS8Jj+JXfThVfqwcHeNJ2HcubN1u7JbiT vzi0gq7b7q/xk5W9PrhJ5YlB/684WFy1GSCgZoYmjbIga958wFTvHRvGjm5ikkpZ7kvq MRPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=XU8iIGar3QBec2AaEQxE3tFA1chwQxxmXuzZk4nbLLQ=; b=eMUcxh5bbpsOA3xPYuHeIuTgD/VFI17JkCDOxvFURJDOLXh/3srL3qMeTIU5oVrUO1 E+e3UMsnYfzGgOkoNcEWHbg+VC0Gek/kDqkvu8O2ODPVF+/By4SCAcjtzvFfCyeTjWkm 877mI0aNX50+qUmAxyH/RZ7L00nCBDUzmN8qmtatgVLY5ZYI/+vJrQBYB5j2CVnvxgT8 qsr8QolQD7S77udw98zMfAOkAbgxaXihoRd/bKgT5UBB+xYEJysdg5oC0+6PzlxIA9dL L9+EcIOWC/LzTlF9MSHT8oEOyj/Gb5QXyRRMNexejUh69mQ7axnRWQmg44Y2d97Fuawn 51KQ== X-Gm-Message-State: AOAM532rsJSpDpNVJKWNP9bJs9pNK3QWsxDqisg9yizgxmn247TqsoBO 48R+GxTivBMwazy3IQD954jtOtO3DrptGQ== X-Google-Smtp-Source: ABdhPJxF2KVj39RR1xz3X+7IcrCUUh7dES7bRaxhbXZViIc2QOvunhzSaLODpz7X7FGJr1yupr3j8Q== X-Received: by 2002:a05:6a00:2402:b0:4e1:3df2:5373 with SMTP id z2-20020a056a00240200b004e13df25373mr4977042pfh.40.1645139690062; Thu, 17 Feb 2022 15:14:50 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:9013]) by smtp.gmail.com with ESMTPSA id pg1sm2745326pjb.31.2022.02.17.15.14.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 17 Feb 2022 15:14:49 -0800 (PST) From: Omar Sandoval To: linux-btrfs@vger.kernel.org Cc: kernel-team@fb.com, Filipe Manana Subject: [PATCH] btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() Date: Thu, 17 Feb 2022 15:14:43 -0800 Message-Id: <4ebf450a931e83b1d305d07fcc6db104b85c2627.1645139641.git.osandov@fb.com> X-Mailer: git-send-email 2.35.1 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval We are seeing crashes similar to the following trace: [ 38.968587] ------------[ cut here ]------------ [ 38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [ 38.970984] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common [ 38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54 [ 38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [ 38.977489] Code: fe ff ff ff e9 f0 fd ff ff 0f 0b e9 f1 fe ff ff 4c 89 e7 41 bd f4 ff ff ff e8 50 0e 03 00 e9 d6 fd ff ff 0f 0b e9 45 ff ff ff <0f> 0b e9 33 ff ff ff 48 8b 45 10 48 83 ca ff 31 f6 48 8b 78 30 e8 [ 38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206 [ 38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14 [ 38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360 [ 38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000 [ 38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800 [ 38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360 [ 38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000 [ 38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0 [ 38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 38.992528] Call Trace: [ 38.992854] [ 38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs] [ 38.993941] btrfs_balance+0x78e/0xea0 [btrfs] [ 38.994801] ? vsnprintf+0x33c/0x520 [ 38.995368] ? __kmalloc_track_caller+0x351/0x440 [ 38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs] [ 38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs] [ 38.997867] ? mod_objcg_state+0xee/0x340 [ 38.998552] ? seq_release+0x24/0x30 [ 38.999184] ? proc_nr_files+0x30/0x30 [ 38.999654] ? call_rcu+0xc8/0x2f0 [ 39.000228] ? __x64_sys_ioctl+0x84/0xc0 [ 39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 39.001973] __x64_sys_ioctl+0x84/0xc0 [ 39.002566] do_syscall_64+0x3a/0x80 [ 39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 39.003735] RIP: 0033:0x7f11c166959b [ 39.004302] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48 [ 39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b [ 39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003 [ 39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0 [ 39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3 [ 39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001 [ 39.015040] [ 39.015418] ---[ end trace 0000000000000000 ]--- [ 43.131559] ------------[ cut here ]------------ [ 43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717! [ 43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 #54 [ 43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [ 43.137255] Code: fd ff ff 4d 8d b5 08 01 00 00 4c 89 f7 e8 de f2 49 ef 66 41 83 bd 0c 01 00 00 00 74 0f 4c 89 f7 e8 2b f3 49 ef e9 ed fe ff ff <0f> 0b 49 8b 85 f8 00 00 00 4d 8b 9d f0 00 00 00 49 29 c3 49 39 db [ 43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [ 43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [ 43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [ 43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [ 43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [ 43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [ 43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000 [ 43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0 [ 43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 43.150559] Call Trace: [ 43.150904] [ 43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs] [ 43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs] [ 43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] [ 43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs] [ 43.154475] ? __check_object_size+0x150/0x170 [ 43.155170] ? preempt_count_add+0x49/0xa0 [ 43.155753] ? __x64_sys_ioctl+0x84/0xc0 [ 43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 43.157456] __x64_sys_ioctl+0x84/0xc0 [ 43.157980] do_syscall_64+0x3a/0x80 [ 43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 43.159231] RIP: 0033:0x7f7657f1e59b [ 43.159653] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48 [ 43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b [ 43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003 [ 43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [ 43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800 [ 43.166863] [ 43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common [ 43.169552] ---[ end trace 0000000000000000 ]--- [ 43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [ 43.172356] Code: fd ff ff 4d 8d b5 08 01 00 00 4c 89 f7 e8 de f2 49 ef 66 41 83 bd 0c 01 00 00 00 74 0f 4c 89 f7 e8 2b f3 49 ef e9 ed fe ff ff <0f> 0b 49 8b 85 f8 00 00 00 4d 8b 9d f0 00 00 00 49 29 c3 49 39 db [ 43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [ 43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [ 43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [ 43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [ 43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [ 43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [ 43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000 [ 43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0 [ 43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 We first hit the WARN_ON(rc->block_group->pinned > 0) in btrfs_relocate_block_group() and then the BUG_ON(!cache) in unpin_extent_range(). This tells us that we are exiting relocation and removing the block group with bytes still pinned for that block group. This is supposed to be impossible: the last thing relocate_block_group() does is commit the transaction to get rid of pinned extents. Commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") introduced an optimization so that commits from fsync don't have to wait for the previous commit to unpin extents. This was only intended to affect fsync, but it inadvertently made it possible for any commit to skip waiting for the previous commit to unpin. This is because if a call to btrfs_commit_transaction() finds that another thread is already committing the transaction, it waits for the other thread to complete the commit and then returns. If that other thread was in fsync, then it completes the commit without completing the previous commit. This makes the following sequence of events possible: Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________ btrfs_commit_transaction(N) | | btrfs_run_delayed_refs | | pin extents | | ... | | state = UNBLOCKED |btrfs_sync_file | | btrfs_start_transaction(N + 1) |relocate_block_group | | btrfs_join_transaction(N + 1) | btrfs_commit_transaction(N + 1) | ... | trans->state = COMMIT_START | | | btrfs_commit_transaction(N + 1) | | wait_for_commit(N + 1, COMPLETED) | wait_for_commit(N, SUPER_COMMITTED)| state = SUPER_COMMITTED | ... | btrfs_finish_extent_commit| | unpin_extent_range() | trans->state = COMPLETED | | | return | | ... | |Thread 1 isn't done, so pinned > 0 | |and we WARN | | | |btrfs_remove_block_group unpin_extent_range() | | Thread 3 removed the | | block group, so we BUG| | There are other sequences involving SUPER_COMMITTED transactions that can cause a similar outcome. We could fix this by making relocation explicitly wait for unpinning, but there may be other cases that need it. Josef mentioned ENOSPC flushing and the free space cache inode as other potential victims. Rather than playing whack-a-mole, this fix is conservative and makes all commits not in fsync wait for all previous transactions, which is what the optimization intended. Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") Signed-off-by: Omar Sandoval Reviewed-by: Filipe Manana --- fs/btrfs/transaction.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index c6e550fa4d55..9f6bb22403c3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -854,7 +854,37 @@ btrfs_attach_transaction_barrier(struct btrfs_root *root) static noinline void wait_for_commit(struct btrfs_transaction *commit, const enum btrfs_trans_state min_state) { - wait_event(commit->commit_wait, commit->state >= min_state); + struct btrfs_fs_info *fs_info = commit->fs_info; + u64 transid = commit->transid; + bool put = false; + + while (1) { + wait_event(commit->commit_wait, commit->state >= min_state); + if (put) + btrfs_put_transaction(commit); + + if (min_state < TRANS_STATE_COMPLETED) + break; + + /* + * A transaction isn't really completed until all of the + * previous transactions are completed, but with fsync we can + * end up with SUPER_COMMITTED transactions before a COMPLETED + * transaction. Wait for those. + */ + + spin_lock(&fs_info->trans_lock); + commit = list_first_entry_or_null(&fs_info->trans_list, + struct btrfs_transaction, + list); + if (!commit || commit->transid > transid) { + spin_unlock(&fs_info->trans_lock); + break; + } + refcount_inc(&commit->use_count); + put = true; + spin_unlock(&fs_info->trans_lock); + } } int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)