Btrfs: fix race when cleaning unused block groups

We have a race while deleting unused block groups that causes extents written
by past generations/transactions to be rewritten by the current transaction
before that transaction is committed. The steps that lead to this issue:

1) At transaction N one or more block groups became unused and we added them
   to the list fs_info->unused_bgs;

2) While still at transaction N we write btree extents to block group X and the
   transaction is committed;

3) The cleaner kthread is awaken and calls btrfs_delete_unused_bgs() to go through
   the list fs_info->unused_bgs and remove unused block groups;

4) Transaction N + 1 starts;

5) At transaction N + 1, block group X becomes unused and is added to the list
   fs_info->unused_bgs - this implies delayed refs were run, so we had the
   following function calls: btrfs_run_delayed_refs() -> __btrfs_free_extent()
   -> update_block_group(). The update_block_group() function grabs the lock
   fs_info->unused_bgs_lock, adds block group X to fs_info->unused_bgs and
   releases that lock;

6) The cleaner kthread, while at btrfs_delete_unused_bgs(), sees block group X
   added by transaction N + 1 because it's doing a loop that finishes only when
   the list fs_info->unused_bgs is empty and locks and unlocks the spinlock
   fs_info->unused_bgs_lock on each iteration. So it deletes the block group
   and its respective chunk is released. Even if it didn't do the lock/unlock
   per iteration, it could still see block group X in the list, because the
   cleaner kthread might call btrfs_delete_unused_bgs() multiple times (for
   example if there are several snapshots to delete);

7) A new block group X' is created for data, and it's associated to the same chunk
   that block group X was associated to;

8) Extents from block group X' are allocated for file data and for example an fsync
   makes the file data be effectively written to disk;

9) A crash/reboot happens before transaction N + 1 is committed;

10) On the next mount, we will read extents from block group/chunk X but they no
   longer have valid btree nodes/leafs - they have instead file data, and therefore
   all sorts of errors will happen.

So fix this by ensuring the cleaner kthread can never delete a block group that
became unused in the current transaction, that is, only delete block groups that
were added to the unused_bgs list by past transactions.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h       | 1 +
 fs/btrfs/disk-io.c     | 1 +
 fs/btrfs/extent-tree.c | 5 +++--
 fs/btrfs/transaction.c | 5 +++++
 4 files changed, 10 insertions(+), 2 deletions(-)

Btrfs: fix race when cleaning unused block groups

Commit Message

Comments

Patch