Message ID | 20181022090946.1150-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix deadlock on tree root leaf when finding free extent | expand |
On Mon, Oct 22, 2018 at 10:10 AM <fdmanana@kernel.org> wrote: > > From: Filipe Manana <fdmanana@suse.com> > > When we are writing out a free space cache, during the transaction commit > phase, we can end up in a deadlock which results in a stack trace like the > following: > > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > At cache_save_setup() we need to update the inode item of a block group's > cache which is located in the tree root (fs_info->tree_root), which means > that it may result in COWing a leaf from that tree. If that happens we > need to find a free metadata extent and while looking for one, if we find > a block group which was not cached yet we attempt to load its cache by > calling cache_block_group(). However this function will try to load the > inode of the free space cache, which requires finding the matching inode > item in the tree root - if that inode item is located in the same leaf as > the inode item of the space cache we are updating at cache_save_setup(), > we end up in a deadlock, since we try to obtain a read lock on the same > extent buffer that we previously write locked. > > So fix this by skipping the loading of free space caches of any block > groups that are not yet cached (rare cases) if we are updating the inode > of a free space cache. This is a rare case and its downside is failure to > find a free extent (return -ENOSPC) when all the already cached block > groups have no free extents. > > Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com> > Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/ > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Andrew Nelson <andrew.s.nelson@gmail.com> > --- > fs/btrfs/ctree.h | 3 +++ > fs/btrfs/disk-io.c | 2 ++ > fs/btrfs/extent-tree.c | 22 +++++++++++++++++++++- > 3 files changed, 26 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 2cddfe7806a4..d23ee26eb17d 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1121,6 +1121,9 @@ struct btrfs_fs_info { > u32 sectorsize; > u32 stripesize; > > + /* The task currently updating a free space cache inode item. */ > + struct task_struct *space_cache_updater; > + > #ifdef CONFIG_BTRFS_FS_REF_VERIFY > spinlock_t ref_verify_lock; > struct rb_root block_tree; > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 05dc3c17cb62..aa5e9a91e560 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb, > fs_info->sectorsize = 4096; > fs_info->stripesize = 4096; > > + fs_info->space_cache_updater = NULL; > + > ret = btrfs_alloc_stripe_hash_table(fs_info); > if (ret) { > err = ret; > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 577878324799..e93040449771 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, > * time. > */ > BTRFS_I(inode)->generation = 0; > + fs_info->space_cache_updater = current; > ret = btrfs_update_inode(trans, root, inode); > + fs_info->space_cache_updater = NULL; > if (ret) { > /* > * So theoretically we could recover from this, simply set the > @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, > > have_block_group: > cached = block_group_cache_done(block_group); > - if (unlikely(!cached)) { > + /* > + * If we are updating the inode of a free space cache, we can > + * not start the caching of any block group because we could > + * deadlock on an extent buffer of the root tree. > + * At cache_save_setup() we update the inode item of a free > + * space cache, so we may need to COW a leaf of the root tree, > + * which implies finding a free metadata extent. So if when > + * searching for such an extent we find a block group that was > + * not yet cached (which is unlikely), we can not start loading > + * or building its free space cache because that implies reading > + * its inode from disk (load_free_space_cache()) which implies > + * searching the root tree for its inode item, which can be > + * located in the same leaf that we previously locked at > + * cache_save_setup() for updating the inode item of the former > + * free space cache, therefore leading to an attempt to lock the > + * same leaf twice. > + */ > + if (unlikely(!cached) && > + fs_info->space_cache_updater != current) { > have_caching_bg = true; > ret = cache_block_group(block_group, 0); > BUG_ON(ret < 0); > -- > 2.11.0 >
On Mon, Oct 22, 2018 at 10:10 AM <fdmanana@kernel.org> wrote: > > From: Filipe Manana <fdmanana@suse.com> > > When we are writing out a free space cache, during the transaction commit > phase, we can end up in a deadlock which results in a stack trace like the > following: > > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > At cache_save_setup() we need to update the inode item of a block group's > cache which is located in the tree root (fs_info->tree_root), which means > that it may result in COWing a leaf from that tree. If that happens we > need to find a free metadata extent and while looking for one, if we find > a block group which was not cached yet we attempt to load its cache by > calling cache_block_group(). However this function will try to load the > inode of the free space cache, which requires finding the matching inode > item in the tree root - if that inode item is located in the same leaf as > the inode item of the space cache we are updating at cache_save_setup(), > we end up in a deadlock, since we try to obtain a read lock on the same > extent buffer that we previously write locked. > > So fix this by skipping the loading of free space caches of any block > groups that are not yet cached (rare cases) if we are updating the inode > of a free space cache. This is a rare case and its downside is failure to > find a free extent (return -ENOSPC) when all the already cached block > groups have no free extents. > > Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com> > Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/ > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Andrew Nelson <andrew.s.nelson@gmail.com> > --- > fs/btrfs/ctree.h | 3 +++ > fs/btrfs/disk-io.c | 2 ++ > fs/btrfs/extent-tree.c | 22 +++++++++++++++++++++- > 3 files changed, 26 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 2cddfe7806a4..d23ee26eb17d 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1121,6 +1121,9 @@ struct btrfs_fs_info { > u32 sectorsize; > u32 stripesize; > > + /* The task currently updating a free space cache inode item. */ > + struct task_struct *space_cache_updater; > + > #ifdef CONFIG_BTRFS_FS_REF_VERIFY > spinlock_t ref_verify_lock; > struct rb_root block_tree; > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 05dc3c17cb62..aa5e9a91e560 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb, > fs_info->sectorsize = 4096; > fs_info->stripesize = 4096; > > + fs_info->space_cache_updater = NULL; > + > ret = btrfs_alloc_stripe_hash_table(fs_info); > if (ret) { > err = ret; > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 577878324799..e93040449771 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, > * time. > */ > BTRFS_I(inode)->generation = 0; > + fs_info->space_cache_updater = current; > ret = btrfs_update_inode(trans, root, inode); > + fs_info->space_cache_updater = NULL; > if (ret) { > /* > * So theoretically we could recover from this, simply set the > @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, > > have_block_group: > cached = block_group_cache_done(block_group); > - if (unlikely(!cached)) { > + /* > + * If we are updating the inode of a free space cache, we can > + * not start the caching of any block group because we could > + * deadlock on an extent buffer of the root tree. > + * At cache_save_setup() we update the inode item of a free > + * space cache, so we may need to COW a leaf of the root tree, > + * which implies finding a free metadata extent. So if when > + * searching for such an extent we find a block group that was > + * not yet cached (which is unlikely), we can not start loading > + * or building its free space cache because that implies reading > + * its inode from disk (load_free_space_cache()) which implies > + * searching the root tree for its inode item, which can be > + * located in the same leaf that we previously locked at > + * cache_save_setup() for updating the inode item of the former > + * free space cache, therefore leading to an attempt to lock the > + * same leaf twice. > + */ > + if (unlikely(!cached) && > + fs_info->space_cache_updater != current) { > have_caching_bg = true; > ret = cache_block_group(block_group, 0); > BUG_ON(ret < 0); > -- > 2.11.0 >
On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we are writing out a free space cache, during the transaction commit > phase, we can end up in a deadlock which results in a stack trace like the > following: > > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > At cache_save_setup() we need to update the inode item of a block group's > cache which is located in the tree root (fs_info->tree_root), which means > that it may result in COWing a leaf from that tree. If that happens we > need to find a free metadata extent and while looking for one, if we find > a block group which was not cached yet we attempt to load its cache by > calling cache_block_group(). However this function will try to load the > inode of the free space cache, which requires finding the matching inode > item in the tree root - if that inode item is located in the same leaf as > the inode item of the space cache we are updating at cache_save_setup(), > we end up in a deadlock, since we try to obtain a read lock on the same > extent buffer that we previously write locked. > > So fix this by skipping the loading of free space caches of any block > groups that are not yet cached (rare cases) if we are updating the inode > of a free space cache. This is a rare case and its downside is failure to > find a free extent (return -ENOSPC) when all the already cached block > groups have no free extents. > Actually isn't this a problem for anything that tries to allocate an extent while in the tree_root? Like we snapshot or make a subvolume or anything? We should just disallow if root == tree_root. But even then we only need to do this if we're using SPACE_CACHE, using the ye-olde caching or the free space tree are both ok. Let's just limit it to those cases. Thanks, Josef
On Mon, Oct 22, 2018 at 7:07 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When we are writing out a free space cache, during the transaction commit > > phase, we can end up in a deadlock which results in a stack trace like the > > following: > > > > schedule+0x28/0x80 > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > ? finish_wait+0x80/0x80 > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > ? inode_insert5+0x119/0x190 > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > ? kmem_cache_alloc+0x166/0x1d0 > > btrfs_iget+0x113/0x690 [btrfs] > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > load_free_space_cache+0x7c/0x170 [btrfs] > > ? cache_block_group+0x72/0x3b0 [btrfs] > > cache_block_group+0x1b3/0x3b0 [btrfs] > > ? finish_wait+0x80/0x80 > > find_free_extent+0x799/0x1010 [btrfs] > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > btrfs_cow_block+0xdc/0x180 [btrfs] > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > ? kmem_cache_alloc+0x166/0x1d0 > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > cache_save_setup+0xe4/0x3a0 [btrfs] > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > > At cache_save_setup() we need to update the inode item of a block group's > > cache which is located in the tree root (fs_info->tree_root), which means > > that it may result in COWing a leaf from that tree. If that happens we > > need to find a free metadata extent and while looking for one, if we find > > a block group which was not cached yet we attempt to load its cache by > > calling cache_block_group(). However this function will try to load the > > inode of the free space cache, which requires finding the matching inode > > item in the tree root - if that inode item is located in the same leaf as > > the inode item of the space cache we are updating at cache_save_setup(), > > we end up in a deadlock, since we try to obtain a read lock on the same > > extent buffer that we previously write locked. > > > > So fix this by skipping the loading of free space caches of any block > > groups that are not yet cached (rare cases) if we are updating the inode > > of a free space cache. This is a rare case and its downside is failure to > > find a free extent (return -ENOSPC) when all the already cached block > > groups have no free extents. > > > > Actually isn't this a problem for anything that tries to allocate an extent > while in the tree_root? Like we snapshot or make a subvolume or anything? Indeed. Initially I considered making it more generic (like the recent fix for deadlock when cowing from extent/chunk/device tree) but I totally forgot about the other cases like you mentioned. > We > should just disallow if root == tree_root. But even then we only need to do > this if we're using SPACE_CACHE, using the ye-olde caching or the free space > tree are both ok. Let's just limit it to those cases. Thanks, Yep, makes all sense. Thanks! V2 sent out. > > Josef
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2cddfe7806a4..d23ee26eb17d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1121,6 +1121,9 @@ struct btrfs_fs_info { u32 sectorsize; u32 stripesize; + /* The task currently updating a free space cache inode item. */ + struct task_struct *space_cache_updater; + #ifdef CONFIG_BTRFS_FS_REF_VERIFY spinlock_t ref_verify_lock; struct rb_root block_tree; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 05dc3c17cb62..aa5e9a91e560 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb, fs_info->sectorsize = 4096; fs_info->stripesize = 4096; + fs_info->space_cache_updater = NULL; + ret = btrfs_alloc_stripe_hash_table(fs_info); if (ret) { err = ret; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 577878324799..e93040449771 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, * time. */ BTRFS_I(inode)->generation = 0; + fs_info->space_cache_updater = current; ret = btrfs_update_inode(trans, root, inode); + fs_info->space_cache_updater = NULL; if (ret) { /* * So theoretically we could recover from this, simply set the @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, have_block_group: cached = block_group_cache_done(block_group); - if (unlikely(!cached)) { + /* + * If we are updating the inode of a free space cache, we can + * not start the caching of any block group because we could + * deadlock on an extent buffer of the root tree. + * At cache_save_setup() we update the inode item of a free + * space cache, so we may need to COW a leaf of the root tree, + * which implies finding a free metadata extent. So if when + * searching for such an extent we find a block group that was + * not yet cached (which is unlikely), we can not start loading + * or building its free space cache because that implies reading + * its inode from disk (load_free_space_cache()) which implies + * searching the root tree for its inode item, which can be + * located in the same leaf that we previously locked at + * cache_save_setup() for updating the inode item of the former + * free space cache, therefore leading to an attempt to lock the + * same leaf twice. + */ + if (unlikely(!cached) && + fs_info->space_cache_updater != current) { have_caching_bg = true; ret = cache_block_group(block_group, 0); BUG_ON(ret < 0);