Message ID | 1446452986-6581-1-git-send-email-chandan@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On 11/02/2015 03:29 AM, Chandan Rajendra wrote: > When executing generic/001 in a loop on a ppc64 machine (with both sectorsize > and nodesize set to 64k), the following call trace is observed, > > WARNING: at /root/repos/linux/fs/btrfs/locking.c:253 > Modules linked in: > CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54 > task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000 > NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000 > REGS: c0000000f600a820 TRAP: 0700 Not tainted (4.3.0-rc5-13676-ga5e681d) > MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 24444884 XER: 00000000 > CFAR: c0000000004a3b30 SOFTE: 1 > GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0 > GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005 > GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030 > GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800 > GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049 > GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000 > GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000 > GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0 > NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0 > LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 > Call Trace: > [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable) > [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 > [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00 > [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120 > [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620 > [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0 > [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420 > [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030 > [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250 > [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590 > [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780 > [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250 > [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00 > [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0 > [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570 > [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0 > [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30 > [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0 > [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30 > [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0 > [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0 > [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0 > [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100 > [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80 > [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68 > Instruction dump: > fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040 > 812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c > > The above call trace is seen even on x86_64; albeit very rarely and that too > with nodesize set to 64k and with nospace_cache mount option being used. > > The reason for the above call trace is, > btrfs_remove_chunk > check_system_chunk > Allocate chunk if required > For each physical stripe on underlying device, > btrfs_free_dev_extent > ... > Take lock on Device tree's root node > btrfs_cow_block("dev tree's root node"); > btrfs_reserve_extent > find_free_extent > index = BTRFS_RAID_DUP; > have_caching_bg = false; > > When in LOOP_CACHING_NOWAIT state, Assume we find a block group > which is being cached; Hence have_caching_bg is set to true > > When repeating the search for the next RAID index, we set > have_caching_bg to false. > > Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly > skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we > allocate a chunk and try to add entries corresponding to the chunk's physical > stripe into the device tree. When doing so the task deadlocks itself waiting > for the blocking lock on the root node of the device tree. > > This commit fixes the issue by introducing a new local variable to help > indicate as to whether a block group of any RAID type is being cached. > > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> > --- > Changelog: > v1->v2: Honor 80 column restriction. > > fs/btrfs/extent-tree.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index f50c7c2..99a8e57 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7029,6 +7029,7 @@ static noinline int find_free_extent(struct btrfs_root *orig_root, > bool failed_alloc = false; > bool use_cluster = true; > bool have_caching_bg = false; > + bool orig_have_caching_bg = false; > bool full_search = false; > > WARN_ON(num_bytes < root->sectorsize); > @@ -7378,6 +7379,10 @@ loop: > } > up_read(&space_info->groups_sem); > > + if ((loop == LOOP_CACHING_NOWAIT) && have_caching_bg > + && !orig_have_caching_bg) > + orig_have_caching_bg = true; > + > if (!ins->objectid && loop >= LOOP_CACHING_WAIT && have_caching_bg) > goto search; > > @@ -7400,7 +7405,7 @@ loop: > * don't have any unached bgs and we've alrelady done a > * full search through. > */ > - if (have_caching_bg || !full_search) > + if (orig_have_caching_bg || !full_search) > loop = LOOP_CACHING_WAIT; > else > loop = LOOP_ALLOC_CHUNK; > Ah good catch, thanks Chandan, Reviewed-by: Josef Bacik <jbacik@fb.com> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote: > When executing generic/001 in a loop on a ppc64 machine (with both sectorsize > and nodesize set to 64k), the following call trace is observed, Thanks Chandan, I hit this same trace on x86-64 with 16K nodes. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Resending in plain text, apologies.] Hi Chandan, Josef, Chris, I am not sure I understand the fix to the problem. It may happen that when updating the device tree, we need to allocate a new chunk via do_chunk_alloc (while we are holding the device tree root node locked). This is a legitimate thing for find_free_extent() to do. And do_chunk_alloc() call may lead to call to btrfs_create_pending_block_groups(), which will try to update the device tree. This may happen due to direct call to btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or perhaps by __btrfs_end_transaction() that find_free_extent() does after it completed chunk allocation (although in this case it will use the transaction that already exists in current->journal_info). So the deadlock still may happen? Thanks, Alex. > > > On Mon, Nov 2, 2015 at 6:52 PM, Chris Mason <clm@fb.com> wrote: >> >> On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote: >> > When executing generic/001 in a loop on a ppc64 machine (with both >> > sectorsize >> > and nodesize set to 64k), the following call trace is observed, >> >> Thanks Chandan, I hit this same trace on x86-64 with 16K nodes. >> >> -chris >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sunday 13 Dec 2015 12:18:55 Alex Lyakas wrote: > [Resending in plain text, apologies.] > > Hi Chandan, Josef, Chris, > > I am not sure I understand the fix to the problem. > > It may happen that when updating the device tree, we need to allocate a new > chunk via do_chunk_alloc (while we are holding the device tree root node > locked). This is a legitimate thing for find_free_extent() to do. And > do_chunk_alloc() call may lead to call to > btrfs_create_pending_block_groups(), which will try to update the device > tree. This may happen due to direct call to > btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or > perhaps by __btrfs_end_transaction() that find_free_extent() does after it > completed chunk allocation (although in this case it will use the > transaction that already exists in current->journal_info). > So the deadlock still may happen? Hello Alex, The "global block reservation" (see btrfs_fs_info->global_block_rsv) aims to solve this problem. I don't claim to have understood the behaviour of global_block_rsv completely. However, Global block reservation makes sure that we have enough free space reserved (see update_global_block_rsv()) for future operations on, - Extent tree - Checksum tree - Device tree - Tree root tree and - Quota tree. Tasks changing the device tree should get their space requirements satisfied from the global block reservation. Hence such changes to the device tree should not end up forcing find_free_extent() to allocate a new chunk.
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f50c7c2..99a8e57 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7029,6 +7029,7 @@ static noinline int find_free_extent(struct btrfs_root *orig_root, bool failed_alloc = false; bool use_cluster = true; bool have_caching_bg = false; + bool orig_have_caching_bg = false; bool full_search = false; WARN_ON(num_bytes < root->sectorsize); @@ -7378,6 +7379,10 @@ loop: } up_read(&space_info->groups_sem); + if ((loop == LOOP_CACHING_NOWAIT) && have_caching_bg + && !orig_have_caching_bg) + orig_have_caching_bg = true; + if (!ins->objectid && loop >= LOOP_CACHING_WAIT && have_caching_bg) goto search; @@ -7400,7 +7405,7 @@ loop: * don't have any unached bgs and we've alrelady done a * full search through. */ - if (have_caching_bg || !full_search) + if (orig_have_caching_bg || !full_search) loop = LOOP_CACHING_WAIT; else loop = LOOP_ALLOC_CHUNK;
When executing generic/001 in a loop on a ppc64 machine (with both sectorsize and nodesize set to 64k), the following call trace is observed, WARNING: at /root/repos/linux/fs/btrfs/locking.c:253 Modules linked in: CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54 task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000 NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000 REGS: c0000000f600a820 TRAP: 0700 Not tainted (4.3.0-rc5-13676-ga5e681d) MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 24444884 XER: 00000000 CFAR: c0000000004a3b30 SOFTE: 1 GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0 GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005 GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030 GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800 GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049 GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000 GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000 GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0 NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0 LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 Call Trace: [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable) [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00 [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120 [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620 [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0 [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420 [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030 [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250 [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590 [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780 [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250 [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00 [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0 [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570 [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0 [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30 [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0 [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30 [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0 [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0 [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0 [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100 [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80 [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68 Instruction dump: fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040 812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c The above call trace is seen even on x86_64; albeit very rarely and that too with nodesize set to 64k and with nospace_cache mount option being used. The reason for the above call trace is, btrfs_remove_chunk check_system_chunk Allocate chunk if required For each physical stripe on underlying device, btrfs_free_dev_extent ... Take lock on Device tree's root node btrfs_cow_block("dev tree's root node"); btrfs_reserve_extent find_free_extent index = BTRFS_RAID_DUP; have_caching_bg = false; When in LOOP_CACHING_NOWAIT state, Assume we find a block group which is being cached; Hence have_caching_bg is set to true When repeating the search for the next RAID index, we set have_caching_bg to false. Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we allocate a chunk and try to add entries corresponding to the chunk's physical stripe into the device tree. When doing so the task deadlocks itself waiting for the blocking lock on the root node of the device tree. This commit fixes the issue by introducing a new local variable to help indicate as to whether a block group of any RAID type is being cached. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> --- Changelog: v1->v2: Honor 80 column restriction. fs/btrfs/extent-tree.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)