Message ID | 20170308022552.14686-1-quwenruo@cn.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Mar 8, 2017 at 2:25 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote: > [BUG] > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > and leads to kernel assertion on outstanding extents in > run_delalloc_nocow() and cow_file_range(). > > BTRFS info (device vdb5): relocating block group 12582912 flags data > BTRFS info (device vdb5): found 1 extents > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > > Currently, due to another bug blocking ordered extents, the bug is only > reproducible under certain block group layout and using error injection. > > a) Create one data block group with one 4K extent in it. > To avoid the bug that hangs btrfs due to ordered extent which never > finishes > b) Make btrfs_reloc_clone_csum() always fail > c) Relocate that block group > > [CAUSE] > run_delalloc_nocow() and cow_file_range() handles error from > btrfs_reloc_clone_csum() wrongly: > > (The ascii chart shows a more generic case of this bug other than the > bug mentioned above) > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- cleanup range --------------->| > |<----------- ----------->| > \/ > btrfs_finish_ordered_io() range > > So error handler, which calls extent_clear_unlock_delalloc() with > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > will both cover OE n, and free its metadata, causing metadata under flow. > > [Fix] > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > call error handler after increasing the iteration offset, so that > cleanup range won't cover any created ordered extent. > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- ----------->|<---------- cleanup range --------->| > \/ > btrfs_finish_ordered_io() range > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Thanks > --- > v6: > New, split from v5 patch, as this is a separate bug. > v7: > Fix a wrong operator pointed out by Liu Bo. > Test case will follow soon. > --- > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 39 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index c40060cc481f..fe588bf30d02 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, > BTRFS_DATA_RELOC_TREE_OBJECTID) { > ret = btrfs_reloc_clone_csums(inode, start, > cur_alloc_size); > + /* > + * Only drop cache here, and process as normal. > + * > + * We must not allow extent_clear_unlock_delalloc() > + * at out_unlock label to free meta of this ordered > + * extent, as its meta should be freed by > + * btrfs_finish_ordered_io(). > + * > + * So we must continue until @start is increased to > + * skip current ordered extent. > + */ > if (ret) > - goto out_drop_extent_cache; > + btrfs_drop_extent_cache(BTRFS_I(inode), start, > + start + ram_size - 1, 0); > } > > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > > - if (disk_num_bytes < cur_alloc_size) > - break; > - > /* we're not doing compressed IO, don't unlock the first > * page (which the caller expects to stay locked), don't > * clear any dirty bits and don't set any writeback bits > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, > delalloc_end, locked_page, > EXTENT_LOCKED | EXTENT_DELALLOC, > op); > - disk_num_bytes -= cur_alloc_size; > + if (disk_num_bytes < cur_alloc_size) > + disk_num_bytes = 0; > + else > + disk_num_bytes -= cur_alloc_size; > num_bytes -= cur_alloc_size; > alloc_hint = ins.objectid + ins.offset; > start += cur_alloc_size; > + > + /* > + * btrfs_reloc_clone_csums() error, since start is increased > + * extent_clear_unlock_delalloc() at out_unlock label won't > + * free metadata of current ordered extent, we're OK to exit. > + */ > + if (ret) > + goto out_unlock; > } > out: > return ret; > @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > BUG_ON(ret); /* -ENOMEM */ > > if (root->root_key.objectid == > - BTRFS_DATA_RELOC_TREE_OBJECTID) { > + BTRFS_DATA_RELOC_TREE_OBJECTID) > + /* > + * Error handled later, as we must prevent > + * extent_clear_unlock_delalloc() in error handler > + * from freeing metadata of created ordered extent. > + */ > ret = btrfs_reloc_clone_csums(inode, cur_offset, > num_bytes); > - if (ret) { > - if (!nolock && nocow) > - btrfs_end_write_no_snapshoting(root); > - goto error; > - } > - } > > extent_clear_unlock_delalloc(inode, cur_offset, > cur_offset + num_bytes - 1, end, > @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > if (!nolock && nocow) > btrfs_end_write_no_snapshoting(root); > cur_offset = extent_end; > + > + /* > + * btrfs_reloc_clone_csums() error, now we're OK to call error > + * handler, as metadata for created ordered extent will only > + * be freed by btrfs_finish_ordered_io(). > + */ > + if (ret) > + goto error; > if (cur_offset > end) > break; > } > -- > 2.12.0 > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 08, 2017 at 10:25:51AM +0800, Qu Wenruo wrote: > [BUG] > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > and leads to kernel assertion on outstanding extents in > run_delalloc_nocow() and cow_file_range(). > > BTRFS info (device vdb5): relocating block group 12582912 flags data > BTRFS info (device vdb5): found 1 extents > assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 > > Currently, due to another bug blocking ordered extents, the bug is only > reproducible under certain block group layout and using error injection. > > a) Create one data block group with one 4K extent in it. > To avoid the bug that hangs btrfs due to ordered extent which never > finishes > b) Make btrfs_reloc_clone_csum() always fail > c) Relocate that block group > > [CAUSE] > run_delalloc_nocow() and cow_file_range() handles error from > btrfs_reloc_clone_csum() wrongly: > > (The ascii chart shows a more generic case of this bug other than the > bug mentioned above) > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- cleanup range --------------->| > |<----------- ----------->| > \/ > btrfs_finish_ordered_io() range > > So error handler, which calls extent_clear_unlock_delalloc() with > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > will both cover OE n, and free its metadata, causing metadata under flow. > > [Fix] > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > call error handler after increasing the iteration offset, so that > cleanup range won't cover any created ordered extent. > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- ----------->|<---------- cleanup range --------->| > \/ > btrfs_finish_ordered_io() range Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thanks, -liubo > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> > --- > v6: > New, split from v5 patch, as this is a separate bug. > v7: > Fix a wrong operator pointed out by Liu Bo. > Test case will follow soon. > --- > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 39 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index c40060cc481f..fe588bf30d02 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, > BTRFS_DATA_RELOC_TREE_OBJECTID) { > ret = btrfs_reloc_clone_csums(inode, start, > cur_alloc_size); > + /* > + * Only drop cache here, and process as normal. > + * > + * We must not allow extent_clear_unlock_delalloc() > + * at out_unlock label to free meta of this ordered > + * extent, as its meta should be freed by > + * btrfs_finish_ordered_io(). > + * > + * So we must continue until @start is increased to > + * skip current ordered extent. > + */ > if (ret) > - goto out_drop_extent_cache; > + btrfs_drop_extent_cache(BTRFS_I(inode), start, > + start + ram_size - 1, 0); > } > > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > > - if (disk_num_bytes < cur_alloc_size) > - break; > - > /* we're not doing compressed IO, don't unlock the first > * page (which the caller expects to stay locked), don't > * clear any dirty bits and don't set any writeback bits > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, > delalloc_end, locked_page, > EXTENT_LOCKED | EXTENT_DELALLOC, > op); > - disk_num_bytes -= cur_alloc_size; > + if (disk_num_bytes < cur_alloc_size) > + disk_num_bytes = 0; > + else > + disk_num_bytes -= cur_alloc_size; > num_bytes -= cur_alloc_size; > alloc_hint = ins.objectid + ins.offset; > start += cur_alloc_size; > + > + /* > + * btrfs_reloc_clone_csums() error, since start is increased > + * extent_clear_unlock_delalloc() at out_unlock label won't > + * free metadata of current ordered extent, we're OK to exit. > + */ > + if (ret) > + goto out_unlock; > } > out: > return ret; > @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > BUG_ON(ret); /* -ENOMEM */ > > if (root->root_key.objectid == > - BTRFS_DATA_RELOC_TREE_OBJECTID) { > + BTRFS_DATA_RELOC_TREE_OBJECTID) > + /* > + * Error handled later, as we must prevent > + * extent_clear_unlock_delalloc() in error handler > + * from freeing metadata of created ordered extent. > + */ > ret = btrfs_reloc_clone_csums(inode, cur_offset, > num_bytes); > - if (ret) { > - if (!nolock && nocow) > - btrfs_end_write_no_snapshoting(root); > - goto error; > - } > - } > > extent_clear_unlock_delalloc(inode, cur_offset, > cur_offset + num_bytes - 1, end, > @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, > if (!nolock && nocow) > btrfs_end_write_no_snapshoting(root); > cur_offset = extent_end; > + > + /* > + * btrfs_reloc_clone_csums() error, now we're OK to call error > + * handler, as metadata for created ordered extent will only > + * be freed by btrfs_finish_ordered_io(). > + */ > + if (ret) > + goto error; > if (cur_offset > end) > break; > } > -- > 2.12.0 > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Qu, while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4). V7 results in OOPS to me: BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0 IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] PGD 14e18d4067 PUD 14e1868067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci i2c_core usb_common ata_piix floppy CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140722_172050-sagunt 04/01/2014 task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000 RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001 RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0 RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0 R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80 FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack: 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0 ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38 ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001 Call Trace: [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs] [<ffffffffb438f2f7>] bio_endio+0x57/0x60 [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs] [<ffffffffb438f2f7>] bio_endio+0x57/0x60 [<ffffffffb439763b>] blk_update_request+0x8b/0x330 [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70 [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk] [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0 [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20 [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk] [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90 [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0 [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60 [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60 [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150 [<ffffffffb4007cad>] handle_irq+0x1d/0x30 [<ffffffffb400750b>] do_IRQ+0x4b/0xd0 [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b Leftover inexact backtrace: 2017-03-12 20:33:08 <IRQ><EOI> 2017-03-12 20:33:08 [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10 [<ffffffffb400fa3e>] default_idle+0x1e/0xe0 [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20 [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40 [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370 [<ffffffffb46a358c>] rest_init+0x7c/0x80 [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120 [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] RSP <ffff8814eae03cd8> CR2: 00000000000001f0 ---[ end trace 7529a0652fd7873e ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: > Hi Qu, > > while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4). Thanks for the test. > > V7 results in OOPS to me: > BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0 This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite nice clue. > IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] IP points to: --- static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) { struct btrfs_root *root = inode->root; << Either here if (root == root->fs_info->tree_root && << Or here btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) --- Taking the above offset into consideration, it's only possible for later case. So here, we have a btrfs_inode whose @root is NULL. This can be fixed easily by checking @root inside btrfs_is_free_space_inode(), as the backtrace shows that it's only happening for DirectIO, and it won't happen for free space cache inode. But I'm more curious how this happened for a more accurate fix, or we could have other NULL pointer access. Did you have any reproducer for this? Thanks, Qu > PGD 14e18d4067 PUD 14e1868067 PMD 0 > Oops: 0000 [#1] SMP > Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 > xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set > nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic > virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci > i2c_core usb_common ata_piix floppy > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.7.5-20140722_172050-sagunt 04/01/2014 > task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000 > RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>] > __endio_write_update_ordered+0x33/0x140 [btrfs] > RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086 > RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001 > RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0 > RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a > R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0 > R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80 > FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack: > 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0 > ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38 > ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001 > Call Trace: > [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs] > [<ffffffffb438f2f7>] bio_endio+0x57/0x60 > [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs] > [<ffffffffb438f2f7>] bio_endio+0x57/0x60 > [<ffffffffb439763b>] blk_update_request+0x8b/0x330 > [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70 > [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk] > [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0 > [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20 > [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk] > [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90 > [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0 > [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60 > [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60 > [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150 > [<ffffffffb4007cad>] handle_irq+0x1d/0x30 > [<ffffffffb400750b>] do_IRQ+0x4b/0xd0 > [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c > DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b > Leftover inexact backtrace: > 2017-03-12 20:33:08 <IRQ><EOI> > 2017-03-12 20:33:08 [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10 > [<ffffffffb400fa3e>] default_idle+0x1e/0xe0 > [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20 > [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40 > [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370 > [<ffffffffb46a358c>] rest_init+0x7c/0x80 > [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d > [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120 > [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c > [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a > Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc > ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b > b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 > RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] > RSP <ffff8814eae03cd8> > CR2: 00000000000001f0 > ---[ end trace 7529a0652fd7873e ]--- > Kernel panic - not syncing: Fatal exception in interrupt > Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range: > 0xffffffff80000000-0xffffffffbfffffff) > > Greets, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Qu, Am 13.03.2017 um 02:16 schrieb Qu Wenruo: > > At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: >> Hi Qu, >> >> while V5 was running fine against the openSUSE-42.2 kernel (based on >> v4.4). > > Thanks for the test. > >> V7 results in OOPS to me: >> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0 > > This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite > nice clue. > >> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] > > IP points to: > --- > static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) > { > struct btrfs_root *root = inode->root; << Either here > > if (root == root->fs_info->tree_root && << Or here > btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) > > --- > > Taking the above offset into consideration, it's only possible for later > case. > > So here, we have a btrfs_inode whose @root is NULL. But wasn't this part of the code identical in V5? Why does it only happen with V7? > This can be fixed easily by checking @root inside > btrfs_is_free_space_inode(), as the backtrace shows that it's only > happening for DirectIO, and it won't happen for free space cache inode. > > But I'm more curious how this happened for a more accurate fix, or we > could have other NULL pointer access. > > Did you have any reproducer for this? Sorry no - this is a production MariaDB Server running btrfs with compress-force=zlib. But if i could test anything i'll do. Greets, Stefan > > Thanks, > Qu > >> PGD 14e18d4067 PUD 14e1868067 PMD 0 >> Oops: 0000 [#1] SMP >> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 >> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set >> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic >> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci >> i2c_core usb_common ata_piix floppy >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 1.7.5-20140722_172050-sagunt 04/01/2014 >> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000 >> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>] >> __endio_write_update_ordered+0x33/0x140 [btrfs] >> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086 >> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001 >> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0 >> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a >> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0 >> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80 >> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) >> knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack: >> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0 >> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38 >> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001 >> Call Trace: >> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs] >> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs] >> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >> [<ffffffffb439763b>] blk_update_request+0x8b/0x330 >> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70 >> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk] >> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0 >> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20 >> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk] >> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90 >> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0 >> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60 >> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60 >> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150 >> [<ffffffffb4007cad>] handle_irq+0x1d/0x30 >> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0 >> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c >> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b >> Leftover inexact backtrace: >> 2017-03-12 20:33:08 <IRQ><EOI> >> 2017-03-12 20:33:08 [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10 >> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0 >> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20 >> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40 >> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370 >> [<ffffffffb46a358c>] rest_init+0x7c/0x80 >> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d >> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120 >> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c >> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a >> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc >> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b >> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 >> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] >> RSP <ffff8814eae03cd8> >> CR2: 00000000000001f0 >> ---[ end trace 7529a0652fd7873e ]--- >> Kernel panic - not syncing: Fatal exception in interrupt >> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range: >> 0xffffffff80000000-0xffffffffbfffffff) >> >> Greets, >> Stefan >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: > Hi Qu, > > Am 13.03.2017 um 02:16 schrieb Qu Wenruo: >> >> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: >>> Hi Qu, >>> >>> while V5 was running fine against the openSUSE-42.2 kernel (based on >>> v4.4). >> >> Thanks for the test. >> >>> V7 results in OOPS to me: >>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0 >> >> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite >> nice clue. >> >>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] >> >> IP points to: >> --- >> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) >> { >> struct btrfs_root *root = inode->root; << Either here >> >> if (root == root->fs_info->tree_root && << Or here >> btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) >> >> --- >> >> Taking the above offset into consideration, it's only possible for later >> case. >> >> So here, we have a btrfs_inode whose @root is NULL. > > But wasn't this part of the code identical in V5? Why does it only > happen with V7? There are still difference, but just as you said, the related part(checking if inode is free space cache inode) is identical across v5 and v7. I'm afraid that's a rare race leading to NULL btrfs_inode->root, which could happen in both v5 and v7. What's the difference between SUSE and mainline kernel? Maybe some mainline kernel commits have already fixed it? Thanks, Qu > >> This can be fixed easily by checking @root inside >> btrfs_is_free_space_inode(), as the backtrace shows that it's only >> happening for DirectIO, and it won't happen for free space cache inode. >> >> But I'm more curious how this happened for a more accurate fix, or we >> could have other NULL pointer access. >> >> Did you have any reproducer for this? > > Sorry no - this is a production MariaDB Server running btrfs with > compress-force=zlib. But if i could test anything i'll do. > > Greets, > Stefan > >> >> Thanks, >> Qu >> >>> PGD 14e18d4067 PUD 14e1868067 PMD 0 >>> Oops: 0000 [#1] SMP >>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 >>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set >>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic >>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci >>> i2c_core usb_common ata_piix floppy >>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> 1.7.5-20140722_172050-sagunt 04/01/2014 >>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000 >>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>] >>> __endio_write_update_ordered+0x33/0x140 [btrfs] >>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086 >>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001 >>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0 >>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a >>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0 >>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80 >>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) >>> knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack: >>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0 >>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38 >>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001 >>> Call Trace: >>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs] >>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs] >>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330 >>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70 >>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk] >>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0 >>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20 >>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk] >>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90 >>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0 >>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60 >>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60 >>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150 >>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30 >>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0 >>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c >>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b >>> Leftover inexact backtrace: >>> 2017-03-12 20:33:08 <IRQ><EOI> >>> 2017-03-12 20:33:08 [<ffffffffb404ba46>] ? native_safe_halt+0x6/0x10 >>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0 >>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20 >>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40 >>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370 >>> [<ffffffffb46a358c>] rest_init+0x7c/0x80 >>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d >>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120 >>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c >>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a >>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc >>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b >>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 >>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 [btrfs] >>> RSP <ffff8814eae03cd8> >>> CR2: 00000000000001f0 >>> ---[ end trace 7529a0652fd7873e ]--- >>> Kernel panic - not syncing: Fatal exception in interrupt >>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range: >>> 0xffffffff80000000-0xffffffffbfffffff) >>> >>> Greets, >>> Stefan >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am 13.03.2017 um 08:39 schrieb Qu Wenruo: > > > At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: >> Hi Qu, >> >> Am 13.03.2017 um 02:16 schrieb Qu Wenruo: >>> >>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: >>>> Hi Qu, >>>> >>>> while V5 was running fine against the openSUSE-42.2 kernel (based on >>>> v4.4). >>> >>> Thanks for the test. >>> >>>> V7 results in OOPS to me: >>>> BUG: unable to handle kernel NULL pointer dereference at >>>> 00000000000001f0 >>> >>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite >>> nice clue. >>> >>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 >>>> [btrfs] >>> >>> IP points to: >>> --- >>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) >>> { >>> struct btrfs_root *root = inode->root; << Either here >>> >>> if (root == root->fs_info->tree_root && << Or here >>> btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) >>> >>> --- >>> >>> Taking the above offset into consideration, it's only possible for later >>> case. >>> >>> So here, we have a btrfs_inode whose @root is NULL. >> >> But wasn't this part of the code identical in V5? Why does it only >> happen with V7? > > There are still difference, but just as you said, the related > part(checking if inode is free space cache inode) is identical across v5 > and v7. But if i boot v7 it always happens. If i boot v5 it always works. Have done 5 repeatet tests. > I'm afraid that's a rare race leading to NULL btrfs_inode->root, which > could happen in both v5 and v7. > > What's the difference between SUSE and mainline kernel? A lot ;-) But i don't think anything related. > Maybe some mainline kernel commits have already fixed it? May be no idea. But i haven't found any reason why v5 works. Stefan > > Thanks, > Qu >> >>> This can be fixed easily by checking @root inside >>> btrfs_is_free_space_inode(), as the backtrace shows that it's only >>> happening for DirectIO, and it won't happen for free space cache inode. >>> >>> But I'm more curious how this happened for a more accurate fix, or we >>> could have other NULL pointer access. >>> >>> Did you have any reproducer for this? >> >> Sorry no - this is a production MariaDB Server running btrfs with >> compress-force=zlib. But if i could test anything i'll do. >> >> Greets, >> Stefan >> >>> >>> Thanks, >>> Qu >>> >>>> PGD 14e18d4067 PUD 14e1868067 PMD 0 >>>> Oops: 0000 [#1] SMP >>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 >>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set >>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq >>>> ata_generic >>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci >>>> i2c_core usb_common ata_piix floppy >>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 >>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>>> 1.7.5-20140722_172050-sagunt 04/01/2014 >>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000 >>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>] >>>> __endio_write_update_ordered+0x33/0x140 [btrfs] >>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086 >>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001 >>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0 >>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a >>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0 >>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80 >>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) >>>> knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack: >>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0 >>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38 >>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001 >>>> Call Trace: >>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs] >>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs] >>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330 >>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70 >>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk] >>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0 >>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20 >>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk] >>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90 >>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0 >>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60 >>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60 >>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150 >>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30 >>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0 >>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c >>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b >>>> Leftover inexact backtrace: >>>> 2017-03-12 20:33:08 <IRQ><EOI> >>>> 2017-03-12 20:33:08 [<ffffffffb404ba46>] ? >>>> native_safe_halt+0x6/0x10 >>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0 >>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20 >>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40 >>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370 >>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80 >>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d >>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120 >>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c >>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a >>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc >>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b >>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 >>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 >>>> [btrfs] >>>> RSP <ffff8814eae03cd8> >>>> CR2: 00000000000001f0 >>>> ---[ end trace 7529a0652fd7873e ]--- >>>> Kernel panic - not syncing: Fatal exception in interrupt >>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range: >>>> 0xffffffff80000000-0xffffffffbfffffff) >>>> >>>> Greets, >>>> Stefan >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>> linux-btrfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> >>> >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote: > > Am 13.03.2017 um 08:39 schrieb Qu Wenruo: >> >> >> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: >>> Hi Qu, >>> >>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo: >>>> >>>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: >>>>> Hi Qu, >>>>> >>>>> while V5 was running fine against the openSUSE-42.2 kernel (based on >>>>> v4.4). >>>> >>>> Thanks for the test. >>>> >>>>> V7 results in OOPS to me: >>>>> BUG: unable to handle kernel NULL pointer dereference at >>>>> 00000000000001f0 >>>> >>>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite >>>> nice clue. >>>> >>>>> IP: [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 >>>>> [btrfs] >>>> >>>> IP points to: >>>> --- >>>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) >>>> { >>>> struct btrfs_root *root = inode->root; << Either here >>>> >>>> if (root == root->fs_info->tree_root && << Or here >>>> btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) >>>> >>>> --- >>>> >>>> Taking the above offset into consideration, it's only possible for later >>>> case. >>>> >>>> So here, we have a btrfs_inode whose @root is NULL. >>> >>> But wasn't this part of the code identical in V5? Why does it only >>> happen with V7? >> >> There are still difference, but just as you said, the related >> part(checking if inode is free space cache inode) is identical across v5 >> and v7. > > But if i boot v7 it always happens. If i boot v5 it always works. Have > done 5 repeatet tests. That's a problem now, I'll dig it further to find out the difference. Thanks, Qu > >> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which >> could happen in both v5 and v7. >> >> What's the difference between SUSE and mainline kernel? > > A lot ;-) But i don't think anything related. > >> Maybe some mainline kernel commits have already fixed it? > > May be no idea. But i haven't found any reason why v5 works. > > Stefan > >> >> Thanks, >> Qu >>> >>>> This can be fixed easily by checking @root inside >>>> btrfs_is_free_space_inode(), as the backtrace shows that it's only >>>> happening for DirectIO, and it won't happen for free space cache inode. >>>> >>>> But I'm more curious how this happened for a more accurate fix, or we >>>> could have other NULL pointer access. >>>> >>>> Did you have any reproducer for this? >>> >>> Sorry no - this is a production MariaDB Server running btrfs with >>> compress-force=zlib. But if i could test anything i'll do. >>> >>> Greets, >>> Stefan >>> >>>> >>>> Thanks, >>>> Qu >>>> >>>>> PGD 14e18d4067 PUD 14e1868067 PMD 0 >>>>> Oops: 0000 [#1] SMP >>>>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 >>>>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set >>>>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq >>>>> ata_generic >>>>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci >>>>> i2c_core usb_common ata_piix floppy >>>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 >>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>>>> 1.7.5-20140722_172050-sagunt 04/01/2014 >>>>> task: ffffffffb4e0f500 ti: ffffffffb4e00000 task.ti: ffffffffb4e00000 >>>>> RIP: 0010:[<ffffffffc03dde23>] [<ffffffffc03dde23>] >>>>> __endio_write_update_ordered+0x33/0x140 [btrfs] >>>>> RSP: 0018:ffff8814eae03cd8 EFLAGS: 00010086 >>>>> RAX: 0000000000000000 RBX: ffff8814e8fd5aa8 RCX: 0000000000000001 >>>>> RDX: 0000000000100000 RSI: 0000000000100000 RDI: ffff8814e45885c0 >>>>> RBP: ffff8814eae03d10 R08: ffff8814e8334000 R09: 000000018040003a >>>>> R10: ffffea00507d8d00 R11: ffff88141f634080 R12: ffff8814e45885c0 >>>>> R13: ffff8814e125d700 R14: 0000000000100000 R15: ffff8800376c6a80 >>>>> FS: 0000000000000000(0000) GS:ffff8814eae00000(0000) >>>>> knlGS:0000000000000000 >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> CR2: 00000000000001f0 CR3: 00000014e34c9000 CR4: 00000000001406f0Stack: >>>>> 0000000000000000 0000000000100000 ffff8814e8fd5aa8 ffff8814e953f3c0 >>>>> ffff8814e125d700 0000000000100000 ffff8800376c6a80 ffff8814eae03d38 >>>>> ffffffffc03ddf67 ffff8814e86b6a80 ffff8814e8fd5aa8 0000000000000001 >>>>> Call Trace: >>>>> [<ffffffffc03ddf67>] btrfs_endio_direct_write+0x37/0x60 [btrfs] >>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >>>>> [<ffffffffc04082c1>] btrfs_end_bio+0xa1/0x140 [btrfs] >>>>> [<ffffffffb438f2f7>] bio_endio+0x57/0x60 >>>>> [<ffffffffb439763b>] blk_update_request+0x8b/0x330 >>>>> [<ffffffffb43a05ba>] blk_mq_end_request+0x1a/0x70 >>>>> [<ffffffffc039f30f>] virtblk_request_done+0x3f/0x70 [virtio_blk] >>>>> [<ffffffffb43a0688>] __blk_mq_complete_request+0x78/0xe0 >>>>> [<ffffffffb43a070c>] blk_mq_complete_request+0x1c/0x20 >>>>> [<ffffffffc039f184>] virtblk_done+0x64/0xe0 [virtio_blk] >>>>> [<ffffffffb446dd2a>] vring_interrupt+0x3a/0x90 >>>>> [<ffffffffb40d3fe9>] __handle_irq_event_percpu+0x89/0x1b0 >>>>> [<ffffffffb40d4133>] handle_irq_event_percpu+0x23/0x60 >>>>> [<ffffffffb40d41ab>] handle_irq_event+0x3b/0x60 >>>>> [<ffffffffb40d74ef>] handle_edge_irq+0x6f/0x150 >>>>> [<ffffffffb4007cad>] handle_irq+0x1d/0x30 >>>>> [<ffffffffb400750b>] do_IRQ+0x4b/0xd0 >>>>> [<ffffffffb46af8cc>] common_interrupt+0x8c/0x8c >>>>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b >>>>> Leftover inexact backtrace: >>>>> 2017-03-12 20:33:08 <IRQ><EOI> >>>>> 2017-03-12 20:33:08 [<ffffffffb404ba46>] ? >>>>> native_safe_halt+0x6/0x10 >>>>> [<ffffffffb400fa3e>] default_idle+0x1e/0xe0 >>>>> [<ffffffffb401021f>] arch_cpu_idle+0xf/0x20 >>>>> [<ffffffffb40c67eb>] default_idle_call+0x3b/0x40 >>>>> [<ffffffffb40c6a8a>] cpu_startup_entry+0x29a/0x370 >>>>> [<ffffffffb46a358c>] rest_init+0x7c/0x80 >>>>> [<ffffffffb4f67fa5>] start_kernel+0x490/0x49d >>>>> [<ffffffffb4f67120>] ? early_idt_handler_array+0x120/0x120 >>>>> [<ffffffffb4f674b3>] x86_64_start_reservations+0x2a/0x2c >>>>> [<ffffffffb4f675f0>] x86_64_start_kernel+0x13b/0x14a >>>>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc >>>>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b >>>>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 >>>>> RIP [<ffffffffc03dde23>] __endio_write_update_ordered+0x33/0x140 >>>>> [btrfs] >>>>> RSP <ffff8814eae03cd8> >>>>> CR2: 00000000000001f0 >>>>> ---[ end trace 7529a0652fd7873e ]--- >>>>> Kernel panic - not syncing: Fatal exception in interrupt >>>>> Kernel Offset: 0x33000000 from 0xffffffff81000000 (relocation range: >>>>> 0xffffffff80000000-0xffffffffbfffffff) >>>>> >>>>> Greets, >>>>> Stefan >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe >>>>> linux-btrfs" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>> >>>> >>> >>> >> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote: > > Am 13.03.2017 um 08:39 schrieb Qu Wenruo: >> >> >> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: >>> Hi Qu, >>> >>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo: >>> >>> But wasn't this part of the code identical in V5? Why does it only >>> happen with V7? >> >> There are still difference, but just as you said, the related >> part(checking if inode is free space cache inode) is identical across v5 >> and v7. > > But if i boot v7 it always happens. If i boot v5 it always works. Have > done 5 repeatet tests. I rechecked the code change between v7 and v5. It turns out that, the code base may cause the problem. In v7, the base is v4.11-rc1, which introduced quite a lot of btrfs_inode cleanup. One of the difference is the parameter for btrfs_is_free_space_inode(). In v7, the parameter @inode changed from struct inode to struct btrfs_inode. So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(), other than plain inode. That's the most possible cause for me here. So would you please paste the final patch applied to your tree? Git diff or git format-patch can both handle it. Thanks, Qu > >> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which >> could happen in both v5 and v7. >> >> What's the difference between SUSE and mainline kernel? > > A lot ;-) But i don't think anything related. > >> Maybe some mainline kernel commits have already fixed it? > > May be no idea. But i haven't found any reason why v5 works. > > Stefan > >> >> Thanks, >> Qu >>> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks Qu, removing BTRFS_I from the inode fixes this issue to me. Greets, Stefan Am 14.03.2017 um 03:50 schrieb Qu Wenruo: > > > At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote: >> >> Am 13.03.2017 um 08:39 schrieb Qu Wenruo: >>> >>> >>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: >>>> Hi Qu, >>>> >>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo: >>>> >>>> But wasn't this part of the code identical in V5? Why does it only >>>> happen with V7? >>> >>> There are still difference, but just as you said, the related >>> part(checking if inode is free space cache inode) is identical across v5 >>> and v7. >> >> But if i boot v7 it always happens. If i boot v5 it always works. Have >> done 5 repeatet tests. > > I rechecked the code change between v7 and v5. > > It turns out that, the code base may cause the problem. > > In v7, the base is v4.11-rc1, which introduced quite a lot of > btrfs_inode cleanup. > > One of the difference is the parameter for btrfs_is_free_space_inode(). > > In v7, the parameter @inode changed from struct inode to struct > btrfs_inode. > > So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(), > other than plain inode. > > That's the most possible cause for me here. > > So would you please paste the final patch applied to your tree? > Git diff or git format-patch can both handle it. > > Thanks, > Qu > >> >>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which >>> could happen in both v5 and v7. >>> >>> What's the difference between SUSE and mainline kernel? >> >> A lot ;-) But i don't think anything related. >> >>> Maybe some mainline kernel commits have already fixed it? >> >> May be no idea. But i haven't found any reason why v5 works. >> >> Stefan >> >>> >>> Thanks, >>> Qu >>>> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
At 03/14/2017 05:06 PM, Stefan Priebe - Profihost AG wrote: > Thanks Qu, removing BTRFS_I from the inode fixes this issue to me. > > Greets, > Stefan > Glad to hear that. And a small tip is, when compiling kernel(at least btrfs module), any warning should be checked carefully. Such type mismatch will cause a warning and can help us to avoid such problem. Thanks, Qu > > Am 14.03.2017 um 03:50 schrieb Qu Wenruo: >> >> >> At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote: >>> >>> Am 13.03.2017 um 08:39 schrieb Qu Wenruo: >>>> >>>> >>>> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: >>>>> Hi Qu, >>>>> >>>>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo: >>>>> >>>>> But wasn't this part of the code identical in V5? Why does it only >>>>> happen with V7? >>>> >>>> There are still difference, but just as you said, the related >>>> part(checking if inode is free space cache inode) is identical across v5 >>>> and v7. >>> >>> But if i boot v7 it always happens. If i boot v5 it always works. Have >>> done 5 repeatet tests. >> >> I rechecked the code change between v7 and v5. >> >> It turns out that, the code base may cause the problem. >> >> In v7, the base is v4.11-rc1, which introduced quite a lot of >> btrfs_inode cleanup. >> >> One of the difference is the parameter for btrfs_is_free_space_inode(). >> >> In v7, the parameter @inode changed from struct inode to struct >> btrfs_inode. >> >> So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(), >> other than plain inode. >> >> That's the most possible cause for me here. >> >> So would you please paste the final patch applied to your tree? >> Git diff or git format-patch can both handle it. >> >> Thanks, >> Qu >> >>> >>>> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which >>>> could happen in both v5 and v7. >>>> >>>> What's the difference between SUSE and mainline kernel? >>> >>> A lot ;-) But i don't think anything related. >>> >>>> Maybe some mainline kernel commits have already fixed it? >>> >>> May be no idea. But i haven't found any reason why v5 works. >>> >>> Stefan >>> >>>> >>>> Thanks, >>>> Qu >>>>> >> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c40060cc481f..fe588bf30d02 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, BTRFS_DATA_RELOC_TREE_OBJECTID) { ret = btrfs_reloc_clone_csums(inode, start, cur_alloc_size); + /* + * Only drop cache here, and process as normal. + * + * We must not allow extent_clear_unlock_delalloc() + * at out_unlock label to free meta of this ordered + * extent, as its meta should be freed by + * btrfs_finish_ordered_io(). + * + * So we must continue until @start is increased to + * skip current ordered extent. + */ if (ret) - goto out_drop_extent_cache; + btrfs_drop_extent_cache(BTRFS_I(inode), start, + start + ram_size - 1, 0); } btrfs_dec_block_group_reservations(fs_info, ins.objectid); - if (disk_num_bytes < cur_alloc_size) - break; - /* we're not doing compressed IO, don't unlock the first * page (which the caller expects to stay locked), don't * clear any dirty bits and don't set any writeback bits @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, delalloc_end, locked_page, EXTENT_LOCKED | EXTENT_DELALLOC, op); - disk_num_bytes -= cur_alloc_size; + if (disk_num_bytes < cur_alloc_size) + disk_num_bytes = 0; + else + disk_num_bytes -= cur_alloc_size; num_bytes -= cur_alloc_size; alloc_hint = ins.objectid + ins.offset; start += cur_alloc_size; + + /* + * btrfs_reloc_clone_csums() error, since start is increased + * extent_clear_unlock_delalloc() at out_unlock label won't + * free metadata of current ordered extent, we're OK to exit. + */ + if (ret) + goto out_unlock; } out: return ret; @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, BUG_ON(ret); /* -ENOMEM */ if (root->root_key.objectid == - BTRFS_DATA_RELOC_TREE_OBJECTID) { + BTRFS_DATA_RELOC_TREE_OBJECTID) + /* + * Error handled later, as we must prevent + * extent_clear_unlock_delalloc() in error handler + * from freeing metadata of created ordered extent. + */ ret = btrfs_reloc_clone_csums(inode, cur_offset, num_bytes); - if (ret) { - if (!nolock && nocow) - btrfs_end_write_no_snapshoting(root); - goto error; - } - } extent_clear_unlock_delalloc(inode, cur_offset, cur_offset + num_bytes - 1, end, @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, if (!nolock && nocow) btrfs_end_write_no_snapshoting(root); cur_offset = extent_end; + + /* + * btrfs_reloc_clone_csums() error, now we're OK to call error + * handler, as metadata for created ordered extent will only + * be freed by btrfs_finish_ordered_io(). + */ + if (ret) + goto error; if (cur_offset > end) break; }
[BUG] When btrfs_reloc_clone_csum() reports error, it can underflow metadata and leads to kernel assertion on outstanding extents in run_delalloc_nocow() and cow_file_range(). BTRFS info (device vdb5): relocating block group 12582912 flags data BTRFS info (device vdb5): found 1 extents assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 Currently, due to another bug blocking ordered extents, the bug is only reproducible under certain block group layout and using error injection. a) Create one data block group with one 4K extent in it. To avoid the bug that hangs btrfs due to ordered extent which never finishes b) Make btrfs_reloc_clone_csum() always fail c) Relocate that block group [CAUSE] run_delalloc_nocow() and cow_file_range() handles error from btrfs_reloc_clone_csum() wrongly: (The ascii chart shows a more generic case of this bug other than the bug mentioned above) |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- cleanup range --------------->| |<----------- ----------->| \/ btrfs_finish_ordered_io() range So error handler, which calls extent_clear_unlock_delalloc() with EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() will both cover OE n, and free its metadata, causing metadata under flow. [Fix] The fix is to ensure after calling btrfs_add_ordered_extent(), we only call error handler after increasing the iteration offset, so that cleanup range won't cover any created ordered extent. |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- ----------->|<---------- cleanup range --------->| \/ btrfs_finish_ordered_io() range Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> --- v6: New, split from v5 patch, as this is a separate bug. v7: Fix a wrong operator pointed out by Liu Bo. Test case will follow soon. --- fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 39 insertions(+), 12 deletions(-)