Message ID | 4FF6C12A.10305@jan-o-sch.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jul 06, 2012 at 04:42:50AM -0600, Jan Schmidt wrote: > ... down here in the stack. The warning is printed from two levels above, > __readahead_hook. > > Either I'm absolutely blind and there's code along the (rather short) road > between those two that might do this I haven't seen. Or someone else messes > with our extent buffers or the underlying pages. What really confuses me is > that it happens so reproducibly. > > I've no good idea at the moment how to go on. It might help to get a feeling if > it's shifting around at least a little bit or really constant in the timing of > occurrence. So can you please apply the next patch on top of the other two and > give it some more failure tries? The "checksum mismatch [1234]" line will be of > most interest. I'm also curious what the additional debug variables will say in > the extended version of the very first printk. You can leave out the stack > traces if you like, they won't matter much anyway. I would suggest turning on slab debug and CONFIG_DEBUG_PAGEALLOC. Something really strange is happening here. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: > I've no good idea at the moment how to go on. It might help to get a feeling if > it's shifting around at least a little bit or really constant in the timing of > occurrence. So can you please apply the next patch on top of the other two and > give it some more failure tries? The "checksum mismatch [1234]" line will be of > most interest. I'm also curious what the additional debug variables will say in > the extended version of the very first printk. You can leave out the stack > traces if you like, they won't matter much anyway. Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as suggested by Chris Mason. With those and the latest patch, there's an oops already at boot. I don't have netconsole yet at that point, but here's the important parts (sure I can capture it fully if you need). By the way, something seems to be untabifying your patches. I don't know if it's on my side or yours, but at least some other patches I receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs mostly makes them apply cleanly for me. Sami ------------------------------------------------------------ btrfs: disk space caching is enabled BUG: unable to handle kernel NULL pointer dereference at 0000000000000150 IP: [<ffffffffa0223568>] check_node+0x138/0x250 [btrfs] PGD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU 6 Modules linked in: <omitted> [last unloaded: scsi_wait_scan] Pid: 1176, comm: btrfs-endio-met Tainted: G W 3.4.4+btrfsdebug2 #2 System Product Name/P8P67 EVO RIP: 0010:[<ffffffffa0223568>] [<ffffffffa0223568>] check_node+0x138/0x250 [btrfs [...] Process btrfs-endio-met (pid: 1176, [...]) Call trace: [...] btree_readpage_end_io_hook+0x1e5/0x2d0 [btrfs] [...] end_bio_extent_readpage+0xcb/0xa30 [btrfs] [...] ? end_workqueue_fn+0x31/0x50 [btrfs] [...] bio_endio+0x18/0x30 [...] end_workqueue_fn+0x3c/0x50 [btrfs] [...] worker_loop+0x157/0x560 [btrfs] [...] ? btrfs_queue_worker+0x310/0x310 [btrfs] [...] kthead+0x8e/0xa0 [...] kernel_thread_helper+0x4/0x10 [...] ? flush_kthread_worker+0x70/0x70 [...] ? gs_change+0x13/0x13 Code: [...] RIP [<ffffffffa0223568>] check_node+0x138/0x250 [btrfs] RSP <ffff8801f3843cb0> ------------------------------------------------------------
On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote: > On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: > > I've no good idea at the moment how to go on. It might help to get a feeling if > > it's shifting around at least a little bit or really constant in the timing of > > occurrence. So can you please apply the next patch on top of the other two and > > give it some more failure tries? The "checksum mismatch [1234]" line will be of > > most interest. I'm also curious what the additional debug variables will say in > > the extended version of the very first printk. You can leave out the stack > > traces if you like, they won't matter much anyway. > > Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as > suggested by Chris Mason. > > With those and the latest patch, there's an oops already at boot. I > don't have netconsole yet at that point, but here's the important > parts (sure I can capture it fully if you need). > > By the way, something seems to be untabifying your patches. I don't > know if it's on my side or yours, but at least some other patches I > receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs > mostly makes them apply cleanly for me. > > Sami > > > ------------------------------------------------------------ > btrfs: disk space caching is enabled > BUG: unable to handle kernel NULL pointer dereference at 0000000000000150 > IP: [<ffffffffa0223568>] check_node+0x138/0x250 [btrfs] This isn't from any of the new debugging. Can you please try it on an unpatched kernel? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, July 06, 2012 at 16:40 (+0200), Chris Mason wrote: > On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote: >> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: >>> I've no good idea at the moment how to go on. It might help to get a feeling if >>> it's shifting around at least a little bit or really constant in the timing of >>> occurrence. So can you please apply the next patch on top of the other two and >>> give it some more failure tries? The "checksum mismatch [1234]" line will be of >>> most interest. I'm also curious what the additional debug variables will say in >>> the extended version of the very first printk. You can leave out the stack >>> traces if you like, they won't matter much anyway. >> >> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as >> suggested by Chris Mason. >> >> With those and the latest patch, there's an oops already at boot. I >> don't have netconsole yet at that point, but here's the important >> parts (sure I can capture it fully if you need). >> >> By the way, something seems to be untabifying your patches. I don't >> know if it's on my side or yours, but at least some other patches I >> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs >> mostly makes them apply cleanly for me. >> >> Sami >> >> >> ------------------------------------------------------------ >> btrfs: disk space caching is enabled >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000150 >> IP: [<ffffffffa0223568>] check_node+0x138/0x250 [btrfs] > > This isn't from any of the new debugging. Can you please try it on an > unpatched kernel? You're confusing that with check_leaf. I added check_node along the way, see my mail from Thu, July 05, 2012 at 15:41 (+0200). I'd really like to add something similar for the 3.6 series. Checking for the null pointer dereference. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 06, 2012 at 09:02:34AM -0600, Jan Schmidt wrote: > On Fri, July 06, 2012 at 16:40 (+0200), Chris Mason wrote: > > On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote: > >> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: > >>> I've no good idea at the moment how to go on. It might help to get a feeling if > >>> it's shifting around at least a little bit or really constant in the timing of > >>> occurrence. So can you please apply the next patch on top of the other two and > >>> give it some more failure tries? The "checksum mismatch [1234]" line will be of > >>> most interest. I'm also curious what the additional debug variables will say in > >>> the extended version of the very first printk. You can leave out the stack > >>> traces if you like, they won't matter much anyway. > >> > >> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as > >> suggested by Chris Mason. > >> > >> With those and the latest patch, there's an oops already at boot. I > >> don't have netconsole yet at that point, but here's the important > >> parts (sure I can capture it fully if you need). > >> > >> By the way, something seems to be untabifying your patches. I don't > >> know if it's on my side or yours, but at least some other patches I > >> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs > >> mostly makes them apply cleanly for me. > >> > >> Sami > >> > >> > >> ------------------------------------------------------------ > >> btrfs: disk space caching is enabled > >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000150 > >> IP: [<ffffffffa0223568>] check_node+0x138/0x250 [btrfs] > > > > This isn't from any of the new debugging. Can you please try it on an > > unpatched kernel? > > You're confusing that with check_leaf. I added check_node along the way, see my > mail from Thu, July 05, 2012 at 15:41 (+0200). I'd really like to add something > similar for the 3.6 series. > > Checking for the null pointer dereference. Sorry, I wasn't clear. I meant it wasn't from slab debug or DEBUG_PAGEALLOC, so it must be new in your patches ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 34122c2..df0b347 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -550,6 +550,7 @@ static noinline int check_node(struct btrfs_root *root, u32 nritems = btrfs_header_nritems(node); u64 generation; + node->debug[4] = 0xb77f50b77f5; if (nritems == 0) return 0; @@ -575,6 +576,10 @@ static noinline int check_node(struct btrfs_root *root, return -EIO; } } + node->debug[5] = node->start; + node->debug[6] = btrfs_header_level(node); + node->debug[6] |= btrfs_header_level(root->node) << 16; + node->debug[7] = 0xb22f50b22f5; return 0; } @@ -686,10 +691,17 @@ static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end, ret = -EIO; } + if (btrfs_csum_tree_block(root, eb)) + printk(KERN_ERR "btrfs: checksum mismatch 1 on %llu\n", + eb->start); + if (!ret) set_extent_buffer_uptodate(eb); err: if (test_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags)) { + if (btrfs_csum_tree_block(root, eb)) + printk(KERN_ERR "btrfs: checksum mismatch 2 on %llu\n", + eb->start); clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags); btree_readahead_hook(root, eb, eb->start, ret); } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 099ce6e..7452ecb 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4521,11 +4521,12 @@ void read_extent_buffer(struct extent_buffer *eb, void *dstv, unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; if (start > eb->len) { - printk(KERN_ERR "btrfs: invalid parameters for read_extent_buffer: start (%lu) > eb->len (%lu). eb start is %llu, level %d, generation %llu, nritems %d. len param %lu. debug %llu/%llu/%llu/%llu\n", + printk(KERN_ERR "btrfs: invalid parameters for read_extent_buffer: start (%lu) > eb->len (%lu). eb start is %llu, level %d, generation %llu, nritems %d. len param %lu. debug %llu/%llu/%llu/%llu/%#llx/%llu/%#llx/%#llx\n", start, eb->len, eb->start, btrfs_header_level(eb), btrfs_header_generation(eb), btrfs_header_nritems(eb), len, - eb->debug[0], eb->debug[1], eb->debug[2], eb->debug[3]); + eb->debug[0], eb->debug[1], eb->debug[2], eb->debug[3], + eb->debug[4], eb->debug[5], eb->debug[6], eb->debug[7]); WARN_ON(1); } WARN_ON(start + len > eb->start + eb->len); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 1bbf823..51c42f1 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -165,7 +165,7 @@ struct extent_buffer { struct page *inline_pages[INLINE_EXTENT_BUFFER_PAGES]; struct page **pages; - u64 debug[4]; + u64 debug[8]; }; static inline void extent_set_compress_type(unsigned long *bio_flags, diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index b659c8d..ea81bd4 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -130,6 +130,10 @@ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, kref_get(&re->refcnt); spin_unlock(&fs_info->reada_lock); + if (!err && btrfs_csum_tree_block(root, eb)) + printk(KERN_ERR "btrfs: checksum mismatch 4 on %llu\n", + eb->start); + if (!re) return -1; @@ -248,6 +252,9 @@ int btree_readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, { int ret; + if (!err && btrfs_csum_tree_block(root, eb)) + printk(KERN_ERR "btrfs: checksum mismatch 3 on %llu\n", + eb->start); ret = __readahead_hook(root, eb, start, err); reada_start_machine(root->fs_info);