diff mbox

btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

Message ID 4FF6FF8C.4010007@jan-o-sch.net (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Schmidt July 6, 2012, 3:09 p.m. UTC
On Fri, July 06, 2012 at 16:33 (+0200), Sami Liedes wrote:
> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote:
>> I've no good idea at the moment how to go on. It might help to get a feeling if
>> it's shifting around at least a little bit or really constant in the timing of
>> occurrence. So can you please apply the next patch on top of the other two and
>> give it some more failure tries? The "checksum mismatch [1234]" line will be of
>> most interest. I'm also curious what the additional debug variables will say in
>> the extended version of the very first printk. You can leave out the stack
>> traces if you like, they won't matter much anyway.
> 
> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as
> suggested by Chris Mason.
> 
> With those and the latest patch, there's an oops already at boot. I
> don't have netconsole yet at that point, but here's the important
> parts (sure I can capture it fully if you need).

Oh I see. root->node can be NULL during mount. Please add this on top:

--
--


> By the way, something seems to be untabifying your patches. I don't
> know if it's on my side or yours, but at least some other patches I
> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
> mostly makes them apply cleanly for me.

Oh, I'm sorry. Should have been on my side. I hope it's better with the current
diff?

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sami Liedes July 6, 2012, 9:41 p.m. UTC | #1
On Fri, Jul 06, 2012 at 10:59:24PM +0300, Sami Liedes wrote:
> I think I might try running it overnight with KMEMCHECK to see if it
> reports something. But for now, what there's in the log:

My KMEMCHECK kernel didn't even boot (due to some weird KMEMCHECK/ACPI
interaction), so I won't pursue this idea further at the moment...

> * lots of checksum mismatch [234], no 1s

One thing to notice from the logs, too, is that the device seems to
always be dm-6, the second device of the filesystem. This never seems
to happen to dm-5. There are 1583 lines of "btrfs: dm-6 checksum
verify failed".

	Sami
Sami Liedes July 6, 2012, 11:44 p.m. UTC | #2
[Retry: I think this mail didn't make it to the list, probably because
of the 73 kilobyte attached log. Here's a URL to the file:]

   http://www.niksula.hut.fi/~sliedes/btrfs-scrub-debug.log.gz

	Sami


------------------------------------------------------------
On Fri, Jul 06, 2012 at 05:09:00PM +0200, Jan Schmidt wrote:
> Oh I see. root->node can be NULL during mount. Please add this on top:

Ok. So, ran it with DEBUG_PAGEALLOC and slub debugging on. This time
it took half an hour to crash, and there's _lots_ of checksum mismatch
[234] messages even before the crash. gzipped dmesg attached.

At 781 seconds there's an "irq 17: nobody cared". That's a known bug
with this (and other Asus) motherboards and happens every now and
then. I doubt it has anything to do with this.

I think I might try running it overnight with KMEMCHECK to see if it
reports something. But for now, what there's in the log:

* lots of checksum mismatch [234], no 1s

* a fair number of "csum_tree_block: [0-9]+ callbacks suppressed"
  lines

* two "btrfs: node seems invalid now. checksum ok = 1" messages, one
  at 1499 seconds and another just before the crash at 1973

* Just before the crash:
  btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2261163409408, level 100, generation 4412718571037421157, nritems 538968254. len param 17. debug 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0

* the oopses

> > By the way, something seems to be untabifying your patches. I don't
> > know if it's on my side or yours, but at least some other patches I
> > receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
> > mostly makes them apply cleanly for me.
> 
> Oh, I'm sorry. Should have been on my side. I hope it's better with the current
> diff?

Yes. No problem :)

[See attachment for dmesg log.]

	Sami
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arne Jansen July 9, 2012, 9:05 a.m. UTC | #3
On 07.07.2012 01:44, Sami Liedes wrote:
> [Retry: I think this mail didn't make it to the list, probably because
> of the 73 kilobyte attached log. Here's a URL to the file:]
> 
>    http://www.niksula.hut.fi/~sliedes/btrfs-scrub-debug.log.gz
> 
> 	Sami
> 
> 
> ------------------------------------------------------------
> On Fri, Jul 06, 2012 at 05:09:00PM +0200, Jan Schmidt wrote:
>> Oh I see. root->node can be NULL during mount. Please add this on top:
> 
> Ok. So, ran it with DEBUG_PAGEALLOC and slub debugging on. This time
> it took half an hour to crash, and there's _lots_ of checksum mismatch
> [234] messages even before the crash. gzipped dmesg attached.
> 
> At 781 seconds there's an "irq 17: nobody cared". That's a known bug
> with this (and other Asus) motherboards and happens every now and
> then. I doubt it has anything to do with this.
> 
> I think I might try running it overnight with KMEMCHECK to see if it
> reports something. But for now, what there's in the log:
> 
> * lots of checksum mismatch [234], no 1s
> 
> * a fair number of "csum_tree_block: [0-9]+ callbacks suppressed"
>   lines
> 
> * two "btrfs: node seems invalid now. checksum ok = 1" messages, one
>   at 1499 seconds and another just before the crash at 1973
> 
> * Just before the crash:
>   btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2261163409408, level 100, generation 4412718571037421157, nritems 538968254. len param 17. debug 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0
> 

At a first glance: the generation converted to ascii is: "ent() ==",
so someone is patching the memory with ascii text, possibly C source.
It might be interesting to dump the full contents of the eb, to get
a clue on the source of the data.


> * the oopses
> 
>>> By the way, something seems to be untabifying your patches. I don't
>>> know if it's on my side or yours, but at least some other patches I
>>> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
>>> mostly makes them apply cleanly for me.
>>
>> Oh, I'm sorry. Should have been on my side. I hope it's better with the current
>> diff?
> 
> Yes. No problem :)
> 
> [See attachment for dmesg log.]
> 
> 	Sami
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index df0b347..22838a3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -578,7 +578,8 @@  static noinline int check_node(struct btrfs_root *root,
	}
	node->debug[5] = node->start;
	node->debug[6] = btrfs_header_level(node);
-	node->debug[6] |= btrfs_header_level(root->node) << 16;
+	if (root->node)
+		node->debug[6] |= btrfs_header_level(root->node) << 16;
	node->debug[7] = 0xb22f50b22f5;

	return 0;