Message ID | 874nllxi7e.fsf_-_@spindle.srvr.nix (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote: > On 23 Oct 2012, J. Bruce Fields uttered the following: > > nfs-utils shouldn't be capable of oopsing the kernel, so from my > > (selfish) point of view I'd actually rather you stick with whatever you > > have and try to reproduce the oops. > > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > have runs of NULs in my logs now, which keep eating the oopses): > > [while in 3.6.1] > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 89.648152] ------------[ cut here ]------------ > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > [ 89.648614] Hardware name: empty > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > [ 89.649610] Call Trace: > [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b > [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c > [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 > [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 > [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 > [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 > [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 > [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 > [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 > [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 > [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 > [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e > [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b > [...] > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > was clean, and fscked on the previous boot into 3.6.3 after a previous > instance of this bug] > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. > > > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to > turn off panic_on_oops when I come home from the holidays). > > It's a lockd problem, and probably happens during delivery of mail over > NFS (my mailserver load soars when it happens): > > [ 813.110354] ------------[ cut here ]------------ > [ 813.110585] kernel BUG at fs/lockd/mon.c:150! So nsm_mon_unmon() is being passed a NULL client. There are three container patches between 3.6.1 and 3.6.3: lockd: per-net NSM client creation and destruction helpers introduced lockd: use rpc client's cl_nodename for id encoding lockd: create and use per-net NSM RPC clients on MON/UNMON requests and that last does change nsm_monitor's call to nsm_mon_unmon, so that's almost certainly it.... Looks like there's some confusion about whether nsm_client_get() returns NULL or an error? --b. > [ 813.110878] invalid opcode: 0000 [#1] SMP > [ 813.111173] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > [ 813.111727] CPU 1 > [ 813.111772] Pid: 1040, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010 > [ 813.112388] RIP: 0010:[<ffffffff8120fbbc>] [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 > [ 813.112840] RSP: 0018:ffff88062163dcf0 EFLAGS: 00010246 > [ 813.113069] RAX: ffff88062163dcf8 RBX: 0000000000000000 RCX: 0000000000000000 > [ 813.113303] RDX: ffff88062163dd68 RSI: 0000000000000002 RDI: ffff88062163dd40 > [ 813.113537] RBP: ffff88062163dd50 R08: 0000000000000000 R09: ffffffff8120c463 > [ 813.113771] R10: ffffffff8120c463 R11: 0000000000000001 R12: 0000000000000000 > [ 813.114006] R13: ffff88061f067e40 R14: ffff88059c763a00 R15: ffff88062163de28 > [ 813.114241] FS: 0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000 > [ 813.114651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 813.114882] CR2: ffffffffff600400 CR3: 0000000001a0b000 CR4: 00000000000007e0 > [ 813.115116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 813.115350] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 813.115584] Process lockd (pid: 1040, threadinfo ffff88062163c000, task ffff880621611640) > [ 813.115994] Stack: > [ 813.116210] ffff88062163dd90 ffff8805f4e044b1 00000003000186b5 ffffffff00000010 > [ 813.116761] ffff8805f4e044c1 000000000000008c 0000000000000000 ffff88062163dcf8 > [ 813.117310] ffff88062163dd68 0000000000000000 ffff8806200ce088 ffff8805f4e04400 > [ 813.117861] Call Trace: > [ 813.118083] [<ffffffff8120fee4>] nsm_monitor+0x100/0x157 > [ 813.118315] [<ffffffff81211b4a>] nlm4svc_retrieve_args+0x62/0xd7 > [ 813.118547] [<ffffffff81211f49>] nlm4svc_proc_lock+0x3c/0xb5 > [ 813.118779] [<ffffffff8121167b>] ? nlm4svc_decode_lockargs+0x47/0xb2 > [ 813.119016] [<ffffffff814d89ca>] svc_process+0x3bf/0x6a1 > [ 813.119246] [<ffffffff8120d5f0>] lockd+0x127/0x164 > [ 813.119474] [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a > [ 813.119708] [<ffffffff8107bcbc>] kthread+0x8b/0x93 > [ 813.119938] [<ffffffff815012f4>] kernel_thread_helper+0x4/0x10 > [ 813.120170] [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1 > [ 813.120401] [<ffffffff815012f0>] ? gs_change+0xb/0xb > [ 813.120628] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04 > [ 813.123627] RIP [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 > [ 813.123894] RSP <ffff88062163dcf0> > [ 813.124129] ---[ end trace 11eb11a091ffd910 ]--- > > Aside: I am running with this patch, to prevent mounting of one > NFS-mounted directory under another from failing all accesses with a > spurious -EBUSY, a 3.6 regression -- but I can't see how this could > cause it. I'm running with this in 3.6.1 as well: > > diff --git a/fs/dcache.c b/fs/dcache.c > index 8086636..649a112 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -2404,6 +2404,10 @@ out_unalias: > if (likely(!d_mountpoint(alias))) { > __d_move(alias, dentry); > ret = alias; > + } else { > + printk(KERN_WARNING "VFS: __d_move()ing a d_mountpoint(), uh oh\n"); > + __d_move(alias, dentry); > + ret = alias; > } > out_err: > spin_unlock(&inode->i_lock); > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 23, 2012 at 12:46:21PM -0400, J. Bruce Fields wrote: > On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote: > > On 23 Oct 2012, J. Bruce Fields uttered the following: > > > nfs-utils shouldn't be capable of oopsing the kernel, so from my > > > (selfish) point of view I'd actually rather you stick with whatever you > > > have and try to reproduce the oops. > > > > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > > have runs of NULs in my logs now, which keep eating the oopses): > > > > [while in 3.6.1] > > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > > [ 89.648152] ------------[ cut here ]------------ > > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > > [ 89.648614] Hardware name: empty > > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > > [ 89.649610] Call Trace: > > [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b > > [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c > > [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 > > [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 > > [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 > > [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 > > [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 > > [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 > > [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 > > [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 > > [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 > > [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e > > [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b > > [...] > > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > > was clean, and fscked on the previous boot into 3.6.3 after a previous > > instance of this bug] > > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) > > > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. > > > > > > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to > > turn off panic_on_oops when I come home from the holidays). > > > > It's a lockd problem, and probably happens during delivery of mail over > > NFS (my mailserver load soars when it happens): > > > > [ 813.110354] ------------[ cut here ]------------ > > [ 813.110585] kernel BUG at fs/lockd/mon.c:150! > > So nsm_mon_unmon() is being passed a NULL client. > > There are three container patches between 3.6.1 and 3.6.3: > > lockd: per-net NSM client creation and destruction helpers introduced > lockd: use rpc client's cl_nodename for id encoding > lockd: create and use per-net NSM RPC clients on MON/UNMON requests > > and that last does change nsm_monitor's call to nsm_mon_unmon, so that's > almost certainly it.... > > Looks like there's some confusion about whether nsm_client_get() returns > NULL or an error? The return from nsm_client_get() is either from nsm_create() or from ln->nsm_clnt. nsm_create's return is from rpc_create, and it doesn't look possible for rpc_creat to return NULL. So probably we have some case where, while holding ln->nsm_clnt_lock, you can see ln->nsm_users nonzero, but ln->nsm_clnt NULL ? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote: > On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote: > > On 23 Oct 2012, J. Bruce Fields uttered the following: > > > nfs-utils shouldn't be capable of oopsing the kernel, so from my > > > (selfish) point of view I'd actually rather you stick with whatever you > > > have and try to reproduce the oops. > > > > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > > have runs of NULs in my logs now, which keep eating the oopses): > > > > [while in 3.6.1] > > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > > [ 89.648152] ------------[ cut here ]------------ > > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > > [ 89.648614] Hardware name: empty > > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > > [ 89.649610] Call Trace: > > [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b > > [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c > > [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 > > [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 > > [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 > > [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 > > [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 > > [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 > > [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 > > [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 > > [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 > > [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e > > [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b > > [...] > > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > > was clean, and fscked on the previous boot into 3.6.3 after a previous > > instance of this bug] > > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) > > > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. > > > > > > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to > > turn off panic_on_oops when I come home from the holidays). > > > > It's a lockd problem, and probably happens during delivery of mail over > > NFS (my mailserver load soars when it happens): > > > > [ 813.110354] ------------[ cut here ]------------ > > [ 813.110585] kernel BUG at fs/lockd/mon.c:150! > > So nsm_mon_unmon() is being passed a NULL client. > > There are three container patches between 3.6.1 and 3.6.3: > > lockd: per-net NSM client creation and destruction helpers introduced > lockd: use rpc client's cl_nodename for id encoding > lockd: create and use per-net NSM RPC clients on MON/UNMON requests > > and that last does change nsm_monitor's call to nsm_mon_unmon, so that's > almost certainly it.... > > Looks like there's some confusion about whether nsm_client_get() returns > NULL or an error? nsm_client_get() looks extremely racy in the case where ln->nsm_users == 0. Since we never recheck the value of ln->nsm_users after taking nsm_create_mutex, what is stopping 2 different threads from both setting ln->nsm_clnt and re-initialising ln->nsm_users? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On 23 Oct 2012, Trond Myklebust spake thusly: > On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote: >> Looks like there's some confusion about whether nsm_client_get() returns >> NULL or an error? > > nsm_client_get() looks extremely racy in the case where ln->nsm_users == > 0. Since we never recheck the value of ln->nsm_users after taking > nsm_create_mutex, what is stopping 2 different threads from both setting > ln->nsm_clnt and re-initialising ln->nsm_users? Yep. At the worst possible time: spin_lock(&ln->nsm_clnt_lock); if (ln->nsm_users) { if (--ln->nsm_users) ln->nsm_clnt = NULL; (1) shutdown = !ln->nsm_users; } spin_unlock(&ln->nsm_clnt_lock); If a thread reinitializes nsm_users at point (1), after the assignment, we could well end up with ln->nsm_clnt NULL and shutdown false. A bit later, nsm_mon_unmon gets called with a NULL clnt, and boom. This seems particularly likely if there is only one nsm_user (which is true in my case, since I have only one active network namespace).
[Bruce, Trond, I fear it may be hard for me to continue chasing this NFS lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like this. Apologies.] On 23 Oct 2012, nix@esperi.org.uk uttered the following: > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > have runs of NULs in my logs now, which keep eating the oopses): > > [while in 3.6.1] > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 89.648152] ------------[ cut here ]------------ > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > [ 89.648614] Hardware name: empty > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > [ 89.649610] Call Trace: > [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b > [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c > [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 > [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 > [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 > [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 > [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 > [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 > [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 > [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 > [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 > [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e > [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b > [...] > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > was clean, and fscked on the previous boot into 3.6.3 after a previous > instance of this bug] > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. It is now quite clear that this is a bug introduced by one or more of the post-3.6.1 ext4 patches (which have all been backported at least to 3.5, so the problem is probably there too). Rebooting from 3.6.3 back into 3.6.1, I saw this within seconds of boot: [ 60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd [ 60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 116.508621] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 5, 17749 clusters in bitmap, 17700 in gd [ 116.509626] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 8, 15814 clusters in bitmap, 16073 in gd [ 116.510103] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 10, 3349 clusters in bitmap, 3493 in gd [ 116.510571] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 12, 1792 clusters in bitmap, 1648 in gd [ 116.511691] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 116.512736] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 25, 14463 clusters in bitmap, 14462 in gd [ 116.513624] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 359.538550] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10024 [ 359.559220] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10572 [ 366.113780] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10024 [ 366.114837] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10572 [ 456.013682] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 456.384454] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 457.508943] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 457.509422] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 457.509897] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 478.779574] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.780047] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.780516] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.780983] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.782010] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.782480] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 479.826974] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm lesspipe.sh: deleted inode referenced: 10303 [ 479.834091] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm lesspipe.sh: deleted inode referenced: 10303 [ 479.835966] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm file: deleted inode referenced: 10303 [ 479.856946] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm less: deleted inode referenced: 10303 [ 479.857431] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm less: deleted inode referenced: 10303 [ 679.812704] EXT4-fs error (device dm-3): mb_free_blocks:1300: group 27, block 916489:freeing already freed block (bit 31753) (I'd provide more sample errors, but this bug has been eating newly-written logs in /var all day, so not much has survived.) I rebooted into 3.6.1 rescue mode and fscked everything: lots of orphans, block group corruption and cross-linked files. The problems did not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear that metadata changes made in 3.6.3 are not making it to disk reliably, thus leading to corrupted filesystems marked clean on reboot into other kernels: pretty much every file appended to in 3.6.3 loses some or all of its appended data, and newly allocated blocks often end up cross-linked between multiple files. The curious thing is this doesn't affect every filesystem: for a while it affected only /var, and now it's affecting only /var and /home. The massive writes to the ext4 filesystem mounted on /usr/src seem to have gone off without incident: fsck reports no problems. The only unusual thing about the filesystems on this machine are that they have hardware RAID-5 (using the Areca driver), so I'm mounting with 'nobarrier': the full set of options for all my ext4 filesystems are: rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota If there's anything I can do to help, I'm happy to do it, once I've restored my home directory from backup :( tune2fs output for one of the afflicted filesystems (after fscking): tune2fs 1.42.2 (9-Apr-2012) Filesystem volume name: home Last mounted on: /home Filesystem UUID: 95bd22c2-253c-456f-8e36-b6cfb9ecd4ef Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3276800 Block count: 13107200 Reserved block count: 655360 Free blocks: 5134852 Free inodes: 3174777 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 20 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stripe width: 16 Flex block group size: 64 Filesystem created: Tue May 26 21:29:41 2009 Last mount time: Tue Oct 23 21:32:07 2012 Last write time: Tue Oct 23 21:32:07 2012 Mount count: 2 Maximum mount count: 20 Last checked: Tue Oct 23 21:22:16 2012 Check interval: 15552000 (6 months) Next check after: Sun Apr 21 21:22:16 2013 Lifetime writes: 1092 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 1572907 Default directory hash: half_md4 Directory Hash Seed: a201983d-d8a3-460b-93ca-eb7804b62c23 Journal backup: inode blocks -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/23/12 3:57 PM, Nix wrote: <snip> > (I'd provide more sample errors, but this bug has been eating > newly-written logs in /var all day, so not much has survived.) > > I rebooted into 3.6.1 rescue mode and fscked everything: lots of > orphans, block group corruption and cross-linked files. The problems did > not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear > that metadata changes made in 3.6.3 are not making it to disk reliably, > thus leading to corrupted filesystems marked clean on reboot into other > kernels: pretty much every file appended to in 3.6.3 loses some or all > of its appended data, and newly allocated blocks often end up > cross-linked between multiple files. > > The curious thing is this doesn't affect every filesystem: for a while > it affected only /var, and now it's affecting only /var and /home. The > massive writes to the ext4 filesystem mounted on /usr/src seem to have > gone off without incident: fsck reports no problems. > > > The only unusual thing about the filesystems on this machine are that > they have hardware RAID-5 (using the Areca driver), so I'm mounting with > 'nobarrier': I should have read more. :( More questions follow: * Does the Areca have a battery backed write cache? * Are you crashing or rebooting cleanly? * Do you see log recovery messages in the logs for this filesystem? > the full set of options for all my ext4 filesystems are: > > rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, > usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota ok journal_async_commit is off the reservation a bit; that's really not tested, and Jan had serious reservations about its safety. * Can you reproduce this w/o journal_async_commit? -Eric > If there's anything I can do to help, I'm happy to do it, once I've > restored my home directory from backup :( > > > tune2fs output for one of the afflicted filesystems (after fscking): > > tune2fs 1.42.2 (9-Apr-2012) > Filesystem volume name: home > Last mounted on: /home > Filesystem UUID: 95bd22c2-253c-456f-8e36-b6cfb9ecd4ef > Filesystem magic number: 0xEF53 > Filesystem revision #: 1 (dynamic) > Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize > Filesystem flags: signed_directory_hash > Default mount options: (none) > Filesystem state: clean > Errors behavior: Continue > Filesystem OS type: Linux > Inode count: 3276800 > Block count: 13107200 > Reserved block count: 655360 > Free blocks: 5134852 > Free inodes: 3174777 > First block: 0 > Block size: 4096 > Fragment size: 4096 > Reserved GDT blocks: 20 > Blocks per group: 32768 > Fragments per group: 32768 > Inodes per group: 8192 > Inode blocks per group: 512 > RAID stripe width: 16 > Flex block group size: 64 > Filesystem created: Tue May 26 21:29:41 2009 > Last mount time: Tue Oct 23 21:32:07 2012 > Last write time: Tue Oct 23 21:32:07 2012 > Mount count: 2 > Maximum mount count: 20 > Last checked: Tue Oct 23 21:22:16 2012 > Check interval: 15552000 (6 months) > Next check after: Sun Apr 21 21:22:16 2013 > Lifetime writes: 1092 GB > Reserved blocks uid: 0 (user root) > Reserved blocks gid: 0 (group root) > First inode: 11 > Inode size: 256 > Required extra isize: 28 > Desired extra isize: 28 > Journal inode: 8 > First orphan inode: 1572907 > Default directory hash: half_md4 > Directory Hash Seed: a201983d-d8a3-460b-93ca-eb7804b62c23 > Journal backup: inode blocks > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24 Oct 2012, Eric Sandeen uttered the following: > On 10/23/12 3:57 PM, Nix wrote: >> The only unusual thing about the filesystems on this machine are that >> they have hardware RAID-5 (using the Areca driver), so I'm mounting with >> 'nobarrier': > > I should have read more. :( More questions follow: > > * Does the Areca have a battery backed write cache? Yes (though I'm not powering off, just rebooting). Battery at 100% and happy, though the lack of power-off means it's not actually getting used, since the cache is obviously mains-backed as well. > * Are you crashing or rebooting cleanly? Rebooting cleanly, everything umounted happily including /home and /var. > * Do you see log recovery messages in the logs for this filesystem? My memory says yes, but nothing seems to be logged when this happens (though with my logs on the first filesystem damaged by this, this is rather hard to tell, they're all quite full of NULs by now). I'll double-reboot tomorrow via the faulty kernel and check, unless I get asked not to in the interim. (And then double-reboot again to fsck everything...) >> the full set of options for all my ext4 filesystems are: >> >> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, >> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota > > ok journal_async_commit is off the reservation a bit; that's really not > tested, and Jan had serious reservations about its safety. OK, well, I've been 'testing' it for years :) No problems until now. (If anything, I was more concerned about journal_checksum. I thought that had actually been implicated in corruption before now...) > * Can you reproduce this w/o journal_async_commit? I can try!
On 10/23/12 11:15 PM, Nix wrote: > On 24 Oct 2012, Eric Sandeen uttered the following: > >> On 10/23/12 3:57 PM, Nix wrote: >>> The only unusual thing about the filesystems on this machine are that >>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with >>> 'nobarrier': >> >> I should have read more. :( More questions follow: >> >> * Does the Areca have a battery backed write cache? > > Yes (though I'm not powering off, just rebooting). Battery at 100% and > happy, though the lack of power-off means it's not actually getting > used, since the cache is obviously mains-backed as well. > >> * Are you crashing or rebooting cleanly? > > Rebooting cleanly, everything umounted happily including /home and /var. > >> * Do you see log recovery messages in the logs for this filesystem? > > My memory says yes, but nothing seems to be logged when this happens > (though with my logs on the first filesystem damaged by this, this is > rather hard to tell, they're all quite full of NULs by now). > > I'll double-reboot tomorrow via the faulty kernel and check, unless I > get asked not to in the interim. (And then double-reboot again to fsck > everything...) > >>> the full set of options for all my ext4 filesystems are: >>> >>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, >>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota >> >> ok journal_async_commit is off the reservation a bit; that's really not >> tested, and Jan had serious reservations about its safety. > > OK, well, I've been 'testing' it for years :) No problems until now. (If > anything, I was more concerned about journal_checksum. I thought that > had actually been implicated in corruption before now...) It had, but I fixed it AFAIK; OTOH, we turned it off by default after that episode. >> * Can you reproduce this w/o journal_async_commit? > > I can try! Ok, fair enough. If the BBU is working, nobarrier is ok; I don't trust journal_async_commit, but that doesn't mean this isn't a regression. Thanks for the answers... onward. :) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/23/12 3:57 PM, Nix wrote: > [Bruce, Trond, I fear it may be hard for me to continue chasing this NFS > lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like > this. Apologies.] <big snip> > The only unusual thing about the filesystems on this machine are that > they have hardware RAID-5 (using the Areca driver), so I'm mounting with > 'nobarrier': the full set of options for all my ext4 filesystems are: > > rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, > usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota Out of curiosity, when I test log replay with the journal_checksum option, I almost always get something like: [ 999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt. [ 999.923904] EXT4-fs (dm-1): error loading journal after a simulated crash & log replay. Do you see anything like that in your logs? <big snip> Thanks, -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 26 Oct 2012, Eric Sandeen outgrape: > On 10/23/12 3:57 PM, Nix wrote: >> The only unusual thing about the filesystems on this machine are that >> they have hardware RAID-5 (using the Areca driver), so I'm mounting with >> 'nobarrier': the full set of options for all my ext4 filesystems are: >> >> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, >> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota > > Out of curiosity, when I test log replay with the journal_checksum option, I > almost always get something like: > > [ 999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt. > [ 999.923904] EXT4-fs (dm-1): error loading journal > > after a simulated crash & log replay. > > Do you see anything like that in your logs? I'm not seeing any corrupt journals or abort messages at all. The journal claims to be fine, but plainly isn't. I can reproduce this on a small filesystem and stick the image somewhere if that would be of any use to anyone. (If I'm very lucky, merely making this offer will make the problem go away. :} )
On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote: > > I can reproduce this on a small filesystem and stick the image somewhere > if that would be of any use to anyone. (If I'm very lucky, merely making > this offer will make the problem go away. :} ) I'm not sure the image is going to be that useful. What we really need to do is to get a reliable reproduction of what _you_ are seeing. It's clear from Eric's experiments that journal_checksum is dangerous. In fact, I will likely put it under an #ifdef EXT4_EXPERIMENTAL to try to discourage people from using it in the future. There are things I've been planning on doing to make it be safer, but there's a very good *reason* that both journal_checksum and journal_async_commit are not on by default. That's why one of the things I asked you to do when you had time was to see if you could reproduce the problem you are seeing w/o nobarrier,journal_checksum,journal_async_commit. The other experiment that would be really useful if you could do is to try to apply these two patches which I sent earlier this week: [PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty [PATCH 2/2] ext4: fix I/O error when unmounting an ro file system ... and see if they make a difference. If they don't make a difference, I don't want to apply patches just for placebo/PR reasons. And for Eric at least, he can reproduce the journal checksum error followed by fairly significant corruption reported by e2fsck with journal_checksum, and the presence or absense of these patches make no difference for him. So I really don't want to push these patches to Linus until I get confirmation that they make a difference to *somebody*. Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 26 Oct 2012, Theodore Ts'o stated: > On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote: >> >> I can reproduce this on a small filesystem and stick the image somewhere >> if that would be of any use to anyone. (If I'm very lucky, merely making >> this offer will make the problem go away. :} ) > > I'm not sure the image is going to be that useful. What we really > need to do is to get a reliable reproduction of what _you_ are seeing. > > It's clear from Eric's experiments that journal_checksum is dangerous. > > That's why one of the things I asked you to do when you had time was > to see if you could reproduce the problem you are seeing w/o > nobarrier,journal_checksum,journal_async_commit. OK. Will do tomorrow. > The other experiment that would be really useful if you could do is to > try to apply these two patches which I sent earlier this week: > > [PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty > [PATCH 2/2] ext4: fix I/O error when unmounting an ro file system > > ... and see if they make a difference. As of tomorrow I'll be able to reboot without causing a riot: I'll test it then. (Sorry for the delay :( ) > So I really don't want > to push these patches to Linus until I get confirmation that they make > a difference to *somebody*. Agreed. This isn't the first time that journal_checksum has proven problematic. It's a shame that we're stuck between two error-inducing stools here...
> This isn't the first time that journal_checksum has proven problematic. > It's a shame that we're stuck between two error-inducing stools here... The problem is that it currently bails out be aborting the entire journal replay, and the file system will get left in a mess when it does that. It's actually safer today to just be blissfully ignorant of a corrupted block in the journal, than to have the journal getting aborted mid-replay when we detect a corrupted commit. The plan is that eventually, we will have checksums on a per-journalled block basis, instead of a per-commit basis, and when we get a failed checksum, we skip the replay of that block, but we keep going and replay all of the other blocks and commits. We'll then set the "file system corrupted" bit and force an e2fsck check. The problem is this code isn't done yet, and journal_checksum is really not ready for prime time. When it is ready, my plan is to wire it up so it is enabled by default; at the moment, it was intended for developer experimentation only. As I said, it's my fault for not clearly labelling it "Not for you!", or putting it under an #ifdef to prevent unwary civilians from coming across the feature and saying, "oooh, shiny!" and turning it on. :-( - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 26 Oct 2012, Theodore Ts'o uttered the following: > The plan is that eventually, we will have checksums on a > per-journalled block basis, instead of a per-commit basis, and when we > get a failed checksum, we skip the replay of that block, But not of everything it implies, since that's quite tricky to track down (it's basically the same work needed for softupdates, but in reverse). Hence the e2fsck check, I suppose. > prevent unwary civilians from coming across the feature and saying, > "oooh, shiny!" and turning it on. :-( Or having it turned on by default either, which seems to be the case now.
On Fri, Oct 26, 2012 at 10:19:21PM +0100, Nix wrote: > > prevent unwary civilians from coming across the feature and saying, > > "oooh, shiny!" and turning it on. :-( > > Or having it turned on by default either, which seems to be the case > now. Huh? It's not turned on by default. If you mount with no mount options, journal checksums are *not* turned on. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Ts'o wrote: The problem is this code isn't done yet, and journal_checksum is really not ready for prime time. When it is ready, my plan is to wire it up so it is enabled by default; at the moment, it was intended for developer experimentation only. As I said, it's my fault for not clearly labelling it "Not for you!", or putting it under an #ifdef to prevent unwary civilians from coming across the feature and saying, "oooh, shiny!" and turning it on. :-( Perhaps a word or two in the mount man page would be appropriate? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/dcache.c b/fs/dcache.c index 8086636..649a112 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2404,6 +2404,10 @@ out_unalias: if (likely(!d_mountpoint(alias))) { __d_move(alias, dentry); ret = alias; + } else { + printk(KERN_WARNING "VFS: __d_move()ing a d_mountpoint(), uh oh\n"); + __d_move(alias, dentry); + ret = alias; } out_err: spin_unlock(&inode->i_lock);