Message ID | 20150507125241.4da739ac@gandalf.local.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, May 07, 2015 at 12:52:41PM -0400, Steven Rostedt wrote: > > Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and > d_is_positive() in pathwalk" removed a check for inode being NULL in > walk_component() where the type is tested. Stressing my tracefs create > and remove instances while reading the files now triggers this: So you get NULL ->d_inode with stale flags? The thing is, ->d_inode becoming NULL should happen via d_delete(), which goes throug this: unsigned flags = READ_ONCE(dentry->d_flags); flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); WRITE_ONCE(dentry->d_flags, flags); smp_wmb(); dentry->d_inode = NULL; and after that assignment to ->d_flags you'll see d_is_negative() being true. OTOH, we have *inode = dentry->d_inode; if (read_seqcount_retry(&dentry->d_seq, seq)) in lookup_fast(), and read_seqcount_retry() is { smp_rmb(); return __read_seqcount_retry(s, start); } IOW, we have smp_rmb() between fetching ->d_inode and checking ->d_flags. If you can reproduce that at will, could you make it dump nd->flags along with dentry involved? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 7 May 2015 18:28:34 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Thu, May 07, 2015 at 12:52:41PM -0400, Steven Rostedt wrote: > > > > Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and > > d_is_positive() in pathwalk" removed a check for inode being NULL in > > walk_component() where the type is tested. Stressing my tracefs create > > and remove instances while reading the files now triggers this: > > So you get NULL ->d_inode with stale flags? The thing is, ->d_inode > becoming NULL should happen via d_delete(), which goes throug this: But it's not the delete, it's the creation of a new d_entry. Pehaps it gets linked premature? The tracing I had happened in tracefs_create_file(). > unsigned flags = READ_ONCE(dentry->d_flags); > > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); > WRITE_ONCE(dentry->d_flags, flags); > smp_wmb(); > dentry->d_inode = NULL; > > and after that assignment to ->d_flags you'll see d_is_negative() being > true. OTOH, we have > *inode = dentry->d_inode; > if (read_seqcount_retry(&dentry->d_seq, seq)) > in lookup_fast(), and read_seqcount_retry() is > { > smp_rmb(); > return __read_seqcount_retry(s, start); > } > > IOW, we have smp_rmb() between fetching ->d_inode and checking ->d_flags. > > If you can reproduce that at will, could you make it dump nd->flags along with > dentry involved? I had them printed in my previous traces. The flags were 0x200088, and they were 0 just before the call. ftrace-test-mki-3201 [000] 85.306761: bprint: walk_component: inode=(nil) dentry=0xee3c0240 neg:1 ftrace-test-mki-3201 [000] 85.306761: bprint: walk_component: inode=(nil) dentry=0xee3c0240 neg:0 flags:200088 That was with this: trace_printk("inode=%p dentry=%p neg:%d\n", inode, path->dentry, d_is_negative(path->dentry)); err = -ENOENT; if (d_is_negative(path->dentry)) goto out_path_put; trace_printk("inode=%p dentry=%p neg:%d flags:%lx\n", inode, path->dentry, d_is_negative(path->dentry), (long)path->dentry->d_flags); -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 7 May 2015 13:39:35 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 7 May 2015 18:28:34 +0100 > Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > On Thu, May 07, 2015 at 12:52:41PM -0400, Steven Rostedt wrote: > > > > > > Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and > > > d_is_positive() in pathwalk" removed a check for inode being NULL in > > > walk_component() where the type is tested. Stressing my tracefs create > > > and remove instances while reading the files now triggers this: > > > > So you get NULL ->d_inode with stale flags? The thing is, ->d_inode > > becoming NULL should happen via d_delete(), which goes throug this: > > But it's not the delete, it's the creation of a new d_entry. Pehaps it > gets linked premature? The tracing I had happened in > tracefs_create_file(). > Note, this could be because of my special hack to have mkdir create files too. We never came up with a clean solution for that. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 07, 2015 at 01:39:35PM -0400, Steven Rostedt wrote: > I had them printed in my previous traces. The flags were 0x200088, and > they were 0 just before the call. Not dentry->d_flags, nd->flags. Most interesting part is bit 6 in those (LOOKUP_RCU, 0x40). As for creation... I think I see what might be going on: A: finds a negative dentry, picks NULL ->d_inode from it and whatever ->d_seq it had. B: d_instantiate(): sets ->d_inode non-NULL, ->d_flags accordingly and bumps ->d_seq. A: fetches ->d_flags, sees non-negative, assumes ->d_inode is non-NULL. In reality, the last assumption should've been "->d_inode is non-NULL or we have a stale ->d_seq and will end up discarding that fscker anyway". Hmm... Smells like we ought to a) in lookup_fast() turn if (read_seqcount_retry(&dentry->d_seq, seq)) return -ECHILD; into if (unlikely(d_is_negative(dentry))) { if (read_seqcount_retry(&dentry->d_seq, seq)) return -ECHILD; else return -ENOENT; } if (read_seqcount_retry(&dentry->d_seq, seq)) return -ECHILD; and if (likely(!err)) *inode = path->dentry->d_inode; into if (likely(!err)) { *inode = path->dentry->d_inode; if (unlikely(d_is_negative(dentry))) { path_to_nameidata(path, nd); err = -ENOENT; } } b) in walk_component() and do_last():finish_lookup move the d_is_negative() checks a bit up - into the body of preceding if () in the former and just prior to the finish_lookup: in the latter. AFAICS, the rest of d_is_negative() in fs/namei.c doesn't suffer that kind of problem... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 7 May 2015 19:13:35 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Thu, May 07, 2015 at 01:39:35PM -0400, Steven Rostedt wrote: > > I had them printed in my previous traces. The flags were 0x200088, and > > they were 0 just before the call. > > Not dentry->d_flags, nd->flags. Most interesting part is bit 6 in those > (LOOKUP_RCU, 0x40). nd->flags == 0x51 -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/namei.c b/fs/namei.c index 4a8d998b7274..cdd066680de9 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1585,7 +1585,7 @@ static inline int walk_component(struct nameidata *nd, struct path *path, inode = path->dentry->d_inode; } err = -ENOENT; - if (d_is_negative(path->dentry)) + if (!inode || d_is_negative(path->dentry)) goto out_path_put; if (should_follow_link(path->dentry, follow)) {
Commit 698934df8b45 "VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk" removed a check for inode being NULL in walk_component() where the type is tested. Stressing my tracefs create and remove instances while reading the files now triggers this: BUG: unable to handle kernel NULL pointer dereference at 0000001c IP: [<c05383a6>] inode_permission+0x2d/0x6c *pdpt = 0000000030d9a001 *pde = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 microcode ppdev parport_pc parport r8169 CPU: 0 PID: 3201 Comm: ftrace-test-mki Not tainted 4.1.0-rc2-test+ #94 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 task: efa2ac20 ti: ed7a8000 task.ti: ed7a8000 EIP: 0060:[<c05383a6>] EFLAGS: 00010282 CPU: 0 EIP is at inode_permission+0x2d/0x6c EAX: 00000001 EBX: 00000000 ECX: 00000006 EDX: 00000081 ESI: 00000000 EDI: ef92201b EBP: ed7a9e0c ESP: ed7a9df8 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: 0000001c CR3: 31cb4c20 CR4: 001407f0 Stack: c0538379 c101c228 00000000 00000081 ed7a9ef8 ed7a9e64 c0538c7e c0538c33 c1012e98 00000000 ed7a9ef8 00000000 00000006 efa2ac20 efa2ac20 e1919d01 00000000 c04767e1 ed7a9e64 c05374fb f1392210 ee3c0240 00000000 c05391b5 Call Trace: [<c0538379>] ? __inode_permission+0x91/0x91 [<c0538c7e>] link_path_walk+0x7a/0x3db [<c0538c33>] ? link_path_walk+0x2f/0x3db [<c04767e1>] ? trace_hardirqs_on+0xb/0xd [<c05374fb>] ? read_seqcount_begin+0x6a/0x77 [<c05391b5>] ? path_init+0x1d6/0x326 [<c05392f9>] path_init+0x31a/0x326 [<c0538fdf>] ? link_path_walk+0x3db/0x3db [<c05312a3>] ? get_empty_filp+0x128/0x190 [<c053ad56>] path_openat+0x1a3/0x3da [<c0408c1a>] ? native_sched_clock+0x46/0x4b [<c053bcdf>] do_filp_open+0x2e/0x6f [<c052f591>] do_sys_open+0x7c/0x108 [<c052f558>] ? do_sys_open+0x43/0x108 [<c0cc4f87>] ? sysenter_exit+0xf/0x16 [<c052f63d>] SyS_open+0x20/0x22 [<c0cc4f58>] sysenter_do_call+0x12/0x12 Code: e5 53 83 ec 10 3e 8d 74 26 00 89 c3 89 44 24 08 a1 d8 7b 25 c1 89 55 f8 c7 04 24 79 83 53 c0 89 44 24 04 e8 0e b0 f9 ff 8b 55 f8 <8b> 4b 1c f6 c2 02 74 2e f6 41 30 01 66 8b 03 74 25 25 00 f0 00 EIP: [<c05383a6>] inode_permission+0x2d/0x6c SS:ESP 0068:ed7a9df8 CR2: 000000000000001c ---[ end trace 54b6a3ccfbef84c6 ]--- Adding a bunch of debug I found that the race is the following: CPU1 CPU2 ---- ---- mkdir(foo) d_instantiate(dentry, inode); spin_lock(inode->i_lock); spin_lock(dentry->d_lock); __d_set_inode_and_type(); link_path_walk() walk_component() lookup_fast(nd, path, &inode); *inode = path->d_entry->d_inode; (inode == NULL) dentry->d_inode = inode; smp_wmb(); dentry->d_flags = flags; if (d_is_negative(path->d_entry)) [ fails ] [...] nd->inode = inode; (where inode is NULL); Then in the next loop of link_path_walk() err = may_lookup(nd); inode_permission(nd->inode...) reference to nd->inode->i_sb (BOOM!) Without this patch I can easily cause the bug, with this patch, I have yet to trigger it. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- fs/namei.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)