Message ID | 20240806144628.874350-1-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | vfs: avoid spurious dentry ref/unref cycle on open | expand |
On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote: > The flag thing is optional and can be dropped, but I think the general > direction should be to add *more* asserts and whatnot (even if they are > to land separately). A debug-only variant would not hurt. Asserts do *not* clarify anything; if you want your optional flag, come up with clear description of its semantics. In terms of state, _not_ control flow. > @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd, > static int do_open(struct nameidata *nd, > struct file *file, const struct open_flags *op) > { > + struct vfsmount *mnt; > struct mnt_idmap *idmap; > int open_flag = op->open_flag; > bool do_truncate; > @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd, > error = mnt_want_write(nd->path.mnt); > if (error) > return error; > + /* > + * We grab an additional reference here because vfs_open_consume() > + * may error out and free the mount from under us, while we need > + * to undo write access below. > + */ > + mnt = mntget(nd->path.mnt); It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt", error or no error... > do_truncate = true; > } > error = may_open(idmap, &nd->path, acc_mode, open_flag); > - if (!error && !(file->f_mode & FMODE_OPENED)) > - error = vfs_open(&nd->path, file); > + if (!error && !(file->f_mode & FMODE_OPENED)) { > + BUG_ON(nd->state & ND_PATH_CONSUMED); > + error = vfs_open_consume(&nd->path, file); > + nd->state |= ND_PATH_CONSUMED; > + nd->path.mnt = NULL; > + nd->path.dentry = NULL; Umm... The thing that feels wrong here is that you get an extra asymmetry with ->atomic_open() ;-/ We obviously can't do that kind of thing there (if nothing else, we have the parent directory's inode to unlock, error or no error). I don't hate that patch, but it really feels like the calling conventions are not right. Let me try to tweak it a bit and see if anything falls out...
On Tue, Aug 6, 2024 at 5:53 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote: > > > The flag thing is optional and can be dropped, but I think the general > > direction should be to add *more* asserts and whatnot (even if they are > > to land separately). A debug-only variant would not hurt. > > Asserts do *not* clarify anything; if you want your optional flag, > come up with clear description of its semantics. In terms of state, > _not_ control flow. > It is supposed to indicate that both nd->path.mnt and nd->path.dentry are no longer usable and must not even be looked at. Ideally code which *does* look at them despite the flag (== there is a bug) traps. However, I did not find a handy macro or anything of the sort to "poison" these pointers. Instead I found tons of NULL checks all over, including in lookup clean up. So as is I agree the flag adds next to nothing as is, but the intent was to catch any use of the above pointers after what they point to got consumed. Perhaps I should just drop the flag for the time being and only propose it with a more fleshed out scheme later. > > @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd, > > static int do_open(struct nameidata *nd, > > struct file *file, const struct open_flags *op) > > { > > + struct vfsmount *mnt; > > struct mnt_idmap *idmap; > > int open_flag = op->open_flag; > > bool do_truncate; > > @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd, > > error = mnt_want_write(nd->path.mnt); > > if (error) > > return error; > > + /* > > + * We grab an additional reference here because vfs_open_consume() > > + * may error out and free the mount from under us, while we need > > + * to undo write access below. > > + */ > > + mnt = mntget(nd->path.mnt); > > It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt", > error or no error... > ok > > do_truncate = true; > > > > } > > error = may_open(idmap, &nd->path, acc_mode, open_flag); > > - if (!error && !(file->f_mode & FMODE_OPENED)) > > - error = vfs_open(&nd->path, file); > > + if (!error && !(file->f_mode & FMODE_OPENED)) { > > + BUG_ON(nd->state & ND_PATH_CONSUMED); > > + error = vfs_open_consume(&nd->path, file); > > + nd->state |= ND_PATH_CONSUMED; > > + nd->path.mnt = NULL; > > + nd->path.dentry = NULL; > > Umm... The thing that feels wrong here is that you get an extra > asymmetry with ->atomic_open() ;-/ We obviously can't do that > kind of thing there (if nothing else, we have the parent directory's > inode to unlock, error or no error). > > I don't hate that patch, but it really feels like the calling > conventions are not right. Let me try to tweak it a bit and > see if anything falls out... Should you write your own patch I'm happy to give it a spin to validate the win is about the same. I forgot to note some stuff when sending the patch, so here it is: perf is highly unstable between re-runs of the benchmark because struct inode itself is not aligned to anything and at least ext4 plops it in in a rather arbitrary place and uses a kmem cache without any magic alignment guarantees. This results in false sharing showing up and disappearing depending on how (un)lucky one gets. For reported results I picked the worst case. I had to patch one sucker for constantly getting in: https://lore.kernel.org/linux-security-module/20240806133607.869394-1-mjguzik@gmail.com/T/#u
On Tue, Aug 6, 2024 at 6:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Tue, Aug 6, 2024 at 5:53 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote: > > > > > The flag thing is optional and can be dropped, but I think the general > > > direction should be to add *more* asserts and whatnot (even if they are > > > to land separately). A debug-only variant would not hurt. > > > > Asserts do *not* clarify anything; if you want your optional flag, > > come up with clear description of its semantics. In terms of state, > > _not_ control flow. > > > > It is supposed to indicate that both nd->path.mnt and nd->path.dentry > are no longer usable and must not even be looked at. Ideally code > which *does* look at them despite the flag (== there is a bug) traps. > > However, I did not find a handy macro or anything of the sort to > "poison" these pointers. Instead I found tons of NULL checks all over, > including in lookup clean up. > > So as is I agree the flag adds next to nothing as is, but the intent > was to catch any use of the above pointers after what they point to > got consumed. Perhaps I should just drop the flag for the time being > and only propose it with a more fleshed out scheme later. > > > > @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd, > > > static int do_open(struct nameidata *nd, > > > struct file *file, const struct open_flags *op) > > > { > > > + struct vfsmount *mnt; > > > struct mnt_idmap *idmap; > > > int open_flag = op->open_flag; > > > bool do_truncate; > > > @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd, > > > error = mnt_want_write(nd->path.mnt); > > > if (error) > > > return error; > > > + /* > > > + * We grab an additional reference here because vfs_open_consume() > > > + * may error out and free the mount from under us, while we need > > > + * to undo write access below. > > > + */ > > > + mnt = mntget(nd->path.mnt); > > > > It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt", > > error or no error... > > > > ok > > > > do_truncate = true; > > > > > > > } > > > error = may_open(idmap, &nd->path, acc_mode, open_flag); > > > - if (!error && !(file->f_mode & FMODE_OPENED)) > > > - error = vfs_open(&nd->path, file); > > > + if (!error && !(file->f_mode & FMODE_OPENED)) { > > > + BUG_ON(nd->state & ND_PATH_CONSUMED); > > > + error = vfs_open_consume(&nd->path, file); > > > + nd->state |= ND_PATH_CONSUMED; > > > + nd->path.mnt = NULL; > > > + nd->path.dentry = NULL; > > > > Umm... The thing that feels wrong here is that you get an extra > > asymmetry with ->atomic_open() ;-/ We obviously can't do that > > kind of thing there (if nothing else, we have the parent directory's > > inode to unlock, error or no error). > > > > I don't hate that patch, but it really feels like the calling > > conventions are not right. Let me try to tweak it a bit and > > see if anything falls out... > fwiw if you are looking to have vfs_open_${keyword} which accepts nameidata and "consumes" it, I did not go for it because vfs_open is in another file and I did not want to "leak" the struct there. > Should you write your own patch I'm happy to give it a spin to > validate the win is about the same. > > I forgot to note some stuff when sending the patch, so here it is: > perf is highly unstable between re-runs of the benchmark because > struct inode itself is not aligned to anything and at least ext4 plops > it in in a rather arbitrary place and uses a kmem cache without any > magic alignment guarantees. This results in false sharing showing up > and disappearing depending on how (un)lucky one gets. For reported > results I picked the worst case. > > I had to patch one sucker for constantly getting in: > https://lore.kernel.org/linux-security-module/20240806133607.869394-1-mjguzik@gmail.com/T/#u > > -- > Mateusz Guzik <mjguzik gmail.com>
On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote: > error = may_open(idmap, &nd->path, acc_mode, open_flag); > - if (!error && !(file->f_mode & FMODE_OPENED)) > - error = vfs_open(&nd->path, file); > + if (!error && !(file->f_mode & FMODE_OPENED)) { > + BUG_ON(nd->state & ND_PATH_CONSUMED); Please don't litter new code with random BUG_ON() checks. If this every happens, it will panic a production kernel and the fix will generate a CVE. Given that these checks should never fire in a production kernel unless something is corrupting memory (i.e. the end is already near), these should be considered debug assertions and we should treat them that way from the start. i.e. we really should have a VFS_ASSERT() or VFS_BUG_ON() (following the VM_BUG_ON() pattern) masked by a CONFIG_VFS_DEBUG option so they are only included into debug builds where there is a developer watching to debug the system when one of these things fires. This is a common pattern for subsystem specific assertions. We do this in all the major filesystems, the MM subsystem does this (VM_BUG_ON), etc. Perhaps it is time to do this in the VFS code as well.... -Dave.
On Wed, Aug 7, 2024 at 12:51 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote: > > error = may_open(idmap, &nd->path, acc_mode, open_flag); > > - if (!error && !(file->f_mode & FMODE_OPENED)) > > - error = vfs_open(&nd->path, file); > > + if (!error && !(file->f_mode & FMODE_OPENED)) { > > + BUG_ON(nd->state & ND_PATH_CONSUMED); > > Please don't litter new code with random BUG_ON() checks. If this > every happens, it will panic a production kernel and the fix will > generate a CVE. > > Given that these checks should never fire in a production kernel > unless something is corrupting memory (i.e. the end is already > near), these should be considered debug assertions and we should > treat them that way from the start. > > i.e. we really should have a VFS_ASSERT() or VFS_BUG_ON() (following > the VM_BUG_ON() pattern) masked by a CONFIG_VFS_DEBUG option so they > are only included into debug builds where there is a developer > watching to debug the system when one of these things fires. > > This is a common pattern for subsystem specific assertions. We do > this in all the major filesystems, the MM subsystem does this > (VM_BUG_ON), etc. Perhaps it is time to do this in the VFS code as > well.... I agree, I have this at the bottom of my todo list. The only reason I BUG_ON'ed here is because proper debug macros are not present. fwiw v2 does not have any of this, so...
On Wed, Aug 07, 2024 at 12:55:07AM +0200, Mateusz Guzik wrote: > On Wed, Aug 7, 2024 at 12:51 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote: > > > error = may_open(idmap, &nd->path, acc_mode, open_flag); > > > - if (!error && !(file->f_mode & FMODE_OPENED)) > > > - error = vfs_open(&nd->path, file); > > > + if (!error && !(file->f_mode & FMODE_OPENED)) { > > > + BUG_ON(nd->state & ND_PATH_CONSUMED); > > > > Please don't litter new code with random BUG_ON() checks. If this > > every happens, it will panic a production kernel and the fix will > > generate a CVE. > > > > Given that these checks should never fire in a production kernel > > unless something is corrupting memory (i.e. the end is already > > near), these should be considered debug assertions and we should > > treat them that way from the start. > > > > i.e. we really should have a VFS_ASSERT() or VFS_BUG_ON() (following > > the VM_BUG_ON() pattern) masked by a CONFIG_VFS_DEBUG option so they > > are only included into debug builds where there is a developer > > watching to debug the system when one of these things fires. > > > > This is a common pattern for subsystem specific assertions. We do > > this in all the major filesystems, the MM subsystem does this > > (VM_BUG_ON), etc. Perhaps it is time to do this in the VFS code as > > well.... > > I agree, I have this at the bottom of my todo list. Good to know. > The only reason I BUG_ON'ed here is because proper debug macros are not present. > > fwiw v2 does not have any of this, so... Yeah, I didn't notice there was a v2 patch before I wrote this. -Dave.
On Tue, Aug 06, 2024 at 06:09:43PM +0200, Mateusz Guzik wrote: > It is supposed to indicate that both nd->path.mnt and nd->path.dentry > are no longer usable and must not even be looked at. Ideally code > which *does* look at them despite the flag (== there is a bug) traps. > > However, I did not find a handy macro or anything of the sort to > "poison" these pointers. Instead I found tons of NULL checks all over, > including in lookup clean up. Unless I'm misreading you, those existing NULLs have nothing to do with poisoning of any sort. Or any kind of defensive programming, while we are at it. Those are about the cleanups on failed transition from lazy mode; if we have already legitimized some of the references (i.e. bumped the refcounts there) by the time we'd run into a stale one, we need to drop the ones we'd grabbed on the way out. And the easiest way to do that is to leave that until terminate_walk(), when we'll be out of RCU mode. The references that were *NOT* grabbed obviously should be left alone rather than dropped. Which is where those NULL assignments come from.
On Wed, Aug 7, 2024 at 5:38 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Aug 06, 2024 at 06:09:43PM +0200, Mateusz Guzik wrote: > > > It is supposed to indicate that both nd->path.mnt and nd->path.dentry > > are no longer usable and must not even be looked at. Ideally code > > which *does* look at them despite the flag (== there is a bug) traps. > > > > However, I did not find a handy macro or anything of the sort to > > "poison" these pointers. Instead I found tons of NULL checks all over, > > including in lookup clean up. > > Unless I'm misreading you, those existing NULLs have nothing to do with > poisoning of any sort. Or any kind of defensive programming, while we are > at it. Those are about the cleanups on failed transition from lazy mode; > if we have already legitimized some of the references (i.e. bumped the > refcounts there) by the time we'd run into a stale one, we need to drop > the ones we'd grabbed on the way out. And the easiest way to do that > is to leave that until terminate_walk(), when we'll be out of RCU mode. > The references that were *NOT* grabbed obviously should be left alone > rather than dropped. Which is where those NULL assignments come from. Yes, this is my understanding of the code and part of my compliant. :) Things just work(tm) as is with NULLified pointers, but this is error-prone. I was looking for an equivalent of the following feature from $elsewhere: /* * Trap accesses going through a pointer. Moreover if kasan is available trap * reading the pointer itself. * * Sample usage: you have a struct with numerous fields and by API contract * only some of them get populated, even if the implementation temporary writes * to them. You can use DEBUG_POISON_POINTER so that the consumer which should * no be looking at the field gets caught. * * DEBUG_POISON_POINTER(obj->ptr); * .... * if (obj->ptr != NULL) // traps with kasan, does not trap otherwise * .... * if (obj->ptr->field) // traps with and without kasan */ extern caddr_t poisoned_buf; #define DEBUG_POISON_POINTER_VALUE poisoned_buf #define DEBUG_POISON_POINTER(x) ({ \ x = (void *)(DEBUG_POISON_POINTER_VALUE); \ kasan_mark(&x, 0, sizeof(x), KASAN_GENERIC_REDZONE); \ }) As a hypothetical suppose there is code executing some time after vfs_open which looks at nd->path.dentry and by finding the pointer is NULL it concludes the lookup did not work out. If such code exists *and* the pointer is poisoned in the above sense (notably merely branching on it with kasan already traps), then the consumer will be caught immediately during coverage testing by syzkaller. If such code exists but the pointer is only nullified, one is only going to find out the hard way when some functionality weirdly breaks. Anyhow, this is really beyond the scope of the patch and I should not have done the half-assed thing abandoned mid-effort. I'm going to get back to this later(tm). See the v2 which just gets to the point concerning eliding the extra ref trip.
On Wed, Aug 07, 2024 at 05:57:07AM +0200, Mateusz Guzik wrote: [there'll be a separate reply with what I hope might be a usable approach] > Yes, this is my understanding of the code and part of my compliant. :) > > Things just work(tm) as is with NULLified pointers, but this is error-prone. And carrying the arseloads of information (which ones do and which do not need to be dropped) is *less* error-prone? Are you serious? > As a hypothetical suppose there is code executing some time after > vfs_open which looks at nd->path.dentry and by finding the pointer is > NULL it concludes the lookup did not work out. > > If such code exists *and* the pointer is poisoned in the above sense > (notably merely branching on it with kasan already traps), then the > consumer will be caught immediately during coverage testing by > syzkaller. You are much too optimistic about the quality of test coverage in this particular area. > If such code exists but the pointer is only nullified, one is only > going to find out the hard way when some functionality weirdly breaks. To do _useful_ asserts, one needs invariants to check. And "we got to this check after having passed through that assignment at some earlier point" is not it. That's why I'm asking questions about the state. The thing is, suppose I (or you, or somebody else) is trying to modify the whole thing. There's a magical mystery assert in the way; what should be done with it? Move it/split it/remove it/do something random and hope syzkaller won't catch anything? If I can reason about the predicate being checked, I can at least start figuring out what should be done. If not, it's bloody guaranteed to rot. This particular area (pathwalk machinery) has a nasty history of growing complexity once in a while, with following cleanups and massage to get it back into more or less tolerable shape. And refactoring that had been _painful_ - I'd done more than a few there. As far as I can tell, at the moment this flag (and yes, I've seen its removal in the next version) is "we'd called vfs_open_consume() at some point, then found ourselves still in RCU mode or we'd called vfs_open_consume() more than once". This is *NOT* a property of state; it's a property of execution history. The first part is checked in the wrong place - one of the invariants (trivially verified by code examination) is that LOOKUP_RCU is never regained after it had been dropped. The only place where it can be set is path_init() and calling _that_ between path_init() and terminate_walk() would be a) a hard and very visible bug b) would've wiped your flag anyway. So that part of the check is basically "we are not calling vfs_open_consume() under rcu_read_lock()". Which is definitely a desirable property, since ->open() can block. So can mnt_want_write() several lines prior. Invariant here is "the places where we set FMODE_OPENED or FMODE_CREATED may not have LOOKUP_RCU". Having if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) { error = complete_walk(nd); if (error) return error; } in the beginning of do_open() guarantees that for vfs_open() call there. All other places where that can happen are in lookup_open() or called from it (via ->atomic_open() to finish_open()). And *that* definitely should not be done in RCU mode, due to if (open_flag & O_CREAT) inode_lock(dir->d_inode); else inode_lock_shared(dir->d_inode); dentry = lookup_open(nd, file, op, got_write); in the sole caller of that thing. Again, can't grab a blocking lock under rcu_read_lock(). Which is why we have this if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) return ERR_PTR(-ECHILD); } else { /* create side of things */ if (nd->flags & LOOKUP_RCU) { if (!try_to_unlazy(nd)) return ERR_PTR(-ECHILD); } slightly prior to that call. WARN_ON_ONCE is basically "lookup_fast() has returned NULL and stayed in RCU mode", which should never happen. try_to_unlazy() is straight "either switch to non-RCU mode or return an error" - that's what this function is for. No WARN_ON after that - it would only obfuscate things. *IF* you want to add debugging checks for that kind of stuff, just call that assert_nonrcu(nd), make it check and whine and feel free to slap them in reasonable amount of places (anything that makes a reader go "for fuck sake, hadn't we (a) done that on the entry to this function and (b) done IO since then, anyway?" is obviously not reasonable, etc. - no more than common sense limitations). Another common sense thing: extra asserts won't confuse syzkaller, but they very much can confuse a human reader. And any rewrites are done by humans... As for the double call of vfs_open_consume()... You do realize that the damn thing wouldn't have reached that check if it would ever have cause to be triggered, right? Seeing that we call static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt) { /* Pairs with smp_store_release() in do_idmap_mount(). */ return smp_load_acquire(&mnt->mnt_idmap); } near the beginning of do_open(), ~20 lines before the place where you added that check... I'm not sure it makes sense to defend against a weird loop appearing out of nowhere near the top of call chain, but if you want to do that, this is not the right place for that.
On Wed, Aug 7, 2024 at 7:32 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Aug 07, 2024 at 05:57:07AM +0200, Mateusz Guzik wrote: > > [there'll be a separate reply with what I hope might be a usable > approach] > At the moment I'm trying to get the spurious lockref cycle out of the way because it is interfering with the actual thing I'm trying to do. This entire pointer poisoning business is optional and the v2 I posted does not do any of it. With this in mind can you review v2? Here is the link for reference: https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@gmail.com/T/#u As for the feedback below, I'm going to circle back to it when I take a stab at adding debug macros, but no ETA. > > Yes, this is my understanding of the code and part of my compliant. :) > > > > Things just work(tm) as is with NULLified pointers, but this is error-prone. > > And carrying the arseloads of information (which ones do and which do not > need to be dropped) is *less* error-prone? Are you serious? > > > As a hypothetical suppose there is code executing some time after > > vfs_open which looks at nd->path.dentry and by finding the pointer is > > NULL it concludes the lookup did not work out. > > > > If such code exists *and* the pointer is poisoned in the above sense > > (notably merely branching on it with kasan already traps), then the > > consumer will be caught immediately during coverage testing by > > syzkaller. > > You are much too optimistic about the quality of test coverage in this > particular area. > > > If such code exists but the pointer is only nullified, one is only > > going to find out the hard way when some functionality weirdly breaks. > > To do _useful_ asserts, one needs invariants to check. And "we got > to this check after having passed through that assignment at some > earlier point" is not it. That's why I'm asking questions about > the state. > > The thing is, suppose I (or you, or somebody else) is trying to modify > the whole thing. There's a magical mystery assert in the way; what > should be done with it? Move it/split it/remove it/do something > random and hope syzkaller won't catch anything? If I can reason > about the predicate being checked, I can at least start figuring out > what should be done. If not, it's bloody guaranteed to rot. > > This particular area (pathwalk machinery) has a nasty history of > growing complexity once in a while, with following cleanups and > massage to get it back into more or less tolerable shape. > And refactoring that had been _painful_ - I'd done more than > a few there. > > As far as I can tell, at the moment this flag (and yes, I've seen its > removal in the next version) is "we'd called vfs_open_consume() at > some point, then found ourselves still in RCU mode or we'd called > vfs_open_consume() more than once". > > This is *NOT* a property of state; it's a property of execution > history. The first part is checked in the wrong place - one of > the invariants (trivially verified by code examination) is that > LOOKUP_RCU is never regained after it had been dropped. The > only place where it can be set is path_init() and calling _that_ > between path_init() and terminate_walk() would be > a) a hard and very visible bug > b) would've wiped your flag anyway. > So that part of the check is basically "we are not calling > vfs_open_consume() under rcu_read_lock()". Which is definitely > a desirable property, since ->open() can block. So can > mnt_want_write() several lines prior. Invariant here is > "the places where we set FMODE_OPENED or FMODE_CREATED may > not have LOOKUP_RCU". Having > if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) { > error = complete_walk(nd); > if (error) > return error; > } > in the beginning of do_open() guarantees that for vfs_open() > call there. All other places where that can happen are in > lookup_open() or called from it (via ->atomic_open() to > finish_open()). And *that* definitely should not be done > in RCU mode, due to > if (open_flag & O_CREAT) > inode_lock(dir->d_inode); > else > inode_lock_shared(dir->d_inode); > dentry = lookup_open(nd, file, op, got_write); > in the sole caller of that thing. Again, can't grab a blocking > lock under rcu_read_lock(). Which is why we have this > if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) > return ERR_PTR(-ECHILD); > } else { > /* create side of things */ > if (nd->flags & LOOKUP_RCU) { > if (!try_to_unlazy(nd)) > return ERR_PTR(-ECHILD); > } > slightly prior to that call. WARN_ON_ONCE is basically "lookup_fast() > has returned NULL and stayed in RCU mode", which should never happen. > try_to_unlazy() is straight "either switch to non-RCU mode or return an > error" - that's what this function is for. No WARN_ON after that - it > would only obfuscate things. > > *IF* you want to add debugging checks for that kind of stuff, just call > that assert_nonrcu(nd), make it check and whine and feel free to slap > them in reasonable amount of places (anything that makes a reader go > "for fuck sake, hadn't we (a) done that on the entry to this function > and (b) done IO since then, anyway?" is obviously not reasonable, etc. - > no more than common sense limitations). > > Another common sense thing: extra asserts won't confuse syzkaller, but > they very much can confuse a human reader. And any rewrites are done > by humans... > > As for the double call of vfs_open_consume()... You do realize that > the damn thing wouldn't have reached that check if it would ever have > cause to be triggered, right? Seeing that we call > static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt) > { > /* Pairs with smp_store_release() in do_idmap_mount(). */ > return smp_load_acquire(&mnt->mnt_idmap); > } > near the beginning of do_open(), ~20 lines before the place where > you added that check... > > I'm not sure it makes sense to defend against a weird loop appearing > out of nowhere near the top of call chain, but if you want to do that, > this is not the right place for that.
After having looked at the problem, how about the following series: 1/5) lift path_get() *AND* path_put() out of do_dentry_open() into the callers. The latter - conditional upon "do_dentry_open() has not set FMODE_OPENED". Equivalent transformation. 2/5) move path_get() we'd lifted into the callers past the call of do_dentry_open(), conditionally collapse it with path_put(). You'd get e.g. int vfs_open(const struct path *path, struct file *file) { int ret; file->f_path = *path; ret = do_dentry_open(file, NULL); if (!ret) { /* * Once we return a file with FMODE_OPENED, __fput() will call * fsnotify_close(), so we need fsnotify_open() here for * symmetry. */ fsnotify_open(file); } if (file->f_mode & FMODE_OPENED) path_get(path); return ret; } Equivalent transformation, provided that nobody is playing silly buggers with reassigning ->f_path in their ->open() instances. They *really* should not - if anyone does, we'd better catch them and fix them^Wtheir code. Incidentally, if we find any such, we have a damn good reason to add asserts in the callers. As in, "if do_dentry_open() has set FMODE_OPENED, it would bloody better *not* modify ->f_path". <greps> Nope, nobody is that insane. 3/5) split vfs_open_consume() out of vfs_open() (possibly named vfs_open_borrow()), replace the call in do_open() with calling the new function. Trivially equivalent transformation. 4/5) Remove conditional path_get() from vfs_open_consume() and finish_open(). Add if (file->f_mode & FMODE_OPENED) path_get(&nd->path); before terminate_walk(nd); in path_openat(). Equivalent transformation - see if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) { dput(nd->path.dentry); nd->path.dentry = dentry; return NULL; } in lookup_open() (which is where nd->path gets in sync with what had been given to do_dentry_open() in finish_open()); in case of vfs_open_consume() in do_open() it's in sync from the very beginning. And we never modify nd->path after those points. So we can move grabbing it downstream, keeping it under the same condition (which also happens to be true only if we'd called do_dentry_open(), so for all other paths through the whole thing it's a no-op. 5/5) replace if (file->f_mode & FMODE_OPENED) path_get(&nd->path); terminate_walk(nd); with if (file->f_mode & FMODE_OPENED) { nd->path.mnt = NULL; nd->path.dentry = NULL; } terminate_walk(nd); Again, an obvious equivalent transformation.
On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote: > After having looked at the problem, how about the following > series: > > 1/5) lift path_get() *AND* path_put() out of do_dentry_open() > into the callers. The latter - conditional upon "do_dentry_open() > has not set FMODE_OPENED". Equivalent transformation. > > 2/5) move path_get() we'd lifted into the callers past the > call of do_dentry_open(), conditionally collapse it with path_put(). > You'd get e.g. > int vfs_open(const struct path *path, struct file *file) > { > int ret; > > file->f_path = *path; > ret = do_dentry_open(file, NULL); > if (!ret) { > /* > * Once we return a file with FMODE_OPENED, __fput() will call > * fsnotify_close(), so we need fsnotify_open() here for > * symmetry. > */ > fsnotify_open(file); > } > if (file->f_mode & FMODE_OPENED) > path_get(path); > return ret; > } > > Equivalent transformation, provided that nobody is playing silly > buggers with reassigning ->f_path in their ->open() instances. > They *really* should not - if anyone does, we'd better catch them > and fix them^Wtheir code. Incidentally, if we find any such, > we have a damn good reason to add asserts in the callers. As > in, "if do_dentry_open() has set FMODE_OPENED, it would bloody > better *not* modify ->f_path". <greps> Nope, nobody is that > insane. > > 3/5) split vfs_open_consume() out of vfs_open() (possibly > named vfs_open_borrow()), replace the call in do_open() with > calling the new function. > > Trivially equivalent transformation. > > 4/5) Remove conditional path_get() from vfs_open_consume() > and finish_open(). Add > if (file->f_mode & FMODE_OPENED) > path_get(&nd->path); > before terminate_walk(nd); in path_openat(). > > Equivalent transformation - see > if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) { > dput(nd->path.dentry); > nd->path.dentry = dentry; > return NULL; > } > in lookup_open() (which is where nd->path gets in sync with what > had been given to do_dentry_open() in finish_open()); in case > of vfs_open_consume() in do_open() it's in sync from the very > beginning. And we never modify nd->path after those points. > So we can move grabbing it downstream, keeping it under the > same condition (which also happens to be true only if we'd > called do_dentry_open(), so for all other paths through the > whole thing it's a no-op. > > 5/5) replace > if (file->f_mode & FMODE_OPENED) > path_get(&nd->path); > terminate_walk(nd); > with > if (file->f_mode & FMODE_OPENED) { > nd->path.mnt = NULL; > nd->path.dentry = NULL; > } > terminate_walk(nd); > Again, an obvious equivalent transformation. BTW, similar to that, with that we could turn do_o_path() into struct path path; int error = path_lookupat(nd, flags, &path); if (!error) { audit_inode(nd->name, path.dentry, 0); error = vfs_open_borrow(&path, file); if (!(file->f_mode & FMODE_OPENED)) path_put(&path); } return error; } and perhaps do something similar in the vicinity of vfs_tmpfile() / do_o_tmpfile().
On Wed, Aug 7, 2024 at 8:33 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote: > > After having looked at the problem, how about the following > > series: > > > > 1/5) lift path_get() *AND* path_put() out of do_dentry_open() > > into the callers. The latter - conditional upon "do_dentry_open() > > has not set FMODE_OPENED". Equivalent transformation. > > > > 2/5) move path_get() we'd lifted into the callers past the > > call of do_dentry_open(), conditionally collapse it with path_put(). > > You'd get e.g. > > int vfs_open(const struct path *path, struct file *file) > > { > > int ret; > > > > file->f_path = *path; > > ret = do_dentry_open(file, NULL); > > if (!ret) { > > /* > > * Once we return a file with FMODE_OPENED, __fput() will call > > * fsnotify_close(), so we need fsnotify_open() here for > > * symmetry. > > */ > > fsnotify_open(file); > > } > > if (file->f_mode & FMODE_OPENED) > > path_get(path); > > return ret; > > } > > > > Equivalent transformation, provided that nobody is playing silly > > buggers with reassigning ->f_path in their ->open() instances. > > They *really* should not - if anyone does, we'd better catch them > > and fix them^Wtheir code. Incidentally, if we find any such, > > we have a damn good reason to add asserts in the callers. As > > in, "if do_dentry_open() has set FMODE_OPENED, it would bloody > > better *not* modify ->f_path". <greps> Nope, nobody is that > > insane. > > > > 3/5) split vfs_open_consume() out of vfs_open() (possibly > > named vfs_open_borrow()), replace the call in do_open() with > > calling the new function. > > > > Trivially equivalent transformation. > > > > 4/5) Remove conditional path_get() from vfs_open_consume() > > and finish_open(). Add > > if (file->f_mode & FMODE_OPENED) > > path_get(&nd->path); > > before terminate_walk(nd); in path_openat(). > > > > Equivalent transformation - see > > if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) { > > dput(nd->path.dentry); > > nd->path.dentry = dentry; > > return NULL; > > } > > in lookup_open() (which is where nd->path gets in sync with what > > had been given to do_dentry_open() in finish_open()); in case > > of vfs_open_consume() in do_open() it's in sync from the very > > beginning. And we never modify nd->path after those points. > > So we can move grabbing it downstream, keeping it under the > > same condition (which also happens to be true only if we'd > > called do_dentry_open(), so for all other paths through the > > whole thing it's a no-op. > > > > 5/5) replace > > if (file->f_mode & FMODE_OPENED) > > path_get(&nd->path); > > terminate_walk(nd); > > with > > if (file->f_mode & FMODE_OPENED) { > > nd->path.mnt = NULL; > > nd->path.dentry = NULL; > > } > > terminate_walk(nd); > > Again, an obvious equivalent transformation. > > BTW, similar to that, with that we could turn do_o_path() > into > > struct path path; > int error = path_lookupat(nd, flags, &path); > if (!error) { > audit_inode(nd->name, path.dentry, 0); > error = vfs_open_borrow(&path, file); > if (!(file->f_mode & FMODE_OPENED)) > path_put(&path); > } > return error; > } > > and perhaps do something similar in the vicinity of > vfs_tmpfile() / do_o_tmpfile(). That's quite a bit of churn, but if you insist I can take a stab. fwiw I do think the weird error condition in do_dentry_open can be used to simplify stuff, which I still do in my v2. with my approach there is never a path_put needed to backpedal (it was already done by do_dentry_open *or* it is going to be done by whoever doing last fput) then do_o_path would be: struct path path; int error = path_lookupat(nd, flags, &path); if (!error) { audit_inode(nd->name, path.dentry, 0); error = vfs_open_consume(&path, file); } return error;
On Wed, Aug 07, 2024 at 08:40:28AM +0200, Mateusz Guzik wrote: > On Wed, Aug 7, 2024 at 8:33 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote: > > > After having looked at the problem, how about the following > > > series: > > > > > > 1/5) lift path_get() *AND* path_put() out of do_dentry_open() > > > into the callers. The latter - conditional upon "do_dentry_open() > > > has not set FMODE_OPENED". Equivalent transformation. > > > > > > 2/5) move path_get() we'd lifted into the callers past the > > > call of do_dentry_open(), conditionally collapse it with path_put(). > > > You'd get e.g. > > > int vfs_open(const struct path *path, struct file *file) > > > { > > > int ret; > > > > > > file->f_path = *path; > > > ret = do_dentry_open(file, NULL); > > > if (!ret) { > > > /* > > > * Once we return a file with FMODE_OPENED, __fput() will call > > > * fsnotify_close(), so we need fsnotify_open() here for > > > * symmetry. > > > */ > > > fsnotify_open(file); > > > } > > > if (file->f_mode & FMODE_OPENED) > > > path_get(path); > > > return ret; > > > } > > > > > > Equivalent transformation, provided that nobody is playing silly > > > buggers with reassigning ->f_path in their ->open() instances. > > > They *really* should not - if anyone does, we'd better catch them > > > and fix them^Wtheir code. Incidentally, if we find any such, > > > we have a damn good reason to add asserts in the callers. As > > > in, "if do_dentry_open() has set FMODE_OPENED, it would bloody > > > better *not* modify ->f_path". <greps> Nope, nobody is that > > > insane. > > > > > > 3/5) split vfs_open_consume() out of vfs_open() (possibly > > > named vfs_open_borrow()), replace the call in do_open() with > > > calling the new function. > > > > > > Trivially equivalent transformation. > > > > > > 4/5) Remove conditional path_get() from vfs_open_consume() > > > and finish_open(). Add > > > if (file->f_mode & FMODE_OPENED) > > > path_get(&nd->path); > > > before terminate_walk(nd); in path_openat(). > > > > > > Equivalent transformation - see > > > if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) { > > > dput(nd->path.dentry); > > > nd->path.dentry = dentry; > > > return NULL; > > > } > > > in lookup_open() (which is where nd->path gets in sync with what > > > had been given to do_dentry_open() in finish_open()); in case > > > of vfs_open_consume() in do_open() it's in sync from the very > > > beginning. And we never modify nd->path after those points. > > > So we can move grabbing it downstream, keeping it under the > > > same condition (which also happens to be true only if we'd > > > called do_dentry_open(), so for all other paths through the > > > whole thing it's a no-op. > > > > > > 5/5) replace > > > if (file->f_mode & FMODE_OPENED) > > > path_get(&nd->path); > > > terminate_walk(nd); > > > with > > > if (file->f_mode & FMODE_OPENED) { > > > nd->path.mnt = NULL; > > > nd->path.dentry = NULL; > > > } > > > terminate_walk(nd); > > > Again, an obvious equivalent transformation. > > > > BTW, similar to that, with that we could turn do_o_path() > > into > > > > struct path path; > > int error = path_lookupat(nd, flags, &path); > > if (!error) { > > audit_inode(nd->name, path.dentry, 0); > > error = vfs_open_borrow(&path, file); > > if (!(file->f_mode & FMODE_OPENED)) > > path_put(&path); > > } > > return error; > > } > > > > and perhaps do something similar in the vicinity of > > vfs_tmpfile() / do_o_tmpfile(). > > That's quite a bit of churn, but if you insist I can take a stab. What I have in mind is something along the lines of COMPLETELY UNTESTED git.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git #experimental-for-mateusz It needs saner commit messages, references to your analysis of the overhead, quite possibly a finer carve-up, etc. And it's really completely untested - it builds, but I hadn't even tried to boot the sucker, let alone give it any kind of beating, so consider that as a quick illustration (slapped together at 3am, on top of 5 hours of sleep yesterday) to what I'd been talking about and no more than that.
On Wed, Aug 7, 2024 at 9:05 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Aug 07, 2024 at 08:40:28AM +0200, Mateusz Guzik wrote: > > On Wed, Aug 7, 2024 at 8:33 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote: > > > > After having looked at the problem, how about the following > > > > series: > > > > > > > > 1/5) lift path_get() *AND* path_put() out of do_dentry_open() > > > > into the callers. The latter - conditional upon "do_dentry_open() > > > > has not set FMODE_OPENED". Equivalent transformation. > > > > > > > > 2/5) move path_get() we'd lifted into the callers past the > > > > call of do_dentry_open(), conditionally collapse it with path_put(). > > > > You'd get e.g. > > > > int vfs_open(const struct path *path, struct file *file) > > > > { > > > > int ret; > > > > > > > > file->f_path = *path; > > > > ret = do_dentry_open(file, NULL); > > > > if (!ret) { > > > > /* > > > > * Once we return a file with FMODE_OPENED, __fput() will call > > > > * fsnotify_close(), so we need fsnotify_open() here for > > > > * symmetry. > > > > */ > > > > fsnotify_open(file); > > > > } > > > > if (file->f_mode & FMODE_OPENED) > > > > path_get(path); > > > > return ret; > > > > } > > > > > > > > Equivalent transformation, provided that nobody is playing silly > > > > buggers with reassigning ->f_path in their ->open() instances. > > > > They *really* should not - if anyone does, we'd better catch them > > > > and fix them^Wtheir code. Incidentally, if we find any such, > > > > we have a damn good reason to add asserts in the callers. As > > > > in, "if do_dentry_open() has set FMODE_OPENED, it would bloody > > > > better *not* modify ->f_path". <greps> Nope, nobody is that > > > > insane. > > > > > > > > 3/5) split vfs_open_consume() out of vfs_open() (possibly > > > > named vfs_open_borrow()), replace the call in do_open() with > > > > calling the new function. > > > > > > > > Trivially equivalent transformation. > > > > > > > > 4/5) Remove conditional path_get() from vfs_open_consume() > > > > and finish_open(). Add > > > > if (file->f_mode & FMODE_OPENED) > > > > path_get(&nd->path); > > > > before terminate_walk(nd); in path_openat(). > > > > > > > > Equivalent transformation - see > > > > if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) { > > > > dput(nd->path.dentry); > > > > nd->path.dentry = dentry; > > > > return NULL; > > > > } > > > > in lookup_open() (which is where nd->path gets in sync with what > > > > had been given to do_dentry_open() in finish_open()); in case > > > > of vfs_open_consume() in do_open() it's in sync from the very > > > > beginning. And we never modify nd->path after those points. > > > > So we can move grabbing it downstream, keeping it under the > > > > same condition (which also happens to be true only if we'd > > > > called do_dentry_open(), so for all other paths through the > > > > whole thing it's a no-op. > > > > > > > > 5/5) replace > > > > if (file->f_mode & FMODE_OPENED) > > > > path_get(&nd->path); > > > > terminate_walk(nd); > > > > with > > > > if (file->f_mode & FMODE_OPENED) { > > > > nd->path.mnt = NULL; > > > > nd->path.dentry = NULL; > > > > } > > > > terminate_walk(nd); > > > > Again, an obvious equivalent transformation. > > > > > > BTW, similar to that, with that we could turn do_o_path() > > > into > > > > > > struct path path; > > > int error = path_lookupat(nd, flags, &path); > > > if (!error) { > > > audit_inode(nd->name, path.dentry, 0); > > > error = vfs_open_borrow(&path, file); > > > if (!(file->f_mode & FMODE_OPENED)) > > > path_put(&path); > > > } > > > return error; > > > } > > > > > > and perhaps do something similar in the vicinity of > > > vfs_tmpfile() / do_o_tmpfile(). > > > > That's quite a bit of churn, but if you insist I can take a stab. > > What I have in mind is something along the lines of COMPLETELY UNTESTED > git.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git #experimental-for-mateusz > > It needs saner commit messages, references to your analysis of the > overhead, quite possibly a finer carve-up, etc. And it's really > completely untested - it builds, but I hadn't even tried to boot > the sucker, let alone give it any kind of beating, so consider that > as a quick illustration (slapped together at 3am, on top of 5 hours of > sleep yesterday) to what I'd been talking about and no more than that. Well it's your call, you wrote the thing and I need the problem out of the way, so I'm not going to argue about the patchset. I verified it boots and provides the expected perf win [I have to repeat it is highly variable between re-runs because of ever-changing offsets between different inode allocations resulting in different false-sharing problems; i'm going to separately mail about that] I think it will be fine to copy the result from my commit message and denote it's from a different variant achieving the same goal. That said feel free to use my commit message in whatever capacity, there is no need to mention me. Thanks for sorting this out.
On Wed, Aug 07, 2024 at 09:22:59AM +0200, Mateusz Guzik wrote: > Well it's your call, you wrote the thing and I need the problem out of > the way, so I'm not going to argue about the patchset. > > I verified it boots and provides the expected perf win [I have to > repeat it is highly variable between re-runs because of ever-changing > offsets between different inode allocations resulting in different > false-sharing problems; i'm going to separately mail about that] > > I think it will be fine to copy the result from my commit message and > denote it's from a different variant achieving the same goal. > > That said feel free to use my commit message in whatever capacity, > there is no need to mention me. Original analysis had been yours, same for "let's change the calling conventions for do_dentry_open() wrt path refcounting", same for the version I'd transformed into that... FWIW, my approach to that had been along the lines of "how do we get it consistently, whether we go through vfs_open() or finish_open()", which pretty much required keeping hold on the path until just before terminate_walk(). This "transfer from nd->path to whatever borrowed it" was copied from path_openat() (BTW, might be worth an inlined helper next to terminate_walk(), just to document that it's not an accidental property of terminate_walk()) and that was pretty much it. Co-developed-by: seems to be the usual notation these days for such situations - that really had been incremental changes. Anyway, I really need to get some sleep before writing something usable as commit messages...
On Wed, Aug 7, 2024 at 9:52 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Aug 07, 2024 at 09:22:59AM +0200, Mateusz Guzik wrote: > > > Well it's your call, you wrote the thing and I need the problem out of > > the way, so I'm not going to argue about the patchset. > > > > I verified it boots and provides the expected perf win [I have to > > repeat it is highly variable between re-runs because of ever-changing > > offsets between different inode allocations resulting in different > > false-sharing problems; i'm going to separately mail about that] > > > > I think it will be fine to copy the result from my commit message and > > denote it's from a different variant achieving the same goal. > > > > That said feel free to use my commit message in whatever capacity, > > there is no need to mention me. > > Original analysis had been yours, same for "let's change the calling > conventions for do_dentry_open() wrt path refcounting", same for > the version I'd transformed into that... FWIW, my approach to > that had been along the lines of "how do we get it consistently, > whether we go through vfs_open() or finish_open()", which pretty > much required keeping hold on the path until just before > terminate_walk(). This "transfer from nd->path to whatever borrowed > it" was copied from path_openat() (BTW, might be worth an inlined helper > next to terminate_walk(), just to document that it's not an accidental > property of terminate_walk()) and that was pretty much it. > > Co-developed-by: seems to be the usual notation these days for > such situations - that really had been incremental changes. > > Anyway, I really need to get some sleep before writing something > usable as commit messages... Nobody is getting a Turing award for noticing the extra ref trip and eliding it. Co-developed-by is fine with me if you insist on sharing credit. My only objective here is to expedite the fix so that I can get on with speeding up refcount management. :)
This eventually broke: [ 1297.949437] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 1297.951840] #PF: supervisor read access in kernel mode [ 1297.953548] #PF: error_code(0x0000) - not-present page [ 1297.954984] PGD 109590067 P4D 0 [ 1297.955533] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1297.956340] CPU: 18 UID: 0 PID: 1113 Comm: cron Not tainted 6.11.0-rc1-00003-g80efb96e5cd4 #291 [ 1297.958090] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 [ 1297.959926] RIP: 0010:vfs_tmpfile+0x162/0x230 [ 1297.960666] Code: b2 8b 55 48 83 e2 20 83 fa 01 19 ff 81 e7 00 f0 ff ff 81 c7 20 10 00 00 a9 00 40 00 04 75 90 48 8b 85 a0 00 00 00 4c 8b 48 30 <49> 8b 51 28 48 8b 92 b8 03 00 00 48 85 d2 0f 84 71 ff ff ff 48 8b [ 1297.963682] RSP: 0018:ff38d126c19c3cc0 EFLAGS: 00010246 [ 1297.964549] RAX: ff2c2c8c89a3b980 RBX: 0000000000000000 RCX: 0000000000000002 [ 1297.965714] RDX: 0000000000000000 RSI: ff2c2c8c89a3ba30 RDI: 0000000000000020 [ 1297.966872] RBP: ff2c2c8c91917b00 R08: 0000000000000000 R09: 0000000000000000 [ 1297.968046] R10: 0000000000000000 R11: ff2c2c8c91cd23f0 R12: ffffffffb7b91e20 [ 1297.969202] R13: ff38d126c19c3d48 R14: ff2c2c8c89a3b980 R15: ff2c2c8c89a00a58 [ 1297.970364] FS: 00007ff81cb63840(0000) GS:ff2c2c91e7a80000(0000) knlGS:0000000000000000 [ 1297.971694] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1297.972645] CR2: 0000000000000028 CR3: 000000010cf58001 CR4: 0000000000371ef0 [ 1297.973825] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1297.974996] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 1297.976175] Call Trace: [ 1297.976609] <TASK> [ 1297.976978] ? __die+0x23/0x70 [ 1297.977501] ? page_fault_oops+0x15c/0x480 [ 1297.978191] ? jbd2_journal_stop+0x13c/0x2d0 [ 1297.978903] ? exc_page_fault+0x78/0x170 [ 1297.979564] ? asm_exc_page_fault+0x26/0x30 [ 1297.980262] ? vfs_tmpfile+0x162/0x230 [ 1297.980891] path_openat+0xc70/0x12b0 [ 1297.981507] ? __count_memcg_events+0x58/0xf0 [ 1297.982233] ? handle_mm_fault+0xb9/0x260 [ 1297.982903] ? do_user_addr_fault+0x2ed/0x6e0 [ 1297.983636] do_filp_open+0xb0/0x150 [ 1297.984241] do_sys_openat2+0x91/0xc0 [ 1297.984855] __x64_sys_openat+0x57/0xa0 [ 1297.985504] do_syscall_64+0x52/0x150 [ 1297.986128] entry_SYSCALL_64_after_hwframe+0x76/0x7e tripping ip: vfs_tmpfile+0x162/0x230: fsnotify_parent at include/linux/fsnotify.h:81 (inlined by) fsnotify_file at include/linux/fsnotify.h:131 (inlined by) fsnotify_open at include/linux/fsnotify.h:401 (inlined by) vfs_tmpfile at fs/namei.c:3781
On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote: > tripping ip: > vfs_tmpfile+0x162/0x230: > fsnotify_parent at include/linux/fsnotify.h:81 > (inlined by) fsnotify_file at include/linux/fsnotify.h:131 > (inlined by) fsnotify_open at include/linux/fsnotify.h:401 > (inlined by) vfs_tmpfile at fs/namei.c:3781 Try this for incremental; missed the fact that finish_open() is used by ->tmpfile() instances, not just ->atomic_open(). Al, crawling back to sleep... diff --git a/fs/namei.c b/fs/namei.c index 95345a5beb3a..0536907e8e79 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3776,7 +3776,10 @@ int vfs_tmpfile(struct mnt_idmap *idmap, file->f_path.dentry = child; mode = vfs_prepare_mode(idmap, dir, mode, mode, mode); error = dir->i_op->tmpfile(idmap, dir, file, mode); - dput(child); + if (file->f_mode & FMODE_OPENED) + mntget(parentpath->mnt); + else + dput(child); if (file->f_mode & FMODE_OPENED) fsnotify_open(file); if (error)
On Wed, Aug 07, 2024 at 01:43:48PM +0100, Al Viro wrote: > On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote: > > > tripping ip: > > vfs_tmpfile+0x162/0x230: > > fsnotify_parent at include/linux/fsnotify.h:81 > > (inlined by) fsnotify_file at include/linux/fsnotify.h:131 > > (inlined by) fsnotify_open at include/linux/fsnotify.h:401 > > (inlined by) vfs_tmpfile at fs/namei.c:3781 > > Try this for incremental; missed the fact that finish_open() is > used by ->tmpfile() instances, not just ->atomic_open(). > > Al, crawling back to sleep... I _really_ hate ->atomic_open() calling conventions; FWIW, I suspect that in the current form this series is OK, but only because none of the existing instances call finish_open() on a preexisting aliases found by d_splice_alias(). And control flow in the instances (especially the cleanup paths) is bloody awful... We never got it quite right, and while the previous iterations of the calling conventions for that methods had been worse, it's still nasty in the current form ;-/ Oh, well - review of those has been long overdue.
On Wed, Aug 7, 2024 at 2:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote: > > > tripping ip: > > vfs_tmpfile+0x162/0x230: > > fsnotify_parent at include/linux/fsnotify.h:81 > > (inlined by) fsnotify_file at include/linux/fsnotify.h:131 > > (inlined by) fsnotify_open at include/linux/fsnotify.h:401 > > (inlined by) vfs_tmpfile at fs/namei.c:3781 > > Try this for incremental; missed the fact that finish_open() is > used by ->tmpfile() instances, not just ->atomic_open(). > > Al, crawling back to sleep... > > diff --git a/fs/namei.c b/fs/namei.c > index 95345a5beb3a..0536907e8e79 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3776,7 +3776,10 @@ int vfs_tmpfile(struct mnt_idmap *idmap, > file->f_path.dentry = child; > mode = vfs_prepare_mode(idmap, dir, mode, mode, mode); > error = dir->i_op->tmpfile(idmap, dir, file, mode); > - dput(child); > + if (file->f_mode & FMODE_OPENED) > + mntget(parentpath->mnt); > + else > + dput(child); > if (file->f_mode & FMODE_OPENED) > fsnotify_open(file); > if (error) That seems to have worked, but my test rig went down due to surprise PDU maintenance (and it is still down) and I was unable to give the patch a more serious beating.
do you plan to submit this to next? anything this is waiting for? my quick skim suggests this only needs more testing (and maybe a review) On Wed, Aug 7, 2024 at 10:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Wed, Aug 07, 2024 at 01:43:48PM +0100, Al Viro wrote: > > On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote: > > > > > tripping ip: > > > vfs_tmpfile+0x162/0x230: > > > fsnotify_parent at include/linux/fsnotify.h:81 > > > (inlined by) fsnotify_file at include/linux/fsnotify.h:131 > > > (inlined by) fsnotify_open at include/linux/fsnotify.h:401 > > > (inlined by) vfs_tmpfile at fs/namei.c:3781 > > > > Try this for incremental; missed the fact that finish_open() is > > used by ->tmpfile() instances, not just ->atomic_open(). > > > > Al, crawling back to sleep... > > I _really_ hate ->atomic_open() calling conventions; FWIW, I suspect > that in the current form this series is OK, but only because none > of the existing instances call finish_open() on a preexisting > aliases found by d_splice_alias(). And control flow in the > instances (especially the cleanup paths) is bloody awful... > > We never got it quite right, and while the previous iterations of > the calling conventions for that methods had been worse, it's still > nasty in the current form ;-/ > > Oh, well - review of those has been long overdue.
On Tue, Aug 20, 2024 at 01:38:05PM +0200, Mateusz Guzik wrote: > do you plan to submit this to next? > > anything this is waiting for? > > my quick skim suggests this only needs more testing (and maybe a review) OK, let me post the current variant of that series, then we'll see if anyone screams...
diff --git a/fs/internal.h b/fs/internal.h index cdd73209eecb..9eeb7e03f81d 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -193,7 +193,8 @@ int chmod_common(const struct path *path, umode_t mode); int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, int flag); int chown_common(const struct path *path, uid_t user, gid_t group); -extern int vfs_open(const struct path *, struct file *); +int vfs_open_consume(const struct path *, struct file *); +int vfs_open(const struct path *, struct file *); /* * inode.c diff --git a/fs/namei.c b/fs/namei.c index 1bf081959066..20c5823d34dc 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -595,9 +595,10 @@ struct nameidata { umode_t dir_mode; } __randomize_layout; -#define ND_ROOT_PRESET 1 -#define ND_ROOT_GRABBED 2 -#define ND_JUMPED 4 +#define ND_ROOT_PRESET 0x00000001 +#define ND_ROOT_GRABBED 0x00000002 +#define ND_JUMPED 0x00000004 +#define ND_PATH_CONSUMED 0x00000008 static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) { @@ -697,6 +698,7 @@ static void terminate_walk(struct nameidata *nd) nd->state &= ~ND_ROOT_GRABBED; } } else { + BUG_ON(nd->state & ND_PATH_CONSUMED); leave_rcu(nd); } nd->depth = 0; @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd, static int do_open(struct nameidata *nd, struct file *file, const struct open_flags *op) { + struct vfsmount *mnt; struct mnt_idmap *idmap; int open_flag = op->open_flag; bool do_truncate; @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd, error = mnt_want_write(nd->path.mnt); if (error) return error; + /* + * We grab an additional reference here because vfs_open_consume() + * may error out and free the mount from under us, while we need + * to undo write access below. + */ + mnt = mntget(nd->path.mnt); do_truncate = true; } error = may_open(idmap, &nd->path, acc_mode, open_flag); - if (!error && !(file->f_mode & FMODE_OPENED)) - error = vfs_open(&nd->path, file); + if (!error && !(file->f_mode & FMODE_OPENED)) { + BUG_ON(nd->state & ND_PATH_CONSUMED); + error = vfs_open_consume(&nd->path, file); + nd->state |= ND_PATH_CONSUMED; + nd->path.mnt = NULL; + nd->path.dentry = NULL; + } if (!error) error = security_file_post_open(file, op->acc_mode); if (!error && do_truncate) @@ -3733,8 +3747,10 @@ static int do_open(struct nameidata *nd, WARN_ON(1); error = -EINVAL; } - if (do_truncate) - mnt_drop_write(nd->path.mnt); + if (do_truncate) { + mnt_drop_write(mnt); + mntput(mnt); + } return error; } diff --git a/fs/open.c b/fs/open.c index 22adbef7ecc2..eb69af3676e3 100644 --- a/fs/open.c +++ b/fs/open.c @@ -905,6 +905,15 @@ static inline int file_get_write_access(struct file *f) return error; } +/* + * Populate struct file + * + * NOTE: it assumes f_path is populated and consumes the caller's reference. + * + * The caller must not path_put on it regardless of the error code -- the + * routine will either clean it up on its own or rely on fput, which must + * be issued anyway. + */ static int do_dentry_open(struct file *f, int (*open)(struct inode *, struct file *)) { @@ -912,7 +921,6 @@ static int do_dentry_open(struct file *f, struct inode *inode = f->f_path.dentry->d_inode; int error; - path_get(&f->f_path); f->f_inode = inode; f->f_mapping = inode->i_mapping; f->f_wb_err = filemap_sample_wb_err(f->f_mapping); @@ -1045,6 +1053,7 @@ int finish_open(struct file *file, struct dentry *dentry, BUG_ON(file->f_mode & FMODE_OPENED); /* once it's opened, it's opened */ file->f_path.dentry = dentry; + path_get(&file->f_path); return do_dentry_open(file, open); } EXPORT_SYMBOL(finish_open); @@ -1077,15 +1086,19 @@ char *file_path(struct file *filp, char *buf, int buflen) EXPORT_SYMBOL(file_path); /** - * vfs_open - open the file at the given path + * vfs_open_consume - open the file at the given path and consume the reference * @path: path to open * @file: newly allocated file with f_flag initialized */ -int vfs_open(const struct path *path, struct file *file) +int vfs_open_consume(const struct path *path, struct file *file) { int ret; file->f_path = *path; + /* + * do_dentry_open() does the referene consuming regardless of its return + * value + */ ret = do_dentry_open(file, NULL); if (!ret) { /* @@ -1098,6 +1111,17 @@ int vfs_open(const struct path *path, struct file *file) return ret; } +/** + * vfs_open - open the file at the given path + * @path: path to open + * @file: newly allocated file with f_flag initialized + */ +int vfs_open(const struct path *path, struct file *file) +{ + path_get(path); + return vfs_open_consume(path, file); +} + struct file *dentry_open(const struct path *path, int flags, const struct cred *cred) { @@ -1183,6 +1207,7 @@ struct file *kernel_file_open(const struct path *path, int flags, return f; f->f_path = *path; + path_get(&f->f_path); error = do_dentry_open(f, NULL); if (error) { fput(f);
Opening a file grabs a reference on the terminal dentry in __legitimize_path(), then another one in do_dentry_open() and finally drops the initial reference in terminate_walk(). That's 2 modifications which don't need to be there -- do_dentry_open() can consume the already held reference instead. In order to facilitate some debugging a dedicated namei state flag was added to denote this happened. When benchmarking on a 20-core vm using will-it-scale's open3_processes ("Same file open/close"), the results are (ops/s): before: 3087010 after: 4173977 (+35%) Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- The flag thing is optional and can be dropped, but I think the general direction should be to add *more* asserts and whatnot (even if they are to land separately). A debug-only variant would not hurt. The entire "always consume regardless of error" ordeal stems from the corner case where open succeeds with a fully populated file object and then returns an error anyway. While odd, it does mean error handling does not get more complicated at least. fs/internal.h | 3 ++- fs/namei.c | 30 +++++++++++++++++++++++------- fs/open.c | 31 ++++++++++++++++++++++++++++--- 3 files changed, 53 insertions(+), 11 deletions(-)