vfs: avoid spurious dentry ref/unref cycle on open

Message ID	20240806144628.874350-1-mjguzik@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-ej1-f44.google.com (mail-ej1-f44.google.com [209.85.218.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2B74017ADFD; Tue, 6 Aug 2024 14:46:36 +0000 (UTC) From: Mateusz Guzik <mjguzik@gmail.com> To: brauner@kernel.org Cc: viro@zeniv.linux.org.uk, jack@suse.cz, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [PATCH] vfs: avoid spurious dentry ref/unref cycle on open Date: Tue, 6 Aug 2024 16:46:28 +0200 Message-ID: <20240806144628.874350-1-mjguzik@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	vfs: avoid spurious dentry ref/unref cycle on open \| expand vfs: avoid spurious dentry ref/unref cycle on open

Mateusz Guzik Aug. 6, 2024, 2:46 p.m. UTC

Opening a file grabs a reference on the terminal dentry in
__legitimize_path(), then another one in do_dentry_open() and finally
drops the initial reference in terminate_walk().

That's 2 modifications which don't need to be there -- do_dentry_open()
can consume the already held reference instead.

In order to facilitate some debugging a dedicated namei state flag was
added to denote this happened.

When benchmarking on a 20-core vm using will-it-scale's open3_processes
("Same file open/close"), the results are (ops/s):
before:	3087010
after:	4173977 (+35%)

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

The flag thing is optional and can be dropped, but I think the general
direction should be to add *more* asserts and whatnot (even if they are
to land separately). A debug-only variant would not hurt.

The entire "always consume regardless of error" ordeal stems from the
corner case where open succeeds with a fully populated file object and
then returns an error anyway. While odd, it does mean error handling
does not get more complicated at least.

 fs/internal.h |  3 ++-
 fs/namei.c    | 30 +++++++++++++++++++++++-------
 fs/open.c     | 31 ++++++++++++++++++++++++++++---
 3 files changed, 53 insertions(+), 11 deletions(-)

Al Viro Aug. 6, 2024, 3:53 p.m. UTC | #1

On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:

> The flag thing is optional and can be dropped, but I think the general
> direction should be to add *more* asserts and whatnot (even if they are
> to land separately). A debug-only variant would not hurt.

Asserts do *not* clarify anything; if you want your optional flag,
come up with clear description of its semantics.  In terms of state,
_not_ control flow.

> @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd,
>  static int do_open(struct nameidata *nd,
>  		   struct file *file, const struct open_flags *op)
>  {
> +	struct vfsmount *mnt;
>  	struct mnt_idmap *idmap;
>  	int open_flag = op->open_flag;
>  	bool do_truncate;
> @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd,
>  		error = mnt_want_write(nd->path.mnt);
>  		if (error)
>  			return error;
> +		/*
> +		 * We grab an additional reference here because vfs_open_consume()
> +		 * may error out and free the mount from under us, while we need
> +		 * to undo write access below.
> +		 */
> +		mnt = mntget(nd->path.mnt);

It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt",
error or no error...

>  		do_truncate = true;

>  	}
>  	error = may_open(idmap, &nd->path, acc_mode, open_flag);
> -	if (!error && !(file->f_mode & FMODE_OPENED))
> -		error = vfs_open(&nd->path, file);
> +	if (!error && !(file->f_mode & FMODE_OPENED)) {
> +		BUG_ON(nd->state & ND_PATH_CONSUMED);
> +		error = vfs_open_consume(&nd->path, file);
> +		nd->state |= ND_PATH_CONSUMED;
> +		nd->path.mnt = NULL;
> +		nd->path.dentry = NULL;

Umm...  The thing that feels wrong here is that you get an extra
asymmetry with ->atomic_open() ;-/  We obviously can't do that
kind of thing there (if nothing else, we have the parent directory's
inode to unlock, error or no error).

I don't hate that patch, but it really feels like the calling
conventions are not right.  Let me try to tweak it a bit and
see if anything falls out...

Mateusz Guzik Aug. 6, 2024, 4:09 p.m. UTC | #2

On Tue, Aug 6, 2024 at 5:53 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:
>
> > The flag thing is optional and can be dropped, but I think the general
> > direction should be to add *more* asserts and whatnot (even if they are
> > to land separately). A debug-only variant would not hurt.
>
> Asserts do *not* clarify anything; if you want your optional flag,
> come up with clear description of its semantics.  In terms of state,
> _not_ control flow.
>

It is supposed to indicate that both nd->path.mnt and nd->path.dentry
are no longer usable and must not even be looked at. Ideally code
which *does* look at them despite the flag (== there is a bug) traps.

However, I did not find a handy macro or anything of the sort to
"poison" these pointers. Instead I found tons of NULL checks all over,
including in lookup clean up.

So as is I agree the flag adds next to nothing as is, but the intent
was to catch any use of the above pointers after what they point to
got consumed. Perhaps I should just drop the flag for the time being
and only propose it with a more fleshed out scheme later.

> > @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd,
> >  static int do_open(struct nameidata *nd,
> >                  struct file *file, const struct open_flags *op)
> >  {
> > +     struct vfsmount *mnt;
> >       struct mnt_idmap *idmap;
> >       int open_flag = op->open_flag;
> >       bool do_truncate;
> > @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd,
> >               error = mnt_want_write(nd->path.mnt);
> >               if (error)
> >                       return error;
> > +             /*
> > +              * We grab an additional reference here because vfs_open_consume()
> > +              * may error out and free the mount from under us, while we need
> > +              * to undo write access below.
> > +              */
> > +             mnt = mntget(nd->path.mnt);
>
> It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt",
> error or no error...
>

ok

> >               do_truncate = true;
>
>
> >       }
> >       error = may_open(idmap, &nd->path, acc_mode, open_flag);
> > -     if (!error && !(file->f_mode & FMODE_OPENED))
> > -             error = vfs_open(&nd->path, file);
> > +     if (!error && !(file->f_mode & FMODE_OPENED)) {
> > +             BUG_ON(nd->state & ND_PATH_CONSUMED);
> > +             error = vfs_open_consume(&nd->path, file);
> > +             nd->state |= ND_PATH_CONSUMED;
> > +             nd->path.mnt = NULL;
> > +             nd->path.dentry = NULL;
>
> Umm...  The thing that feels wrong here is that you get an extra
> asymmetry with ->atomic_open() ;-/  We obviously can't do that
> kind of thing there (if nothing else, we have the parent directory's
> inode to unlock, error or no error).
>
> I don't hate that patch, but it really feels like the calling
> conventions are not right.  Let me try to tweak it a bit and
> see if anything falls out...

Should you write your own patch I'm happy to give it a spin to
validate the win is about the same.

I forgot to note some stuff when sending the patch, so here it is:
perf is highly unstable between re-runs of the benchmark because
struct inode itself is not aligned to anything and at least ext4 plops
it in in a rather arbitrary place and uses a kmem cache without any
magic alignment guarantees. This results in false sharing showing up
and disappearing depending on how (un)lucky one gets. For reported
results I picked the worst case.

I had to patch one sucker for constantly getting in:
https://lore.kernel.org/linux-security-module/20240806133607.869394-1-mjguzik@gmail.com/T/#u

Mateusz Guzik Aug. 6, 2024, 4:14 p.m. UTC | #3

On Tue, Aug 6, 2024 at 6:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Tue, Aug 6, 2024 at 5:53 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:
> >
> > > The flag thing is optional and can be dropped, but I think the general
> > > direction should be to add *more* asserts and whatnot (even if they are
> > > to land separately). A debug-only variant would not hurt.
> >
> > Asserts do *not* clarify anything; if you want your optional flag,
> > come up with clear description of its semantics.  In terms of state,
> > _not_ control flow.
> >
>
> It is supposed to indicate that both nd->path.mnt and nd->path.dentry
> are no longer usable and must not even be looked at. Ideally code
> which *does* look at them despite the flag (== there is a bug) traps.
>
> However, I did not find a handy macro or anything of the sort to
> "poison" these pointers. Instead I found tons of NULL checks all over,
> including in lookup clean up.
>
> So as is I agree the flag adds next to nothing as is, but the intent
> was to catch any use of the above pointers after what they point to
> got consumed. Perhaps I should just drop the flag for the time being
> and only propose it with a more fleshed out scheme later.
>
> > > @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd,
> > >  static int do_open(struct nameidata *nd,
> > >                  struct file *file, const struct open_flags *op)
> > >  {
> > > +     struct vfsmount *mnt;
> > >       struct mnt_idmap *idmap;
> > >       int open_flag = op->open_flag;
> > >       bool do_truncate;
> > > @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd,
> > >               error = mnt_want_write(nd->path.mnt);
> > >               if (error)
> > >                       return error;
> > > +             /*
> > > +              * We grab an additional reference here because vfs_open_consume()
> > > +              * may error out and free the mount from under us, while we need
> > > +              * to undo write access below.
> > > +              */
> > > +             mnt = mntget(nd->path.mnt);
> >
> > It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt",
> > error or no error...
> >
>
> ok
>
> > >               do_truncate = true;
> >
> >
> > >       }
> > >       error = may_open(idmap, &nd->path, acc_mode, open_flag);
> > > -     if (!error && !(file->f_mode & FMODE_OPENED))
> > > -             error = vfs_open(&nd->path, file);
> > > +     if (!error && !(file->f_mode & FMODE_OPENED)) {
> > > +             BUG_ON(nd->state & ND_PATH_CONSUMED);
> > > +             error = vfs_open_consume(&nd->path, file);
> > > +             nd->state |= ND_PATH_CONSUMED;
> > > +             nd->path.mnt = NULL;
> > > +             nd->path.dentry = NULL;
> >
> > Umm...  The thing that feels wrong here is that you get an extra
> > asymmetry with ->atomic_open() ;-/  We obviously can't do that
> > kind of thing there (if nothing else, we have the parent directory's
> > inode to unlock, error or no error).
> >
> > I don't hate that patch, but it really feels like the calling
> > conventions are not right.  Let me try to tweak it a bit and
> > see if anything falls out...
>

fwiw if you are looking to have vfs_open_${keyword} which accepts
nameidata and "consumes" it, I did not go for it because vfs_open is
in another file and I did not want to "leak" the struct there.

> Should you write your own patch I'm happy to give it a spin to
> validate the win is about the same.
>
> I forgot to note some stuff when sending the patch, so here it is:
> perf is highly unstable between re-runs of the benchmark because
> struct inode itself is not aligned to anything and at least ext4 plops
> it in in a rather arbitrary place and uses a kmem cache without any
> magic alignment guarantees. This results in false sharing showing up
> and disappearing depending on how (un)lucky one gets. For reported
> results I picked the worst case.
>
> I had to patch one sucker for constantly getting in:
> https://lore.kernel.org/linux-security-module/20240806133607.869394-1-mjguzik@gmail.com/T/#u
>
> --
> Mateusz Guzik <mjguzik gmail.com>

Dave Chinner Aug. 6, 2024, 10:51 p.m. UTC | #4

On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:
>  	error = may_open(idmap, &nd->path, acc_mode, open_flag);
> -	if (!error && !(file->f_mode & FMODE_OPENED))
> -		error = vfs_open(&nd->path, file);
> +	if (!error && !(file->f_mode & FMODE_OPENED)) {
> +		BUG_ON(nd->state & ND_PATH_CONSUMED);

Please don't litter new code with random BUG_ON() checks. If this
every happens, it will panic a production kernel and the fix will
generate a CVE.

Given that these checks should never fire in a production kernel
unless something is corrupting memory (i.e. the end is already
near), these should be considered debug assertions and we should
treat them that way from the start.

i.e. we really should have a VFS_ASSERT() or VFS_BUG_ON() (following
the VM_BUG_ON() pattern) masked by a CONFIG_VFS_DEBUG option so they
are only included into debug builds where there is a developer
watching to debug the system when one of these things fires.

This is a common pattern for subsystem specific assertions.  We do
this in all the major filesystems, the MM subsystem does this
(VM_BUG_ON), etc.  Perhaps it is time to do this in the VFS code as
well....

-Dave.

Mateusz Guzik Aug. 6, 2024, 10:55 p.m. UTC | #5

On Wed, Aug 7, 2024 at 12:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:
> >       error = may_open(idmap, &nd->path, acc_mode, open_flag);
> > -     if (!error && !(file->f_mode & FMODE_OPENED))
> > -             error = vfs_open(&nd->path, file);
> > +     if (!error && !(file->f_mode & FMODE_OPENED)) {
> > +             BUG_ON(nd->state & ND_PATH_CONSUMED);
>
> Please don't litter new code with random BUG_ON() checks. If this
> every happens, it will panic a production kernel and the fix will
> generate a CVE.
>
> Given that these checks should never fire in a production kernel
> unless something is corrupting memory (i.e. the end is already
> near), these should be considered debug assertions and we should
> treat them that way from the start.
>
> i.e. we really should have a VFS_ASSERT() or VFS_BUG_ON() (following
> the VM_BUG_ON() pattern) masked by a CONFIG_VFS_DEBUG option so they
> are only included into debug builds where there is a developer
> watching to debug the system when one of these things fires.
>
> This is a common pattern for subsystem specific assertions.  We do
> this in all the major filesystems, the MM subsystem does this
> (VM_BUG_ON), etc.  Perhaps it is time to do this in the VFS code as
> well....

I agree, I have this at the bottom of my todo list.

The only reason I BUG_ON'ed here is because proper debug macros are not present.

fwiw v2 does not have any of this, so...

Dave Chinner Aug. 7, 2024, 2:56 a.m. UTC | #6

On Wed, Aug 07, 2024 at 12:55:07AM +0200, Mateusz Guzik wrote:
> On Wed, Aug 7, 2024 at 12:51 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:
> > >       error = may_open(idmap, &nd->path, acc_mode, open_flag);
> > > -     if (!error && !(file->f_mode & FMODE_OPENED))
> > > -             error = vfs_open(&nd->path, file);
> > > +     if (!error && !(file->f_mode & FMODE_OPENED)) {
> > > +             BUG_ON(nd->state & ND_PATH_CONSUMED);
> >
> > Please don't litter new code with random BUG_ON() checks. If this
> > every happens, it will panic a production kernel and the fix will
> > generate a CVE.
> >
> > Given that these checks should never fire in a production kernel
> > unless something is corrupting memory (i.e. the end is already
> > near), these should be considered debug assertions and we should
> > treat them that way from the start.
> >
> > i.e. we really should have a VFS_ASSERT() or VFS_BUG_ON() (following
> > the VM_BUG_ON() pattern) masked by a CONFIG_VFS_DEBUG option so they
> > are only included into debug builds where there is a developer
> > watching to debug the system when one of these things fires.
> >
> > This is a common pattern for subsystem specific assertions.  We do
> > this in all the major filesystems, the MM subsystem does this
> > (VM_BUG_ON), etc.  Perhaps it is time to do this in the VFS code as
> > well....
> 
> I agree, I have this at the bottom of my todo list.

Good to know.

> The only reason I BUG_ON'ed here is because proper debug macros are not present.
> 
> fwiw v2 does not have any of this, so...

Yeah, I didn't notice there was a v2 patch before I wrote this.

-Dave.

Al Viro Aug. 7, 2024, 3:38 a.m. UTC | #7

On Tue, Aug 06, 2024 at 06:09:43PM +0200, Mateusz Guzik wrote:

> It is supposed to indicate that both nd->path.mnt and nd->path.dentry
> are no longer usable and must not even be looked at. Ideally code
> which *does* look at them despite the flag (== there is a bug) traps.
> 
> However, I did not find a handy macro or anything of the sort to
> "poison" these pointers. Instead I found tons of NULL checks all over,
> including in lookup clean up.

Unless I'm misreading you, those existing NULLs have nothing to do with
poisoning of any sort.  Or any kind of defensive programming, while we are
at it.  Those are about the cleanups on failed transition from lazy mode;
if we have already legitimized some of the references (i.e. bumped the
refcounts there) by the time we'd run into a stale one, we need to drop
the ones we'd grabbed on the way out.  And the easiest way to do that
is to leave that until terminate_walk(), when we'll be out of RCU mode.
The references that were *NOT* grabbed obviously should be left alone
rather than dropped.  Which is where those NULL assignments come from.

Mateusz Guzik Aug. 7, 2024, 3:57 a.m. UTC | #8

On Wed, Aug 7, 2024 at 5:38 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Aug 06, 2024 at 06:09:43PM +0200, Mateusz Guzik wrote:
>
> > It is supposed to indicate that both nd->path.mnt and nd->path.dentry
> > are no longer usable and must not even be looked at. Ideally code
> > which *does* look at them despite the flag (== there is a bug) traps.
> >
> > However, I did not find a handy macro or anything of the sort to
> > "poison" these pointers. Instead I found tons of NULL checks all over,
> > including in lookup clean up.
>
> Unless I'm misreading you, those existing NULLs have nothing to do with
> poisoning of any sort.  Or any kind of defensive programming, while we are
> at it.  Those are about the cleanups on failed transition from lazy mode;
> if we have already legitimized some of the references (i.e. bumped the
> refcounts there) by the time we'd run into a stale one, we need to drop
> the ones we'd grabbed on the way out.  And the easiest way to do that
> is to leave that until terminate_walk(), when we'll be out of RCU mode.
> The references that were *NOT* grabbed obviously should be left alone
> rather than dropped.  Which is where those NULL assignments come from.

Yes, this is my understanding of the code and part of my compliant. :)

Things just work(tm) as is with NULLified pointers, but this is error-prone.

I was looking for an equivalent of the following feature from $elsewhere:
/*
 * Trap accesses going through a pointer. Moreover if kasan is available trap
 * reading the pointer itself.
 *
 * Sample usage: you have a struct with numerous fields and by API contract
 * only some of them get populated, even if the implementation temporary writes
 * to them. You can use DEBUG_POISON_POINTER so that the consumer which should
 * no be looking at the field gets caught.
 *
 * DEBUG_POISON_POINTER(obj->ptr);
 * ....
 * if (obj->ptr != NULL) // traps with kasan, does not trap otherwise
 * ....
 * if (obj->ptr->field) // traps with and without kasan
 */
extern caddr_t poisoned_buf;
#define DEBUG_POISON_POINTER_VALUE poisoned_buf

#define DEBUG_POISON_POINTER(x) ({                              \
        x = (void *)(DEBUG_POISON_POINTER_VALUE);               \
        kasan_mark(&x, 0, sizeof(x), KASAN_GENERIC_REDZONE);    \
})

As a hypothetical suppose there is code executing some time after
vfs_open which looks at nd->path.dentry and by finding the pointer is
NULL it concludes the lookup did not work out.

If such code exists *and* the pointer is poisoned in the above sense
(notably merely branching on it with kasan already traps), then the
consumer will be caught immediately during coverage testing by
syzkaller.
If such code exists but the pointer is only nullified, one is only
going to find out the hard way when some functionality weirdly breaks.

Anyhow, this is really beyond the scope of the patch and I should not
have done the half-assed thing abandoned mid-effort. I'm going to get
back to this later(tm).

See the v2 which just gets to the point concerning eliding the extra ref trip.

Al Viro Aug. 7, 2024, 5:32 a.m. UTC | #9

On Wed, Aug 07, 2024 at 05:57:07AM +0200, Mateusz Guzik wrote:

[there'll be a separate reply with what I hope might be a usable
approach]

> Yes, this is my understanding of the code and part of my compliant. :)
> 
> Things just work(tm) as is with NULLified pointers, but this is error-prone.

And carrying the arseloads of information (which ones do and which do not
need to be dropped) is *less* error-prone?  Are you serious?

> As a hypothetical suppose there is code executing some time after
> vfs_open which looks at nd->path.dentry and by finding the pointer is
> NULL it concludes the lookup did not work out.
> 
> If such code exists *and* the pointer is poisoned in the above sense
> (notably merely branching on it with kasan already traps), then the
> consumer will be caught immediately during coverage testing by
> syzkaller.

You are much too optimistic about the quality of test coverage in this
particular area.

> If such code exists but the pointer is only nullified, one is only
> going to find out the hard way when some functionality weirdly breaks.

To do _useful_ asserts, one needs invariants to check.  And "we got
to this check after having passed through that assignment at some
earlier point" is not it.  That's why I'm asking questions about
the state.

The thing is, suppose I (or you, or somebody else) is trying to modify
the whole thing.  There's a magical mystery assert in the way; what
should be done with it?  Move it/split it/remove it/do something
random and hope syzkaller won't catch anything?  If I can reason
about the predicate being checked, I can at least start figuring out
what should be done.  If not, it's bloody guaranteed to rot.

This particular area (pathwalk machinery) has a nasty history of
growing complexity once in a while, with following cleanups and
massage to get it back into more or less tolerable shape.
And refactoring that had been _painful_ - I'd done more than
a few there.

As far as I can tell, at the moment this flag (and yes, I've seen its
removal in the next version) is "we'd called vfs_open_consume() at
some point, then found ourselves still in RCU mode or we'd called
vfs_open_consume() more than once".

This is *NOT* a property of state; it's a property of execution
history.  The first part is checked in the wrong place - one of
the invariants (trivially verified by code examination) is that
LOOKUP_RCU is never regained after it had been dropped.  The
only place where it can be set is path_init() and calling _that_
between path_init() and terminate_walk() would be
	a) a hard and very visible bug
	b) would've wiped your flag anyway.
So that part of the check is basically "we are not calling
vfs_open_consume() under rcu_read_lock()".  Which is definitely
a desirable property, since ->open() can block.  So can
mnt_want_write() several lines prior.  Invariant here is
"the places where we set FMODE_OPENED or FMODE_CREATED may
not have LOOKUP_RCU".  Having
        if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
		error = complete_walk(nd);
		if (error)
			return error;
	}
in the beginning of do_open() guarantees that for vfs_open()
call there.  All other places where that can happen are in
lookup_open() or called from it (via ->atomic_open() to
finish_open()).  And *that* definitely should not be done
in RCU mode, due to
        if (open_flag & O_CREAT)
                inode_lock(dir->d_inode);
        else
                inode_lock_shared(dir->d_inode);
        dentry = lookup_open(nd, file, op, got_write);
in the sole caller of that thing.  Again, can't grab a blocking
lock under rcu_read_lock().  Which is why we have this
                if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
                        return ERR_PTR(-ECHILD);
        } else {
                /* create side of things */
                if (nd->flags & LOOKUP_RCU) {
                        if (!try_to_unlazy(nd))
                                return ERR_PTR(-ECHILD);
                }
slightly prior to that call.  WARN_ON_ONCE is basically "lookup_fast()
has returned NULL and stayed in RCU mode", which should never happen.
try_to_unlazy() is straight "either switch to non-RCU mode or return an
error" - that's what this function is for.  No WARN_ON after that - it
would only obfuscate things.

*IF* you want to add debugging checks for that kind of stuff, just call
that assert_nonrcu(nd), make it check and whine and feel free to slap
them in reasonable amount of places (anything that makes a reader go
"for fuck sake, hadn't we (a) done that on the entry to this function
and (b) done IO since then, anyway?" is obviously not reasonable, etc. -
no more than common sense limitations).

Another common sense thing: extra asserts won't confuse syzkaller, but
they very much can confuse a human reader.  And any rewrites are done
by humans...

As for the double call of vfs_open_consume()...  You do realize that
the damn thing wouldn't have reached that check if it would ever have
cause to be triggered, right?  Seeing that we call
static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
{
        /* Pairs with smp_store_release() in do_idmap_mount(). */
        return smp_load_acquire(&mnt->mnt_idmap);
}
near the beginning of do_open(), ~20 lines before the place where
you added that check...

I'm not sure it makes sense to defend against a weird loop appearing
out of nowhere near the top of call chain, but if you want to do that,
this is not the right place for that.

Mateusz Guzik Aug. 7, 2024, 5:46 a.m. UTC | #10

On Wed, Aug 7, 2024 at 7:32 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 07, 2024 at 05:57:07AM +0200, Mateusz Guzik wrote:
>
> [there'll be a separate reply with what I hope might be a usable
> approach]
>

At the moment I'm trying to get the spurious lockref cycle out of the
way because it is interfering with the actual thing I'm trying to do.

This entire pointer poisoning business is optional and the v2 I posted
does not do any of it.

With this in mind can you review v2? Here is the link for reference:
https://lore.kernel.org/linux-fsdevel/20240806163256.882140-1-mjguzik@gmail.com/T/#u

As for the feedback below, I'm going to circle back to it when I take
a stab at adding debug macros, but no ETA.

> > Yes, this is my understanding of the code and part of my compliant. :)
> >
> > Things just work(tm) as is with NULLified pointers, but this is error-prone.
>
> And carrying the arseloads of information (which ones do and which do not
> need to be dropped) is *less* error-prone?  Are you serious?
>
> > As a hypothetical suppose there is code executing some time after
> > vfs_open which looks at nd->path.dentry and by finding the pointer is
> > NULL it concludes the lookup did not work out.
> >
> > If such code exists *and* the pointer is poisoned in the above sense
> > (notably merely branching on it with kasan already traps), then the
> > consumer will be caught immediately during coverage testing by
> > syzkaller.
>
> You are much too optimistic about the quality of test coverage in this
> particular area.
>
> > If such code exists but the pointer is only nullified, one is only
> > going to find out the hard way when some functionality weirdly breaks.
>
> To do _useful_ asserts, one needs invariants to check.  And "we got
> to this check after having passed through that assignment at some
> earlier point" is not it.  That's why I'm asking questions about
> the state.
>
> The thing is, suppose I (or you, or somebody else) is trying to modify
> the whole thing.  There's a magical mystery assert in the way; what
> should be done with it?  Move it/split it/remove it/do something
> random and hope syzkaller won't catch anything?  If I can reason
> about the predicate being checked, I can at least start figuring out
> what should be done.  If not, it's bloody guaranteed to rot.
>
> This particular area (pathwalk machinery) has a nasty history of
> growing complexity once in a while, with following cleanups and
> massage to get it back into more or less tolerable shape.
> And refactoring that had been _painful_ - I'd done more than
> a few there.
>
> As far as I can tell, at the moment this flag (and yes, I've seen its
> removal in the next version) is "we'd called vfs_open_consume() at
> some point, then found ourselves still in RCU mode or we'd called
> vfs_open_consume() more than once".
>
> This is *NOT* a property of state; it's a property of execution
> history.  The first part is checked in the wrong place - one of
> the invariants (trivially verified by code examination) is that
> LOOKUP_RCU is never regained after it had been dropped.  The
> only place where it can be set is path_init() and calling _that_
> between path_init() and terminate_walk() would be
>         a) a hard and very visible bug
>         b) would've wiped your flag anyway.
> So that part of the check is basically "we are not calling
> vfs_open_consume() under rcu_read_lock()".  Which is definitely
> a desirable property, since ->open() can block.  So can
> mnt_want_write() several lines prior.  Invariant here is
> "the places where we set FMODE_OPENED or FMODE_CREATED may
> not have LOOKUP_RCU".  Having
>         if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
>                 error = complete_walk(nd);
>                 if (error)
>                         return error;
>         }
> in the beginning of do_open() guarantees that for vfs_open()
> call there.  All other places where that can happen are in
> lookup_open() or called from it (via ->atomic_open() to
> finish_open()).  And *that* definitely should not be done
> in RCU mode, due to
>         if (open_flag & O_CREAT)
>                 inode_lock(dir->d_inode);
>         else
>                 inode_lock_shared(dir->d_inode);
>         dentry = lookup_open(nd, file, op, got_write);
> in the sole caller of that thing.  Again, can't grab a blocking
> lock under rcu_read_lock().  Which is why we have this
>                 if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
>                         return ERR_PTR(-ECHILD);
>         } else {
>                 /* create side of things */
>                 if (nd->flags & LOOKUP_RCU) {
>                         if (!try_to_unlazy(nd))
>                                 return ERR_PTR(-ECHILD);
>                 }
> slightly prior to that call.  WARN_ON_ONCE is basically "lookup_fast()
> has returned NULL and stayed in RCU mode", which should never happen.
> try_to_unlazy() is straight "either switch to non-RCU mode or return an
> error" - that's what this function is for.  No WARN_ON after that - it
> would only obfuscate things.
>
> *IF* you want to add debugging checks for that kind of stuff, just call
> that assert_nonrcu(nd), make it check and whine and feel free to slap
> them in reasonable amount of places (anything that makes a reader go
> "for fuck sake, hadn't we (a) done that on the entry to this function
> and (b) done IO since then, anyway?" is obviously not reasonable, etc. -
> no more than common sense limitations).
>
> Another common sense thing: extra asserts won't confuse syzkaller, but
> they very much can confuse a human reader.  And any rewrites are done
> by humans...
>
> As for the double call of vfs_open_consume()...  You do realize that
> the damn thing wouldn't have reached that check if it would ever have
> cause to be triggered, right?  Seeing that we call
> static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
> {
>         /* Pairs with smp_store_release() in do_idmap_mount(). */
>         return smp_load_acquire(&mnt->mnt_idmap);
> }
> near the beginning of do_open(), ~20 lines before the place where
> you added that check...
>
> I'm not sure it makes sense to defend against a weird loop appearing
> out of nowhere near the top of call chain, but if you want to do that,
> this is not the right place for that.

Al Viro Aug. 7, 2024, 6:23 a.m. UTC | #11

After having looked at the problem, how about the following
series:

1/5) lift path_get() *AND* path_put() out of do_dentry_open()
into the callers.  The latter - conditional upon "do_dentry_open()
has not set FMODE_OPENED".  Equivalent transformation.

2/5) move path_get() we'd lifted into the callers past the
call of do_dentry_open(), conditionally collapse it with path_put().
You'd get e.g.
int vfs_open(const struct path *path, struct file *file)
{
        int ret;

        file->f_path = *path;
        ret = do_dentry_open(file, NULL);
        if (!ret) {
                /*
                 * Once we return a file with FMODE_OPENED, __fput() will call
                 * fsnotify_close(), so we need fsnotify_open() here for
                 * symmetry.
                 */
                fsnotify_open(file);
        }
	if (file->f_mode & FMODE_OPENED)
		path_get(path);
        return ret;
}

Equivalent transformation, provided that nobody is playing silly
buggers with reassigning ->f_path in their ->open() instances.
They *really* should not - if anyone does, we'd better catch them
and fix them^Wtheir code.  Incidentally, if we find any such,
we have a damn good reason to add asserts in the callers.  As
in, "if do_dentry_open() has set FMODE_OPENED, it would bloody
better *not* modify ->f_path".  <greps> Nope, nobody is that
insane.

3/5) split vfs_open_consume() out of vfs_open() (possibly
named vfs_open_borrow()), replace the call in do_open() with
calling the new function.

Trivially equivalent transformation.

4/5) Remove conditional path_get() from vfs_open_consume()
and finish_open().  Add
		if (file->f_mode & FMODE_OPENED)
			path_get(&nd->path);
before terminate_walk(nd); in path_openat().

Equivalent transformation - see
        if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
                dput(nd->path.dentry);
                nd->path.dentry = dentry;
                return NULL;
        }
in lookup_open() (which is where nd->path gets in sync with what
had been given to do_dentry_open() in finish_open()); in case
of vfs_open_consume() in do_open() it's in sync from the very
beginning.  And we never modify nd->path after those points.
So we can move grabbing it downstream, keeping it under the
same condition (which also happens to be true only if we'd
called do_dentry_open(), so for all other paths through the
whole thing it's a no-op.

5/5) replace
		if (file->f_mode & FMODE_OPENED)
			path_get(&nd->path);
		terminate_walk(nd);
with
		if (file->f_mode & FMODE_OPENED) {
			nd->path.mnt = NULL;
			nd->path.dentry = NULL;
		}
		terminate_walk(nd);
Again, an obvious equivalent transformation.

Al Viro Aug. 7, 2024, 6:33 a.m. UTC | #12

On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote:
> 	After having looked at the problem, how about the following
> series:
> 
> 1/5) lift path_get() *AND* path_put() out of do_dentry_open()
> into the callers.  The latter - conditional upon "do_dentry_open()
> has not set FMODE_OPENED".  Equivalent transformation.
> 
> 2/5) move path_get() we'd lifted into the callers past the
> call of do_dentry_open(), conditionally collapse it with path_put().
> You'd get e.g.
> int vfs_open(const struct path *path, struct file *file)
> {
>         int ret;
> 
>         file->f_path = *path;
>         ret = do_dentry_open(file, NULL);
>         if (!ret) {
>                 /*
>                  * Once we return a file with FMODE_OPENED, __fput() will call
>                  * fsnotify_close(), so we need fsnotify_open() here for
>                  * symmetry.
>                  */
>                 fsnotify_open(file);
>         }
> 	if (file->f_mode & FMODE_OPENED)
> 		path_get(path);
>         return ret;
> }
> 
> Equivalent transformation, provided that nobody is playing silly
> buggers with reassigning ->f_path in their ->open() instances.
> They *really* should not - if anyone does, we'd better catch them
> and fix them^Wtheir code.  Incidentally, if we find any such,
> we have a damn good reason to add asserts in the callers.  As
> in, "if do_dentry_open() has set FMODE_OPENED, it would bloody
> better *not* modify ->f_path".  <greps> Nope, nobody is that
> insane.
> 
> 3/5) split vfs_open_consume() out of vfs_open() (possibly
> named vfs_open_borrow()), replace the call in do_open() with
> calling the new function.
> 
> Trivially equivalent transformation.
> 
> 4/5) Remove conditional path_get() from vfs_open_consume()
> and finish_open().  Add
> 		if (file->f_mode & FMODE_OPENED)
> 			path_get(&nd->path);
> before terminate_walk(nd); in path_openat().
> 
> Equivalent transformation - see
>         if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
>                 dput(nd->path.dentry);
>                 nd->path.dentry = dentry;
>                 return NULL;
>         }
> in lookup_open() (which is where nd->path gets in sync with what
> had been given to do_dentry_open() in finish_open()); in case
> of vfs_open_consume() in do_open() it's in sync from the very
> beginning.  And we never modify nd->path after those points.
> So we can move grabbing it downstream, keeping it under the
> same condition (which also happens to be true only if we'd
> called do_dentry_open(), so for all other paths through the
> whole thing it's a no-op.
> 
> 5/5) replace
> 		if (file->f_mode & FMODE_OPENED)
> 			path_get(&nd->path);
> 		terminate_walk(nd);
> with
> 		if (file->f_mode & FMODE_OPENED) {
> 			nd->path.mnt = NULL;
> 			nd->path.dentry = NULL;
> 		}
> 		terminate_walk(nd);
> Again, an obvious equivalent transformation.

BTW, similar to that, with that we could turn do_o_path()
into

        struct path path;
        int error = path_lookupat(nd, flags, &path);
        if (!error) {
                audit_inode(nd->name, path.dentry, 0);
                error = vfs_open_borrow(&path, file);
		if (!(file->f_mode & FMODE_OPENED))
			path_put(&path);
        }
        return error;
}

and perhaps do something similar in the vicinity of
vfs_tmpfile() / do_o_tmpfile().

Mateusz Guzik Aug. 7, 2024, 6:40 a.m. UTC | #13

On Wed, Aug 7, 2024 at 8:33 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote:
> >       After having looked at the problem, how about the following
> > series:
> >
> > 1/5) lift path_get() *AND* path_put() out of do_dentry_open()
> > into the callers.  The latter - conditional upon "do_dentry_open()
> > has not set FMODE_OPENED".  Equivalent transformation.
> >
> > 2/5) move path_get() we'd lifted into the callers past the
> > call of do_dentry_open(), conditionally collapse it with path_put().
> > You'd get e.g.
> > int vfs_open(const struct path *path, struct file *file)
> > {
> >         int ret;
> >
> >         file->f_path = *path;
> >         ret = do_dentry_open(file, NULL);
> >         if (!ret) {
> >                 /*
> >                  * Once we return a file with FMODE_OPENED, __fput() will call
> >                  * fsnotify_close(), so we need fsnotify_open() here for
> >                  * symmetry.
> >                  */
> >                 fsnotify_open(file);
> >         }
> >       if (file->f_mode & FMODE_OPENED)
> >               path_get(path);
> >         return ret;
> > }
> >
> > Equivalent transformation, provided that nobody is playing silly
> > buggers with reassigning ->f_path in their ->open() instances.
> > They *really* should not - if anyone does, we'd better catch them
> > and fix them^Wtheir code.  Incidentally, if we find any such,
> > we have a damn good reason to add asserts in the callers.  As
> > in, "if do_dentry_open() has set FMODE_OPENED, it would bloody
> > better *not* modify ->f_path".  <greps> Nope, nobody is that
> > insane.
> >
> > 3/5) split vfs_open_consume() out of vfs_open() (possibly
> > named vfs_open_borrow()), replace the call in do_open() with
> > calling the new function.
> >
> > Trivially equivalent transformation.
> >
> > 4/5) Remove conditional path_get() from vfs_open_consume()
> > and finish_open().  Add
> >               if (file->f_mode & FMODE_OPENED)
> >                       path_get(&nd->path);
> > before terminate_walk(nd); in path_openat().
> >
> > Equivalent transformation - see
> >         if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
> >                 dput(nd->path.dentry);
> >                 nd->path.dentry = dentry;
> >                 return NULL;
> >         }
> > in lookup_open() (which is where nd->path gets in sync with what
> > had been given to do_dentry_open() in finish_open()); in case
> > of vfs_open_consume() in do_open() it's in sync from the very
> > beginning.  And we never modify nd->path after those points.
> > So we can move grabbing it downstream, keeping it under the
> > same condition (which also happens to be true only if we'd
> > called do_dentry_open(), so for all other paths through the
> > whole thing it's a no-op.
> >
> > 5/5) replace
> >               if (file->f_mode & FMODE_OPENED)
> >                       path_get(&nd->path);
> >               terminate_walk(nd);
> > with
> >               if (file->f_mode & FMODE_OPENED) {
> >                       nd->path.mnt = NULL;
> >                       nd->path.dentry = NULL;
> >               }
> >               terminate_walk(nd);
> > Again, an obvious equivalent transformation.
>
> BTW, similar to that, with that we could turn do_o_path()
> into
>
>         struct path path;
>         int error = path_lookupat(nd, flags, &path);
>         if (!error) {
>                 audit_inode(nd->name, path.dentry, 0);
>                 error = vfs_open_borrow(&path, file);
>                 if (!(file->f_mode & FMODE_OPENED))
>                         path_put(&path);
>         }
>         return error;
> }
>
> and perhaps do something similar in the vicinity of
> vfs_tmpfile() / do_o_tmpfile().

That's quite a bit of churn, but if you insist I can take a stab.

fwiw I do think the weird error condition in do_dentry_open can be
used to simplify stuff, which I still do in my v2.

with my approach there is never a path_put needed to backpedal (it was
already done by do_dentry_open *or* it is going to be done by whoever
doing last fput)

then do_o_path would be:
        struct path path;
        int error = path_lookupat(nd, flags, &path);
        if (!error) {
                audit_inode(nd->name, path.dentry, 0);
                error = vfs_open_consume(&path, file);
        }
        return error;

Al Viro Aug. 7, 2024, 7:05 a.m. UTC | #14

On Wed, Aug 07, 2024 at 08:40:28AM +0200, Mateusz Guzik wrote:
> On Wed, Aug 7, 2024 at 8:33 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote:
> > >       After having looked at the problem, how about the following
> > > series:
> > >
> > > 1/5) lift path_get() *AND* path_put() out of do_dentry_open()
> > > into the callers.  The latter - conditional upon "do_dentry_open()
> > > has not set FMODE_OPENED".  Equivalent transformation.
> > >
> > > 2/5) move path_get() we'd lifted into the callers past the
> > > call of do_dentry_open(), conditionally collapse it with path_put().
> > > You'd get e.g.
> > > int vfs_open(const struct path *path, struct file *file)
> > > {
> > >         int ret;
> > >
> > >         file->f_path = *path;
> > >         ret = do_dentry_open(file, NULL);
> > >         if (!ret) {
> > >                 /*
> > >                  * Once we return a file with FMODE_OPENED, __fput() will call
> > >                  * fsnotify_close(), so we need fsnotify_open() here for
> > >                  * symmetry.
> > >                  */
> > >                 fsnotify_open(file);
> > >         }
> > >       if (file->f_mode & FMODE_OPENED)
> > >               path_get(path);
> > >         return ret;
> > > }
> > >
> > > Equivalent transformation, provided that nobody is playing silly
> > > buggers with reassigning ->f_path in their ->open() instances.
> > > They *really* should not - if anyone does, we'd better catch them
> > > and fix them^Wtheir code.  Incidentally, if we find any such,
> > > we have a damn good reason to add asserts in the callers.  As
> > > in, "if do_dentry_open() has set FMODE_OPENED, it would bloody
> > > better *not* modify ->f_path".  <greps> Nope, nobody is that
> > > insane.
> > >
> > > 3/5) split vfs_open_consume() out of vfs_open() (possibly
> > > named vfs_open_borrow()), replace the call in do_open() with
> > > calling the new function.
> > >
> > > Trivially equivalent transformation.
> > >
> > > 4/5) Remove conditional path_get() from vfs_open_consume()
> > > and finish_open().  Add
> > >               if (file->f_mode & FMODE_OPENED)
> > >                       path_get(&nd->path);
> > > before terminate_walk(nd); in path_openat().
> > >
> > > Equivalent transformation - see
> > >         if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
> > >                 dput(nd->path.dentry);
> > >                 nd->path.dentry = dentry;
> > >                 return NULL;
> > >         }
> > > in lookup_open() (which is where nd->path gets in sync with what
> > > had been given to do_dentry_open() in finish_open()); in case
> > > of vfs_open_consume() in do_open() it's in sync from the very
> > > beginning.  And we never modify nd->path after those points.
> > > So we can move grabbing it downstream, keeping it under the
> > > same condition (which also happens to be true only if we'd
> > > called do_dentry_open(), so for all other paths through the
> > > whole thing it's a no-op.
> > >
> > > 5/5) replace
> > >               if (file->f_mode & FMODE_OPENED)
> > >                       path_get(&nd->path);
> > >               terminate_walk(nd);
> > > with
> > >               if (file->f_mode & FMODE_OPENED) {
> > >                       nd->path.mnt = NULL;
> > >                       nd->path.dentry = NULL;
> > >               }
> > >               terminate_walk(nd);
> > > Again, an obvious equivalent transformation.
> >
> > BTW, similar to that, with that we could turn do_o_path()
> > into
> >
> >         struct path path;
> >         int error = path_lookupat(nd, flags, &path);
> >         if (!error) {
> >                 audit_inode(nd->name, path.dentry, 0);
> >                 error = vfs_open_borrow(&path, file);
> >                 if (!(file->f_mode & FMODE_OPENED))
> >                         path_put(&path);
> >         }
> >         return error;
> > }
> >
> > and perhaps do something similar in the vicinity of
> > vfs_tmpfile() / do_o_tmpfile().
> 
> That's quite a bit of churn, but if you insist I can take a stab.

What I have in mind is something along the lines of COMPLETELY UNTESTED
git.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git #experimental-for-mateusz

It needs saner commit messages, references to your analysis of the
overhead, quite possibly a finer carve-up, etc.  And it's really
completely untested - it builds, but I hadn't even tried to boot
the sucker, let alone give it any kind of beating, so consider that
as a quick illustration (slapped together at 3am, on top of 5 hours of
sleep yesterday) to what I'd been talking about and no more than that.

Mateusz Guzik Aug. 7, 2024, 7:22 a.m. UTC | #15

On Wed, Aug 7, 2024 at 9:05 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 07, 2024 at 08:40:28AM +0200, Mateusz Guzik wrote:
> > On Wed, Aug 7, 2024 at 8:33 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Wed, Aug 07, 2024 at 07:23:00AM +0100, Al Viro wrote:
> > > >       After having looked at the problem, how about the following
> > > > series:
> > > >
> > > > 1/5) lift path_get() *AND* path_put() out of do_dentry_open()
> > > > into the callers.  The latter - conditional upon "do_dentry_open()
> > > > has not set FMODE_OPENED".  Equivalent transformation.
> > > >
> > > > 2/5) move path_get() we'd lifted into the callers past the
> > > > call of do_dentry_open(), conditionally collapse it with path_put().
> > > > You'd get e.g.
> > > > int vfs_open(const struct path *path, struct file *file)
> > > > {
> > > >         int ret;
> > > >
> > > >         file->f_path = *path;
> > > >         ret = do_dentry_open(file, NULL);
> > > >         if (!ret) {
> > > >                 /*
> > > >                  * Once we return a file with FMODE_OPENED, __fput() will call
> > > >                  * fsnotify_close(), so we need fsnotify_open() here for
> > > >                  * symmetry.
> > > >                  */
> > > >                 fsnotify_open(file);
> > > >         }
> > > >       if (file->f_mode & FMODE_OPENED)
> > > >               path_get(path);
> > > >         return ret;
> > > > }
> > > >
> > > > Equivalent transformation, provided that nobody is playing silly
> > > > buggers with reassigning ->f_path in their ->open() instances.
> > > > They *really* should not - if anyone does, we'd better catch them
> > > > and fix them^Wtheir code.  Incidentally, if we find any such,
> > > > we have a damn good reason to add asserts in the callers.  As
> > > > in, "if do_dentry_open() has set FMODE_OPENED, it would bloody
> > > > better *not* modify ->f_path".  <greps> Nope, nobody is that
> > > > insane.
> > > >
> > > > 3/5) split vfs_open_consume() out of vfs_open() (possibly
> > > > named vfs_open_borrow()), replace the call in do_open() with
> > > > calling the new function.
> > > >
> > > > Trivially equivalent transformation.
> > > >
> > > > 4/5) Remove conditional path_get() from vfs_open_consume()
> > > > and finish_open().  Add
> > > >               if (file->f_mode & FMODE_OPENED)
> > > >                       path_get(&nd->path);
> > > > before terminate_walk(nd); in path_openat().
> > > >
> > > > Equivalent transformation - see
> > > >         if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
> > > >                 dput(nd->path.dentry);
> > > >                 nd->path.dentry = dentry;
> > > >                 return NULL;
> > > >         }
> > > > in lookup_open() (which is where nd->path gets in sync with what
> > > > had been given to do_dentry_open() in finish_open()); in case
> > > > of vfs_open_consume() in do_open() it's in sync from the very
> > > > beginning.  And we never modify nd->path after those points.
> > > > So we can move grabbing it downstream, keeping it under the
> > > > same condition (which also happens to be true only if we'd
> > > > called do_dentry_open(), so for all other paths through the
> > > > whole thing it's a no-op.
> > > >
> > > > 5/5) replace
> > > >               if (file->f_mode & FMODE_OPENED)
> > > >                       path_get(&nd->path);
> > > >               terminate_walk(nd);
> > > > with
> > > >               if (file->f_mode & FMODE_OPENED) {
> > > >                       nd->path.mnt = NULL;
> > > >                       nd->path.dentry = NULL;
> > > >               }
> > > >               terminate_walk(nd);
> > > > Again, an obvious equivalent transformation.
> > >
> > > BTW, similar to that, with that we could turn do_o_path()
> > > into
> > >
> > >         struct path path;
> > >         int error = path_lookupat(nd, flags, &path);
> > >         if (!error) {
> > >                 audit_inode(nd->name, path.dentry, 0);
> > >                 error = vfs_open_borrow(&path, file);
> > >                 if (!(file->f_mode & FMODE_OPENED))
> > >                         path_put(&path);
> > >         }
> > >         return error;
> > > }
> > >
> > > and perhaps do something similar in the vicinity of
> > > vfs_tmpfile() / do_o_tmpfile().
> >
> > That's quite a bit of churn, but if you insist I can take a stab.
>
> What I have in mind is something along the lines of COMPLETELY UNTESTED
> git.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git #experimental-for-mateusz
>
> It needs saner commit messages, references to your analysis of the
> overhead, quite possibly a finer carve-up, etc.  And it's really
> completely untested - it builds, but I hadn't even tried to boot
> the sucker, let alone give it any kind of beating, so consider that
> as a quick illustration (slapped together at 3am, on top of 5 hours of
> sleep yesterday) to what I'd been talking about and no more than that.

Well it's your call, you wrote the thing and I need the problem out of
the way, so I'm not going to argue about the patchset.

I verified it boots and provides the expected perf win [I have to
repeat it is highly variable between re-runs because of ever-changing
offsets between different inode allocations resulting in different
false-sharing problems; i'm going to separately mail about that]

I think it will be fine to copy the result from my commit message and
denote it's from a different variant achieving the same goal.

That said feel free to use my commit message in whatever capacity,
there is no need to mention me.

Thanks for sorting this out.

Al Viro Aug. 7, 2024, 7:52 a.m. UTC | #16

On Wed, Aug 07, 2024 at 09:22:59AM +0200, Mateusz Guzik wrote:

> Well it's your call, you wrote the thing and I need the problem out of
> the way, so I'm not going to argue about the patchset.
> 
> I verified it boots and provides the expected perf win [I have to
> repeat it is highly variable between re-runs because of ever-changing
> offsets between different inode allocations resulting in different
> false-sharing problems; i'm going to separately mail about that]
> 
> I think it will be fine to copy the result from my commit message and
> denote it's from a different variant achieving the same goal.
> 
> That said feel free to use my commit message in whatever capacity,
> there is no need to mention me.

Original analysis had been yours, same for "let's change the calling
conventions for do_dentry_open() wrt path refcounting", same for
the version I'd transformed into that...  FWIW, my approach to
that had been along the lines of "how do we get it consistently,
whether we go through vfs_open() or finish_open()", which pretty
much required keeping hold on the path until just before
terminate_walk().  This "transfer from nd->path to whatever borrowed
it" was copied from path_openat() (BTW, might be worth an inlined helper
next to terminate_walk(), just to document that it's not an accidental
property of terminate_walk()) and that was pretty much it.

Co-developed-by: seems to be the usual notation these days for
such situations - that really had been incremental changes.

Anyway, I really need to get some sleep before writing something
usable as commit messages...

Mateusz Guzik Aug. 7, 2024, 7:59 a.m. UTC | #17

On Wed, Aug 7, 2024 at 9:52 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 07, 2024 at 09:22:59AM +0200, Mateusz Guzik wrote:
>
> > Well it's your call, you wrote the thing and I need the problem out of
> > the way, so I'm not going to argue about the patchset.
> >
> > I verified it boots and provides the expected perf win [I have to
> > repeat it is highly variable between re-runs because of ever-changing
> > offsets between different inode allocations resulting in different
> > false-sharing problems; i'm going to separately mail about that]
> >
> > I think it will be fine to copy the result from my commit message and
> > denote it's from a different variant achieving the same goal.
> >
> > That said feel free to use my commit message in whatever capacity,
> > there is no need to mention me.
>
> Original analysis had been yours, same for "let's change the calling
> conventions for do_dentry_open() wrt path refcounting", same for
> the version I'd transformed into that...  FWIW, my approach to
> that had been along the lines of "how do we get it consistently,
> whether we go through vfs_open() or finish_open()", which pretty
> much required keeping hold on the path until just before
> terminate_walk().  This "transfer from nd->path to whatever borrowed
> it" was copied from path_openat() (BTW, might be worth an inlined helper
> next to terminate_walk(), just to document that it's not an accidental
> property of terminate_walk()) and that was pretty much it.
>
> Co-developed-by: seems to be the usual notation these days for
> such situations - that really had been incremental changes.
>
> Anyway, I really need to get some sleep before writing something
> usable as commit messages...

Nobody is getting a Turing award for noticing the extra ref trip and eliding it.

Co-developed-by is fine with me if you insist on sharing credit.

My only objective here is to expedite the fix so that I can get on
with speeding up refcount management. :)

Mateusz Guzik Aug. 7, 2024, 9:50 a.m. UTC | #18

This eventually broke:

[ 1297.949437] BUG: kernel NULL pointer dereference, address: 0000000000000028
[ 1297.951840] #PF: supervisor read access in kernel mode
[ 1297.953548] #PF: error_code(0x0000) - not-present page
[ 1297.954984] PGD 109590067 P4D 0
[ 1297.955533] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1297.956340] CPU: 18 UID: 0 PID: 1113 Comm: cron Not tainted
6.11.0-rc1-00003-g80efb96e5cd4 #291
[ 1297.958090] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[ 1297.959926] RIP: 0010:vfs_tmpfile+0x162/0x230
[ 1297.960666] Code: b2 8b 55 48 83 e2 20 83 fa 01 19 ff 81 e7 00 f0
ff ff 81 c7 20 10 00 00 a9 00 40 00
04 75 90 48 8b 85 a0 00 00 00 4c 8b 48 30 <49> 8b 51 28 48 8b 92 b8 03
00 00 48 85 d2 0f 84 71 ff ff ff 48 8b
[ 1297.963682] RSP: 0018:ff38d126c19c3cc0 EFLAGS: 00010246
[ 1297.964549] RAX: ff2c2c8c89a3b980 RBX: 0000000000000000 RCX: 0000000000000002
[ 1297.965714] RDX: 0000000000000000 RSI: ff2c2c8c89a3ba30 RDI: 0000000000000020
[ 1297.966872] RBP: ff2c2c8c91917b00 R08: 0000000000000000 R09: 0000000000000000
[ 1297.968046] R10: 0000000000000000 R11: ff2c2c8c91cd23f0 R12: ffffffffb7b91e20
[ 1297.969202] R13: ff38d126c19c3d48 R14: ff2c2c8c89a3b980 R15: ff2c2c8c89a00a58
[ 1297.970364] FS:  00007ff81cb63840(0000) GS:ff2c2c91e7a80000(0000)
knlGS:0000000000000000
[ 1297.971694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1297.972645] CR2: 0000000000000028 CR3: 000000010cf58001 CR4: 0000000000371ef0
[ 1297.973825] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1297.974996] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1297.976175] Call Trace:
[ 1297.976609]  <TASK>
[ 1297.976978]  ? __die+0x23/0x70
[ 1297.977501]  ? page_fault_oops+0x15c/0x480
[ 1297.978191]  ? jbd2_journal_stop+0x13c/0x2d0
[ 1297.978903]  ? exc_page_fault+0x78/0x170
[ 1297.979564]  ? asm_exc_page_fault+0x26/0x30
[ 1297.980262]  ? vfs_tmpfile+0x162/0x230
[ 1297.980891]  path_openat+0xc70/0x12b0
[ 1297.981507]  ? __count_memcg_events+0x58/0xf0
[ 1297.982233]  ? handle_mm_fault+0xb9/0x260
[ 1297.982903]  ? do_user_addr_fault+0x2ed/0x6e0
[ 1297.983636]  do_filp_open+0xb0/0x150
[ 1297.984241]  do_sys_openat2+0x91/0xc0
[ 1297.984855]  __x64_sys_openat+0x57/0xa0
[ 1297.985504]  do_syscall_64+0x52/0x150
[ 1297.986128]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

tripping ip:
vfs_tmpfile+0x162/0x230:
fsnotify_parent at include/linux/fsnotify.h:81
(inlined by) fsnotify_file at include/linux/fsnotify.h:131
(inlined by) fsnotify_open at include/linux/fsnotify.h:401
(inlined by) vfs_tmpfile at fs/namei.c:3781

Al Viro Aug. 7, 2024, 12:43 p.m. UTC | #19

On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote:

> tripping ip:
> vfs_tmpfile+0x162/0x230:
> fsnotify_parent at include/linux/fsnotify.h:81
> (inlined by) fsnotify_file at include/linux/fsnotify.h:131
> (inlined by) fsnotify_open at include/linux/fsnotify.h:401
> (inlined by) vfs_tmpfile at fs/namei.c:3781

Try this for incremental; missed the fact that finish_open() is
used by ->tmpfile() instances, not just ->atomic_open().

Al, crawling back to sleep...

diff --git a/fs/namei.c b/fs/namei.c
index 95345a5beb3a..0536907e8e79 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3776,7 +3776,10 @@ int vfs_tmpfile(struct mnt_idmap *idmap,
 	file->f_path.dentry = child;
 	mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
 	error = dir->i_op->tmpfile(idmap, dir, file, mode);
-	dput(child);
+	if (file->f_mode & FMODE_OPENED)
+		mntget(parentpath->mnt);
+	else
+		dput(child);
 	if (file->f_mode & FMODE_OPENED)
 		fsnotify_open(file);
 	if (error)

Al Viro Aug. 7, 2024, 8:38 p.m. UTC | #20

On Wed, Aug 07, 2024 at 01:43:48PM +0100, Al Viro wrote:
> On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote:
> 
> > tripping ip:
> > vfs_tmpfile+0x162/0x230:
> > fsnotify_parent at include/linux/fsnotify.h:81
> > (inlined by) fsnotify_file at include/linux/fsnotify.h:131
> > (inlined by) fsnotify_open at include/linux/fsnotify.h:401
> > (inlined by) vfs_tmpfile at fs/namei.c:3781
> 
> Try this for incremental; missed the fact that finish_open() is
> used by ->tmpfile() instances, not just ->atomic_open().
> 
> Al, crawling back to sleep...

I _really_ hate ->atomic_open() calling conventions; FWIW, I suspect
that in the current form this series is OK, but only because none
of the existing instances call finish_open() on a preexisting
aliases found by d_splice_alias().  And control flow in the
instances (especially the cleanup paths) is bloody awful...

We never got it quite right, and while the previous iterations of
the calling conventions for that methods had been worse, it's still
nasty in the current form ;-/

Oh, well - review of those has been long overdue.

Mateusz Guzik Aug. 8, 2024, 6:26 a.m. UTC | #21

On Wed, Aug 7, 2024 at 2:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote:
>
> > tripping ip:
> > vfs_tmpfile+0x162/0x230:
> > fsnotify_parent at include/linux/fsnotify.h:81
> > (inlined by) fsnotify_file at include/linux/fsnotify.h:131
> > (inlined by) fsnotify_open at include/linux/fsnotify.h:401
> > (inlined by) vfs_tmpfile at fs/namei.c:3781
>
> Try this for incremental; missed the fact that finish_open() is
> used by ->tmpfile() instances, not just ->atomic_open().
>
> Al, crawling back to sleep...
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 95345a5beb3a..0536907e8e79 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3776,7 +3776,10 @@ int vfs_tmpfile(struct mnt_idmap *idmap,
>         file->f_path.dentry = child;
>         mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
>         error = dir->i_op->tmpfile(idmap, dir, file, mode);
> -       dput(child);
> +       if (file->f_mode & FMODE_OPENED)
> +               mntget(parentpath->mnt);
> +       else
> +               dput(child);
>         if (file->f_mode & FMODE_OPENED)
>                 fsnotify_open(file);
>         if (error)

That seems to have worked, but my test rig went down due to surprise
PDU maintenance (and it is still down) and I was unable to give the
patch a more serious beating.

Mateusz Guzik Aug. 20, 2024, 11:38 a.m. UTC | #22

do you plan to submit this to next?

anything this is waiting for?

my quick skim suggests this only needs more testing (and maybe a review)

On Wed, Aug 7, 2024 at 10:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Aug 07, 2024 at 01:43:48PM +0100, Al Viro wrote:
> > On Wed, Aug 07, 2024 at 11:50:50AM +0200, Mateusz Guzik wrote:
> >
> > > tripping ip:
> > > vfs_tmpfile+0x162/0x230:
> > > fsnotify_parent at include/linux/fsnotify.h:81
> > > (inlined by) fsnotify_file at include/linux/fsnotify.h:131
> > > (inlined by) fsnotify_open at include/linux/fsnotify.h:401
> > > (inlined by) vfs_tmpfile at fs/namei.c:3781
> >
> > Try this for incremental; missed the fact that finish_open() is
> > used by ->tmpfile() instances, not just ->atomic_open().
> >
> > Al, crawling back to sleep...
>
> I _really_ hate ->atomic_open() calling conventions; FWIW, I suspect
> that in the current form this series is OK, but only because none
> of the existing instances call finish_open() on a preexisting
> aliases found by d_splice_alias().  And control flow in the
> instances (especially the cleanup paths) is bloody awful...
>
> We never got it quite right, and while the previous iterations of
> the calling conventions for that methods had been worse, it's still
> nasty in the current form ;-/
>
> Oh, well - review of those has been long overdue.

Al Viro Aug. 22, 2024, 12:33 a.m. UTC | #23

On Tue, Aug 20, 2024 at 01:38:05PM +0200, Mateusz Guzik wrote:
> do you plan to submit this to next?
> 
> anything this is waiting for?
> 
> my quick skim suggests this only needs more testing (and maybe a review)

OK, let me post the current variant of that series, then we'll see
if anyone screams...

vfs: avoid spurious dentry ref/unref cycle on open

Commit Message

Comments

Patch