diff mbox series

[4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open

Message ID 20201212165105.902688-5-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK | expand

Commit Message

Jens Axboe Dec. 12, 2020, 4:51 p.m. UTC
We handle it for the path resolution itself, but we should also factor
it in for open_last_lookups() and tmpfile open.

We don't allow RESOLVE_NONBLOCK with O_TRUNC, so that case we can safely
ignore.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/namei.c | 39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

Comments

Al Viro Dec. 12, 2020, 5:25 p.m. UTC | #1
On Sat, Dec 12, 2020 at 09:51:04AM -0700, Jens Axboe wrote:

>  	struct dentry *dentry;
> @@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd,
>  	}
>  
>  	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
> -		got_write = !mnt_want_write(nd->path.mnt);
> +		if (nonblock) {
> +			got_write = !mnt_want_write_trylock(nd->path.mnt);
> +			if (!got_write)
> +				return ERR_PTR(-EAGAIN);
> +		} else {
> +			got_write = !mnt_want_write(nd->path.mnt);
> +		}
>  		/*
>  		 * do _not_ fail yet - we might not need that or fail with
>  		 * a different error; let lookup_open() decide; we'll be
>  		 * dropping this one anyway.
>  		 */

Read the comment immediately after the place you are modifying.  Really.
To elaborate: consider e.g. the case when /mnt/foo is a symlink to /tmp/bar,
/mnt is mounted r/o and you are asking to open /mnt/foo for write.  We get
to /mnt, call open_last_lookups() to resolve the last component ("foo") in
it.  And find that the sucker happens to be an absolute symlink, so we need
to jump into root and resolve "tmp/bar" staring from there.  Which is what
the loop in the caller is about.  Eventually we'll get to /tmp and call
open_last_lookups() to resolve "bar" there.  /tmp needs to be mounted
writable; /mnt does not.

Sure, you bail out only in nonblock case, so normally the next time around
it'll go sanely.  But you are making the damn thing (and it's still too
convoluted, even after a lot of massage towards sanity) harder to reason
about.

> +	if (open_flag & O_CREAT) {
> +		if (nonblock) {
> +			if (!inode_trylock(dir->d_inode)) {
> +				dentry = ERR_PTR(-EAGAIN);
> +				goto drop_write;
> +			}
> +		} else {
> +			inode_lock(dir->d_inode);
> +		}
> +	} else {
> +		if (nonblock) {
> +			if (!inode_trylock_shared(dir->d_inode)) {
> +				dentry = ERR_PTR(-EAGAIN);
> +				goto drop_write;
> +			}
> +		} else {
> +			inode_lock_shared(dir->d_inode);
> +		}
> +	}
>  	dentry = lookup_open(nd, file, op, got_write);

... as well as more bloated, with no obvious benefits (take a look
at lookup_open()).
Jens Axboe Dec. 12, 2020, 5:47 p.m. UTC | #2
On 12/12/20 10:25 AM, Al Viro wrote:
> On Sat, Dec 12, 2020 at 09:51:04AM -0700, Jens Axboe wrote:
> 
>>  	struct dentry *dentry;
>> @@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd,
>>  	}
>>  
>>  	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
>> -		got_write = !mnt_want_write(nd->path.mnt);
>> +		if (nonblock) {
>> +			got_write = !mnt_want_write_trylock(nd->path.mnt);
>> +			if (!got_write)
>> +				return ERR_PTR(-EAGAIN);
>> +		} else {
>> +			got_write = !mnt_want_write(nd->path.mnt);
>> +		}
>>  		/*
>>  		 * do _not_ fail yet - we might not need that or fail with
>>  		 * a different error; let lookup_open() decide; we'll be
>>  		 * dropping this one anyway.
>>  		 */
> 
> Read the comment immediately after the place you are modifying.  Really.
> To elaborate: consider e.g. the case when /mnt/foo is a symlink to /tmp/bar,
> /mnt is mounted r/o and you are asking to open /mnt/foo for write.  We get
> to /mnt, call open_last_lookups() to resolve the last component ("foo") in
> it.  And find that the sucker happens to be an absolute symlink, so we need
> to jump into root and resolve "tmp/bar" staring from there.  Which is what
> the loop in the caller is about.  Eventually we'll get to /tmp and call
> open_last_lookups() to resolve "bar" there.  /tmp needs to be mounted
> writable; /mnt does not.
> 
> Sure, you bail out only in nonblock case, so normally the next time around
> it'll go sanely.  But you are making the damn thing (and it's still too
> convoluted, even after a lot of massage towards sanity) harder to reason
> about.

Thanks - I did read that comment, but reading your full explanation it's
starting to make sense. I'll have it follow the same paradigm for the
nonblocking side.

>> +	if (open_flag & O_CREAT) {
>> +		if (nonblock) {
>> +			if (!inode_trylock(dir->d_inode)) {
>> +				dentry = ERR_PTR(-EAGAIN);
>> +				goto drop_write;
>> +			}
>> +		} else {
>> +			inode_lock(dir->d_inode);
>> +		}
>> +	} else {
>> +		if (nonblock) {
>> +			if (!inode_trylock_shared(dir->d_inode)) {
>> +				dentry = ERR_PTR(-EAGAIN);
>> +				goto drop_write;
>> +			}
>> +		} else {
>> +			inode_lock_shared(dir->d_inode);
>> +		}
>> +	}
>>  	dentry = lookup_open(nd, file, op, got_write);
> 
> ... as well as more bloated, with no obvious benefits (take a look
> at lookup_open()).

Can you elaborate? It's hard not to make

if (nonblock) {
	trylock;
} else
	lock;
}

not look/feel bloated, as it tends to explode it like the above. Can
hide it in helpers, but really, the logic needs to look like that...
Linus Torvalds Dec. 12, 2020, 6:57 p.m. UTC | #3
On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> We handle it for the path resolution itself, but we should also factor
> it in for open_last_lookups() and tmpfile open.

So I think this one is fundamentally wrong, for two reasons.

One is that "nonblock" shouldn't necessarily mean "take no locks at
all". That directory inode lock is very very different from "go down
to the filesystem to do IO". No other NONBLOCK thing has ever been "no
locks at all", they have all been about possibly long-term blocking.

The other this is that the mnt_want_write() check really smells no
different at all from any of the "some ->open functions might block".
Which you don't handle, and which again is entirely different from the
pathname resolution itself blocking.

So I don't think either of these cases are about LOOKUP_NONBLOCK, the
same way they aren't about LOOKUP_RCU.

The  mnt_want_write() in particular is much more about the kinds of
things we already check for in do_open().

In fact, looking at this patch, I think mnt_want_write() itself is
actually conceptually buggy, although it doesn't really matter: think
about device nodes etc. Opening a device node for writing doesn't
write to the filesystem that the device node is on.

Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
do with the open() wanting to write to the filesystem. We don't even
hold that lock after the open - we'll always drop it even for a
successful open.

Only O_CREAT | O_TRUNC should matter, since those are the ones that
cause writes as part of the *open*.

Again - I don't think that this is really a problem. I mention it more
as a "this shows how the code is _conceptually_ wrong", and how it's
not really about the pathname resolution any more.

In fact, I think that how we pass on that "got_write" to lookup_open()
is just another sign of how we do that mnt_want_write() in the wrong
place and at trhe wrong level. We shouldn't have been doing that
mnt_want_write() for writable oipens (that's a different thing), and
looking at lookup_open(), I think we should have waited with doing it
at all until we've checked if we even need it (ie O_CREAT on a file
that already exists does *not* need the mnt_want_write() at all, and
we'll see that when we get to that

        /* Negative dentry, just create the file */
        if (!dentry->d_inode && (open_flag & O_CREAT)) {

thing.

So I think this patch actually shows that we do things wrong in this
area, and I think the kinds of things it does are questionable as a
result. In particular, I'm not convinced that the directory semaphore
thing should be tied to LOOKUP_NONBLOCK, and I don't think the
mnt_want_write() logic is right at all.

The first one would be fixed by a separate flag.

The second one I think is more about "we are doing mnt_want_write() at
the wrong point, and if we move mnt_want_write() to the right place
we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same
way a truncate should just be an explicit fail, not a "trylock"
failure.

I also think we need to deal with the O_NONBLOCK kind of situation at
the actual ->open() time (ie the whole "oh, NFS wants to revalidate
caches due to open/close consistency, named pipes want to wait for
pairing, device nodes want to wait for device". Again, I think that's
separate from LOOKUP_NONBLOCK, though.

               Linus
Jens Axboe Dec. 12, 2020, 9:25 p.m. UTC | #4
On 12/12/20 11:57 AM, Linus Torvalds wrote:
> On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> We handle it for the path resolution itself, but we should also factor
>> it in for open_last_lookups() and tmpfile open.
> 
> So I think this one is fundamentally wrong, for two reasons.
> 
> One is that "nonblock" shouldn't necessarily mean "take no locks at
> all". That directory inode lock is very very different from "go down
> to the filesystem to do IO". No other NONBLOCK thing has ever been "no
> locks at all", they have all been about possibly long-term blocking.

Do we ever do long term IO _while_ holding the direcoty inode lock? If
we don't, then we can probably just ignore that side alltogether.

> The other this is that the mnt_want_write() check really smells no
> different at all from any of the "some ->open functions might block".
> Which you don't handle, and which again is entirely different from the
> pathname resolution itself blocking.
> 
> So I don't think either of these cases are about LOOKUP_NONBLOCK, the
> same way they aren't about LOOKUP_RCU.
> 
> The  mnt_want_write() in particular is much more about the kinds of
> things we already check for in do_open().
> 
> In fact, looking at this patch, I think mnt_want_write() itself is
> actually conceptually buggy, although it doesn't really matter: think
> about device nodes etc. Opening a device node for writing doesn't
> write to the filesystem that the device node is on.
> 
> Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
> do with the open() wanting to write to the filesystem. We don't even
> hold that lock after the open - we'll always drop it even for a
> successful open.
> 
> Only O_CREAT | O_TRUNC should matter, since those are the ones that
> cause writes as part of the *open*.
> 
> Again - I don't think that this is really a problem. I mention it more
> as a "this shows how the code is _conceptually_ wrong", and how it's
> not really about the pathname resolution any more.
> 
> In fact, I think that how we pass on that "got_write" to lookup_open()
> is just another sign of how we do that mnt_want_write() in the wrong
> place and at trhe wrong level. We shouldn't have been doing that
> mnt_want_write() for writable oipens (that's a different thing), and
> looking at lookup_open(), I think we should have waited with doing it
> at all until we've checked if we even need it (ie O_CREAT on a file
> that already exists does *not* need the mnt_want_write() at all, and
> we'll see that when we get to that
> 
>         /* Negative dentry, just create the file */
>         if (!dentry->d_inode && (open_flag & O_CREAT)) {
> 
> thing.
> 
> So I think this patch actually shows that we do things wrong in this
> area, and I think the kinds of things it does are questionable as a
> result. In particular, I'm not convinced that the directory semaphore
> thing should be tied to LOOKUP_NONBLOCK, and I don't think the
> mnt_want_write() logic is right at all.
> 
> The first one would be fixed by a separate flag.
> 
> The second one I think is more about "we are doing mnt_want_write() at
> the wrong point, and if we move mnt_want_write() to the right place
> we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same
> way a truncate should just be an explicit fail, not a "trylock"
> failure.

I'm going to let Al comment on the mnt_want_write() logic. It did feel
iffy to me, and the error handling (and passing of it) around makes it
hard to reason about. Would likely make this change more straight
forward if we sort out that first.

I do agree that we should keep the two things separate - and potentially
just have the RESOLVE_NONBLOCK be RESOLVE_CACHE (or something like that)
and not deal with the locking / want write side of things at all. For
io_uring, we really do want to ensure that we don't stall the submission
pipeline by potentially being stuck on the directory lock or waiting for
a frozen file system.

And I don't think we can get around having two flags at that point -
probably by renaming the initial LOOKUP_NONBLOCK to LOOKUP_CACHE, and by
having a LOOKUP_NONBLOCK which has a slightly wider scope.

> I also think we need to deal with the O_NONBLOCK kind of situation at
> the actual ->open() time (ie the whole "oh, NFS wants to revalidate
> caches due to open/close consistency, named pipes want to wait for
> pairing, device nodes want to wait for device". Again, I think that's
> separate from LOOKUP_NONBLOCK, though.

Agree, that's still not done in this patch. I did think about it, and
the only potential idea I had around that was to wrap the actual open in
setting/clearing O_NDELAY/O_NONBLOCK around the open. Which _should_
work as long as it happens before fd_install() is done, but doesn't feel
that architecturally clean.
Linus Torvalds Dec. 12, 2020, 10:03 p.m. UTC | #5
On Sat, Dec 12, 2020 at 1:25 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Do we ever do long term IO _while_ holding the direcoty inode lock? If
> we don't, then we can probably just ignore that side alltogether.

The inode lock is all kinds of messy. Part of it is that we have these
helper functions for taking it ("inode_lock_shared()" and friends).
Part of it is that some codepaths do *not* use those helpers and use
"inode->i_rwsem" directly. And part of it is that our comments
sometimes talk about the old name ("i_mutex").

The inode lock *can* be a problem. The historical problem point is
actually readdir(), which takes the lock for reading, but does so over
not just IO but also the user space accesses.

That used to be a huge problem when it was a mutex, not an rwlock. But
I think it can still be a problem for (a) filesystems that haven't
been converted to 'iterate_shared' or (b) if a slow readdir has the
lock, and a O_CREAT comes in, then new readers will block too.

Honestly, the inode lock is nasty and broken. It's made worse by the
fact that it really doesn't have great semantics: filesystems use it
randomly for internal "lock this inode" too.

A lot of inode lock users don't actually do any IO at all. The
messiness of that lock comes literally from the fact that it was this
random per-inode lock that just grew a lot of random uses. Many of
them aren't particularly relevant for directories, though.

It's one of my least favorite locks in the kernel, but practically
speaking it seldom causes problems.

But if you haven't figured out the pattern by now, let's just say that
"it's completely random".

It would be interesting to see if it causes actual problems. Because
maybe that could push us towards fixing some of them.

               Linus
Dave Chinner Dec. 13, 2020, 10:50 p.m. UTC | #6
On Sat, Dec 12, 2020 at 02:25:00PM -0700, Jens Axboe wrote:
> On 12/12/20 11:57 AM, Linus Torvalds wrote:
> > On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> We handle it for the path resolution itself, but we should also factor
> >> it in for open_last_lookups() and tmpfile open.
> > 
> > So I think this one is fundamentally wrong, for two reasons.
> > 
> > One is that "nonblock" shouldn't necessarily mean "take no locks at
> > all". That directory inode lock is very very different from "go down
> > to the filesystem to do IO". No other NONBLOCK thing has ever been "no
> > locks at all", they have all been about possibly long-term blocking.
> 
> Do we ever do long term IO _while_ holding the direcoty inode lock? If
> we don't, then we can probably just ignore that side alltogether.

Yes - "all the time" is the simple answer.

Readdir is a simple example, but that is just extent mapping tree
and directory block IO you'd have to wait for, just like regular
file IO.

The big problem is that modifications to directories are atomic and
transactional in most filesystems, which means we might block a
create/unlink/attr/etc in a transaction start for an indefinite
amount of time while we wait for metadata writeback to free up
journal/reservation space. And while we are doing this, nothing else
can access the directory because the VFS holds the directory inode
lock....

We also have metadata IO within transactions, but in most
journalling filesystems once we've started the transaction we can't
back out and return -EAGAIN. So once we are in a transaction
context, the filesystem will block as necessary to run the operation
to completion.

So, really, at the filesystem level I don't see much value in trying
to push non-blocking directory modifications down to the filesystem.
The commonly used filesystems will mostly have to return -EAGAIN
immediately without being able to do anything at all because they
simply aren't architected with the modification rollback
capabilities needed to run fully non-blocking transactional
modification operations.

> > Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
> > do with the open() wanting to write to the filesystem. We don't even
> > hold that lock after the open - we'll always drop it even for a
> > successful open.
> > 
> > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > cause writes as part of the *open*.

And __O_TMPFILE, which is the same as O_CREAT.

Cheers,

Dave.
Linus Torvalds Dec. 14, 2020, 12:45 a.m. UTC | #7
On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > > cause writes as part of the *open*.
>
> And __O_TMPFILE, which is the same as O_CREAT.

This made me go look at the code, but we seem to be ok here -
__O_TMPFILE should never get to the do_open() logic at all, because it
gets caught before that and does off to do_tmpfile() and then
vfs_tmpfile() instead.

And then it's up to the filesystem to do the inode locking if it needs
to - it has a separate i_io->tempfile function for that.

From a LOOKUP_NONBLOCK standpoint, I think we should just disallow
O_TMPFILE the same way Jens disallowed O_TRUNCATE.

Otherwise we'd have to teach filesystems about it.

Which might be an option long-term of course, but I don't think it
makes sense for any initial patch-set: the real "we can do this with
no locks and quickly" case is just opening an existing file entirely
using the dcache.

Creating new files isn't exactly _uncommon_, of course, but it's
probably still two orders of magnitude less common than just opening
an existing file. Wild handwaving.

So I suspect O_CREAT simply isn't really interesting either. Sure, the
"it already exists" case could potentially be done without any locking
or anything like that, but again, I don't think that case is worth
optimizing for: O_CREAT probably just isn't common enough.

[ I don't have tons of hard data to back that handwaving argument
based on my gut feel up, but I did do a trace of a kernel build just
because that's my "default load" and out of 250 thousand openat()
calls, only 560 had O_CREAT and/or O_TRUNC. But while that's _my_
default load, it's obviously not necessarily very representative of
anything else. The "almost three orders of magnitude more regular
opens" doesn't _surprise_ me though ]

So I really think the right model is not to worry about trylock for
the inode lock at all, but to simply just always fail with EAGAIN in
that case - and not do LOOKUP_NONBLOCK for O_CREAT at all.

The normal fast-path case in open_last_lookups() is the "goto
finish_lookup" case in this sequence:

        if (!(open_flag & O_CREAT)) {
                if (nd->last.name[nd->last.len])
                        nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
                /* we _can_ be in RCU mode here */
                dentry = lookup_fast(nd, &inode, &seq);
                if (IS_ERR(dentry))
                        return ERR_CAST(dentry);
                if (likely(dentry))
                        goto finish_lookup;

and we never get to the inode locking sequence at all.

So just adding a

        if (nd->flags & LOOKUP_NONBLOCK)
                return -EAGAIN;

there is probably the right thing - rather than worry about trylock on
the inode lock.

Because if you've missed in the dcache, you've by definition lost the fast-path.

That said, numbers talk, BS walks, and maybe somebody has better loads
with better numbers. I assume Jens has some internal FB io_uring loads
and could do stats on what kinds of pathname opens matter there...

          Linus
Dave Chinner Dec. 14, 2020, 1:52 a.m. UTC | #8
On Sun, Dec 13, 2020 at 04:45:39PM -0800, Linus Torvalds wrote:
> On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > > > cause writes as part of the *open*.
> >
> > And __O_TMPFILE, which is the same as O_CREAT.
> 
> This made me go look at the code, but we seem to be ok here -
> __O_TMPFILE should never get to the do_open() logic at all, because it
> gets caught before that and does off to do_tmpfile() and then
> vfs_tmpfile() instead.
> 
> And then it's up to the filesystem to do the inode locking if it needs
> to - it has a separate i_io->tempfile function for that.

Sure, and then it blocks. Guaranteed, for the same reasons that
O_CREAT will block when calling ->create() after the path lookup:
the filesystem runs a transaction to allocate an inode and track it
on the orphan list so that it gets cleaned up by recovery after a
crash while the tmpfile is still open.

So it doesn't matter if the lookup is non-blocking, the tmpfile
creation is guaranteed to block for the same reason O_CREAT and
O_TRUNCATE will block....

> From a LOOKUP_NONBLOCK standpoint, I think we should just disallow
> O_TMPFILE the same way Jens disallowed O_TRUNCATE.

*nod*

I just don't think it makes sense to try to make any of the
filesystem level stuff open() might do non-blocking. The moment we
start a filesystem modification, we have to be able to block because
it is the only way to guarantee forwards progress. So if we know we
are going to call into the filesystem to make a modification if the
pathwalk is successful, why even bother starting the pathwalk?

Cheers,

Dave.
Jens Axboe Dec. 14, 2020, 5:43 p.m. UTC | #9
On 12/13/20 5:45 PM, Linus Torvalds wrote:
> On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
>>>>
>>>> Only O_CREAT | O_TRUNC should matter, since those are the ones that
>>>> cause writes as part of the *open*.
>>
>> And __O_TMPFILE, which is the same as O_CREAT.
> 
> This made me go look at the code, but we seem to be ok here -
> __O_TMPFILE should never get to the do_open() logic at all, because it
> gets caught before that and does off to do_tmpfile() and then
> vfs_tmpfile() instead.
> 
> And then it's up to the filesystem to do the inode locking if it needs
> to - it has a separate i_io->tempfile function for that.
> 
> From a LOOKUP_NONBLOCK standpoint, I think we should just disallow
> O_TMPFILE the same way Jens disallowed O_TRUNCATE.
> 
> Otherwise we'd have to teach filesystems about it.

Good point, let's avoid that for now... More below.

> Which might be an option long-term of course, but I don't think it
> makes sense for any initial patch-set: the real "we can do this with
> no locks and quickly" case is just opening an existing file entirely
> using the dcache.
> 
> Creating new files isn't exactly _uncommon_, of course, but it's
> probably still two orders of magnitude less common than just opening
> an existing file. Wild handwaving.

Obviously depends on the workload, but there are plenty of interesting
workloads that are not create intensive at all and just want a fast way
to open an existing file.

> So I suspect O_CREAT simply isn't really interesting either. Sure, the
> "it already exists" case could potentially be done without any locking
> or anything like that, but again, I don't think that case is worth
> optimizing for: O_CREAT probably just isn't common enough.
> 
> [ I don't have tons of hard data to back that handwaving argument
> based on my gut feel up, but I did do a trace of a kernel build just
> because that's my "default load" and out of 250 thousand openat()
> calls, only 560 had O_CREAT and/or O_TRUNC. But while that's _my_
> default load, it's obviously not necessarily very representative of
> anything else. The "almost three orders of magnitude more regular
> opens" doesn't _surprise_ me though ]
> 
> So I really think the right model is not to worry about trylock for
> the inode lock at all, but to simply just always fail with EAGAIN in
> that case - and not do LOOKUP_NONBLOCK for O_CREAT at all.
> 
> The normal fast-path case in open_last_lookups() is the "goto
> finish_lookup" case in this sequence:
> 
>         if (!(open_flag & O_CREAT)) {
>                 if (nd->last.name[nd->last.len])
>                         nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
>                 /* we _can_ be in RCU mode here */
>                 dentry = lookup_fast(nd, &inode, &seq);
>                 if (IS_ERR(dentry))
>                         return ERR_CAST(dentry);
>                 if (likely(dentry))
>                         goto finish_lookup;
> 
> and we never get to the inode locking sequence at all.
> 
> So just adding a
> 
>         if (nd->flags & LOOKUP_NONBLOCK)
>                 return -EAGAIN;
> 
> there is probably the right thing - rather than worry about trylock on
> the inode lock.

I really like that, as it also means we can safely drop the
mnt_want_write() path and just safe that for a separate cleanup that's
not related to this at all. With that, I'd suggest folding the two
patches into one as we no longer need followup work. It also means we
can drop the concept of having two flags for this, LOOKUP_NONBLOCK
covers the entire case of the existing fast path.

Might still make sense to name it LOOKUP_CACHE instead, but I'll leave
that for you/Al to decide...

> Because if you've missed in the dcache, you've by definition lost the fast-path.
> 
> That said, numbers talk, BS walks, and maybe somebody has better loads
> with better numbers. I assume Jens has some internal FB io_uring loads
> and could do stats on what kinds of pathname opens matter there...

Definitely, was already planning on checking that. And just as
importantly, once we have this, it's pretty trivial to do the same thing
on the stat side as well, which makes me excited for that change too.
io_uring currently punts that one too, and that's arguably even more
interesting than open if we can avoid that for the fast case.
Linus Torvalds Dec. 14, 2020, 6:06 p.m. UTC | #10
On Sun, Dec 13, 2020 at 5:52 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Dec 13, 2020 at 04:45:39PM -0800, Linus Torvalds wrote:
> > On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > > > > cause writes as part of the *open*.
> > >
> > > And __O_TMPFILE, which is the same as O_CREAT.
> >
> > This made me go look at the code, but we seem to be ok here -
> > __O_TMPFILE should never get to the do_open() logic at all, because it
> > gets caught before that and does off to do_tmpfile() and then
> > vfs_tmpfile() instead.
> >
> > And then it's up to the filesystem to do the inode locking if it needs
> > to - it has a separate i_io->tempfile function for that.
>
> Sure, and then it blocks.

Yes. I was more just double-checking that currently really odd

        if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {

condition that didn't make sense to me (it basically does two
different kinds of writablity checks). So it was more that you pointed
out that __O_TMPFILE was also missing from that odd condition, and
that turns out to be because it was handled separately.

So no disagreement about __O_TMPFILE being a "not a cached operation"
- purely a "that condition is odd".

It was just that O_WRONLY | O_RDWR didn't make tons of sense to me,
since we then get the write count only to then drop it immediately
immediately without having actually done any writes.

But I guess they are there only as a "even if we don't write to the
filesystem right now, we do want to get the EROFS error return from
open(), rather than at write() time".

I think technically it shouldn't need to do the pointless "synchronize
and increment writers only to decrement them again" dance, and could
just do a "mnt_is_readonly()" test for the plain "open writably, but
without O_CREAT/O_TRUNC" case.

But I guess there's no real downside to doing it the way we're doing
it - it just looked odd to me when I was looking at it in the context
of just pathname lookup.

In do_faccessat() we have that bare __mnt_is_readonly() for the "I
want EROFS, but I'm not actually writing now" case.

            Linus
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 07a1aa874f65..1f976a213eef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3127,6 +3127,7 @@  static const char *open_last_lookups(struct nameidata *nd,
 	struct dentry *dir = nd->path.dentry;
 	int open_flag = op->open_flag;
 	bool got_write = false;
+	bool nonblock = nd->flags & LOOKUP_NONBLOCK;
 	unsigned seq;
 	struct inode *inode;
 	struct dentry *dentry;
@@ -3164,17 +3165,38 @@  static const char *open_last_lookups(struct nameidata *nd,
 	}
 
 	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
-		got_write = !mnt_want_write(nd->path.mnt);
+		if (nonblock) {
+			got_write = !mnt_want_write_trylock(nd->path.mnt);
+			if (!got_write)
+				return ERR_PTR(-EAGAIN);
+		} else {
+			got_write = !mnt_want_write(nd->path.mnt);
+		}
 		/*
 		 * do _not_ fail yet - we might not need that or fail with
 		 * a different error; let lookup_open() decide; we'll be
 		 * dropping this one anyway.
 		 */
 	}
-	if (open_flag & O_CREAT)
-		inode_lock(dir->d_inode);
-	else
-		inode_lock_shared(dir->d_inode);
+	if (open_flag & O_CREAT) {
+		if (nonblock) {
+			if (!inode_trylock(dir->d_inode)) {
+				dentry = ERR_PTR(-EAGAIN);
+				goto drop_write;
+			}
+		} else {
+			inode_lock(dir->d_inode);
+		}
+	} else {
+		if (nonblock) {
+			if (!inode_trylock_shared(dir->d_inode)) {
+				dentry = ERR_PTR(-EAGAIN);
+				goto drop_write;
+			}
+		} else {
+			inode_lock_shared(dir->d_inode);
+		}
+	}
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
 		fsnotify_create(dir->d_inode, dentry);
@@ -3183,6 +3205,7 @@  static const char *open_last_lookups(struct nameidata *nd,
 	else
 		inode_unlock_shared(dir->d_inode);
 
+drop_write:
 	if (got_write)
 		mnt_drop_write(nd->path.mnt);
 
@@ -3242,6 +3265,7 @@  static int do_open(struct nameidata *nd,
 		open_flag &= ~O_TRUNC;
 		acc_mode = 0;
 	} else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
+		WARN_ON_ONCE(nd->flags & LOOKUP_NONBLOCK);
 		error = mnt_want_write(nd->path.mnt);
 		if (error)
 			return error;
@@ -3311,7 +3335,10 @@  static int do_tmpfile(struct nameidata *nd, unsigned flags,
 	int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);
 	if (unlikely(error))
 		return error;
-	error = mnt_want_write(path.mnt);
+	if (flags & LOOKUP_NONBLOCK)
+		error = mnt_want_write_trylock(path.mnt);
+	else
+		error = mnt_want_write(path.mnt);
 	if (unlikely(error))
 		goto out;
 	child = vfs_tmpfile(path.dentry, op->mode, op->open_flag);