Message ID | 20201212165105.902688-5-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK | expand |
On Sat, Dec 12, 2020 at 09:51:04AM -0700, Jens Axboe wrote: > struct dentry *dentry; > @@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd, > } > > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > - got_write = !mnt_want_write(nd->path.mnt); > + if (nonblock) { > + got_write = !mnt_want_write_trylock(nd->path.mnt); > + if (!got_write) > + return ERR_PTR(-EAGAIN); > + } else { > + got_write = !mnt_want_write(nd->path.mnt); > + } > /* > * do _not_ fail yet - we might not need that or fail with > * a different error; let lookup_open() decide; we'll be > * dropping this one anyway. > */ Read the comment immediately after the place you are modifying. Really. To elaborate: consider e.g. the case when /mnt/foo is a symlink to /tmp/bar, /mnt is mounted r/o and you are asking to open /mnt/foo for write. We get to /mnt, call open_last_lookups() to resolve the last component ("foo") in it. And find that the sucker happens to be an absolute symlink, so we need to jump into root and resolve "tmp/bar" staring from there. Which is what the loop in the caller is about. Eventually we'll get to /tmp and call open_last_lookups() to resolve "bar" there. /tmp needs to be mounted writable; /mnt does not. Sure, you bail out only in nonblock case, so normally the next time around it'll go sanely. But you are making the damn thing (and it's still too convoluted, even after a lot of massage towards sanity) harder to reason about. > + if (open_flag & O_CREAT) { > + if (nonblock) { > + if (!inode_trylock(dir->d_inode)) { > + dentry = ERR_PTR(-EAGAIN); > + goto drop_write; > + } > + } else { > + inode_lock(dir->d_inode); > + } > + } else { > + if (nonblock) { > + if (!inode_trylock_shared(dir->d_inode)) { > + dentry = ERR_PTR(-EAGAIN); > + goto drop_write; > + } > + } else { > + inode_lock_shared(dir->d_inode); > + } > + } > dentry = lookup_open(nd, file, op, got_write); ... as well as more bloated, with no obvious benefits (take a look at lookup_open()).
On 12/12/20 10:25 AM, Al Viro wrote: > On Sat, Dec 12, 2020 at 09:51:04AM -0700, Jens Axboe wrote: > >> struct dentry *dentry; >> @@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd, >> } >> >> if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { >> - got_write = !mnt_want_write(nd->path.mnt); >> + if (nonblock) { >> + got_write = !mnt_want_write_trylock(nd->path.mnt); >> + if (!got_write) >> + return ERR_PTR(-EAGAIN); >> + } else { >> + got_write = !mnt_want_write(nd->path.mnt); >> + } >> /* >> * do _not_ fail yet - we might not need that or fail with >> * a different error; let lookup_open() decide; we'll be >> * dropping this one anyway. >> */ > > Read the comment immediately after the place you are modifying. Really. > To elaborate: consider e.g. the case when /mnt/foo is a symlink to /tmp/bar, > /mnt is mounted r/o and you are asking to open /mnt/foo for write. We get > to /mnt, call open_last_lookups() to resolve the last component ("foo") in > it. And find that the sucker happens to be an absolute symlink, so we need > to jump into root and resolve "tmp/bar" staring from there. Which is what > the loop in the caller is about. Eventually we'll get to /tmp and call > open_last_lookups() to resolve "bar" there. /tmp needs to be mounted > writable; /mnt does not. > > Sure, you bail out only in nonblock case, so normally the next time around > it'll go sanely. But you are making the damn thing (and it's still too > convoluted, even after a lot of massage towards sanity) harder to reason > about. Thanks - I did read that comment, but reading your full explanation it's starting to make sense. I'll have it follow the same paradigm for the nonblocking side. >> + if (open_flag & O_CREAT) { >> + if (nonblock) { >> + if (!inode_trylock(dir->d_inode)) { >> + dentry = ERR_PTR(-EAGAIN); >> + goto drop_write; >> + } >> + } else { >> + inode_lock(dir->d_inode); >> + } >> + } else { >> + if (nonblock) { >> + if (!inode_trylock_shared(dir->d_inode)) { >> + dentry = ERR_PTR(-EAGAIN); >> + goto drop_write; >> + } >> + } else { >> + inode_lock_shared(dir->d_inode); >> + } >> + } >> dentry = lookup_open(nd, file, op, got_write); > > ... as well as more bloated, with no obvious benefits (take a look > at lookup_open()). Can you elaborate? It's hard not to make if (nonblock) { trylock; } else lock; } not look/feel bloated, as it tends to explode it like the above. Can hide it in helpers, but really, the logic needs to look like that...
On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote: > > We handle it for the path resolution itself, but we should also factor > it in for open_last_lookups() and tmpfile open. So I think this one is fundamentally wrong, for two reasons. One is that "nonblock" shouldn't necessarily mean "take no locks at all". That directory inode lock is very very different from "go down to the filesystem to do IO". No other NONBLOCK thing has ever been "no locks at all", they have all been about possibly long-term blocking. The other this is that the mnt_want_write() check really smells no different at all from any of the "some ->open functions might block". Which you don't handle, and which again is entirely different from the pathname resolution itself blocking. So I don't think either of these cases are about LOOKUP_NONBLOCK, the same way they aren't about LOOKUP_RCU. The mnt_want_write() in particular is much more about the kinds of things we already check for in do_open(). In fact, looking at this patch, I think mnt_want_write() itself is actually conceptually buggy, although it doesn't really matter: think about device nodes etc. Opening a device node for writing doesn't write to the filesystem that the device node is on. Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to do with the open() wanting to write to the filesystem. We don't even hold that lock after the open - we'll always drop it even for a successful open. Only O_CREAT | O_TRUNC should matter, since those are the ones that cause writes as part of the *open*. Again - I don't think that this is really a problem. I mention it more as a "this shows how the code is _conceptually_ wrong", and how it's not really about the pathname resolution any more. In fact, I think that how we pass on that "got_write" to lookup_open() is just another sign of how we do that mnt_want_write() in the wrong place and at trhe wrong level. We shouldn't have been doing that mnt_want_write() for writable oipens (that's a different thing), and looking at lookup_open(), I think we should have waited with doing it at all until we've checked if we even need it (ie O_CREAT on a file that already exists does *not* need the mnt_want_write() at all, and we'll see that when we get to that /* Negative dentry, just create the file */ if (!dentry->d_inode && (open_flag & O_CREAT)) { thing. So I think this patch actually shows that we do things wrong in this area, and I think the kinds of things it does are questionable as a result. In particular, I'm not convinced that the directory semaphore thing should be tied to LOOKUP_NONBLOCK, and I don't think the mnt_want_write() logic is right at all. The first one would be fixed by a separate flag. The second one I think is more about "we are doing mnt_want_write() at the wrong point, and if we move mnt_want_write() to the right place we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same way a truncate should just be an explicit fail, not a "trylock" failure. I also think we need to deal with the O_NONBLOCK kind of situation at the actual ->open() time (ie the whole "oh, NFS wants to revalidate caches due to open/close consistency, named pipes want to wait for pairing, device nodes want to wait for device". Again, I think that's separate from LOOKUP_NONBLOCK, though. Linus
On 12/12/20 11:57 AM, Linus Torvalds wrote: > On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> We handle it for the path resolution itself, but we should also factor >> it in for open_last_lookups() and tmpfile open. > > So I think this one is fundamentally wrong, for two reasons. > > One is that "nonblock" shouldn't necessarily mean "take no locks at > all". That directory inode lock is very very different from "go down > to the filesystem to do IO". No other NONBLOCK thing has ever been "no > locks at all", they have all been about possibly long-term blocking. Do we ever do long term IO _while_ holding the direcoty inode lock? If we don't, then we can probably just ignore that side alltogether. > The other this is that the mnt_want_write() check really smells no > different at all from any of the "some ->open functions might block". > Which you don't handle, and which again is entirely different from the > pathname resolution itself blocking. > > So I don't think either of these cases are about LOOKUP_NONBLOCK, the > same way they aren't about LOOKUP_RCU. > > The mnt_want_write() in particular is much more about the kinds of > things we already check for in do_open(). > > In fact, looking at this patch, I think mnt_want_write() itself is > actually conceptually buggy, although it doesn't really matter: think > about device nodes etc. Opening a device node for writing doesn't > write to the filesystem that the device node is on. > > Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to > do with the open() wanting to write to the filesystem. We don't even > hold that lock after the open - we'll always drop it even for a > successful open. > > Only O_CREAT | O_TRUNC should matter, since those are the ones that > cause writes as part of the *open*. > > Again - I don't think that this is really a problem. I mention it more > as a "this shows how the code is _conceptually_ wrong", and how it's > not really about the pathname resolution any more. > > In fact, I think that how we pass on that "got_write" to lookup_open() > is just another sign of how we do that mnt_want_write() in the wrong > place and at trhe wrong level. We shouldn't have been doing that > mnt_want_write() for writable oipens (that's a different thing), and > looking at lookup_open(), I think we should have waited with doing it > at all until we've checked if we even need it (ie O_CREAT on a file > that already exists does *not* need the mnt_want_write() at all, and > we'll see that when we get to that > > /* Negative dentry, just create the file */ > if (!dentry->d_inode && (open_flag & O_CREAT)) { > > thing. > > So I think this patch actually shows that we do things wrong in this > area, and I think the kinds of things it does are questionable as a > result. In particular, I'm not convinced that the directory semaphore > thing should be tied to LOOKUP_NONBLOCK, and I don't think the > mnt_want_write() logic is right at all. > > The first one would be fixed by a separate flag. > > The second one I think is more about "we are doing mnt_want_write() at > the wrong point, and if we move mnt_want_write() to the right place > we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same > way a truncate should just be an explicit fail, not a "trylock" > failure. I'm going to let Al comment on the mnt_want_write() logic. It did feel iffy to me, and the error handling (and passing of it) around makes it hard to reason about. Would likely make this change more straight forward if we sort out that first. I do agree that we should keep the two things separate - and potentially just have the RESOLVE_NONBLOCK be RESOLVE_CACHE (or something like that) and not deal with the locking / want write side of things at all. For io_uring, we really do want to ensure that we don't stall the submission pipeline by potentially being stuck on the directory lock or waiting for a frozen file system. And I don't think we can get around having two flags at that point - probably by renaming the initial LOOKUP_NONBLOCK to LOOKUP_CACHE, and by having a LOOKUP_NONBLOCK which has a slightly wider scope. > I also think we need to deal with the O_NONBLOCK kind of situation at > the actual ->open() time (ie the whole "oh, NFS wants to revalidate > caches due to open/close consistency, named pipes want to wait for > pairing, device nodes want to wait for device". Again, I think that's > separate from LOOKUP_NONBLOCK, though. Agree, that's still not done in this patch. I did think about it, and the only potential idea I had around that was to wrap the actual open in setting/clearing O_NDELAY/O_NONBLOCK around the open. Which _should_ work as long as it happens before fd_install() is done, but doesn't feel that architecturally clean.
On Sat, Dec 12, 2020 at 1:25 PM Jens Axboe <axboe@kernel.dk> wrote: > > Do we ever do long term IO _while_ holding the direcoty inode lock? If > we don't, then we can probably just ignore that side alltogether. The inode lock is all kinds of messy. Part of it is that we have these helper functions for taking it ("inode_lock_shared()" and friends). Part of it is that some codepaths do *not* use those helpers and use "inode->i_rwsem" directly. And part of it is that our comments sometimes talk about the old name ("i_mutex"). The inode lock *can* be a problem. The historical problem point is actually readdir(), which takes the lock for reading, but does so over not just IO but also the user space accesses. That used to be a huge problem when it was a mutex, not an rwlock. But I think it can still be a problem for (a) filesystems that haven't been converted to 'iterate_shared' or (b) if a slow readdir has the lock, and a O_CREAT comes in, then new readers will block too. Honestly, the inode lock is nasty and broken. It's made worse by the fact that it really doesn't have great semantics: filesystems use it randomly for internal "lock this inode" too. A lot of inode lock users don't actually do any IO at all. The messiness of that lock comes literally from the fact that it was this random per-inode lock that just grew a lot of random uses. Many of them aren't particularly relevant for directories, though. It's one of my least favorite locks in the kernel, but practically speaking it seldom causes problems. But if you haven't figured out the pattern by now, let's just say that "it's completely random". It would be interesting to see if it causes actual problems. Because maybe that could push us towards fixing some of them. Linus
On Sat, Dec 12, 2020 at 02:25:00PM -0700, Jens Axboe wrote: > On 12/12/20 11:57 AM, Linus Torvalds wrote: > > On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote: > >> > >> We handle it for the path resolution itself, but we should also factor > >> it in for open_last_lookups() and tmpfile open. > > > > So I think this one is fundamentally wrong, for two reasons. > > > > One is that "nonblock" shouldn't necessarily mean "take no locks at > > all". That directory inode lock is very very different from "go down > > to the filesystem to do IO". No other NONBLOCK thing has ever been "no > > locks at all", they have all been about possibly long-term blocking. > > Do we ever do long term IO _while_ holding the direcoty inode lock? If > we don't, then we can probably just ignore that side alltogether. Yes - "all the time" is the simple answer. Readdir is a simple example, but that is just extent mapping tree and directory block IO you'd have to wait for, just like regular file IO. The big problem is that modifications to directories are atomic and transactional in most filesystems, which means we might block a create/unlink/attr/etc in a transaction start for an indefinite amount of time while we wait for metadata writeback to free up journal/reservation space. And while we are doing this, nothing else can access the directory because the VFS holds the directory inode lock.... We also have metadata IO within transactions, but in most journalling filesystems once we've started the transaction we can't back out and return -EAGAIN. So once we are in a transaction context, the filesystem will block as necessary to run the operation to completion. So, really, at the filesystem level I don't see much value in trying to push non-blocking directory modifications down to the filesystem. The commonly used filesystems will mostly have to return -EAGAIN immediately without being able to do anything at all because they simply aren't architected with the modification rollback capabilities needed to run fully non-blocking transactional modification operations. > > Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to > > do with the open() wanting to write to the filesystem. We don't even > > hold that lock after the open - we'll always drop it even for a > > successful open. > > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that > > cause writes as part of the *open*. And __O_TMPFILE, which is the same as O_CREAT. Cheers, Dave.
On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that > > > cause writes as part of the *open*. > > And __O_TMPFILE, which is the same as O_CREAT. This made me go look at the code, but we seem to be ok here - __O_TMPFILE should never get to the do_open() logic at all, because it gets caught before that and does off to do_tmpfile() and then vfs_tmpfile() instead. And then it's up to the filesystem to do the inode locking if it needs to - it has a separate i_io->tempfile function for that. From a LOOKUP_NONBLOCK standpoint, I think we should just disallow O_TMPFILE the same way Jens disallowed O_TRUNCATE. Otherwise we'd have to teach filesystems about it. Which might be an option long-term of course, but I don't think it makes sense for any initial patch-set: the real "we can do this with no locks and quickly" case is just opening an existing file entirely using the dcache. Creating new files isn't exactly _uncommon_, of course, but it's probably still two orders of magnitude less common than just opening an existing file. Wild handwaving. So I suspect O_CREAT simply isn't really interesting either. Sure, the "it already exists" case could potentially be done without any locking or anything like that, but again, I don't think that case is worth optimizing for: O_CREAT probably just isn't common enough. [ I don't have tons of hard data to back that handwaving argument based on my gut feel up, but I did do a trace of a kernel build just because that's my "default load" and out of 250 thousand openat() calls, only 560 had O_CREAT and/or O_TRUNC. But while that's _my_ default load, it's obviously not necessarily very representative of anything else. The "almost three orders of magnitude more regular opens" doesn't _surprise_ me though ] So I really think the right model is not to worry about trylock for the inode lock at all, but to simply just always fail with EAGAIN in that case - and not do LOOKUP_NONBLOCK for O_CREAT at all. The normal fast-path case in open_last_lookups() is the "goto finish_lookup" case in this sequence: if (!(open_flag & O_CREAT)) { if (nd->last.name[nd->last.len]) nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; /* we _can_ be in RCU mode here */ dentry = lookup_fast(nd, &inode, &seq); if (IS_ERR(dentry)) return ERR_CAST(dentry); if (likely(dentry)) goto finish_lookup; and we never get to the inode locking sequence at all. So just adding a if (nd->flags & LOOKUP_NONBLOCK) return -EAGAIN; there is probably the right thing - rather than worry about trylock on the inode lock. Because if you've missed in the dcache, you've by definition lost the fast-path. That said, numbers talk, BS walks, and maybe somebody has better loads with better numbers. I assume Jens has some internal FB io_uring loads and could do stats on what kinds of pathname opens matter there... Linus
On Sun, Dec 13, 2020 at 04:45:39PM -0800, Linus Torvalds wrote: > On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that > > > > cause writes as part of the *open*. > > > > And __O_TMPFILE, which is the same as O_CREAT. > > This made me go look at the code, but we seem to be ok here - > __O_TMPFILE should never get to the do_open() logic at all, because it > gets caught before that and does off to do_tmpfile() and then > vfs_tmpfile() instead. > > And then it's up to the filesystem to do the inode locking if it needs > to - it has a separate i_io->tempfile function for that. Sure, and then it blocks. Guaranteed, for the same reasons that O_CREAT will block when calling ->create() after the path lookup: the filesystem runs a transaction to allocate an inode and track it on the orphan list so that it gets cleaned up by recovery after a crash while the tmpfile is still open. So it doesn't matter if the lookup is non-blocking, the tmpfile creation is guaranteed to block for the same reason O_CREAT and O_TRUNCATE will block.... > From a LOOKUP_NONBLOCK standpoint, I think we should just disallow > O_TMPFILE the same way Jens disallowed O_TRUNCATE. *nod* I just don't think it makes sense to try to make any of the filesystem level stuff open() might do non-blocking. The moment we start a filesystem modification, we have to be able to block because it is the only way to guarantee forwards progress. So if we know we are going to call into the filesystem to make a modification if the pathwalk is successful, why even bother starting the pathwalk? Cheers, Dave.
On 12/13/20 5:45 PM, Linus Torvalds wrote: > On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote: >>>> >>>> Only O_CREAT | O_TRUNC should matter, since those are the ones that >>>> cause writes as part of the *open*. >> >> And __O_TMPFILE, which is the same as O_CREAT. > > This made me go look at the code, but we seem to be ok here - > __O_TMPFILE should never get to the do_open() logic at all, because it > gets caught before that and does off to do_tmpfile() and then > vfs_tmpfile() instead. > > And then it's up to the filesystem to do the inode locking if it needs > to - it has a separate i_io->tempfile function for that. > > From a LOOKUP_NONBLOCK standpoint, I think we should just disallow > O_TMPFILE the same way Jens disallowed O_TRUNCATE. > > Otherwise we'd have to teach filesystems about it. Good point, let's avoid that for now... More below. > Which might be an option long-term of course, but I don't think it > makes sense for any initial patch-set: the real "we can do this with > no locks and quickly" case is just opening an existing file entirely > using the dcache. > > Creating new files isn't exactly _uncommon_, of course, but it's > probably still two orders of magnitude less common than just opening > an existing file. Wild handwaving. Obviously depends on the workload, but there are plenty of interesting workloads that are not create intensive at all and just want a fast way to open an existing file. > So I suspect O_CREAT simply isn't really interesting either. Sure, the > "it already exists" case could potentially be done without any locking > or anything like that, but again, I don't think that case is worth > optimizing for: O_CREAT probably just isn't common enough. > > [ I don't have tons of hard data to back that handwaving argument > based on my gut feel up, but I did do a trace of a kernel build just > because that's my "default load" and out of 250 thousand openat() > calls, only 560 had O_CREAT and/or O_TRUNC. But while that's _my_ > default load, it's obviously not necessarily very representative of > anything else. The "almost three orders of magnitude more regular > opens" doesn't _surprise_ me though ] > > So I really think the right model is not to worry about trylock for > the inode lock at all, but to simply just always fail with EAGAIN in > that case - and not do LOOKUP_NONBLOCK for O_CREAT at all. > > The normal fast-path case in open_last_lookups() is the "goto > finish_lookup" case in this sequence: > > if (!(open_flag & O_CREAT)) { > if (nd->last.name[nd->last.len]) > nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; > /* we _can_ be in RCU mode here */ > dentry = lookup_fast(nd, &inode, &seq); > if (IS_ERR(dentry)) > return ERR_CAST(dentry); > if (likely(dentry)) > goto finish_lookup; > > and we never get to the inode locking sequence at all. > > So just adding a > > if (nd->flags & LOOKUP_NONBLOCK) > return -EAGAIN; > > there is probably the right thing - rather than worry about trylock on > the inode lock. I really like that, as it also means we can safely drop the mnt_want_write() path and just safe that for a separate cleanup that's not related to this at all. With that, I'd suggest folding the two patches into one as we no longer need followup work. It also means we can drop the concept of having two flags for this, LOOKUP_NONBLOCK covers the entire case of the existing fast path. Might still make sense to name it LOOKUP_CACHE instead, but I'll leave that for you/Al to decide... > Because if you've missed in the dcache, you've by definition lost the fast-path. > > That said, numbers talk, BS walks, and maybe somebody has better loads > with better numbers. I assume Jens has some internal FB io_uring loads > and could do stats on what kinds of pathname opens matter there... Definitely, was already planning on checking that. And just as importantly, once we have this, it's pretty trivial to do the same thing on the stat side as well, which makes me excited for that change too. io_uring currently punts that one too, and that's arguably even more interesting than open if we can avoid that for the fast case.
On Sun, Dec 13, 2020 at 5:52 PM Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Dec 13, 2020 at 04:45:39PM -0800, Linus Torvalds wrote: > > On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that > > > > > cause writes as part of the *open*. > > > > > > And __O_TMPFILE, which is the same as O_CREAT. > > > > This made me go look at the code, but we seem to be ok here - > > __O_TMPFILE should never get to the do_open() logic at all, because it > > gets caught before that and does off to do_tmpfile() and then > > vfs_tmpfile() instead. > > > > And then it's up to the filesystem to do the inode locking if it needs > > to - it has a separate i_io->tempfile function for that. > > Sure, and then it blocks. Yes. I was more just double-checking that currently really odd if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { condition that didn't make sense to me (it basically does two different kinds of writablity checks). So it was more that you pointed out that __O_TMPFILE was also missing from that odd condition, and that turns out to be because it was handled separately. So no disagreement about __O_TMPFILE being a "not a cached operation" - purely a "that condition is odd". It was just that O_WRONLY | O_RDWR didn't make tons of sense to me, since we then get the write count only to then drop it immediately immediately without having actually done any writes. But I guess they are there only as a "even if we don't write to the filesystem right now, we do want to get the EROFS error return from open(), rather than at write() time". I think technically it shouldn't need to do the pointless "synchronize and increment writers only to decrement them again" dance, and could just do a "mnt_is_readonly()" test for the plain "open writably, but without O_CREAT/O_TRUNC" case. But I guess there's no real downside to doing it the way we're doing it - it just looked odd to me when I was looking at it in the context of just pathname lookup. In do_faccessat() we have that bare __mnt_is_readonly() for the "I want EROFS, but I'm not actually writing now" case. Linus
diff --git a/fs/namei.c b/fs/namei.c index 07a1aa874f65..1f976a213eef 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3127,6 +3127,7 @@ static const char *open_last_lookups(struct nameidata *nd, struct dentry *dir = nd->path.dentry; int open_flag = op->open_flag; bool got_write = false; + bool nonblock = nd->flags & LOOKUP_NONBLOCK; unsigned seq; struct inode *inode; struct dentry *dentry; @@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd, } if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { - got_write = !mnt_want_write(nd->path.mnt); + if (nonblock) { + got_write = !mnt_want_write_trylock(nd->path.mnt); + if (!got_write) + return ERR_PTR(-EAGAIN); + } else { + got_write = !mnt_want_write(nd->path.mnt); + } /* * do _not_ fail yet - we might not need that or fail with * a different error; let lookup_open() decide; we'll be * dropping this one anyway. */ } - if (open_flag & O_CREAT) - inode_lock(dir->d_inode); - else - inode_lock_shared(dir->d_inode); + if (open_flag & O_CREAT) { + if (nonblock) { + if (!inode_trylock(dir->d_inode)) { + dentry = ERR_PTR(-EAGAIN); + goto drop_write; + } + } else { + inode_lock(dir->d_inode); + } + } else { + if (nonblock) { + if (!inode_trylock_shared(dir->d_inode)) { + dentry = ERR_PTR(-EAGAIN); + goto drop_write; + } + } else { + inode_lock_shared(dir->d_inode); + } + } dentry = lookup_open(nd, file, op, got_write); if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED)) fsnotify_create(dir->d_inode, dentry); @@ -3183,6 +3205,7 @@ static const char *open_last_lookups(struct nameidata *nd, else inode_unlock_shared(dir->d_inode); +drop_write: if (got_write) mnt_drop_write(nd->path.mnt); @@ -3242,6 +3265,7 @@ static int do_open(struct nameidata *nd, open_flag &= ~O_TRUNC; acc_mode = 0; } else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) { + WARN_ON_ONCE(nd->flags & LOOKUP_NONBLOCK); error = mnt_want_write(nd->path.mnt); if (error) return error; @@ -3311,7 +3335,10 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags, int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path); if (unlikely(error)) return error; - error = mnt_want_write(path.mnt); + if (flags & LOOKUP_NONBLOCK) + error = mnt_want_write_trylock(path.mnt); + else + error = mnt_want_write(path.mnt); if (unlikely(error)) goto out; child = vfs_tmpfile(path.dentry, op->mode, op->open_flag);
We handle it for the path resolution itself, but we should also factor it in for open_last_lookups() and tmpfile open. We don't allow RESOLVE_NONBLOCK with O_TRUNC, so that case we can safely ignore. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/namei.c | 39 +++++++++++++++++++++++++++++++++------ 1 file changed, 33 insertions(+), 6 deletions(-)