Message ID | 20201210200114.525026-3-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK | expand |
On Thu, Dec 10, 2020 at 01:01:14PM -0700, Jens Axboe wrote: > Now that we support non-blocking path resolution internally, expose it > via openat2() in the struct open_how ->resolve flags. This allows > applications using openat2() to limit path resolution to the extent that > it is already cached. > > If the lookup cannot be satisfied in a non-blocking manner, openat2(2) > will return -1/-EAGAIN. > > Cc: Al Viro <viro@zeniv.linux.org.uk> > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > fs/open.c | 2 ++ > include/linux/fcntl.h | 2 +- > include/uapi/linux/openat2.h | 2 ++ > 3 files changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/open.c b/fs/open.c > index 9af548fb841b..07dc9f3d1628 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -1087,6 +1087,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) > lookup_flags |= LOOKUP_BENEATH; > if (how->resolve & RESOLVE_IN_ROOT) > lookup_flags |= LOOKUP_IN_ROOT; > + if (how->resolve & RESOLVE_NONBLOCK) > + lookup_flags |= LOOKUP_NONBLOCK; > > op->lookup_flags = lookup_flags; > return 0; > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h > index 921e750843e6..919a13c9317c 100644 > --- a/include/linux/fcntl.h > +++ b/include/linux/fcntl.h > @@ -19,7 +19,7 @@ > /* List of all valid flags for the how->resolve argument: */ > #define VALID_RESOLVE_FLAGS \ > (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ > - RESOLVE_BENEATH | RESOLVE_IN_ROOT) > + RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NONBLOCK) > > /* List of all open_how "versions". */ > #define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */ > diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h > index 58b1eb711360..ddbf0796841a 100644 > --- a/include/uapi/linux/openat2.h > +++ b/include/uapi/linux/openat2.h > @@ -35,5 +35,7 @@ struct open_how { > #define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." > be scoped inside the dirfd > (similar to chroot(2)). */ > +#define RESOLVE_NONBLOCK 0x20 /* Only complete if resolution can be > + done without IO */ I don't think this describes the implementation correctly - it has nothing to actually do with whether IO is needed, just whether the lookup can be done without taking blocking locks. The slow path can complete without doing IO - it might miss the dentry cache but hit the filesystem buffer cache on lookup and the inode cache when retrieving the inode. And it may not even block anywhere doing this. So, really, this isn't avoiding IO at all - it's avoiding the possibility of running a lookup path that might blocking on something. This also needs a openat2(2) man page update explaining exactly what behaviour/semantics this flag provides and that userspace can rely on when this flag is set... We've been failing to define the behaviour of our interfaces clearly, especially around non-blocking IO behaviour in recent times. We need to fix that, not make matters worse by adding new, poorly defined non-blocking behaviours... I'd also like to know how we actually test this is working- a reliable regression test for fstests would be very useful for ensuring that the behaviour as defined by the man page is not broken accidentally by future changes... Cheers, Dave.
On 12/10/20 3:29 PM, Dave Chinner wrote: > On Thu, Dec 10, 2020 at 01:01:14PM -0700, Jens Axboe wrote: >> Now that we support non-blocking path resolution internally, expose it >> via openat2() in the struct open_how ->resolve flags. This allows >> applications using openat2() to limit path resolution to the extent that >> it is already cached. >> >> If the lookup cannot be satisfied in a non-blocking manner, openat2(2) >> will return -1/-EAGAIN. >> >> Cc: Al Viro <viro@zeniv.linux.org.uk> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> fs/open.c | 2 ++ >> include/linux/fcntl.h | 2 +- >> include/uapi/linux/openat2.h | 2 ++ >> 3 files changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/fs/open.c b/fs/open.c >> index 9af548fb841b..07dc9f3d1628 100644 >> --- a/fs/open.c >> +++ b/fs/open.c >> @@ -1087,6 +1087,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) >> lookup_flags |= LOOKUP_BENEATH; >> if (how->resolve & RESOLVE_IN_ROOT) >> lookup_flags |= LOOKUP_IN_ROOT; >> + if (how->resolve & RESOLVE_NONBLOCK) >> + lookup_flags |= LOOKUP_NONBLOCK; >> >> op->lookup_flags = lookup_flags; >> return 0; >> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h >> index 921e750843e6..919a13c9317c 100644 >> --- a/include/linux/fcntl.h >> +++ b/include/linux/fcntl.h >> @@ -19,7 +19,7 @@ >> /* List of all valid flags for the how->resolve argument: */ >> #define VALID_RESOLVE_FLAGS \ >> (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ >> - RESOLVE_BENEATH | RESOLVE_IN_ROOT) >> + RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NONBLOCK) >> >> /* List of all open_how "versions". */ >> #define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */ >> diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h >> index 58b1eb711360..ddbf0796841a 100644 >> --- a/include/uapi/linux/openat2.h >> +++ b/include/uapi/linux/openat2.h >> @@ -35,5 +35,7 @@ struct open_how { >> #define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." >> be scoped inside the dirfd >> (similar to chroot(2)). */ >> +#define RESOLVE_NONBLOCK 0x20 /* Only complete if resolution can be >> + done without IO */ > > I don't think this describes the implementation correctly - it has > nothing to actually do with whether IO is needed, just whether the > lookup can be done without taking blocking locks. The slow path can > complete without doing IO - it might miss the dentry cache but hit > the filesystem buffer cache on lookup and the inode cache when > retrieving the inode. And it may not even block anywhere doing this. > > So, really, this isn't avoiding IO at all - it's avoiding the > possibility of running a lookup path that might blocking on > something. Right, it's about not blocking, as the commit message says. That could be IO, either directly or indirectly, or it could be just locking. I'll update this comment. > This also needs a openat2(2) man page update explaining exactly what > behaviour/semantics this flag provides and that userspace can rely > on when this flag is set... Agree, I'll add that. > We've been failing to define the behaviour of our interfaces clearly, > especially around non-blocking IO behaviour in recent times. We need > to fix that, not make matters worse by adding new, poorly defined > non-blocking behaviours... Also agree on that! It doesn't help that different folks have different criteria of what nowait/nonblock means... > I'd also like to know how we actually test this is working- a > reliable regression test for fstests would be very useful for > ensuring that the behaviour as defined by the man page is not broken > accidentally by future changes... Definitely. On the io_uring side, I generally run with a debug patch that triggers if anything blocks off the submission path. That won't really work for this one, however. FWIW, I'm quite fine deferring this patch, I obviously care more about the io_uring side of things. It seems like a no-brainer to support for openat2 though, as it would allow applications to decide when to punt open to another thread. Since this one ties in very closely with LOOKUP_RCU, I'm not _too_ worried about this one in particular. But it would be great to have something that we could use with eg RWF_NOWAIT as well, and similar constructs. I'll give it some thought.
On Thu, Dec 10, 2020 at 2:29 PM Dave Chinner <david@fromorbit.com> wrote: > > So, really, this isn't avoiding IO at all - it's avoiding the > possibility of running a lookup path that might blocking on > something. For pathname lookup, the only case that matters is the RCU lockless lookup. That cache hits basically 100% of the time except for the first lookup, or under memory pressure. And honestly, from a performance perspective, it's the lockless path lookup that matters most. By the time you have to go to the filesystem, take the directory locks etc, you've already lost. So we're never going to bother with some kind of "lockless lookup for actual filesystems", because it's only extra work for no actual gain. End result: LOOKUP_NONBLOCK is about not just avoiding IO, but about avoiding the filesystem and the inevitable locking that causes. Linus
On Thu, Dec 10, 2020 at 03:29:23PM -0800, Linus Torvalds wrote: > On Thu, Dec 10, 2020 at 2:29 PM Dave Chinner <david@fromorbit.com> wrote: > > > > So, really, this isn't avoiding IO at all - it's avoiding the > > possibility of running a lookup path that might blocking on > > something. > > For pathname lookup, the only case that matters is the RCU lockless lookup. > > That cache hits basically 100% of the time except for the first > lookup, or under memory pressure. > > And honestly, from a performance perspective, it's the lockless path > lookup that matters most. By the time you have to go to the > filesystem, take the directory locks etc, you've already lost. > > So we're never going to bother with some kind of "lockless lookup for > actual filesystems", because it's only extra work for no actual gain. > > End result: LOOKUP_NONBLOCK is about not just avoiding IO, but about > avoiding the filesystem and the inevitable locking that causes. Umm, yes, that is _exactly_ what I just said. :/ Cheers, Dave.
On Thu, Dec 10, 2020 at 4:58 PM Dave Chinner <david@fromorbit.com> wrote: > > Umm, yes, that is _exactly_ what I just said. :/ .,. but it _sounded_ like you would actually want to do the whole filesystem thing, since why would you have piped up otherwise. I just wanted to clarify that the onle sane model is the one that patch actually implements. Otherwise, your email was just nit-picking about a single word in a comment in a header file. Was that really what you wanted to do? Linus
On Thu, Dec 10, 2020 at 05:01:44PM -0800, Linus Torvalds wrote: > On Thu, Dec 10, 2020 at 4:58 PM Dave Chinner <david@fromorbit.com> wrote: > > > > Umm, yes, that is _exactly_ what I just said. :/ > > .,. but it _sounded_ like you would actually want to do the whole > filesystem thing, since why would you have piped up otherwise. I just > wanted to clarify that the onle sane model is the one that patch > actually implements. <sigh> Is that really what you think motivates me, Linus? It sounds like you've created an Evil Dave strawman and you simply cannot see past the taint Evil Dave has created in your head. :/ I commented because Jens has recently found several issues with inconsistencies in "non-blocking" APIs that we have in the IO path. He's triggered bugs in the non-blocking behaviour in filesystem code through io_uring that we've had to fix (e.g. see commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across extent boundaries"). Then there are the active discussions about the limited ability to use IOCB_NOWAIT for full stack non-blocking IO behaviour w/ REQ_NOWAIT in the block layer because the semantics of IOCB_NOWAIT are directly tied to the requirements of the RWF_NOWAIT preadv2/pwritev2 flag and using REQ_NOWAIT in the block layer will break them. Part of the problem we have with the non-blocking behaviour is that the user interfaces have been under specified, poorly reviewed and targetted a single specific use case on a single filesystem rather than generic behaviour. And mostly they lack the necessary test coverage to ensure all filesystems behave the same way and to inform us of a regression that *would break userspace applications*. Yes, I recognise and accept that some of the problems are partially my fault. I also have a habit of trying to learn from the mistakes I've made and then take steps to ensure that *we do not make those same mistakes again*. > Otherwise, your email was just nit-picking about a single word in a > comment in a header file. > > Was that really what you wanted to do? So for all your talk about "don't break userspace", you think that actively taking steps during reviews to avoid a poor userspace API is "nit-picking"? FYI, having a reviewer ask for a userspace API modification to: - have clearly specified and documented behaviour, - be provided with user documentation, and - be submitted with regression tests is not at all unusual or unexpected. Asking for these things during review on -fsdevel and various filesystem lists is a normal part of the process for getting changes to user APIs reviewed and merged. The fact that Jens replied with "yep, no problems, let's make sure we nail down the semantics" and Al has replied "what does RESOLVE_NONBLOCK actually mean for all the blocking stuff that open does /after/ the pathwalk?" shows that these semantics really do matter Hence they need to be defined, specified, documented and carefully exercised by regression tests. i.e. the patch that introduces the RESOLVE_NONBLOCK flag is the -easy part-, filling in the rest of the blanks is where all the hard work is... Hence calling these requests "nit picking" sets entirely the wrong tone for the wider community. You may not care about things like properly documenting interfaces, but application developers and users sure do and hence it's something we need to pay attention to and encourage. Leaders are supposed to encourage and support good development practices, not be arseholes to the people who ask for good practices to be followed. Please start behaving more like a leader should when I'm around, Linus. -Dave. (Not really Evil.)
On Thu, Dec 10, 2020 at 7:45 PM Dave Chinner <david@fromorbit.com> wrote: > > Part of the problem we have with the non-blocking behaviour is that > the user interfaces have been under specified, poorly reviewed and > targetted a single specific use case on a single filesystem rather > than generic behaviour. And mostly they lack the necessary test > coverage to ensure all filesystems behave the same way and to inform > us of a regression that *would break userspace applications*. Fair enough. I just didn't really see much a concern here, exactly because this ends up not being a hard guarantee in the first place (but the reason I suggested then adding that RESOLVE_NONBLOCK is so that it could be tested without having to rely on timing and io_uring that _should_ get the same result regardless in the end). But the second reason I don't see much concern is exactly because it wouldn't affect individual filesystems. There's nothing new going on as far as filesystem is concerned. > Yes, I recognise and accept that some of the problems are partially > my fault. I also have a habit of trying to learn from the mistakes > I've made and then take steps to ensure that *we do not make those > same mistakes again*. So the third reason I reacted was because we have a history, and you have traditionally not ever really cared unless it's about xfs and IO. Which this thing would very explicitly not be about. The low-level filesystem would never see the semantics at all, and could never get it wrong. Well, a filesystem could "get it wrong" in the same sense that it can get the current LOOKUP_RCU wrong, of course. But that would be either an outright bug and a correctness problem - sleeping in RCU context - or be a performance problem - returning ECHILD very easily due to other reasons. And it would be entirely unrelated to the nonblocking path open, because it would be a performance issue even _normally_, just not visible as semantics. And that's the second reason I like this, and would like to see this, and see RESOLVE_NONBLOCK: exactly because we have _had_ those subtle issues that aren't actually correctness issues, but only "oh, this situation always takes us out of RCU lookup, and it's a very subtle performance problem". For example, it used to be that whenever we saw a symlink, we'd throw up our hands and say "we can't do this in RCU lookup" and give up. That wasn't a low-level filesystem problem, it was literally at the VFS layer, because the RCU lookup was fairly limited. Or some of the security models (well, _all_ of them) used to just say "oh, I can't do this check in RCU mode" and forced the slow path. It was very noticeable from a performance angle under certain loads, because RCU lookup is just _so_ much faster when you otherwise get locking and reference counting cache conflicts. Yes, yes, Al fixed the symlinks long ago, but we know RCU lookup still gives up fairly easily in other circumstances. And RCU lookup giving up eagerly is fine, of course. The whole point is that it's an optimistic "let's see if we can do this really quickly" interface that just works _most_ of the time. But that also makes it easy to miss things, because it's so hard to test when it's purely about latency and all the operations will retry and get the right result in the end. The "noticeable performance issues" are generally not very noticeable at the level of an individual operation - you need to do a lot of them, and often in parallel, to see the _big_ benefits. So RESOLVE_NONBLOCK would be a nice way to perhaps add automated testing for "did we screw up RCU pathname lookup", in addition to perhaps making it easier to find the cases where we currently give up too quickly just because it was _fairly_ rare and nobody had much visibility into that case. And we have had that "oh, RCU lookup broke" a couple of times by mistake - and then it takes a while to notice, because it's purely visible as a performance bug and not necessarily all _that_ obvious - exactly because it's purely about performance, and the semantic end result is the same. Linus
diff --git a/fs/open.c b/fs/open.c index 9af548fb841b..07dc9f3d1628 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1087,6 +1087,8 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) lookup_flags |= LOOKUP_BENEATH; if (how->resolve & RESOLVE_IN_ROOT) lookup_flags |= LOOKUP_IN_ROOT; + if (how->resolve & RESOLVE_NONBLOCK) + lookup_flags |= LOOKUP_NONBLOCK; op->lookup_flags = lookup_flags; return 0; diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 921e750843e6..919a13c9317c 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -19,7 +19,7 @@ /* List of all valid flags for the how->resolve argument: */ #define VALID_RESOLVE_FLAGS \ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ - RESOLVE_BENEATH | RESOLVE_IN_ROOT) + RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NONBLOCK) /* List of all open_how "versions". */ #define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */ diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h index 58b1eb711360..ddbf0796841a 100644 --- a/include/uapi/linux/openat2.h +++ b/include/uapi/linux/openat2.h @@ -35,5 +35,7 @@ struct open_how { #define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".." be scoped inside the dirfd (similar to chroot(2)). */ +#define RESOLVE_NONBLOCK 0x20 /* Only complete if resolution can be + done without IO */ #endif /* _UAPI_LINUX_OPENAT2_H */
Now that we support non-blocking path resolution internally, expose it via openat2() in the struct open_how ->resolve flags. This allows applications using openat2() to limit path resolution to the extent that it is already cached. If the lookup cannot be satisfied in a non-blocking manner, openat2(2) will return -1/-EAGAIN. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/open.c | 2 ++ include/linux/fcntl.h | 2 +- include/uapi/linux/openat2.h | 2 ++ 3 files changed, 5 insertions(+), 1 deletion(-)