Message ID | 20200406110702.GA13469@nautica (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] 9p update for 5.7 | expand |
On Mon, Apr 6, 2020 at 4:07 AM Dominique Martinet <asmadeus@codewreck.org> wrote: > > - Fix read with O_NONBLOCK to allow incomplete read and return > immediately Hmm. This is kind of special semantics (normally a POSIX filesystem ignores O_NONBLOCK), but I guess it makes sense for a network filesystem. It might be worth a bti more documentation/commenting because of the special semantics. For example, since you don't have 'poll()', O_NONBLOCK doesn't really mean "nonblocking", it means "stop earlier" if I read that patch right. You can't just return -EAGAIN because there's no way to then avoid busy looping.. Linus
Linus Torvalds wrote on Mon, Apr 06, 2020: > On Mon, Apr 6, 2020 at 4:07 AM Dominique Martinet > <asmadeus@codewreck.org> wrote: > > - Fix read with O_NONBLOCK to allow incomplete read and return > > immediately > > Hmm. This is kind of special semantics (normally a POSIX filesystem > ignores O_NONBLOCK), but I guess it makes sense for a network > filesystem. > > It might be worth a bti more documentation/commenting because of the > special semantics. For example, since you don't have 'poll()', > O_NONBLOCK doesn't really mean "nonblocking", it means "stop earlier" > if I read that patch right. You can't just return -EAGAIN because > there's no way to then avoid busy looping.. Yes, I think you got this right. Basically there is no way to tell if the server will return immediately or not, so even with O_NONBLOCK the read() will still be 'blocking' if the server decides to wait for something before sending a reply. This patch will just make the read stop after a single round-trip with the server instead of looping to fill the buffer if O_NONBLOCK is set. The use-case here is stuff like reading from synthetic files (think fake pipes) where data comes in like a pipe and one would want read to return as soon as data is available. Just thinking out loud it might be possible to make pipes go through the server and somewhat work, but this might bring its own share of other problems and existing programs would need to be changed (e.g. wmii's synthetic filesystem exposes this kind of files as well as regular files, which works fine for their userspace client (wmiir) but can't really be used with a linux client) (Added Sergey in Cc for opinion) Anyway, I agree looking at O_NONBLOCK for that isn't obvious. I agree with the usecase here and posix allows short reads regardless of the flag so the behaviour is legal either way ; the filesystem is allowed to return whenever it wants on a whim - let's just add some docs as you suggest unless Sergey has something to add. I will add a few lines on that in Documentation/filesystems/9p.txt and send another pull request in a couple of days to give whoever wants to review time to comment on wording. Cheers,
The pull request you sent on Mon, 6 Apr 2020 13:07:02 +0200:
> https://github.com/martinetd/linux tags/9p-for-5.7
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e14679b62d84b8ab9136189fc069d389da43fe71
Thank you!
On Mon, Apr 06, 2020 at 06:40:57PM +0200, Dominique Martinet wrote: > Anyway, I agree looking at O_NONBLOCK for that isn't obvious. > I agree with the usecase here and posix allows short reads regardless of > the flag so the behaviour is legal either way ; the filesystem is > allowed to return whenever it wants on a whim - let's just add some docs > as you suggest unless Sergey has something to add. Ahahahahhahahahahaha. POSIX may well "allow" short reads, but userspace programmers basically never check the return value from read(). Short reads aren't actually allowed. That's why signals are only allowed to interrupt syscalls if they're fatal (and the application will never see the returned value because it's already dead).
Matthew Wilcox wrote on Mon, Apr 06, 2020: > POSIX may well "allow" short reads, but userspace programmers basically > never check the return value from read(). Short reads aren't actually > allowed. That's why signals are only allowed to interrupt syscalls if > they're fatal (and the application will never see the returned value > because it's already dead). I've seen tons of programs not check read return value yes but these also have no idea what O_NONBLOCK is so I'm not sure how realistic a use-case that is? The alternative I see would be making pipes go through the server as I said, but that would probably mean another mount option for this; pipes work as local pipes like they do in nfs currently.
On Mon, Apr 6, 2020 at 9:46 AM Matthew Wilcox <willy@infradead.org> wrote: > > POSIX may well "allow" short reads, but userspace programmers basically > never check the return value from read(). Short reads aren't actually > allowed. That's why signals are only allowed to interrupt syscalls if > they're fatal (and the application will never see the returned value > because it's already dead). Well, that's true for some applications. But look at anybody who ever worked more with NFS mounts, and they got used to having the 'intr' mount flag set and incomplete reads and -EAGAIN as a result. So a lot of normal applications are actually used to partial reads even from file reads. Are there apps that react badly? I'm sure - but they also wouldn't have O_NONBLOCK set on a regular file. The only reason to set O_NONBLOCK is because you think the fd might be a pipe or something, and you _are_ ready to get partial reads. So the 9p behavior certainly isn't outrageously out of line for a network filesystem. In fact, because of O_NONBLOCK rather than a mount option, I think it's a lot safer than a fairly standard NFS option. Linus
On Mon, Apr 06, 2020 at 10:04:11AM -0700, Linus Torvalds wrote: > On Mon, Apr 6, 2020 at 9:46 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > POSIX may well "allow" short reads, but userspace programmers basically > > never check the return value from read(). Short reads aren't actually > > allowed. That's why signals are only allowed to interrupt syscalls if > > they're fatal (and the application will never see the returned value > > because it's already dead). > > Well, that's true for some applications. > > But look at anybody who ever worked more with NFS mounts, and they got > used to having the 'intr' mount flag set and incomplete reads and > -EAGAIN as a result. That's why you had me implement TASK_KILLABLE ;-) > Are there apps that react badly? I'm sure - but they also wouldn't > have O_NONBLOCK set on a regular file. The only reason to set > O_NONBLOCK is because you think the fd might be a pipe or something, > and you _are_ ready to get partial reads. > > So the 9p behavior certainly isn't outrageously out of line for a > network filesystem. In fact, because of O_NONBLOCK rather than a mount > option, I think it's a lot safer than a fairly standard NFS option. The NFS option has been a no-op for over a decade ;-) I agree with you that O_NONBLOCK is a good indicator the application is willing to handle short reads (or indeed writes).
On Mon, Apr 6, 2020 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > But look at anybody who ever worked more with NFS mounts, and they got > > used to having the 'intr' mount flag set and incomplete reads and > > -EAGAIN as a result. > > That's why you had me implement TASK_KILLABLE ;-) Oh, absolutely. We can *NOT* do this in general. Applications _will_ break if you end up just randomly breaking POSIX behavior. But network filesystems are almost never fully POSIX anyway. And yes, they do break some apps. 'intr' may not be a thing any more, but other differences wrt various atomicity guarantees (or file locking) etc still exist. So the whole "network filesystems do odd things in corner cases" isn't exactly unusual. I think the O_NONBLOCK difference is one of the more benign ones. I just think it should be documented more. Linus
Linus Torvalds wrote on Mon, Apr 06, 2020: > I think the O_NONBLOCK difference is one of the more benign ones. > > I just think it should be documented more. Hadn't explicitely added you in Ccs, but I did post something to document it: http://lkml.kernel.org/r/1586193572-1375-1-git-send-email-asmadeus@codewreck.org (small enough to just quote:) -------------- diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index fec7144e817c..3fb780ffdf23 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -133,6 +133,16 @@ OPTIONS cache tags for existing cache sessions can be listed at /sys/fs/9p/caches. (applies only to cache=fscache) +BEHAVIOR +======== + +This section aims at describing 9p 'quirks' that can be different +from a local filesystem behaviors. + + - Setting O_NONBLOCK on a file will make client reads return as early + as the server returns some data instead of trying to fill the read + buffer with the requested amount of bytes or end of file is reached. + RESOURCES ========= -------------- Will submit the pull request again with that in a couple of days unless anything bad happens. Thanks,
On Mon, Apr 06, 2020 at 06:40:57PM +0200, Dominique Martinet wrote: > The use-case here is stuff like reading from synthetic files (think fake > pipes) where data comes in like a pipe and one would want read to return > as soon as data is available. > Just thinking out loud it might be possible to make pipes go through the > server and somewhat work, but this might bring its own share of other > problems and existing programs would need to be changed (e.g. wmii's > synthetic filesystem exposes this kind of files as well as regular > files, which works fine for their userspace client (wmiir) but can't > really be used with a linux client) > Anyway, I agree looking at O_NONBLOCK for that isn't obvious. > I agree with the usecase here and posix allows short reads regardless of > the flag so the behaviour is legal either way ; the filesystem is > allowed to return whenever it wants on a whim - let's just add some docs > as you suggest unless Sergey has something to add. In fact i would prefer disabling the full reads unconditionally, but AFAIR some userspace programs might interpret a short read as EOF (and also would need to check the logic that motivated the kernel-side looping).
L29Ah wrote on Tue, Apr 07, 2020: > In fact i would prefer disabling the full reads unconditionally, but > AFAIR some userspace programs might interpret a short read as EOF (and > also would need to check the logic that motivated the kernel-side > looping). Willy is correct there we can't just do that, way too many applications would break. I think O_NONBLOCK on regular files is a good compromise, let's not go overboard :)
On Mon, Apr 6, 2020 at 7:16 PM L29Ah <l29ah@cock.li> wrote: > > In fact i would prefer disabling the full reads unconditionally, but AFAIR some userspace programs might interpret a short read as EOF (and also would need to check the logic that motivated the kernel-side looping). Oh, it's even worse than "might interpret a short read as EOF". Lots of ad-hoc small tools will basically do something like fd = open(name, O_RDONLY); fstat(fd, &st); buf = malloc(st.st_size); read(fd, buf, st.st_size); and be done with it. Obviously they may have some error handling (ie imagine the above being written with proper tests for buf beign NULL and 'fstat()' returning an error), but if they check the return value of "read()" at all, it might be just to verify that it matches st.st_size. I've written stuff like that myself. Sure, the "real" programs I write would have loops with EAGAIN and partial reads, and maybe I'd have a helper function called "xread()" that does that. And most major applications will do things like that, exactly because they've seen years of development, they're trying to be portable, and they might even have hit other network filesystems that do partial reads or return EAGAIN - or they might have more complex functionality anyway which allows you to pipe things in from a buffer etc. But the above kind of "assume read() gets the whole thing" is not unusual for quick hacks. After all, it's a _valid_ assumption for a proper POSIX filesystem, although it obviously _also_ assumes that nobody else is writing to that file at the same time. And some of those quick hacks may end up existing for years in major code-bases, who knows.. [ Honesty in advertising: the Linux VFS layer itself says "screw POSIX" for some things. Particularly, if somebody tries to do a read larger than 2GB in size, the VFS layer will just say "POSIX is garbage in this situation, we _will_ truncate this read". So if you deal with huge files, you _have_ to do the proper "loop until EOF" even for regular files, and POSIX be damned. The kernel refuses to do crazy things, and no amount of standard paperwork matters. ] But basically honoring full reads for any _reasonable_ situation is pretty much required for a lot of reasons. Yes, lots of apps will deal gracefully with partial reads - maybe even most. But "lots" is not "all". Linus