Message ID | 20230519074047.1739879-4-dhowells@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | splice, block: Use page pinning and kill ITER_PIPE | expand |
On Fri, May 19, 2023 at 08:40:18AM +0100, David Howells wrote: > Make direct_read_splice() limit the read to the end of the file for regular > files and block devices, thereby reducing the amount of allocation it will > do in such a case. > > This means that the blockdev code doesn't require any special handling as > filemap_read_splice() also limits to i_size. I'm really not sure if this is a good idea. Right now direct_read_splice (which also appears a little misnamed) really is a splice by calling ->read_iter helper. I we don't do any of this validtion we can just call it directly from splice.c instead of calling into ->splice_read for direct I/O and DAX and remove a ton of boilerplate code. How often do we even call into splice beyond i_size and for how much? > + if (S_ISREG(file_inode(in)->i_mode) || > + S_ISBLK(file_inode(in)->i_mode)) { Btw, having these kinds of checks in supposedly generic code is always a marked for a bit of a layering problem.
Christoph Hellwig <hch@infradead.org> wrote: > I'm really not sure if this is a good idea. Right now > direct_read_splice (which also appears a little misnamed) really is > a splice by calling ->read_iter helper. It can be renamed if you want a different name. copy_splice_read() or copy_splice_read_iter()? > I we don't do any of this validtion we can just call it directly from > splice.c instead of calling into ->splice_read for direct I/O and DAX and > remove a ton of boilerplate code. There's a couple of places where we might not want to do that - at least for non-DAX. shmem and f2fs for example. f2fs calls back to buffered reading under some circumstances. shmem ignores O_DIRECT and always splices from the pagecache. David
On Fri, May 19, 2023 at 09:43:34AM +0100, David Howells wrote: > > direct_read_splice (which also appears a little misnamed) really is > > a splice by calling ->read_iter helper. > > It can be renamed if you want a different name. copy_splice_read() or > copy_splice_read_iter()? Maybe something like that, yes. > > > I we don't do any of this validtion we can just call it directly from > > splice.c instead of calling into ->splice_read for direct I/O and DAX and > > remove a ton of boilerplate code. > > There's a couple of places where we might not want to do that - at least for > non-DAX. shmem and f2fs for example. f2fs calls back to buffered reading > under some circumstances. shmem ignores O_DIRECT and always splices from the > pagecache. So? even if ->read_iter does buffered I/O for O_DIRECT it will still work. This can in fact happen for many other file systems due when they fall back to buffeed I/O due to various reasons.
On Fri, May 19, 2023 at 12:41 AM David Howells <dhowells@redhat.com> wrote: > > + > + if (S_ISREG(file_inode(in)->i_mode) || > + S_ISBLK(file_inode(in)->i_mode)) { This really feels fundamentally wrong to me. If block and regular files have this limit, they should have their own splice_read() function that implements that limit. Not make everybody else check it. IOW, this should be a separate function ("block_splice_read()" or whatever), not inside a generic function that other users use. The zero size checking looks fine, although I wondered about that too. Some special files do traditionally have special meanings for zero-sized reads (as in "packet boundary"). But I suspect that isn't an issue for splice, and perhaps more importantly, I think the same rule should be in place: special files that want special rules shouldn't be using this generic function directly then. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > + if (S_ISREG(file_inode(in)->i_mode) || > > + S_ISBLK(file_inode(in)->i_mode)) { > > This really feels fundamentally wrong to me. > > If block and regular files have this limit, they should have their own > splice_read() function that implements that limit. > > Not make everybody else check it. > > IOW, this should be a separate function ("block_splice_read()" or > whatever), not inside a generic function that other users use. This is just an optimisation to cut down the amount of bufferage allocated, so I could just drop it and leave it to userspace for now as the filesystem/block layer will stop anyway if it hits the EOF. Christoph would prefer that I call direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if that's fine with you. David
On Fri, May 19, 2023 at 9:48 AM David Howells <dhowells@redhat.com> wrote: > > This is just an optimisation to cut down the amount of bufferage allocated So the thing is, it's actually very very wrong for some files. Now, admittedly, those files have other issues too, and it's a design mistake to begin with, but look at a number of files in /proc. In particular, look at the regular files that have a size of '0'. It's quite common indeed. Things like /proc/cpuinfo /proc/stat ... you can find a ton of them with find /proc -type f -size 0 Is it horribly wrong and bad? Yes. I hate it. It means that some really basic user space tools refuse to work on them, and the tools are 100% right - this is a kernel misfeature. Trying to do things like less -S /proc/cpuinfo may or may not work depending on your version of 'less', for example, because it's entirely reasonable to do something like fd = open(..); if (!fstat(fd, &st)) len = st.st_size; and limit your reads to the size of the file - exactly like your patch does. Except it fails horribly on those broken /proc files. I hate it, and I blame myself for the above horror, but it's pretty much unfixable. We could make them look like named pipes or something, but that's really ugly and probably would break other things anyway. And we simply don't know the size ahead of time. Now, *most* things work, because they just do the whole "read until EOF". In fact, my current version of 'less' has no problem at all doing the above thing, and gives the "expected" output. Also, honestly, I really don't think that it's necessarily a good idea to splice /proc files, but we actually do have splice wired up to these because people asked for it: fe33850ff798 ("proc: wire up generic_file_splice_read for iter ops") 4bd6a7353ee1 ("sysctl: Convert to iter interfaces") so I suspect those things do exist. > I could just drop it and leave it to userspace for now as the filesystem/block > layer will stop anyway if it hits the EOF. Christoph would prefer that I call > direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if > that's fine with you. I guess that's fine, and for O_DIRECT itself it might even make sense to do the size test. That said, I doubt it matters: if you use O_DIRECT on a small file, you only have yourself to blame for doing something stupid. And if it isn't a small file, then who cares about some small EOF-time optimization? Nobody. So I would suggest not doing that optimization at all, because as-is, it's either pointless or actively broken. That said, I would *not* hate some kind of special FMODE_SIZELIMIT flag that allows filesystems to opt in to "limit reads to size". We already have flags like that: FMODE_UNSIGNED_OFFSET and 'sb->s_maxbytes' are both basically variations on that same theme, and having another flag to say "limit reads to i_size" wouldn't be wrong. It's only wrong when it is done mindlessly with S_ISREG(). Linus
Okay. Let's go with that. So I have to put the handling in vfs_splice_read(): long vfs_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { ... if (unlikely(!in->f_op->splice_read)) return warn_unsupported(in, "read"); /* * O_DIRECT and DAX don't deal with the pagecache, so we * allocate a buffer, copy into it and splice that into the pipe. */ if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host)) return copy_splice_read(in, ppos, pipe, len, flags); return in->f_op->splice_read(in, ppos, pipe, len, flags); } which leaves very little in generic_file_splice_read: ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes)) return 0; if (unlikely(!len)) return 0; return filemap_splice_read(in, ppos, pipe, len, flags); } so I wonder if the tests in generic_file_splice_read() can be folded into vfs_splice_read(), pointers to generic_file_splice_read() be replaced with pointers to filemap_splice_read() and generic_file_splice_read() just be removed. I suspect we can't quite do this because of the *ppos check - but I wonder if that's actually necessary since filemap_splice_read() checks against i_size... or if the check can be moved there if we definitely want to do it. Certainly, the zero-length check can be done in vfs_splice_read(). David
On Fri, May 19, 2023 at 11:27:51PM +0100, David Howells wrote: > ssize_t generic_file_splice_read(struct file *in, loff_t *ppos, > struct pipe_inode_info *pipe, size_t len, > unsigned int flags) > { > if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes)) > return 0; > if (unlikely(!len)) > return 0; > return filemap_splice_read(in, ppos, pipe, len, flags); > } > > so I wonder if the tests in generic_file_splice_read() can be folded into > vfs_splice_read(), pointers to generic_file_splice_read() be replaced with > pointers to filemap_splice_read() and generic_file_splice_read() just be > removed. > > I suspect we can't quite do this because of the *ppos check - but I wonder if > that's actually necessary since filemap_splice_read() checks against > i_size... or if the check can be moved there if we definitely want to do it. > > Certainly, the zero-length check can be done in vfs_splice_read(). The zero length check makes sense in vfs_splice_read. The ppos check I think makes sense for filemap_splice_read - after all we're dealing with the page cache here, and the page cache needs a working maxbytes and i_size. What callers of filemap_splice_read that don't have the checks do you have in your tree right now and how did you end up with them?
diff --git a/fs/splice.c b/fs/splice.c index 4db3eee49423..89c8516554d1 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -315,6 +315,19 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos, size_t used, npages, chunk, remain, keep = 0; int i; + if (!len) + return 0; + + if (S_ISREG(file_inode(in)->i_mode) || + S_ISBLK(file_inode(in)->i_mode)) { + loff_t i_size = i_size_read(in->f_mapping->host); + + if (*ppos >= i_size) + return 0; + if (len > i_size - *ppos) + len = i_size - *ppos; + } + /* Work out how much data we can actually add into the pipe */ used = pipe_occupancy(pipe->head, pipe->tail); npages = max_t(ssize_t, pipe->max_usage - used, 0);
Make direct_read_splice() limit the read to the end of the file for regular files and block devices, thereby reducing the amount of allocation it will do in such a case. This means that the blockdev code doesn't require any special handling as filemap_read_splice() also limits to i_size. Signed-off-by: David Howells <dhowells@redhat.com> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org --- fs/splice.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)