diff mbox series

[1/3] readfile: implement readfile syscall

Message ID 20200704140250.423345-2-gregkh@linuxfoundation.org (mailing list archive)
State New, archived
Headers show
Series readfile(2): a new syscall to make open/read/close faster | expand

Commit Message

Greg KH July 4, 2020, 2:02 p.m. UTC
It's a tiny syscall, meant to allow a user to do a single "open this
file, read into this buffer, and close the file" all in a single shot.

Should be good for reading "tiny" files like sysfs, procfs, and other
"small" files.

There is no restarting the syscall, this is a "simple" syscall, with the
attempt to make reading "simple" files easier with less syscall
overhead.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/open.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

Comments

Geert Uytterhoeven July 4, 2020, 6:35 p.m. UTC | #1
Hi Greg,

On Sat, Jul 4, 2020 at 4:05 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> It's a tiny syscall, meant to allow a user to do a single "open this
> file, read into this buffer, and close the file" all in a single shot.
>
> Should be good for reading "tiny" files like sysfs, procfs, and other
> "small" files.
>
> There is no restarting the syscall, this is a "simple" syscall, with the
> attempt to make reading "simple" files easier with less syscall
> overhead.
>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Thanks for your patch!

> --- a/fs/open.c
> +++ b/fs/open.c

> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> +               char __user *, buffer, size_t, bufsize, int, flags)
> +{
> +       struct open_flags op;
> +       struct open_how how;
> +       struct file *file;
> +       loff_t pos = 0;
> +       int retval;
> +
> +       /* only accept a small subset of O_ flags that make sense */
> +       if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
> +               return -EINVAL;
> +
> +       /* add some needed flags to be able to open the file properly */
> +       flags |= O_RDONLY | O_LARGEFILE;
> +
> +       how = build_open_how(flags, 0000);
> +       retval = build_open_flags(&how, &op);
> +       if (retval)
> +               return retval;
> +
> +       file = readfile_open(dfd, filename, &op);
> +       if (IS_ERR(file))
> +               return PTR_ERR(file);
> +
> +       retval = vfs_read(file, buffer, bufsize, &pos);

Should there be a way for the user to be informed that the file doesn't
fit in the provided buffer (.e.g. -EFBIG)?

> +
> +       filp_close(file, NULL);
> +
> +       return retval;
> +}

Gr{oetje,eeting}s,

                        Geert
Matthew Wilcox July 4, 2020, 7:14 p.m. UTC | #2
On Sat, Jul 04, 2020 at 04:02:47PM +0200, Greg Kroah-Hartman wrote:
> +	/* only accept a small subset of O_ flags that make sense */
> +	if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
> +		return -EINVAL;
> +
> +	/* add some needed flags to be able to open the file properly */
> +	flags |= O_RDONLY | O_LARGEFILE;

Should we also add O_NOCTTY to prevent any problems if this is called on
a tty?
Miklos Szeredi July 4, 2020, 7:41 p.m. UTC | #3
On Sat, Jul 4, 2020 at 4:03 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> It's a tiny syscall, meant to allow a user to do a single "open this
> file, read into this buffer, and close the file" all in a single shot.
>
> Should be good for reading "tiny" files like sysfs, procfs, and other
> "small" files.
>
> There is no restarting the syscall, this is a "simple" syscall, with the
> attempt to make reading "simple" files easier with less syscall
> overhead.
>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  fs/open.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
>
> diff --git a/fs/open.c b/fs/open.c
> index 6cd48a61cda3..4469faa9379c 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1370,3 +1370,53 @@ int stream_open(struct inode *inode, struct file *filp)
>  }
>
>  EXPORT_SYMBOL(stream_open);
> +
> +static struct file *readfile_open(int dfd, const char __user *filename,
> +                                 struct open_flags *op)
> +{
> +       struct filename *tmp;
> +       struct file *f;
> +
> +       tmp = getname(filename);
> +       if (IS_ERR(tmp))
> +               return (struct file *)tmp;
> +
> +       f = do_filp_open(dfd, tmp, op);
> +       if (!IS_ERR(f))
> +               fsnotify_open(f);
> +
> +       putname(tmp);
> +       return f;
> +}
> +
> +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
> +               char __user *, buffer, size_t, bufsize, int, flags)
> +{
> +       struct open_flags op;
> +       struct open_how how;
> +       struct file *file;
> +       loff_t pos = 0;
> +       int retval;
> +
> +       /* only accept a small subset of O_ flags that make sense */
> +       if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
> +               return -EINVAL;
> +
> +       /* add some needed flags to be able to open the file properly */
> +       flags |= O_RDONLY | O_LARGEFILE;
> +
> +       how = build_open_how(flags, 0000);
> +       retval = build_open_flags(&how, &op);
> +       if (retval)
> +               return retval;
> +
> +       file = readfile_open(dfd, filename, &op);
> +       if (IS_ERR(file))
> +               return PTR_ERR(file);
> +
> +       retval = vfs_read(file, buffer, bufsize, &pos);
> +
> +       filp_close(file, NULL);
> +
> +       return retval;

Manpage says: "doing the sequence of open() and then read() and then
close()", which is exactly what it does.

But then it goes on to say: "If the file is larger than the value
provided in count then only count number of bytes will be copied into
buf", which is only half true, it should be: "If the file is larger
than the value provided in count then at most count number of bytes
will be copied into buf", which is not a lot of information.

And "If the size of file is smaller than the value provided in count
then the whole file will be copied into buf", which is simply a lie;
for example seq_file will happily return a smaller-than-PAGE_SIZE
chunk if at least one record fits in there.  You'll have a very hard
time explaining that in the man page.  So I think there are two
possible ways forward:

 1) just leave the first explanation (it's an open + read + close
equivalent) and leave out the rest

 2) add a loop around the vfs_read() in the code.

I'd strongly prefer #2 because with the non-looping read it's
impossible to detect whether the file was completely read or not, and
that's just going to lead to surprises and bugs in userspace code.

Thanks,
Miklos
Al Viro July 4, 2020, 8:12 p.m. UTC | #4
On Sat, Jul 04, 2020 at 09:41:09PM +0200, Miklos Szeredi wrote:
> And "If the size of file is smaller than the value provided in count
> then the whole file will be copied into buf", which is simply a lie;
> for example seq_file will happily return a smaller-than-PAGE_SIZE
> chunk if at least one record fits in there.  You'll have a very hard
> time explaining that in the man page.  So I think there are two
> possible ways forward:
> 
>  1) just leave the first explanation (it's an open + read + close
> equivalent) and leave out the rest
> 
>  2) add a loop around the vfs_read() in the code.

3) don't bother with the entire thing, until somebody manages to demonstrate
a setup where it does make a real difference (compared to than the obvious
sequence of syscalls, that is).  At which point we'll need to figure out
what's going on and deal with the underlying problem of that setup.
Al Viro July 4, 2020, 8:16 p.m. UTC | #5
On Sat, Jul 04, 2020 at 09:12:06PM +0100, Al Viro wrote:
> On Sat, Jul 04, 2020 at 09:41:09PM +0200, Miklos Szeredi wrote:
> > And "If the size of file is smaller than the value provided in count
> > then the whole file will be copied into buf", which is simply a lie;
> > for example seq_file will happily return a smaller-than-PAGE_SIZE
> > chunk if at least one record fits in there.  You'll have a very hard
> > time explaining that in the man page.  So I think there are two
> > possible ways forward:
> > 
> >  1) just leave the first explanation (it's an open + read + close
> > equivalent) and leave out the rest
> > 
> >  2) add a loop around the vfs_read() in the code.
> 
> 3) don't bother with the entire thing, until somebody manages to demonstrate
> a setup where it does make a real difference (compared to than the obvious
> sequence of syscalls, that is).  At which point we'll need to figure out
> what's going on and deal with the underlying problem of that setup.

Incidentally, if that's intended for use on _sysfs_, I would like to see the
effects of that sucker being called by many processes in parallel, seeing that
sysfs has, er, certain scalability problems in its lookups.  And I would be
very surprised if they were not heavier than said overhead of two extra syscalls.
diff mbox series

Patch

diff --git a/fs/open.c b/fs/open.c
index 6cd48a61cda3..4469faa9379c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1370,3 +1370,53 @@  int stream_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(stream_open);
+
+static struct file *readfile_open(int dfd, const char __user *filename,
+				  struct open_flags *op)
+{
+	struct filename *tmp;
+	struct file *f;
+
+	tmp = getname(filename);
+	if (IS_ERR(tmp))
+		return (struct file *)tmp;
+
+	f = do_filp_open(dfd, tmp, op);
+	if (!IS_ERR(f))
+		fsnotify_open(f);
+
+	putname(tmp);
+	return f;
+}
+
+SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename,
+		char __user *, buffer, size_t, bufsize, int, flags)
+{
+	struct open_flags op;
+	struct open_how how;
+	struct file *file;
+	loff_t pos = 0;
+	int retval;
+
+	/* only accept a small subset of O_ flags that make sense */
+	if ((flags & (O_NOFOLLOW | O_NOATIME)) != flags)
+		return -EINVAL;
+
+	/* add some needed flags to be able to open the file properly */
+	flags |= O_RDONLY | O_LARGEFILE;
+
+	how = build_open_how(flags, 0000);
+	retval = build_open_flags(&how, &op);
+	if (retval)
+		return retval;
+
+	file = readfile_open(dfd, filename, &op);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	retval = vfs_read(file, buffer, bufsize, &pos);
+
+	filp_close(file, NULL);
+
+	return retval;
+}