diff mbox series

[v20,23/32] splice: Convert trace/seq to use direct_splice_read()

Message ID 20230519074047.1739879-24-dhowells@redhat.com (mailing list archive)
State New
Headers show
Series splice, block: Use page pinning and kill ITER_PIPE | expand

Commit Message

David Howells May 19, 2023, 7:40 a.m. UTC
For the splice from the trace seq buffer, just use direct_splice_read().

In the future, something better can probably be done by gifting pages from
seq->buf into the pipe, but that would require changing seq->buf into a
vmap over an array of pages.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: Masami Hiramatsu <mhiramat@kernel.org>
cc: linux-kernel@vger.kernel.org
cc: linux-trace-kernel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Steven Rostedt May 22, 2023, 2:29 p.m. UTC | #1
On Fri, 19 May 2023 08:40:38 +0100
David Howells <dhowells@redhat.com> wrote:

> For the splice from the trace seq buffer, just use direct_splice_read().
> 
> In the future, something better can probably be done by gifting pages from
> seq->buf into the pipe, but that would require changing seq->buf into a
> vmap over an array of pages.

If you can give me a POC of what needs to be done, I could possibly
implement it.

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Steven Rostedt <rostedt@goodmis.org>
> cc: Masami Hiramatsu <mhiramat@kernel.org>
> cc: linux-kernel@vger.kernel.org
> cc: linux-trace-kernel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>  kernel/trace/trace.c | 2 +-

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index ebc59781456a..b664020efcb7 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -5171,7 +5171,7 @@ static const struct file_operations tracing_fops = {
>  	.open		= tracing_open,
>  	.read		= seq_read,
>  	.read_iter	= seq_read_iter,
> -	.splice_read	= generic_file_splice_read,
> +	.splice_read	= direct_splice_read,
>  	.write		= tracing_write_stub,
>  	.llseek		= tracing_lseek,
>  	.release	= tracing_release,
David Howells May 22, 2023, 2:50 p.m. UTC | #2
Steven Rostedt <rostedt@goodmis.org> wrote:

> > In the future, something better can probably be done by gifting pages from
> > seq->buf into the pipe, but that would require changing seq->buf into a
> > vmap over an array of pages.
> 
> If you can give me a POC of what needs to be done, I could possibly
> implement it.

I wrote my idea up here for Masami[*]:

We could implement seq_splice_read().  What we would need to do is to change
how the seq buffer is allocated: bulk allocate a bunch of arbitrary pages
which we then vmap().  When we need to splice, we read into the buffer, do a
vunmap() and then splice the pages holding the data we used into the pipe.

If we don't manage to splice all the data, we can continue splicing from the
pages we have left next time.  If a read() comes along to view partially
spliced data, we would need to copy from the individual pages.

When we use up all the data, we discard all the pages we might have spliced
from and shuffle down the other pages, call the bulk allocator to replenish
the buffer and then vmap() it again.

Any pages we've spliced from must be discarded and replaced and not rewritten.

If a read() comes without the buffer having been spliced from, it can do as it
does now.

David
---

[*] https://lore.kernel.org/linux-fsdevel/20230522-pfund-ferngeblieben-53fad9c0e527@brauner/T/#mc03959454c76cc3f29024b092c62d88c90f7c071
Linus Torvalds May 22, 2023, 5:42 p.m. UTC | #3
On Mon, May 22, 2023 at 7:50 AM David Howells <dhowells@redhat.com> wrote:
>
> We could implement seq_splice_read().  What we would need to do is to change
> how the seq buffer is allocated: bulk allocate a bunch of arbitrary pages
> which we then vmap().  When we need to splice, we read into the buffer, do a
> vunmap() and then splice the pages holding the data we used into the pipe.

Please don't use vmap as a way to do zero-copy.

The virtual mapping games are more expensive than a small copy from
some random seq file.

Yes, yes, seq_file currently uses "kvmalloc()", which does fall back
to vmalloc too. But the keyword there is "falls back". Most of the
time it's just a regular boring kmalloc, and most of the time a
seq-file is tiny.

                      Linus
Steven Rostedt May 22, 2023, 6:38 p.m. UTC | #4
On Mon, 22 May 2023 10:42:12 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 22, 2023 at 7:50 AM David Howells <dhowells@redhat.com> wrote:
> >
> > We could implement seq_splice_read().  What we would need to do is to change
> > how the seq buffer is allocated: bulk allocate a bunch of arbitrary pages
> > which we then vmap().  When we need to splice, we read into the buffer, do a
> > vunmap() and then splice the pages holding the data we used into the pipe.  
> 
> Please don't use vmap as a way to do zero-copy.
> 
> The virtual mapping games are more expensive than a small copy from
> some random seq file.
> 
> Yes, yes, seq_file currently uses "kvmalloc()", which does fall back
> to vmalloc too. But the keyword there is "falls back". Most of the
> time it's just a regular boring kmalloc, and most of the time a
> seq-file is tiny.

I was thinking this change had to do with the splice callback for
trace_pipe_raw (which is a hot path that does zero copy of the ftrace ring
buffer into files). But looking at this further, I see that it's for just
the "trace" file, which is a textual conversion of the tracing data (slow
path, although some user space uses this and parses the text, which IMHO is
wrong).

In other words, I don't really care much about this code being "efficient".

-- Steve
diff mbox series

Patch

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ebc59781456a..b664020efcb7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5171,7 +5171,7 @@  static const struct file_operations tracing_fops = {
 	.open		= tracing_open,
 	.read		= seq_read,
 	.read_iter	= seq_read_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.write		= tracing_write_stub,
 	.llseek		= tracing_lseek,
 	.release	= tracing_release,