diff mbox series

[v5,3/4] tracing: Use vmap_page_range() to map memmap ring buffer

Message ID 20250401225842.597899085@goodmis.org (mailing list archive)
State Superseded
Headers show
Series tracing: Clean up persistent ring buffer code | expand

Commit Message

Steven Rostedt April 1, 2025, 10:58 p.m. UTC
From: Steven Rostedt <rostedt@goodmis.org>

The code to map the physical memory retrieved by memmap currently
allocates an array of pages to cover the physical memory and then calls
vmap() to map it to a virtual address. Instead of using this temporary
array of struct page descriptors, simply use vmap_page_range() that can
directly map the contiguous physical memory to a virtual address.

Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

Comments

Linus Torvalds April 2, 2025, 4:42 p.m. UTC | #1
On Tue, 1 Apr 2025 at 15:57, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The code to map the physical memory retrieved by memmap currently
> allocates an array of pages to cover the physical memory and then calls
> vmap() to map it to a virtual address. Instead of using this temporary
> array of struct page descriptors, simply use vmap_page_range() that can
> directly map the contiguous physical memory to a virtual address.

Wait, what?

Didn't you just say that you can just use page_address() for the kernel mapping?

So the whole vmap thing is entirely unnecessary in the first place.

Including the simpler vmap_page_range().

             Linus
Steven Rostedt April 2, 2025, 4:55 p.m. UTC | #2
On Wed, 2 Apr 2025 09:42:00 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Wait, what?
> 
> Didn't you just say that you can just use page_address() for the kernel mapping?

Correct. I misunderstood Mike when he created the reserve_mem and had it
return physical addresses. I didn't realize he had it already mapped via
the normal virtual mappings.

> 
> So the whole vmap thing is entirely unnecessary in the first place.
> 
> Including the simpler vmap_page_range().

Not entirely. The original code only used memmap=, which does require the
vmap_page_range(). The reserve_mem came later, and thinking it was just
physical address space (not a mapping), I just had it use the memmap code.

That is, the persistent memory buffer was originally created on top of the
memmap feature (and only for x86). I talked with Mike about making it more
generic and he and I worked out the reserve_mem option. Having the memmap
code already doing the vmap() I just had the reserve_mem version use the
same, not knowing that it could just use phys_to_virt().

This patch series fixes that miscommunication and separates out a memmap'ed
buffer from reserve_mem buffer and simplifies everything.

-- Steve
Steven Rostedt April 2, 2025, 5:03 p.m. UTC | #3
On Wed, 2 Apr 2025 12:55:48 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> This patch series fixes that miscommunication and separates out a memmap'ed
> buffer from reserve_mem buffer and simplifies everything.

Since I only care about memory mapping a buffer from reserve_mem to user
space, this series makes sure that a buffer created from memmap (or any
other physical memory area) is never used for mapping to user space. It
also prevents that buffer from being freed, as it is not part of the buddy
allocator and can't be added later.

That also means all the tricks to determine where the page in the ring
buffer came from can go away.

-- Steve
Steven Rostedt April 2, 2025, 5:14 p.m. UTC | #4
On Wed, 2 Apr 2025 13:03:37 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> That also means all the tricks to determine where the page in the ring
> buffer came from can go away.

I wonder, just to be safe, if I should add in the ring buffer mapping code:

	/* If the buffer is not part of the normal memory, do not allow mapping */
	if (!virt_addr_valid(cpu_buffer->reader_page->page))
		return -ENODEV;

?

As the buffer can come from anywhere, but we only allow a memory mapping to
user space if the buffer is from the normal memory allocator.

Currently, this patch series has the caller (the tracing code) prevent
allowing memmap to be user mapped. Perhaps the above should include a
WARN_ON_ONCE()?

-- Steve
Linus Torvalds April 2, 2025, 5:20 p.m. UTC | #5
On Wed, 2 Apr 2025 at 10:13, Steven Rostedt <rostedt@goodmis.org> wrote:
>
>         /* If the buffer is not part of the normal memory, do not allow mapping */
>         if (!virt_addr_valid(cpu_buffer->reader_page->page))
>                 return -ENODEV;

Honestly, virt_addr_valid() is pretty much random.

It's not really a valid function outside of random debugging code. It
has no real semantics. The comment saying that it validates some kind
of 'virt_to_page()' thing is pure garbage.

> As the buffer can come from anywhere, but we only allow a memory mapping to
> user space if the buffer is from the normal memory allocator.

You should damn well keep track of where the memory comes from.

You can't just say "I'll take random shit, and then I'll ask the VM what it is".

Stop it.

If the address came from something you consider trustworthy, then just
trust it. If some admin person gave you garbage, it's better to just
get a random oops than to have random bogus code.

                 Linus
Steven Rostedt April 2, 2025, 5:40 p.m. UTC | #6
On Wed, 2 Apr 2025 10:20:58 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> You should damn well keep track of where the memory comes from.

And it does.

> 
> You can't just say "I'll take random shit, and then I'll ask the VM what it is".
> 
> Stop it.
> 
> If the address came from something you consider trustworthy, then just
> trust it. If some admin person gave you garbage, it's better to just
> get a random oops than to have random bogus code.

This has nothing to do with admins. This would only occur if the kernel
itself created a buffer from some random physical address and then tried to
mmap it to user space (which would be a bug).

My early career came from the military industry (I worked on the C17 engine
control systems in the early '90s). It was drilled in my head to have
safety checks throughout the code, in case something buggy happened, it
would be caught quickly later on.

The virt_addr_valid() would only be a debugging feature, hence the
added "WARN_ON()" for it.

But I'm fine to not do that.

-- Steve
Linus Torvalds April 2, 2025, 5:46 p.m. UTC | #7
On Wed, 2 Apr 2025 at 10:39, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> This has nothing to do with admins. This would only occur if the kernel
> itself created a buffer from some random physical address and then tried to
> mmap it to user space (which would be a bug).

Do *not* try to check for bugs like that with virt_addr_valid().

It literally snakes debugging harder.

You're much better off getting an oops,. and then you have stack
traces, distro bug trackers, and various other automated tools that
give you information.

Trying to "validate" buggy data is crazy. It's absolutely the opposite
of safety. It's going to cause more bugs, it's going to only work for
the validation scenarios you thought about, and it's going to make it
harder to debug the cases it actually catches.

And if you are trying to catch kernel bugs, *any* data could be that
buggy data. So the whole concept is insane.

Yes, you could make every single line be a "WARN_ON()" with some
random check for the particular data you are using.

Or you could just write good solid code that is actually readable and
maintainable, and doesn't have random pointless checks in it.

          Linus
diff mbox series

Patch

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2f9c91f26d5b..e30d1e058fea 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -50,6 +50,7 @@ 
 #include <linux/irq_work.h>
 #include <linux/workqueue.h>
 #include <linux/sort.h>
+#include <linux/io.h> /* vmap_page_range() */
 
 #include <asm/setup.h> /* COMMAND_LINE_SIZE */
 
@@ -9810,29 +9811,27 @@  static int instance_mkdir(const char *name)
 	return ret;
 }
 
-static u64 map_pages(u64 start, u64 size)
+static u64 map_pages(unsigned long start, unsigned long size)
 {
-	struct page **pages;
-	phys_addr_t page_start;
-	unsigned int page_count;
-	unsigned int i;
-	void *vaddr;
-
-	page_count = DIV_ROUND_UP(size, PAGE_SIZE);
+	unsigned long vmap_start, vmap_end;
+	struct vm_struct *area;
+	int ret;
 
-	page_start = start;
-	pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
-	if (!pages)
+	area = get_vm_area(size, VM_IOREMAP);
+	if (!area)
 		return 0;
 
-	for (i = 0; i < page_count; i++) {
-		phys_addr_t addr = page_start + i * PAGE_SIZE;
-		pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
+	vmap_start = (unsigned long) area->addr;
+	vmap_end = vmap_start + size;
+
+	ret = vmap_page_range(vmap_start, vmap_end,
+			      start, pgprot_nx(PAGE_KERNEL));
+	if (ret < 0) {
+		free_vm_area(area);
+		return 0;
 	}
-	vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
-	kfree(pages);
 
-	return (u64)(unsigned long)vaddr;
+	return (u64)vmap_start;
 }
 
 /**