diff mbox series

[v2,1/2] tracing: ring-buffer: Have the ring buffer code do the vmap of physical memory

Message ID 20250331143532.459810712@goodmis.org (mailing list archive)
State Superseded
Headers show
Series ring-buffer: Allow persistent memory to be user space mmapped | expand

Commit Message

Steven Rostedt March 31, 2025, 2:34 p.m. UTC
From: Steven Rostedt <rostedt@goodmis.org>

Currently, when reserve_mem is used on the kernel command line to reserve
"persistent" memory to map the ring buffer on. The tracing code will do
the vmap() on the physical memory provided by reserve_mem and pass that to
ring_buffer_alloc_range() where it will map the ring buffer on top of the
given memory. It will also look at the current content of the memory there
and if the memory already contains valid content, it will use that content
for the ring buffer.

But this method makes the ring buffer code not know where that memory came
from. Here, the tracing code used vmap() but it could have also used
vmalloc(), or whatever. And many of these methods may not be supported by
the ring buffer code.

Instead, rename ring_buffer_alloc_range() to ring_buffer_alloc_physical()
where contiguous physical memory is passed to the ring buffer code, and it
will be responsible for mapping it as well as freeing it. This simplifies
the callers from having to keep track of whether the code is mapped or
not.

The ring buffer can also take control of whether it can memory map the
buffer to user space or not. Currently it does not allow this physical
memory to be mapped to user space, but now that it has control over the
struct pages of this memory, it can easily do so in the future.

As the ring buffer mapping of a physical memory mapped buffer will now
fail, the tracing code no longer keeps track if the buffer is the "last
boot" buffer or not and will try to map it anyway. It will still fail, but
when the ring buffer code is modified to allow it, it will then succeed.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/ring_buffer.h | 19 ++++----
 kernel/trace/ring_buffer.c  | 86 ++++++++++++++++++++++++++++++++-----
 kernel/trace/trace.c        | 65 +++++++---------------------
 3 files changed, 101 insertions(+), 69 deletions(-)

Comments

Linus Torvalds March 31, 2025, 4:55 p.m. UTC | #1
On Mon, 31 Mar 2025 at 07:34, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Instead, rename ring_buffer_alloc_range() to ring_buffer_alloc_physical()
> where contiguous physical memory is passed to the ring buffer code,

The concept looks like an improvement, but:

> +static unsigned long map_pages(unsigned long *start, unsigned long *end)
> +{
> +       struct page **pages;
> +       phys_addr_t page_start;
> +       unsigned long size;
> +       unsigned long page_count;
> +       unsigned long i;
> +       void *vaddr;
> +
> +       /* Make sure the mappings are page aligned */
> +       *start = ALIGN(*start, PAGE_SIZE);

The above is *completely* unacceptable.

There is no way in hell that ALIGN() can ever be right.

You don't even fix up the low bits of the returned virtual address, so
you literally return the virtual address of something that doesn't
match what was passed in.

So if you pass it a starting area that isn't page-aligned, it now
randomly gives you complete crap back, and includes some random
unrelated part in the mapping.

So no. That needs to be a

        if (*start & PAGE_MASK)
                return NULL;

or whatever. Because just randomly corrupting the base address by
ignoring the low bits is not ok.

> +       /* The size must fit full pages */
> +       page_count = size >> PAGE_SHIFT;

This part works, simply because truncating the size is fine. It won't
all get mapped, but that's the caller's problem, at least the code
isn't returning random crap that has random other data in it.

That said, I don't see the point. If you want to virtually map
physical pages, they need to be full pages, otherwise the end result
gets randomly truncated. So I think that while this is much better
than the "return random crap that doesn't make any sense", it should
be the same rule: just don't allow mapping partial pages.

So make it be

        if (size & PAGE_MASK)
                return NULL;

instead, and just enforce the fact that allocations have to be sanely
aligned for vmap.

Anyway, that takes care of the horrific interface. However, there's
another issue:

> +       pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);

you create this pointless array of pages. Why? It's a physically
contiguous area.

You do that just because you want to use vmap() to map that contiguous
area one page at a time.

But this is NOT a new thing. It's exactly what every single PCI device
with a random physical memory region BAR needs to do. And no, they
don't create arrays of 'struct page *', because they use memory that
doesn't even have page backing.

So we actually have interfaces to do linear virtual mappings of
physical pages that *predate* vmap(), and do the right thing without
any of these games.

Yes, the legacy versions of interfaces are all for IO memory, but we
do have things like vmap_page_range() which should JustWork(tm).

Yeah, you'll need to do something like

        unsigned long vmap_start, vmap_end;

        area = get_vm_area(size, VM_IOREMAP);
        if (!area)
                return NULL;

        vmap_start = (unsigned long) area->addr;
        vmap_end = vmap_start + size;

        ret = vmap_page_range(vmap_start, vmap_end,
                *start, prot_nx(PAGE_KERNEL));

        if (ret < 0) {
                free_vm_area(area);
                return NULL;
        }

and the above is *entirely* untested and maybe there's something wrong
there, but the concept should work, and when you don't do it a page at
a time, you not only don't need the kmalloc_array(), it should even do
things like be able to use large page mappings if the alignment and
size work out.

That said, the old code is *really* broken to begin with. I don't
understand why you want to vmap() a contiguous physical range. Either
it's real pages to begin with, and you can just use "page_address()"
to get a virtual address, it's *not* real pages, and doing
"pfn_to_page()" is actively wrong, because it creates a fake 'struct
page *' pointer that isn't valid.

Is this all just for some disgusting HIGHMEM use (in which case you
need the virtual mapping because of HIGHMEM)? Is there any reason to
support HIGHMEM in this area at all?

So I'm not sure why this code does all this horror in the first place.
Either it's all just confused code that just didn't know what it was
doing and just happened to work (very possible..) or there is
something odd going on.

          Linus
Steven Rostedt March 31, 2025, 5:39 p.m. UTC | #2
On Mon, 31 Mar 2025 09:55:28 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:


> > +       /* Make sure the mappings are page aligned */
> > +       *start = ALIGN(*start, PAGE_SIZE);  
> 
> The above is *completely* unacceptable.
> 
> There is no way in hell that ALIGN() can ever be right.

I just did this to be robust in case what was passed in was not aligned. In
all my use cases, it is.

> 
> You don't even fix up the low bits of the returned virtual address, so
> you literally return the virtual address of something that doesn't
> match what was passed in.
> 
> So if you pass it a starting area that isn't page-aligned, it now
> randomly gives you complete crap back, and includes some random
> unrelated part in the mapping.
> 
> So no. That needs to be a
> 
>         if (*start & PAGE_MASK)
>                 return NULL;

No problem, will update. As I said, I just added that to not map something
that wasn't part of what was passed in. But returning error if it's not
page aligned works for me too.

> 
> or whatever. Because just randomly corrupting the base address by
> ignoring the low bits is not ok.
> 
> > +       /* The size must fit full pages */
> > +       page_count = size >> PAGE_SHIFT;  
> 
> This part works, simply because truncating the size is fine. It won't
> all get mapped, but that's the caller's problem, at least the code
> isn't returning random crap that has random other data in it.
> 
> That said, I don't see the point. If you want to virtually map
> physical pages, they need to be full pages, otherwise the end result
> gets randomly truncated. So I think that while this is much better
> than the "return random crap that doesn't make any sense", it should
> be the same rule: just don't allow mapping partial pages.
> 
> So make it be
> 
>         if (size & PAGE_MASK)
>                 return NULL;
> 
> instead, and just enforce the fact that allocations have to be sanely
> aligned for vmap.

Again, I'm happy to error out on non alignment. I'll just update the
documentation to say it must be page size aligned. Currently it shows an
example of using 4096 for alignment, but that should be changed to
explicitly say to have it page aligned, as some archs (ppc) have 64K pages.

> 
> Anyway, that takes care of the horrific interface. However, there's
> another issue:
> 
> > +       pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);  
> 
> you create this pointless array of pages. Why? It's a physically
> contiguous area.
> 
> You do that just because you want to use vmap() to map that contiguous
> area one page at a time.
> 
> But this is NOT a new thing. It's exactly what every single PCI device
> with a random physical memory region BAR needs to do. And no, they
> don't create arrays of 'struct page *', because they use memory that
> doesn't even have page backing.
> 
> So we actually have interfaces to do linear virtual mappings of
> physical pages that *predate* vmap(), and do the right thing without
> any of these games.

[ Added the pstore folks ]

OK, so I did copy this from fs/pstore/ram_core.c as this does basically the
same thing as pstore. And it looks like pstore should be updated too.

> 
> Yes, the legacy versions of interfaces are all for IO memory, but we
> do have things like vmap_page_range() which should JustWork(tm).
> 
> Yeah, you'll need to do something like
> 
>         unsigned long vmap_start, vmap_end;
> 
>         area = get_vm_area(size, VM_IOREMAP);
>         if (!area)
>                 return NULL;
> 
>         vmap_start = (unsigned long) area->addr;
>         vmap_end = vmap_start + size;
> 
>         ret = vmap_page_range(vmap_start, vmap_end,
>                 *start, prot_nx(PAGE_KERNEL));
> 
>         if (ret < 0) {
>                 free_vm_area(area);
>                 return NULL;
>         }
> 
> and the above is *entirely* untested and maybe there's something wrong
> there, but the concept should work, and when you don't do it a page at
> a time, you not only don't need the kmalloc_array(), it should even do
> things like be able to use large page mappings if the alignment and
> size work out.
> 
> That said, the old code is *really* broken to begin with. I don't
> understand why you want to vmap() a contiguous physical range. Either
> it's real pages to begin with, and you can just use "page_address()"
> to get a virtual address, it's *not* real pages, and doing
> "pfn_to_page()" is actively wrong, because it creates a fake 'struct
> page *' pointer that isn't valid.
> 
> Is this all just for some disgusting HIGHMEM use (in which case you
> need the virtual mapping because of HIGHMEM)? Is there any reason to
> support HIGHMEM in this area at all?
> 
> So I'm not sure why this code does all this horror in the first place.
> Either it's all just confused code that just didn't know what it was
> doing and just happened to work (very possible..) or there is
> something odd going on.

[ Adding Mike Rapoport ]

This is due to what the "reserve_mem" kernel command line feature gives
back. It reserves physical memory that is not part of the kernel memory
allocator (it also works with memmap ie. memmap=12M$0x284500000, which also
requires physical memory addresses that are not part of the memory
allocator).

It's up to the user to map it to virtual memory. The same is true for
ramoops in pstore.

-- Steve
Linus Torvalds March 31, 2025, 7:12 p.m. UTC | #3
On Mon, 31 Mar 2025 at 10:38, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I just did this to be robust in case what was passed in was not aligned. In
> all my use cases, it is.

I really think that ALIGN() is absolutely the opposite of robust.

It silently just makes incorrect values generate incorrect code.

Robust is dealing correctly with incorrect values, not making it
incorrect data just create even more incorrect data.

> OK, so I did copy this from fs/pstore/ram_core.c as this does basically the
> same thing as pstore. And it looks like pstore should be updated too.

Please just stop copying code from random drivers or filesystems.

Christ.

I've said this before: core kernel code has higher quality
requirements than random drivers (or even random filesystems).

You can't copy crazy incorrect snippets from random sources and put it
in core kernel code.

> This is due to what the "reserve_mem" kernel command line feature gives
> back. It reserves physical memory that is not part of the kernel memory
> allocator (it also works with memmap ie. memmap=12M$0x284500000, which also
> requires physical memory addresses that are not part of the memory
> allocator).

Yeah, so in that case turning those into "struct page *" is
horrendously wrong, because you've literally taken it away from the
memory management.

If it's not part of the kernel memory allocator, it does not
necessarily have a "struct page *" associated with it. Using a pointer
to 'struct page' and passing it off is just fundamentally more than a
bug: it's seriously and deeply broken.

It's probably much more obviously broken if the physical addresses
came from a PCI device, and this all just happens to work because it's
actually real RAM and we ended up having a 'struct page []' for it for
some random reason.

But the generral rule is that if all you have physical RAM addresses,
you can use them as phys_addr_t (or turn them into pfn's, which is
just the page index of the physical address).

I think that *completely* bogus code worked only entirely due to luck
- the 'struct page *' was never actually dereferenced, and it got
turned back into a pfn and then a page table entry by the vmap code
using purely address arithmetic (page_to_phys() and page_to_pfn()).

So it probably only ever used pointer calculations, although some of
those actually can end up depending on dereferencing 'struct page' (to
find the base zone of the page, to find the mapping).

Add even with just pointer games, that should actually have triggered
a warning in vmap_pages_pte_range() from this:

                if (WARN_ON(!pfn_valid(page_to_pfn(page))))

which I suspect means that we actually *do* end up having 'struct
page' backing for those..

Looking closer...

I think what happened is that reserve_mem() actually does end up
giving you pages that have the 'struct page' backing store even if
they aren't mapped. Which makes the 'struct page' pointer stuff work,
and those pages end up working as 'reserved pages'.

And I suspect that happens exactly because we had users that mis-used
these 'struct page *' things, so it might even be intentional.

Or it might just be because that memory *has* been in the E280 memory
map before it was reserved, and the 'strict page' arrays may have been
sized for the original E280 information, not the "after stealing"
information.

I didn't go look into that early memory reservation code, but I would
not be surprised if there have been fixups to work around bugs like
this. I haven't had to look at it in ages, but maybe the bootmem code
knows that bootmem allocations then might be given back to the VM and
need to have those 'struct page' backing store things

Or - and this is also entirely possible - maybe you were just very
lucky indeed because the code to allocate the 'struct page' regions
ends up intentionally over-allocating and rounding things up.

Because of how the buddy allocator works, each 'struct page' array
needs to be rounded up to the maximum page allocation, so I think we
always allocate the page arrays of a memory zone up to at least
PAGE_SIZE << NR_PAGE_ORDERS boundaries (so 8MB-aligned, I think). So
there's fundamnetally some slop in there.

Anyway, if you allocated them as physical addresses outside of the
regular MM code, you should not use 'struct page' in *any* form.  You
really can't just turn any random physical address into a page, unless
it came from the page allocator.

And yes, I'm not at all surprised that we'd have other mis-users of
this. We have some very historical code that depends on reserved pages
going back all the way to linux-0.01 I suspect, because things like
the original VGA code knew that the physical addresses were in the
BIOS hole region that was backed by 'struct page'.

                Linus
Steven Rostedt March 31, 2025, 8:58 p.m. UTC | #4
On Mon, 31 Mar 2025 12:12:08 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:


> > OK, so I did copy this from fs/pstore/ram_core.c as this does basically the
> > same thing as pstore. And it looks like pstore should be updated too.  
> 
> Please just stop copying code from random drivers or filesystems.

Note, I did not blindly copy it. I knew ramoops did exactly what I wanted
to do, so I looked at how it did it. I read the code, it looked reasonable,
and I mapped it.

I needed a way to map physical memory to vmap memory. How am I supposed to
know it was not doing it the proper way?

> 
> Christ.
> 
> I've said this before: core kernel code has higher quality
> requirements than random drivers (or even random filesystems).

I did not believe pstore was a random file system. It does similar code
that tracing does.

> 
> You can't copy crazy incorrect snippets from random sources and put it
> in core kernel code.

Note, even though this code lives in kernel/, it is not much different than
pstore code. It's a way to trace / debug the kernel, very much like pstore
in that it's used to debug. I saw that code was written in 2012 and thought
it was mature. It made sense, and yes I could have looked for a range, but
I trusted the people who wrote that code. This wasn't just a random looking
at something and copying it. I really wanted to understand how it worked.

And I did talk with some of the mm folks, and they seemed OK with this.


> If it's not part of the kernel memory allocator, it does not
> necessarily have a "struct page *" associated with it. Using a pointer
> to 'struct page' and passing it off is just fundamentally more than a
> bug: it's seriously and deeply broken.
> 
> It's probably much more obviously broken if the physical addresses
> came from a PCI device, and this all just happens to work because it's
> actually real RAM and we ended up having a 'struct page []' for it for
> some random reason.

Note, this only works with RAM.


> Or it might just be because that memory *has* been in the E280 memory
> map before it was reserved, and the 'strict page' arrays may have been
> sized for the original E280 information, not the "after stealing"
> information.

I guess you mean E820.

Mike can correct me if I'm wrong, but the memory that was stolen was actual
memory returned by the system (E820 in x86). It reserves the memory before
the memory allocation reserves this memory. So what reserve_mem returns is
valid memory that can be used by memory allocator, but is currently just
"reserved" which means it wants to prevent the allocator from using it.


> And yes, I'm not at all surprised that we'd have other mis-users of
> this. We have some very historical code that depends on reserved pages
> going back all the way to linux-0.01 I suspect, because things like
> the original VGA code knew that the physical addresses were in the
> BIOS hole region that was backed by 'struct page'.

I believe this is what Mike set up. A way to have normal RAM reserved
before it gets added to the memory allocator. This was by design.

In fact, if you do pull my v2[*] pull request of the ring buffer code (that
removes this user space mapping of the persistent ring buffer logic) it
actually adds the ability to free the memory and add it back into the memory
allocator.

 https://lore.kernel.org/linux-trace-kernel/173989134814.230693.18199312930337815629.stgit@devnote2/

-- Steve

[*] https://lore.kernel.org/all/20250328173533.7fa7810d@gandalf.local.home/
Linus Torvalds March 31, 2025, 9:42 p.m. UTC | #5
On Mon, 31 Mar 2025 at 13:57, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Note, this only works with RAM.

That's not the issue, actually.

The issue is whether there's a 'struct page'. And "being RAM" does not
mean that a struct page exists.

For example, using "mem=" on the kernel command line will literally
limit the amount of RAM the kernel will use, and in doing so will
limit the page allocations too.

IOW, all of these kernel command line things are *subtle*.

Don't mis-use them by then making assumptions about how they work
today (or how they will work tomorrow).

> I guess you mean E820.

Yes. Thankfully it's been years since I had to use the BIOS
interfaces, so I no longer remember the exact line noise numbers...

> Mike can correct me if I'm wrong, but the memory that was stolen was actual
> memory returned by the system (E820 in x86). It reserves the memory before
> the memory allocation reserves this memory. So what reserve_mem returns is
> valid memory that can be used by memory allocator, but is currently just
> "reserved" which means it wants to prevent the allocator from using it.

That may indeed be true of reserve_mem.

That doesn't make it any better to then use it for 'struct page' when
the MM layer has explicitly been told to ignore it.

Particularly since my first reaction wasn't from "that's wrong", but
from "that's stupid". Generating that temporary array of pages is just
bogus overhead.

> In fact, if you do pull my v2[*] pull request of the ring buffer code (that
> removes this user space mapping of the persistent ring buffer logic) it
> actually adds the ability to free the memory and add it back into the memory
> allocator.

.. and *after* you've given it back to the memory allocator, and it
gets allocated using the page allocators, at that point ahead and use
'struct page' as much as you want.

Before that, don't. Even if it might work. Because you didn't allocate
it as a struct page, and for all you know it might be treated as a
different hotplug memory zone or whatever when given back.

               Linus
Steven Rostedt March 31, 2025, 11:42 p.m. UTC | #6
On Mon, 31 Mar 2025 14:42:38 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> .. and *after* you've given it back to the memory allocator, and it
> gets allocated using the page allocators, at that point ahead and use
> 'struct page' as much as you want.
> 
> Before that, don't. Even if it might work. Because you didn't allocate
> it as a struct page, and for all you know it might be treated as a
> different hotplug memory zone or whatever when given back.

Hmm, so if we need to map this memory to user space memory, then I can't
use the method from this patch series, if I have to avoid struct page.

Should I then be using vm_iomap_memory() passing in the physical address?

As for architectures that do not have user/kernel data cache coherency, how
does one flush the page when there's an update on the kernel side so that
the user side doesn't see stale data?

As the code currently uses flush_dcache_folio(), I'm guessing there's an
easy way to create a folio that points to physical memory that's not part
of the memory allocator?

Matthew?

-- Steve
Jann Horn April 1, 2025, 12:09 a.m. UTC | #7
On Tue, Apr 1, 2025 at 1:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 31 Mar 2025 14:42:38 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > .. and *after* you've given it back to the memory allocator, and it
> > gets allocated using the page allocators, at that point ahead and use
> > 'struct page' as much as you want.
> >
> > Before that, don't. Even if it might work. Because you didn't allocate
> > it as a struct page, and for all you know it might be treated as a
> > different hotplug memory zone or whatever when given back.
>
> Hmm, so if we need to map this memory to user space memory, then I can't
> use the method from this patch series, if I have to avoid struct page.
>
> Should I then be using vm_iomap_memory() passing in the physical address?

For mapping random physical memory ranges into userspace, we have
helpers like remap_pfn_range() (the easy option, for use in an mmap
handler, in case you want to want to map one contiguous physical
region into userspace) and vmf_insert_pfn() (for use in a page fault
handler, in case you want to map random physical pages into userspace
on demand).

> As for architectures that do not have user/kernel data cache coherency, how
> does one flush the page when there's an update on the kernel side so that
> the user side doesn't see stale data?

flush_kernel_vmap_range() (and invalidate_kernel_vmap_range() for the
other direction) might be what you want... I found those by going
backwards from an arch-specific cache-flushing implementation.

> As the code currently uses flush_dcache_folio(), I'm guessing there's an
> easy way to create a folio that points to physical memory that's not part
> of the memory allocator?

Creating your own folio structs sounds like a bad idea; folio structs
are supposed to be in specific kernel memory regions. For example,
conversions from folio* to physical address can involve pointer
arithmetic on the folio*, or they can involve reading members of the
pointed-to folio.
Linus Torvalds April 1, 2025, 12:11 a.m. UTC | #8
On Mon, 31 Mar 2025 at 16:41, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Hmm, so if we need to map this memory to user space memory, then I can't
> use the method from this patch series, if I have to avoid struct page.
>
> Should I then be using vm_iomap_memory() passing in the physical address?

I actually think that would be the best option in general - it works
*regardless* of the source of the pages (ie it works for pages that
don't have 'struct page' backing, but it works for regular RAM too).

So it avoids any question of how the page was allocated, and it also
avoids the page reference counting overhead.

I thought you did that already for the user mappings - don't use you
remap_pfn_range()?

That's basically the equivalent of vmap_page_range() - you're mapping
a whole range based on physical addresses, not mapping individual
pages.

But I didn't go look, this is from dim and possibly confused memories
from past patches.

> As for architectures that do not have user/kernel data cache coherency, how
> does one flush the page when there's an update on the kernel side so that
> the user side doesn't see stale data?

So if you don't treat this as some kind of 'page' or 'folio' thing,
then the proper function is actually flush_cache_range().

I actually suspect that if you treat things just as an arbitrary range
of memory, it might simplify things in general.

For example, the whole flush_cache_page() thing obviously just flushes
one page. So then you have to artificially iterate over pages rather
than just use the natural range.

HOWEVER.

At this point I have to also admit that you will likely find various
holes in various architecture implementations.

Why?

Because sane architectures don't care (there's basically no testing of
any of this on x86, because x86 is typically always cache coherent
outside of some GPU oddities that are handled by the DRM layer
explicitly, so all of these functions are just no-ops on x86).

And so almost none of this gets any testing in practice. A missed
cache flush doesn't matter on x86 or arm64, and very seldom elsewhere
too.

We do have "flush_cache_range()" calls in the generic MM code, and so
it *should* all work, but honestly, I'd expect there to be bugs in
this area.

Of course, I would expect the same to be true of the page/folio cases,
so I don't think using flush_cache_range() should be any worse, but I
*could* imagine that it's bad in a different way ;)

              Linus
Linus Torvalds April 1, 2025, 12:27 a.m. UTC | #9
On Mon, 31 Mar 2025 at 17:11, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I thought you did that already for the user mappings - don't use you
> remap_pfn_range()?
>
> That's basically the equivalent of vmap_page_range() - you're mapping
> a whole range based on physical addresses, not mapping individual
> pages.

Btw, if you are wondering why this isn't consistent: for user space
mapping 'remap_pfn_range()' takes a starting physical 'pfn' and a
size, while for kernel space 'vmap_page_range()' takes a 'phys_addr_t'
and a size, then I have no answers for you.

Except just "random historical accidents".

There are a lot more "map into user space" cases, and it's the much
more common case, and it's the one that a lot of drivers have used for
a long time.

The 'vmap' code in contrast really started out as just "vmalloc()"
which allocated the pages while mapping and "ioremap()" which mapped
in IO areas into kernel virtual memory.

And then the interface has very occasionally gotten updated over the
years and we *have* had some unification, but it's been driven not by
"let's make this a kernel version of user mappings", but by "random
odd use cases".

You can see this just in the statistics: remap_pfn_range() has hundreds of uses.

In contrast, vmap_page_range() has something like five uses - when you
grep for it, half of the hits for it are in the mm/vmalloc.c
implementation and the header file declaration).

So I do think vmap_page_range() is the right thing to do here, but it
is *so* rare that maybe it has issues. There simply generally isn't
any reason to map contiguous pages into kernel space.

The *typical* use for kernel virtual mappings is either

 (a) turn physically discontiguous allocations into a contiguous
virtual mapping (ie "vmalloc" like things, sometimes with
pre-allocated arrays of pages)

or

 (b) map device pages into kernel space (ie "ioremap()" kinds of things)

so vmap_page_range() is kind of an odd and unusual duck, which is why
people may not be all that aware of it.

And being an odd and unusual duck might mean it has bugs, but the good
news is that "vmap_page_range()" is actually the underlying
implementation of "ioremap()", so it *does* actually get tested that
way.

           Linus
Steven Rostedt April 1, 2025, 12:30 a.m. UTC | #10
On Mon, 31 Mar 2025 17:11:46 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I thought you did that already for the user mappings - don't use you
> remap_pfn_range()?

No, that's not what is done. The normal buffer is split among several
sub-buffers (usually one page in size each, but can also be a power of two
pages), and those pages are allocated via alloc_page() and are not
contiguous. Which is why the mapping to user space creates an array of
struct page pointers and then calls vm_insert_pages().

For the contigous physical memory, then yeah, we can simply use
vm_iomap_memory().


> So if you don't treat this as some kind of 'page' or 'folio' thing,
> then the proper function is actually flush_cache_range().
> 
> I actually suspect that if you treat things just as an arbitrary range
> of memory, it might simplify things in general.

Ah, yeah. That's the function I was looking for.


> Of course, I would expect the same to be true of the page/folio cases,
> so I don't think using flush_cache_range() should be any worse, but I
> *could* imagine that it's bad in a different way ;)

At least we can say we covered those other archs, and if a bug is reported,
then all that would need to be fixed is the flush_cache_range()
implementation ;-)

-- Steve
Linus Torvalds April 1, 2025, 12:38 a.m. UTC | #11
On Mon, 31 Mar 2025 at 17:29, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Of course, I would expect the same to be true of the page/folio cases,
> > so I don't think using flush_cache_range() should be any worse, but I
> > *could* imagine that it's bad in a different way ;)
>
> At least we can say we covered those other archs, and if a bug is reported,
> then all that would need to be fixed is the flush_cache_range()
> implementation ;-)

Well, there's also the whole "is it I$ or D$" question.

I think flush_cache_range() is basically expected to do both, and you
don't care about the I$ side (and I$ coherency issues are a *lot* more
common than D$ coherency issues are - while D$ coherency is happily
the common situation, the I$ not being coherent is actually the
default, and x86 - and s390? - is unusual in this area).

So maybe powerpc people will be unhappy about flush_cache_range()
because it does the "look out for I$" thing too.

There's a flush_dcache_range() thing too, but I don't see a single use
of that function in generic code, so I suspect that one is currently
purely for architecture-specific drivers and doesn't work in egneral.

So *this* is the kind of "bad in a different way" I could imagine:
things that probably should be cleaned up and be available to generic
code, but very few people have cared, and so they are used in ad-hoc
places and haven't been sufficiently generalized and cleaned up.

           Linus
Linus Torvalds April 1, 2025, 12:49 a.m. UTC | #12
On Mon, 31 Mar 2025 at 17:38, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So *this* is the kind of "bad in a different way" I could imagine:
> things that probably should be cleaned up and be available to generic
> code, but very few people have cared, and so they are used in ad-hoc
> places and haven't been sufficiently generalized and cleaned up.

Of course, the good news is that (a) few people care - which is why we
are in this situation - and (b) the people who *do* care about these
issues tend to be both responsible and responsive.

So not some unfixable morass, more of a "one or two architectures may
care, and we might have to extend interfaces that haven't been used
this way before".

For example, just adding flush_dcache_range() into
include/asm-generic/cacheflush.h should be more-or-less trivial, with
the same kinds of "architectures can make clear they have their own
optimized version".

              Linus
Steven Rostedt April 1, 2025, 1:02 a.m. UTC | #13
On Tue, 1 Apr 2025 02:09:10 +0200
Jann Horn <jannh@google.com> wrote:

> On Tue, Apr 1, 2025 at 1:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Mon, 31 Mar 2025 14:42:38 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >  
> > > .. and *after* you've given it back to the memory allocator, and it
> > > gets allocated using the page allocators, at that point ahead and use
> > > 'struct page' as much as you want.
> > >
> > > Before that, don't. Even if it might work. Because you didn't allocate
> > > it as a struct page, and for all you know it might be treated as a
> > > different hotplug memory zone or whatever when given back.  
> >
> > Hmm, so if we need to map this memory to user space memory, then I can't
> > use the method from this patch series, if I have to avoid struct page.
> >
> > Should I then be using vm_iomap_memory() passing in the physical address?  
> 
> For mapping random physical memory ranges into userspace, we have
> helpers like remap_pfn_range() (the easy option, for use in an mmap
> handler, in case you want to want to map one contiguous physical
> region into userspace) and vmf_insert_pfn() (for use in a page fault
> handler, in case you want to map random physical pages into userspace
> on demand).

Note, I believe that Linus brought up the issue that because this physical
memory is not currently part of the memory allocator (it's not aware of it
yet), that the getting struct page or a "pfn" for it may not be reliable.

> 
> > As for architectures that do not have user/kernel data cache coherency, how
> > does one flush the page when there's an update on the kernel side so that
> > the user side doesn't see stale data?  
> 
> flush_kernel_vmap_range() (and invalidate_kernel_vmap_range() for the
> other direction) might be what you want... I found those by going
> backwards from an arch-specific cache-flushing implementation.
> 
> > As the code currently uses flush_dcache_folio(), I'm guessing there's an
> > easy way to create a folio that points to physical memory that's not part
> > of the memory allocator?  
> 
> Creating your own folio structs sounds like a bad idea; folio structs
> are supposed to be in specific kernel memory regions. For example,
> conversions from folio* to physical address can involve pointer
> arithmetic on the folio*, or they can involve reading members of the
> pointed-to folio.

Linus already mentioned flush_cache_range() which looks to be the thing to
use.

-- Steve
Jann Horn April 1, 2025, 1:28 a.m. UTC | #14
On Tue, Apr 1, 2025 at 3:01 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 1 Apr 2025 02:09:10 +0200
> Jann Horn <jannh@google.com> wrote:
>
> > On Tue, Apr 1, 2025 at 1:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > On Mon, 31 Mar 2025 14:42:38 -0700
> > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > > > .. and *after* you've given it back to the memory allocator, and it
> > > > gets allocated using the page allocators, at that point ahead and use
> > > > 'struct page' as much as you want.
> > > >
> > > > Before that, don't. Even if it might work. Because you didn't allocate
> > > > it as a struct page, and for all you know it might be treated as a
> > > > different hotplug memory zone or whatever when given back.
> > >
> > > Hmm, so if we need to map this memory to user space memory, then I can't
> > > use the method from this patch series, if I have to avoid struct page.
> > >
> > > Should I then be using vm_iomap_memory() passing in the physical address?
> >
> > For mapping random physical memory ranges into userspace, we have
> > helpers like remap_pfn_range() (the easy option, for use in an mmap
> > handler, in case you want to want to map one contiguous physical
> > region into userspace) and vmf_insert_pfn() (for use in a page fault
> > handler, in case you want to map random physical pages into userspace
> > on demand).
>
> Note, I believe that Linus brought up the issue that because this physical
> memory is not currently part of the memory allocator (it's not aware of it
> yet), that the getting struct page or a "pfn" for it may not be reliable.

PFN mappings are specifically designed to work with memory that does
not have "struct page":

#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct
page", just pure PFN */

> > > As for architectures that do not have user/kernel data cache coherency, how
> > > does one flush the page when there's an update on the kernel side so that
> > > the user side doesn't see stale data?
> >
> > flush_kernel_vmap_range() (and invalidate_kernel_vmap_range() for the
> > other direction) might be what you want... I found those by going
> > backwards from an arch-specific cache-flushing implementation.
> >
> > > As the code currently uses flush_dcache_folio(), I'm guessing there's an
> > > easy way to create a folio that points to physical memory that's not part
> > > of the memory allocator?
> >
> > Creating your own folio structs sounds like a bad idea; folio structs
> > are supposed to be in specific kernel memory regions. For example,
> > conversions from folio* to physical address can involve pointer
> > arithmetic on the folio*, or they can involve reading members of the
> > pointed-to folio.
>
> Linus already mentioned flush_cache_range() which looks to be the thing to
> use.

It looks like flush_kernel_vmap_range() is used for flushing dcache
for the kernel mapping, while flush_cache_range() is for flushing
dcache/icache for the userspace mapping?

For example, on 32-bit arm, you might go down these paths, ending up
in arch-specific functions that make it clear whether they're for the
user side or the kernel side:

flush_cache_range -> __cpuc_flush_user_range

flush_kernel_vmap_range -> __cpuc_flush_dcache_area ->
cpu_cache.flush_kern_dcache_area

I think you probably need flushes on both sides, since you might have
to first flush out the dirty cacheline you wrote through the kernel
mapping, then discard the stale clean cacheline for the user mapping,
or something like that? (Unless these VIVT cache architectures provide
stronger guarantees on cache state than I thought.) But when you're
adding data to the tracing buffers, I guess maybe you only want to
flush the kernel mapping from the kernel, and leave flushing of the
user mapping to userspace? I think if you're running in some random
kernel context, you probably can't even reliably flush the right
userspace context - see how for example vivt_flush_cache_range() does
nothing if the MM being flushed is not running on the current CPU.
Linus Torvalds April 1, 2025, 1:30 a.m. UTC | #15
On Mon, 31 Mar 2025 at 18:01, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Note, I believe that Linus brought up the issue that because this physical
> memory is not currently part of the memory allocator (it's not aware of it
> yet), that the getting struct page or a "pfn" for it may not be reliable.

'pfn' is always reliable.

The pfn ('page frame number') is literally just the physical address
expressed in 'page units', ie just shifted down by the page shift.

So pfn and phys_addr_t are interchangeable when it comes to mapping
pages. The pfn is in fact often the preferred form, because on 32-bit
architectures a pfn is 32-bit, but a phys_addr_t is often 64-bit and
generates extra code.

I think 'pfn' was introduced as a name ong long ago because it was
what the alpha architecture used in the VM documentation. It probably
predates that too, but it's where I got it from, iirc.

              Linus
Steven Rostedt April 1, 2025, 1:36 a.m. UTC | #16
On Mon, 31 Mar 2025 17:49:07 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> For example, just adding flush_dcache_range() into
> include/asm-generic/cacheflush.h should be more-or-less trivial, with
> the same kinds of "architectures can make clear they have their own
> optimized version".

Just so I'm clear. Are you suggesting to add flush_dcache_range() to
cacheflush.h and using that?

Looks like some of the code would be able to remove the ifdefs:

 #ifdef CONFIG_PPC
	flush_dcache_range(...);
 #endif

if that's the case.

-- Steve
Steven Rostedt April 1, 2025, 1:41 a.m. UTC | #17
On Mon, 31 Mar 2025 18:30:33 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 31 Mar 2025 at 18:01, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Note, I believe that Linus brought up the issue that because this physical
> > memory is not currently part of the memory allocator (it's not aware of it
> > yet), that the getting struct page or a "pfn" for it may not be reliable.  
> 
> 'pfn' is always reliable.
> 
> The pfn ('page frame number') is literally just the physical address
> expressed in 'page units', ie just shifted down by the page shift.

Ah, for some reason I thought the pfn mapped directly to struct pages.

> 
> So pfn and phys_addr_t are interchangeable when it comes to mapping
> pages. The pfn is in fact often the preferred form, because on 32-bit
> architectures a pfn is 32-bit, but a phys_addr_t is often 64-bit and
> generates extra code.
> 
> I think 'pfn' was introduced as a name ong long ago because it was
> what the alpha architecture used in the VM documentation. It probably
> predates that too, but it's where I got it from, iirc.
> 

It is old, as I remember using it when I first started Linux kernel
development back in 1998. But my memory of it was it was also used as an
index into a struct page array. Which is why I was thinking it was somewhat
interchangeable with struct page. But that was a long time ago when I was
an embedded developer, but I've only been using struct page for my needs in
the last couple of decades.

-- Steve
Linus Torvalds April 1, 2025, 1:44 a.m. UTC | #18
On Mon, 31 Mar 2025 at 18:35, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Just so I'm clear. Are you suggesting to add flush_dcache_range() to
> cacheflush.h and using that?

No, let's wait for it to be actually considered a problem.

And I think Jann is right that for kernel ranges you should use
flush_kernel_vmap_range() anyway.

Oh how I hate incoherent data caches.

           Linus
Steven Rostedt April 1, 2025, 1:50 a.m. UTC | #19
On Tue, 1 Apr 2025 03:28:20 +0200
Jann Horn <jannh@google.com> wrote:

> I think you probably need flushes on both sides, since you might have
> to first flush out the dirty cacheline you wrote through the kernel
> mapping, then discard the stale clean cacheline for the user mapping,
> or something like that? (Unless these VIVT cache architectures provide
> stronger guarantees on cache state than I thought.) But when you're
> adding data to the tracing buffers, I guess maybe you only want to
> flush the kernel mapping from the kernel, and leave flushing of the
> user mapping to userspace? I think if you're running in some random
> kernel context, you probably can't even reliably flush the right
> userspace context - see how for example vivt_flush_cache_range() does
> nothing if the MM being flushed is not running on the current CPU.

I'm assuming I need to flush both the kernel (get the updates out to
memory) and user space (so it can read those updates).

The paths are all done via system calls from user space, so it should be on
the same CPU. User space will do an ioctl() on the buffer file descriptor
asking for an update, the kernel will populate the page with that update,
and then user space will read the update after the ioctl() returns. All
very synchronous. Thus, we don't need to worry about updates from one CPU
happening on another CPU.

Even when it wants to read the buffer. The ioctl() will swap out the old
reader page with one of the write pages making it the new "reader" page,
where no more updates will happen on that page. The flush happens after
that and before going back to user space.

-- Steve
Linus Torvalds April 1, 2025, 1:55 a.m. UTC | #20
On Mon, 31 Mar 2025 at 18:40, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I think 'pfn' was introduced as a name ong long ago because it was
> > what the alpha architecture used in the VM documentation. It probably
> > predates that too, but it's where I got it from, iirc.
> >
>
> It is old, as I remember using it when I first started Linux kernel
> development back in 1998.

So I did the alpha port in '95, but I meant that the 'page frame
number' as a name may well have roots that go much further back. I
only know it from the alpha architecture docs.

Google only seems to find the modern Linux use, but I wouldn't be
surprised if the alpha architects got it from some much older use (I
would suspect VMS).

> But my memory of it was it was also used as an
> index into a struct page array.

So typically, the pfn is what is used to translate different kinds of addresses.

And yes, one such translation is "turn virtual address into physical
address, then turn that into the pfn, and then use that to index the
array of 'struct page'.

Few places use the pfn itself as the primary thing, but one of those
few things is the low-level page table code. Because page table
entries are literally a combination of the pfn and various per-page
information (protection bits, cacheability, stuff like that).

So to the page table code, 'pfn' is kind of fundamental.

Example: the actual low-level "create a page table entry" is
"pfn_pte()" which takes a pfn and the protection bits.

That's the basis from which all the other page table entry functions
are based on, even if you often have wrapper helpers to hide the 'pfn'
part from users, and you might have seen mk_pte() which takes a
'struct page' and the protection bits. That one just turns the page
into a pfn and then used pfn_pte().

            Linus
Mathieu Desnoyers April 1, 2025, 2:23 a.m. UTC | #21
On 2025-03-31 21:50, Steven Rostedt wrote:
> On Tue, 1 Apr 2025 03:28:20 +0200
> Jann Horn <jannh@google.com> wrote:
> 
>> I think you probably need flushes on both sides, since you might have
>> to first flush out the dirty cacheline you wrote through the kernel
>> mapping, then discard the stale clean cacheline for the user mapping,
>> or something like that? (Unless these VIVT cache architectures provide
>> stronger guarantees on cache state than I thought.) But when you're
>> adding data to the tracing buffers, I guess maybe you only want to
>> flush the kernel mapping from the kernel, and leave flushing of the
>> user mapping to userspace? I think if you're running in some random
>> kernel context, you probably can't even reliably flush the right
>> userspace context - see how for example vivt_flush_cache_range() does
>> nothing if the MM being flushed is not running on the current CPU.
> 
> I'm assuming I need to flush both the kernel (get the updates out to
> memory) and user space (so it can read those updates).
> 
> The paths are all done via system calls from user space, so it should be on
> the same CPU. User space will do an ioctl() on the buffer file descriptor
> asking for an update, the kernel will populate the page with that update,
> and then user space will read the update after the ioctl() returns. All
> very synchronous. Thus, we don't need to worry about updates from one CPU
> happening on another CPU.
> 
> Even when it wants to read the buffer. The ioctl() will swap out the old
> reader page with one of the write pages making it the new "reader" page,
> where no more updates will happen on that page. The flush happens after
> that and before going back to user space.
FWIW, I have the following in the LTTng kernel tracer to cover this.
LTTng writes to ring buffers through the linear mapping, and reads
the buffers from userspace either through mmap or splice.

When userspace wants to get read access to a sub-buffer (an abstraction
that generalizes your ftrace ring buffer "pages") through mmap,
it does the following through a "get subbuffer" ioctl:

- Use "cpu_dcache_is_aliasing()" to check whether explicit flushing is
   needed between the kernel linear mapping and userspace mappings.

- Use flush_dcache_page() to make sure all mappings for a given page
   are flushed (both the kernel linear mapping and the userspace virtual
   mappings).

I suspect that if you go down the route of the explicit
"flush_cache_range()", then you'll need to issue it on all
mappings that alias your memory.

AFAIU, using flush_dcache_page() saves you the trouble of issuing
flush_cache_range() on all mapping aliases manually.

Thanks,

Mathieu
Ingo Molnar April 1, 2025, 9:53 a.m. UTC | #22
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 31 Mar 2025 at 18:40, Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > I think 'pfn' was introduced as a name ong long ago because it was
> > > what the alpha architecture used in the VM documentation. It probably
> > > predates that too, but it's where I got it from, iirc.
> > >
> >
> > It is old, as I remember using it when I first started Linux kernel
> > development back in 1998.
> 
> So I did the alpha port in '95, but I meant that the 'page frame
> number' as a name may well have roots that go much further back. I
> only know it from the alpha architecture docs.
> 
> Google only seems to find the modern Linux use, but I wouldn't be
> surprised if the alpha architects got it from some much older use (I
> would suspect VMS).

*Technically*, for those of us with a weakness for Git archeology, the 
'PFN' term was first introduced in Linux with the MIPS port, which went 
upstream via Linux 1.1.45, released on 1994/04/06, and had this in 
include/asm-mips/mipsregs.h:

  +/*
  + * Compute a vpn/pfn entry for EntryHi register
  + */
  +#define VPN(addr,pagesizeshift) ((addr) & ~((1 << (pagesizeshift))-1))
  +#define PFN(addr,pagesizeshift) (((addr) & ((1 << (pagesizeshift))-1)) << 6)

... while your Alpha port went upstream via 1.1.57, released on 
1994/10/24, and the first mention of 'PFN' was in arch/alpha/mm/init.c, 
introduced as part of 1.1.82, released 1995/01/16, almost a year later:

  +unsigned long paging_init(unsigned long start_mem, unsigned long end_mem)
  ..
  +               unsigned long pfn, nr;
  +               if (cluster->usage & 1)
  +                       continue;
  +               pfn = cluster->start_pfn;
  +               nr = cluster->numpages;

So you don't appear to have dibs on the PFN name within Linux, although 
I suspect MIPS and Alpha had it from a similar background. :-)

By the way, some quick review feedback on that patch: there should be 
extra newlines after variable declarations - clearly the author is not 
a good kernel programm... oh, never mind.

Thanks,

	Ingo
Mike Rapoport April 1, 2025, 9:56 a.m. UTC | #23
On Mon, Mar 31, 2025 at 02:42:38PM -0700, Linus Torvalds wrote:
> On Mon, 31 Mar 2025 at 13:57, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Note, this only works with RAM.
> 
> That's not the issue, actually.
> 
> The issue is whether there's a 'struct page'. And "being RAM" does not
> mean that a struct page exists.
> 
> For example, using "mem=" on the kernel command line will literally
> limit the amount of RAM the kernel will use, and in doing so will
> limit the page allocations too.

And using memmap=m$n on x86 creates a hole in System RAM that does not have
neither struct page nor kernel mappings and it is never considered RAM
anywhere in mm.
 
> IOW, all of these kernel command line things are *subtle*.
> 
> Don't mis-use them by then making assumptions about how they work
> today (or how they will work tomorrow).

I'd say it's better not to use them at all. They cause weirdness in memory
layout and also they are inconsistent in how architectures implement them.
 
> > Mike can correct me if I'm wrong, but the memory that was stolen was actual
> > memory returned by the system (E820 in x86). It reserves the memory before
> > the memory allocation reserves this memory. So what reserve_mem returns is
> > valid memory that can be used by memory allocator, but is currently just
> > "reserved" which means it wants to prevent the allocator from using it.
> 
> That may indeed be true of reserve_mem.

The reserve_mem behaves like any other early allocation, it has proper
struct pages (PG_Reserved) and it is mapped in the direct map so
phys_to_virt() will work on it.

As for mapping it to userspace, vm_iomap_memory() seems the best API to
use. It has all the alignment checks and will refuse to map ranges that are
not properly aligned and it will use vma information to create the right
mappings.
 
>                Linus
Steven Rostedt April 1, 2025, 3:11 p.m. UTC | #24
On Tue, 1 Apr 2025 12:56:31 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> > For example, using "mem=" on the kernel command line will literally
> > limit the amount of RAM the kernel will use, and in doing so will
> > limit the page allocations too.  
> 
> And using memmap=m$n on x86 creates a hole in System RAM that does not have
> neither struct page nor kernel mappings and it is never considered RAM
> anywhere in mm.

Hmm, when that is used, then we had better not "free" the buffer.

>  
> > IOW, all of these kernel command line things are *subtle*.
> > 
> > Don't mis-use them by then making assumptions about how they work
> > today (or how they will work tomorrow).  
> 
> I'd say it's better not to use them at all. They cause weirdness in memory
> layout and also they are inconsistent in how architectures implement them.
>  
> > > Mike can correct me if I'm wrong, but the memory that was stolen was actual
> > > memory returned by the system (E820 in x86). It reserves the memory before
> > > the memory allocation reserves this memory. So what reserve_mem returns is
> > > valid memory that can be used by memory allocator, but is currently just
> > > "reserved" which means it wants to prevent the allocator from using it.  
> > 
> > That may indeed be true of reserve_mem.  
> 
> The reserve_mem behaves like any other early allocation, it has proper
> struct pages (PG_Reserved) and it is mapped in the direct map so
> phys_to_virt() will work on it.
> 
> As for mapping it to userspace, vm_iomap_memory() seems the best API to
> use. It has all the alignment checks and will refuse to map ranges that are
> not properly aligned and it will use vma information to create the right
> mappings.
>  

When using vmap() to get the virtual addresses (via the kmalloc_array() of
struct pages), the vunmap() gives the memory back to the memory allocator:

~# free
               total        used        free      shared  buff/cache   available
Mem:         8185928      296676     7840576         920      148280     7889252
Swap:        7812092           0     7812092
~# rmdir /sys/kernel/tracing/instances/boot_mapped/
~# free
               total        used        free      shared  buff/cache   available
Mem:         8206404      290868     7866772         920      148384     7915536
Swap:        7812092           0     7812092

With no issues.

But if I use vmap_page_range(), how do I give that back to the memory allocator?

Calling vunmap() on that memory gives me:

 1779.832484] ------------[ cut here ]------------
[ 1779.834076] Trying to vunmap() nonexistent vm area (000000027c000000)
[ 1779.835941] WARNING: CPU: 6 PID: 956 at mm/vmalloc.c:3413 vunmap+0x5a/0x60
[ 1779.837587] Modules linked in:
[ 1779.838455] CPU: 6 UID: 0 PID: 956 Comm: rmdir Not tainted 6.14.0-rc4-test-00019-ga9c509c0c8e7-dirty #379
[ 1779.840597] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 1779.842576] RIP: 0010:vunmap+0x5a/0x60
[ 1779.843553] Code: 89 c7 48 85 c0 74 12 e8 94 e1 01 00 48 8b 5d f8 c9 c3 cc cc cc cc 90 0f 0b 90 48 c7 c7 78 b6 c4 9d 48 89 de e8 57 b4 cd ff 90 <0f> 0b 90 90 eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 1779.847159] RSP: 0018:ffffbfdb7ec93da0 EFLAGS: 00010282
[ 1779.848256] RAX: 0000000000000000 RBX: 000000027c000000 RCX: 0000000000000000
[ 1779.849623] RDX: ffff9f0efdfab108 RSI: ffff9f0efdf9cbc0 RDI: 0000000000000001
[ 1779.851079] RBP: ffffbfdb7ec93da8 R08: 00000000ffffdfff R09: ffffffff9e7652c8
[ 1779.852447] R10: ffffffff9e6b5320 R11: 0000000000000000 R12: ffff9f0d80226e00
[ 1779.853746] R13: 0000000000000001 R14: ffff9f0d806294c0 R15: ffff9f0d80629190
[ 1779.855121] FS:  00007f414dea6740(0000) GS:ffff9f0efdf80000(0000) knlGS:0000000000000000
[ 1779.856524] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1779.857548] CR2: 00007f28a0a4d350 CR3: 000000010f950002 CR4: 0000000000172ef0
[ 1779.858916] Call Trace:
[ 1779.859467]  <TASK>
[ 1779.859986]  ? show_regs.cold+0x19/0x24
[ 1779.860708]  ? vunmap+0x5a/0x60
[ 1779.861387]  ? __warn.cold+0xc2/0x157
[ 1779.862141]  ? vunmap+0x5a/0x60
[ 1779.862752]  ? report_bug+0x10a/0x150
[ 1779.865787]  ? handle_bug+0x5c/0xa0
[ 1779.866684]  ? exc_invalid_op+0x1c/0x80
[ 1779.867455]  ? asm_exc_invalid_op+0x1f/0x30
[ 1779.868256]  ? vunmap+0x5a/0x60
[ 1779.868908]  ring_buffer_free+0xac/0xc0
[ 1779.869595]  __remove_instance.part.0.constprop.0+0xeb/0x1f0
[ 1779.870590]  instance_rmdir+0xe1/0xf0
[ 1779.871342]  tracefs_syscall_rmdir+0x5c/0xa0
[ 1779.872198]  vfs_rmdir+0xa0/0x220
[ 1779.872806]  do_rmdir+0x146/0x190
[ 1779.873477]  __x64_sys_rmdir+0x43/0x70
[ 1779.874188]  x64_sys_call+0x114f/0x1d70
[ 1779.874944]  do_syscall_64+0xbb/0x1d0
[ 1779.875605]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

What's the proper way to say: "I no longer need this physical memory I
reserved, the kernel can now use it"?

-- Steve
Mike Rapoport April 1, 2025, 3:26 p.m. UTC | #25
On Tue, Apr 01, 2025 at 11:11:59AM -0400, Steven Rostedt wrote:
> On Tue, 1 Apr 2025 12:56:31 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > > For example, using "mem=" on the kernel command line will literally
> > > limit the amount of RAM the kernel will use, and in doing so will
> > > limit the page allocations too.  
> > 
> > And using memmap=m$n on x86 creates a hole in System RAM that does not have
> > neither struct page nor kernel mappings and it is never considered RAM
> > anywhere in mm.
> 
> Hmm, when that is used, then we had better not "free" the buffer.
> 
> >  
> > > IOW, all of these kernel command line things are *subtle*.
> > > 
> > > Don't mis-use them by then making assumptions about how they work
> > > today (or how they will work tomorrow).  
> > 
> > I'd say it's better not to use them at all. They cause weirdness in memory
> > layout and also they are inconsistent in how architectures implement them.
> >  
> > > > Mike can correct me if I'm wrong, but the memory that was stolen was actual
> > > > memory returned by the system (E820 in x86). It reserves the memory before
> > > > the memory allocation reserves this memory. So what reserve_mem returns is
> > > > valid memory that can be used by memory allocator, but is currently just
> > > > "reserved" which means it wants to prevent the allocator from using it.  
> > > 
> > > That may indeed be true of reserve_mem.  
> > 
> > The reserve_mem behaves like any other early allocation, it has proper
> > struct pages (PG_Reserved) and it is mapped in the direct map so
> > phys_to_virt() will work on it.
> > 
> > As for mapping it to userspace, vm_iomap_memory() seems the best API to
> > use. It has all the alignment checks and will refuse to map ranges that are
> > not properly aligned and it will use vma information to create the right
> > mappings.
> >  
> 
> When using vmap() to get the virtual addresses (via the kmalloc_array() of
> struct pages), the vunmap() gives the memory back to the memory allocator:
> 
> ~# free
>                total        used        free      shared  buff/cache   available
> Mem:         8185928      296676     7840576         920      148280     7889252
> Swap:        7812092           0     7812092
> ~# rmdir /sys/kernel/tracing/instances/boot_mapped/
> ~# free
>                total        used        free      shared  buff/cache   available
> Mem:         8206404      290868     7866772         920      148384     7915536
> Swap:        7812092           0     7812092
> 
> With no issues.
> 
> But if I use vmap_page_range(), how do I give that back to the memory allocator?

But you don't need neither vmap() nor vmap_page_range() to have kernel page
tables for memory that you get from reserve_mem. It's already mapped and
plain phys_to_virt() gives you the virtual address you can use.
 
> Calling vunmap() on that memory gives me:
> 
>  1779.832484] ------------[ cut here ]------------
> [ 1779.834076] Trying to vunmap() nonexistent vm area (000000027c000000)
> [ 1779.835941] WARNING: CPU: 6 PID: 956 at mm/vmalloc.c:3413 vunmap+0x5a/0x60
> 
> What's the proper way to say: "I no longer need this physical memory I
> reserved, the kernel can now use it"?
 
free_reserved_area()

> -- Steve
Steven Rostedt April 1, 2025, 3:54 p.m. UTC | #26
On Tue, 1 Apr 2025 18:26:43 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> > But if I use vmap_page_range(), how do I give that back to the memory allocator?  
> 
> But you don't need neither vmap() nor vmap_page_range() to have kernel page
> tables for memory that you get from reserve_mem. It's already mapped and
> plain phys_to_virt() gives you the virtual address you can use.

Oh! That makes things so much easier! Especially since that means it should
work like the normal buffer where virt_to_page() should also work. Right?

Now I do support mapping the persistent ring buffer via memmap, but I can
just give up on allowing that to be memory mapped to user space, or even
freed.

>  
> > Calling vunmap() on that memory gives me:
> > 
> >  1779.832484] ------------[ cut here ]------------
> > [ 1779.834076] Trying to vunmap() nonexistent vm area (000000027c000000)
> > [ 1779.835941] WARNING: CPU: 6 PID: 956 at mm/vmalloc.c:3413 vunmap+0x5a/0x60
> > 
> > What's the proper way to say: "I no longer need this physical memory I
> > reserved, the kernel can now use it"?  
>  
> free_reserved_area()

Awesome!

Thanks,

-- Steve
Mike Rapoport April 1, 2025, 5:58 p.m. UTC | #27
On Tue, Apr 01, 2025 at 11:54:23AM -0400, Steven Rostedt wrote:
> On Tue, 1 Apr 2025 18:26:43 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > > But if I use vmap_page_range(), how do I give that back to the memory allocator?  
> > 
> > But you don't need neither vmap() nor vmap_page_range() to have kernel page
> > tables for memory that you get from reserve_mem. It's already mapped and
> > plain phys_to_virt() gives you the virtual address you can use.
> 
> Oh! That makes things so much easier! Especially since that means it should
> work like the normal buffer where virt_to_page() should also work. Right?

Right.
 
> Now I do support mapping the persistent ring buffer via memmap, but I can
> just give up on allowing that to be memory mapped to user space, or even
> freed.

It can't be freed without jumping through some hoops, and mapping it to
user space is hassle as well.
Herbert Xu April 3, 2025, 5:59 a.m. UTC | #28
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> There's a flush_dcache_range() thing too, but I don't see a single use
> of that function in generic code, so I suspect that one is currently
> purely for architecture-specific drivers and doesn't work in egneral.
> 
> So *this* is the kind of "bad in a different way" I could imagine:
> things that probably should be cleaned up and be available to generic
> code, but very few people have cared, and so they are used in ad-hoc
> places and haven't been sufficiently generalized and cleaned up.

There is no fallback implementation of flush_dcache_range for the
architectures that don't need it.

It would be nice to have that, as doing it by hand is pretty silly:

static inline void scatterwalk_done_dst(struct scatter_walk *walk,
					unsigned int nbytes)
{
	scatterwalk_unmap(walk);
	/*
	 * Explicitly check ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE instead of just
	 * relying on flush_dcache_page() being a no-op when not implemented,
	 * since otherwise the BUG_ON in sg_page() does not get optimized out.
	 * This also avoids having to consider whether the loop would get
	 * reliably optimized out or not.
	 */
	if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE) {
		struct page *base_page;
		unsigned int offset;
		int start, end, i;

		base_page = sg_page(walk->sg);
		offset = walk->offset;
		start = offset >> PAGE_SHIFT;
		end = start + (nbytes >> PAGE_SHIFT);
		end += (offset_in_page(offset) + offset_in_page(nbytes) +
			PAGE_SIZE - 1) >> PAGE_SHIFT;
		for (i = start; i < end; i++)
			flush_dcache_page(nth_page(base_page, i));
	}

Cheers,
Kees Cook April 3, 2025, 4:45 p.m. UTC | #29
On Mon, Mar 31, 2025 at 01:39:06PM -0400, Steven Rostedt wrote:
> On Mon, 31 Mar 2025 09:55:28 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Anyway, that takes care of the horrific interface. However, there's
> > another issue:
> > 
> > > +       pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);  
> > 
> > you create this pointless array of pages. Why? It's a physically
> > contiguous area.
> > 
> > You do that just because you want to use vmap() to map that contiguous
> > area one page at a time.
> > 
> > But this is NOT a new thing. It's exactly what every single PCI device
> > with a random physical memory region BAR needs to do. And no, they
> > don't create arrays of 'struct page *', because they use memory that
> > doesn't even have page backing.
> > 
> > So we actually have interfaces to do linear virtual mappings of
> > physical pages that *predate* vmap(), and do the right thing without
> > any of these games.
> 
> [ Added the pstore folks ]
> 
> OK, so I did copy this from fs/pstore/ram_core.c as this does basically the
> same thing as pstore. And it looks like pstore should be updated too.

I think we're talking about persistent_ram_vmap()? That code predates my
maintainership, but I'm happy to update it to use better APIs.

> > Yes, the legacy versions of interfaces are all for IO memory, but we
> > do have things like vmap_page_range() which should JustWork(tm).
> > 
> > Yeah, you'll need to do something like
> > 
> >         unsigned long vmap_start, vmap_end;
> > 
> >         area = get_vm_area(size, VM_IOREMAP);
> >         if (!area)
> >                 return NULL;
> > 
> >         vmap_start = (unsigned long) area->addr;
> >         vmap_end = vmap_start + size;
> > 
> >         ret = vmap_page_range(vmap_start, vmap_end,
> >                 *start, prot_nx(PAGE_KERNEL));
> > 
> >         if (ret < 0) {
> >                 free_vm_area(area);
> >                 return NULL;
> >         }
> > 
> > and the above is *entirely* untested and maybe there's something wrong
> > there, but the concept should work, and when you don't do it a page at
> > a time, you not only don't need the kmalloc_array(), it should even do
> > things like be able to use large page mappings if the alignment and
> > size work out.
> > 
> > That said, the old code is *really* broken to begin with. I don't
> > understand why you want to vmap() a contiguous physical range. Either
> > it's real pages to begin with, and you can just use "page_address()"
> > to get a virtual address, it's *not* real pages, and doing
> > "pfn_to_page()" is actively wrong, because it creates a fake 'struct
> > page *' pointer that isn't valid.
> > 
> > Is this all just for some disgusting HIGHMEM use (in which case you
> > need the virtual mapping because of HIGHMEM)? Is there any reason to
> > support HIGHMEM in this area at all?
> > 
> > So I'm not sure why this code does all this horror in the first place.
> > Either it's all just confused code that just didn't know what it was
> > doing and just happened to work (very possible..) or there is
> > something odd going on.

pstore tries to work with either real RAM or with iomem things. What
is there now Currently Works Fine, but should this be using
vmap_page_range()?
Linus Torvalds April 3, 2025, 4:47 p.m. UTC | #30
On Wed, 2 Apr 2025 at 23:00, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> There is no fallback implementation of flush_dcache_range for the
> architectures that don't need it.

Yeah, that literally looks like an oversight.

Probably because nobody needed it - or people didn't realize it was an
option and did the "loop over each page by hand" as in your example.

So yeah, I think it would be good to just expose flush_dcache_range()
as a dummy empty function in <asm-generic/cacheflush.h>, with the
usual "if the architecture defines this, that will override it" rule.

However, part of the problem is likely just unclear semantics. Because
the existing flush_cache_range() is about pure virtual caches and how
they need to be flushed before the TLB is flushed.

Also about D$ aliasing, but there isn't necessarily any TLB issues.

Honestly, I'd personally be perfectly fine saying "nobody sane cares,
and if your D$ isn't coherent, your perf mmap won't work".

                Linus
Linus Torvalds April 3, 2025, 4:51 p.m. UTC | #31
On Thu, 3 Apr 2025 at 09:45, Kees Cook <kees@kernel.org> wrote:
>
> pstore tries to work with either real RAM or with iomem things. What
> is there now Currently Works Fine, but should this be using
> vmap_page_range()?

Yes, I don't see the point of using vmap() on something that is
contiguous but is made to look like a collection of random pfns.

IOW, that whole 'pages[]' array (and the kmalloc/kfree) seems pointless.

I *suspect* the history is simply that 'vmap()' predates 'vmap_page_range()'.

But maybe I'm missing something, and there is some limitation to
vmap_page_range() that I'm not seeing.

             Linus
Steven Rostedt April 3, 2025, 5:15 p.m. UTC | #32
On Thu, 3 Apr 2025 09:51:51 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I *suspect* the history is simply that 'vmap()' predates 'vmap_page_range()'.
> 

Doing some git archeology, it appears vmap_page_range() was added last
year, to replace some ioremap_page_range() use cases, by Alexei.

In 2021, Christoph moved ioremap_page_range() to vmalloc.c replacing
vmap_range().

Also in 2021, Nick changed the ioremap_*range to vmap_*_range(), basically
making this full circle.

The ioremap_page_range() was introduced in 2006.

vmap() looks to have been there before you started git.

The vmap() usage in pstore was introduced in 2012.

-- Steve
diff mbox series

Patch

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 56e27263acf8..a31dff0a9f09 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -89,11 +89,11 @@  void ring_buffer_discard_commit(struct trace_buffer *buffer,
 struct trace_buffer *
 __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key);
 
-struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags,
-					       int order, unsigned long start,
-					       unsigned long range_size,
-					       unsigned long scratch_size,
-					       struct lock_class_key *key);
+struct trace_buffer *__ring_buffer_alloc_physical(unsigned long size, unsigned flags,
+						  int order, unsigned long start,
+						  unsigned long range_size,
+						  unsigned long scratch_size,
+						  struct lock_class_key *key);
 
 void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size);
 
@@ -113,11 +113,11 @@  void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size);
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
  * ring buffer's lock class separate.
  */
-#define ring_buffer_alloc_range(size, flags, order, start, range_size, s_size)	\
+#define ring_buffer_alloc_physical(size, flags, order, start, range_size, s_size) \
 ({									\
 	static struct lock_class_key __key;				\
-	__ring_buffer_alloc_range((size), (flags), (order), (start),	\
-				  (range_size), (s_size), &__key);	\
+	__ring_buffer_alloc_physical((size), (flags), (order), (start),	\
+				     (range_size), (s_size), &__key);	\
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
@@ -235,7 +235,8 @@  int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order);
 int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
 
 enum ring_buffer_flags {
-	RB_FL_OVERWRITE		= 1 << 0,
+	RB_FL_OVERWRITE		= BIT(0),
+	RB_FL_PHYSICAL		= BIT(1),
 };
 
 #ifdef CONFIG_RING_BUFFER
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f25966b3a1fc..36665754340f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -17,6 +17,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
 #include <linux/kthread.h>	/* for self test */
+#include <linux/vmalloc.h>	/* for vmap */
 #include <linux/module.h>
 #include <linux/percpu.h>
 #include <linux/mutex.h>
@@ -556,6 +557,8 @@  struct trace_buffer {
 
 	struct ring_buffer_meta		*meta;
 
+	unsigned long			phys_start;
+
 	unsigned int			subbuf_size;
 	unsigned int			subbuf_order;
 	unsigned int			max_data_size;
@@ -2351,6 +2354,43 @@  static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 	kfree(cpu_buffer);
 }
 
+static unsigned long map_pages(unsigned long *start, unsigned long *end)
+{
+	struct page **pages;
+	phys_addr_t page_start;
+	unsigned long size;
+	unsigned long page_count;
+	unsigned long i;
+	void *vaddr;
+
+	/* Make sure the mappings are page aligned */
+	*start = ALIGN(*start, PAGE_SIZE);
+
+	size = *end - *start;
+
+	/* The size must fit full pages */
+	page_count = size >> PAGE_SHIFT;
+
+	if (!page_count)
+		return 0;
+
+	page_start = *start;
+	pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return 0;
+
+	for (i = 0; i < page_count; i++) {
+		phys_addr_t addr = page_start + i * PAGE_SIZE;
+		pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
+	}
+	vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
+	kfree(pages);
+
+	*end = *start + page_count * PAGE_SIZE;
+
+	return (unsigned long)vaddr;
+}
+
 static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 					 int order, unsigned long start,
 					 unsigned long end,
@@ -2395,14 +2435,26 @@  static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 	if (!buffer->buffers)
 		goto fail_free_cpumask;
 
-	/* If start/end are specified, then that overrides size */
+	/* If start/end are specified, then this is a physical mapping */
 	if (start && end) {
 		unsigned long buffers_start;
+		unsigned long addr;
 		unsigned long ptr;
+		u64 size;
 		int n;
 
-		/* Make sure that start is word aligned */
-		start = ALIGN(start, sizeof(long));
+		addr = map_pages(&start, &end);
+		if (!addr)
+			goto fail_free_cpumask;
+
+		/* end and start have have been updated for alignment */
+		size = end - start;
+
+		buffer->phys_start = start;
+		buffer->flags |= RB_FL_PHYSICAL;
+
+		start = addr;
+		end = start + size;
 
 		/* scratch_size needs to be aligned too */
 		scratch_size = ALIGN(scratch_size, sizeof(long));
@@ -2479,6 +2531,9 @@  static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 	}
 	kfree(buffer->buffers);
 
+	if (buffer->phys_start)
+		vunmap((void *)buffer->phys_start);
+
  fail_free_cpumask:
 	free_cpumask_var(buffer->cpumask);
 
@@ -2508,11 +2563,11 @@  struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
 EXPORT_SYMBOL_GPL(__ring_buffer_alloc);
 
 /**
- * __ring_buffer_alloc_range - allocate a new ring_buffer from existing memory
+ * __ring_buffer_alloc_physical - allocate a new ring_buffer from physical memory
  * @size: the size in bytes per cpu that is needed.
  * @flags: attributes to set for the ring buffer.
  * @order: sub-buffer order
- * @start: start of allocated range
+ * @start: start of the physical memory range
  * @range_size: size of allocated range
  * @scratch_size: size of scratch area (for preallocated memory buffers)
  * @key: ring buffer reader_lock_key.
@@ -2522,11 +2577,11 @@  EXPORT_SYMBOL_GPL(__ring_buffer_alloc);
  * when the buffer wraps. If this flag is not set, the buffer will
  * drop data when the tail hits the head.
  */
-struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags,
-					       int order, unsigned long start,
-					       unsigned long range_size,
-					       unsigned long scratch_size,
-					       struct lock_class_key *key)
+struct trace_buffer *__ring_buffer_alloc_physical(unsigned long size, unsigned flags,
+						  int order, unsigned long start,
+						  unsigned long range_size,
+						  unsigned long scratch_size,
+						  struct lock_class_key *key)
 {
 	return alloc_buffer(size, flags, order, start, start + range_size,
 			    scratch_size, key);
@@ -2569,6 +2624,9 @@  ring_buffer_free(struct trace_buffer *buffer)
 	kfree(buffer->buffers);
 	free_cpumask_var(buffer->cpumask);
 
+	if (buffer->flags & RB_FL_PHYSICAL)
+		vunmap((void *)buffer->phys_start);
+
 	kfree(buffer);
 }
 EXPORT_SYMBOL_GPL(ring_buffer_free);
@@ -7138,6 +7196,14 @@  int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 	unsigned long flags, *subbuf_ids;
 	int err = 0;
 
+	/*
+	 * Currently, this does not support vmap()'d buffers.
+	 * Return -ENODEV as that is what is returned when a file
+	 * does not support memory mapping.
+	 */
+	if (buffer->flags & RB_FL_PHYSICAL)
+		return -ENODEV;
+
 	if (!cpumask_test_cpu(cpu, buffer->cpumask))
 		return -EINVAL;
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 14c38fcd6f9e..20724d64e02e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8492,10 +8492,6 @@  static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
 	struct trace_iterator *iter = &info->iter;
 	int ret = 0;
 
-	/* Currently the boot mapped buffer is not supported for mmap */
-	if (iter->tr->flags & TRACE_ARRAY_FL_BOOT)
-		return -ENODEV;
-
 	ret = get_snapshot_map(iter->tr);
 	if (ret)
 		return ret;
@@ -8503,8 +8499,8 @@  static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
 	ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file, vma);
 	if (ret)
 		put_snapshot_map(iter->tr);
-
-	vma->vm_ops = &tracing_buffers_vmops;
+	else
+		vma->vm_ops = &tracing_buffers_vmops;
 
 	return ret;
 }
@@ -9520,10 +9516,17 @@  allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size
 
 	if (tr->range_addr_start && tr->range_addr_size) {
 		/* Add scratch buffer to handle 128 modules */
-		buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
-						      tr->range_addr_start,
-						      tr->range_addr_size,
-						      struct_size(tscratch, entries, 128));
+		buf->buffer = ring_buffer_alloc_physical(size, rb_flags, 0,
+							 tr->range_addr_start,
+							 tr->range_addr_size,
+							 struct_size(tscratch, entries, 128));
+		if (!buf->buffer) {
+			pr_warn("Tracing: Failed to map boot instance %s\n", tr->name);
+			return -1;
+		}
+
+		pr_info("Tracing: mapped boot instance %s at physical memory %lx of size 0x%lx\n",
+			tr->name, tr->range_addr_start, tr->range_addr_size);
 
 		tscratch = ring_buffer_meta_scratch(buf->buffer, &scratch_size);
 		setup_trace_scratch(tr, tscratch, scratch_size);
@@ -9600,9 +9603,6 @@  static void free_trace_buffers(struct trace_array *tr)
 #ifdef CONFIG_TRACER_MAX_TRACE
 	free_trace_buffer(&tr->max_buffer);
 #endif
-
-	if (tr->range_addr_start)
-		vunmap((void *)tr->range_addr_start);
 }
 
 static void init_trace_flags_index(struct trace_array *tr)
@@ -9795,31 +9795,6 @@  static int instance_mkdir(const char *name)
 	return ret;
 }
 
-static u64 map_pages(u64 start, u64 size)
-{
-	struct page **pages;
-	phys_addr_t page_start;
-	unsigned int page_count;
-	unsigned int i;
-	void *vaddr;
-
-	page_count = DIV_ROUND_UP(size, PAGE_SIZE);
-
-	page_start = start;
-	pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
-	if (!pages)
-		return 0;
-
-	for (i = 0; i < page_count; i++) {
-		phys_addr_t addr = page_start + i * PAGE_SIZE;
-		pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
-	}
-	vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
-	kfree(pages);
-
-	return (u64)(unsigned long)vaddr;
-}
-
 /**
  * trace_array_get_by_name - Create/Lookup a trace array, given its name.
  * @name: The name of the trace array to be looked up/created.
@@ -10708,7 +10683,6 @@  __init static void enable_instances(void)
 	while ((curr_str = strsep(&str, "\t"))) {
 		phys_addr_t start = 0;
 		phys_addr_t size = 0;
-		unsigned long addr = 0;
 		bool traceprintk = false;
 		bool traceoff = false;
 		char *flag_delim;
@@ -10773,22 +10747,13 @@  __init static void enable_instances(void)
 			rname = kstrdup(tok, GFP_KERNEL);
 		}
 
-		if (start) {
-			addr = map_pages(start, size);
-			if (addr) {
-				pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n",
-					name, &start, (unsigned long)size);
-			} else {
-				pr_warn("Tracing: Failed to map boot instance %s\n", name);
-				continue;
-			}
-		} else {
+		if (!start) {
 			/* Only non mapped buffers have snapshot buffers */
 			if (IS_ENABLED(CONFIG_TRACER_MAX_TRACE))
 				do_allocate_snapshot(name);
 		}
 
-		tr = trace_array_create_systems(name, NULL, addr, size);
+		tr = trace_array_create_systems(name, NULL, (unsigned long)start, size);
 		if (IS_ERR(tr)) {
 			pr_warn("Tracing: Failed to create instance buffer %s\n", curr_str);
 			continue;