mbox series

[v6,0/4] tracing: Clean up persistent ring buffer code

Message ID 20250402144903.993276623@goodmis.org (mailing list archive)
Headers show
Series tracing: Clean up persistent ring buffer code | expand

Message

Steven Rostedt April 2, 2025, 2:49 p.m. UTC
Now that I learned that the memory passed back from reserve_mem is part
of the memory allocator and just "reserved" and the memory is already
virtually mapped, it can simply use phys_to_virt() on the physical memory
that is returned to get the virtual mapping for that memory!
  (Thanks Mike!)

That makes things much easier, especially since it means that the memory
returned by reserve_mem is no different than the memory retrieved by
page_alloc(). This allows that memory to be memory mapped to user space
no differently than it is mapped by the normal buffer.

This new series does the following:

- Enforce the memory mapping is page aligned (both the address and the
  size). If not, it errors out.

- Use phys_to_virt() to get to the virtual memory from the reserve_mem
  returned addresses. As the memory is already freed via
  reserve_mem_release_by_name() and it's not mapped by vmap() anymore,
  the free ring buffer code doesn't need to do anything special for
  this mapping.

- Treat the buffer allocated via memmap differently. It still needs to
  be virtually mapped (cannot use phys_to_virt) and it must not be
  freed nor memory mapped to user space. A new flag is added when a buffer
  is created this way to prevent it from ever being memory mapped to user
  space and the ref count is upped so that it can never be freed.

- Use vmap_page_range() instead of using kmalloc_array() to create an array
  of struct pages for vmap().

- Use flush_kernel_vmap_range() instead of flush_dcache_folio()

Changes since v5: https://lore.kernel.org/linux-trace-kernel/20250401225811.008143218@goodmis.org/

- Use %pa instead of %lx for start and size sizes (Mike Rapoport)

- Updated change log to use memblock_alloc() instead of memblock_reserve()
  (Mike Rapoport)

Steven Rostedt (4):
      tracing: Enforce the persistent ring buffer to be page aligned
      tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
      tracing: Use vmap_page_range() to map memmap ring buffer
      ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()

----
 Documentation/admin-guide/kernel-parameters.txt |  2 +
 Documentation/trace/debugging.rst               |  2 +
 kernel/trace/ring_buffer.c                      |  5 +-
 kernel/trace/trace.c                            | 66 ++++++++++++++++---------
 kernel/trace/trace.h                            |  1 +
 5 files changed, 50 insertions(+), 26 deletions(-)

Comments

Mathieu Desnoyers April 2, 2025, 2:56 p.m. UTC | #1
On 2025-04-02 10:49, Steven Rostedt wrote:
> 
> Now that I learned that the memory passed back from reserve_mem is part
> of the memory allocator and just "reserved" and the memory is already
> virtually mapped, it can simply use phys_to_virt() on the physical memory
> that is returned to get the virtual mapping for that memory!
>    (Thanks Mike!)
> 
> That makes things much easier, especially since it means that the memory
> returned by reserve_mem is no different than the memory retrieved by
> page_alloc(). This allows that memory to be memory mapped to user space
> no differently than it is mapped by the normal buffer.
> 
> This new series does the following:

So I've been loosely following this patch series, and I'm confused about
one thing.

AFAIU one goal is to have the ftrace persistent ring buffer written to
through a memory range returned by vmap_page_range(), and userspace maps
the buffer with virtual mappings.

With respect to architectures with aliasing dcache, is the plan:

A) To make sure all persistent ring buffer mappings are aligned on
    SHMLBA:

Quoting "Documentation/core-api/cachetlb.rst":

   Is your port susceptible to virtual aliasing in its D-cache?
   Well, if your D-cache is virtually indexed, is larger in size than
   PAGE_SIZE, and does not prevent multiple cache lines for the same
   physical address from existing at once, you have this problem.

   If your D-cache has this problem, first define asm/shmparam.h SHMLBA
   properly, it should essentially be the size of your virtually
   addressed D-cache (or if the size is variable, the largest possible
   size).  This setting will force the SYSv IPC layer to only allow user
   processes to mmap shared memory at address which are a multiple of
   this value.

or

B) to flush both the kernel and userspace mappings when a ring buffer
    page is handed over from writer to reader ?

I've seen both approaches being discussed in the recent threads, with
some consensus aiming towards (A), but then the code that follows takes
approach (B).

AFAIU, it we are aiming for approach (A), then I'm missing where
vmap_page_range() guarantees that the _kernel_ virtual mapping is
SHMLBA aligned. AFAIU, only user mappings are aligned on SHMLBA.

And if we aiming towards approach (A), then the explicit flushing
is not needed when handing over pages from writer to reader, but
an SHMLBA alignment should be introduced.

Please let me know if I'm missing something,

Thanks,

Mathieu

> 
> - Enforce the memory mapping is page aligned (both the address and the
>    size). If not, it errors out.
> 
> - Use phys_to_virt() to get to the virtual memory from the reserve_mem
>    returned addresses. As the memory is already freed via
>    reserve_mem_release_by_name() and it's not mapped by vmap() anymore,
>    the free ring buffer code doesn't need to do anything special for
>    this mapping.
> 
> - Treat the buffer allocated via memmap differently. It still needs to
>    be virtually mapped (cannot use phys_to_virt) and it must not be
>    freed nor memory mapped to user space. A new flag is added when a buffer
>    is created this way to prevent it from ever being memory mapped to user
>    space and the ref count is upped so that it can never be freed.
> 
> - Use vmap_page_range() instead of using kmalloc_array() to create an array
>    of struct pages for vmap().
> 
> - Use flush_kernel_vmap_range() instead of flush_dcache_folio()
> 
> Changes since v5: https://lore.kernel.org/linux-trace-kernel/20250401225811.008143218@goodmis.org/
> 
> - Use %pa instead of %lx for start and size sizes (Mike Rapoport)
> 
> - Updated change log to use memblock_alloc() instead of memblock_reserve()
>    (Mike Rapoport)
> 
> Steven Rostedt (4):
>        tracing: Enforce the persistent ring buffer to be page aligned
>        tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
>        tracing: Use vmap_page_range() to map memmap ring buffer
>        ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
> 
> ----
>   Documentation/admin-guide/kernel-parameters.txt |  2 +
>   Documentation/trace/debugging.rst               |  2 +
>   kernel/trace/ring_buffer.c                      |  5 +-
>   kernel/trace/trace.c                            | 66 ++++++++++++++++---------
>   kernel/trace/trace.h                            |  1 +
>   5 files changed, 50 insertions(+), 26 deletions(-)
Steven Rostedt April 2, 2025, 3:49 p.m. UTC | #2
On Wed, 2 Apr 2025 10:56:51 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> So I've been loosely following this patch series, and I'm confused about
> one thing.
> 
> AFAIU one goal is to have the ftrace persistent ring buffer written to
> through a memory range returned by vmap_page_range(), and userspace maps
> the buffer with virtual mappings.

Note, this series does not have that goal. The memory mapped buffer via
vmap_page_range() is not going to ever be mapped to user space. The only
time the buffer will be mapped via vmap_page_range() (after this patch
series) is when it is allocated via memmap=... which is a major hack and
cannot be trusted to be used with memory mappings.

The persistent buffer when mapped with reserve_mem=... is no different than
the normal buffer being allocated via page_alloc() and page_address() to
get the virtual address. (Note this series still doesn't it make available
to be mapped to user space).

> 
> With respect to architectures with aliasing dcache, is the plan:
> 
> A) To make sure all persistent ring buffer mappings are aligned on
>     SHMLBA:
> 
> Quoting "Documentation/core-api/cachetlb.rst":
> 
>    Is your port susceptible to virtual aliasing in its D-cache?
>    Well, if your D-cache is virtually indexed, is larger in size than
>    PAGE_SIZE, and does not prevent multiple cache lines for the same
>    physical address from existing at once, you have this problem.
> 
>    If your D-cache has this problem, first define asm/shmparam.h SHMLBA
>    properly, it should essentially be the size of your virtually
>    addressed D-cache (or if the size is variable, the largest possible
>    size).  This setting will force the SYSv IPC layer to only allow user
>    processes to mmap shared memory at address which are a multiple of
>    this value.
> 
> or
> 
> B) to flush both the kernel and userspace mappings when a ring buffer
>     page is handed over from writer to reader ?
> 
> I've seen both approaches being discussed in the recent threads, with
> some consensus aiming towards (A), but then the code that follows takes
> approach (B).

The alignment done is not for cache issues but for memory mapping in general.
Having the persistent buffer memory be page aligned and all of it fit
nicely in pages is a requirement so that there's nothing truncated.

For the cache issue, (B) is used. Note, all mappings to user space is from
the page allocator. As reserve_mem has it allocated but just not free to
use.

> 
> AFAIU, it we are aiming for approach (A), then I'm missing where
> vmap_page_range() guarantees that the _kernel_ virtual mapping is
> SHMLBA aligned. AFAIU, only user mappings are aligned on SHMLBA.

The code in this patch series makes vmap_page_range() only happen when
memmap= is used to allocate the buffer. And then it sets the
TRACE_ARRAY_FL_MEMMAP flag which tells the the tracing code "Do not ever
memory map his buffer to user space".

> 
> And if we aiming towards approach (A), then the explicit flushing
> is not needed when handing over pages from writer to reader, but
> an SHMLBA alignment should be introduced.
> 
> Please let me know if I'm missing something,

I guess what you are missing is that the code no longer is aimed at having
vmap_page_range() ever get mapped to user space ;-)

-- Steve
Mathieu Desnoyers April 2, 2025, 4 p.m. UTC | #3
On 2025-04-02 11:49, Steven Rostedt wrote:
> On Wed, 2 Apr 2025 10:56:51 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
[...]
>> Please let me know if I'm missing something,
> 
> I guess what you are missing is that the code no longer is aimed at having
> vmap_page_range() ever get mapped to user space ;-)

Thanks for the clarification!

Mathieu