Message ID | 20250331143532.459810712@goodmis.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | ring-buffer: Allow persistent memory to be user space mmapped | expand |
On Mon, 31 Mar 2025 at 07:34, Steven Rostedt <rostedt@goodmis.org> wrote: > > Instead, rename ring_buffer_alloc_range() to ring_buffer_alloc_physical() > where contiguous physical memory is passed to the ring buffer code, The concept looks like an improvement, but: > +static unsigned long map_pages(unsigned long *start, unsigned long *end) > +{ > + struct page **pages; > + phys_addr_t page_start; > + unsigned long size; > + unsigned long page_count; > + unsigned long i; > + void *vaddr; > + > + /* Make sure the mappings are page aligned */ > + *start = ALIGN(*start, PAGE_SIZE); The above is *completely* unacceptable. There is no way in hell that ALIGN() can ever be right. You don't even fix up the low bits of the returned virtual address, so you literally return the virtual address of something that doesn't match what was passed in. So if you pass it a starting area that isn't page-aligned, it now randomly gives you complete crap back, and includes some random unrelated part in the mapping. So no. That needs to be a if (*start & PAGE_MASK) return NULL; or whatever. Because just randomly corrupting the base address by ignoring the low bits is not ok. > + /* The size must fit full pages */ > + page_count = size >> PAGE_SHIFT; This part works, simply because truncating the size is fine. It won't all get mapped, but that's the caller's problem, at least the code isn't returning random crap that has random other data in it. That said, I don't see the point. If you want to virtually map physical pages, they need to be full pages, otherwise the end result gets randomly truncated. So I think that while this is much better than the "return random crap that doesn't make any sense", it should be the same rule: just don't allow mapping partial pages. So make it be if (size & PAGE_MASK) return NULL; instead, and just enforce the fact that allocations have to be sanely aligned for vmap. Anyway, that takes care of the horrific interface. However, there's another issue: > + pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); you create this pointless array of pages. Why? It's a physically contiguous area. You do that just because you want to use vmap() to map that contiguous area one page at a time. But this is NOT a new thing. It's exactly what every single PCI device with a random physical memory region BAR needs to do. And no, they don't create arrays of 'struct page *', because they use memory that doesn't even have page backing. So we actually have interfaces to do linear virtual mappings of physical pages that *predate* vmap(), and do the right thing without any of these games. Yes, the legacy versions of interfaces are all for IO memory, but we do have things like vmap_page_range() which should JustWork(tm). Yeah, you'll need to do something like unsigned long vmap_start, vmap_end; area = get_vm_area(size, VM_IOREMAP); if (!area) return NULL; vmap_start = (unsigned long) area->addr; vmap_end = vmap_start + size; ret = vmap_page_range(vmap_start, vmap_end, *start, prot_nx(PAGE_KERNEL)); if (ret < 0) { free_vm_area(area); return NULL; } and the above is *entirely* untested and maybe there's something wrong there, but the concept should work, and when you don't do it a page at a time, you not only don't need the kmalloc_array(), it should even do things like be able to use large page mappings if the alignment and size work out. That said, the old code is *really* broken to begin with. I don't understand why you want to vmap() a contiguous physical range. Either it's real pages to begin with, and you can just use "page_address()" to get a virtual address, it's *not* real pages, and doing "pfn_to_page()" is actively wrong, because it creates a fake 'struct page *' pointer that isn't valid. Is this all just for some disgusting HIGHMEM use (in which case you need the virtual mapping because of HIGHMEM)? Is there any reason to support HIGHMEM in this area at all? So I'm not sure why this code does all this horror in the first place. Either it's all just confused code that just didn't know what it was doing and just happened to work (very possible..) or there is something odd going on. Linus
On Mon, 31 Mar 2025 09:55:28 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > + /* Make sure the mappings are page aligned */ > > + *start = ALIGN(*start, PAGE_SIZE); > > The above is *completely* unacceptable. > > There is no way in hell that ALIGN() can ever be right. I just did this to be robust in case what was passed in was not aligned. In all my use cases, it is. > > You don't even fix up the low bits of the returned virtual address, so > you literally return the virtual address of something that doesn't > match what was passed in. > > So if you pass it a starting area that isn't page-aligned, it now > randomly gives you complete crap back, and includes some random > unrelated part in the mapping. > > So no. That needs to be a > > if (*start & PAGE_MASK) > return NULL; No problem, will update. As I said, I just added that to not map something that wasn't part of what was passed in. But returning error if it's not page aligned works for me too. > > or whatever. Because just randomly corrupting the base address by > ignoring the low bits is not ok. > > > + /* The size must fit full pages */ > > + page_count = size >> PAGE_SHIFT; > > This part works, simply because truncating the size is fine. It won't > all get mapped, but that's the caller's problem, at least the code > isn't returning random crap that has random other data in it. > > That said, I don't see the point. If you want to virtually map > physical pages, they need to be full pages, otherwise the end result > gets randomly truncated. So I think that while this is much better > than the "return random crap that doesn't make any sense", it should > be the same rule: just don't allow mapping partial pages. > > So make it be > > if (size & PAGE_MASK) > return NULL; > > instead, and just enforce the fact that allocations have to be sanely > aligned for vmap. Again, I'm happy to error out on non alignment. I'll just update the documentation to say it must be page size aligned. Currently it shows an example of using 4096 for alignment, but that should be changed to explicitly say to have it page aligned, as some archs (ppc) have 64K pages. > > Anyway, that takes care of the horrific interface. However, there's > another issue: > > > + pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); > > you create this pointless array of pages. Why? It's a physically > contiguous area. > > You do that just because you want to use vmap() to map that contiguous > area one page at a time. > > But this is NOT a new thing. It's exactly what every single PCI device > with a random physical memory region BAR needs to do. And no, they > don't create arrays of 'struct page *', because they use memory that > doesn't even have page backing. > > So we actually have interfaces to do linear virtual mappings of > physical pages that *predate* vmap(), and do the right thing without > any of these games. [ Added the pstore folks ] OK, so I did copy this from fs/pstore/ram_core.c as this does basically the same thing as pstore. And it looks like pstore should be updated too. > > Yes, the legacy versions of interfaces are all for IO memory, but we > do have things like vmap_page_range() which should JustWork(tm). > > Yeah, you'll need to do something like > > unsigned long vmap_start, vmap_end; > > area = get_vm_area(size, VM_IOREMAP); > if (!area) > return NULL; > > vmap_start = (unsigned long) area->addr; > vmap_end = vmap_start + size; > > ret = vmap_page_range(vmap_start, vmap_end, > *start, prot_nx(PAGE_KERNEL)); > > if (ret < 0) { > free_vm_area(area); > return NULL; > } > > and the above is *entirely* untested and maybe there's something wrong > there, but the concept should work, and when you don't do it a page at > a time, you not only don't need the kmalloc_array(), it should even do > things like be able to use large page mappings if the alignment and > size work out. > > That said, the old code is *really* broken to begin with. I don't > understand why you want to vmap() a contiguous physical range. Either > it's real pages to begin with, and you can just use "page_address()" > to get a virtual address, it's *not* real pages, and doing > "pfn_to_page()" is actively wrong, because it creates a fake 'struct > page *' pointer that isn't valid. > > Is this all just for some disgusting HIGHMEM use (in which case you > need the virtual mapping because of HIGHMEM)? Is there any reason to > support HIGHMEM in this area at all? > > So I'm not sure why this code does all this horror in the first place. > Either it's all just confused code that just didn't know what it was > doing and just happened to work (very possible..) or there is > something odd going on. [ Adding Mike Rapoport ] This is due to what the "reserve_mem" kernel command line feature gives back. It reserves physical memory that is not part of the kernel memory allocator (it also works with memmap ie. memmap=12M$0x284500000, which also requires physical memory addresses that are not part of the memory allocator). It's up to the user to map it to virtual memory. The same is true for ramoops in pstore. -- Steve
On Mon, 31 Mar 2025 at 10:38, Steven Rostedt <rostedt@goodmis.org> wrote: > > I just did this to be robust in case what was passed in was not aligned. In > all my use cases, it is. I really think that ALIGN() is absolutely the opposite of robust. It silently just makes incorrect values generate incorrect code. Robust is dealing correctly with incorrect values, not making it incorrect data just create even more incorrect data. > OK, so I did copy this from fs/pstore/ram_core.c as this does basically the > same thing as pstore. And it looks like pstore should be updated too. Please just stop copying code from random drivers or filesystems. Christ. I've said this before: core kernel code has higher quality requirements than random drivers (or even random filesystems). You can't copy crazy incorrect snippets from random sources and put it in core kernel code. > This is due to what the "reserve_mem" kernel command line feature gives > back. It reserves physical memory that is not part of the kernel memory > allocator (it also works with memmap ie. memmap=12M$0x284500000, which also > requires physical memory addresses that are not part of the memory > allocator). Yeah, so in that case turning those into "struct page *" is horrendously wrong, because you've literally taken it away from the memory management. If it's not part of the kernel memory allocator, it does not necessarily have a "struct page *" associated with it. Using a pointer to 'struct page' and passing it off is just fundamentally more than a bug: it's seriously and deeply broken. It's probably much more obviously broken if the physical addresses came from a PCI device, and this all just happens to work because it's actually real RAM and we ended up having a 'struct page []' for it for some random reason. But the generral rule is that if all you have physical RAM addresses, you can use them as phys_addr_t (or turn them into pfn's, which is just the page index of the physical address). I think that *completely* bogus code worked only entirely due to luck - the 'struct page *' was never actually dereferenced, and it got turned back into a pfn and then a page table entry by the vmap code using purely address arithmetic (page_to_phys() and page_to_pfn()). So it probably only ever used pointer calculations, although some of those actually can end up depending on dereferencing 'struct page' (to find the base zone of the page, to find the mapping). Add even with just pointer games, that should actually have triggered a warning in vmap_pages_pte_range() from this: if (WARN_ON(!pfn_valid(page_to_pfn(page)))) which I suspect means that we actually *do* end up having 'struct page' backing for those.. Looking closer... I think what happened is that reserve_mem() actually does end up giving you pages that have the 'struct page' backing store even if they aren't mapped. Which makes the 'struct page' pointer stuff work, and those pages end up working as 'reserved pages'. And I suspect that happens exactly because we had users that mis-used these 'struct page *' things, so it might even be intentional. Or it might just be because that memory *has* been in the E280 memory map before it was reserved, and the 'strict page' arrays may have been sized for the original E280 information, not the "after stealing" information. I didn't go look into that early memory reservation code, but I would not be surprised if there have been fixups to work around bugs like this. I haven't had to look at it in ages, but maybe the bootmem code knows that bootmem allocations then might be given back to the VM and need to have those 'struct page' backing store things Or - and this is also entirely possible - maybe you were just very lucky indeed because the code to allocate the 'struct page' regions ends up intentionally over-allocating and rounding things up. Because of how the buddy allocator works, each 'struct page' array needs to be rounded up to the maximum page allocation, so I think we always allocate the page arrays of a memory zone up to at least PAGE_SIZE << NR_PAGE_ORDERS boundaries (so 8MB-aligned, I think). So there's fundamnetally some slop in there. Anyway, if you allocated them as physical addresses outside of the regular MM code, you should not use 'struct page' in *any* form. You really can't just turn any random physical address into a page, unless it came from the page allocator. And yes, I'm not at all surprised that we'd have other mis-users of this. We have some very historical code that depends on reserved pages going back all the way to linux-0.01 I suspect, because things like the original VGA code knew that the physical addresses were in the BIOS hole region that was backed by 'struct page'. Linus
On Mon, 31 Mar 2025 12:12:08 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > OK, so I did copy this from fs/pstore/ram_core.c as this does basically the > > same thing as pstore. And it looks like pstore should be updated too. > > Please just stop copying code from random drivers or filesystems. Note, I did not blindly copy it. I knew ramoops did exactly what I wanted to do, so I looked at how it did it. I read the code, it looked reasonable, and I mapped it. I needed a way to map physical memory to vmap memory. How am I supposed to know it was not doing it the proper way? > > Christ. > > I've said this before: core kernel code has higher quality > requirements than random drivers (or even random filesystems). I did not believe pstore was a random file system. It does similar code that tracing does. > > You can't copy crazy incorrect snippets from random sources and put it > in core kernel code. Note, even though this code lives in kernel/, it is not much different than pstore code. It's a way to trace / debug the kernel, very much like pstore in that it's used to debug. I saw that code was written in 2012 and thought it was mature. It made sense, and yes I could have looked for a range, but I trusted the people who wrote that code. This wasn't just a random looking at something and copying it. I really wanted to understand how it worked. And I did talk with some of the mm folks, and they seemed OK with this. > If it's not part of the kernel memory allocator, it does not > necessarily have a "struct page *" associated with it. Using a pointer > to 'struct page' and passing it off is just fundamentally more than a > bug: it's seriously and deeply broken. > > It's probably much more obviously broken if the physical addresses > came from a PCI device, and this all just happens to work because it's > actually real RAM and we ended up having a 'struct page []' for it for > some random reason. Note, this only works with RAM. > Or it might just be because that memory *has* been in the E280 memory > map before it was reserved, and the 'strict page' arrays may have been > sized for the original E280 information, not the "after stealing" > information. I guess you mean E820. Mike can correct me if I'm wrong, but the memory that was stolen was actual memory returned by the system (E820 in x86). It reserves the memory before the memory allocation reserves this memory. So what reserve_mem returns is valid memory that can be used by memory allocator, but is currently just "reserved" which means it wants to prevent the allocator from using it. > And yes, I'm not at all surprised that we'd have other mis-users of > this. We have some very historical code that depends on reserved pages > going back all the way to linux-0.01 I suspect, because things like > the original VGA code knew that the physical addresses were in the > BIOS hole region that was backed by 'struct page'. I believe this is what Mike set up. A way to have normal RAM reserved before it gets added to the memory allocator. This was by design. In fact, if you do pull my v2[*] pull request of the ring buffer code (that removes this user space mapping of the persistent ring buffer logic) it actually adds the ability to free the memory and add it back into the memory allocator. https://lore.kernel.org/linux-trace-kernel/173989134814.230693.18199312930337815629.stgit@devnote2/ -- Steve [*] https://lore.kernel.org/all/20250328173533.7fa7810d@gandalf.local.home/
On Mon, 31 Mar 2025 at 13:57, Steven Rostedt <rostedt@goodmis.org> wrote: > > Note, this only works with RAM. That's not the issue, actually. The issue is whether there's a 'struct page'. And "being RAM" does not mean that a struct page exists. For example, using "mem=" on the kernel command line will literally limit the amount of RAM the kernel will use, and in doing so will limit the page allocations too. IOW, all of these kernel command line things are *subtle*. Don't mis-use them by then making assumptions about how they work today (or how they will work tomorrow). > I guess you mean E820. Yes. Thankfully it's been years since I had to use the BIOS interfaces, so I no longer remember the exact line noise numbers... > Mike can correct me if I'm wrong, but the memory that was stolen was actual > memory returned by the system (E820 in x86). It reserves the memory before > the memory allocation reserves this memory. So what reserve_mem returns is > valid memory that can be used by memory allocator, but is currently just > "reserved" which means it wants to prevent the allocator from using it. That may indeed be true of reserve_mem. That doesn't make it any better to then use it for 'struct page' when the MM layer has explicitly been told to ignore it. Particularly since my first reaction wasn't from "that's wrong", but from "that's stupid". Generating that temporary array of pages is just bogus overhead. > In fact, if you do pull my v2[*] pull request of the ring buffer code (that > removes this user space mapping of the persistent ring buffer logic) it > actually adds the ability to free the memory and add it back into the memory > allocator. .. and *after* you've given it back to the memory allocator, and it gets allocated using the page allocators, at that point ahead and use 'struct page' as much as you want. Before that, don't. Even if it might work. Because you didn't allocate it as a struct page, and for all you know it might be treated as a different hotplug memory zone or whatever when given back. Linus
On Mon, 31 Mar 2025 14:42:38 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > .. and *after* you've given it back to the memory allocator, and it > gets allocated using the page allocators, at that point ahead and use > 'struct page' as much as you want. > > Before that, don't. Even if it might work. Because you didn't allocate > it as a struct page, and for all you know it might be treated as a > different hotplug memory zone or whatever when given back. Hmm, so if we need to map this memory to user space memory, then I can't use the method from this patch series, if I have to avoid struct page. Should I then be using vm_iomap_memory() passing in the physical address? As for architectures that do not have user/kernel data cache coherency, how does one flush the page when there's an update on the kernel side so that the user side doesn't see stale data? As the code currently uses flush_dcache_folio(), I'm guessing there's an easy way to create a folio that points to physical memory that's not part of the memory allocator? Matthew? -- Steve
On Tue, Apr 1, 2025 at 1:41 AM Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 31 Mar 2025 14:42:38 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > .. and *after* you've given it back to the memory allocator, and it > > gets allocated using the page allocators, at that point ahead and use > > 'struct page' as much as you want. > > > > Before that, don't. Even if it might work. Because you didn't allocate > > it as a struct page, and for all you know it might be treated as a > > different hotplug memory zone or whatever when given back. > > Hmm, so if we need to map this memory to user space memory, then I can't > use the method from this patch series, if I have to avoid struct page. > > Should I then be using vm_iomap_memory() passing in the physical address? For mapping random physical memory ranges into userspace, we have helpers like remap_pfn_range() (the easy option, for use in an mmap handler, in case you want to want to map one contiguous physical region into userspace) and vmf_insert_pfn() (for use in a page fault handler, in case you want to map random physical pages into userspace on demand). > As for architectures that do not have user/kernel data cache coherency, how > does one flush the page when there's an update on the kernel side so that > the user side doesn't see stale data? flush_kernel_vmap_range() (and invalidate_kernel_vmap_range() for the other direction) might be what you want... I found those by going backwards from an arch-specific cache-flushing implementation. > As the code currently uses flush_dcache_folio(), I'm guessing there's an > easy way to create a folio that points to physical memory that's not part > of the memory allocator? Creating your own folio structs sounds like a bad idea; folio structs are supposed to be in specific kernel memory regions. For example, conversions from folio* to physical address can involve pointer arithmetic on the folio*, or they can involve reading members of the pointed-to folio.
On Mon, 31 Mar 2025 at 16:41, Steven Rostedt <rostedt@goodmis.org> wrote: > > Hmm, so if we need to map this memory to user space memory, then I can't > use the method from this patch series, if I have to avoid struct page. > > Should I then be using vm_iomap_memory() passing in the physical address? I actually think that would be the best option in general - it works *regardless* of the source of the pages (ie it works for pages that don't have 'struct page' backing, but it works for regular RAM too). So it avoids any question of how the page was allocated, and it also avoids the page reference counting overhead. I thought you did that already for the user mappings - don't use you remap_pfn_range()? That's basically the equivalent of vmap_page_range() - you're mapping a whole range based on physical addresses, not mapping individual pages. But I didn't go look, this is from dim and possibly confused memories from past patches. > As for architectures that do not have user/kernel data cache coherency, how > does one flush the page when there's an update on the kernel side so that > the user side doesn't see stale data? So if you don't treat this as some kind of 'page' or 'folio' thing, then the proper function is actually flush_cache_range(). I actually suspect that if you treat things just as an arbitrary range of memory, it might simplify things in general. For example, the whole flush_cache_page() thing obviously just flushes one page. So then you have to artificially iterate over pages rather than just use the natural range. HOWEVER. At this point I have to also admit that you will likely find various holes in various architecture implementations. Why? Because sane architectures don't care (there's basically no testing of any of this on x86, because x86 is typically always cache coherent outside of some GPU oddities that are handled by the DRM layer explicitly, so all of these functions are just no-ops on x86). And so almost none of this gets any testing in practice. A missed cache flush doesn't matter on x86 or arm64, and very seldom elsewhere too. We do have "flush_cache_range()" calls in the generic MM code, and so it *should* all work, but honestly, I'd expect there to be bugs in this area. Of course, I would expect the same to be true of the page/folio cases, so I don't think using flush_cache_range() should be any worse, but I *could* imagine that it's bad in a different way ;) Linus
On Mon, 31 Mar 2025 at 17:11, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I thought you did that already for the user mappings - don't use you > remap_pfn_range()? > > That's basically the equivalent of vmap_page_range() - you're mapping > a whole range based on physical addresses, not mapping individual > pages. Btw, if you are wondering why this isn't consistent: for user space mapping 'remap_pfn_range()' takes a starting physical 'pfn' and a size, while for kernel space 'vmap_page_range()' takes a 'phys_addr_t' and a size, then I have no answers for you. Except just "random historical accidents". There are a lot more "map into user space" cases, and it's the much more common case, and it's the one that a lot of drivers have used for a long time. The 'vmap' code in contrast really started out as just "vmalloc()" which allocated the pages while mapping and "ioremap()" which mapped in IO areas into kernel virtual memory. And then the interface has very occasionally gotten updated over the years and we *have* had some unification, but it's been driven not by "let's make this a kernel version of user mappings", but by "random odd use cases". You can see this just in the statistics: remap_pfn_range() has hundreds of uses. In contrast, vmap_page_range() has something like five uses - when you grep for it, half of the hits for it are in the mm/vmalloc.c implementation and the header file declaration). So I do think vmap_page_range() is the right thing to do here, but it is *so* rare that maybe it has issues. There simply generally isn't any reason to map contiguous pages into kernel space. The *typical* use for kernel virtual mappings is either (a) turn physically discontiguous allocations into a contiguous virtual mapping (ie "vmalloc" like things, sometimes with pre-allocated arrays of pages) or (b) map device pages into kernel space (ie "ioremap()" kinds of things) so vmap_page_range() is kind of an odd and unusual duck, which is why people may not be all that aware of it. And being an odd and unusual duck might mean it has bugs, but the good news is that "vmap_page_range()" is actually the underlying implementation of "ioremap()", so it *does* actually get tested that way. Linus
On Mon, 31 Mar 2025 17:11:46 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > I thought you did that already for the user mappings - don't use you > remap_pfn_range()? No, that's not what is done. The normal buffer is split among several sub-buffers (usually one page in size each, but can also be a power of two pages), and those pages are allocated via alloc_page() and are not contiguous. Which is why the mapping to user space creates an array of struct page pointers and then calls vm_insert_pages(). For the contigous physical memory, then yeah, we can simply use vm_iomap_memory(). > So if you don't treat this as some kind of 'page' or 'folio' thing, > then the proper function is actually flush_cache_range(). > > I actually suspect that if you treat things just as an arbitrary range > of memory, it might simplify things in general. Ah, yeah. That's the function I was looking for. > Of course, I would expect the same to be true of the page/folio cases, > so I don't think using flush_cache_range() should be any worse, but I > *could* imagine that it's bad in a different way ;) At least we can say we covered those other archs, and if a bug is reported, then all that would need to be fixed is the flush_cache_range() implementation ;-) -- Steve
On Mon, 31 Mar 2025 at 17:29, Steven Rostedt <rostedt@goodmis.org> wrote: > > > Of course, I would expect the same to be true of the page/folio cases, > > so I don't think using flush_cache_range() should be any worse, but I > > *could* imagine that it's bad in a different way ;) > > At least we can say we covered those other archs, and if a bug is reported, > then all that would need to be fixed is the flush_cache_range() > implementation ;-) Well, there's also the whole "is it I$ or D$" question. I think flush_cache_range() is basically expected to do both, and you don't care about the I$ side (and I$ coherency issues are a *lot* more common than D$ coherency issues are - while D$ coherency is happily the common situation, the I$ not being coherent is actually the default, and x86 - and s390? - is unusual in this area). So maybe powerpc people will be unhappy about flush_cache_range() because it does the "look out for I$" thing too. There's a flush_dcache_range() thing too, but I don't see a single use of that function in generic code, so I suspect that one is currently purely for architecture-specific drivers and doesn't work in egneral. So *this* is the kind of "bad in a different way" I could imagine: things that probably should be cleaned up and be available to generic code, but very few people have cared, and so they are used in ad-hoc places and haven't been sufficiently generalized and cleaned up. Linus
On Mon, 31 Mar 2025 at 17:38, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So *this* is the kind of "bad in a different way" I could imagine: > things that probably should be cleaned up and be available to generic > code, but very few people have cared, and so they are used in ad-hoc > places and haven't been sufficiently generalized and cleaned up. Of course, the good news is that (a) few people care - which is why we are in this situation - and (b) the people who *do* care about these issues tend to be both responsible and responsive. So not some unfixable morass, more of a "one or two architectures may care, and we might have to extend interfaces that haven't been used this way before". For example, just adding flush_dcache_range() into include/asm-generic/cacheflush.h should be more-or-less trivial, with the same kinds of "architectures can make clear they have their own optimized version". Linus
On Tue, 1 Apr 2025 02:09:10 +0200 Jann Horn <jannh@google.com> wrote: > On Tue, Apr 1, 2025 at 1:41 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 31 Mar 2025 14:42:38 -0700 > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > .. and *after* you've given it back to the memory allocator, and it > > > gets allocated using the page allocators, at that point ahead and use > > > 'struct page' as much as you want. > > > > > > Before that, don't. Even if it might work. Because you didn't allocate > > > it as a struct page, and for all you know it might be treated as a > > > different hotplug memory zone or whatever when given back. > > > > Hmm, so if we need to map this memory to user space memory, then I can't > > use the method from this patch series, if I have to avoid struct page. > > > > Should I then be using vm_iomap_memory() passing in the physical address? > > For mapping random physical memory ranges into userspace, we have > helpers like remap_pfn_range() (the easy option, for use in an mmap > handler, in case you want to want to map one contiguous physical > region into userspace) and vmf_insert_pfn() (for use in a page fault > handler, in case you want to map random physical pages into userspace > on demand). Note, I believe that Linus brought up the issue that because this physical memory is not currently part of the memory allocator (it's not aware of it yet), that the getting struct page or a "pfn" for it may not be reliable. > > > As for architectures that do not have user/kernel data cache coherency, how > > does one flush the page when there's an update on the kernel side so that > > the user side doesn't see stale data? > > flush_kernel_vmap_range() (and invalidate_kernel_vmap_range() for the > other direction) might be what you want... I found those by going > backwards from an arch-specific cache-flushing implementation. > > > As the code currently uses flush_dcache_folio(), I'm guessing there's an > > easy way to create a folio that points to physical memory that's not part > > of the memory allocator? > > Creating your own folio structs sounds like a bad idea; folio structs > are supposed to be in specific kernel memory regions. For example, > conversions from folio* to physical address can involve pointer > arithmetic on the folio*, or they can involve reading members of the > pointed-to folio. Linus already mentioned flush_cache_range() which looks to be the thing to use. -- Steve
On Tue, Apr 1, 2025 at 3:01 AM Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 1 Apr 2025 02:09:10 +0200 > Jann Horn <jannh@google.com> wrote: > > > On Tue, Apr 1, 2025 at 1:41 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Mon, 31 Mar 2025 14:42:38 -0700 > > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > > .. and *after* you've given it back to the memory allocator, and it > > > > gets allocated using the page allocators, at that point ahead and use > > > > 'struct page' as much as you want. > > > > > > > > Before that, don't. Even if it might work. Because you didn't allocate > > > > it as a struct page, and for all you know it might be treated as a > > > > different hotplug memory zone or whatever when given back. > > > > > > Hmm, so if we need to map this memory to user space memory, then I can't > > > use the method from this patch series, if I have to avoid struct page. > > > > > > Should I then be using vm_iomap_memory() passing in the physical address? > > > > For mapping random physical memory ranges into userspace, we have > > helpers like remap_pfn_range() (the easy option, for use in an mmap > > handler, in case you want to want to map one contiguous physical > > region into userspace) and vmf_insert_pfn() (for use in a page fault > > handler, in case you want to map random physical pages into userspace > > on demand). > > Note, I believe that Linus brought up the issue that because this physical > memory is not currently part of the memory allocator (it's not aware of it > yet), that the getting struct page or a "pfn" for it may not be reliable. PFN mappings are specifically designed to work with memory that does not have "struct page": #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ > > > As for architectures that do not have user/kernel data cache coherency, how > > > does one flush the page when there's an update on the kernel side so that > > > the user side doesn't see stale data? > > > > flush_kernel_vmap_range() (and invalidate_kernel_vmap_range() for the > > other direction) might be what you want... I found those by going > > backwards from an arch-specific cache-flushing implementation. > > > > > As the code currently uses flush_dcache_folio(), I'm guessing there's an > > > easy way to create a folio that points to physical memory that's not part > > > of the memory allocator? > > > > Creating your own folio structs sounds like a bad idea; folio structs > > are supposed to be in specific kernel memory regions. For example, > > conversions from folio* to physical address can involve pointer > > arithmetic on the folio*, or they can involve reading members of the > > pointed-to folio. > > Linus already mentioned flush_cache_range() which looks to be the thing to > use. It looks like flush_kernel_vmap_range() is used for flushing dcache for the kernel mapping, while flush_cache_range() is for flushing dcache/icache for the userspace mapping? For example, on 32-bit arm, you might go down these paths, ending up in arch-specific functions that make it clear whether they're for the user side or the kernel side: flush_cache_range -> __cpuc_flush_user_range flush_kernel_vmap_range -> __cpuc_flush_dcache_area -> cpu_cache.flush_kern_dcache_area I think you probably need flushes on both sides, since you might have to first flush out the dirty cacheline you wrote through the kernel mapping, then discard the stale clean cacheline for the user mapping, or something like that? (Unless these VIVT cache architectures provide stronger guarantees on cache state than I thought.) But when you're adding data to the tracing buffers, I guess maybe you only want to flush the kernel mapping from the kernel, and leave flushing of the user mapping to userspace? I think if you're running in some random kernel context, you probably can't even reliably flush the right userspace context - see how for example vivt_flush_cache_range() does nothing if the MM being flushed is not running on the current CPU.
On Mon, 31 Mar 2025 at 18:01, Steven Rostedt <rostedt@goodmis.org> wrote: > > Note, I believe that Linus brought up the issue that because this physical > memory is not currently part of the memory allocator (it's not aware of it > yet), that the getting struct page or a "pfn" for it may not be reliable. 'pfn' is always reliable. The pfn ('page frame number') is literally just the physical address expressed in 'page units', ie just shifted down by the page shift. So pfn and phys_addr_t are interchangeable when it comes to mapping pages. The pfn is in fact often the preferred form, because on 32-bit architectures a pfn is 32-bit, but a phys_addr_t is often 64-bit and generates extra code. I think 'pfn' was introduced as a name ong long ago because it was what the alpha architecture used in the VM documentation. It probably predates that too, but it's where I got it from, iirc. Linus
On Mon, 31 Mar 2025 17:49:07 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > For example, just adding flush_dcache_range() into > include/asm-generic/cacheflush.h should be more-or-less trivial, with > the same kinds of "architectures can make clear they have their own > optimized version". Just so I'm clear. Are you suggesting to add flush_dcache_range() to cacheflush.h and using that? Looks like some of the code would be able to remove the ifdefs: #ifdef CONFIG_PPC flush_dcache_range(...); #endif if that's the case. -- Steve
On Mon, 31 Mar 2025 18:30:33 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 31 Mar 2025 at 18:01, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Note, I believe that Linus brought up the issue that because this physical > > memory is not currently part of the memory allocator (it's not aware of it > > yet), that the getting struct page or a "pfn" for it may not be reliable. > > 'pfn' is always reliable. > > The pfn ('page frame number') is literally just the physical address > expressed in 'page units', ie just shifted down by the page shift. Ah, for some reason I thought the pfn mapped directly to struct pages. > > So pfn and phys_addr_t are interchangeable when it comes to mapping > pages. The pfn is in fact often the preferred form, because on 32-bit > architectures a pfn is 32-bit, but a phys_addr_t is often 64-bit and > generates extra code. > > I think 'pfn' was introduced as a name ong long ago because it was > what the alpha architecture used in the VM documentation. It probably > predates that too, but it's where I got it from, iirc. > It is old, as I remember using it when I first started Linux kernel development back in 1998. But my memory of it was it was also used as an index into a struct page array. Which is why I was thinking it was somewhat interchangeable with struct page. But that was a long time ago when I was an embedded developer, but I've only been using struct page for my needs in the last couple of decades. -- Steve
On Mon, 31 Mar 2025 at 18:35, Steven Rostedt <rostedt@goodmis.org> wrote: > > Just so I'm clear. Are you suggesting to add flush_dcache_range() to > cacheflush.h and using that? No, let's wait for it to be actually considered a problem. And I think Jann is right that for kernel ranges you should use flush_kernel_vmap_range() anyway. Oh how I hate incoherent data caches. Linus
On Tue, 1 Apr 2025 03:28:20 +0200 Jann Horn <jannh@google.com> wrote: > I think you probably need flushes on both sides, since you might have > to first flush out the dirty cacheline you wrote through the kernel > mapping, then discard the stale clean cacheline for the user mapping, > or something like that? (Unless these VIVT cache architectures provide > stronger guarantees on cache state than I thought.) But when you're > adding data to the tracing buffers, I guess maybe you only want to > flush the kernel mapping from the kernel, and leave flushing of the > user mapping to userspace? I think if you're running in some random > kernel context, you probably can't even reliably flush the right > userspace context - see how for example vivt_flush_cache_range() does > nothing if the MM being flushed is not running on the current CPU. I'm assuming I need to flush both the kernel (get the updates out to memory) and user space (so it can read those updates). The paths are all done via system calls from user space, so it should be on the same CPU. User space will do an ioctl() on the buffer file descriptor asking for an update, the kernel will populate the page with that update, and then user space will read the update after the ioctl() returns. All very synchronous. Thus, we don't need to worry about updates from one CPU happening on another CPU. Even when it wants to read the buffer. The ioctl() will swap out the old reader page with one of the write pages making it the new "reader" page, where no more updates will happen on that page. The flush happens after that and before going back to user space. -- Steve
On Mon, 31 Mar 2025 at 18:40, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I think 'pfn' was introduced as a name ong long ago because it was > > what the alpha architecture used in the VM documentation. It probably > > predates that too, but it's where I got it from, iirc. > > > > It is old, as I remember using it when I first started Linux kernel > development back in 1998. So I did the alpha port in '95, but I meant that the 'page frame number' as a name may well have roots that go much further back. I only know it from the alpha architecture docs. Google only seems to find the modern Linux use, but I wouldn't be surprised if the alpha architects got it from some much older use (I would suspect VMS). > But my memory of it was it was also used as an > index into a struct page array. So typically, the pfn is what is used to translate different kinds of addresses. And yes, one such translation is "turn virtual address into physical address, then turn that into the pfn, and then use that to index the array of 'struct page'. Few places use the pfn itself as the primary thing, but one of those few things is the low-level page table code. Because page table entries are literally a combination of the pfn and various per-page information (protection bits, cacheability, stuff like that). So to the page table code, 'pfn' is kind of fundamental. Example: the actual low-level "create a page table entry" is "pfn_pte()" which takes a pfn and the protection bits. That's the basis from which all the other page table entry functions are based on, even if you often have wrapper helpers to hide the 'pfn' part from users, and you might have seen mk_pte() which takes a 'struct page' and the protection bits. That one just turns the page into a pfn and then used pfn_pte(). Linus
On 2025-03-31 21:50, Steven Rostedt wrote: > On Tue, 1 Apr 2025 03:28:20 +0200 > Jann Horn <jannh@google.com> wrote: > >> I think you probably need flushes on both sides, since you might have >> to first flush out the dirty cacheline you wrote through the kernel >> mapping, then discard the stale clean cacheline for the user mapping, >> or something like that? (Unless these VIVT cache architectures provide >> stronger guarantees on cache state than I thought.) But when you're >> adding data to the tracing buffers, I guess maybe you only want to >> flush the kernel mapping from the kernel, and leave flushing of the >> user mapping to userspace? I think if you're running in some random >> kernel context, you probably can't even reliably flush the right >> userspace context - see how for example vivt_flush_cache_range() does >> nothing if the MM being flushed is not running on the current CPU. > > I'm assuming I need to flush both the kernel (get the updates out to > memory) and user space (so it can read those updates). > > The paths are all done via system calls from user space, so it should be on > the same CPU. User space will do an ioctl() on the buffer file descriptor > asking for an update, the kernel will populate the page with that update, > and then user space will read the update after the ioctl() returns. All > very synchronous. Thus, we don't need to worry about updates from one CPU > happening on another CPU. > > Even when it wants to read the buffer. The ioctl() will swap out the old > reader page with one of the write pages making it the new "reader" page, > where no more updates will happen on that page. The flush happens after > that and before going back to user space. FWIW, I have the following in the LTTng kernel tracer to cover this. LTTng writes to ring buffers through the linear mapping, and reads the buffers from userspace either through mmap or splice. When userspace wants to get read access to a sub-buffer (an abstraction that generalizes your ftrace ring buffer "pages") through mmap, it does the following through a "get subbuffer" ioctl: - Use "cpu_dcache_is_aliasing()" to check whether explicit flushing is needed between the kernel linear mapping and userspace mappings. - Use flush_dcache_page() to make sure all mappings for a given page are flushed (both the kernel linear mapping and the userspace virtual mappings). I suspect that if you go down the route of the explicit "flush_cache_range()", then you'll need to issue it on all mappings that alias your memory. AFAIU, using flush_dcache_page() saves you the trouble of issuing flush_cache_range() on all mapping aliases manually. Thanks, Mathieu
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 31 Mar 2025 at 18:40, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > I think 'pfn' was introduced as a name ong long ago because it was > > > what the alpha architecture used in the VM documentation. It probably > > > predates that too, but it's where I got it from, iirc. > > > > > > > It is old, as I remember using it when I first started Linux kernel > > development back in 1998. > > So I did the alpha port in '95, but I meant that the 'page frame > number' as a name may well have roots that go much further back. I > only know it from the alpha architecture docs. > > Google only seems to find the modern Linux use, but I wouldn't be > surprised if the alpha architects got it from some much older use (I > would suspect VMS). *Technically*, for those of us with a weakness for Git archeology, the 'PFN' term was first introduced in Linux with the MIPS port, which went upstream via Linux 1.1.45, released on 1994/04/06, and had this in include/asm-mips/mipsregs.h: +/* + * Compute a vpn/pfn entry for EntryHi register + */ +#define VPN(addr,pagesizeshift) ((addr) & ~((1 << (pagesizeshift))-1)) +#define PFN(addr,pagesizeshift) (((addr) & ((1 << (pagesizeshift))-1)) << 6) ... while your Alpha port went upstream via 1.1.57, released on 1994/10/24, and the first mention of 'PFN' was in arch/alpha/mm/init.c, introduced as part of 1.1.82, released 1995/01/16, almost a year later: +unsigned long paging_init(unsigned long start_mem, unsigned long end_mem) .. + unsigned long pfn, nr; + if (cluster->usage & 1) + continue; + pfn = cluster->start_pfn; + nr = cluster->numpages; So you don't appear to have dibs on the PFN name within Linux, although I suspect MIPS and Alpha had it from a similar background. :-) By the way, some quick review feedback on that patch: there should be extra newlines after variable declarations - clearly the author is not a good kernel programm... oh, never mind. Thanks, Ingo
On Mon, Mar 31, 2025 at 02:42:38PM -0700, Linus Torvalds wrote: > On Mon, 31 Mar 2025 at 13:57, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Note, this only works with RAM. > > That's not the issue, actually. > > The issue is whether there's a 'struct page'. And "being RAM" does not > mean that a struct page exists. > > For example, using "mem=" on the kernel command line will literally > limit the amount of RAM the kernel will use, and in doing so will > limit the page allocations too. And using memmap=m$n on x86 creates a hole in System RAM that does not have neither struct page nor kernel mappings and it is never considered RAM anywhere in mm. > IOW, all of these kernel command line things are *subtle*. > > Don't mis-use them by then making assumptions about how they work > today (or how they will work tomorrow). I'd say it's better not to use them at all. They cause weirdness in memory layout and also they are inconsistent in how architectures implement them. > > Mike can correct me if I'm wrong, but the memory that was stolen was actual > > memory returned by the system (E820 in x86). It reserves the memory before > > the memory allocation reserves this memory. So what reserve_mem returns is > > valid memory that can be used by memory allocator, but is currently just > > "reserved" which means it wants to prevent the allocator from using it. > > That may indeed be true of reserve_mem. The reserve_mem behaves like any other early allocation, it has proper struct pages (PG_Reserved) and it is mapped in the direct map so phys_to_virt() will work on it. As for mapping it to userspace, vm_iomap_memory() seems the best API to use. It has all the alignment checks and will refuse to map ranges that are not properly aligned and it will use vma information to create the right mappings. > Linus
On Tue, 1 Apr 2025 12:56:31 +0300 Mike Rapoport <rppt@kernel.org> wrote: > > For example, using "mem=" on the kernel command line will literally > > limit the amount of RAM the kernel will use, and in doing so will > > limit the page allocations too. > > And using memmap=m$n on x86 creates a hole in System RAM that does not have > neither struct page nor kernel mappings and it is never considered RAM > anywhere in mm. Hmm, when that is used, then we had better not "free" the buffer. > > > IOW, all of these kernel command line things are *subtle*. > > > > Don't mis-use them by then making assumptions about how they work > > today (or how they will work tomorrow). > > I'd say it's better not to use them at all. They cause weirdness in memory > layout and also they are inconsistent in how architectures implement them. > > > > Mike can correct me if I'm wrong, but the memory that was stolen was actual > > > memory returned by the system (E820 in x86). It reserves the memory before > > > the memory allocation reserves this memory. So what reserve_mem returns is > > > valid memory that can be used by memory allocator, but is currently just > > > "reserved" which means it wants to prevent the allocator from using it. > > > > That may indeed be true of reserve_mem. > > The reserve_mem behaves like any other early allocation, it has proper > struct pages (PG_Reserved) and it is mapped in the direct map so > phys_to_virt() will work on it. > > As for mapping it to userspace, vm_iomap_memory() seems the best API to > use. It has all the alignment checks and will refuse to map ranges that are > not properly aligned and it will use vma information to create the right > mappings. > When using vmap() to get the virtual addresses (via the kmalloc_array() of struct pages), the vunmap() gives the memory back to the memory allocator: ~# free total used free shared buff/cache available Mem: 8185928 296676 7840576 920 148280 7889252 Swap: 7812092 0 7812092 ~# rmdir /sys/kernel/tracing/instances/boot_mapped/ ~# free total used free shared buff/cache available Mem: 8206404 290868 7866772 920 148384 7915536 Swap: 7812092 0 7812092 With no issues. But if I use vmap_page_range(), how do I give that back to the memory allocator? Calling vunmap() on that memory gives me: 1779.832484] ------------[ cut here ]------------ [ 1779.834076] Trying to vunmap() nonexistent vm area (000000027c000000) [ 1779.835941] WARNING: CPU: 6 PID: 956 at mm/vmalloc.c:3413 vunmap+0x5a/0x60 [ 1779.837587] Modules linked in: [ 1779.838455] CPU: 6 UID: 0 PID: 956 Comm: rmdir Not tainted 6.14.0-rc4-test-00019-ga9c509c0c8e7-dirty #379 [ 1779.840597] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 1779.842576] RIP: 0010:vunmap+0x5a/0x60 [ 1779.843553] Code: 89 c7 48 85 c0 74 12 e8 94 e1 01 00 48 8b 5d f8 c9 c3 cc cc cc cc 90 0f 0b 90 48 c7 c7 78 b6 c4 9d 48 89 de e8 57 b4 cd ff 90 <0f> 0b 90 90 eb dc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 1779.847159] RSP: 0018:ffffbfdb7ec93da0 EFLAGS: 00010282 [ 1779.848256] RAX: 0000000000000000 RBX: 000000027c000000 RCX: 0000000000000000 [ 1779.849623] RDX: ffff9f0efdfab108 RSI: ffff9f0efdf9cbc0 RDI: 0000000000000001 [ 1779.851079] RBP: ffffbfdb7ec93da8 R08: 00000000ffffdfff R09: ffffffff9e7652c8 [ 1779.852447] R10: ffffffff9e6b5320 R11: 0000000000000000 R12: ffff9f0d80226e00 [ 1779.853746] R13: 0000000000000001 R14: ffff9f0d806294c0 R15: ffff9f0d80629190 [ 1779.855121] FS: 00007f414dea6740(0000) GS:ffff9f0efdf80000(0000) knlGS:0000000000000000 [ 1779.856524] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1779.857548] CR2: 00007f28a0a4d350 CR3: 000000010f950002 CR4: 0000000000172ef0 [ 1779.858916] Call Trace: [ 1779.859467] <TASK> [ 1779.859986] ? show_regs.cold+0x19/0x24 [ 1779.860708] ? vunmap+0x5a/0x60 [ 1779.861387] ? __warn.cold+0xc2/0x157 [ 1779.862141] ? vunmap+0x5a/0x60 [ 1779.862752] ? report_bug+0x10a/0x150 [ 1779.865787] ? handle_bug+0x5c/0xa0 [ 1779.866684] ? exc_invalid_op+0x1c/0x80 [ 1779.867455] ? asm_exc_invalid_op+0x1f/0x30 [ 1779.868256] ? vunmap+0x5a/0x60 [ 1779.868908] ring_buffer_free+0xac/0xc0 [ 1779.869595] __remove_instance.part.0.constprop.0+0xeb/0x1f0 [ 1779.870590] instance_rmdir+0xe1/0xf0 [ 1779.871342] tracefs_syscall_rmdir+0x5c/0xa0 [ 1779.872198] vfs_rmdir+0xa0/0x220 [ 1779.872806] do_rmdir+0x146/0x190 [ 1779.873477] __x64_sys_rmdir+0x43/0x70 [ 1779.874188] x64_sys_call+0x114f/0x1d70 [ 1779.874944] do_syscall_64+0xbb/0x1d0 [ 1779.875605] entry_SYSCALL_64_after_hwframe+0x77/0x7f What's the proper way to say: "I no longer need this physical memory I reserved, the kernel can now use it"? -- Steve
On Tue, Apr 01, 2025 at 11:11:59AM -0400, Steven Rostedt wrote: > On Tue, 1 Apr 2025 12:56:31 +0300 > Mike Rapoport <rppt@kernel.org> wrote: > > > > For example, using "mem=" on the kernel command line will literally > > > limit the amount of RAM the kernel will use, and in doing so will > > > limit the page allocations too. > > > > And using memmap=m$n on x86 creates a hole in System RAM that does not have > > neither struct page nor kernel mappings and it is never considered RAM > > anywhere in mm. > > Hmm, when that is used, then we had better not "free" the buffer. > > > > > > IOW, all of these kernel command line things are *subtle*. > > > > > > Don't mis-use them by then making assumptions about how they work > > > today (or how they will work tomorrow). > > > > I'd say it's better not to use them at all. They cause weirdness in memory > > layout and also they are inconsistent in how architectures implement them. > > > > > > Mike can correct me if I'm wrong, but the memory that was stolen was actual > > > > memory returned by the system (E820 in x86). It reserves the memory before > > > > the memory allocation reserves this memory. So what reserve_mem returns is > > > > valid memory that can be used by memory allocator, but is currently just > > > > "reserved" which means it wants to prevent the allocator from using it. > > > > > > That may indeed be true of reserve_mem. > > > > The reserve_mem behaves like any other early allocation, it has proper > > struct pages (PG_Reserved) and it is mapped in the direct map so > > phys_to_virt() will work on it. > > > > As for mapping it to userspace, vm_iomap_memory() seems the best API to > > use. It has all the alignment checks and will refuse to map ranges that are > > not properly aligned and it will use vma information to create the right > > mappings. > > > > When using vmap() to get the virtual addresses (via the kmalloc_array() of > struct pages), the vunmap() gives the memory back to the memory allocator: > > ~# free > total used free shared buff/cache available > Mem: 8185928 296676 7840576 920 148280 7889252 > Swap: 7812092 0 7812092 > ~# rmdir /sys/kernel/tracing/instances/boot_mapped/ > ~# free > total used free shared buff/cache available > Mem: 8206404 290868 7866772 920 148384 7915536 > Swap: 7812092 0 7812092 > > With no issues. > > But if I use vmap_page_range(), how do I give that back to the memory allocator? But you don't need neither vmap() nor vmap_page_range() to have kernel page tables for memory that you get from reserve_mem. It's already mapped and plain phys_to_virt() gives you the virtual address you can use. > Calling vunmap() on that memory gives me: > > 1779.832484] ------------[ cut here ]------------ > [ 1779.834076] Trying to vunmap() nonexistent vm area (000000027c000000) > [ 1779.835941] WARNING: CPU: 6 PID: 956 at mm/vmalloc.c:3413 vunmap+0x5a/0x60 > > What's the proper way to say: "I no longer need this physical memory I > reserved, the kernel can now use it"? free_reserved_area() > -- Steve
On Tue, 1 Apr 2025 18:26:43 +0300 Mike Rapoport <rppt@kernel.org> wrote: > > But if I use vmap_page_range(), how do I give that back to the memory allocator? > > But you don't need neither vmap() nor vmap_page_range() to have kernel page > tables for memory that you get from reserve_mem. It's already mapped and > plain phys_to_virt() gives you the virtual address you can use. Oh! That makes things so much easier! Especially since that means it should work like the normal buffer where virt_to_page() should also work. Right? Now I do support mapping the persistent ring buffer via memmap, but I can just give up on allowing that to be memory mapped to user space, or even freed. > > > Calling vunmap() on that memory gives me: > > > > 1779.832484] ------------[ cut here ]------------ > > [ 1779.834076] Trying to vunmap() nonexistent vm area (000000027c000000) > > [ 1779.835941] WARNING: CPU: 6 PID: 956 at mm/vmalloc.c:3413 vunmap+0x5a/0x60 > > > > What's the proper way to say: "I no longer need this physical memory I > > reserved, the kernel can now use it"? > > free_reserved_area() Awesome! Thanks, -- Steve
On Tue, Apr 01, 2025 at 11:54:23AM -0400, Steven Rostedt wrote: > On Tue, 1 Apr 2025 18:26:43 +0300 > Mike Rapoport <rppt@kernel.org> wrote: > > > > But if I use vmap_page_range(), how do I give that back to the memory allocator? > > > > But you don't need neither vmap() nor vmap_page_range() to have kernel page > > tables for memory that you get from reserve_mem. It's already mapped and > > plain phys_to_virt() gives you the virtual address you can use. > > Oh! That makes things so much easier! Especially since that means it should > work like the normal buffer where virt_to_page() should also work. Right? Right. > Now I do support mapping the persistent ring buffer via memmap, but I can > just give up on allowing that to be memory mapped to user space, or even > freed. It can't be freed without jumping through some hoops, and mapping it to user space is hassle as well.
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > There's a flush_dcache_range() thing too, but I don't see a single use > of that function in generic code, so I suspect that one is currently > purely for architecture-specific drivers and doesn't work in egneral. > > So *this* is the kind of "bad in a different way" I could imagine: > things that probably should be cleaned up and be available to generic > code, but very few people have cared, and so they are used in ad-hoc > places and haven't been sufficiently generalized and cleaned up. There is no fallback implementation of flush_dcache_range for the architectures that don't need it. It would be nice to have that, as doing it by hand is pretty silly: static inline void scatterwalk_done_dst(struct scatter_walk *walk, unsigned int nbytes) { scatterwalk_unmap(walk); /* * Explicitly check ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE instead of just * relying on flush_dcache_page() being a no-op when not implemented, * since otherwise the BUG_ON in sg_page() does not get optimized out. * This also avoids having to consider whether the loop would get * reliably optimized out or not. */ if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE) { struct page *base_page; unsigned int offset; int start, end, i; base_page = sg_page(walk->sg); offset = walk->offset; start = offset >> PAGE_SHIFT; end = start + (nbytes >> PAGE_SHIFT); end += (offset_in_page(offset) + offset_in_page(nbytes) + PAGE_SIZE - 1) >> PAGE_SHIFT; for (i = start; i < end; i++) flush_dcache_page(nth_page(base_page, i)); } Cheers,
On Mon, Mar 31, 2025 at 01:39:06PM -0400, Steven Rostedt wrote: > On Mon, 31 Mar 2025 09:55:28 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, that takes care of the horrific interface. However, there's > > another issue: > > > > > + pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); > > > > you create this pointless array of pages. Why? It's a physically > > contiguous area. > > > > You do that just because you want to use vmap() to map that contiguous > > area one page at a time. > > > > But this is NOT a new thing. It's exactly what every single PCI device > > with a random physical memory region BAR needs to do. And no, they > > don't create arrays of 'struct page *', because they use memory that > > doesn't even have page backing. > > > > So we actually have interfaces to do linear virtual mappings of > > physical pages that *predate* vmap(), and do the right thing without > > any of these games. > > [ Added the pstore folks ] > > OK, so I did copy this from fs/pstore/ram_core.c as this does basically the > same thing as pstore. And it looks like pstore should be updated too. I think we're talking about persistent_ram_vmap()? That code predates my maintainership, but I'm happy to update it to use better APIs. > > Yes, the legacy versions of interfaces are all for IO memory, but we > > do have things like vmap_page_range() which should JustWork(tm). > > > > Yeah, you'll need to do something like > > > > unsigned long vmap_start, vmap_end; > > > > area = get_vm_area(size, VM_IOREMAP); > > if (!area) > > return NULL; > > > > vmap_start = (unsigned long) area->addr; > > vmap_end = vmap_start + size; > > > > ret = vmap_page_range(vmap_start, vmap_end, > > *start, prot_nx(PAGE_KERNEL)); > > > > if (ret < 0) { > > free_vm_area(area); > > return NULL; > > } > > > > and the above is *entirely* untested and maybe there's something wrong > > there, but the concept should work, and when you don't do it a page at > > a time, you not only don't need the kmalloc_array(), it should even do > > things like be able to use large page mappings if the alignment and > > size work out. > > > > That said, the old code is *really* broken to begin with. I don't > > understand why you want to vmap() a contiguous physical range. Either > > it's real pages to begin with, and you can just use "page_address()" > > to get a virtual address, it's *not* real pages, and doing > > "pfn_to_page()" is actively wrong, because it creates a fake 'struct > > page *' pointer that isn't valid. > > > > Is this all just for some disgusting HIGHMEM use (in which case you > > need the virtual mapping because of HIGHMEM)? Is there any reason to > > support HIGHMEM in this area at all? > > > > So I'm not sure why this code does all this horror in the first place. > > Either it's all just confused code that just didn't know what it was > > doing and just happened to work (very possible..) or there is > > something odd going on. pstore tries to work with either real RAM or with iomem things. What is there now Currently Works Fine, but should this be using vmap_page_range()?
On Wed, 2 Apr 2025 at 23:00, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > There is no fallback implementation of flush_dcache_range for the > architectures that don't need it. Yeah, that literally looks like an oversight. Probably because nobody needed it - or people didn't realize it was an option and did the "loop over each page by hand" as in your example. So yeah, I think it would be good to just expose flush_dcache_range() as a dummy empty function in <asm-generic/cacheflush.h>, with the usual "if the architecture defines this, that will override it" rule. However, part of the problem is likely just unclear semantics. Because the existing flush_cache_range() is about pure virtual caches and how they need to be flushed before the TLB is flushed. Also about D$ aliasing, but there isn't necessarily any TLB issues. Honestly, I'd personally be perfectly fine saying "nobody sane cares, and if your D$ isn't coherent, your perf mmap won't work". Linus
On Thu, 3 Apr 2025 at 09:45, Kees Cook <kees@kernel.org> wrote: > > pstore tries to work with either real RAM or with iomem things. What > is there now Currently Works Fine, but should this be using > vmap_page_range()? Yes, I don't see the point of using vmap() on something that is contiguous but is made to look like a collection of random pfns. IOW, that whole 'pages[]' array (and the kmalloc/kfree) seems pointless. I *suspect* the history is simply that 'vmap()' predates 'vmap_page_range()'. But maybe I'm missing something, and there is some limitation to vmap_page_range() that I'm not seeing. Linus
On Thu, 3 Apr 2025 09:51:51 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > I *suspect* the history is simply that 'vmap()' predates 'vmap_page_range()'. > Doing some git archeology, it appears vmap_page_range() was added last year, to replace some ioremap_page_range() use cases, by Alexei. In 2021, Christoph moved ioremap_page_range() to vmalloc.c replacing vmap_range(). Also in 2021, Nick changed the ioremap_*range to vmap_*_range(), basically making this full circle. The ioremap_page_range() was introduced in 2006. vmap() looks to have been there before you started git. The vmap() usage in pstore was introduced in 2012. -- Steve
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 56e27263acf8..a31dff0a9f09 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -89,11 +89,11 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer, struct trace_buffer * __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key); -struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags, - int order, unsigned long start, - unsigned long range_size, - unsigned long scratch_size, - struct lock_class_key *key); +struct trace_buffer *__ring_buffer_alloc_physical(unsigned long size, unsigned flags, + int order, unsigned long start, + unsigned long range_size, + unsigned long scratch_size, + struct lock_class_key *key); void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size); @@ -113,11 +113,11 @@ void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size); * traced by ftrace, it can produce lockdep warnings. We need to keep each * ring buffer's lock class separate. */ -#define ring_buffer_alloc_range(size, flags, order, start, range_size, s_size) \ +#define ring_buffer_alloc_physical(size, flags, order, start, range_size, s_size) \ ({ \ static struct lock_class_key __key; \ - __ring_buffer_alloc_range((size), (flags), (order), (start), \ - (range_size), (s_size), &__key); \ + __ring_buffer_alloc_physical((size), (flags), (order), (start), \ + (range_size), (s_size), &__key); \ }) typedef bool (*ring_buffer_cond_fn)(void *data); @@ -235,7 +235,8 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order); int ring_buffer_subbuf_size_get(struct trace_buffer *buffer); enum ring_buffer_flags { - RB_FL_OVERWRITE = 1 << 0, + RB_FL_OVERWRITE = BIT(0), + RB_FL_PHYSICAL = BIT(1), }; #ifdef CONFIG_RING_BUFFER diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f25966b3a1fc..36665754340f 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -17,6 +17,7 @@ #include <linux/uaccess.h> #include <linux/hardirq.h> #include <linux/kthread.h> /* for self test */ +#include <linux/vmalloc.h> /* for vmap */ #include <linux/module.h> #include <linux/percpu.h> #include <linux/mutex.h> @@ -556,6 +557,8 @@ struct trace_buffer { struct ring_buffer_meta *meta; + unsigned long phys_start; + unsigned int subbuf_size; unsigned int subbuf_order; unsigned int max_data_size; @@ -2351,6 +2354,43 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer) kfree(cpu_buffer); } +static unsigned long map_pages(unsigned long *start, unsigned long *end) +{ + struct page **pages; + phys_addr_t page_start; + unsigned long size; + unsigned long page_count; + unsigned long i; + void *vaddr; + + /* Make sure the mappings are page aligned */ + *start = ALIGN(*start, PAGE_SIZE); + + size = *end - *start; + + /* The size must fit full pages */ + page_count = size >> PAGE_SHIFT; + + if (!page_count) + return 0; + + page_start = *start; + pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); + if (!pages) + return 0; + + for (i = 0; i < page_count; i++) { + phys_addr_t addr = page_start + i * PAGE_SIZE; + pages[i] = pfn_to_page(addr >> PAGE_SHIFT); + } + vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL); + kfree(pages); + + *end = *start + page_count * PAGE_SIZE; + + return (unsigned long)vaddr; +} + static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags, int order, unsigned long start, unsigned long end, @@ -2395,14 +2435,26 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags, if (!buffer->buffers) goto fail_free_cpumask; - /* If start/end are specified, then that overrides size */ + /* If start/end are specified, then this is a physical mapping */ if (start && end) { unsigned long buffers_start; + unsigned long addr; unsigned long ptr; + u64 size; int n; - /* Make sure that start is word aligned */ - start = ALIGN(start, sizeof(long)); + addr = map_pages(&start, &end); + if (!addr) + goto fail_free_cpumask; + + /* end and start have have been updated for alignment */ + size = end - start; + + buffer->phys_start = start; + buffer->flags |= RB_FL_PHYSICAL; + + start = addr; + end = start + size; /* scratch_size needs to be aligned too */ scratch_size = ALIGN(scratch_size, sizeof(long)); @@ -2479,6 +2531,9 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags, } kfree(buffer->buffers); + if (buffer->phys_start) + vunmap((void *)buffer->phys_start); + fail_free_cpumask: free_cpumask_var(buffer->cpumask); @@ -2508,11 +2563,11 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags, EXPORT_SYMBOL_GPL(__ring_buffer_alloc); /** - * __ring_buffer_alloc_range - allocate a new ring_buffer from existing memory + * __ring_buffer_alloc_physical - allocate a new ring_buffer from physical memory * @size: the size in bytes per cpu that is needed. * @flags: attributes to set for the ring buffer. * @order: sub-buffer order - * @start: start of allocated range + * @start: start of the physical memory range * @range_size: size of allocated range * @scratch_size: size of scratch area (for preallocated memory buffers) * @key: ring buffer reader_lock_key. @@ -2522,11 +2577,11 @@ EXPORT_SYMBOL_GPL(__ring_buffer_alloc); * when the buffer wraps. If this flag is not set, the buffer will * drop data when the tail hits the head. */ -struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags, - int order, unsigned long start, - unsigned long range_size, - unsigned long scratch_size, - struct lock_class_key *key) +struct trace_buffer *__ring_buffer_alloc_physical(unsigned long size, unsigned flags, + int order, unsigned long start, + unsigned long range_size, + unsigned long scratch_size, + struct lock_class_key *key) { return alloc_buffer(size, flags, order, start, start + range_size, scratch_size, key); @@ -2569,6 +2624,9 @@ ring_buffer_free(struct trace_buffer *buffer) kfree(buffer->buffers); free_cpumask_var(buffer->cpumask); + if (buffer->flags & RB_FL_PHYSICAL) + vunmap((void *)buffer->phys_start); + kfree(buffer); } EXPORT_SYMBOL_GPL(ring_buffer_free); @@ -7138,6 +7196,14 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu, unsigned long flags, *subbuf_ids; int err = 0; + /* + * Currently, this does not support vmap()'d buffers. + * Return -ENODEV as that is what is returned when a file + * does not support memory mapping. + */ + if (buffer->flags & RB_FL_PHYSICAL) + return -ENODEV; + if (!cpumask_test_cpu(cpu, buffer->cpumask)) return -EINVAL; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 14c38fcd6f9e..20724d64e02e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8492,10 +8492,6 @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) struct trace_iterator *iter = &info->iter; int ret = 0; - /* Currently the boot mapped buffer is not supported for mmap */ - if (iter->tr->flags & TRACE_ARRAY_FL_BOOT) - return -ENODEV; - ret = get_snapshot_map(iter->tr); if (ret) return ret; @@ -8503,8 +8499,8 @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file, vma); if (ret) put_snapshot_map(iter->tr); - - vma->vm_ops = &tracing_buffers_vmops; + else + vma->vm_ops = &tracing_buffers_vmops; return ret; } @@ -9520,10 +9516,17 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size if (tr->range_addr_start && tr->range_addr_size) { /* Add scratch buffer to handle 128 modules */ - buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0, - tr->range_addr_start, - tr->range_addr_size, - struct_size(tscratch, entries, 128)); + buf->buffer = ring_buffer_alloc_physical(size, rb_flags, 0, + tr->range_addr_start, + tr->range_addr_size, + struct_size(tscratch, entries, 128)); + if (!buf->buffer) { + pr_warn("Tracing: Failed to map boot instance %s\n", tr->name); + return -1; + } + + pr_info("Tracing: mapped boot instance %s at physical memory %lx of size 0x%lx\n", + tr->name, tr->range_addr_start, tr->range_addr_size); tscratch = ring_buffer_meta_scratch(buf->buffer, &scratch_size); setup_trace_scratch(tr, tscratch, scratch_size); @@ -9600,9 +9603,6 @@ static void free_trace_buffers(struct trace_array *tr) #ifdef CONFIG_TRACER_MAX_TRACE free_trace_buffer(&tr->max_buffer); #endif - - if (tr->range_addr_start) - vunmap((void *)tr->range_addr_start); } static void init_trace_flags_index(struct trace_array *tr) @@ -9795,31 +9795,6 @@ static int instance_mkdir(const char *name) return ret; } -static u64 map_pages(u64 start, u64 size) -{ - struct page **pages; - phys_addr_t page_start; - unsigned int page_count; - unsigned int i; - void *vaddr; - - page_count = DIV_ROUND_UP(size, PAGE_SIZE); - - page_start = start; - pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); - if (!pages) - return 0; - - for (i = 0; i < page_count; i++) { - phys_addr_t addr = page_start + i * PAGE_SIZE; - pages[i] = pfn_to_page(addr >> PAGE_SHIFT); - } - vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL); - kfree(pages); - - return (u64)(unsigned long)vaddr; -} - /** * trace_array_get_by_name - Create/Lookup a trace array, given its name. * @name: The name of the trace array to be looked up/created. @@ -10708,7 +10683,6 @@ __init static void enable_instances(void) while ((curr_str = strsep(&str, "\t"))) { phys_addr_t start = 0; phys_addr_t size = 0; - unsigned long addr = 0; bool traceprintk = false; bool traceoff = false; char *flag_delim; @@ -10773,22 +10747,13 @@ __init static void enable_instances(void) rname = kstrdup(tok, GFP_KERNEL); } - if (start) { - addr = map_pages(start, size); - if (addr) { - pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n", - name, &start, (unsigned long)size); - } else { - pr_warn("Tracing: Failed to map boot instance %s\n", name); - continue; - } - } else { + if (!start) { /* Only non mapped buffers have snapshot buffers */ if (IS_ENABLED(CONFIG_TRACER_MAX_TRACE)) do_allocate_snapshot(name); } - tr = trace_array_create_systems(name, NULL, addr, size); + tr = trace_array_create_systems(name, NULL, (unsigned long)start, size); if (IS_ERR(tr)) { pr_warn("Tracing: Failed to create instance buffer %s\n", curr_str); continue;