Message ID | 20221107223921.3451913-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | execmem_alloc for BPF programs | expand |
On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > This patchset tries to address the following issues: > > 1. Direct map fragmentation > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > RO+X. These set_memory_* calls cause 1GB page table entries to be split > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > and slower page table, and pressure for both instruction and data TLB. > > Our previous work in bpf_prog_pack tries to address this issue from BPF > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > greatly reduced direct map fragmentation from BPF programs. You should be able to past the results there into the respecite commit from non-bpf-prog-pack to the new generalized solution here. > 2. iTLB pressure from BPF program > > Dynamic kernel text such as modules and BPF programs (even with current > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > BPF program is big, we can see visible performance drop caused by high > iTLB miss rate. This is arbitrary, please provide some real stat and in the commit with some reproducible benchmark. > 3. TLB shootdown for short-living BPF programs > > Before bpf_prog_pack loading and unloading BPF programs requires global > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > TLB flush. > > 4. Reduce memory usage by BPF programs (in some cases) > > Most BPF programs and various trampolines are small, and they often > occupies a whole page. From a random server in our fleet, 50% of the > loaded BPF programs are less than 500 byte in size, and 75% of them are > less than 2kB in size. Allowing these BPF programs to share 2MB pages > would yield some memory saving for systems with many BPF programs. For > systems with only small number of BPF programs, this patch may waste a > little memory by allocating one 2MB page, but using only part of it. Should be easy to provide some real numbers with at least selftests and onto the commit as well. > Based on our experiments [5], we measured 0.5% performance improvement > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > This patchset replaces bpf_prog_pack with a better API and makes it > available for other dynamic kernel text, such as modules, ftrace, kprobe. And likewise here, please no arbitrary internal benchmark, real numbers. Luis
Hi Luis, On Mon, Nov 7, 2022 at 2:55 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > > This patchset tries to address the following issues: > > > > 1. Direct map fragmentation > > > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > > RO+X. These set_memory_* calls cause 1GB page table entries to be split > > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > > and slower page table, and pressure for both instruction and data TLB. > > > > Our previous work in bpf_prog_pack tries to address this issue from BPF > > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > > greatly reduced direct map fragmentation from BPF programs. > > You should be able to past the results there into the respecite commit > from non-bpf-prog-pack to the new generalized solution here. > > > 2. iTLB pressure from BPF program > > > > Dynamic kernel text such as modules and BPF programs (even with current > > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > > BPF program is big, we can see visible performance drop caused by high > > iTLB miss rate. > > This is arbitrary, please provide some real stat and in the commit with > some reproducible benchmark. > > > 3. TLB shootdown for short-living BPF programs > > > > Before bpf_prog_pack loading and unloading BPF programs requires global > > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > > TLB flush. > > > > 4. Reduce memory usage by BPF programs (in some cases) > > > > Most BPF programs and various trampolines are small, and they often > > occupies a whole page. From a random server in our fleet, 50% of the > > loaded BPF programs are less than 500 byte in size, and 75% of them are > > less than 2kB in size. Allowing these BPF programs to share 2MB pages > > would yield some memory saving for systems with many BPF programs. For > > systems with only small number of BPF programs, this patch may waste a > > little memory by allocating one 2MB page, but using only part of it. > > Should be easy to provide some real numbers with at least selftests and > onto the commit as well. > > > Based on our experiments [5], we measured 0.5% performance improvement > > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. > > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > > > This patchset replaces bpf_prog_pack with a better API and makes it > > available for other dynamic kernel text, such as modules, ftrace, kprobe. > > And likewise here, please no arbitrary internal benchmark, real numbers. The benchmark used here is identical on our web service, which runs on many many servers, so it represents the workload that we care a lot. Unfortunately, it is not possible to run it out of our data centers. We can build some artificial workloads and probably get much higher performance improvements. But these workload may not represent real world use cases. Thanks, Song
On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote: > The benchmark used here is identical on our web service, which runs on > many many servers, so it represents the workload that we care a lot. > Unfortunately, it is not possible to run it out of our data centers. I am not asking for that, I am asking for you to pick any similar benchark which can run in paralellel which may yield similar results. > We can build some artificial workloads and probably get much higher > performance improvements. But these workload may not represent real > world use cases. You can very likely use some existing benchmark. The direct map fragmentation stuff doesn't require much effort, as was demonstrated by Aaron, you can easily do that or more by running all selftests or just the test_bpf. This I buy. I'm not buying the iTLB gains as I can't even reproduce them myself for eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you mean iTLB gains for other memory intensive applications running in tandem? And none of your patches mentions the gains of this effort helping with the long term advantage of centralizing the semantics for permissions on memory. Luis
On Mon, 2022-11-07 at 15:39 -0800, Luis Chamberlain wrote: > On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote: > > The benchmark used here is identical on our web service, which runs > > on > > many many servers, so it represents the workload that we care a > > lot. > > Unfortunately, it is not possible to run it out of our data > > centers. > > I am not asking for that, I am asking for you to pick any similar > benchark which can run in paralellel which may yield similar results. > > > We can build some artificial workloads and probably get much higher > > performance improvements. But these workload may not represent real > > world use cases. > > You can very likely use some existing benchmark. > > The direct map fragmentation stuff doesn't require much effort, as > was demonstrated by Aaron, you can easily do that or more by > running all selftests or just the test_bpf. This I buy. > > I'm not buying the iTLB gains as I can't even reproduce them myself > for > eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you > mean iTLB gains for other memory intensive applications running in > tandem? Song, didn't you find that there wasn't (or in the noise) iTLB gains? What is this about visible performance drop from iTLB misses? IIRC there was a test done where progpack mapped things at 4k, but in 2MB chunks, so it would re-use pages like the 2MB mapped version. And it didn't see much improvement over the 2MB mapped version. Did I remember that wrong? > > And none of your patches mentions the gains of this effort helping > with the long term advantage of centralizing the semantics for > permissions on memory. Another good point. Although this brings up that this interface "execmem" does just handle one type of permission.
On Tue, Nov 08, 2022 at 12:13:09AM +0000, Edgecombe, Rick P wrote: > On Mon, 2022-11-07 at 15:39 -0800, Luis Chamberlain wrote: > > And none of your patches mentions the gains of this effort helping > > with the long term advantage of centralizing the semantics for > > permissions on memory. > > Another good point. Although this brings up that this interface > "execmem" does just handle one type of permission. Yes but the huge benefit of at least sharing this for eBPF + kprobes + ftrace + modules long term under all execmem would be huge. Luis
Hi Song, On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > This patchset tries to address the following issues: > > 1. Direct map fragmentation > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > RO+X. These set_memory_* calls cause 1GB page table entries to be split > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > and slower page table, and pressure for both instruction and data TLB. > > Our previous work in bpf_prog_pack tries to address this issue from BPF > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > greatly reduced direct map fragmentation from BPF programs. Usage of set_memory_* APIs with memory allocated from vmalloc/modules virtual range does not change the direct map, but only updates the permissions in vmalloc range. The direct map splits occur in vm_remove_mappings() when the memory is *freed*. That said, both bpf_prog_pack and these patches do reduce the fragmentation, but this happens because the memory is freed to the system in 2M chunks and there are no splits of 2M pages. Besides, since the same 2M page used for many BPF programs there should be way less vfree() calls. > 2. iTLB pressure from BPF program > > Dynamic kernel text such as modules and BPF programs (even with current > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > BPF program is big, we can see visible performance drop caused by high > iTLB miss rate. Like Luis mentioned several times already, it would be nice to see numbers. > 3. TLB shootdown for short-living BPF programs > > Before bpf_prog_pack loading and unloading BPF programs requires global > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > TLB flush. > > 4. Reduce memory usage by BPF programs (in some cases) > > Most BPF programs and various trampolines are small, and they often > occupies a whole page. From a random server in our fleet, 50% of the > loaded BPF programs are less than 500 byte in size, and 75% of them are > less than 2kB in size. Allowing these BPF programs to share 2MB pages > would yield some memory saving for systems with many BPF programs. For > systems with only small number of BPF programs, this patch may waste a > little memory by allocating one 2MB page, but using only part of it. I'm not convinced there are memory savings here. Unless you have hundreds of BPF programs, most of 2M page will be wasted, won't it? So for systems that have moderate use of BPF most of the 2M page will be unused, right? > Based on our experiments [5], we measured 0.5% performance improvement > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > This patchset replaces bpf_prog_pack with a better API and makes it > available for other dynamic kernel text, such as modules, ftrace, kprobe. The proposed execmem_alloc() looks to me very much tailored for x86 to be used as a replacement for module_alloc(). Some architectures have module_alloc() that is quite different from the default or x86 version, so I'd expect at least some explanation how modules etc can use execmem_ APIs without breaking !x86 architectures. > This set enables bpf programs and bpf dispatchers to share huge pages with > new API: > execmem_alloc() > execmem_alloc() > execmem_fill() > > The idea is similar to Peter's suggestion in [1]. > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > memory to its users. execmem_alloc() is used to free memory allocated by > execmem_alloc(). execmem_fill() is used to update memory allocated by > execmem_alloc(). > > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X. > The caller has to update the content with text_poke like mechanism. > Specifically, execmem_fill() is provided to update memory allocated by > execmem_alloc(). execmem_fill() also makes sure the update stays in the > boundary of one chunk allocated by execmem_alloc(). Please refer to patch > 1/5 for more details of Unless I'm mistaken, a failure to allocate PMD_SIZE page will fail text allocation altogether. That means that if somebody tries to load a BFP program on a busy long lived system, they are quite likely to fail because high order free lists might be already exhausted although there is still plenty of free memory. Did you consider a fallback for small pages if the high order allocation fails?
Le 07/11/2022 à 23:39, Song Liu a écrit : > This patchset tries to address the following issues: > > 1. Direct map fragmentation > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > RO+X. These set_memory_* calls cause 1GB page table entries to be split > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > and slower page table, and pressure for both instruction and data TLB. > > Our previous work in bpf_prog_pack tries to address this issue from BPF > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > greatly reduced direct map fragmentation from BPF programs. > > 2. iTLB pressure from BPF program > > Dynamic kernel text such as modules and BPF programs (even with current > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > BPF program is big, we can see visible performance drop caused by high > iTLB miss rate. > > 3. TLB shootdown for short-living BPF programs > > Before bpf_prog_pack loading and unloading BPF programs requires global > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > TLB flush. > > 4. Reduce memory usage by BPF programs (in some cases) > > Most BPF programs and various trampolines are small, and they often > occupies a whole page. From a random server in our fleet, 50% of the > loaded BPF programs are less than 500 byte in size, and 75% of them are > less than 2kB in size. Allowing these BPF programs to share 2MB pages > would yield some memory saving for systems with many BPF programs. For > systems with only small number of BPF programs, this patch may waste a > little memory by allocating one 2MB page, but using only part of it. > > > Based on our experiments [5], we measured 0.5% performance improvement > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > This patchset replaces bpf_prog_pack with a better API and makes it > available for other dynamic kernel text, such as modules, ftrace, kprobe. > > > This set enables bpf programs and bpf dispatchers to share huge pages with > new API: > execmem_alloc() > execmem_alloc() > execmem_fill() > > The idea is similar to Peter's suggestion in [1]. > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > memory to its users. execmem_alloc() is used to free memory allocated by > execmem_alloc(). execmem_fill() is used to update memory allocated by > execmem_alloc(). > > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X. > The caller has to update the content with text_poke like mechanism. > Specifically, execmem_fill() is provided to update memory allocated by > execmem_alloc(). execmem_fill() also makes sure the update stays in the > boundary of one chunk allocated by execmem_alloc(). Please refer to patch > 1/5 for more details of > > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher. > > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text. Would it be possible to have something more generic than being stuck to PMD_SIZE ? On powerpc 8xx, PMD_SIZE is 4MB and hugepages are 512kB and 8MB. Christophe
(added linux-arch list) On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > This patchset tries to address the following issues: > > 1. Direct map fragmentation > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > RO+X. These set_memory_* calls cause 1GB page table entries to be split > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > and slower page table, and pressure for both instruction and data TLB. > > Our previous work in bpf_prog_pack tries to address this issue from BPF > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > greatly reduced direct map fragmentation from BPF programs. > > 2. iTLB pressure from BPF program > > Dynamic kernel text such as modules and BPF programs (even with current > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > BPF program is big, we can see visible performance drop caused by high > iTLB miss rate. > > 3. TLB shootdown for short-living BPF programs > > Before bpf_prog_pack loading and unloading BPF programs requires global > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > TLB flush. > > 4. Reduce memory usage by BPF programs (in some cases) > > Most BPF programs and various trampolines are small, and they often > occupies a whole page. From a random server in our fleet, 50% of the > loaded BPF programs are less than 500 byte in size, and 75% of them are > less than 2kB in size. Allowing these BPF programs to share 2MB pages > would yield some memory saving for systems with many BPF programs. For > systems with only small number of BPF programs, this patch may waste a > little memory by allocating one 2MB page, but using only part of it. > > > Based on our experiments [5], we measured 0.5% performance improvement > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > This patchset replaces bpf_prog_pack with a better API and makes it > available for other dynamic kernel text, such as modules, ftrace, kprobe. > > > This set enables bpf programs and bpf dispatchers to share huge pages with > new API: > execmem_alloc() > execmem_alloc() > execmem_fill() > > The idea is similar to Peter's suggestion in [1]. > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > memory to its users. execmem_alloc() is used to free memory allocated by > execmem_alloc(). execmem_fill() is used to update memory allocated by > execmem_alloc(). > > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X. > The caller has to update the content with text_poke like mechanism. > Specifically, execmem_fill() is provided to update memory allocated by > execmem_alloc(). execmem_fill() also makes sure the update stays in the > boundary of one chunk allocated by execmem_alloc(). Please refer to patch > 1/5 for more details of > > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher. > > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text. > > [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ > [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/ > [3] v1: https://lore.kernel.org/bpf/20221031222541.1773452-1-song@kernel.org/ > [4] https://lore.kernel.org/bpf/Y2ioTodn+mBXdIqp@ziqianlu-desk2/ > [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/ > > Changes PATCH v1 => v2: > 1. Rename the APIs as execmem_* (Christoph Hellwig) > 2. Add more information about the motivation of this work (and follow up > works in for kernel modules, various trampolines, etc). > (Luis Chamberlain, Rick Edgecombe, Mike Rapoport, Aaron Lu) > 3. Include expermential results from previous bpf_prog_pack and the > community. (Aaron Lu, Luis Chamberlain, Rick Edgecombe) > > Changes RFC v2 => PATCH v1: > 1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It > also ensures vcopy_exec() is only used to update memory from one single > vmalloc_exec() call. (Christoph Hellwig) > 2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the > text_poke() like logic. > 3. Drop changes for kernel modules and focus on BPF side changes. > > Changes RFC v1 => RFC v2: > 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now > work fine with BPF programs (patch 1, 2, 4). But module side (patch 3) > still need some work. > > Song Liu (5): > vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill > x86/alternative: support execmem_alloc() and execmem_free() > bpf: use execmem_alloc for bpf program and bpf dispatcher > vmalloc: introduce register_text_tail_vm() > x86: use register_text_tail_vm > > arch/x86/include/asm/pgtable_64_types.h | 1 + > arch/x86/kernel/alternative.c | 12 + > arch/x86/mm/init_64.c | 4 +- > arch/x86/net/bpf_jit_comp.c | 23 +- > include/linux/bpf.h | 3 - > include/linux/filter.h | 5 - > include/linux/vmalloc.h | 9 + > kernel/bpf/core.c | 180 +----------- > kernel/bpf/dispatcher.c | 11 +- > mm/nommu.c | 12 + > mm/vmalloc.c | 354 ++++++++++++++++++++++++ > 11 files changed, 412 insertions(+), 202 deletions(-) > > -- > 2.30.2
Hi Mike, On Tue, Nov 08, 2022 at 01:27:31PM +0200, Mike Rapoport wrote: > Hi Song, > > On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > > This patchset tries to address the following issues: > > > > 1. Direct map fragmentation > > > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > > RO+X. These set_memory_* calls cause 1GB page table entries to be split > > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > > and slower page table, and pressure for both instruction and data TLB. > > > > Our previous work in bpf_prog_pack tries to address this issue from BPF > > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > > greatly reduced direct map fragmentation from BPF programs. > > Usage of set_memory_* APIs with memory allocated from vmalloc/modules > virtual range does not change the direct map, but only updates the > permissions in vmalloc range. The direct map splits occur in > vm_remove_mappings() when the memory is *freed*. set_memory_nx/x() on a vmalloced range will not affect direct map but set_memory_ro/rw() will. set_memory_ro/rw() cares about other alias mappings and will do the same permission change for that alias mapping, e.g. direct mapping. For this reason, the bpf prog load can trigger a direct map split. A sample callstack on x86_64 VM looks like this: [ 40.602450] address=0xffffffffc01e2000 numpages=1 set=0x0 clr=0x2 alias=1 [ 40.614566] address=0xffff88816ee1e000 numpages=1 set=0x0 clr=0x2 alias=0 [ 40.627641] split: address=0xffff88816ee1e000, level=2 [ 40.627981] CPU: 15 PID: 534 Comm: sockex1 Not tainted 5.17.0-dirty #28 [ 40.628421] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 [ 40.628996] Call Trace: [ 40.629161] <TASK> [ 40.629304] dump_stack_lvl+0x45/0x59 [ 40.629550] __change_page_attr_set_clr+0x718/0x8d4 [ 40.629872] ? static_protections+0x1c8/0x1fd [ 40.630160] ? dump_stack_lvl+0x54/0x59 [ 40.630418] __change_page_attr_set_clr+0x7ff/0x8d4 [ 40.630739] ? _raw_spin_unlock+0x14/0x30 [ 40.631004] ? __purge_vmap_area_lazy+0x323/0x720 [ 40.631316] ? _raw_spin_unlock_irqrestore+0x18/0x40 [ 40.631646] change_page_attr_set_clr.cold+0x2f/0x164 [ 40.631979] set_memory_ro+0x26/0x30 [ 40.632215] bpf_int_jit_compile+0x4a1/0x4e0 [ 40.632502] bpf_prog_select_runtime+0xad/0xf0 [ 40.632794] bpf_prog_load+0x6a1/0xa20 [ 40.633044] ? _raw_spin_trylock_bh+0x1/0x50 [ 40.633328] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 40.633656] ? free_debug_processing+0x1f8/0x2c0 [ 40.633964] ? __slab_free+0x2f0/0x4f0 [ 40.634214] ? trace_hardirqs_on+0x2b/0xf0 [ 40.634492] __sys_bpf+0xb20/0x2750 [ 40.634726] ? __might_fault+0x1e/0x20 [ 40.634978] __x64_sys_bpf+0x1c/0x20 [ 40.635216] do_syscall_64+0x38/0x90 [ 40.635457] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 40.635792] RIP: 0033:0x7fd4f2cacfbd [ 40.636030] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 ce 0e 00 f7 d8 64 89 01 48 [ 40.637253] RSP: 002b:00007ffddf20b2d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 [ 40.637752] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd4f2cacfbd [ 40.638220] RDX: 0000000000000080 RSI: 00007ffddf20b360 RDI: 0000000000000005 [ 40.638689] RBP: 0000000000436c48 R08: 0000000000000000 R09: 0000000000000000 [ 40.639156] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000080 [ 40.639627] R13: 00007ffddf20b360 R14: 0000000000000000 R15: 0000000000000000 [ 40.640096] </TASK> Regards, Aaron
On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote: > > Based on our experiments [5], we measured 0.5% performance > > improvement > > from bpf_prog_pack. This patchset further boosts the improvement to > > 0.7%. > > The difference is because bpf_prog_pack uses 512x 4kB pages instead > > of > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > > > This patchset replaces bpf_prog_pack with a better API and makes it > > available for other dynamic kernel text, such as modules, ftrace, > > kprobe. > > > The proposed execmem_alloc() looks to me very much tailored for x86 > to be > used as a replacement for module_alloc(). Some architectures have > module_alloc() that is quite different from the default or x86 > version, so > I'd expect at least some explanation how modules etc can use execmem_ > APIs > without breaking !x86 architectures. I think this is fair, but I think we should ask ask ourselves - how much should we do in one step? For non-text_poke() architectures, the way you can make it work is have the API look like: execmem_alloc() <- Does the allocation, but necessarily usable yet execmem_write() <- Loads the mapping, doesn't work after finish() execmem_finish() <- Makes the mapping live (loaded, executable, ready) So for text_poke(): execmem_alloc() <- reserves the mapping execmem_write() <- text_pokes() to the mapping execmem_finish() <- does nothing And non-text_poke(): execmem_alloc() <- Allocates a regular RW vmalloc allocation execmem_write() <- Writes normally to it execmem_finish() <- does set_memory_ro()/set_memory_x() on it Non-text_poke() only gets the benefits of centralized logic, but the interface works for both. This is pretty much what the perm_alloc() RFC did to make it work with other arch's and modules. But to fit with the existing modules code (which is actually spread all over) and also handle RO sections, it also needed some additional bells and whistles. So the question I'm trying to ask is, how much should we target for the next step? I first thought that this functionality was so intertwined, it would be too hard to do iteratively. So if we want to try iteratively, I'm ok if it doesn't solve everything.
On Mon, Nov 7, 2022 at 3:39 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote: > > The benchmark used here is identical on our web service, which runs on > > many many servers, so it represents the workload that we care a lot. > > Unfortunately, it is not possible to run it out of our data centers. > > I am not asking for that, I am asking for you to pick any similar > benchark which can run in paralellel which may yield similar results. > > > We can build some artificial workloads and probably get much higher > > performance improvements. But these workload may not represent real > > world use cases. > > You can very likely use some existing benchmark. > > The direct map fragmentation stuff doesn't require much effort, as > was demonstrated by Aaron, you can easily do that or more by > running all selftests or just the test_bpf. This I buy. > > I'm not buying the iTLB gains as I can't even reproduce them myself for > eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you > mean iTLB gains for other memory intensive applications running in > tandem? Right. In most of the cases, BPF programs are not the main workload. The main workloads are user space services with big .text sections. We measure performance in terms of main service throughput under some latency requirements. For benchmarks with small .text sections, the benefit from iTLB gains is not significant. > > And none of your patches mentions the gains of this effort helping > with the long term advantage of centralizing the semantics for > permissions on memory. Right.. I will add that. Thanks, Song
On Mon, Nov 7, 2022 at 4:13 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Mon, 2022-11-07 at 15:39 -0800, Luis Chamberlain wrote: > > On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote: > > > The benchmark used here is identical on our web service, which runs > > > on > > > many many servers, so it represents the workload that we care a > > > lot. > > > Unfortunately, it is not possible to run it out of our data > > > centers. > > > > I am not asking for that, I am asking for you to pick any similar > > benchark which can run in paralellel which may yield similar results. > > > > > We can build some artificial workloads and probably get much higher > > > performance improvements. But these workload may not represent real > > > world use cases. > > > > You can very likely use some existing benchmark. > > > > The direct map fragmentation stuff doesn't require much effort, as > > was demonstrated by Aaron, you can easily do that or more by > > running all selftests or just the test_bpf. This I buy. > > > > I'm not buying the iTLB gains as I can't even reproduce them myself > > for > > eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you > > mean iTLB gains for other memory intensive applications running in > > tandem? > > Song, didn't you find that there wasn't (or in the noise) iTLB gains? > What is this about visible performance drop from iTLB misses? > > IIRC there was a test done where progpack mapped things at 4k, but in > 2MB chunks, so it would re-use pages like the 2MB mapped version. And > it didn't see much improvement over the 2MB mapped version. Did I > remember that wrong? There is still a small gain (~0.2%) for this benchmark. Thanks, Song > > > > > And none of your patches mentions the gains of this effort helping > > with the long term advantage of centralizing the semantics for > > permissions on memory. > > Another good point. Although this brings up that this interface > "execmem" does just handle one type of permission.
On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote: > > Hi Song, > > On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > > This patchset tries to address the following issues: > > > > 1. Direct map fragmentation > > > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > > RO+X. These set_memory_* calls cause 1GB page table entries to be split > > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > > and slower page table, and pressure for both instruction and data TLB. > > > > Our previous work in bpf_prog_pack tries to address this issue from BPF > > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > > greatly reduced direct map fragmentation from BPF programs. > > Usage of set_memory_* APIs with memory allocated from vmalloc/modules > virtual range does not change the direct map, but only updates the > permissions in vmalloc range. The direct map splits occur in > vm_remove_mappings() when the memory is *freed*. > > That said, both bpf_prog_pack and these patches do reduce the > fragmentation, but this happens because the memory is freed to the system > in 2M chunks and there are no splits of 2M pages. Besides, since the same > 2M page used for many BPF programs there should be way less vfree() calls. > > > 2. iTLB pressure from BPF program > > > > Dynamic kernel text such as modules and BPF programs (even with current > > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > > BPF program is big, we can see visible performance drop caused by high > > iTLB miss rate. > > Like Luis mentioned several times already, it would be nice to see numbers. > > > 3. TLB shootdown for short-living BPF programs > > > > Before bpf_prog_pack loading and unloading BPF programs requires global > > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > > TLB flush. > > > > 4. Reduce memory usage by BPF programs (in some cases) > > > > Most BPF programs and various trampolines are small, and they often > > occupies a whole page. From a random server in our fleet, 50% of the > > loaded BPF programs are less than 500 byte in size, and 75% of them are > > less than 2kB in size. Allowing these BPF programs to share 2MB pages > > would yield some memory saving for systems with many BPF programs. For > > systems with only small number of BPF programs, this patch may waste a > > little memory by allocating one 2MB page, but using only part of it. > > I'm not convinced there are memory savings here. Unless you have hundreds > of BPF programs, most of 2M page will be wasted, won't it? > So for systems that have moderate use of BPF most of the 2M page will be > unused, right? There will be some memory waste in such cases. But it will get better with: 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text section (_stext to _etext); 2) modules, ftrace, kprobe will also share this 2MB page; 3) There are bigger BPF programs in many use cases. > > > Based on our experiments [5], we measured 0.5% performance improvement > > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. > > The difference is because bpf_prog_pack uses 512x 4kB pages instead of > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > > > This patchset replaces bpf_prog_pack with a better API and makes it > > available for other dynamic kernel text, such as modules, ftrace, kprobe. > > The proposed execmem_alloc() looks to me very much tailored for x86 to be > used as a replacement for module_alloc(). Some architectures have > module_alloc() that is quite different from the default or x86 version, so > I'd expect at least some explanation how modules etc can use execmem_ APIs > without breaking !x86 architectures. > > > This set enables bpf programs and bpf dispatchers to share huge pages with > > new API: > > execmem_alloc() > > execmem_alloc() > > execmem_fill() > > > > The idea is similar to Peter's suggestion in [1]. > > > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > > memory to its users. execmem_alloc() is used to free memory allocated by > > execmem_alloc(). execmem_fill() is used to update memory allocated by > > execmem_alloc(). > > > > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X. > > The caller has to update the content with text_poke like mechanism. > > Specifically, execmem_fill() is provided to update memory allocated by > > execmem_alloc(). execmem_fill() also makes sure the update stays in the > > boundary of one chunk allocated by execmem_alloc(). Please refer to patch > > 1/5 for more details of > > Unless I'm mistaken, a failure to allocate PMD_SIZE page will fail text > allocation altogether. That means that if somebody tries to load a BFP > program on a busy long lived system, they are quite likely to fail because > high order free lists might be already exhausted although there is still > plenty of free memory. > > Did you consider a fallback for small pages if the high order allocation > fails? I think __vmalloc_node_range() already has the fallback mechanism. (the end of the function). Thanks, Song
On Tue, Nov 8, 2022 at 3:44 AM Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > [...] > > > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > > memory to its users. execmem_alloc() is used to free memory allocated by > > execmem_alloc(). execmem_fill() is used to update memory allocated by > > execmem_alloc(). > > > > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X. > > The caller has to update the content with text_poke like mechanism. > > Specifically, execmem_fill() is provided to update memory allocated by > > execmem_alloc(). execmem_fill() also makes sure the update stays in the > > boundary of one chunk allocated by execmem_alloc(). Please refer to patch > > 1/5 for more details of > > > > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher. > > > > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share > > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by > > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use > > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text. > > Would it be possible to have something more generic than being stuck to > PMD_SIZE ? > > On powerpc 8xx, PMD_SIZE is 4MB and hugepages are 512kB and 8MB. Currently, __vmalloc_node_range() tries to allocate huge pages when size_per_node >= PMD_SIZE How do we handle this in powerpc 8xx? I guess we can use the same logic here? Thanks, Song
On Tue, Nov 8, 2022 at 8:51 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote: > > > Based on our experiments [5], we measured 0.5% performance > > > improvement > > > from bpf_prog_pack. This patchset further boosts the improvement to > > > 0.7%. > > > The difference is because bpf_prog_pack uses 512x 4kB pages instead > > > of > > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > > > > > This patchset replaces bpf_prog_pack with a better API and makes it > > > available for other dynamic kernel text, such as modules, ftrace, > > > kprobe. > > > > > > The proposed execmem_alloc() looks to me very much tailored for x86 > > to be > > used as a replacement for module_alloc(). Some architectures have > > module_alloc() that is quite different from the default or x86 > > version, so > > I'd expect at least some explanation how modules etc can use execmem_ > > APIs > > without breaking !x86 architectures. > > I think this is fair, but I think we should ask ask ourselves - how > much should we do in one step? > > For non-text_poke() architectures, the way you can make it work is have > the API look like: > execmem_alloc() <- Does the allocation, but necessarily usable yet > execmem_write() <- Loads the mapping, doesn't work after finish() > execmem_finish() <- Makes the mapping live (loaded, executable, ready) > > So for text_poke(): > execmem_alloc() <- reserves the mapping > execmem_write() <- text_pokes() to the mapping > execmem_finish() <- does nothing > > And non-text_poke(): > execmem_alloc() <- Allocates a regular RW vmalloc allocation > execmem_write() <- Writes normally to it > execmem_finish() <- does set_memory_ro()/set_memory_x() on it Yeah, some fallback mechanism like this is missing in current version. It is not a problem for BPF programs, as we call it from arch code. But we do need better APIs for modules. Thanks, Song > > Non-text_poke() only gets the benefits of centralized logic, but the > interface works for both. This is pretty much what the perm_alloc() RFC > did to make it work with other arch's and modules. But to fit with the > existing modules code (which is actually spread all over) and also > handle RO sections, it also needed some additional bells and whistles. > > So the question I'm trying to ask is, how much should we target for the > next step? I first thought that this functionality was so intertwined, > it would be too hard to do iteratively. So if we want to try > iteratively, I'm ok if it doesn't solve everything. > >
Le 08/11/2022 à 19:47, Song Liu a écrit : > On Tue, Nov 8, 2022 at 3:44 AM Christophe Leroy > <christophe.leroy@csgroup.eu> wrote: >> > [...] >>> >>> execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these >>> memory to its users. execmem_alloc() is used to free memory allocated by >>> execmem_alloc(). execmem_fill() is used to update memory allocated by >>> execmem_alloc(). >>> >>> Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X. >>> The caller has to update the content with text_poke like mechanism. >>> Specifically, execmem_fill() is provided to update memory allocated by >>> execmem_alloc(). execmem_fill() also makes sure the update stays in the >>> boundary of one chunk allocated by execmem_alloc(). Please refer to patch >>> 1/5 for more details of >>> >>> Patch 3/5 uses these new APIs in bpf program and bpf dispatcher. >>> >>> Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share >>> PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by >>> allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use >>> _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text. >> >> Would it be possible to have something more generic than being stuck to >> PMD_SIZE ? >> >> On powerpc 8xx, PMD_SIZE is 4MB and hugepages are 512kB and 8MB. > > Currently, __vmalloc_node_range() tries to allocate huge pages when > size_per_node >= PMD_SIZE Ah right, that reminds me that for powerpc 8xx, 8MB vmalloc mapping is not implemented yet and arch_vmap_pmd_supported() returns false. However, 512kB mapping is implemented, through arch_vmap_pte_supported_shift(). In __vmalloc_node_range() that's the part below: if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE) shift = PMD_SHIFT; else shift = arch_vmap_pte_supported_shift(size_per_node); > > How do we handle this in powerpc 8xx? I guess we can use the same logic > here? arch/powerpc/include/asm/nohash/32/mmu-8xx.h has : static inline int arch_vmap_pte_supported_shift(unsigned long size) { if (size >= SZ_512K) return 19; else if (size >= SZ_16K) return 14; else return PAGE_SHIFT; } Christophe
Le 08/11/2022 à 19:41, Song Liu a écrit : > On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote: >> >> Hi Song, >> >> On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: >>> This patchset tries to address the following issues: >>> >>> 1. Direct map fragmentation >>> >>> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also >>> RO+X. These set_memory_* calls cause 1GB page table entries to be split >>> into 2MB and 4kB ones. This fragmentation in direct map results in bigger >>> and slower page table, and pressure for both instruction and data TLB. >>> >>> Our previous work in bpf_prog_pack tries to address this issue from BPF >>> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has >>> greatly reduced direct map fragmentation from BPF programs. >> >> Usage of set_memory_* APIs with memory allocated from vmalloc/modules >> virtual range does not change the direct map, but only updates the >> permissions in vmalloc range. The direct map splits occur in >> vm_remove_mappings() when the memory is *freed*. >> >> That said, both bpf_prog_pack and these patches do reduce the >> fragmentation, but this happens because the memory is freed to the system >> in 2M chunks and there are no splits of 2M pages. Besides, since the same >> 2M page used for many BPF programs there should be way less vfree() calls. >> >>> 2. iTLB pressure from BPF program >>> >>> Dynamic kernel text such as modules and BPF programs (even with current >>> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and >>> BPF program is big, we can see visible performance drop caused by high >>> iTLB miss rate. >> >> Like Luis mentioned several times already, it would be nice to see numbers. >> >>> 3. TLB shootdown for short-living BPF programs >>> >>> Before bpf_prog_pack loading and unloading BPF programs requires global >>> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local >>> TLB flush. >>> >>> 4. Reduce memory usage by BPF programs (in some cases) >>> >>> Most BPF programs and various trampolines are small, and they often >>> occupies a whole page. From a random server in our fleet, 50% of the >>> loaded BPF programs are less than 500 byte in size, and 75% of them are >>> less than 2kB in size. Allowing these BPF programs to share 2MB pages >>> would yield some memory saving for systems with many BPF programs. For >>> systems with only small number of BPF programs, this patch may waste a >>> little memory by allocating one 2MB page, but using only part of it. >> >> I'm not convinced there are memory savings here. Unless you have hundreds >> of BPF programs, most of 2M page will be wasted, won't it? >> So for systems that have moderate use of BPF most of the 2M page will be >> unused, right? > > There will be some memory waste in such cases. But it will get better with: > 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text > section (_stext to _etext); > 2) modules, ftrace, kprobe will also share this 2MB page; > 3) There are bigger BPF programs in many use cases. And what I love with this series (for powerpc/32) is that we will likely now be able to have bpf, ftrace, kprobe without the performance cost of CONFIG_MODULES. Today, CONFIG_MODULES means page mapping, which means handling of kernel page in ITLB miss handlers. By using some of the space between end of rodata and start of inittext, we are able to use ROX linear memory which is mapped by blocks. It means there is no need to handle kernel text in ITLB handler (You can look at https://elixir.bootlin.com/linux/v6.1-rc3/source/arch/powerpc/kernel/head_8xx.S#L191 to better understand what I'm talking about). Thanks Christophe
On Tue, Nov 8, 2022 at 11:43 AM Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > > > Le 08/11/2022 à 19:41, Song Liu a écrit : > > On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote: > >> > >> Hi Song, > >> > >> On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > >>> This patchset tries to address the following issues: > >>> > >>> 1. Direct map fragmentation > >>> > >>> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > >>> RO+X. These set_memory_* calls cause 1GB page table entries to be split > >>> into 2MB and 4kB ones. This fragmentation in direct map results in bigger > >>> and slower page table, and pressure for both instruction and data TLB. > >>> > >>> Our previous work in bpf_prog_pack tries to address this issue from BPF > >>> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > >>> greatly reduced direct map fragmentation from BPF programs. > >> > >> Usage of set_memory_* APIs with memory allocated from vmalloc/modules > >> virtual range does not change the direct map, but only updates the > >> permissions in vmalloc range. The direct map splits occur in > >> vm_remove_mappings() when the memory is *freed*. > >> > >> That said, both bpf_prog_pack and these patches do reduce the > >> fragmentation, but this happens because the memory is freed to the system > >> in 2M chunks and there are no splits of 2M pages. Besides, since the same > >> 2M page used for many BPF programs there should be way less vfree() calls. > >> > >>> 2. iTLB pressure from BPF program > >>> > >>> Dynamic kernel text such as modules and BPF programs (even with current > >>> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > >>> BPF program is big, we can see visible performance drop caused by high > >>> iTLB miss rate. > >> > >> Like Luis mentioned several times already, it would be nice to see numbers. > >> > >>> 3. TLB shootdown for short-living BPF programs > >>> > >>> Before bpf_prog_pack loading and unloading BPF programs requires global > >>> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > >>> TLB flush. > >>> > >>> 4. Reduce memory usage by BPF programs (in some cases) > >>> > >>> Most BPF programs and various trampolines are small, and they often > >>> occupies a whole page. From a random server in our fleet, 50% of the > >>> loaded BPF programs are less than 500 byte in size, and 75% of them are > >>> less than 2kB in size. Allowing these BPF programs to share 2MB pages > >>> would yield some memory saving for systems with many BPF programs. For > >>> systems with only small number of BPF programs, this patch may waste a > >>> little memory by allocating one 2MB page, but using only part of it. > >> > >> I'm not convinced there are memory savings here. Unless you have hundreds > >> of BPF programs, most of 2M page will be wasted, won't it? > >> So for systems that have moderate use of BPF most of the 2M page will be > >> unused, right? > > > > There will be some memory waste in such cases. But it will get better with: > > 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text > > section (_stext to _etext); > > 2) modules, ftrace, kprobe will also share this 2MB page; > > 3) There are bigger BPF programs in many use cases. > > And what I love with this series (for powerpc/32) is that we will likely > now be able to have bpf, ftrace, kprobe without the performance cost of > CONFIG_MODULES. Yeah, I remember reading emails about using tracing tools without CONFIG_MODULES. We still need more work (beyond this set) to make it happen for powerpc/32. For example, current powerpc bpf_jit doesn't support jitting into ROX memory. Song > > Today, CONFIG_MODULES means page mapping, which means handling of kernel > page in ITLB miss handlers. > > By using some of the space between end of rodata and start of inittext, > we are able to use ROX linear memory which is mapped by blocks. It means > there is no need to handle kernel text in ITLB handler (You can look at > https://elixir.bootlin.com/linux/v6.1-rc3/source/arch/powerpc/kernel/head_8xx.S#L191 > to better understand what I'm talking about). > > Thanks > Christophe
On Tue, Nov 08, 2022 at 08:38:32PM +0800, Aaron Lu wrote: > set_memory_nx/x() on a vmalloced range will not affect direct map but > set_memory_ro/rw() will. Which seems a little odd. Is there any good reason to not also propagate the NX bit?
On Wed, Nov 09, 2022 at 07:55:12AM +0100, Christoph Hellwig wrote: > On Tue, Nov 08, 2022 at 08:38:32PM +0800, Aaron Lu wrote: > > set_memory_nx/x() on a vmalloced range will not affect direct map but > > set_memory_ro/rw() will. > > Which seems a little odd. Is there any good reason to not also propagate > the NX bit? Less executable ranges more better. That is, the direct map is *always* NX.
On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote: > On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote: > > > Based on our experiments [5], we measured 0.5% performance > > > improvement > > > from bpf_prog_pack. This patchset further boosts the improvement to > > > 0.7%. > > > The difference is because bpf_prog_pack uses 512x 4kB pages instead > > > of > > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > > > > > This patchset replaces bpf_prog_pack with a better API and makes it > > > available for other dynamic kernel text, such as modules, ftrace, > > > kprobe. > > > > > > The proposed execmem_alloc() looks to me very much tailored for x86 > > to be > > used as a replacement for module_alloc(). Some architectures have > > module_alloc() that is quite different from the default or x86 > > version, so > > I'd expect at least some explanation how modules etc can use execmem_ > > APIs > > without breaking !x86 architectures. > > I think this is fair, but I think we should ask ask ourselves - how > much should we do in one step? I think that at least we need an evidence that execmem_alloc() etc can be actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work for him at all, so having a core MM API for code allocation that only works with BPF on x86 seems not right to me. > For non-text_poke() architectures, the way you can make it work is have > the API look like: > execmem_alloc() <- Does the allocation, but necessarily usable yet > execmem_write() <- Loads the mapping, doesn't work after finish() > execmem_finish() <- Makes the mapping live (loaded, executable, ready) > > So for text_poke(): > execmem_alloc() <- reserves the mapping > execmem_write() <- text_pokes() to the mapping > execmem_finish() <- does nothing > > And non-text_poke(): > execmem_alloc() <- Allocates a regular RW vmalloc allocation > execmem_write() <- Writes normally to it > execmem_finish() <- does set_memory_ro()/set_memory_x() on it > > Non-text_poke() only gets the benefits of centralized logic, but the > interface works for both. This is pretty much what the perm_alloc() RFC > did to make it work with other arch's and modules. But to fit with the > existing modules code (which is actually spread all over) and also > handle RO sections, it also needed some additional bells and whistles. I'm less concerned about non-text_poke() part, but rather about restrictions where code and data can live on different architectures and whether these restrictions won't lead to inability to use the centralized logic on, say, arm64 and powerpc. For instance, if we use execmem_alloc() for modules, it means that data sections should be allocated separately with plain vmalloc(). Will this work universally? Or this will require special care with additional complexity in the modules code? > So the question I'm trying to ask is, how much should we target for the > next step? I first thought that this functionality was so intertwined, > it would be too hard to do iteratively. So if we want to try > iteratively, I'm ok if it doesn't solve everything. With execmem_alloc() as the first step I'm failing to see the large picture. If we want to use it for modules, how will we allocate RO data? with similar rodata_alloc() that uses yet another tree in vmalloc? How the caching of large pages in vmalloc can be made useful for use cases like secretmem and PKS?
On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote: > On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote: > > On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote: > > > > Based on our experiments [5], we measured 0.5% performance > > > > improvement > > > > from bpf_prog_pack. This patchset further boosts the > > > > improvement to > > > > 0.7%. > > > > The difference is because bpf_prog_pack uses 512x 4kB pages > > > > instead > > > > of > > > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above. > > > > > > > > This patchset replaces bpf_prog_pack with a better API and > > > > makes it > > > > available for other dynamic kernel text, such as modules, > > > > ftrace, > > > > kprobe. > > > > > > > > > The proposed execmem_alloc() looks to me very much tailored for > > > x86 > > > to be > > > used as a replacement for module_alloc(). Some architectures have > > > module_alloc() that is quite different from the default or x86 > > > version, so > > > I'd expect at least some explanation how modules etc can use > > > execmem_ > > > APIs > > > without breaking !x86 architectures. > > > > I think this is fair, but I think we should ask ask ourselves - how > > much should we do in one step? > > I think that at least we need an evidence that execmem_alloc() etc > can be > actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't > work > for him at all, so having a core MM API for code allocation that only > works > with BPF on x86 seems not right to me. Those modules changes wouldn't work on non-x86 either. Most of modules is cross-arch, so this kind of has to work for non-text_poke() or modules needs to be refactored. > > > For non-text_poke() architectures, the way you can make it work is > > have > > the API look like: > > execmem_alloc() <- Does the allocation, but necessarily usable yet > > execmem_write() <- Loads the mapping, doesn't work after finish() > > execmem_finish() <- Makes the mapping live (loaded, executable, > > ready) > > > > So for text_poke(): > > execmem_alloc() <- reserves the mapping > > execmem_write() <- text_pokes() to the mapping > > execmem_finish() <- does nothing > > > > And non-text_poke(): > > execmem_alloc() <- Allocates a regular RW vmalloc allocation > > execmem_write() <- Writes normally to it > > execmem_finish() <- does set_memory_ro()/set_memory_x() on it > > > > Non-text_poke() only gets the benefits of centralized logic, but > > the > > interface works for both. This is pretty much what the perm_alloc() > > RFC > > did to make it work with other arch's and modules. But to fit with > > the > > existing modules code (which is actually spread all over) and also > > handle RO sections, it also needed some additional bells and > > whistles. > > I'm less concerned about non-text_poke() part, but rather about > restrictions where code and data can live on different architectures > and > whether these restrictions won't lead to inability to use the > centralized > logic on, say, arm64 and powerpc. > > For instance, if we use execmem_alloc() for modules, it means that > data > sections should be allocated separately with plain vmalloc(). Will > this > work universally? Or this will require special care with additional > complexity in the modules code? Good point. If the module data was still in the modules range, I would hope it would still work, but there are a lot of architectures to check. Some might care if the data is really close to the text. I'm not sure. The perm_alloc() stuff did some hacks to force the allocations close to each other out of paranoia about this. Basically started with one allocation, but then tracked the pieces separately so arch's could separate them if they wanted. But I wondered if it was really needed. > > > So the question I'm trying to ask is, how much should we target for > > the > > next step? I first thought that this functionality was so > > intertwined, > > it would be too hard to do iteratively. So if we want to try > > iteratively, I'm ok if it doesn't solve everything. > > > With execmem_alloc() as the first step I'm failing to see the large > picture. If we want to use it for modules, how will we allocate RO > data? Similar to the perm_alloc() hacks? > with similar rodata_alloc() that uses yet another tree in vmalloc? It would have to group them together at least. Not sure if it needs a separate tree or not. I would think permission flags would be better than a new function for each memory type. > How the caching of large pages in vmalloc can be made useful for use > cases > like secretmem and PKS? This part is easy I think. If we had an unmapped page allocator it could just feed this. Do you have any idea when you might pick up that stuff again? To answer my own question, I think a good first step would be to make the interface also work for non-text_poke() so it could really be cross arch, then use it for everything except modules. The benefit to the other arch's at that point is centralized handling of loading text.
On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote: > [...] > > > > > > The proposed execmem_alloc() looks to me very much tailored for x86 > > > to be > > > used as a replacement for module_alloc(). Some architectures have > > > module_alloc() that is quite different from the default or x86 > > > version, so > > > I'd expect at least some explanation how modules etc can use execmem_ > > > APIs > > > without breaking !x86 architectures. > > > > I think this is fair, but I think we should ask ask ourselves - how > > much should we do in one step? > > I think that at least we need an evidence that execmem_alloc() etc can be > actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work > for him at all, so having a core MM API for code allocation that only works > with BPF on x86 seems not right to me. While using execmem_alloc() et. al. in module support is difficult, folks are making progress with it. For example, the prototype would be more difficult before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC (introduced by Christophe). We also have other users that we can onboard soon: BPF trampoline on x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and s390. > > > For non-text_poke() architectures, the way you can make it work is have > > the API look like: > > execmem_alloc() <- Does the allocation, but necessarily usable yet > > execmem_write() <- Loads the mapping, doesn't work after finish() > > execmem_finish() <- Makes the mapping live (loaded, executable, ready) > > > > So for text_poke(): > > execmem_alloc() <- reserves the mapping > > execmem_write() <- text_pokes() to the mapping > > execmem_finish() <- does nothing > > > > And non-text_poke(): > > execmem_alloc() <- Allocates a regular RW vmalloc allocation > > execmem_write() <- Writes normally to it > > execmem_finish() <- does set_memory_ro()/set_memory_x() on it > > > > Non-text_poke() only gets the benefits of centralized logic, but the > > interface works for both. This is pretty much what the perm_alloc() RFC > > did to make it work with other arch's and modules. But to fit with the > > existing modules code (which is actually spread all over) and also > > handle RO sections, it also needed some additional bells and whistles. > > I'm less concerned about non-text_poke() part, but rather about > restrictions where code and data can live on different architectures and > whether these restrictions won't lead to inability to use the centralized > logic on, say, arm64 and powerpc. > > For instance, if we use execmem_alloc() for modules, it means that data > sections should be allocated separately with plain vmalloc(). Will this > work universally? Or this will require special care with additional > complexity in the modules code? > > > So the question I'm trying to ask is, how much should we target for the > > next step? I first thought that this functionality was so intertwined, > > it would be too hard to do iteratively. So if we want to try > > iteratively, I'm ok if it doesn't solve everything. > > With execmem_alloc() as the first step I'm failing to see the large > picture. If we want to use it for modules, how will we allocate RO data? > with similar rodata_alloc() that uses yet another tree in vmalloc? > How the caching of large pages in vmalloc can be made useful for use cases > like secretmem and PKS? If RO data causes problems with direct map fragmentation, we can use similar logic. I think we will need another tree in vmalloc for this case. Since the logic will be mostly identical, I personally don't think adding another tree is a big overhead. Thanks, Song
On Wed, Nov 9, 2022 at 9:04 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > [...] > > Similar to the perm_alloc() hacks? > > > with similar rodata_alloc() that uses yet another tree in vmalloc? > > It would have to group them together at least. Not sure if it needs a > separate tree or not. I would think permission flags would be better > than a new function for each memory type. > > > How the caching of large pages in vmalloc can be made useful for use > > cases > > like secretmem and PKS? > > This part is easy I think. If we had an unmapped page allocator it > could just feed this. Do you have any idea when you might pick up that > stuff again? > > To answer my own question, I think a good first step would be to make > the interface also work for non-text_poke() so it could really be cross > arch, then use it for everything except modules. The benefit to the > other arch's at that point is centralized handling of loading text. AFAICT, most major archs are introducing text_poke or similar support. What would be a good non-text_poke major to look into? Thanks, Song
+ linuxppc-dev list as we start mentioning powerpc. Le 09/11/2022 à 18:43, Song Liu a écrit : > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote: >> > [...] > >>>> >>>> The proposed execmem_alloc() looks to me very much tailored for x86 >>>> to be >>>> used as a replacement for module_alloc(). Some architectures have >>>> module_alloc() that is quite different from the default or x86 >>>> version, so >>>> I'd expect at least some explanation how modules etc can use execmem_ >>>> APIs >>>> without breaking !x86 architectures. >>> >>> I think this is fair, but I think we should ask ask ourselves - how >>> much should we do in one step? >> >> I think that at least we need an evidence that execmem_alloc() etc can be >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work >> for him at all, so having a core MM API for code allocation that only works >> with BPF on x86 seems not right to me. > > While using execmem_alloc() et. al. in module support is difficult, folks are > making progress with it. For example, the prototype would be more difficult > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > (introduced by Christophe). By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC was completely different: This was because on powerpc book3s/32, no-exec flaggin is per segment of size 256 Mbytes, so in order to provide STRICT_MODULES_RWX it was necessary to put data outside of the segment that holds module text in order to be able to flag RW data as no-exec. But I'm happy if it can also serve other purposes. > > We also have other users that we can onboard soon: BPF trampoline on > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and > s390. > >> >>> For non-text_poke() architectures, the way you can make it work is have >>> the API look like: >>> execmem_alloc() <- Does the allocation, but necessarily usable yet >>> execmem_write() <- Loads the mapping, doesn't work after finish() >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready) >>> >>> So for text_poke(): >>> execmem_alloc() <- reserves the mapping >>> execmem_write() <- text_pokes() to the mapping >>> execmem_finish() <- does nothing >>> >>> And non-text_poke(): >>> execmem_alloc() <- Allocates a regular RW vmalloc allocation >>> execmem_write() <- Writes normally to it >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it >>> >>> Non-text_poke() only gets the benefits of centralized logic, but the >>> interface works for both. This is pretty much what the perm_alloc() RFC >>> did to make it work with other arch's and modules. But to fit with the >>> existing modules code (which is actually spread all over) and also >>> handle RO sections, it also needed some additional bells and whistles. >> >> I'm less concerned about non-text_poke() part, but rather about >> restrictions where code and data can live on different architectures and >> whether these restrictions won't lead to inability to use the centralized >> logic on, say, arm64 and powerpc. Until recently, powerpc CPU didn't implement PC-relative data access. Only very recent powerpc CPUs (power10 only ?) have capability to do PC-relative accesses, but the kernel doesn't use it yet. So there's no constraint about distance between text and data. What matters is the distance between core kernel text and module text to avoid trampolines. >> >> For instance, if we use execmem_alloc() for modules, it means that data >> sections should be allocated separately with plain vmalloc(). Will this >> work universally? Or this will require special care with additional >> complexity in the modules code? >> >>> So the question I'm trying to ask is, how much should we target for the >>> next step? I first thought that this functionality was so intertwined, >>> it would be too hard to do iteratively. So if we want to try >>> iteratively, I'm ok if it doesn't solve everything. >> >> With execmem_alloc() as the first step I'm failing to see the large >> picture. If we want to use it for modules, how will we allocate RO data? >> with similar rodata_alloc() that uses yet another tree in vmalloc? >> How the caching of large pages in vmalloc can be made useful for use cases >> like secretmem and PKS? > > If RO data causes problems with direct map fragmentation, we can use > similar logic. I think we will need another tree in vmalloc for this case. > Since the logic will be mostly identical, I personally don't think adding > another tree is a big overhead. On powerpc, kernel core RAM is not mapped by pages but is mapped by blocks. There are only two blocks: One ROX block which contains both text and rodata, and one RW block that contains everything else. Maybe the same can be done for modules. What matters is to be sure you never have WX memory. Having ROX rodata is not an issue. Christophe
On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > + linuxppc-dev list as we start mentioning powerpc. > > Le 09/11/2022 à 18:43, Song Liu a écrit : > > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote: > >> > > [...] > > > >>>> > >>>> The proposed execmem_alloc() looks to me very much tailored for x86 > >>>> to be > >>>> used as a replacement for module_alloc(). Some architectures have > >>>> module_alloc() that is quite different from the default or x86 > >>>> version, so > >>>> I'd expect at least some explanation how modules etc can use execmem_ > >>>> APIs > >>>> without breaking !x86 architectures. > >>> > >>> I think this is fair, but I think we should ask ask ourselves - how > >>> much should we do in one step? > >> > >> I think that at least we need an evidence that execmem_alloc() etc can be > >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work > >> for him at all, so having a core MM API for code allocation that only works > >> with BPF on x86 seems not right to me. > > > > While using execmem_alloc() et. al. in module support is difficult, folks are > > making progress with it. For example, the prototype would be more difficult > > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > > (introduced by Christophe). > > By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > was completely different: This was because on powerpc book3s/32, no-exec > flaggin is per segment of size 256 Mbytes, so in order to provide > STRICT_MODULES_RWX it was necessary to put data outside of the segment > that holds module text in order to be able to flag RW data as no-exec. Yeah, I only noticed the actual motivation of this work earlier today. :) > > But I'm happy if it can also serve other purposes. > > > > > We also have other users that we can onboard soon: BPF trampoline on > > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and > > s390. > > > >> > >>> For non-text_poke() architectures, the way you can make it work is have > >>> the API look like: > >>> execmem_alloc() <- Does the allocation, but necessarily usable yet > >>> execmem_write() <- Loads the mapping, doesn't work after finish() > >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready) > >>> > >>> So for text_poke(): > >>> execmem_alloc() <- reserves the mapping > >>> execmem_write() <- text_pokes() to the mapping > >>> execmem_finish() <- does nothing > >>> > >>> And non-text_poke(): > >>> execmem_alloc() <- Allocates a regular RW vmalloc allocation > >>> execmem_write() <- Writes normally to it > >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it > >>> > >>> Non-text_poke() only gets the benefits of centralized logic, but the > >>> interface works for both. This is pretty much what the perm_alloc() RFC > >>> did to make it work with other arch's and modules. But to fit with the > >>> existing modules code (which is actually spread all over) and also > >>> handle RO sections, it also needed some additional bells and whistles. > >> > >> I'm less concerned about non-text_poke() part, but rather about > >> restrictions where code and data can live on different architectures and > >> whether these restrictions won't lead to inability to use the centralized > >> logic on, say, arm64 and powerpc. > > Until recently, powerpc CPU didn't implement PC-relative data access. > Only very recent powerpc CPUs (power10 only ?) have capability to do > PC-relative accesses, but the kernel doesn't use it yet. So there's no > constraint about distance between text and data. What matters is the > distance between core kernel text and module text to avoid trampolines. Ah, this is great. I guess this means powerpc can benefit from this work with much less effort than x86_64. > > >> > >> For instance, if we use execmem_alloc() for modules, it means that data > >> sections should be allocated separately with plain vmalloc(). Will this > >> work universally? Or this will require special care with additional > >> complexity in the modules code? > >> > >>> So the question I'm trying to ask is, how much should we target for the > >>> next step? I first thought that this functionality was so intertwined, > >>> it would be too hard to do iteratively. So if we want to try > >>> iteratively, I'm ok if it doesn't solve everything. > >> > >> With execmem_alloc() as the first step I'm failing to see the large > >> picture. If we want to use it for modules, how will we allocate RO data? > >> with similar rodata_alloc() that uses yet another tree in vmalloc? > >> How the caching of large pages in vmalloc can be made useful for use cases > >> like secretmem and PKS? > > > > If RO data causes problems with direct map fragmentation, we can use > > similar logic. I think we will need another tree in vmalloc for this case. > > Since the logic will be mostly identical, I personally don't think adding > > another tree is a big overhead. > > On powerpc, kernel core RAM is not mapped by pages but is mapped by > blocks. There are only two blocks: One ROX block which contains both > text and rodata, and one RW block that contains everything else. Maybe > the same can be done for modules. What matters is to be sure you never > have WX memory. Having ROX rodata is not an issue. Got it. Thanks! Song
On Tue, Nov 08, 2022 at 10:41:53AM -0800, Song Liu wrote: > On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > Hi Song, > > > > On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote: > > > This patchset tries to address the following issues: > > > > > > 1. Direct map fragmentation > > > > > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also > > > RO+X. These set_memory_* calls cause 1GB page table entries to be split > > > into 2MB and 4kB ones. This fragmentation in direct map results in bigger > > > and slower page table, and pressure for both instruction and data TLB. > > > > > > Our previous work in bpf_prog_pack tries to address this issue from BPF > > > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has > > > greatly reduced direct map fragmentation from BPF programs. > > > > Usage of set_memory_* APIs with memory allocated from vmalloc/modules > > virtual range does not change the direct map, but only updates the > > permissions in vmalloc range. The direct map splits occur in > > vm_remove_mappings() when the memory is *freed*. > > > > That said, both bpf_prog_pack and these patches do reduce the > > fragmentation, but this happens because the memory is freed to the system > > in 2M chunks and there are no splits of 2M pages. Besides, since the same > > 2M page used for many BPF programs there should be way less vfree() calls. > > > > > 2. iTLB pressure from BPF program > > > > > > Dynamic kernel text such as modules and BPF programs (even with current > > > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and > > > BPF program is big, we can see visible performance drop caused by high > > > iTLB miss rate. > > > > Like Luis mentioned several times already, it would be nice to see numbers. > > > > > 3. TLB shootdown for short-living BPF programs > > > > > > Before bpf_prog_pack loading and unloading BPF programs requires global > > > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local > > > TLB flush. > > > > > > 4. Reduce memory usage by BPF programs (in some cases) > > > > > > Most BPF programs and various trampolines are small, and they often > > > occupies a whole page. From a random server in our fleet, 50% of the > > > loaded BPF programs are less than 500 byte in size, and 75% of them are > > > less than 2kB in size. Allowing these BPF programs to share 2MB pages > > > would yield some memory saving for systems with many BPF programs. For > > > systems with only small number of BPF programs, this patch may waste a > > > little memory by allocating one 2MB page, but using only part of it. > > > > I'm not convinced there are memory savings here. Unless you have hundreds > > of BPF programs, most of 2M page will be wasted, won't it? > > So for systems that have moderate use of BPF most of the 2M page will be > > unused, right? > > There will be some memory waste in such cases. But it will get better with: > 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text > section (_stext to _etext); > 2) modules, ftrace, kprobe will also share this 2MB page; Unless I'm missing something, what will be shared is the virtual space, the actual physical pages will be still allocated the same way as any vmalloc() allocation. > 3) There are bigger BPF programs in many use cases. With statistics you provided above one will need hundreds if not thousands of BPF programs to fill a 2M page. I didn't do the math, but it seems that to see memory savings there should be several hundreds of BPF programs. > Thanks, > Song
On Wed, Nov 09, 2022 at 05:04:25PM +0000, Edgecombe, Rick P wrote: > On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote: > > On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote: > > > How the caching of large pages in vmalloc can be made useful for use > > cases like secretmem and PKS? > > This part is easy I think. If we had an unmapped page allocator it > could just feed this. The unmapped page allocator could be used by anything that needs non-default permissions in the direct map and knows how to map the pages elsewhere. E.g it would have been a oneliner to switch x86::module_alloc() to use unmapped allocations. But ... > Do you have any idea when you might pick up that stuff again? ... unfortunately I don't see it happening anytime soon. > To answer my own question, I think a good first step would be to make > the interface also work for non-text_poke() so it could really be cross > arch, then use it for everything except modules. The benefit to the > other arch's at that point is centralized handling of loading text. My concern is that the proposed execmem_alloc() cannot be used for centralized handling of loading text. I'm not familiar enough with modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but my gut feeling is that the proposed execmem_alloc() won't be an improvement but rather a hindrance for moving to centralized handling of loading text. It feels to me that a lot of ground work is needed to get to the point where we can use centralized handling of loading text.
On Wed, Nov 09, 2022 at 09:43:50AM -0800, Song Liu wrote: > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote: > > > [...] > > > > > > > > > The proposed execmem_alloc() looks to me very much tailored for x86 > > > > to be > > > > used as a replacement for module_alloc(). Some architectures have > > > > module_alloc() that is quite different from the default or x86 > > > > version, so > > > > I'd expect at least some explanation how modules etc can use execmem_ > > > > APIs > > > > without breaking !x86 architectures. > > > > > > I think this is fair, but I think we should ask ask ourselves - how > > > much should we do in one step? > > > > I think that at least we need an evidence that execmem_alloc() etc can be > > actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work > > for him at all, so having a core MM API for code allocation that only works > > with BPF on x86 seems not right to me. > > While using execmem_alloc() et. al. in module support is difficult, folks are > making progress with it. For example, the prototype would be more difficult > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > (introduced by Christophe). > > We also have other users that we can onboard soon: BPF trampoline on > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and > s390. Caching of large pages won't make any difference on arm64 and powerpc because they do not support splitting of the direct map, so the only potential benefit there is a centralized handling of text loading and I'm not convinced execmem_alloc() will get us there. > > With execmem_alloc() as the first step I'm failing to see the large > > picture. If we want to use it for modules, how will we allocate RO data? > > with similar rodata_alloc() that uses yet another tree in vmalloc? > > How the caching of large pages in vmalloc can be made useful for use cases > > like secretmem and PKS? > > If RO data causes problems with direct map fragmentation, we can use > similar logic. I think we will need another tree in vmalloc for this case. > Since the logic will be mostly identical, I personally don't think adding > another tree is a big overhead. Actually, it would be interesting to quantify memory savings/waste as the result of using execmem_alloc() > Thanks, > Song
On Sun, Nov 13, 2022 at 1:59 AM Mike Rapoport <rppt@kernel.org> wrote: [...] > > > > There will be some memory waste in such cases. But it will get better with: > > 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text > > section (_stext to _etext); > > 2) modules, ftrace, kprobe will also share this 2MB page; > > Unless I'm missing something, what will be shared is the virtual space, the > actual physical pages will be still allocated the same way as any vmalloc() > allocation. What do you mean by shared virtual space, but the actual physical pages are still the same? This is a 2MB page shared by BPF programs, modules, etc., so it is 2MB virtual address space, and it is also 1x 2MB physical huge page. For example, we will allocate one 2MB page, and put 1MB of module text, 512kB of BPF programs, some ftrace trampolines in this page. > > > 3) There are bigger BPF programs in many use cases. > > With statistics you provided above one will need hundreds if not thousands > of BPF programs to fill a 2M page. I didn't do the math, but it seems that > to see memory savings there should be several hundreds of BPF programs. powerpc is trying to use bpf_prog_pack [1]. IIUC, execmem_alloc() should allocate 512kB pages for powerpc. This already yielding memory savings: on a random system in our fleet (x86_64), BPF programs use 140x 4kB pages (or 560kB) without execmem_alloc(). They will fit in 200kB with execmem_alloc(), and we can use the other 300kB+ for modules, ftrace, etc. OTOH, 512kB or even 2MB is really small for module systems, but iTLB is always a limited resource. Thanks, Song [1] https://lore.kernel.org/bpf/20221110184303.393179-1-hbathini@linux.ibm.com/
On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote: > > On Wed, Nov 09, 2022 at 05:04:25PM +0000, Edgecombe, Rick P wrote: > > On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote: > > > On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote: > > > > > How the caching of large pages in vmalloc can be made useful for use > > > cases like secretmem and PKS? > > > > This part is easy I think. If we had an unmapped page allocator it > > could just feed this. > > The unmapped page allocator could be used by anything that needs > non-default permissions in the direct map and knows how to map the pages > elsewhere. E.g it would have been a oneliner to switch x86::module_alloc() > to use unmapped allocations. But ... > > > Do you have any idea when you might pick up that stuff again? > > ... unfortunately I don't see it happening anytime soon. > > > To answer my own question, I think a good first step would be to make > > the interface also work for non-text_poke() so it could really be cross > > arch, then use it for everything except modules. The benefit to the > > other arch's at that point is centralized handling of loading text. > > My concern is that the proposed execmem_alloc() cannot be used for > centralized handling of loading text. I'm not familiar enough with > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but > my gut feeling is that the proposed execmem_alloc() won't be an improvement > but rather a hindrance for moving to centralized handling of loading text. I don't follow why this could ever be a hindrance. Luis is very excited about this, and I am very sure it works for ftrace, kprobe, and BPF. If there is a better API, it shouldn't be too hard to do the migration. See the example in 3/5 of the set, where we move x86 BPF jit AND BPF dispatcher from bpf_prog_pack to execmem_alloc(): 5 files changed, 21 insertions(+), 201 deletions(-) It is not a one liner, but it is definitely very easy. > > It feels to me that a lot of ground work is needed to get to the point > where we can use centralized handling of loading text. Could you please be more specific on what is needed? Thanks, Song
On Sun, Nov 13, 2022 at 2:43 AM Mike Rapoport <rppt@kernel.org> wrote: > > On Wed, Nov 09, 2022 at 09:43:50AM -0800, Song Liu wrote: > > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > [...] > > > > > > > > > > > > The proposed execmem_alloc() looks to me very much tailored for x86 > > > > > to be > > > > > used as a replacement for module_alloc(). Some architectures have > > > > > module_alloc() that is quite different from the default or x86 > > > > > version, so > > > > > I'd expect at least some explanation how modules etc can use execmem_ > > > > > APIs > > > > > without breaking !x86 architectures. > > > > > > > > I think this is fair, but I think we should ask ask ourselves - how > > > > much should we do in one step? > > > > > > I think that at least we need an evidence that execmem_alloc() etc can be > > > actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work > > > for him at all, so having a core MM API for code allocation that only works > > > with BPF on x86 seems not right to me. > > > > While using execmem_alloc() et. al. in module support is difficult, folks are > > making progress with it. For example, the prototype would be more difficult > > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > > (introduced by Christophe). > > > > We also have other users that we can onboard soon: BPF trampoline on > > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and > > s390. > > Caching of large pages won't make any difference on arm64 and powerpc > because they do not support splitting of the direct map, so the only > potential benefit there is a centralized handling of text loading and I'm > not convinced execmem_alloc() will get us there. Sharing large pages helps reduce iTLB pressure, which is the second motivation here (after reducing direct map fragmentation). > > > > With execmem_alloc() as the first step I'm failing to see the large > > > picture. If we want to use it for modules, how will we allocate RO data? > > > with similar rodata_alloc() that uses yet another tree in vmalloc? > > > How the caching of large pages in vmalloc can be made useful for use cases > > > like secretmem and PKS? > > > > If RO data causes problems with direct map fragmentation, we can use > > similar logic. I think we will need another tree in vmalloc for this case. > > Since the logic will be mostly identical, I personally don't think adding > > another tree is a big overhead. > > Actually, it would be interesting to quantify memory savings/waste as the > result of using execmem_alloc() From a random system in our fleet, execmem_alloc() saves: 139 iTLB entries (1x 2MB entry vs, 140x 4kB entries), which is more than 100% of L1 iTLB and about 10% of L2 TLB. It wastes 1.5MB memory, which is 0.0023% of system memory (64GB). I believe this is clearly a good trade-off. Thanks, Song
On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote: > [...] > > > This set enables bpf programs and bpf dispatchers to share huge pages with > new API: > execmem_alloc() > execmem_alloc() > execmem_fill() > > The idea is similar to Peter's suggestion in [1]. > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > memory to its users. execmem_alloc() is used to free memory allocated by > execmem_alloc(). execmem_fill() is used to update memory allocated by > execmem_alloc(). Sigh, I just realized this thread made through linux-mm@kvack.org, but got dropped by bpf@vger.kernel.org, so I guess I will have to resend v3. Currently, I have got the following action items for v3: 1. Add unify API to allocate text memory to motivation; 2. Update Documentation/x86/x86_64/mm.rst; 3. Allow none PMD_SIZE allocation for powerpc. 1 and 2 are relatively simple. We can probably do 3 in a follow up patch (as I don't have powerpc environments for testing). Did I miss anything? Besides these, does this set look reasonable? Andrew and Peter, could you please share your comments on this? Thanks, Song > > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X. > The caller has to update the content with text_poke like mechanism. > Specifically, execmem_fill() is provided to update memory allocated by > execmem_alloc(). execmem_fill() also makes sure the update stays in the > boundary of one chunk allocated by execmem_alloc(). Please refer to patch > 1/5 for more details of > > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher. > > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text. > > [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ > [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/ > [3] v1: https://lore.kernel.org/bpf/20221031222541.1773452-1-song@kernel.org/ > [4] https://lore.kernel.org/bpf/Y2ioTodn+mBXdIqp@ziqianlu-desk2/ > [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/ > > Changes PATCH v1 => v2: > 1. Rename the APIs as execmem_* (Christoph Hellwig) > 2. Add more information about the motivation of this work (and follow up > works in for kernel modules, various trampolines, etc). > (Luis Chamberlain, Rick Edgecombe, Mike Rapoport, Aaron Lu) > 3. Include expermential results from previous bpf_prog_pack and the > community. (Aaron Lu, Luis Chamberlain, Rick Edgecombe) > > Changes RFC v2 => PATCH v1: > 1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It > also ensures vcopy_exec() is only used to update memory from one single > vmalloc_exec() call. (Christoph Hellwig) > 2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the > text_poke() like logic. > 3. Drop changes for kernel modules and focus on BPF side changes. > > Changes RFC v1 => RFC v2: > 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now > work fine with BPF programs (patch 1, 2, 4). But module side (patch 3) > still need some work. > > Song Liu (5): > vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill > x86/alternative: support execmem_alloc() and execmem_free() > bpf: use execmem_alloc for bpf program and bpf dispatcher > vmalloc: introduce register_text_tail_vm() > x86: use register_text_tail_vm > > arch/x86/include/asm/pgtable_64_types.h | 1 + > arch/x86/kernel/alternative.c | 12 + > arch/x86/mm/init_64.c | 4 +- > arch/x86/net/bpf_jit_comp.c | 23 +- > include/linux/bpf.h | 3 - > include/linux/filter.h | 5 - > include/linux/vmalloc.h | 9 + > kernel/bpf/core.c | 180 +----------- > kernel/bpf/dispatcher.c | 11 +- > mm/nommu.c | 12 + > mm/vmalloc.c | 354 ++++++++++++++++++++++++ > 11 files changed, 412 insertions(+), 202 deletions(-) > > -- > 2.30.2
On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote: > Currently, I have got the following action items for v3: > 1. Add unify API to allocate text memory to motivation; > 2. Update Documentation/x86/x86_64/mm.rst; > 3. Allow none PMD_SIZE allocation for powerpc. So what do we think about supporting the fallback mechanism for the first version, like: https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/ It helps the (1) story by actually being usable by non-text_poke() architectures.
On Mon, Nov 14, 2022 at 12:45:16PM -0800, Song Liu wrote: > On Sun, Nov 13, 2022 at 2:43 AM Mike Rapoport <rppt@kernel.org> wrote: > > Actually, it would be interesting to quantify memory savings/waste as the > > result of using execmem_alloc() > > From a random system in our fleet, execmem_alloc() saves: > > 139 iTLB entries (1x 2MB entry vs, 140x 4kB entries), which is more than > 100% of L1 iTLB and about 10% of L2 TLB. That should be easily reflected then using perf using a real benchmark, however we don't have such data yet. I have hinted I suspect the reason might be that the actual loading of eBPF JIT / whatever may need to be done in parallel a la lib/test_kmod.c, or having the existing ebpf selftests run JIT tests in parallel, something along those lines, but I see no effort to showcase that. That's a missed opportunity. Luis
On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote: > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote: > > > > [...] > > > > > > > This set enables bpf programs and bpf dispatchers to share huge pages with > > new API: > > execmem_alloc() > > execmem_alloc() > > execmem_fill() > > > > The idea is similar to Peter's suggestion in [1]. > > > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > > memory to its users. execmem_alloc() is used to free memory allocated by > > execmem_alloc(). execmem_fill() is used to update memory allocated by > > execmem_alloc(). > > Sigh, I just realized this thread made through linux-mm@kvack.org, but got > dropped by bpf@vger.kernel.org, so I guess I will have to resend v3. I don't know what is going on with the bpf list but whatever it is, is silly. You should Cc the right folks to ensure proper review if the bpf list is the issue. > Currently, I have got the following action items for v3: > 1. Add unify API to allocate text memory to motivation; > 2. Update Documentation/x86/x86_64/mm.rst; > 3. Allow none PMD_SIZE allocation for powerpc. - I am really exausted of asking again for real performance tests, you keep saying you can't and I keep saying you can, you are not trying hard enough. Stop thinking about your internal benchmark which you cannot publish. There should be enough crap out which you can use. - A new selftest or set of selftests which demonstrates gain in performance - Extensions maybe of lib/test_vmalloc.c or whatever is appropriate to test correctness - Enhance commit logs to justify the *why*, one of which to hightight is providing an API for memory semantics for special memory pages Luis
On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote: > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > On Wed, Nov 09, 2022 at 05:04:25PM +0000, Edgecombe, Rick P wrote: > > > On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote: > > > > On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote: > > > > > > > How the caching of large pages in vmalloc can be made useful for use > > > > cases like secretmem and PKS? > > > > > > This part is easy I think. If we had an unmapped page allocator it > > > could just feed this. > > > > The unmapped page allocator could be used by anything that needs > > non-default permissions in the direct map and knows how to map the pages > > elsewhere. E.g it would have been a oneliner to switch x86::module_alloc() > > to use unmapped allocations. But ... > > > > > Do you have any idea when you might pick up that stuff again? > > > > ... unfortunately I don't see it happening anytime soon. > > > > > To answer my own question, I think a good first step would be to make > > > the interface also work for non-text_poke() so it could really be cross > > > arch, then use it for everything except modules. The benefit to the > > > other arch's at that point is centralized handling of loading text. > > > > My concern is that the proposed execmem_alloc() cannot be used for > > centralized handling of loading text. I'm not familiar enough with > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but > > my gut feeling is that the proposed execmem_alloc() won't be an improvement > > but rather a hindrance for moving to centralized handling of loading text. > > I don't follow why this could ever be a hindrance. Luis is very excited about > this, and I am very sure it works for ftrace, kprobe, and BPF. The main hurdles for modules are: * x86 needs support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC to use this properly * in light of lack of support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC we need a fallback * today module_alloc() follows special hanky panky open coded semantics for special page permissions, a unified way to handle this would be ideal instead of expecting everyone to get it right. Other than this there are probably odd corner cases which would likely only come up during testing. I see Song's efforts striving towards these objectives, and because of the new ARCH_WANTS_MODULES_DATA_IN_VMALLOC, it should be possible to get us there. Luis
On Tue, Nov 15, 2022 at 01:09:08PM -0800, Luis Chamberlain wrote: > - I am really exausted of asking again for real performance tests, > you keep saying you can't and I keep saying you can, you are not > trying hard enough. Stop thinking about your internal benchmark which > you cannot publish. There should be enough crap out which you can use. and if results defy expectations, please don't be coy about it, just be direct and mention it. Luis
On Tue, 2022-11-15 at 13:18 -0800, Luis Chamberlain wrote: > The main hurdles for modules are: > > * x86 needs support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > to use this properly > * in light of lack of support for > CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > we need a fallback > * today module_alloc() follows special hanky panky open coded > semantics for > special page permissions, a unified way to handle this would be > ideal instead of expecting everyone to get it right. How were you thinking non-text_poke() architectures load their text into the text region without the fallback method? Split logic in cross arch modules.c? Sorry if this is changed, but last time I was in there I remember there was quite a bit of cross-arch code loading text.
On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote: > > Currently, I have got the following action items for v3: > > 1. Add unify API to allocate text memory to motivation; > > 2. Update Documentation/x86/x86_64/mm.rst; > > 3. Allow none PMD_SIZE allocation for powerpc. > > So what do we think about supporting the fallback mechanism for the > first version, like: > > https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/ > > It helps the (1) story by actually being usable by non-text_poke() > architectures. I personally think this might be a good idea. We will need this when we use execmem_alloc for modules. But I haven't got a chance to look at module code in great detail. I was thinking of adding this logic with changes in module code. Thanks, Song
On Tue, 2022-11-15 at 13:54 -0800, Song Liu wrote: > On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P > <rick.p.edgecombe@intel.com> wrote: > > > > On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote: > > > Currently, I have got the following action items for v3: > > > 1. Add unify API to allocate text memory to motivation; > > > 2. Update Documentation/x86/x86_64/mm.rst; > > > 3. Allow none PMD_SIZE allocation for powerpc. > > > > So what do we think about supporting the fallback mechanism for the > > first version, like: > > > > https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/ > > > > It helps the (1) story by actually being usable by non-text_poke() > > architectures. > > I personally think this might be a good idea. We will need this when > we use > execmem_alloc for modules. But I haven't got a chance to look at > module code in > great detail. I was thinking of adding this logic with changes in > module code. BPF used to have a generic allocator that just called module_alloc(). If you had a fallback method could you unify all of BPF to use execmem()?
On Tue, Nov 15, 2022 at 2:14 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2022-11-15 at 13:54 -0800, Song Liu wrote: > > On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P > > <rick.p.edgecombe@intel.com> wrote: > > > > > > On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote: > > > > Currently, I have got the following action items for v3: > > > > 1. Add unify API to allocate text memory to motivation; > > > > 2. Update Documentation/x86/x86_64/mm.rst; > > > > 3. Allow none PMD_SIZE allocation for powerpc. > > > > > > So what do we think about supporting the fallback mechanism for the > > > first version, like: > > > > > > > https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/ > > > > > > It helps the (1) story by actually being usable by non-text_poke() > > > architectures. > > > > I personally think this might be a good idea. We will need this when > > we use > > execmem_alloc for modules. But I haven't got a chance to look at > > module code in > > great detail. I was thinking of adding this logic with changes in > > module code. > > BPF used to have a generic allocator that just called module_alloc(). > If you had a fallback method could you unify all of BPF to use > execmem()? Yes, we can use execmem as the unified API for all BPF JIT engines. Song
On Tue, Nov 15, 2022 at 1:09 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote: > > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote: > > > > > > > [...] > > > > > > > > > > > This set enables bpf programs and bpf dispatchers to share huge pages with > > > new API: > > > execmem_alloc() > > > execmem_alloc() > > > execmem_fill() > > > > > > The idea is similar to Peter's suggestion in [1]. > > > > > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > > > memory to its users. execmem_alloc() is used to free memory allocated by > > > execmem_alloc(). execmem_fill() is used to update memory allocated by > > > execmem_alloc(). > > > > Sigh, I just realized this thread made through linux-mm@kvack.org, but got > > dropped by bpf@vger.kernel.org, so I guess I will have to resend v3. > > I don't know what is going on with the bpf list but whatever it is, is silly. > You should Cc the right folks to ensure proper review if the bpf list is > the issue. > > > Currently, I have got the following action items for v3: > > 1. Add unify API to allocate text memory to motivation; > > 2. Update Documentation/x86/x86_64/mm.rst; > > 3. Allow none PMD_SIZE allocation for powerpc. > > - I am really exausted of asking again for real performance tests, > you keep saying you can't and I keep saying you can, you are not > trying hard enough. Stop thinking about your internal benchmark which > you cannot publish. There should be enough crap out which you can use. > > - A new selftest or set of selftests which demonstrates gain in > performance I didn't mean to not show the result with publically available. I just thought the actual benchmark was better (and we do use that to demonstrate the benefit of a lot of kernel improvement). For something publically available, how about the following: Run 100 instances of the following benchmark from bpf selftests: tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe which loads 7 BPF programs, and triggers one of them. Then use perf to monitor TLB related counters: perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \ itlb_misses.walk_completed_2m_4m -a The following results are from a qemu VM with 32 cores. Before bpf_prog_pack: iTLB-load-misses: 350k/s itlb_misses.walk_completed_4k: 90k/s itlb_misses.walk_completed_2m_4m: 0.1/s With bpf_prog_pack (current upstream): iTLB-load-misses: 220k/s itlb_misses.walk_completed_4k: 68k/s itlb_misses.walk_completed_2m_4m: 0.2/s With execmem_alloc (with this set): iTLB-load-misses: 185k/s itlb_misses.walk_completed_4k: 58k/s itlb_misses.walk_completed_2m_4m: 1/s Do these address your questions with this? > > - Extensions maybe of lib/test_vmalloc.c or whatever is appropriate to > test correctness I will look into this. > > - Enhance commit logs to justify the *why*, one of which to hightight is > providing an API for memory semantics for special memory pages And this. Thanks, Song
On Tue, Nov 15, 2022 at 2:14 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2022-11-15 at 13:54 -0800, Song Liu wrote: > > On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P > > <rick.p.edgecombe@intel.com> wrote: > > > > > > On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote: > > > > Currently, I have got the following action items for v3: > > > > 1. Add unify API to allocate text memory to motivation; > > > > 2. Update Documentation/x86/x86_64/mm.rst; > > > > 3. Allow none PMD_SIZE allocation for powerpc. > > > > > > So what do we think about supporting the fallback mechanism for the > > > first version, like: > > > > > > > https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/ > > > > > > It helps the (1) story by actually being usable by non-text_poke() > > > architectures. > > > > I personally think this might be a good idea. We will need this when > > we use > > execmem_alloc for modules. But I haven't got a chance to look at > > module code in > > great detail. I was thinking of adding this logic with changes in > > module code. > > BPF used to have a generic allocator that just called module_alloc(). > If you had a fallback method could you unify all of BPF to use > execmem()? To clarify, are you suggesting we need this logic in this set? I would rather wait until we handle module code. This is because BPF JIT uses module_alloc() for archs other than x86_64. So the fall back of execmem_alloc() for these archs would be module_alloc() or something similar. I think it is really weird to do something like void *execmem_alloc(size_t size) { #ifdef CONFIG_SUPPORT_EXECMEM_ALLOC ... #else return module_alloc(size); #endif } WDYT? Thanks, Song
On Tue, 2022-11-15 at 17:20 -0800, Song Liu wrote: > To clarify, are you suggesting we need this logic in this set? I > would > rather wait until we handle module code. This is because BPF JIT uses > module_alloc() for archs other than x86_64. So the fall back of > execmem_alloc() for these archs would be module_alloc() or > something similar. I think it is really weird to do something like > > void *execmem_alloc(size_t size) > { > #ifdef CONFIG_SUPPORT_EXECMEM_ALLOC > ... > #else > return module_alloc(size); > #endif > } > > WDYT? Hmm, that is a good point. It looks like it's plugged in backwards. Several people in the past have expressed that all the text users calling into *module*_alloc() also is a little wrong. So I think in some finished future, each architecture would have an execmem_alloc() arch breakout of some sort that modules could use instead of it's module_alloc() logic. So basically all the module_alloc() arch specifics details of location and PAGE_FOO would move to execmem. I guess the question is how to get there. Calling into module_alloc() does the job but looks wrong. There are a lot of module_alloc()s, but what about implementing an execmem_alloc() for each bpf jit architecture that doesn't match the existing default version. It shouldn't be too much code. I think some of them will work with just the EXEC_MEM_START/END heuristic and wont need a breakout. But if this thing just works for x86 BPF JITs, I'm not sure we can say we have unified anything...
On Wed, Nov 16, 2022 at 1:22 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Tue, 2022-11-15 at 17:20 -0800, Song Liu wrote: > > To clarify, are you suggesting we need this logic in this set? I > > would > > rather wait until we handle module code. This is because BPF JIT uses > > module_alloc() for archs other than x86_64. So the fall back of > > execmem_alloc() for these archs would be module_alloc() or > > something similar. I think it is really weird to do something like > > > > void *execmem_alloc(size_t size) > > { > > #ifdef CONFIG_SUPPORT_EXECMEM_ALLOC > > ... > > #else > > return module_alloc(size); > > #endif > > } > > > > WDYT? > > Hmm, that is a good point. It looks like it's plugged in backwards. > > Several people in the past have expressed that all the text users > calling into *module*_alloc() also is a little wrong. So I think in > some finished future, each architecture would have an execmem_alloc() > arch breakout of some sort that modules could use instead of it's > module_alloc() logic. So basically all the module_alloc() arch > specifics details of location and PAGE_FOO would move to execmem. > > I guess the question is how to get there. Calling into module_alloc() > does the job but looks wrong. There are a lot of module_alloc()s, but > what about implementing an execmem_alloc() for each bpf jit > architecture that doesn't match the existing default version. It > shouldn't be too much code. I think some of them will work with just > the EXEC_MEM_START/END heuristic and wont need a breakout. > > But if this thing just works for x86 BPF JITs, I'm not sure we can say > we have unified anything... powerpc BPF JIT is getting bpf_prog_pack soon. [1] And we should be able to make ftrace and BPF trampoline to use execmem_alloc soon after this set is merged. AFAICT, we don't have to finalize the API until we handle module text. I personally think current API is good enough for ftrace and BPF trampoline, which already use something similar to JIT. Thanks, Song [1] https://lore.kernel.org/bpf/20221110184303.393179-1-hbathini@linux.ibm.com/
On Tue, Nov 15, 2022 at 02:48:05PM -0800, Song Liu wrote: > On Tue, Nov 15, 2022 at 1:09 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote: > > > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote: > > > > > > > > > > [...] > > > > > > > > > > > > > > > This set enables bpf programs and bpf dispatchers to share huge pages with > > > > new API: > > > > execmem_alloc() > > > > execmem_alloc() > > > > execmem_fill() > > > > > > > > The idea is similar to Peter's suggestion in [1]. > > > > > > > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these > > > > memory to its users. execmem_alloc() is used to free memory allocated by > > > > execmem_alloc(). execmem_fill() is used to update memory allocated by > > > > execmem_alloc(). > > > > > > Sigh, I just realized this thread made through linux-mm@kvack.org, but got > > > dropped by bpf@vger.kernel.org, so I guess I will have to resend v3. > > > > I don't know what is going on with the bpf list but whatever it is, is silly. > > You should Cc the right folks to ensure proper review if the bpf list is > > the issue. > > > > > Currently, I have got the following action items for v3: > > > 1. Add unify API to allocate text memory to motivation; > > > 2. Update Documentation/x86/x86_64/mm.rst; > > > 3. Allow none PMD_SIZE allocation for powerpc. > > > > - I am really exausted of asking again for real performance tests, > > you keep saying you can't and I keep saying you can, you are not > > trying hard enough. Stop thinking about your internal benchmark which > > you cannot publish. There should be enough crap out which you can use. > > > > - A new selftest or set of selftests which demonstrates gain in > > performance > > I didn't mean to not show the result with publically available. I just > thought the actual benchmark was better (and we do use that to > demonstrate the benefit of a lot of kernel improvement). > > For something publically available, how about the following: > > Run 100 instances of the following benchmark from bpf selftests: > tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe > which loads 7 BPF programs, and triggers one of them. > > Then use perf to monitor TLB related counters: > perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \ > itlb_misses.walk_completed_2m_4m -a > > The following results are from a qemu VM with 32 cores. > > Before bpf_prog_pack: > iTLB-load-misses: 350k/s > itlb_misses.walk_completed_4k: 90k/s > itlb_misses.walk_completed_2m_4m: 0.1/s > > With bpf_prog_pack (current upstream): > iTLB-load-misses: 220k/s > itlb_misses.walk_completed_4k: 68k/s > itlb_misses.walk_completed_2m_4m: 0.2/s > > With execmem_alloc (with this set): > iTLB-load-misses: 185k/s > itlb_misses.walk_completed_4k: 58k/s > itlb_misses.walk_completed_2m_4m: 1/s > > Do these address your questions with this? More in lines with what I was hoping for. Can something just do the parallelization for you in one shot? Can bench alone do it for you? Is there no interest to have soemthing which generically showcases multithreading / hammering a system with tons of eBPF JITs? It may prove useful. And also, it begs the question, what if you had another iTLB generic benchmark or genearl memory pressure workload running *as* you run the above? I as, as it was my understanding that one of the issues was the long term slowdown caused by the directmap fragmentation without bpf_prog_pack, and so such an application should crawl to its knees over time, and there should be numbers you could show to prove that too, before and after. Luis
On Tue, Nov 15, 2022 at 09:39:08PM +0000, Edgecombe, Rick P wrote: > On Tue, 2022-11-15 at 13:18 -0800, Luis Chamberlain wrote: > > The main hurdles for modules are: > > > > * x86 needs support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > > to use this properly > > * in light of lack of support for > > CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > > we need a fallback > > * today module_alloc() follows special hanky panky open coded > > semantics for > > special page permissions, a unified way to handle this would be > > ideal instead of expecting everyone to get it right. > > How were you thinking non-text_poke() architectures load their text > into the text region without the fallback method? Fallbacks are needed for sure. I think you spelled out well what would be needed. Luis
On Wed, 2022-11-16 at 14:33 -0800, Luis Chamberlain wrote: > More in lines with what I was hoping for. Can something just do > the parallelization for you in one shot? Can bench alone do it for > you? > Is there no interest to have soemthing which generically showcases > multithreading / hammering a system with tons of eBPF JITs? It may > prove useful. > > And also, it begs the question, what if you had another iTLB generic > benchmark or genearl memory pressure workload running *as* you run > the > above? I as, as it was my understanding that one of the issues was > the > long term slowdown caused by the directmap fragmentation without > bpf_prog_pack, and so such an application should crawl to its knees > over time, and there should be numbers you could show to prove that > too, before and after. We did have some benchmarks that showed if your direct map was totally fragmented (started from boot at 4k page size) what the regression was: https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
On Wed, Nov 16, 2022 at 10:47:04PM +0000, Edgecombe, Rick P wrote: > On Wed, 2022-11-16 at 14:33 -0800, Luis Chamberlain wrote: > > More in lines with what I was hoping for. Can something just do > > the parallelization for you in one shot? Can bench alone do it for > > you? > > Is there no interest to have soemthing which generically showcases > > multithreading / hammering a system with tons of eBPF JITs? It may > > prove useful. > > > > And also, it begs the question, what if you had another iTLB generic > > benchmark or genearl memory pressure workload running *as* you run > > the > > above? I as, as it was my understanding that one of the issues was > > the > > long term slowdown caused by the directmap fragmentation without > > bpf_prog_pack, and so such an application should crawl to its knees > > over time, and there should be numbers you could show to prove that > > too, before and after. > > We did have some benchmarks that showed if your direct map was totally > fragmented (started from boot at 4k page size) what the regression was: > > > https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ Oh yes that is a good example of effort, but I'm suggesting taking for instance will-it-scale and run it in tandem with bpg prog pack and measure on *both* iTLB differences, before / after, *and* doing this again after a period of expected deterioation of the direct map fragmentation (say after non-bpf-prog-pack shows high direct map fragmetnation). This is the sort of thing which easily go into a commit log. Luis
On Wed, Nov 16, 2022 at 3:53 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Wed, Nov 16, 2022 at 10:47:04PM +0000, Edgecombe, Rick P wrote: > > On Wed, 2022-11-16 at 14:33 -0800, Luis Chamberlain wrote: > > > More in lines with what I was hoping for. Can something just do > > > the parallelization for you in one shot? Can bench alone do it for > > > you? > > > Is there no interest to have soemthing which generically showcases > > > multithreading / hammering a system with tons of eBPF JITs? It may > > > prove useful. > > > > > > And also, it begs the question, what if you had another iTLB generic > > > benchmark or genearl memory pressure workload running *as* you run > > > the > > > above? I as, as it was my understanding that one of the issues was > > > the > > > long term slowdown caused by the directmap fragmentation without > > > bpf_prog_pack, and so such an application should crawl to its knees > > > over time, and there should be numbers you could show to prove that > > > too, before and after. > > > > We did have some benchmarks that showed if your direct map was totally > > fragmented (started from boot at 4k page size) what the regression was: > > > > > > https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > > Oh yes that is a good example of effort, but I'm suggesting taking for > instance will-it-scale and run it in tandem with bpg prog pack > and measure on *both* iTLB differences, before / after, *and* doing > this again after a period of expected deterioation of the direct > map fragmentation (say after non-bpf-prog-pack shows high direct > map fragmetnation). > > This is the sort of thing which easily go into a commit log. To be honest, I don't see experimental results with artificial benchmarks would help this set. I don't think a real workload would see 10% speed up from this set (we can see large % improvements in TLB miss rate though). However, 1% or even 0.5% improvement matters a lot for large scale workload. Thanks, Song
On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote: > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > My concern is that the proposed execmem_alloc() cannot be used for > > centralized handling of loading text. I'm not familiar enough with > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but > > my gut feeling is that the proposed execmem_alloc() won't be an improvement > > but rather a hindrance for moving to centralized handling of loading text. > > I don't follow why this could ever be a hindrance. Luis is very excited about > this, and I am very sure it works for ftrace, kprobe, and BPF. Again, it's a gut feeling. But for execmem_alloc() to be a unified place of code allocation, it has to work for all architectures. If architectures have to override it, then where is the unification? The implementation you propose if great for x86, but to see it as unified solution it should be good at least for the major architectures. > > It feels to me that a lot of ground work is needed to get to the point > > where we can use centralized handling of loading text. > > Could you please be more specific on what is needed? The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP so that execmem_alloc() can be actually used by modules. > Thanks, > Song
On Wed, Nov 16, 2022 at 02:33:37PM -0800, Luis Chamberlain wrote: > On Tue, Nov 15, 2022 at 02:48:05PM -0800, Song Liu wrote: > > On Tue, Nov 15, 2022 at 1:09 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > > > On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote: > > > > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote: > > > > > > > Currently, I have got the following action items for v3: > > > > 1. Add unify API to allocate text memory to motivation; > > > > 2. Update Documentation/x86/x86_64/mm.rst; > > > > 3. Allow none PMD_SIZE allocation for powerpc. > > > > > > - I am really exausted of asking again for real performance tests, > > > you keep saying you can't and I keep saying you can, you are not > > > trying hard enough. Stop thinking about your internal benchmark which > > > you cannot publish. There should be enough crap out which you can use. > > > > > > - A new selftest or set of selftests which demonstrates gain in > > > performance > > > > I didn't mean to not show the result with publically available. I just > > thought the actual benchmark was better (and we do use that to > > demonstrate the benefit of a lot of kernel improvement). > > > > For something publically available, how about the following: > > > > Run 100 instances of the following benchmark from bpf selftests: > > tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe > > which loads 7 BPF programs, and triggers one of them. > > > > Then use perf to monitor TLB related counters: > > perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \ > > itlb_misses.walk_completed_2m_4m -a > > > > The following results are from a qemu VM with 32 cores. > > > > Before bpf_prog_pack: > > iTLB-load-misses: 350k/s > > itlb_misses.walk_completed_4k: 90k/s > > itlb_misses.walk_completed_2m_4m: 0.1/s > > > > With bpf_prog_pack (current upstream): > > iTLB-load-misses: 220k/s > > itlb_misses.walk_completed_4k: 68k/s > > itlb_misses.walk_completed_2m_4m: 0.2/s > > > > With execmem_alloc (with this set): > > iTLB-load-misses: 185k/s > > itlb_misses.walk_completed_4k: 58k/s > > itlb_misses.walk_completed_2m_4m: 1/s > > > > Do these address your questions with this? > > More in lines with what I was hoping for. Can something just do > the parallelization for you in one shot? Can bench alone do it for you? > Is there no interest to have soemthing which generically showcases > multithreading / hammering a system with tons of eBPF JITs? It may > prove useful. > > And also, it begs the question, what if you had another iTLB generic > benchmark or genearl memory pressure workload running *as* you run the > above? I as, as it was my understanding that one of the issues was the > long term slowdown caused by the directmap fragmentation without > bpf_prog_pack, and so such an application should crawl to its knees > over time, and there should be numbers you could show to prove that > too, before and after. I'd add to that that benchmarking iTLB performance on an idle system is not very representative. TLB is a scarce resource, so it'd be interesting to see this benchmark on a loaded system. > Luis
On Thu, Nov 17, 2022 at 12:50 AM Mike Rapoport <rppt@kernel.org> wrote: > > On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote: > > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > My concern is that the proposed execmem_alloc() cannot be used for > > > centralized handling of loading text. I'm not familiar enough with > > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but > > > my gut feeling is that the proposed execmem_alloc() won't be an improvement > > > but rather a hindrance for moving to centralized handling of loading text. > > > > I don't follow why this could ever be a hindrance. Luis is very excited about > > this, and I am very sure it works for ftrace, kprobe, and BPF. > > Again, it's a gut feeling. But for execmem_alloc() to be a unified place of > code allocation, it has to work for all architectures. If architectures > have to override it, then where is the unification? > > The implementation you propose if great for x86, but to see it as unified > solution it should be good at least for the major architectures. As I mentioned earlier, folks are working on using bpf_prog_pack for BPF JIT on powerpc. We will also work on something similar for ARM. I guess these are good enough for major architectures? > > > > It feels to me that a lot of ground work is needed to get to the point > > > where we can use centralized handling of loading text. > > > > Could you please be more specific on what is needed? > > The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP > so that execmem_alloc() can be actually used by modules. Current implementation is an alternative to VM_TOPDOWN_VMAP. I am very sure it works for modules just like VM_TOPDOWN_VMAP solution. Thanks, Song
On Thu, Nov 17, 2022 at 10:36:43AM -0800, Song Liu wrote: > On Thu, Nov 17, 2022 at 12:50 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote: > > > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > > > My concern is that the proposed execmem_alloc() cannot be used for > > > > centralized handling of loading text. I'm not familiar enough with > > > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but > > > > my gut feeling is that the proposed execmem_alloc() won't be an improvement > > > > but rather a hindrance for moving to centralized handling of loading text. > > > > > > I don't follow why this could ever be a hindrance. Luis is very excited about > > > this, and I am very sure it works for ftrace, kprobe, and BPF. > > > > Again, it's a gut feeling. But for execmem_alloc() to be a unified place of > > code allocation, it has to work for all architectures. If architectures > > have to override it, then where is the unification? > > > > The implementation you propose if great for x86, but to see it as unified > > solution it should be good at least for the major architectures. > > As I mentioned earlier, folks are working on using bpf_prog_pack for BPF > JIT on powerpc. We will also work on something similar for ARM. Does "something similar" mean that it won't use execmem_alloc() as is? > I guess these are good enough for major architectures? Sorry if I wasn't clear, I referred for unified solution for all code allocations, not only BPF, so that execmem_alloc() will eventually replace module_alloc(). And that means it has to be able to deal with with architecture specific requirements at least on ARM and powerpc, probably others as well. > > > > It feels to me that a lot of ground work is needed to get to the point > > > > where we can use centralized handling of loading text. > > > > > > Could you please be more specific on what is needed? > > > > The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP > > so that execmem_alloc() can be actually used by modules. > > Current implementation is an alternative to VM_TOPDOWN_VMAP. I am > very sure it works for modules just like VM_TOPDOWN_VMAP solution. It might, but it still does not. And until they do I consider these patches as an optimization for BFP rather than unification of code allocations. > Thanks, > Song
On Mon, Nov 14, 2022 at 12:45:16PM -0800, Song Liu wrote: > On Sun, Nov 13, 2022 at 2:43 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > If RO data causes problems with direct map fragmentation, we can use > > > similar logic. I think we will need another tree in vmalloc for this case. > > > Since the logic will be mostly identical, I personally don't think adding > > > another tree is a big overhead. > > > > Actually, it would be interesting to quantify memory savings/waste as the > > result of using execmem_alloc() > > From a random system in our fleet, execmem_alloc() saves: > > 139 iTLB entries (1x 2MB entry vs, 140x 4kB entries), which is more than > 100% of L1 iTLB and about 10% of L2 TLB. Using 2M pages saves page table entries. They might be cached in iTLB and might be not because on a loaded system large part of iTBL will cache userspace mappings. > It wastes 1.5MB memory, which is 0.0023% of system memory (64GB). > > I believe this is clearly a good trade-off. The actual trade-off would be for 1.5MB of memory for yet unknown, or at least unpublished, performance improvement on a loaded system. > Thanks, > Song
On Sun, Nov 20, 2022 at 3:41 AM Mike Rapoport <rppt@kernel.org> wrote: > > On Thu, Nov 17, 2022 at 10:36:43AM -0800, Song Liu wrote: > > On Thu, Nov 17, 2022 at 12:50 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote: > > > > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > > > > > My concern is that the proposed execmem_alloc() cannot be used for > > > > > centralized handling of loading text. I'm not familiar enough with > > > > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but > > > > > my gut feeling is that the proposed execmem_alloc() won't be an improvement > > > > > but rather a hindrance for moving to centralized handling of loading text. > > > > > > > > I don't follow why this could ever be a hindrance. Luis is very excited about > > > > this, and I am very sure it works for ftrace, kprobe, and BPF. > > > > > > Again, it's a gut feeling. But for execmem_alloc() to be a unified place of > > > code allocation, it has to work for all architectures. If architectures > > > have to override it, then where is the unification? > > > > > > The implementation you propose if great for x86, but to see it as unified > > > solution it should be good at least for the major architectures. > > > > As I mentioned earlier, folks are working on using bpf_prog_pack for BPF > > JIT on powerpc. We will also work on something similar for ARM. > > Does "something similar" mean that it won't use execmem_alloc() as is? "Something similar" means it will use execmem_alloc as is. We still need changes to the ARM JIT code, just like we need it for powerpc and x86. > > > I guess these are good enough for major architectures? > > Sorry if I wasn't clear, I referred for unified solution for all code > allocations, not only BPF, so that execmem_alloc() will eventually replace > module_alloc(). And that means it has to be able to deal with with > architecture specific requirements at least on ARM and powerpc, probably > others as well. > > > > > > It feels to me that a lot of ground work is needed to get to the point > > > > > where we can use centralized handling of loading text. > > > > > > > > Could you please be more specific on what is needed? > > > > > > The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP > > > so that execmem_alloc() can be actually used by modules. > > > > Current implementation is an alternative to VM_TOPDOWN_VMAP. I am > > very sure it works for modules just like VM_TOPDOWN_VMAP solution. > > It might, but it still does not. And until they do I consider these > patches as an optimization for BFP rather than unification of code > allocations. We haven't got module to use execmem_alloc yet, that's true. But this has nothing to do with VM_TOPDOWN_VMAP at all. Thanks, Song
On Mon, Nov 14 2022 at 17:30, Song Liu wrote: > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote: > Currently, I have got the following action items for v3: > 1. Add unify API to allocate text memory to motivation; > 2. Update Documentation/x86/x86_64/mm.rst; > 3. Allow none PMD_SIZE allocation for powerpc. > > 1 and 2 are relatively simple. We can probably do 3 in a follow up patch > (as I don't have powerpc environments for testing). Did I miss anything? > > Besides these, does this set look reasonable? Andrew and Peter, could > you please share your comments on this? This is a step into the right direction, but is it a solution which has a broader benefit? I don't think so. While you are so concerned about (i)TLB usage for BPF, I'm way more concerned about modules. Just from a random server class workstation: Total module memory: 12.4609 MB Number of 4k PTEs: 3190 The above would nicely fit into 7 or 8 2M mappings. Guess how much memory is occupied by BPF on that machine and how much BPF contributes to iTLB pressure? In comparison to the above very close to zero. Modules have exactly the same problem as BPF, just an order of magnitude larger. So we don't need a "works" for BPF solution which comes with the handwaving assumption that it could be used for modules too. We need something which demonstrates that it solves the entire problem class. Modules are the obvious starting point. Once that is solved pretty much everything else falls into place including BPF. Without modules support this whole exercise is pointless and not going anywhere near x86. Thanks, tglx
Hi Thomas, Thanks for your comments. On Tue, Nov 29, 2022 at 2:23 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, Nov 14 2022 at 17:30, Song Liu wrote: > > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote: > > Currently, I have got the following action items for v3: > > 1. Add unify API to allocate text memory to motivation; > > 2. Update Documentation/x86/x86_64/mm.rst; > > 3. Allow none PMD_SIZE allocation for powerpc. > > > > 1 and 2 are relatively simple. We can probably do 3 in a follow up patch > > (as I don't have powerpc environments for testing). Did I miss anything? > > > > Besides these, does this set look reasonable? Andrew and Peter, could > > you please share your comments on this? > > This is a step into the right direction, but is it a solution which has > a broader benefit? I don't think so. > > While you are so concerned about (i)TLB usage for BPF, I'm way more > concerned about modules. Just from a random server class workstation: > > Total module memory: 12.4609 MB > Number of 4k PTEs: 3190 > > The above would nicely fit into 7 or 8 2M mappings. > > Guess how much memory is occupied by BPF on that machine and how much > BPF contributes to iTLB pressure? In comparison to the above very close > to zero. > > Modules have exactly the same problem as BPF, just an order of magnitude > larger. > > So we don't need a "works" for BPF solution which comes with the > handwaving assumption that it could be used for modules too. We need > something which demonstrates that it solves the entire problem class. > > Modules are the obvious starting point. Once that is solved pretty much > everything else falls into place including BPF. > > Without modules support this whole exercise is pointless and not going > anywhere near x86. I am not sure I fully understand your point here. Do you mean 1) There is something wrong with this solution, that makes it not suitable for modules; or 2) The solution is in the right direction and it will very likely work for modules. But we haven't finished module support. ? If it is 1), I would like to understand what are the issues that make it not suitable for modules. If it is 2), I think a solid, mostly like working small step toward the right direction is the better way as it makes code reviews a lot easier and has much lower risks. Does this make sense? I would also highlight that part of the benefit of this work comes from reducing direct map fragmentations. While BPF programs consume less memory, they are more dynamic and can cause more direct map fragmentations. bpf_prog_pack in upstream kernel already covers this part, but this set is a better solution than bpf_prog_pack. Finally, I would like to point out that 5/6 and 6/6 of (v5) the set let BPF programs share a 2MB page with static kernel text. Therefore, even for systems without many BPF programs, we should already see some reduction in iTLB misses. Thanks, Song
Song, On Tue, Nov 29 2022 at 09:26, Song Liu wrote: > On Tue, Nov 29, 2022 at 2:23 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Modules are the obvious starting point. Once that is solved pretty much >> everything else falls into place including BPF. >> >> Without modules support this whole exercise is pointless and not going >> anywhere near x86. > > I am not sure I fully understand your point here. Do you mean > > 1) There is something wrong with this solution, that makes it not suitable > for modules; > or > 2) The solution is in the right direction and it will very likely work > for modules. > But we haven't finished module support. ? As I'm obviously unable to express myself coherently, let me try again: A solution which solves the BPF problem, but does not solve the underlying problem of module_alloc() is not acceptable. Is that clear enough? > If it is 1), I would like to understand what are the issues that make it not > suitable for modules. If it is 2), I think a solid, mostly like working small > step toward the right direction is the better way as it makes code reviews > a lot easier and has much lower risks. Does this make sense? No. Because all you are interested in is to get your BPF itch scratched instead of actually sitting down and solving the underlying problem and thereby creating a benefit for everyone. You are not making anything easier. You are violating the basic engineering principle of "Fix the root cause, not the symptom". By doing that you are actually creating more problems than you solve. Why? Clearly your "solution" does not cover the full requirements of the module space because you solely focus on executable memory allocations which somehow magically go into the module address space. Can you coherently explain how this results in a consistent solution for the rest of the module requirements? Can you coherently explain why this wont create problems down the road for anyone who actually would be willing to solve the root cause? No, you can't answer any of these questions simply because you never explored the problem space sufficiently. I'm not the first one to point this out. Quite some people in the various threads regarding this issue have been pointing that out to you before. They even provided you hints on how this can be solved properly once and forever and for everyones benefits. > I would also highlight that part of the benefit of this work comes from > reducing direct map fragmentations. While BPF programs consume less > memory, they are more dynamic and can cause more direct map > fragmentations. bpf_prog_pack in upstream kernel already covers this > part, but this set is a better solution than bpf_prog_pack. > > Finally, I would like to point out that 5/6 and 6/6 of (v5) the set let BPF > programs share a 2MB page with static kernel text. Therefore, even > for systems without many BPF programs, we should already see some > reduction in iTLB misses. Can you please stop this marketing nonsense? As I pointed out to you in the very mail which your are replying to, the influence of BPF on the system I picked randomly out of the pool is pretty close to ZERO. Ergo, the reduction of iTLB misses is going to be equally close to ZERO. What is the benefit you are trying to sell me? I'm happy to run perf on this machine and provide the numbers which put your 'we should already see some reduction' handwaving into perspective. But the above is just a distraction. The real point is: You can highlight and point out the benefits of your BPF specific solution as much as you want, it does not make the fact that you are "fixing" the symptom instead of the root cause magically go away. Again for the record: The iTLB pressure problem, which affects modules, kprobes, tracing and BPF, is caused by the way how module_alloc() is implemented. That's the root cause and this needs to be solved for _ALL_ of the users of this infrastructure and not worked around by adding something which makes BPF shiny and handwaves about that it solves the underlying problem. Thanks, tglx
On Mon, Nov 21, 2022 at 07:52:12AM -0700, Song Liu wrote: > On Sun, Nov 20, 2022 at 3:41 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > > > The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP > > > > so that execmem_alloc() can be actually used by modules. > > > > > > Current implementation is an alternative to VM_TOPDOWN_VMAP. I am > > > very sure it works for modules just like VM_TOPDOWN_VMAP solution. > > > > It might, but it still does not. And until they do I consider these > > patches as an optimization for BFP rather than unification of code > > allocations. > > We haven't got module to use execmem_alloc yet, that's true. But > this has nothing to do with VM_TOPDOWN_VMAP at all. The point is that Peter suggested a way to make module_alloc() use 2M pages, so that all code allocations could benefit from it. Your execmem_alloc() isn't anywhere near this. > Thanks, > Song
Hi Thomas, On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Song, > > On Tue, Nov 29 2022 at 09:26, Song Liu wrote: > > On Tue, Nov 29, 2022 at 2:23 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >> Modules are the obvious starting point. Once that is solved pretty much > >> everything else falls into place including BPF. > >> > >> Without modules support this whole exercise is pointless and not going > >> anywhere near x86. > > > > I am not sure I fully understand your point here. Do you mean > > > > 1) There is something wrong with this solution, that makes it not suitable > > for modules; > > or > > 2) The solution is in the right direction and it will very likely work > > for modules. > > But we haven't finished module support. ? > > As I'm obviously unable to express myself coherently, let me try again: > > A solution which solves the BPF problem, but does not solve the > underlying problem of module_alloc() is not acceptable. > > Is that clear enough? While I sincerely want to provide a solution not just for BPF but also for modules and others, I don't think I fully understand the underlying problem of module_alloc(). I sincerely would like to learn more about it. > > > If it is 1), I would like to understand what are the issues that make it not > > suitable for modules. If it is 2), I think a solid, mostly like working small > > step toward the right direction is the better way as it makes code reviews > > a lot easier and has much lower risks. Does this make sense? > > No. Because all you are interested in is to get your BPF itch scratched > instead of actually sitting down and solving the underlying problem and > thereby creating a benefit for everyone. TBH, until your reply, I thought I was working on something that would benefit everyone. It is indeed not just for BPF itch, as bpf_prog_pack already scratched it for BPF. > > You are not making anything easier. You are violating the basic > engineering principle of "Fix the root cause, not the symptom". > I am not sure what is the root cause and the symptom here. I understand ideas referred in this lwn article: https://lwn.net/Articles/894557/ But I don't know which one of them (if any) would fix the root cause. > By doing that you are actually creating more problems than you > solve. Why? > > Clearly your "solution" does not cover the full requirements of the > module space because you solely focus on executable memory allocations > which somehow magically go into the module address space. > > Can you coherently explain how this results in a consistent solution > for the rest of the module requirements? > > Can you coherently explain why this wont create problems down the road > for anyone who actually would be willing to solve the root cause? > > No, you can't answer any of these questions simply because you never > explored the problem space sufficiently. I was thinking, for modules, we only need something new for module text, and module data will just use vmalloc(). I guess this is probably not the right solution? > > I'm not the first one to point this out. Quite some people in the > various threads regarding this issue have been pointing that out to you > before. They even provided you hints on how this can be solved properly > once and forever and for everyones benefits. I tried to review various threads. Unfortunately, I am not able to identify the proper hints and construct a solution. > > > I would also highlight that part of the benefit of this work comes from > > reducing direct map fragmentations. While BPF programs consume less > > memory, they are more dynamic and can cause more direct map > > fragmentations. bpf_prog_pack in upstream kernel already covers this > > part, but this set is a better solution than bpf_prog_pack. > > > > Finally, I would like to point out that 5/6 and 6/6 of (v5) the set let BPF > > programs share a 2MB page with static kernel text. Therefore, even > > for systems without many BPF programs, we should already see some > > reduction in iTLB misses. > > Can you please stop this marketing nonsense? As I pointed out to you in > the very mail which your are replying to, the influence of BPF on the > system I picked randomly out of the pool is pretty close to ZERO. > > Ergo, the reduction of iTLB misses is going to be equally close to > ZERO. What is the benefit you are trying to sell me? > > I'm happy to run perf on this machine and provide the numbers which put > your 'we should already see some reduction' handwaving into perspective. > > But the above is just a distraction. The real point is: > > You can highlight and point out the benefits of your BPF specific > solution as much as you want, it does not make the fact that you are > "fixing" the symptom instead of the root cause magically go away. > > Again for the record: > > The iTLB pressure problem, which affects modules, kprobes, tracing and > BPF, is caused by the way how module_alloc() is implemented. TBH, I don't think I understand this... Do you mean the problem with module_alloc() is that it is not aware of desired permissions (W or X or neither)? If so, is permission vmalloc [1] the right direction for this? [1] https://lwn.net/ml/linux-mm/20201120202426.18009-1-rick.p.edgecombe@intel.com/ > > That's the root cause and this needs to be solved for _ALL_ of the users > of this infrastructure and not worked around by adding something which > makes BPF shiny and handwaves about that it solves the underlying > problem. While I did plan to enable 2MB pages for module text, I didn't plan to solve it in the first set. However, since you think it is possible and would like to provide directions, I am up for the challenge and will give it a try. Please share more details about the right direction. Otherwise, I am still lost... Thanks, Song
Song! On Wed, Nov 30 2022 at 08:18, Song Liu wrote: > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> You are not making anything easier. You are violating the basic >> engineering principle of "Fix the root cause, not the symptom". >> > > I am not sure what is the root cause and the symptom here. The symptom is iTLB pressure. The root cause is the way how module memory is allocated, which in turn causes the fragmentation into 4k PTEs. That's the same problem for anything which uses module_alloc() to get space for text allocated, e.g. kprobes, tracing.... A module consists of: - text sections - data sections Except for PPC32, which has the module data in vmalloc space, all others allocate text and data sections in one lump. This en-bloc allocation is one reason for the 4k splits: - text is RX - data is RW or RO Truly vmalloc'ed module data is not an option for 64bit architectures which use PC relative addressing as vmalloc does not guarantee that the data ends up within the limited displacement range (s32 on x8664) This made me look at your allocator again: > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) > +#define EXEC_MEM_START MODULES_VADDR > +#define EXEC_MEM_END MODULES_END > +#else > +#define EXEC_MEM_START VMALLOC_START > +#define EXEC_MEM_END VMALLOC_END > +#endif The #else part is completely broken on x86/64 and any other architecture, which has PC relative restricted displacement. Even if modules are disabled in Kconfig the only safe place to allocate executable kernel text from (on these architectures) is the modules address space. The ISA restrictions do not go magically away when modules are disabled. In the early version of the SKX retbleed mitigation work I had https://lore.kernel.org/all/20220716230953.442937066@linutronix.de exactly to handle this correctly for the !MODULE case. It went nowhere as we did not need the trampolines in the final version. This is why Peter suggested to 'split' the module address range into a top down and bottom up part: https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ That obviously separates text and data, but keeps everything within the defined working range. It immediately solves the text problem for _all_ module_alloc() users and still leaves the data split into 4k pages due to RO/RW sections. But after staring at it for a while I think this top down and bottom up dance is too much effort for not much gain. The module address space is sized generously, so the straight forward solution is to split that space into two blocks and use them to allocate text and data separately. The rest of Peter's suggestions how to migrate there still apply. The init sections of a module are obviously separate as they are freed after the module is initialized, but they are not really special either. Today they leave holes in the address range. With the new scheme these holes will be in the memory backed large mapping, but I don't see a real issue with that, especially as those holes at least in text can be reused for small allocations (kprobes, trace, bpf). As a logical next step we make that three blocks and allocate text, data and rodata separately, which will preserve the large mappings for text and data. rodata still needs to be split because we need a space to accomodate ro_after_init data. Alternatively, instead of splitting the module address space, the allocation mechanism can keep track of the types (text, data, rodata) and manage large mapping blocks per type. There are pros and cons for both approaches, so that needs some thought. But at the end we want an allocation mechanism which: - preserves large mappings - handles a distinct address range - is mapping type aware That solves _all_ the issues of modules, kprobes, tracing, bpf in one go. See? Thanks, tglx
Hi Thomas, Thanks for these insights! They are really helpful! On Thu, Dec 1, 2022 at 1:08 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Song! > > On Wed, Nov 30 2022 at 08:18, Song Liu wrote: > > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> You are not making anything easier. You are violating the basic > >> engineering principle of "Fix the root cause, not the symptom". > >> > > > > I am not sure what is the root cause and the symptom here. > [...] > > This made me look at your allocator again: > > > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) > > +#define EXEC_MEM_START MODULES_VADDR > > +#define EXEC_MEM_END MODULES_END > > +#else > > +#define EXEC_MEM_START VMALLOC_START > > +#define EXEC_MEM_END VMALLOC_END > > +#endif > > The #else part is completely broken on x86/64 and any other > architecture, which has PC relative restricted displacement. Yeah, the #else part is just to make it build. It is not really usable. > > Even if modules are disabled in Kconfig the only safe place to allocate > executable kernel text from (on these architectures) is the modules > address space. The ISA restrictions do not go magically away when > modules are disabled. > > In the early version of the SKX retbleed mitigation work I had > > https://lore.kernel.org/all/20220716230953.442937066@linutronix.de > > exactly to handle this correctly for the !MODULE case. It went nowhere > as we did not need the trampolines in the final version. I remember there was some other work to use module_alloc for ftrace, etc. without CONFIG_MODULES. One of these versions would work here. > > This is why Peter suggested to 'split' the module address range into a > top down and bottom up part: > > https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ > > That obviously separates text and data, but keeps everything within the > defined working range. > > It immediately solves the text problem for _all_ module_alloc() users > and still leaves the data split into 4k pages due to RO/RW sections. > > But after staring at it for a while I think this top down and bottom up > dance is too much effort for not much gain. The module address space is > sized generously, so the straight forward solution is to split that > space into two blocks and use them to allocate text and data separately. > > The rest of Peter's suggestions how to migrate there still apply. > > The init sections of a module are obviously separate as they are freed > after the module is initialized, but they are not really special either. > Today they leave holes in the address range. With the new scheme these > holes will be in the memory backed large mapping, but I don't see a real > issue with that, especially as those holes at least in text can be > reused for small allocations (kprobes, trace, bpf). > > As a logical next step we make that three blocks and allocate text, > data and rodata separately, which will preserve the large mappings for > text and data. rodata still needs to be split because we need a space to > accomodate ro_after_init data. > > Alternatively, instead of splitting the module address space, the > allocation mechanism can keep track of the types (text, data, rodata) > and manage large mapping blocks per type. There are pros and cons for > both approaches, so that needs some thought. AFAICT, the new allocator (let's call it module_alloc_new here) requires quite some different logic than the existing vmalloc logic (or module_alloc logic): 1. vmalloc is at least PAGE_SIZE granularity; while ftrace, bpf etc would benefit from a much smaller granularity. 2. vmalloc maintains 1-to-1 mapping between virtual address range (vmap in vmap_area_root) and physical pages (vm_struct); while module_alloc_new allocates physical pages in 2MB chunks, and maintains multiple vmap within a single 2MB chunk. To solve this, I introduced a new tree free_text_area_root, address spaces in this tree is backed with ROX physical pages, but not used by any user. I think some logic like this is always needed. With this logic in place, I think we don't really need to split the module address space. Instead, we can have 3 trees: free_module_text_area_root; free_module_data_area_root; free_module_ro_data_area_root; Similar to free_text_area_root, we add virtual address and physical pages to these trees in 2MB chunks, and hands virtual address rnage out to users in smaller granularity. What do you think about this idea? > > But at the end we want an allocation mechanism which: > > - preserves large mappings > - handles a distinct address range > - is mapping type aware > > That solves _all_ the issues of modules, kprobes, tracing, bpf in one > go. See? I think the user still needs to use module_alloc_new() differently. At the moment, the user does something like. my_text = module_alloc(size); set_vm_flush_reset_perms(my_text); update_my_text(my_text); set_memory_ro(my_text); set_memory_x(my_text); /* use my_text */ With module_alloc_new(), my_text buffer is RX right out of the allocator, so some text_poke mechanism is needed. In some cases, the user also needs some logic to handle relative call/jump. It is something like: my_text = module_alloc_new(size, MODULE_MEM_TEXT); my_tmp_buf = vmalloc(size); update_my_text(my_tmp_buf); adjust_rela_calls(my_tmp_buf, my_text); text_poke_copy(my_text, my_tmp_buf, size); vfree(my_tmp_buf); /* use my_text */ There are also archs that do not support text_poke, so we need some logic, especially for modules, to handle them properly. For example, Rick suggested something like: For non-text_poke() architectures, the way you can make it work is have the API look like: execmem_alloc() <- Does the allocation, but necessarily usable yet execmem_write() <- Loads the mapping, doesn't work after finish() execmem_finish() <- Makes the mapping live (loaded, executable, ready) So for text_poke(): execmem_alloc() <- reserves the mapping execmem_write() <- text_pokes() to the mapping execmem_finish() <- does nothing And non-text_poke(): execmem_alloc() <- Allocates a regular RW vmalloc allocation execmem_write() <- Writes normally to it execmem_finish() <- does set_memory_ro()/set_memory_x() on it Does this sound like the best path forward to you? Also, do you have suggestions on the name of the API? Maybe something like: enum module_mem_type { MODULE_MEM_TEXT, MODULE_MEM_DATA, MODULE_MEM_RODATA, }; module_alloc_type(size_t len, enum module_mem_type type); module_free_type(ptr); /* I guess we may or may not type here */ Thanks, Song
On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote: > Song! > > On Wed, Nov 30 2022 at 08:18, Song Liu wrote: > > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> You are not making anything easier. You are violating the basic > >> engineering principle of "Fix the root cause, not the symptom". > >> > > > > I am not sure what is the root cause and the symptom here. > > The symptom is iTLB pressure. The root cause is the way how module > memory is allocated, which in turn causes the fragmentation into > 4k PTEs. That's the same problem for anything which uses module_alloc() > to get space for text allocated, e.g. kprobes, tracing.... There's also dTLB pressure caused by the fragmentation of the direct map. The memory allocated with module_alloc() is a priori mapped with 4k PTEs, but setting RO in the malloc address space also updates the direct map alias and this causes splits of large pages. It's not clear what causes more performance improvement: avoiding splits of large pages in the direct map or reducing iTLB pressure by backing text memory with 2M pages. If the major improvement comes from keeping direct map intact, it's might be possible to mix data and text in the same 2M page. > A module consists of: > > - text sections > - data sections > > Except for PPC32, which has the module data in vmalloc space, all others > allocate text and data sections in one lump. > > This en-bloc allocation is one reason for the 4k splits: > > - text is RX > - data is RW or RO > > Truly vmalloc'ed module data is not an option for 64bit architectures > which use PC relative addressing as vmalloc does not guarantee that the > data ends up within the limited displacement range (s32 on x8664) > > This made me look at your allocator again: > > > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) > > +#define EXEC_MEM_START MODULES_VADDR > > +#define EXEC_MEM_END MODULES_END > > +#else > > +#define EXEC_MEM_START VMALLOC_START > > +#define EXEC_MEM_END VMALLOC_END > > +#endif > > The #else part is completely broken on x86/64 and any other > architecture, which has PC relative restricted displacement. > > Even if modules are disabled in Kconfig the only safe place to allocate > executable kernel text from (on these architectures) is the modules > address space. The ISA restrictions do not go magically away when > modules are disabled. > > In the early version of the SKX retbleed mitigation work I had > > https://lore.kernel.org/all/20220716230953.442937066@linutronix.de > > exactly to handle this correctly for the !MODULE case. It went nowhere > as we did not need the trampolines in the final version. > > This is why Peter suggested to 'split' the module address range into a > top down and bottom up part: > > https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/ > > That obviously separates text and data, but keeps everything within the > defined working range. > > It immediately solves the text problem for _all_ module_alloc() users > and still leaves the data split into 4k pages due to RO/RW sections. > > But after staring at it for a while I think this top down and bottom up > dance is too much effort for not much gain. The module address space is > sized generously, so the straight forward solution is to split that > space into two blocks and use them to allocate text and data separately. > > The rest of Peter's suggestions how to migrate there still apply. > > The init sections of a module are obviously separate as they are freed > after the module is initialized, but they are not really special either. > Today they leave holes in the address range. With the new scheme these > holes will be in the memory backed large mapping, but I don't see a real > issue with that, especially as those holes at least in text can be > reused for small allocations (kprobes, trace, bpf). > > As a logical next step we make that three blocks and allocate text, > data and rodata separately, which will preserve the large mappings for > text and data. rodata still needs to be split because we need a space to > accomodate ro_after_init data. > > Alternatively, instead of splitting the module address space, the > allocation mechanism can keep track of the types (text, data, rodata) > and manage large mapping blocks per type. There are pros and cons for > both approaches, so that needs some thought. > > But at the end we want an allocation mechanism which: > > - preserves large mappings > - handles a distinct address range > - is mapping type aware > > That solves _all_ the issues of modules, kprobes, tracing, bpf in one > go. See? There is also - handles kaslr and at least for arm and powerpc we'd also need - handles architecture specific range restrictions and fallbacks > Thanks, > > tglx
Mike! On Thu, Dec 01 2022 at 22:23, Mike Rapoport wrote: > On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote: >> On Wed, Nov 30 2022 at 08:18, Song Liu wrote: >> The symptom is iTLB pressure. The root cause is the way how module >> memory is allocated, which in turn causes the fragmentation into >> 4k PTEs. That's the same problem for anything which uses module_alloc() >> to get space for text allocated, e.g. kprobes, tracing.... > > There's also dTLB pressure caused by the fragmentation of the direct map. > The memory allocated with module_alloc() is a priori mapped with 4k PTEs, > but setting RO in the malloc address space also updates the direct map > alias and this causes splits of large pages. > > It's not clear what causes more performance improvement: avoiding splits of > large pages in the direct map or reducing iTLB pressure by backing text > memory with 2M pages. From our experiments when doing the first version of the SKX retbleed mitigation, the main improvement came from reducing iTLB pressure simply because the iTLB cache is really small. The kernel text placement is way beyond suboptimal. If you really do a hotpath analysis and (manually) place all hot code into one or two 2M pages, then you can achieve massive performance improvements way above the 10% range. We currently have a master student investigating this, but it will take some time until usable results materialize. > If the major improvement comes from keeping direct map intact, it's > might be possible to mix data and text in the same 2M page. No. That can't work. text = RX data = RW or RO If you mix this, then you end up with RWX for the whole 2M page. Not an option really as you lose _all_ protections in one go. That's why I said: >> As a logical next step we make that three blocks and allocate text, >> data and rodata separately, which will preserve the large mappings for >> text and data. rodata still needs to be split because we need a space to >> accomodate ro_after_init data. The point is, that rodata and ro_after_init_data is a pretty small portion of modules as far as my limited analysis of a distro build shows. The bulk is in text and data. So if we preserve 2M pages for text and for RW data and bite the bullet to split one 2M page for ro[_after_init_]data, we get the maximum benefit for the least complexity. >> But at the end we want an allocation mechanism which: >> >> - preserves large mappings >> - handles a distinct address range >> - is mapping type aware >> >> That solves _all_ the issues of modules, kprobes, tracing, bpf in one >> go. See? > > There is also > > - handles kaslr > > and at least for arm and powerpc we'd also need > > - handles architecture specific range restrictions and fallbacks Good points. kaslr should be fairly trivial. The architecture specific restrictions and fallbacks are not really hard to solve either. If done right then the allocator just falls back to 4k maps during initialization in early boot which brings it back to the status quo. But we can provide consistent semantics for the three types which are required for modules and the text only usage for kprobes, tracing, bpf... Thanks, tglx
Song! On Thu, Dec 01 2022 at 11:31, Song Liu wrote: > Thanks for these insights! They are really helpful! I'm glad it helped and did not create more confusion :) > On Thu, Dec 1, 2022 at 1:08 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> This made me look at your allocator again: >> >> > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) >> > +#define EXEC_MEM_START MODULES_VADDR >> > +#define EXEC_MEM_END MODULES_END >> > +#else >> > +#define EXEC_MEM_START VMALLOC_START >> > +#define EXEC_MEM_END VMALLOC_END >> > +#endif >> >> The #else part is completely broken on x86/64 and any other >> architecture, which has PC relative restricted displacement. > > Yeah, the #else part is just to make it build. It is not really > usable. That's not an option for code which is submitted for inclusion. It can be used and will be used. You know that already, right? So why are you posting such broken crap? >> Even if modules are disabled in Kconfig the only safe place to allocate >> executable kernel text from (on these architectures) is the modules >> address space. The ISA restrictions do not go magically away when >> modules are disabled. >> >> In the early version of the SKX retbleed mitigation work I had >> >> https://lore.kernel.org/all/20220716230953.442937066@linutronix.de >> >> exactly to handle this correctly for the !MODULE case. It went nowhere >> as we did not need the trampolines in the final version. > > I remember there was some other work to use module_alloc for ftrace, etc. > without CONFIG_MODULES. One of these versions would work here. We don't want any of these. The new infrastructure which is set out to replace module_alloc() has to be designed upfront to handle the CONFIG_MODULES=n case properly. >> Alternatively, instead of splitting the module address space, the >> allocation mechanism can keep track of the types (text, data, rodata) >> and manage large mapping blocks per type. There are pros and cons for >> both approaches, so that needs some thought. > > AFAICT, the new allocator (let's call it module_alloc_new here) > requires quite some different logic than the existing vmalloc logic (or > module_alloc logic): Obviously. > 1. vmalloc is at least PAGE_SIZE granularity; while ftrace, bpf etc would > benefit from a much smaller granularity. Even modules can benefit from that. The fact that modules have all sections (text, data, rodata) page aligned and page granular is not due to an requirement of modules, it's so because that's how module_alloc() works and the module layout has been adopted to it. text has an architecture/compile time specific alignment requirement for function entries, data obviously has it's own alignment requirements, but none of those are page granular. > 2. vmalloc maintains 1-to-1 mapping between virtual address range (vmap > in vmap_area_root) and physical pages (vm_struct); while > module_alloc_new allocates physical pages in 2MB chunks, and > maintains multiple vmap within a single 2MB chunk. > > To solve this, I introduced a new tree free_text_area_root, address spaces > in this tree is backed with ROX physical pages, but not used by any > user. Can you please refrain from discussing implementation details _before_ the conceptual details are sorted? Implementation details are completely irrelevant at this point, really. Quite the contrary, they just add confusion to the discussion. The basic concept is: module_alloc_type(type, size) { if (!size || invalid_type(type)) return -ENOPONIES; mutex_lock(&m); if (!space_available(type, size)) alloc_new_blocks(type, size); p = alloc_space(type, size); mutex_unlock(&m); return p; } and the counterpart module_free_type(p) { if (!p) return; mutex_lock(&m); x = lookup(p); type = x->type; free_space(x) { ... if (unused_blocks(type)) free_unused_blocks(type); } mutex_unlock(&m); return p; } > With this logic in place, I think we don't really need to split the module > address space. Instead, we can have 3 trees: > free_module_text_area_root; > free_module_data_area_root; > free_module_ro_data_area_root; > > Similar to free_text_area_root, we add virtual address and physical pages > to these trees in 2MB chunks, and hands virtual address rnage out to users in > smaller granularity. Whatever the actual tracking mechanism is, does not matter. This is pretty much at the conceptual level the same what I said before: >> Alternatively, instead of splitting the module address space, the >> allocation mechanism can keep track of the types (text, data, rodata) >> and manage large mapping blocks per type. There are pros and cons for >> both approaches, so that needs some thought. Right? You have to be aware, that the rodata space needs to be page granular while text and data can really aggregate below the page alignment, but again might have different alignment requirements. So you need a configuration mechanism which allows to specify per type: - Initial mapping type (RX, RWX, RW) - Alignment - Granularity - Address space restrictions With those 4 parameters you can map all architecture requirements including kaslr to each type. >> But at the end we want an allocation mechanism which: >> >> - preserves large mappings >> - handles a distinct address range >> - is mapping type aware >> >> That solves _all_ the issues of modules, kprobes, tracing, bpf in one >> go. See? > > I think the user still needs to use module_alloc_new() differently. At > the moment, the user does something like. Neither me nor Peter said, that this ends up as a 1:1 replacement for the actual module_alloc() users, but that's obvious and not the point. The point is that the new infrastructure provides a solution which solves _all_ issues of those users in one go. The fact that we need to adjust the call sites is more than obvious, but this can be done with the well known methods of code refactoring. Step 1: module_alloc_text(size) { if (!size) return -ENOPONIES; mutex_lock(&m); p = module_alloc(size); mutex_unlock(&m); if (p) set_memory_ro+x(p, size); return p; } and module_write_text(p, src, size) { // Fill in the architecture specific magic depending on // the mapping type for text: if (ARCH_TEXT_MAP_TYPE == RX) { text_poke(); } else { memcpy(); set_memory_ro+x(); } } Step 2: Convert the usage sites outside of the core module code which really only care about text allocations one by one: - p = module_alloc(size); - memcpy(p, src, size); + p = module_alloc_text(size); + module_write_text(p, src, size); After that the only user of module_alloc() left is the module core code. Step 3: Provide functions to prepare for converting the module core. You need to have module_alloc_data() and module_alloc_rodata() where both implement in the first step: module_alloc_[ro]data(size) { if (!size) return -ENOPONIES; mutex_lock(&m); p = module_alloc(size); mutex_unlock(&m); if (p) set_memory_rw-nox(p, size); return p; } Step 4: Fixup the module code. Break up the en-bloc allocation and allocate text, data, and rodata separately and adjust the methods to write into them. For text that's obviously module_write_text(), but for the [ro]data mappings memcpy() is still fine. For the rodata mapping you need set_memory_ro() right in the module prepare stage and for the ro_after_init_data() you do that after the module init function returns success, which is pretty much what the code does today. Step 5: Now once this is sorted, you add the new allocation magic. Step 6: Then you do: module_alloc_text(size) { if (!size) return -ENOPONIES; mutex_lock(&m); - p = module_alloc(size); + if (IS_ENABLED(CONFIG_MODULE_ALLOC_NEWFANGLED)) + p = module_alloc_type(TYPE_TEXT, size); + else + p = module_alloc(size); mutex_unlock(&m); - if (p) + if (!IS_ENABLED(CONFIG_MODULE_ALLOC_NEWFANGLED) && p) set_memory_ro+x(p, size); return p; } and the corresponding changes to module_alloc_[ro]data() Step 7: Then you go and enable it per architecture: + select CONFIG_MODULE_ALLOC_NEWFANGLED + module_alloc_newfangled_init(&type_parameters); Step 8: Once everything is converted over you can remove CONFIG_MODULE_ALLOC_NEWFANGLED and the associated conversion cruft. > my_text = module_alloc(size); <SNIP .... /> > Also, do you have suggestions on the name of the API? Again, API and implementation details are pointless bikeshedding material at this point. I really have a hard time to understand why you are so focussed on implementation details instead of establishing a common understanding of the problem, the goals and the concepts in the first place. Linus once said: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships." He's absolutely right. Here is my version of it: The order of things to worry about: 1) Problem analysis 2) Concepts 3) Data structures and their relationships 4) Code #1 You need to understand the problem fully to come up with concepts #2 Once you understand the problem fully you can talk about concepts to solve it #3 Maps the concept to data structures and forms relationships #4 Is the logical consequence of #1 + #2 + #3 and because your concept makes sense, the data structures and their relationships are understandable, the code becomes understandable too. If any of the steps finds a gap in the previous ones, then you have to go back and solve those first. Any attempt to reorder the above is putting the cart before the horse and a guarantee for failure. Now go back and carefully read up on what I wrote above and in my previous mail. The previous mail was mostly about #1 to explain the problem as broad as possible and an initial stab at #2 suggesting concepts to solve it. This one is still covering some aspects of #1, but it is mostly about #2 and more focussed on particular aspects of the concept. If you look at it carefully then you find some bits which map to #3 but still at the conceptual level. Did I talk about code or implementation details? Not at all and I'm not going to do so before #1 and #2 are agreed on. The above pseudo code snippets are just for illustration and I used them because I was too lazy to write a novel, but they all are still at the conceptual level. Now you can rightfully argue that if you stich those snippets together then they form a picture which outlines the implementation, but that's the whole purpose of this exercise, right? They still do not worry about any of the implementation details of root trees, APIs or whatever simply because they are completely irrelevant at this point of the discussion. See? Thanks, tglx
Hi Thomas, Thanks for all these suggestions! On Thu, Dec 1, 2022 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote: > [...] (everything snipped here makes perfect sense). > > You have to be aware, that the rodata space needs to be page granular > while text and data can really aggregate below the page alignment, but > again might have different alignment requirements. I don't quite follow why rodata space needs to be page granular. If text can go below page granular, rodata should also do that, no? > > So you need a configuration mechanism which allows to specify per type: > > - Initial mapping type (RX, RWX, RW) > - Alignment > - Granularity > - Address space restrictions [...] > > Step 1: > These steps are really helpful. Thanks! [...] > > For text that's obviously module_write_text(), but for the [ro]data > mappings memcpy() is still fine. For the rodata mapping you need > set_memory_ro() right in the module prepare stage and for the > ro_after_init_data() you do that after the module init function returns > success, which is pretty much what the code does today. I guess this is related to rodata needs to be page granular? But I don't think I got the idea. Do we allow rodata and rwdata share the same 2MB page? ro_after_init_data seems trickier. > > Step 5: > [...] > > Linus once said: > > "Bad programmers worry about the code. Good programmers worry about > data structures and their relationships." > > He's absolutely right. Here is my version of it: > > The order of things to worry about: > > 1) Problem analysis > 2) Concepts > 3) Data structures and their relationships > 4) Code > > #1 You need to understand the problem fully to come up with > concepts > > #2 Once you understand the problem fully you can talk about > concepts to solve it > > #3 Maps the concept to data structures and forms relationships > > #4 Is the logical consequence of #1 + #2 + #3 and because your > concept makes sense, the data structures and their > relationships are understandable, the code becomes > understandable too. > > If any of the steps finds a gap in the previous ones, then you > have to go back and solve those first. > > Any attempt to reorder the above is putting the cart before the horse > and a guarantee for failure. Thanks for these advices! They would help me for many years. > > Now go back and carefully read up on what I wrote above and in my > previous mail. > > The previous mail was mostly about #1 to explain the problem as broad as > possible and an initial stab at #2 suggesting concepts to solve it. > > This one is still covering some aspects of #1, but it is mostly about #2 > and more focussed on particular aspects of the concept. If you look at > it carefully then you find some bits which map to #3 but still at the > conceptual level. > > Did I talk about code or implementation details? > > Not at all and I'm not going to do so before #1 and #2 are agreed > on. The above pseudo code snippets are just for illustration and I used > them because I was too lazy to write a novel, but they all are still at > the conceptual level. > > Now you can rightfully argue that if you stich those snippets together > then they form a picture which outlines the implementation, but that's > the whole purpose of this exercise, right? I guess I will do my homework, and come back with as much information as possible for #1 + #2 + #3. Then, we can discuss whether it makes sense at all. Does this sound like the right approach? Thanks, Song
Song! On Fri, Dec 02 2022 at 00:38, Song Liu wrote: > Thanks for all these suggestions! Welcome. > On Thu, Dec 1, 2022 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> You have to be aware, that the rodata space needs to be page granular >> while text and data can really aggregate below the page alignment, but >> again might have different alignment requirements. > > I don't quite follow why rodata space needs to be page granular. If text can > go below page granular, rodata should also do that, no? Of course it can, except for the case of ro_after_init_data, because that needs to be RW during module_init() and is then switched to RO when module_init() returns success. So for that you need page granular maps per module, right? Sure you can have a separate space for rodata and ro_after_init_data, but as I said to Mike: "The point is, that rodata and ro_after_init_data is a pretty small portion of modules as far as my limited analysis of a distro build shows. The bulk is in text and data. So if we preserve 2M pages for text and for RW data and bite the bullet to split one 2M page for ro[_after_init_]data, we get the maximum benefit for the least complexity." So under the assumption that rodata is small, it's questionable whether the split of rodata and ro_after_init_data makes a lot of difference. It might, but that needs to be investigated. That's not a fundamental conceptual problem because adding a 4th type to the concept we outlined so far is straight forward, right? > I guess I will do my homework, and come back with as much information > as possible for #1 + #2 + #3. Then, we can discuss whether it makes > sense at all. Correct. Please have a close look at the 11 architecture specific module_alloc() variants so you can see what kind of tweaks and magic they need, which lets you better specify the needs for the initialization parameter set required. > Does this sound like the right approach? Very much so. Thanks, tglx
Le 02/12/2022 à 02:38, Thomas Gleixner a écrit : > >>> Alternatively, instead of splitting the module address space, the >>> allocation mechanism can keep track of the types (text, data, rodata) >>> and manage large mapping blocks per type. There are pros and cons for >>> both approaches, so that needs some thought. >> >> AFAICT, the new allocator (let's call it module_alloc_new here) >> requires quite some different logic than the existing vmalloc logic (or >> module_alloc logic): > > Obviously. > >> 1. vmalloc is at least PAGE_SIZE granularity; while ftrace, bpf etc would >> benefit from a much smaller granularity. > > Even modules can benefit from that. The fact that modules have all > sections (text, data, rodata) page aligned and page granular is not due > to an requirement of modules, it's so because that's how module_alloc() > works and the module layout has been adopted to it. Sections are page aligned only when STRICT_MODULE_RWX is selected. See commit 3b5be16c7e90 ("modules: page-align module section allocations only for arches supporting strict module rwx") Today it is implemented as follows: static inline unsigned int strict_align(unsigned int size) { if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) return PAGE_ALIGN(size); else return size; } Christophe
On Fri, Dec 02 2022 at 10:46, Christophe Leroy wrote: > Le 02/12/2022 à 02:38, Thomas Gleixner a écrit : >> Even modules can benefit from that. The fact that modules have all >> sections (text, data, rodata) page aligned and page granular is not due >> to an requirement of modules, it's so because that's how module_alloc() >> works and the module layout has been adopted to it. > > Sections are page aligned only when STRICT_MODULE_RWX is selected. Correct, but without strict permission separation we would not debate this at all. Everything would be RWX and fine. For separation my point still stands that the problem is that module_alloc() is just doing an en-bloc allocation, which needs to be split into RX, RW, RO afterwards and that consequently splits the large mappings apart. Which in turn means text, data, rodata have to be page aligned and page granular. The typed approach and having a mechanism to preserve the underlying large page mappings is the broadest scope we have to cover. An RWX only architecture is just the most trivial case of such an infrastructure. The types become all the same, the underlying page size does not matter, but it's just a configuration variant. Architectures which support strict separation, but only small pages are a configuration variant too. All of them can use the same infrastructure and the same API to alloc/free and write/update the allocated memory. The configuration will take care to pick the appropriate mechanisms. No? Thanks, tglx
Hi Thomas, On Thu, Dec 01, 2022 at 11:34:57PM +0100, Thomas Gleixner wrote: > Mike! > > On Thu, Dec 01 2022 at 22:23, Mike Rapoport wrote: > > On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote: > >> On Wed, Nov 30 2022 at 08:18, Song Liu wrote: > >> The symptom is iTLB pressure. The root cause is the way how module > >> memory is allocated, which in turn causes the fragmentation into > >> 4k PTEs. That's the same problem for anything which uses module_alloc() > >> to get space for text allocated, e.g. kprobes, tracing.... > > > > There's also dTLB pressure caused by the fragmentation of the direct map. > > The memory allocated with module_alloc() is a priori mapped with 4k PTEs, > > but setting RO in the malloc address space also updates the direct map > > alias and this causes splits of large pages. > > > > It's not clear what causes more performance improvement: avoiding splits of > > large pages in the direct map or reducing iTLB pressure by backing text > > memory with 2M pages. > > From our experiments when doing the first version of the SKX retbleed > mitigation, the main improvement came from reducing iTLB pressure simply > because the iTLB cache is really small. > > The kernel text placement is way beyond suboptimal. If you really do a > hotpath analysis and (manually) place all hot code into one or two 2M > pages, then you can achieve massive performance improvements way above > the 10% range. > > We currently have a master student investigating this, but it will take > some time until usable results materialize. > > > If the major improvement comes from keeping direct map intact, it's > > might be possible to mix data and text in the same 2M page. > > No. That can't work. > > text = RX > data = RW or RO > > If you mix this, then you end up with RWX for the whole 2M page. Not an > option really as you lose _all_ protections in one go. I meant to take one 2M page from the direct map and split it to 4K in the module address space. Then the protection could be done at PTE level after relocations etc and it would save the dance with text poking. But if mapping the code with 2M pages gives massive performance improvements, it's surely better to keep 2M pages in the modules space. > Thanks, > > tglx
On Sat, Dec 03 2022 at 16:46, Mike Rapoport wrote: > On Thu, Dec 01, 2022 at 11:34:57PM +0100, Thomas Gleixner wrote: >> If you mix this, then you end up with RWX for the whole 2M page. Not an >> option really as you lose _all_ protections in one go. > > I meant to take one 2M page from the direct map and split it to 4K in the > module address space. Then the protection could be done at PTE level after > relocations etc and it would save the dance with text poking. I see what you meant. > But if mapping the code with 2M pages gives massive performance > improvements, it's surely better to keep 2M pages in the modules > space. Yes.
Hi Thomas, Thanks again for your suggestions. Here is my homework so far. On Fri, Dec 2, 2022 at 1:22 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Song! > > On Fri, Dec 02 2022 at 00:38, Song Liu wrote: > > Thanks for all these suggestions! > > Welcome. > > > On Thu, Dec 1, 2022 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> You have to be aware, that the rodata space needs to be page granular > >> while text and data can really aggregate below the page alignment, but > >> again might have different alignment requirements. > > > > I don't quite follow why rodata space needs to be page granular. If text can > > go below page granular, rodata should also do that, no? > > Of course it can, except for the case of ro_after_init_data, because > that needs to be RW during module_init() and is then switched to RO when > module_init() returns success. So for that you need page granular maps > per module, right? > > Sure you can have a separate space for rodata and ro_after_init_data, > but as I said to Mike: > > "The point is, that rodata and ro_after_init_data is a pretty small > portion of modules as far as my limited analysis of a distro build > shows. > > The bulk is in text and data. So if we preserve 2M pages for text and > for RW data and bite the bullet to split one 2M page for > ro[_after_init_]data, we get the maximum benefit for the least > complexity." > > So under the assumption that rodata is small, it's questionable whether > the split of rodata and ro_after_init_data makes a lot of difference. It > might, but that needs to be investigated. > > That's not a fundamental conceptual problem because adding a 4th type to > the concept we outlined so far is straight forward, right? > > > I guess I will do my homework, and come back with as much information > > as possible for #1 + #2 + #3. Then, we can discuss whether it makes > > sense at all. > > Correct. Please have a close look at the 11 architecture specific > module_alloc() variants so you can see what kind of tweaks and magic > they need, which lets you better specify the needs for the > initialization parameter set required. Survey of the 11 architecture specific module_alloc(). They basically do the following magic: 1. Modify MODULES_VADDR and/or MODULES_END. There are multiple reasons behind this, some arch does this for KASLR, some other archs have different MODULES_[VADDR|END] for different processors (32b vs. 64b for example), some archs use some module address space for other things (i.e. _exiprom on arm). Archs need 1: x86, arm64, arm, mips, ppc, riscv, s390, loongarch, sparc 2. Use kasan_alloc_module_shadow() Archs need 2: x86, arm64, s390 3. A secondary module address space. There is a smaller preferred address space for modules. Once the preferred space runs out, allocate memory from a secondary address space. Archs need 3: some ppc, arm, arm64 (PLTS on arm and arm64) 4. User different pgprot_t (PAGE_KERNEL, PAGE_KERNEL_EXEC, etc.) 5. sparc does memset(ptr, 0, size) in module_alloc() 6. nios2 uses kmalloc() for modules. Based on the comment, this is probably only because it needs different MODULES_[VADDR|END]. I think we can handle all these with a single module_alloc() and a few module_arch_* functions(). unsigned long module_arch_vaddr(void); unsigned long module_arch_end(void); unsigned long module_arch_secondary_vaddr(void); unsigned long module_arch_secondary_end(void); pgprot_t module_arch_pgprot(alloc_type type); void *module_arch_initialize(void *s, size_t n); bool module_arch_do_kasan_shadow(void); So module_alloc() would look like: void *module_alloc(unsigned long size, pgprot_t prot, unsigned long align, unsigned long granularity, alloc_type type) { unsigned long vm_flags = VM_FLUSH_RESET_PERMS | (module_arch_do_kasan_shadow() ? VM_DEFER_KMEMLEAK : 0); void *ptr; ptr = __vmalloc_node_range(size, align, module_arch_vaddr(), module_arch_end(), GFP_KERNEL, module_arch_pgprot(type), vm_flags, NUMA_NO_NODE, __builtin_return_address(0)); if (!ptr && module_arch_secondary_vaddr() != module_arch_secondary_end()) ptr = __vmalloc_node_range(size, align, module_arch_secondary_vaddr(), module_arch_secondary_end(), GFP_KERNEL, module_arch_pgprot(type), vm_flags, NUMA_NO_NODE, __builtin_return_address(0)); if (p && module_arch_do_kasan_shadow() && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) { vfree(p); return NULL; } module_arch_initialize(ptr, size); return p; } This is not really pretty, but I don't have a better idea at the moment. For the allocation type, there are technically 5 of them: ALLOC_TYPE_RX, /* text */ ALLOC_TYPE_RW, /* rw data */ ALLOC_TYPE_RO, /* ro data */ ALLOC_TYPE_RO_AFTER_INIT, ALLOC_TYPE_RWX, /* legacy, existing module_alloc behavior */ Given RO and RO_AFTER_INIT require PAGE alignment and are relatively small. I think we can merge them with RWX. For RX and RW, we can allocate huge pages, and cut subpage chunks out for users (something similar to 1/6 of the set). For RWX, we have 2 options: 1. Use similar logic as RX and RW, but use PAGE granularity, and do set_memory_ro on it. 2. Keep current module_alloc behavior. 1 is better at protecting direct map (less fragmentation); while 2 is probably a little simpler. Given module load/unload are rare events in most systems, I personally think we can start with option 2. We also need to redesign module_layout. Right now, we have up to 3 layouts: core, init, and data. We will need 6 allocations: core text, core rw data, core ro and ro_after_init data (one allocation) init text, init rw data, init ro data. PS: how much do we benefit with separate core and init. Maybe it is time to merge the two? (keep init part around until the module unloads). The above is my Problem analysis and Concepts. For data structures, I propose we use two extra trees for RX and RW allocation (similar to 1/6 of current version, but 2x trees). For RWX, we keep current module_alloc() behavior, so no new data structure is needed. The new module_layout will be something like: struct module_layout { void *ptr; unsigned int size; /* text size, rw data size, ro + ro_after_init size */ unsigned int ro_size; /* ro_size for ro + ro_after_init allocation */ }; So that's all I have so far. Please share your comments and suggestions on it. One more question: shall we make module sections page aligned without STRICT_MODULE_RWX? It appears to be a good way to simplify the logic. But it may cause too much memory waste for smaller processors? Thanks, Song
Song! On Tue, Dec 06 2022 at 12:25, Song Liu wrote: > On Fri, Dec 2, 2022 at 1:22 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Correct. Please have a close look at the 11 architecture specific >> module_alloc() variants so you can see what kind of tweaks and magic >> they need, which lets you better specify the needs for the >> initialization parameter set required. > > Survey of the 11 architecture specific module_alloc(). They basically do > the following magic: > > 1. Modify MODULES_VADDR and/or MODULES_END. There are multiple > reasons behind this, some arch does this for KASLR, some other archs > have different MODULES_[VADDR|END] for different processors (32b vs. > 64b for example), some archs use some module address space for other > things (i.e. _exiprom on arm). > > Archs need 1: x86, arm64, arm, mips, ppc, riscv, s390, loongarch, > sparc All of this is pretty much a boot time init decision, right? > 2. Use kasan_alloc_module_shadow() > > Archs need 2: x86, arm64, s390 There is nothing really architecture specific, so that can be part of the core code, right? > 3. A secondary module address space. There is a smaller preferred > address space for modules. Once the preferred space runs out, allocate > memory from a secondary address space. > > Archs need 3: some ppc, arm, arm64 (PLTS on arm and arm64) Right. > 4. User different pgprot_t (PAGE_KERNEL, PAGE_KERNEL_EXEC, etc.) > > 5. sparc does memset(ptr, 0, size) in module_alloc() which is pointless if you do a GPF_ZERO allocation, but sure. > 6. nios2 uses kmalloc() for modules. Based on the comment, this is > probably only because it needs different MODULES_[VADDR|END]. It's a horrible hack because they decided to have their layout: VMALLOC_SPACE 0x80000000 KERNEL_SPACE 0xC0000000 and they use kmalloc because CALL26/PCREL26 cannot reach from 0x80000000 to 0xC0000000. That's true, but broken beyond repair. Making the layout: VMALLOC_SPACE 0x80000000 MODULE_SPACE 0xBE000000 == 0xC0000000 - (1 << 24) (32M) or MODULE_SPACE 0xBF000000 == 0xC0000000 - (1 << 24) (16M) KERNEL_SPACE 0xC0000000 would have been too obvious... > I think we can handle all these with a single module_alloc() and a few > module_arch_* functions(). > > unsigned long module_arch_vaddr(void); > unsigned long module_arch_end(void); > unsigned long module_arch_secondary_vaddr(void); > unsigned long module_arch_secondary_end(void); > pgprot_t module_arch_pgprot(alloc_type type); > void *module_arch_initialize(void *s, size_t n); > bool module_arch_do_kasan_shadow(void); Why? None of these functions is required at all. Go back to one of my previous replies: >> + select CONFIG_MODULE_ALLOC_NEWFANGLED >> >> + module_alloc_newfangled_init(&type_parameters); /** * struct mod_alloc_type - Parameters for module allocation type * @mapto_type: The type to merge this type into, if different * from the actual type which is configured here. * @flags: Properties * @granularity: The allocation granularity (PTE/PMD) * @alignment: The allocation alignment requirement * @start: Array of address space range start (inclusive) * @end: Array of address space range end (inclusive) * @pgprot: The page protection for this type * @fill: Function to fill allocated space. If NULL, use memcpy() * @invalidate: Function to invalidate allocated space. If NULL, use memset() * * If @granularity > @alignment the allocation can reuse free space in * previously allocated pages. If they are the same, then fresh pages * have to be allocated. */ struct mod_alloc_type { unsigned int mapto_type; unsigned int flags; unsigned int granularity; unsigned int alignment; unsigned long start[MOD_MAX_ADDR_SPACES]; unsigned long end[MOD_MAX_ADDR_SPACES]; pgprot_t pgprot; void (*fill)(void *dst, void *src, unsigned int size); void (*invalidate)(void *dst, unsigned int size); }; struct mod_alloc_type_params { struct mod_alloc_type types[MOD_MAX_TYPES]; }; or something like that. > So module_alloc() would look like: <SNIP>horror</SNIP> No. It would not contain a single arch_foo() function at all. Everything can be expressed with the type descriptors above. > For the allocation type, there are technically 5 of them: > > ALLOC_TYPE_RX, /* text */ > ALLOC_TYPE_RW, /* rw data */ > ALLOC_TYPE_RO, /* ro data */ > ALLOC_TYPE_RO_AFTER_INIT, > ALLOC_TYPE_RWX, /* legacy, existing module_alloc behavior */ You are mixing page protections and section types as seen from the module core code. We need exactly four abstract allocation types: MOD_ALLOC_TYPE_TEXT MOD_ALLOC_TYPE_DATA MOD_ALLOC_TYPE_RODATA MOD_ALLOC_TYPE_RODATA_AFTER_INIT These allocation types represent the section types and are mapped to module_alloc_type->pgprot by the allocator. Those protections can be different or identical, e.g. all RWX. The module core does neither care about the resulting page protection nor about the question whether the allocation can reuse free space or needs to allocate fresh pages nor about the question whether an allocation type maps to some other type. > Given RO and RO_AFTER_INIT require PAGE alignment and are > relatively small. I think we can merge them with RWX. There is no RWX, really. See above. > We also need to redesign module_layout. Right now, we have > up to 3 layouts: core, init, and data. We will need 6 allocations: > core text, > core rw data, > core ro and ro_after_init data (one allocation) > init text, > init rw data, > init ro data. > > PS: how much do we benefit with separate core and init. > Maybe it is time to merge the two? (keep init part around until > the module unloads). That needs some investigation into how much memory is really made available. OTOH, if you look at the above then we can just have: MOD_ALLOC_TYPE_INITTEXT MOD_ALLOC_TYPE_INITDATA MOD_ALLOC_TYPE_INITRODATA as extra allocation types. All it does is add some initconst memory and a slightly larger static datastructure in the allocator itself. See below. > For data structures, I propose we use two extra trees for RX > and RW allocation (similar to 1/6 of current version, but 2x > trees). The number of trees does not matter. You can make them part of the type scheme and then the number of active trees depends on the number of types and the properties of the types. Which is the right thing to do because then you can e.g. trivially split RODATA and RODATA_AFTER_INIT. > For RWX, we keep current module_alloc() behavior, so no new data > structure is needed. Seriously no. We switch the whole logic over to the new scheme for consistency, simplicity and mental sanity reasons. Again: Module code cares about section types, not about the resulting page protections. The page protections are a matter of the underlying allocator and architecture specific properties. > The new module_layout will be something like: > > struct module_layout { > void *ptr; > unsigned int size; /* text size, rw data size, ro + ro_after_init size */ > unsigned int ro_size; /* ro_size for ro + ro_after_init allocation */ > }; *SHUDDER* struct module_layout { unsigned int type; unsigned int size; void *ptr; }; or struct module_layout { unsigned int size; void *ptr; }; struct module { ... struct module_layout layouts[MOD_ALLOC_TYPE_MAX]; ... }; > One more question: shall we make module sections page > aligned without STRICT_MODULE_RWX? It appears to be > a good way to simplify the logic. But it may cause too much > memory waste for smaller processors? Yes, we want that again for simplicity. That wastes some space on architectures which do not support large pages, but that's not the end of the world. Deep embedded is not really module heavy. So lets look at some examples under the assumption that we have the init sections separate: Legacy default (All RWX): struct mod_alloc_type_params { [MOD_ALLOC_TYPE_TEXT ... MOD_ALLOC_TYPE_RODATA_AFTER_INIT] = { .mapto_type = MOD_ALLOC_TYPE_TEXT, .granularity = PAGE_SIZE, .alignment = PAGE_SIZE, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL_EXEC, }, [MOD_ALLOC_TYPE_INITTEXT ... MOD_ALLOC_TYPE_INITRODATA] = { .mapto_type = MOD_ALLOC_TYPE_INITTEXT, .granularity = PAGE_SIZE, .alignment = PAGE_SIZE, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL_EXEC, }, }; So this is the waste space version, but we can also do: struct mod_alloc_type_params { [MOD_ALLOC_TYPE_TEXT ... MOD_ALLOC_TYPE_INITRODATA] = { .mapto_type = MOD_ALLOC_TYPE_TEXT, .granularity = PAGE_SIZE, .alignment = MOD_ARCH_ALIGNMENT, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL_EXEC, }, }; The "use free space in existing mappings" mechanism is not required to be PMD_SIZE based, right? Large page size, strict separation: struct mod_alloc_type_params { [MOD_ALLOC_TYPE_TEXT] = { .mapto_type = MOD_ALLOC_TYPE_TEXT, .flags = FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE, .granularity = PMD_SIZE, .alignment = MOD_ARCH_ALIGNMENT, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .start[1] = MODULES_VADDR_2ND, .end[1] = MODULES_END_2ND, .pgprot = PAGE_KERNEL_EXEC, .fill = text_poke, .invalidate = text_poke_invalidate, }, [MOD_ALLOC_TYPE_DATA] = { .mapto_type = MOD_ALLOC_TYPE_DATA, .flags = FLAG_SHARED_PMD, .granularity = PMD_SIZE, .alignment = MOD_ARCH_ALIGNMENT, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL, }, [MOD_ALLOC_TYPE_RODATA] = { .mapto_type = MOD_ALLOC_TYPE_RODATA, .granularity = PAGE_SIZE, .alignment = PAGE_SIZE, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL_RO, }, [MOD_ALLOC_TYPE_RODATA_AFTER_INIT] = { .mapto_type = MOD_ALLOC_TYPE_RODATA_AFTER_INIT, .flags = FLAG_SET_RO_AFTER_INIT, .granularity = PAGE_SIZE, .alignment = PAGE_SIZE, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL, }, [MOD_ALLOC_TYPE_INITTEXT] = { .mapto_type = MOD_ALLOC_TYPE_TEXT, }, [MOD_ALLOC_TYPE_INITDATA] = { .mapto_type = MOD_ALLOC_TYPE_DATA, }, [MOD_ALLOC_TYPE_INITRODATA] = { .mapto_type = MOD_ALLOC_TYPE_RODATA, }, }; Large page size, strict separation and RODATA uses PMD: struct mod_alloc_type_params { [MOD_ALLOC_TYPE_TEXT] = { .mapto_type = MOD_ALLOC_TYPE_TEXT, .flags = FLAG_SHARED_PMD, .granularity = PMD_SIZE, .alignment = MOD_ARCH_ALIGNMENT, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL_EXEC, .fill = text_poke, .invalidate = text_poke_invalidate, }, [MOD_ALLOC_TYPE_DATA] = { .mapto_type = MOD_ALLOC_TYPE_DATA, .flags = FLAG_SHARED_PMD, .granularity = PMD_SIZE, .alignment = MOD_ARCH_ALIGNMENT, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL, }, [MOD_ALLOC_TYPE_RODATA] = { .mapto_type = MOD_ALLOC_TYPE_RODATA, .flags = FLAG_SHARED_PMD, .granularity = PMD_SIZE, .alignment = MOD_ARCH_ALIGNMENT, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL_RO, .fill = rodata_poke, .invalidate = rodata_poke_invalidate, }, [MOD_ALLOC_TYPE_RODATA_AFTER_INIT] = { .mapto_type = MOD_ALLOC_TYPE_RODATA_AFTER_INIT, .flags = FLAG_SET_RO_AFTER_INIT, .granularity = PAGE_SIZE, .alignment = PAGE_SIZE, .start[0] = MODULES_VADDR, .end[0] = MODULES_END, .pgprot = PAGE_KERNEL, }, [MOD_ALLOC_TYPE_INITTEXT] = { .mapto_type = MOD_ALLOC_TYPE_TEXT, }, [MOD_ALLOC_TYPE_INITDATA] = { .mapto_type = MOD_ALLOC_TYPE_DATA, }, [MOD_ALLOC_TYPE_INITRODATA] = { .mapto_type = MOD_ALLOC_TYPE_RODATA, }, }; Then this data structure is handed to the module allocation init function, which in turn sets up the resulting type magic. struct mod_type_allocator { struct mutex mutex; struct mod_alloc_type params; struct rb_root free_area; struct list_head list; struct vm_struct *vm; }; struct mod_allocator { struct mod_type_allocator base_types[MAX_TYPES]; struct mod_type_allocator *real_types[MAX_TYPES]; }; static struct mod_allocator mod_allocator; So the init function does: mod_alloc_init(struct mod_alloc_type_params *params) { for (i = 0; i < MAX_TYPES; i++) mod_alloc_init_type(i, ¶ms->types[i]); } mod_alloc_init_type(int i, struct mod_alloc_type *params) { struct mod_type_allocator = *ta; mutex_init(&ta->mutex); memcpy(&ta->params, params, sizeof(params); ta->free_area = RB_ROOT; LIST_HEAD_INIT(&ta->list); mod_allocator.real_types[i] = &mod_allocator.base_types[params->mapto_type]; } and then you have: module_alloc_type(size, type) { return __module_alloc_type(mod_allocator.real_types[type], size); } and everything just works and operates from the relevant allocator. See: No question about how many trees are required, no question about legacy RWX, ... Hmm? Thanks, tglx
Le 07/12/2022 à 16:36, Thomas Gleixner a écrit : > > The "use free space in existing mappings" mechanism is not required to > be PMD_SIZE based, right? > > Large page size, strict separation: > > struct mod_alloc_type_params { > [MOD_ALLOC_TYPE_TEXT] = { > .mapto_type = MOD_ALLOC_TYPE_TEXT, > .flags = FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE, > .granularity = PMD_SIZE, > .alignment = MOD_ARCH_ALIGNMENT, > .start[0] = MODULES_VADDR, > .end[0] = MODULES_END, > .start[1] = MODULES_VADDR_2ND, > .end[1] = MODULES_END_2ND, > .pgprot = PAGE_KERNEL_EXEC, > .fill = text_poke, > .invalidate = text_poke_invalidate, > }, Don't restrict implementation to PMD_SIZE only. On powerpc 8xx: - PMD_SIZE is 4 Mbytes - Large pages are 512 kbytes and 8 Mbytes. It even has large pages of size 16 kbytes when build for 4k normal page size. Christophe
Hi Thomas, On Wed, Dec 7, 2022 at 7:36 AM Thomas Gleixner <tglx@linutronix.de> wrote: > [...] > > Survey of the 11 architecture specific module_alloc(). They basically do > > the following magic: > > > > 1. Modify MODULES_VADDR and/or MODULES_END. There are multiple > > reasons behind this, some arch does this for KASLR, some other archs > > have different MODULES_[VADDR|END] for different processors (32b vs. > > 64b for example), some archs use some module address space for other > > things (i.e. _exiprom on arm). > > > > Archs need 1: x86, arm64, arm, mips, ppc, riscv, s390, loongarch, > > sparc > > All of this is pretty much a boot time init decision, right? Yeah, all of these are boot time or compile time decisions. > > > 2. Use kasan_alloc_module_shadow() > > > > Archs need 2: x86, arm64, s390 > > There is nothing really architecture specific, so that can be part of > the core code, right? Right, kasan_free_module_shadow() is called from vmalloc.c, so the alloc one can do the same. > > > 3. A secondary module address space. There is a smaller preferred > > address space for modules. Once the preferred space runs out, allocate > > memory from a secondary address space. [...] > > > 6. nios2 uses kmalloc() for modules. Based on the comment, this is > > probably only because it needs different MODULES_[VADDR|END]. > > It's a horrible hack because they decided to have their layout: > > VMALLOC_SPACE 0x80000000 > KERNEL_SPACE 0xC0000000 > > and they use kmalloc because CALL26/PCREL26 cannot reach from 0x80000000 > to 0xC0000000. That's true, but broken beyond repair. > > Making the layout: > > VMALLOC_SPACE 0x80000000 > MODULE_SPACE 0xBE000000 == 0xC0000000 - (1 << 24) (32M) > or > MODULE_SPACE 0xBF000000 == 0xC0000000 - (1 << 24) (16M) > KERNEL_SPACE 0xC0000000 > > would have been too obvious... Yeah, I was thinking about something like this. > > > I think we can handle all these with a single module_alloc() and a few > > module_arch_* functions(). [...] > > /** > * struct mod_alloc_type - Parameters for module allocation type > * @mapto_type: The type to merge this type into, if different > * from the actual type which is configured here. > * @flags: Properties > * @granularity: The allocation granularity (PTE/PMD) > * @alignment: The allocation alignment requirement > * @start: Array of address space range start (inclusive) > * @end: Array of address space range end (inclusive) > * @pgprot: The page protection for this type > * @fill: Function to fill allocated space. If NULL, use memcpy() > * @invalidate: Function to invalidate allocated space. If NULL, use memset() > * > * If @granularity > @alignment the allocation can reuse free space in > * previously allocated pages. If they are the same, then fresh pages > * have to be allocated. > */ > struct mod_alloc_type { > unsigned int mapto_type; > unsigned int flags; > unsigned int granularity; > unsigned int alignment; > unsigned long start[MOD_MAX_ADDR_SPACES]; > unsigned long end[MOD_MAX_ADDR_SPACES]; > pgprot_t pgprot; > void (*fill)(void *dst, void *src, unsigned int size); > void (*invalidate)(void *dst, unsigned int size); > }; Yeah, this is a lot better than arch_ functions. We probably want two more function pointers here: int (*protect)(unsigned long addr, int numpages); int (*unprotect)(unsigned long addr, int numpages); These two functions will be NULL for archs that support text_poke; while legacy archs use them for set_memory_[ro|x|rw|nx]. Then, I think we can get rid of VM_FLUSH_RESET_PERMS. [...] Everything else makes perfect sense. Thanks! I think I am ready to dive into the code and prepare the first RFC/PATCH. Please let me know if there is anything we should discuss/clarify before that. Best, Song
On Wed, Dec 7, 2022 at 8:54 AM Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > > > Le 07/12/2022 à 16:36, Thomas Gleixner a écrit : > > > > The "use free space in existing mappings" mechanism is not required to > > be PMD_SIZE based, right? > > > > Large page size, strict separation: > > > > struct mod_alloc_type_params { > > [MOD_ALLOC_TYPE_TEXT] = { > > .mapto_type = MOD_ALLOC_TYPE_TEXT, > > .flags = FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE, > > .granularity = PMD_SIZE, > > .alignment = MOD_ARCH_ALIGNMENT, > > .start[0] = MODULES_VADDR, > > .end[0] = MODULES_END, > > .start[1] = MODULES_VADDR_2ND, > > .end[1] = MODULES_END_2ND, > > .pgprot = PAGE_KERNEL_EXEC, > > .fill = text_poke, > > .invalidate = text_poke_invalidate, > > }, > > Don't restrict implementation to PMD_SIZE only. > > On powerpc 8xx: > - PMD_SIZE is 4 Mbytes > - Large pages are 512 kbytes and 8 Mbytes. > > It even has large pages of size 16 kbytes when build for 4k normal page > size. IIUC, these customizations would fit nicely with this design. I will see how much I can prototype with powerpc without the actual hardwares. Thanks, Song
Song! On Wed, Dec 07 2022 at 11:26, Song Liu wrote: > On Wed, Dec 7, 2022 at 7:36 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> > I think we can handle all these with a single module_alloc() and a few >> > module_arch_* functions(). >> struct mod_alloc_type { >> unsigned int mapto_type; >> unsigned int flags; >> unsigned int granularity; >> unsigned int alignment; >> unsigned long start[MOD_MAX_ADDR_SPACES]; >> unsigned long end[MOD_MAX_ADDR_SPACES]; >> pgprot_t pgprot; >> void (*fill)(void *dst, void *src, unsigned int size); >> void (*invalidate)(void *dst, unsigned int size); >> }; > > Yeah, this is a lot better than arch_ functions. Remember the order of things to worry about: #3 :) > We probably want two more function pointers here: > > int (*protect)(unsigned long addr, int numpages); > int (*unprotect)(unsigned long addr, int numpages); > > These two functions will be NULL for archs that support text_poke; > while legacy archs use them for set_memory_[ro|x|rw|nx]. Then, I > think we can get rid of VM_FLUSH_RESET_PERMS. Depends. You can implement fill() memcpy(...); set_memory_ro(); and invalidate() set_memory_rw(); memset(); as global helpers which can be used by the architecture for the init struct or used as default for certain types. > I think I am ready to dive into the code and prepare the first RFC/PATCH. > Please let me know if there is anything we should discuss/clarify before that. I think we covered most of it by now, so sure a POC is probably due, but please yell when you find a gap in step #1 - #3 which we did not cover yet. Thanks, tglx
Christophe, On Wed, Dec 07 2022 at 16:53, Christophe Leroy wrote: > Le 07/12/2022 à 16:36, Thomas Gleixner a écrit : >> The "use free space in existing mappings" mechanism is not required to >> be PMD_SIZE based, right? Correct. I just used it for the example. >> Large page size, strict separation: >> >> struct mod_alloc_type_params { >> [MOD_ALLOC_TYPE_TEXT] = { >> .mapto_type = MOD_ALLOC_TYPE_TEXT, >> .flags = FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE, >> .granularity = PMD_SIZE, >> .alignment = MOD_ARCH_ALIGNMENT, >> .start[0] = MODULES_VADDR, >> .end[0] = MODULES_END, >> .start[1] = MODULES_VADDR_2ND, >> .end[1] = MODULES_END_2ND, >> .pgprot = PAGE_KERNEL_EXEC, >> .fill = text_poke, >> .invalidate = text_poke_invalidate, >> }, > > Don't restrict implementation to PMD_SIZE only. > > On powerpc 8xx: > - PMD_SIZE is 4 Mbytes > - Large pages are 512 kbytes and 8 Mbytes. > > It even has large pages of size 16 kbytes when build for 4k normal page > size. @granularity takes any size which is valid from the architecture side and can handle the @pgprot distinctions. That's why I separated functionality and configuration. Note, it's not strict compile time configuration. You can either build it completely dynamic at boot or have a static configuration structure as compile time default. That static default might be __initconst for architectures where there is no boot time detection and change required, but can be __initdata for those which need to adjust it to the needs of the detected CPU/platform before feeding it into the module allocator init function. Does that answer your question? Thanks, tglx
Le 07/12/2022 à 22:04, Thomas Gleixner a écrit : > Christophe, > > On Wed, Dec 07 2022 at 16:53, Christophe Leroy wrote: >> Le 07/12/2022 à 16:36, Thomas Gleixner a écrit : >>> The "use free space in existing mappings" mechanism is not required to >>> be PMD_SIZE based, right? > > Correct. I just used it for the example. > >>> Large page size, strict separation: >>> >>> struct mod_alloc_type_params { >>> [MOD_ALLOC_TYPE_TEXT] = { >>> .mapto_type = MOD_ALLOC_TYPE_TEXT, >>> .flags = FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE, >>> .granularity = PMD_SIZE, >>> .alignment = MOD_ARCH_ALIGNMENT, >>> .start[0] = MODULES_VADDR, >>> .end[0] = MODULES_END, >>> .start[1] = MODULES_VADDR_2ND, >>> .end[1] = MODULES_END_2ND, >>> .pgprot = PAGE_KERNEL_EXEC, >>> .fill = text_poke, >>> .invalidate = text_poke_invalidate, >>> }, >> >> Don't restrict implementation to PMD_SIZE only. >> >> On powerpc 8xx: >> - PMD_SIZE is 4 Mbytes >> - Large pages are 512 kbytes and 8 Mbytes. >> >> It even has large pages of size 16 kbytes when build for 4k normal page >> size. > > @granularity takes any size which is valid from the architecture side > and can handle the @pgprot distinctions. > > That's why I separated functionality and configuration. > > Note, it's not strict compile time configuration. You can either build > it completely dynamic at boot or have a static configuration structure > as compile time default. > > That static default might be __initconst for architectures where there > is no boot time detection and change required, but can be __initdata for > those which need to adjust it to the needs of the detected CPU/platform > before feeding it into the module allocator init function. > > Does that answer your question? > Yes it does, thanks. Christophe
On Wed, Dec 7, 2022 at 12:57 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Song! > > On Wed, Dec 07 2022 at 11:26, Song Liu wrote: > > On Wed, Dec 7, 2022 at 7:36 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >> > I think we can handle all these with a single module_alloc() and a few > >> > module_arch_* functions(). > >> struct mod_alloc_type { > >> unsigned int mapto_type; > >> unsigned int flags; > >> unsigned int granularity; > >> unsigned int alignment; > >> unsigned long start[MOD_MAX_ADDR_SPACES]; > >> unsigned long end[MOD_MAX_ADDR_SPACES]; > >> pgprot_t pgprot; > >> void (*fill)(void *dst, void *src, unsigned int size); > >> void (*invalidate)(void *dst, unsigned int size); > >> }; > > > > Yeah, this is a lot better than arch_ functions. > > Remember the order of things to worry about: #3 :) > > > We probably want two more function pointers here: > > > > int (*protect)(unsigned long addr, int numpages); > > int (*unprotect)(unsigned long addr, int numpages); > > > > These two functions will be NULL for archs that support text_poke; > > while legacy archs use them for set_memory_[ro|x|rw|nx]. Then, I > > think we can get rid of VM_FLUSH_RESET_PERMS. > > Depends. You can implement > > fill() > memcpy(...); > set_memory_ro(); > > and > > invalidate() > set_memory_rw(); > memset(); > > as global helpers which can be used by the architecture for the init > struct or used as default for certain types. Ah, that's right. Legacy archs should always use PAGE_SIZE granularity, so fill() and invalidate() are sufficient. > > > I think I am ready to dive into the code and prepare the first RFC/PATCH. > > Please let me know if there is anything we should discuss/clarify before that. > > I think we covered most of it by now, so sure a POC is probably due, but > please yell when you find a gap in step #1 - #3 which we did not cover > yet. Thanks! Let me see what I can get. Song