Message ID | 20211227145903.187152-4-wangkefeng.wang@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: support huge vmalloc mapping on arm64/x86 | expand |
On 12/27/21 6:59 AM, Kefeng Wang wrote: > This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE > support huge vmalloc mappings. In general, this seems interesting and the diff is simple. But, I don't see _any_ x86-specific data. I think the bare minimum here would be a few kernel compiles and some 'perf stat' data for some TLB events. > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c > index 95fa745e310a..6bf5cb7d876a 100644 > --- a/arch/x86/kernel/module.c > +++ b/arch/x86/kernel/module.c > @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size) > > p = __vmalloc_node_range(size, MODULE_ALIGN, > MODULES_VADDR + get_module_load_offset(), > - MODULES_END, gfp_mask, > - PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE, > + MODULES_END, gfp_mask, PAGE_KERNEL, > + VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE, > __builtin_return_address(0)); > if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { > vfree(p); To figure out what's going on in this hunk, I had to look at the cover letter (which I wasn't cc'd on). That's not great and it means that somebody who stumbles upon this in the code is going to have a really hard time figuring out what is going on. Cover letters don't make it into git history. This desperately needs a comment and some changelog material in *this* patch. But, even the description from the cover letter is sparse: > There are some disadvantages about this feature[2], one of the main > concerns is the possible memory fragmentation/waste in some scenarios, > also archs must ensure that any arch specific vmalloc allocations that > require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX) > use the VM_NO_HUGE_VMAP flag to inhibit larger mappings. That just says that x86 *needs* PAGE_SIZE allocations. But, what happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)? Will the subsequent permission changes just fragment the 2M mapping?
On 2021/12/27 23:56, Dave Hansen wrote: > On 12/27/21 6:59 AM, Kefeng Wang wrote: >> This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE >> support huge vmalloc mappings. > In general, this seems interesting and the diff is simple. But, I don't > see _any_ x86-specific data. I think the bare minimum here would be a > few kernel compiles and some 'perf stat' data for some TLB events. When the feature supported on ppc, commit 8abddd968a303db75e4debe77a3df484164f1f33 Author: Nicholas Piggin <npiggin@gmail.com> Date: Mon May 3 19:17:55 2021 +1000 powerpc/64s/radix: Enable huge vmalloc mappings This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due to vfs hashes being allocated with 2MB pages. But the data could be different on different machine/arch. >> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c >> index 95fa745e310a..6bf5cb7d876a 100644 >> --- a/arch/x86/kernel/module.c >> +++ b/arch/x86/kernel/module.c >> @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size) >> >> p = __vmalloc_node_range(size, MODULE_ALIGN, >> MODULES_VADDR + get_module_load_offset(), >> - MODULES_END, gfp_mask, >> - PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE, >> + MODULES_END, gfp_mask, PAGE_KERNEL, >> + VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE, >> __builtin_return_address(0)); >> if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { >> vfree(p); > To figure out what's going on in this hunk, I had to look at the cover > letter (which I wasn't cc'd on). That's not great and it means that > somebody who stumbles upon this in the code is going to have a really > hard time figuring out what is going on. Cover letters don't make it > into git history. Sorry for that, will add more into arch's patch changelog. > This desperately needs a comment and some changelog material in *this* > patch. > > But, even the description from the cover letter is sparse: > >> There are some disadvantages about this feature[2], one of the main >> concerns is the possible memory fragmentation/waste in some scenarios, >> also archs must ensure that any arch specific vmalloc allocations that >> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX) >> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings. > That just says that x86 *needs* PAGE_SIZE allocations. But, what > happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)? Will the > subsequent permission changes just fragment the 2M mapping? > . Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping. When module alloc with STRICT_MODULE_RWX on x86, it calls __change_page_attr() from set_memory_ro/rw/nx which will split large page, so there is no need to make module alloc with HUGE_VMALLOC. > > > >
On 12/28/21 2:26 AM, Kefeng Wang wrote: >>> There are some disadvantages about this feature[2], one of the main >>> concerns is the possible memory fragmentation/waste in some scenarios, >>> also archs must ensure that any arch specific vmalloc allocations that >>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX) >>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings. >> That just says that x86 *needs* PAGE_SIZE allocations. But, what >> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)? Will the >> subsequent permission changes just fragment the 2M mapping? > > Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping. > > When module alloc with STRICT_MODULE_RWX on x86, it calls > __change_page_attr() > > from set_memory_ro/rw/nx which will split large page, so there is no > need to make > > module alloc with HUGE_VMALLOC. This all sounds very fragile to me. Every time a new architecture would get added for huge vmalloc() support, the developer needs to know to go find that architecture's module_alloc() and add this flag. They next guy is going to forget, just like you did. Considering that this is not a hot path, a weak function would be a nice choice: /* vmalloc() flags used for all module allocations. */ unsigned long __weak arch_module_vm_flags() { /* * Modules use a single, large vmalloc(). Different * permissions are applied later and will fragment * huge mappings. Avoid using huge pages for modules. */ return VM_NO_HUGE_VMAP; } Stick that in some the common module code, next to: > void * __weak module_alloc(unsigned long size) > { > return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, ... Then, put arch_module_vm_flags() in *all* of the module_alloc() implementations, including the generic one. That way (even with a new architecture) whoever copies-and-pastes their module_alloc() implementation is likely to get it right. The next guy who just does a "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work. VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.
On 2021/12/29 0:14, Dave Hansen wrote: > On 12/28/21 2:26 AM, Kefeng Wang wrote: >>>> There are some disadvantages about this feature[2], one of the main >>>> concerns is the possible memory fragmentation/waste in some scenarios, >>>> also archs must ensure that any arch specific vmalloc allocations that >>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX) >>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings. >>> That just says that x86 *needs* PAGE_SIZE allocations. But, what >>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)? Will the >>> subsequent permission changes just fragment the 2M mapping? >> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping. >> >> When module alloc with STRICT_MODULE_RWX on x86, it calls >> __change_page_attr() >> >> from set_memory_ro/rw/nx which will split large page, so there is no >> need to make >> >> module alloc with HUGE_VMALLOC. > This all sounds very fragile to me. Every time a new architecture would > get added for huge vmalloc() support, the developer needs to know to go > find that architecture's module_alloc() and add this flag. They next > guy is going to forget, just like you did. > > Considering that this is not a hot path, a weak function would be a nice > choice: > > /* vmalloc() flags used for all module allocations. */ > unsigned long __weak arch_module_vm_flags() > { > /* > * Modules use a single, large vmalloc(). Different > * permissions are applied later and will fragment > * huge mappings. Avoid using huge pages for modules. > */ > return VM_NO_HUGE_VMAP; For x86, it only fragment, but for arm64, due to apply_to_page_range() in set_memory_*, without this flag maybe crash. Whatever, we need this flag for module. > } > > Stick that in some the common module code, next to: > >> void * __weak module_alloc(unsigned long size) >> { >> return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, > ... > > Then, put arch_module_vm_flags() in *all* of the module_alloc() > implementations, including the generic one. That way (even with a new > architecture) whoever copies-and-pastes their module_alloc() > implementation is likely to get it right. The next guy who just does a > "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work. OK, Let me check the VM_FLUSH_RESET_PERMS and try about this way. Thanks. > > VM_FLUSH_RESET_PERMS could probably be dealt with in the same way. > .
Le 27/12/2021 à 15:59, Kefeng Wang a écrit : > This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE > support huge vmalloc mappings. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > Documentation/admin-guide/kernel-parameters.txt | 4 ++-- > arch/x86/Kconfig | 1 + > arch/x86/kernel/module.c | 4 ++-- > 3 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index e3f9fd7ec106..ffce6591ae64 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1639,7 +1639,7 @@ > precedence over memory_hotplug.memmap_on_memory. > > > - hugevmalloc= [KNL,PPC,ARM64] Reguires CONFIG_HAVE_ARCH_HUGE_VMALLOC > + hugevmalloc= [KNL,PPC,ARM64,X86] Reguires CONFIG_HAVE_ARCH_HUGE_VMALLOC > Format: { on | off } > Default set by CONFIG_HUGE_VMALLOC_DEFAULT_ENABLED. > > @@ -3424,7 +3424,7 @@ > > nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings. > > - nohugevmalloc [KNL,PPC,ARM64] Disable kernel huge vmalloc mappings. > + nohugevmalloc [KNL,PPC,ARM64,X86] Disable kernel huge vmalloc mappings. > > nosmt [KNL,S390] Disable symmetric multithreading (SMT). > Equivalent to smt=1. > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index ebe8fc76949a..f6bf6675bbe7 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -157,6 +157,7 @@ config X86 > select HAVE_ACPI_APEI_NMI if ACPI > select HAVE_ALIGNED_STRUCT_PAGE if SLUB > select HAVE_ARCH_AUDITSYSCALL > + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP > select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE > select HAVE_ARCH_JUMP_LABEL > select HAVE_ARCH_JUMP_LABEL_RELATIVE > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c > index 95fa745e310a..6bf5cb7d876a 100644 > --- a/arch/x86/kernel/module.c > +++ b/arch/x86/kernel/module.c > @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size) > > p = __vmalloc_node_range(size, MODULE_ALIGN, > MODULES_VADDR + get_module_load_offset(), > - MODULES_END, gfp_mask, > - PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE, > + MODULES_END, gfp_mask, PAGE_KERNEL, > + VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE, you should add a comment like powerpc (commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc mappings")) to explain why this requires VM_NO_HUGE_VMAP > __builtin_return_address(0)); > if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { > vfree(p);
Le 28/12/2021 à 11:26, Kefeng Wang a écrit : > > On 2021/12/27 23:56, Dave Hansen wrote: >> On 12/27/21 6:59 AM, Kefeng Wang wrote: >>> This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE >>> support huge vmalloc mappings. >> In general, this seems interesting and the diff is simple. But, I don't >> see _any_ x86-specific data. I think the bare minimum here would be a >> few kernel compiles and some 'perf stat' data for some TLB events. > > When the feature supported on ppc, > > commit 8abddd968a303db75e4debe77a3df484164f1f33 > Author: Nicholas Piggin <npiggin@gmail.com> > Date: Mon May 3 19:17:55 2021 +1000 > > powerpc/64s/radix: Enable huge vmalloc mappings > > This reduces TLB misses by nearly 30x on a `git diff` workload on a > 2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due > to vfs hashes being allocated with 2MB pages. > > But the data could be different on different machine/arch. > >>> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c >>> index 95fa745e310a..6bf5cb7d876a 100644 >>> --- a/arch/x86/kernel/module.c >>> +++ b/arch/x86/kernel/module.c >>> @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size) >>> p = __vmalloc_node_range(size, MODULE_ALIGN, >>> MODULES_VADDR + get_module_load_offset(), >>> - MODULES_END, gfp_mask, >>> - PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE, >>> + MODULES_END, gfp_mask, PAGE_KERNEL, >>> + VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE, >>> __builtin_return_address(0)); >>> if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { >>> vfree(p); >> To figure out what's going on in this hunk, I had to look at the cover >> letter (which I wasn't cc'd on). That's not great and it means that >> somebody who stumbles upon this in the code is going to have a really >> hard time figuring out what is going on. Cover letters don't make it >> into git history. > Sorry for that, will add more into arch's patch changelog. >> This desperately needs a comment and some changelog material in *this* >> patch. >> >> But, even the description from the cover letter is sparse: >> >>> There are some disadvantages about this feature[2], one of the main >>> concerns is the possible memory fragmentation/waste in some scenarios, >>> also archs must ensure that any arch specific vmalloc allocations that >>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX) >>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings. >> That just says that x86 *needs* PAGE_SIZE allocations. But, what >> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)? Will the >> subsequent permission changes just fragment the 2M mapping? >> . > > Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping. > > When module alloc with STRICT_MODULE_RWX on x86, it calls > __change_page_attr() > > from set_memory_ro/rw/nx which will split large page, so there is no > need to make > > module alloc with HUGE_VMALLOC. > Maybe there is no need to perform the module alloc with HUGE_VMALLOC, but it least it would still work if you do so. Powerpc did add VM_NO_HUGE_VMAP temporarily and for some reason which is explained in a comment. If x86 already has the necessary logic to handle it, why add VM_NO_HUGE_VMAP ? Christophe
Le 28/12/2021 à 17:14, Dave Hansen a écrit : > On 12/28/21 2:26 AM, Kefeng Wang wrote: >>>> There are some disadvantages about this feature[2], one of the main >>>> concerns is the possible memory fragmentation/waste in some scenarios, >>>> also archs must ensure that any arch specific vmalloc allocations that >>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX) >>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings. >>> That just says that x86 *needs* PAGE_SIZE allocations. But, what >>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)? Will the >>> subsequent permission changes just fragment the 2M mapping? >> >> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping. >> >> When module alloc with STRICT_MODULE_RWX on x86, it calls >> __change_page_attr() >> >> from set_memory_ro/rw/nx which will split large page, so there is no >> need to make >> >> module alloc with HUGE_VMALLOC. > > This all sounds very fragile to me. Every time a new architecture would > get added for huge vmalloc() support, the developer needs to know to go > find that architecture's module_alloc() and add this flag. They next > guy is going to forget, just like you did. That's not correct from my point of view. When powerpc added that, a clear comment explains why: + /* + * Don't do huge page allocations for modules yet until more testing + * is done. STRICT_MODULE_RWX may require extra work to support this + * too. + */ So as you can see, this is something specific to powerpc and temporary. > > Considering that this is not a hot path, a weak function would be a nice > choice: > > /* vmalloc() flags used for all module allocations. */ > unsigned long __weak arch_module_vm_flags() > { > /* > * Modules use a single, large vmalloc(). Different > * permissions are applied later and will fragment > * huge mappings. Avoid using huge pages for modules. > */ Why ? Not everybody use STRICT_MODULES_RWX. Even if you do so, you can still benefit from huge pages for modules. Why make what was initially a temporary precaution for powerpc become a definitive default limitation for all ? > return VM_NO_HUGE_VMAP; > } > > Stick that in some the common module code, next to: > >> void * __weak module_alloc(unsigned long size) >> { >> return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, > ... > > Then, put arch_module_vm_flags() in *all* of the module_alloc() > implementations, including the generic one. That way (even with a new > architecture) whoever copies-and-pastes their module_alloc() > implementation is likely to get it right. The next guy who just does a > "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work. > > VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.
Le 29/12/2021 à 12:01, Kefeng Wang a écrit : > > On 2021/12/29 0:14, Dave Hansen wrote: >> On 12/28/21 2:26 AM, Kefeng Wang wrote: >>>>> There are some disadvantages about this feature[2], one of the main >>>>> concerns is the possible memory fragmentation/waste in some scenarios, >>>>> also archs must ensure that any arch specific vmalloc allocations that >>>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX) >>>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings. >>>> That just says that x86 *needs* PAGE_SIZE allocations. But, what >>>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)? Will the >>>> subsequent permission changes just fragment the 2M mapping? >>> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping. >>> >>> When module alloc with STRICT_MODULE_RWX on x86, it calls >>> __change_page_attr() >>> >>> from set_memory_ro/rw/nx which will split large page, so there is no >>> need to make >>> >>> module alloc with HUGE_VMALLOC. >> This all sounds very fragile to me. Every time a new architecture would >> get added for huge vmalloc() support, the developer needs to know to go >> find that architecture's module_alloc() and add this flag. They next >> guy is going to forget, just like you did. >> >> Considering that this is not a hot path, a weak function would be a nice >> choice: >> >> /* vmalloc() flags used for all module allocations. */ >> unsigned long __weak arch_module_vm_flags() >> { >> /* >> * Modules use a single, large vmalloc(). Different >> * permissions are applied later and will fragment >> * huge mappings. Avoid using huge pages for modules. >> */ >> return VM_NO_HUGE_VMAP; > > For x86, it only fragment, but for arm64, due to apply_to_page_range() in > > set_memory_*, without this flag maybe crash. Whatever, we need this > > flag for module. I see no reason to have this flag by default. Only ARM should have it if necessary, with a comment explaining why just like powerpc. And maybe the flag should be there only when STRICT_MODULE_RWX is selected. > >> } >> >> Stick that in some the common module code, next to: >> >>> void * __weak module_alloc(unsigned long size) >>> { >>> return __vmalloc_node_range(size, 1, VMALLOC_START, >>> VMALLOC_END, >> ... >> >> Then, put arch_module_vm_flags() in *all* of the module_alloc() >> implementations, including the generic one. That way (even with a new >> architecture) whoever copies-and-pastes their module_alloc() >> implementation is likely to get it right. The next guy who just does a >> "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work. > > OK, Let me check the VM_FLUSH_RESET_PERMS and try about this way. > > Thanks. > >> >> VM_FLUSH_RESET_PERMS could probably be dealt with in the same way. >> .
On 1/17/22 6:46 PM, Nicholas Piggin wrote: >> This all sounds very fragile to me. Every time a new architecture would >> get added for huge vmalloc() support, the developer needs to know to go >> find that architecture's module_alloc() and add this flag. > This is documented in the Kconfig. > > # > # Archs that select this would be capable of PMD-sized vmaps (i.e., > # arch_vmap_pmd_supported() returns true), and they must make no assumptions > # that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag > # can be used to prohibit arch-specific allocations from using hugepages to > # help with this (e.g., modules may require it). > # > config HAVE_ARCH_HUGE_VMALLOC > depends on HAVE_ARCH_HUGE_VMAP > bool > > Is it really fair to say it's *very* fragile? Surely it's reasonable to > read the (not very long) documentation ad understand the consequences for > the arch code before enabling it. Very fragile or not, I think folks are likely to get it wrong. It would be nice to have it default *everyone* to safe and slow and make *sure* they go look at the architecture modules code itself before enabling this for modules. Just from that Kconfig text, I don't think I'd know off the top of my head what do do for x86, or what code I needed to go touch.
Excerpts from Dave Hansen's message of January 19, 2022 3:28 am: > On 1/17/22 6:46 PM, Nicholas Piggin wrote: >>> This all sounds very fragile to me. Every time a new architecture would >>> get added for huge vmalloc() support, the developer needs to know to go >>> find that architecture's module_alloc() and add this flag. >> This is documented in the Kconfig. >> >> # >> # Archs that select this would be capable of PMD-sized vmaps (i.e., >> # arch_vmap_pmd_supported() returns true), and they must make no assumptions >> # that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag >> # can be used to prohibit arch-specific allocations from using hugepages to >> # help with this (e.g., modules may require it). >> # >> config HAVE_ARCH_HUGE_VMALLOC >> depends on HAVE_ARCH_HUGE_VMAP >> bool >> >> Is it really fair to say it's *very* fragile? Surely it's reasonable to >> read the (not very long) documentation ad understand the consequences for >> the arch code before enabling it. > > Very fragile or not, I think folks are likely to get it wrong. It would > be nice to have it default *everyone* to safe and slow and make *sure* It's not safe to enable though. That's the problem. If it was just modules then you'd have a point but it could be anything. > they go look at the architecture modules code itself before enabling > this for modules. This is required not just for modules for the whole arch code, it has to be looked at and decided this will work. > Just from that Kconfig text, I don't think I'd know off the top of my > head what do do for x86, or what code I needed to go touch. You have to make sure arch/x86 makes no assumptions that vmalloc memory is backed by PAGE_SIZE ptes. If you can't do that then you shouldn't enable the option. The option can not explain it any more because any arch could do anything with its mappings. The module code is an example, not the recipe. Thanks, Nick
On 2022/1/19 12:17, Nicholas Piggin wrote: > Excerpts from Dave Hansen's message of January 19, 2022 3:28 am: >> On 1/17/22 6:46 PM, Nicholas Piggin wrote: >>>> This all sounds very fragile to me. Every time a new architecture would >>>> get added for huge vmalloc() support, the developer needs to know to go >>>> find that architecture's module_alloc() and add this flag. >>> This is documented in the Kconfig. >>> >>> # >>> # Archs that select this would be capable of PMD-sized vmaps (i.e., >>> # arch_vmap_pmd_supported() returns true), and they must make no assumptions >>> # that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag >>> # can be used to prohibit arch-specific allocations from using hugepages to >>> # help with this (e.g., modules may require it). >>> # >>> config HAVE_ARCH_HUGE_VMALLOC >>> depends on HAVE_ARCH_HUGE_VMAP >>> bool >>> >>> Is it really fair to say it's *very* fragile? Surely it's reasonable to >>> read the (not very long) documentation ad understand the consequences for >>> the arch code before enabling it. >> Very fragile or not, I think folks are likely to get it wrong. It would >> be nice to have it default *everyone* to safe and slow and make *sure* > It's not safe to enable though. That's the problem. If it was just > modules then you'd have a point but it could be anything. > >> they go look at the architecture modules code itself before enabling >> this for modules. > This is required not just for modules for the whole arch code, it > has to be looked at and decided this will work. > >> Just from that Kconfig text, I don't think I'd know off the top of my >> head what do do for x86, or what code I needed to go touch. > You have to make sure arch/x86 makes no assumptions that vmalloc memory > is backed by PAGE_SIZE ptes. If you can't do that then you shouldn't > enable the option. The option can not explain it any more because any > arch could do anything with its mappings. The module code is an example, > not the recipe. Hi Nick, Dave and Christophe,thanks for your review, a little confused, I think, 1) for ppc/arm64 module_alloc(), it must set VM_NO_HUGE_VMAP because the arch's set_memory_* funcitons can only support PAGE_SIZE mapping, due to the limit of apply_to_page_range(). 2) but for x86's module_alloc(), add VM_NO_HUGE_VMAP is to avoid fragmentation, x86's __change_page_attr functions will split the huge mapping. this flags is not a must. and the behavior above occurred when STRICT_MODULE_RWX enabled, so 1) add a unified function to set vm flags(suggested by Dave ) or 2) add vm flags with some comments to per-arch's module_alloc() are both acceptable, for the way of unified function , we could make this a default recipe with STRICT_MODULE_RWX, also make two more vm flags into it, eg, +unsigned long module_alloc_vm_flags(bool need_flush_reset_perms) +{ + unsigned long vm_flags = VM_DEFER_KMEMLEAK; + + if (need_flush_reset_perms) + vm_flags |= VM_FLUSH_RESET_PERMS; + /* + * Modules use a single, large vmalloc(). Different permissions + * are applied later and will fragment huge mappings or even + * fails in set_memory_* on some architectures. Avoid using + * huge pages for modules. + */ + if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) + vm_flags |= VM_NO_HUGE_VMAP; + + return vm_flags; +} then called each arch's module_alloc(). Any suggestion, many thanks. > > Thanks, > Nick > .
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e3f9fd7ec106..ffce6591ae64 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1639,7 +1639,7 @@ precedence over memory_hotplug.memmap_on_memory. - hugevmalloc= [KNL,PPC,ARM64] Reguires CONFIG_HAVE_ARCH_HUGE_VMALLOC + hugevmalloc= [KNL,PPC,ARM64,X86] Reguires CONFIG_HAVE_ARCH_HUGE_VMALLOC Format: { on | off } Default set by CONFIG_HUGE_VMALLOC_DEFAULT_ENABLED. @@ -3424,7 +3424,7 @@ nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings. - nohugevmalloc [KNL,PPC,ARM64] Disable kernel huge vmalloc mappings. + nohugevmalloc [KNL,PPC,ARM64,X86] Disable kernel huge vmalloc mappings. nosmt [KNL,S390] Disable symmetric multithreading (SMT). Equivalent to smt=1. diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ebe8fc76949a..f6bf6675bbe7 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -157,6 +157,7 @@ config X86 select HAVE_ACPI_APEI_NMI if ACPI select HAVE_ALIGNED_STRUCT_PAGE if SLUB select HAVE_ARCH_AUDITSYSCALL + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE select HAVE_ARCH_JUMP_LABEL select HAVE_ARCH_JUMP_LABEL_RELATIVE diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 95fa745e310a..6bf5cb7d876a 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size) p = __vmalloc_node_range(size, MODULE_ALIGN, MODULES_VADDR + get_module_load_offset(), - MODULES_END, gfp_mask, - PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE, + MODULES_END, gfp_mask, PAGE_KERNEL, + VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE, __builtin_return_address(0)); if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { vfree(p);
This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE support huge vmalloc mappings. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> --- Documentation/admin-guide/kernel-parameters.txt | 4 ++-- arch/x86/Kconfig | 1 + arch/x86/kernel/module.c | 4 ++-- 3 files changed, 5 insertions(+), 4 deletions(-)