[v2,3/3] x86: Support huge vmalloc mappings

Message ID	20211227145903.187152-4-wangkefeng.wang@huawei.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kefeng Wang <wangkefeng.wang@huawei.com> To: Jonathan Corbet <corbet@lwn.net>, Andrew Morton <akpm@linux-foundation.org>, <linuxppc-dev@lists.ozlabs.org>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, <x86@kernel.org>, <linux-arm-kernel@lists.infradead.org> CC: Nicholas Piggin <npiggin@gmail.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com>, Michael Ellerman <mpe@ellerman.id.au>, "Benjamin Herrenschmidt" <benh@kernel.crashing.org>, Paul Mackerras <paulus@samba.org>, Christophe Leroy <christophe.leroy@csgroup.eu>, Matthew Wilcox <willy@infradead.org>, Kefeng Wang <wangkefeng.wang@huawei.com> Subject: [PATCH v2 3/3] x86: Support huge vmalloc mappings Date: Mon, 27 Dec 2021 22:59:03 +0800 Message-ID: <20211227145903.187152-4-wangkefeng.wang@huawei.com> In-Reply-To: <20211227145903.187152-1-wangkefeng.wang@huawei.com> References: <20211227145903.187152-1-wangkefeng.wang@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: support huge vmalloc mapping on arm64/x86 \| expand [v2,0/3] mm: support huge vmalloc mapping on arm64/x86 [v2,1/3] mm: vmalloc: Let user to control huge vmalloc default behavior [v2,2/3] arm64: Support huge vmalloc mappings [v2,3/3] x86: Support huge vmalloc mappings

Kefeng Wang Dec. 27, 2021, 2:59 p.m. UTC

This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE
support huge vmalloc mappings.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 4 ++--
 arch/x86/Kconfig                                | 1 +
 arch/x86/kernel/module.c                        | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

Dave Hansen Dec. 27, 2021, 3:56 p.m. UTC | #1

On 12/27/21 6:59 AM, Kefeng Wang wrote:
> This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE
> support huge vmalloc mappings.

In general, this seems interesting and the diff is simple.  But, I don't
see _any_ x86-specific data.  I think the bare minimum here would be a
few kernel compiles and some 'perf stat' data for some TLB events.

> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index 95fa745e310a..6bf5cb7d876a 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size)
>  
>  	p = __vmalloc_node_range(size, MODULE_ALIGN,
>  				    MODULES_VADDR + get_module_load_offset(),
> -				    MODULES_END, gfp_mask,
> -				    PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> +				    MODULES_END, gfp_mask, PAGE_KERNEL,
> +				    VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE,
>  				    __builtin_return_address(0));
>  	if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) {
>  		vfree(p);

To figure out what's going on in this hunk, I had to look at the cover
letter (which I wasn't cc'd on).  That's not great and it means that
somebody who stumbles upon this in the code is going to have a really
hard time figuring out what is going on.  Cover letters don't make it
into git history.

This desperately needs a comment and some changelog material in *this*
patch.

But, even the description from the cover letter is sparse:

> There are some disadvantages about this feature[2], one of the main
> concerns is the possible memory fragmentation/waste in some scenarios,
> also archs must ensure that any arch specific vmalloc allocations that
> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.

That just says that x86 *needs* PAGE_SIZE allocations.  But, what
happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
subsequent permission changes just fragment the 2M mapping?

Kefeng Wang Dec. 28, 2021, 10:26 a.m. UTC | #2

On 2021/12/27 23:56, Dave Hansen wrote:
> On 12/27/21 6:59 AM, Kefeng Wang wrote:
>> This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE
>> support huge vmalloc mappings.
> In general, this seems interesting and the diff is simple.  But, I don't
> see _any_ x86-specific data.  I think the bare minimum here would be a
> few kernel compiles and some 'perf stat' data for some TLB events.

When the feature supported on ppc,

commit 8abddd968a303db75e4debe77a3df484164f1f33
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Mon May 3 19:17:55 2021 +1000

     powerpc/64s/radix: Enable huge vmalloc mappings

     This reduces TLB misses by nearly 30x on a `git diff` workload on a
     2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due
     to vfs hashes being allocated with 2MB pages.

But the data could be different on different machine/arch.

>> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
>> index 95fa745e310a..6bf5cb7d876a 100644
>> --- a/arch/x86/kernel/module.c
>> +++ b/arch/x86/kernel/module.c
>> @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size)
>>   
>>   	p = __vmalloc_node_range(size, MODULE_ALIGN,
>>   				    MODULES_VADDR + get_module_load_offset(),
>> -				    MODULES_END, gfp_mask,
>> -				    PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
>> +				    MODULES_END, gfp_mask, PAGE_KERNEL,
>> +				    VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE,
>>   				    __builtin_return_address(0));
>>   	if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) {
>>   		vfree(p);
> To figure out what's going on in this hunk, I had to look at the cover
> letter (which I wasn't cc'd on).  That's not great and it means that
> somebody who stumbles upon this in the code is going to have a really
> hard time figuring out what is going on.  Cover letters don't make it
> into git history.
Sorry for that, will add more into arch's patch changelog.
> This desperately needs a comment and some changelog material in *this*
> patch.
>
> But, even the description from the cover letter is sparse:
>
>> There are some disadvantages about this feature[2], one of the main
>> concerns is the possible memory fragmentation/waste in some scenarios,
>> also archs must ensure that any arch specific vmalloc allocations that
>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
> subsequent permission changes just fragment the 2M mapping?
> .

Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.

When module alloc with STRICT_MODULE_RWX on x86, it calls 
__change_page_attr()

from set_memory_ro/rw/nx which will split large page, so there is no 
need to make

module alloc with HUGE_VMALLOC.

>                                   
>
>      
>

Dave Hansen Dec. 28, 2021, 4:14 p.m. UTC | #3

On 12/28/21 2:26 AM, Kefeng Wang wrote:
>>> There are some disadvantages about this feature[2], one of the main
>>> concerns is the possible memory fragmentation/waste in some scenarios,
>>> also archs must ensure that any arch specific vmalloc allocations that
>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
>> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
>> subsequent permission changes just fragment the 2M mapping?
> 
> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.
> 
> When module alloc with STRICT_MODULE_RWX on x86, it calls
> __change_page_attr()
> 
> from set_memory_ro/rw/nx which will split large page, so there is no
> need to make
> 
> module alloc with HUGE_VMALLOC.

This all sounds very fragile to me.  Every time a new architecture would
get added for huge vmalloc() support, the developer needs to know to go
find that architecture's module_alloc() and add this flag.  They next
guy is going to forget, just like you did.

Considering that this is not a hot path, a weak function would be a nice
choice:

/* vmalloc() flags used for all module allocations. */
unsigned long __weak arch_module_vm_flags()
{
	/*
	 * Modules use a single, large vmalloc().  Different
	 * permissions are applied later and will fragment
	 * huge mappings.  Avoid using huge pages for modules.
	 */
	return VM_NO_HUGE_VMAP;
}

Stick that in some the common module code, next to:

> void * __weak module_alloc(unsigned long size)
> {
>         return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
...

Then, put arch_module_vm_flags() in *all* of the module_alloc()
implementations, including the generic one.  That way (even with a new
architecture) whoever copies-and-pastes their module_alloc()
implementation is likely to get it right.  The next guy who just does a
"select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work.

VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.

Kefeng Wang Dec. 29, 2021, 11:01 a.m. UTC | #4

On 2021/12/29 0:14, Dave Hansen wrote:
> On 12/28/21 2:26 AM, Kefeng Wang wrote:
>>>> There are some disadvantages about this feature[2], one of the main
>>>> concerns is the possible memory fragmentation/waste in some scenarios,
>>>> also archs must ensure that any arch specific vmalloc allocations that
>>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
>>> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
>>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
>>> subsequent permission changes just fragment the 2M mapping?
>> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.
>>
>> When module alloc with STRICT_MODULE_RWX on x86, it calls
>> __change_page_attr()
>>
>> from set_memory_ro/rw/nx which will split large page, so there is no
>> need to make
>>
>> module alloc with HUGE_VMALLOC.
> This all sounds very fragile to me.  Every time a new architecture would
> get added for huge vmalloc() support, the developer needs to know to go
> find that architecture's module_alloc() and add this flag.  They next
> guy is going to forget, just like you did.
>
> Considering that this is not a hot path, a weak function would be a nice
> choice:
>
> /* vmalloc() flags used for all module allocations. */
> unsigned long __weak arch_module_vm_flags()
> {
> 	/*
> 	 * Modules use a single, large vmalloc().  Different
> 	 * permissions are applied later and will fragment
> 	 * huge mappings.  Avoid using huge pages for modules.
> 	 */
> 	return VM_NO_HUGE_VMAP;

For x86, it only fragment, but for arm64, due to apply_to_page_range() in

set_memory_*, without this flag maybe crash. Whatever, we need this

flag for module.

> }
>
> Stick that in some the common module code, next to:
>
>> void * __weak module_alloc(unsigned long size)
>> {
>>          return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> ...
>
> Then, put arch_module_vm_flags() in *all* of the module_alloc()
> implementations, including the generic one.  That way (even with a new
> architecture) whoever copies-and-pastes their module_alloc()
> implementation is likely to get it right.  The next guy who just does a
> "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work.

OK， Let me check the VM_FLUSH_RESET_PERMS and try about this way.

Thanks.

>
> VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.
> .

Christophe Leroy Jan. 15, 2022, 10:06 a.m. UTC | #5

Le 27/12/2021 à 15:59, Kefeng Wang a écrit :
> This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE
> support huge vmalloc mappings.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   Documentation/admin-guide/kernel-parameters.txt | 4 ++--
>   arch/x86/Kconfig                                | 1 +
>   arch/x86/kernel/module.c                        | 4 ++--
>   3 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e3f9fd7ec106..ffce6591ae64 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1639,7 +1639,7 @@
>   			precedence over memory_hotplug.memmap_on_memory.
>   
>   
> -	hugevmalloc=	[KNL,PPC,ARM64] Reguires CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +	hugevmalloc=	[KNL,PPC,ARM64,X86] Reguires CONFIG_HAVE_ARCH_HUGE_VMALLOC
>   			Format: { on | off }
>   			Default set by CONFIG_HUGE_VMALLOC_DEFAULT_ENABLED.
>   
> @@ -3424,7 +3424,7 @@
>   
>   	nohugeiomap	[KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.
>   
> -	nohugevmalloc	[KNL,PPC,ARM64] Disable kernel huge vmalloc mappings.
> +	nohugevmalloc	[KNL,PPC,ARM64,X86] Disable kernel huge vmalloc mappings.
>   
>   	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).
>   			Equivalent to smt=1.
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index ebe8fc76949a..f6bf6675bbe7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -157,6 +157,7 @@ config X86
>   	select HAVE_ACPI_APEI_NMI		if ACPI
>   	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
>   	select HAVE_ARCH_AUDITSYSCALL
> +	select HAVE_ARCH_HUGE_VMALLOC		if HAVE_ARCH_HUGE_VMAP
>   	select HAVE_ARCH_HUGE_VMAP		if X86_64 || X86_PAE
>   	select HAVE_ARCH_JUMP_LABEL
>   	select HAVE_ARCH_JUMP_LABEL_RELATIVE
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index 95fa745e310a..6bf5cb7d876a 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size)
>   
>   	p = __vmalloc_node_range(size, MODULE_ALIGN,
>   				    MODULES_VADDR + get_module_load_offset(),
> -				    MODULES_END, gfp_mask,
> -				    PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> +				    MODULES_END, gfp_mask, PAGE_KERNEL,
> +				    VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE,

you should add a comment like powerpc (commit 8abddd968a30 
("powerpc/64s/radix: Enable huge vmalloc mappings")) to explain why this 
requires VM_NO_HUGE_VMAP

>   				    __builtin_return_address(0));
>   	if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) {
>   		vfree(p);

Christophe Leroy Jan. 15, 2022, 10:11 a.m. UTC | #6

Le 28/12/2021 à 11:26, Kefeng Wang a écrit :
> 
> On 2021/12/27 23:56, Dave Hansen wrote:
>> On 12/27/21 6:59 AM, Kefeng Wang wrote:
>>> This patch select HAVE_ARCH_HUGE_VMALLOC to let X86_64 and X86_PAE
>>> support huge vmalloc mappings.
>> In general, this seems interesting and the diff is simple.  But, I don't
>> see _any_ x86-specific data.  I think the bare minimum here would be a
>> few kernel compiles and some 'perf stat' data for some TLB events.
> 
> When the feature supported on ppc,
> 
> commit 8abddd968a303db75e4debe77a3df484164f1f33
> Author: Nicholas Piggin <npiggin@gmail.com>
> Date:   Mon May 3 19:17:55 2021 +1000
> 
>      powerpc/64s/radix: Enable huge vmalloc mappings
> 
>      This reduces TLB misses by nearly 30x on a `git diff` workload on a
>      2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due
>      to vfs hashes being allocated with 2MB pages.
> 
> But the data could be different on different machine/arch.
> 
>>> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
>>> index 95fa745e310a..6bf5cb7d876a 100644
>>> --- a/arch/x86/kernel/module.c
>>> +++ b/arch/x86/kernel/module.c
>>> @@ -75,8 +75,8 @@ void *module_alloc(unsigned long size)
>>>       p = __vmalloc_node_range(size, MODULE_ALIGN,
>>>                       MODULES_VADDR + get_module_load_offset(),
>>> -                    MODULES_END, gfp_mask,
>>> -                    PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
>>> +                    MODULES_END, gfp_mask, PAGE_KERNEL,
>>> +                    VM_DEFER_KMEMLEAK | VM_NO_HUGE_VMAP, NUMA_NO_NODE,
>>>                       __builtin_return_address(0));
>>>       if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) {
>>>           vfree(p);
>> To figure out what's going on in this hunk, I had to look at the cover
>> letter (which I wasn't cc'd on).  That's not great and it means that
>> somebody who stumbles upon this in the code is going to have a really
>> hard time figuring out what is going on.  Cover letters don't make it
>> into git history.
> Sorry for that, will add more into arch's patch changelog.
>> This desperately needs a comment and some changelog material in *this*
>> patch.
>>
>> But, even the description from the cover letter is sparse:
>>
>>> There are some disadvantages about this feature[2], one of the main
>>> concerns is the possible memory fragmentation/waste in some scenarios,
>>> also archs must ensure that any arch specific vmalloc allocations that
>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
>> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
>> subsequent permission changes just fragment the 2M mapping?
>> .
> 
> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.
> 
> When module alloc with STRICT_MODULE_RWX on x86, it calls 
> __change_page_attr()
> 
> from set_memory_ro/rw/nx which will split large page, so there is no 
> need to make
> 
> module alloc with HUGE_VMALLOC.
> 

Maybe there is no need to perform the module alloc with HUGE_VMALLOC, 
but it least it would still work if you do so.

Powerpc did add VM_NO_HUGE_VMAP temporarily and for some reason which is 
  explained in a comment.

If x86 already has the necessary logic to handle it, why add 
VM_NO_HUGE_VMAP ?

Christophe

Christophe Leroy Jan. 15, 2022, 10:15 a.m. UTC | #7

Le 28/12/2021 à 17:14, Dave Hansen a écrit :
> On 12/28/21 2:26 AM, Kefeng Wang wrote:
>>>> There are some disadvantages about this feature[2], one of the main
>>>> concerns is the possible memory fragmentation/waste in some scenarios,
>>>> also archs must ensure that any arch specific vmalloc allocations that
>>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
>>> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
>>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
>>> subsequent permission changes just fragment the 2M mapping?
>>
>> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.
>>
>> When module alloc with STRICT_MODULE_RWX on x86, it calls
>> __change_page_attr()
>>
>> from set_memory_ro/rw/nx which will split large page, so there is no
>> need to make
>>
>> module alloc with HUGE_VMALLOC.
> 
> This all sounds very fragile to me.  Every time a new architecture would
> get added for huge vmalloc() support, the developer needs to know to go
> find that architecture's module_alloc() and add this flag.  They next
> guy is going to forget, just like you did.

That's not correct from my point of view.

When powerpc added that, a clear comment explains why:


+	/*
+	 * Don't do huge page allocations for modules yet until more testing
+	 * is done. STRICT_MODULE_RWX may require extra work to support this
+	 * too.
+	 */

So as you can see, this is something specific to powerpc and temporary.

> 
> Considering that this is not a hot path, a weak function would be a nice
> choice:
> 
> /* vmalloc() flags used for all module allocations. */
> unsigned long __weak arch_module_vm_flags()
> {
> 	/*
> 	 * Modules use a single, large vmalloc().  Different
> 	 * permissions are applied later and will fragment
> 	 * huge mappings.  Avoid using huge pages for modules.
> 	 */

Why ? Not everybody use STRICT_MODULES_RWX.
Even if you do so, you can still benefit from huge pages for modules.

Why make what was initially a temporary precaution for powerpc become a 
definitive default limitation for all ?

> 	return VM_NO_HUGE_VMAP;
> }
> 
> Stick that in some the common module code, next to:
> 
>> void * __weak module_alloc(unsigned long size)
>> {
>>          return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> ...
> 
> Then, put arch_module_vm_flags() in *all* of the module_alloc()
> implementations, including the generic one.  That way (even with a new
> architecture) whoever copies-and-pastes their module_alloc()
> implementation is likely to get it right.  The next guy who just does a
> "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work.
> 
> VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.

Christophe Leroy Jan. 15, 2022, 10:17 a.m. UTC | #8

Le 29/12/2021 à 12:01, Kefeng Wang a écrit :
> 
> On 2021/12/29 0:14, Dave Hansen wrote:
>> On 12/28/21 2:26 AM, Kefeng Wang wrote:
>>>>> There are some disadvantages about this feature[2], one of the main
>>>>> concerns is the possible memory fragmentation/waste in some scenarios,
>>>>> also archs must ensure that any arch specific vmalloc allocations that
>>>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>>>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
>>>> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
>>>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
>>>> subsequent permission changes just fragment the 2M mapping?
>>> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.
>>>
>>> When module alloc with STRICT_MODULE_RWX on x86, it calls
>>> __change_page_attr()
>>>
>>> from set_memory_ro/rw/nx which will split large page, so there is no
>>> need to make
>>>
>>> module alloc with HUGE_VMALLOC.
>> This all sounds very fragile to me.  Every time a new architecture would
>> get added for huge vmalloc() support, the developer needs to know to go
>> find that architecture's module_alloc() and add this flag.  They next
>> guy is going to forget, just like you did.
>>
>> Considering that this is not a hot path, a weak function would be a nice
>> choice:
>>
>> /* vmalloc() flags used for all module allocations. */
>> unsigned long __weak arch_module_vm_flags()
>> {
>>     /*
>>      * Modules use a single, large vmalloc().  Different
>>      * permissions are applied later and will fragment
>>      * huge mappings.  Avoid using huge pages for modules.
>>      */
>>     return VM_NO_HUGE_VMAP;
> 
> For x86, it only fragment, but for arm64, due to apply_to_page_range() in
> 
> set_memory_*, without this flag maybe crash. Whatever, we need this
> 
> flag for module.

I see no reason to have this flag by default.

Only ARM should have it if necessary, with a comment explaining why just 
like powerpc.

And maybe the flag should be there only when STRICT_MODULE_RWX is selected.

> 
>> }
>>
>> Stick that in some the common module code, next to:
>>
>>> void * __weak module_alloc(unsigned long size)
>>> {
>>>          return __vmalloc_node_range(size, 1, VMALLOC_START, 
>>> VMALLOC_END,
>> ...
>>
>> Then, put arch_module_vm_flags() in *all* of the module_alloc()
>> implementations, including the generic one.  That way (even with a new
>> architecture) whoever copies-and-pastes their module_alloc()
>> implementation is likely to get it right.  The next guy who just does a
>> "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work.
> 
> OK， Let me check the VM_FLUSH_RESET_PERMS and try about this way.
> 
> Thanks.
> 
>>
>> VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.
>> .

Dave Hansen Jan. 18, 2022, 5:28 p.m. UTC | #9

On 1/17/22 6:46 PM, Nicholas Piggin wrote:
>> This all sounds very fragile to me.  Every time a new architecture would
>> get added for huge vmalloc() support, the developer needs to know to go
>> find that architecture's module_alloc() and add this flag.
> This is documented in the Kconfig.
> 
>  #
>  #  Archs that select this would be capable of PMD-sized vmaps (i.e.,
>  #  arch_vmap_pmd_supported() returns true), and they must make no assumptions
>  #  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
>  #  can be used to prohibit arch-specific allocations from using hugepages to
>  #  help with this (e.g., modules may require it).
>  #
>  config HAVE_ARCH_HUGE_VMALLOC
>          depends on HAVE_ARCH_HUGE_VMAP
>          bool
> 
> Is it really fair to say it's *very* fragile? Surely it's reasonable to 
> read the (not very long) documentation ad understand the consequences for
> the arch code before enabling it.

Very fragile or not, I think folks are likely to get it wrong.  It would
be nice to have it default *everyone* to safe and slow and make *sure*
they go look at the architecture modules code itself before enabling
this for modules.

Just from that Kconfig text, I don't think I'd know off the top of my
head what do do for x86, or what code I needed to go touch.

Nicholas Piggin Jan. 19, 2022, 4:17 a.m. UTC | #10

Excerpts from Dave Hansen's message of January 19, 2022 3:28 am:
> On 1/17/22 6:46 PM, Nicholas Piggin wrote:
>>> This all sounds very fragile to me.  Every time a new architecture would
>>> get added for huge vmalloc() support, the developer needs to know to go
>>> find that architecture's module_alloc() and add this flag.
>> This is documented in the Kconfig.
>> 
>>  #
>>  #  Archs that select this would be capable of PMD-sized vmaps (i.e.,
>>  #  arch_vmap_pmd_supported() returns true), and they must make no assumptions
>>  #  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
>>  #  can be used to prohibit arch-specific allocations from using hugepages to
>>  #  help with this (e.g., modules may require it).
>>  #
>>  config HAVE_ARCH_HUGE_VMALLOC
>>          depends on HAVE_ARCH_HUGE_VMAP
>>          bool
>> 
>> Is it really fair to say it's *very* fragile? Surely it's reasonable to 
>> read the (not very long) documentation ad understand the consequences for
>> the arch code before enabling it.
> 
> Very fragile or not, I think folks are likely to get it wrong.  It would
> be nice to have it default *everyone* to safe and slow and make *sure*

It's not safe to enable though. That's the problem. If it was just 
modules then you'd have a point but it could be anything.

> they go look at the architecture modules code itself before enabling
> this for modules.

This is required not just for modules for the whole arch code, it
has to be looked at and decided this will work.

> Just from that Kconfig text, I don't think I'd know off the top of my
> head what do do for x86, or what code I needed to go touch.

You have to make sure arch/x86 makes no assumptions that vmalloc memory
is backed by PAGE_SIZE ptes. If you can't do that then you shouldn't 
enable the option. The option can not explain it any more because any
arch could do anything with its mappings. The module code is an example,
not the recipe.

Thanks,
Nick

Kefeng Wang Jan. 19, 2022, 1:32 p.m. UTC | #11

On 2022/1/19 12:17, Nicholas Piggin wrote:
> Excerpts from Dave Hansen's message of January 19, 2022 3:28 am:
>> On 1/17/22 6:46 PM, Nicholas Piggin wrote:
>>>> This all sounds very fragile to me.  Every time a new architecture would
>>>> get added for huge vmalloc() support, the developer needs to know to go
>>>> find that architecture's module_alloc() and add this flag.
>>> This is documented in the Kconfig.
>>>
>>>   #
>>>   #  Archs that select this would be capable of PMD-sized vmaps (i.e.,
>>>   #  arch_vmap_pmd_supported() returns true), and they must make no assumptions
>>>   #  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
>>>   #  can be used to prohibit arch-specific allocations from using hugepages to
>>>   #  help with this (e.g., modules may require it).
>>>   #
>>>   config HAVE_ARCH_HUGE_VMALLOC
>>>           depends on HAVE_ARCH_HUGE_VMAP
>>>           bool
>>>
>>> Is it really fair to say it's *very* fragile? Surely it's reasonable to
>>> read the (not very long) documentation ad understand the consequences for
>>> the arch code before enabling it.
>> Very fragile or not, I think folks are likely to get it wrong.  It would
>> be nice to have it default *everyone* to safe and slow and make *sure*
> It's not safe to enable though. That's the problem. If it was just
> modules then you'd have a point but it could be anything.
>
>> they go look at the architecture modules code itself before enabling
>> this for modules.
> This is required not just for modules for the whole arch code, it
> has to be looked at and decided this will work.
>
>> Just from that Kconfig text, I don't think I'd know off the top of my
>> head what do do for x86, or what code I needed to go touch.
> You have to make sure arch/x86 makes no assumptions that vmalloc memory
> is backed by PAGE_SIZE ptes. If you can't do that then you shouldn't
> enable the option. The option can not explain it any more because any
> arch could do anything with its mappings. The module code is an example,
> not the recipe.

Hi Nick, Dave and Christophe，thanks for your review,  a little 
confused,   I think,

1) for ppc/arm64 module_alloc(),  it must set VM_NO_HUGE_VMAP because the

arch's set_memory_* funcitons can only support PAGE_SIZE mapping, due to the

limit of apply_to_page_range().

2) but for x86's module_alloc(), add VM_NO_HUGE_VMAP is to avoid 
fragmentation,

x86's __change_page_attr functions will split the huge mapping. this 
flags is not a must.


and the behavior above occurred when STRICT_MODULE_RWX enabled, so

1) add a unified function to set vm flags(suggested by Dave ) or

2) add vm flags with some comments to per-arch's module_alloc()

are both acceptable, for the way of unified function ,  we could make 
this a default recipe

with STRICT_MODULE_RWX, also make two more vm flags into it, eg,

+unsigned long module_alloc_vm_flags(bool need_flush_reset_perms)
+{
+       unsigned long vm_flags = VM_DEFER_KMEMLEAK;
+
+       if (need_flush_reset_perms)
+               vm_flags |= VM_FLUSH_RESET_PERMS;
+       /*
+        * Modules use a single, large vmalloc(). Different permissions
+        * are applied later and will fragment huge mappings or even
+        * fails in set_memory_* on some architectures. Avoid using
+        * huge pages for modules.
+        */
+       if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
+               vm_flags |= VM_NO_HUGE_VMAP;
+
+       return vm_flags;
+}

then called each arch's module_alloc().

Any suggestion, many thanks.


>
> Thanks,
> Nick
> .

[v2,3/3] x86: Support huge vmalloc mappings

Commit Message

Comments

Patch