ARM: tlb: Prevent flushing insane large ranges one by one

Message ID	87pm6qm5wo.ffs@tglx (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Thomas Gleixner <tglx@linutronix.de> To: linux-arm-kernel@lists.infradead.org Cc: Russell King <linux@armlinux.org.uk>, John Ogness <jogness@linutronix.de>, Arnd Bergmann <arnd@kernel.org> Subject: [PATCH] ARM: tlb: Prevent flushing insane large ranges one by one Date: Wed, 24 May 2023 11:32:07 +0200 Message-ID: <87pm6qm5wo.ffs@tglx> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	ARM: tlb: Prevent flushing insane large ranges one by one \| expand ARM: tlb: Prevent flushing insane large ranges one by one

Message ID

87pm6qm5wo.ffs@tglx (mailing list archive)

State

New, archived

Headers

From: Thomas Gleixner <tglx@linutronix.de>
To: linux-arm-kernel@lists.infradead.org
Cc: Russell King <linux@armlinux.org.uk>, John Ogness
 <jogness@linutronix.de>, Arnd Bergmann <arnd@kernel.org>
Subject: [PATCH] ARM: tlb: Prevent flushing insane large ranges one by one
Date: Wed, 24 May 2023 11:32:07 +0200
Message-ID: <87pm6qm5wo.ffs@tglx>
MIME-Version: 1.0
Precedence: list
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Series

ARM: tlb: Prevent flushing insane large ranges one by one | expand

Commit Message

Thomas Gleixner May 24, 2023, 9:32 a.m. UTC

vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB
flushing on every unmap. The lazy flushing coalesces unmapped ranges and
invokes flush_tlb_kernel_range() with the combined range.

The coalescing can result in ranges which spawn the full vmalloc address
range. In the case of flushing an executable mapping in the module address
space this range is extended to also flush the direct map alias.

flush_tlb_kernel_range() then walks insane large ranges, the worst case
observed was ~1.5GB.

The range is flushed page by page, which takes several milliseconds to
complete in the worst case and obviously affects all processes in the
system. In the worst case observed this causes the runtime of a realtime
task on an isolated CPU to be almost doubled over the normal worst
case, which makes it miss the deadline.

Cure this by sanity checking the range against a threshold and fall back to
tlb_flush_all() when the range is too large.

The default threshold is 32 pages, but for CPUs with CP15 this is evaluated
at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB
size.

The vmalloc range coalescing could be improved to provide a list or
array of ranges to flush, which allows to avoid overbroad flushing, but
that's a major surgery and does not solve the problem of actual
justified large range flushes which can happen due to the lazy flush
mechanics in vmalloc. The lazy flush results in batching which is biased
towards large range flushes by design.

Fixes: db64fe02258f ("mm: rewrite vmap layer")
Reported-by: John Ogness <john.ogness@linutronix.de>
Debugged-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx
---
 arch/arm/include/asm/cputype.h  |    5 +++++
 arch/arm/include/asm/tlbflush.h |    2 ++
 arch/arm/kernel/setup.c         |   10 ++++++++++
 arch/arm/kernel/smp_tlb.c       |    4 ++++
 4 files changed, 21 insertions(+)

Comments

Robin Murphy May 24, 2023, 10:18 a.m. UTC | #1

On 2023-05-24 10:32, Thomas Gleixner wrote:
> vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB
> flushing on every unmap. The lazy flushing coalesces unmapped ranges and
> invokes flush_tlb_kernel_range() with the combined range.
> 
> The coalescing can result in ranges which spawn the full vmalloc address
> range. In the case of flushing an executable mapping in the module address
> space this range is extended to also flush the direct map alias.
> 
> flush_tlb_kernel_range() then walks insane large ranges, the worst case
> observed was ~1.5GB.
> 
> The range is flushed page by page, which takes several milliseconds to
> complete in the worst case and obviously affects all processes in the
> system. In the worst case observed this causes the runtime of a realtime
> task on an isolated CPU to be almost doubled over the normal worst
> case, which makes it miss the deadline.
> 
> Cure this by sanity checking the range against a threshold and fall back to
> tlb_flush_all() when the range is too large.
> 
> The default threshold is 32 pages, but for CPUs with CP15 this is evaluated
> at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB
> size.
> 
> The vmalloc range coalescing could be improved to provide a list or
> array of ranges to flush, which allows to avoid overbroad flushing, but
> that's a major surgery and does not solve the problem of actual
> justified large range flushes which can happen due to the lazy flush
> mechanics in vmalloc. The lazy flush results in batching which is biased
> towards large range flushes by design.
> 
> Fixes: db64fe02258f ("mm: rewrite vmap layer")
> Reported-by: John Ogness <john.ogness@linutronix.de>
> Debugged-by: John Ogness <john.ogness@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: John Ogness <john.ogness@linutronix.de>
> Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx
> ---
>   arch/arm/include/asm/cputype.h  |    5 +++++
>   arch/arm/include/asm/tlbflush.h |    2 ++
>   arch/arm/kernel/setup.c         |   10 ++++++++++
>   arch/arm/kernel/smp_tlb.c       |    4 ++++
>   4 files changed, 21 insertions(+)
> 
> --- a/arch/arm/include/asm/cputype.h
> +++ b/arch/arm/include/asm/cputype.h
> @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c
>   	return read_cpuid(CPUID_MPUIR);
>   }
>   
> +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void)
> +{
> +	return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03);
> +}

This appears to be specific to Cortex-A9 - these bits are 
implementation-defined, and it looks like on on most other Arm Ltd. CPUs 
they have no meaning at all, e.g.[1][2][3], but they could still hold 
some wildly unrelated value on other implementations.

Thanks,
Robin.

[1] 
https://developer.arm.com/documentation/ddi0344/k/system-control-coprocessor/system-control-coprocessor-registers/c0--tlb-type-register
[2] 
https://developer.arm.com/documentation/ddi0464/f/System-Control/Register-descriptions/TLB-Type-Register
[3] 
https://developer.arm.com/documentation/ddi0500/j/System-Control/AArch32-register-descriptions/TLB-Type-Register

> +
>   #elif defined(CONFIG_CPU_V7M)
>   
>   static inline unsigned int __attribute_const__ read_cpuid_id(void)
> --- a/arch/arm/include/asm/tlbflush.h
> +++ b/arch/arm/include/asm/tlbflush.h
> @@ -210,6 +210,8 @@ struct cpu_tlb_fns {
>   	unsigned long tlb_flags;
>   };
>   
> +extern unsigned int tlb_flush_all_threshold;
> +
>   /*
>    * Select the calling method
>    */
> --- a/arch/arm/kernel/setup.c
> +++ b/arch/arm/kernel/setup.c
> @@ -90,6 +90,8 @@ EXPORT_SYMBOL(__machine_arch_type);
>   unsigned int cacheid __read_mostly;
>   EXPORT_SYMBOL(cacheid);
>   
> +unsigned int tlb_flush_all_threshold __ro_after_init = 32;
> +
>   unsigned int __atags_pointer __initdata;
>   
>   unsigned int system_rev;
> @@ -356,6 +358,13 @@ static void __init cacheid_init(void)
>   		cache_is_vipt_nonaliasing() ? "VIPT nonaliasing" : "unknown");
>   }
>   
> +static void __init tlbsize_init(void)
> +{
> +#ifdef CONFIG_CPU_CP15
> +	tlb_flush_all_threshold = read_cpuid_tlbsize() / 2;
> +#endif
> +}
> +
>   /*
>    * These functions re-use the assembly code in head.S, which
>    * already provide the required functionality.
> @@ -747,6 +756,7 @@ static void __init setup_processor(void)
>   	elf_hwcap_fixup();
>   
>   	cacheid_init();
> +	tlbsize_init();
>   	cpu_init();
>   }
>   
> --- a/arch/arm/kernel/smp_tlb.c
> +++ b/arch/arm/kernel/smp_tlb.c
> @@ -234,6 +234,10 @@ void flush_tlb_range(struct vm_area_stru
>   
>   void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>   {
> +	if ((end - start) > (tlb_flush_all_threshold << PAGE_SHIFT)) {
> +		flush_tlb_all();
> +		return;
> +	}
>   	if (tlb_ops_need_broadcast()) {
>   		struct tlb_args ta;
>   		ta.ta_start = start;
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Russell King (Oracle) May 24, 2023, 10:23 a.m. UTC | #2

On Wed, May 24, 2023 at 11:18:12AM +0100, Robin Murphy wrote:
> On 2023-05-24 10:32, Thomas Gleixner wrote:
> > vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB
> > flushing on every unmap. The lazy flushing coalesces unmapped ranges and
> > invokes flush_tlb_kernel_range() with the combined range.
> > 
> > The coalescing can result in ranges which spawn the full vmalloc address
> > range. In the case of flushing an executable mapping in the module address
> > space this range is extended to also flush the direct map alias.
> > 
> > flush_tlb_kernel_range() then walks insane large ranges, the worst case
> > observed was ~1.5GB.
> > 
> > The range is flushed page by page, which takes several milliseconds to
> > complete in the worst case and obviously affects all processes in the
> > system. In the worst case observed this causes the runtime of a realtime
> > task on an isolated CPU to be almost doubled over the normal worst
> > case, which makes it miss the deadline.
> > 
> > Cure this by sanity checking the range against a threshold and fall back to
> > tlb_flush_all() when the range is too large.
> > 
> > The default threshold is 32 pages, but for CPUs with CP15 this is evaluated
> > at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB
> > size.
> > 
> > The vmalloc range coalescing could be improved to provide a list or
> > array of ranges to flush, which allows to avoid overbroad flushing, but
> > that's a major surgery and does not solve the problem of actual
> > justified large range flushes which can happen due to the lazy flush
> > mechanics in vmalloc. The lazy flush results in batching which is biased
> > towards large range flushes by design.
> > 
> > Fixes: db64fe02258f ("mm: rewrite vmap layer")
> > Reported-by: John Ogness <john.ogness@linutronix.de>
> > Debugged-by: John Ogness <john.ogness@linutronix.de>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Tested-by: John Ogness <john.ogness@linutronix.de>
> > Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx
> > ---
> >   arch/arm/include/asm/cputype.h  |    5 +++++
> >   arch/arm/include/asm/tlbflush.h |    2 ++
> >   arch/arm/kernel/setup.c         |   10 ++++++++++
> >   arch/arm/kernel/smp_tlb.c       |    4 ++++
> >   4 files changed, 21 insertions(+)
> > 
> > --- a/arch/arm/include/asm/cputype.h
> > +++ b/arch/arm/include/asm/cputype.h
> > @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c
> >   	return read_cpuid(CPUID_MPUIR);
> >   }
> > +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void)
> > +{
> > +	return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03);
> > +}
> 
> This appears to be specific to Cortex-A9 - these bits are
> implementation-defined, and it looks like on on most other Arm Ltd. CPUs
> they have no meaning at all, e.g.[1][2][3], but they could still hold some
> wildly unrelated value on other implementations.

That sucks. I guess we'll need to decode the main CPU ID register and
have a table, except for Cortex-A9 where we can read the TLB size.

If that's not going to work either, then the MM layer needs to get
fixed not to be so utterly stupid to request a TLB flush over an
insanely large range - or people will just have to put up with
latency sucking on 32-bit ARM platforms.

Robin Murphy May 24, 2023, 11:05 a.m. UTC | #3

On 2023-05-24 11:23, Russell King (Oracle) wrote:
> On Wed, May 24, 2023 at 11:18:12AM +0100, Robin Murphy wrote:
>> On 2023-05-24 10:32, Thomas Gleixner wrote:
>>> vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB
>>> flushing on every unmap. The lazy flushing coalesces unmapped ranges and
>>> invokes flush_tlb_kernel_range() with the combined range.
>>>
>>> The coalescing can result in ranges which spawn the full vmalloc address
>>> range. In the case of flushing an executable mapping in the module address
>>> space this range is extended to also flush the direct map alias.
>>>
>>> flush_tlb_kernel_range() then walks insane large ranges, the worst case
>>> observed was ~1.5GB.
>>>
>>> The range is flushed page by page, which takes several milliseconds to
>>> complete in the worst case and obviously affects all processes in the
>>> system. In the worst case observed this causes the runtime of a realtime
>>> task on an isolated CPU to be almost doubled over the normal worst
>>> case, which makes it miss the deadline.
>>>
>>> Cure this by sanity checking the range against a threshold and fall back to
>>> tlb_flush_all() when the range is too large.
>>>
>>> The default threshold is 32 pages, but for CPUs with CP15 this is evaluated
>>> at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB
>>> size.
>>>
>>> The vmalloc range coalescing could be improved to provide a list or
>>> array of ranges to flush, which allows to avoid overbroad flushing, but
>>> that's a major surgery and does not solve the problem of actual
>>> justified large range flushes which can happen due to the lazy flush
>>> mechanics in vmalloc. The lazy flush results in batching which is biased
>>> towards large range flushes by design.
>>>
>>> Fixes: db64fe02258f ("mm: rewrite vmap layer")
>>> Reported-by: John Ogness <john.ogness@linutronix.de>
>>> Debugged-by: John Ogness <john.ogness@linutronix.de>
>>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>> Tested-by: John Ogness <john.ogness@linutronix.de>
>>> Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx
>>> ---
>>>    arch/arm/include/asm/cputype.h  |    5 +++++
>>>    arch/arm/include/asm/tlbflush.h |    2 ++
>>>    arch/arm/kernel/setup.c         |   10 ++++++++++
>>>    arch/arm/kernel/smp_tlb.c       |    4 ++++
>>>    4 files changed, 21 insertions(+)
>>>
>>> --- a/arch/arm/include/asm/cputype.h
>>> +++ b/arch/arm/include/asm/cputype.h
>>> @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c
>>>    	return read_cpuid(CPUID_MPUIR);
>>>    }
>>> +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void)
>>> +{
>>> +	return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03);
>>> +}
>>
>> This appears to be specific to Cortex-A9 - these bits are
>> implementation-defined, and it looks like on on most other Arm Ltd. CPUs
>> they have no meaning at all, e.g.[1][2][3], but they could still hold some
>> wildly unrelated value on other implementations.
> 
> That sucks. I guess we'll need to decode the main CPU ID register and
> have a table, except for Cortex-A9 where we can read the TLB size.

Yes, it seems like Cortex-A9 is the odd one out for having 
configurability here, otherwise the sizes seem to range from 32 entries 
on Cortex-A8 to 1024 entries for Cortex-A17's main TLB, so having just 
one single default value would seem less than optimal.

Thanks,
Robin.

> If that's not going to work either, then the MM layer needs to get
> fixed not to be so utterly stupid to request a TLB flush over an
> insanely large range - or people will just have to put up with
> latency sucking on 32-bit ARM platforms.
>

Thomas Gleixner May 24, 2023, 3:51 p.m. UTC | #4

On Wed, May 24 2023 at 11:23, Russell King wrote:
> On Wed, May 24, 2023 at 11:18:12AM +0100, Robin Murphy wrote:
>> > +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void)
>> > +{
>> > +	return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03);
>> > +}
>> 
>> This appears to be specific to Cortex-A9 - these bits are
>> implementation-defined, and it looks like on on most other Arm Ltd. CPUs
>> they have no meaning at all, e.g.[1][2][3], but they could still hold some
>> wildly unrelated value on other implementations.

Bah.

> That sucks. I guess we'll need to decode the main CPU ID register and
> have a table, except for Cortex-A9 where we can read the TLB size.
>
> If that's not going to work either, then the MM layer needs to get
> fixed not to be so utterly stupid to request a TLB flush over an
> insanely large range - or people will just have to put up with
> latency sucking on 32-bit ARM platforms.

The problem is that there are legitimate cases for large ranges. Even if
we can provide a list or an array of ranges then due to the batching in
the vmap layer it can end up with a large number of pages to flush.

There is an obviously CPU specific crossover point where

      N * t(single) > t(all) + t(refill)

That needs some perf analysis, but I'd be truly surprised if $N would be
large. On x86 the crossover point is around 32 pages.

Lets just assume 32 pages for that A9 too. That's 128k, which is not an
unreasonable size for large buffers. Though its way under the batching
threshold of the vmalloc code, which scales logarithmicaly with the
number of online cpus:

        fls(num_online_cpus()) * (32UL * 1024 * 1024);

That's 64M, i.e. 16k pages, for 2 CPUs... The reasoning there is that a
single flush all is in sum way cheaper than 16k individual flushes right
at the point of each *free() operation.

Where the vmalloc layer can be less silly, is for the immediate flush
case which only flushes 3 pages, but that wont show up yesterday.

For now (and eventual backporting) the occasional flush all is
definitely a better choice than the current situation.

Thanks,

        tglx

--- a/arch/arm/include/asm/cputype.h
+++ b/arch/arm/include/asm/cputype.h
@@ -196,6 +196,11 @@  static inline unsigned int __attribute_c
 	return read_cpuid(CPUID_MPUIR);
 }
 
+static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void)
+{
+	return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03);
+}
+
 #elif defined(CONFIG_CPU_V7M)
 
 static inline unsigned int __attribute_const__ read_cpuid_id(void)
--- a/arch/arm/include/asm/tlbflush.h
+++ b/arch/arm/include/asm/tlbflush.h
@@ -210,6 +210,8 @@  struct cpu_tlb_fns {
 	unsigned long tlb_flags;
 };
 
+extern unsigned int tlb_flush_all_threshold;
+
 /*
  * Select the calling method
  */
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -90,6 +90,8 @@  EXPORT_SYMBOL(__machine_arch_type);
 unsigned int cacheid __read_mostly;
 EXPORT_SYMBOL(cacheid);
 
+unsigned int tlb_flush_all_threshold __ro_after_init = 32;
+
 unsigned int __atags_pointer __initdata;
 
 unsigned int system_rev;
@@ -356,6 +358,13 @@  static void __init cacheid_init(void)
 		cache_is_vipt_nonaliasing() ? "VIPT nonaliasing" : "unknown");
 }
 
+static void __init tlbsize_init(void)
+{
+#ifdef CONFIG_CPU_CP15
+	tlb_flush_all_threshold = read_cpuid_tlbsize() / 2;
+#endif
+}
+
 /*
  * These functions re-use the assembly code in head.S, which
  * already provide the required functionality.
@@ -747,6 +756,7 @@  static void __init setup_processor(void)
 	elf_hwcap_fixup();
 
 	cacheid_init();
+	tlbsize_init();
 	cpu_init();
 }
 
--- a/arch/arm/kernel/smp_tlb.c
+++ b/arch/arm/kernel/smp_tlb.c
@@ -234,6 +234,10 @@  void flush_tlb_range(struct vm_area_stru
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
+	if ((end - start) > (tlb_flush_all_threshold << PAGE_SHIFT)) {
+		flush_tlb_all();
+		return;
+	}
 	if (tlb_ops_need_broadcast()) {
 		struct tlb_args ta;
 		ta.ta_start = start;

ARM: tlb: Prevent flushing insane large ranges one by one

Commit Message

Comments

Patch