Message ID | 87pm6qm5wo.ffs@tglx (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ARM: tlb: Prevent flushing insane large ranges one by one | expand |
On 2023-05-24 10:32, Thomas Gleixner wrote: > vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB > flushing on every unmap. The lazy flushing coalesces unmapped ranges and > invokes flush_tlb_kernel_range() with the combined range. > > The coalescing can result in ranges which spawn the full vmalloc address > range. In the case of flushing an executable mapping in the module address > space this range is extended to also flush the direct map alias. > > flush_tlb_kernel_range() then walks insane large ranges, the worst case > observed was ~1.5GB. > > The range is flushed page by page, which takes several milliseconds to > complete in the worst case and obviously affects all processes in the > system. In the worst case observed this causes the runtime of a realtime > task on an isolated CPU to be almost doubled over the normal worst > case, which makes it miss the deadline. > > Cure this by sanity checking the range against a threshold and fall back to > tlb_flush_all() when the range is too large. > > The default threshold is 32 pages, but for CPUs with CP15 this is evaluated > at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB > size. > > The vmalloc range coalescing could be improved to provide a list or > array of ranges to flush, which allows to avoid overbroad flushing, but > that's a major surgery and does not solve the problem of actual > justified large range flushes which can happen due to the lazy flush > mechanics in vmalloc. The lazy flush results in batching which is biased > towards large range flushes by design. > > Fixes: db64fe02258f ("mm: rewrite vmap layer") > Reported-by: John Ogness <john.ogness@linutronix.de> > Debugged-by: John Ogness <john.ogness@linutronix.de> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Tested-by: John Ogness <john.ogness@linutronix.de> > Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx > --- > arch/arm/include/asm/cputype.h | 5 +++++ > arch/arm/include/asm/tlbflush.h | 2 ++ > arch/arm/kernel/setup.c | 10 ++++++++++ > arch/arm/kernel/smp_tlb.c | 4 ++++ > 4 files changed, 21 insertions(+) > > --- a/arch/arm/include/asm/cputype.h > +++ b/arch/arm/include/asm/cputype.h > @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c > return read_cpuid(CPUID_MPUIR); > } > > +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void) > +{ > + return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03); > +} This appears to be specific to Cortex-A9 - these bits are implementation-defined, and it looks like on on most other Arm Ltd. CPUs they have no meaning at all, e.g.[1][2][3], but they could still hold some wildly unrelated value on other implementations. Thanks, Robin. [1] https://developer.arm.com/documentation/ddi0344/k/system-control-coprocessor/system-control-coprocessor-registers/c0--tlb-type-register [2] https://developer.arm.com/documentation/ddi0464/f/System-Control/Register-descriptions/TLB-Type-Register [3] https://developer.arm.com/documentation/ddi0500/j/System-Control/AArch32-register-descriptions/TLB-Type-Register > + > #elif defined(CONFIG_CPU_V7M) > > static inline unsigned int __attribute_const__ read_cpuid_id(void) > --- a/arch/arm/include/asm/tlbflush.h > +++ b/arch/arm/include/asm/tlbflush.h > @@ -210,6 +210,8 @@ struct cpu_tlb_fns { > unsigned long tlb_flags; > }; > > +extern unsigned int tlb_flush_all_threshold; > + > /* > * Select the calling method > */ > --- a/arch/arm/kernel/setup.c > +++ b/arch/arm/kernel/setup.c > @@ -90,6 +90,8 @@ EXPORT_SYMBOL(__machine_arch_type); > unsigned int cacheid __read_mostly; > EXPORT_SYMBOL(cacheid); > > +unsigned int tlb_flush_all_threshold __ro_after_init = 32; > + > unsigned int __atags_pointer __initdata; > > unsigned int system_rev; > @@ -356,6 +358,13 @@ static void __init cacheid_init(void) > cache_is_vipt_nonaliasing() ? "VIPT nonaliasing" : "unknown"); > } > > +static void __init tlbsize_init(void) > +{ > +#ifdef CONFIG_CPU_CP15 > + tlb_flush_all_threshold = read_cpuid_tlbsize() / 2; > +#endif > +} > + > /* > * These functions re-use the assembly code in head.S, which > * already provide the required functionality. > @@ -747,6 +756,7 @@ static void __init setup_processor(void) > elf_hwcap_fixup(); > > cacheid_init(); > + tlbsize_init(); > cpu_init(); > } > > --- a/arch/arm/kernel/smp_tlb.c > +++ b/arch/arm/kernel/smp_tlb.c > @@ -234,6 +234,10 @@ void flush_tlb_range(struct vm_area_stru > > void flush_tlb_kernel_range(unsigned long start, unsigned long end) > { > + if ((end - start) > (tlb_flush_all_threshold << PAGE_SHIFT)) { > + flush_tlb_all(); > + return; > + } > if (tlb_ops_need_broadcast()) { > struct tlb_args ta; > ta.ta_start = start; > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Wed, May 24, 2023 at 11:18:12AM +0100, Robin Murphy wrote: > On 2023-05-24 10:32, Thomas Gleixner wrote: > > vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB > > flushing on every unmap. The lazy flushing coalesces unmapped ranges and > > invokes flush_tlb_kernel_range() with the combined range. > > > > The coalescing can result in ranges which spawn the full vmalloc address > > range. In the case of flushing an executable mapping in the module address > > space this range is extended to also flush the direct map alias. > > > > flush_tlb_kernel_range() then walks insane large ranges, the worst case > > observed was ~1.5GB. > > > > The range is flushed page by page, which takes several milliseconds to > > complete in the worst case and obviously affects all processes in the > > system. In the worst case observed this causes the runtime of a realtime > > task on an isolated CPU to be almost doubled over the normal worst > > case, which makes it miss the deadline. > > > > Cure this by sanity checking the range against a threshold and fall back to > > tlb_flush_all() when the range is too large. > > > > The default threshold is 32 pages, but for CPUs with CP15 this is evaluated > > at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB > > size. > > > > The vmalloc range coalescing could be improved to provide a list or > > array of ranges to flush, which allows to avoid overbroad flushing, but > > that's a major surgery and does not solve the problem of actual > > justified large range flushes which can happen due to the lazy flush > > mechanics in vmalloc. The lazy flush results in batching which is biased > > towards large range flushes by design. > > > > Fixes: db64fe02258f ("mm: rewrite vmap layer") > > Reported-by: John Ogness <john.ogness@linutronix.de> > > Debugged-by: John Ogness <john.ogness@linutronix.de> > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > > Tested-by: John Ogness <john.ogness@linutronix.de> > > Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx > > --- > > arch/arm/include/asm/cputype.h | 5 +++++ > > arch/arm/include/asm/tlbflush.h | 2 ++ > > arch/arm/kernel/setup.c | 10 ++++++++++ > > arch/arm/kernel/smp_tlb.c | 4 ++++ > > 4 files changed, 21 insertions(+) > > > > --- a/arch/arm/include/asm/cputype.h > > +++ b/arch/arm/include/asm/cputype.h > > @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c > > return read_cpuid(CPUID_MPUIR); > > } > > +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void) > > +{ > > + return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03); > > +} > > This appears to be specific to Cortex-A9 - these bits are > implementation-defined, and it looks like on on most other Arm Ltd. CPUs > they have no meaning at all, e.g.[1][2][3], but they could still hold some > wildly unrelated value on other implementations. That sucks. I guess we'll need to decode the main CPU ID register and have a table, except for Cortex-A9 where we can read the TLB size. If that's not going to work either, then the MM layer needs to get fixed not to be so utterly stupid to request a TLB flush over an insanely large range - or people will just have to put up with latency sucking on 32-bit ARM platforms.
On 2023-05-24 11:23, Russell King (Oracle) wrote: > On Wed, May 24, 2023 at 11:18:12AM +0100, Robin Murphy wrote: >> On 2023-05-24 10:32, Thomas Gleixner wrote: >>> vmalloc uses lazy TLB flushes for unmapped ranges to avoid excessive TLB >>> flushing on every unmap. The lazy flushing coalesces unmapped ranges and >>> invokes flush_tlb_kernel_range() with the combined range. >>> >>> The coalescing can result in ranges which spawn the full vmalloc address >>> range. In the case of flushing an executable mapping in the module address >>> space this range is extended to also flush the direct map alias. >>> >>> flush_tlb_kernel_range() then walks insane large ranges, the worst case >>> observed was ~1.5GB. >>> >>> The range is flushed page by page, which takes several milliseconds to >>> complete in the worst case and obviously affects all processes in the >>> system. In the worst case observed this causes the runtime of a realtime >>> task on an isolated CPU to be almost doubled over the normal worst >>> case, which makes it miss the deadline. >>> >>> Cure this by sanity checking the range against a threshold and fall back to >>> tlb_flush_all() when the range is too large. >>> >>> The default threshold is 32 pages, but for CPUs with CP15 this is evaluated >>> at boot time via read_cpuid(CPUID_TLBTYPE) and set to the half of the TLB >>> size. >>> >>> The vmalloc range coalescing could be improved to provide a list or >>> array of ranges to flush, which allows to avoid overbroad flushing, but >>> that's a major surgery and does not solve the problem of actual >>> justified large range flushes which can happen due to the lazy flush >>> mechanics in vmalloc. The lazy flush results in batching which is biased >>> towards large range flushes by design. >>> >>> Fixes: db64fe02258f ("mm: rewrite vmap layer") >>> Reported-by: John Ogness <john.ogness@linutronix.de> >>> Debugged-by: John Ogness <john.ogness@linutronix.de> >>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> >>> Tested-by: John Ogness <john.ogness@linutronix.de> >>> Link: https://lore.kernel.org/all/87a5y5a6kj.ffs@tglx >>> --- >>> arch/arm/include/asm/cputype.h | 5 +++++ >>> arch/arm/include/asm/tlbflush.h | 2 ++ >>> arch/arm/kernel/setup.c | 10 ++++++++++ >>> arch/arm/kernel/smp_tlb.c | 4 ++++ >>> 4 files changed, 21 insertions(+) >>> >>> --- a/arch/arm/include/asm/cputype.h >>> +++ b/arch/arm/include/asm/cputype.h >>> @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c >>> return read_cpuid(CPUID_MPUIR); >>> } >>> +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void) >>> +{ >>> + return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03); >>> +} >> >> This appears to be specific to Cortex-A9 - these bits are >> implementation-defined, and it looks like on on most other Arm Ltd. CPUs >> they have no meaning at all, e.g.[1][2][3], but they could still hold some >> wildly unrelated value on other implementations. > > That sucks. I guess we'll need to decode the main CPU ID register and > have a table, except for Cortex-A9 where we can read the TLB size. Yes, it seems like Cortex-A9 is the odd one out for having configurability here, otherwise the sizes seem to range from 32 entries on Cortex-A8 to 1024 entries for Cortex-A17's main TLB, so having just one single default value would seem less than optimal. Thanks, Robin. > If that's not going to work either, then the MM layer needs to get > fixed not to be so utterly stupid to request a TLB flush over an > insanely large range - or people will just have to put up with > latency sucking on 32-bit ARM platforms. >
On Wed, May 24 2023 at 11:23, Russell King wrote: > On Wed, May 24, 2023 at 11:18:12AM +0100, Robin Murphy wrote: >> > +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void) >> > +{ >> > + return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03); >> > +} >> >> This appears to be specific to Cortex-A9 - these bits are >> implementation-defined, and it looks like on on most other Arm Ltd. CPUs >> they have no meaning at all, e.g.[1][2][3], but they could still hold some >> wildly unrelated value on other implementations. Bah. > That sucks. I guess we'll need to decode the main CPU ID register and > have a table, except for Cortex-A9 where we can read the TLB size. > > If that's not going to work either, then the MM layer needs to get > fixed not to be so utterly stupid to request a TLB flush over an > insanely large range - or people will just have to put up with > latency sucking on 32-bit ARM platforms. The problem is that there are legitimate cases for large ranges. Even if we can provide a list or an array of ranges then due to the batching in the vmap layer it can end up with a large number of pages to flush. There is an obviously CPU specific crossover point where N * t(single) > t(all) + t(refill) That needs some perf analysis, but I'd be truly surprised if $N would be large. On x86 the crossover point is around 32 pages. Lets just assume 32 pages for that A9 too. That's 128k, which is not an unreasonable size for large buffers. Though its way under the batching threshold of the vmalloc code, which scales logarithmicaly with the number of online cpus: fls(num_online_cpus()) * (32UL * 1024 * 1024); That's 64M, i.e. 16k pages, for 2 CPUs... The reasoning there is that a single flush all is in sum way cheaper than 16k individual flushes right at the point of each *free() operation. Where the vmalloc layer can be less silly, is for the immediate flush case which only flushes 3 pages, but that wont show up yesterday. For now (and eventual backporting) the occasional flush all is definitely a better choice than the current situation. Thanks, tglx
--- a/arch/arm/include/asm/cputype.h +++ b/arch/arm/include/asm/cputype.h @@ -196,6 +196,11 @@ static inline unsigned int __attribute_c return read_cpuid(CPUID_MPUIR); } +static inline unsigned int __attribute_const__ read_cpuid_tlbsize(void) +{ + return 64 << ((read_cpuid(CPUID_TLBTYPE) >> 1) & 0x03); +} + #elif defined(CONFIG_CPU_V7M) static inline unsigned int __attribute_const__ read_cpuid_id(void) --- a/arch/arm/include/asm/tlbflush.h +++ b/arch/arm/include/asm/tlbflush.h @@ -210,6 +210,8 @@ struct cpu_tlb_fns { unsigned long tlb_flags; }; +extern unsigned int tlb_flush_all_threshold; + /* * Select the calling method */ --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -90,6 +90,8 @@ EXPORT_SYMBOL(__machine_arch_type); unsigned int cacheid __read_mostly; EXPORT_SYMBOL(cacheid); +unsigned int tlb_flush_all_threshold __ro_after_init = 32; + unsigned int __atags_pointer __initdata; unsigned int system_rev; @@ -356,6 +358,13 @@ static void __init cacheid_init(void) cache_is_vipt_nonaliasing() ? "VIPT nonaliasing" : "unknown"); } +static void __init tlbsize_init(void) +{ +#ifdef CONFIG_CPU_CP15 + tlb_flush_all_threshold = read_cpuid_tlbsize() / 2; +#endif +} + /* * These functions re-use the assembly code in head.S, which * already provide the required functionality. @@ -747,6 +756,7 @@ static void __init setup_processor(void) elf_hwcap_fixup(); cacheid_init(); + tlbsize_init(); cpu_init(); } --- a/arch/arm/kernel/smp_tlb.c +++ b/arch/arm/kernel/smp_tlb.c @@ -234,6 +234,10 @@ void flush_tlb_range(struct vm_area_stru void flush_tlb_kernel_range(unsigned long start, unsigned long end) { + if ((end - start) > (tlb_flush_all_threshold << PAGE_SHIFT)) { + flush_tlb_all(); + return; + } if (tlb_ops_need_broadcast()) { struct tlb_args ta; ta.ta_start = start;