Message ID | 20250213161423.449435-7-riel@surriel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | AMD broadcast TLB invalidation | expand |
On 2/13/25 08:13, Rik van Riel wrote: > - if (info->end == TLB_FLUSH_ALL) > + if (broadcast_kernel_range_flush(info)) > + ; /* Fall through. */ > + else if (info->end == TLB_FLUSH_ALL) > on_each_cpu(do_flush_tlb_all, NULL, 1); > else > on_each_cpu(do_kernel_range_flush, info, 1); We've got to find a better name for broadcast_kernel_range_flush(). Because IPIs are broadcast too. The naming makes it confusing. Why would be broadcast, and then start trying IPIs that are also broadcast? This hunk really the crux of the patch and we need to make sure the semantics are crystal clear. What do folks think about something like this: /* * Try to use a hardware assist for flushing the TLB. * This is expected to be cheaper than using an IPI * to do flushes. * * Returns True if the assisted method succeeded. */ static bool assisted_kernel_range_flush(struct flush_tlb_info *info) { ... } static do_ipi_tlb_flush(...info) { if (info->end == TLB_FLUSH_ALL) on_each_cpu(do_flush_tlb_all, NULL, 1); else on_each_cpu(do_kernel_range_flush, info, 1); } Then we end up with: /* * Assisted flushes are assumed to be faster. Try * them first and fall back to IPIs if not available. */ flush_success = assisted_kernel_range_flush(info); if (!flush_success) do_ipi_tlb_flush(info); I think that's a *LOT* more clear: 1. Try assisted flush 2. If it fails fall back to an IPI-based flush
On Fri, Feb 14, 2025 at 10:35:40AM -0800, Dave Hansen wrote: > On 2/13/25 08:13, Rik van Riel wrote: > > - if (info->end == TLB_FLUSH_ALL) > > + if (broadcast_kernel_range_flush(info)) > > + ; /* Fall through. */ > > + else if (info->end == TLB_FLUSH_ALL) > > on_each_cpu(do_flush_tlb_all, NULL, 1); > > else > > on_each_cpu(do_kernel_range_flush, info, 1); > > We've got to find a better name for broadcast_kernel_range_flush(). > Because IPIs are broadcast too. The naming makes it confusing. Why would > be broadcast, and then start trying IPIs that are also broadcast? IIRC the more general name is indeed broadcast tlbi; as in other architectures use this naming to mean this very thing too. But yes, I see the confusion, but I don't think changing the naming really helps a lot here :-/
On 2/14/25 11:40, Peter Zijlstra wrote: > On Fri, Feb 14, 2025 at 10:35:40AM -0800, Dave Hansen wrote: >> On 2/13/25 08:13, Rik van Riel wrote: >>> - if (info->end == TLB_FLUSH_ALL) >>> + if (broadcast_kernel_range_flush(info)) >>> + ; /* Fall through. */ >>> + else if (info->end == TLB_FLUSH_ALL) >>> on_each_cpu(do_flush_tlb_all, NULL, 1); >>> else >>> on_each_cpu(do_kernel_range_flush, info, 1); >> We've got to find a better name for broadcast_kernel_range_flush(). >> Because IPIs are broadcast too. The naming makes it confusing. Why would >> be broadcast, and then start trying IPIs that are also broadcast? > IIRC the more general name is indeed broadcast tlbi; as in other > architectures use this naming to mean this very thing too. > > But yes, I see the confusion, but I don't think changing the naming > really helps a lot here :-/ Fair enough. If we don't have a better name, we can at least do: if (new_bad_name()) { new_thing(); } else { old_thing(); } My real heartburn is with: if (new_bad_name()) { new_thing(); } else if (need_thing_1()) { old_thing1(); } else { old_thing2(); } Where new and old are logically squished together.
On Fri, 2025-02-14 at 11:55 -0800, Dave Hansen wrote: > > Fair enough. If we don't have a better name, we can at least do: > > if (new_bad_name()) { > new_thing(); > } else { > old_thing(); > } > > My real heartburn is with: > > if (new_bad_name()) { > new_thing(); > } else if (need_thing_1()) { > old_thing1(); > } else { > old_thing2(); > } > > Where new and old are logically squished together. > Do we want to group this code by history, or by function? I would argue that new_thing() and old_thing1() are closer to each other functionally (they both do remote TLB invalidation) than they are to old_thing2(), which does local-only invalidation. I can organize the code however people want, but I would like a second opinion on this idea :)
On Fri, Feb 14, 2025 at 08:25:51PM -0500, Rik van Riel wrote: > On Fri, 2025-02-14 at 11:55 -0800, Dave Hansen wrote: > > > > Fair enough. If we don't have a better name, we can at least do: > > > > if (new_bad_name()) { > > new_thing(); > > } else { > > old_thing(); > > } > > > > My real heartburn is with: > > > > if (new_bad_name()) { > > new_thing(); > > } else if (need_thing_1()) { > > old_thing1(); > > } else { > > old_thing2(); > > } > > > > Where new and old are logically squished together. > > > Do we want to group this code by history, or > by function? > > I would argue that new_thing() and old_thing1() > are closer to each other functionally (they both > do remote TLB invalidation) than they are to > old_thing2(), which does local-only invalidation. > > I can organize the code however people want, > but I would like a second opinion on this idea :) IIUC the discussion is about: if (broadcast_kernel_range_flush(info)) ; /* Fall through. */ else if (info->end == TLB_FLUSH_ALL) on_each_cpu(do_flush_tlb_all, NULL, 1); else on_each_cpu(do_kernel_range_flush, info, 1); In this case I agree with Dave. old_thing1() and old_thing2() are both sending IPIs, the difference is that old_thing1() is doing a full flush while old_thing2() is doing a range flush. Not sure why you mentioned that old_thing2() does local invalidation. broadcast_kernel_range_flush() also decides between full and range flushes internally. So the main difference between 'new' and 'old' here is using the broadcast flush vs the IPI flush. So I think what Dave wants (and I agree) is: if (!broadcast_kernel_range_flush(info)) ipi_kernel_range_flush(info) Where ipi_kernel_range_flush() contains old_thing1() and oldthing2(). > -- > All Rights Reversed. >
On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote: > > So I think what Dave wants (and I agree) is: > if (!broadcast_kernel_range_flush(info)) > ipi_kernel_range_flush(info) > > Where ipi_kernel_range_flush() contains old_thing1() and oldthing2(). I like this idea. I changed the code to add that for the next version! Thank you.
On 2/18/25 10:00, Rik van Riel wrote: > On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote: >> So I think what Dave wants (and I agree) is: >> if (!broadcast_kernel_range_flush(info)) >> ipi_kernel_range_flush(info) >> >> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2(). That's OK-ish. But it still smells of hacking in the new concept without refactoring things properly. Let's logically inline the code that we've got. I think it actually ends up looking something like this: if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { if (info->end == TLB_FLUSH_ALL) { invlpgb_flush_all(); } else { for_each(addr) invlpgb_flush_addr_nosync(addr, nr); } } else { if (info->end == TLB_FLUSH_ALL) on_each_cpu(do_flush_tlb_all, NULL, 1); else on_each_cpu(do_kernel_range_flush, info, 1); } Where we've got two inputs: 1. INVLPGB support (or not) 2. TLB_FLUSH_ALL (basically ranged or full flush) So I think we should group by *one* of those. The above groups by INVLPGB support and this groups by TLB_FLUSH_ALL: if (info->end == TLB_FLUSH_ALL) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { invlpgb_flush_all(); } else { on_each_cpu(do_flush_tlb_all, NULL, 1); } } else { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) for_each(addr) invlpgb_flush_addr_nosync(addr, nr); else on_each_cpu(do_kernel_range_flush, info, 1); } So, if we create some helpers that give some consistent naming: static tlb_flush_all_ipi(...) { on_each_cpu(do_flush_tlb_all, NULL, 1); } static tlb_flush_all(...) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_flush_all(...); else tlb_flush_all_ipi(...); } and then also create the ranged equivalents (which internally have the same cpu_feature_enabled() check): tlb_flush_range_ipi(...) invlpgb_flush_range(...) Then we can have the top-level code be: if (info->end == TLB_FLUSH_ALL) tlb_flush_all(info); else tlb_flush_range(info); That actually looks way nicer than what we have today. For bonus points, if a third way of flushing the TLB showed up, it would slot right in: static tlb_flush_all(...) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_flush_all(...); + else if cpu_feature_enabled(X86_FEATURE_RAR)) + rar_flush_all(...); else tlb_flush_all_ipi(...); } That's *exactly* the way we want the code to read. At the higher level, it's deciding based on the generic thing that *everybody* cares about: ranged or full flush. Then, at the lower level, it's deciding how to implement that high-level flush concept.
On Tue, Feb 18, 2025 at 02:27:31PM -0800, Dave Hansen wrote: > On 2/18/25 10:00, Rik van Riel wrote: > > On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote: > >> So I think what Dave wants (and I agree) is: > >> if (!broadcast_kernel_range_flush(info)) > >> ipi_kernel_range_flush(info) > >> > >> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2(). > > That's OK-ish. But it still smells of hacking in the new concept without > refactoring things properly. > > Let's logically inline the code that we've got. I think it actually > ends up looking something like this: > > if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { > if (info->end == TLB_FLUSH_ALL) { > invlpgb_flush_all(); > } else { > for_each(addr) > invlpgb_flush_addr_nosync(addr, nr); > } > } else { > if (info->end == TLB_FLUSH_ALL) > on_each_cpu(do_flush_tlb_all, NULL, 1); > else > on_each_cpu(do_kernel_range_flush, info, 1); > } > > Where we've got two inputs: > > 1. INVLPGB support (or not) > 2. TLB_FLUSH_ALL (basically ranged or full flush) > > So I think we should group by *one* of those. The above groups by > INVLPGB support and this groups by TLB_FLUSH_ALL: > > if (info->end == TLB_FLUSH_ALL) { > if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { > invlpgb_flush_all(); > } else { > on_each_cpu(do_flush_tlb_all, NULL, 1); > } > } else { > if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) > for_each(addr) > invlpgb_flush_addr_nosync(addr, nr); > else > on_each_cpu(do_kernel_range_flush, info, 1); > } Yeah an if/else structure is better than using the invlpgb helper and falling back to IPIs if it returns false, and I also prefer grouping by the flush scope (range/flush). Thanks for the illustrations :) > > So, if we create some helpers that give some consistent naming: > > static tlb_flush_all_ipi(...) > { > on_each_cpu(do_flush_tlb_all, NULL, 1); > } > > static tlb_flush_all(...) > { > if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) > invlpgb_flush_all(...); > else > tlb_flush_all_ipi(...); > } > > and then also create the ranged equivalents (which internally have the > same cpu_feature_enabled() check): > > tlb_flush_range_ipi(...) > invlpgb_flush_range(...) > > Then we can have the top-level code be: > > if (info->end == TLB_FLUSH_ALL) > tlb_flush_all(info); > else > tlb_flush_range(info); > > That actually looks way nicer than what we have today. For bonus points, > if a third way of flushing the TLB showed up, it would slot right in: > > static tlb_flush_all(...) > { > if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) > invlpgb_flush_all(...); > + else if cpu_feature_enabled(X86_FEATURE_RAR)) > + rar_flush_all(...); > else > tlb_flush_all_ipi(...); > } > > That's *exactly* the way we want the code to read. At the higher level, > it's deciding based on the generic thing that *everybody* cares about: > ranged or full flush. Then, at the lower level, it's deciding how to > implement that high-level flush concept. >
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 924ac2263725..ce9df82754ce 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1077,6 +1077,28 @@ void flush_tlb_all(void) on_each_cpu(do_flush_tlb_all, NULL, 1); } +static bool broadcast_kernel_range_flush(struct flush_tlb_info *info) +{ + unsigned long addr; + unsigned long nr; + + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + return false; + + if (info->end == TLB_FLUSH_ALL) { + invlpgb_flush_all(); + return true; + } + + for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) { + nr = (info->end - addr) >> PAGE_SHIFT; + nr = clamp_val(nr, 1, invlpgb_count_max); + invlpgb_flush_addr_nosync(addr, nr); + } + tlbsync(); + return true; +} + static void do_kernel_range_flush(void *info) { struct flush_tlb_info *f = info; @@ -1096,7 +1118,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false, TLB_GENERATION_INVALID); - if (info->end == TLB_FLUSH_ALL) + if (broadcast_kernel_range_flush(info)) + ; /* Fall through. */ + else if (info->end == TLB_FLUSH_ALL) on_each_cpu(do_flush_tlb_all, NULL, 1); else on_each_cpu(do_kernel_range_flush, info, 1);