diff mbox series

[v11,06/12] x86/mm: use INVLPGB for kernel TLB flushes

Message ID 20250213161423.449435-7-riel@surriel.com (mailing list archive)
State New
Headers show
Series AMD broadcast TLB invalidation | expand

Commit Message

Rik van Riel Feb. 13, 2025, 4:13 p.m. UTC
Use broadcast TLB invalidation for kernel addresses when available.

Remove the need to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Comments

Dave Hansen Feb. 14, 2025, 6:35 p.m. UTC | #1
On 2/13/25 08:13, Rik van Riel wrote:
> -	if (info->end == TLB_FLUSH_ALL)
> +	if (broadcast_kernel_range_flush(info))
> +		; /* Fall through. */
> +	else if (info->end == TLB_FLUSH_ALL)
>  		on_each_cpu(do_flush_tlb_all, NULL, 1);
>  	else
>  		on_each_cpu(do_kernel_range_flush, info, 1);

We've got to find a better name for broadcast_kernel_range_flush().
Because IPIs are broadcast too. The naming makes it confusing. Why would
be broadcast, and then start trying IPIs that are also broadcast?

This hunk really the crux of the patch and we need to make sure the
semantics are crystal clear. What do folks think about something like this:

/*
 * Try to use a hardware assist for flushing the TLB.
 * This is expected to be cheaper than using an IPI
 * to do flushes.
 *
 * Returns True if the assisted method succeeded.
 */
static bool assisted_kernel_range_flush(struct flush_tlb_info *info)
{
	...
}

static do_ipi_tlb_flush(...info)
{
	if (info->end == TLB_FLUSH_ALL)
		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	else
		on_each_cpu(do_kernel_range_flush, info, 1);
}

Then we end up with:

	/*
	 * Assisted flushes are assumed to be faster. Try
	 * them first and fall back to IPIs if not available.
	 */
	flush_success = assisted_kernel_range_flush(info);
	if (!flush_success)
		do_ipi_tlb_flush(info);

I think that's a *LOT* more clear:

 1. Try assisted flush
 2. If it fails fall back to an IPI-based flush
Peter Zijlstra Feb. 14, 2025, 7:40 p.m. UTC | #2
On Fri, Feb 14, 2025 at 10:35:40AM -0800, Dave Hansen wrote:
> On 2/13/25 08:13, Rik van Riel wrote:
> > -	if (info->end == TLB_FLUSH_ALL)
> > +	if (broadcast_kernel_range_flush(info))
> > +		; /* Fall through. */
> > +	else if (info->end == TLB_FLUSH_ALL)
> >  		on_each_cpu(do_flush_tlb_all, NULL, 1);
> >  	else
> >  		on_each_cpu(do_kernel_range_flush, info, 1);
> 
> We've got to find a better name for broadcast_kernel_range_flush().
> Because IPIs are broadcast too. The naming makes it confusing. Why would
> be broadcast, and then start trying IPIs that are also broadcast?

IIRC the more general name is indeed broadcast tlbi; as in other
architectures use this naming to mean this very thing too.

But yes, I see the confusion, but I don't think changing the naming
really helps a lot here :-/
Dave Hansen Feb. 14, 2025, 7:55 p.m. UTC | #3
On 2/14/25 11:40, Peter Zijlstra wrote:
> On Fri, Feb 14, 2025 at 10:35:40AM -0800, Dave Hansen wrote:
>> On 2/13/25 08:13, Rik van Riel wrote:
>>> -	if (info->end == TLB_FLUSH_ALL)
>>> +	if (broadcast_kernel_range_flush(info))
>>> +		; /* Fall through. */
>>> +	else if (info->end == TLB_FLUSH_ALL)
>>>  		on_each_cpu(do_flush_tlb_all, NULL, 1);
>>>  	else
>>>  		on_each_cpu(do_kernel_range_flush, info, 1);
>> We've got to find a better name for broadcast_kernel_range_flush().
>> Because IPIs are broadcast too. The naming makes it confusing. Why would
>> be broadcast, and then start trying IPIs that are also broadcast?
> IIRC the more general name is indeed broadcast tlbi; as in other
> architectures use this naming to mean this very thing too.
> 
> But yes, I see the confusion, but I don't think changing the naming
> really helps a lot here :-/

Fair enough. If we don't have a better name, we can at least do:

	if (new_bad_name()) {
		new_thing();
	} else {
		old_thing();
	}

My real heartburn is with:

	if (new_bad_name()) {
		new_thing();
	} else if (need_thing_1()) {
		old_thing1();
	} else {
		old_thing2();
	}

Where new and old are logically squished together.
Rik van Riel Feb. 15, 2025, 1:25 a.m. UTC | #4
On Fri, 2025-02-14 at 11:55 -0800, Dave Hansen wrote:
> 
> Fair enough. If we don't have a better name, we can at least do:
> 
> 	if (new_bad_name()) {
> 		new_thing();
> 	} else {
> 		old_thing();
> 	}
> 
> My real heartburn is with:
> 
> 	if (new_bad_name()) {
> 		new_thing();
> 	} else if (need_thing_1()) {
> 		old_thing1();
> 	} else {
> 		old_thing2();
> 	}
> 
> Where new and old are logically squished together.
> 
Do we want to group this code by history, or
by function?

I would argue that new_thing() and old_thing1()
are closer to each other functionally (they both
do remote TLB invalidation) than they are to 
old_thing2(), which does local-only invalidation.

I can organize the code however people want,
but I would like a second opinion on this idea :)
Yosry Ahmed Feb. 15, 2025, 2:08 a.m. UTC | #5
On Fri, Feb 14, 2025 at 08:25:51PM -0500, Rik van Riel wrote:
> On Fri, 2025-02-14 at 11:55 -0800, Dave Hansen wrote:
> > 
> > Fair enough. If we don't have a better name, we can at least do:
> > 
> > 	if (new_bad_name()) {
> > 		new_thing();
> > 	} else {
> > 		old_thing();
> > 	}
> > 
> > My real heartburn is with:
> > 
> > 	if (new_bad_name()) {
> > 		new_thing();
> > 	} else if (need_thing_1()) {
> > 		old_thing1();
> > 	} else {
> > 		old_thing2();
> > 	}
> > 
> > Where new and old are logically squished together.
> > 
> Do we want to group this code by history, or
> by function?
> 
> I would argue that new_thing() and old_thing1()
> are closer to each other functionally (they both
> do remote TLB invalidation) than they are to 
> old_thing2(), which does local-only invalidation.
> 
> I can organize the code however people want,
> but I would like a second opinion on this idea :)

IIUC the discussion is about:

	if (broadcast_kernel_range_flush(info))
       		; /* Fall through. */
	else if (info->end == TLB_FLUSH_ALL)
		on_each_cpu(do_flush_tlb_all, NULL, 1);
	else
		on_each_cpu(do_kernel_range_flush, info, 1);

In this case I agree with Dave. old_thing1() and old_thing2() are both
sending IPIs, the difference is that old_thing1() is doing a full flush
while old_thing2() is doing a range flush. Not sure why you mentioned
that old_thing2() does local invalidation.

broadcast_kernel_range_flush() also decides between full and range
flushes internally. So the main difference between 'new' and 'old' here
is using the broadcast flush vs the IPI flush.

So I think what Dave wants (and I agree) is:
	if (!broadcast_kernel_range_flush(info))
		ipi_kernel_range_flush(info)

Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().

> -- 
> All Rights Reversed.
>
Rik van Riel Feb. 18, 2025, 6 p.m. UTC | #6
On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote:
> 
> So I think what Dave wants (and I agree) is:
> 	if (!broadcast_kernel_range_flush(info))
> 		ipi_kernel_range_flush(info)
> 
> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().

I like this idea. I changed the code to add that
for the next version! Thank you.
Dave Hansen Feb. 18, 2025, 10:27 p.m. UTC | #7
On 2/18/25 10:00, Rik van Riel wrote:
> On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote:
>> So I think what Dave wants (and I agree) is:
>> 	if (!broadcast_kernel_range_flush(info))
>> 		ipi_kernel_range_flush(info)
>>
>> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().

That's OK-ish. But it still smells of hacking in the new concept without
refactoring things properly.

Let's logically inline the code that we've got.  I think it actually
ends up looking something like this:

	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
		if (info->end == TLB_FLUSH_ALL) {
			invlpgb_flush_all();
		} else {
			for_each(addr)
				invlpgb_flush_addr_nosync(addr, nr);
		}
	} else {
		if (info->end == TLB_FLUSH_ALL)
 			on_each_cpu(do_flush_tlb_all, NULL, 1);
	 	else
 			on_each_cpu(do_kernel_range_flush, info, 1);
	}

Where we've got two inputs:

	1. INVLPGB support (or not)
	2. TLB_FLUSH_ALL (basically ranged or full flush)

So I think we should group by *one* of those. The above groups by
INVLPGB support and this groups by TLB_FLUSH_ALL:

	if (info->end == TLB_FLUSH_ALL) {
		if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
			invlpgb_flush_all();
		} else {
			on_each_cpu(do_flush_tlb_all, NULL, 1);
		}
	} else {
		if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 			for_each(addr)
				invlpgb_flush_addr_nosync(addr, nr);
	 	else
 			on_each_cpu(do_kernel_range_flush, info, 1);
	}

So, if we create some helpers that give some consistent naming:

static tlb_flush_all_ipi(...)
{
	on_each_cpu(do_flush_tlb_all, NULL, 1);
}

static tlb_flush_all(...)
{
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_flush_all(...);
	else
		tlb_flush_all_ipi(...);
}

and then also create the ranged equivalents (which internally have the
same cpu_feature_enabled() check):

	tlb_flush_range_ipi(...)
	invlpgb_flush_range(...)

Then we can have the top-level code be:

	if (info->end == TLB_FLUSH_ALL)
		tlb_flush_all(info);
	else
		tlb_flush_range(info);

That actually looks way nicer than what we have today. For bonus points,
if a third way of flushing the TLB showed up, it would slot right in:

 static tlb_flush_all(...)
 {
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_flush_all(...);
+	else if cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_flush_all(...);
	else
		tlb_flush_all_ipi(...);
 }

That's *exactly* the way we want the code to read. At the higher level,
it's deciding based on the generic thing that *everybody* cares about:
ranged or full flush. Then, at the lower level, it's deciding how to
implement that high-level flush concept.
Yosry Ahmed Feb. 19, 2025, 1:46 a.m. UTC | #8
On Tue, Feb 18, 2025 at 02:27:31PM -0800, Dave Hansen wrote:
> On 2/18/25 10:00, Rik van Riel wrote:
> > On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote:
> >> So I think what Dave wants (and I agree) is:
> >> 	if (!broadcast_kernel_range_flush(info))
> >> 		ipi_kernel_range_flush(info)
> >>
> >> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().
> 
> That's OK-ish. But it still smells of hacking in the new concept without
> refactoring things properly.
> 
> Let's logically inline the code that we've got.  I think it actually
> ends up looking something like this:
> 
> 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> 		if (info->end == TLB_FLUSH_ALL) {
> 			invlpgb_flush_all();
> 		} else {
> 			for_each(addr)
> 				invlpgb_flush_addr_nosync(addr, nr);
> 		}
> 	} else {
> 		if (info->end == TLB_FLUSH_ALL)
>  			on_each_cpu(do_flush_tlb_all, NULL, 1);
> 	 	else
>  			on_each_cpu(do_kernel_range_flush, info, 1);
> 	}
> 
> Where we've got two inputs:
> 
> 	1. INVLPGB support (or not)
> 	2. TLB_FLUSH_ALL (basically ranged or full flush)
> 
> So I think we should group by *one* of those. The above groups by
> INVLPGB support and this groups by TLB_FLUSH_ALL:
> 
> 	if (info->end == TLB_FLUSH_ALL) {
> 		if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> 			invlpgb_flush_all();
> 		} else {
> 			on_each_cpu(do_flush_tlb_all, NULL, 1);
> 		}
> 	} else {
> 		if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
>  			for_each(addr)
> 				invlpgb_flush_addr_nosync(addr, nr);
> 	 	else
>  			on_each_cpu(do_kernel_range_flush, info, 1);
> 	}

Yeah an if/else structure is better than using the invlpgb helper and
falling back to IPIs if it returns false, and I also prefer grouping by
the flush scope (range/flush).

Thanks for the illustrations :)

> 
> So, if we create some helpers that give some consistent naming:
> 
> static tlb_flush_all_ipi(...)
> {
> 	on_each_cpu(do_flush_tlb_all, NULL, 1);
> }
> 
> static tlb_flush_all(...)
> {
> 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> 		invlpgb_flush_all(...);
> 	else
> 		tlb_flush_all_ipi(...);
> }
> 
> and then also create the ranged equivalents (which internally have the
> same cpu_feature_enabled() check):
> 
> 	tlb_flush_range_ipi(...)
> 	invlpgb_flush_range(...)
> 
> Then we can have the top-level code be:
> 
> 	if (info->end == TLB_FLUSH_ALL)
> 		tlb_flush_all(info);
> 	else
> 		tlb_flush_range(info);
> 
> That actually looks way nicer than what we have today. For bonus points,
> if a third way of flushing the TLB showed up, it would slot right in:
> 
>  static tlb_flush_all(...)
>  {
> 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> 		invlpgb_flush_all(...);
> +	else if cpu_feature_enabled(X86_FEATURE_RAR))
> +		rar_flush_all(...);
> 	else
> 		tlb_flush_all_ipi(...);
>  }
> 
> That's *exactly* the way we want the code to read. At the higher level,
> it's deciding based on the generic thing that *everybody* cares about:
> ranged or full flush. Then, at the lower level, it's deciding how to
> implement that high-level flush concept.
>
diff mbox series

Patch

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 924ac2263725..ce9df82754ce 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1077,6 +1077,28 @@  void flush_tlb_all(void)
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
+static bool broadcast_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr;
+	unsigned long nr;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_all();
+		return true;
+	}
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = (info->end - addr) >> PAGE_SHIFT;
+		nr = clamp_val(nr, 1, invlpgb_count_max);
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	tlbsync();
+	return true;
+}
+
 static void do_kernel_range_flush(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -1096,7 +1118,9 @@  void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
 				  TLB_GENERATION_INVALID);
 
-	if (info->end == TLB_FLUSH_ALL)
+	if (broadcast_kernel_range_flush(info))
+		; /* Fall through. */
+	else if (info->end == TLB_FLUSH_ALL)
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	else
 		on_each_cpu(do_kernel_range_flush, info, 1);