[v11,06/12] x86/mm: use INVLPGB for kernel TLB flushes

Message ID	20250213161423.449435-7-riel@surriel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Rik van Riel <riel@surriel.com> To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org, dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com, nadav.amit@gmail.com, thomas.lendacky@amd.com, kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, jackmanb@google.com, jannh@google.com, mhklinux@outlook.com, andrew.cooper3@citrix.com, Rik van Riel <riel@surriel.com>, Manali Shukla <Manali.Shukla@amd.com> Subject: [PATCH v11 06/12] x86/mm: use INVLPGB for kernel TLB flushes Date: Thu, 13 Feb 2025 11:13:57 -0500 Message-ID: <20250213161423.449435-7-riel@surriel.com> In-Reply-To: <20250213161423.449435-1-riel@surriel.com> References: <20250213161423.449435-1-riel@surriel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	AMD broadcast TLB invalidation \| expand [v11,00/12] AMD broadcast TLB invalidation [v11,01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional [v11,02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call [v11,03/12] x86/mm: consolidate full flush threshold decision [v11,04/12] x86/mm: get INVLPGB count max from CPUID [v11,05/12] x86/mm: add INVLPGB support code [v11,06/12] x86/mm: use INVLPGB for kernel TLB flushes [v11,07/12] x86/mm: use INVLPGB in flush_tlb_all [v11,08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing [v11,09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes [v11,10/12] x86/mm: do targeted broadcast flushing from tlbbatch code [v11,11/12] x86/mm: enable AMD translation cache extensions [v11,12/12] x86/mm: only invalidate final translations with INVLPGB

Rik van Riel Feb. 13, 2025, 4:13 p.m. UTC

Use broadcast TLB invalidation for kernel addresses when available.

Remove the need to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Dave Hansen Feb. 14, 2025, 6:35 p.m. UTC | #1

On 2/13/25 08:13, Rik van Riel wrote:
> -	if (info->end == TLB_FLUSH_ALL)
> +	if (broadcast_kernel_range_flush(info))
> +		; /* Fall through. */
> +	else if (info->end == TLB_FLUSH_ALL)
>  		on_each_cpu(do_flush_tlb_all, NULL, 1);
>  	else
>  		on_each_cpu(do_kernel_range_flush, info, 1);

We've got to find a better name for broadcast_kernel_range_flush().
Because IPIs are broadcast too. The naming makes it confusing. Why would
be broadcast, and then start trying IPIs that are also broadcast?

This hunk really the crux of the patch and we need to make sure the
semantics are crystal clear. What do folks think about something like this:

/*
 * Try to use a hardware assist for flushing the TLB.
 * This is expected to be cheaper than using an IPI
 * to do flushes.
 *
 * Returns True if the assisted method succeeded.
 */
static bool assisted_kernel_range_flush(struct flush_tlb_info *info)
{
	...
}

static do_ipi_tlb_flush(...info)
{
	if (info->end == TLB_FLUSH_ALL)
		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	else
		on_each_cpu(do_kernel_range_flush, info, 1);
}

Then we end up with:

	/*
	 * Assisted flushes are assumed to be faster. Try
	 * them first and fall back to IPIs if not available.
	 */
	flush_success = assisted_kernel_range_flush(info);
	if (!flush_success)
		do_ipi_tlb_flush(info);

I think that's a *LOT* more clear:

 1. Try assisted flush
 2. If it fails fall back to an IPI-based flush

Peter Zijlstra Feb. 14, 2025, 7:40 p.m. UTC | #2

On Fri, Feb 14, 2025 at 10:35:40AM -0800, Dave Hansen wrote:
> On 2/13/25 08:13, Rik van Riel wrote:
> > -	if (info->end == TLB_FLUSH_ALL)
> > +	if (broadcast_kernel_range_flush(info))
> > +		; /* Fall through. */
> > +	else if (info->end == TLB_FLUSH_ALL)
> >  		on_each_cpu(do_flush_tlb_all, NULL, 1);
> >  	else
> >  		on_each_cpu(do_kernel_range_flush, info, 1);
> 
> We've got to find a better name for broadcast_kernel_range_flush().
> Because IPIs are broadcast too. The naming makes it confusing. Why would
> be broadcast, and then start trying IPIs that are also broadcast?

IIRC the more general name is indeed broadcast tlbi; as in other
architectures use this naming to mean this very thing too.

But yes, I see the confusion, but I don't think changing the naming
really helps a lot here :-/

Dave Hansen Feb. 14, 2025, 7:55 p.m. UTC | #3

On 2/14/25 11:40, Peter Zijlstra wrote:
> On Fri, Feb 14, 2025 at 10:35:40AM -0800, Dave Hansen wrote:
>> On 2/13/25 08:13, Rik van Riel wrote:
>>> -	if (info->end == TLB_FLUSH_ALL)
>>> +	if (broadcast_kernel_range_flush(info))
>>> +		; /* Fall through. */
>>> +	else if (info->end == TLB_FLUSH_ALL)
>>>  		on_each_cpu(do_flush_tlb_all, NULL, 1);
>>>  	else
>>>  		on_each_cpu(do_kernel_range_flush, info, 1);
>> We've got to find a better name for broadcast_kernel_range_flush().
>> Because IPIs are broadcast too. The naming makes it confusing. Why would
>> be broadcast, and then start trying IPIs that are also broadcast?
> IIRC the more general name is indeed broadcast tlbi; as in other
> architectures use this naming to mean this very thing too.
> 
> But yes, I see the confusion, but I don't think changing the naming
> really helps a lot here :-/

Fair enough. If we don't have a better name, we can at least do:

	if (new_bad_name()) {
		new_thing();
	} else {
		old_thing();
	}

My real heartburn is with:

	if (new_bad_name()) {
		new_thing();
	} else if (need_thing_1()) {
		old_thing1();
	} else {
		old_thing2();
	}

Where new and old are logically squished together.

Rik van Riel Feb. 15, 2025, 1:25 a.m. UTC | #4

On Fri, 2025-02-14 at 11:55 -0800, Dave Hansen wrote:
> 
> Fair enough. If we don't have a better name, we can at least do:
> 
> 	if (new_bad_name()) {
> 		new_thing();
> 	} else {
> 		old_thing();
> 	}
> 
> My real heartburn is with:
> 
> 	if (new_bad_name()) {
> 		new_thing();
> 	} else if (need_thing_1()) {
> 		old_thing1();
> 	} else {
> 		old_thing2();
> 	}
> 
> Where new and old are logically squished together.
> 
Do we want to group this code by history, or
by function?

I would argue that new_thing() and old_thing1()
are closer to each other functionally (they both
do remote TLB invalidation) than they are to 
old_thing2(), which does local-only invalidation.

I can organize the code however people want,
but I would like a second opinion on this idea :)

Yosry Ahmed Feb. 15, 2025, 2:08 a.m. UTC | #5

On Fri, Feb 14, 2025 at 08:25:51PM -0500, Rik van Riel wrote:
> On Fri, 2025-02-14 at 11:55 -0800, Dave Hansen wrote:
> > 
> > Fair enough. If we don't have a better name, we can at least do:
> > 
> > 	if (new_bad_name()) {
> > 		new_thing();
> > 	} else {
> > 		old_thing();
> > 	}
> > 
> > My real heartburn is with:
> > 
> > 	if (new_bad_name()) {
> > 		new_thing();
> > 	} else if (need_thing_1()) {
> > 		old_thing1();
> > 	} else {
> > 		old_thing2();
> > 	}
> > 
> > Where new and old are logically squished together.
> > 
> Do we want to group this code by history, or
> by function?
> 
> I would argue that new_thing() and old_thing1()
> are closer to each other functionally (they both
> do remote TLB invalidation) than they are to 
> old_thing2(), which does local-only invalidation.
> 
> I can organize the code however people want,
> but I would like a second opinion on this idea :)

IIUC the discussion is about:

	if (broadcast_kernel_range_flush(info))
       		; /* Fall through. */
	else if (info->end == TLB_FLUSH_ALL)
		on_each_cpu(do_flush_tlb_all, NULL, 1);
	else
		on_each_cpu(do_kernel_range_flush, info, 1);

In this case I agree with Dave. old_thing1() and old_thing2() are both
sending IPIs, the difference is that old_thing1() is doing a full flush
while old_thing2() is doing a range flush. Not sure why you mentioned
that old_thing2() does local invalidation.

broadcast_kernel_range_flush() also decides between full and range
flushes internally. So the main difference between 'new' and 'old' here
is using the broadcast flush vs the IPI flush.

So I think what Dave wants (and I agree) is:
	if (!broadcast_kernel_range_flush(info))
		ipi_kernel_range_flush(info)

Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().

> -- 
> All Rights Reversed.
>

Rik van Riel Feb. 18, 2025, 6 p.m. UTC | #6

On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote:
> 
> So I think what Dave wants (and I agree) is:
> 	if (!broadcast_kernel_range_flush(info))
> 		ipi_kernel_range_flush(info)
> 
> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().

I like this idea. I changed the code to add that
for the next version! Thank you.

Dave Hansen Feb. 18, 2025, 10:27 p.m. UTC | #7

On 2/18/25 10:00, Rik van Riel wrote:
> On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote:
>> So I think what Dave wants (and I agree) is:
>> 	if (!broadcast_kernel_range_flush(info))
>> 		ipi_kernel_range_flush(info)
>>
>> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().

That's OK-ish. But it still smells of hacking in the new concept without
refactoring things properly.

Let's logically inline the code that we've got.  I think it actually
ends up looking something like this:

	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
		if (info->end == TLB_FLUSH_ALL) {
			invlpgb_flush_all();
		} else {
			for_each(addr)
				invlpgb_flush_addr_nosync(addr, nr);
		}
	} else {
		if (info->end == TLB_FLUSH_ALL)
 			on_each_cpu(do_flush_tlb_all, NULL, 1);
	 	else
 			on_each_cpu(do_kernel_range_flush, info, 1);
	}

Where we've got two inputs:

	1. INVLPGB support (or not)
	2. TLB_FLUSH_ALL (basically ranged or full flush)

So I think we should group by *one* of those. The above groups by
INVLPGB support and this groups by TLB_FLUSH_ALL:

	if (info->end == TLB_FLUSH_ALL) {
		if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
			invlpgb_flush_all();
		} else {
			on_each_cpu(do_flush_tlb_all, NULL, 1);
		}
	} else {
		if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 			for_each(addr)
				invlpgb_flush_addr_nosync(addr, nr);
	 	else
 			on_each_cpu(do_kernel_range_flush, info, 1);
	}

So, if we create some helpers that give some consistent naming:

static tlb_flush_all_ipi(...)
{
	on_each_cpu(do_flush_tlb_all, NULL, 1);
}

static tlb_flush_all(...)
{
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_flush_all(...);
	else
		tlb_flush_all_ipi(...);
}

and then also create the ranged equivalents (which internally have the
same cpu_feature_enabled() check):

	tlb_flush_range_ipi(...)
	invlpgb_flush_range(...)

Then we can have the top-level code be:

	if (info->end == TLB_FLUSH_ALL)
		tlb_flush_all(info);
	else
		tlb_flush_range(info);

That actually looks way nicer than what we have today. For bonus points,
if a third way of flushing the TLB showed up, it would slot right in:

 static tlb_flush_all(...)
 {
	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_flush_all(...);
+	else if cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_flush_all(...);
	else
		tlb_flush_all_ipi(...);
 }

That's *exactly* the way we want the code to read. At the higher level,
it's deciding based on the generic thing that *everybody* cares about:
ranged or full flush. Then, at the lower level, it's deciding how to
implement that high-level flush concept.

Yosry Ahmed Feb. 19, 2025, 1:46 a.m. UTC | #8

On Tue, Feb 18, 2025 at 02:27:31PM -0800, Dave Hansen wrote:
> On 2/18/25 10:00, Rik van Riel wrote:
> > On Sat, 2025-02-15 at 02:08 +0000, Yosry Ahmed wrote:
> >> So I think what Dave wants (and I agree) is:
> >> 	if (!broadcast_kernel_range_flush(info))
> >> 		ipi_kernel_range_flush(info)
> >>
> >> Where ipi_kernel_range_flush() contains old_thing1() and oldthing2().
> 
> That's OK-ish. But it still smells of hacking in the new concept without
> refactoring things properly.
> 
> Let's logically inline the code that we've got.  I think it actually
> ends up looking something like this:
> 
> 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> 		if (info->end == TLB_FLUSH_ALL) {
> 			invlpgb_flush_all();
> 		} else {
> 			for_each(addr)
> 				invlpgb_flush_addr_nosync(addr, nr);
> 		}
> 	} else {
> 		if (info->end == TLB_FLUSH_ALL)
>  			on_each_cpu(do_flush_tlb_all, NULL, 1);
> 	 	else
>  			on_each_cpu(do_kernel_range_flush, info, 1);
> 	}
> 
> Where we've got two inputs:
> 
> 	1. INVLPGB support (or not)
> 	2. TLB_FLUSH_ALL (basically ranged or full flush)
> 
> So I think we should group by *one* of those. The above groups by
> INVLPGB support and this groups by TLB_FLUSH_ALL:
> 
> 	if (info->end == TLB_FLUSH_ALL) {
> 		if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> 			invlpgb_flush_all();
> 		} else {
> 			on_each_cpu(do_flush_tlb_all, NULL, 1);
> 		}
> 	} else {
> 		if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
>  			for_each(addr)
> 				invlpgb_flush_addr_nosync(addr, nr);
> 	 	else
>  			on_each_cpu(do_kernel_range_flush, info, 1);
> 	}

Yeah an if/else structure is better than using the invlpgb helper and
falling back to IPIs if it returns false, and I also prefer grouping by
the flush scope (range/flush).

Thanks for the illustrations :)

> 
> So, if we create some helpers that give some consistent naming:
> 
> static tlb_flush_all_ipi(...)
> {
> 	on_each_cpu(do_flush_tlb_all, NULL, 1);
> }
> 
> static tlb_flush_all(...)
> {
> 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> 		invlpgb_flush_all(...);
> 	else
> 		tlb_flush_all_ipi(...);
> }
> 
> and then also create the ranged equivalents (which internally have the
> same cpu_feature_enabled() check):
> 
> 	tlb_flush_range_ipi(...)
> 	invlpgb_flush_range(...)
> 
> Then we can have the top-level code be:
> 
> 	if (info->end == TLB_FLUSH_ALL)
> 		tlb_flush_all(info);
> 	else
> 		tlb_flush_range(info);
> 
> That actually looks way nicer than what we have today. For bonus points,
> if a third way of flushing the TLB showed up, it would slot right in:
> 
>  static tlb_flush_all(...)
>  {
> 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> 		invlpgb_flush_all(...);
> +	else if cpu_feature_enabled(X86_FEATURE_RAR))
> +		rar_flush_all(...);
> 	else
> 		tlb_flush_all_ipi(...);
>  }
> 
> That's *exactly* the way we want the code to read. At the higher level,
> it's deciding based on the generic thing that *everybody* cares about:
> ranged or full flush. Then, at the lower level, it's deciding how to
> implement that high-level flush concept.
>

[v11,06/12] x86/mm: use INVLPGB for kernel TLB flushes

Commit Message

Comments

Patch