[v10,00/12] AMD broadcast TLB invalidation

Message ID	20250211210823.242681-1-riel@surriel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Rik van Riel <riel@surriel.com> To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org, dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com, nadav.amit@gmail.com, thomas.lendacky@amd.com, kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, jackmanb@google.com, jannh@google.com, mhklinux@outlook.com, andrew.cooper3@citrix.com Subject: [PATCH v10 00/12] AMD broadcast TLB invalidation Date: Tue, 11 Feb 2025 16:07:55 -0500 Message-ID: <20250211210823.242681-1-riel@surriel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	AMD broadcast TLB invalidation \| expand [v10,00/12] AMD broadcast TLB invalidation [v10,01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional [v10,02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call [v10,03/12] x86/mm: consolidate full flush threshold decision [v10,04/12] x86/mm: get INVLPGB count max from CPUID [v10,05/12] x86/mm: add INVLPGB support code [v10,06/12] x86/mm: use INVLPGB for kernel TLB flushes [v10,07/12] x86/mm: use INVLPGB in flush_tlb_all [v10,08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing [v10,09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes [v10,10/12] x86/mm: do targeted broadcast flushing from tlbbatch code [v10,11/12] x86/mm: enable AMD translation cache extensions [v10,12/12] x86/mm: only invalidate final translations with INVLPGB

Rik van Riel Feb. 11, 2025, 9:07 p.m. UTC

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.

v10:
 - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter)
 - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan)
 - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan)
 - various cleanups (Brendan)
v9:
 - print warning when start or end address was rounded (Peter)
 - in the reclaim code, tlbsync at context switch time (Peter)
 - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
v8:
 - round start & end to handle non-page-aligned callers (Steven & Jan)
 - fix up changelog & add tested-by tags (Manali)
v7:
 - a few small code cleanups (Nadav)
 - fix spurious VM_WARN_ON_ONCE in mm_global_asid
 - code simplifications & better barriers (Peter & Dave)
v6:
 - fix info->end check in flush_tlb_kernel_range (Michael)
 - disable broadcast TLB flushing on 32 bit x86
v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

Peter Zijlstra Feb. 12, 2025, 10:23 a.m. UTC | #1

On Tue, Feb 11, 2025 at 04:07:55PM -0500, Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

What tree are these patches against? I can't seem to cleanly apply them
to anything much :/

Brendan Jackman Feb. 12, 2025, 10:44 a.m. UTC | #2

They apply to 60675d4ca1ef0 ("Merge branch 'linus' into x86/mm, to
pick up fixes").

Rik, can I refer you to the BASE TREE INFORMATION section of man
git-format-patch. I haven't used that feature lately (b4 takes care of
this) but it looks like --base=auto will add the necessary info, or
IIRC there's a way to make that behaviour the default.

On Wed, 12 Feb 2025 at 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Feb 11, 2025 at 04:07:55PM -0500, Rik van Riel wrote:
> > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
>
> What tree are these patches against? I can't seem to cleanly apply them
> to anything much :/

Peter Zijlstra Feb. 12, 2025, 10:59 a.m. UTC | #3

On Wed, Feb 12, 2025 at 11:44:01AM +0100, Brendan Jackman wrote:
> They apply to 60675d4ca1ef0 ("Merge branch 'linus' into x86/mm, to
> pick up fixes").

Why some random commit? Shouldn't this be against a sensible branch or
something, like perhaps tip/x86/mm ?

Rik van Riel Feb. 12, 2025, 3:39 p.m. UTC | #4

On Wed, 2025-02-12 at 11:59 +0100, Peter Zijlstra wrote:
> On Wed, Feb 12, 2025 at 11:44:01AM +0100, Brendan Jackman wrote:
> > They apply to 60675d4ca1ef0 ("Merge branch 'linus' into x86/mm, to
> > pick up fixes").
> 
> Why some random commit? Shouldn't this be against a sensible branch
> or
> something, like perhaps tip/x86/mm ?
> 
Let me rebase these against current tip/x86/mm
and apply the latest suggested cleanups!

Sean Christopherson Feb. 12, 2025, 4:30 p.m. UTC | #5

On Wed, Feb 12, 2025, Brendan Jackman wrote:
> They apply to 60675d4ca1ef0 ("Merge branch 'linus' into x86/mm, to
> pick up fixes").
> 
> Rik, can I refer you to the BASE TREE INFORMATION section of man
> git-format-patch. I haven't used that feature lately (b4 takes care of
> this) but it looks like --base=auto will add the necessary info, or
> IIRC there's a way to make that behaviour the default.

IMO, --base=auto is too easy to unintentionally misuse, e.g. it will do the wrong
thing if your upstream branch is set to a personal repository.  --base itself is
fantastic though.  I personally do:

  git format-patch --base=HEAD~$nr <bunch of other stuff> -$nr

where $nr is the number of patches in the series.  I.e. advertise the base as
whatever the series of patches is based on, not what the branch is based on.  The
only time it doesn't work is if your local branch has a commit that is not in the
series, and is not publicly visible.  E.g. if the series depends on another in-flight
series that you've applied locally.  But in that case, you should be explaining
what's up in your cover letter no matter what.

Michael Kelley Feb. 12, 2025, 8:35 p.m. UTC | #6

From: riel@surriel.com <riel@surriel.com> Sent: Tuesday, February 11, 2025 1:08 PM
> 
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
> 
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
> 
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
> 
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
> 
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.
> 
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
> 
> Some numbers closer to real world performance
> can be found at Phoronix, thanks to Michael:
> 
> https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> 
> My current plan is to implement support for Intel's RAR
> (Remote Action Request) TLB flushing in a follow-up series,
> after this thing has been merged into -tip. Making things
> any larger would just be unwieldy for reviewers.
> 
> v10:
>  - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter)
>  - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan)
>  - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan)
>  - various cleanups (Brendan)
> v9:
>  - print warning when start or end address was rounded (Peter)
>  - in the reclaim code, tlbsync at context switch time (Peter)
>  - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
> v8:
>  - round start & end to handle non-page-aligned callers (Steven & Jan)
>  - fix up changelog & add tested-by tags (Manali)
> v7:
>  - a few small code cleanups (Nadav)
>  - fix spurious VM_WARN_ON_ONCE in mm_global_asid
>  - code simplifications & better barriers (Peter & Dave)
> v6:
>  - fix info->end check in flush_tlb_kernel_range (Michael)
>  - disable broadcast TLB flushing on 32 bit x86
> v5:
>  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
>  - ensure a panic on an invalid number of extra pages (Dave, Tom)
>  - add cant_migrate() assertion to tlbsync (Jann)
>  - a bunch more cleanups (Nadav)
>  - key TCE enabling off X86_FEATURE_TCE (Andrew)
>  - fix a race between reclaim and ASID transition (Jann)
> v4:
>  - Use only bitmaps to track free global ASIDs (Nadav)
>  - Improved AMD initialization (Borislav & Tom)
>  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
>  - Fixes for subtle race conditions (Jann)
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
> 

Tested this series in an Azure Confidential VM based on SEV-SNP,
which is running on Hyper-V and exposes INVLPGB in the guest VM.
I applied the patches to 6.13.0 with one minor fixup, but did not
include the patch to remove unnecessary lru_add_drain calls.
I also added some custom telemetry to see when INVLPGB is
being used vs. Hyper-V's paravirt hypercalls for TLB flushing.

I did not see any problems. The custom telemetry looked about
as I expected, showing a mix of INVLPGB and the PV hypercalls.
So for the series:

Tested-by: Michael Kelley <mhklinux@outlook.com>

Borislav Petkov Feb. 13, 2025, 1:03 p.m. UTC | #7

On Wed, Feb 12, 2025 at 10:39:30AM -0500, Rik van Riel wrote:
> Let me rebase these against current tip/x86/mm
> and apply the latest suggested cleanups!

Please use tip/master - this is and has always been the branch tip patches
should be applied on.

Thx.

[v10,00/12] AMD broadcast TLB invalidation

Message

Comments