mbox series

[v5,00/12] AMD broadcast TLB invalidation

Message ID 20250116023127.1531583-1-riel@surriel.com (mailing list archive)
Headers show
Series AMD broadcast TLB invalidation | expand

Message

Rik van Riel Jan. 16, 2025, 2:30 a.m. UTC
Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

The code is now in a state where I am not sure what else needs to
be done before it can be merged. If you can think of something,
please let me know ;)

v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

Comments

Michael Kelley Jan. 16, 2025, 6:14 p.m. UTC | #1
From: riel@surriel.com <riel@surriel.com> Sent: Wednesday, January 15, 2025 6:30 PM
> 
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
> 
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
> 
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
> 
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
> 
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.
> 
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
> 
> Some numbers closer to real world performance
> can be found at Phoronix, thanks to Michael:
> 
> https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> 
> The code is now in a state where I am not sure what else needs to
> be done before it can be merged. If you can think of something,
> please let me know ;)

Rik --

We had an earlier thread about INVLPGB/TLBSYNC in a VM [1]. It
turns out that Hyper-V in the Azure public cloud enables
INVLPGB/TLBSYNC in Confidential VMs (CVMs, which conform to the
Linux concept of a CoCo VM) running on AMD processors using SEV-SNP.
The CPUID instruction in a such a VM reports the enablement as
expected. The instructions are *not* enabled in general purpose VMs
running on the same AMD processors. The enablement is a natural
outgrowth of CoCo VM's wanting to be able to avoid a dependency on
the untrusted hypervisor to perform TLB flushes. Of course, Linux hasn't
been updated to make use of the instructions in this scenario, and your
patch set doesn't use the instructions in all situations. So CoCo
VMs may still use the paravirtualization that makes hypercalls to do
TLB flushes. It's future work to *always* use INVLPGB (if available)
in a CoCo VM.

For a couple of days, I've been running your v4 patch set in an Azure
CVM, just to make sure nothing bad happens. From a basic testing
standpoint, no issues. As expected, INVLPGB is used in some cases,
and the existing paravirt hypercalls are used in other cases. But I have
not fully reviewed your code looking for potential VM-only issues.

All of this is to say "Don't exclude the VM scenario from your
thinking." The scenario exists in real life today. I don't have
any specific code changes needed for the scenario.

Michael

[1] https://lore.kernel.org/lkml/00294e7e-dcd8-f940-372e-070b8d174582@amd.com/

> 
> v5:
>  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
>  - ensure a panic on an invalid number of extra pages (Dave, Tom)
>  - add cant_migrate() assertion to tlbsync (Jann)
>  - a bunch more cleanups (Nadav)
>  - key TCE enabling off X86_FEATURE_TCE (Andrew)
>  - fix a race between reclaim and ASID transition (Jann)
> v4:
>  - Use only bitmaps to track free global ASIDs (Nadav)
>  - Improved AMD initialization (Borislav & Tom)
>  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
>  - Fixes for subtle race conditions (Jann)
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
>
Peter Zijlstra Jan. 16, 2025, 10:37 p.m. UTC | #2
On Thu, Jan 16, 2025 at 06:14:00PM +0000, Michael Kelley wrote:
> So CoCo
> VMs may still use the paravirtualization that makes hypercalls to do
> TLB flushes. It's future work to *always* use INVLPGB (if available)
> in a CoCo VM.

That would place a limit on the number of CPUs, to be no larger than the
number of available ASIDs.
Andrew Cooper Jan. 17, 2025, midnight UTC | #3
On 16/01/2025 10:37 pm, Peter Zijlstra wrote:
> On Thu, Jan 16, 2025 at 06:14:00PM +0000, Michael Kelley wrote:
>> So CoCo
>> VMs may still use the paravirtualization that makes hypercalls to do
>> TLB flushes. It's future work to *always* use INVLPGB (if available)
>> in a CoCo VM.
> That would place a limit on the number of CPUs, to be no larger than the
> number of available ASIDs.

Can you please be specific between PCID (the x86 architectural thing
commonly called ASID) or ASID (the thing named by the AMD architecture).

INVLPGB instruction under virt can use PCIDs to its hearts content, but
ASIDs are rewritten behind the scenes because VM does not usually know
the ASID the VMM assigned to it.

~Andrew