mbox series

[v3,00/12] AMD broadcast TLB invalidation

Message ID 20241230175550.4046587-1-riel@surriel.com (mailing list archive)
Headers show
Series AMD broadcast TLB invalidation | expand

Message

Rik van Riel Dec. 30, 2024, 5:53 p.m. UTC
Subject: [RFC PATCH 00/10] AMD broadcast TLB invalidation

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

Comments

Dave Hansen Jan. 6, 2025, 7:03 p.m. UTC | #1
A couple of high level things we need to address:

First, I'm OK calling this approach "broadcast TLB invalidation". But I
don't think the ASIDs should be called "broadcast ASIDs". I'd much
rather that they are called something which makes it clear that they are
from a different namespace than the existing ASIDs.

After this series there will be three classes:

 0: Special ASID used for the kernel, basically
 1->TLB_NR_DYN_ASIDS: Allocated from private, per-cpu space. Meaningless
		      when compared between CPUs.
 >TLB_NR_DYN_ASIDS:   Allocated from shared, kernel-wide space. All CPUs
		      share this space and must all agree on what the
		      values mean.

The fact that the "shared" ones are system-wide obviously allows INVLPGB
to be used. The hardware feature also obviously "broadcasts" things more
than plain old INVLPG did. But I don't think that makes the ASIDs
"broadcast" ASIDs.

It's much more important to know that they are shared across the system
instead of per-cpu than the fact that the deep implementation manages
them with an instruction that is "broadcast" by hardware.

So can we call them "global", "shared" or "system" ASIDs, please?

Second, the TLB_NR_DYN_ASIDS was picked because it's roughly the number
of distinct PCIDs that the CPU can keep in the TLB at once (at least on
Intel). Let's say a CPU has 6 mm's in the per-cpu ASID space and another
6 in the shared/broadcast space. At that point, PCIDs might not be doing
much good because the TLB can't store entries for 12 PCIDs.

Is there any comprehension in this series? Should we be indexing
cpu_tlbstate.ctxs[] by a *context* number rather than by the ASID that
it's running as?

Last, I'm not 100% convinced we want to do this whole thing. The
will-it-scale numbers are nice. But given the complexity of this, I
think we need some actual, real end users to stand up and say exactly
how this is important in *PRODUCTION* to them.
Yosry Ahmed Jan. 6, 2025, 10:49 p.m. UTC | #2
On Mon, Dec 30, 2024 at 9:57 AM Rik van Riel <riel@surriel.com> wrote:
>
> Subject: [RFC PATCH 00/10] AMD broadcast TLB invalidation
>
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
>
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
>
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
>
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
>
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
>
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.

We briefly looked at using INVLPGB/TLBSYNC as part of the ASI work to
optimize away the async freeing logic which sends TLB flush IPIs.

I have a high-level question about INVLPGB/TLBSYNC that I could not
immediately find the answer to in the AMD manual. Sorry if I missed
the answer or if I missed something obvious.

Do we know what the underlying mechanism for delivering the TLB
flushes is? If a CPU has interrupts disabled, does it still receive
the broadcast TLB flush request and handle it?

My main concern is that TLBSYNC is a single instruction that seems
like it will wait for an arbitrary amount of time, and IIUC interrupts
(and NMIs) will not be delivered to the running CPU until after the
instruction completes execution (only at an instruction boundary).

Are there any guarantees about other CPUs handling the broadcast TLB
flush in a timely manner, or an explanation of how CPUs handle the
incoming requests in general?

>
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
>
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
>
>
Rik van Riel Jan. 7, 2025, 3:25 a.m. UTC | #3
On Mon, 2025-01-06 at 14:49 -0800, Yosry Ahmed wrote:
> 
> We briefly looked at using INVLPGB/TLBSYNC as part of the ASI work to
> optimize away the async freeing logic which sends TLB flush IPIs.
> 
> I have a high-level question about INVLPGB/TLBSYNC that I could not
> immediately find the answer to in the AMD manual. Sorry if I missed
> the answer or if I missed something obvious.
> 
> Do we know what the underlying mechanism for delivering the TLB
> flushes is? If a CPU has interrupts disabled, does it still receive
> the broadcast TLB flush request and handle it?

I assume TLB invalidation is probably handled similarly
to how cache coherency is handled between CPUs.

However, it probably does not need to be quite as fast,
since cache coherency traffic is probably 2-6 orders of
magnitude more common than TLB invalidation traffic.

> 
> My main concern is that TLBSYNC is a single instruction that seems
> like it will wait for an arbitrary amount of time, and IIUC
> interrupts
> (and NMIs) will not be delivered to the running CPU until after the
> instruction completes execution (only at an instruction boundary).
> 
> Are there any guarantees about other CPUs handling the broadcast TLB
> flush in a timely manner, or an explanation of how CPUs handle the
> incoming requests in general?

The performance numbers I got with the tlb_flush2_threads
microbenchmark strongly suggest that INVLPGB flushes are
handled by the receiving CPUs even while interrupts are
disabled.

CPU time spent in flush_tlb_mm_range goes down with
INVLPGB, compared with IPI based TLB flushing, even when
the IPIs only go to a subset of CPUs.

I have no idea whether the invalidation is handled by
something like microcode in the CPU, by the (more
external?) logic that handles cache coherency, or
something else entirely.

I suspect AMD wouldn't tell us exactly ;)
Yosry Ahmed Jan. 8, 2025, 1:36 a.m. UTC | #4
On Mon, Jan 6, 2025 at 7:25 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Mon, 2025-01-06 at 14:49 -0800, Yosry Ahmed wrote:
> >
> > We briefly looked at using INVLPGB/TLBSYNC as part of the ASI work to
> > optimize away the async freeing logic which sends TLB flush IPIs.
> >
> > I have a high-level question about INVLPGB/TLBSYNC that I could not
> > immediately find the answer to in the AMD manual. Sorry if I missed
> > the answer or if I missed something obvious.
> >
> > Do we know what the underlying mechanism for delivering the TLB
> > flushes is? If a CPU has interrupts disabled, does it still receive
> > the broadcast TLB flush request and handle it?
>
> I assume TLB invalidation is probably handled similarly
> to how cache coherency is handled between CPUs.
>
> However, it probably does not need to be quite as fast,
> since cache coherency traffic is probably 2-6 orders of
> magnitude more common than TLB invalidation traffic.
>
> >
> > My main concern is that TLBSYNC is a single instruction that seems
> > like it will wait for an arbitrary amount of time, and IIUC
> > interrupts
> > (and NMIs) will not be delivered to the running CPU until after the
> > instruction completes execution (only at an instruction boundary).
> >
> > Are there any guarantees about other CPUs handling the broadcast TLB
> > flush in a timely manner, or an explanation of how CPUs handle the
> > incoming requests in general?
>
> The performance numbers I got with the tlb_flush2_threads
> microbenchmark strongly suggest that INVLPGB flushes are
> handled by the receiving CPUs even while interrupts are
> disabled.
>
> CPU time spent in flush_tlb_mm_range goes down with
> INVLPGB, compared with IPI based TLB flushing, even when
> the IPIs only go to a subset of CPUs.
>
> I have no idea whether the invalidation is handled by
> something like microcode in the CPU, by the (more
> external?) logic that handles cache coherency, or
> something else entirely.
>
> I suspect AMD wouldn't tell us exactly ;)

Well, ideally they would just tell us the conditions under which CPUs
respond to the broadcast TLB flush or the expectations around latency.
I am also wondering if a CPU can respond to an INVLPGB while running
TLBSYNC, specifically if it's possible for two CPUs to send broadcasts
to one another and then execute TLBSYNC to wait for each other. Could
this lead to a deadlock? I think the answer is no but we have little
understanding about what's going on under the hood to know for sure
(or at least I do).

>
> --
> All Rights Reversed.
Andrew Cooper Jan. 9, 2025, 2:25 a.m. UTC | #5
>> I suspect AMD wouldn't tell us exactly ;)
>
> Well, ideally they would just tell us the conditions under which CPUs
> respond to the broadcast TLB flush or the expectations around latency.

Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
just a random person on the internet.  But, here are a few things that
might be relevant to know.

AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
the CPU TLB and related structures" and also "When required, hardware
automatically performs TLB invalidations to ensure that all processors
in the system see the updated RMP entry information."

That sentence doesn't use "broadcast" or "remote", but "all processors"
is a pretty clear clue.  Broadcast TLB invalidations are a building
block of all the RMP-manipulation instructions.

Furthermore, to be useful in this context, they need to be ordered with
memory.  Specifically, a new pagewalk mustn't start after an
invalidation, yet observe the stale RMP entry.


x86 CPUs do have reasonable forward-progress guarantees, but in order to
achieve forward progress, they need to e.g. guarantee that one memory
access doesn't displace the TLB entry backing a different memory access
from the same instruction, or you could livelock while trying to
complete a single instruction.

A consequence is that you can't safely invalidate a TLB entry of an
in-progress instruction (although this means only the oldest instruction
in the pipeline, because everything else is speculative and potentially
transient).


INVLPGB invalidations are interrupt-like from the point of view of the
remote core, but can be processed



~Andrew

[1]
https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf
Andrew Cooper Jan. 9, 2025, 2:47 a.m. UTC | #6
>> I suspect AMD wouldn't tell us exactly ;)
>
> Well, ideally they would just tell us the conditions under which CPUs
> respond to the broadcast TLB flush or the expectations around latency.

[Resend, complete this time]

Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
just a random person on the internet.  But, here are a few things that
might be relevant to know.

AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
the CPU TLB and related structures" and also "When required, hardware
automatically performs TLB invalidations to ensure that all processors
in the system see the updated RMP entry information."

That sentence doesn't use "broadcast" or "remote", but "all processors"
is a pretty clear clue.  Broadcast TLB invalidations are a building
block of all the RMP-manipulation instructions.

Furthermore, to be useful in this context, they need to be ordered with
memory.  Specifically, a new pagewalk mustn't start after an
invalidation, yet observe the stale RMP entry.


x86 CPUs do have reasonable forward-progress guarantees, but in order to
achieve forward progress, they need to e.g. guarantee that one memory
access doesn't displace the TLB entry backing a different memory access
from the same instruction, or you could livelock while trying to
complete a single instruction.

A consequence is that you can't safely invalidate a TLB entry of an
in-progress instruction (although this means only the oldest instruction
in the pipeline, because everything else is speculative and potentially
transient).


INVLPGB invalidations are interrupt-like from the point of view of the
remote core, but are microarchitectural and can be taken irrespective of
the architectural Interrupt and Global Interrupt Flags.  As a
consequence, they'll need wait until an instruction boundary to be
processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
handling of RARs on the remote processor, and they share a number of
constraints in common with INVLPGB.


Overall, I'd expect the INVLPGB instructions to be pretty quick in and
of themselves; interestingly, they're not identified as architecturally
serialising.  The broadcast is probably posted, and will be dealt with
by remote processors on the subsequent instruction boundary.  TLBSYNC is
the barrier to wait until the invalidations have been processed, and
this will block for an unspecified length of time, probably bounded by
the "longest" instruction in progress on a remote CPU.  e.g. I expect it
probably will suck if you have to wait for a WBINVD instruction to
complete on a remote CPU.

That said, architectural IPIs have the same conditions too, except on
top of that you've got to run a whole interrupt handler.  So, with
reasonable confidence, however slow TLBSYNC might be in the worst case,
it's got absolutely nothing on the overhead of doing invalidations the
old fashioned way.


~Andrew

[1]
https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf
[2]
https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf
Yosry Ahmed Jan. 9, 2025, 9:32 p.m. UTC | #7
On Wed, Jan 8, 2025 at 6:47 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> >> I suspect AMD wouldn't tell us exactly ;)
> >
> > Well, ideally they would just tell us the conditions under which CPUs
> > respond to the broadcast TLB flush or the expectations around latency.
>
> [Resend, complete this time]
>
> Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
> just a random person on the internet.  But, here are a few things that
> might be relevant to know.
>
> AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
> the CPU TLB and related structures" and also "When required, hardware
> automatically performs TLB invalidations to ensure that all processors
> in the system see the updated RMP entry information."
>
> That sentence doesn't use "broadcast" or "remote", but "all processors"
> is a pretty clear clue.  Broadcast TLB invalidations are a building
> block of all the RMP-manipulation instructions.
>
> Furthermore, to be useful in this context, they need to be ordered with
> memory.  Specifically, a new pagewalk mustn't start after an
> invalidation, yet observe the stale RMP entry.
>
>
> x86 CPUs do have reasonable forward-progress guarantees, but in order to
> achieve forward progress, they need to e.g. guarantee that one memory
> access doesn't displace the TLB entry backing a different memory access
> from the same instruction, or you could livelock while trying to
> complete a single instruction.
>
> A consequence is that you can't safely invalidate a TLB entry of an
> in-progress instruction (although this means only the oldest instruction
> in the pipeline, because everything else is speculative and potentially
> transient).
>
>
> INVLPGB invalidations are interrupt-like from the point of view of the
> remote core, but are microarchitectural and can be taken irrespective of
> the architectural Interrupt and Global Interrupt Flags.  As a
> consequence, they'll need wait until an instruction boundary to be
> processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
> handling of RARs on the remote processor, and they share a number of
> constraints in common with INVLPGB.
>
>
> Overall, I'd expect the INVLPGB instructions to be pretty quick in and
> of themselves; interestingly, they're not identified as architecturally
> serialising.  The broadcast is probably posted, and will be dealt with
> by remote processors on the subsequent instruction boundary.  TLBSYNC is
> the barrier to wait until the invalidations have been processed, and
> this will block for an unspecified length of time, probably bounded by
> the "longest" instruction in progress on a remote CPU.  e.g. I expect it
> probably will suck if you have to wait for a WBINVD instruction to
> complete on a remote CPU.
>
> That said, architectural IPIs have the same conditions too, except on
> top of that you've got to run a whole interrupt handler.  So, with
> reasonable confidence, however slow TLBSYNC might be in the worst case,
> it's got absolutely nothing on the overhead of doing invalidations the
> old fashioned way.

Generally speaking, I am not arguing that TLB flush IPIs are worse
than INLPGB/TLBSYNC, I think we should expect the latter to perform
better in most cases.

But there is a difference here because the processor executing TLBSYNC
cannot serve interrupts or NMIs while waiting for remote CPUs, because
they have to be served at an instruction boundary, right? Unless
TLBSYNC is an exception to that rule, or its execution is considered
completed before remote CPUs respond (i.e. the CPU executes it quickly
then enters into a wait doing "nothing").

There are also intriguing corner cases that are not documented. For
example, you mention that it's reasonable to expect that a remote CPU
does not serve TLBSYNC except at the instruction boundary. What if
that CPU is executing TLBSYNC? Do we have to wait for its execution to
complete? Is it possible to end up in a deadlock? This goes back to my
previous point about whether TLBSYNC is a special case or when it's
considered to have finished executing.

I am sure people thought about that and I am probably worried over
nothing, but there's little details here so one has to speculate.

Again, sorry if I am making a fuss over nothing and it's all in my head.

>
>
> ~Andrew
>
> [1]
> https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf
> [2]
> https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf
Andrew Cooper Jan. 9, 2025, 11 p.m. UTC | #8
On 09/01/2025 9:32 pm, Yosry Ahmed wrote:
> On Wed, Jan 8, 2025 at 6:47 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>> I suspect AMD wouldn't tell us exactly ;)
>>> Well, ideally they would just tell us the conditions under which CPUs
>>> respond to the broadcast TLB flush or the expectations around latency.
>> [Resend, complete this time]
>>
>> Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
>> just a random person on the internet.  But, here are a few things that
>> might be relevant to know.
>>
>> AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
>> the CPU TLB and related structures" and also "When required, hardware
>> automatically performs TLB invalidations to ensure that all processors
>> in the system see the updated RMP entry information."
>>
>> That sentence doesn't use "broadcast" or "remote", but "all processors"
>> is a pretty clear clue.  Broadcast TLB invalidations are a building
>> block of all the RMP-manipulation instructions.
>>
>> Furthermore, to be useful in this context, they need to be ordered with
>> memory.  Specifically, a new pagewalk mustn't start after an
>> invalidation, yet observe the stale RMP entry.
>>
>>
>> x86 CPUs do have reasonable forward-progress guarantees, but in order to
>> achieve forward progress, they need to e.g. guarantee that one memory
>> access doesn't displace the TLB entry backing a different memory access
>> from the same instruction, or you could livelock while trying to
>> complete a single instruction.
>>
>> A consequence is that you can't safely invalidate a TLB entry of an
>> in-progress instruction (although this means only the oldest instruction
>> in the pipeline, because everything else is speculative and potentially
>> transient).
>>
>>
>> INVLPGB invalidations are interrupt-like from the point of view of the
>> remote core, but are microarchitectural and can be taken irrespective of
>> the architectural Interrupt and Global Interrupt Flags.  As a
>> consequence, they'll need wait until an instruction boundary to be
>> processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
>> handling of RARs on the remote processor, and they share a number of
>> constraints in common with INVLPGB.
>>
>>
>> Overall, I'd expect the INVLPGB instructions to be pretty quick in and
>> of themselves; interestingly, they're not identified as architecturally
>> serialising.  The broadcast is probably posted, and will be dealt with
>> by remote processors on the subsequent instruction boundary.  TLBSYNC is
>> the barrier to wait until the invalidations have been processed, and
>> this will block for an unspecified length of time, probably bounded by
>> the "longest" instruction in progress on a remote CPU.  e.g. I expect it
>> probably will suck if you have to wait for a WBINVD instruction to
>> complete on a remote CPU.
>>
>> That said, architectural IPIs have the same conditions too, except on
>> top of that you've got to run a whole interrupt handler.  So, with
>> reasonable confidence, however slow TLBSYNC might be in the worst case,
>> it's got absolutely nothing on the overhead of doing invalidations the
>> old fashioned way.
> Generally speaking, I am not arguing that TLB flush IPIs are worse
> than INLPGB/TLBSYNC, I think we should expect the latter to perform
> better in most cases.
>
> But there is a difference here because the processor executing TLBSYNC
> cannot serve interrupts or NMIs while waiting for remote CPUs, because
> they have to be served at an instruction boundary, right?

That's as per the architecture, yes.  NMIs do have to be served on
instruction boundaries.  An NMI that becomes pending while a TLBSYNC is
in progress will have to wait until the TLBSYNC completes.

(Probably.  REP string instructions and AVX scatter/gather have explicit
behaviours that them them be interrupted, and to continue from where
they left off when the interrupt handler returns.  Depending on how
TLBSYNC is implemented, it's just possible it has this property too.)

> Unless
> TLBSYNC is an exception to that rule, or its execution is considered
> completed before remote CPUs respond (i.e. the CPU executes it quickly
> then enters into a wait doing "nothing").
>
> There are also intriguing corner cases that are not documented. For
> example, you mention that it's reasonable to expect that a remote CPU
> does not serve TLBSYNC except at the instruction boundary.

INVLPGB needs to wait for an instruction boundary in order to be processed.

All TLBSYNC needs to do is wait until it's certain that all the prior
INVLPGBs issued by this CPU have been serviced.

>  What if
> that CPU is executing TLBSYNC? Do we have to wait for its execution to
> complete? Is it possible to end up in a deadlock? This goes back to my
> previous point about whether TLBSYNC is a special case or when it's
> considered to have finished executing.

Remember that the SEV-SNP instruction (PSMASH, PVALIDATE,
RMP{ADJUST,UPDATE,QUERY,READ}) have an INVLPGB/TLBSYNC pair under the
hood.  You can execute these instructions on different CPUs in parallel.

It's certainly possible AMD missed something and there's and there's a
deadlock case in there.  But Google do offer SEV-SNP VMs and have the
data and scale to know whether such a deadlock is happening in practice.

>
> I am sure people thought about that and I am probably worried over
> nothing, but there's little details here so one has to speculate.
>
> Again, sorry if I am making a fuss over nothing and it's all in my head.

It's absolutely a valid question to ask.

But x86 is full of longer delays than this.  The GIF for example can
block NMIs until the hypervisor is complete with the world switch, and
it's left as an exercise to software not to abuse this.  Taking an SMI
will be orders of magnitude more expensive than anything discussed here.

~Andrew
Yosry Ahmed Jan. 9, 2025, 11:26 p.m. UTC | #9
On Thu, Jan 9, 2025 at 3:00 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 09/01/2025 9:32 pm, Yosry Ahmed wrote:
> > On Wed, Jan 8, 2025 at 6:47 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>>> I suspect AMD wouldn't tell us exactly ;)
> >>> Well, ideally they would just tell us the conditions under which CPUs
> >>> respond to the broadcast TLB flush or the expectations around latency.
> >> [Resend, complete this time]
> >>
> >> Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
> >> just a random person on the internet.  But, here are a few things that
> >> might be relevant to know.
> >>
> >> AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
> >> the CPU TLB and related structures" and also "When required, hardware
> >> automatically performs TLB invalidations to ensure that all processors
> >> in the system see the updated RMP entry information."
> >>
> >> That sentence doesn't use "broadcast" or "remote", but "all processors"
> >> is a pretty clear clue.  Broadcast TLB invalidations are a building
> >> block of all the RMP-manipulation instructions.
> >>
> >> Furthermore, to be useful in this context, they need to be ordered with
> >> memory.  Specifically, a new pagewalk mustn't start after an
> >> invalidation, yet observe the stale RMP entry.
> >>
> >>
> >> x86 CPUs do have reasonable forward-progress guarantees, but in order to
> >> achieve forward progress, they need to e.g. guarantee that one memory
> >> access doesn't displace the TLB entry backing a different memory access
> >> from the same instruction, or you could livelock while trying to
> >> complete a single instruction.
> >>
> >> A consequence is that you can't safely invalidate a TLB entry of an
> >> in-progress instruction (although this means only the oldest instruction
> >> in the pipeline, because everything else is speculative and potentially
> >> transient).
> >>
> >>
> >> INVLPGB invalidations are interrupt-like from the point of view of the
> >> remote core, but are microarchitectural and can be taken irrespective of
> >> the architectural Interrupt and Global Interrupt Flags.  As a
> >> consequence, they'll need wait until an instruction boundary to be
> >> processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
> >> handling of RARs on the remote processor, and they share a number of
> >> constraints in common with INVLPGB.
> >>
> >>
> >> Overall, I'd expect the INVLPGB instructions to be pretty quick in and
> >> of themselves; interestingly, they're not identified as architecturally
> >> serialising.  The broadcast is probably posted, and will be dealt with
> >> by remote processors on the subsequent instruction boundary.  TLBSYNC is
> >> the barrier to wait until the invalidations have been processed, and
> >> this will block for an unspecified length of time, probably bounded by
> >> the "longest" instruction in progress on a remote CPU.  e.g. I expect it
> >> probably will suck if you have to wait for a WBINVD instruction to
> >> complete on a remote CPU.
> >>
> >> That said, architectural IPIs have the same conditions too, except on
> >> top of that you've got to run a whole interrupt handler.  So, with
> >> reasonable confidence, however slow TLBSYNC might be in the worst case,
> >> it's got absolutely nothing on the overhead of doing invalidations the
> >> old fashioned way.
> > Generally speaking, I am not arguing that TLB flush IPIs are worse
> > than INLPGB/TLBSYNC, I think we should expect the latter to perform
> > better in most cases.
> >
> > But there is a difference here because the processor executing TLBSYNC
> > cannot serve interrupts or NMIs while waiting for remote CPUs, because
> > they have to be served at an instruction boundary, right?
>
> That's as per the architecture, yes.  NMIs do have to be served on
> instruction boundaries.  An NMI that becomes pending while a TLBSYNC is
> in progress will have to wait until the TLBSYNC completes.
>
> (Probably.  REP string instructions and AVX scatter/gather have explicit
> behaviours that them them be interrupted, and to continue from where
> they left off when the interrupt handler returns.  Depending on how
> TLBSYNC is implemented, it's just possible it has this property too.)

That would be great actually, if that's the case all my concerns go away.

>
> > Unless
> > TLBSYNC is an exception to that rule, or its execution is considered
> > completed before remote CPUs respond (i.e. the CPU executes it quickly
> > then enters into a wait doing "nothing").
> >
> > There are also intriguing corner cases that are not documented. For
> > example, you mention that it's reasonable to expect that a remote CPU
> > does not serve TLBSYNC except at the instruction boundary.
>
> INVLPGB needs to wait for an instruction boundary in order to be processed.
>
> All TLBSYNC needs to do is wait until it's certain that all the prior
> INVLPGBs issued by this CPU have been serviced.
>
> >  What if
> > that CPU is executing TLBSYNC? Do we have to wait for its execution to
> > complete? Is it possible to end up in a deadlock? This goes back to my
> > previous point about whether TLBSYNC is a special case or when it's
> > considered to have finished executing.
>
> Remember that the SEV-SNP instruction (PSMASH, PVALIDATE,
> RMP{ADJUST,UPDATE,QUERY,READ}) have an INVLPGB/TLBSYNC pair under the
> hood.  You can execute these instructions on different CPUs in parallel.
>
> It's certainly possible AMD missed something and there's and there's a
> deadlock case in there.  But Google do offer SEV-SNP VMs and have the
> data and scale to know whether such a deadlock is happening in practice.

I am not familiar with SEV-SNP so excuse my ignorance. I am also
pretty sure that the percentage of SEV-SNP workloads is very low
compared to the workloads that would start using INVLPGB/TLBSYNC after
this series. So if there's a dormant bug or a rare scenario where the
TLBSYNC latency is massive, it may very well be newly uncovered now.

>
> >
> > I am sure people thought about that and I am probably worried over
> > nothing, but there's little details here so one has to speculate.
> >
> > Again, sorry if I am making a fuss over nothing and it's all in my head.
>
> It's absolutely a valid question to ask.
>
> But x86 is full of longer delays than this.  The GIF for example can
> block NMIs until the hypervisor is complete with the world switch, and
> it's left as an exercise to software not to abuse this.  Taking an SMI
> will be orders of magnitude more expensive than anything discussed here.

Right. What is happening here just seems like something that happens
more frequently and therefore is more likely to run into cases with
absurd delays.

It would be great if someone from AMD could shed some light on what is
to be reasonably expected from TLBSYNC here.

Anyway, thanks a lot for all your (very informative) responses :)
Rik van Riel Jan. 12, 2025, 2:46 a.m. UTC | #10
On Mon, 2025-01-06 at 11:03 -0800, Dave Hansen wrote:
> 
> So can we call them "global", "shared" or "system" ASIDs, please?
> 
I have renamed them to global ASIDs.

> Second, the TLB_NR_DYN_ASIDS was picked because it's roughly the
> number
> of distinct PCIDs that the CPU can keep in the TLB at once (at least
> on
> Intel). Let's say a CPU has 6 mm's in the per-cpu ASID space and
> another
> 6 in the shared/broadcast space. At that point, PCIDs might not be
> doing
> much good because the TLB can't store entries for 12 PCIDs.
> 
If the CPU has 12 runnable processes, we may have
various other performance issues, too, like the
system simply not having enough CPU power to run
all the runnable tasks.

Most of the systems I have looked at seem to average
between .2 and 2 runnable tasks per CPU, depending on
whether the workload is CPU bound, or memory/IO bound.

> Is there any comprehension in this series? Should we be indexing
> cpu_tlbstate.ctxs[] by a *context* number rather than by the ASID
> that
> it's running as?
> 
We only need the cpu_tlbstate.ctxs[] for the per-CPU
ASID space, in order to look up what process is
assigned which slot.

We do not need it for global ASID numbers, which are
always the same everywhere.

> Last, I'm not 100% convinced we want to do this whole thing. The
> will-it-scale numbers are nice. But given the complexity of this, I
> think we need some actual, real end users to stand up and say exactly
> how this is important in *PRODUCTION* to them.
> 
Do any of these count? :)

https://www.phoronix.com/review/amd-invlpgb-linux
I am hoping to gather some real world numbers as well,
and will work with some workload owners to get some numbers.