Message ID | 20201128160141.1003903-3-npiggin@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | shoot lazy tlbs | expand |
On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npiggin@gmail.com> wrote: > > And get rid of the generic sync_core_before_usermode facility. This is > functionally a no-op in the core scheduler code, but it also catches > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. > > Consider if an arch did not synchronize core in switch_mm either, then > membarrier_mm_sync_core_before_usermode would be in the wrong place > but arch specific mmu context functions would still be the right place. > There is also a exit_lazy_tlb case that is not covered by this call, which > could be a bugs (kthread use mm the membarrier process's mm then context > switch back to the process without switching mm or lazy mm switch). > > This makes lazy tlb code a bit more modular. I have a couple of membarrier fixes that I want to send out today or tomorrow, and they might eliminate the need for this patch. Let me think about this a little bit. I'll cc you. The existing code is way to subtle and the comments are far too confusing for me to be quickly confident about any of my conclusions :)
----- On Nov 28, 2020, at 11:01 AM, Nicholas Piggin npiggin@gmail.com wrote: > And get rid of the generic sync_core_before_usermode facility. This is > functionally a no-op in the core scheduler code, but it also catches This sentence is incomplete. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. Ideally yes this complexity should sit within the x86 architecture code if only that architecture requires it. Thanks, Mathieu
Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am: > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> And get rid of the generic sync_core_before_usermode facility. This is >> functionally a no-op in the core scheduler code, but it also catches >> >> This helper is the wrong way around I think. The idea that membarrier >> state requires a core sync before returning to user is the easy one >> that does not need hiding behind membarrier calls. The gap in core >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the >> tricky detail that is better put in x86 lazy tlb code. >> >> Consider if an arch did not synchronize core in switch_mm either, then >> membarrier_mm_sync_core_before_usermode would be in the wrong place >> but arch specific mmu context functions would still be the right place. >> There is also a exit_lazy_tlb case that is not covered by this call, which >> could be a bugs (kthread use mm the membarrier process's mm then context >> switch back to the process without switching mm or lazy mm switch). >> >> This makes lazy tlb code a bit more modular. > > I have a couple of membarrier fixes that I want to send out today or > tomorrow, and they might eliminate the need for this patch. Let me > think about this a little bit. I'll cc you. The existing code is way > to subtle and the comments are far too confusing for me to be quickly > confident about any of my conclusions :) > Thanks for the head's up. I'll have to have a better look through them but I don't know that it eliminates the need for this entirely although it might close some gaps and make this not a bug fix. The problem here is x86 code wanted something to be called when a lazy mm is unlazied, but it missed some spots and also the core scheduler doesn't need to know about those x86 details if it has this generic call that annotates the lazy handling better. I'll go through the wording again and look at your patches a bit better but I think they are somewhat orthogonal. Thanks, Nick
On Tue, Dec 1, 2020 at 6:50 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am: > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npiggin@gmail.com> wrote: > >> > >> And get rid of the generic sync_core_before_usermode facility. This is > >> functionally a no-op in the core scheduler code, but it also catches > >> > >> This helper is the wrong way around I think. The idea that membarrier > >> state requires a core sync before returning to user is the easy one > >> that does not need hiding behind membarrier calls. The gap in core > >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > >> tricky detail that is better put in x86 lazy tlb code. > >> > >> Consider if an arch did not synchronize core in switch_mm either, then > >> membarrier_mm_sync_core_before_usermode would be in the wrong place > >> but arch specific mmu context functions would still be the right place. > >> There is also a exit_lazy_tlb case that is not covered by this call, which > >> could be a bugs (kthread use mm the membarrier process's mm then context > >> switch back to the process without switching mm or lazy mm switch). > >> > >> This makes lazy tlb code a bit more modular. > > > > I have a couple of membarrier fixes that I want to send out today or > > tomorrow, and they might eliminate the need for this patch. Let me > > think about this a little bit. I'll cc you. The existing code is way > > to subtle and the comments are far too confusing for me to be quickly > > confident about any of my conclusions :) > > > > Thanks for the head's up. I'll have to have a better look through them > but I don't know that it eliminates the need for this entirely although > it might close some gaps and make this not a bug fix. The problem here > is x86 code wanted something to be called when a lazy mm is unlazied, > but it missed some spots and also the core scheduler doesn't need to > know about those x86 details if it has this generic call that annotates > the lazy handling better. I'll send v3 tomorrow. They add more sync_core_before_usermode() callers. Having looked at your patches a bunch and the membarrier code a bunch, I don't think I like the approach of pushing this logic out into new core functions called by arch code. Right now, even with my membarrier patches applied, understanding how (for example) the x86 switch_mm_irqs_off() plus the scheduler code provides the barriers that membarrier needs is quite complicated, and it's not clear to me that the code is even correct. At the very least I'm pretty sure that the x86 comments are misleading. With your patches, someone trying to audit the code would have to follow core code calling into arch code and back out into core code to figure out what's going on. I think the result is worse. I wrote this incomplete patch which takes the opposite approach (sorry for whitespace damage): commit 928b5c60e93f475934892d6e0b357ebf0a2bf87d Author: Andy Lutomirski <luto@kernel.org> Date: Wed Dec 2 17:24:02 2020 -0800 [WIP] x86/mm: Handle unlazying membarrier core sync in the arch code The core scheduler isn't a great place for membarrier_mm_sync_core_before_usermode() -- the core scheduler doesn't actually know whether we are lazy. With the old code, if a CPU is running a membarrier-registered task, goes idle, gets unlazied via a TLB shootdown IPI, and switches back to the membarrier-registered task, it will do an unnecessary core sync. Conveniently, x86 is the only architecture that does anything in this hook, so we can just move the code. XXX: actually delete the old code. Signed-off-by: Andy Lutomirski <luto@kernel.org> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3338a1feccf9..e27300fc865b 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -496,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * from one thread in a process to another thread in the same * process. No TLB flush required. */ + + // XXX: why is this okay wrt membarrier? if (!was_lazy) return; @@ -508,12 +510,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, smp_mb(); next_tlb_gen = atomic64_read(&next->context.tlb_gen); if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == - next_tlb_gen) + next_tlb_gen) { + /* + * We're reactivating an mm, and membarrier might + * need to serialize. Tell membarrier. + */ + + // XXX: I can't understand the logic in + // membarrier_mm_sync_core_before_usermode(). What's + // the mm check for? + membarrier_mm_sync_core_before_usermode(next); return; + } /* * TLB contents went out of date while we were in lazy * mode. Fall through to the TLB switching code below. + * No need for an explicit membarrier invocation -- the CR3 + * write will serialize. */ new_asid = prev_asid; need_flush = true;
Excerpts from Andy Lutomirski's message of December 3, 2020 3:09 pm: > On Tue, Dec 1, 2020 at 6:50 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am: >> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> >> >> And get rid of the generic sync_core_before_usermode facility. This is >> >> functionally a no-op in the core scheduler code, but it also catches >> >> >> >> This helper is the wrong way around I think. The idea that membarrier >> >> state requires a core sync before returning to user is the easy one >> >> that does not need hiding behind membarrier calls. The gap in core >> >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the >> >> tricky detail that is better put in x86 lazy tlb code. >> >> >> >> Consider if an arch did not synchronize core in switch_mm either, then >> >> membarrier_mm_sync_core_before_usermode would be in the wrong place >> >> but arch specific mmu context functions would still be the right place. >> >> There is also a exit_lazy_tlb case that is not covered by this call, which >> >> could be a bugs (kthread use mm the membarrier process's mm then context >> >> switch back to the process without switching mm or lazy mm switch). >> >> >> >> This makes lazy tlb code a bit more modular. >> > >> > I have a couple of membarrier fixes that I want to send out today or >> > tomorrow, and they might eliminate the need for this patch. Let me >> > think about this a little bit. I'll cc you. The existing code is way >> > to subtle and the comments are far too confusing for me to be quickly >> > confident about any of my conclusions :) >> > >> >> Thanks for the head's up. I'll have to have a better look through them >> but I don't know that it eliminates the need for this entirely although >> it might close some gaps and make this not a bug fix. The problem here >> is x86 code wanted something to be called when a lazy mm is unlazied, >> but it missed some spots and also the core scheduler doesn't need to >> know about those x86 details if it has this generic call that annotates >> the lazy handling better. > > I'll send v3 tomorrow. They add more sync_core_before_usermode() callers. > > Having looked at your patches a bunch and the membarrier code a bunch, > I don't think I like the approach of pushing this logic out into new > core functions called by arch code. Right now, even with my > membarrier patches applied, understanding how (for example) the x86 > switch_mm_irqs_off() plus the scheduler code provides the barriers > that membarrier needs is quite complicated, and it's not clear to me > that the code is even correct. At the very least I'm pretty sure that > the x86 comments are misleading. > > With your patches, someone trying to > audit the code would have to follow core code calling into arch code > and back out into core code to figure out what's going on. I think > the result is worse. Sorry I missed this and rather than reply to the later version you have a bigger comment here. I disagree. Until now nobody following it noticed that the mm gets un-lazied in other cases, because that was not too clear from the code (only indirectly using non-standard terminology in the arch support document). In other words, membarrier needs a special sync to deal with the case when a kthread takes the mm. exit_lazy_tlb gives membarrier code that exact hook that it wants from the core scheduler code. > > I wrote this incomplete patch which takes the opposite approach (sorry > for whitespace damage): That said, if you want to move the code entirely in the x86 arch from exit_lazy_tlb to switch_mm_irqs_off, it's trivial and touches no core code after my series :) and I would have no problem with doing that. I suspect it might actually be more readable to go the other way and pull most of the real_prev == next membarrier code into exit_lazy_tlb instead, but I could be wrong I don't know exactly how the x86 lazy state correlates with core lazy tlb state. Thanks, Nick > > commit 928b5c60e93f475934892d6e0b357ebf0a2bf87d > Author: Andy Lutomirski <luto@kernel.org> > Date: Wed Dec 2 17:24:02 2020 -0800 > > [WIP] x86/mm: Handle unlazying membarrier core sync in the arch code > > The core scheduler isn't a great place for > membarrier_mm_sync_core_before_usermode() -- the core scheduler > doesn't actually know whether we are lazy. With the old code, if a > CPU is running a membarrier-registered task, goes idle, gets unlazied > via a TLB shootdown IPI, and switches back to the > membarrier-registered task, it will do an unnecessary core sync. > > Conveniently, x86 is the only architecture that does anything in this > hook, so we can just move the code. > > XXX: actually delete the old code. > > Signed-off-by: Andy Lutomirski <luto@kernel.org> > > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 3338a1feccf9..e27300fc865b 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -496,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, > struct mm_struct *next, > * from one thread in a process to another thread in the same > * process. No TLB flush required. > */ > + > + // XXX: why is this okay wrt membarrier? > if (!was_lazy) > return; > > @@ -508,12 +510,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, > struct mm_struct *next, > smp_mb(); > next_tlb_gen = atomic64_read(&next->context.tlb_gen); > if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == > - next_tlb_gen) > + next_tlb_gen) { > + /* > + * We're reactivating an mm, and membarrier might > + * need to serialize. Tell membarrier. > + */ > + > + // XXX: I can't understand the logic in > + // membarrier_mm_sync_core_before_usermode(). What's > + // the mm check for? > + membarrier_mm_sync_core_before_usermode(next); > return; > + } > > /* > * TLB contents went out of date while we were in lazy > * mode. Fall through to the TLB switching code below. > + * No need for an explicit membarrier invocation -- the CR3 > + * write will serialize. > */ > new_asid = prev_asid; > need_flush = true; >
> On Dec 5, 2020, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote: > > > I disagree. Until now nobody following it noticed that the mm gets > un-lazied in other cases, because that was not too clear from the > code (only indirectly using non-standard terminology in the arch > support document). > In other words, membarrier needs a special sync to deal with the case > when a kthread takes the mm. I don’t think this is actually true. Somehow the x86 oddities about CR3 writes leaked too much into the membarrier core code and comments. (I doubt this is x86 specific. The actual x86 specific part seems to be that we can return to user mode without syncing the instruction stream.) As far as I can tell, membarrier doesn’t care at all about laziness. Membarrier cares about rq->curr->mm. The fact that a cpu can switch its actual loaded mm without scheduling at all (on x86 at least) is entirely beside the point except insofar as it has an effect on whether a subsequent switch_mm() call serializes. If we notify membarrier about x86’s asynchronous CR3 writes, then membarrier needs to understand what to do with them, which results in an unmaintainable mess in membarrier *and* in the x86 code. I’m currently trying to document how membarrier actually works, and hopefully this will result in untangling membarrier from mmdrop() and such. A silly part of this is that x86 already has a high quality implementation of most of membarrier(): flush_tlb_mm(). If you flush an mm’s TLB, we carefully propagate the flush to all threads, with attention to memory ordering. We can’t use this directly as an arch-specific implementation of membarrier because it has the annoying side affect of flushing the TLB and because upcoming hardware might be able to flush without guaranteeing a core sync. (Upcoming means Zen 3, but the Zen 3 implementation is sadly not usable by Linux.)
Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am: > >> On Dec 5, 2020, at 12:00 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> >> I disagree. Until now nobody following it noticed that the mm gets >> un-lazied in other cases, because that was not too clear from the >> code (only indirectly using non-standard terminology in the arch >> support document). > >> In other words, membarrier needs a special sync to deal with the case >> when a kthread takes the mm. > > I don’t think this is actually true. Somehow the x86 oddities about > CR3 writes leaked too much into the membarrier core code and comments. > (I doubt this is x86 specific. The actual x86 specific part seems to > be that we can return to user mode without syncing the instruction > stream.) > > As far as I can tell, membarrier doesn’t care at all about laziness. > Membarrier cares about rq->curr->mm. The fact that a cpu can switch > its actual loaded mm without scheduling at all (on x86 at least) is > entirely beside the point except insofar as it has an effect on > whether a subsequent switch_mm() call serializes. Core membarrier itself doesn't care about laziness, which is why the membarrier flush should go in exit_lazy_tlb() or other x86 specific code (at least until more architectures did the same thing and we moved it into generic code). I just meant this non-serialising return as documented in the membarrier arch enablement doc specifies the lazy tlb requirement. If an mm was lazy tlb for a kernel thread and then it becomes unlazy, and if switch_mm is serialising but return to user is not, then you need a serialising instruction somewhere before return to user. unlazy is the logical place to add that, because the lazy tlb mm (i.e., switching to a kernel thread and back without switching mm) is what opens the hole. > If we notify > membarrier about x86’s asynchronous CR3 writes, then membarrier needs > to understand what to do with them, which results in an unmaintainable > mess in membarrier *and* in the x86 code. How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying arch code about when an mm becomes not-lazy, and nothing to do with membarrier at all even. It's a convenient hook to do your un-lazying. I guess you can do it also checking things in switch_mm and keeping state in arch code, I don't think that's necessarily the best place to put it. So membarrier code is unchanged (it cares that the serialise is done at un-lazy time), core code is simpler (no knowledge of this membarrier quirk and it already knows about lazy-tlb so the calls actually improve the documentation), and x86 code I would argue becomes nicer (or no real difference at worst) because you can move some exit lazy tlb handling to that specific call rather than decipher it from switch_mm. > > I’m currently trying to document how membarrier actually works, and > hopefully this will result in untangling membarrier from mmdrop() and > such. That would be nice. > > A silly part of this is that x86 already has a high quality > implementation of most of membarrier(): flush_tlb_mm(). If you flush > an mm’s TLB, we carefully propagate the flush to all threads, with > attention to memory ordering. We can’t use this directly as an > arch-specific implementation of membarrier because it has the annoying > side affect of flushing the TLB and because upcoming hardware might be > able to flush without guaranteeing a core sync. (Upcoming means Zen > 3, but the Zen 3 implementation is sadly not usable by Linux.) > A hardware broadcast TLB flush, you mean? What makes it unusable by Linux out of curiosity?
On Sat, Dec 5, 2020 at 3:15 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am: > > > If an mm was lazy tlb for a kernel thread and then it becomes unlazy, > and if switch_mm is serialising but return to user is not, then you > need a serialising instruction somewhere before return to user. unlazy > is the logical place to add that, because the lazy tlb mm (i.e., > switching to a kernel thread and back without switching mm) is what > opens the hole. The issue here is that unlazying on x86 sometimes serializes and sometimes doesn't. It's straightforward to add logic to the x86 code to serialize specifically in the case where membarrier core sync is registered and unlazying would otherwise not serialize, but trying to define sensible semantics for this in a call to core code seems complicated. (Specifically, the x86 code only sometimes sends IPIs to lazy CPUs for TLB flushes. If an IPI is skipped, then unlazying will flush the TLB, and that operation is serializing. The whole lazy thing is IMO a red herring for membarrier(). The membarrier() logic requires that switching *logical* mms (rq->curr->mm) serializes before user mode if the new mm is registered for core sync. AFAICT the only architecture on which this isn't automatic is x86, and somehow the logic turned into "actually changing rq->curr->mm serializes, but unlazying only sometimes serializes, so we need to do an extra serialization step on unlazying operations" instead of "tell x86 to make sure it always serializes when it switches logical mms". The latter is easy to specify and easy to implement. > > How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying > arch code about when an mm becomes not-lazy, and nothing to do with > membarrier at all even. It's a convenient hook to do your un-lazying. > I guess you can do it also checking things in switch_mm and keeping state > in arch code, I don't think that's necessarily the best place to put it. I'm confused. I just re-read your patches, and it looks like you have arch code calling exit_lazy_tlb(). On x86, if we do a TLB shootdown IPI to a lazy CPU, the IPI handler will unlazy that CPU (by switching to init_mm for real), and we have no way to notify the core scheduler about this, so we don't. The result is that the core scheduler state and the x86 state gets out of sync. If the core scheduler subsequently switches us back to the mm that it thinks we were still using lazily them, from the x86 code's perspective, we're not unlazying -- we're just doing a regular switch from init_mm to some other mm. This is why x86's switch_mm_irqs_off() totally ignores its 'prev' argument. I'm honestly a bit surprised that other architectures don't do the same thing. I suppose that some architectures don't use shootdown IPIs at all, in which case there doesn't seem to be any good reason to aggressively unlazy. (Oddly, despite the fact that, since Ivy Bridge, x86 has a "just flush the TLB" instruction, that instruction is considerably slower than the old "switch mm and flush" operation. So the operation "switch to init_mm" is only ever any slower than "flush and stay lazy" if we get lucky and unlazy to the same mm before we get a second TLB shootdown *and* if unlazying to the same mm would not have needed to flush. I spend quite a bit of time tuning this stuff and being quite surprised at the bizarre performance properties of Intel's TLB management instructions.) > > So membarrier code is unchanged (it cares that the serialise is done at > un-lazy time), core code is simpler (no knowledge of this membarrier > quirk and it already knows about lazy-tlb so the calls actually improve > the documentation), and x86 code I would argue becomes nicer (or no real > difference at worst) because you can move some exit lazy tlb handling to > that specific call rather than decipher it from switch_mm. As above, I can't move the exit-lazy handling because the scheduler doesn't know when I'm unlazying. > > > > > I’m currently trying to document how membarrier actually works, and > > hopefully this will result in untangling membarrier from mmdrop() and > > such. > > That would be nice. It's still a work in progress. I haven't actually convinced myself that the non-IPI case in membarrier() is correct, nor have I convinced myself that it's incorrect. Anyway, I think that my patch is a bit incorrect and I either need a barrier somewhere (which may already exist) or a store-release to lazy_mm to make sure that all accesses to the lazy mm are done before lazy_mm is freed. On x86, even aside from the fact that all stores are releases, this isn't needed -- stopping using an mm is itself a full barrier. Will this be a performance issue on power? > > > > > A silly part of this is that x86 already has a high quality > > implementation of most of membarrier(): flush_tlb_mm(). If you flush > > an mm’s TLB, we carefully propagate the flush to all threads, with > > attention to memory ordering. We can’t use this directly as an > > arch-specific implementation of membarrier because it has the annoying > > side affect of flushing the TLB and because upcoming hardware might be > > able to flush without guaranteeing a core sync. (Upcoming means Zen > > 3, but the Zen 3 implementation is sadly not usable by Linux.) > > > > A hardware broadcast TLB flush, you mean? What makes it unusable by > Linux out of curiosity? The new instruction is INVLPGB. Unfortunately, x86's ASID field is very narrow, and there's no way we can give each mm the same ASID across all CPUs, which means we can't accurately target the flush at the correct set of TLB entries. I've asked engineers at both Intel and AMD to widen the ASID field, but that will end up being complicated -- x86 has run out of bits in its absurdly overloaded CR3 encoding, and widening the ASID to any reasonable size would require adding a new way to switch mms. There are lots of reasons that x86 should do that anyway [0], but it would be a big project and I'm not sure that either company is interested in big projects like that. [0] On x86, you can't switch between (64-bit execution, 48-bit virtual address space) and (64-bit execution, 57-bit address space) without exiting 64-bit mode in the middle. This is because the way that the addressing mode is split among multiple registers prevents a single instruction from switching between the two states. This is absolutely delightful for anyone trying to boot an OS on a system with a very, very large amount of memory.
Excerpts from Andy Lutomirski's message of December 6, 2020 10:36 am: > On Sat, Dec 5, 2020 at 3:15 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am: >> > > >> If an mm was lazy tlb for a kernel thread and then it becomes unlazy, >> and if switch_mm is serialising but return to user is not, then you >> need a serialising instruction somewhere before return to user. unlazy >> is the logical place to add that, because the lazy tlb mm (i.e., >> switching to a kernel thread and back without switching mm) is what >> opens the hole. > > The issue here is that unlazying on x86 sometimes serializes and > sometimes doesn't. That's additional state that x86 keeps around though, which is fine. It can optimise that case if it knows it's already serialised. > It's straightforward to add logic to the x86 code > to serialize specifically in the case where membarrier core sync is > registered and unlazying would otherwise not serialize, but trying to > define sensible semantics for this in a call to core code seems > complicated. It's not though, it's a call from core code (to arch code). > (Specifically, the x86 code only sometimes sends IPIs to > lazy CPUs for TLB flushes. If an IPI is skipped, then unlazying will > flush the TLB, and that operation is serializing. > > The whole lazy thing is IMO a red herring for membarrier(). The > membarrier() logic requires that switching *logical* mms > (rq->curr->mm) serializes before user mode if the new mm is registered > for core sync. It's not a red herring, the reason the IPI gets skipped is because we go to a kernel thread -- that's all core code and core lazy tlb handling. That x86 might do some additional ops serialise during un-lazy in some cases doesn't make that a red herring, it just means that you can take advantage of it and avoid doing an extra serialising op. > AFAICT the only architecture on which this isn't > automatic is x86, and somehow the logic turned into "actually changing > rq->curr->mm serializes, but unlazying only sometimes serializes, so > we need to do an extra serialization step on unlazying operations" > instead of "tell x86 to make sure it always serializes when it > switches logical mms". The latter is easy to specify and easy to > implement. > >> >> How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying >> arch code about when an mm becomes not-lazy, and nothing to do with >> membarrier at all even. It's a convenient hook to do your un-lazying. >> I guess you can do it also checking things in switch_mm and keeping state >> in arch code, I don't think that's necessarily the best place to put it. > > I'm confused. I just re-read your patches, and it looks like you have > arch code calling exit_lazy_tlb(). More for code-comment / consistency than anything else. They are entirely arch hooks. > On x86, if we do a TLB shootdown > IPI to a lazy CPU, the IPI handler will unlazy that CPU (by switching > to init_mm for real), and we have no way to notify the core scheduler > about this, so we don't. The result is that the core scheduler state > and the x86 state gets out of sync. If the core scheduler > subsequently switches us back to the mm that it thinks we were still > using lazily them, from the x86 code's perspective, we're not > unlazying -- we're just doing a regular switch from init_mm to some > other mm. This is why x86's switch_mm_irqs_off() totally ignores its > 'prev' argument. You actually do now have such a way to do that now that we've (hopefully) closed races, and I think should use it, which might make things simpler for you. See patch 6 do_shoot_lazy_tlb(). > I'm honestly a bit surprised that other architectures don't do the > same thing. I suppose that some architectures don't use shootdown > IPIs at all, in which case there doesn't seem to be any good reason to > aggressively unlazy. powerpc/radix does (in some cases) since a few years ago. It just doesn't fully exploit that for the final TLB shootdown to always clean them all up and avoid the subsequent shoot-lazies IPI, but it could be more aggressive there. The powerpc virtualised hash architecture is the traditional one and isn't conducive to this (translation management is done via hcalls, and the hypervisor maintains the TLB) so I suspect that's why it wasn't done earlier there. That will continue to rely on shoot-lazies. > (Oddly, despite the fact that, since Ivy Bridge, x86 has a "just flush > the TLB" instruction, that instruction is considerably slower than the > old "switch mm and flush" operation. So the operation "switch to > init_mm" is only ever any slower than "flush and stay lazy" if we get > lucky and unlazy to the same mm before we get a second TLB shootdown > *and* if unlazying to the same mm would not have needed to flush. I > spend quite a bit of time tuning this stuff and being quite surprised > at the bizarre performance properties of Intel's TLB management > instructions.) Well, you also casue an extra mm switch in case you returned to the same mm. Which probably isn't uncommon (app<->idle). >> >> So membarrier code is unchanged (it cares that the serialise is done at >> un-lazy time), core code is simpler (no knowledge of this membarrier >> quirk and it already knows about lazy-tlb so the calls actually improve >> the documentation), and x86 code I would argue becomes nicer (or no real >> difference at worst) because you can move some exit lazy tlb handling to >> that specific call rather than decipher it from switch_mm. > > As above, I can't move the exit-lazy handling because the scheduler > doesn't know when I'm unlazying. As above, you can actually tell it. But even if you don't do that, in the current scheme it's still telling you a superset of what you need, so you'd just put move your extra checks there. > >> >> > >> > I’m currently trying to document how membarrier actually works, and >> > hopefully this will result in untangling membarrier from mmdrop() and >> > such. >> >> That would be nice. > > It's still a work in progress. I haven't actually convinced myself > that the non-IPI case in membarrier() is correct, nor have I convinced > myself that it's incorrect. > > Anyway, I think that my patch is a bit incorrect and I either need a > barrier somewhere (which may already exist) or a store-release to > lazy_mm to make sure that all accesses to the lazy mm are done before > lazy_mm is freed. On x86, even aside from the fact that all stores > are releases, this isn't needed -- stopping using an mm is itself a > full barrier. Will this be a performance issue on power? store-release is lwsync on power. Not so bad as a full barrier, but probably not wonderful. The fast path would be worse than shoot-lazies of course, but may not be prohibitive. I'm still going to persue shoot-lazies for the merge window. As you see it's about a dozen lines and a if (IS_ENABLED(... in core code. Your change is common code, but a significant complexity (which affects all archs) so needs a lot more review and testing at this point. If x86 is already shooting lazies in its final TLB flush, I don't know why you're putting so much effort in though, surely it's more complexity and (even slightly) more cost there too. > >> >> > >> > A silly part of this is that x86 already has a high quality >> > implementation of most of membarrier(): flush_tlb_mm(). If you flush >> > an mm’s TLB, we carefully propagate the flush to all threads, with >> > attention to memory ordering. We can’t use this directly as an >> > arch-specific implementation of membarrier because it has the annoying >> > side affect of flushing the TLB and because upcoming hardware might be >> > able to flush without guaranteeing a core sync. (Upcoming means Zen >> > 3, but the Zen 3 implementation is sadly not usable by Linux.) >> > >> >> A hardware broadcast TLB flush, you mean? What makes it unusable by >> Linux out of curiosity? > > The new instruction is INVLPGB. Unfortunately, x86's ASID field is > very narrow, and there's no way we can give each mm the same ASID > across all CPUs, which means we can't accurately target the flush at > the correct set of TLB entries. I've asked engineers at both Intel > and AMD to widen the ASID field, but that will end up being > complicated -- x86 has run out of bits in its absurdly overloaded CR3 > encoding, and widening the ASID to any reasonable size would require > adding a new way to switch mms. There are lots of reasons that x86 > should do that anyway [0], but it would be a big project and I'm not > sure that either company is interested in big projects like that. Interesting, thanks. powerpc has a PID register for guest ASIDs that implements about 20 bits. The IPI is very flexible though, it allows more complex/fine grained flushes and also software state to be updated, so we've started using it a bit. I haven't seen much software where performance of IPIs is prohibitive these days. Maybe improvements to threaded malloc, JVMs databases etc reduce the amount of flushes. > [0] On x86, you can't switch between (64-bit execution, 48-bit virtual > address space) and (64-bit execution, 57-bit address space) without > exiting 64-bit mode in the middle. This is because the way that the > addressing mode is split among multiple registers prevents a single > instruction from switching between the two states. This is absolutely > delightful for anyone trying to boot an OS on a system with a very, > very large amount of memory. > powerpc has some issues like that with context switching guest / host state there are several MMU registers involved that can't be switched with a single instruction. It doesn't require such a big hammer, but a careful sequence to switch things. Thanks, Nick
> On Dec 5, 2020, at 7:59 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > I'm still going to persue shoot-lazies for the merge window. As you > see it's about a dozen lines and a if (IS_ENABLED(... in core code. > Your change is common code, but a significant complexity (which > affects all archs) so needs a lot more review and testing at this > point. I don't think it's ready for this merge window. I read the early patches again, and I think they make the membarrier code worse, not better. I'm not fundamentally opposed to the shoot-lazies concept, but it needs more thought and it needs a cleaner foundation.
Excerpts from Andy Lutomirski's message of December 11, 2020 10:11 am: >> On Dec 5, 2020, at 7:59 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> > >> I'm still going to persue shoot-lazies for the merge window. As you >> see it's about a dozen lines and a if (IS_ENABLED(... in core code. >> Your change is common code, but a significant complexity (which >> affects all archs) so needs a lot more review and testing at this >> point. > > I don't think it's ready for this merge window. Yes next one I meant (aka this one for development perspective :)). > I read the early > patches again, and I think they make the membarrier code worse, not > better. Mathieu and I disagree, so we are at an impasse. I addressed your comment about not being able to do the additional core sync avoidance from the exit tlb call (you can indeed do so in your arch code) and about exit_lazy_tlb being a call into the scheduler (it's not) and about the arch code not being able to reconcile lazy tlb mm with the core scheduler code (you can). I fundamentally think the core sync is an issue with what the membarrier / arch specifics are doing with lazy tlb mm switching, and not something the core scheduler needs to know about at all. I don't see the big problem with essentially moving it from an explicit call to exit_lazy_tlb (which from scheduler POV describes better what it is doing, not how). > I'm not fundamentally opposed to the shoot-lazies concept, > but it needs more thought and it needs a cleaner foundation. Well shoot lazies actually doesn't really rely on that membarrier change at all, it just came as a nice looking cleanup so that part can be dropped from the series. It's not really foundational. Thanks, Nick
Excerpts from Nicholas Piggin's message of December 14, 2020 2:07 pm: > Excerpts from Andy Lutomirski's message of December 11, 2020 10:11 am: >>> On Dec 5, 2020, at 7:59 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >>> >> >>> I'm still going to persue shoot-lazies for the merge window. As you >>> see it's about a dozen lines and a if (IS_ENABLED(... in core code. >>> Your change is common code, but a significant complexity (which >>> affects all archs) so needs a lot more review and testing at this >>> point. >> >> I don't think it's ready for this merge window. > > Yes next one I meant (aka this one for development perspective :)). > >> I read the early >> patches again, and I think they make the membarrier code worse, not >> better. > > Mathieu and I disagree, so we are at an impasse. Well actually not really, I went and cut out the exit_lazy_tlb stuff from the patch series, those are better to be untangled anyway. I think an earlier version had something in exit_lazy_tlb for the mm refcounting change but it's not required now anyway. I'll split them out and just work on the shoot lazies series for now, I might revisit exit_lazy_tlb after the dust settles from that and the current membarrier changes. I'll test and repost shortly. Thanks, Nick
diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt index 47e6903f47a5..0763a63a7097 100644 --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt @@ -5,6 +5,10 @@ # # Architecture requirements # +# If your architecture returns to user-space through non-core-serializing +# instructions, you need to ensure these are done in switch_mm and exit_lazy_tlb +# (if lazy tlb switching is implemented). +# # * arm/arm64/powerpc # # Rely on implicit context synchronization as a result of exception return @@ -24,7 +28,7 @@ # instead on write_cr3() performed by switch_mm() to provide core serialization # after changing the current mm, and deal with the special case of kthread -> # uthread (temporarily keeping current mm into active_mm) by issuing a -# sync_core_before_usermode() in that specific case. +# serializing instruction in exit_lazy_mm() in that specific case. # ----------------------- | arch |status| diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 36afcbea6a9f..8094893254f1 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -6,12 +6,14 @@ #include <linux/atomic.h> #include <linux/mm_types.h> #include <linux/pkeys.h> +#include <linux/sched/mm.h> #include <trace/events/tlb.h> #include <asm/tlbflush.h> #include <asm/paravirt.h> #include <asm/debugreg.h> +#include <asm/sync_core.h> extern atomic64_t last_mm_ctx_id; @@ -94,6 +96,31 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) #define enter_lazy_tlb enter_lazy_tlb extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); +#ifdef CONFIG_MEMBARRIER +/* + * Ensure that a core serializing instruction is issued before returning + * to user-mode, if a SYNC_CORE was requested. x86 implements return to + * user-space through sysexit, sysrel, and sysretq, which are not core + * serializing. + * + * See the membarrier comment in finish_task_switch as to why this is done + * in exit_lazy_tlb. + */ +#define exit_lazy_tlb exit_lazy_tlb +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) +{ + /* Switching mm is serializing with write_cr3 */ + if (tsk->mm != mm) + return; + + if (likely(!(atomic_read(&mm->membarrier_state) & + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) + return; + + sync_core_before_usermode(); +} +#endif + /* * Init a new mm. Used on mm copies, like at fork() * and on mm's that are brand-new, like at execve(). diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index d5ece7a9a403..2c6bcdf76d99 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -7,7 +7,6 @@ #include <linux/sched.h> #include <linux/mm_types.h> #include <linux/gfp.h> -#include <linux/sync_core.h> /* * Routines for handling mm_structs @@ -335,16 +334,6 @@ enum { #include <asm/membarrier.h> #endif -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ - if (current->mm != mm) - return; - if (likely(!(atomic_read(&mm->membarrier_state) & - MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) - return; - sync_core_before_usermode(); -} - extern void membarrier_exec_mmap(struct mm_struct *mm); #else @@ -358,9 +347,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev, static inline void membarrier_exec_mmap(struct mm_struct *mm) { } -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ -} #endif #endif /* _LINUX_SCHED_MM_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 6ff2578ecf17..134688d79589 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -572,7 +572,9 @@ static int finish_cpu(unsigned int cpu) /* * idle_task_exit() will have switched to &init_mm, now - * clean up any remaining active_mm state. + * clean up any remaining active_mm state. exit_lazy_tlb + * is not done, if an arch did any accounting in these + * functions it would have to be added. */ if (mm != &init_mm) idle->active_mm = &init_mm; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dcc46039ade5..e4e8cebd82e2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3620,22 +3620,19 @@ static struct rq *finish_task_switch(struct task_struct *prev) kcov_finish_switch(current); fire_sched_in_preempt_notifiers(current); + /* * When switching through a kernel thread, the loop in * membarrier_{private,global}_expedited() may have observed that * kernel thread and not issued an IPI. It is therefore possible to * schedule between user->kernel->user threads without passing though - * switch_mm(). Membarrier requires a barrier after storing to - * rq->curr, before returning to userspace, so provide them here: - * - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly - * provided by mmdrop(), - * - a sync_core for SYNC_CORE. + * switch_mm(). Membarrier requires a full barrier after storing to + * rq->curr, before returning to userspace, for + * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop(). */ - if (mm) { - membarrier_mm_sync_core_before_usermode(mm); + if (mm) mmdrop(mm); - } + if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); @@ -6689,6 +6686,7 @@ void idle_task_exit(void) BUG_ON(current != this_rq()->idle); if (mm != &init_mm) { + /* enter_lazy_tlb is not done because we're about to go down */ switch_mm(mm, &init_mm, current); finish_arch_post_lock_switch(); }
And get rid of the generic sync_core_before_usermode facility. This is functionally a no-op in the core scheduler code, but it also catches This helper is the wrong way around I think. The idea that membarrier state requires a core sync before returning to user is the easy one that does not need hiding behind membarrier calls. The gap in core synchronization due to x86's sysret/sysexit and lazy tlb mode, is the tricky detail that is better put in x86 lazy tlb code. Consider if an arch did not synchronize core in switch_mm either, then membarrier_mm_sync_core_before_usermode would be in the wrong place but arch specific mmu context functions would still be the right place. There is also a exit_lazy_tlb case that is not covered by this call, which could be a bugs (kthread use mm the membarrier process's mm then context switch back to the process without switching mm or lazy mm switch). This makes lazy tlb code a bit more modular. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- .../membarrier-sync-core/arch-support.txt | 6 ++++- arch/x86/include/asm/mmu_context.h | 27 +++++++++++++++++++ include/linux/sched/mm.h | 14 ---------- kernel/cpu.c | 4 ++- kernel/sched/core.c | 16 +++++------ 5 files changed, 42 insertions(+), 25 deletions(-)