diff mbox series

[16/23] sched: Use lightweight hazard pointers to grab lazy mms

Message ID 7c9c388c388df8e88bb5d14828053ac0cb11cf69.1641659630.git.luto@kernel.org (mailing list archive)
State New
Headers show
Series mm, sched: Rework lazy mm handling | expand

Commit Message

Andy Lutomirski Jan. 8, 2022, 4:44 p.m. UTC
Currently, switching between a real user mm and a kernel context (including
idle) performs an atomic operation on a per-mm counter via mmgrab() and
mmdrop().  For a single-threaded program, this isn't a big problem: a pair
of atomic operations when entering and returning from idle isn't free, but
it's not very expensive in the grand scheme of things.  For a heavily
multithreaded program on a large system, however, the overhead can be very
large -- all CPUs can end up hammering the same cacheline with atomic
operations, and scalability suffers.

The purpose of mmgrab() and mmdrop() is to make "lazy tlb" mode safe.  When
Linux switches from user to kernel mm context, instead of immediately
reprogramming the MMU to use init_mm, the kernel continues to use the most
recent set of user page tables.  This is safe as long as those page tables
aren't freed.

RCU can't be used to keep the pagetables alive, since RCU read locks can't
be held when idle.

To improve scalability, this patch adds a percpu hazard pointer scheme to
keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
must not be freed, and __mmput() checks the pointers belonging to all CPUs
that might be lazily using the mm in question.

By default, this means walking all online CPUs, but arch code can override
the set of CPUs to check if they can do something more clever.  For
architectures that have accurate mm_cpumask(), mm_cpumask() is a reasonable
choice.  For architectures that can guarantee that *no* remote CPUs are
lazily using an mm after the user portion of the pagetables are torn down
(any architcture that uses IPI shootdowns in exit_mmap() and unlazies the
MMU in the IPI handler, e.g. x86 on bare metal), the set of CPUs to check
could be empty.

XXX: I *think* this is correct when hot-unplugging a CPU, but this needs
double-checking and maybe even a WARN to make sure the ordering is correct.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched/mm.h |   3 +
 kernel/fork.c            |  11 ++
 kernel/sched/core.c      | 230 +++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h     |  10 +-
 4 files changed, 221 insertions(+), 33 deletions(-)

Comments

Linus Torvalds Jan. 8, 2022, 7:22 p.m. UTC | #1
On Sat, Jan 8, 2022 at 8:44 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> To improve scalability, this patch adds a percpu hazard pointer scheme to
> keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
> must not be freed, and __mmput() checks the pointers belonging to all CPUs
> that might be lazily using the mm in question.

Ugh. This feels horribly fragile to me, and also looks like it makes
some common cases potentially quite expensive for machines with large
CPU counts if they don't do that mm_cpumask optimization - which in
turn feels quite fragile as well.

IOW, this just feels *complicated*.

And I think it's overly so. I get the strong feeling that we could
make the rules much simpler and more straightforward.

For example, how about we make the rules be

 - a lazy TLB mm reference requires that there's an actual active user
of that mm (ie "mm_users > 0")

 - the last mm_users decrement (ie __mmput) forces a TLB flush, and
that TLB flush must make sure that no lazy users exist (which I think
it does already anyway).

Doesn't that seem like a really simple set of rules?

And the nice thing about it is that we *already* do that required TLB
flush in all normal circumstances. __mmput() already calls
exit_mmap(), and exit_mm() already forces that TLB flush in every
normal situation.

So we might have to make sure that every architecture really does that
"drop lazy mms on TLB flush", and maybe add a flag to the existing
'struct mmu_gather tlb' to make sure that flush actually always
happens (even if the process somehow managed to unmap all vma's even
before exiting).

Is there something silly I'm missing? Somebody pat me on the head, and
say "There, there, Linus, don't try to get involved with things you
don't understand.." and explain to me in small words.

                  Linus
Andy Lutomirski Jan. 8, 2022, 10:04 p.m. UTC | #2
On Sat, Jan 8, 2022, at 12:22 PM, Linus Torvalds wrote:
> On Sat, Jan 8, 2022 at 8:44 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> To improve scalability, this patch adds a percpu hazard pointer scheme to
>> keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
>> must not be freed, and __mmput() checks the pointers belonging to all CPUs
>> that might be lazily using the mm in question.
>
> Ugh. This feels horribly fragile to me, and also looks like it makes
> some common cases potentially quite expensive for machines with large
> CPU counts if they don't do that mm_cpumask optimization - which in
> turn feels quite fragile as well.
>
> IOW, this just feels *complicated*.
>
> And I think it's overly so. I get the strong feeling that we could
> make the rules much simpler and more straightforward.
>
> For example, how about we make the rules be

There there, Linus, not everything is as simple^Wincapable as x86 bare metal, and mm_cpumask does not have useful cross-arch semantics.  Is that good?

>
>  - a lazy TLB mm reference requires that there's an actual active user
> of that mm (ie "mm_users > 0")
>
>  - the last mm_users decrement (ie __mmput) forces a TLB flush, and
> that TLB flush must make sure that no lazy users exist (which I think
> it does already anyway).

It does, on x86 bare metal, in exit_mmap().  It’s implicit, but it could be made explicit, as below.

>
> Doesn't that seem like a really simple set of rules?
>
> And the nice thing about it is that we *already* do that required TLB
> flush in all normal circumstances. __mmput() already calls
> exit_mmap(), and exit_mm() already forces that TLB flush in every
> normal situation.

Exactly. On x86 bare metal and similar architectures, this flush is done by IPI, which involves a loop over all CPUs that might be using the mm.  And other patches in this series add the core ability for x86 to shoot down the lazy TLB cleanly so the core drops its reference and wires it up for x86.

>
> So we might have to make sure that every architecture really does that
> "drop lazy mms on TLB flush", and maybe add a flag to the existing
> 'struct mmu_gather tlb' to make sure that flush actually always
> happens (even if the process somehow managed to unmap all vma's even
> before exiting).

So this requires that all architectures actually walk all relevant CPUs to see if an IPI is needed and send that IPI.  On architectures that actually need an IPI anyway (x86 bare metal, powerpc (I think) and others, fine. But on architectures with a broadcast-to-all-CPUs flush (ARM64 IIUC), then the extra IPI will be much much slower than a simple load-acquire in a loop.

In fact, arm64 doesn’t even track mm_cpumask at all last time I checked, so even an IPI lazy shoot down would require looping *all* CPUs, doing a load-acquire, and possibly doing an IPI. I much prefer doing a load-acquire and possibly a cmpxchg.

(And x86 PV can do hypercall flushes.  If a bunch of vCPUs are not running, an IPI shootdown will end up sleeping until they run, whereas this patch will allow the hypervisor to leave them asleep and thus to finish __mmput without waking them. This only matters on a CPU-oversubscribed host, but still.  And it kind of looks like hardware remote flushes are coming in AMD land eventually.)

But yes, I fully agree that this patch is complicated and subtle.

>
> Is there something silly I'm missing? Somebody pat me on the head, and
> say "There, there, Linus, don't try to get involved with things you
> don't understand.." and explain to me in small words.
>
>                   Linus
Linus Torvalds Jan. 9, 2022, 12:27 a.m. UTC | #3
On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> So this requires that all architectures actually walk all relevant
> CPUs to see if an IPI is needed and send that IPI. On architectures
> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
> and others, fine. But on architectures with a broadcast-to-all-CPUs
> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
> simple load-acquire in a loop.

... hmm. How about a hybrid scheme?

 (a) architectures that already require that IPI anyway for TLB
invalidation (ie x86, but others too), just make the rule be that the TLB
flush by exit_mmap() get rid of any lazy TLB mm references. Which they
already do.

 (b) architectures like arm64 that do hw-assisted TLB shootdown will have
an ASID allocator model, and what you do is to use that to either
    (b') increment/decrement the mm_count at mm ASID allocation/freeing time
    (b'') use the existing ASID tracking data to find the CPU's that have
that ASID

 (c) can you really imagine hardware TLB shootdown without ASID allocation?
That doesn't seem to make sense. But if it exists, maybe that kind of crazy
case would do the percpu array walking.

(Honesty in advertising: I don't know the arm64 ASID code - I used to know
the old alpha version I wrote in a previous lifetime - but afaik any ASID
allocator has to be able to track CPU's that have a particular ASID in use
and be able to invalidate it).

Hmm. The x86 maintainers are on this thread, but they aren't even the
problem. Adding Catalin and Will to this, I think they should know if/how
this would fit with the arm64 ASID allocator.

Will/Catalin, background here:


https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com/

for my objection to that special "keep non-refcounted magic per-cpu pointer
to lazy tlb mm".

           Linus
Linus Torvalds Jan. 9, 2022, 12:53 a.m. UTC | #4
[ Let's try this again, without the html crud this time. Apologies to
the people who see this reply twice ]

On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> So this requires that all architectures actually walk all relevant
> CPUs to see if an IPI is needed and send that IPI. On architectures
> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
> and others, fine. But on architectures with a broadcast-to-all-CPUs
> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
> simple load-acquire in a loop.

... hmm. How about a hybrid scheme?

 (a) architectures that already require that IPI anyway for TLB
invalidation (ie x86, but others too), just make the rule be that the
TLB flush by exit_mmap() get rid of any lazy TLB mm references. Which
they already do.

 (b) architectures like arm64 that do hw-assisted TLB shootdown will
have an ASID allocator model, and what you do is to use that to either
    (b') increment/decrement the mm_count at mm ASID allocation/freeing time
    (b'') use the existing ASID tracking data to find the CPU's that
have that ASID

 (c) can you really imagine hardware TLB shootdown without ASID
allocation? That doesn't seem to make sense. But if it exists, maybe
that kind of crazy case would do the percpu array walking.

(Honesty in advertising: I don't know the arm64 ASID code - I used to
know the old alpha version I wrote in a previous lifetime - but afaik
any ASID allocator has to be able to track CPU's that have a
particular ASID in use and be able to invalidate it).

Hmm. The x86 maintainers are on this thread, but they aren't even the
problem. Adding Catalin and Will to this, I think they should know
if/how this would fit with the arm64 ASID allocator.

Will/Catalin, background here:

   https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com/

for my objection to that special "keep non-refcounted magic per-cpu
pointer to lazy tlb mm".

           Linus
Andy Lutomirski Jan. 9, 2022, 3:58 a.m. UTC | #5
On Sat, Jan 8, 2022, at 4:53 PM, Linus Torvalds wrote:
> [ Let's try this again, without the html crud this time. Apologies to
> the people who see this reply twice ]
>
> On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> So this requires that all architectures actually walk all relevant
>> CPUs to see if an IPI is needed and send that IPI. On architectures
>> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
>> and others, fine. But on architectures with a broadcast-to-all-CPUs
>> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
>> simple load-acquire in a loop.
>
> ... hmm. How about a hybrid scheme?
>
>  (a) architectures that already require that IPI anyway for TLB
> invalidation (ie x86, but others too), just make the rule be that the
> TLB flush by exit_mmap() get rid of any lazy TLB mm references. Which
> they already do.
>
>  (b) architectures like arm64 that do hw-assisted TLB shootdown will
> have an ASID allocator model, and what you do is to use that to either
>     (b') increment/decrement the mm_count at mm ASID allocation/freeing time
>     (b'') use the existing ASID tracking data to find the CPU's that
> have that ASID
>
>  (c) can you really imagine hardware TLB shootdown without ASID
> allocation? That doesn't seem to make sense. But if it exists, maybe
> that kind of crazy case would do the percpu array walking.
>

So I can go over a handful of TLB flush schemes:

1. x86 bare metal.  As noted, just plain shootdown would work.  (Unless we switch to inexact mm_cpumask() tracking, which might be enough of a win that it's worth it.)  Right now, "ASID" (i.e. PCID, thanks Intel) is allocated per cpu.  They are never explicitly freed -- they just expire off a percpu LRU.  The data structures have no idea whether an mm still exists -- instead they track mm->context.ctx_id, which is 64 bits and never reused.

2. x86 paravirt.  This is just like bare metal except there's a hypercall to flush a specific target cpu.  (I think this is mutually exclusive with PCID, but I'm not sure.  I haven't looked that hard.  I'm not sure exactly what is implemented right now.  It could be an operation to flush (cpu, pcid), but that gets awkward for reasons that aren't too relevant to this discussion.)  In this model, the exit_mmap() shootdown would either need to switch to a non-paravirt flush or we need a fancy mm_count solution of some sort.

3. Hypothetical better x86.  AMD has INVLPGB, which is almost useless right now.  But it's *so* close to being very useful, and I've asked engineers at AMD and Intel to improve this.  Specifically, I want PCID to be widened to 64 bits.  (This would, as I understand it, not affect the TLB hardware at all.  It would affect the tiny table that sits in front of the real PCID and maintains the illusion that PCID is 12 bits, and it would affect the MOV CR3 instruction.  The latter makes it complicated.)  And INVLPGB would invalidate a given 64-bit PCID system-wide.  In this model, there would be no such thing as freeing an ASID.  So I think we would want something very much like this patch.

4. ARM64.  I only barely understand it, but I think it's an intermediate scheme with ASIDs that are wide enough to be useful but narrow enough to run out on occasion.  I don't think they're tracked -- I think the whole world just gets invalidated when they overflow.  I could be wrong.

In any event, ASID lifetimes aren't a magic solution -- how do we know when to expire an ASID?  Presumably it would be when an mm is fully freed (__mmdrop), which gets us right back to square one.

In any case, what I particularly like about my patch is that, while it's subtle, it's subtle just once.  I think it can handle all the interesting arch cases by merely redefining for_each_possible_lazymm_cpu() to do the right thing.

> (Honesty in advertising: I don't know the arm64 ASID code - I used to
> know the old alpha version I wrote in a previous lifetime - but afaik
> any ASID allocator has to be able to track CPU's that have a
> particular ASID in use and be able to invalidate it).
>
> Hmm. The x86 maintainers are on this thread, but they aren't even the
> problem. Adding Catalin and Will to this, I think they should know
> if/how this would fit with the arm64 ASID allocator.
>

Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)

> Will/Catalin, background here:
>
>    
> https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com/
>
> for my objection to that special "keep non-refcounted magic per-cpu
> pointer to lazy tlb mm".
>
>            Linus

--Andy
Linus Torvalds Jan. 9, 2022, 4:38 a.m. UTC | #6
On Sat, Jan 8, 2022 at 7:59 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> > Hmm. The x86 maintainers are on this thread, but they aren't even the
> > problem. Adding Catalin and Will to this, I think they should know
> > if/how this would fit with the arm64 ASID allocator.
> >
>
> Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)

Well, my point was that on x86, the complexities of the patch you
posted are completely pointless.

So on x86, you can just remove the mmgrab/mmdrop reference counts from
the lazy mm use entirely, and voila, that performance problem is gone.
We don't _need_ reference counting on x86 at all, if we just say that
the rule is that a lazy mm is always associated with a
honest-to-goodness live mm.

So on x86 - and any platform with the IPI model - there is no need for
hundreds of lines of complexity at all.

THAT is my point. Your patch adds complexity that buys you ABSOLUTELY NOTHING.

You then saying that the mmgrab/mmdrop is a performance problem is
just trying to muddy the water. You can just remove it entirely.

Now, I do agree that that depends on the whole "TLB IPI will get rid
of any lazy mm users on other cpus". So I agree that if you have
hardware TLB invalidation that then doesn't have that software
component to it, you need something else.

But my argument _then_ was that hardware TLB invalidation then needs
the hardware ASID thing to be useful, and the ASID management code
already effectively keeps track of "this ASID is used on other CPU's".
And that's exactly the same kind of information that your patch
basically added a separate percpu array for.

So I think that even for that hardware TLB shootdown case, your patch
only adds overhead.

And it potentially adds a *LOT* of overhead, if you replace an atomic
refcount with a "for_each_possible_cpu()" loop that has to do cmpxchg
things too.

Now, on x86, where we maintain that mm_cpumask, and as a result that
overhead is much lower - but we maintain that mm_cpumask exactly
*because* we do that IPI thing, so I don't think you can use that
argument in favor of your patch. When we do the IPI thing, your patch
is worthless overhead.

See?

Btw, you don't even need to really solve the arm64 TLB invalidate
thing - we could make the rule be that we only do the mmgrab/mmput at
all on platforms that don't do that IPI flush.

I think that's basically exactly what Nick Piggin wanted to do on powerpc, no?

But you hated that patch, for non-obvious reasons, and are now
introducing this new patch that is clearly non-optimal on x86.

So I think there's some intellectual dishonesty on your part here.

                Linus
Nadav Amit Jan. 9, 2022, 5:56 a.m. UTC | #7
> On Jan 8, 2022, at 8:44 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> The purpose of mmgrab() and mmdrop() is to make "lazy tlb" mode safe.

Just wondering: In a world of ASID/PCID - does the “lazy TLB” really
have a worthy advantage?

Considering the fact that with PTI anyhow address spaces are switched
all the time, can’t we just get rid of it?
Linus Torvalds Jan. 9, 2022, 6:48 a.m. UTC | #8
On Sat, Jan 8, 2022 at 9:56 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Just wondering: In a world of ASID/PCID - does the “lazy TLB” really
> have a worthy advantage?
>
> Considering the fact that with PTI anyhow address spaces are switched
> all the time, can’t we just get rid of it?

Hmm.. That may indeed be the right thing to do.

I think arm64 already hardcodes ASID 0 to init_mm, and that kernel
threads (and the idle threads in particular) might as well just use
that. In that kind of situation, there's likely little advantage to
reusing a user address space ID, and quite possibly any potential
advantage is overshadowed by the costs.

The lazy tlb thing goes back a *looong* way, and lots of things have
changed since. Maybe it's not worth it any more.

Or maybe it's only worth it on platforms where it's free (UP, possibly
other situations - like if you have IPI and it's "free").

If I read the history correctly, it looks like PF_LAZY_TLB was
introduced in 2.3.11-pre4 or something. Back in summer of 1999. The
"active_mm" vs "mm" model came later.

                  Linus
Nadav Amit Jan. 9, 2022, 8:49 a.m. UTC | #9
> On Jan 8, 2022, at 10:48 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Sat, Jan 8, 2022 at 9:56 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> Just wondering: In a world of ASID/PCID - does the “lazy TLB” really
>> have a worthy advantage?
>> 
>> Considering the fact that with PTI anyhow address spaces are switched
>> all the time, can’t we just get rid of it?
> 

[snip]

> 
> Or maybe it's only worth it on platforms where it's free (UP, possibly
> other situations - like if you have IPI and it's "free").

On UP it might be free, but on x86+IPIs there is a tradeoff.

When you decide which CPUs you want to send the IPI to, in the
common flow (no tables freed) you check whether they use
“lazy TLB” or not in order to filter out the lazy ones. In the
past this was on a cacheline with other frequently-dirtied data so
many times the cacheline bounced from cache to cache. Worse, the
test used an indirect branch so was expensive with Spectre v2
mitigations. I fixed it some time ago, so things are better and
today the cacheline of is_lazy should bounce less between caches,
but there is a tradeoff in maintaining and checking both cpumask
and then is_lazy for each CPU in cpumask.

It is possible for instance to get rid of is_lazy, keep the CPU
on mm_cpumask when switching to kernel thread, and then if/when
an IPI is received remove it from cpumask to avoid further
unnecessary TLB shootdown IPIs.

I do not know whether it is a pure win, but there is a tradeoff.
Linus Torvalds Jan. 9, 2022, 7:10 p.m. UTC | #10
On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I do not know whether it is a pure win, but there is a tradeoff.

Hmm. I guess only some serious testing would tell.

On x86, I'd be a bit worried about removing lazy TLB simply because
even with ASID support there (called PCIDs by Intel for NIH reasons),
the actual ASID space on x86 was at least originally very very
limited.

Architecturally, x86 may expose 12 bits of ASID space, but iirc at
least the first few implementations actually only internally had one
or two bits, and hashed the 12 bits down to that internal very limited
hardware TLB ID space.

We only use a handful of ASIDs per CPU on x86 partly for this reason
(but also since there's no remote hardware TLB shootdown, there's no
reason to have a bigger global ASID space, so ASIDs aren't _that_
common).

And I don't know how many non-PCID x86 systems (perhaps virtualized?)
there might be out there.

But it would be very interesting to test some "disable lazy tlb"
patch. The main problem workloads tend to be IO, and I'm not sure how
many of the automated performance tests would catch issues. I guess
some threaded pipe ping-pong test (with each thread pinned to
different cores) would show it.

And I guess there is some load that triggered the original powerpc
patch by Nick&co, and that Andy has been using..

Anybody willing to cook up a patch and run some benchmarks? Perhaps
one that basically just replaces "set ->mm to NULL" with "set ->mm to
&init_mm" - so that the lazy TLB code is still *there*, but it never
triggers..

I think it's mainly 'copy_thread()' in kernel/fork.c and the 'init_mm'
initializer in mm/init-mm.c, but there's probably other things too
that have that knowledge of the special "tsk->mm = NULL" situation.

                  Linus
Rik van Riel Jan. 9, 2022, 7:22 p.m. UTC | #11
On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
> 
> It is possible for instance to get rid of is_lazy, keep the CPU
> on mm_cpumask when switching to kernel thread, and then if/when
> an IPI is received remove it from cpumask to avoid further
> unnecessary TLB shootdown IPIs.
> 
> I do not know whether it is a pure win, but there is a tradeoff.

That's not a win at all. It is what we used to have before
the lazy mm stuff was re-introduced, and it led to quite a
few unnecessary IPIs, which are especially slow on virtual
machines, where the target CPU may not be running.
Nadav Amit Jan. 9, 2022, 7:34 p.m. UTC | #12
> On Jan 9, 2022, at 11:22 AM, Rik van Riel <riel@surriel.com> wrote:
> 
> On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
>> 
>> It is possible for instance to get rid of is_lazy, keep the CPU
>> on mm_cpumask when switching to kernel thread, and then if/when
>> an IPI is received remove it from cpumask to avoid further
>> unnecessary TLB shootdown IPIs.
>> 
>> I do not know whether it is a pure win, but there is a tradeoff.
> 
> That's not a win at all. It is what we used to have before
> the lazy mm stuff was re-introduced, and it led to quite a
> few unnecessary IPIs, which are especially slow on virtual
> machines, where the target CPU may not be running.

You make a good point about VMs.

IIUC Lazy-TLB serves several goals:

1. Avoid arch address-space switch to save switching time and
   TLB misses.
2. Prevent unnecessary IPIs while kernel threads run.
3. Avoid cache-contention on mm_cpumask when switching to a kernel
   thread.

Your concern is with (2), and this one is easy to keep regardless
of the rest.

I am not sure that (3) is actually helpful, since it might lead
to more cache activity than without lazy-TLB, but that is somewhat
orthogonal to everything else.

As for (1), which is the most fragile aspect, unless you use
shadow page-tables, I am not sure there is a significant benefit.
Rik van Riel Jan. 9, 2022, 7:37 p.m. UTC | #13
On Sun, 2022-01-09 at 11:34 -0800, Nadav Amit wrote:
> 
> 
> > On Jan 9, 2022, at 11:22 AM, Rik van Riel <riel@surriel.com> wrote:
> > 
> > On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
> > > 
> > > It is possible for instance to get rid of is_lazy, keep the CPU
> > > on mm_cpumask when switching to kernel thread, and then if/when
> > > an IPI is received remove it from cpumask to avoid further
> > > unnecessary TLB shootdown IPIs.
> > > 
> > > I do not know whether it is a pure win, but there is a tradeoff.
> > 
> > That's not a win at all. It is what we used to have before
> > the lazy mm stuff was re-introduced, and it led to quite a
> > few unnecessary IPIs, which are especially slow on virtual
> > machines, where the target CPU may not be running.
> 
> You make a good point about VMs.
> 
> IIUC Lazy-TLB serves several goals:
> 
> 1. Avoid arch address-space switch to save switching time and
>    TLB misses.
> 2. Prevent unnecessary IPIs while kernel threads run.
> 3. Avoid cache-contention on mm_cpumask when switching to a kernel
>    thread.
> 
> Your concern is with (2), and this one is easy to keep regardless
> of the rest.
> 
> I am not sure that (3) is actually helpful, since it might lead
> to more cache activity than without lazy-TLB, but that is somewhat
> orthogonal to everything else.

I have seen problems with (3) in practice, too.

For many workloads, context switching is much, much more
of a hot path than TLB shootdowns, which are relatively
infrequent by comparison.

Basically ASID took away only the first concern from your
list above, not the other two.
Nadav Amit Jan. 9, 2022, 7:51 p.m. UTC | #14
> On Jan 9, 2022, at 11:37 AM, Rik van Riel <riel@surriel.com> wrote:
> 
> On Sun, 2022-01-09 at 11:34 -0800, Nadav Amit wrote:
>> 
>> 
>>> On Jan 9, 2022, at 11:22 AM, Rik van Riel <riel@surriel.com> wrote:
>>> 
>>> On Sun, 2022-01-09 at 00:49 -0800, Nadav Amit wrote:
>>>> 
>>>> It is possible for instance to get rid of is_lazy, keep the CPU
>>>> on mm_cpumask when switching to kernel thread, and then if/when
>>>> an IPI is received remove it from cpumask to avoid further
>>>> unnecessary TLB shootdown IPIs.
>>>> 
>>>> I do not know whether it is a pure win, but there is a tradeoff.
>>> 
>>> That's not a win at all. It is what we used to have before
>>> the lazy mm stuff was re-introduced, and it led to quite a
>>> few unnecessary IPIs, which are especially slow on virtual
>>> machines, where the target CPU may not be running.
>> 
>> You make a good point about VMs.
>> 
>> IIUC Lazy-TLB serves several goals:
>> 
>> 1. Avoid arch address-space switch to save switching time and
>>    TLB misses.
>> 2. Prevent unnecessary IPIs while kernel threads run.
>> 3. Avoid cache-contention on mm_cpumask when switching to a kernel
>>    thread.
>> 
>> Your concern is with (2), and this one is easy to keep regardless
>> of the rest.
>> 
>> I am not sure that (3) is actually helpful, since it might lead
>> to more cache activity than without lazy-TLB, but that is somewhat
>> orthogonal to everything else.
> 
> I have seen problems with (3) in practice, too.
> 
> For many workloads, context switching is much, much more
> of a hot path than TLB shootdowns, which are relatively
> infrequent by comparison.
> 
> Basically ASID took away only the first concern from your
> list above, not the other two.

I agree, but the point I was trying to make is that you can keep lazy
TLB for (2) and (3), but still switch the address-space. If you
already accept PTI, then the 600 cycles or so of switching the
address space back and forth, which should occur more infrequently
than those on syscalls/exceptions, are not that painful.

You can also make a case that it is “safer” to switch the address
space, although SMAP/SMEP protection provides similar properties.
Andy Lutomirski Jan. 9, 2022, 7:52 p.m. UTC | #15
On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>
>> I do not know whether it is a pure win, but there is a tradeoff.
>
> Hmm. I guess only some serious testing would tell.
>
> On x86, I'd be a bit worried about removing lazy TLB simply because
> even with ASID support there (called PCIDs by Intel for NIH reasons),
> the actual ASID space on x86 was at least originally very very
> limited.
>
> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
> least the first few implementations actually only internally had one
> or two bits, and hashed the 12 bits down to that internal very limited
> hardware TLB ID space.
>
> We only use a handful of ASIDs per CPU on x86 partly for this reason
> (but also since there's no remote hardware TLB shootdown, there's no
> reason to have a bigger global ASID space, so ASIDs aren't _that_
> common).
>
> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
> there might be out there.
>
> But it would be very interesting to test some "disable lazy tlb"
> patch. The main problem workloads tend to be IO, and I'm not sure how
> many of the automated performance tests would catch issues. I guess
> some threaded pipe ping-pong test (with each thread pinned to
> different cores) would show it.

My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.

>
> And I guess there is some load that triggered the original powerpc
> patch by Nick&co, and that Andy has been using..

I don't own a big enough machine.  The workloads I'm aware of with the problem have massively multithreaded programs using many CPUs, and transitions into and out of lazy mode ping-pong the cacheline.

>
> Anybody willing to cook up a patch and run some benchmarks? Perhaps
> one that basically just replaces "set ->mm to NULL" with "set ->mm to
> &init_mm" - so that the lazy TLB code is still *there*, but it never
> triggers..

It would 

>
> I think it's mainly 'copy_thread()' in kernel/fork.c and the 'init_mm'
> initializer in mm/init-mm.c, but there's probably other things too
> that have that knowledge of the special "tsk->mm = NULL" situation.

I think, for a little test, we would leave all the mm == NULL code alone and just change the enter-lazy logic.  On top of all the cleanups in this series, that would be trivial.

>
>                   Linus
Linus Torvalds Jan. 9, 2022, 7:54 p.m. UTC | #16
On Sun, Jan 9, 2022 at 11:51 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> If you already accept PTI, [..]

I really don't think anybody should ever "accept PTI".

It's an absolutely enormous cost, and it should be seen as a
last-choice thing. Only really meant for completely broken hardware
(ie meltdown), and for people who have some very serious issues and
think it's "reasonable" to have the TLB isolation.

No real normal sane setup should ever "accept PTI", and it shouldn't
be used as an argument.

              Linus
Linus Torvalds Jan. 9, 2022, 8 p.m. UTC | #17
On Sun, Jan 9, 2022 at 11:53 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> My original PCID series actually did remove lazy TLB on x86. I don't
> remember why, but people objected. The issue isn't the limited PCID
> space -- IIRC it's just that MOV CR3 is slooooow. If we get rid of
> lazy TLB on x86, then we are writing CR3 twice on even a very short
> idle. That adds maybe 1k cycles, which isn't great.

Yeah, my gut feel is that lazy-TLB almost certainly makes sense on x86.

And the grab/mmput overhead and associated cacheline ping-pong is (I
think) something we could just get rid of on x86 due to the IPI model.
There are probably other costs to lazy TLB, and I can imagine that
there are other maintenance costs, but yes, cr3 moves have always been
expensive on x86 even aside from the actual TLB switch.

But I could easily imagine the situation being different on arm64, for example.

But numbers beat "gut feel" and "easily imagine" every time. So it
would be kind of nice to have that ...

          Linus
Andy Lutomirski Jan. 9, 2022, 8:19 p.m. UTC | #18
On Sat, Jan 8, 2022, at 8:38 PM, Linus Torvalds wrote:
> On Sat, Jan 8, 2022 at 7:59 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> > Hmm. The x86 maintainers are on this thread, but they aren't even the
>> > problem. Adding Catalin and Will to this, I think they should know
>> > if/how this would fit with the arm64 ASID allocator.
>> >
>>
>> Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)
>
> Well, my point was that on x86, the complexities of the patch you
> posted are completely pointless.
>
> So on x86, you can just remove the mmgrab/mmdrop reference counts from
> the lazy mm use entirely, and voila, that performance problem is gone.
> We don't _need_ reference counting on x86 at all, if we just say that
> the rule is that a lazy mm is always associated with a
> honest-to-goodness live mm.
>
> So on x86 - and any platform with the IPI model - there is no need for
> hundreds of lines of complexity at all.
>
> THAT is my point. Your patch adds complexity that buys you ABSOLUTELY NOTHING.
>
> You then saying that the mmgrab/mmdrop is a performance problem is
> just trying to muddy the water. You can just remove it entirely.
>
> Now, I do agree that that depends on the whole "TLB IPI will get rid
> of any lazy mm users on other cpus". So I agree that if you have
> hardware TLB invalidation that then doesn't have that software
> component to it, you need something else.
>
> But my argument _then_ was that hardware TLB invalidation then needs
> the hardware ASID thing to be useful, and the ASID management code
> already effectively keeps track of "this ASID is used on other CPU's".
> And that's exactly the same kind of information that your patch
> basically added a separate percpu array for.
>

Are you *sure*?  The ASID management code on x86 is (as mentioned before) completely unaware of whether an ASID is actually in use anywhere.  The x86 ASID code is a per-cpu LRU -- it tracks whether an mm has been recently used on a cpu, not whether the mm exists.  If an mm hasn't been used recently, the ASID gets recycled.  If we had more bits, we wouldn't even recycle it.  An ASID can and does get recycled while the mm still exists.

> So I think that even for that hardware TLB shootdown case, your patch
> only adds overhead.

The overhead is literally:

exit_mmap();
for each cpu still in mm_cpumask:
  smp_load_acquire

That's it, unless the mm is actually in use, in which 

>
> And it potentially adds a *LOT* of overhead, if you replace an atomic
> refcount with a "for_each_possible_cpu()" loop that has to do cmpxchg
> things too.

The cmpxchg is only in the case in which the mm is actually in use on that CPU.  I'm having trouble imagining a workload in which the loop is even measurable unless the bit scan itself is somehow problematic.

On a very large arm64 system, I would believe there could be real overhead.  But these very large systems are exactly the systems that currently ping-pong mm_count.

>
> Btw, you don't even need to really solve the arm64 TLB invalidate
> thing - we could make the rule be that we only do the mmgrab/mmput at
> all on platforms that don't do that IPI flush.
>
> I think that's basically exactly what Nick Piggin wanted to do on powerpc, no?

More or less, but...

>
> But you hated that patch, for non-obvious reasons, and are now
> introducing this new patch that is clearly non-optimal on x86.

I hated that patch because it's messy and it leaves the core lazy handling in an IMO quite regrettable state, not because I'm particularly opposed to shooting down lazies on platforms where it makes sense (powerpc and mostly x86).

As just the most obvious issue, note the kasan_check_byte() in this patch that verifies that ->active_mm doesn't point to freed memory when the scheduler is entered.  If we flipped shoot-lazies on on x86, then KASAN would blow up with that.

For perspective, this whole series is 23 patches.  Exactly two of them are directly related to my hazard pointer scheme: patches 16 and 22.  The rest of them are, in my opinion, cleanups and some straight-up bugfixes that are worthwhile no matter what we do with lazy mm handling per se.

>
> So I think there's some intellectual dishonesty on your part here.

I never said I hated shoot-lazies.  I didn't like the *code*.  I thought I could do better, and I still think my hazard pointer scheme is nifty and, aside from some complexity, quite nice, and it even reduces to shoot-lazies if for_each_possible_lazymm_cpu() is defined to do nothing, but I mainly wanted the code to be better.  So I went and did it.  I could respin this series without the hazard pointers quite easily.

--Andy
Nadav Amit Jan. 9, 2022, 8:34 p.m. UTC | #19
> On Jan 9, 2022, at 11:52 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
>> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>> I do not know whether it is a pure win, but there is a tradeoff.
>> 
>> Hmm. I guess only some serious testing would tell.
>> 
>> On x86, I'd be a bit worried about removing lazy TLB simply because
>> even with ASID support there (called PCIDs by Intel for NIH reasons),
>> the actual ASID space on x86 was at least originally very very
>> limited.
>> 
>> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
>> least the first few implementations actually only internally had one
>> or two bits, and hashed the 12 bits down to that internal very limited
>> hardware TLB ID space.
>> 
>> We only use a handful of ASIDs per CPU on x86 partly for this reason
>> (but also since there's no remote hardware TLB shootdown, there's no
>> reason to have a bigger global ASID space, so ASIDs aren't _that_
>> common).
>> 
>> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
>> there might be out there.
>> 
>> But it would be very interesting to test some "disable lazy tlb"
>> patch. The main problem workloads tend to be IO, and I'm not sure how
>> many of the automated performance tests would catch issues. I guess
>> some threaded pipe ping-pong test (with each thread pinned to
>> different cores) would show it.
> 
> My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.

Just for the record: I just ran a short test when CPUs are on max freq
on my skylake. MOV-CR3 without flush is 250-300 cycles. One can argue
that you mostly only care for one of the switches for the idle thread
(once you wake up). And waking up by itself has its overheads.

But you are the master of micro optimizations, and as Rik said, I
mostly think of TLB shootdowns and might miss the big picture. Just
trying to make your life easier by less coding and my life simpler
in understanding your super-smart code. ;-)
Andy Lutomirski Jan. 9, 2022, 8:48 p.m. UTC | #20
On Sun, Jan 9, 2022, at 1:34 PM, Nadav Amit wrote:
>> On Jan 9, 2022, at 11:52 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
>>> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>> 
>>>> I do not know whether it is a pure win, but there is a tradeoff.
>>> 
>>> Hmm. I guess only some serious testing would tell.
>>> 
>>> On x86, I'd be a bit worried about removing lazy TLB simply because
>>> even with ASID support there (called PCIDs by Intel for NIH reasons),
>>> the actual ASID space on x86 was at least originally very very
>>> limited.
>>> 
>>> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
>>> least the first few implementations actually only internally had one
>>> or two bits, and hashed the 12 bits down to that internal very limited
>>> hardware TLB ID space.
>>> 
>>> We only use a handful of ASIDs per CPU on x86 partly for this reason
>>> (but also since there's no remote hardware TLB shootdown, there's no
>>> reason to have a bigger global ASID space, so ASIDs aren't _that_
>>> common).
>>> 
>>> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
>>> there might be out there.
>>> 
>>> But it would be very interesting to test some "disable lazy tlb"
>>> patch. The main problem workloads tend to be IO, and I'm not sure how
>>> many of the automated performance tests would catch issues. I guess
>>> some threaded pipe ping-pong test (with each thread pinned to
>>> different cores) would show it.
>> 
>> My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.
>
> Just for the record: I just ran a short test when CPUs are on max freq
> on my skylake. MOV-CR3 without flush is 250-300 cycles. One can argue
> that you mostly only care for one of the switches for the idle thread
> (once you wake up). And waking up by itself has its overheads.
>
> But you are the master of micro optimizations, and as Rik said, I
> mostly think of TLB shootdowns and might miss the big picture. Just
> trying to make your life easier by less coding and my life simpler
> in understanding your super-smart code. ;-)

As Rik pointed out, the mm_cpumask manipulation is also expensive if we get rid of lazy. Let me ponder how to do this nicely.
Linus Torvalds Jan. 9, 2022, 8:48 p.m. UTC | #21
On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> Are you *sure*? The ASID management code on x86 is (as mentioned
> before) completely unaware of whether an ASID is actually in use
> anywhere.

Right.

But the ASID situation on x86 is very very different, exactly because
x86 doesn't have cross-CPU TLB invalidates.

Put another way: x86 TLB hardware is fundamentally per-cpu. As such,
any ASID management is also per-cpu.

That's fundamentally not true on arm64.  And that's not some "arm64
implementation detail". That's fundamental to doing cross-CPU TLB
invalidates in hardware.

If your TLB invalidates act across CPU's, then the state they act on
are also obviously across CPU's.

So the ASID situation is fundamentally different depending on the
hardware usage. On x86, TLB's are per-core, and on arm64 they are not,
and that's reflected in our code too.

As a result, on x86, each mm has a per-cpu ASID, and there's a small
array of per-cpu "mm->asid" mappings.

On arm, each mm has an asid, and it's allocated from a global asid
space - so there is no need for that "mm->asid" mapping, because the
asid is there in the mm, and it's shared across cpus.

That said, I still don't actually know the arm64 ASID management code.

The thing about TLB flushes is that it's ok to do them spuriously (as
long as you don't do _too_ many of them and tank performance), so two
different mm's can have the same hw ASID on two different cores and
that just makes cross-CPU TLB invalidates too aggressive. You can't
share an ASID on the _same_ core without flushing in between context
switches, because then the TLB on that core might be re-used for a
different mm. So the flushing rules aren't necessarily 100% 1:1 with
the "in use" rules, and who knows if the arm64 ASID management
actually ends up just matching what that whole "this lazy TLB is still
in use on another CPU".

So I don't really know the arm64 situation. And i's possible that lazy
TLB isn't even worth it on arm64 in the first place.

> > So I think that even for that hardware TLB shootdown case, your patch
> > only adds overhead.
>
> The overhead is literally:
>
> exit_mmap();
> for each cpu still in mm_cpumask:
>   smp_load_acquire
>
> That's it, unless the mm is actually in use

Ok, now do this for a machine with 1024 CPU's.

And tell me it is "scalable".

> On a very large arm64 system, I would believe there could be real overhead.  But these very large systems are exactly the systems that currently ping-pong mm_count.

Right.

But I think your arguments against mm_count are questionable.

I'd much rather have a *much* smaller patch that says "on x86 and
powerpc, we don't need this overhead at all".

And then the arm64 people can look at it and say "Yeah, we'll still do
the mm_count thing", or maybe say "Yeah, we can solve it another way".

And maybe the arm64 people actually say "Yeah, this hazard pointer
thing is perfect for us". That still doesn't necessarily argue for it
on an architecture that ends up serializing with an IPI anyway.

                Linus
Linus Torvalds Jan. 9, 2022, 9:51 p.m. UTC | #22
[ Ugh, I actually went back and looked at Nick's patches again, to
just verify my memory, and they weren't as pretty as I thought they
were ]

On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'd much rather have a *much* smaller patch that says "on x86 and
> powerpc, we don't need this overhead at all".

For some reason I thought Nick's patch worked at "last mmput" time and
the TLB flush IPIs that happen at that point anyway would then make
sure any lazy TLB is cleaned up.

But that's not actually what it does. It ties the
MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
last mmdrop() instead. Because it really tied the whole logic to the
mm_count logic (and made lazy tlb to not do mm_count) rather than the
mm_users thing I mis-remembered it doing.

So at least some of my arguments were based on me just mis-remembering
what Nick's patch actually did (mainly because I mentally recreated
the patch from "Nick did something like this" and what I thought would
be the way to do it on x86).

So I guess I have to recant my arguments.

I still think my "get rid of lazy at last mmput" model should work,
and would be a perfect match for x86, but I can't really point to Nick
having done that.

So I was full of BS.

Hmm. I'd love to try to actually create a patch that does that "Nick
thing", but on last mmput() (ie when __mmput triggers). Because I
think this is interesting. But then I look at my schedule for the
upcoming week, and I go "I don't have a leg to stand on in this
discussion, and I'm just all hot air".

Because code talks, BS walks.

                Linus
Andy Lutomirski Jan. 10, 2022, 12:52 a.m. UTC | #23
On Sun, Jan 9, 2022, at 1:51 PM, Linus Torvalds wrote:
> [ Ugh, I actually went back and looked at Nick's patches again, to
> just verify my memory, and they weren't as pretty as I thought they
> were ]
>
> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I'd much rather have a *much* smaller patch that says "on x86 and
>> powerpc, we don't need this overhead at all".

I can whip this up.  It won’t be much smaller — we still need the entire remainder of the series as prep, and we’ll end up with an if (arch needs mmcount) somewhere, but it’s straightforward.

But I reserve the right to spam you with an even bigger patch, because mm_cpumask also pingpongs, and I have some ideas for helping that out too.

Also:

>> exit_mmap();
>> for each cpu still in mm_cpumask:
>>   smp_load_acquire
>>
>> That's it, unless the mm is actually in use
>
> Ok, now do this for a machine with 1024 CPU's.
>
> And tell me it is "scalable".
>

Do you mean a machine with 1024 CPUs and 2 bits set in mm_cpumask or 1024 CPU with 800 bits set in mm_cpumask?  In the former case, this is fine.  In the latter case, *on x86*, sure it does 800 loads, but we're about to do 800 CR3 writes to tear the whole mess down, so the 800 loads should be in the noise.  (And this series won't actually do this anyway on bare metal, since exit_mmap does the shootdown.)
Rik van Riel Jan. 10, 2022, 2:36 a.m. UTC | #24
On Sun, 2022-01-09 at 17:52 -0700, Andy Lutomirski wrote:
> On Sun, Jan 9, 2022, at 1:51 PM, Linus Torvalds wrote:
> 
> 
> 
> > > exit_mmap();
> > > for each cpu still in mm_cpumask:
> > >   smp_load_acquire
> > > 
> > > That's it, unless the mm is actually in use
> > 
> > Ok, now do this for a machine with 1024 CPU's.
> > 
> > And tell me it is "scalable".
> > 
> 
> Do you mean a machine with 1024 CPUs and 2 bits set in mm_cpumask or
> 1024 CPU with 800 bits set in mm_cpumask?  In the former case, this
> is fine.  In the latter case, *on x86*, sure it does 800 loads, but
> we're about to do 800 CR3 writes to tear the whole mess down, so the
> 800 loads should be in the noise.  (And this series won't actually do
> this anyway on bare metal, since exit_mmap does the shootdown.)

Also, while 800 loads is kinda expensive, it is a heck of
a lot less expensive than 800 IPIs.
Linus Torvalds Jan. 10, 2022, 3:51 a.m. UTC | #25
On Sun, Jan 9, 2022 at 6:40 PM Rik van Riel <riel@surriel.com> wrote:
>
> Also, while 800 loads is kinda expensive, it is a heck of
> a lot less expensive than 800 IPIs.

Rik, the IPI's you have to do *anyway*. So there are exactly zero extra IPI's.

Go take a look. It's part of the whole "flush TLB's" thing in __mmput().

So let me explain one more time what I think we should have done, at
least on x86:

 (1) stop refcounting active_mm entries entirely on x86

Why can we do that? Because instead of worrying about doing those
mm_count games for the active_mm reference, we realize that any
active_mm has to have a _regular_ mm associated with it, and it has a
'mm_users' count.

And when that mm_users count go to zero, we have:

 (2) mmput -> __mmput -> exit_mmap(), which already has to flush all
TLB's because it's tearing down the page tables

And since it has to flush those TLB's as part of tearing down the page
tables, we on x86 then have:

 (3) that TLB flush will have to do the IPI's to anybody who has that
mm active already

and that IPI has to be done *regardless*. And the TLB flushing done by
that IPI? That code already clears the lazy status (and not doing so
would be pointless and in fact wrong).

Notice? There isn't some "800 loads". There isn't some "800 IPI's".
And there isn't any refcounting cost of the lazy TLB.

Well, right now there *is* that refcounting cost, but my point is that
I don't think it should exist.

It shouldn't exist as an atomic access to mm_count (with those cache
ping-pongs when you have a lot of threads across a lot of CPUs), but
it *also* shouldn't exist as a "lightweight hazard pointer".

See my point? I think the lazy-tlb refcounting we do is pointless if
you have to do IPI's for TLB flushes.

Note: the above is for x86, which has to do the IPI's anyway (and it's
very possible that if you don't have to do IPI's because you have HW
TLB coherency, maybe lazy TLB's aren't what you should be using, but I
think that should be a separate discussion).

And yes, right now we do that pointless reference counting, because it
was simple and straightforward, and people historically didn't see it
as a problem.

Plus we had to have that whole secondary 'mm_count' anyway for other
reasons, since we use it for things that need to keep a ref to 'struct
mm_struct' around regardless of page table counts (eg things like a
lot of /proc references to 'struct mm_struct' do not want to keep
forced references to user page tables alive).

But I think conceptually mm_count (ie mmgrab/mmdrop) was always really
dodgy for lazy TLB. Lazy TLB really cares about the page tables still
being there, and that's not what mm_count is ostensibly about. That's
really what mm_users is about.

Yet mmgrab/mmdrop is exactly what the lazy TLB code uses, even if it's
technically odd (ie mmgrab really only keeps the 'struct mm' around,
but not about the vma's and page tables).

Side note: you can see the effects of this mis-use of mmgrab/mmdrop in
 how we tear down _almost_ all the page table state in __mmput(). But
look at what we keep until the final __mmdrop, even though there are
no page tables left:

        mm_free_pgd(mm);
        destroy_context(mm);

exactly because even though we've torn down all the page tables
earlier, we had to keep the page table *root* around for the lazy
case.

It's kind of a layering violation, but it comes from that lazy-tlb
mm_count use, and so we have that odd situation where the page table
directory lifetime is very different from the rest of the page table
lifetimes.

(You can easily make excuses for it by just saying that "mm_users" is
the user-space page table user count, and that the page directory has
a different lifetime because it's also about the kernel page tables,
so it's all a bit of a gray area, but I do think it's also a bit of a
sign of how our refcounting for lazy-tlb is a bit dodgy).

                Linus
Linus Torvalds Jan. 10, 2022, 5:19 p.m. UTC | #26
On Sun, Jan 9, 2022 at 9:18 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> This is the patch that goes on top of the series I posted. It's not
> very clean at the moment it was just a proof of concept.

Yeah, this looks like what x86 basically already effectively does.

x86 obviously doesn't have that TLBIE option, and already has that
"exit lazy mode" logic (although it does so differently, using
switch_mm_irqs_off(), and guards it with the 'info->freed_tables'
check).

But there are so many different possible ways to flush TLB's (the
whole "paravirt vs native") that it would still require some
double-checking that there isn't some case that does it differently..

               Linus
Andy Lutomirski Jan. 10, 2022, 8:52 p.m. UTC | #27
On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>> [ Ugh, I actually went back and looked at Nick's patches again, to
>> just verify my memory, and they weren't as pretty as I thought they
>> were ]
>> 
>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>> powerpc, we don't need this overhead at all".
>> 
>> For some reason I thought Nick's patch worked at "last mmput" time and
>> the TLB flush IPIs that happen at that point anyway would then make
>> sure any lazy TLB is cleaned up.
>> 
>> But that's not actually what it does. It ties the
>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>> last mmdrop() instead. Because it really tied the whole logic to the
>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>> mm_users thing I mis-remembered it doing.
>
> It does this because on powerpc with hash MMU, we can't use IPIs for
> TLB shootdowns.

I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns, it sounds like the hazard pointer scheme might actually be pretty good.

>
>> So at least some of my arguments were based on me just mis-remembering
>> what Nick's patch actually did (mainly because I mentally recreated
>> the patch from "Nick did something like this" and what I thought would
>> be the way to do it on x86).
>
> With powerpc with the radix MMU using IPI based shootdowns, we can 
> actually do the switch-away-from-lazy on the final TLB flush and the
> final broadcast shootdown thing becomes a no-op. I didn't post that
> additional patch because it's powerpc-specific and I didn't want to
> post more code so widely.
>
>> So I guess I have to recant my arguments.
>> 
>> I still think my "get rid of lazy at last mmput" model should work,
>> and would be a perfect match for x86, but I can't really point to Nick
>> having done that.
>> 
>> So I was full of BS.
>> 
>> Hmm. I'd love to try to actually create a patch that does that "Nick
>> thing", but on last mmput() (ie when __mmput triggers). Because I
>> think this is interesting. But then I look at my schedule for the
>> upcoming week, and I go "I don't have a leg to stand on in this
>> discussion, and I'm just all hot air".
>
> I agree Andy's approach is very complicated and adds more overhead than
> necessary for powerpc, which is why I don't want to use it. I'm still
> not entirely sure what the big problem would be to convert x86 to use
> it, I admit I haven't kept up with the exact details of its lazy tlb
> mm handling recently though.

The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:

- x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.

- mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.

- There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.

- If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.

- If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).

My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.
Nicholas Piggin Jan. 11, 2022, 3:10 a.m. UTC | #28
Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
> 
> 
> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>> just verify my memory, and they weren't as pretty as I thought they
>>> were ]
>>> 
>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>> powerpc, we don't need this overhead at all".
>>> 
>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>> the TLB flush IPIs that happen at that point anyway would then make
>>> sure any lazy TLB is cleaned up.
>>> 
>>> But that's not actually what it does. It ties the
>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>> last mmdrop() instead. Because it really tied the whole logic to the
>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>> mm_users thing I mis-remembered it doing.
>>
>> It does this because on powerpc with hash MMU, we can't use IPIs for
>> TLB shootdowns.
> 
> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,

The paravirtualised hash MMU environment doesn't because it has a single 
level translation and the guest uses hypercalls to insert and remove 
translations and the hypervisor flushes TLBs. The HV could flush TLBs
with IPIs but obviously the guest can't use those to execute the lazy
switch. In radix guests (and all bare metal) the OS flushes its own
TLBs.

We are moving over to radix, but powerpc also has a hardware broadcast 
flush instruction which can be a bit faster than IPIs and is usable by 
bare metal and radix guests so those can also avoid the IPIs if they 
want. Part of the powerpc patch I just sent to combine the lazy switch 
with the final TLB flush is to force it to always take the IPI path and 
not use TLBIE instruction on the final exit.

So hazard points could avoid some IPIs there too.

> it sounds like the hazard pointer scheme might actually be pretty good.

Some IPIs in the exit path just aren't that big a concern. I measured,
got numbers, tried to irritate it, just wasn't really a problem. Some
archs use IPIs for all threaded TLB shootdowns and exits not that
frequent. Very fast short lived processes that do a lot of exits just
don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
and long lived and multi threaded ones that do don't exit at high rates.

So from what I can see it's premature optimization. Actually maybe not
even optimization because IIRC it adds complexity and even a barrier on
powerpc in the context switch path which is a lot more critical than
exit() for us we don't want slowdowns there.

It's a pretty high complexity boutique kind of synchronization. Which
don't get me wrong is the kind of thing I like, it is clever and may be
perfectly bug free but it needs to prove itself over the simple dumb
shoot lazies approach.

>>> So at least some of my arguments were based on me just mis-remembering
>>> what Nick's patch actually did (mainly because I mentally recreated
>>> the patch from "Nick did something like this" and what I thought would
>>> be the way to do it on x86).
>>
>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>> actually do the switch-away-from-lazy on the final TLB flush and the
>> final broadcast shootdown thing becomes a no-op. I didn't post that
>> additional patch because it's powerpc-specific and I didn't want to
>> post more code so widely.
>>
>>> So I guess I have to recant my arguments.
>>> 
>>> I still think my "get rid of lazy at last mmput" model should work,
>>> and would be a perfect match for x86, but I can't really point to Nick
>>> having done that.
>>> 
>>> So I was full of BS.
>>> 
>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>> think this is interesting. But then I look at my schedule for the
>>> upcoming week, and I go "I don't have a leg to stand on in this
>>> discussion, and I'm just all hot air".
>>
>> I agree Andy's approach is very complicated and adds more overhead than
>> necessary for powerpc, which is why I don't want to use it. I'm still
>> not entirely sure what the big problem would be to convert x86 to use
>> it, I admit I haven't kept up with the exact details of its lazy tlb
>> mm handling recently though.
> 
> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
> 
> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.

Definitely should have been done at the time x86 deviated, but better 
late than never.

> 
> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.

membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
instead if mmdrop is a no-op is fine. Patches changing membarrier's 
ordering requirements can be concurrent and are not fundmentally tied
to lazy tlb mm switching, it just reuses an existing ordering point.

> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.

Seems like a decent cleanup. Again lazy tlb specific, just general switch
code should be factored and better contained in kernel/sched/ which is
fine, but concurrent to lazy tlb improvements.

> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.

My shoot lazies patch has no such issues with that AFAIKS. What exact 
issue is it and where did you fix it?

> 
> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
> 
> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.

I disagree, the lazy mm code is very clean and simple. And I can't see 
how you would propose to remove active_mm from core code I'm skeptical
but would be very interested to see, but that's nothing to do with my
shoot lazies patch and can also be concurrent except for mechanical
merge issues.

Thanks,
Nick
Will Deacon Jan. 11, 2022, 10:39 a.m. UTC | #29
Hi Andy, Linus,

On Sun, Jan 09, 2022 at 12:48:42PM -0800, Linus Torvalds wrote:
> On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Are you *sure*? The ASID management code on x86 is (as mentioned
> > before) completely unaware of whether an ASID is actually in use
> > anywhere.
> 
> Right.
> 
> But the ASID situation on x86 is very very different, exactly because
> x86 doesn't have cross-CPU TLB invalidates.
> 
> Put another way: x86 TLB hardware is fundamentally per-cpu. As such,
> any ASID management is also per-cpu.
> 
> That's fundamentally not true on arm64.  And that's not some "arm64
> implementation detail". That's fundamental to doing cross-CPU TLB
> invalidates in hardware.
> 
> If your TLB invalidates act across CPU's, then the state they act on
> are also obviously across CPU's.
> 
> So the ASID situation is fundamentally different depending on the
> hardware usage. On x86, TLB's are per-core, and on arm64 they are not,
> and that's reflected in our code too.
> 
> As a result, on x86, each mm has a per-cpu ASID, and there's a small
> array of per-cpu "mm->asid" mappings.
> 
> On arm, each mm has an asid, and it's allocated from a global asid
> space - so there is no need for that "mm->asid" mapping, because the
> asid is there in the mm, and it's shared across cpus.
> 
> That said, I still don't actually know the arm64 ASID management code.

That appears to be a common theme in this thread, so hopefully I can shed
some light on the arm64 side of things:

The CPU supports either 8-bit or 16-bit ASIDs and we require that we don't
have more CPUs than we can represent in the ASID space (well, we WARN in
this case but it's likely to go badly wrong). We reserve ASID 0 for things
like the idmap, so as far as the allocator is concerned ASID 0 is "invalid"
and we rely on this.

As Linus says above, the ASID is per-'mm' and we require that all threads
of an 'mm' use the same ASID at the same time, otherwise the hardware TLB
broadcasting isn't going to work properly because the invalidations are
typically tagged by ASID.

As Andy points out later, this means that we have to reuse ASIDs for
different 'mm's once we have enough of them. We do this using a 64-bit
context ID in mm_context_t, where the lower bits are the ASID for the 'mm'
and the upper bits are a generation count. The ASID allocator keeps an
internal generation count which is incremented whenever we fail to allocate
an ASID and are forced to invalidate them all and start re-allocating. We
assume that the generation count doesn't overflow.

When switching to an 'mm', we check if the generation count held in the
'mm' is behind the allocator's current generation count. If it is, then
we know that the 'mm' needs to be allocated a new ASID. Allocation is
performed with a spinlock held and basically involves a setting a new bit
in the bitmap and updating the 'mm' with the new ASID and current
generation. We don't reclaim ASIDs greedily on 'mm' teardown -- this was
pretty slow when I looked at it in the past.

So far so good, but it gets more complicated when we look at the details of
the overflow handling. Overflow is always detected on the allocation path
with the spinlock held but other CPUs could happily be running other code
(inc. user code) at this point. Therefore, we can't simply invalidate the
TLBs, clear the bitmap and start re-allocating ASIDs because we could end up
with an ASID shared between two running 'mm's, leading to both invalidation
interference but also the potential to hit stale TLB entries allocated after
the invalidation on rollover. We handle this with a couple of per-cpu
variables, 'active_asids' and 'reserved_asids'.

'active_asids' is set to the current ASID in switch_mm() just before
writing the actual TTBR register. On a rollover, the CPU holding the lock
goes through each CPU's 'active_asids' entry, atomic xchg()s it to 0 and
writes the result into the corresponding 'reserved_asids' entry. These
'reserved_asids' are then immediately marked as allocated and a flag is
set for each CPU to indicate that its TLBs are dirty. This allows the
CPU handling the rollover to continue with its allocation without stopping
the world and without broadcasting TLB invalidation; other CPUs will
hit a generation mismatch on their next switch_mm(), notice that they are
running a reserved ASID from an older generation, upgrade the generation
(i.e. keep the same ASID) and then invalidate their local TLB.

So we do have some tracking of which ASIDs are where, but we can't generally
say "is this ASID dirty in the TLBs of this CPU". That also gets more
complicated on some systems where a TLB can be shared between some of the
CPUs (I haven't covered that case above, since I think that this is enough
detail already.)

FWIW, we have a TLA+ model of some of this, which may (or may not) be easier
to follow than my text:

https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/asidalloc.tla

although the syntax is pretty hard going :(

> The thing about TLB flushes is that it's ok to do them spuriously (as
> long as you don't do _too_ many of them and tank performance), so two
> different mm's can have the same hw ASID on two different cores and
> that just makes cross-CPU TLB invalidates too aggressive. You can't
> share an ASID on the _same_ core without flushing in between context
> switches, because then the TLB on that core might be re-used for a
> different mm. So the flushing rules aren't necessarily 100% 1:1 with
> the "in use" rules, and who knows if the arm64 ASID management
> actually ends up just matching what that whole "this lazy TLB is still
> in use on another CPU".

The shared TLBs (Arm calls this "Common-not-private") make this problematic,
as the TLB is no longer necessarily per-core.

> So I don't really know the arm64 situation. And i's possible that lazy
> TLB isn't even worth it on arm64 in the first place.

ASID allocation aside, I think there are a few useful things to point out
for arm64:

	- We only have "local" or "all" TLB invalidation; nothing targetted
	  (and for KVM guests this is always "all").

	- Most mms end up running on more than one CPU (at least, when I
	  last looked at this a fork+exec would end up with the mm having
	  been installed on two CPUs)

	- We don't track mm_cpumask as it showed up as a bottleneck in the
	  past and, because of the earlier points, it wasn't very useful
	  anyway

	- mmgrab() should be fast for us (it's a posted atomic add),
	  although mmdrop() will be slower as it has to return data to
	  check against the count going to zero.

So it doesn't feel like an obvious win to me for us to scan these new hazard
pointers on arm64. At least, I would love to see some numbers if we're going
to make changes here.

Will
Andy Lutomirski Jan. 11, 2022, 3:22 p.m. UTC | #30
On Tue, Jan 11, 2022, at 2:39 AM, Will Deacon wrote:
> Hi Andy, Linus,
>
> On Sun, Jan 09, 2022 at 12:48:42PM -0800, Linus Torvalds wrote:
>> On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@kernel.org> wrote:

>> That said, I still don't actually know the arm64 ASID management code.
>
> That appears to be a common theme in this thread, so hopefully I can shed
> some light on the arm64 side of things:
>

Thanks!

>
> FWIW, we have a TLA+ model of some of this, which may (or may not) be easier
> to follow than my text:

Yikes. Your fine hardware engineers should consider 64-bit ASIDs :)

I suppose x86-on-AMD could copy this, but eww.  OTOH x86 can easily have more CPUs than ASIDs, so maybe not.

>
> https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/asidalloc.tla
>
> although the syntax is pretty hard going :(
>
>> The thing about TLB flushes is that it's ok to do them spuriously (as
>> long as you don't do _too_ many of them and tank performance), so two
>> different mm's can have the same hw ASID on two different cores and
>> that just makes cross-CPU TLB invalidates too aggressive. You can't
>> share an ASID on the _same_ core without flushing in between context
>> switches, because then the TLB on that core might be re-used for a
>> different mm. So the flushing rules aren't necessarily 100% 1:1 with
>> the "in use" rules, and who knows if the arm64 ASID management
>> actually ends up just matching what that whole "this lazy TLB is still
>> in use on another CPU".
>
> The shared TLBs (Arm calls this "Common-not-private") make this problematic,
> as the TLB is no longer necessarily per-core.
>
>> So I don't really know the arm64 situation. And i's possible that lazy
>> TLB isn't even worth it on arm64 in the first place.
>
> ASID allocation aside, I think there are a few useful things to point out
> for arm64:
>
> 	- We only have "local" or "all" TLB invalidation; nothing targetted
> 	  (and for KVM guests this is always "all").
>
> 	- Most mms end up running on more than one CPU (at least, when I
> 	  last looked at this a fork+exec would end up with the mm having
> 	  been installed on two CPUs)
>
> 	- We don't track mm_cpumask as it showed up as a bottleneck in the
> 	  past and, because of the earlier points, it wasn't very useful
> 	  anyway
>
> 	- mmgrab() should be fast for us (it's a posted atomic add),
> 	  although mmdrop() will be slower as it has to return data to
> 	  check against the count going to zero.
>
> So it doesn't feel like an obvious win to me for us to scan these new hazard
> pointers on arm64. At least, I would love to see some numbers if we're going
> to make changes here.

I will table the hazard pointer scheme, then, and adjust the series to do shootdowns.

I would guess that once arm64 hits a few hundred CPUs, you'll start finding workloads where mmdrop() at least starts to hurt.  But we can cross that bridge when we get to it.
Andy Lutomirski Jan. 11, 2022, 3:39 p.m. UTC | #31
On Mon, Jan 10, 2022, at 7:10 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
>> 
>> 
>> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>>> just verify my memory, and they weren't as pretty as I thought they
>>>> were ]
>>>> 
>>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>>> powerpc, we don't need this overhead at all".
>>>> 
>>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>>> the TLB flush IPIs that happen at that point anyway would then make
>>>> sure any lazy TLB is cleaned up.
>>>> 
>>>> But that's not actually what it does. It ties the
>>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>>> last mmdrop() instead. Because it really tied the whole logic to the
>>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>>> mm_users thing I mis-remembered it doing.
>>>
>>> It does this because on powerpc with hash MMU, we can't use IPIs for
>>> TLB shootdowns.
>> 
>> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,
>
> The paravirtualised hash MMU environment doesn't because it has a single 
> level translation and the guest uses hypercalls to insert and remove 
> translations and the hypervisor flushes TLBs. The HV could flush TLBs
> with IPIs but obviously the guest can't use those to execute the lazy
> switch. In radix guests (and all bare metal) the OS flushes its own
> TLBs.
>
> We are moving over to radix, but powerpc also has a hardware broadcast 
> flush instruction which can be a bit faster than IPIs and is usable by 
> bare metal and radix guests so those can also avoid the IPIs if they 
> want. Part of the powerpc patch I just sent to combine the lazy switch 
> with the final TLB flush is to force it to always take the IPI path and 
> not use TLBIE instruction on the final exit.
>
> So hazard points could avoid some IPIs there too.
>
>> it sounds like the hazard pointer scheme might actually be pretty good.
>
> Some IPIs in the exit path just aren't that big a concern. I measured,
> got numbers, tried to irritate it, just wasn't really a problem. Some
> archs use IPIs for all threaded TLB shootdowns and exits not that
> frequent. Very fast short lived processes that do a lot of exits just
> don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
> and long lived and multi threaded ones that do don't exit at high rates.
>
> So from what I can see it's premature optimization. Actually maybe not
> even optimization because IIRC it adds complexity and even a barrier on
> powerpc in the context switch path which is a lot more critical than
> exit() for us we don't want slowdowns there.
>
> It's a pretty high complexity boutique kind of synchronization. Which
> don't get me wrong is the kind of thing I like, it is clever and may be
> perfectly bug free but it needs to prove itself over the simple dumb
> shoot lazies approach.
>
>>>> So at least some of my arguments were based on me just mis-remembering
>>>> what Nick's patch actually did (mainly because I mentally recreated
>>>> the patch from "Nick did something like this" and what I thought would
>>>> be the way to do it on x86).
>>>
>>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>>> actually do the switch-away-from-lazy on the final TLB flush and the
>>> final broadcast shootdown thing becomes a no-op. I didn't post that
>>> additional patch because it's powerpc-specific and I didn't want to
>>> post more code so widely.
>>>
>>>> So I guess I have to recant my arguments.
>>>> 
>>>> I still think my "get rid of lazy at last mmput" model should work,
>>>> and would be a perfect match for x86, but I can't really point to Nick
>>>> having done that.
>>>> 
>>>> So I was full of BS.
>>>> 
>>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>>> think this is interesting. But then I look at my schedule for the
>>>> upcoming week, and I go "I don't have a leg to stand on in this
>>>> discussion, and I'm just all hot air".
>>>
>>> I agree Andy's approach is very complicated and adds more overhead than
>>> necessary for powerpc, which is why I don't want to use it. I'm still
>>> not entirely sure what the big problem would be to convert x86 to use
>>> it, I admit I haven't kept up with the exact details of its lazy tlb
>>> mm handling recently though.
>> 
>> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
>> 
>> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.
>
> Definitely should have been done at the time x86 deviated, but better 
> late than never.
>

I suspect that this code may predate there being anything for x86 to deviate from.

>> 
>> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.
>
> membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
> instead if mmdrop is a no-op is fine. Patches changing membarrier's 
> ordering requirements can be concurrent and are not fundmentally tied
> to lazy tlb mm switching, it just reuses an existing ordering point.

smp_mb() is rather expensive on x86 at least, and I think on powerpc to.  Let's not.

Also, IMO my cleanups here generally make sense and make the code better, so I think we should just go with them.

>
>> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.
>
> Seems like a decent cleanup. Again lazy tlb specific, just general switch
> code should be factored and better contained in kernel/sched/ which is
> fine, but concurrent to lazy tlb improvements.

I personally rather dislike the model of:

...
mmgrab_lazy();
...
mmdrop_lazy();
...
mmgrab_lazy();
...
mmdrop_lazy();
...

where the different calls have incompatible logic at the call sites and a magic config option NOPs them all away and adds barriers.  I'm personally a big fan of cleaning up code before making it even more complicated.

>
>> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.
>
> My shoot lazies patch has no such issues with that AFAIKS. What exact 
> issue is it and where did you fix it?

Without my unlazy_mm_irqs_off() (or something similar), x86's very sensible (and very old!) code to unlazy a lazy CPU when flushing leaves active_mm pointing to the old lazy mm.  That's not a problem at all on current kernels, but in a shootdown-lazy world, those active_mm pointers will stick around.  Even with that fixed, without some care, an NMI during the shootdown CPU could dereference ->active_mm at a time when it's not being actively kept alive.

Fixed by unlazy_mm_irqs_off(), the patches that use it, and the patches that make x86 stop inappropriately using ->active_mm.

>
>> 
>> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
>> 
>> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.
>
> I disagree, the lazy mm code is very clean and simple. And I can't see 
> how you would propose to remove active_mm from core code I'm skeptical
> but would be very interested to see, but that's nothing to do with my
> shoot lazies patch and can also be concurrent except for mechanical
> merge issues.

I think we may just have to agree to disagree.  The more I looked at the lazy code, the more problems I found.  So I fixed them.  That work is done now (as far as I'm aware) except for rebasing and review.

--Andy
Nicholas Piggin Jan. 11, 2022, 10:48 p.m. UTC | #32
Excerpts from Andy Lutomirski's message of January 12, 2022 1:39 am:
> 
> 
> On Mon, Jan 10, 2022, at 7:10 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
>>> 
>>> 
>>> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>>>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>>>> just verify my memory, and they weren't as pretty as I thought they
>>>>> were ]
>>>>> 
>>>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>>>> <torvalds@linux-foundation.org> wrote:
>>>>>>
>>>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>>>> powerpc, we don't need this overhead at all".
>>>>> 
>>>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>>>> the TLB flush IPIs that happen at that point anyway would then make
>>>>> sure any lazy TLB is cleaned up.
>>>>> 
>>>>> But that's not actually what it does. It ties the
>>>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>>>> last mmdrop() instead. Because it really tied the whole logic to the
>>>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>>>> mm_users thing I mis-remembered it doing.
>>>>
>>>> It does this because on powerpc with hash MMU, we can't use IPIs for
>>>> TLB shootdowns.
>>> 
>>> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,
>>
>> The paravirtualised hash MMU environment doesn't because it has a single 
>> level translation and the guest uses hypercalls to insert and remove 
>> translations and the hypervisor flushes TLBs. The HV could flush TLBs
>> with IPIs but obviously the guest can't use those to execute the lazy
>> switch. In radix guests (and all bare metal) the OS flushes its own
>> TLBs.
>>
>> We are moving over to radix, but powerpc also has a hardware broadcast 
>> flush instruction which can be a bit faster than IPIs and is usable by 
>> bare metal and radix guests so those can also avoid the IPIs if they 
>> want. Part of the powerpc patch I just sent to combine the lazy switch 
>> with the final TLB flush is to force it to always take the IPI path and 
>> not use TLBIE instruction on the final exit.
>>
>> So hazard points could avoid some IPIs there too.
>>
>>> it sounds like the hazard pointer scheme might actually be pretty good.
>>
>> Some IPIs in the exit path just aren't that big a concern. I measured,
>> got numbers, tried to irritate it, just wasn't really a problem. Some
>> archs use IPIs for all threaded TLB shootdowns and exits not that
>> frequent. Very fast short lived processes that do a lot of exits just
>> don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
>> and long lived and multi threaded ones that do don't exit at high rates.
>>
>> So from what I can see it's premature optimization. Actually maybe not
>> even optimization because IIRC it adds complexity and even a barrier on
>> powerpc in the context switch path which is a lot more critical than
>> exit() for us we don't want slowdowns there.
>>
>> It's a pretty high complexity boutique kind of synchronization. Which
>> don't get me wrong is the kind of thing I like, it is clever and may be
>> perfectly bug free but it needs to prove itself over the simple dumb
>> shoot lazies approach.
>>
>>>>> So at least some of my arguments were based on me just mis-remembering
>>>>> what Nick's patch actually did (mainly because I mentally recreated
>>>>> the patch from "Nick did something like this" and what I thought would
>>>>> be the way to do it on x86).
>>>>
>>>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>>>> actually do the switch-away-from-lazy on the final TLB flush and the
>>>> final broadcast shootdown thing becomes a no-op. I didn't post that
>>>> additional patch because it's powerpc-specific and I didn't want to
>>>> post more code so widely.
>>>>
>>>>> So I guess I have to recant my arguments.
>>>>> 
>>>>> I still think my "get rid of lazy at last mmput" model should work,
>>>>> and would be a perfect match for x86, but I can't really point to Nick
>>>>> having done that.
>>>>> 
>>>>> So I was full of BS.
>>>>> 
>>>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>>>> think this is interesting. But then I look at my schedule for the
>>>>> upcoming week, and I go "I don't have a leg to stand on in this
>>>>> discussion, and I'm just all hot air".
>>>>
>>>> I agree Andy's approach is very complicated and adds more overhead than
>>>> necessary for powerpc, which is why I don't want to use it. I'm still
>>>> not entirely sure what the big problem would be to convert x86 to use
>>>> it, I admit I haven't kept up with the exact details of its lazy tlb
>>>> mm handling recently though.
>>> 
>>> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
>>> 
>>> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.
>>
>> Definitely should have been done at the time x86 deviated, but better 
>> late than never.
>>
> 
> I suspect that this code may predate there being anything for x86 to deviate from.

Interesting, active_mm came in 2.3.11 and x86's cpu_tlbstate[].active_mm 
2.3.43. Longer than I thought.

>>> 
>>> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.
>>
>> membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
>> instead if mmdrop is a no-op is fine. Patches changing membarrier's 
>> ordering requirements can be concurrent and are not fundmentally tied
>> to lazy tlb mm switching, it just reuses an existing ordering point.
> 
> smp_mb() is rather expensive on x86 at least, and I think on powerpc to.  Let's not.

You misunderstand me. If mmdrop is not there to provide the required 
ordering that membarrier needs, then the membarrier code does its own
smp_mb(). It's not _more_ expensive than before because the full barrier
from the mmdrop is gone.

> Also, IMO my cleanups here generally make sense and make the code better, so I think we should just go with them.

Sure if you can make the membarrier code better and avoid this ordering
requirement that's nice. This is orthogonal to the lazy tlb code though
and they can go in parallel (again aside from mechanical merge issues).

I'm not sure what you don't understand about this, that ordering is a
membarrier requirement, and it happens to be using an existing barrier
that the lazy mm code had anyway, which is perfectly fine and something
that is done all over the kernel in performance critical code.

>>> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.
>>
>> Seems like a decent cleanup. Again lazy tlb specific, just general switch
>> code should be factored and better contained in kernel/sched/ which is
>> fine, but concurrent to lazy tlb improvements.
> 
> I personally rather dislike the model of:
> 
> ...
> mmgrab_lazy();
> ...
> mmdrop_lazy();
> ...
> mmgrab_lazy();
> ...
> mmdrop_lazy();
> ...
> 
> where the different calls have incompatible logic at the call sites and a magic config option NOPs them all away and adds barriers.  I'm personally a big fan of cleaning up code before making it even more complicated.

Not sure what model that is though. Call sites don't have to know 
anything at all about the options or barriers. The rule is simply
that the lazy tlb mm reference is manipulated with the _lazy variants.

It was just mostly duplicated code. Again, can go in parallel and no
dependencies other than mechanical merge.

> 
>>
>>> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.
>>
>> My shoot lazies patch has no such issues with that AFAIKS. What exact 
>> issue is it and where did you fix it?
> 
> Without my unlazy_mm_irqs_off() (or something similar), x86's very sensible (and very old!) code to unlazy a lazy CPU when flushing leaves active_mm pointing to the old lazy mm.  That's not a problem at all on current kernels, but in a shootdown-lazy world, those active_mm pointers will stick around.  Even with that fixed, without some care, an NMI during the shootdown CPU could dereference ->active_mm at a time when it's not being actively kept alive.
> 
> Fixed by unlazy_mm_irqs_off(), the patches that use it, and the patches that make x86 stop inappropriately using ->active_mm.

Oh you're talking about some x86 specific issue where you would have
problems if you didn't do the port properly? Don't tell me this is why
you've been nacking my patches for 15 months.

>>> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
>>> 
>>> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.
>>
>> I disagree, the lazy mm code is very clean and simple. And I can't see 
>> how you would propose to remove active_mm from core code I'm skeptical
>> but would be very interested to see, but that's nothing to do with my
>> shoot lazies patch and can also be concurrent except for mechanical
>> merge issues.
> 
> I think we may just have to agree to disagree.  The more I looked at the lazy code, the more problems I found.  So I fixed them.  That work is done now (as far as I'm aware) except for rebasing and review.

I don't see any problems with the lazy tlb mm code outside arch/x86 or 
anything your series fixed with it aside from a bit of code duplication.

Anyway I will try to take a look over and review bits I can before the
5.18 merge window. For 5.17 my series has been ready to go for a year or
so and very small so let's merge that first since Linus wants to try go
with that approach rather than the refcount one.

Thanks,
Nick
diff mbox series

Patch

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 7509b2b2e99d..3ceba11c049c 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -76,6 +76,9 @@  static inline bool mmget_not_zero(struct mm_struct *mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
+
+extern void mm_unlazy_mm_count(struct mm_struct *mm);
+
 #ifdef CONFIG_MMU
 /* same as above but performs the slow path from the async context. Can
  * be called from the atomic context as well
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..2df72cf3c0d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1122,6 +1122,17 @@  static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+
+	/*
+	 * We hold one mm_count reference.  Convert all remaining lazy_mm
+	 * references to mm_count references so that the mm will be genuinely
+	 * unused when mm_count goes to zero.  Do this after exit_mmap() so
+	 * that, if the architecture shoots down remote TLB entries via IPI in
+	 * exit_mmap() and calls unlazy_mm_irqs_off() when doing so, most or
+	 * all lazy_mm references can be removed without
+	 * mm_unlazy_mm_count()'s help.
+	 */
+	mm_unlazy_mm_count(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95eb0e78f74c..64e4058b3c61 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -20,6 +20,7 @@ 
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
+#include <asm/mmu.h>
 
 #include "../workqueue_internal.h"
 #include "../../fs/io-wq.h"
@@ -4750,6 +4751,144 @@  prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	prepare_arch_switch(next);
 }
 
+/*
+ * Called after each context switch.
+ *
+ * Strictly speaking, no action at all is required here.  This rq
+ * can hold an extra reference to at most one mm, so the memory
+ * wasted by deferring the mmdrop() forever is bounded.  That being
+ * said, it's straightforward to safely drop spare references
+ * in the common case.
+ */
+static void mmdrop_lazy(struct rq *rq)
+{
+	struct mm_struct *old_mm;
+
+	old_mm = READ_ONCE(rq->drop_mm);
+
+	do {
+		/*
+		 * If there is nothing to drop or if we are still using old_mm,
+		 * then don't call mmdrop().
+		 */
+		if (likely(!old_mm || old_mm == rq->lazy_mm))
+			return;
+	} while (!try_cmpxchg_relaxed(&rq->drop_mm, &old_mm, NULL));
+
+	mmdrop(old_mm);
+}
+
+#ifndef for_each_possible_lazymm_cpu
+#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu))
+#endif
+
+static bool __try_mm_drop_rq_ref(struct rq *rq, struct mm_struct *mm)
+{
+	struct mm_struct *old_drop_mm = smp_load_acquire(&rq->drop_mm);
+
+	/*
+	 * We know that old_mm != mm: this is the only function that
+	 * might set drop_mm to mm, and we haven't set it yet.
+	 */
+	WARN_ON_ONCE(old_drop_mm == mm);
+
+	if (!old_drop_mm) {
+		/*
+		 * Just set rq->drop_mm to mm and our reference will
+		 * get dropped eventually after rq is done with it.
+		 */
+		return try_cmpxchg(&rq->drop_mm, &old_drop_mm, mm);
+	}
+
+	/*
+	 * The target cpu could still be using old_drop_mm.  We know that, if
+	 * old_drop_mm still exists, then old_drop_mm->mm_users == 0.  Can we
+	 * drop it?
+	 *
+	 * NB: it is critical that we load rq->lazy_mm again after loading
+	 * drop_mm.  If we looked at a prior value of lazy_mm (which we
+	 * already know to be mm), then we would be subject to a race:
+	 *
+	 * Us:
+	 *     Load rq->lazy_mm.
+	 * Remote CPU:
+	 *     Switch to old_drop_mm (with mm_users > 0)
+	 *     Become lazy and set rq->lazy_mm = old_drop_mm
+	 * Third CPU:
+	 *     Set old_drop_mm->mm_users to 0.
+	 *     Set rq->drop_mm = old_drop_mm
+	 * Us:
+	 *     Incorrectly believe that old_drop_mm is unused
+	 *     because rq->lazy_mm != old_drop_mm
+	 *
+	 * In other words, to verify that rq->lazy_mm is not keeping a given
+	 * mm alive, we must load rq->lazy_mm _after_ we know that mm_users ==
+	 * 0 and therefore that rq will not switch to that mm.
+	 */
+	if (smp_load_acquire(&rq->lazy_mm) != mm) {
+		/*
+		 * We got lucky!  rq _was_ using mm, but it stopped.
+		 * Just drop our reference.
+		 */
+		mmdrop(mm);
+		return true;
+	}
+
+	/*
+	 * If we got here, rq->lazy_mm != old_drop_mm, and we ruled
+	 * out the race described above.  rq is done with old_drop_mm,
+	 * so we can steal the reference held by rq and replace it with
+	 * our reference to mm.
+	 */
+	if (cmpxchg(&rq->drop_mm, old_drop_mm, mm) != old_drop_mm)
+		return false;
+
+	mmdrop(old_drop_mm);
+	return true;
+}
+
+/*
+ * This converts all lazy_mm references to mm to mm_count refcounts.  Our
+ * caller holds an mm_count reference, so we don't need to worry about mm
+ * being freed out from under us.
+ */
+void mm_unlazy_mm_count(struct mm_struct *mm)
+{
+	unsigned int drop_count = 0;
+	int cpu;
+
+	/*
+	 * mm_users is zero, so no cpu will set its rq->lazy_mm to mm.
+	 */
+	WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
+
+	for_each_possible_lazymm_cpu(cpu, mm) {
+		struct rq *rq = cpu_rq(cpu);
+
+		if (smp_load_acquire(&rq->lazy_mm) != mm)
+			continue;
+
+		/*
+		 * Grab one reference.  Do it as a batch so we do a maximum
+		 * of two atomic operations instead of one per lazy reference.
+		 */
+		if (!drop_count) {
+			/*
+			 * Collect lots of references.  We'll drop the ones we
+			 * don't use.
+			 */
+			drop_count = num_possible_cpus();
+			atomic_add(drop_count, &mm->mm_count);
+		}
+		drop_count--;
+
+		while (!__try_mm_drop_rq_ref(rq, mm))
+			;
+	}
+
+	atomic_sub(drop_count, &mm->mm_count);
+}
+
 /**
  * finish_task_switch - clean up after a task-switch
  * @prev: the thread we just switched away from.
@@ -4773,7 +4912,6 @@  static struct rq *finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
 	long prev_state;
 
 	/*
@@ -4792,8 +4930,6 @@  static struct rq *finish_task_switch(struct task_struct *prev)
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
-
 	/*
 	 * A task struct has one reference for the use as "current".
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
@@ -4824,12 +4960,7 @@  static struct rq *finish_task_switch(struct task_struct *prev)
 
 	fire_sched_in_preempt_notifiers(current);
 
-	/*
-	 * If an architecture needs to take a specific action for
-	 * SYNC_CORE, it can do so in switch_mm_irqs_off().
-	 */
-	if (mm)
-		mmdrop(mm);
+	mmdrop_lazy(rq);
 
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
@@ -4891,36 +5022,55 @@  context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
+	/*
+	 * Sanity check: if something went wrong and the previous mm was
+	 * freed while we were still using it, KASAN might not notice
+	 * without help.
+	 */
+	kasan_check_byte(prev->active_mm);
+
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + lazy_mm grab active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + lazy_mm release active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
 		enter_lazy_tlb(prev->active_mm, next);
 
 		next->active_mm = prev->active_mm;
-		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
-		else
+		if (prev->mm) {                         // from user
+			SCHED_WARN_ON(rq->lazy_mm);
+
+			/*
+			 * Acqure a lazy_mm reference to the active
+			 * (lazy) mm.  No explicit barrier needed: we still
+			 * hold an explicit (mm_users) reference.  __mmput()
+			 * can't be called until we call mmput() to drop
+			 * our reference, and __mmput() is a release barrier.
+			 */
+			WRITE_ONCE(rq->lazy_mm, next->active_mm);
+		} else {
 			prev->active_mm = NULL;
+		}
 	} else {                                        // to user
 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		/*
-		 * sys_membarrier() requires an smp_mb() between setting
-		 * rq->curr->mm to a membarrier-enabled mm and returning
-		 * to userspace.
+		 * An arch implementation of for_each_possible_lazymm_cpu()
+		 * may skip this CPU now that we have switched away from
+		 * prev->active_mm, so we must not reference it again.
 		 */
+
 		membarrier_finish_switch_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
 			prev->active_mm = NULL;
+
+			/* Drop our lazy_mm reference to the old lazy mm. */
+			smp_store_release(&rq->lazy_mm, NULL);
 		}
 	}
 
@@ -4938,7 +5088,8 @@  context_switch(struct rq *rq, struct task_struct *prev,
 void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 {
 	struct task_struct *tsk = current;
-	struct mm_struct *old_active_mm, *mm_to_drop = NULL;
+	struct mm_struct *old_active_mm;
+	bool was_kernel;
 
 	BUG_ON(!mm);	/* likely to cause corruption if we continue */
 
@@ -4958,12 +5109,9 @@  void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 	if (tsk->mm) {
 		/* We're detaching from an old mm.  Sync stats. */
 		sync_mm_rss(tsk->mm);
+		was_kernel = false;
 	} else {
-		/*
-		 * Switching from kernel mm to user.  Drop the old lazy
-		 * mm reference.
-		 */
-		mm_to_drop = tsk->active_mm;
+		was_kernel = true;
 	}
 
 	old_active_mm = tsk->active_mm;
@@ -4992,6 +5140,10 @@  void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 
 	membarrier_finish_switch_mm(mm);
 	vmacache_flush(tsk);
+
+	if (was_kernel)
+		smp_store_release(&this_rq()->lazy_mm, NULL);
+
 	task_unlock(tsk);
 
 #ifdef finish_arch_post_lock_switch
@@ -5009,9 +5161,6 @@  void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new)
 		finish_arch_post_lock_switch();
 	}
 #endif
-
-	if (mm_to_drop)
-		mmdrop(mm_to_drop);
 }
 
 void __change_current_mm_to_kernel(void)
@@ -5044,8 +5193,17 @@  void __change_current_mm_to_kernel(void)
 	membarrier_update_current_mm(NULL);
 	vmacache_flush(tsk);
 
-	/* active_mm is still 'old_mm' */
-	mmgrab(old_mm);
+	/*
+	 * active_mm is still 'old_mm'
+	 *
+	 * Acqure a lazy_mm reference to the active (lazy) mm.  As in
+	 * context_switch(), no explicit barrier needed: we still hold an
+	 * explicit (mm_users) reference.  __mmput() can't be called until we
+	 * call mmput() to drop our reference, and __mmput() is a release
+	 * barrier.
+	 */
+	WRITE_ONCE(this_rq()->lazy_mm, old_mm);
+
 	enter_lazy_tlb(old_mm, tsk);
 
 	local_irq_enable();
@@ -8805,6 +8963,7 @@  void __init init_idle(struct task_struct *idle, int cpu)
 void unlazy_mm_irqs_off(void)
 {
 	struct mm_struct *mm = current->active_mm;
+	struct rq *rq = this_rq();
 
 	lockdep_assert_irqs_disabled();
 
@@ -8815,10 +8974,17 @@  void unlazy_mm_irqs_off(void)
 		return;
 
 	switch_mm_irqs_off(mm, &init_mm, current);
-	mmgrab(&init_mm);
 	current->active_mm = &init_mm;
+
+	/*
+	 * We don't need a lazy reference to init_mm -- it's not about
+	 * to go away.
+	 */
+	smp_store_release(&rq->lazy_mm, NULL);
+
 	finish_arch_post_lock_switch();
-	mmdrop(mm);
+
+	mmdrop_lazy(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b496a9ee9aec..1010e63962d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -977,7 +977,15 @@  struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+
+	/*
+	 * Fast refcounting scheme for lazy mm.  lazy_mm is a hazard pointer:
+	 * setting it to point to a lazily used mm keeps that mm from being
+	 * freed.  drop_mm points to am mm that needs an mmdrop() call
+	 * after the CPU owning the rq is done with it.
+	 */
+	struct mm_struct	*lazy_mm;
+	struct mm_struct	*drop_mm;
 
 	unsigned int		clock_update_flags;
 	u64			clock;