mbox series

[00/23] mm, sched: Rework lazy mm handling

Message ID cover.1641659630.git.luto@kernel.org (mailing list archive)
Headers show
Series mm, sched: Rework lazy mm handling | expand

Message

Andy Lutomirski Jan. 8, 2022, 4:43 p.m. UTC
Hi all-

Sorry I've been sitting on this so long.  I think it's in decent shape, it
has no *known* bugs, and I think it's time to get the show on the road.
This series needs more eyeballs, too.

The overall point of this series is to get rid of the scalability
problems with mm_count, and my goal is to solve it once and for all,
for all architectures, in a way that doesn't have any gotchas for
unwary users of ->active_mm.

Most of this series is just cleanup, though.  mmgrab(), mmdrop(), and
->active_mm are a mess.  A number of ->active_mm users are simply
wrong.  kthread lazy mm handling is inconsistent with user thread lazy
mm handling (by accident, as far as I can tell).  And membarrier()
relies on the barrier semantics of mmdrop() and mmgrab(), such that
anything that gets rid of those barriers risks breaking membarrier().
x86 is sometimes non-lazy when the core thinks it's lazy because the
core mm code didn't offer any mechanism by which x86 could tell the core
that it's exiting lazy mode.

So most of this series is just cleanup.  Bogus users of ->active_mm
are fixed, and membarrier() is reworked so that its barriers are
explicit instead of depending on mmdrop() and mmgrab().  x86 lazy
handling is extensively tidied up, and x86's EFI mm code gets tidied
up a bit too.  I think I've done this all in a way that introduces
little or no overhead.


Additionally, all the code paths that change current->mm are consolidated
so that there is only one path to start using an mm and only one path
to stop using it.

Once that's done, the actual meat (the hazard pointers) isn't so bad, and
the x86 optimization on top that should eliminate scanning of remote CPUs
in __mmput() is about two lines of code.  Other architectures with
sufficiently accurate mm_cpumask() tracking should be able to do the same
thing.

akpm, this is intended to mostly replace Nick Piggin's lazy shootdown
series.  This series implements lazy shootdown on x86 implicitly, and
powerpc should be able to do the same thing in just a couple lines
of code if it wants to.  The result is IMO much cleaner and more
maintainable.

Once this is all reviewed, I'm hoping it can go in -tip (and -next) after
the merge window or go in -mm.  This is not intended for v5.16.  I suspect
-tip is easier in case other arch maintainers want to optimize their
code in the same release.

Andy Lutomirski (23):
  membarrier: Document why membarrier() works
  x86/mm: Handle unlazying membarrier core sync in the arch code
  membarrier: Remove membarrier_arch_switch_mm() prototype in core code
  membarrier: Make the post-switch-mm barrier explicit
  membarrier, kthread: Use _ONCE accessors for task->mm
  powerpc/membarrier: Remove special barrier on mm switch
  membarrier: Rewrite sync_core_before_usermode() and improve
    documentation
  membarrier: Remove redundant clear of mm->membarrier_state in
    exec_mmap()
  membarrier: Fix incorrect barrier positions during exec and
    kthread_use_mm()
  x86/events, x86/insn-eval: Remove incorrect active_mm references
  sched/scs: Initialize shadow stack on idle thread bringup, not
    shutdown
  Rework "sched/core: Fix illegal RCU from offline CPUs"
  exec: Remove unnecessary vmacache_seqnum clear in exec_mmap()
  sched, exec: Factor current mm changes out from exec
  kthread: Switch to __change_current_mm()
  sched: Use lightweight hazard pointers to grab lazy mms
  x86/mm: Make use/unuse_temporary_mm() non-static
  x86/mm: Allow temporary mms when IRQs are on
  x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery
  x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off()
  x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs
  x86/mm: Optimize for_each_possible_lazymm_cpu()
  x86/mm: Opt in to IRQs-off activate_mm()

 .../membarrier-sync-core/arch-support.txt     |  69 +--
 arch/arm/include/asm/membarrier.h             |  21 +
 arch/arm/kernel/smp.c                         |   2 -
 arch/arm64/include/asm/membarrier.h           |  19 +
 arch/arm64/kernel/smp.c                       |   2 -
 arch/csky/kernel/smp.c                        |   2 -
 arch/ia64/kernel/process.c                    |   1 -
 arch/mips/cavium-octeon/smp.c                 |   1 -
 arch/mips/kernel/smp-bmips.c                  |   2 -
 arch/mips/kernel/smp-cps.c                    |   1 -
 arch/mips/loongson64/smp.c                    |   2 -
 arch/powerpc/include/asm/membarrier.h         |  28 +-
 arch/powerpc/mm/mmu_context.c                 |   1 -
 arch/powerpc/platforms/85xx/smp.c             |   2 -
 arch/powerpc/platforms/powermac/smp.c         |   2 -
 arch/powerpc/platforms/powernv/smp.c          |   1 -
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |   2 -
 arch/powerpc/platforms/pseries/pmem.c         |   1 -
 arch/riscv/kernel/cpu-hotplug.c               |   2 -
 arch/s390/kernel/smp.c                        |   1 -
 arch/sh/kernel/smp.c                          |   1 -
 arch/sparc/kernel/smp_64.c                    |   2 -
 arch/x86/Kconfig                              |   2 +-
 arch/x86/events/core.c                        |   9 +-
 arch/x86/include/asm/membarrier.h             |  25 ++
 arch/x86/include/asm/mmu.h                    |   6 +-
 arch/x86/include/asm/mmu_context.h            |  15 +-
 arch/x86/include/asm/sync_core.h              |  20 -
 arch/x86/kernel/alternative.c                 |  67 +--
 arch/x86/kernel/cpu/mce/core.c                |   2 +-
 arch/x86/kernel/smpboot.c                     |   2 -
 arch/x86/lib/insn-eval.c                      |  13 +-
 arch/x86/mm/tlb.c                             | 155 +++++--
 arch/x86/platform/efi/efi_64.c                |   9 +-
 arch/x86/xen/mmu_pv.c                         |   2 +-
 arch/xtensa/kernel/smp.c                      |   1 -
 drivers/cpuidle/cpuidle.c                     |   2 +-
 drivers/idle/intel_idle.c                     |   4 +-
 drivers/misc/sgi-gru/grufault.c               |   2 +-
 drivers/misc/sgi-gru/gruhandles.c             |   2 +-
 drivers/misc/sgi-gru/grukservices.c           |   2 +-
 fs/exec.c                                     |  28 +-
 include/linux/mmu_context.h                   |   4 +-
 include/linux/sched/hotplug.h                 |   6 -
 include/linux/sched/mm.h                      |  58 ++-
 include/linux/sync_core.h                     |  21 -
 init/Kconfig                                  |   3 -
 kernel/cpu.c                                  |  21 +-
 kernel/exit.c                                 |   2 +-
 kernel/fork.c                                 |  11 +
 kernel/kthread.c                              |  50 +--
 kernel/sched/core.c                           | 409 +++++++++++++++---
 kernel/sched/idle.c                           |   1 +
 kernel/sched/membarrier.c                     |  97 ++++-
 kernel/sched/sched.h                          |  11 +-
 55 files changed, 745 insertions(+), 482 deletions(-)
 create mode 100644 arch/arm/include/asm/membarrier.h
 create mode 100644 arch/arm64/include/asm/membarrier.h
 create mode 100644 arch/x86/include/asm/membarrier.h
 delete mode 100644 include/linux/sync_core.h

Comments

Mathieu Desnoyers Jan. 12, 2022, 3:55 p.m. UTC | #1
----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> membarrier reads cpu_rq(remote cpu)->curr->mm without locking.  Use
> READ_ONCE() and WRITE_ONCE() to remove the data races.
> 

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> fs/exec.c                 | 2 +-
> kernel/exit.c             | 2 +-
> kernel/kthread.c          | 4 ++--
> kernel/sched/membarrier.c | 7 ++++---
> 4 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 3abbd0294e73..38b05e01c5bd 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1006,7 +1006,7 @@ static int exec_mmap(struct mm_struct *mm)
> 	local_irq_disable();
> 	active_mm = tsk->active_mm;
> 	tsk->active_mm = mm;
> -	tsk->mm = mm;
> +	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
> 	/*
> 	 * This prevents preemption while active_mm is being loaded and
> 	 * it and mm are being updated, which could cause problems for
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 91a43e57a32e..70f2cbc42015 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -491,7 +491,7 @@ static void exit_mm(void)
> 	 */
> 	smp_mb__after_spinlock();
> 	local_irq_disable();
> -	current->mm = NULL;
> +	WRITE_ONCE(current->mm, NULL);
> 	membarrier_update_current_mm(NULL);
> 	enter_lazy_tlb(mm, current);
> 	local_irq_enable();
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 396ae78a1a34..3b18329f885c 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1358,7 +1358,7 @@ void kthread_use_mm(struct mm_struct *mm)
> 		mmgrab(mm);
> 		tsk->active_mm = mm;
> 	}
> -	tsk->mm = mm;
> +	WRITE_ONCE(tsk->mm, mm);  /* membarrier reads this without locks */
> 	membarrier_update_current_mm(mm);
> 	switch_mm_irqs_off(active_mm, mm, tsk);
> 	membarrier_finish_switch_mm(mm);
> @@ -1399,7 +1399,7 @@ void kthread_unuse_mm(struct mm_struct *mm)
> 	smp_mb__after_spinlock();
> 	sync_mm_rss(mm);
> 	local_irq_disable();
> -	tsk->mm = NULL;
> +	WRITE_ONCE(tsk->mm, NULL);  /* membarrier reads this without locks */
> 	membarrier_update_current_mm(NULL);
> 	/* active_mm is still 'mm' */
> 	enter_lazy_tlb(mm, tsk);
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 30e964b9689d..327830f89c37 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -411,7 +411,7 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 			goto out;
> 		rcu_read_lock();
> 		p = rcu_dereference(cpu_rq(cpu_id)->curr);
> -		if (!p || p->mm != mm) {
> +		if (!p || READ_ONCE(p->mm) != mm) {
> 			rcu_read_unlock();
> 			goto out;
> 		}
> @@ -424,7 +424,7 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 			struct task_struct *p;
> 
> 			p = rcu_dereference(cpu_rq(cpu)->curr);
> -			if (p && p->mm == mm)
> +			if (p && READ_ONCE(p->mm) == mm)
> 				__cpumask_set_cpu(cpu, tmpmask);
> 		}
> 		rcu_read_unlock();
> @@ -522,7 +522,8 @@ static int sync_runqueues_membarrier_state(struct mm_struct
> *mm)
> 		struct task_struct *p;
> 
> 		p = rcu_dereference(rq->curr);
> -		if (p && p->mm == mm)
> +		/* exec and kthread_use_mm() write ->mm without locks */
> +		if (p && READ_ONCE(p->mm) == mm)
> 			__cpumask_set_cpu(cpu, tmpmask);
> 	}
> 	rcu_read_unlock();
> --
> 2.33.1