Message ID | c1bc25b895213921cf36f4ee4ba07c1791fb631e.1641659630.git.luto@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, sched: Rework lazy mm handling | expand |
----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote: > membarrier() needs a barrier after any CPU changes mm. There is currently > a comment explaining why this barrier probably exists in all cases. The > logic is based on ensuring that the barrier exists on every control flow > path through the scheduler. It also relies on mmgrab() and mmdrop() being > full barriers. > > mmgrab() and mmdrop() would be better if they were not full barriers. As a > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop() > could use a release on architectures that have these operations. Larger > optimizations are also in the works. Doing any of these optimizations > while preserving an unnecessary barrier will complicate the code and > penalize non-membarrier-using tasks. > > Simplify the logic by adding an explicit barrier, and allow architectures > to override it as an optimization if they want to. > > One of the deleted comments in this patch said "It is therefore > possible to schedule between user->kernel->user threads without > passing through switch_mm()". It is possible to do this without, say, > writing to CR3 on x86, but the core scheduler indeed calls > switch_mm_irqs_off() to tell the arch code to go back from lazy mode > to no-lazy mode. > > The membarrier_finish_switch_mm() call in exec_mmap() is a no-op so long as > there is no way for a newly execed program to register for membarrier prior > to running user code. Subsequent patches will merge the exec_mmap() code > with the kthread_use_mm() code, though, and keeping the paths consistent > will make the result more comprehensible. I don't agree with the approach here. Adding additional memory barrier overhead for the sake of possible future optimization work is IMHO not an appropriate justification for a performance regression. However I think we can manage to allow forward progress on optimization of mmgrab/mmdrop without hurting performance. One possible approach would be to introduce smp_mb__{before,after}_{mmgrab,mmdrop}, which would initially be no-ops. Those could be used by membarrier without introducing any overhead, and would allow to gradually move the implicit barriers from mmgrab/mmdrop to smp_mb__{before,after}_{mmgrab,mmdrop} on a per-architecture basis. Thanks, Mathieu > > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Andy Lutomirski <luto@kernel.org> > --- > fs/exec.c | 1 + > include/linux/sched/mm.h | 18 ++++++++++++++++++ > kernel/kthread.c | 12 +----------- > kernel/sched/core.c | 34 +++++++++------------------------- > 4 files changed, 29 insertions(+), 36 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index a098c133d8d7..3abbd0294e73 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1019,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm) > activate_mm(active_mm, mm); > if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) > local_irq_enable(); > + membarrier_finish_switch_mm(mm); > tsk->mm->vmacache_seqnum = 0; > vmacache_flush(tsk); > task_unlock(tsk); > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 0df706c099e5..e8919995d8dd 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -349,6 +349,20 @@ extern void membarrier_exec_mmap(struct mm_struct *mm); > > extern void membarrier_update_current_mm(struct mm_struct *next_mm); > > +/* > + * Called by the core scheduler after calling switch_mm_irqs_off(). > + * Architectures that have implicit barriers when switching mms can > + * override this as an optimization. > + */ > +#ifndef membarrier_finish_switch_mm > +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) > +{ > + if (atomic_read(&mm->membarrier_state) & > + (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED)) > + smp_mb(); > +} > +#endif > + > #else > static inline void membarrier_exec_mmap(struct mm_struct *mm) > { > @@ -356,6 +370,10 @@ static inline void membarrier_exec_mmap(struct mm_struct > *mm) > static inline void membarrier_update_current_mm(struct mm_struct *next_mm) > { > } > +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) > +{ > +} > + > #endif > > #endif /* _LINUX_SCHED_MM_H */ > diff --git a/kernel/kthread.c b/kernel/kthread.c > index 5b37a8567168..396ae78a1a34 100644 > --- a/kernel/kthread.c > +++ b/kernel/kthread.c > @@ -1361,25 +1361,15 @@ void kthread_use_mm(struct mm_struct *mm) > tsk->mm = mm; > membarrier_update_current_mm(mm); > switch_mm_irqs_off(active_mm, mm, tsk); > + membarrier_finish_switch_mm(mm); > local_irq_enable(); > task_unlock(tsk); > #ifdef finish_arch_post_lock_switch > finish_arch_post_lock_switch(); > #endif > > - /* > - * When a kthread starts operating on an address space, the loop > - * in membarrier_{private,global}_expedited() may not observe > - * that tsk->mm, and not issue an IPI. Membarrier requires a > - * memory barrier after storing to tsk->mm, before accessing > - * user-space memory. A full memory barrier for membarrier > - * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by > - * mmdrop(), or explicitly with smp_mb(). > - */ > if (active_mm != mm) > mmdrop(active_mm); > - else > - smp_mb(); > > to_kthread(tsk)->oldfs = force_uaccess_begin(); > } > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 6a1db8264c7b..917068b0a145 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4824,14 +4824,6 @@ static struct rq *finish_task_switch(struct task_struct > *prev) > fire_sched_in_preempt_notifiers(current); > > /* > - * When switching through a kernel thread, the loop in > - * membarrier_{private,global}_expedited() may have observed that > - * kernel thread and not issued an IPI. It is therefore possible to > - * schedule between user->kernel->user threads without passing though > - * switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, and mmdrop() provides > - * this barrier. > - * > * If an architecture needs to take a specific action for > * SYNC_CORE, it can do so in switch_mm_irqs_off(). > */ > @@ -4915,15 +4907,14 @@ context_switch(struct rq *rq, struct task_struct *prev, > prev->active_mm = NULL; > } else { // to user > membarrier_switch_mm(rq, prev->active_mm, next->mm); > + switch_mm_irqs_off(prev->active_mm, next->mm, next); > + > /* > * sys_membarrier() requires an smp_mb() between setting > - * rq->curr / membarrier_switch_mm() and returning to userspace. > - * > - * The below provides this either through switch_mm(), or in > - * case 'prev->active_mm == next->mm' through > - * finish_task_switch()'s mmdrop(). > + * rq->curr->mm to a membarrier-enabled mm and returning > + * to userspace. > */ > - switch_mm_irqs_off(prev->active_mm, next->mm, next); > + membarrier_finish_switch_mm(next->mm); > > if (!prev->mm) { // from kernel > /* will mmdrop() in finish_task_switch(). */ > @@ -6264,17 +6255,10 @@ static void __sched notrace __schedule(unsigned int > sched_mode) > RCU_INIT_POINTER(rq->curr, next); > /* > * The membarrier system call requires each architecture > - * to have a full memory barrier after updating > - * rq->curr, before returning to user-space. > - * > - * Here are the schemes providing that barrier on the > - * various architectures: > - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. > - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. > - * - finish_lock_switch() for weakly-ordered > - * architectures where spin_unlock is a full barrier, > - * - switch_to() for arm64 (weakly-ordered, spin_unlock > - * is a RELEASE barrier), > + * to have a full memory barrier before and after updating > + * rq->curr->mm, before returning to userspace. This > + * is provided by membarrier_finish_switch_mm(). Architectures > + * that want to optimize this can override that function. > */ > ++*switch_count; > > -- > 2.33.1
diff --git a/fs/exec.c b/fs/exec.c index a098c133d8d7..3abbd0294e73 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1019,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm) activate_mm(active_mm, mm); if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); + membarrier_finish_switch_mm(mm); tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 0df706c099e5..e8919995d8dd 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -349,6 +349,20 @@ extern void membarrier_exec_mmap(struct mm_struct *mm); extern void membarrier_update_current_mm(struct mm_struct *next_mm); +/* + * Called by the core scheduler after calling switch_mm_irqs_off(). + * Architectures that have implicit barriers when switching mms can + * override this as an optimization. + */ +#ifndef membarrier_finish_switch_mm +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) +{ + if (atomic_read(&mm->membarrier_state) & + (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED)) + smp_mb(); +} +#endif + #else static inline void membarrier_exec_mmap(struct mm_struct *mm) { @@ -356,6 +370,10 @@ static inline void membarrier_exec_mmap(struct mm_struct *mm) static inline void membarrier_update_current_mm(struct mm_struct *next_mm) { } +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) +{ +} + #endif #endif /* _LINUX_SCHED_MM_H */ diff --git a/kernel/kthread.c b/kernel/kthread.c index 5b37a8567168..396ae78a1a34 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1361,25 +1361,15 @@ void kthread_use_mm(struct mm_struct *mm) tsk->mm = mm; membarrier_update_current_mm(mm); switch_mm_irqs_off(active_mm, mm, tsk); + membarrier_finish_switch_mm(mm); local_irq_enable(); task_unlock(tsk); #ifdef finish_arch_post_lock_switch finish_arch_post_lock_switch(); #endif - /* - * When a kthread starts operating on an address space, the loop - * in membarrier_{private,global}_expedited() may not observe - * that tsk->mm, and not issue an IPI. Membarrier requires a - * memory barrier after storing to tsk->mm, before accessing - * user-space memory. A full memory barrier for membarrier - * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by - * mmdrop(), or explicitly with smp_mb(). - */ if (active_mm != mm) mmdrop(active_mm); - else - smp_mb(); to_kthread(tsk)->oldfs = force_uaccess_begin(); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6a1db8264c7b..917068b0a145 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4824,14 +4824,6 @@ static struct rq *finish_task_switch(struct task_struct *prev) fire_sched_in_preempt_notifiers(current); /* - * When switching through a kernel thread, the loop in - * membarrier_{private,global}_expedited() may have observed that - * kernel thread and not issued an IPI. It is therefore possible to - * schedule between user->kernel->user threads without passing though - * switch_mm(). Membarrier requires a barrier after storing to - * rq->curr, before returning to userspace, and mmdrop() provides - * this barrier. - * * If an architecture needs to take a specific action for * SYNC_CORE, it can do so in switch_mm_irqs_off(). */ @@ -4915,15 +4907,14 @@ context_switch(struct rq *rq, struct task_struct *prev, prev->active_mm = NULL; } else { // to user membarrier_switch_mm(rq, prev->active_mm, next->mm); + switch_mm_irqs_off(prev->active_mm, next->mm, next); + /* * sys_membarrier() requires an smp_mb() between setting - * rq->curr / membarrier_switch_mm() and returning to userspace. - * - * The below provides this either through switch_mm(), or in - * case 'prev->active_mm == next->mm' through - * finish_task_switch()'s mmdrop(). + * rq->curr->mm to a membarrier-enabled mm and returning + * to userspace. */ - switch_mm_irqs_off(prev->active_mm, next->mm, next); + membarrier_finish_switch_mm(next->mm); if (!prev->mm) { // from kernel /* will mmdrop() in finish_task_switch(). */ @@ -6264,17 +6255,10 @@ static void __sched notrace __schedule(unsigned int sched_mode) RCU_INIT_POINTER(rq->curr, next); /* * The membarrier system call requires each architecture - * to have a full memory barrier after updating - * rq->curr, before returning to user-space. - * - * Here are the schemes providing that barrier on the - * various architectures: - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. - * - finish_lock_switch() for weakly-ordered - * architectures where spin_unlock is a full barrier, - * - switch_to() for arm64 (weakly-ordered, spin_unlock - * is a RELEASE barrier), + * to have a full memory barrier before and after updating + * rq->curr->mm, before returning to userspace. This + * is provided by membarrier_finish_switch_mm(). Architectures + * that want to optimize this can override that function. */ ++*switch_count;
membarrier() needs a barrier after any CPU changes mm. There is currently a comment explaining why this barrier probably exists in all cases. The logic is based on ensuring that the barrier exists on every control flow path through the scheduler. It also relies on mmgrab() and mmdrop() being full barriers. mmgrab() and mmdrop() would be better if they were not full barriers. As a trivial optimization, mmgrab() could use a relaxed atomic and mmdrop() could use a release on architectures that have these operations. Larger optimizations are also in the works. Doing any of these optimizations while preserving an unnecessary barrier will complicate the code and penalize non-membarrier-using tasks. Simplify the logic by adding an explicit barrier, and allow architectures to override it as an optimization if they want to. One of the deleted comments in this patch said "It is therefore possible to schedule between user->kernel->user threads without passing through switch_mm()". It is possible to do this without, say, writing to CR3 on x86, but the core scheduler indeed calls switch_mm_irqs_off() to tell the arch code to go back from lazy mode to no-lazy mode. The membarrier_finish_switch_mm() call in exec_mmap() is a no-op so long as there is no way for a newly execed program to register for membarrier prior to running user code. Subsequent patches will merge the exec_mmap() code with the kthread_use_mm() code, though, and keeping the paths consistent will make the result more comprehensible. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> --- fs/exec.c | 1 + include/linux/sched/mm.h | 18 ++++++++++++++++++ kernel/kthread.c | 12 +----------- kernel/sched/core.c | 34 +++++++++------------------------- 4 files changed, 29 insertions(+), 36 deletions(-)