Message ID | 20201020134715.13909-2-mathieu.desnoyers@efficios.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] sched: fix exit_mm vs membarrier (v4) | expand |
On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote: > +void membarrier_update_current_mm(struct mm_struct *next_mm) > +{ > + struct rq *rq = this_rq(); > + int membarrier_state = 0; > + > + if (next_mm) > + membarrier_state = atomic_read(&next_mm->membarrier_state); > + if (READ_ONCE(rq->membarrier_state) == membarrier_state) > + return; > + WRITE_ONCE(rq->membarrier_state, membarrier_state); > +} This is suspisioucly similar to membarrier_switch_mm(). Would something like so make sense? --- --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru void membarrier_update_current_mm(struct mm_struct *next_mm) { - struct rq *rq = this_rq(); - int membarrier_state = 0; - - if (next_mm) - membarrier_state = atomic_read(&next_mm->membarrier_state); - if (READ_ONCE(rq->membarrier_state) == membarrier_state) - return; - WRITE_ONCE(rq->membarrier_state, membarrier_state); + membarrier_switch_mm(this_rq(), NULL, next_mm); } static int membarrier_global_expedited(void) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d2621155393c..3d589c2ffd28 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq, struct mm_struct *prev_mm, struct mm_struct *next_mm) { - int membarrier_state; + int membarrier_state = 0; if (prev_mm == next_mm) return; - membarrier_state = atomic_read(&next_mm->membarrier_state); + if (next_mm) + membarrier_state = atomic_read(&next_mm->membarrier_state); + if (READ_ONCE(rq->membarrier_state) == membarrier_state) return;
----- On Oct 20, 2020, at 10:36 AM, Peter Zijlstra peterz@infradead.org wrote: > On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote: >> +void membarrier_update_current_mm(struct mm_struct *next_mm) >> +{ >> + struct rq *rq = this_rq(); >> + int membarrier_state = 0; >> + >> + if (next_mm) >> + membarrier_state = atomic_read(&next_mm->membarrier_state); >> + if (READ_ONCE(rq->membarrier_state) == membarrier_state) >> + return; >> + WRITE_ONCE(rq->membarrier_state, membarrier_state); >> +} > > This is suspisioucly similar to membarrier_switch_mm(). > > Would something like so make sense? Very much yes. Do you want me to re-send the series, or you want to fold this in as you merge it ? Thanks, Mathieu > > --- > --- a/kernel/sched/membarrier.c > +++ b/kernel/sched/membarrier.c > @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru > > void membarrier_update_current_mm(struct mm_struct *next_mm) > { > - struct rq *rq = this_rq(); > - int membarrier_state = 0; > - > - if (next_mm) > - membarrier_state = atomic_read(&next_mm->membarrier_state); > - if (READ_ONCE(rq->membarrier_state) == membarrier_state) > - return; > - WRITE_ONCE(rq->membarrier_state, membarrier_state); > + membarrier_switch_mm(this_rq(), NULL, next_mm); > } > > static int membarrier_global_expedited(void) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index d2621155393c..3d589c2ffd28 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq, > struct mm_struct *prev_mm, > struct mm_struct *next_mm) > { > - int membarrier_state; > + int membarrier_state = 0; > > if (prev_mm == next_mm) > return; > > - membarrier_state = atomic_read(&next_mm->membarrier_state); > + if (next_mm) > + membarrier_state = atomic_read(&next_mm->membarrier_state); > + > if (READ_ONCE(rq->membarrier_state) == membarrier_state) > return;
Hi, On Tue, Oct 20, 2020 at 10:59:58AM -0400, Mathieu Desnoyers wrote: > ----- On Oct 20, 2020, at 10:36 AM, Peter Zijlstra peterz@infradead.org wrote: > > > On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote: > >> +void membarrier_update_current_mm(struct mm_struct *next_mm) > >> +{ > >> + struct rq *rq = this_rq(); > >> + int membarrier_state = 0; > >> + > >> + if (next_mm) > >> + membarrier_state = atomic_read(&next_mm->membarrier_state); > >> + if (READ_ONCE(rq->membarrier_state) == membarrier_state) > >> + return; > >> + WRITE_ONCE(rq->membarrier_state, membarrier_state); > >> +} > > > > This is suspisioucly similar to membarrier_switch_mm(). > > > > Would something like so make sense? > > Very much yes. Do you want me to re-send the series, or you > want to fold this in as you merge it ? > > Thanks, > > Mathieu > > > > > --- > > --- a/kernel/sched/membarrier.c > > +++ b/kernel/sched/membarrier.c > > @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru > > > > void membarrier_update_current_mm(struct mm_struct *next_mm) > > { > > - struct rq *rq = this_rq(); > > - int membarrier_state = 0; > > - > > - if (next_mm) > > - membarrier_state = atomic_read(&next_mm->membarrier_state); > > - if (READ_ONCE(rq->membarrier_state) == membarrier_state) > > - return; > > - WRITE_ONCE(rq->membarrier_state, membarrier_state); > > + membarrier_switch_mm(this_rq(), NULL, next_mm); > > } > > > > static int membarrier_global_expedited(void) > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index d2621155393c..3d589c2ffd28 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq, > > struct mm_struct *prev_mm, > > struct mm_struct *next_mm) > > { > > - int membarrier_state; > > + int membarrier_state = 0; > > > > if (prev_mm == next_mm) Unless I'm missing something subtle, in exit_mm(), membarrier_update_current_mm() is called with @next_mm == NULL, and inside membarrier_update_current_mm(), membarrier_switch_mm() is called wiht @prev_mm == NULL. As a result, the branch above is taken, so membarrier_update_current_mm() becomes a nop. I think we should use the previous value of current->mm as the @prev_mm, something like below maybe? void update_current_mm(struct mm_struct *next_mm) { struct mm_struct *prev_mm; unsigned long flags; local_irq_save(flags); prev_mm = current->mm; current->mm = next_mm; membarrier_switch_mm(this_rq(), prev_mm, next_mm); local_irq_restore(flags); } , and replace all settings for "current->mm" in kernel with update_current_mm(). Thoughts? Regards, Boqun > > return; > > > > - membarrier_state = atomic_read(&next_mm->membarrier_state); > > + if (next_mm) > > + membarrier_state = atomic_read(&next_mm->membarrier_state); > > + > > if (READ_ONCE(rq->membarrier_state) == membarrier_state) > > return; > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index f889e332912f..5dd7f56baaba 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -370,6 +370,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) extern void membarrier_exec_mmap(struct mm_struct *mm); +extern void membarrier_update_current_mm(struct mm_struct *next_mm); + #else #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS static inline void membarrier_arch_switch_mm(struct mm_struct *prev, @@ -384,6 +386,9 @@ static inline void membarrier_exec_mmap(struct mm_struct *mm) static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } +static inline void membarrier_update_current_mm(struct mm_struct *next_mm) +{ +} #endif #endif /* _LINUX_SCHED_MM_H */ diff --git a/kernel/exit.c b/kernel/exit.c index 733e80f334e7..18ca74c07085 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -475,10 +475,24 @@ static void exit_mm(void) BUG_ON(mm != current->active_mm); /* more a memory barrier than a real lock */ task_lock(current); + /* + * When a thread stops operating on an address space, the loop + * in membarrier_private_expedited() may not observe that + * tsk->mm, and the loop in membarrier_global_expedited() may + * not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED + * rq->membarrier_state, so those would not issue an IPI. + * Membarrier requires a memory barrier after accessing + * user-space memory, before clearing tsk->mm or the + * rq->membarrier_state. + */ + smp_mb__after_spinlock(); + local_irq_disable(); current->mm = NULL; - mmap_read_unlock(mm); + membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); + local_irq_enable(); task_unlock(current); + mmap_read_unlock(mm); mm_update_next_owner(mm); mmput(mm); if (test_thread_flag(TIF_MEMDIE)) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 168479a7d61b..8bc8b8a888b7 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -63,6 +63,18 @@ void membarrier_exec_mmap(struct mm_struct *mm) this_cpu_write(runqueues.membarrier_state, 0); } +void membarrier_update_current_mm(struct mm_struct *next_mm) +{ + struct rq *rq = this_rq(); + int membarrier_state = 0; + + if (next_mm) + membarrier_state = atomic_read(&next_mm->membarrier_state); + if (READ_ONCE(rq->membarrier_state) == membarrier_state) + return; + WRITE_ONCE(rq->membarrier_state, membarrier_state); +} + static int membarrier_global_expedited(void) { int cpu;
exit_mm should issue memory barriers after user-space memory accesses, before clearing current->mm, to order user-space memory accesses performed prior to exit_mm before clearing tsk->mm, which has the effect of skipping the membarrier private expedited IPIs. exit_mm should also update the runqueue's membarrier_state so membarrier global expedited IPIs are not sent when they are not needed. The membarrier system call can be issued concurrently with do_exit if we have thread groups created with CLONE_VM but not CLONE_THREAD. Here is the scenario I have in mind: Two thread groups are created, A and B. Thread group B is created by issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD. Let's assume we have a single thread within each thread group (Thread A and Thread B). The AFAIU we can have: Userspace variables: int x = 0, y = 0; CPU 0 CPU 1 Thread A Thread B (in thread group A) (in thread group B) x = 1 barrier() y = 1 exit() exit_mm() current->mm = NULL; r1 = load y membarrier() skips CPU 0 (no IPI) because its current mm is NULL r2 = load x BUG_ON(r1 == 1 && r2 == 0) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: linux-mm@kvack.org --- Changes since v1: - Use smp_mb__after_spinlock rather than smp_mb. - Document race scenario in commit message. Changes since v2: - Introduce membarrier_update_current_mm, - Use membarrier_update_current_mm to update rq's membarrier_state from exit_mm. Changes since v3: - Disable interrupts around call to membarrier_update_current_mm, which is required to access the runqueue's fields. --- include/linux/sched/mm.h | 5 +++++ kernel/exit.c | 16 +++++++++++++++- kernel/sched/membarrier.c | 12 ++++++++++++ 3 files changed, 32 insertions(+), 1 deletion(-)