Message ID | 20190729142450.GE31425@hirez.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched: Clean up active_mm reference counting | expand |
----- On Jul 29, 2019, at 10:24 AM, Peter Zijlstra peterz@infradead.org wrote: [...] > --- > Subject: sched: Clean up active_mm reference counting > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Jul 29 16:05:15 CEST 2019 > > The current active_mm reference counting is confusing and sub-optimal. > > Rewrite the code to explicitly consider the 4 separate cases: > > user -> user > > When switching between two user tasks, all we need to consider > is switch_mm(). > > user -> kernel > > When switching from a user task to a kernel task (which > doesn't have an associated mm) we retain the last mm in our > active_mm. Increment a reference count on active_mm. > > kernel -> kernel > > When switching between kernel threads, all we need to do is > pass along the active_mm reference. > > kernel -> user > > When switching between a kernel and user task, we must switch > from the last active_mm to the next mm, hoping of course that > these are the same. Decrement a reference on the active_mm. > > The code keeps a different order, because as you'll note, both 'to > user' cases require switch_mm(). > > And where the old code would increment/decrement for the 'kernel -> > kernel' case, the new code observes this is a neutral operation and > avoids touching the reference count. Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Cc: riel@surriel.com > Cc: luto@kernel.org > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/sched/core.c | 49 ++++++++++++++++++++++++++++++------------------- > 1 file changed, 30 insertions(+), 19 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3214,12 +3214,8 @@ static __always_inline struct rq * > context_switch(struct rq *rq, struct task_struct *prev, > struct task_struct *next, struct rq_flags *rf) > { > - struct mm_struct *mm, *oldmm; > - > prepare_task_switch(rq, prev, next); > > - mm = next->mm; > - oldmm = prev->active_mm; > /* > * For paravirt, this is coupled with an exit in switch_to to > * combine the page table reload and the switch backend into > @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas > arch_start_context_switch(prev); > > /* > - * If mm is non-NULL, we pass through switch_mm(). If mm is > - * NULL, we will pass through mmdrop() in finish_task_switch(). > - * Both of these contain the full memory barrier required by > - * membarrier after storing to rq->curr, before returning to > - * user-space. > + * kernel -> kernel lazy + transfer active > + * user -> kernel lazy + mmgrab() active > + * > + * kernel -> user switch + mmdrop() active > + * user -> user switch > */ > - if (!mm) { > - next->active_mm = oldmm; > - mmgrab(oldmm); > - enter_lazy_tlb(oldmm, next); > - } else > - switch_mm_irqs_off(oldmm, mm, next); > - > - if (!prev->mm) { > - prev->active_mm = NULL; > - rq->prev_mm = oldmm; > + if (!next->mm) { // to kernel > + enter_lazy_tlb(prev->active_mm, next); > + > + next->active_mm = prev->active_mm; > + if (prev->mm) // from user > + mmgrab(prev->active_mm); > + else > + prev->active_mm = NULL; > + } else { // to user > + /* > + * sys_membarrier() requires an smp_mb() between setting > + * rq->curr and returning to userspace. > + * > + * The below provides this either through switch_mm(), or in > + * case 'prev->active_mm == next->mm' through > + * finish_task_switch()'s mmdrop(). > + */ > + > + switch_mm_irqs_off(prev->active_mm, next->mm, next); > + > + if (!prev->mm) { // from kernel > + /* will mmdrop() in finish_task_switch(). */ > + rq->prev_mm = prev->active_mm; > + prev->active_mm = NULL; > + } > } > > rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
On 7/29/19 10:24 AM, Peter Zijlstra wrote: > On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote: > > --- > Subject: sched: Clean up active_mm reference counting > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Jul 29 16:05:15 CEST 2019 > > The current active_mm reference counting is confusing and sub-optimal. > > Rewrite the code to explicitly consider the 4 separate cases: > > user -> user > > When switching between two user tasks, all we need to consider > is switch_mm(). > > user -> kernel > > When switching from a user task to a kernel task (which > doesn't have an associated mm) we retain the last mm in our > active_mm. Increment a reference count on active_mm. > > kernel -> kernel > > When switching between kernel threads, all we need to do is > pass along the active_mm reference. > > kernel -> user > > When switching between a kernel and user task, we must switch > from the last active_mm to the next mm, hoping of course that > these are the same. Decrement a reference on the active_mm. > > The code keeps a different order, because as you'll note, both 'to > user' cases require switch_mm(). > > And where the old code would increment/decrement for the 'kernel -> > kernel' case, the new code observes this is a neutral operation and > avoids touching the reference count. I am aware of that behavior which is indeed redundant, but it is not what I am trying to fix and so I kind of leave it alone in my patch. > > Cc: riel@surriel.com > Cc: luto@kernel.org > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/sched/core.c | 49 ++++++++++++++++++++++++++++++------------------- > 1 file changed, 30 insertions(+), 19 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3214,12 +3214,8 @@ static __always_inline struct rq * > context_switch(struct rq *rq, struct task_struct *prev, > struct task_struct *next, struct rq_flags *rf) > { > - struct mm_struct *mm, *oldmm; > - > prepare_task_switch(rq, prev, next); > > - mm = next->mm; > - oldmm = prev->active_mm; > /* > * For paravirt, this is coupled with an exit in switch_to to > * combine the page table reload and the switch backend into > @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas > arch_start_context_switch(prev); > > /* > - * If mm is non-NULL, we pass through switch_mm(). If mm is > - * NULL, we will pass through mmdrop() in finish_task_switch(). > - * Both of these contain the full memory barrier required by > - * membarrier after storing to rq->curr, before returning to > - * user-space. > + * kernel -> kernel lazy + transfer active > + * user -> kernel lazy + mmgrab() active > + * > + * kernel -> user switch + mmdrop() active > + * user -> user switch > */ > - if (!mm) { > - next->active_mm = oldmm; > - mmgrab(oldmm); > - enter_lazy_tlb(oldmm, next); > - } else > - switch_mm_irqs_off(oldmm, mm, next); > - > - if (!prev->mm) { > - prev->active_mm = NULL; > - rq->prev_mm = oldmm; > + if (!next->mm) { // to kernel > + enter_lazy_tlb(prev->active_mm, next); > + > + next->active_mm = prev->active_mm; > + if (prev->mm) // from user > + mmgrab(prev->active_mm); > + else > + prev->active_mm = NULL; > + } else { // to user > + /* > + * sys_membarrier() requires an smp_mb() between setting > + * rq->curr and returning to userspace. > + * > + * The below provides this either through switch_mm(), or in > + * case 'prev->active_mm == next->mm' through > + * finish_task_switch()'s mmdrop(). > + */ > + > + switch_mm_irqs_off(prev->active_mm, next->mm, next); > + > + if (!prev->mm) { // from kernel > + /* will mmdrop() in finish_task_switch(). */ > + rq->prev_mm = prev->active_mm; > + prev->active_mm = NULL; > + } > } > > rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); This patch looks fine to me, I don't see any problem in its logic. Acked-by: Waiman Long <longman@redhat.com> The problem that I am trying to fix is in the kernel->kernel case where the active_mm just get passed along. I would like to just bump the active_mm off if it is dying. I will see what I can do to make it work even with !CONFIG_MEMCG. Cheers, Longman
On Mon, Jul 29, 2019 at 11:16:55AM -0400, Waiman Long wrote: > On 7/29/19 10:24 AM, Peter Zijlstra wrote: > > On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote: > > > > --- > > Subject: sched: Clean up active_mm reference counting > > From: Peter Zijlstra <peterz@infradead.org> > > Date: Mon Jul 29 16:05:15 CEST 2019 > > > > The current active_mm reference counting is confusing and sub-optimal. > > > > Rewrite the code to explicitly consider the 4 separate cases: > > > > user -> user > > > > When switching between two user tasks, all we need to consider > > is switch_mm(). > > > > user -> kernel > > > > When switching from a user task to a kernel task (which > > doesn't have an associated mm) we retain the last mm in our > > active_mm. Increment a reference count on active_mm. > > > > kernel -> kernel > > > > When switching between kernel threads, all we need to do is > > pass along the active_mm reference. > > > > kernel -> user > > > > When switching between a kernel and user task, we must switch > > from the last active_mm to the next mm, hoping of course that > > these are the same. Decrement a reference on the active_mm. > > > > The code keeps a different order, because as you'll note, both 'to > > user' cases require switch_mm(). > > > > And where the old code would increment/decrement for the 'kernel -> > > kernel' case, the new code observes this is a neutral operation and > > avoids touching the reference count. > > I am aware of that behavior which is indeed redundant, but it is not > what I am trying to fix and so I kind of leave it alone in my patch. Oh sure; and it's not all that important either. It is jst that every time I look at that code I get confused. On top of that, the new is easier to rip the active_mm stuff out of, which is where it came from.
On Mon, 2019-07-29 at 16:24 +0200, Peter Zijlstra wrote: > Subject: sched: Clean up active_mm reference counting > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Jul 29 16:05:15 CEST 2019 > > The current active_mm reference counting is confusing and sub- > optimal. > > Rewrite the code to explicitly consider the 4 separate cases: > Reviewed-by: Rik van Riel <riel@surriel.com>
--- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3214,12 +3214,8 @@ static __always_inline struct rq * context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next, struct rq_flags *rf) { - struct mm_struct *mm, *oldmm; - prepare_task_switch(rq, prev, next); - mm = next->mm; - oldmm = prev->active_mm; /* * For paravirt, this is coupled with an exit in switch_to to * combine the page table reload and the switch backend into @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas arch_start_context_switch(prev); /* - * If mm is non-NULL, we pass through switch_mm(). If mm is - * NULL, we will pass through mmdrop() in finish_task_switch(). - * Both of these contain the full memory barrier required by - * membarrier after storing to rq->curr, before returning to - * user-space. + * kernel -> kernel lazy + transfer active + * user -> kernel lazy + mmgrab() active + * + * kernel -> user switch + mmdrop() active + * user -> user switch */ - if (!mm) { - next->active_mm = oldmm; - mmgrab(oldmm); - enter_lazy_tlb(oldmm, next); - } else - switch_mm_irqs_off(oldmm, mm, next); - - if (!prev->mm) { - prev->active_mm = NULL; - rq->prev_mm = oldmm; + if (!next->mm) { // to kernel + enter_lazy_tlb(prev->active_mm, next); + + next->active_mm = prev->active_mm; + if (prev->mm) // from user + mmgrab(prev->active_mm); + else + prev->active_mm = NULL; + } else { // to user + /* + * sys_membarrier() requires an smp_mb() between setting + * rq->curr and returning to userspace. + * + * The below provides this either through switch_mm(), or in + * case 'prev->active_mm == next->mm' through + * finish_task_switch()'s mmdrop(). + */ + + switch_mm_irqs_off(prev->active_mm, next->mm, next); + + if (!prev->mm) { // from kernel + /* will mmdrop() in finish_task_switch(). */ + rq->prev_mm = prev->active_mm; + prev->active_mm = NULL; + } } rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);