diff mbox series

sched: Clean up active_mm reference counting

Message ID 20190729142450.GE31425@hirez.programming.kicks-ass.net (mailing list archive)
State New, archived
Headers show
Series sched: Clean up active_mm reference counting | expand

Commit Message

Peter Zijlstra July 29, 2019, 2:24 p.m. UTC
On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2b037f195473..923a63262dfd 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >  	 * Both of these contain the full memory barrier required by
> >  	 * membarrier after storing to rq->curr, before returning to
> >  	 * user-space.
> > +	 *
> > +	 * If mm is NULL and oldmm is dying (!owner), we switch to
> > +	 * init_mm instead to make sure that oldmm can be freed ASAP.
> >  	 */
> > -	if (!mm) {
> > +	if (!mm && !mm_dying(oldmm)) {
> >  		next->active_mm = oldmm;
> >  		mmgrab(oldmm);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> > +		if (!mm) {
> > +			mm = &init_mm;
> > +			next->active_mm = mm;
> > +			mmgrab(mm);
> > +		}
> >  		switch_mm_irqs_off(oldmm, mm, next);
> > +	}
> >  
> >  	if (!prev->mm) {
> >  		prev->active_mm = NULL;
> 
> Bah, I see we _still_ haven't 'fixed' that code. And you're making an
> even bigger mess of it.
> 
> Let me go find where that cleanup went.

---
Subject: sched: Clean up active_mm reference counting
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Jul 29 16:05:15 CEST 2019

The current active_mm reference counting is confusing and sub-optimal.

Rewrite the code to explicitly consider the 4 separate cases:

    user -> user

	When switching between two user tasks, all we need to consider
	is switch_mm().

    user -> kernel

	When switching from a user task to a kernel task (which
	doesn't have an associated mm) we retain the last mm in our
	active_mm. Increment a reference count on active_mm.

  kernel -> kernel

	When switching between kernel threads, all we need to do is
	pass along the active_mm reference.

  kernel -> user

	When switching between a kernel and user task, we must switch
	from the last active_mm to the next mm, hoping of course that
	these are the same. Decrement a reference on the active_mm.

The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().

And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.

Cc: riel@surriel.com
Cc: luto@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   49 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 19 deletions(-)

Comments

Mathieu Desnoyers July 29, 2019, 3:01 p.m. UTC | #1
----- On Jul 29, 2019, at 10:24 AM, Peter Zijlstra peterz@infradead.org wrote:
[...]
> ---
> Subject: sched: Clean up active_mm reference counting
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Jul 29 16:05:15 CEST 2019
> 
> The current active_mm reference counting is confusing and sub-optimal.
> 
> Rewrite the code to explicitly consider the 4 separate cases:
> 
>    user -> user
> 
>	When switching between two user tasks, all we need to consider
>	is switch_mm().
> 
>    user -> kernel
> 
>	When switching from a user task to a kernel task (which
>	doesn't have an associated mm) we retain the last mm in our
>	active_mm. Increment a reference count on active_mm.
> 
>  kernel -> kernel
> 
>	When switching between kernel threads, all we need to do is
>	pass along the active_mm reference.
> 
>  kernel -> user
> 
>	When switching between a kernel and user task, we must switch
>	from the last active_mm to the next mm, hoping of course that
>	these are the same. Decrement a reference on the active_mm.
> 
> The code keeps a different order, because as you'll note, both 'to
> user' cases require switch_mm().
> 
> And where the old code would increment/decrement for the 'kernel ->
> kernel' case, the new code observes this is a neutral operation and
> avoids touching the reference count.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Cc: riel@surriel.com
> Cc: luto@kernel.org
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> kernel/sched/core.c |   49 ++++++++++++++++++++++++++++++-------------------
> 1 file changed, 30 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3214,12 +3214,8 @@ static __always_inline struct rq *
> context_switch(struct rq *rq, struct task_struct *prev,
> 	       struct task_struct *next, struct rq_flags *rf)
> {
> -	struct mm_struct *mm, *oldmm;
> -
> 	prepare_task_switch(rq, prev, next);
> 
> -	mm = next->mm;
> -	oldmm = prev->active_mm;
> 	/*
> 	 * For paravirt, this is coupled with an exit in switch_to to
> 	 * combine the page table reload and the switch backend into
> @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas
> 	arch_start_context_switch(prev);
> 
> 	/*
> -	 * If mm is non-NULL, we pass through switch_mm(). If mm is
> -	 * NULL, we will pass through mmdrop() in finish_task_switch().
> -	 * Both of these contain the full memory barrier required by
> -	 * membarrier after storing to rq->curr, before returning to
> -	 * user-space.
> +	 * kernel -> kernel   lazy + transfer active
> +	 *   user -> kernel   lazy + mmgrab() active
> +	 *
> +	 * kernel ->   user   switch + mmdrop() active
> +	 *   user ->   user   switch
> 	 */
> -	if (!mm) {
> -		next->active_mm = oldmm;
> -		mmgrab(oldmm);
> -		enter_lazy_tlb(oldmm, next);
> -	} else
> -		switch_mm_irqs_off(oldmm, mm, next);
> -
> -	if (!prev->mm) {
> -		prev->active_mm = NULL;
> -		rq->prev_mm = oldmm;
> +	if (!next->mm) {                                // to kernel
> +		enter_lazy_tlb(prev->active_mm, next);
> +
> +		next->active_mm = prev->active_mm;
> +		if (prev->mm)                           // from user
> +			mmgrab(prev->active_mm);
> +		else
> +			prev->active_mm = NULL;
> +	} else {                                        // to user
> +		/*
> +		 * sys_membarrier() requires an smp_mb() between setting
> +		 * rq->curr and returning to userspace.
> +		 *
> +		 * The below provides this either through switch_mm(), or in
> +		 * case 'prev->active_mm == next->mm' through
> +		 * finish_task_switch()'s mmdrop().
> +		 */
> +
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
> +		if (!prev->mm) {                        // from kernel
> +			/* will mmdrop() in finish_task_switch(). */
> +			rq->prev_mm = prev->active_mm;
> +			prev->active_mm = NULL;
> +		}
> 	}
> 
>  	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
Waiman Long July 29, 2019, 3:16 p.m. UTC | #2
On 7/29/19 10:24 AM, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
>
> ---
> Subject: sched: Clean up active_mm reference counting
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Jul 29 16:05:15 CEST 2019
>
> The current active_mm reference counting is confusing and sub-optimal.
>
> Rewrite the code to explicitly consider the 4 separate cases:
>
>     user -> user
>
> 	When switching between two user tasks, all we need to consider
> 	is switch_mm().
>
>     user -> kernel
>
> 	When switching from a user task to a kernel task (which
> 	doesn't have an associated mm) we retain the last mm in our
> 	active_mm. Increment a reference count on active_mm.
>
>   kernel -> kernel
>
> 	When switching between kernel threads, all we need to do is
> 	pass along the active_mm reference.
>
>   kernel -> user
>
> 	When switching between a kernel and user task, we must switch
> 	from the last active_mm to the next mm, hoping of course that
> 	these are the same. Decrement a reference on the active_mm.
>
> The code keeps a different order, because as you'll note, both 'to
> user' cases require switch_mm().
>
> And where the old code would increment/decrement for the 'kernel ->
> kernel' case, the new code observes this is a neutral operation and
> avoids touching the reference count.

I am aware of that behavior which is indeed redundant, but it is not
what I am trying to fix and so I kind of leave it alone in my patch.


>
> Cc: riel@surriel.com
> Cc: luto@kernel.org
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   49 ++++++++++++++++++++++++++++++-------------------
>  1 file changed, 30 insertions(+), 19 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3214,12 +3214,8 @@ static __always_inline struct rq *
>  context_switch(struct rq *rq, struct task_struct *prev,
>  	       struct task_struct *next, struct rq_flags *rf)
>  {
> -	struct mm_struct *mm, *oldmm;
> -
>  	prepare_task_switch(rq, prev, next);
>  
> -	mm = next->mm;
> -	oldmm = prev->active_mm;
>  	/*
>  	 * For paravirt, this is coupled with an exit in switch_to to
>  	 * combine the page table reload and the switch backend into
> @@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas
>  	arch_start_context_switch(prev);
>  
>  	/*
> -	 * If mm is non-NULL, we pass through switch_mm(). If mm is
> -	 * NULL, we will pass through mmdrop() in finish_task_switch().
> -	 * Both of these contain the full memory barrier required by
> -	 * membarrier after storing to rq->curr, before returning to
> -	 * user-space.
> +	 * kernel -> kernel   lazy + transfer active
> +	 *   user -> kernel   lazy + mmgrab() active
> +	 *
> +	 * kernel ->   user   switch + mmdrop() active
> +	 *   user ->   user   switch
>  	 */
> -	if (!mm) {
> -		next->active_mm = oldmm;
> -		mmgrab(oldmm);
> -		enter_lazy_tlb(oldmm, next);
> -	} else
> -		switch_mm_irqs_off(oldmm, mm, next);
> -
> -	if (!prev->mm) {
> -		prev->active_mm = NULL;
> -		rq->prev_mm = oldmm;
> +	if (!next->mm) {                                // to kernel
> +		enter_lazy_tlb(prev->active_mm, next);
> +
> +		next->active_mm = prev->active_mm;
> +		if (prev->mm)                           // from user
> +			mmgrab(prev->active_mm);
> +		else
> +			prev->active_mm = NULL;
> +	} else {                                        // to user
> +		/*
> +		 * sys_membarrier() requires an smp_mb() between setting
> +		 * rq->curr and returning to userspace.
> +		 *
> +		 * The below provides this either through switch_mm(), or in
> +		 * case 'prev->active_mm == next->mm' through
> +		 * finish_task_switch()'s mmdrop().
> +		 */
> +
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
> +		if (!prev->mm) {                        // from kernel
> +			/* will mmdrop() in finish_task_switch(). */
> +			rq->prev_mm = prev->active_mm;
> +			prev->active_mm = NULL;
> +		}
>  	}
>  
>  	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

This patch looks fine to me, I don't see any problem in its logic.

Acked-by: Waiman Long <longman@redhat.com>

The problem that I am trying to fix is in the kernel->kernel case where
the active_mm just get passed along. I would like to just bump the
active_mm off if it is dying. I will see what I can do to make it work
even with !CONFIG_MEMCG.

Cheers,
Longman
Peter Zijlstra July 29, 2019, 3:22 p.m. UTC | #3
On Mon, Jul 29, 2019 at 11:16:55AM -0400, Waiman Long wrote:
> On 7/29/19 10:24 AM, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
> >
> > ---
> > Subject: sched: Clean up active_mm reference counting
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Mon Jul 29 16:05:15 CEST 2019
> >
> > The current active_mm reference counting is confusing and sub-optimal.
> >
> > Rewrite the code to explicitly consider the 4 separate cases:
> >
> >     user -> user
> >
> > 	When switching between two user tasks, all we need to consider
> > 	is switch_mm().
> >
> >     user -> kernel
> >
> > 	When switching from a user task to a kernel task (which
> > 	doesn't have an associated mm) we retain the last mm in our
> > 	active_mm. Increment a reference count on active_mm.
> >
> >   kernel -> kernel
> >
> > 	When switching between kernel threads, all we need to do is
> > 	pass along the active_mm reference.
> >
> >   kernel -> user
> >
> > 	When switching between a kernel and user task, we must switch
> > 	from the last active_mm to the next mm, hoping of course that
> > 	these are the same. Decrement a reference on the active_mm.
> >
> > The code keeps a different order, because as you'll note, both 'to
> > user' cases require switch_mm().
> >
> > And where the old code would increment/decrement for the 'kernel ->
> > kernel' case, the new code observes this is a neutral operation and
> > avoids touching the reference count.
> 
> I am aware of that behavior which is indeed redundant, but it is not
> what I am trying to fix and so I kind of leave it alone in my patch.

Oh sure; and it's not all that important either. It is jst that every
time I look at that code I get confused.

On top of that, the new is easier to rip the active_mm stuff out of,
which is where it came from.
Rik van Riel July 29, 2019, 3:29 p.m. UTC | #4
On Mon, 2019-07-29 at 16:24 +0200, Peter Zijlstra wrote:

> Subject: sched: Clean up active_mm reference counting
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Jul 29 16:05:15 CEST 2019
> 
> The current active_mm reference counting is confusing and sub-
> optimal.
> 
> Rewrite the code to explicitly consider the 4 separate cases:
> 

Reviewed-by: Rik van Riel <riel@surriel.com>
diff mbox series

Patch

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3214,12 +3214,8 @@  static __always_inline struct rq *
 context_switch(struct rq *rq, struct task_struct *prev,
 	       struct task_struct *next, struct rq_flags *rf)
 {
-	struct mm_struct *mm, *oldmm;
-
 	prepare_task_switch(rq, prev, next);
 
-	mm = next->mm;
-	oldmm = prev->active_mm;
 	/*
 	 * For paravirt, this is coupled with an exit in switch_to to
 	 * combine the page table reload and the switch backend into
@@ -3228,22 +3224,37 @@  context_switch(struct rq *rq, struct tas
 	arch_start_context_switch(prev);
 
 	/*
-	 * If mm is non-NULL, we pass through switch_mm(). If mm is
-	 * NULL, we will pass through mmdrop() in finish_task_switch().
-	 * Both of these contain the full memory barrier required by
-	 * membarrier after storing to rq->curr, before returning to
-	 * user-space.
+	 * kernel -> kernel   lazy + transfer active
+	 *   user -> kernel   lazy + mmgrab() active
+	 *
+	 * kernel ->   user   switch + mmdrop() active
+	 *   user ->   user   switch
 	 */
-	if (!mm) {
-		next->active_mm = oldmm;
-		mmgrab(oldmm);
-		enter_lazy_tlb(oldmm, next);
-	} else
-		switch_mm_irqs_off(oldmm, mm, next);
-
-	if (!prev->mm) {
-		prev->active_mm = NULL;
-		rq->prev_mm = oldmm;
+	if (!next->mm) {                                // to kernel
+		enter_lazy_tlb(prev->active_mm, next);
+
+		next->active_mm = prev->active_mm;
+		if (prev->mm)                           // from user
+			mmgrab(prev->active_mm);
+		else
+			prev->active_mm = NULL;
+	} else {                                        // to user
+		/*
+		 * sys_membarrier() requires an smp_mb() between setting
+		 * rq->curr and returning to userspace.
+		 *
+		 * The below provides this either through switch_mm(), or in
+		 * case 'prev->active_mm == next->mm' through
+		 * finish_task_switch()'s mmdrop().
+		 */
+
+		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
+		if (!prev->mm) {                        // from kernel
+			/* will mmdrop() in finish_task_switch(). */
+			rq->prev_mm = prev->active_mm;
+			prev->active_mm = NULL;
+		}
 	}
 
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);