diff mbox

arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch

Message ID 20180214150739.GH2992@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Will Deacon Feb. 14, 2018, 3:07 p.m. UTC
Hi Mark,

Cheers for the report. These things tend to be a pain to debug, but I've had
a go.

On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote:
> As a heads-up, I hit the splat below when fuzzing v4.16-rc1 on arm64.
> 
> Evidently, we get to finish_task_switch() with rq->prev_mm != NULL,
> despite rq->prev_mm having been freed. KASAN spots the dereference of
> mm->membarrier_state in membarrier_mm_sync_core_before_usermode(mm), but
> AFAICT the underlying issue is independent of the membarrier code, and
> we could get a splat on the subsequent mmdrop(mm).
> 
> I've seen this once in ~2500 CPU hours of fuzzing, so it looks pretty
> difficult to hit, and I have no reproducer so far.
> 
> Syzkaller report below, mirrored with Syzkaller log at [1]. If I hit
> this again, I'll upload new info there.

The interesting thing here is on the exit path:

> Freed by task 10882:
>  save_stack mm/kasan/kasan.c:447 [inline]
>  set_track mm/kasan/kasan.c:459 [inline]
>  __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
>  kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
>  slab_free_hook mm/slub.c:1393 [inline]
>  slab_free_freelist_hook mm/slub.c:1414 [inline]
>  slab_free mm/slub.c:2968 [inline]
>  kmem_cache_free+0x88/0x270 mm/slub.c:2990
>  __mmdrop+0x164/0x248 kernel/fork.c:604

^^ This should never run, because there's an mmgrab() about 8 lines above
the mmput() in exit_mm.

>  mmdrop+0x50/0x60 kernel/fork.c:615
>  __mmput kernel/fork.c:981 [inline]
>  mmput+0x270/0x338 kernel/fork.c:992
>  exit_mm kernel/exit.c:544 [inline]

Looking at exit_mm:

        mmgrab(mm);
        BUG_ON(mm != current->active_mm);
        /* more a memory barrier than a real lock */
        task_lock(current);
        current->mm = NULL;
        up_read(&mm->mmap_sem);
        enter_lazy_tlb(mm, current);
        task_unlock(current);
        mm_update_next_owner(mm);
        mmput(mm);

Then the comment already rings some alarm bells: our spin_lock (as used
by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered
due to being an atomic_inc) can be reordered with respect to the assignment
of NULL to current->mm.

If the exit()ing task had recently migrated from another CPU, then that
CPU could concurrently run context_switch() and take this path:

	if (!prev->mm) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}

which then means finish_task_switch will call mmdrop():

	struct mm_struct *mm = rq->prev_mm;
	[...]
	if (mm) {
		membarrier_mm_sync_core_before_usermode(mm);
		mmdrop(mm);
	}

note that KASAN will be ok at this point, but it explains how the exit_mm
path ends up freeing the mm. Then, when the exit()ing CPU calls
context_switch, *it* will explode accessing the freed mm.

Easiest way to fix this is by guaranteeing the barrier semantics in the
exit path. Patch below. I guess we'll have to wait another 2500 hours to
see if it works :)

Will

--->8

Comments

Mark Rutland Feb. 14, 2018, 4:51 p.m. UTC | #1
On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote:
> Hi Mark,

Hi Will,

> Cheers for the report. These things tend to be a pain to debug, but I've had
> a go.

Thanks for taking a look!

> On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote:
> The interesting thing here is on the exit path:
> 
> > Freed by task 10882:
> >  save_stack mm/kasan/kasan.c:447 [inline]
> >  set_track mm/kasan/kasan.c:459 [inline]
> >  __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
> >  kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
> >  slab_free_hook mm/slub.c:1393 [inline]
> >  slab_free_freelist_hook mm/slub.c:1414 [inline]
> >  slab_free mm/slub.c:2968 [inline]
> >  kmem_cache_free+0x88/0x270 mm/slub.c:2990
> >  __mmdrop+0x164/0x248 kernel/fork.c:604
> 
> ^^ This should never run, because there's an mmgrab() about 8 lines above
> the mmput() in exit_mm.
> 
> >  mmdrop+0x50/0x60 kernel/fork.c:615
> >  __mmput kernel/fork.c:981 [inline]
> >  mmput+0x270/0x338 kernel/fork.c:992
> >  exit_mm kernel/exit.c:544 [inline]
> 
> Looking at exit_mm:
> 
>         mmgrab(mm);
>         BUG_ON(mm != current->active_mm);
>         /* more a memory barrier than a real lock */
>         task_lock(current);
>         current->mm = NULL;
>         up_read(&mm->mmap_sem);
>         enter_lazy_tlb(mm, current);
>         task_unlock(current);
>         mm_update_next_owner(mm);
>         mmput(mm);
> 
> Then the comment already rings some alarm bells: our spin_lock (as used
> by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered
> due to being an atomic_inc) can be reordered with respect to the assignment
> of NULL to current->mm.
> 
> If the exit()ing task had recently migrated from another CPU, then that
> CPU could concurrently run context_switch() and take this path:
> 
> 	if (!prev->mm) {
> 		prev->active_mm = NULL;
> 		rq->prev_mm = oldmm;
> 	}

IIUC, on the prior context_switch, next->mm == NULL, so we set
next->active_mm to prev->mm.

Then, in this context_switch we set oldmm = prev->active_mm (where prev
is next from the prior context switch).

... right?

> which then means finish_task_switch will call mmdrop():
> 
> 	struct mm_struct *mm = rq->prev_mm;
> 	[...]
> 	if (mm) {
> 		membarrier_mm_sync_core_before_usermode(mm);
> 		mmdrop(mm);
> 	}

... then here we use what was prev->active_mm in the most recent context
switch.

So AFAICT, we're never concurrently accessing a task_struct::mm field
here, only prev::{mm,active_mm} while prev is current...

[...]

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 995453d9fb55..f91e8d56b03f 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -534,8 +534,9 @@ static void exit_mm(void)
>         }
>         mmgrab(mm);
>         BUG_ON(mm != current->active_mm);
> -       /* more a memory barrier than a real lock */
>         task_lock(current);
> +       /* Ensure we've grabbed the mm before setting current->mm to NULL */
> +       smp_mb__after_spin_lock();
>         current->mm = NULL;

... and thus I don't follow why we would need to order these with
anything more than a compiler barrier (if we're preemptible here).

What have I completely misunderstood? ;)

Thanks,
Mark.
Mathieu Desnoyers Feb. 14, 2018, 6:53 p.m. UTC | #2
----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland@arm.com wrote:

> On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote:
>> Hi Mark,
> 
> Hi Will,
> 
>> Cheers for the report. These things tend to be a pain to debug, but I've had
>> a go.
> 
> Thanks for taking a look!
> 
>> On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote:
>> The interesting thing here is on the exit path:
>> 
>> > Freed by task 10882:
>> >  save_stack mm/kasan/kasan.c:447 [inline]
>> >  set_track mm/kasan/kasan.c:459 [inline]
>> >  __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
>> >  kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
>> >  slab_free_hook mm/slub.c:1393 [inline]
>> >  slab_free_freelist_hook mm/slub.c:1414 [inline]
>> >  slab_free mm/slub.c:2968 [inline]
>> >  kmem_cache_free+0x88/0x270 mm/slub.c:2990
>> >  __mmdrop+0x164/0x248 kernel/fork.c:604
>> 
>> ^^ This should never run, because there's an mmgrab() about 8 lines above
>> the mmput() in exit_mm.
>> 
>> >  mmdrop+0x50/0x60 kernel/fork.c:615
>> >  __mmput kernel/fork.c:981 [inline]
>> >  mmput+0x270/0x338 kernel/fork.c:992
>> >  exit_mm kernel/exit.c:544 [inline]
>> 
>> Looking at exit_mm:
>> 
>>         mmgrab(mm);
>>         BUG_ON(mm != current->active_mm);
>>         /* more a memory barrier than a real lock */
>>         task_lock(current);
>>         current->mm = NULL;
>>         up_read(&mm->mmap_sem);
>>         enter_lazy_tlb(mm, current);
>>         task_unlock(current);
>>         mm_update_next_owner(mm);
>>         mmput(mm);
>> 
>> Then the comment already rings some alarm bells: our spin_lock (as used
>> by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered
>> due to being an atomic_inc) can be reordered with respect to the assignment
>> of NULL to current->mm.
>> 
>> If the exit()ing task had recently migrated from another CPU, then that
>> CPU could concurrently run context_switch() and take this path:
>> 
>> 	if (!prev->mm) {
>> 		prev->active_mm = NULL;
>> 		rq->prev_mm = oldmm;
>> 	}
> 
> IIUC, on the prior context_switch, next->mm == NULL, so we set
> next->active_mm to prev->mm.
> 
> Then, in this context_switch we set oldmm = prev->active_mm (where prev
> is next from the prior context switch).
> 
> ... right?
> 
>> which then means finish_task_switch will call mmdrop():
>> 
>> 	struct mm_struct *mm = rq->prev_mm;
>> 	[...]
>> 	if (mm) {
>> 		membarrier_mm_sync_core_before_usermode(mm);
>> 		mmdrop(mm);
>> 	}
> 
> ... then here we use what was prev->active_mm in the most recent context
> switch.
> 
> So AFAICT, we're never concurrently accessing a task_struct::mm field
> here, only prev::{mm,active_mm} while prev is current...
> 
> [...]
> 
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 995453d9fb55..f91e8d56b03f 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -534,8 +534,9 @@ static void exit_mm(void)
>>         }
>>         mmgrab(mm);
>>         BUG_ON(mm != current->active_mm);
>> -       /* more a memory barrier than a real lock */
>>         task_lock(current);
>> +       /* Ensure we've grabbed the mm before setting current->mm to NULL */
>> +       smp_mb__after_spin_lock();
>>         current->mm = NULL;
> 
> ... and thus I don't follow why we would need to order these with
> anything more than a compiler barrier (if we're preemptible here).
> 
> What have I completely misunderstood? ;)

The compiler barrier would not change anything, because task_lock()
already implies a compiler barrier (provided by the arch spin lock
inline asm memory clobber). So compiler-wise, it cannot move the
mmgrab(mm) after the store "current->mm = NULL".

However, given the scenario involves multiples CPUs (one doing exit_mm(),
the other doing context switch), the actual order of perceived load/store
can be shuffled. And AFAIU nothing prevents the CPU from ordering the
atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.

I wonder if we should not simply add a smp_mb__after_atomic() into
mmgrab() instead ? I see that e.g. futex.c does:

static inline void futex_get_mm(union futex_key *key)
{
        mmgrab(key->private.mm);
        /*
         * Ensure futex_get_mm() implies a full barrier such that
         * get_futex_key() implies a full barrier. This is relied upon
         * as smp_mb(); (B), see the ordering comment above.
         */
        smp_mb__after_atomic();
}

It could prevent nasty subtle bugs in other mmgrab() users.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> Mark.
Peter Zijlstra Feb. 15, 2018, 11:49 a.m. UTC | #3
On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote:
> However, given the scenario involves multiples CPUs (one doing exit_mm(),
> the other doing context switch), the actual order of perceived load/store
> can be shuffled. And AFAIU nothing prevents the CPU from ordering the
> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.
> 
> I wonder if we should not simply add a smp_mb__after_atomic() into
> mmgrab() instead ? I see that e.g. futex.c does:

Don't think so, the futex case is really rather special and I suspect
this one is too. I would much rather have explicit comments rather than
implicit works by magic.

As per the rationale used for refcount_t, increments should be
unordered, because you ACQUIRE your object _before_ you can do the
increment.

The futex thing is simply abusing a bunch of implied barriers and
patching up the holes in paths that didn't already imply a barrier in
order to avoid having to add explicit barriers (which had measurable
performance issues).

And here we have explicit ordering outside of the reference counting
too, we want to ensure the reference is incremented before we modify
a second object.

This ordering is not at all related to acquiring the reference, so
bunding it seems odd.
Mathieu Desnoyers Feb. 15, 2018, 1:13 p.m. UTC | #4
----- On Feb 15, 2018, at 6:49 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote:
>> However, given the scenario involves multiples CPUs (one doing exit_mm(),
>> the other doing context switch), the actual order of perceived load/store
>> can be shuffled. And AFAIU nothing prevents the CPU from ordering the
>> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.
>> 
>> I wonder if we should not simply add a smp_mb__after_atomic() into
>> mmgrab() instead ? I see that e.g. futex.c does:
> 
> Don't think so, the futex case is really rather special and I suspect
> this one is too. I would much rather have explicit comments rather than
> implicit works by magic.
> 
> As per the rationale used for refcount_t, increments should be
> unordered, because you ACQUIRE your object _before_ you can do the
> increment.
> 
> The futex thing is simply abusing a bunch of implied barriers and
> patching up the holes in paths that didn't already imply a barrier in
> order to avoid having to add explicit barriers (which had measurable
> performance issues).
> 
> And here we have explicit ordering outside of the reference counting
> too, we want to ensure the reference is incremented before we modify
> a second object.
> 
> This ordering is not at all related to acquiring the reference, so
> bunding it seems odd.

I understand your point. Will's added barrier and comment is fine.

Thanks,

Mathieu
Will Deacon Feb. 15, 2018, 2:22 p.m. UTC | #5
On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland@arm.com wrote:
> > On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote:
> >> If the exit()ing task had recently migrated from another CPU, then that
> >> CPU could concurrently run context_switch() and take this path:
> >> 
> >> 	if (!prev->mm) {
> >> 		prev->active_mm = NULL;
> >> 		rq->prev_mm = oldmm;
> >> 	}
> > 
> > IIUC, on the prior context_switch, next->mm == NULL, so we set
> > next->active_mm to prev->mm.
> > 
> > Then, in this context_switch we set oldmm = prev->active_mm (where prev
> > is next from the prior context switch).
> > 
> > ... right?
> > 
> >> which then means finish_task_switch will call mmdrop():
> >> 
> >> 	struct mm_struct *mm = rq->prev_mm;
> >> 	[...]
> >> 	if (mm) {
> >> 		membarrier_mm_sync_core_before_usermode(mm);
> >> 		mmdrop(mm);
> >> 	}
> > 
> > ... then here we use what was prev->active_mm in the most recent context
> > switch.
> > 
> > So AFAICT, we're never concurrently accessing a task_struct::mm field
> > here, only prev::{mm,active_mm} while prev is current...
> > 
> > [...]
> > 
> >> diff --git a/kernel/exit.c b/kernel/exit.c
> >> index 995453d9fb55..f91e8d56b03f 100644
> >> --- a/kernel/exit.c
> >> +++ b/kernel/exit.c
> >> @@ -534,8 +534,9 @@ static void exit_mm(void)
> >>         }
> >>         mmgrab(mm);
> >>         BUG_ON(mm != current->active_mm);
> >> -       /* more a memory barrier than a real lock */
> >>         task_lock(current);
> >> +       /* Ensure we've grabbed the mm before setting current->mm to NULL */
> >> +       smp_mb__after_spin_lock();
> >>         current->mm = NULL;
> > 
> > ... and thus I don't follow why we would need to order these with
> > anything more than a compiler barrier (if we're preemptible here).
> > 
> > What have I completely misunderstood? ;)
> 
> The compiler barrier would not change anything, because task_lock()
> already implies a compiler barrier (provided by the arch spin lock
> inline asm memory clobber). So compiler-wise, it cannot move the
> mmgrab(mm) after the store "current->mm = NULL".
> 
> However, given the scenario involves multiples CPUs (one doing exit_mm(),
> the other doing context switch), the actual order of perceived load/store
> can be shuffled. And AFAIU nothing prevents the CPU from ordering the
> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.

Mark and I have spent most of the morning looking at this and realised I
made a mistake in my original guesswork: prev can't migrate until half way
down finish_task_switch when on_cpu = 0, so the access of prev->mm in
context_switch can't race with exit_mm() for that task.

Furthermore, although the mmgrab() could in theory be reordered with
current->mm = NULL (and the ARMv8 architecture allows this too), it's
pretty unlikely with LL/SC atomics and the backwards branch, where the
CPU would have to pull off quite a few tricks for this to happen.

Instead, we've come up with a more plausible sequence that can in theory
happen on a single CPU:

<task foo calls exit()>

do_exit
	exit_mm
		mmgrab(mm);			// foo's mm has count +1
		BUG_ON(mm != current->active_mm);
		task_lock(current);
		current->mm = NULL;
		task_unlock(current);

<irq and ctxsw to kthread>

context_switch(prev=foo, next=kthread)
	mm = next->mm;
	oldmm = prev->active_mm;

	if (!mm) {				// True for kthread
		next->active_mm = oldmm;
		mmgrab(oldmm);			// foo's mm has count +2
	}

	if (!prev->mm) {			// True for foo
		rq->prev_mm = oldmm;
	}

	finish_task_switch
		mm = rq->prev_mm;
		if (mm) {			// True (foo's mm)
			mmdrop(mm);		// foo's mm has count +1
		}

	[...]

<ctxsw to task bar>

context_switch(prev=kthread, next=bar)
	mm = next->mm;
	oldmm = prev->active_mm;		// foo's mm!

	if (!prev->mm) {			// True for kthread
		rq->prev_mm = oldmm;
	}

	finish_task_switch
		mm = rq->prev_mm;
		if (mm) {			// True (foo's mm)
			mmdrop(mm);		// foo's mm has count +0
		}

	[...]

<ctxsw back to task foo>

context_switch(prev=bar, next=foo)
	mm = next->mm;
	oldmm = prev->active_mm;

	if (!mm) {				// True for foo
		next->active_mm = oldmm;	// This is bar's mm
		mmgrab(oldmm);			// bar's mm has count +1
	}


	[return back to exit_mm]

mmdrop(mm);					// foo's mm has count -1

At this point, we've got an imbalanced count on the mm and could free it
prematurely as seen in the KASAN log. A subsequent context-switch away
from foo would therefore result in a use-after-free.

Assuming others agree with this diagnosis, I'm not sure how to fix it.
It's basically not safe to set current->mm = NULL with preemption enabled.

Will
Will Deacon Feb. 15, 2018, 3:33 p.m. UTC | #6
On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> Assuming others agree with this diagnosis, I'm not sure how to fix it.
> It's basically not safe to set current->mm = NULL with preemption enabled.

One thing we could try would be to leave current->mm alone and just do the
mmdrop in finish_task_switch at the point where we put the task_struct for
DEAD tasks. mm_update_next_owner might need updating so that it doesn't
re-assign current as the owner and run into a BUG_ON.


Will
Peter Zijlstra Feb. 15, 2018, 4:47 p.m. UTC | #7
On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> Instead, we've come up with a more plausible sequence that can in theory
> happen on a single CPU:
> 
> <task foo calls exit()>
> 
> do_exit
> 	exit_mm

If this is the last task of the process, we would expect:

  mm_count == 1
  mm_users == 1

at this point.

> 		mmgrab(mm);			// foo's mm has count +1
> 		BUG_ON(mm != current->active_mm);
> 		task_lock(current);
> 		current->mm = NULL;
> 		task_unlock(current);

So the whole active_mm is basically the last 'real' mm, and its purpose
is to avoid switch_mm() between user tasks and kernel tasks.

A kernel task has !->mm. We do this by incrementing mm_count when
switching from user to kernel task and decrementing when switching from
kernel to user.

What exit_mm() does is change a user task into a 'kernel' task. So it
should increment mm_count to mirror the context switch. I suspect this
is what the mmgrab() in exit_mm() is for.

> <irq and ctxsw to kthread>
> 
> context_switch(prev=foo, next=kthread)
> 	mm = next->mm;
> 	oldmm = prev->active_mm;
> 
> 	if (!mm) {				// True for kthread
> 		next->active_mm = oldmm;
> 		mmgrab(oldmm);			// foo's mm has count +2
> 	}
> 
> 	if (!prev->mm) {			// True for foo
> 		rq->prev_mm = oldmm;
> 	}
> 
> 	finish_task_switch
> 		mm = rq->prev_mm;
> 		if (mm) {			// True (foo's mm)
> 			mmdrop(mm);		// foo's mm has count +1
> 		}
> 
> 	[...]
> 
> <ctxsw to task bar>
> 
> context_switch(prev=kthread, next=bar)
> 	mm = next->mm;
> 	oldmm = prev->active_mm;		// foo's mm!
> 
> 	if (!prev->mm) {			// True for kthread
> 		rq->prev_mm = oldmm;
> 	}
> 
> 	finish_task_switch
> 		mm = rq->prev_mm;
> 		if (mm) {			// True (foo's mm)
> 			mmdrop(mm);		// foo's mm has count +0

The context switch into the next user task will then decrement. At this
point foo no longer has a reference to its mm, except on the stack.

> 		}
> 
> 	[...]
> 
> <ctxsw back to task foo>
> 
> context_switch(prev=bar, next=foo)
> 	mm = next->mm;
> 	oldmm = prev->active_mm;
> 
> 	if (!mm) {				// True for foo
> 		next->active_mm = oldmm;	// This is bar's mm
> 		mmgrab(oldmm);			// bar's mm has count +1
> 	}
> 
> 
> 	[return back to exit_mm]

Enter mm_users, this counts the number of tasks associated with the mm.
We start with 1 in mm_init(), and when it drops to 0, we decrement
mm_count. Since we also start with mm_count == 1, this would appear
consistent.

  mmput() // --mm_users == 0, which then results in:

> mmdrop(mm);					// foo's mm has count -1

In the above case, that's the very last reference to the mm, and since
we started out with mm_count == 1, this -1 makes 0 and we do the actual
free.

> At this point, we've got an imbalanced count on the mm and could free it
> prematurely as seen in the KASAN log.

I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
we freed mm and there is no further use of the on-stack mm pointer and
foo no longer has a pointer to it in either ->mm or ->active_mm. It's
well and proper dead.

> A subsequent context-switch away from foo would therefore result in a
> use-after-free.

At the above point, foo no longer has a reference to mm, we cleared ->mm
early, and the context switch to bar cleared ->active_mm. The switch
back into foo then results with foo->active_mm == bar->mm, which is
fine.
Will Deacon Feb. 15, 2018, 6:21 p.m. UTC | #8
On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> > Instead, we've come up with a more plausible sequence that can in theory
> > happen on a single CPU:
> > 
> > <task foo calls exit()>
> > 
> > do_exit
> > 	exit_mm
> 
> If this is the last task of the process, we would expect:
> 
>   mm_count == 1
>   mm_users == 1
> 
> at this point.
> 
> > 		mmgrab(mm);			// foo's mm has count +1
> > 		BUG_ON(mm != current->active_mm);
> > 		task_lock(current);
> > 		current->mm = NULL;
> > 		task_unlock(current);
> 
> So the whole active_mm is basically the last 'real' mm, and its purpose
> is to avoid switch_mm() between user tasks and kernel tasks.
> 
> A kernel task has !->mm. We do this by incrementing mm_count when
> switching from user to kernel task and decrementing when switching from
> kernel to user.
> 
> What exit_mm() does is change a user task into a 'kernel' task. So it
> should increment mm_count to mirror the context switch. I suspect this
> is what the mmgrab() in exit_mm() is for.
> 
> > <irq and ctxsw to kthread>
> > 
> > context_switch(prev=foo, next=kthread)
> > 	mm = next->mm;
> > 	oldmm = prev->active_mm;
> > 
> > 	if (!mm) {				// True for kthread
> > 		next->active_mm = oldmm;
> > 		mmgrab(oldmm);			// foo's mm has count +2
> > 	}
> > 
> > 	if (!prev->mm) {			// True for foo
> > 		rq->prev_mm = oldmm;
> > 	}
> > 
> > 	finish_task_switch
> > 		mm = rq->prev_mm;
> > 		if (mm) {			// True (foo's mm)
> > 			mmdrop(mm);		// foo's mm has count +1
> > 		}
> > 
> > 	[...]
> > 
> > <ctxsw to task bar>
> > 
> > context_switch(prev=kthread, next=bar)
> > 	mm = next->mm;
> > 	oldmm = prev->active_mm;		// foo's mm!
> > 
> > 	if (!prev->mm) {			// True for kthread
> > 		rq->prev_mm = oldmm;
> > 	}
> > 
> > 	finish_task_switch
> > 		mm = rq->prev_mm;
> > 		if (mm) {			// True (foo's mm)
> > 			mmdrop(mm);		// foo's mm has count +0
> 
> The context switch into the next user task will then decrement. At this
> point foo no longer has a reference to its mm, except on the stack.
> 
> > 		}
> > 
> > 	[...]
> > 
> > <ctxsw back to task foo>
> > 
> > context_switch(prev=bar, next=foo)
> > 	mm = next->mm;
> > 	oldmm = prev->active_mm;
> > 
> > 	if (!mm) {				// True for foo
> > 		next->active_mm = oldmm;	// This is bar's mm
> > 		mmgrab(oldmm);			// bar's mm has count +1
> > 	}
> > 
> > 
> > 	[return back to exit_mm]
> 
> Enter mm_users, this counts the number of tasks associated with the mm.
> We start with 1 in mm_init(), and when it drops to 0, we decrement
> mm_count. Since we also start with mm_count == 1, this would appear
> consistent.
> 
>   mmput() // --mm_users == 0, which then results in:
> 
> > mmdrop(mm);					// foo's mm has count -1
> 
> In the above case, that's the very last reference to the mm, and since
> we started out with mm_count == 1, this -1 makes 0 and we do the actual
> free.
> 
> > At this point, we've got an imbalanced count on the mm and could free it
> > prematurely as seen in the KASAN log.
> 
> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
> we freed mm and there is no further use of the on-stack mm pointer and
> foo no longer has a pointer to it in either ->mm or ->active_mm. It's
> well and proper dead.
> 
> > A subsequent context-switch away from foo would therefore result in a
> > use-after-free.
> 
> At the above point, foo no longer has a reference to mm, we cleared ->mm
> early, and the context switch to bar cleared ->active_mm. The switch
> back into foo then results with foo->active_mm == bar->mm, which is
> fine.

Bugger, you're right. When we switch off foo after freeing the mm, we'll
actually access it's active mm which points to bar's mm. So whilst this
can explain part of the kasan splat, it doesn't explain the actual
use-after-free.

More head-scratching required :(

Will
Catalin Marinas Feb. 19, 2018, 11:26 a.m. UTC | #9
On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> Instead, we've come up with a more plausible sequence that can in theory
> happen on a single CPU:
> 
> <task foo calls exit()>
> 
> do_exit
> 	exit_mm
> 		mmgrab(mm);			// foo's mm has count +1
> 		BUG_ON(mm != current->active_mm);
> 		task_lock(current);
> 		current->mm = NULL;
> 		task_unlock(current);
> 
> <irq and ctxsw to kthread>

[...]

> mmdrop(mm);					// foo's mm has count -1
> 
> At this point, we've got an imbalanced count on the mm and could free it
> prematurely as seen in the KASAN log. A subsequent context-switch away
> from foo would therefore result in a use-after-free.

Peter already dismissed an algorithm issue here but I thought I'd give
model checking a go (it only deals with mm_users/mm_count; also added a
heavily simplified exec_mmap() in the loop):

https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/ctxsw.tla

As expected, it didn't show any problems (though it does not take memory
ordering into account).

Now, there are lots of other mmget/mmput and mmgrab/mmdrop throughout
the kernel and finding an imbalanced call needs more work.
diff mbox

Patch

diff --git a/kernel/exit.c b/kernel/exit.c
index 995453d9fb55..f91e8d56b03f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -534,8 +534,9 @@  static void exit_mm(void)
        }
        mmgrab(mm);
        BUG_ON(mm != current->active_mm);
-       /* more a memory barrier than a real lock */
        task_lock(current);
+       /* Ensure we've grabbed the mm before setting current->mm to NULL */
+       smp_mb__after_spin_lock();
        current->mm = NULL;
        up_read(&mm->mmap_sem);
        enter_lazy_tlb(mm, current);