Message ID | 20180214150739.GH2992@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote: > Hi Mark, Hi Will, > Cheers for the report. These things tend to be a pain to debug, but I've had > a go. Thanks for taking a look! > On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote: > The interesting thing here is on the exit path: > > > Freed by task 10882: > > save_stack mm/kasan/kasan.c:447 [inline] > > set_track mm/kasan/kasan.c:459 [inline] > > __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520 > > kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527 > > slab_free_hook mm/slub.c:1393 [inline] > > slab_free_freelist_hook mm/slub.c:1414 [inline] > > slab_free mm/slub.c:2968 [inline] > > kmem_cache_free+0x88/0x270 mm/slub.c:2990 > > __mmdrop+0x164/0x248 kernel/fork.c:604 > > ^^ This should never run, because there's an mmgrab() about 8 lines above > the mmput() in exit_mm. > > > mmdrop+0x50/0x60 kernel/fork.c:615 > > __mmput kernel/fork.c:981 [inline] > > mmput+0x270/0x338 kernel/fork.c:992 > > exit_mm kernel/exit.c:544 [inline] > > Looking at exit_mm: > > mmgrab(mm); > BUG_ON(mm != current->active_mm); > /* more a memory barrier than a real lock */ > task_lock(current); > current->mm = NULL; > up_read(&mm->mmap_sem); > enter_lazy_tlb(mm, current); > task_unlock(current); > mm_update_next_owner(mm); > mmput(mm); > > Then the comment already rings some alarm bells: our spin_lock (as used > by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered > due to being an atomic_inc) can be reordered with respect to the assignment > of NULL to current->mm. > > If the exit()ing task had recently migrated from another CPU, then that > CPU could concurrently run context_switch() and take this path: > > if (!prev->mm) { > prev->active_mm = NULL; > rq->prev_mm = oldmm; > } IIUC, on the prior context_switch, next->mm == NULL, so we set next->active_mm to prev->mm. Then, in this context_switch we set oldmm = prev->active_mm (where prev is next from the prior context switch). ... right? > which then means finish_task_switch will call mmdrop(): > > struct mm_struct *mm = rq->prev_mm; > [...] > if (mm) { > membarrier_mm_sync_core_before_usermode(mm); > mmdrop(mm); > } ... then here we use what was prev->active_mm in the most recent context switch. So AFAICT, we're never concurrently accessing a task_struct::mm field here, only prev::{mm,active_mm} while prev is current... [...] > diff --git a/kernel/exit.c b/kernel/exit.c > index 995453d9fb55..f91e8d56b03f 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -534,8 +534,9 @@ static void exit_mm(void) > } > mmgrab(mm); > BUG_ON(mm != current->active_mm); > - /* more a memory barrier than a real lock */ > task_lock(current); > + /* Ensure we've grabbed the mm before setting current->mm to NULL */ > + smp_mb__after_spin_lock(); > current->mm = NULL; ... and thus I don't follow why we would need to order these with anything more than a compiler barrier (if we're preemptible here). What have I completely misunderstood? ;) Thanks, Mark.
----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland@arm.com wrote: > On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote: >> Hi Mark, > > Hi Will, > >> Cheers for the report. These things tend to be a pain to debug, but I've had >> a go. > > Thanks for taking a look! > >> On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote: >> The interesting thing here is on the exit path: >> >> > Freed by task 10882: >> > save_stack mm/kasan/kasan.c:447 [inline] >> > set_track mm/kasan/kasan.c:459 [inline] >> > __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520 >> > kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527 >> > slab_free_hook mm/slub.c:1393 [inline] >> > slab_free_freelist_hook mm/slub.c:1414 [inline] >> > slab_free mm/slub.c:2968 [inline] >> > kmem_cache_free+0x88/0x270 mm/slub.c:2990 >> > __mmdrop+0x164/0x248 kernel/fork.c:604 >> >> ^^ This should never run, because there's an mmgrab() about 8 lines above >> the mmput() in exit_mm. >> >> > mmdrop+0x50/0x60 kernel/fork.c:615 >> > __mmput kernel/fork.c:981 [inline] >> > mmput+0x270/0x338 kernel/fork.c:992 >> > exit_mm kernel/exit.c:544 [inline] >> >> Looking at exit_mm: >> >> mmgrab(mm); >> BUG_ON(mm != current->active_mm); >> /* more a memory barrier than a real lock */ >> task_lock(current); >> current->mm = NULL; >> up_read(&mm->mmap_sem); >> enter_lazy_tlb(mm, current); >> task_unlock(current); >> mm_update_next_owner(mm); >> mmput(mm); >> >> Then the comment already rings some alarm bells: our spin_lock (as used >> by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered >> due to being an atomic_inc) can be reordered with respect to the assignment >> of NULL to current->mm. >> >> If the exit()ing task had recently migrated from another CPU, then that >> CPU could concurrently run context_switch() and take this path: >> >> if (!prev->mm) { >> prev->active_mm = NULL; >> rq->prev_mm = oldmm; >> } > > IIUC, on the prior context_switch, next->mm == NULL, so we set > next->active_mm to prev->mm. > > Then, in this context_switch we set oldmm = prev->active_mm (where prev > is next from the prior context switch). > > ... right? > >> which then means finish_task_switch will call mmdrop(): >> >> struct mm_struct *mm = rq->prev_mm; >> [...] >> if (mm) { >> membarrier_mm_sync_core_before_usermode(mm); >> mmdrop(mm); >> } > > ... then here we use what was prev->active_mm in the most recent context > switch. > > So AFAICT, we're never concurrently accessing a task_struct::mm field > here, only prev::{mm,active_mm} while prev is current... > > [...] > >> diff --git a/kernel/exit.c b/kernel/exit.c >> index 995453d9fb55..f91e8d56b03f 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -534,8 +534,9 @@ static void exit_mm(void) >> } >> mmgrab(mm); >> BUG_ON(mm != current->active_mm); >> - /* more a memory barrier than a real lock */ >> task_lock(current); >> + /* Ensure we've grabbed the mm before setting current->mm to NULL */ >> + smp_mb__after_spin_lock(); >> current->mm = NULL; > > ... and thus I don't follow why we would need to order these with > anything more than a compiler barrier (if we're preemptible here). > > What have I completely misunderstood? ;) The compiler barrier would not change anything, because task_lock() already implies a compiler barrier (provided by the arch spin lock inline asm memory clobber). So compiler-wise, it cannot move the mmgrab(mm) after the store "current->mm = NULL". However, given the scenario involves multiples CPUs (one doing exit_mm(), the other doing context switch), the actual order of perceived load/store can be shuffled. And AFAIU nothing prevents the CPU from ordering the atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. I wonder if we should not simply add a smp_mb__after_atomic() into mmgrab() instead ? I see that e.g. futex.c does: static inline void futex_get_mm(union futex_key *key) { mmgrab(key->private.mm); /* * Ensure futex_get_mm() implies a full barrier such that * get_futex_key() implies a full barrier. This is relied upon * as smp_mb(); (B), see the ordering comment above. */ smp_mb__after_atomic(); } It could prevent nasty subtle bugs in other mmgrab() users. Thoughts ? Thanks, Mathieu > > Thanks, > Mark.
On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote: > However, given the scenario involves multiples CPUs (one doing exit_mm(), > the other doing context switch), the actual order of perceived load/store > can be shuffled. And AFAIU nothing prevents the CPU from ordering the > atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. > > I wonder if we should not simply add a smp_mb__after_atomic() into > mmgrab() instead ? I see that e.g. futex.c does: Don't think so, the futex case is really rather special and I suspect this one is too. I would much rather have explicit comments rather than implicit works by magic. As per the rationale used for refcount_t, increments should be unordered, because you ACQUIRE your object _before_ you can do the increment. The futex thing is simply abusing a bunch of implied barriers and patching up the holes in paths that didn't already imply a barrier in order to avoid having to add explicit barriers (which had measurable performance issues). And here we have explicit ordering outside of the reference counting too, we want to ensure the reference is incremented before we modify a second object. This ordering is not at all related to acquiring the reference, so bunding it seems odd.
----- On Feb 15, 2018, at 6:49 AM, Peter Zijlstra peterz@infradead.org wrote: > On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote: >> However, given the scenario involves multiples CPUs (one doing exit_mm(), >> the other doing context switch), the actual order of perceived load/store >> can be shuffled. And AFAIU nothing prevents the CPU from ordering the >> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. >> >> I wonder if we should not simply add a smp_mb__after_atomic() into >> mmgrab() instead ? I see that e.g. futex.c does: > > Don't think so, the futex case is really rather special and I suspect > this one is too. I would much rather have explicit comments rather than > implicit works by magic. > > As per the rationale used for refcount_t, increments should be > unordered, because you ACQUIRE your object _before_ you can do the > increment. > > The futex thing is simply abusing a bunch of implied barriers and > patching up the holes in paths that didn't already imply a barrier in > order to avoid having to add explicit barriers (which had measurable > performance issues). > > And here we have explicit ordering outside of the reference counting > too, we want to ensure the reference is incremented before we modify > a second object. > > This ordering is not at all related to acquiring the reference, so > bunding it seems odd. I understand your point. Will's added barrier and comment is fine. Thanks, Mathieu
On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote: > ----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland@arm.com wrote: > > On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote: > >> If the exit()ing task had recently migrated from another CPU, then that > >> CPU could concurrently run context_switch() and take this path: > >> > >> if (!prev->mm) { > >> prev->active_mm = NULL; > >> rq->prev_mm = oldmm; > >> } > > > > IIUC, on the prior context_switch, next->mm == NULL, so we set > > next->active_mm to prev->mm. > > > > Then, in this context_switch we set oldmm = prev->active_mm (where prev > > is next from the prior context switch). > > > > ... right? > > > >> which then means finish_task_switch will call mmdrop(): > >> > >> struct mm_struct *mm = rq->prev_mm; > >> [...] > >> if (mm) { > >> membarrier_mm_sync_core_before_usermode(mm); > >> mmdrop(mm); > >> } > > > > ... then here we use what was prev->active_mm in the most recent context > > switch. > > > > So AFAICT, we're never concurrently accessing a task_struct::mm field > > here, only prev::{mm,active_mm} while prev is current... > > > > [...] > > > >> diff --git a/kernel/exit.c b/kernel/exit.c > >> index 995453d9fb55..f91e8d56b03f 100644 > >> --- a/kernel/exit.c > >> +++ b/kernel/exit.c > >> @@ -534,8 +534,9 @@ static void exit_mm(void) > >> } > >> mmgrab(mm); > >> BUG_ON(mm != current->active_mm); > >> - /* more a memory barrier than a real lock */ > >> task_lock(current); > >> + /* Ensure we've grabbed the mm before setting current->mm to NULL */ > >> + smp_mb__after_spin_lock(); > >> current->mm = NULL; > > > > ... and thus I don't follow why we would need to order these with > > anything more than a compiler barrier (if we're preemptible here). > > > > What have I completely misunderstood? ;) > > The compiler barrier would not change anything, because task_lock() > already implies a compiler barrier (provided by the arch spin lock > inline asm memory clobber). So compiler-wise, it cannot move the > mmgrab(mm) after the store "current->mm = NULL". > > However, given the scenario involves multiples CPUs (one doing exit_mm(), > the other doing context switch), the actual order of perceived load/store > can be shuffled. And AFAIU nothing prevents the CPU from ordering the > atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. Mark and I have spent most of the morning looking at this and realised I made a mistake in my original guesswork: prev can't migrate until half way down finish_task_switch when on_cpu = 0, so the access of prev->mm in context_switch can't race with exit_mm() for that task. Furthermore, although the mmgrab() could in theory be reordered with current->mm = NULL (and the ARMv8 architecture allows this too), it's pretty unlikely with LL/SC atomics and the backwards branch, where the CPU would have to pull off quite a few tricks for this to happen. Instead, we've come up with a more plausible sequence that can in theory happen on a single CPU: <task foo calls exit()> do_exit exit_mm mmgrab(mm); // foo's mm has count +1 BUG_ON(mm != current->active_mm); task_lock(current); current->mm = NULL; task_unlock(current); <irq and ctxsw to kthread> context_switch(prev=foo, next=kthread) mm = next->mm; oldmm = prev->active_mm; if (!mm) { // True for kthread next->active_mm = oldmm; mmgrab(oldmm); // foo's mm has count +2 } if (!prev->mm) { // True for foo rq->prev_mm = oldmm; } finish_task_switch mm = rq->prev_mm; if (mm) { // True (foo's mm) mmdrop(mm); // foo's mm has count +1 } [...] <ctxsw to task bar> context_switch(prev=kthread, next=bar) mm = next->mm; oldmm = prev->active_mm; // foo's mm! if (!prev->mm) { // True for kthread rq->prev_mm = oldmm; } finish_task_switch mm = rq->prev_mm; if (mm) { // True (foo's mm) mmdrop(mm); // foo's mm has count +0 } [...] <ctxsw back to task foo> context_switch(prev=bar, next=foo) mm = next->mm; oldmm = prev->active_mm; if (!mm) { // True for foo next->active_mm = oldmm; // This is bar's mm mmgrab(oldmm); // bar's mm has count +1 } [return back to exit_mm] mmdrop(mm); // foo's mm has count -1 At this point, we've got an imbalanced count on the mm and could free it prematurely as seen in the KASAN log. A subsequent context-switch away from foo would therefore result in a use-after-free. Assuming others agree with this diagnosis, I'm not sure how to fix it. It's basically not safe to set current->mm = NULL with preemption enabled. Will
On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > Assuming others agree with this diagnosis, I'm not sure how to fix it. > It's basically not safe to set current->mm = NULL with preemption enabled. One thing we could try would be to leave current->mm alone and just do the mmdrop in finish_task_switch at the point where we put the task_struct for DEAD tasks. mm_update_next_owner might need updating so that it doesn't re-assign current as the owner and run into a BUG_ON. Will
On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > Instead, we've come up with a more plausible sequence that can in theory > happen on a single CPU: > > <task foo calls exit()> > > do_exit > exit_mm If this is the last task of the process, we would expect: mm_count == 1 mm_users == 1 at this point. > mmgrab(mm); // foo's mm has count +1 > BUG_ON(mm != current->active_mm); > task_lock(current); > current->mm = NULL; > task_unlock(current); So the whole active_mm is basically the last 'real' mm, and its purpose is to avoid switch_mm() between user tasks and kernel tasks. A kernel task has !->mm. We do this by incrementing mm_count when switching from user to kernel task and decrementing when switching from kernel to user. What exit_mm() does is change a user task into a 'kernel' task. So it should increment mm_count to mirror the context switch. I suspect this is what the mmgrab() in exit_mm() is for. > <irq and ctxsw to kthread> > > context_switch(prev=foo, next=kthread) > mm = next->mm; > oldmm = prev->active_mm; > > if (!mm) { // True for kthread > next->active_mm = oldmm; > mmgrab(oldmm); // foo's mm has count +2 > } > > if (!prev->mm) { // True for foo > rq->prev_mm = oldmm; > } > > finish_task_switch > mm = rq->prev_mm; > if (mm) { // True (foo's mm) > mmdrop(mm); // foo's mm has count +1 > } > > [...] > > <ctxsw to task bar> > > context_switch(prev=kthread, next=bar) > mm = next->mm; > oldmm = prev->active_mm; // foo's mm! > > if (!prev->mm) { // True for kthread > rq->prev_mm = oldmm; > } > > finish_task_switch > mm = rq->prev_mm; > if (mm) { // True (foo's mm) > mmdrop(mm); // foo's mm has count +0 The context switch into the next user task will then decrement. At this point foo no longer has a reference to its mm, except on the stack. > } > > [...] > > <ctxsw back to task foo> > > context_switch(prev=bar, next=foo) > mm = next->mm; > oldmm = prev->active_mm; > > if (!mm) { // True for foo > next->active_mm = oldmm; // This is bar's mm > mmgrab(oldmm); // bar's mm has count +1 > } > > > [return back to exit_mm] Enter mm_users, this counts the number of tasks associated with the mm. We start with 1 in mm_init(), and when it drops to 0, we decrement mm_count. Since we also start with mm_count == 1, this would appear consistent. mmput() // --mm_users == 0, which then results in: > mmdrop(mm); // foo's mm has count -1 In the above case, that's the very last reference to the mm, and since we started out with mm_count == 1, this -1 makes 0 and we do the actual free. > At this point, we've got an imbalanced count on the mm and could free it > prematurely as seen in the KASAN log. I'm not sure I see premature. At this point mm_users==0, mm_count==0 and we freed mm and there is no further use of the on-stack mm pointer and foo no longer has a pointer to it in either ->mm or ->active_mm. It's well and proper dead. > A subsequent context-switch away from foo would therefore result in a > use-after-free. At the above point, foo no longer has a reference to mm, we cleared ->mm early, and the context switch to bar cleared ->active_mm. The switch back into foo then results with foo->active_mm == bar->mm, which is fine.
On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote: > On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > > Instead, we've come up with a more plausible sequence that can in theory > > happen on a single CPU: > > > > <task foo calls exit()> > > > > do_exit > > exit_mm > > If this is the last task of the process, we would expect: > > mm_count == 1 > mm_users == 1 > > at this point. > > > mmgrab(mm); // foo's mm has count +1 > > BUG_ON(mm != current->active_mm); > > task_lock(current); > > current->mm = NULL; > > task_unlock(current); > > So the whole active_mm is basically the last 'real' mm, and its purpose > is to avoid switch_mm() between user tasks and kernel tasks. > > A kernel task has !->mm. We do this by incrementing mm_count when > switching from user to kernel task and decrementing when switching from > kernel to user. > > What exit_mm() does is change a user task into a 'kernel' task. So it > should increment mm_count to mirror the context switch. I suspect this > is what the mmgrab() in exit_mm() is for. > > > <irq and ctxsw to kthread> > > > > context_switch(prev=foo, next=kthread) > > mm = next->mm; > > oldmm = prev->active_mm; > > > > if (!mm) { // True for kthread > > next->active_mm = oldmm; > > mmgrab(oldmm); // foo's mm has count +2 > > } > > > > if (!prev->mm) { // True for foo > > rq->prev_mm = oldmm; > > } > > > > finish_task_switch > > mm = rq->prev_mm; > > if (mm) { // True (foo's mm) > > mmdrop(mm); // foo's mm has count +1 > > } > > > > [...] > > > > <ctxsw to task bar> > > > > context_switch(prev=kthread, next=bar) > > mm = next->mm; > > oldmm = prev->active_mm; // foo's mm! > > > > if (!prev->mm) { // True for kthread > > rq->prev_mm = oldmm; > > } > > > > finish_task_switch > > mm = rq->prev_mm; > > if (mm) { // True (foo's mm) > > mmdrop(mm); // foo's mm has count +0 > > The context switch into the next user task will then decrement. At this > point foo no longer has a reference to its mm, except on the stack. > > > } > > > > [...] > > > > <ctxsw back to task foo> > > > > context_switch(prev=bar, next=foo) > > mm = next->mm; > > oldmm = prev->active_mm; > > > > if (!mm) { // True for foo > > next->active_mm = oldmm; // This is bar's mm > > mmgrab(oldmm); // bar's mm has count +1 > > } > > > > > > [return back to exit_mm] > > Enter mm_users, this counts the number of tasks associated with the mm. > We start with 1 in mm_init(), and when it drops to 0, we decrement > mm_count. Since we also start with mm_count == 1, this would appear > consistent. > > mmput() // --mm_users == 0, which then results in: > > > mmdrop(mm); // foo's mm has count -1 > > In the above case, that's the very last reference to the mm, and since > we started out with mm_count == 1, this -1 makes 0 and we do the actual > free. > > > At this point, we've got an imbalanced count on the mm and could free it > > prematurely as seen in the KASAN log. > > I'm not sure I see premature. At this point mm_users==0, mm_count==0 and > we freed mm and there is no further use of the on-stack mm pointer and > foo no longer has a pointer to it in either ->mm or ->active_mm. It's > well and proper dead. > > > A subsequent context-switch away from foo would therefore result in a > > use-after-free. > > At the above point, foo no longer has a reference to mm, we cleared ->mm > early, and the context switch to bar cleared ->active_mm. The switch > back into foo then results with foo->active_mm == bar->mm, which is > fine. Bugger, you're right. When we switch off foo after freeing the mm, we'll actually access it's active mm which points to bar's mm. So whilst this can explain part of the kasan splat, it doesn't explain the actual use-after-free. More head-scratching required :( Will
On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > Instead, we've come up with a more plausible sequence that can in theory > happen on a single CPU: > > <task foo calls exit()> > > do_exit > exit_mm > mmgrab(mm); // foo's mm has count +1 > BUG_ON(mm != current->active_mm); > task_lock(current); > current->mm = NULL; > task_unlock(current); > > <irq and ctxsw to kthread> [...] > mmdrop(mm); // foo's mm has count -1 > > At this point, we've got an imbalanced count on the mm and could free it > prematurely as seen in the KASAN log. A subsequent context-switch away > from foo would therefore result in a use-after-free. Peter already dismissed an algorithm issue here but I thought I'd give model checking a go (it only deals with mm_users/mm_count; also added a heavily simplified exec_mmap() in the loop): https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/ctxsw.tla As expected, it didn't show any problems (though it does not take memory ordering into account). Now, there are lots of other mmget/mmput and mmgrab/mmdrop throughout the kernel and finding an imbalanced call needs more work.
diff --git a/kernel/exit.c b/kernel/exit.c index 995453d9fb55..f91e8d56b03f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -534,8 +534,9 @@ static void exit_mm(void) } mmgrab(mm); BUG_ON(mm != current->active_mm); - /* more a memory barrier than a real lock */ task_lock(current); + /* Ensure we've grabbed the mm before setting current->mm to NULL */ + smp_mb__after_spin_lock(); current->mm = NULL; up_read(&mm->mmap_sem); enter_lazy_tlb(mm, current);