diff mbox series

fork: avoid inappropriate uprobe access to invalid mm

Message ID 20241210163104.55181-1-lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series fork: avoid inappropriate uprobe access to invalid mm | expand

Commit Message

Lorenzo Stoakes Dec. 10, 2024, 4:31 p.m. UTC
If dup_mmap() encounters an issue, currently uprobe is able to access the
relevant mm via the reverse mapping (in build_map_info()), and if we are
very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
we establish as part of the fork error path.

This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
invoking vma_iter_load() which uses the advanced maple tree API and thus is
able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
dup_mmap()").

This change was made on the assumption that only process tear-down code
would actually observe (and make use of) these values. However this very
unlikely but still possible edge case with uprobes exists and unfortunately
does make these observable.

The uprobe operation prevents races against the dup_mmap() operation via
the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
and dropped via uprobe_end_dup_mmap(), and held across
register_for_each_vma() prior to invoking build_map_info() which does the
reverse mapping lookup.

Currently these are acquired and dropped within dup_mmap(), which exposes
the race window prior to error handling in the invoking dup_mm() which
tears down the mm.

We can avoid all this by just moving the invocation of
uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
and only release this lock once the dup_mmap() operation succeeds or clean
up is done.

This means that the uprobe code can never observe an incompletely
constructed mm and resolves the issue in this case.

Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/fork.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

--
2.47.1

Comments

Oleg Nesterov Dec. 10, 2024, 4:53 p.m. UTC | #1
I must have missed something, but...

On 12/10, Lorenzo Stoakes wrote:
>
> @@ -1746,9 +1741,11 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
>  	if (!mm_init(mm, tsk, mm->user_ns))
>  		goto fail_nomem;
> 
> +	uprobe_start_dup_mmap();
>  	err = dup_mmap(mm, oldmm);
>  	if (err)
> -		goto free_pt;
> +		goto free_pt_end_uprobe;
> +	uprobe_end_dup_mmap();
> 
>  	mm->hiwater_rss = get_mm_rss(mm);
>  	mm->hiwater_vm = mm->total_vm;
> @@ -1758,6 +1755,8 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> 
>  	return mm;
> 
> +free_pt_end_uprobe:
> +	uprobe_end_dup_mmap();

if dup_mmap() fails and "mm" is incomplete, then with this version dup_mmap_sem
is dropped before __mmput/exit_mmap/etc. How can this help?

Oleg.
Lorenzo Stoakes Dec. 10, 2024, 5:01 p.m. UTC | #2
On Tue, Dec 10, 2024 at 05:53:34PM +0100, Oleg Nesterov wrote:
> I must have missed something, but...
>
> On 12/10, Lorenzo Stoakes wrote:
> >
> > @@ -1746,9 +1741,11 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> >  	if (!mm_init(mm, tsk, mm->user_ns))
> >  		goto fail_nomem;
> >
> > +	uprobe_start_dup_mmap();
> >  	err = dup_mmap(mm, oldmm);
> >  	if (err)
> > -		goto free_pt;
> > +		goto free_pt_end_uprobe;
> > +	uprobe_end_dup_mmap();
> >
> >  	mm->hiwater_rss = get_mm_rss(mm);
> >  	mm->hiwater_vm = mm->total_vm;
> > @@ -1758,6 +1755,8 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> >
> >  	return mm;
> >
> > +free_pt_end_uprobe:
> > +	uprobe_end_dup_mmap();
>
> if dup_mmap() fails and "mm" is incomplete, then with this version dup_mmap_sem
> is dropped before __mmput/exit_mmap/etc. How can this help?

Doh! Sorry, this is me rather rushing through this aspect of it here, my
mistake... I may risk an immmediate v2 with apologies to all for the noise... :P

>
> Oleg.
>
Oleg Nesterov Dec. 10, 2024, 5:23 p.m. UTC | #3
On 12/10, Oleg Nesterov wrote:
>
> I must have missed something, but...
>
> On 12/10, Lorenzo Stoakes wrote:
> >
> > @@ -1746,9 +1741,11 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> >  	if (!mm_init(mm, tsk, mm->user_ns))
> >  		goto fail_nomem;
> >
> > +	uprobe_start_dup_mmap();
> >  	err = dup_mmap(mm, oldmm);
> >  	if (err)
> > -		goto free_pt;
> > +		goto free_pt_end_uprobe;
> > +	uprobe_end_dup_mmap();
> >
> >  	mm->hiwater_rss = get_mm_rss(mm);
> >  	mm->hiwater_vm = mm->total_vm;
> > @@ -1758,6 +1755,8 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> >
> >  	return mm;
> >
> > +free_pt_end_uprobe:
> > +	uprobe_end_dup_mmap();
>
> if dup_mmap() fails and "mm" is incomplete, then with this version dup_mmap_sem
> is dropped before __mmput/exit_mmap/etc. How can this help?

Easy to fix, but what ensures that another mmget(mm) is not possible until
dup_mm() calls mmput() and drops dup_mmap_sem? Sorry, my understanding of mm/
is very limited...

Oleg.
Lorenzo Stoakes Dec. 10, 2024, 5:34 p.m. UTC | #4
On Tue, Dec 10, 2024 at 06:23:19PM +0100, Oleg Nesterov wrote:
> On 12/10, Oleg Nesterov wrote:
> >
> > I must have missed something, but...
> >
> > On 12/10, Lorenzo Stoakes wrote:
> > >
> > > @@ -1746,9 +1741,11 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> > >  	if (!mm_init(mm, tsk, mm->user_ns))
> > >  		goto fail_nomem;
> > >
> > > +	uprobe_start_dup_mmap();
> > >  	err = dup_mmap(mm, oldmm);
> > >  	if (err)
> > > -		goto free_pt;
> > > +		goto free_pt_end_uprobe;
> > > +	uprobe_end_dup_mmap();
> > >
> > >  	mm->hiwater_rss = get_mm_rss(mm);
> > >  	mm->hiwater_vm = mm->total_vm;
> > > @@ -1758,6 +1755,8 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> > >
> > >  	return mm;
> > >
> > > +free_pt_end_uprobe:
> > > +	uprobe_end_dup_mmap();
> >
> > if dup_mmap() fails and "mm" is incomplete, then with this version dup_mmap_sem
> > is dropped before __mmput/exit_mmap/etc. How can this help?
>
> Easy to fix, but what ensures that another mmget(mm) is not possible until
> dup_mm() calls mmput() and drops dup_mmap_sem? Sorry, my understanding of mm/
> is very limited...

Yeah that's a really good point - it's not impossible, though
unlikely. This lives alongside the 'what if the rmap provides access to
something else' consideration.

I believe Liam's going to look into something more solid on the maple tree
side.

So this change doesn't purport to entirely shut down all possible routes to
this issue arising but rather is a very small and direct way of addressing
the proximal case.

The thing with this situation is that its borderline impossible to happen
in practice, so we should probably be cautious about going out of our way
to preclude it if it involves too much chnage.

Somehow marking the mm as 'don't touch' would be good, but then we'd have
to audit every single corner of the kernel that touches it to be really
sure that works, and we don't have any MMF_xxx bits left to play with
really. MMF_UNSTABLE does seem suited but could it somehow interact with
oomk in some broken way...?

Not sure the RoI on that is good.

Probably the better thing to do here is to attack the XA_ZERO_ENTRY stuff
from the maple tree side - see if we can't audit/rule out inappropriate
access to these (perhaps adding filters in the vma_xxx(0 helpers?) which
elimniates this particular issue from this side.

But I think this patch as-is is relatively low risk to scoop up the most
likely version of the nearly-impossible thing :)

>
> Oleg.
>
diff mbox series

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index d532f893e977..c2591063dfc1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -696,11 +696,8 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	LIST_HEAD(uf);
 	VMA_ITERATOR(vmi, mm, 0);

-	uprobe_start_dup_mmap();
-	if (mmap_write_lock_killable(oldmm)) {
-		retval = -EINTR;
-		goto fail_uprobe_end;
-	}
+	if (mmap_write_lock_killable(oldmm))
+		return -EINTR;
 	flush_cache_dup_mm(oldmm);
 	uprobe_dup_mmap(oldmm, mm);
 	/*
@@ -839,8 +836,6 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		dup_userfaultfd_complete(&uf);
 	else
 		dup_userfaultfd_fail(&uf);
-fail_uprobe_end:
-	uprobe_end_dup_mmap();
 	return retval;

 fail_nomem_anon_vma_fork:
@@ -1746,9 +1741,11 @@  static struct mm_struct *dup_mm(struct task_struct *tsk,
 	if (!mm_init(mm, tsk, mm->user_ns))
 		goto fail_nomem;

+	uprobe_start_dup_mmap();
 	err = dup_mmap(mm, oldmm);
 	if (err)
-		goto free_pt;
+		goto free_pt_end_uprobe;
+	uprobe_end_dup_mmap();

 	mm->hiwater_rss = get_mm_rss(mm);
 	mm->hiwater_vm = mm->total_vm;
@@ -1758,6 +1755,8 @@  static struct mm_struct *dup_mm(struct task_struct *tsk,

 	return mm;

+free_pt_end_uprobe:
+	uprobe_end_dup_mmap();
 free_pt:
 	/* don't put binfmt in mmput, we haven't got module yet */
 	mm->binfmt = NULL;