diff mbox series

[v2] fork: avoid inappropriate uprobe access to invalid mm

Message ID 20241210172412.52995-1-lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series [v2] fork: avoid inappropriate uprobe access to invalid mm | expand

Commit Message

Lorenzo Stoakes Dec. 10, 2024, 5:24 p.m. UTC
If dup_mmap() encounters an issue, currently uprobe is able to access the
relevant mm via the reverse mapping (in build_map_info()), and if we are
very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
we establish as part of the fork error path.

This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
invoking vma_iter_load() which uses the advanced maple tree API and thus is
able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
dup_mmap()").

This change was made on the assumption that only process tear-down code
would actually observe (and make use of) these values. However this very
unlikely but still possible edge case with uprobes exists and unfortunately
does make these observable.

The uprobe operation prevents races against the dup_mmap() operation via
the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
and dropped via uprobe_end_dup_mmap(), and held across
register_for_each_vma() prior to invoking build_map_info() which does the
reverse mapping lookup.

Currently these are acquired and dropped within dup_mmap(), which exposes
the race window prior to error handling in the invoking dup_mm() which
tears down the mm.

We can avoid all this by just moving the invocation of
uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
and only release this lock once the dup_mmap() operation succeeds or clean
up is done.

This means that the uprobe code can never observe an incompletely
constructed mm and resolves the issue in this case.

Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
v2:
* Quick fix for silly mistake in error handling in dup_mm() as pointed out by
  Oleg.

v1:
https://lore.kernel.org/linux-mm/20241210163104.55181-1-lorenzo.stoakes@oracle.com/

 kernel/fork.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

--
2.47.0

Comments

David Hildenbrand Dec. 10, 2024, 7:35 p.m. UTC | #1
On 10.12.24 18:24, Lorenzo Stoakes wrote:
> If dup_mmap() encounters an issue, currently uprobe is able to access the
> relevant mm via the reverse mapping (in build_map_info()), and if we are
> very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
> we establish as part of the fork error path.
> 
> This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
> in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
> invoking vma_iter_load() which uses the advanced maple tree API and thus is
> able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
> d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
> dup_mmap()").
> 
> This change was made on the assumption that only process tear-down code
> would actually observe (and make use of) these values. However this very
> unlikely but still possible edge case with uprobes exists and unfortunately
> does make these observable.
> 
> The uprobe operation prevents races against the dup_mmap() operation via
> the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
> and dropped via uprobe_end_dup_mmap(), and held across
> register_for_each_vma() prior to invoking build_map_info() which does the
> reverse mapping lookup.
> 
> Currently these are acquired and dropped within dup_mmap(), which exposes
> the race window prior to error handling in the invoking dup_mm() which
> tears down the mm.
> 
> We can avoid all this by just moving the invocation of
> uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
> and only release this lock once the dup_mmap() operation succeeds or clean
> up is done.

What I understand is: we need to perform the uprobe_end_dup_mmap() after 
the mmput().

I assume/hope that we cannot see another mmget() before we return here. 
In that case, this LGTM.
Lorenzo Stoakes Dec. 10, 2024, 8:59 p.m. UTC | #2
On Tue, Dec 10, 2024 at 08:35:30PM +0100, David Hildenbrand wrote:
> On 10.12.24 18:24, Lorenzo Stoakes wrote:
> > If dup_mmap() encounters an issue, currently uprobe is able to access the
> > relevant mm via the reverse mapping (in build_map_info()), and if we are
> > very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
> > we establish as part of the fork error path.
> >
> > This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
> > in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
> > invoking vma_iter_load() which uses the advanced maple tree API and thus is
> > able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
> > d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
> > dup_mmap()").
> >
> > This change was made on the assumption that only process tear-down code
> > would actually observe (and make use of) these values. However this very
> > unlikely but still possible edge case with uprobes exists and unfortunately
> > does make these observable.
> >
> > The uprobe operation prevents races against the dup_mmap() operation via
> > the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
> > and dropped via uprobe_end_dup_mmap(), and held across
> > register_for_each_vma() prior to invoking build_map_info() which does the
> > reverse mapping lookup.
> >
> > Currently these are acquired and dropped within dup_mmap(), which exposes
> > the race window prior to error handling in the invoking dup_mm() which
> > tears down the mm.
> >
> > We can avoid all this by just moving the invocation of
> > uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
> > and only release this lock once the dup_mmap() operation succeeds or clean
> > up is done.
>
> What I understand is: we need to perform the uprobe_end_dup_mmap() after the
> mmput().

Ack yes.

>
> I assume/hope that we cannot see another mmget() before we return here. In
> that case, this LGTM.

We are dealing with a tiny time window and brief rmap availability, so it's hard
to say that's impossible. You also have to have failed to allocate really very
small amounts of memory, so we are talking lottery odds for this to even happen
in the first instance :)

I mean the syzkaller report took a year or so to hit it, and had to do
fault injection to do so.

Also it's not impossible that there are other means of accessing the mm
contianing XA_ZERO_ENTRY items through other means (I believe Liam was looking
into this).

However this patch is intended to at least eliminate the most proximate obvious
case with as simple a code change as possible.

Ideally we'd somehow mark the mm as being inaccessible somehow, but MMF_ flags
are out, and the obvious one to extend to mean this here, MMF_UNSTABLE, may
interact with oomk logic in some horrid way.

>
> --
> Cheers,
>
> David / dhildenb
>

So overall this patch is a relatively benign attempt to deal with the most
obvious issue with no apparent cost, but doesn't really rule out the need
to do more going forward...
David Hildenbrand Dec. 10, 2024, 9:43 p.m. UTC | #3
On 10.12.24 21:59, Lorenzo Stoakes wrote:
> On Tue, Dec 10, 2024 at 08:35:30PM +0100, David Hildenbrand wrote:
>> On 10.12.24 18:24, Lorenzo Stoakes wrote:
>>> If dup_mmap() encounters an issue, currently uprobe is able to access the
>>> relevant mm via the reverse mapping (in build_map_info()), and if we are
>>> very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which
>>> we establish as part of the fork error path.
>>>
>>> This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which
>>> in turn invokes find_mergeable_anon_vma() that uses a VMA iterator,
>>> invoking vma_iter_load() which uses the advanced maple tree API and thus is
>>> able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit
>>> d24062914837 ("fork: use __mt_dup() to duplicate maple tree in
>>> dup_mmap()").
>>>
>>> This change was made on the assumption that only process tear-down code
>>> would actually observe (and make use of) these values. However this very
>>> unlikely but still possible edge case with uprobes exists and unfortunately
>>> does make these observable.
>>>
>>> The uprobe operation prevents races against the dup_mmap() operation via
>>> the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap()
>>> and dropped via uprobe_end_dup_mmap(), and held across
>>> register_for_each_vma() prior to invoking build_map_info() which does the
>>> reverse mapping lookup.
>>>
>>> Currently these are acquired and dropped within dup_mmap(), which exposes
>>> the race window prior to error handling in the invoking dup_mm() which
>>> tears down the mm.
>>>
>>> We can avoid all this by just moving the invocation of
>>> uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm()
>>> and only release this lock once the dup_mmap() operation succeeds or clean
>>> up is done.
>>
>> What I understand is: we need to perform the uprobe_end_dup_mmap() after the
>> mmput().
> 
> Ack yes.
> 
>>
>> I assume/hope that we cannot see another mmget() before we return here. In
>> that case, this LGTM.
> 
> We are dealing with a tiny time window and brief rmap availability, so it's hard
> to say that's impossible. You also have to have failed to allocate really very
> small amounts of memory, so we are talking lottery odds for this to even happen
> in the first instance :)

Yes, likely the error injection framework is one of the only reliable 
ways to trigger that :)

> 
> I mean the syzkaller report took a year or so to hit it, and had to do
> fault injection to do so.

Ah, there it is: "fault injection" :D

> 
> Also it's not impossible that there are other means of accessing the mm
> contianing XA_ZERO_ENTRY items through other means (I believe Liam was looking
> into this).
> 
> However this patch is intended to at least eliminate the most proximate obvious
> case with as simple a code change as possible.
> 
> Ideally we'd somehow mark the mm as being inaccessible somehow, but MMF_ flags
> are out, and the obvious one to extend to mean this here, MMF_UNSTABLE, may
> interact with oomk logic in some horrid way.
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> So overall this patch is a relatively benign attempt to deal with the most
> obvious issue with no apparent cost, but doesn't really rule out the need
> to do more going forward...

Maybe add a bit of that to the patch description. Like "Fixes the 
reproducer, but likely there is more we'll tackle separately", your call.

Thanks for the details!
diff mbox series

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index d532f893e977..0bf377e2892b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -696,11 +696,8 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	LIST_HEAD(uf);
 	VMA_ITERATOR(vmi, mm, 0);

-	uprobe_start_dup_mmap();
-	if (mmap_write_lock_killable(oldmm)) {
-		retval = -EINTR;
-		goto fail_uprobe_end;
-	}
+	if (mmap_write_lock_killable(oldmm))
+		return -EINTR;
 	flush_cache_dup_mm(oldmm);
 	uprobe_dup_mmap(oldmm, mm);
 	/*
@@ -839,8 +836,6 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		dup_userfaultfd_complete(&uf);
 	else
 		dup_userfaultfd_fail(&uf);
-fail_uprobe_end:
-	uprobe_end_dup_mmap();
 	return retval;

 fail_nomem_anon_vma_fork:
@@ -1746,9 +1741,11 @@  static struct mm_struct *dup_mm(struct task_struct *tsk,
 	if (!mm_init(mm, tsk, mm->user_ns))
 		goto fail_nomem;

+	uprobe_start_dup_mmap();
 	err = dup_mmap(mm, oldmm);
 	if (err)
 		goto free_pt;
+	uprobe_end_dup_mmap();

 	mm->hiwater_rss = get_mm_rss(mm);
 	mm->hiwater_vm = mm->total_vm;
@@ -1763,6 +1760,8 @@  static struct mm_struct *dup_mm(struct task_struct *tsk,
 	mm->binfmt = NULL;
 	mm_init_owner(mm, NULL);
 	mmput(mm);
+	if (err)
+		uprobe_end_dup_mmap();

 fail_nomem:
 	return NULL;