diff mbox series

exec: fix the racy usage of fs_struct->in_exec

Message ID 20250324160003.GA8878@redhat.com (mailing list archive)
State New
Headers show
Series exec: fix the racy usage of fs_struct->in_exec | expand

Commit Message

Oleg Nesterov March 24, 2025, 4 p.m. UTC
check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
fails we have the following race:

	T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex

	T2 sets fs->in_exec = 1

	T1 clears fs->in_exec

	T2 continues with fs->in_exec == 0

Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held.

Reported-by: syzbot+1c486d0b62032c82a968@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/exec.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Comments

Mateusz Guzik March 24, 2025, 5:01 p.m. UTC | #1
On Mon, Mar 24, 2025 at 5:00 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve()
> paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it
> fails we have the following race:
>
>         T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex
>
>         T2 sets fs->in_exec = 1
>
>         T1 clears fs->in_exec
>
>         T2 continues with fs->in_exec == 0
>
> Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held.
>

I had cursory glances at this code earlier and the more I try to
understand it the more confused I am.

The mutex at hand hides in ->signal and fs->in_exec remains treated as a flag.

The loop in check_unsafe_exec() tries to guard against a task which
shares ->fs, but does not share ->mm? To my reading this implies
unshared ->signal, so the mutex protection does not apply.

I think this ends up being harmless as in this case nobody is going to
set ->in_exec (instead they are going to share LSM_UNSAFE_SHARE), so
clearing it in these spots becomes a nop.

At the same time the check in copy_fs() no longer blocks clones as
check_unsafe_exec() already opts to LSM_UNSAFE_SHARE?

Even if this all works with the patch, this is an incredibly odd set
of dependencies and I don't see a good reason for it to still be here.

Per my other e-mail the obvious scheme would serialize all execs
sharing ->fs and make copy_fs do a killable wait for execs to finish.
Arguably this would also improve userspace-visible behavior as a
transient -EBUSY would be eliminated.

No matter what the specific solution, imo treating ->in_exec as a flag
needs to die.

is there a problem getting this done even for stable kernels? I
understand it would be harder to backport churn-wise, but should be
much easier to reason about?

Just my $0,03
Oleg Nesterov March 24, 2025, 6:27 p.m. UTC | #2
On 03/24, Mateusz Guzik wrote:
>
> I had cursory glances at this code earlier and the more I try to
> understand it the more confused I am.

You are not alone ;)

> Per my other e-mail the obvious scheme would serialize all execs
> sharing ->fs and make copy_fs do a killable wait for execs to finish.
> Arguably this would also improve userspace-visible behavior as a
> transient -EBUSY would be eliminated.

I had the same feeling years ago. Why didn't I do it? I can't recall.
Perhaps because I found some problem in this idea, but most probably
because I failed to make the correct and simple patch.

> is there a problem getting this done even for stable kernels? I
> understand it would be harder to backport churn-wise, but should be
> much easier to reason about?

I won't argue with another solution. But this problem is quite old,
unless I am totally confused this logic was wrong from the very
beginning when fs->in_exec was introduced by 498052bba55ec.

So to me it would be better to have the trivial fix for stable,
exactly because it is trivially backportable. Then cleanup/simplify
this logic on top of it.

Oleg.
Oleg Nesterov March 24, 2025, 6:37 p.m. UTC | #3
On 03/24, Oleg Nesterov wrote:
>
> I won't argue with another solution. But this problem is quite old,

Yes, but...

> unless I am totally confused this logic was wrong from the very
> beginning when fs->in_exec was introduced by 498052bba55ec.

OK, I was wrong. According to git show 498052bba55ec:fs/exec.c
this patch was correct in this respect. Sorry for the noise.

> So to me it would be better to have the trivial fix for stable,
> exactly because it is trivially backportable. Then cleanup/simplify
> this logic on top of it.

Yes...

Oleg.
Mateusz Guzik March 24, 2025, 10:24 p.m. UTC | #4
On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> I won't argue with another solution. But this problem is quite old,
> unless I am totally confused this logic was wrong from the very
> beginning when fs->in_exec was introduced by 498052bba55ec.
>
> So to me it would be better to have the trivial fix for stable,
> exactly because it is trivially backportable. Then cleanup/simplify
> this logic on top of it.
>

So I got myself a crap testcase with a CLONE_FS'ed task which can
execve and sanity-checked that suid is indeed not honored as expected.

The most convenient way out of planting a mutex in there does not work
because of the size -- fs_struct is already at 56 bytes and I'm not
going to push it past 64.

However, looks like "wait on bit" machinery can be employed here,
based on what I had seen with inodes (see __wait_on_freeing_inode) and
that should avoid growing the struct, unless I'm grossly misreading
something.

Anyhow, the plan would be to serialize on the bit, synchronized with
the current spin lock. copy_fs would call a helper to wait for it to
clear, would still bump ->users under the spin lock.

This would decouple the handling from cred_mutex and avoid weirdness
like clearing the ->in_exec flag when we never set it.

Should be easy enough and viable for backports, I'll try to hack it up
tomorrow unless the idea is NAKed. The crapper mentioned above will be
used to validate exec vs clone work as expected.

--
Mateusz Guzik <mjguzik gmail.com>
Oleg Nesterov March 25, 2025, 10:09 a.m. UTC | #5
On 03/24, Mateusz Guzik wrote:
>
> On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > So to me it would be better to have the trivial fix for stable,
> > exactly because it is trivially backportable. Then cleanup/simplify
> > this logic on top of it.
>
> So I got myself a crap testcase with a CLONE_FS'ed task which can
> execve and sanity-checked that suid is indeed not honored as expected.

So you mean my patch can't fix the problem?

> Anyhow, the plan would be to serialize on the bit, synchronized with
> the current spin lock. copy_fs would call a helper to wait for it to
> clear, would still bump ->users under the spin lock.
>
> This would decouple the handling from cred_mutex and avoid weirdness
> like clearing the ->in_exec flag when we never set it.

I don't really understand the idea, but as I said I won't argue with
another solution.

Oleg.
Mateusz Guzik March 25, 2025, 11:01 a.m. UTC | #6
On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 03/24, Mateusz Guzik wrote:
> >
> > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > So to me it would be better to have the trivial fix for stable,
> > > exactly because it is trivially backportable. Then cleanup/simplify
> > > this logic on top of it.
> >
> > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > execve and sanity-checked that suid is indeed not honored as expected.
>
> So you mean my patch can't fix the problem?

No, I think the patch works.

I am saying the current scheme is avoidably hard to reason about.

>
> > Anyhow, the plan would be to serialize on the bit, synchronized with
> > the current spin lock. copy_fs would call a helper to wait for it to
> > clear, would still bump ->users under the spin lock.
> >
> > This would decouple the handling from cred_mutex and avoid weirdness
> > like clearing the ->in_exec flag when we never set it.
>
> I don't really understand the idea, but as I said I won't argue with
> another solution.
>

I'll try to ship later today so that there will be something concrete
to comment on.
Oleg Nesterov March 25, 2025, 1:21 p.m. UTC | #7
On 03/25, Mateusz Guzik wrote:
>
> On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 03/24, Mateusz Guzik wrote:
> > >
> > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > So to me it would be better to have the trivial fix for stable,
> > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > this logic on top of it.
> > >
> > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > execve and sanity-checked that suid is indeed not honored as expected.
> >
> > So you mean my patch can't fix the problem?
>
> No, I think the patch works.
>
> I am saying the current scheme is avoidably hard to reason about.

Ah, OK, thanks. Then I still think it makes more sense to do the
cleanups you propose on top of this fix.

But I leave this to you and other fs/ maintainers.

Oleg.
Christian Brauner March 25, 2025, 1:30 p.m. UTC | #8
On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
> On 03/25, Mateusz Guzik wrote:
> >
> > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > On 03/24, Mateusz Guzik wrote:
> > > >
> > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > > So to me it would be better to have the trivial fix for stable,
> > > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > > this logic on top of it.
> > > >
> > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > > execve and sanity-checked that suid is indeed not honored as expected.
> > >
> > > So you mean my patch can't fix the problem?
> >
> > No, I think the patch works.
> >
> > I am saying the current scheme is avoidably hard to reason about.
> 
> Ah, OK, thanks. Then I still think it makes more sense to do the
> cleanups you propose on top of this fix.

I agree. We should go with Oleg's fix that in the old scheme and use
that. And then @Mateusz your cleanup should please go on top!
Mateusz Guzik March 25, 2025, 2:15 p.m. UTC | #9
On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
> > On 03/25, Mateusz Guzik wrote:
> > >
> > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > On 03/24, Mateusz Guzik wrote:
> > > > >
> > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > > >
> > > > > > So to me it would be better to have the trivial fix for stable,
> > > > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > > > this logic on top of it.
> > > > >
> > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > > > execve and sanity-checked that suid is indeed not honored as expected.
> > > >
> > > > So you mean my patch can't fix the problem?
> > >
> > > No, I think the patch works.
> > >
> > > I am saying the current scheme is avoidably hard to reason about.
> >
> > Ah, OK, thanks. Then I still think it makes more sense to do the
> > cleanups you propose on top of this fix.
>
> I agree. We should go with Oleg's fix that in the old scheme and use
> that. And then @Mateusz your cleanup should please go on top!

Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week.
Christian Brauner March 25, 2025, 2:46 p.m. UTC | #10
On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote:
> On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
> > > On 03/25, Mateusz Guzik wrote:
> > > >
> > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > > On 03/24, Mateusz Guzik wrote:
> > > > > >
> > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > > > >
> > > > > > > So to me it would be better to have the trivial fix for stable,
> > > > > > > exactly because it is trivially backportable. Then cleanup/simplify
> > > > > > > this logic on top of it.
> > > > > >
> > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
> > > > > > execve and sanity-checked that suid is indeed not honored as expected.
> > > > >
> > > > > So you mean my patch can't fix the problem?
> > > >
> > > > No, I think the patch works.
> > > >
> > > > I am saying the current scheme is avoidably hard to reason about.
> > >
> > > Ah, OK, thanks. Then I still think it makes more sense to do the
> > > cleanups you propose on top of this fix.
> >
> > I agree. We should go with Oleg's fix that in the old scheme and use
> > that. And then @Mateusz your cleanup should please go on top!
> 
> Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week.

Ok, I've taken the patch as I've got a first round of fixes to send
already.
Kees Cook March 25, 2025, 6:40 p.m. UTC | #11
On March 25, 2025 7:46:15 AM PDT, Christian Brauner <brauner@kernel.org> wrote:
>On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote:
>> On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
>> >
>> > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote:
>> > > On 03/25, Mateusz Guzik wrote:
>> > > >
>> > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>> > > > >
>> > > > > On 03/24, Mateusz Guzik wrote:
>> > > > > >
>> > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
>> > > > > > >
>> > > > > > > So to me it would be better to have the trivial fix for stable,
>> > > > > > > exactly because it is trivially backportable. Then cleanup/simplify
>> > > > > > > this logic on top of it.
>> > > > > >
>> > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can
>> > > > > > execve and sanity-checked that suid is indeed not honored as expected.
>> > > > >
>> > > > > So you mean my patch can't fix the problem?
>> > > >
>> > > > No, I think the patch works.
>> > > >
>> > > > I am saying the current scheme is avoidably hard to reason about.
>> > >
>> > > Ah, OK, thanks. Then I still think it makes more sense to do the
>> > > cleanups you propose on top of this fix.
>> >
>> > I agree. We should go with Oleg's fix that in the old scheme and use
>> > that. And then @Mateusz your cleanup should please go on top!
>> 
>> Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week.
>
>Ok, I've taken the patch as I've got a first round of fixes to send
>already.

Thanks! 

Acked-by: Kees Cook <kees@kernel.org>
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 506cd411f4ac..17047210be46 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1229,13 +1229,12 @@  int begin_new_exec(struct linux_binprm * bprm)
 	 */
 	bprm->point_of_no_return = true;
 
-	/*
-	 * Make this the only thread in the thread group.
-	 */
+	/* Make this the only thread in the thread group */
 	retval = de_thread(me);
 	if (retval)
 		goto out;
-
+	/* see the comment in check_unsafe_exec() */
+	current->fs->in_exec = 0;
 	/*
 	 * Cancel any io_uring activity across execve
 	 */
@@ -1497,6 +1496,8 @@  static void free_bprm(struct linux_binprm *bprm)
 	}
 	free_arg_pages(bprm);
 	if (bprm->cred) {
+		/* in case exec fails before de_thread() succeeds */
+		current->fs->in_exec = 0;
 		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
@@ -1618,6 +1619,10 @@  static void check_unsafe_exec(struct linux_binprm *bprm)
 	 * suid exec because the differently privileged task
 	 * will be able to manipulate the current directory, etc.
 	 * It would be nice to force an unshare instead...
+	 *
+	 * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS)
+	 * from another sub-thread until de_thread() succeeds, this
+	 * state is protected by cred_guard_mutex we hold.
 	 */
 	n_fs = 1;
 	spin_lock(&p->fs->lock);
@@ -1862,7 +1867,6 @@  static int bprm_execve(struct linux_binprm *bprm)
 
 	sched_mm_cid_after_execve(current);
 	/* execve succeeded */
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
 	user_events_execve(current);
@@ -1881,7 +1885,6 @@  static int bprm_execve(struct linux_binprm *bprm)
 		force_fatal_sig(SIGSEGV);
 
 	sched_mm_cid_after_execve(current);
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
 	return retval;