Message ID | 20250324160003.GA8878@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | exec: fix the racy usage of fs_struct->in_exec | expand |
On Mon, Mar 24, 2025 at 5:00 PM Oleg Nesterov <oleg@redhat.com> wrote: > > check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() > paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it > fails we have the following race: > > T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex > > T2 sets fs->in_exec = 1 > > T1 clears fs->in_exec > > T2 continues with fs->in_exec == 0 > > Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held. > I had cursory glances at this code earlier and the more I try to understand it the more confused I am. The mutex at hand hides in ->signal and fs->in_exec remains treated as a flag. The loop in check_unsafe_exec() tries to guard against a task which shares ->fs, but does not share ->mm? To my reading this implies unshared ->signal, so the mutex protection does not apply. I think this ends up being harmless as in this case nobody is going to set ->in_exec (instead they are going to share LSM_UNSAFE_SHARE), so clearing it in these spots becomes a nop. At the same time the check in copy_fs() no longer blocks clones as check_unsafe_exec() already opts to LSM_UNSAFE_SHARE? Even if this all works with the patch, this is an incredibly odd set of dependencies and I don't see a good reason for it to still be here. Per my other e-mail the obvious scheme would serialize all execs sharing ->fs and make copy_fs do a killable wait for execs to finish. Arguably this would also improve userspace-visible behavior as a transient -EBUSY would be eliminated. No matter what the specific solution, imo treating ->in_exec as a flag needs to die. is there a problem getting this done even for stable kernels? I understand it would be harder to backport churn-wise, but should be much easier to reason about? Just my $0,03
On 03/24, Mateusz Guzik wrote: > > I had cursory glances at this code earlier and the more I try to > understand it the more confused I am. You are not alone ;) > Per my other e-mail the obvious scheme would serialize all execs > sharing ->fs and make copy_fs do a killable wait for execs to finish. > Arguably this would also improve userspace-visible behavior as a > transient -EBUSY would be eliminated. I had the same feeling years ago. Why didn't I do it? I can't recall. Perhaps because I found some problem in this idea, but most probably because I failed to make the correct and simple patch. > is there a problem getting this done even for stable kernels? I > understand it would be harder to backport churn-wise, but should be > much easier to reason about? I won't argue with another solution. But this problem is quite old, unless I am totally confused this logic was wrong from the very beginning when fs->in_exec was introduced by 498052bba55ec. So to me it would be better to have the trivial fix for stable, exactly because it is trivially backportable. Then cleanup/simplify this logic on top of it. Oleg.
On 03/24, Oleg Nesterov wrote: > > I won't argue with another solution. But this problem is quite old, Yes, but... > unless I am totally confused this logic was wrong from the very > beginning when fs->in_exec was introduced by 498052bba55ec. OK, I was wrong. According to git show 498052bba55ec:fs/exec.c this patch was correct in this respect. Sorry for the noise. > So to me it would be better to have the trivial fix for stable, > exactly because it is trivially backportable. Then cleanup/simplify > this logic on top of it. Yes... Oleg.
On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > I won't argue with another solution. But this problem is quite old, > unless I am totally confused this logic was wrong from the very > beginning when fs->in_exec was introduced by 498052bba55ec. > > So to me it would be better to have the trivial fix for stable, > exactly because it is trivially backportable. Then cleanup/simplify > this logic on top of it. > So I got myself a crap testcase with a CLONE_FS'ed task which can execve and sanity-checked that suid is indeed not honored as expected. The most convenient way out of planting a mutex in there does not work because of the size -- fs_struct is already at 56 bytes and I'm not going to push it past 64. However, looks like "wait on bit" machinery can be employed here, based on what I had seen with inodes (see __wait_on_freeing_inode) and that should avoid growing the struct, unless I'm grossly misreading something. Anyhow, the plan would be to serialize on the bit, synchronized with the current spin lock. copy_fs would call a helper to wait for it to clear, would still bump ->users under the spin lock. This would decouple the handling from cred_mutex and avoid weirdness like clearing the ->in_exec flag when we never set it. Should be easy enough and viable for backports, I'll try to hack it up tomorrow unless the idea is NAKed. The crapper mentioned above will be used to validate exec vs clone work as expected. -- Mateusz Guzik <mjguzik gmail.com>
On 03/24, Mateusz Guzik wrote: > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > So to me it would be better to have the trivial fix for stable, > > exactly because it is trivially backportable. Then cleanup/simplify > > this logic on top of it. > > So I got myself a crap testcase with a CLONE_FS'ed task which can > execve and sanity-checked that suid is indeed not honored as expected. So you mean my patch can't fix the problem? > Anyhow, the plan would be to serialize on the bit, synchronized with > the current spin lock. copy_fs would call a helper to wait for it to > clear, would still bump ->users under the spin lock. > > This would decouple the handling from cred_mutex and avoid weirdness > like clearing the ->in_exec flag when we never set it. I don't really understand the idea, but as I said I won't argue with another solution. Oleg.
On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 03/24, Mateusz Guzik wrote: > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > So to me it would be better to have the trivial fix for stable, > > > exactly because it is trivially backportable. Then cleanup/simplify > > > this logic on top of it. > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > execve and sanity-checked that suid is indeed not honored as expected. > > So you mean my patch can't fix the problem? No, I think the patch works. I am saying the current scheme is avoidably hard to reason about. > > > Anyhow, the plan would be to serialize on the bit, synchronized with > > the current spin lock. copy_fs would call a helper to wait for it to > > clear, would still bump ->users under the spin lock. > > > > This would decouple the handling from cred_mutex and avoid weirdness > > like clearing the ->in_exec flag when we never set it. > > I don't really understand the idea, but as I said I won't argue with > another solution. > I'll try to ship later today so that there will be something concrete to comment on.
On 03/25, Mateusz Guzik wrote: > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > On 03/24, Mateusz Guzik wrote: > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > this logic on top of it. > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > So you mean my patch can't fix the problem? > > No, I think the patch works. > > I am saying the current scheme is avoidably hard to reason about. Ah, OK, thanks. Then I still think it makes more sense to do the cleanups you propose on top of this fix. But I leave this to you and other fs/ maintainers. Oleg.
On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: > On 03/25, Mateusz Guzik wrote: > > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > On 03/24, Mateusz Guzik wrote: > > > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > > this logic on top of it. > > > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > > > So you mean my patch can't fix the problem? > > > > No, I think the patch works. > > > > I am saying the current scheme is avoidably hard to reason about. > > Ah, OK, thanks. Then I still think it makes more sense to do the > cleanups you propose on top of this fix. I agree. We should go with Oleg's fix that in the old scheme and use that. And then @Mateusz your cleanup should please go on top!
On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote: > > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: > > On 03/25, Mateusz Guzik wrote: > > > > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > On 03/24, Mateusz Guzik wrote: > > > > > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > > > this logic on top of it. > > > > > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > > > > > So you mean my patch can't fix the problem? > > > > > > No, I think the patch works. > > > > > > I am saying the current scheme is avoidably hard to reason about. > > > > Ah, OK, thanks. Then I still think it makes more sense to do the > > cleanups you propose on top of this fix. > > I agree. We should go with Oleg's fix that in the old scheme and use > that. And then @Mateusz your cleanup should please go on top! Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week.
On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote: > On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote: > > > > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: > > > On 03/25, Mateusz Guzik wrote: > > > > > > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > On 03/24, Mateusz Guzik wrote: > > > > > > > > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > > > > > So to me it would be better to have the trivial fix for stable, > > > > > > > exactly because it is trivially backportable. Then cleanup/simplify > > > > > > > this logic on top of it. > > > > > > > > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can > > > > > > execve and sanity-checked that suid is indeed not honored as expected. > > > > > > > > > > So you mean my patch can't fix the problem? > > > > > > > > No, I think the patch works. > > > > > > > > I am saying the current scheme is avoidably hard to reason about. > > > > > > Ah, OK, thanks. Then I still think it makes more sense to do the > > > cleanups you propose on top of this fix. > > > > I agree. We should go with Oleg's fix that in the old scheme and use > > that. And then @Mateusz your cleanup should please go on top! > > Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week. Ok, I've taken the patch as I've got a first round of fixes to send already.
On March 25, 2025 7:46:15 AM PDT, Christian Brauner <brauner@kernel.org> wrote: >On Tue, Mar 25, 2025 at 03:15:06PM +0100, Mateusz Guzik wrote: >> On Tue, Mar 25, 2025 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote: >> > >> > On Tue, Mar 25, 2025 at 02:21:36PM +0100, Oleg Nesterov wrote: >> > > On 03/25, Mateusz Guzik wrote: >> > > > >> > > > On Tue, Mar 25, 2025 at 11:10 AM Oleg Nesterov <oleg@redhat.com> wrote: >> > > > > >> > > > > On 03/24, Mateusz Guzik wrote: >> > > > > > >> > > > > > On Mon, Mar 24, 2025 at 7:28 PM Oleg Nesterov <oleg@redhat.com> wrote: >> > > > > > > >> > > > > > > So to me it would be better to have the trivial fix for stable, >> > > > > > > exactly because it is trivially backportable. Then cleanup/simplify >> > > > > > > this logic on top of it. >> > > > > > >> > > > > > So I got myself a crap testcase with a CLONE_FS'ed task which can >> > > > > > execve and sanity-checked that suid is indeed not honored as expected. >> > > > > >> > > > > So you mean my patch can't fix the problem? >> > > > >> > > > No, I think the patch works. >> > > > >> > > > I am saying the current scheme is avoidably hard to reason about. >> > > >> > > Ah, OK, thanks. Then I still think it makes more sense to do the >> > > cleanups you propose on top of this fix. >> > >> > I agree. We should go with Oleg's fix that in the old scheme and use >> > that. And then @Mateusz your cleanup should please go on top! >> >> Ok, in that case I'm gonna ship when I'm gonna ship(tm), maybe later this week. > >Ok, I've taken the patch as I've got a first round of fixes to send >already. Thanks! Acked-by: Kees Cook <kees@kernel.org>
diff --git a/fs/exec.c b/fs/exec.c index 506cd411f4ac..17047210be46 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1229,13 +1229,12 @@ int begin_new_exec(struct linux_binprm * bprm) */ bprm->point_of_no_return = true; - /* - * Make this the only thread in the thread group. - */ + /* Make this the only thread in the thread group */ retval = de_thread(me); if (retval) goto out; - + /* see the comment in check_unsafe_exec() */ + current->fs->in_exec = 0; /* * Cancel any io_uring activity across execve */ @@ -1497,6 +1496,8 @@ static void free_bprm(struct linux_binprm *bprm) } free_arg_pages(bprm); if (bprm->cred) { + /* in case exec fails before de_thread() succeeds */ + current->fs->in_exec = 0; mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1618,6 +1619,10 @@ static void check_unsafe_exec(struct linux_binprm *bprm) * suid exec because the differently privileged task * will be able to manipulate the current directory, etc. * It would be nice to force an unshare instead... + * + * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS) + * from another sub-thread until de_thread() succeeds, this + * state is protected by cred_guard_mutex we hold. */ n_fs = 1; spin_lock(&p->fs->lock); @@ -1862,7 +1867,6 @@ static int bprm_execve(struct linux_binprm *bprm) sched_mm_cid_after_execve(current); /* execve succeeded */ - current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); user_events_execve(current); @@ -1881,7 +1885,6 @@ static int bprm_execve(struct linux_binprm *bprm) force_fatal_sig(SIGSEGV); sched_mm_cid_after_execve(current); - current->fs->in_exec = 0; current->in_execve = 0; return retval;
check_unsafe_exec() sets fs->in_exec under cred_guard_mutex, then execve() paths clear fs->in_exec lockless. This is fine if exec succeeds, but if it fails we have the following race: T1 sets fs->in_exec = 1, fails, drops cred_guard_mutex T2 sets fs->in_exec = 1 T1 clears fs->in_exec T2 continues with fs->in_exec == 0 Change fs/exec.c to clear fs->in_exec with cred_guard_mutex held. Reported-by: syzbot+1c486d0b62032c82a968@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001f.GAE@google.com/ Cc: stable@vger.kernel.org Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- fs/exec.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)