Message ID | 20221006082735.1321612-2-keescook@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] fs/exec: Explicitly unshare fs_struct on exec | expand |
On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > The check_unsafe_exec() counting of n_fs would not add up under a heavily > threaded process trying to perform a suid exec, causing the suid portion > to fail. This counting error appears to be unneeded, but to catch any > possible conditions, explicitly unshare fs_struct on exec, if it ends up Isn't this a potential uapi break? Afaict, before this change a call to clone{3}(CLONE_FS) followed by an exec in the child would have the parent and child share fs information. So if the child e.g., changes the working directory post exec it would also affect the parent. But after this change here this would no longer be true. So a child changing a workding directoro would not affect the parent anymore. IOW, an exec is accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but it seems like a non-trivial uapi change but there might be few users that do clone{3}(CLONE_FS) followed by an exec. > +/* > + * Unshare the filesystem structure if it is being shared > + */ > +int unshare_fs(void) > +{ > + struct fs_struct *new_fs = NULL; > + int error; > + > + error = unshare_fs_alloc(CLONE_FS, &new_fs); > + if (error || !new_fs) > + return error; You should just check for error here, not new_fs.
From: Christian Brauner > Sent: 06 October 2022 10:05 > > On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > threaded process trying to perform a suid exec, causing the suid portion > > to fail. This counting error appears to be unneeded, but to catch any > > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > Isn't this a potential uapi break? Afaict, before this change a call to > clone{3}(CLONE_FS) followed by an exec in the child would have the > parent and child share fs information. So if the child e.g., changes the > working directory post exec it would also affect the parent. But after > this change here this would no longer be true. So a child changing a > workding directoro would not affect the parent anymore. IOW, an exec is > accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > it seems like a non-trivial uapi change but there might be few users > that do clone{3}(CLONE_FS) followed by an exec. The thought of that is entirely horrid... I presume a suid exec will fail in that case? If the old code is trying to compare the number of threads with the number of users of the fs table isn't is just buggy? If a thread unshares the fs table there can be another reference somewhere else - which is what (I presume) is being tested for. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > threaded process trying to perform a suid exec, causing the suid portion > > to fail. This counting error appears to be unneeded, but to catch any > > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > Isn't this a potential uapi break? Afaict, before this change a call to > clone{3}(CLONE_FS) followed by an exec in the child would have the > parent and child share fs information. So if the child e.g., changes the > working directory post exec it would also affect the parent. But after > this change here this would no longer be true. So a child changing a > workding directoro would not affect the parent anymore. IOW, an exec is > accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > it seems like a non-trivial uapi change but there might be few users > that do clone{3}(CLONE_FS) followed by an exec. I believe the following code in Chromium explicitly relies on this behavior, but I'm not sure whether this code is in active use anymore: https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote: >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily >> > threaded process trying to perform a suid exec, causing the suid portion >> > to fail. This counting error appears to be unneeded, but to catch any >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up >> >> Isn't this a potential uapi break? Afaict, before this change a call to >> clone{3}(CLONE_FS) followed by an exec in the child would have the >> parent and child share fs information. So if the child e.g., changes the >> working directory post exec it would also affect the parent. But after >> this change here this would no longer be true. So a child changing a >> workding directoro would not affect the parent anymore. IOW, an exec is >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but >> it seems like a non-trivial uapi change but there might be few users >> that do clone{3}(CLONE_FS) followed by an exec. > >I believe the following code in Chromium explicitly relies on this >behavior, but I'm not sure whether this code is in active use anymore: > >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed... It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...
On Thu, Oct 6, 2022 at 5:25 PM Kees Cook <keescook@chromium.org> wrote: > On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote: > >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > >> > threaded process trying to perform a suid exec, causing the suid portion > >> > to fail. This counting error appears to be unneeded, but to catch any > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > >> > >> Isn't this a potential uapi break? Afaict, before this change a call to > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > >> parent and child share fs information. So if the child e.g., changes the > >> working directory post exec it would also affect the parent. But after > >> this change here this would no longer be true. So a child changing a > >> workding directoro would not affect the parent anymore. IOW, an exec is > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > >> it seems like a non-trivial uapi change but there might be few users > >> that do clone{3}(CLONE_FS) followed by an exec. > > > >I believe the following code in Chromium explicitly relies on this > >behavior, but I'm not sure whether this code is in active use anymore: > > > >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed... > > It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition... Random idea that I haven't thought about a lot: One approach might be to not do it by counting, but instead have a flag on the fs_struct that we set when someone does a clone() with CLONE_FS but without CLONE_THREAD? Then we'd end up with the following possible states for fs_struct: - single-process, normal - single-process, pending execve past check_unsafe_exec() (prevent concurrent CLONE_FS) - shared between processes The slight difference from the old semantics would be that once you've used CLONE_FS without CLONE_THREAD, you can never do setuid execve() from your current process again (without calling unshare()), even if the child disappears in the meantime. I think that might be an acceptably small UAPI break.
On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily >> > threaded process trying to perform a suid exec, causing the suid portion >> > to fail. This counting error appears to be unneeded, but to catch any >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up >> >> Isn't this a potential uapi break? Afaict, before this change a call to >> clone{3}(CLONE_FS) followed by an exec in the child would have the >> parent and child share fs information. So if the child e.g., changes the >> working directory post exec it would also affect the parent. But after >> this change here this would no longer be true. So a child changing a >> workding directoro would not affect the parent anymore. IOW, an exec is >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but >> it seems like a non-trivial uapi change but there might be few users >> that do clone{3}(CLONE_FS) followed by an exec. > > I believe the following code in Chromium explicitly relies on this > behavior, but I'm not sure whether this code is in active use anymore: > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? --Andy
On Thu, Oct 13, 2022 at 08:18:04PM -0700, Andy Lutomirski wrote:
> But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior?
Yup, already abandoned:
https://lore.kernel.org/lkml/202210061301.207A20C8E5@keescook/
On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote: > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > >> > threaded process trying to perform a suid exec, causing the suid portion > >> > to fail. This counting error appears to be unneeded, but to catch any > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > >> > >> Isn't this a potential uapi break? Afaict, before this change a call to > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > >> parent and child share fs information. So if the child e.g., changes the > >> working directory post exec it would also affect the parent. But after > >> this change here this would no longer be true. So a child changing a > >> workding directoro would not affect the parent anymore. IOW, an exec is > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > >> it seems like a non-trivial uapi change but there might be few users > >> that do clone{3}(CLONE_FS) followed by an exec. > > > > I believe the following code in Chromium explicitly relies on this > > behavior, but I'm not sure whether this code is in active use anymore: > > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. > > But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? I agree that this is pretty wild. The single user I'm aware of is Chrome, and as far as I know, they use it for establishing their sandbox on systems where unprivileged user namespaces are disabled - see <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>. They also have seccomp-based sandboxing, but IIRC there are some small holes that mean it's still useful for them to be able to set up namespaces, like how sendmsg() on a unix domain socket can specify a file path as the destination address. (By the way, I think maybe Chrome wouldn't need this wacky trick with the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had ever landed that you (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/) and Mickaël Salaün proposed in the past... or alternatively, if there was a way to properly filter all the syscalls that Chrome has to permit for renderers.) (But also, to be clear, I don't speak for Chrome, this is just my understanding of how their stuff works.)
From: Andy Lutomirski > Sent: 14 October 2022 04:18 ... > But seriously, this makes no sense at all. It should not be possible to exec a program and then, > without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? it maybe ok if the exec'ed program also 'bought-in' to the fact that its cwd and open files might get changed. But imagine someone doing it to a login shell! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote: > On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > >> > threaded process trying to perform a suid exec, causing the suid portion > > >> > to fail. This counting error appears to be unneeded, but to catch any > > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > >> > > >> Isn't this a potential uapi break? Afaict, before this change a call to > > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > > >> parent and child share fs information. So if the child e.g., changes the > > >> working directory post exec it would also affect the parent. But after > > >> this change here this would no longer be true. So a child changing a > > >> workding directoro would not affect the parent anymore. IOW, an exec is > > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > > >> it seems like a non-trivial uapi change but there might be few users > > >> that do clone{3}(CLONE_FS) followed by an exec. > > > > > > I believe the following code in Chromium explicitly relies on this > > > behavior, but I'm not sure whether this code is in active use anymore: > > > > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > > > Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. > > > > But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? > > I agree that this is pretty wild. > > The single user I'm aware of is Chrome, and as far as I know, they use > it for establishing their sandbox on systems where unprivileged user > namespaces are disabled - see > <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>. > They also have seccomp-based sandboxing, but IIRC there are some small > holes that mean it's still useful for them to be able to set up > namespaces, like how sendmsg() on a unix domain socket can specify a > file path as the destination address. > > (By the way, I think maybe Chrome wouldn't need this wacky trick with > the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had > ever landed that you > (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/) > and Mickaël Salaün proposed in the past... or alternatively, if there > was a way to properly filter all the syscalls that Chrome has to > permit for renderers.) > > (But also, to be clear, I don't speak for Chrome, this is just my > understanding of how their stuff works.) Chrome seems to just want a totally empty filesystem view, yes? Let's land the nnp+chroot change. :P Only 10 years late! Then we can have Chrome use this and we can unshare fs on exec...
On Tue, Oct 18, 2022 at 9:09 AM Kees Cook <keescook@chromium.org> wrote: > On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote: > > On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote: > > > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote: > > > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote: > > > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote: > > > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily > > > >> > threaded process trying to perform a suid exec, causing the suid portion > > > >> > to fail. This counting error appears to be unneeded, but to catch any > > > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up > > > >> > > > >> Isn't this a potential uapi break? Afaict, before this change a call to > > > >> clone{3}(CLONE_FS) followed by an exec in the child would have the > > > >> parent and child share fs information. So if the child e.g., changes the > > > >> working directory post exec it would also affect the parent. But after > > > >> this change here this would no longer be true. So a child changing a > > > >> workding directoro would not affect the parent anymore. IOW, an exec is > > > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but > > > >> it seems like a non-trivial uapi change but there might be few users > > > >> that do clone{3}(CLONE_FS) followed by an exec. > > > > > > > > I believe the following code in Chromium explicitly relies on this > > > > behavior, but I'm not sure whether this code is in active use anymore: > > > > > > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium > > > > > > Wait, this is absolutely nucking futs. On a very quick inspection, the sharable things like this are fs, files, sighand, and io. files and sighand get unshared, which makes sense. fs supposedly checks for extra refs and prevents gaining privilege. io is... ignored! At least it's not immediately obvious that io is a problem. > > > > > > But seriously, this makes no sense at all. It should not be possible to exec a program and then, without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? > > > > I agree that this is pretty wild. > > > > The single user I'm aware of is Chrome, and as far as I know, they use > > it for establishing their sandbox on systems where unprivileged user > > namespaces are disabled - see > > <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>. > > They also have seccomp-based sandboxing, but IIRC there are some small > > holes that mean it's still useful for them to be able to set up > > namespaces, like how sendmsg() on a unix domain socket can specify a > > file path as the destination address. > > > > (By the way, I think maybe Chrome wouldn't need this wacky trick with > > the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had > > ever landed that you > > (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/) > > and Mickaël Salaün proposed in the past... or alternatively, if there > > was a way to properly filter all the syscalls that Chrome has to > > permit for renderers.) > > > > (But also, to be clear, I don't speak for Chrome, this is just my > > understanding of how their stuff works.) > > Chrome seems to just want a totally empty filesystem view, yes? > Let's land the nnp+chroot change. :P Only 10 years late! Then we can > have Chrome use this and we can unshare fs on exec... Someone should check with Chrome first though to make sure what I said accurately represents what they think...
David Laight <David.Laight@ACULAB.COM> writes: > From: Andy Lutomirski >> Sent: 14 October 2022 04:18 > ... >> But seriously, this makes no sense at all. It should not be possible to exec a program and then, >> without ptrace, change its cwd out from under it. Do we really need to preserve this behavior? > > it maybe ok if the exec'ed program also 'bought-in' to the > fact that its cwd and open files might get changed. > But imagine someone doing it to a login shell! I am slowly catching up on my email and I saw this conversation. When I initially saw this thread I was confused and thought this might run into an issue with fs/locks.c. I was close but wrong. fs/locks.c uses current->files as a sort of process identifier and so is very sensitive to when it is unshared. Making unsharing current->files unconditionally a bug. Not relevant to this conversation. There are several clone options that were only relevant for the old LinuxThreads implementation including CLONE_FS and CLONE_SIGHAND. The LinuxThreads implementation has not been needed since the introduction of CLONE_THREAD in linux-2.6.0 in 17 Dec 2003. Almost 20 years ago. I suggest we introduce CONFIG_CLONE_FS and CONFIG_SIGHAND to allow disabling support of these clone options. No known user space will care. The are both getting in the way of kernel maintenance so there is a reason to start pushing them out. Further simply not worrying about UNSHARE_FS during exec fixes the race so it essentially a bug fix by code removal. I believe something like the patch below should get the job done. diff --git a/fs/exec.c b/fs/exec.c index a0b1f0337a62..7ff13c77ad04 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1186,7 +1186,8 @@ static int unshare_sighand(struct task_struct *me) { struct sighand_struct *oldsighand = me->sighand; - if (refcount_read(&oldsighand->count) != 1) { + if (IS_ENABLED(CONFIG_CLONE_SIGHAND) && + refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /* * This ->sighand is shared with the CLONE_SIGHAND @@ -1568,6 +1569,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm) if (task_no_new_privs(current)) bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS; + if (!IS_ENABLED(CONFIG_CLONE_FS)) + return; + t = p; n_fs = 1; spin_lock(&p->fs->lock); diff --git a/init/Kconfig b/init/Kconfig index 94125d3b6893..8660a6bcc1cf 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1764,6 +1764,23 @@ config KALLSYMS_BASE_RELATIVE time constants, and no relocation pass is required at runtime to fix up the entries based on the runtime load address of the kernel. +config CLONE_FS + bool + default y + help + Support CLONE_FS being passed to clone. The only known user + is the old LinuxThreads package so it should be safe to disable + this option. + +config CLONE_SIGHAND + bool + default y + help + Support CLONE_SIGHAND being passed to clone. The only known user + is the old LinuxThreads package so it should be safe to disable + this option. + + # end of the "standard kernel features (expert users)" menu # syscall, maps, verifier diff --git a/kernel/fork.c b/kernel/fork.c index 08969f5aa38d..da9017b51da4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2023,6 +2023,16 @@ static __latent_entropy struct task_struct *copy_process( if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM)) return ERR_PTR(-EINVAL); + /* Don't allow CLONE_FS if not enabled */ + if (!IS_ENABLED(CONFIG_CLONE_FS) && + ((clone_flags & (CLONE_THREAD | CLONE_FS)) == CLONE_FS)) + return ERR_PTR(-EINVAL); + + /* Don't allow CLONE_SIGHAND if not enabled */ + if (!IS_ENABLED(CONFIG_CLONE_SIGHAND) && + ((clone_flags & (CLONE_THREAD | CLONE_SIGHAND)) == CLONE_SIGHAND)) + return ERR_PTR(-EINVAL); + /* * Siblings of global init remain as zombies on exit since they are * not reaped by their parent (swapper). To solve this and to avoid @@ -3101,6 +3111,9 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp) if (fs->users == 1) return 0; + if (!IS_ENABLED(CONFIG_CLONE_FS)) + return -EINVAL; + *new_fsp = copy_fs_struct(fs); if (!*new_fsp) return -ENOMEM; Eric
diff --git a/fs/exec.c b/fs/exec.c index 902bce45b116..7d5f63f03c58 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1272,6 +1272,11 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; + /* Ensure the fs_struct is not shared. */ + retval = unshare_fs(); + if (retval) + goto out; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visible until then. This also enables the update @@ -1583,8 +1588,6 @@ static void check_unsafe_exec(struct linux_binprm *bprm) if (p->fs->users > n_fs) bprm->unsafe |= LSM_UNSAFE_SHARE; - else - p->fs->in_exec = 1; spin_unlock(&p->fs->lock); } @@ -1843,7 +1846,6 @@ static int bprm_execve(struct linux_binprm *bprm, goto out; /* execve succeeded */ - current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); acct_update_integrals(current); @@ -1861,7 +1863,6 @@ static int bprm_execve(struct linux_binprm *bprm, force_fatal_sig(SIGSEGV); out_unmark: - current->fs->in_exec = 0; current->in_execve = 0; return retval; diff --git a/fs/fs_struct.c b/fs/fs_struct.c index 04b3f5b9c629..c27c67023d01 100644 --- a/fs/fs_struct.c +++ b/fs/fs_struct.c @@ -115,7 +115,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old) /* We don't need to lock fs - think why ;-) */ if (fs) { fs->users = 1; - fs->in_exec = 0; spin_lock_init(&fs->lock); seqcount_spinlock_init(&fs->seq, &fs->lock); fs->umask = old->umask; diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index e066816f3519..fbfb89ee3bda 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -117,6 +117,7 @@ struct task_struct; void put_files_struct(struct files_struct *fs); int unshare_files(void); +int unshare_fs(void); struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy; void do_close_on_exec(struct files_struct *); int iterate_fd(struct files_struct *, unsigned, diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h index 783b48dedb72..08b82a90ceae 100644 --- a/include/linux/fs_struct.h +++ b/include/linux/fs_struct.h @@ -11,7 +11,6 @@ struct fs_struct { spinlock_t lock; seqcount_spinlock_t seq; int umask; - int in_exec; struct path root, pwd; } __randomize_layout; diff --git a/kernel/fork.c b/kernel/fork.c index b4a799d9c50f..53b7248f7a4b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1589,10 +1589,6 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) if (clone_flags & CLONE_FS) { /* tsk->fs is already what we want */ spin_lock(&fs->lock); - if (fs->in_exec) { - spin_unlock(&fs->lock); - return -EAGAIN; - } fs->users++; spin_unlock(&fs->lock); return 0; @@ -3070,10 +3066,9 @@ static int check_unshare_flags(unsigned long unshare_flags) return 0; } -/* - * Unshare the filesystem structure if it is being shared - */ -static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp) +/* Allocate an fs_struct if it is currently shared and CLONE_FS requested. */ +static int unshare_fs_alloc(unsigned long unshare_flags, + struct fs_struct **new_fsp) { struct fs_struct *fs = current->fs; @@ -3091,6 +3086,41 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp) return 0; } +/* Swap out fs_struct, returning old fs if it needs freeing. */ +static void unshare_fs_finalize(struct fs_struct **new_fsp) +{ + struct task_struct *task = current; + struct fs_struct *fs = task->fs; + + spin_lock(&fs->lock); + task->fs = *new_fsp; + if (--fs->users) + *new_fsp = NULL; + else + *new_fsp = fs; + spin_unlock(&fs->lock); +} + +/* + * Unshare the filesystem structure if it is being shared + */ +int unshare_fs(void) +{ + struct fs_struct *new_fs = NULL; + int error; + + error = unshare_fs_alloc(CLONE_FS, &new_fs); + if (error || !new_fs) + return error; + + unshare_fs_finalize(&new_fs); + + if (new_fs) + free_fs_struct(new_fs); + + return 0; +} + /* * Unshare file descriptor table if it is being shared */ @@ -3120,7 +3150,7 @@ int unshare_fd(unsigned long unshare_flags, unsigned int max_fds, */ int ksys_unshare(unsigned long unshare_flags) { - struct fs_struct *fs, *new_fs = NULL; + struct fs_struct *new_fs = NULL; struct files_struct *new_fd = NULL; struct cred *new_cred = NULL; struct nsproxy *new_nsproxy = NULL; @@ -3159,7 +3189,7 @@ int ksys_unshare(unsigned long unshare_flags) */ if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM)) do_sysvsem = 1; - err = unshare_fs(unshare_flags, &new_fs); + err = unshare_fs_alloc(unshare_flags, &new_fs); if (err) goto bad_unshare_out; err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd); @@ -3197,16 +3227,8 @@ int ksys_unshare(unsigned long unshare_flags) task_lock(current); - if (new_fs) { - fs = current->fs; - spin_lock(&fs->lock); - current->fs = new_fs; - if (--fs->users) - new_fs = NULL; - else - new_fs = fs; - spin_unlock(&fs->lock); - } + if (new_fs) + unshare_fs_finalize(&new_fs); if (new_fd) swap(current->files, new_fd);
The check_unsafe_exec() counting of n_fs would not add up under a heavily threaded process trying to perform a suid exec, causing the suid portion to fail. This counting error appears to be unneeded, but to catch any possible conditions, explicitly unshare fs_struct on exec, if it ends up needing to happen. This means fs->in_exec must be removed (since it would never get cleared in the old copy), and is specifically no longer needed. See also commit 498052bba55e ("New locking/refcounting for fs_struct"). This additionally allows the "in_exec" member to be dropped, showing an 8 bytes savings per fs_struct on 64-bit. Before: struct fs_struct { int users; /* 0 4 */ spinlock_t lock; /* 4 4 */ seqcount_spinlock_t seq; /* 8 4 */ int umask; /* 12 4 */ int in_exec; /* 16 4 */ /* XXX 4 bytes hole, try to pack */ struct path root; /* 24 16 */ struct path pwd; /* 40 16 */ /* size: 56, cachelines: 1, members: 7 */ /* sum members: 52, holes: 1, sum holes: 4 */ /* last cacheline: 56 bytes */ }; After: struct fs_struct { int users; /* 0 4 */ spinlock_t lock; /* 4 4 */ seqcount_spinlock_t seq; /* 8 4 */ int umask; /* 12 4 */ struct path root; /* 16 16 */ struct path pwd; /* 32 16 */ /* size: 48, cachelines: 1, members: 6 */ /* last cacheline: 48 bytes */ }; Reported-by: Jorge Merlino <jorge.merlino@canonical.com> Link: https://lore.kernel.org/lkml/20220910211215.140270-1-jorge.merlino@canonical.com/ Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> --- fs/exec.c | 9 +++--- fs/fs_struct.c | 1 - include/linux/fdtable.h | 1 + include/linux/fs_struct.h | 1 - kernel/fork.c | 62 ++++++++++++++++++++++++++------------- 5 files changed, 48 insertions(+), 26 deletions(-)