diff mbox series

[1/2] fs/exec: Explicitly unshare fs_struct on exec

Message ID 20221006082735.1321612-2-keescook@chromium.org (mailing list archive)
State New, archived
Headers show
Series [1/2] fs/exec: Explicitly unshare fs_struct on exec | expand

Commit Message

Kees Cook Oct. 6, 2022, 8:27 a.m. UTC
The check_unsafe_exec() counting of n_fs would not add up under a heavily
threaded process trying to perform a suid exec, causing the suid portion
to fail. This counting error appears to be unneeded, but to catch any
possible conditions, explicitly unshare fs_struct on exec, if it ends up
needing to happen. This means fs->in_exec must be removed (since it would
never get cleared in the old copy), and is specifically no longer needed.

See also commit 498052bba55e ("New locking/refcounting for fs_struct").

This additionally allows the "in_exec" member to be dropped, showing an
8 bytes savings per fs_struct on 64-bit. Before:

struct fs_struct {
        int                        users;                /*     0     4 */
        spinlock_t                 lock;                 /*     4     4 */
        seqcount_spinlock_t        seq;                  /*     8     4 */
        int                        umask;                /*    12     4 */
        int                        in_exec;              /*    16     4 */

        /* XXX 4 bytes hole, try to pack */

        struct path                root;                 /*    24    16 */
        struct path                pwd;                  /*    40    16 */

        /* size: 56, cachelines: 1, members: 7 */
        /* sum members: 52, holes: 1, sum holes: 4 */
        /* last cacheline: 56 bytes */
};

After:

struct fs_struct {
        int                        users;                /*     0     4 */
        spinlock_t                 lock;                 /*     4     4 */
        seqcount_spinlock_t        seq;                  /*     8     4 */
        int                        umask;                /*    12     4 */
        struct path                root;                 /*    16    16 */
        struct path                pwd;                  /*    32    16 */

        /* size: 48, cachelines: 1, members: 6 */
        /* last cacheline: 48 bytes */
};

Reported-by: Jorge Merlino <jorge.merlino@canonical.com>
Link: https://lore.kernel.org/lkml/20220910211215.140270-1-jorge.merlino@canonical.com/
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/exec.c                 |  9 +++---
 fs/fs_struct.c            |  1 -
 include/linux/fdtable.h   |  1 +
 include/linux/fs_struct.h |  1 -
 kernel/fork.c             | 62 ++++++++++++++++++++++++++-------------
 5 files changed, 48 insertions(+), 26 deletions(-)

Comments

Christian Brauner Oct. 6, 2022, 9:05 a.m. UTC | #1
On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> The check_unsafe_exec() counting of n_fs would not add up under a heavily
> threaded process trying to perform a suid exec, causing the suid portion
> to fail. This counting error appears to be unneeded, but to catch any
> possible conditions, explicitly unshare fs_struct on exec, if it ends up

Isn't this a potential uapi break? Afaict, before this change a call to
clone{3}(CLONE_FS) followed by an exec in the child would have the
parent and child share fs information. So if the child e.g., changes the
working directory post exec it would also affect the parent. But after
this change here this would no longer be true. So a child changing a
workding directoro would not affect the parent anymore. IOW, an exec is
accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
it seems like a non-trivial uapi change but there might be few users
that do clone{3}(CLONE_FS) followed by an exec.

> +/*
> + * Unshare the filesystem structure if it is being shared
> + */
> +int unshare_fs(void)
> +{
> +	struct fs_struct *new_fs = NULL;
> +	int error;
> +
> +	error = unshare_fs_alloc(CLONE_FS, &new_fs);
> +	if (error || !new_fs)
> +		return error;

You should just check for error here, not new_fs.
David Laight Oct. 6, 2022, 10:48 a.m. UTC | #2
From: Christian Brauner
> Sent: 06 October 2022 10:05
> 
> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > threaded process trying to perform a suid exec, causing the suid portion
> > to fail. This counting error appears to be unneeded, but to catch any
> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> 
> Isn't this a potential uapi break? Afaict, before this change a call to
> clone{3}(CLONE_FS) followed by an exec in the child would have the
> parent and child share fs information. So if the child e.g., changes the
> working directory post exec it would also affect the parent. But after
> this change here this would no longer be true. So a child changing a
> workding directoro would not affect the parent anymore. IOW, an exec is
> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> it seems like a non-trivial uapi change but there might be few users
> that do clone{3}(CLONE_FS) followed by an exec.

The thought of that is entirely horrid...

I presume a suid exec will fail in that case?

If the old code is trying to compare the number of threads
with the number of users of the fs table isn't is just buggy?
If a thread unshares the fs table there can be another
reference somewhere else - which is what (I presume) is being
tested for.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jann Horn Oct. 6, 2022, 2:13 p.m. UTC | #3
On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > threaded process trying to perform a suid exec, causing the suid portion
> > to fail. This counting error appears to be unneeded, but to catch any
> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>
> Isn't this a potential uapi break? Afaict, before this change a call to
> clone{3}(CLONE_FS) followed by an exec in the child would have the
> parent and child share fs information. So if the child e.g., changes the
> working directory post exec it would also affect the parent. But after
> this change here this would no longer be true. So a child changing a
> workding directoro would not affect the parent anymore. IOW, an exec is
> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> it seems like a non-trivial uapi change but there might be few users
> that do clone{3}(CLONE_FS) followed by an exec.

I believe the following code in Chromium explicitly relies on this
behavior, but I'm not sure whether this code is in active use anymore:

https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
Kees Cook Oct. 6, 2022, 3:25 p.m. UTC | #4
On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote:
>On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
>> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
>> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
>> > threaded process trying to perform a suid exec, causing the suid portion
>> > to fail. This counting error appears to be unneeded, but to catch any
>> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>>
>> Isn't this a potential uapi break? Afaict, before this change a call to
>> clone{3}(CLONE_FS) followed by an exec in the child would have the
>> parent and child share fs information. So if the child e.g., changes the
>> working directory post exec it would also affect the parent. But after
>> this change here this would no longer be true. So a child changing a
>> workding directoro would not affect the parent anymore. IOW, an exec is
>> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
>> it seems like a non-trivial uapi change but there might be few users
>> that do clone{3}(CLONE_FS) followed by an exec.
>
>I believe the following code in Chromium explicitly relies on this
>behavior, but I'm not sure whether this code is in active use anymore:
>
>https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium

Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed...

It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...
Jann Horn Oct. 6, 2022, 3:35 p.m. UTC | #5
On Thu, Oct 6, 2022 at 5:25 PM Kees Cook <keescook@chromium.org> wrote:
> On October 6, 2022 7:13:37 AM PDT, Jann Horn <jannh@google.com> wrote:
> >On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> >> > threaded process trying to perform a suid exec, causing the suid portion
> >> > to fail. This counting error appears to be unneeded, but to catch any
> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> >>
> >> Isn't this a potential uapi break? Afaict, before this change a call to
> >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> >> parent and child share fs information. So if the child e.g., changes the
> >> working directory post exec it would also affect the parent. But after
> >> this change here this would no longer be true. So a child changing a
> >> workding directoro would not affect the parent anymore. IOW, an exec is
> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> >> it seems like a non-trivial uapi change but there might be few users
> >> that do clone{3}(CLONE_FS) followed by an exec.
> >
> >I believe the following code in Chromium explicitly relies on this
> >behavior, but I'm not sure whether this code is in active use anymore:
> >
> >https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
>
> Oh yes. I think I had tried to forget this existed. Ugh. Okay, so back to the drawing board, I guess. The counting will need to be fixed...
>
> It's possible we can move the counting after dethread -- it seems the early count was just to avoid setting flags after the point of no return, but it's not an error condition...

Random idea that I haven't thought about a lot:

One approach might be to not do it by counting, but instead have a
flag on the fs_struct that we set when someone does a clone() with
CLONE_FS but without CLONE_THREAD? Then we'd end up with the following
possible states for fs_struct:

 - single-process, normal
 - single-process, pending execve past check_unsafe_exec() (prevent
concurrent CLONE_FS)
 - shared between processes

The slight difference from the old semantics would be that once you've
used CLONE_FS without CLONE_THREAD, you can never do setuid execve()
from your current process again (without calling unshare()), even if
the child disappears in the meantime. I think that might be an
acceptably small UAPI break.
Andy Lutomirski Oct. 14, 2022, 3:18 a.m. UTC | #6
On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
>> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
>> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
>> > threaded process trying to perform a suid exec, causing the suid portion
>> > to fail. This counting error appears to be unneeded, but to catch any
>> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
>>
>> Isn't this a potential uapi break? Afaict, before this change a call to
>> clone{3}(CLONE_FS) followed by an exec in the child would have the
>> parent and child share fs information. So if the child e.g., changes the
>> working directory post exec it would also affect the parent. But after
>> this change here this would no longer be true. So a child changing a
>> workding directoro would not affect the parent anymore. IOW, an exec is
>> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
>> it seems like a non-trivial uapi change but there might be few users
>> that do clone{3}(CLONE_FS) followed by an exec.
>
> I believe the following code in Chromium explicitly relies on this
> behavior, but I'm not sure whether this code is in active use anymore:
>
> https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium

Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.

But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

--Andy
Kees Cook Oct. 14, 2022, 3:54 a.m. UTC | #7
On Thu, Oct 13, 2022 at 08:18:04PM -0700, Andy Lutomirski wrote:
> But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

Yup, already abandoned:
https://lore.kernel.org/lkml/202210061301.207A20C8E5@keescook/
Jann Horn Oct. 14, 2022, 3:35 p.m. UTC | #8
On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote:
> On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> >> > threaded process trying to perform a suid exec, causing the suid portion
> >> > to fail. This counting error appears to be unneeded, but to catch any
> >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> >>
> >> Isn't this a potential uapi break? Afaict, before this change a call to
> >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> >> parent and child share fs information. So if the child e.g., changes the
> >> working directory post exec it would also affect the parent. But after
> >> this change here this would no longer be true. So a child changing a
> >> workding directoro would not affect the parent anymore. IOW, an exec is
> >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> >> it seems like a non-trivial uapi change but there might be few users
> >> that do clone{3}(CLONE_FS) followed by an exec.
> >
> > I believe the following code in Chromium explicitly relies on this
> > behavior, but I'm not sure whether this code is in active use anymore:
> >
> > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
>
> Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.
>
> But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

I agree that this is pretty wild.

The single user I'm aware of is Chrome, and as far as I know, they use
it for establishing their sandbox on systems where unprivileged user
namespaces are disabled - see
<https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>.
They also have seccomp-based sandboxing, but IIRC there are some small
holes that mean it's still useful for them to be able to set up
namespaces, like how sendmsg() on a unix domain socket can specify a
file path as the destination address.

(By the way, I think maybe Chrome wouldn't need this wacky trick with
the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had
ever landed that you
(https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/)
and Mickaël Salaün proposed in the past... or alternatively, if there
was a way to properly filter all the syscalls that Chrome has to
permit for renderers.)

(But also, to be clear, I don't speak for Chrome, this is just my
understanding of how their stuff works.)
David Laight Oct. 14, 2022, 10:03 p.m. UTC | #9
From: Andy Lutomirski
> Sent: 14 October 2022 04:18
...
> But seriously, this makes no sense at all.  It should not be possible to exec a program and then,
> without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?

it maybe ok if the exec'ed program also 'bought-in' to the
fact that its cwd and open files might get changed.
But imagine someone doing it to a login shell!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Kees Cook Oct. 18, 2022, 7:09 a.m. UTC | #10
On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote:
> On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote:
> > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > >> > threaded process trying to perform a suid exec, causing the suid portion
> > >> > to fail. This counting error appears to be unneeded, but to catch any
> > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> > >>
> > >> Isn't this a potential uapi break? Afaict, before this change a call to
> > >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> > >> parent and child share fs information. So if the child e.g., changes the
> > >> working directory post exec it would also affect the parent. But after
> > >> this change here this would no longer be true. So a child changing a
> > >> workding directoro would not affect the parent anymore. IOW, an exec is
> > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> > >> it seems like a non-trivial uapi change but there might be few users
> > >> that do clone{3}(CLONE_FS) followed by an exec.
> > >
> > > I believe the following code in Chromium explicitly relies on this
> > > behavior, but I'm not sure whether this code is in active use anymore:
> > >
> > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
> >
> > Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.
> >
> > But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?
> 
> I agree that this is pretty wild.
> 
> The single user I'm aware of is Chrome, and as far as I know, they use
> it for establishing their sandbox on systems where unprivileged user
> namespaces are disabled - see
> <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>.
> They also have seccomp-based sandboxing, but IIRC there are some small
> holes that mean it's still useful for them to be able to set up
> namespaces, like how sendmsg() on a unix domain socket can specify a
> file path as the destination address.
> 
> (By the way, I think maybe Chrome wouldn't need this wacky trick with
> the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had
> ever landed that you
> (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/)
> and Mickaël Salaün proposed in the past... or alternatively, if there
> was a way to properly filter all the syscalls that Chrome has to
> permit for renderers.)
> 
> (But also, to be clear, I don't speak for Chrome, this is just my
> understanding of how their stuff works.)

Chrome seems to just want a totally empty filesystem view, yes?
Let's land the nnp+chroot change. :P Only 10 years late! Then we can
have Chrome use this and we can unshare fs on exec...
Jann Horn Oct. 18, 2022, 11:19 a.m. UTC | #11
On Tue, Oct 18, 2022 at 9:09 AM Kees Cook <keescook@chromium.org> wrote:
> On Fri, Oct 14, 2022 at 05:35:26PM +0200, Jann Horn wrote:
> > On Fri, Oct 14, 2022 at 5:18 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > On Thu, Oct 6, 2022, at 7:13 AM, Jann Horn wrote:
> > > > On Thu, Oct 6, 2022 at 11:05 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >> On Thu, Oct 06, 2022 at 01:27:34AM -0700, Kees Cook wrote:
> > > >> > The check_unsafe_exec() counting of n_fs would not add up under a heavily
> > > >> > threaded process trying to perform a suid exec, causing the suid portion
> > > >> > to fail. This counting error appears to be unneeded, but to catch any
> > > >> > possible conditions, explicitly unshare fs_struct on exec, if it ends up
> > > >>
> > > >> Isn't this a potential uapi break? Afaict, before this change a call to
> > > >> clone{3}(CLONE_FS) followed by an exec in the child would have the
> > > >> parent and child share fs information. So if the child e.g., changes the
> > > >> working directory post exec it would also affect the parent. But after
> > > >> this change here this would no longer be true. So a child changing a
> > > >> workding directoro would not affect the parent anymore. IOW, an exec is
> > > >> accompanied by an unshare(CLONE_FS). Might still be worth trying ofc but
> > > >> it seems like a non-trivial uapi change but there might be few users
> > > >> that do clone{3}(CLONE_FS) followed by an exec.
> > > >
> > > > I believe the following code in Chromium explicitly relies on this
> > > > behavior, but I'm not sure whether this code is in active use anymore:
> > > >
> > > > https://source.chromium.org/chromium/chromium/src/+/main:sandbox/linux/suid/sandbox.c;l=101?q=CLONE_FS&sq=&ss=chromium
> > >
> > > Wait, this is absolutely nucking futs.  On a very quick inspection, the sharable things like this are fs, files, sighand, and io.    files and sighand get unshared, which makes sense.  fs supposedly checks for extra refs and prevents gaining privilege.  io is... ignored!  At least it's not immediately obvious that io is a problem.
> > >
> > > But seriously, this makes no sense at all.  It should not be possible to exec a program and then, without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?
> >
> > I agree that this is pretty wild.
> >
> > The single user I'm aware of is Chrome, and as far as I know, they use
> > it for establishing their sandbox on systems where unprivileged user
> > namespaces are disabled - see
> > <https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox.md>.
> > They also have seccomp-based sandboxing, but IIRC there are some small
> > holes that mean it's still useful for them to be able to set up
> > namespaces, like how sendmsg() on a unix domain socket can specify a
> > file path as the destination address.
> >
> > (By the way, I think maybe Chrome wouldn't need this wacky trick with
> > the shared fs_struct if the "NO_NEW_PRIVS permits chroot()" thing had
> > ever landed that you
> > (https://lore.kernel.org/lkml/0e2f0f54e19bff53a3739ecfddb4ffa9a6dbde4d.1327858005.git.luto@amacapital.net/)
> > and Mickaël Salaün proposed in the past... or alternatively, if there
> > was a way to properly filter all the syscalls that Chrome has to
> > permit for renderers.)
> >
> > (But also, to be clear, I don't speak for Chrome, this is just my
> > understanding of how their stuff works.)
>
> Chrome seems to just want a totally empty filesystem view, yes?
> Let's land the nnp+chroot change. :P Only 10 years late! Then we can
> have Chrome use this and we can unshare fs on exec...

Someone should check with Chrome first though to make sure what I said
accurately represents what they think...
Eric W. Biederman Nov. 28, 2022, 5:49 p.m. UTC | #12
David Laight <David.Laight@ACULAB.COM> writes:

> From: Andy Lutomirski
>> Sent: 14 October 2022 04:18
> ...
>> But seriously, this makes no sense at all.  It should not be possible to exec a program and then,
>> without ptrace, change its cwd out from under it.  Do we really need to preserve this behavior?
>
> it maybe ok if the exec'ed program also 'bought-in' to the
> fact that its cwd and open files might get changed.
> But imagine someone doing it to a login shell!


I am slowly catching up on my email and I saw this conversation.

When I initially saw this thread I was confused and thought this
might run into an issue with fs/locks.c.  I was close but wrong.
fs/locks.c uses current->files as a sort of process identifier
and so is very sensitive to when it is unshared.  Making
unsharing current->files unconditionally a bug.  Not relevant to
this conversation.


There are several clone options that were only relevant for the old
LinuxThreads implementation including CLONE_FS and CLONE_SIGHAND.
The LinuxThreads implementation has not been needed since
the introduction of CLONE_THREAD in linux-2.6.0 in 17 Dec 2003.
Almost 20 years ago.

I suggest we introduce CONFIG_CLONE_FS and CONFIG_SIGHAND to allow
disabling support of these clone options.  No known user space will
care.  The are both getting in the way of kernel maintenance so there
is a reason to start pushing them out.

Further simply not worrying about UNSHARE_FS during exec fixes the
race so it essentially a bug fix by code removal.

I believe something like the patch below should get the job done.

diff --git a/fs/exec.c b/fs/exec.c
index a0b1f0337a62..7ff13c77ad04 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1186,7 +1186,8 @@ static int unshare_sighand(struct task_struct *me)
 {
 	struct sighand_struct *oldsighand = me->sighand;
 
-	if (refcount_read(&oldsighand->count) != 1) {
+	if (IS_ENABLED(CONFIG_CLONE_SIGHAND) &&
+	    refcount_read(&oldsighand->count) != 1) {
 		struct sighand_struct *newsighand;
 		/*
 		 * This ->sighand is shared with the CLONE_SIGHAND
@@ -1568,6 +1569,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	if (task_no_new_privs(current))
 		bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
 
+	if (!IS_ENABLED(CONFIG_CLONE_FS))
+		return;
+
 	t = p;
 	n_fs = 1;
 	spin_lock(&p->fs->lock);
diff --git a/init/Kconfig b/init/Kconfig
index 94125d3b6893..8660a6bcc1cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1764,6 +1764,23 @@ config KALLSYMS_BASE_RELATIVE
 	  time constants, and no relocation pass is required at runtime to fix
 	  up the entries based on the runtime load address of the kernel.
 
+config CLONE_FS
+	bool
+	default y
+	help
+	  Support CLONE_FS being passed to clone.  The only known user
+	  is the old LinuxThreads package so it should be safe to disable
+	  this option.
+
+config CLONE_SIGHAND
+	bool
+	default y
+	help
+	  Support CLONE_SIGHAND being passed to clone.  The only known user
+	  is the old LinuxThreads package so it should be safe to disable
+	  this option.
+
+
 # end of the "standard kernel features (expert users)" menu
 
 # syscall, maps, verifier
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..da9017b51da4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2023,6 +2023,16 @@ static __latent_entropy struct task_struct *copy_process(
 	if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
 		return ERR_PTR(-EINVAL);
 
+	/* Don't allow CLONE_FS if not enabled */
+	if (!IS_ENABLED(CONFIG_CLONE_FS) &&
+	    ((clone_flags & (CLONE_THREAD | CLONE_FS)) == CLONE_FS))
+		return ERR_PTR(-EINVAL);
+
+	/* Don't allow CLONE_SIGHAND if not enabled */
+	if (!IS_ENABLED(CONFIG_CLONE_SIGHAND) &&
+	    ((clone_flags & (CLONE_THREAD | CLONE_SIGHAND)) == CLONE_SIGHAND))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * Siblings of global init remain as zombies on exit since they are
 	 * not reaped by their parent (swapper). To solve this and to avoid
@@ -3101,6 +3111,9 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 	if (fs->users == 1)
 		return 0;
 
+	if (!IS_ENABLED(CONFIG_CLONE_FS))
+		return -EINVAL;
+
 	*new_fsp = copy_fs_struct(fs);
 	if (!*new_fsp)
 		return -ENOMEM;

Eric
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 902bce45b116..7d5f63f03c58 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1272,6 +1272,11 @@  int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	/* Ensure the fs_struct is not shared. */
+	retval = unshare_fs();
+	if (retval)
+		goto out;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visible until then. This also enables the update
@@ -1583,8 +1588,6 @@  static void check_unsafe_exec(struct linux_binprm *bprm)
 
 	if (p->fs->users > n_fs)
 		bprm->unsafe |= LSM_UNSAFE_SHARE;
-	else
-		p->fs->in_exec = 1;
 	spin_unlock(&p->fs->lock);
 }
 
@@ -1843,7 +1846,6 @@  static int bprm_execve(struct linux_binprm *bprm,
 		goto out;
 
 	/* execve succeeded */
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
 	acct_update_integrals(current);
@@ -1861,7 +1863,6 @@  static int bprm_execve(struct linux_binprm *bprm,
 		force_fatal_sig(SIGSEGV);
 
 out_unmark:
-	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
 	return retval;
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 04b3f5b9c629..c27c67023d01 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -115,7 +115,6 @@  struct fs_struct *copy_fs_struct(struct fs_struct *old)
 	/* We don't need to lock fs - think why ;-) */
 	if (fs) {
 		fs->users = 1;
-		fs->in_exec = 0;
 		spin_lock_init(&fs->lock);
 		seqcount_spinlock_init(&fs->seq, &fs->lock);
 		fs->umask = old->umask;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..fbfb89ee3bda 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -117,6 +117,7 @@  struct task_struct;
 
 void put_files_struct(struct files_struct *fs);
 int unshare_files(void);
+int unshare_fs(void);
 struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
 int iterate_fd(struct files_struct *, unsigned,
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 783b48dedb72..08b82a90ceae 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -11,7 +11,6 @@  struct fs_struct {
 	spinlock_t lock;
 	seqcount_spinlock_t seq;
 	int umask;
-	int in_exec;
 	struct path root, pwd;
 } __randomize_layout;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b4a799d9c50f..53b7248f7a4b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1589,10 +1589,6 @@  static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 	if (clone_flags & CLONE_FS) {
 		/* tsk->fs is already what we want */
 		spin_lock(&fs->lock);
-		if (fs->in_exec) {
-			spin_unlock(&fs->lock);
-			return -EAGAIN;
-		}
 		fs->users++;
 		spin_unlock(&fs->lock);
 		return 0;
@@ -3070,10 +3066,9 @@  static int check_unshare_flags(unsigned long unshare_flags)
 	return 0;
 }
 
-/*
- * Unshare the filesystem structure if it is being shared
- */
-static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
+/* Allocate an fs_struct if it is currently shared and CLONE_FS requested. */
+static int unshare_fs_alloc(unsigned long unshare_flags,
+			    struct fs_struct **new_fsp)
 {
 	struct fs_struct *fs = current->fs;
 
@@ -3091,6 +3086,41 @@  static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 	return 0;
 }
 
+/* Swap out fs_struct, returning old fs if it needs freeing. */
+static void unshare_fs_finalize(struct fs_struct **new_fsp)
+{
+	struct task_struct *task = current;
+	struct fs_struct *fs = task->fs;
+
+	spin_lock(&fs->lock);
+	task->fs = *new_fsp;
+	if (--fs->users)
+		*new_fsp = NULL;
+	else
+		*new_fsp = fs;
+	spin_unlock(&fs->lock);
+}
+
+/*
+ * Unshare the filesystem structure if it is being shared
+ */
+int unshare_fs(void)
+{
+	struct fs_struct *new_fs = NULL;
+	int error;
+
+	error = unshare_fs_alloc(CLONE_FS, &new_fs);
+	if (error || !new_fs)
+		return error;
+
+	unshare_fs_finalize(&new_fs);
+
+	if (new_fs)
+		free_fs_struct(new_fs);
+
+	return 0;
+}
+
 /*
  * Unshare file descriptor table if it is being shared
  */
@@ -3120,7 +3150,7 @@  int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
  */
 int ksys_unshare(unsigned long unshare_flags)
 {
-	struct fs_struct *fs, *new_fs = NULL;
+	struct fs_struct *new_fs = NULL;
 	struct files_struct *new_fd = NULL;
 	struct cred *new_cred = NULL;
 	struct nsproxy *new_nsproxy = NULL;
@@ -3159,7 +3189,7 @@  int ksys_unshare(unsigned long unshare_flags)
 	 */
 	if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM))
 		do_sysvsem = 1;
-	err = unshare_fs(unshare_flags, &new_fs);
+	err = unshare_fs_alloc(unshare_flags, &new_fs);
 	if (err)
 		goto bad_unshare_out;
 	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
@@ -3197,16 +3227,8 @@  int ksys_unshare(unsigned long unshare_flags)
 
 		task_lock(current);
 
-		if (new_fs) {
-			fs = current->fs;
-			spin_lock(&fs->lock);
-			current->fs = new_fs;
-			if (--fs->users)
-				new_fs = NULL;
-			else
-				new_fs = fs;
-			spin_unlock(&fs->lock);
-		}
+		if (new_fs)
+			unshare_fs_finalize(&new_fs);
 
 		if (new_fd)
 			swap(current->files, new_fd);