Message ID | 20240422084505.3465238-2-stsp2@yandex.ru (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] fs: reorganize path_openat() | expand |
Am 22.04.24 um 10:45 schrieb Stas Sergeev: > This flag performs the open operation with the fsuid/fsgid that > were in effect when dir_fd was opened. > This allows the process to pre-open some directories and then > change eUID (and all other UIDs/GIDs) to a less-privileged user, > retaining the ability to open/create files within these directories. > > Design goal: > The idea is to provide a very light-weight sandboxing, where the > process, without the use of any heavy-weight techniques like chroot > within namespaces, can restrict the access to the set of pre-opened > directories. > This patch is just a first step to such sandboxing. If things go > well, in the future the same extension can be added to more syscalls. > These should include at least unlinkat(), renameat2() and the > not-yet-upstreamed setxattrat(). > > Security considerations: > To avoid sandboxing escape, this patch makes sure the restricted > lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT. > To avoid leaking creds across exec, this patch requires O_CLOEXEC > flag on a directory. > > Use cases: > Virtual machines that deal with untrusted code, can use that > instead of a more heavy-weighted approaches. > Currently the approach is being tested on a dosemu2 VM. > > Signed-off-by: Stas Sergeev <stsp2@yandex.ru> > > CC: Eric Biederman <ebiederm@xmission.com> > CC: Alexander Viro <viro@zeniv.linux.org.uk> > CC: Andy Lutomirski <luto@kernel.org> > CC: Christian Brauner <brauner@kernel.org> > CC: Jan Kara <jack@suse.cz> > CC: Jeff Layton <jlayton@kernel.org> > CC: Chuck Lever <chuck.lever@oracle.com> > CC: Alexander Aring <alex.aring@gmail.com> > CC: linux-fsdevel@vger.kernel.org > CC: linux-kernel@vger.kernel.org > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Christian Göttsche <cgzones@googlemail.com> > --- > fs/file_table.c | 2 ++ > fs/internal.h | 2 +- > fs/namei.c | 54 ++++++++++++++++++++++++++++++++++-- > fs/open.c | 2 +- > include/linux/fcntl.h | 2 ++ > include/linux/fs.h | 2 ++ > include/uapi/linux/openat2.h | 3 ++ > 7 files changed, 63 insertions(+), 4 deletions(-) > > diff --git a/fs/file_table.c b/fs/file_table.c > index 4f03beed4737..9991bdd538e9 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -160,6 +160,8 @@ static int init_file(struct file *f, int flags, const struct cred *cred) > mutex_init(&f->f_pos_lock); > f->f_flags = flags; > f->f_mode = OPEN_FMODE(flags); > + f->f_fsuid = cred->fsuid; > + f->f_fsgid = cred->fsgid; > /* f->f_version: 0 */ > > /* > diff --git a/fs/internal.h b/fs/internal.h > index 7ca738904e34..692b53b19aad 100644 > --- a/fs/internal.h > +++ b/fs/internal.h > @@ -169,7 +169,7 @@ static inline void sb_end_ro_state_change(struct super_block *sb) > * open.c > */ > struct open_flags { > - int open_flag; > + u64 open_flag; > umode_t mode; > int acc_mode; > int intent; > diff --git a/fs/namei.c b/fs/namei.c > index 2fde2c320ae9..d1db6ceee4bd 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -586,6 +586,8 @@ struct nameidata { > int dfd; > vfsuid_t dir_vfsuid; > umode_t dir_mode; > + kuid_t dir_open_fsuid; > + kgid_t dir_open_fsgid; > } __randomize_layout; > > #define ND_ROOT_PRESET 1 > @@ -2414,6 +2416,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > get_fs_pwd(current->fs, &nd->path); > nd->inode = nd->path.dentry->d_inode; > } > + nd->dir_open_fsuid = current_cred()->fsuid; > + nd->dir_open_fsgid = current_cred()->fsgid; I'm wondering if it would be better to capture the whole cred structure. Similar to io_register_personality(), which uses get_current_cred(). Only using uid and gid, won't reflect any group memberships or capabilities... metze
22.04.2024 22:53, Stefan Metzmacher пишет: > I'm wondering if it would be better to capture the whole cred structure. > > Similar to io_register_personality(), which uses get_current_cred(). > > Only using uid and gid, won't reflect any group memberships or > capabilities... Hmm, I thought about that, but was under an impression that with get_current_cred() you only increment a refcount. But I guess the trick here is that due to an RCU machinery, you can actually get your local copy if someone else changes it? I'll try what you say, thanks.
22.04.2024 22:53, Stefan Metzmacher пишет: > I'm wondering if it would be better to capture the whole cred structure. > > Similar to io_register_personality(), which uses get_current_cred(). > > Only using uid and gid, won't reflect any group memberships or > capabilities... I ended up posting v3 where the group memberships are added but the rest, including capabilities, is omitted to avoid security risks. Does adding just a groupinfo to the set of overridden members (which is now: fsuid, fsgid and group_info) address your concern? I really think that raising caps is far out of the scope for my approach, which aims to be safe and simple. Someone else can do that later, if need be.
diff --git a/fs/file_table.c b/fs/file_table.c index 4f03beed4737..9991bdd538e9 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -160,6 +160,8 @@ static int init_file(struct file *f, int flags, const struct cred *cred) mutex_init(&f->f_pos_lock); f->f_flags = flags; f->f_mode = OPEN_FMODE(flags); + f->f_fsuid = cred->fsuid; + f->f_fsgid = cred->fsgid; /* f->f_version: 0 */ /* diff --git a/fs/internal.h b/fs/internal.h index 7ca738904e34..692b53b19aad 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -169,7 +169,7 @@ static inline void sb_end_ro_state_change(struct super_block *sb) * open.c */ struct open_flags { - int open_flag; + u64 open_flag; umode_t mode; int acc_mode; int intent; diff --git a/fs/namei.c b/fs/namei.c index 2fde2c320ae9..d1db6ceee4bd 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -586,6 +586,8 @@ struct nameidata { int dfd; vfsuid_t dir_vfsuid; umode_t dir_mode; + kuid_t dir_open_fsuid; + kgid_t dir_open_fsgid; } __randomize_layout; #define ND_ROOT_PRESET 1 @@ -2414,6 +2416,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags) get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_fsuid = current_cred()->fsuid; + nd->dir_open_fsgid = current_cred()->fsgid; } else { /* Caller must check execute permissions on the starting path component */ struct fd f = fdget_raw(nd->dfd); @@ -2437,6 +2441,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags) path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } + nd->dir_open_fsuid = f.file->f_fsuid; + nd->dir_open_fsgid = f.file->f_fsgid; fdput(f); } @@ -3653,6 +3659,28 @@ static int do_open(struct nameidata *nd, return error; } +static const struct cred *openat2_override_creds(struct nameidata *nd) +{ + const struct cred *old_cred; + struct cred *override_cred; + + override_cred = prepare_creds(); + if (!override_cred) + return NULL; + + override_cred->fsuid = nd->dir_open_fsuid; + override_cred->fsgid = nd->dir_open_fsgid; + + override_cred->non_rcu = 1; + + old_cred = override_creds(override_cred); + + /* override_cred() gets its own ref */ + put_cred(override_cred); + + return old_cred; +} + /** * vfs_tmpfile - create tmpfile * @idmap: idmap of the mount the inode was found from @@ -3794,8 +3822,28 @@ static struct file *path_openat(struct nameidata *nd, error = do_o_path(nd, flags, file); } else { const char *s = path_init(nd, flags); - file = alloc_empty_file(op->open_flag, current_cred()); - error = PTR_ERR_OR_ZERO(file); + const struct cred *old_cred = NULL; + + error = 0; + if (op->open_flag & OA2_INHERIT_CRED) { + /* Make sure to work only with restricted + * look-up modes. + */ + if (!(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))) + error = -EPERM; + /* Only work with O_CLOEXEC dirs. */ + if (!get_close_on_exec(nd->dfd)) + error = -EPERM; + + if (!error) + old_cred = openat2_override_creds(nd); + } + if (!error) { + file = alloc_empty_file(op->open_flag, current_cred()); + error = PTR_ERR_OR_ZERO(file); + } else { + file = ERR_PTR(error); + } if (!error) { while (!(error = link_path_walk(s, nd)) && (s = open_last_lookups(nd, file, op)) != NULL) @@ -3803,6 +3851,8 @@ static struct file *path_openat(struct nameidata *nd, } if (!error) error = do_open(nd, file, op); + if (old_cred) + revert_creds(old_cred); terminate_walk(nd); if (IS_ERR(file)) return file; diff --git a/fs/open.c b/fs/open.c index ee8460c83c77..6be013182a35 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1225,7 +1225,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op) * values before calling build_open_flags(), but openat2(2) checks all * of its arguments. */ - if (flags & ~VALID_OPEN_FLAGS) + if (flags & ~VALID_OPENAT2_FLAGS) return -EINVAL; if (how->resolve & ~VALID_RESOLVE_FLAGS) return -EINVAL; diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index a332e79b3207..b71f8b162102 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -12,6 +12,8 @@ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) +#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_INHERIT_CRED) + /* List of all valid flags for the how->resolve argument: */ #define VALID_RESOLVE_FLAGS \ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \ diff --git a/include/linux/fs.h b/include/linux/fs.h index 8dfd53b52744..ea954b6945d0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1028,6 +1028,8 @@ struct file { struct address_space *f_mapping; errseq_t f_wb_err; errseq_t f_sb_err; /* for syncfs */ + kuid_t f_fsuid; /* fsUID of the opener */ + kgid_t f_fsgid; /* fsGID of the opener */ } __randomize_layout __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h index a5feb7604948..cdd676a10b62 100644 --- a/include/uapi/linux/openat2.h +++ b/include/uapi/linux/openat2.h @@ -40,4 +40,7 @@ struct open_how { return -EAGAIN if that's not possible. */ +/* openat2-specific flags go to upper 4 bytes. */ +#define OA2_INHERIT_CRED (1ULL << 32) + #endif /* _UAPI_LINUX_OPENAT2_H */
This flag performs the open operation with the fsuid/fsgid that were in effect when dir_fd was opened. This allows the process to pre-open some directories and then change eUID (and all other UIDs/GIDs) to a less-privileged user, retaining the ability to open/create files within these directories. Design goal: The idea is to provide a very light-weight sandboxing, where the process, without the use of any heavy-weight techniques like chroot within namespaces, can restrict the access to the set of pre-opened directories. This patch is just a first step to such sandboxing. If things go well, in the future the same extension can be added to more syscalls. These should include at least unlinkat(), renameat2() and the not-yet-upstreamed setxattrat(). Security considerations: To avoid sandboxing escape, this patch makes sure the restricted lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT. To avoid leaking creds across exec, this patch requires O_CLOEXEC flag on a directory. Use cases: Virtual machines that deal with untrusted code, can use that instead of a more heavy-weighted approaches. Currently the approach is being tested on a dosemu2 VM. Signed-off-by: Stas Sergeev <stsp2@yandex.ru> CC: Eric Biederman <ebiederm@xmission.com> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andy Lutomirski <luto@kernel.org> CC: Christian Brauner <brauner@kernel.org> CC: Jan Kara <jack@suse.cz> CC: Jeff Layton <jlayton@kernel.org> CC: Chuck Lever <chuck.lever@oracle.com> CC: Alexander Aring <alex.aring@gmail.com> CC: linux-fsdevel@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Christian Göttsche <cgzones@googlemail.com> --- fs/file_table.c | 2 ++ fs/internal.h | 2 +- fs/namei.c | 54 ++++++++++++++++++++++++++++++++++-- fs/open.c | 2 +- include/linux/fcntl.h | 2 ++ include/linux/fs.h | 2 ++ include/uapi/linux/openat2.h | 3 ++ 7 files changed, 63 insertions(+), 4 deletions(-)