Message ID | 20180929131534.24472-1-cyphar@cyphar.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | namei: implement various scoping AT_* flags | expand |
+cc linux-api; please keep them in CC for future versions of the patch On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > The primary motivation for the need for this flag is container runtimes > which have to interact with malicious root filesystems in the host > namespaces. One of the first requirements for a container runtime to be > secure against a malicious rootfs is that they correctly scope symlinks > (that is, they should be scoped as though they are chroot(2)ed into the > container's rootfs) and ".."-style paths. The already-existing AT_XDEV > and AT_NO_PROCLINKS help defend against other potential attacks in a > malicious rootfs scenario. So, I really like the concept for patch 1 of this series (but haven't read the code yet); but I dislike this patch because of its footgun potential. If this patch landed as-is, the manpage would need some big warning labels. chroot() basically provides no security guarantees at all; and yes, that includes that if you do `chroot(...); chdir("/"); open(attacker_controlled_path, ...);`, you can potentially end up opening a file outside the chroot. See https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-014-2015.txt for an example, where Qubes OS did pretty much that, and ended up with a potentially exploitable security bug because of that, where one VM, while performing a file transfer into another VM, could write outside of the transfer target directory. The problem is what happens if a folder you are walking through is concurrently moved out of the chroot. Consider the following scenario: You attempt to open "C/../../etc/passwd" under the root "/A/B". Something else concurrently moves /A/B/C to /A/C. This can result in the following: 1. You start the path walk and reach /A/B/C. 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. 3. Your path walk follows the first ".." up into /A. This is outside the process root, but you never actually encountered the process root, so you don't notice. 4. Your path walk follows the second ".." up to /. Again, this is outside the process root, but you don't notice. 5. Your path walk walks down to /etc/passwd, and the open completes successfully. You now have an fd pointing outside your chroot. If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly. If you point the root of the walk at a directory out of which a process in the container wouldn't be able to move the file, you're probably kinda mostly fine - as long as you know, for certain, that nothing else on the system would ever do that. But I still wouldn't feel good about that. (Yes, this means that if you run an SFTP server with OpenSSH's ChrootDirectory directive, you have to be very careful about these things.) I believe that the only way to robustly use this would be to point the dirfd at a mount point, such that you know that being moved out of the chroot is impossible because the mount point limits movement of directories under it. (Well, technically, it doesn't, but it ensures that if a directory does dangerously move away, the syscall fails.) It might make sense to hardcode this constraint in the implementation of AT_THIS_ROOT, to keep people from shooting themselves in the foot. > Currently most container runtimes try to do this resolution in > userspace[1], causing many potential race conditions. In addition, the > "obvious" alternative (actually performing a {ch,pivot_}root(2)) > requires a fork+exec which is *very* costly if necessary for every > filesystem operation involving a container. Wait. fork() I understand, but why exec? And actually, you don't need a full fork() either, clone() lets you do this with some process parts shared. And then you also shouldn't need to use SCM_RIGHTS, just keep the file descriptor table shared. And why chroot()/pivot_root(), wouldn't you want to use setns()? I think something like this should work (except that you should add some error handling - omitted here because I'm lazy), assuming that the container runtime does NOT have CAP_SYS_ADMIN in the init namespace (otherwise it's easier). Of course, this is entirely untested, and probably won't compile because I screwed something up. :P But you should get the idea... // Ensure that we are non-dumpable. Together with // commit bfedb589252c, this ensures that container root // can't trace our child once it enters the container. // My patch // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net/ // would make this unnecessary, but that patch didn't // land because Eric nacked it (for political reasons, // because people incorrectly claimed that this was a // security fix): // https://lore.kernel.org/lkml/8760z7fope.fsf@x220.int.ebiederm.org/ // Note that dumpability is per-mm, not per-process, // so this hack has the unfortunate side effect of preventing // unprivileged debugging of the container runtime. // Oh well. prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE); // Inform gcc that this particular syscall will effectively // return twice, just like vfork() or setjmp(). __attribute__((returns_twice)) long syscall_(long sysno, ...) = (void*)syscall; int result_fd = -1; // CLONE_FILES means we don't need to do fd passing, // we share the file descriptor table. // CLONE_VM means we don't have the cost of duplicating // our VMAs and page tables, and we don't have to mark // all our pagetable entries as readonly for copy-on-write. // CLONE_VFORK is a dirty hack to avoid having to // allocate a child stack. // Lack of SIGCHLD means we don't want to have to wait() // for the child. int child_pid = syscall_(__NR_clone, CLONE_FILES|CLONE_VM|CLONE_VFORK, 0, 0, 0, 0); if (child_pid == 0) { // Enter the container's user namespace; this allows us // to afterwards join its mount namespace even if we're // not capable in the init namespace. // (I believe that it should be possible to change the kernel // such that this is not required if you have set the // no-new-privs flag.) setns(container_user_ns_fd, CLONE_NEWUSER); // Entering the filesystem namespace automatically // moves us to that namespace's filesystem root. setns(container_fs_ns_fd, CLONE_NEWNS); result_fd = open(untrusted_container_path, ...); syscall(__NR_exit, 0); } > The most significant change in semantics with AT_THIS_ROOT is that > *at(2) syscalls now no longer have the property that an absolute > pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is > specified). The reasoning behind this is that AT_THIS_ROOT necessarily > has to chroot-scope symlinks with absolute paths to dirfd, and so doing > it for the base path seems to be the most consistent behaviour (and also > avoids foot-gunning users who want to chroot-scope paths that might be > absolute). > > Currently this is only enabled for the stat(2) and openat(2) family (the > latter has its own flag O_THISROOT with the same semantics). Ideally > this flag would be supported by all *at(2) syscalls, but this will > require adding flags arguments to many of them (and will be done in a > separate patchset). > > [1]: https://github.com/cyphar/filepath-securejoin > > Cc: Eric Biederman <ebiederm@xmission.com> > Cc: Christian Brauner <christian@brauner.io> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > fs/fcntl.c | 2 +- > fs/namei.c | 121 +++++++++++++++++-------------- > fs/open.c | 2 + > fs/stat.c | 4 +- > include/linux/fcntl.h | 2 +- > include/linux/namei.h | 1 + > include/uapi/asm-generic/fcntl.h | 3 + > include/uapi/linux/fcntl.h | 2 + > 8 files changed, 81 insertions(+), 56 deletions(-) > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index e343618736f7..4c36c5b9fdb9 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY > * is defined as O_NONBLOCK on some platforms and not on others. > */ > - BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ != > + BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ != > HWEIGHT32( > (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | > __FMODE_EXEC | __FMODE_NONOTIFY)); > diff --git a/fs/namei.c b/fs/namei.c > index 757dd783771c..1b984f0dbbb4 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd) > } > } > > +/* > + * Configure nd->path based on the nd->dfd. This is only used as part of > + * path_init(). > + */ > +static inline int dirfd_path_init(struct nameidata *nd) > +{ > + if (nd->dfd == AT_FDCWD) { > + if (nd->flags & LOOKUP_RCU) { > + struct fs_struct *fs = current->fs; > + unsigned seq; > + > + do { > + seq = read_seqcount_begin(&fs->seq); > + nd->path = fs->pwd; > + nd->inode = nd->path.dentry->d_inode; > + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); > + } while (read_seqcount_retry(&fs->seq, seq)); > + } else { > + get_fs_pwd(current->fs, &nd->path); > + nd->inode = nd->path.dentry->d_inode; > + } > + } else { > + /* Caller must check execute permissions on the starting path component */ > + struct fd f = fdget_raw(nd->dfd); > + struct dentry *dentry; > + > + if (!f.file) > + return -EBADF; > + > + dentry = f.file->f_path.dentry; > + > + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) { > + fdput(f); > + return -ENOTDIR; > + } > + > + nd->path = f.file->f_path; > + if (nd->flags & LOOKUP_RCU) { > + nd->inode = nd->path.dentry->d_inode; > + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); > + } else { > + path_get(&nd->path); > + nd->inode = nd->path.dentry->d_inode; > + } > + fdput(f); > + } > + if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) { > + nd->root = nd->path; > + if (!(nd->flags & LOOKUP_RCU)) > + path_get(&nd->root); > + } > + return 0; > +} > + > /* must be paired with terminate_walk() */ > static const char *path_init(struct nameidata *nd, unsigned flags) > { > + int error; > const char *s = nd->name->name; > > if (!*s) > @@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > nd->path.dentry = NULL; > > nd->m_seq = read_seqbegin(&mount_lock); > + if (unlikely(flags & LOOKUP_CHROOT)) { > + error = dirfd_path_init(nd); > + if (unlikely(error)) > + return ERR_PTR(error); > + } > if (*s == '/') { > - int error; > - set_root(nd); > + if (likely(!nd->root.mnt)) > + set_root(nd); > error = nd_jump_root(nd); > if (unlikely(error)) > s = ERR_PTR(error); > return s; > - } else if (nd->dfd == AT_FDCWD) { > - if (flags & LOOKUP_RCU) { > - struct fs_struct *fs = current->fs; > - unsigned seq; > - > - do { > - seq = read_seqcount_begin(&fs->seq); > - nd->path = fs->pwd; > - nd->inode = nd->path.dentry->d_inode; > - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); > - } while (read_seqcount_retry(&fs->seq, seq)); > - } else { > - get_fs_pwd(current->fs, &nd->path); > - nd->inode = nd->path.dentry->d_inode; > - } > - if (unlikely(flags & LOOKUP_BENEATH)) { > - nd->root = nd->path; > - if (!(flags & LOOKUP_RCU)) > - path_get(&nd->root); > - } > - return s; > - } else { > - /* Caller must check execute permissions on the starting path component */ > - struct fd f = fdget_raw(nd->dfd); > - struct dentry *dentry; > - > - if (!f.file) > - return ERR_PTR(-EBADF); > - > - dentry = f.file->f_path.dentry; > - > - if (*s && unlikely(!d_can_lookup(dentry))) { > - fdput(f); > - return ERR_PTR(-ENOTDIR); > - } > - > - nd->path = f.file->f_path; > - if (flags & LOOKUP_RCU) { > - nd->inode = nd->path.dentry->d_inode; > - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); > - } else { > - path_get(&nd->path); > - nd->inode = nd->path.dentry->d_inode; > - } > - if (unlikely(flags & LOOKUP_BENEATH)) { > - nd->root = nd->path; > - if (!(flags & LOOKUP_RCU)) > - path_get(&nd->root); > - } > - fdput(f); > - return s; > } > + if (likely(!nd->path.mnt)) { > + error = dirfd_path_init(nd); > + if (unlikely(error)) > + return ERR_PTR(error); > + } > + return s; > } > > static const char *trailing_symlink(struct nameidata *nd) > diff --git a/fs/open.c b/fs/open.c > index 80f5f566a5ff..81d148f626cd 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o > lookup_flags |= LOOKUP_NO_PROCLINKS; > if (flags & O_NOSYMLINKS) > lookup_flags |= LOOKUP_NO_SYMLINKS; > + if (flags & O_THISROOT) > + lookup_flags |= LOOKUP_CHROOT; > op->lookup_flags = lookup_flags; > return 0; > } > diff --git a/fs/stat.c b/fs/stat.c > index 791e61b916ae..e8366e4812c3 100644 > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags, > > if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | > KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV | > - AT_NO_PROCLINKS | AT_NO_SYMLINKS)) > + AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT)) > return -EINVAL; > > if (flags & AT_SYMLINK_NOFOLLOW) > @@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags, > lookup_flags |= LOOKUP_NO_PROCLINKS; > if (flags & AT_NO_SYMLINKS) > lookup_flags |= LOOKUP_NO_SYMLINKS; > + if (flags & AT_THIS_ROOT) > + lookup_flags |= LOOKUP_CHROOT; > > retry: > error = user_path_at(dfd, filename, lookup_flags, &path); > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h > index ad5bba4b5b12..95480cd4c09d 100644 > --- a/include/linux/fcntl.h > +++ b/include/linux/fcntl.h > @@ -10,7 +10,7 @@ > O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ > FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ > O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \ > - O_NOPROCLINKS | O_NOSYMLINKS) > + O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT) > > #ifndef force_o_largefile > #define force_o_largefile() (BITS_PER_LONG != 32) > diff --git a/include/linux/namei.h b/include/linux/namei.h > index 5ff7f3362d1b..7ec9e2d84649 100644 > --- a/include/linux/namei.h > +++ b/include/linux/namei.h > @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; > #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ > #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. > Implies LOOKUP_NO_PROCLINKS. */ > +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */ > > extern int path_pts(struct path *path); > > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h > index c2bf5983e46a..11206b0e927c 100644 > --- a/include/uapi/asm-generic/fcntl.h > +++ b/include/uapi/asm-generic/fcntl.h > @@ -113,6 +113,9 @@ > #ifndef O_NOSYMLINKS > #define O_NOSYMLINKS 01000000000 > #endif > +#ifndef O_THISROOT > +#define O_THISROOT 02000000000 > +#endif > > #define F_DUPFD 0 /* dup */ > #define F_GETFD 1 /* get close_on_exec */ > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index 551a9e2166a8..ea978457b68f 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -99,6 +99,8 @@ > #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ > #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*. > Implies AT_NO_PROCLINKS. */ > +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though > + it is chroot-ed into dirfd. */ > > > #endif /* _UAPI_LINUX_FCNTL_H */ > -- > 2.19.0 > >
> On Sep 29, 2018, at 9:35 AM, Jann Horn <jannh@google.com> wrote: > > +cc linux-api; please keep them in CC for future versions of the patch > >> On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <cyphar@cyphar.com> wrote: >> The primary motivation for the need for this flag is container runtimes >> which have to interact with malicious root filesystems in the host >> namespaces. One of the first requirements for a container runtime to be >> secure against a malicious rootfs is that they correctly scope symlinks >> (that is, they should be scoped as though they are chroot(2)ed into the >> container's rootfs) and ".."-style paths. The already-existing AT_XDEV >> and AT_NO_PROCLINKS help defend against other potential attacks in a >> malicious rootfs scenario. > > So, I really like the concept for patch 1 of this series (but haven't > read the code yet); but I dislike this patch because of its footgun > potential. > The code could do it differently: do the path walk and then, before accepting the result, walk back up and make sure the result is under the starting point. This is *not* a full solution, though, since a walk above the root gas side effects on timing, various caches, and possibly network traffic, so it’s open to Spectre-like attacks in which a malicious container could use a runtime-initiated AT_THIS_ROOT to infer the existence of directories outside the container. But what’s the container usecase? Any sane container is based on pivot_root or similar, so the runtime can just do the walk in the container context. IOW I’m a bit confused as to the exact intended use of the whole series. Can you elaborate?
On 2018-09-29, Jann Horn <jannh@google.com> wrote: > The problem is what happens if a folder you are walking through is > concurrently moved out of the chroot. Consider the following scenario: > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > Something else concurrently moves /A/B/C to /A/C. This can result in > the following: > > 1. You start the path walk and reach /A/B/C. > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > 3. Your path walk follows the first ".." up into /A. This is outside > the process root, but you never actually encountered the process root, > so you don't notice. > 4. Your path walk follows the second ".." up to /. Again, this is > outside the process root, but you don't notice. > 5. Your path walk walks down to /etc/passwd, and the open completes > successfully. You now have an fd pointing outside your chroot. > > If the root of your walk is below an attacker-controlled directory, > this of course means that you lose instantly. If you point the root of > the walk at a directory out of which a process in the container > wouldn't be able to move the file, you're probably kinda mostly fine - > as long as you know, for certain, that nothing else on the system > would ever do that. But I still wouldn't feel good about that. Please correct me if I'm wrong here (this is the first patch I've written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this -- or does that only handle if a particular path component changes *while* it's being walked through? Is it possible for a path walk to succeed after a path component was unmounted (obviously you can't delete a directory path component since you'd get -ENOTEMPTY)? If this is an issue for AT_THIS_ROOT, I believe this might also be an issue for AT_BENEATH since they are effectively both using the same nd->root trick (so you could similarly trick AT_BENEATH to not error out). So we'd need to figure out how to solve this problem in order for AT_BENEATH to be safe. Speaking naively, doesn't it make sense to invalidate the walk if a path component was modified? Or is this something that would be far too costly with little benefit? What if we do more aggressive nd->root checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != current->mnt_ns->root)? Regarding chroot attacks, I was aware of the trivial chroot-open-chroot-fchdir attack but I was not aware that there was a rename attack for chroot. Thanks for bringing this up! > I believe that the only way to robustly use this would be to point the > dirfd at a mount point, such that you know that being moved out of the > chroot is impossible because the mount point limits movement of > directories under it. (Well, technically, it doesn't, but it ensures > that if a directory does dangerously move away, the syscall fails.) It > might make sense to hardcode this constraint in the implementation of > AT_THIS_ROOT, to keep people from shooting themselves in the foot. Unless I'm missing something, would this not also affect using a mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through path component) -- or does MS_MOVE cause a rewalk in a way that rename does not? I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I thought that bind-mounts would be an issue but you also get -EXDEV when trying to rename across bind-mounts even if they are on the same underlying filesystem). But AT_BENEATH might be a more bitter pill to swallow. I'm not sure. In the usecase of container runtimes, we wouldn't generally be doing resolution of attacker-controlled paths but it still definitely doesn't hurt to consider this part of the threat model -- to avoid foot-gunning as you've said. (There also might be some nested-container cases where you might want to do that.) > > Currently most container runtimes try to do this resolution in > > userspace[1], causing many potential race conditions. In addition, the > > "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > requires a fork+exec which is *very* costly if necessary for every > > filesystem operation involving a container. > > Wait. fork() I understand, but why exec? And actually, you don't need > a full fork() either, clone() lets you do this with some process parts > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > the file descriptor table shared. And why chroot()/pivot_root(), > wouldn't you want to use setns()? You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK. (It should be noted that multi-threaded C runtimes have somewhat similar issues -- AFAIK you can technically only use AS-Safe glibc functions after a fork() but that's more of a theoretical concern here. If you just use raw syscalls there isn't an issue.) As for why use setns() rather than pivot_root(), there are cases where you're operating on a container's image without a running container (think image extraction or snapshotting tools). In those cases, you would need to set up a dummy container process in order to setns() into its namespaces. You are right that setns() would be a better option if you want the truthful state of what mounts the container sees. [I also don't like the idea of joining the user namespace of a malicious container unless it's necessary but that's probably just needless paranoia more than anything -- since you're not joining the pidns you aren't trivially addressable by a malicious container.] > // Ensure that we are non-dumpable. Together with > // commit bfedb589252c, this ensures that container root > // can't trace our child once it enters the container. > // My patch > // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net/ > // would make this unnecessary, but that patch didn't > // land because Eric nacked it (for political reasons, > // because people incorrectly claimed that this was a > // security fix): Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks"). If you join a user namespace then processes within that user namespace won't have ptrace_may_access() permissions because your mm is owned by an ancestor user namespace -- only after exec() will you be traceable. We still use PR_SET_DUMPABLE in runc but that's because we support older kernels (and people don't use user namespaces under Docker) but with user namespaces this should not be required anymore.
On 2018-09-29, Andy Lutomirski <luto@amacapital.net> wrote: > >> On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > >> The primary motivation for the need for this flag is container runtimes > >> which have to interact with malicious root filesystems in the host > >> namespaces. One of the first requirements for a container runtime to be > >> secure against a malicious rootfs is that they correctly scope symlinks > >> (that is, they should be scoped as though they are chroot(2)ed into the > >> container's rootfs) and ".."-style paths. The already-existing AT_XDEV > >> and AT_NO_PROCLINKS help defend against other potential attacks in a > >> malicious rootfs scenario. > > > > So, I really like the concept for patch 1 of this series (but haven't > > read the code yet); but I dislike this patch because of its footgun > > potential. > > > > The code could do it differently: do the path walk and then, before > accepting the result, walk back up and make sure the result is under > the starting point. > > This is *not* a full solution, though, since a walk above the root gas > side effects on timing, various caches, and possibly network traffic, > so it’s open to Spectre-like attacks in which a malicious container > could use a runtime-initiated AT_THIS_ROOT to infer the existence of > directories outside the container. I think that one way to solve this problem might be to have more strict checks on nd->root in follow_dotdot(). The problem here (as far as I can tell) is that ".." could end up skipping past the root because of a rename, however walking *down* into a path shouldn't be a problem (even absolute symlinks shouldn't be a problem because they will nd_jump_root and will land back in the root). However, I'm not entirely sure what happens to nd->root if it gets renamed -- can you still safely do checks against it (we'd need to do some sort of is_descendant() check on the current path before we handle ".." in follow_dotdot). That way, we wouldn't shouldn't have the spectre-like attack problem (since the attack would be halted at the ".." stage -- before the path walk can proceed into host paths). Would this be sufficient or is there a more serious issue I'm missing? > But what’s the container usecase? Any sane container is based on > pivot_root or similar, so the runtime can just do the walk in the > container context. IOW I’m a bit confused as to the exact intended use > of the whole series. Can you elaborate? I went into this in my response to Jann.
On Mon, Oct 1, 2018 at 7:44 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > The problem is what happens if a folder you are walking through is > > concurrently moved out of the chroot. Consider the following scenario: > > > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > Something else concurrently moves /A/B/C to /A/C. This can result in > > the following: > > > > 1. You start the path walk and reach /A/B/C. > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > 3. Your path walk follows the first ".." up into /A. This is outside > > the process root, but you never actually encountered the process root, > > so you don't notice. > > 4. Your path walk follows the second ".." up to /. Again, this is > > outside the process root, but you don't notice. > > 5. Your path walk walks down to /etc/passwd, and the open completes > > successfully. You now have an fd pointing outside your chroot. > > > > If the root of your walk is below an attacker-controlled directory, > > this of course means that you lose instantly. If you point the root of > > the walk at a directory out of which a process in the container > > wouldn't be able to move the file, you're probably kinda mostly fine - > > as long as you know, for certain, that nothing else on the system > > would ever do that. But I still wouldn't feel good about that. > > Please correct me if I'm wrong here (this is the first patch I've > written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this > -- or does that only handle if a particular path component changes > *while* it's being walked through? Eric Biederman should be able to talk about all this much better than me, but as far as I know, the LOOKUP_REVAL path is only for dealing with some special filesystems like procfs. > Is it possible for a path walk to > succeed after a path component was unmounted (obviously you can't delete > a directory path component since you'd get -ENOTEMPTY)? I don't think so, but I'm not exactly a VFS expert. > If this is an issue for AT_THIS_ROOT, I believe this might also be an > issue for AT_BENEATH since they are effectively both using the same > nd->root trick (so you could similarly trick AT_BENEATH to not error > out). So we'd need to figure out how to solve this problem in order for > AT_BENEATH to be safe. Oh, wait, what? I think I didn't notice that the semantics of AT_BENEATH changed like that since the original posting of David Drysdale's O_BENEATH_ONLY patch (https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@google.com/). David's patch had nice, straightforward semantics, blocking any form of upwards traversal. Why was that changed? Does anyone actually want to use paths that contain ".." with AT_BENEATH? I would strongly prefer something that blocks any use of "..". @Al: It looks like this already changed back when you posted https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ ? > Speaking naively, doesn't it make sense to invalidate the walk if a path > component was modified? Or is this something that would be far too > costly with little benefit? What if we do more aggressive nd->root > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != > current->mnt_ns->root)? It seems to me like doing that would basically require looking at each node in the path walk twice? And it'd be difficult to guarantee forward progress unless you're willing to do some fairly heavy locking. > Regarding chroot attacks, I was aware of the trivial > chroot-open-chroot-fchdir attack but I was not aware that there was a > rename attack for chroot. Thanks for bringing this up! > > > I believe that the only way to robustly use this would be to point the > > dirfd at a mount point, such that you know that being moved out of the > > chroot is impossible because the mount point limits movement of > > directories under it. (Well, technically, it doesn't, but it ensures > > that if a directory does dangerously move away, the syscall fails.) It > > might make sense to hardcode this constraint in the implementation of > > AT_THIS_ROOT, to keep people from shooting themselves in the foot. > > Unless I'm missing something, would this not also affect using a > mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through > path component) -- or does MS_MOVE cause a rewalk in a way that rename > does not? Hmm. Good point. It looks to me like you probably wouldn't be able to walk up through a mountpoint in RCU mode after the mount hierarchy has changed (see follow_dotdot_rcu()), but it might work in refwalk mode. Eric? > I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I > thought that bind-mounts would be an issue but you also get -EXDEV when > trying to rename across bind-mounts even if they are on the same > underlying filesystem). But AT_BENEATH might be a more bitter pill to > swallow. I'm not sure. Which is part of why I strongly prefer the semantics from David Drysdale's O_BENEATH_ONLY. > In the usecase of container runtimes, we wouldn't generally be doing > resolution of attacker-controlled paths but it still definitely doesn't > hurt to consider this part of the threat model -- to avoid foot-gunning > as you've said. (There also might be some nested-container cases where > you might want to do that.) > > > > Currently most container runtimes try to do this resolution in > > > userspace[1], causing many potential race conditions. In addition, the > > > "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > > requires a fork+exec which is *very* costly if necessary for every > > > filesystem operation involving a container. > > > > Wait. fork() I understand, but why exec? And actually, you don't need > > a full fork() either, clone() lets you do this with some process parts > > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > > the file descriptor table shared. And why chroot()/pivot_root(), > > wouldn't you want to use setns()? > > You're right about this -- for C runtimes. In Go we cannot do a raw > clone() or fork() (if you do it manually with RawSyscall you'll end with > broken runtime state). So you're forced to do fork+exec (which then > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > for CLONE_VFORK. If you insist on implementing every last bit of your code in Go, I guess. > (It should be noted that multi-threaded C runtimes have somewhat similar > issues -- AFAIK you can technically only use AS-Safe glibc functions > after a fork() but that's more of a theoretical concern here. If you > just use raw syscalls there isn't an issue.) > > As for why use setns() rather than pivot_root(), there are cases where > you're operating on a container's image without a running container > (think image extraction or snapshotting tools). In those cases, you > would need to set up a dummy container process in order to setns() into > its namespaces. You are right that setns() would be a better option if > you want the truthful state of what mounts the container sees. > > [I also don't like the idea of joining the user namespace of a malicious > container unless it's necessary but that's probably just needless > paranoia more than anything -- since you're not joining the pidns you > aren't trivially addressable by a malicious container.] > > > // Ensure that we are non-dumpable. Together with > > // commit bfedb589252c, this ensures that container root > > // can't trace our child once it enters the container. > > // My patch > > // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net/ > > // would make this unnecessary, but that patch didn't > > // land because Eric nacked it (for political reasons, > > // because people incorrectly claimed that this was a > > // security fix): > > Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a > user_ns owner to mm_struct and fix ptrace permission checks"). If you > join a user namespace then processes within that user namespace won't > have ptrace_may_access() permissions because your mm is owned by an > ancestor user namespace -- only after exec() will you be traceable. Nope. The code added in bfedb589252c only applies if `get_dumpable(mm) != SUID_DUMP_USER`. Looking at the current version, you can see that `ptrace_has_cap(tcred->user_ns, mode)` is enough to ptrace a process that is not nondumpable. > We still use PR_SET_DUMPABLE in runc but that's because we support older > kernels (and people don't use user namespaces under Docker) but with > user namespaces this should not be required anymore.
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote: > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > The problem is what happens if a folder you are walking through is > > concurrently moved out of the chroot. Consider the following scenario: > > > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > Something else concurrently moves /A/B/C to /A/C. This can result in > > the following: > > > > 1. You start the path walk and reach /A/B/C. > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > 3. Your path walk follows the first ".." up into /A. This is outside > > the process root, but you never actually encountered the process root, > > so you don't notice. > > 4. Your path walk follows the second ".." up to /. Again, this is > > outside the process root, but you don't notice. > > 5. Your path walk walks down to /etc/passwd, and the open completes > > successfully. You now have an fd pointing outside your chroot. > > > > If the root of your walk is below an attacker-controlled directory, > > this of course means that you lose instantly. If you point the root of > > the walk at a directory out of which a process in the container > > wouldn't be able to move the file, you're probably kinda mostly fine - > > as long as you know, for certain, that nothing else on the system > > would ever do that. But I still wouldn't feel good about that. > > Please correct me if I'm wrong here (this is the first patch I've > written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this > -- or does that only handle if a particular path component changes > *while* it's being walked through? Is it possible for a path walk to > succeed after a path component was unmounted (obviously you can't delete > a directory path component since you'd get -ENOTEMPTY)? > > If this is an issue for AT_THIS_ROOT, I believe this might also be an > issue for AT_BENEATH since they are effectively both using the same > nd->root trick (so you could similarly trick AT_BENEATH to not error > out). So we'd need to figure out how to solve this problem in order for > AT_BENEATH to be safe. > > Speaking naively, doesn't it make sense to invalidate the walk if a path > component was modified? Or is this something that would be far too > costly with little benefit? What if we do more aggressive nd->root > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != > current->mnt_ns->root)? > > Regarding chroot attacks, I was aware of the trivial > chroot-open-chroot-fchdir attack but I was not aware that there was a > rename attack for chroot. Thanks for bringing this up! > > > I believe that the only way to robustly use this would be to point the > > dirfd at a mount point, such that you know that being moved out of the > > chroot is impossible because the mount point limits movement of > > directories under it. (Well, technically, it doesn't, but it ensures > > that if a directory does dangerously move away, the syscall fails.) It > > might make sense to hardcode this constraint in the implementation of > > AT_THIS_ROOT, to keep people from shooting themselves in the foot. > > Unless I'm missing something, would this not also affect using a > mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through > path component) -- or does MS_MOVE cause a rewalk in a way that rename > does not? > > I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I > thought that bind-mounts would be an issue but you also get -EXDEV when > trying to rename across bind-mounts even if they are on the same > underlying filesystem). But AT_BENEATH might be a more bitter pill to > swallow. I'm not sure. > > In the usecase of container runtimes, we wouldn't generally be doing > resolution of attacker-controlled paths but it still definitely doesn't > hurt to consider this part of the threat model -- to avoid foot-gunning > as you've said. (There also might be some nested-container cases where > you might want to do that.) > > > > Currently most container runtimes try to do this resolution in > > > userspace[1], causing many potential race conditions. In addition, the > > > "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > > requires a fork+exec which is *very* costly if necessary for every > > > filesystem operation involving a container. > > > > Wait. fork() I understand, but why exec? And actually, you don't need > > a full fork() either, clone() lets you do this with some process parts > > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > > the file descriptor table shared. And why chroot()/pivot_root(), > > wouldn't you want to use setns()? > > You're right about this -- for C runtimes. In Go we cannot do a raw > clone() or fork() (if you do it manually with RawSyscall you'll end with > broken runtime state). So you're forced to do fork+exec (which then > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > for CLONE_VFORK. > > (It should be noted that multi-threaded C runtimes have somewhat similar > issues -- AFAIK you can technically only use AS-Safe glibc functions > after a fork() but that's more of a theoretical concern here. If you > just use raw syscalls there isn't an issue.) > > As for why use setns() rather than pivot_root(), there are cases where > you're operating on a container's image without a running container > (think image extraction or snapshotting tools). In those cases, you > would need to set up a dummy container process in order to setns() into > its namespaces. You are right that setns() would be a better option if > you want the truthful state of what mounts the container sees. > > [I also don't like the idea of joining the user namespace of a malicious > container unless it's necessary but that's probably just needless > paranoia more than anything -- since you're not joining the pidns you > aren't trivially addressable by a malicious container.] > > > // Ensure that we are non-dumpable. Together with > > // commit bfedb589252c, this ensures that container root > > // can't trace our child once it enters the container. > > // My patch > > // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net/ > > // would make this unnecessary, but that patch didn't > > // land because Eric nacked it (for political reasons, > > // because people incorrectly claimed that this was a > > // security fix): > > Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a > user_ns owner to mm_struct and fix ptrace permission checks"). If you > join a user namespace then processes within that user namespace won't > have ptrace_may_access() permissions because your mm is owned by an > ancestor user namespace -- only after exec() will you be traceable. That is not _completely_ true. Iirc (Please someone do yell at me if I'm wrong!), this is as follows. You will in fact be dumpable as long as you don't set{g,u}id() to an effective uid that is different from the effective uid of the process that created the task. For example, if you clone(CLONE_NEWUSER) as an unprivileged user with uid and euid 1000 you are in fact dumpable and thus traceable *but* if you do a setuid(0) in the new task then you will end up with old->euid = 1000 and new->euid = 0 at which point the kernel will remove the dumpable flag and the creating process cannot trace you anymore (which has funny consequences for lsm isolation and sending fds around). Iiuc, The same logic applies when you do a setns() to another user namespace. /* dumpability changes */ if (!uid_eq(old->euid, new->euid) || !gid_eq(old->egid, new->egid) || !uid_eq(old->fsuid, new->fsuid) || !gid_eq(old->fsgid, new->fsgid) || !cred_cap_issubset(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); <<< suid_dumpable == 0 at this point task->pdeath_signal = 0; smp_wmb(); } > > We still use PR_SET_DUMPABLE in runc but that's because we support older > kernels (and people don't use user namespaces under Docker) but with > user namespaces this should not be required anymore. > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/>
On Mon, Oct 1, 2018 at 12:42 PM Christian Brauner <christian@brauner.io> wrote: > On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote: > > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > > The problem is what happens if a folder you are walking through is > > > concurrently moved out of the chroot. Consider the following scenario: > > > > > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > > Something else concurrently moves /A/B/C to /A/C. This can result in > > > the following: > > > > > > 1. You start the path walk and reach /A/B/C. > > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > > 3. Your path walk follows the first ".." up into /A. This is outside > > > the process root, but you never actually encountered the process root, > > > so you don't notice. > > > 4. Your path walk follows the second ".." up to /. Again, this is > > > outside the process root, but you don't notice. > > > 5. Your path walk walks down to /etc/passwd, and the open completes > > > successfully. You now have an fd pointing outside your chroot. > > > > > > If the root of your walk is below an attacker-controlled directory, > > > this of course means that you lose instantly. If you point the root of > > > the walk at a directory out of which a process in the container > > > wouldn't be able to move the file, you're probably kinda mostly fine - > > > as long as you know, for certain, that nothing else on the system > > > would ever do that. But I still wouldn't feel good about that. > > > > Please correct me if I'm wrong here (this is the first patch I've > > written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this > > -- or does that only handle if a particular path component changes > > *while* it's being walked through? Is it possible for a path walk to > > succeed after a path component was unmounted (obviously you can't delete > > a directory path component since you'd get -ENOTEMPTY)? > > > > If this is an issue for AT_THIS_ROOT, I believe this might also be an > > issue for AT_BENEATH since they are effectively both using the same > > nd->root trick (so you could similarly trick AT_BENEATH to not error > > out). So we'd need to figure out how to solve this problem in order for > > AT_BENEATH to be safe. > > > > Speaking naively, doesn't it make sense to invalidate the walk if a path > > component was modified? Or is this something that would be far too > > costly with little benefit? What if we do more aggressive nd->root > > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != > > current->mnt_ns->root)? > > > > Regarding chroot attacks, I was aware of the trivial > > chroot-open-chroot-fchdir attack but I was not aware that there was a > > rename attack for chroot. Thanks for bringing this up! > > > > > I believe that the only way to robustly use this would be to point the > > > dirfd at a mount point, such that you know that being moved out of the > > > chroot is impossible because the mount point limits movement of > > > directories under it. (Well, technically, it doesn't, but it ensures > > > that if a directory does dangerously move away, the syscall fails.) It > > > might make sense to hardcode this constraint in the implementation of > > > AT_THIS_ROOT, to keep people from shooting themselves in the foot. > > > > Unless I'm missing something, would this not also affect using a > > mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through > > path component) -- or does MS_MOVE cause a rewalk in a way that rename > > does not? > > > > I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I > > thought that bind-mounts would be an issue but you also get -EXDEV when > > trying to rename across bind-mounts even if they are on the same > > underlying filesystem). But AT_BENEATH might be a more bitter pill to > > swallow. I'm not sure. > > > > In the usecase of container runtimes, we wouldn't generally be doing > > resolution of attacker-controlled paths but it still definitely doesn't > > hurt to consider this part of the threat model -- to avoid foot-gunning > > as you've said. (There also might be some nested-container cases where > > you might want to do that.) > > > > > > Currently most container runtimes try to do this resolution in > > > > userspace[1], causing many potential race conditions. In addition, the > > > > "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > > > requires a fork+exec which is *very* costly if necessary for every > > > > filesystem operation involving a container. > > > > > > Wait. fork() I understand, but why exec? And actually, you don't need > > > a full fork() either, clone() lets you do this with some process parts > > > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > > > the file descriptor table shared. And why chroot()/pivot_root(), > > > wouldn't you want to use setns()? > > > > You're right about this -- for C runtimes. In Go we cannot do a raw > > clone() or fork() (if you do it manually with RawSyscall you'll end with > > broken runtime state). So you're forced to do fork+exec (which then > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > > for CLONE_VFORK. > > > > (It should be noted that multi-threaded C runtimes have somewhat similar > > issues -- AFAIK you can technically only use AS-Safe glibc functions > > after a fork() but that's more of a theoretical concern here. If you > > just use raw syscalls there isn't an issue.) > > > > As for why use setns() rather than pivot_root(), there are cases where > > you're operating on a container's image without a running container > > (think image extraction or snapshotting tools). In those cases, you > > would need to set up a dummy container process in order to setns() into > > its namespaces. You are right that setns() would be a better option if > > you want the truthful state of what mounts the container sees. > > > > [I also don't like the idea of joining the user namespace of a malicious > > container unless it's necessary but that's probably just needless > > paranoia more than anything -- since you're not joining the pidns you > > aren't trivially addressable by a malicious container.] > > > > > // Ensure that we are non-dumpable. Together with > > > // commit bfedb589252c, this ensures that container root > > > // can't trace our child once it enters the container. > > > // My patch > > > // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net/ > > > // would make this unnecessary, but that patch didn't > > > // land because Eric nacked it (for political reasons, > > > // because people incorrectly claimed that this was a > > > // security fix): > > > > Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a > > user_ns owner to mm_struct and fix ptrace permission checks"). If you > > join a user namespace then processes within that user namespace won't > > have ptrace_may_access() permissions because your mm is owned by an > > ancestor user namespace -- only after exec() will you be traceable. > > That is not _completely_ true. > Iirc (Please someone do yell at me if I'm wrong!), this is as follows. > You will in fact be dumpable as long as you don't set{g,u}id() to an > effective uid that is different from the effective uid of the process > that created the task. For example, if you clone(CLONE_NEWUSER) as an > unprivileged user with uid and euid 1000 you are in fact dumpable and > thus traceable *but* if you do a setuid(0) in the new task then you will > end up with old->euid = 1000 and new->euid = 0 at which point the kernel > will remove the dumpable flag and the creating process cannot trace you > anymore (which has funny consequences for lsm isolation and sending fds > around). Iiuc, The same logic applies when you do a setns() to another > user namespace. (Note that this is only true if your un-namespaced UID actually changes. If you create a user namespace and then write to its uid_map such that your namespaced UID is zero, that won't trigger this logic.)
On Mon, Oct 01, 2018 at 01:29:16PM +0200, Jann Horn wrote: > On Mon, Oct 1, 2018 at 12:42 PM Christian Brauner <christian@brauner.io> wrote: > > On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote: > > > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > > > The problem is what happens if a folder you are walking through is > > > > concurrently moved out of the chroot. Consider the following scenario: > > > > > > > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > > > Something else concurrently moves /A/B/C to /A/C. This can result in > > > > the following: > > > > > > > > 1. You start the path walk and reach /A/B/C. > > > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > > > 3. Your path walk follows the first ".." up into /A. This is outside > > > > the process root, but you never actually encountered the process root, > > > > so you don't notice. > > > > 4. Your path walk follows the second ".." up to /. Again, this is > > > > outside the process root, but you don't notice. > > > > 5. Your path walk walks down to /etc/passwd, and the open completes > > > > successfully. You now have an fd pointing outside your chroot. > > > > > > > > If the root of your walk is below an attacker-controlled directory, > > > > this of course means that you lose instantly. If you point the root of > > > > the walk at a directory out of which a process in the container > > > > wouldn't be able to move the file, you're probably kinda mostly fine - > > > > as long as you know, for certain, that nothing else on the system > > > > would ever do that. But I still wouldn't feel good about that. > > > > > > Please correct me if I'm wrong here (this is the first patch I've > > > written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this > > > -- or does that only handle if a particular path component changes > > > *while* it's being walked through? Is it possible for a path walk to > > > succeed after a path component was unmounted (obviously you can't delete > > > a directory path component since you'd get -ENOTEMPTY)? > > > > > > If this is an issue for AT_THIS_ROOT, I believe this might also be an > > > issue for AT_BENEATH since they are effectively both using the same > > > nd->root trick (so you could similarly trick AT_BENEATH to not error > > > out). So we'd need to figure out how to solve this problem in order for > > > AT_BENEATH to be safe. > > > > > > Speaking naively, doesn't it make sense to invalidate the walk if a path > > > component was modified? Or is this something that would be far too > > > costly with little benefit? What if we do more aggressive nd->root > > > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != > > > current->mnt_ns->root)? > > > > > > Regarding chroot attacks, I was aware of the trivial > > > chroot-open-chroot-fchdir attack but I was not aware that there was a > > > rename attack for chroot. Thanks for bringing this up! > > > > > > > I believe that the only way to robustly use this would be to point the > > > > dirfd at a mount point, such that you know that being moved out of the > > > > chroot is impossible because the mount point limits movement of > > > > directories under it. (Well, technically, it doesn't, but it ensures > > > > that if a directory does dangerously move away, the syscall fails.) It > > > > might make sense to hardcode this constraint in the implementation of > > > > AT_THIS_ROOT, to keep people from shooting themselves in the foot. > > > > > > Unless I'm missing something, would this not also affect using a > > > mountpoint as a dirfd-root (with MS_MOVE of an already-walked-through > > > path component) -- or does MS_MOVE cause a rewalk in a way that rename > > > does not? > > > > > > I wouldn't mind tying AT_THIS_ROOT to only work on mountpoints (I > > > thought that bind-mounts would be an issue but you also get -EXDEV when > > > trying to rename across bind-mounts even if they are on the same > > > underlying filesystem). But AT_BENEATH might be a more bitter pill to > > > swallow. I'm not sure. > > > > > > In the usecase of container runtimes, we wouldn't generally be doing > > > resolution of attacker-controlled paths but it still definitely doesn't > > > hurt to consider this part of the threat model -- to avoid foot-gunning > > > as you've said. (There also might be some nested-container cases where > > > you might want to do that.) > > > > > > > > Currently most container runtimes try to do this resolution in > > > > > userspace[1], causing many potential race conditions. In addition, the > > > > > "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > > > > requires a fork+exec which is *very* costly if necessary for every > > > > > filesystem operation involving a container. > > > > > > > > Wait. fork() I understand, but why exec? And actually, you don't need > > > > a full fork() either, clone() lets you do this with some process parts > > > > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > > > > the file descriptor table shared. And why chroot()/pivot_root(), > > > > wouldn't you want to use setns()? > > > > > > You're right about this -- for C runtimes. In Go we cannot do a raw > > > clone() or fork() (if you do it manually with RawSyscall you'll end with > > > broken runtime state). So you're forced to do fork+exec (which then > > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > > > for CLONE_VFORK. > > > > > > (It should be noted that multi-threaded C runtimes have somewhat similar > > > issues -- AFAIK you can technically only use AS-Safe glibc functions > > > after a fork() but that's more of a theoretical concern here. If you > > > just use raw syscalls there isn't an issue.) > > > > > > As for why use setns() rather than pivot_root(), there are cases where > > > you're operating on a container's image without a running container > > > (think image extraction or snapshotting tools). In those cases, you > > > would need to set up a dummy container process in order to setns() into > > > its namespaces. You are right that setns() would be a better option if > > > you want the truthful state of what mounts the container sees. > > > > > > [I also don't like the idea of joining the user namespace of a malicious > > > container unless it's necessary but that's probably just needless > > > paranoia more than anything -- since you're not joining the pidns you > > > aren't trivially addressable by a malicious container.] > > > > > > > // Ensure that we are non-dumpable. Together with > > > > // commit bfedb589252c, this ensures that container root > > > > // can't trace our child once it enters the container. > > > > // My patch > > > > // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net/ > > > > // would make this unnecessary, but that patch didn't > > > > // land because Eric nacked it (for political reasons, > > > > // because people incorrectly claimed that this was a > > > > // security fix): > > > > > > Unless I'm very much mistaken this was fixed by bfedb589252c ("mm: Add a > > > user_ns owner to mm_struct and fix ptrace permission checks"). If you > > > join a user namespace then processes within that user namespace won't > > > have ptrace_may_access() permissions because your mm is owned by an > > > ancestor user namespace -- only after exec() will you be traceable. > > > > That is not _completely_ true. > > Iirc (Please someone do yell at me if I'm wrong!), this is as follows. > > You will in fact be dumpable as long as you don't set{g,u}id() to an > > effective uid that is different from the effective uid of the process > > that created the task. For example, if you clone(CLONE_NEWUSER) as an > > unprivileged user with uid and euid 1000 you are in fact dumpable and > > thus traceable *but* if you do a setuid(0) in the new task then you will > > end up with old->euid = 1000 and new->euid = 0 at which point the kernel > > will remove the dumpable flag and the creating process cannot trace you > > anymore (which has funny consequences for lsm isolation and sending fds > > around). Iiuc, The same logic applies when you do a setns() to another > > user namespace. > > (Note that this is only true if your un-namespaced UID actually > changes. If you create a user namespace and then write to its uid_map > such that your namespaced UID is zero, that won't trigger this logic.) The way I figured this works is that it actually only applies to the case where you're creating a user namespace as an unprivileged user. And even in that case you will retain dumpability in two cases: 1. mapping userns id 0 to <my-unpriv-host-uid> and any identity mapping, i.e. 2. mapping <my-unpriv-host-uid> to <my-unpriv-host-uid> If you're creating a user namespace as root and set up mappings you should always retain the dumpable flag (I might not be remembering all corner-cases atm.) if you don't muck with capability sets and so on.
On Mon, Oct 01, 2018 at 03:44:28PM +1000, Aleksa Sarai wrote: > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > The problem is what happens if a folder you are walking through is > > concurrently moved out of the chroot. Consider the following scenario: > > > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > Something else concurrently moves /A/B/C to /A/C. This can result in > > the following: > > > > 1. You start the path walk and reach /A/B/C. > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > 3. Your path walk follows the first ".." up into /A. This is outside > > the process root, but you never actually encountered the process root, > > so you don't notice. > > 4. Your path walk follows the second ".." up to /. Again, this is > > outside the process root, but you don't notice. > > 5. Your path walk walks down to /etc/passwd, and the open completes > > successfully. You now have an fd pointing outside your chroot. > > > > If the root of your walk is below an attacker-controlled directory, > > this of course means that you lose instantly. If you point the root of > > the walk at a directory out of which a process in the container > > wouldn't be able to move the file, you're probably kinda mostly fine - > > as long as you know, for certain, that nothing else on the system > > would ever do that. But I still wouldn't feel good about that. > > Please correct me if I'm wrong here (this is the first patch I've > written for VFS). Isn't the retry/LOOKUP_REVAL code meant to handle this No. ... > Speaking naively, doesn't it make sense to invalidate the walk if a path > component was modified? Or is this something that would be far too > costly with little benefit? Lookups and renames can definitely proceed in parallel, and yes I suspect it would be difficult to get good performance and guaranteed forward progress if you required lookup of the full path to be atomic with respect to renames. --b.
On Sat, Sep 29, 2018 at 06:35:17PM +0200, Jann Horn wrote: > +cc linux-api; please keep them in CC for future versions of the patch > > On Sat, Sep 29, 2018 at 4:29 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > The primary motivation for the need for this flag is container runtimes > > which have to interact with malicious root filesystems in the host > > namespaces. One of the first requirements for a container runtime to be > > secure against a malicious rootfs is that they correctly scope symlinks > > (that is, they should be scoped as though they are chroot(2)ed into the > > container's rootfs) and ".."-style paths. The already-existing AT_XDEV > > and AT_NO_PROCLINKS help defend against other potential attacks in a > > malicious rootfs scenario. > > So, I really like the concept for patch 1 of this series (but haven't > read the code yet); but I dislike this patch because of its footgun > potential. > > If this patch landed as-is, the manpage would need some big warning > labels. chroot() basically provides no security guarantees at all; and > yes, that includes that if you do `chroot(...); chdir("/"); > open(attacker_controlled_path, ...);`, you can potentially end up > opening a file outside the chroot. See > https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-014-2015.txt > for an example, where Qubes OS did pretty much that, and ended up with > a potentially exploitable security bug because of that, where one VM, > while performing a file transfer into another VM, could write outside > of the transfer target directory. > The problem is what happens if a folder you are walking through is > concurrently moved out of the chroot. Consider the following scenario: > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > Something else concurrently moves /A/B/C to /A/C. This can result in > the following: > > 1. You start the path walk and reach /A/B/C. > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > 3. Your path walk follows the first ".." up into /A. This is outside > the process root, but you never actually encountered the process root, > so you don't notice. > 4. Your path walk follows the second ".." up to /. Again, this is > outside the process root, but you don't notice. > 5. Your path walk walks down to /etc/passwd, and the open completes > successfully. You now have an fd pointing outside your chroot. > > If the root of your walk is below an attacker-controlled directory, > this of course means that you lose instantly. If you point the root of > the walk at a directory out of which a process in the container > wouldn't be able to move the file, you're probably kinda mostly fine - > as long as you know, for certain, that nothing else on the system > would ever do that. But I still wouldn't feel good about that. > > (Yes, this means that if you run an SFTP server with OpenSSH's > ChrootDirectory directive, you have to be very careful about these > things.) > > I believe that the only way to robustly use this would be to point the > dirfd at a mount point, such that you know that being moved out of the > chroot is impossible because the mount point limits movement of > directories under it. (Well, technically, it doesn't, but it ensures > that if a directory does dangerously move away, the syscall fails.) It > might make sense to hardcode this constraint in the implementation of > AT_THIS_ROOT, to keep people from shooting themselves in the foot. I'm very much in favor of dropping AT_THIS_ROOT from this patch series at least for now and only land the first patch. The first patch is something that we really want and that it seems we can find a good design for. If AT_THIS_ROOT is a feature that still makes sense we can revisit it. > > > Currently most container runtimes try to do this resolution in > > userspace[1], causing many potential race conditions. In addition, the > > "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > requires a fork+exec which is *very* costly if necessary for every > > filesystem operation involving a container. > > Wait. fork() I understand, but why exec? And actually, you don't need > a full fork() either, clone() lets you do this with some process parts > shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > the file descriptor table shared. And why chroot()/pivot_root(), > wouldn't you want to use setns()? I think something like this should > work (except that you should add some error handling - omitted here > because I'm lazy), assuming that the container runtime does NOT have > CAP_SYS_ADMIN in the init namespace (otherwise it's easier). Of > course, this is entirely untested, and probably won't compile because > I screwed something up. :P But you should get the idea... > > // Ensure that we are non-dumpable. Together with > // commit bfedb589252c, this ensures that container root > // can't trace our child once it enters the container. > // My patch > // https://lore.kernel.org/lkml/1451098351-8917-1-git-send-email-jann@thejh.net/ > // would make this unnecessary, but that patch didn't > // land because Eric nacked it (for political reasons, > // because people incorrectly claimed that this was a > // security fix): > // https://lore.kernel.org/lkml/8760z7fope.fsf@x220.int.ebiederm.org/ > // Note that dumpability is per-mm, not per-process, > // so this hack has the unfortunate side effect of preventing > // unprivileged debugging of the container runtime. > // Oh well. > prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE); > // Inform gcc that this particular syscall will effectively > // return twice, just like vfork() or setjmp(). > __attribute__((returns_twice)) long syscall_(long sysno, ...) = (void*)syscall; > int result_fd = -1; > // CLONE_FILES means we don't need to do fd passing, > // we share the file descriptor table. > // CLONE_VM means we don't have the cost of duplicating > // our VMAs and page tables, and we don't have to mark > // all our pagetable entries as readonly for copy-on-write. > // CLONE_VFORK is a dirty hack to avoid having to > // allocate a child stack. > // Lack of SIGCHLD means we don't want to have to wait() > // for the child. > int child_pid = syscall_(__NR_clone, CLONE_FILES|CLONE_VM|CLONE_VFORK, > 0, 0, 0, 0); > if (child_pid == 0) { > // Enter the container's user namespace; this allows us > // to afterwards join its mount namespace even if we're > // not capable in the init namespace. > // (I believe that it should be possible to change the kernel > // such that this is not required if you have set the > // no-new-privs flag.) > setns(container_user_ns_fd, CLONE_NEWUSER); > // Entering the filesystem namespace automatically > // moves us to that namespace's filesystem root. > setns(container_fs_ns_fd, CLONE_NEWNS); > result_fd = open(untrusted_container_path, ...); > syscall(__NR_exit, 0); > } > > > The most significant change in semantics with AT_THIS_ROOT is that > > *at(2) syscalls now no longer have the property that an absolute > > pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is > > specified). The reasoning behind this is that AT_THIS_ROOT necessarily > > has to chroot-scope symlinks with absolute paths to dirfd, and so doing > > it for the base path seems to be the most consistent behaviour (and also > > avoids foot-gunning users who want to chroot-scope paths that might be > > absolute). > > > > Currently this is only enabled for the stat(2) and openat(2) family (the > > latter has its own flag O_THISROOT with the same semantics). Ideally > > this flag would be supported by all *at(2) syscalls, but this will > > require adding flags arguments to many of them (and will be done in a > > separate patchset). > > > > [1]: https://github.com/cyphar/filepath-securejoin > > > > Cc: Eric Biederman <ebiederm@xmission.com> > > Cc: Christian Brauner <christian@brauner.io> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > --- > > fs/fcntl.c | 2 +- > > fs/namei.c | 121 +++++++++++++++++-------------- > > fs/open.c | 2 + > > fs/stat.c | 4 +- > > include/linux/fcntl.h | 2 +- > > include/linux/namei.h | 1 + > > include/uapi/asm-generic/fcntl.h | 3 + > > include/uapi/linux/fcntl.h | 2 + > > 8 files changed, 81 insertions(+), 56 deletions(-) > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c > > index e343618736f7..4c36c5b9fdb9 100644 > > --- a/fs/fcntl.c > > +++ b/fs/fcntl.c > > @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) > > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY > > * is defined as O_NONBLOCK on some platforms and not on others. > > */ > > - BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ != > > + BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ != > > HWEIGHT32( > > (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | > > __FMODE_EXEC | __FMODE_NONOTIFY)); > > diff --git a/fs/namei.c b/fs/namei.c > > index 757dd783771c..1b984f0dbbb4 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd) > > } > > } > > > > +/* > > + * Configure nd->path based on the nd->dfd. This is only used as part of > > + * path_init(). > > + */ > > +static inline int dirfd_path_init(struct nameidata *nd) > > +{ > > + if (nd->dfd == AT_FDCWD) { > > + if (nd->flags & LOOKUP_RCU) { > > + struct fs_struct *fs = current->fs; > > + unsigned seq; > > + > > + do { > > + seq = read_seqcount_begin(&fs->seq); > > + nd->path = fs->pwd; > > + nd->inode = nd->path.dentry->d_inode; > > + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); > > + } while (read_seqcount_retry(&fs->seq, seq)); > > + } else { > > + get_fs_pwd(current->fs, &nd->path); > > + nd->inode = nd->path.dentry->d_inode; > > + } > > + } else { > > + /* Caller must check execute permissions on the starting path component */ > > + struct fd f = fdget_raw(nd->dfd); > > + struct dentry *dentry; > > + > > + if (!f.file) > > + return -EBADF; > > + > > + dentry = f.file->f_path.dentry; > > + > > + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) { > > + fdput(f); > > + return -ENOTDIR; > > + } > > + > > + nd->path = f.file->f_path; > > + if (nd->flags & LOOKUP_RCU) { > > + nd->inode = nd->path.dentry->d_inode; > > + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); > > + } else { > > + path_get(&nd->path); > > + nd->inode = nd->path.dentry->d_inode; > > + } > > + fdput(f); > > + } > > + if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) { > > + nd->root = nd->path; > > + if (!(nd->flags & LOOKUP_RCU)) > > + path_get(&nd->root); > > + } > > + return 0; > > +} > > + > > /* must be paired with terminate_walk() */ > > static const char *path_init(struct nameidata *nd, unsigned flags) > > { > > + int error; > > const char *s = nd->name->name; > > > > if (!*s) > > @@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > nd->path.dentry = NULL; > > > > nd->m_seq = read_seqbegin(&mount_lock); > > + if (unlikely(flags & LOOKUP_CHROOT)) { > > + error = dirfd_path_init(nd); > > + if (unlikely(error)) > > + return ERR_PTR(error); > > + } > > if (*s == '/') { > > - int error; > > - set_root(nd); > > + if (likely(!nd->root.mnt)) > > + set_root(nd); > > error = nd_jump_root(nd); > > if (unlikely(error)) > > s = ERR_PTR(error); > > return s; > > - } else if (nd->dfd == AT_FDCWD) { > > - if (flags & LOOKUP_RCU) { > > - struct fs_struct *fs = current->fs; > > - unsigned seq; > > - > > - do { > > - seq = read_seqcount_begin(&fs->seq); > > - nd->path = fs->pwd; > > - nd->inode = nd->path.dentry->d_inode; > > - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); > > - } while (read_seqcount_retry(&fs->seq, seq)); > > - } else { > > - get_fs_pwd(current->fs, &nd->path); > > - nd->inode = nd->path.dentry->d_inode; > > - } > > - if (unlikely(flags & LOOKUP_BENEATH)) { > > - nd->root = nd->path; > > - if (!(flags & LOOKUP_RCU)) > > - path_get(&nd->root); > > - } > > - return s; > > - } else { > > - /* Caller must check execute permissions on the starting path component */ > > - struct fd f = fdget_raw(nd->dfd); > > - struct dentry *dentry; > > - > > - if (!f.file) > > - return ERR_PTR(-EBADF); > > - > > - dentry = f.file->f_path.dentry; > > - > > - if (*s && unlikely(!d_can_lookup(dentry))) { > > - fdput(f); > > - return ERR_PTR(-ENOTDIR); > > - } > > - > > - nd->path = f.file->f_path; > > - if (flags & LOOKUP_RCU) { > > - nd->inode = nd->path.dentry->d_inode; > > - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); > > - } else { > > - path_get(&nd->path); > > - nd->inode = nd->path.dentry->d_inode; > > - } > > - if (unlikely(flags & LOOKUP_BENEATH)) { > > - nd->root = nd->path; > > - if (!(flags & LOOKUP_RCU)) > > - path_get(&nd->root); > > - } > > - fdput(f); > > - return s; > > } > > + if (likely(!nd->path.mnt)) { > > + error = dirfd_path_init(nd); > > + if (unlikely(error)) > > + return ERR_PTR(error); > > + } > > + return s; > > } > > > > static const char *trailing_symlink(struct nameidata *nd) > > diff --git a/fs/open.c b/fs/open.c > > index 80f5f566a5ff..81d148f626cd 100644 > > --- a/fs/open.c > > +++ b/fs/open.c > > @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o > > lookup_flags |= LOOKUP_NO_PROCLINKS; > > if (flags & O_NOSYMLINKS) > > lookup_flags |= LOOKUP_NO_SYMLINKS; > > + if (flags & O_THISROOT) > > + lookup_flags |= LOOKUP_CHROOT; > > op->lookup_flags = lookup_flags; > > return 0; > > } > > diff --git a/fs/stat.c b/fs/stat.c > > index 791e61b916ae..e8366e4812c3 100644 > > --- a/fs/stat.c > > +++ b/fs/stat.c > > @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags, > > > > if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | > > KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV | > > - AT_NO_PROCLINKS | AT_NO_SYMLINKS)) > > + AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT)) > > return -EINVAL; > > > > if (flags & AT_SYMLINK_NOFOLLOW) > > @@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags, > > lookup_flags |= LOOKUP_NO_PROCLINKS; > > if (flags & AT_NO_SYMLINKS) > > lookup_flags |= LOOKUP_NO_SYMLINKS; > > + if (flags & AT_THIS_ROOT) > > + lookup_flags |= LOOKUP_CHROOT; > > > > retry: > > error = user_path_at(dfd, filename, lookup_flags, &path); > > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h > > index ad5bba4b5b12..95480cd4c09d 100644 > > --- a/include/linux/fcntl.h > > +++ b/include/linux/fcntl.h > > @@ -10,7 +10,7 @@ > > O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ > > FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ > > O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \ > > - O_NOPROCLINKS | O_NOSYMLINKS) > > + O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT) > > > > #ifndef force_o_largefile > > #define force_o_largefile() (BITS_PER_LONG != 32) > > diff --git a/include/linux/namei.h b/include/linux/namei.h > > index 5ff7f3362d1b..7ec9e2d84649 100644 > > --- a/include/linux/namei.h > > +++ b/include/linux/namei.h > > @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; > > #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ > > #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. > > Implies LOOKUP_NO_PROCLINKS. */ > > +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */ > > > > extern int path_pts(struct path *path); > > > > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h > > index c2bf5983e46a..11206b0e927c 100644 > > --- a/include/uapi/asm-generic/fcntl.h > > +++ b/include/uapi/asm-generic/fcntl.h > > @@ -113,6 +113,9 @@ > > #ifndef O_NOSYMLINKS > > #define O_NOSYMLINKS 01000000000 > > #endif > > +#ifndef O_THISROOT > > +#define O_THISROOT 02000000000 > > +#endif > > > > #define F_DUPFD 0 /* dup */ > > #define F_GETFD 1 /* get close_on_exec */ > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > > index 551a9e2166a8..ea978457b68f 100644 > > --- a/include/uapi/linux/fcntl.h > > +++ b/include/uapi/linux/fcntl.h > > @@ -99,6 +99,8 @@ > > #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ > > #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*. > > Implies AT_NO_PROCLINKS. */ > > +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though > > + it is chroot-ed into dirfd. */ > > > > > > #endif /* _UAPI_LINUX_FCNTL_H */ > > -- > > 2.19.0 > > > >
>>> Currently most container runtimes try to do this resolution in >>> userspace[1], causing many potential race conditions. In addition, the >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) >>> requires a fork+exec which is *very* costly if necessary for every >>> filesystem operation involving a container. >> >> Wait. fork() I understand, but why exec? And actually, you don't need >> a full fork() either, clone() lets you do this with some process parts >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep >> the file descriptor table shared. And why chroot()/pivot_root(), >> wouldn't you want to use setns()? > > You're right about this -- for C runtimes. In Go we cannot do a raw > clone() or fork() (if you do it manually with RawSyscall you'll end with > broken runtime state). So you're forced to do fork+exec (which then > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > for CLONE_VFORK. I must admit that I’m not very sympathetic to the argument that “Go’s runtime model is incompatible with the simpler solution.” Anyway, it occurs to me that the real problem is that setns() and chroot() are both overkill for this use case. What’s needed is to start your walk from /proc/pid-in-container/root, with two twists: 1. Do the walk as though rooted at a directory. This is basically just your AT_THIS_ROOT, but the footgun is avoided because the dirfd you use is from a foreign namespace, and, except for symlinks to absolute paths, no amount of .. racing is going to escape the *namespace*. 2. Avoid /proc. It’s not just the *links* — you really don’t want to walk into /proc/self. *Maybe* procfs is already careful enough when mounted in a namespace?
On 2018-10-01, Jann Horn <jannh@google.com> wrote: > > If this is an issue for AT_THIS_ROOT, I believe this might also be an > > issue for AT_BENEATH since they are effectively both using the same > > nd->root trick (so you could similarly trick AT_BENEATH to not error > > out). So we'd need to figure out how to solve this problem in order for > > AT_BENEATH to be safe. > > Oh, wait, what? I think I didn't notice that the semantics of > AT_BENEATH changed like that since the original posting of David > Drysdale's O_BENEATH_ONLY patch > (https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@google.com/). > David's patch had nice, straightforward semantics, blocking any form > of upwards traversal. Why was that changed? Does anyone actually want > to use paths that contain ".." with AT_BENEATH? I would strongly > prefer something that blocks any use of "..". > > @Al: It looks like this already changed back when you posted > https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ > ? Yes, I copied the semantics from Al's patchset. I don't know why he felt strongly that this was the best idea, but in my opinion allowing paths like "a/../b/../c" seems like it's quite useful because otherwise you wouldn't be able to operate on most distribution root filesystems (many have symlinks that have ".." components). Looking at my own (openSUSE) machine there are something like 100k such symlinks (~37% of symlinks on my machine). While I do understand that the easiest way of solving this problem is to disallow ".." entirely with AT_BENEATH, given that support ".." has utility, I would like to know whether it's actually not possible to have this work safely. > > Speaking naively, doesn't it make sense to invalidate the walk if a path > > component was modified? Or is this something that would be far too > > costly with little benefit? What if we do more aggressive nd->root > > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != > > current->mnt_ns->root)? > > It seems to me like doing that would basically require looking at each > node in the path walk twice? And it'd be difficult to guarantee > forward progress unless you're willing to do some fairly heavy > locking. I had another idea since I wrote my previous mail -- since the issue (at least the way I understand it) is that ".." can "skip" over nd->root because of the rename, what if we had some sort of is_descendant() check within follow_dotdot()? (FWIW, ".." already has some pretty interesting "hand-over-hand" locking semantics.) This should be effectively similar to how prepend_path() deals with a path that is not reachable from @root (I'm not sure if the locking is acceptable for the namei path though). Since ".." with AT_THIS_ROOT (or AT_BENEATH) is not going to be the most common component type (and we only need to do these checks for those flags), I would think that the extra cost would not be _that_ awful. Would this work? > > You're right about this -- for C runtimes. In Go we cannot do a raw > > clone() or fork() (if you do it manually with RawSyscall you'll end with > > broken runtime state). So you're forced to do fork+exec (which then > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > > for CLONE_VFORK. > > If you insist on implementing every last bit of your code in Go, I guess. Fair enough, though I believe this would affect most multi-threaded programs as well (regardless of the language they're written in).
On 2018-10-01, Andy Lutomirski <luto@amacapital.net> wrote: > >>> Currently most container runtimes try to do this resolution in > >>> userspace[1], causing many potential race conditions. In addition, the > >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) > >>> requires a fork+exec which is *very* costly if necessary for every > >>> filesystem operation involving a container. > >> > >> Wait. fork() I understand, but why exec? And actually, you don't need > >> a full fork() either, clone() lets you do this with some process parts > >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > >> the file descriptor table shared. And why chroot()/pivot_root(), > >> wouldn't you want to use setns()? > > > > You're right about this -- for C runtimes. In Go we cannot do a raw > > clone() or fork() (if you do it manually with RawSyscall you'll end with > > broken runtime state). So you're forced to do fork+exec (which then > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > > for CLONE_VFORK. > > I must admit that I’m not very sympathetic to the argument that “Go’s > runtime model is incompatible with the simpler solution.” Multi-threaded programs have a similar issue (though with Go it's much worse). If you fork a multi-threaded C program then you can only safely use AS-Safe glibc functions (those that are safe within a signal handler). But if you're just doing three syscalls this shouldn't be as big of a problem as Go where you can't even do said syscalls. > Anyway, it occurs to me that the real problem is that setns() and > chroot() are both overkill for this use case. I agree. My diversion to Go was to explain why it was particularly bad for cri-o/rkt/runc/Docker/etc. > What’s needed is to start your walk from /proc/pid-in-container/root, > with two twists: > > 1. Do the walk as though rooted at a directory. This is basically just > your AT_THIS_ROOT, but the footgun is avoided because the dirfd you > use is from a foreign namespace, and, except for symlinks to absolute > paths, no amount of .. racing is going to escape the *namespace*. This is quite clever and I'll admit I hadn't thought of this. This definitely fixes the ".." issue, but as you've said it won't handle absolute symlinks (which means userspace has the same races that we currently have even if you assume that you have a container process already running -- CVE-2018-15664 is one of many examples of this). (AT_THIS_ROOT using /proc/$container/root would in principle fix all of the mentioned issues -- but as I said before I'd like to see whether hardening ".." would be enough to solve the escape problem.) > 2. Avoid /proc. It’s not just the *links* — you really don’t want to > walk into /proc/self. *Maybe* procfs is already careful enough when > mounted in a namespace? I just tried it and /proc/self gives you -ENOENT. I believe this is because it does a check against the pid namespace that the procfs mount has pinned.
On Tue, Oct 2, 2018 at 12:32 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2018-10-01, Andy Lutomirski <luto@amacapital.net> wrote: > > >>> Currently most container runtimes try to do this resolution in > > >>> userspace[1], causing many potential race conditions. In addition, the > > >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > >>> requires a fork+exec which is *very* costly if necessary for every > > >>> filesystem operation involving a container. > > >> > > >> Wait. fork() I understand, but why exec? And actually, you don't need > > >> a full fork() either, clone() lets you do this with some process parts > > >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > > >> the file descriptor table shared. And why chroot()/pivot_root(), > > >> wouldn't you want to use setns()? > > > > > > You're right about this -- for C runtimes. In Go we cannot do a raw > > > clone() or fork() (if you do it manually with RawSyscall you'll end with > > > broken runtime state). So you're forced to do fork+exec (which then > > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > > > for CLONE_VFORK. > > > > I must admit that I’m not very sympathetic to the argument that “Go’s > > runtime model is incompatible with the simpler solution.” > > Multi-threaded programs have a similar issue (though with Go it's much > worse). If you fork a multi-threaded C program then you can only safely > use AS-Safe glibc functions (those that are safe within a signal > handler). But if you're just doing three syscalls this shouldn't be as > big of a problem as Go where you can't even do said syscalls. > > > Anyway, it occurs to me that the real problem is that setns() and > > chroot() are both overkill for this use case. > > I agree. My diversion to Go was to explain why it was particularly bad > for cri-o/rkt/runc/Docker/etc. > > > What’s needed is to start your walk from /proc/pid-in-container/root, > > with two twists: > > > > 1. Do the walk as though rooted at a directory. This is basically just > > your AT_THIS_ROOT, but the footgun is avoided because the dirfd you > > use is from a foreign namespace, and, except for symlinks to absolute > > paths, no amount of .. racing is going to escape the *namespace*. > > This is quite clever and I'll admit I hadn't thought of this. This > definitely fixes the ".." issue, but as you've said it won't handle > absolute symlinks (which means userspace has the same races that we > currently have even if you assume that you have a container process > already running -- CVE-2018-15664 is one of many examples of this). > > (AT_THIS_ROOT using /proc/$container/root would in principle fix all of > the mentioned issues -- but as I said before I'd like to see whether > hardening ".." would be enough to solve the escape problem.) Hmm. Good point.
On 2018-09-29, Jann Horn <jannh@google.com> wrote: > You attempt to open "C/../../etc/passwd" under the root "/A/B". > Something else concurrently moves /A/B/C to /A/C. This can result in > the following: > > 1. You start the path walk and reach /A/B/C. > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > 3. Your path walk follows the first ".." up into /A. This is outside > the process root, but you never actually encountered the process root, > so you don't notice. > 4. Your path walk follows the second ".." up to /. Again, this is > outside the process root, but you don't notice. > 5. Your path walk walks down to /etc/passwd, and the open completes > successfully. You now have an fd pointing outside your chroot. I've been playing with this and I have the following patch, which according to my testing protects against attacks where ".." skips over nd->root. It abuses __d_path to figure out if nd->path can be resolved from nd->root (obviously a proper version of this patch would refactor __d_path so it could be used like this -- and would not return -EMULTIHOP). I've also attached my reproducer. With it, I was seeing fairly constant breakouts before this patch and after it I didn't see a single breakout after running it overnight. Obviously this is not conclusive, but I'm hoping that it can show what my idea for protecting against ".." was. Does this patch make sense? Or is there something wrong with it that I'm not seeing? --8<------------------------------------------------------------------- There is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root. thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat(dirb, "b/c/../../etc/shadow", O_THISROOT); With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases will be detected during ".." resolution (which is the weak point of chroot(2) -- since walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root). The use of __d_path here might seem suspect, however we don't mind if a path is moved from within the chroot to outside the chroot and we incorrectly decide it is safe (because at that point we are still within the set of files which were accessible at the beginning of resolution). However, we can fail resolution on the next path component if it remains outside of the root. A path which has always been outside nd->root during resolution will never be resolveable from nd->root and thus will always be blocked. DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, purely as a debugging measure (so that you can see that the protection actually does something). Obviously in the proper patch this will return -EXDEV. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/namei.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..c8349693d47b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -53,8 +53,8 @@ * The new code replaces the old recursive symlink resolution with * an iterative one (in case of non-nested symlink chains). It does * this with calls to <fs>_follow_link(). - * As a side effect, dir_namei(), _namei() and follow_link() are now - * replaced with a single function lookup_dentry() that can handle all + * As a side effect, dir_namei(), _namei() and follow_link() are now + * replaced with a single function lookup_dentry() that can handle all * the special cases of the former code. * * With the new dcache, the pathname is stored at each inode, at least as @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; } + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { + char *pathbuf, *pathptr; + + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!pathbuf) + return -ECHILD; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + if (!pathptr) + pathptr = ERR_PTR(-EMULTIHOP); + return PTR_ERR(pathptr); + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1510,6 +1524,20 @@ static int follow_dotdot(struct nameidata *nd) return -EXDEV; break; } + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { + char *pathbuf, *pathptr; + + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!pathbuf) + return -ENOMEM; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + if (!pathptr) + pathptr = ERR_PTR(-EMULTIHOP); + return PTR_ERR(pathptr); + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret)
On Tue, Oct 02, 2018 at 02:18:33AM +1000, Aleksa Sarai wrote: > On 2018-10-01, Jann Horn <jannh@google.com> wrote: > > > If this is an issue for AT_THIS_ROOT, I believe this might also be an > > > issue for AT_BENEATH since they are effectively both using the same > > > nd->root trick (so you could similarly trick AT_BENEATH to not error > > > out). So we'd need to figure out how to solve this problem in order for > > > AT_BENEATH to be safe. > > > > Oh, wait, what? I think I didn't notice that the semantics of > > AT_BENEATH changed like that since the original posting of David > > Drysdale's O_BENEATH_ONLY patch > > (https://lore.kernel.org/lkml/1439458366-8223-2-git-send-email-drysdale@google.com/). > > David's patch had nice, straightforward semantics, blocking any form > > of upwards traversal. Why was that changed? Does anyone actually want > > to use paths that contain ".." with AT_BENEATH? I would strongly > > prefer something that blocks any use of "..". > > > > @Al: It looks like this already changed back when you posted > > https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ > > ? > > Yes, I copied the semantics from Al's patchset. I don't know why he felt > strongly that this was the best idea, but in my opinion allowing paths > like "a/../b/../c" seems like it's quite useful because otherwise you > wouldn't be able to operate on most distribution root filesystems (many > have symlinks that have ".." components). Looking at my own (openSUSE) > machine there are something like 100k such symlinks (~37% of symlinks on > my machine). > > While I do understand that the easiest way of solving this problem is to > disallow ".." entirely with AT_BENEATH, given that support ".." has > utility, I would like to know whether it's actually not possible to have > this work safely. > > > > Speaking naively, doesn't it make sense to invalidate the walk if a path > > > component was modified? Or is this something that would be far too > > > costly with little benefit? What if we do more aggressive nd->root > > > checks when resolving with AT_BENEATH or AT_THIS_ROOT (or if nd->root != > > > current->mnt_ns->root)? > > > > It seems to me like doing that would basically require looking at each > > node in the path walk twice? And it'd be difficult to guarantee > > forward progress unless you're willing to do some fairly heavy > > locking. > > I had another idea since I wrote my previous mail -- since the issue (at > least the way I understand it) is that ".." can "skip" over nd->root > because of the rename, what if we had some sort of is_descendant() check > within follow_dotdot()? (FWIW, ".." already has some pretty interesting > "hand-over-hand" locking semantics.) This should be effectively similar > to how prepend_path() deals with a path that is not reachable from @root > (I'm not sure if the locking is acceptable for the namei path though). > > Since ".." with AT_THIS_ROOT (or AT_BENEATH) is not going to be the most > common component type (and we only need to do these checks for those > flags), I would think that the extra cost would not be _that_ awful. > > Would this work? > > > > You're right about this -- for C runtimes. In Go we cannot do a raw > > > clone() or fork() (if you do it manually with RawSyscall you'll end with > > > broken runtime state). So you're forced to do fork+exec (which then > > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > > > for CLONE_VFORK. > > > > If you insist on implementing every last bit of your code in Go, I guess. > > Fair enough, though I believe this would affect most multi-threaded > programs as well (regardless of the language they're written in). (Depends on whether you do any explicit locking and have atfork handlers for your locks and so on. If you do a clone syscall directly to avoid having libc running any additional atfork handlers (flushing streams etc.) it's doable though not ideal.)
On Fri, Oct 05, 2018 at 02:26:11AM +1000, Aleksa Sarai wrote: > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > Something else concurrently moves /A/B/C to /A/C. This can result in > > the following: > > > > 1. You start the path walk and reach /A/B/C. > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > 3. Your path walk follows the first ".." up into /A. This is outside > > the process root, but you never actually encountered the process root, > > so you don't notice. > > 4. Your path walk follows the second ".." up to /. Again, this is > > outside the process root, but you don't notice. > > 5. Your path walk walks down to /etc/passwd, and the open completes > > successfully. You now have an fd pointing outside your chroot. > > I've been playing with this and I have the following patch, which > according to my testing protects against attacks where ".." skips over > nd->root. It abuses __d_path to figure out if nd->path can be resolved > from nd->root (obviously a proper version of this patch would refactor > __d_path so it could be used like this -- and would not return > -EMULTIHOP). > > I've also attached my reproducer. With it, I was seeing fairly constant > breakouts before this patch and after it I didn't see a single breakout > after running it overnight. Obviously this is not conclusive, but I'm > hoping that it can show what my idea for protecting against ".." was. > > Does this patch make sense? Or is there something wrong with it that I'm > not seeing? Interesting. Apart from the abuse of __d_path() :) the question I'd have is whether this just minimizes the race window or if you can provide a sound argument that this actually can't happen anymore with this patch. > > --8<------------------------------------------------------------------- > > There is a fairly easy-to-exploit race condition with chroot(2) (and > thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a > path can be used to "skip over" nd->root and thus escape to the > filesystem above nd->root. > > thread1 [attacker]: > for (;;) > renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); > thread2 [victim]: > for (;;) > openat(dirb, "b/c/../../etc/shadow", O_THISROOT); > > With fairly significant regularity, thread2 will resolve to > "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases > will be detected during ".." resolution (which is the weak point of > chroot(2) -- since walking *into* a subdirectory tautologically cannot > result in you walking *outside* nd->root). > > The use of __d_path here might seem suspect, however we don't mind if a > path is moved from within the chroot to outside the chroot and we > incorrectly decide it is safe (because at that point we are still within > the set of files which were accessible at the beginning of resolution). > However, we can fail resolution on the next path component if it remains > outside of the root. A path which has always been outside nd->root > during resolution will never be resolveable from nd->root and thus will > always be blocked. > > DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, > purely as a debugging measure (so that you can see that > the protection actually does something). Obviously in the > proper patch this will return -EXDEV. > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > fs/namei.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 6f995e6de6b1..c8349693d47b 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -53,8 +53,8 @@ > * The new code replaces the old recursive symlink resolution with > * an iterative one (in case of non-nested symlink chains). It does > * this with calls to <fs>_follow_link(). > - * As a side effect, dir_namei(), _namei() and follow_link() are now > - * replaced with a single function lookup_dentry() that can handle all > + * As a side effect, dir_namei(), _namei() and follow_link() are now > + * replaced with a single function lookup_dentry() that can handle all > * the special cases of the former code. > * > * With the new dcache, the pathname is stored at each inode, at least as > @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) > return -EXDEV; > break; > } > + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { > + char *pathbuf, *pathptr; > + > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); > + if (!pathbuf) > + return -ECHILD; > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > + kfree(pathbuf); > + if (IS_ERR_OR_NULL(pathptr)) { > + if (!pathptr) > + pathptr = ERR_PTR(-EMULTIHOP); > + return PTR_ERR(pathptr); > + } > + } > if (nd->path.dentry != nd->path.mnt->mnt_root) { > struct dentry *old = nd->path.dentry; > struct dentry *parent = old->d_parent; > @@ -1510,6 +1524,20 @@ static int follow_dotdot(struct nameidata *nd) > return -EXDEV; > break; > } > + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { > + char *pathbuf, *pathptr; > + > + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); > + if (!pathbuf) > + return -ENOMEM; > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > + kfree(pathbuf); > + if (IS_ERR_OR_NULL(pathptr)) { > + if (!pathptr) > + pathptr = ERR_PTR(-EMULTIHOP); > + return PTR_ERR(pathptr); > + } > + } > if (nd->path.dentry != nd->path.mnt->mnt_root) { > int ret = path_parent_directory(&nd->path); > if (ret) > -- > 2.19.0 > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/>
On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > Something else concurrently moves /A/B/C to /A/C. This can result in > > the following: > > > > 1. You start the path walk and reach /A/B/C. > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > 3. Your path walk follows the first ".." up into /A. This is outside > > the process root, but you never actually encountered the process root, > > so you don't notice. > > 4. Your path walk follows the second ".." up to /. Again, this is > > outside the process root, but you don't notice. > > 5. Your path walk walks down to /etc/passwd, and the open completes > > successfully. You now have an fd pointing outside your chroot. > > I've been playing with this and I have the following patch, which > according to my testing protects against attacks where ".." skips over > nd->root. It abuses __d_path to figure out if nd->path can be resolved > from nd->root (obviously a proper version of this patch would refactor > __d_path so it could be used like this -- and would not return > -EMULTIHOP). > > I've also attached my reproducer. With it, I was seeing fairly constant > breakouts before this patch and after it I didn't see a single breakout > after running it overnight. Obviously this is not conclusive, but I'm > hoping that it can show what my idea for protecting against ".." was. > > Does this patch make sense? Or is there something wrong with it that I'm > not seeing? > > --8<------------------------------------------------------------------- > > There is a fairly easy-to-exploit race condition with chroot(2) (and > thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a > path can be used to "skip over" nd->root and thus escape to the > filesystem above nd->root. > > thread1 [attacker]: > for (;;) > renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); > thread2 [victim]: > for (;;) > openat(dirb, "b/c/../../etc/shadow", O_THISROOT); > > With fairly significant regularity, thread2 will resolve to > "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases > will be detected during ".." resolution (which is the weak point of > chroot(2) -- since walking *into* a subdirectory tautologically cannot > result in you walking *outside* nd->root). > > The use of __d_path here might seem suspect, however we don't mind if a > path is moved from within the chroot to outside the chroot and we > incorrectly decide it is safe (because at that point we are still within > the set of files which were accessible at the beginning of resolution). > However, we can fail resolution on the next path component if it remains > outside of the root. A path which has always been outside nd->root > during resolution will never be resolveable from nd->root and thus will > always be blocked. > > DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, > purely as a debugging measure (so that you can see that > the protection actually does something). Obviously in the > proper patch this will return -EXDEV. > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > fs/namei.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 6f995e6de6b1..c8349693d47b 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -53,8 +53,8 @@ > * The new code replaces the old recursive symlink resolution with > * an iterative one (in case of non-nested symlink chains). It does > * this with calls to <fs>_follow_link(). > - * As a side effect, dir_namei(), _namei() and follow_link() are now > - * replaced with a single function lookup_dentry() that can handle all > + * As a side effect, dir_namei(), _namei() and follow_link() are now > + * replaced with a single function lookup_dentry() that can handle all > * the special cases of the former code. > * > * With the new dcache, the pathname is stored at each inode, at least as > @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) > return -EXDEV; > break; > } > + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { > + char *pathbuf, *pathptr; > + > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); > + if (!pathbuf) > + return -ECHILD; > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > + kfree(pathbuf); > + if (IS_ERR_OR_NULL(pathptr)) { > + if (!pathptr) > + pathptr = ERR_PTR(-EMULTIHOP); > + return PTR_ERR(pathptr); > + } > + } One somewhat problematic thing about this approach is that if someone tries to lookup "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some reason, you'll have quadratic runtime: For each "..", you'll have to walk up to the root.
On 2018-10-04, Jann Horn <jannh@google.com> wrote: > On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > > Something else concurrently moves /A/B/C to /A/C. This can result in > > > the following: > > > > > > 1. You start the path walk and reach /A/B/C. > > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > > 3. Your path walk follows the first ".." up into /A. This is outside > > > the process root, but you never actually encountered the process root, > > > so you don't notice. > > > 4. Your path walk follows the second ".." up to /. Again, this is > > > outside the process root, but you don't notice. > > > 5. Your path walk walks down to /etc/passwd, and the open completes > > > successfully. You now have an fd pointing outside your chroot. > > > > I've been playing with this and I have the following patch, which > > according to my testing protects against attacks where ".." skips over > > nd->root. It abuses __d_path to figure out if nd->path can be resolved > > from nd->root (obviously a proper version of this patch would refactor > > __d_path so it could be used like this -- and would not return > > -EMULTIHOP). > > > > I've also attached my reproducer. With it, I was seeing fairly constant > > breakouts before this patch and after it I didn't see a single breakout > > after running it overnight. Obviously this is not conclusive, but I'm > > hoping that it can show what my idea for protecting against ".." was. > > > > Does this patch make sense? Or is there something wrong with it that I'm > > not seeing? > > > > --8<------------------------------------------------------------------- > > > > There is a fairly easy-to-exploit race condition with chroot(2) (and > > thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a > > path can be used to "skip over" nd->root and thus escape to the > > filesystem above nd->root. > > > > thread1 [attacker]: > > for (;;) > > renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); > > thread2 [victim]: > > for (;;) > > openat(dirb, "b/c/../../etc/shadow", O_THISROOT); > > > > With fairly significant regularity, thread2 will resolve to > > "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases > > will be detected during ".." resolution (which is the weak point of > > chroot(2) -- since walking *into* a subdirectory tautologically cannot > > result in you walking *outside* nd->root). > > > > The use of __d_path here might seem suspect, however we don't mind if a > > path is moved from within the chroot to outside the chroot and we > > incorrectly decide it is safe (because at that point we are still within > > the set of files which were accessible at the beginning of resolution). > > However, we can fail resolution on the next path component if it remains > > outside of the root. A path which has always been outside nd->root > > during resolution will never be resolveable from nd->root and thus will > > always be blocked. > > > > DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, > > purely as a debugging measure (so that you can see that > > the protection actually does something). Obviously in the > > proper patch this will return -EXDEV. > > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > --- > > fs/namei.c | 32 ++++++++++++++++++++++++++++++-- > > 1 file changed, 30 insertions(+), 2 deletions(-) > > > > diff --git a/fs/namei.c b/fs/namei.c > > index 6f995e6de6b1..c8349693d47b 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -53,8 +53,8 @@ > > * The new code replaces the old recursive symlink resolution with > > * an iterative one (in case of non-nested symlink chains). It does > > * this with calls to <fs>_follow_link(). > > - * As a side effect, dir_namei(), _namei() and follow_link() are now > > - * replaced with a single function lookup_dentry() that can handle all > > + * As a side effect, dir_namei(), _namei() and follow_link() are now > > + * replaced with a single function lookup_dentry() that can handle all > > * the special cases of the former code. > > * > > * With the new dcache, the pathname is stored at each inode, at least as > > @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) > > return -EXDEV; > > break; > > } > > + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { > > + char *pathbuf, *pathptr; > > + > > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); > > + if (!pathbuf) > > + return -ECHILD; > > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > > + kfree(pathbuf); > > + if (IS_ERR_OR_NULL(pathptr)) { > > + if (!pathptr) > > + pathptr = ERR_PTR(-EMULTIHOP); > > + return PTR_ERR(pathptr); > > + } > > + } > > One somewhat problematic thing about this approach is that if someone > tries to lookup > "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some > reason, you'll have quadratic runtime: For each "..", you'll have to > walk up to the root. What if we took rename_lock (call it nd->r_seq) at the start of the resolution, and then only tried the __d_path-style check if (read_seqretry(&rename_lock, nd->r_seq) || read_seqretry(&mount_lock, nd->m_seq)) /* do the __d_path lookup. */ That way you would only hit the slow path if there were concurrent renames or mounts *and* you are doing a path resolution with AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does this (and after some testing it also appears to work). I'm not sure if there's a way to always avoid the quadratic lookup without (significantly and probably unreasonably) changing how dcache invalidation works. And obviously using this slow path if there was _any_ rename on the _entire_ system is suboptimal, but I think it is a significant improvement. Another possibility is to expand on Andy's suggestion to use /proc/$pid/root, and instead require AT_THIS_ROOT to use the root of a namespace as its dirfd (I'm not sure if there's a trivial way to detect this though). This wouldn't help with AT_BENEATH, but it should protect against ".." shenanigans without any ".." handling changes. (This is less ideal because it requires a container process, but it is another way of dealing with the issue.) --- fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 6f995e6de6b1..12c9be175cb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -493,7 +493,7 @@ struct nameidata { struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags; - unsigned seq, m_seq; + unsigned seq, m_seq, r_seq; int last_type; unsigned depth; int total_link_count; @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -EXDEV; break; } + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && + (read_seqretry(&rename_lock, nd->r_seq) || + read_seqretry(&mount_lock, nd->m_seq)))) { + char *pathbuf, *pathptr; + + nd->r_seq = read_seqbegin(&rename_lock); + /* Cannot take m_seq here. */ + + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!pathbuf) + return -ECHILD; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + int error = PTR_ERR_OR_ZERO(pathptr); + + if (!error) + error = nd_jump_root(nd); + return error; + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1510,6 +1531,27 @@ static int follow_dotdot(struct nameidata *nd) return -EXDEV; break; } + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && + (read_seqretry(&rename_lock, nd->r_seq) || + read_seqretry(&mount_lock, nd->m_seq)))) { + char *pathbuf, *pathptr; + + nd->r_seq = read_seqbegin(&rename_lock); + /* Cannot take m_seq here. */ + + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!pathbuf) + return -ENOMEM; + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); + kfree(pathbuf); + if (IS_ERR_OR_NULL(pathptr)) { + int error = PTR_ERR_OR_ZERO(pathptr); + + if (!error) + error = nd_jump_root(nd); + return error; + } + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret) @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->last_type = LAST_ROOT; /* if there are only slashes... */ nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; nd->depth = 0; + nd->m_seq = read_seqbegin(&mount_lock); + nd->r_seq = read_seqbegin(&rename_lock); + if (flags & LOOKUP_ROOT) { struct dentry *root = nd->root.dentry; struct inode *inode = root->d_inode; @@ -2279,7 +2324,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) if (flags & LOOKUP_RCU) { nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); nd->root_seq = nd->seq; - nd->m_seq = read_seqbegin(&mount_lock); } else { path_get(&nd->path); } @@ -2290,7 +2334,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.mnt = NULL; nd->path.dentry = NULL; - nd->m_seq = read_seqbegin(&mount_lock); if (unlikely(flags & (LOOKUP_CHROOT | LOOKUP_XDEV))) { error = dirfd_path_init(nd); if (unlikely(error))
On Fri, Oct 5, 2018 at 5:07 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2018-10-04, Jann Horn <jannh@google.com> wrote: > > On Thu, Oct 4, 2018 at 6:26 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > On 2018-09-29, Jann Horn <jannh@google.com> wrote: > > > > You attempt to open "C/../../etc/passwd" under the root "/A/B". > > > > Something else concurrently moves /A/B/C to /A/C. This can result in > > > > the following: > > > > > > > > 1. You start the path walk and reach /A/B/C. > > > > 2. The other process moves /A/B/C to /A/C. Your path walk is now at /A/C. > > > > 3. Your path walk follows the first ".." up into /A. This is outside > > > > the process root, but you never actually encountered the process root, > > > > so you don't notice. > > > > 4. Your path walk follows the second ".." up to /. Again, this is > > > > outside the process root, but you don't notice. > > > > 5. Your path walk walks down to /etc/passwd, and the open completes > > > > successfully. You now have an fd pointing outside your chroot. > > > > > > I've been playing with this and I have the following patch, which > > > according to my testing protects against attacks where ".." skips over > > > nd->root. It abuses __d_path to figure out if nd->path can be resolved > > > from nd->root (obviously a proper version of this patch would refactor > > > __d_path so it could be used like this -- and would not return > > > -EMULTIHOP). > > > > > > I've also attached my reproducer. With it, I was seeing fairly constant > > > breakouts before this patch and after it I didn't see a single breakout > > > after running it overnight. Obviously this is not conclusive, but I'm > > > hoping that it can show what my idea for protecting against ".." was. > > > > > > Does this patch make sense? Or is there something wrong with it that I'm > > > not seeing? > > > > > > --8<------------------------------------------------------------------- > > > > > > There is a fairly easy-to-exploit race condition with chroot(2) (and > > > thus by extension AT_THIS_ROOT and AT_BENEATH) where a rename(2) of a > > > path can be used to "skip over" nd->root and thus escape to the > > > filesystem above nd->root. > > > > > > thread1 [attacker]: > > > for (;;) > > > renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); > > > thread2 [victim]: > > > for (;;) > > > openat(dirb, "b/c/../../etc/shadow", O_THISROOT); > > > > > > With fairly significant regularity, thread2 will resolve to > > > "/etc/shadow" rather than "/a/b/etc/shadow". With this patch, such cases > > > will be detected during ".." resolution (which is the weak point of > > > chroot(2) -- since walking *into* a subdirectory tautologically cannot > > > result in you walking *outside* nd->root). > > > > > > The use of __d_path here might seem suspect, however we don't mind if a > > > path is moved from within the chroot to outside the chroot and we > > > incorrectly decide it is safe (because at that point we are still within > > > the set of files which were accessible at the beginning of resolution). > > > However, we can fail resolution on the next path component if it remains > > > outside of the root. A path which has always been outside nd->root > > > during resolution will never be resolveable from nd->root and thus will > > > always be blocked. > > > > > > DO NOT MERGE: Currently this code returns -EMULTIHOP in this case, > > > purely as a debugging measure (so that you can see that > > > the protection actually does something). Obviously in the > > > proper patch this will return -EXDEV. > > > > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > > --- > > > fs/namei.c | 32 ++++++++++++++++++++++++++++++-- > > > 1 file changed, 30 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > index 6f995e6de6b1..c8349693d47b 100644 > > > --- a/fs/namei.c > > > +++ b/fs/namei.c > > > @@ -53,8 +53,8 @@ > > > * The new code replaces the old recursive symlink resolution with > > > * an iterative one (in case of non-nested symlink chains). It does > > > * this with calls to <fs>_follow_link(). > > > - * As a side effect, dir_namei(), _namei() and follow_link() are now > > > - * replaced with a single function lookup_dentry() that can handle all > > > + * As a side effect, dir_namei(), _namei() and follow_link() are now > > > + * replaced with a single function lookup_dentry() that can handle all > > > * the special cases of the former code. > > > * > > > * With the new dcache, the pathname is stored at each inode, at least as > > > @@ -1375,6 +1375,20 @@ static int follow_dotdot_rcu(struct nameidata *nd) > > > return -EXDEV; > > > break; > > > } > > > + if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT))) { > > > + char *pathbuf, *pathptr; > > > + > > > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); > > > + if (!pathbuf) > > > + return -ECHILD; > > > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > > > + kfree(pathbuf); > > > + if (IS_ERR_OR_NULL(pathptr)) { > > > + if (!pathptr) > > > + pathptr = ERR_PTR(-EMULTIHOP); > > > + return PTR_ERR(pathptr); > > > + } > > > + } > > > > One somewhat problematic thing about this approach is that if someone > > tries to lookup > > "a/a/a/a/a/a/a/a/a/a/[...]/../../../../../../../../../.." for some > > reason, you'll have quadratic runtime: For each "..", you'll have to > > walk up to the root. > > What if we took rename_lock (call it nd->r_seq) at the start of the > resolution, and then only tried the __d_path-style check > > if (read_seqretry(&rename_lock, nd->r_seq) || > read_seqretry(&mount_lock, nd->m_seq)) > /* do the __d_path lookup. */ > > That way you would only hit the slow path if there were concurrent > renames or mounts *and* you are doing a path resolution with > AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does > this (and after some testing it also appears to work). Yeah, I think that might do the job. > I'm not sure if there's a way to always avoid the quadratic lookup > without (significantly and probably unreasonably) changing how dcache > invalidation works. And obviously using this slow path if there was > _any_ rename on the _entire_ system is suboptimal, but I think it is a > significant improvement. Yeah, I think this is much better. > Another possibility is to expand on Andy's suggestion to use > /proc/$pid/root, and instead require AT_THIS_ROOT to use the root of a > namespace as its dirfd (I'm not sure if there's a trivial way to detect > this though). This wouldn't help with AT_BENEATH, but it should protect > against ".." shenanigans without any ".." handling changes. (This is > less ideal because it requires a container process, but it is another > way of dealing with the issue.) (For container usecases, but not for a web server that uses AT_BENEATH.) > --- > fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 46 insertions(+), 3 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index 6f995e6de6b1..12c9be175cb4 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -493,7 +493,7 @@ struct nameidata { > struct path root; > struct inode *inode; /* path.dentry.d_inode */ > unsigned int flags; > - unsigned seq, m_seq; > + unsigned seq, m_seq, r_seq; > int last_type; > unsigned depth; > int total_link_count; > @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) > return -EXDEV; > break; > } > + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && > + (read_seqretry(&rename_lock, nd->r_seq) || > + read_seqretry(&mount_lock, nd->m_seq)))) { > + char *pathbuf, *pathptr; > + > + nd->r_seq = read_seqbegin(&rename_lock); > + /* Cannot take m_seq here. */ > + > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); > + if (!pathbuf) > + return -ECHILD; > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > + kfree(pathbuf); You're doing this check before actually looking up the parent, right? So as long as I don't trigger the "path_equal(&nd->path, &nd->root)" check that you do for O_BENEATH, escaping up by one level is possible, right? You should probably move this check so that it happens after following "..". (Also: I assume that you're going to get rid of that memory allocation in a future version.) > + if (IS_ERR_OR_NULL(pathptr)) { > + int error = PTR_ERR_OR_ZERO(pathptr); > + > + if (!error) > + error = nd_jump_root(nd); > + return error; > + } > + } > if (nd->path.dentry != nd->path.mnt->mnt_root) { > struct dentry *old = nd->path.dentry; > struct dentry *parent = old->d_parent; > @@ -1510,6 +1531,27 @@ static int follow_dotdot(struct nameidata *nd) > return -EXDEV; > break; > } > + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && > + (read_seqretry(&rename_lock, nd->r_seq) || > + read_seqretry(&mount_lock, nd->m_seq)))) { > + char *pathbuf, *pathptr; > + > + nd->r_seq = read_seqbegin(&rename_lock); > + /* Cannot take m_seq here. */ > + > + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); > + if (!pathbuf) > + return -ENOMEM; > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > + kfree(pathbuf); > + if (IS_ERR_OR_NULL(pathptr)) { > + int error = PTR_ERR_OR_ZERO(pathptr); > + > + if (!error) > + error = nd_jump_root(nd); > + return error; > + } > + } Same problem as in the RCU case above. > if (nd->path.dentry != nd->path.mnt->mnt_root) { > int ret = path_parent_directory(&nd->path); > if (ret) > @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > nd->last_type = LAST_ROOT; /* if there are only slashes... */ > nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; > nd->depth = 0; > + nd->m_seq = read_seqbegin(&mount_lock); > + nd->r_seq = read_seqbegin(&rename_lock); This means that now, attempting to perform a lookup while something is holding the rename_lock will spin on the lock. I don't know whether that's a problem in practice though. Does anyone on this thread know whether this is problematic?
On 2018-10-05, Jann Horn <jannh@google.com> wrote: > > What if we took rename_lock (call it nd->r_seq) at the start of the > > resolution, and then only tried the __d_path-style check > > > > if (read_seqretry(&rename_lock, nd->r_seq) || > > read_seqretry(&mount_lock, nd->m_seq)) > > /* do the __d_path lookup. */ > > > > That way you would only hit the slow path if there were concurrent > > renames or mounts *and* you are doing a path resolution with > > AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does > > this (and after some testing it also appears to work). > > Yeah, I think that might do the job. *phew* I was all out of other ideas. :P > > --- > > fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 46 insertions(+), 3 deletions(-) > > > > diff --git a/fs/namei.c b/fs/namei.c > > index 6f995e6de6b1..12c9be175cb4 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -493,7 +493,7 @@ struct nameidata { > > struct path root; > > struct inode *inode; /* path.dentry.d_inode */ > > unsigned int flags; > > - unsigned seq, m_seq; > > + unsigned seq, m_seq, r_seq; > > int last_type; > > unsigned depth; > > int total_link_count; > > @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) > > return -EXDEV; > > break; > > } > > + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && > > + (read_seqretry(&rename_lock, nd->r_seq) || > > + read_seqretry(&mount_lock, nd->m_seq)))) { > > + char *pathbuf, *pathptr; > > + > > + nd->r_seq = read_seqbegin(&rename_lock); > > + /* Cannot take m_seq here. */ > > + > > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); > > + if (!pathbuf) > > + return -ECHILD; > > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > > + kfree(pathbuf); > > You're doing this check before actually looking up the parent, right? > So as long as I don't trigger the "path_equal(&nd->path, &nd->root)" > check that you do for O_BENEATH, escaping up by one level is possible, > right? You should probably move this check so that it happens after > following "..". Yup, you're right. I'll do that. > (Also: I assume that you're going to get rid of that memory allocation > in a future version.) Sure. Would you prefer adding some scratch space in nameidata, or that I change __d_path so it accepts NULL as the buffer (and thus it doesn't actually do any string operations)? > > if (nd->path.dentry != nd->path.mnt->mnt_root) { > > int ret = path_parent_directory(&nd->path); > > if (ret) > > @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > nd->last_type = LAST_ROOT; /* if there are only slashes... */ > > nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; > > nd->depth = 0; > > + nd->m_seq = read_seqbegin(&mount_lock); > > + nd->r_seq = read_seqbegin(&rename_lock); > > This means that now, attempting to perform a lookup while something is > holding the rename_lock will spin on the lock. I don't know whether > that's a problem in practice though. Does anyone on this thread know > whether this is problematic? I could make it so that we only take &rename_lock if (nd->flags & (FOLLOW_BENEATH | FOLLOW_CHROOT)), since it's not used outside of that path.
* Aleksa Sarai: > On 2018-10-01, Andy Lutomirski <luto@amacapital.net> wrote: >> >>> Currently most container runtimes try to do this resolution in >> >>> userspace[1], causing many potential race conditions. In addition, the >> >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) >> >>> requires a fork+exec which is *very* costly if necessary for every >> >>> filesystem operation involving a container. >> >> >> >> Wait. fork() I understand, but why exec? And actually, you don't need >> >> a full fork() either, clone() lets you do this with some process parts >> >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep >> >> the file descriptor table shared. And why chroot()/pivot_root(), >> >> wouldn't you want to use setns()? >> > >> > You're right about this -- for C runtimes. In Go we cannot do a raw >> > clone() or fork() (if you do it manually with RawSyscall you'll end with >> > broken runtime state). So you're forced to do fork+exec (which then >> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes >> > for CLONE_VFORK. >> >> I must admit that I’m not very sympathetic to the argument that “Go’s >> runtime model is incompatible with the simpler solution.” > > Multi-threaded programs have a similar issue (though with Go it's much > worse). If you fork a multi-threaded C program then you can only safely > use AS-Safe glibc functions (those that are safe within a signal > handler). But if you're just doing three syscalls this shouldn't be as > big of a problem as Go where you can't even do said syscalls. The situation is a bit more complicated. There are many programs out there which use malloc and free (at least indirectly) after a fork, and we cannot break them. In glibc, we have a couple of subsystems which are put into a known state before calling the fork/clone system call if the application calls fork. The price we pay for that is a fork which is not POSIX-compliant because it is not async-signal-safe. Admittedly, other libcs chose different trade-offs. However, what is the same across libcs is this: You cannot call the clone system call directly and get a fully working new process. Some things break. For example, for recursive mutexes, we need to know the TID of the current thread, and we cannot perform a system call to get it for performance reasons. So everyone has a TID cache for that. But the TID cache does not get reset when you bypass the fork implementation in libc, so you end up with subtle corruption bugs on TID reuse. So I'd say that in most cases, the C situation is pretty much the same as the Go situation. If I recall correctly, the problem for Go is that it cannot call setns from Go code because it fails in the kernel for multi-threaded processes, and Go processes are already multi-threaded when user Go code runs.
On Sat, Oct 6, 2018 at 10:56 PM Florian Weimer <fw@deneb.enyo.de> wrote: > > * Aleksa Sarai: > > > On 2018-10-01, Andy Lutomirski <luto@amacapital.net> wrote: > >> >>> Currently most container runtimes try to do this resolution in > >> >>> userspace[1], causing many potential race conditions. In addition, the > >> >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) > >> >>> requires a fork+exec which is *very* costly if necessary for every > >> >>> filesystem operation involving a container. > >> >> > >> >> Wait. fork() I understand, but why exec? And actually, you don't need > >> >> a full fork() either, clone() lets you do this with some process parts > >> >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep > >> >> the file descriptor table shared. And why chroot()/pivot_root(), > >> >> wouldn't you want to use setns()? > >> > > >> > You're right about this -- for C runtimes. In Go we cannot do a raw > >> > clone() or fork() (if you do it manually with RawSyscall you'll end with > >> > broken runtime state). So you're forced to do fork+exec (which then > >> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes > >> > for CLONE_VFORK. > >> > >> I must admit that I’m not very sympathetic to the argument that “Go’s > >> runtime model is incompatible with the simpler solution.” > > > > Multi-threaded programs have a similar issue (though with Go it's much > > worse). If you fork a multi-threaded C program then you can only safely > > use AS-Safe glibc functions (those that are safe within a signal > > handler). But if you're just doing three syscalls this shouldn't be as > > big of a problem as Go where you can't even do said syscalls. > > The situation is a bit more complicated. There are many programs out > there which use malloc and free (at least indirectly) after a fork, > and we cannot break them. In glibc, we have a couple of subsystems > which are put into a known state before calling the fork/clone system > call if the application calls fork. The price we pay for that is a > fork which is not POSIX-compliant because it is not async-signal-safe. > Admittedly, other libcs chose different trade-offs. > > However, what is the same across libcs is this: You cannot call the > clone system call directly and get a fully working new process. Some > things break. For example, for recursive mutexes, we need to know the > TID of the current thread, and we cannot perform a system call to get > it for performance reasons. So everyone has a TID cache for that. > But the TID cache does not get reset when you bypass the fork > implementation in libc, so you end up with subtle corruption bugs on > TID reuse. Sure, but recursive mutexes etc. are very specific use-case. I'd even go so far to say that if you use mutexes + threads and then also fork in those threads you're hosed anyway. If you don't things get a little cleaner assuming you don't call library functions that use mutexes internally. Event then you might (sometimes at least) still get around most problems with atfork handlers (thought I really don't like him). But you know more about this then I do. :) > > So I'd say that in most cases, the C situation is pretty much the same > as the Go situation. If I recall correctly, the problem for Go is > that it cannot call setns from Go code because it fails in the kernel > for multi-threaded processes, and Go processes are already > multi-threaded when user Go code runs. That is true for *some* namespaces (user, mount) but not for all. For example, setns(CLONE_NEWNET) would be fine from go. But the go runtime thinks it's clever to clone a new thread in between entry and exit of a syscall. If you switch namespaces you might end up with a new thread that belongs to the wrong namespace which is very problematic. So you can either rely on calling some go magic that locks you to a specific os thread but that does only work in later go versions or you go the constructor route, i.e. you e.g. implement a (dummy) subcommand that you can call and that triggers the execution of a C function that is marked with __attribute__((constructor)) that runs before the go runtime and in which you can do setns(), fork() and friends (somewhat) safely. This has very bad performance and is a nasty hack but it's really unavoidable.
On Sat, Oct 6, 2018 at 4:10 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > On 2018-10-05, Jann Horn <jannh@google.com> wrote: > > > What if we took rename_lock (call it nd->r_seq) at the start of the > > > resolution, and then only tried the __d_path-style check > > > > > > if (read_seqretry(&rename_lock, nd->r_seq) || > > > read_seqretry(&mount_lock, nd->m_seq)) > > > /* do the __d_path lookup. */ > > > > > > That way you would only hit the slow path if there were concurrent > > > renames or mounts *and* you are doing a path resolution with > > > AT_THIS_ROOT or AT_BENEATH. I've attached a modified patch that does > > > this (and after some testing it also appears to work). > > > > Yeah, I think that might do the job. > > *phew* I was all out of other ideas. :P > > > > --- > > > fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- > > > 1 file changed, 46 insertions(+), 3 deletions(-) > > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > index 6f995e6de6b1..12c9be175cb4 100644 > > > --- a/fs/namei.c > > > +++ b/fs/namei.c > > > @@ -493,7 +493,7 @@ struct nameidata { > > > struct path root; > > > struct inode *inode; /* path.dentry.d_inode */ > > > unsigned int flags; > > > - unsigned seq, m_seq; > > > + unsigned seq, m_seq, r_seq; > > > int last_type; > > > unsigned depth; > > > int total_link_count; > > > @@ -1375,6 +1375,27 @@ static int follow_dotdot_rcu(struct nameidata *nd) > > > return -EXDEV; > > > break; > > > } > > > + if (unlikely((nd->flags & (LOOKUP_BENEATH | LOOKUP_CHROOT)) && > > > + (read_seqretry(&rename_lock, nd->r_seq) || > > > + read_seqretry(&mount_lock, nd->m_seq)))) { > > > + char *pathbuf, *pathptr; > > > + > > > + nd->r_seq = read_seqbegin(&rename_lock); > > > + /* Cannot take m_seq here. */ > > > + > > > + pathbuf = kmalloc(PATH_MAX, GFP_ATOMIC); > > > + if (!pathbuf) > > > + return -ECHILD; > > > + pathptr = __d_path(&nd->path, &nd->root, pathbuf, PATH_MAX); > > > + kfree(pathbuf); > > > > You're doing this check before actually looking up the parent, right? > > So as long as I don't trigger the "path_equal(&nd->path, &nd->root)" > > check that you do for O_BENEATH, escaping up by one level is possible, > > right? You should probably move this check so that it happens after > > following "..". > > Yup, you're right. I'll do that. > > > (Also: I assume that you're going to get rid of that memory allocation > > in a future version.) > > Sure. Would you prefer adding some scratch space in nameidata, or that I > change __d_path so it accepts NULL as the buffer (and thus it doesn't > actually do any string operations)? Well, I think accepting a NULL buffer would be much cleaner; but keep in mind that I'm just someone making suggestions, Al Viro is the one who has to like your code. :P > > > if (nd->path.dentry != nd->path.mnt->mnt_root) { > > > int ret = path_parent_directory(&nd->path); > > > if (ret) > > > @@ -2269,6 +2311,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > > nd->last_type = LAST_ROOT; /* if there are only slashes... */ > > > nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT; > > > nd->depth = 0; > > > + nd->m_seq = read_seqbegin(&mount_lock); > > > + nd->r_seq = read_seqbegin(&rename_lock); > > > > This means that now, attempting to perform a lookup while something is > > holding the rename_lock will spin on the lock. I don't know whether > > that's a problem in practice though. Does anyone on this thread know > > whether this is problematic? > > I could make it so that we only take &rename_lock > if (nd->flags & (FOLLOW_BENEATH | FOLLOW_CHROOT)), > since it's not used outside of that path. I think that might be a sensible change; but as I said, I don't actually know whether it's necessary, and it would be very helpful if someone who actually knows commented on this.
diff --git a/fs/fcntl.c b/fs/fcntl.c index e343618736f7..4c36c5b9fdb9 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1031,7 +1031,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(25 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(26 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index 757dd783771c..1b984f0dbbb4 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2193,9 +2193,64 @@ static int link_path_walk(const char *name, struct nameidata *nd) } } +/* + * Configure nd->path based on the nd->dfd. This is only used as part of + * path_init(). + */ +static inline int dirfd_path_init(struct nameidata *nd) +{ + if (nd->dfd == AT_FDCWD) { + if (nd->flags & LOOKUP_RCU) { + struct fs_struct *fs = current->fs; + unsigned seq; + + do { + seq = read_seqcount_begin(&fs->seq); + nd->path = fs->pwd; + nd->inode = nd->path.dentry->d_inode; + nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); + } while (read_seqcount_retry(&fs->seq, seq)); + } else { + get_fs_pwd(current->fs, &nd->path); + nd->inode = nd->path.dentry->d_inode; + } + } else { + /* Caller must check execute permissions on the starting path component */ + struct fd f = fdget_raw(nd->dfd); + struct dentry *dentry; + + if (!f.file) + return -EBADF; + + dentry = f.file->f_path.dentry; + + if (*nd->name->name && unlikely(!d_can_lookup(dentry))) { + fdput(f); + return -ENOTDIR; + } + + nd->path = f.file->f_path; + if (nd->flags & LOOKUP_RCU) { + nd->inode = nd->path.dentry->d_inode; + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); + } else { + path_get(&nd->path); + nd->inode = nd->path.dentry->d_inode; + } + fdput(f); + } + if (unlikely(nd->flags & (LOOKUP_CHROOT | LOOKUP_BENEATH))) { + nd->root = nd->path; + if (!(nd->flags & LOOKUP_RCU)) + path_get(&nd->root); + } + return 0; +} + /* must be paired with terminate_walk() */ static const char *path_init(struct nameidata *nd, unsigned flags) { + int error; const char *s = nd->name->name; if (!*s) @@ -2230,65 +2285,25 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL; nd->m_seq = read_seqbegin(&mount_lock); + if (unlikely(flags & LOOKUP_CHROOT)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + } if (*s == '/') { - int error; - set_root(nd); + if (likely(!nd->root.mnt)) + set_root(nd); error = nd_jump_root(nd); if (unlikely(error)) s = ERR_PTR(error); return s; - } else if (nd->dfd == AT_FDCWD) { - if (flags & LOOKUP_RCU) { - struct fs_struct *fs = current->fs; - unsigned seq; - - do { - seq = read_seqcount_begin(&fs->seq); - nd->path = fs->pwd; - nd->inode = nd->path.dentry->d_inode; - nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); - } while (read_seqcount_retry(&fs->seq, seq)); - } else { - get_fs_pwd(current->fs, &nd->path); - nd->inode = nd->path.dentry->d_inode; - } - if (unlikely(flags & LOOKUP_BENEATH)) { - nd->root = nd->path; - if (!(flags & LOOKUP_RCU)) - path_get(&nd->root); - } - return s; - } else { - /* Caller must check execute permissions on the starting path component */ - struct fd f = fdget_raw(nd->dfd); - struct dentry *dentry; - - if (!f.file) - return ERR_PTR(-EBADF); - - dentry = f.file->f_path.dentry; - - if (*s && unlikely(!d_can_lookup(dentry))) { - fdput(f); - return ERR_PTR(-ENOTDIR); - } - - nd->path = f.file->f_path; - if (flags & LOOKUP_RCU) { - nd->inode = nd->path.dentry->d_inode; - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); - } else { - path_get(&nd->path); - nd->inode = nd->path.dentry->d_inode; - } - if (unlikely(flags & LOOKUP_BENEATH)) { - nd->root = nd->path; - if (!(flags & LOOKUP_RCU)) - path_get(&nd->root); - } - fdput(f); - return s; } + if (likely(!nd->path.mnt)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + } + return s; } static const char *trailing_symlink(struct nameidata *nd) diff --git a/fs/open.c b/fs/open.c index 80f5f566a5ff..81d148f626cd 100644 --- a/fs/open.c +++ b/fs/open.c @@ -996,6 +996,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & O_NOSYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & O_THISROOT) + lookup_flags |= LOOKUP_CHROOT; op->lookup_flags = lookup_flags; return 0; } diff --git a/fs/stat.c b/fs/stat.c index 791e61b916ae..e8366e4812c3 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -172,7 +172,7 @@ int vfs_statx(int dfd, const char __user *filename, int flags, if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | KSTAT_QUERY_FLAGS | AT_BENEATH | AT_XDEV | - AT_NO_PROCLINKS | AT_NO_SYMLINKS)) + AT_NO_PROCLINKS | AT_NO_SYMLINKS | AT_THIS_ROOT)) return -EINVAL; if (flags & AT_SYMLINK_NOFOLLOW) @@ -189,6 +189,8 @@ int vfs_statx(int dfd, const char __user *filename, int flags, lookup_flags |= LOOKUP_NO_PROCLINKS; if (flags & AT_NO_SYMLINKS) lookup_flags |= LOOKUP_NO_SYMLINKS; + if (flags & AT_THIS_ROOT) + lookup_flags |= LOOKUP_CHROOT; retry: error = user_path_at(dfd, filename, lookup_flags, &path); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index ad5bba4b5b12..95480cd4c09d 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_BENEATH | O_XDEV | \ - O_NOPROCLINKS | O_NOSYMLINKS) + O_NOPROCLINKS | O_NOSYMLINKS | O_THISROOT) #ifndef force_o_largefile #define force_o_largefile() (BITS_PER_LONG != 32) diff --git a/include/linux/namei.h b/include/linux/namei.h index 5ff7f3362d1b..7ec9e2d84649 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -53,6 +53,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_NO_PROCLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ #define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. Implies LOOKUP_NO_PROCLINKS. */ +#define LOOKUP_CHROOT 0x100000 /* Treat dirfd as %current->fs->root. */ extern int path_pts(struct path *path); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index c2bf5983e46a..11206b0e927c 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -113,6 +113,9 @@ #ifndef O_NOSYMLINKS #define O_NOSYMLINKS 01000000000 #endif +#ifndef O_THISROOT +#define O_THISROOT 02000000000 +#endif #define F_DUPFD 0 /* dup */ #define F_GETFD 1 /* get close_on_exec */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 551a9e2166a8..ea978457b68f 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -99,6 +99,8 @@ #define AT_NO_PROCLINKS 0x40000 /* No /proc/$pid/fd/... "symlinks". */ #define AT_NO_SYMLINKS 0x80000 /* No symlinks *at all*. Implies AT_NO_PROCLINKS. */ +#define AT_THIS_ROOT 0x100000 /* Path resolution acts as though + it is chroot-ed into dirfd. */ #endif /* _UAPI_LINUX_FCNTL_H */
The primary motivation for the need for this flag is container runtimes which have to interact with malicious root filesystems in the host namespaces. One of the first requirements for a container runtime to be secure against a malicious rootfs is that they correctly scope symlinks (that is, they should be scoped as though they are chroot(2)ed into the container's rootfs) and ".."-style paths. The already-existing AT_XDEV and AT_NO_PROCLINKS help defend against other potential attacks in a malicious rootfs scenario. Currently most container runtimes try to do this resolution in userspace[1], causing many potential race conditions. In addition, the "obvious" alternative (actually performing a {ch,pivot_}root(2)) requires a fork+exec which is *very* costly if necessary for every filesystem operation involving a container. The most significant change in semantics with AT_THIS_ROOT is that *at(2) syscalls now no longer have the property that an absolute pathname causes the dirfd to be ignored completely (if LOOKUP_CHROOT is specified). The reasoning behind this is that AT_THIS_ROOT necessarily has to chroot-scope symlinks with absolute paths to dirfd, and so doing it for the base path seems to be the most consistent behaviour (and also avoids foot-gunning users who want to chroot-scope paths that might be absolute). Currently this is only enabled for the stat(2) and openat(2) family (the latter has its own flag O_THISROOT with the same semantics). Ideally this flag would be supported by all *at(2) syscalls, but this will require adding flags arguments to many of them (and will be done in a separate patchset). [1]: https://github.com/cyphar/filepath-securejoin Cc: Eric Biederman <ebiederm@xmission.com> Cc: Christian Brauner <christian@brauner.io> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/fcntl.c | 2 +- fs/namei.c | 121 +++++++++++++++++-------------- fs/open.c | 2 + fs/stat.c | 4 +- include/linux/fcntl.h | 2 +- include/linux/namei.h | 1 + include/uapi/asm-generic/fcntl.h | 3 + include/uapi/linux/fcntl.h | 2 + 8 files changed, 81 insertions(+), 56 deletions(-)