Message ID | 20190706145737.5299-6-cyphar@cyphar.com (mailing list archive) |
---|---|
State | Awaiting Upstream |
Headers | show |
Series | namei: openat2(2) path resolution restrictions | expand |
On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote: > @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) > struct inode *inode = nd->inode; > > while (1) { > - if (path_equal(&nd->path, &nd->root)) > + if (path_equal(&nd->path, &nd->root)) { > + if (unlikely(nd->flags & LOOKUP_BENEATH)) > + return -EXDEV; > @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) > return -ECHILD; > if (&mparent->mnt == nd->path.mnt) > break; > + if (unlikely(nd->flags & LOOKUP_XDEV)) > + return -EXDEV; > /* we know that mountpoint was pinned */ > nd->path.dentry = mountpoint; > nd->path.mnt = &mparent->mnt; > @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) > return -ECHILD; > if (!mounted) > break; > + if (unlikely(nd->flags & LOOKUP_XDEV)) > + return -EXDEV; Are you sure these failure exits in follow_dotdot_rcu() won't give suprious hard errors? > + if (unlikely(nd->flags & LOOKUP_BENEATH)) { > + error = dirfd_path_init(nd); > + if (unlikely(error)) > + return ERR_PTR(error); > + nd->root = nd->path; > + if (!(nd->flags & LOOKUP_RCU)) > + path_get(&nd->root); > + } > if (*s == '/') { > if (likely(!nd->root.mnt)) > set_root(nd); > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > s = ERR_PTR(error); > return s; > } > - error = dirfd_path_init(nd); > - if (unlikely(error)) > - return ERR_PTR(error); > + if (likely(!nd->path.mnt)) { Is that a weird way of saying "if we hadn't already called dirfd_path_init()"?
On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sun, Jul 07, 2019 at 12:57:32AM +1000, Aleksa Sarai wrote: > > @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) > > struct inode *inode = nd->inode; > > > > while (1) { > > - if (path_equal(&nd->path, &nd->root)) > > + if (path_equal(&nd->path, &nd->root)) { > > + if (unlikely(nd->flags & LOOKUP_BENEATH)) > > + return -EXDEV; > > > @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) > > return -ECHILD; > > if (&mparent->mnt == nd->path.mnt) > > break; > > + if (unlikely(nd->flags & LOOKUP_XDEV)) > > + return -EXDEV; > > /* we know that mountpoint was pinned */ > > nd->path.dentry = mountpoint; > > nd->path.mnt = &mparent->mnt; > > @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) > > return -ECHILD; > > if (!mounted) > > break; > > + if (unlikely(nd->flags & LOOKUP_XDEV)) > > + return -EXDEV; > > Are you sure these failure exits in follow_dotdot_rcu() won't give > suprious hard errors? I could switch to -ECHILD for the *_rcu() checks if you'd prefer that. Though, I'd have (probably naively) thought that you'd have already gotten -ECHILD from the seqlock checks if there was a race during ".." handling. > > + if (unlikely(nd->flags & LOOKUP_BENEATH)) { > > + error = dirfd_path_init(nd); > > + if (unlikely(error)) > > + return ERR_PTR(error); > > + nd->root = nd->path; > > + if (!(nd->flags & LOOKUP_RCU)) > > + path_get(&nd->root); > > + } > > if (*s == '/') { > > if (likely(!nd->root.mnt)) > > set_root(nd); > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > s = ERR_PTR(error); > > return s; > > } > > - error = dirfd_path_init(nd); > > - if (unlikely(error)) > > - return ERR_PTR(error); > > + if (likely(!nd->path.mnt)) { > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"? Yes. I did it to be more consistent with the other "have we got the root" checks elsewhere. Is there another way you'd prefer I do it?
On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote: > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > > s = ERR_PTR(error); > > > return s; > > > } > > > - error = dirfd_path_init(nd); > > > - if (unlikely(error)) > > > - return ERR_PTR(error); > > > + if (likely(!nd->path.mnt)) { > > > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"? > > Yes. I did it to be more consistent with the other "have we got the > root" checks elsewhere. Is there another way you'd prefer I do it? "Have we got the root" checks are inevitable evil; here you are making the control flow in a single function hard to follow. I *think* what you are doing is absolute pathname, no LOOKUP_BENEATH: set_root error = nd_jump_root(nd) else error = dirfd_path_init(nd) return unlikely(error) ? ERR_PTR(error) : s; which should be a lot easier to follow (not to mention shorter), but I might be missing something in all of that.
On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote: > On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote: > > > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > > > s = ERR_PTR(error); > > > > return s; > > > > } > > > > - error = dirfd_path_init(nd); > > > > - if (unlikely(error)) > > > > - return ERR_PTR(error); > > > > + if (likely(!nd->path.mnt)) { > > > > > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"? > > > > Yes. I did it to be more consistent with the other "have we got the > > root" checks elsewhere. Is there another way you'd prefer I do it? > > "Have we got the root" checks are inevitable evil; here you are making the > control flow in a single function hard to follow. > > I *think* what you are doing is > absolute pathname, no LOOKUP_BENEATH: > set_root > error = nd_jump_root(nd) > else > error = dirfd_path_init(nd) > return unlikely(error) ? ERR_PTR(error) : s; > which should be a lot easier to follow (not to mention shorter), but I might > be missing something in all of that. PS: if that's what's going on, I would be tempted to turn the entire path_init() part into this: if (flags & LOOKUP_BENEATH) while (*s == '/') s++; in the very beginning (plus the handling of nd_jump_root() prototype change, but that belongs with nd_jump_root() change itself, obviously). Again, I might be missing something here...
On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote: > On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote: > > On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote: > > > > > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > > > > s = ERR_PTR(error); > > > > > return s; > > > > > } > > > > > - error = dirfd_path_init(nd); > > > > > - if (unlikely(error)) > > > > > - return ERR_PTR(error); > > > > > + if (likely(!nd->path.mnt)) { > > > > > > > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"? > > > > > > Yes. I did it to be more consistent with the other "have we got the > > > root" checks elsewhere. Is there another way you'd prefer I do it? > > > > "Have we got the root" checks are inevitable evil; here you are making the > > control flow in a single function hard to follow. > > > > I *think* what you are doing is > > absolute pathname, no LOOKUP_BENEATH: > > set_root > > error = nd_jump_root(nd) > > else > > error = dirfd_path_init(nd) > > return unlikely(error) ? ERR_PTR(error) : s; > > which should be a lot easier to follow (not to mention shorter), but I might > > be missing something in all of that. > > PS: if that's what's going on, I would be tempted to turn the entire > path_init() part into this: > if (flags & LOOKUP_BENEATH) > while (*s == '/') > s++; > in the very beginning (plus the handling of nd_jump_root() prototype > change, but that belongs with nd_jump_root() change itself, obviously). > Again, I might be missing something here... Argh... I am, at that - you have setting path->root (and grabbing it) in LOOKUP_BENEATH cases and you do it after dirfd_path_init(). So how about if (flags & LOOKUP_BENEATH) while (*s == '/') s++; before the whole thing and if (*s == '/') { /* can happen only without LOOKUP_BENEATH */ set_root(nd); error = nd_jump_root(nd); if (unlikely(error)) return ERR_PTR(error); } else if (nd->dfd == AT_FDCWD) { if (flags & LOOKUP_RCU) { struct fs_struct *fs = current->fs; unsigned seq; do { seq = read_seqcount_begin(&fs->seq); nd->path = fs->pwd; nd->inode = nd->path.dentry->d_inode; nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); } while (read_seqcount_retry(&fs->seq, seq)); } else { get_fs_pwd(current->fs, &nd->path); nd->inode = nd->path.dentry->d_inode; } } else { /* Caller must check execute permissions on the starting path component */ struct fd f = fdget_raw(nd->dfd); struct dentry *dentry; if (!f.file) return ERR_PTR(-EBADF); dentry = f.file->f_path.dentry; if (*s && unlikely(!d_can_lookup(dentry))) { fdput(f); return ERR_PTR(-ENOTDIR); } nd->path = f.file->f_path; if (flags & LOOKUP_RCU) { nd->inode = nd->path.dentry->d_inode; nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); } else { path_get(&nd->path); nd->inode = nd->path.dentry->d_inode; } fdput(f); } if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq; } return s; replacing the part in the end? Makes for much smaller change; it might very well still make sense to add dirfd_path_init() as a separate cleanup (perhaps with the *s == '/' case included), though.
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote: > if (flags & LOOKUP_BENEATH) { > nd->root = nd->path; > if (!(flags & LOOKUP_RCU)) > path_get(&nd->root); > else > nd->root_seq = nd->seq; BTW, this assignment is needed for LOOKUP_RCU case. Without it you are pretty much guaranteed that lazy pathwalk will fail, when it comes to complete_walk(). Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH combination would someday get passed?
On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote: > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote: > > > if (flags & LOOKUP_BENEATH) { > > nd->root = nd->path; > > if (!(flags & LOOKUP_RCU)) > > path_get(&nd->root); > > else > > nd->root_seq = nd->seq; > > BTW, this assignment is needed for LOOKUP_RCU case. Without it > you are pretty much guaranteed that lazy pathwalk will fail, > when it comes to complete_walk(). > > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH > combination would someday get passed? I don't understand what's going on with ->r_seq in there - your call of path_is_under() is after having (re-)sampled rename_lock, but if that was the only .. in there, who's going to recheck the value? For that matter, what's to guarantee that the thing won't get moved just as you are returning from handle_dots()? IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
On Sat, Jul 13, 2019 at 03:41:53AM +0100, Al Viro wrote: > On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote: > > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote: > > > > > if (flags & LOOKUP_BENEATH) { > > > nd->root = nd->path; > > > if (!(flags & LOOKUP_RCU)) > > > path_get(&nd->root); > > > else > > > nd->root_seq = nd->seq; > > > > BTW, this assignment is needed for LOOKUP_RCU case. Without it > > you are pretty much guaranteed that lazy pathwalk will fail, > > when it comes to complete_walk(). > > > > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH > > combination would someday get passed? > > I don't understand what's going on with ->r_seq in there - your > call of path_is_under() is after having (re-)sampled rename_lock, > but if that was the only .. in there, who's going to recheck > the value? For that matter, what's to guarantee that the thing > won't get moved just as you are returning from handle_dots()? > > IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())? Sigh... Usual effects of trying to document things: 1) LOOKUP_NO_EVAL looks bogus. It had been introduced by commit 57d4657716ac (audit: ignore fcaps on umount) and AFAICS it's crap. It is set in ksys_umount() and nowhere else. It's ignored by everything except filename_mountpoint(). The thing is, call graph for filename_mountpoint() is filename_mountpoint() <- user_path_mountpoint_at() <- ksys_umount() <- kern_path_mountpoint() <- autofs_dev_ioctl_ismountpoint() <- find_autofs_mount() <- autofs_dev_ioctl_open_mountpoint() <- autofs_dev_ioctl_requester() <- autofs_dev_ioctl_ismountpoint() In other words, that flag is basically "was filename_mountpoint() been called by umount(2) or has it come from an autofs ioctl?". And looking at the rationale in that commit, autofs ioctls need it just as much as umount(2) does. Why is it not set for those as well? And why is it conditional at all? 1b) ... because audit_inode() wants LOOKUP_... as the last argument, only to remap it into AUDIT_..., that's why. So audit needs something guaranteed not to conflict with LOOKUP_PARENT (another flag getting remapped). So why do we bother with remapping those, anyway? Let's look at the callers: fs/namei.c:933: audit_inode(nd->name, nd->stack[0].link.dentry, 0); fs/namei.c:2353: audit_inode(name, path->dentry, flags & LOOKUP_PARENT); fs/namei.c:2394: audit_inode(name, parent->dentry, LOOKUP_PARENT); fs/namei.c:2721: audit_inode(name, path->dentry, flags & LOOKUP_NO_EVAL); fs/namei.c:3302: audit_inode(nd->name, dir, LOOKUP_PARENT); fs/namei.c:3336: audit_inode(nd->name, file->f_path.dentry, 0); fs/namei.c:3371: audit_inode(nd->name, path.dentry, 0); fs/namei.c:3389: audit_inode(nd->name, nd->path.dentry, 0); fs/namei.c:3490: audit_inode(nd->name, child, 0); fs/namei.c:3509: audit_inode(nd->name, path.dentry, 0); ipc/mqueue.c:788: audit_inode(name, dentry, 0); In all but two of those we have a nice constant value - 0 or AUDIT_INODE_PARENT. One of two exceptions is in filename_mountpoint(), and there we want unconditional AUDIT_INODE_NOEVAL (see above). What of the other? It's if (likely(!retval)) audit_inode(name, path->dentry, flags & LOOKUP_PARENT); in filename_lookup(). And that is bogus as well. filename_lookupat() would better *NOT* get LOOKUP_PARENT in flags. And it doesn't - not since commit 8bcb77fabd7c (namei: split off filename_lookupat() with LOOKUP_PARENT) back in 2015. In filename_parentat() introduced there we have audit_inode(name, parent->dentry, LOOKUP_PARENT); and at the same point the call in filename_lookupat() should've become audit_inode(name, path->dentry, 0); It hadn't; my fault. And after fixing that everything becomes nice and unconditional - the last argument of audit_inode() is always an AUDIT_... constant or zero. Moving AUDIT_... definitions outside of ifdef on CONFIG_AUDITSYSCALL, getting rid of remapping in audit_inode() and passing the right values in 3 callers that don't pass 0 and LOOKUP_NO_EVAL can go to hell. Any objections from audit folks? 2) comment in namei.h is seriously out of sync with reality. To quote: * - follow links at the end OK, that's LOOKUP_FOLLOW (1) * - require a directory ... and LOOKUP_DIRECTORY (2) * - ending slashes ok even for nonexistent files ... used to be about LOOKUP_CONTINUE (eight years dead now) * - internal "there are more path components" flag ... LOOKUP_PARENT (16) * - dentry cache is untrusted; force a real lookup ... LOOKUP_REVAL (32) * - suppress terminal automount ... used to be LOOKUP_NO_AUTOMOUNT (128), except that it's been replaced with LOOKUP_AUTOMOUNT (at 4) almost eight years ago. And the meaning of LOOKUP_AUTOMOUNT is opposite to the comment, of course. * - skip revalidation ... LOOKUP_NO_REVAL (128) * - don't fetch xattrs on audit_inode ... and that's about soon-to-be dead LOOKUP_NO_EVAL (256) Note that LOOKUP_RCU (at 64) is quietly skipped and so's the tail of the list. If not for "suppress terminal automount" bit, I wouldn't really care, but that one makes for a really nasty trap for readers. I'm going to convert that to (accurate) comments next to actual defines... 3) while looking through LOOKUP_AUTOMOUNT users, in aa_bind_mount() we have error = kern_path(dev_name, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT, &old_path); matching do_loopback(), while tomoyo_mount_acl() has if (!dev_name || kern_path(dev_name, LOOKUP_FOLLOW, &path)) { And yes, that *is* hit on mount --bind. As well as on new mounts, where apparmor (and bdev_lookup()) has plain LOOKUP_FOLLOW. ->sb_mount() is garbage by design (not the least because of the need to have pathname lookups in the first place, as well as having to duplicate the demultiplexing parts of do_mount() without fucking it up)...
On 2019-07-13, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote: > > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote: > > > > > if (flags & LOOKUP_BENEATH) { > > > nd->root = nd->path; > > > if (!(flags & LOOKUP_RCU)) > > > path_get(&nd->root); > > > else > > > nd->root_seq = nd->seq; > > > > BTW, this assignment is needed for LOOKUP_RCU case. Without it > > you are pretty much guaranteed that lazy pathwalk will fail, > > when it comes to complete_walk(). > > > > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH > > combination would someday get passed? > > I don't understand what's going on with ->r_seq in there - your > call of path_is_under() is after having (re-)sampled rename_lock, > but if that was the only .. in there, who's going to recheck > the value? For that matter, what's to guarantee that the thing > won't get moved just as you are returning from handle_dots()? > > IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())? I tried to explain this in the commit message for "namei: aggressively check for nd->root escape on ".." resolution", but I probably could've explained it better. The basic property being guaranteed by LOOKUP_IN_ROOT is that it will not result in resolution of a path component which was not inside the root of the dirfd tree at some point during resolution (and that all absolute symlink and ".." resolution will be done relative to the dirfd). This may smell slightly of chroot(2), because unfortunately it is a similar concept -- the reason for this is to allow for a more efficient way to safely resolve paths inside a rootfs than spawning a separate process to then pass back the fd to the caller. We don't want to do a path_is_under() check for every ".." (otherwise lookups have a quadratic slowdown when doing many ".."s), so I instead only do a check after a rename or a mount (which are the only operations which could change what ".." points to). And since we do the path_is_under() check if m_seq or r_seq need a retry, we can re-take them[+]. The main reason why I don't re-check path_is_under() after handle_dots() is that there is no way to be sure that a racing rename didn't happen after your last path_is_under() check. The rename could happen after the syscall returns, after all. So, the main purpose of the check is to make sure that a ".."s after a rename doesn't result in an escape. If the rename happens after we've traversed through a ".." that means that the ".." was inside the root in the first place (a root ".." is handled by follow_dotdot). If the rename happens after we've gone through handle_dots() and there is no subsequent ".." then to userspace it looks identical to the rename occurring after the syscall has returned. If there is a subsequent ".." after a racing rename then we may have moved into a path that wasn't path_is_under() and so we have to check it. The only way I could see you could solve the race completely is if you had a way for userspace to lock things from being able to be renamed (or MS_MOVE'd). And that feels like a really bad idea to me. [+]: You asked why don't I re-take m_seq. The reason is that I don't fully understand all the other m_seq checks being done during resolution, and we aren't definitely doing them all in handle_dots(). So I assumed re-taking it could result in me breaking RCU-walk which obviously would be bad. Since I am the only thing using nd->r_seq, I can re-take it without issue.
On 2019-07-12, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jul 12, 2019 at 01:55:52PM +0100, Al Viro wrote: > > On Fri, Jul 12, 2019 at 01:39:24PM +0100, Al Viro wrote: > > > On Fri, Jul 12, 2019 at 08:57:45PM +1000, Aleksa Sarai wrote: > > > > > > > > > @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) > > > > > > s = ERR_PTR(error); > > > > > > return s; > > > > > > } > > > > > > - error = dirfd_path_init(nd); > > > > > > - if (unlikely(error)) > > > > > > - return ERR_PTR(error); > > > > > > + if (likely(!nd->path.mnt)) { > > > > > > > > > > Is that a weird way of saying "if we hadn't already called dirfd_path_init()"? > > > > > > > > Yes. I did it to be more consistent with the other "have we got the > > > > root" checks elsewhere. Is there another way you'd prefer I do it? > > > > > > "Have we got the root" checks are inevitable evil; here you are making the > > > control flow in a single function hard to follow. > > > > > > I *think* what you are doing is > > > absolute pathname, no LOOKUP_BENEATH: > > > set_root > > > error = nd_jump_root(nd) > > > else > > > error = dirfd_path_init(nd) > > > return unlikely(error) ? ERR_PTR(error) : s; > > > which should be a lot easier to follow (not to mention shorter), but I might > > > be missing something in all of that. > > > > PS: if that's what's going on, I would be tempted to turn the entire > > path_init() part into this: > > if (flags & LOOKUP_BENEATH) > > while (*s == '/') > > s++; > > in the very beginning (plus the handling of nd_jump_root() prototype > > change, but that belongs with nd_jump_root() change itself, obviously). > > Again, I might be missing something here... > > Argh... I am, at that - you have setting path->root (and grabbing it) > in LOOKUP_BENEATH cases and you do it after dirfd_path_init(). So > how about > if (flags & LOOKUP_BENEATH) > while (*s == '/') > s++; I can do this for LOOKUP_IN_ROOT, but currently the semantics for LOOKUP_BENEATH is that absolute paths will return -EXDEV indiscriminately (nd_jump_root() errors out with LOOKUP_BENEATH). To be honest, the check could actually just be: if (flags & LOOKUP_BENEATH) if (*s == '/') return ERR_PTR(-EXDEV); (Though we'd still need -EXDEV in nd_jump_root() for obvious reasons.) The logic being that an absolute path means that the resolution starts out without being "beneath" the starting point -- thus violating the contract of LOOKUP_BENEATH. And since the "handle absolute paths like they're scoped to the root" is only implemented for LOOKUP_IN_ROOT, I'd think it's a bit odd to have LOOKUP_BENEATH do it too for absolute paths. I'll be honest, this patchset is more confusing to both of us because of LOOKUP_BENEATH -- I've only kept it since it was part of the original patchset (O_BENEATH). Personally I think more people will be far more interested in LOOKUP_IN_ROOT. Does anyone mind if I drop the LOOKUP_BENEATH parts of this series, and only keep LOOKUP_NO_* and LOOKUP_IN_ROOT? I make a change as you outlined for LOOKUP_IN_ROOT, though.
On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote: > The basic property being guaranteed by LOOKUP_IN_ROOT is that it will > not result in resolution of a path component which was not inside the > root of the dirfd tree at some point during resolution (and that all > absolute symlink and ".." resolution will be done relative to the > dirfd). This may smell slightly of chroot(2), because unfortunately it > is a similar concept -- the reason for this is to allow for a more > efficient way to safely resolve paths inside a rootfs than spawning a > separate process to then pass back the fd to the caller. IDGI... If attacker can modify your subtree, you have already lost - after all, they can make anything appear inside that tree just before your syscall is made and bring it back out immediately afterwards. And if they can't, what is the race you are trying to protect against? Confused...
On 2019-07-14, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sat, Jul 13, 2019 at 03:41:53AM +0100, Al Viro wrote: > > On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote: > > > On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote: > > > > > > > if (flags & LOOKUP_BENEATH) { > > > > nd->root = nd->path; > > > > if (!(flags & LOOKUP_RCU)) > > > > path_get(&nd->root); > > > > else > > > > nd->root_seq = nd->seq; > > > > > > BTW, this assignment is needed for LOOKUP_RCU case. Without it > > > you are pretty much guaranteed that lazy pathwalk will fail, > > > when it comes to complete_walk(). > > > > > > Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH > > > combination would someday get passed? > > > > I don't understand what's going on with ->r_seq in there - your > > call of path_is_under() is after having (re-)sampled rename_lock, > > but if that was the only .. in there, who's going to recheck > > the value? For that matter, what's to guarantee that the thing > > won't get moved just as you are returning from handle_dots()? > > > > IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())? > > Sigh... Usual effects of trying to document things: > > 1) LOOKUP_NO_EVAL looks bogus. It had been introduced by commit 57d4657716ac > (audit: ignore fcaps on umount) and AFAICS it's crap. It is set in > ksys_umount() and nowhere else. It's ignored by everything except > filename_mountpoint(). The thing is, call graph for filename_mountpoint() > is > filename_mountpoint() > <- user_path_mountpoint_at() > <- ksys_umount() > <- kern_path_mountpoint() > <- autofs_dev_ioctl_ismountpoint() > <- find_autofs_mount() > <- autofs_dev_ioctl_open_mountpoint() > <- autofs_dev_ioctl_requester() > <- autofs_dev_ioctl_ismountpoint() > In other words, that flag is basically "was filename_mountpoint() > been called by umount(2) or has it come from an autofs ioctl?". > And looking at the rationale in that commit, autofs ioctls need > it just as much as umount(2) does. Why is it not set for those > as well? And why is it conditional at all? In addition, LOOKUP_NO_EVAL == LOOKUP_OPEN (0x100). Is that meant to be the case? Also I just saw you have a patch in work.namei that fixes this up -- do you want me to rebase on top of that?
On 2019-07-14, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sun, Jul 14, 2019 at 05:00:29PM +1000, Aleksa Sarai wrote: > > The basic property being guaranteed by LOOKUP_IN_ROOT is that it will > > not result in resolution of a path component which was not inside the > > root of the dirfd tree at some point during resolution (and that all > > absolute symlink and ".." resolution will be done relative to the > > dirfd). This may smell slightly of chroot(2), because unfortunately it > > is a similar concept -- the reason for this is to allow for a more > > efficient way to safely resolve paths inside a rootfs than spawning a > > separate process to then pass back the fd to the caller. > > IDGI... If attacker can modify your subtree, you have already lost - > after all, they can make anything appear inside that tree just before > your syscall is made and bring it back out immediately afterwards. > And if they can't, what is the race you are trying to protect against? > Confused... I'll be honest, this code mostly exists because Jann Horn said that it was necessary in order for this interface to be safe against those kinds of attacks. Though, it's also entirely possible I just am mis-remembering the attack scenario he described when I posted v1 of this series last year. The use-case I need this functionality for (as do other container runtimes) is one where you are trying to safely interact with a directory tree that is a (malicious) container's root filesystem -- so the container won't be able to move the directory tree root, nor can they move things outside the rootfs into it (or the reverse). Users dealing with FTP, web, or file servers probably have similar requirements. There is an obvious race condition if you allow the attacker to move the root (I give an example and test-case of it in the last patch in the series), and given that it is fairly trivial to defend against I don't see the downside in including it? But it's obviously your call -- and maybe Jann Horn can explain the reasoning behind this much better than I can.
diff --git a/fs/namei.c b/fs/namei.c index b490bcf855f8..9c3ed597466b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -852,6 +852,13 @@ static inline void path_to_nameidata(const struct path *path, static int nd_jump_root(struct nameidata *nd) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; + if (unlikely(nd->flags & LOOKUP_XDEV)) { + /* Absolute path arguments to path_init() are allowed. */ + if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt) + return -EXDEV; + } if (nd->flags & LOOKUP_RCU) { struct dentry *d; nd->path = nd->root; @@ -1104,6 +1111,9 @@ const char *get_link(struct nameidata *nd, bool trailing) int error; const char *res; + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS)) + return ERR_PTR(-ELOOP); + if (!(nd->flags & LOOKUP_RCU)) { touch_atime(&last->link); cond_resched(); @@ -1136,6 +1146,11 @@ const char *get_link(struct nameidata *nd, bool trailing) } /* If we just jumped it was because of a magic-link. */ if (unlikely(nd->flags & LOOKUP_JUMPED)) { + if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS)) + return ERR_PTR(-ELOOP); + /* Not currently safe. */ + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return ERR_PTR(-EXDEV); /* * For trailing_symlink we check whether the symlink's * mode allows us to do what we want through acc_mode. @@ -1178,8 +1193,9 @@ const char *get_link(struct nameidata *nd, bool trailing) if (*res == '/') { if (!nd->root.mnt) set_root(nd); - if (unlikely(nd_jump_root(nd))) - return ERR_PTR(-ECHILD); + error = nd_jump_root(nd); + if (unlikely(error)) + return ERR_PTR(error); while (unlikely(*++res == '/')) ; } @@ -1360,12 +1376,16 @@ static int follow_managed(struct path *path, struct nameidata *nd) break; } - if (need_mntput && path->mnt == mnt) - mntput(path->mnt); + if (need_mntput) { + if (path->mnt == mnt) + mntput(path->mnt); + if (unlikely(nd->flags & LOOKUP_XDEV)) + ret = -EXDEV; + else + nd->flags |= LOOKUP_JUMPED; + } if (ret == -EISDIR || !ret) ret = 1; - if (need_mntput) - nd->flags |= LOOKUP_JUMPED; if (unlikely(ret < 0)) path_put_conditional(path, nd); return ret; @@ -1422,6 +1442,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, mounted = __lookup_mnt(path->mnt, path->dentry); if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return false; path->mnt = &mounted->mnt; path->dentry = mounted->mnt.mnt_root; nd->flags |= LOOKUP_JUMPED; @@ -1442,8 +1464,11 @@ static int follow_dotdot_rcu(struct nameidata *nd) struct inode *inode = nd->inode; while (1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { struct dentry *old = nd->path.dentry; struct dentry *parent = old->d_parent; @@ -1468,6 +1493,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (&mparent->mnt == nd->path.mnt) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; /* we know that mountpoint was pinned */ nd->path.dentry = mountpoint; nd->path.mnt = &mparent->mnt; @@ -1482,6 +1509,8 @@ static int follow_dotdot_rcu(struct nameidata *nd) return -ECHILD; if (!mounted) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; nd->path.mnt = &mounted->mnt; nd->path.dentry = mounted->mnt.mnt_root; inode = nd->path.dentry->d_inode; @@ -1570,8 +1599,11 @@ static int path_parent_directory(struct path *path) static int follow_dotdot(struct nameidata *nd) { while(1) { - if (path_equal(&nd->path, &nd->root)) + if (path_equal(&nd->path, &nd->root)) { + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; break; + } if (nd->path.dentry != nd->path.mnt->mnt_root) { int ret = path_parent_directory(&nd->path); if (ret) @@ -1580,6 +1612,8 @@ static int follow_dotdot(struct nameidata *nd) } if (!follow_up(&nd->path)) break; + if (unlikely(nd->flags & LOOKUP_XDEV)) + return -EXDEV; } follow_mount(&nd->path); nd->inode = nd->path.dentry->d_inode; @@ -1794,6 +1828,13 @@ static inline int may_lookup(struct nameidata *nd) static inline int handle_dots(struct nameidata *nd, int type) { if (type == LAST_DOTDOT) { + /* + * LOOKUP_BENEATH resolving ".." is not currently safe -- races can + * cause our parent to have moved outside of the root and us to skip + * over it. + */ + if (unlikely(nd->flags & LOOKUP_BENEATH)) + return -EXDEV; if (!nd->root.mnt) set_root(nd); if (nd->flags & LOOKUP_RCU) { @@ -2342,6 +2383,15 @@ static const char *path_init(struct nameidata *nd, unsigned flags) nd->path.dentry = NULL; nd->m_seq = read_seqbegin(&mount_lock); + + if (unlikely(nd->flags & LOOKUP_BENEATH)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + nd->root = nd->path; + if (!(nd->flags & LOOKUP_RCU)) + path_get(&nd->root); + } if (*s == '/') { if (likely(!nd->root.mnt)) set_root(nd); @@ -2350,9 +2400,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags) s = ERR_PTR(error); return s; } - error = dirfd_path_init(nd); - if (unlikely(error)) - return ERR_PTR(error); + if (likely(!nd->path.mnt)) { + error = dirfd_path_init(nd); + if (unlikely(error)) + return ERR_PTR(error); + } return s; } diff --git a/include/linux/namei.h b/include/linux/namei.h index 9138b4471dbf..7bc819ad0cd3 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -50,6 +50,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; #define LOOKUP_EMPTY 0x4000 #define LOOKUP_DOWN 0x8000 +/* Scoping flags for lookup. */ +#define LOOKUP_BENEATH 0x010000 /* No escaping from starting point. */ +#define LOOKUP_XDEV 0x020000 /* No mountpoint crossing. */ +#define LOOKUP_NO_MAGICLINKS 0x040000 /* No /proc/$pid/fd/ "symlink" crossing. */ +#define LOOKUP_NO_SYMLINKS 0x080000 /* No symlink crossing *at all*. + Implies LOOKUP_NO_MAGICLINKS. */ + extern int path_pts(struct path *path); extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
Add the following flags to allow various restrictions on path resolution (these affect the *entire* resolution, rather than just the final path component -- as is the case with most other AT_* flags). The primary justification for these flags is to allow for programs to be far more strict about how they want path resolution to handle symlinks, mountpoint crossings, and paths that escape the dirfd (through an absolute path or ".." shenanigans). This is of particular concern to container runtimes that want to be very careful about malicious root filesystems that a container's init might have screwed around with (and there is no real way to protect against this in userspace if you consider potential races against a malicious container's init). More classical applications (which have their own potentially buggy userspace path sanitisation code) include web servers, archive extraction tools, network file servers, and so on. These flags are exposed to userspace in a later patchset. * LOOKUP_XDEV: Disallow mount-point crossing (both *down* into one, or *up* from one). The primary "scoping" use is to blocking resolution that crosses a bind-mount, which has a similar property to a symlink (in the way that it allows for escape from the starting-point). Since it is not possible to differentiate bind-mounts However since bind-mounting requires privileges (in ways symlinks don't) this has been split from LOOKUP_BENEATH. The naming is based on "find -xdev" as well as -EXDEV (though find(1) doesn't walk upwards, the semantics seem obvious). * LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" jumping. This is a very specific restriction, and it exists because /proc/$pid/fd/... "symlinks" allow for access outside nd->root and pose risk to container runtimes that don't want to be tricked into accessing a host path (but do want to allow no-funny-business symlink resolution). * LOOKUP_NO_SYMLINKS: Disallows symlink jumping *of any kind*. Implies LOOKUP_NO_MAGICLINKS (obviously). * LOOKUP_BENEATH: Disallow "escapes" from the starting point of the filesystem tree during resolution (you must stay "beneath" the starting point at all times). Currently this is done by disallowing ".." and absolute paths (either in the given path or found during symlink resolution) entirely, as well as all magic-link jumping. The wholesale banning of ".." is because it is currently not safe to allow ".." resolution (races can cause the path to be moved outside of the root -- this is conceptually similar to historical chroot(2) escape attacks). Future patches in this series will address this, and will re-enable ".." resolution once it is safe. With those patches, ".." resolution will only be allowed if it remains in the root throughout resolution (such as "a/../b" not "a/../../outside/b"). The banning of magic-link jumping is done because it is not clear whether semantically they should be allowed -- while some magic-links are safe there are many that can cause escapes (and once a resolution is outside of the root, O_BENEATH will no longer detect it). Future patches may re-enable magic-link jumping when such jumps would remain inside the root. The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would violates their requirement, while the others all return -EXDEV. This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation on David Drysdale's O_BENEATH patchset[2], which in turn was based on the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS thread[4] determined most of the API changes made in this refresh. [1]: https://lwn.net/Articles/721443/ [2]: https://lwn.net/Articles/619151/ [3]: https://lwn.net/Articles/603929/ [4]: https://lwn.net/Articles/723057/ Cc: Christian Brauner <christian@brauner.io> Suggested-by: David Drysdale <drysdale@google.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- fs/namei.c | 74 ++++++++++++++++++++++++++++++++++++------- include/linux/namei.h | 7 ++++ 2 files changed, 70 insertions(+), 11 deletions(-)