Message ID | 20210121131959.646623-1-christian.brauner@ubuntu.com (mailing list archive) |
---|---|
Headers | show |
Series | idmapped mounts | expand |
On Thu, Jan 21, 2021 at 02:19:19PM +0100, Christian Brauner wrote: > Hey everyone, > > The only major change is the updated version of hch's pach to port xfs > to support idmapped mounts. Thanks again to Christoph for doing that > work. > (Otherwise Acked-bys and Reviewed-bys were added and the tree reordered > to decouple filesystem specific conversion from the vfs work so they > can proceed independent. > For a full list of major changes between versions see the end of this > cover letter. Please also note the large xfstests testsuite in patch 42 > that has been kept as part of this series. It verifies correct vfs > behavior with and without idmapped mounts including covering newer vfs > features such as io_uring. > I currently still plan to target the v5.12 merge window.) > > With this patchset we make it possible to attach idmappings to mounts, > i.e. simply put different bind mounts can expose the same file or > directory with different ownership. > Shifting of ownership on a per-mount basis handles a wide range of > long standing use-cases. Here are just a few: > - Shifting of a subset of ownership-less filesystems (vfat) for use by > multiple users, effectively allowing for DAC on such devices > (systemd, Android, ...) > - Allow remapping uid/gid on external filesystems or paths (USB sticks, > network filesystem, ...) to match the local system's user and groups. > (David Howells intends to port AFS as a first candidate.) > - Shifting of a container rootfs or base image without having to mangle > every file (runc, Docker, containerd, k8s, LXD, systemd ...) > - Sharing of data between host or privileged containers with > unprivileged containers (runC, Docker, containerd, k8s, LXD, ...) > - Data sharing between multiple user namespaces with incompatible maps > (LXD, k8s, ...) > > There has been significant interest in this patchset as evidenced by > user commenting on previous version of this patchset. They include > containerd, ChromeOS, systemd, LXD and a range of others. There is > already a patchset up for containerd, the default Kubernetes container > runtime https://github.com/containerd/containerd/pull/4734 > to make use of this. systemd intends to use it in their systemd-homed > implementation for portable home directories. ChromeOS wants to make use > of it to share data between the host and the Linux containers they run > on Chrome- and Pixelbooks. There's also a few talks that of people who > are going to make use of this. The most recent one was a CNCF webinar > https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf > and upcoming talk during FOSDEM. > (Fwiw, for fun and since I wanted to do this for a long time I've ported > my home directory to be completely portable with a simple service file > that now mounts my home directory on an ext4 formatted usb stick with > an id mapping mapping all files to the random uid I'm assigned at > login.) > > Making it possible to share directories and mounts between users with > different uids and gids is itself quite an important use-case in > distributed systems environments. It's of course especially useful in > general for portable usb sticks, sharing data between multiple users in, > and sharing home directories between multiple users. The last example is > now elegantly expressed in systemd's homed concept for portable home > directories. As mentioned above, idmapped mounts also allow data from > the host to be shared with unprivileged containers, between privileged > and unprivileged containers simultaneously and in addition also between > unprivileged containers with different idmappings whenever they are used > to isolate one container completely from another container. > > We have implemented and proposed multiple solutions to this before. This > included the introduction of fsid mappings, a tiny filesystem I've > authored with Seth Forshee that is currently carried in Ubuntu that has > shown to be the wrong approach, and the conceptual hack of calling > override creds directly in the vfs. In addition, to some of these > solutions being hacky none of these solutions have covered all of the > above use-cases. > > Idmappings become a property of struct vfsmount instead of tying it to a > process being inside of a user namespace which has been the case for all > other proposed approaches. It also allows to pass down the user > namespace into the filesystems which is a clean way instead of violating > calling conventions by strapping the user namespace information that is > a property of the mount to the caller's credentials or similar hacks. > Each mount can have a separate idmapping and idmapped mounts can even be > created in the initial user namespace unblocking a range of use-cases. > > To this end the vfsmount struct gains a new struct user_namespace > member. The idmapping of the user namespace becomes the idmapping of the > mount. A caller that is privileged with respect to the user namespace of > the superblock of the underlying filesystem can create an idmapped > mount. In the future, we can enable unprivileged use-cases by checking > whether the caller is privileged wrt to the user namespace that an > already idmapped mount has been marked with, allowing them to change the > idmapping. For now, keep things simple until the need arises. > Note, that with syscall interception it is already possible to intercept > idmapped mount requests from unprivileged containers and handle them in > a sufficiently privileged container manager. Support for this is already > available in LXD and will be available in runC where syscall > interception is currently in the process of becoming part of the runtime > spec: https://github.com/opencontainers/runtime-spec/pull/1074. > > The user namespace the mount will be marked with can be specified by > passing a file descriptor refering to the user namespace as an argument > to the new mount_setattr() syscall together with the new > MOUNT_ATTR_IDMAP flag. By default vfsmounts are marked with the initial > user namespace and no behavioral or performance changes are observed. > All mapping operations are nops for the initial user namespace. When a > file/inode is accessed through an idmapped mount the i_uid and i_gid of > the inode will be remapped according to the user namespace the mount has > been marked with. > > In order to support idmapped mounts, filesystems need to be changed and > mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The initial > version contains fat, ext4, and xfs including a list of examples. > But patches for other filesystems are actively worked on and will be > sent out separately. We are here to see this through and there are > multiple people involved in converting filesystems. So filesystem > developers are not left alone with this and are provided with a large > testsuite to verify that their port is correct. > > There is a simple tool available at > https://github.com/brauner/mount-idmapped that allows to create idmapped > mounts so people can play with this patch series. Here are a few > illustrations: > > 1. Create a simple idmapped mount of another user's home directory > > u1001@f2-vm:/$ sudo ./mount-idmapped --map-mount b:1000:1001:1 /home/ubuntu/ /mnt > u1001@f2-vm:/$ ls -al /home/ubuntu/ > total 28 > drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . > drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. > -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history > -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout > -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc > -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile > -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful > -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo So I assume this falls under the buyer beware warning, but it's probably important to warn people loudly of the fact that, at this point, the user with uid 1001 can chmod u+s any binary under /mnt and then run it from /home/ubuntu with euid=1000. In other words, that while this has excellent uses, if you *can* use shared group membership, you should :) Very cool though.
Christian Brauner <christian.brauner@ubuntu.com> wrote: > In order to answer this more confidently I need to know a bit more about > how cachefiles are supposed to work. > > From what I gather here it seemed what this code is trying to set here > is an internal "CacheFiles.cache" extended attribute on the indode. This > extended attribute doesn't store any uids and gids or filesystem > capabilities so the user namespace isn't relevant for that since there > doesn't need to be any conversion. > > What I need to know is what information do you use for cachefiles to > determine whether someone can set that "Cachefiles.cache" extended > attribute on the inode: > - Is it the mnt_userns of a/the mount of the filesystem you're caching for? > - The mnt_userns of the mnt of struct cachefiles_cache? > - Or the stashed or current creds of the caller? Mostly it's about permission checking. The cache driver wants to do accesses onto the files in cache using the context of whatever process writes the "bind" command to /dev/cachefiles, not the context of whichever process issued a read or write, say, on an NFS file that is being cached. This causes standard UNIX perm checking, SELinux checking, etc. all to be switched to the appropriate context. It also controls what appears in the audit logs. There is an exception to this: It also governs the ownership of new files and directories created in the cache and what security labels will be set on them. Quite possibly this doesn't matter for the xattr stuff. It's hard to tell since we use user namespaces to convey so many different things at different times. David
On Wed, Mar 03, 2021 at 02:45:07PM +0000, David Howells wrote: > Christian Brauner <christian.brauner@ubuntu.com> wrote: > > > In order to answer this more confidently I need to know a bit more about > > how cachefiles are supposed to work. > > > > From what I gather here it seemed what this code is trying to set here > > is an internal "CacheFiles.cache" extended attribute on the indode. This > > extended attribute doesn't store any uids and gids or filesystem > > capabilities so the user namespace isn't relevant for that since there > > doesn't need to be any conversion. > > > > What I need to know is what information do you use for cachefiles to > > determine whether someone can set that "Cachefiles.cache" extended > > attribute on the inode: > > - Is it the mnt_userns of a/the mount of the filesystem you're caching for? > > - The mnt_userns of the mnt of struct cachefiles_cache? > > - Or the stashed or current creds of the caller? > > Mostly it's about permission checking. The cache driver wants to do accesses > onto the files in cache using the context of whatever process writes the > "bind" command to /dev/cachefiles, not the context of whichever process issued > a read or write, say, on an NFS file that is being cached. > > This causes standard UNIX perm checking, SELinux checking, etc. all to be > switched to the appropriate context. It also controls what appears in the > audit logs. (Audit always translates from and to init_user_ns. The changes to make it aware of user namespaces proper are delayed until the audit id thing is merged as Paul pointed out to me.) > > There is an exception to this: It also governs the ownership of new files and > directories created in the cache and what security labels will be set on them. So from our offline discussion I gather that cachefilesd creates a cache on a local filesystem (ext4, xfs etc.) for a network filesystem. The way this is done is by writing "bind" to /dev/cachefiles and pointing it to a directory to use as the cache. This directory can currently also be an idmapped mount, say: mount --bind --idmap /mnt /mnt and then pointing cachefilesd via a "bind" operation to /mnt What I would expect is for cachefilesd to now take that idmapping into account when creating files in /mnt but as it stands now, it doesn't. This could leave users confused as the ownership of the files wouldn't match to what they expressed in the idmapping. Since you're reworking cachefilesd currently anyway, I would suggest we port cachefilesd to support idmapped mounts once as part of your rework. I can help there and until then we do: diff --git a/fs/cachefiles/bind.c b/fs/cachefiles/bind.c index dfb14dbddf51..51f21beafad9 100644 --- a/fs/cachefiles/bind.c +++ b/fs/cachefiles/bind.c @@ -115,6 +115,12 @@ static int cachefiles_daemon_add_cache(struct cachefiles_cache *cache) if (ret < 0) goto error_open_root; + if (mnt_user_ns(path.mnt) != &init_user_ns) { + ret = -EPERM; + pr_err("Caches on idmapped mounts are currently not supported\n"); + goto error_open_root; + } + cache->mnt = path.mnt; root = path.dentry; This is safe to do because if a mount is visible in the filesystem it can't change it's idmapping. (Might even be worth if you add a helper at this point: static inline bool mnt_is_idmapped(struct vfsmount *mnt) { return mnt_user_ns(mnt) != &init_user_ns; } ) Christian