Message ID | 20200218143411.2389182-1-christian.brauner@ubuntu.com (mailing list archive) |
---|---|
Headers | show |
Series | user_namespace: introduce fsid mappings | expand |
On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote: > In the usual case of running an unprivileged container we will have > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will > correspond to this id mapping, i.e. all files which we want to appear > as 0:0 inside the user namespace will be chowned to 100000:100000 on > the host. This works, because whenever the kernel needs to do a > filesystem access it will lookup the corresponding uid and gid in the > idmapping tables of the container. Now think about the case where we > want to have an id mapping of 0 100000 100000 but an on-disk mapping > of 0 300000 100000 which is needed to e.g. share a single on-disk > mapping with multiple containers that all have different id mappings. > This will be problematic. Whenever a filesystem access is requested, > the kernel will now try to lookup a mapping for 300000 in the id > mapping tables of the user namespace but since there is none the > files will appear to be owned by the overflow id, i.e. usually > 65534:65534 or nobody:nogroup. > > With fsid mappings we can solve this by writing an id mapping of 0 > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > access the kernel will now lookup the mapping for 300000 in the fsid > mapping tables of the user namespace. And since such a mapping > exists, the corresponding files will have correct ownership. So I did compile this up in order to run the shiftfs tests over it to see how it coped with the various corner cases. However, what I find is it simply fails the fsid reverse mapping in the setup. Trying to use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the entry setuid(0) call because of this code: long __sys_setuid(uid_t uid) { struct user_namespace *ns = current_user_ns(); const struct cred *old; struct cred *new; int retval; kuid_t kuid; kuid_t kfsuid; kuid = make_kuid(ns, uid); if (!uid_valid(kuid)) return -EINVAL; kfsuid = make_kfsuid(ns, uid); if (!uid_valid(kfsuid)) return -EINVAL; which means you can't have a fsid mapping that doesn't have the same domain as the uid mapping, meaning a reverse mapping isn't possible because the range and domain have to be inverse and disjoint. James
On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote: > On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote: > > In the usual case of running an unprivileged container we will have > > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will > > correspond to this id mapping, i.e. all files which we want to appear > > as 0:0 inside the user namespace will be chowned to 100000:100000 on > > the host. This works, because whenever the kernel needs to do a > > filesystem access it will lookup the corresponding uid and gid in the > > idmapping tables of the container. Now think about the case where we > > want to have an id mapping of 0 100000 100000 but an on-disk mapping > > of 0 300000 100000 which is needed to e.g. share a single on-disk > > mapping with multiple containers that all have different id mappings. > > This will be problematic. Whenever a filesystem access is requested, > > the kernel will now try to lookup a mapping for 300000 in the id > > mapping tables of the user namespace but since there is none the > > files will appear to be owned by the overflow id, i.e. usually > > 65534:65534 or nobody:nogroup. > > > > With fsid mappings we can solve this by writing an id mapping of 0 > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > > access the kernel will now lookup the mapping for 300000 in the fsid > > mapping tables of the user namespace. And since such a mapping > > exists, the corresponding files will have correct ownership. > > So I did compile this up in order to run the shiftfs tests over it to > see how it coped with the various corner cases. However, what I find > is it simply fails the fsid reverse mapping in the setup. Trying to > use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the > entry setuid(0) call because of this code: This is easy to fix. But what's the exact use-case?
On Tue, Feb 18, 2020 at 3:35 PM Christian Brauner <christian.brauner@ubuntu.com> wrote: [...] > - Let the keyctl infrastructure only operate on kfsid which are always > mapped/looked up in the id mappings similar to what we do for > filesystems that have the same superblock visible in multiple user > namespaces. > > This version also comes with minimal tests which I intend to expand in > the future. > > From pings and off-list questions and discussions at Google Container > Security Summit there seems to be quite a lot of interest in this > patchset with use-cases ranging from layer sharing for app containers > and k8s, as well as data sharing between containers with different id > mappings. I haven't Cced all people because I don't have all the email > adresses at hand but I've at least added Phil now. :) > > This is the implementation of shiftfs which was cooked up during lunch at > Linux Plumbers 2019 the day after the container's microconference. The > idea is a design-stew from Stéphane, Aleksa, Eric, and myself (and by > now also Jann. > Back then we all were quite busy with other work and couldn't really sit > down and implement it. But I took a few days last week to do this work, > including demos and performance testing. > This implementation does not require us to touch the VFS substantially > at all. Instead, we implement shiftfs via fsid mappings. > With this patch, it took me 20 mins to port both LXD and LXC to support > shiftfs via fsid mappings. [...] Can you please grep through the kernel for all uses of ->fsuid and ->fsgid and fix them up appropriately? Some cases I still see: The SafeSetID LSM wants to enforce that you can only use CAP_SETUID to gain the privileges of a specific set of IDs: static int safesetid_task_fix_setuid(struct cred *new, const struct cred *old, int flags) { /* Do nothing if there are no setuid restrictions for our old RUID. */ if (setuid_policy_lookup(old->uid, INVALID_UID) == SIDPOL_DEFAULT) return 0; if (uid_permitted_for_cred(old, new->uid) && uid_permitted_for_cred(old, new->euid) && uid_permitted_for_cred(old, new->suid) && uid_permitted_for_cred(old, new->fsuid)) return 0; /* * Kill this process to avoid potential security vulnerabilities * that could arise from a missing whitelist entry preventing a * privileged process from dropping to a lesser-privileged one. */ force_sig(SIGKILL); return -EACCES; } This could theoretically be bypassed through setfsuid() if the kuid based on the fsuid mappings is permitted but the kuid based on the normal mappings is not. fs/coredump.c in suid dump mode uses "cred->fsuid = GLOBAL_ROOT_UID"; this should probably also fix up the other uid, even if there is no scenario in which it would actually be used at the moment? The netfilter xt_owner stuff makes packet filtering decisions based on the ->fsuid; it might be better to filter on the ->kfsuid so that you can filter traffic from different user namespaces differently? audit_log_task_info() is doing "from_kuid(&init_user_ns, cred->fsuid)".
On Wed, 2020-02-19 at 13:27 +0100, Christian Brauner wrote: > On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote: > > On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote: [...] > > > With fsid mappings we can solve this by writing an id mapping of > > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On > > > filesystem access the kernel will now lookup the mapping for > > > 300000 in the fsid mapping tables of the user namespace. And > > > since such a mapping exists, the corresponding files will have > > > correct ownership. > > > > So I did compile this up in order to run the shiftfs tests over it > > to see how it coped with the various corner cases. However, what I > > find is it simply fails the fsid reverse mapping in the > > setup. Trying to use a simple uid of 0 100000 1000 and a fsid of > > 100000 0 1000 fails the entry setuid(0) call because of this code: > > This is easy to fix. But what's the exact use-case? Well, the use case I'm looking to solve is the same one it's always been: getting a deprivileged fake root in a user_ns to be able to write an image at fsuid 0. I don't think it's solvable in your current framework, although allowing the domain to be disjoint might possibly hack around it. The problem with the proposed framework is that there are no backshifts from the filesystem view, there are only forward shifts to the filesystem view. This means that to get your framework to write a filesystem at fsuid 0 you have to have an identity map for fsuid. Which I can do: I tested uid shift 0 100000 1000 and fsuid shift 0 0 1000. It does all work, as you'd expect because the container has real fs root not a fake root. And that's the whole problem: Firstly, I'm fs root for any filesystem my userns can see, so any imprecision in setting up the mount namespace of the container and I own your host and secondly any containment break and I'm privileged with respect to the fs uid wherever I escape to so I will likewise own your host. The only way to keep containment is to have a zero fsuid inside the container corresponding to a non-zero one outside. And the only way to solve the imprecision in mount namespace issue is to strictly control the entry point at which the writing at fsuid 0 becomes active. James
On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote: > With fsid mappings we can solve this by writing an id mapping of 0 > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > access the kernel will now lookup the mapping for 300000 in the fsid > mapping tables of the user namespace. And since such a mapping exists, > the corresponding files will have correct ownership. So if I have /proc/self/uid_map: 0 100000 100000 /proc/self/fsid_map: 1000 1000 1 1. If I read files from the rootfs which have host uid 101000, they will appear as uid 100 to me? 2. If I read host files with uid 1000, they will appear as uid 1000 to me? 3. If I create a new file, as uid 1000, what will be the inode owning uid?
On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote: > On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote: > > With fsid mappings we can solve this by writing an id mapping of 0 > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > > access the kernel will now lookup the mapping for 300000 in the fsid > > mapping tables of the user namespace. And since such a mapping exists, > > the corresponding files will have correct ownership. > > So if I have > > /proc/self/uid_map: 0 100000 100000 > /proc/self/fsid_map: 1000 1000 1 Oh, sorry. Your explanation in 20/25 i think set me straight, though I need to think through a few more examples. ... > 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid? (Note - I edited the quoted txt above to be more precise) I'm still not quite clear on this. I believe the fsid mapping will take precedence so it'll be uid 1000 ? Per mount behavior would be nice there, but perhaps unwieldy.
On Wed, Feb 19, 2020 at 03:48:37PM -0600, Serge E. Hallyn wrote: > On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote: > > On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote: > > > With fsid mappings we can solve this by writing an id mapping of 0 > > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > > > access the kernel will now lookup the mapping for 300000 in the fsid > > > mapping tables of the user namespace. And since such a mapping exists, > > > the corresponding files will have correct ownership. > > > > So if I have > > > > /proc/self/uid_map: 0 100000 100000 > > /proc/self/fsid_map: 1000 1000 1 > > Oh, sorry. Your explanation in 20/25 i think set me straight, though I need > to think through a few more examples. > > ... > > > 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid? > > (Note - I edited the quoted txt above to be more precise) > > I'm still not quite clear on this. I believe the fsid mapping will take > precedence so it'll be uid 1000 ? Per mount behavior would be nice there, > but perhaps unwieldy. The is_userns_visible() bits seems to be an attempt at understanding what people would want per-mount, with a policy hard coded in the kernel. But maybe per-mount behavior can be solved more elegantly with shifted bind mounts, so we can drop all that from this series, and ignore per-mount settings here? Tycho
On 2/18/20 9:33 AM, Christian Brauner wrote: > Hey everyone, > > This is v3 after (off- and online) discussions with Jann the following > changes were made: > - To handle nested user namespaces cleanly, efficiently, and with full > backwards compatibility for non fsid-mapping aware workloads we only > allow writing fsid mappings as long as the corresponding id mapping > type has not been written. > - Split the patch which adds the internal ability in > kernel/user_namespace to verify and write fsid mappings into tree > patches: > 1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers > patch to implement core helpers for fsid translations (i.e. > make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(), > k*id_to_kfs*id() > 2. [PATCH v3 05/25] user_namespace: refactor map_write() > patch to refactor map_write() in order to prepare for actual fsid > mappings changes in the following patch. (This should make it > easier to review.) > 3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings > patch to implement actual fsid mappings support in mape_write() > - Let the keyctl infrastructure only operate on kfsid which are always > mapped/looked up in the id mappings similar to what we do for > filesystems that have the same superblock visible in multiple user > namespaces. > > This version also comes with minimal tests which I intend to expand in > the future. > > From pings and off-list questions and discussions at Google Container > Security Summit there seems to be quite a lot of interest in this > patchset with use-cases ranging from layer sharing for app containers > and k8s, as well as data sharing between containers with different id > mappings. I haven't Cced all people because I don't have all the email > adresses at hand but I've at least added Phil now. :) > I put this into a kernel for our container guys to mess with in order to validate it would actually be useful for real world uses. I've cc'ed the guy who did all of the work in case you have specific questions. Good news is the interface is acceptable, albeit apparently the whole user ns interface sucks in general. But you haven't made it worse, so success! But in testing it there appears to be a problem with tmpfs? Our applications will use shared memory segments for certain things and it apparently breaks this in interesting ways, it appears to not shift the UID appropriately on tmpfs. This seems to be relatively straightforward to reproduce, but if you have trouble let me know and I'll come up with a shell script that reproduces the problem. We are happy to continue testing these patches to make sure they're working in our container setup, if you want to CC me on future submissions I can build them for our internal testing and validate them as well. Thanks, Josef
On Thu, Feb 27, 2020 at 02:33:04PM -0500, Josef Bacik wrote: > On 2/18/20 9:33 AM, Christian Brauner wrote: > > Hey everyone, > > > > This is v3 after (off- and online) discussions with Jann the following > > changes were made: > > - To handle nested user namespaces cleanly, efficiently, and with full > > backwards compatibility for non fsid-mapping aware workloads we only > > allow writing fsid mappings as long as the corresponding id mapping > > type has not been written. > > - Split the patch which adds the internal ability in > > kernel/user_namespace to verify and write fsid mappings into tree > > patches: > > 1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers > > patch to implement core helpers for fsid translations (i.e. > > make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(), > > k*id_to_kfs*id() > > 2. [PATCH v3 05/25] user_namespace: refactor map_write() > > patch to refactor map_write() in order to prepare for actual fsid > > mappings changes in the following patch. (This should make it > > easier to review.) > > 3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings > > patch to implement actual fsid mappings support in mape_write() > > - Let the keyctl infrastructure only operate on kfsid which are always > > mapped/looked up in the id mappings similar to what we do for > > filesystems that have the same superblock visible in multiple user > > namespaces. > > > > This version also comes with minimal tests which I intend to expand in > > the future. > > > > From pings and off-list questions and discussions at Google Container > > Security Summit there seems to be quite a lot of interest in this > > patchset with use-cases ranging from layer sharing for app containers > > and k8s, as well as data sharing between containers with different id > > mappings. I haven't Cced all people because I don't have all the email > > adresses at hand but I've at least added Phil now. :) > > > I put this into a kernel for our container guys to mess with in order to > validate it would actually be useful for real world uses. I've cc'ed the > guy who did all of the work in case you have specific questions. > > Good news is the interface is acceptable, albeit apparently the whole user > ns interface sucks in general. But you haven't made it worse, so success! Well I very much disagree here :) With the first part! But I do understand the shortcomings. Anyway, I still hope we get to talk about this in person, but IMO this is the right approach (this being - thinking about how to make the uid mappings more flexible without making them too complicated to be safe to use), but a bit too static in terms of target. There are at least two ways that I could see usefully generalizing it From a user space pov, the following goal is indespensible (for my use cases): that the fsuid be selectable based on fs, mountpoint, or file context (as in selinux). From a userns pov, one way to look at it is this: when task t1 signals task t2, it's not only t1's namespace that's considered when filling in the sender uid, but also t2's. Likewise, when writing a file, we should consider both t1's fsuid+userns, and the file's, mount's, or filesystem's userns. From that POV, your patch is a step in the right direction and could be taken as is (modulo any tmpfs fix Josef needs :) From there I would propose adding a 'userns=<uidnsfd>' bind mount option, so we could create an empty userns with the desired mapping (subject to permissions granted by subuids), get an fd to the uidns, and say mount --bind -o uidns=5 /shared /containers/c1/mnt/shared So now when I write a file /etc/hosts as container fsuid 0, it'll be subject to the container rootfs mount's uid mapping, presumably 100000. When I write /mnt/shared/hello, it'll be subject to the mount's uid mapping, which might be 1000. -serge