Message ID | 20200214183554.1133805-1-christian.brauner@ubuntu.com (mailing list archive) |
---|---|
Headers | show |
Series | user_namespace: introduce fsid mappings | expand |
* Christian Brauner: > With fsid mappings we can solve this by writing an id mapping of 0 > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > access the kernel will now lookup the mapping for 300000 in the fsid > mapping tables of the user namespace. And since such a mapping exists, > the corresponding files will have correct ownership. I'm worried that this is a bit of a management nightmare because the data about the mapping does not live within the file system (it's externally determined, static, but crucial to the interpretation of file system content). I expect that many organizations have centralized allocation of user IDs, but centralized allocation of the static mapping does not appear feasible. Have you considered a more complex design, where untranslated nested user IDs are store in a file attribute (or something like that)? This way, any existing user ID infrastructure can be carried over largely unchanged.
On Sun, Feb 16, 2020 at 04:55:49PM +0100, Florian Weimer wrote: > * Christian Brauner: > > > With fsid mappings we can solve this by writing an id mapping of 0 > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > > access the kernel will now lookup the mapping for 300000 in the fsid > > mapping tables of the user namespace. And since such a mapping exists, > > the corresponding files will have correct ownership. > > I'm worried that this is a bit of a management nightmare because the > data about the mapping does not live within the file system (it's > externally determined, static, but crucial to the interpretation of > file system content). I expect that many organizations have Iiuc, that's already the case with user namespaces right now e.g. when you have an on-disk mapping that doesn't match your user namespace mapping. > centralized allocation of user IDs, but centralized allocation of the > static mapping does not appear feasible. I thought we're working on this right now with the new nss infrastructure to register id mappings aka the shadow discussion we've been having. > > Have you considered a more complex design, where untranslated nested > user IDs are store in a file attribute (or something like that)? This That doesn't sound like it would be feasible especially in the nesting case wrt. to performance. Christian
On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: [...] > People not as familiar with user namespaces might not be aware that > fsid mappings already exist. Right now, fsid mappings are always > identical to id mappings. Specifically, the kernel will lookup fsuids > in the uid mappings and fsgids in the gid mappings of the relevant > user namespace. This isn't actually entirely true: today we have the superblock user namespace, which can be used for fsid remapping on filesystems that support it (currently f2fs and fuse). Since this is a single shift, how is it going to play with s_user_ns? Do you have to understand the superblock mapping to use this shift, or are we simply using this to replace s_user_ns? James
On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: [...] > With this patch series we simply introduce the ability to create fsid > mappings that are different from the id mappings of a user namespace. > The whole feature set is placed under a config option that defaults > to false. > > In the usual case of running an unprivileged container we will have > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will > correspond to this id mapping, i.e. all files which we want to appear > as 0:0 inside the user namespace will be chowned to 100000:100000 on > the host. This works, because whenever the kernel needs to do a > filesystem access it will lookup the corresponding uid and gid in the > idmapping tables of the container. > Now think about the case where we want to have an id mapping of 0 > 100000 100000 but an on-disk mapping of 0 300000 100000 which is > needed to e.g. share a single on-disk mapping with multiple > containers that all have different id mappings. > This will be problematic. Whenever a filesystem access is requested, > the kernel will now try to lookup a mapping for 300000 in the id > mapping tables of the user namespace but since there is none the > files will appear to be owned by the overflow id, i.e. usually > 65534:65534 or nobody:nogroup. > > With fsid mappings we can solve this by writing an id mapping of 0 > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > access the kernel will now lookup the mapping for 300000 in the fsid > mapping tables of the user namespace. And since such a mapping > exists, the corresponding files will have correct ownership. How do we parametrise this new fsid shift for the unprivileged use case? For newuidmap/newgidmap, it's easy because each user gets a dedicated range and everything "just works (tm)". However, for the fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool has to know not only your allocated uid/gid chunk, but also the offset map of the image. The former is easy, but the latter is going to vary by the actual image ... well unless we standardise some accepted shift for images and it simply becomes a known static offset. James
On Mon, Feb 17, 2020 at 01:06:08PM -0800, James Bottomley wrote: > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: > [...] > > People not as familiar with user namespaces might not be aware that > > fsid mappings already exist. Right now, fsid mappings are always > > identical to id mappings. Specifically, the kernel will lookup fsuids > > in the uid mappings and fsgids in the gid mappings of the relevant > > user namespace. > > This isn't actually entirely true: today we have the superblock user > namespace, which can be used for fsid remapping on filesystems that > support it (currently f2fs and fuse). Since this is a single shift, Note that this states "the relevant" user namespace not the caller's user namespace. And the point is true even for such filesystems. fuse does call make_kuid(fc->user_ns, attr->uid) and hence looks up the mapping in the id mappings.. This would be replaced by make_kfsuid(). > how is it going to play with s_user_ns? Do you have to understand the > superblock mapping to use this shift, or are we simply using this to > replace s_user_ns? I'm not sure what you mean by understand the superblock mapping. The case is not different from the devpts patch in this series. Fuse needs to be changed to call make_kfsuid() since it is mountable inside user namespaces at which point everthing just works.
And re-sending, this time hopefully actually in plain text mode. Sorry about that, my e-mail client isn't behaving today... Stéphane On Mon, Feb 17, 2020 at 4:57 PM Stéphane Graber <stgraber@ubuntu.com> wrote: > > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: >> >> On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: >> [...] >> > With this patch series we simply introduce the ability to create fsid >> > mappings that are different from the id mappings of a user namespace. >> > The whole feature set is placed under a config option that defaults >> > to false. >> > >> > In the usual case of running an unprivileged container we will have >> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will >> > correspond to this id mapping, i.e. all files which we want to appear >> > as 0:0 inside the user namespace will be chowned to 100000:100000 on >> > the host. This works, because whenever the kernel needs to do a >> > filesystem access it will lookup the corresponding uid and gid in the >> > idmapping tables of the container. >> > Now think about the case where we want to have an id mapping of 0 >> > 100000 100000 but an on-disk mapping of 0 300000 100000 which is >> > needed to e.g. share a single on-disk mapping with multiple >> > containers that all have different id mappings. >> > This will be problematic. Whenever a filesystem access is requested, >> > the kernel will now try to lookup a mapping for 300000 in the id >> > mapping tables of the user namespace but since there is none the >> > files will appear to be owned by the overflow id, i.e. usually >> > 65534:65534 or nobody:nogroup. >> > >> > With fsid mappings we can solve this by writing an id mapping of 0 >> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem >> > access the kernel will now lookup the mapping for 300000 in the fsid >> > mapping tables of the user namespace. And since such a mapping >> > exists, the corresponding files will have correct ownership. >> >> How do we parametrise this new fsid shift for the unprivileged use >> case? For newuidmap/newgidmap, it's easy because each user gets a >> dedicated range and everything "just works (tm)". However, for the >> fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool >> has to know not only your allocated uid/gid chunk, but also the offset >> map of the image. The former is easy, but the latter is going to vary >> by the actual image ... well unless we standardise some accepted shift >> for images and it simply becomes a known static offset. > > > For unprivileged runtimes, I would expect images to be unshifted and be > unpacked from within a userns. So your unprivileged user would be allowed > a uid/gid range through /etc/subuid and /etc/subgid and allowed to use > them through newuidmap/newgidmap.In that namespace, you can then pull > and unpack any images/layers you may want and the resulting fs tree will > look correct from within that namespace. > > All that is possible today and is how for example unprivileged LXC works > right now. > > What this patchset then allows is for containers to have differing > uid/gid maps while still being based off the same image or layers. > In this scenario, you would carve a subset of your main uid/gid map for > each container you run and run them in a child user namespace while > setting up a fsuid/fsgid map such that their filesystem access do not > follow their uid/gid map. This then results in proper isolation for > processes, networks, ... as everything runs as different kuid/kgid but > the VFS view will be the same in all containers. > > Shared storage between those otherwise isolated containers would also > work just fine by simply bind-mounting the same path into two or more > containers. > > > Now one additional thing that would be safe for a setuid wrapper to > allow would be for arbitrary mapping of any of the uid/gid that the user > owns to be used within the fsuid/fsgid map. One potential use for this > would be to create any number of user namespaces, each with their own > mapping for uid 0 while still having all VFS access be mapped to the > user that spawned them (say uid=1000, gid=1000). > > > Note that in our case, the intended use for this is from a privileged runtime > where our images would be unshifted as would be the container storage > and any shared storage for containers. The security model effectively relying > on properly configured filesystem permissions and mount namespaces such > that the content of those paths can never be seen by anyone but root outside > of those containers (and therefore avoids all the issues around setuid/setgid/fscaps). > > We will then be able to allocate distinct, random, ranges of 65536 uids/gids (or more) > for each container without ever having to do any uid/gid shifting at the filesystem layer > or run into issues when having to setup shared storage between containers or attaching > external storage volumes to those containers. > >> James > > > Stéphane
On Mon, 2020-02-17 at 22:20 +0100, Christian Brauner wrote: > On Mon, Feb 17, 2020 at 01:06:08PM -0800, James Bottomley wrote: > > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: > > [...] > > > People not as familiar with user namespaces might not be aware > > > that fsid mappings already exist. Right now, fsid mappings are > > > always identical to id mappings. Specifically, the kernel will > > > lookup fsuids in the uid mappings and fsgids in the gid mappings > > > of the relevant user namespace. > > > > This isn't actually entirely true: today we have the superblock > > user namespace, which can be used for fsid remapping on filesystems > > that support it (currently f2fs and fuse). Since this is a single > > shift, > > Note that this states "the relevant" user namespace not the caller's > user namespace. And the point is true even for such filesystems. fuse > does call make_kuid(fc->user_ns, attr->uid) and hence looks up the > mapping in the id mappings.. This would be replaced by make_kfsuid(). > > > how is it going to play with s_user_ns? Do you have to understand > > the superblock mapping to use this shift, or are we simply using > > this to replace s_user_ns? > > I'm not sure what you mean by understand the superblock mapping. The > case is not different from the devpts patch in this series. So since devpts wasn't originally a s_user_ns consumer, I assume you're thinking that this patch series just replaces the whole of s_user_ns for fuse and f2fs and we can remove it? > Fuse needs to be changed to call make_kfsuid() since it is mountable > inside user namespaces at which point everthing just works. The fuse case is slightly more complicated because there are sound reasons to run the daemon in a separate user namespace regardless of where the end fuse mount is. James
On Mon, 2020-02-17 at 16:57 -0500, Stéphane Graber wrote: > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley < > James.Bottomley@hansenpartnership.com> wrote: > > > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: > > [...] > > > With this patch series we simply introduce the ability to create > > > fsid mappings that are different from the id mappings of a user > > > namespace. The whole feature set is placed under a config option > > > that defaults to false. > > > > > > In the usual case of running an unprivileged container we will > > > have setup an id mapping, e.g. 0 100000 100000. The on-disk > > > mapping will correspond to this id mapping, i.e. all files which > > > we want to appear as 0:0 inside the user namespace will be > > > chowned to 100000:100000 on the host. This works, because > > > whenever the kernel needs to do a filesystem access it will > > > lookup the corresponding uid and gid in the idmapping tables of > > > the container. > > > Now think about the case where we want to have an id mapping of 0 > > > 100000 100000 but an on-disk mapping of 0 300000 100000 which is > > > needed to e.g. share a single on-disk mapping with multiple > > > containers that all have different id mappings. > > > This will be problematic. Whenever a filesystem access is > > > requested, the kernel will now try to lookup a mapping for 300000 > > > in the id mapping tables of the user namespace but since there is > > > none the files will appear to be owned by the overflow id, i.e. > > > usually 65534:65534 or nobody:nogroup. > > > > > > With fsid mappings we can solve this by writing an id mapping of > > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On > > > filesystem access the kernel will now lookup the mapping for > > > 300000 in the fsid mapping tables of the user namespace. And > > > since such a mapping exists, the corresponding files will have > > > correct ownership. > > > > How do we parametrise this new fsid shift for the unprivileged use > > case? For newuidmap/newgidmap, it's easy because each user gets a > > dedicated range and everything "just works (tm)". However, for the > > fsid mapping, assuming some newfsuid/newfsgid tool to help, that > > tool has to know not only your allocated uid/gid chunk, but also > > the offset map of the image. The former is easy, but the latter is > > going to vary by the actual image ... well unless we standardise > > some accepted shift for images and it simply becomes a known static > > offset. > > > > For unprivileged runtimes, I would expect images to be unshifted and > be unpacked from within a userns. For images whose resting format is an archive like tar, I concur. > So your unprivileged user would be allowed a uid/gid range through > /etc/subuid and /etc/subgid and allowed to use them through > newuidmap/newgidmap.In that namespace, you can then pull > and unpack any images/layers you may want and the resulting fs tree > will look correct from within that namespace. > > All that is possible today and is how for example unprivileged LXC > works right now. I do have a counter example, but it might be more esoteric: I do use unprivileged architecture emulation containers to maintain actual physical system boot environments. These are stored as mountable disk images, not as archives, so I do need a simple remapping ... however, I think this use case is simple: it's a back shift along my owned uid/gid range, so tools for allowing unprivileged use can easily cope with this use case, so the use is either fsid identity or fsid back along existing user_ns mapping. > What this patchset then allows is for containers to have differing > uid/gid maps while still being based off the same image or layers. > In this scenario, you would carve a subset of your main uid/gid map > for each container you run and run them in a child user namespace > while setting up a fsuid/fsgid map such that their filesystem access > do not follow their uid/gid map. This then results in proper > isolation for processes, networks, ... as everything runs as > different kuid/kgid but the VFS view will be the same in all > containers. Who owns the shifted range of the image ... all tenants or none? > Shared storage between those otherwise isolated containers would also > work just fine by simply bind-mounting the same path into two or more > containers. > > > Now one additional thing that would be safe for a setuid wrapper to > allow would be for arbitrary mapping of any of the uid/gid that the > user owns to be used within the fsuid/fsgid map. One potential use > for this would be to create any number of user namespaces, each with > their own mapping for uid 0 while still having all VFS access be > mapped to the user that spawned them (say uid=1000, gid=1000). > > > Note that in our case, the intended use for this is from a privileged > runtime where our images would be unshifted as would be the container > storage and any shared storage for containers. The security model > effectively relying on properly configured filesystem permissions and > mount namespaces such that the content of those paths can never be > seen by anyone but root outside of those containers (and therefore > avoids all the issues around setuid/setgid/fscaps). Yes, I understand ... all orchestration systems are currently hugely privileged. However, there is interest in getting them down to only "slightly privileged". James > We will then be able to allocate distinct, random, ranges of 65536 > uids/gids (or more) for each container without ever having to do any > uid/gid shifting at the filesystem layer or run into issues when > having to setup shared storage between containers or attaching > external storage volumes to those containers.
On Mon, Feb 17, 2020 at 02:35:38PM -0800, James Bottomley wrote: > On Mon, 2020-02-17 at 22:20 +0100, Christian Brauner wrote: > > On Mon, Feb 17, 2020 at 01:06:08PM -0800, James Bottomley wrote: > > > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: > > > [...] > > > > People not as familiar with user namespaces might not be aware > > > > that fsid mappings already exist. Right now, fsid mappings are > > > > always identical to id mappings. Specifically, the kernel will > > > > lookup fsuids in the uid mappings and fsgids in the gid mappings > > > > of the relevant user namespace. > > > > > > This isn't actually entirely true: today we have the superblock > > > user namespace, which can be used for fsid remapping on filesystems > > > that support it (currently f2fs and fuse). Since this is a single > > > shift, > > > > Note that this states "the relevant" user namespace not the caller's > > user namespace. And the point is true even for such filesystems. fuse > > does call make_kuid(fc->user_ns, attr->uid) and hence looks up the > > mapping in the id mappings.. This would be replaced by make_kfsuid(). > > > > > how is it going to play with s_user_ns? Do you have to understand > > > the superblock mapping to use this shift, or are we simply using > > > this to replace s_user_ns? > > > > I'm not sure what you mean by understand the superblock mapping. The > > case is not different from the devpts patch in this series. > > So since devpts wasn't originally a s_user_ns consumer, I assume you're > thinking that this patch series just replaces the whole of s_user_ns > for fuse and f2fs and we can remove it? No, as I said it's just about replacing make_kuid() with make_kfsuid(). This doesn't change anything for all cases where id mappings equal fsid mappings and if there are separate id mappings it will look at the fsid mappings for the user namespace in struct fuse_conn. > > > Fuse needs to be changed to call make_kfsuid() since it is mountable > > inside user namespaces at which point everthing just works. > > The fuse case is slightly more complicated because there are sound > reasons to run the daemon in a separate user namespace regardless of > where the end fuse mount is. I'm curious how you're doing that today as it's usually tricky to mount across mount namespaces? In any case, this patchset doesn't change any of that fuse logic, so thing will keep working as they do today.
On Mon, Feb 17, 2020 at 6:03 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Mon, 2020-02-17 at 16:57 -0500, Stéphane Graber wrote: > > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley < > > James.Bottomley@hansenpartnership.com> wrote: > > > > > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: > > > [...] > > > > With this patch series we simply introduce the ability to create > > > > fsid mappings that are different from the id mappings of a user > > > > namespace. The whole feature set is placed under a config option > > > > that defaults to false. > > > > > > > > In the usual case of running an unprivileged container we will > > > > have setup an id mapping, e.g. 0 100000 100000. The on-disk > > > > mapping will correspond to this id mapping, i.e. all files which > > > > we want to appear as 0:0 inside the user namespace will be > > > > chowned to 100000:100000 on the host. This works, because > > > > whenever the kernel needs to do a filesystem access it will > > > > lookup the corresponding uid and gid in the idmapping tables of > > > > the container. > > > > Now think about the case where we want to have an id mapping of 0 > > > > 100000 100000 but an on-disk mapping of 0 300000 100000 which is > > > > needed to e.g. share a single on-disk mapping with multiple > > > > containers that all have different id mappings. > > > > This will be problematic. Whenever a filesystem access is > > > > requested, the kernel will now try to lookup a mapping for 300000 > > > > in the id mapping tables of the user namespace but since there is > > > > none the files will appear to be owned by the overflow id, i.e. > > > > usually 65534:65534 or nobody:nogroup. > > > > > > > > With fsid mappings we can solve this by writing an id mapping of > > > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On > > > > filesystem access the kernel will now lookup the mapping for > > > > 300000 in the fsid mapping tables of the user namespace. And > > > > since such a mapping exists, the corresponding files will have > > > > correct ownership. > > > > > > How do we parametrise this new fsid shift for the unprivileged use > > > case? For newuidmap/newgidmap, it's easy because each user gets a > > > dedicated range and everything "just works (tm)". However, for the > > > fsid mapping, assuming some newfsuid/newfsgid tool to help, that > > > tool has to know not only your allocated uid/gid chunk, but also > > > the offset map of the image. The former is easy, but the latter is > > > going to vary by the actual image ... well unless we standardise > > > some accepted shift for images and it simply becomes a known static > > > offset. > > > > > > > For unprivileged runtimes, I would expect images to be unshifted and > > be unpacked from within a userns. > > For images whose resting format is an archive like tar, I concur. > > > So your unprivileged user would be allowed a uid/gid range through > > /etc/subuid and /etc/subgid and allowed to use them through > > newuidmap/newgidmap.In that namespace, you can then pull > > and unpack any images/layers you may want and the resulting fs tree > > will look correct from within that namespace. > > > > All that is possible today and is how for example unprivileged LXC > > works right now. > > I do have a counter example, but it might be more esoteric: I do use > unprivileged architecture emulation containers to maintain actual > physical system boot environments. These are stored as mountable disk > images, not as archives, so I do need a simple remapping ... however, I > think this use case is simple: it's a back shift along my owned uid/gid > range, so tools for allowing unprivileged use can easily cope with this > use case, so the use is either fsid identity or fsid back along > existing user_ns mapping. > > > What this patchset then allows is for containers to have differing > > uid/gid maps while still being based off the same image or layers. > > In this scenario, you would carve a subset of your main uid/gid map > > for each container you run and run them in a child user namespace > > while setting up a fsuid/fsgid map such that their filesystem access > > do not follow their uid/gid map. This then results in proper > > isolation for processes, networks, ... as everything runs as > > different kuid/kgid but the VFS view will be the same in all > > containers. > > Who owns the shifted range of the image ... all tenants or none? I would expect the most common case being none of them. So you'd have a uid/gid range carved out of your own allocation which is used to unpack images, let's call that the image map. Your containers would then use a uid/gid map which is distinct from that map and distinct from each other but all using the image map as their fsuid/fsgid map. This will make the VFS behave in a normal way and would also allow for shared paths between those containers by using a shared directory through bind-mount which is also owned by a uid/gid in that image range. > > Shared storage between those otherwise isolated containers would also > > work just fine by simply bind-mounting the same path into two or more > > containers. > > > > > > Now one additional thing that would be safe for a setuid wrapper to > > allow would be for arbitrary mapping of any of the uid/gid that the > > user owns to be used within the fsuid/fsgid map. One potential use > > for this would be to create any number of user namespaces, each with > > their own mapping for uid 0 while still having all VFS access be > > mapped to the user that spawned them (say uid=1000, gid=1000). > > > > > > Note that in our case, the intended use for this is from a privileged > > runtime where our images would be unshifted as would be the container > > storage and any shared storage for containers. The security model > > effectively relying on properly configured filesystem permissions and > > mount namespaces such that the content of those paths can never be > > seen by anyone but root outside of those containers (and therefore > > avoids all the issues around setuid/setgid/fscaps). > > Yes, I understand ... all orchestration systems are currently hugely > privileged. However, there is interest in getting them down to only > "slightly privileged". > > James > > > > We will then be able to allocate distinct, random, ranges of 65536 > > uids/gids (or more) for each container without ever having to do any > > uid/gid shifting at the filesystem layer or run into issues when > > having to setup shared storage between containers or attaching > > external storage volumes to those containers.