Message ID | 20200217205307.32256-1-James.Bottomley@HansenPartnership.com (mailing list archive) |
---|---|
Headers | show |
Series | introduce a uid/gid shifting bind mount | expand |
On Mon, Feb 17, 2020 at 10:56 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > The object of this series is to replace shiftfs with a proper uid/gid > shifting bind mount instead of the shiftfs hack of introducing > something that looks similar to an overlay filesystem to do it. > > The VFS still has the problem that in order to tell what vfsmount a > dentry belongs to, struct path would have to be threaded everywhere > struct dentry currently is. However, this patch is structured only to > require a rethreading of notify_change. The rest of the knowledge > that a shift is in operation is carried in the task structure by > caching the unshifted credentials. > > Note that although it is currently dependent on the new configfd > interface for bind mounts, only patch 3/3 relies on this, and the > whole thing could be redone as a syscall or any other mechanism > (depending on how people eventually want to fix the problem with the > new fsconfig mechanism being unable to reconfigure bind mounts). > > The changes from v2 are I've added Amir's reviewed-by for the > notify_change rethreading and I've implemented Serge's request for a > base offset shift for the image. It turned out to be much harder to > implement a simple linear shift than simply to do it through a > different userns, so that's how I've done it. The userns you need to > set up for the offset shifted image is one where the interior uid > would see the shifted image as fake root. I've introduced an > additional "ns" config parameter, which must be specified when > building the allow shift mount point (so it's done by the admin, not > by the unprivileged user). I've also taken care that the image > shifted to zero (real root) is never visible in the filesystem. Patch > 3/3 explains how to use the additional "ns" parameter. > > James, To us common people who do not breath containers, your proposal seems like a competing implementation to Christian's proposal [1]. If it were a competing implementation, I think Christian's proposal would have won by points for being less intrusive to VFS. But it is not really a competing implementation, is it? Your proposals meet two different, but very overlapping, set of requirements. IMHO, none of you did a really good job of explaining that in the cover latter, let alone, refer to each others proposals (I am referring to your v3 posting of course). IIUC, Christian's proposal deals with single shared image per non-overlapping groups of containers. And it deals with this use case very elegantly IMO. From your comments on Christian's post, it does not seem that you oppose to his proposal, except that it does not meet the requirements for all of your use cases. IIUC, your proposal can deal with multiple shared images per overlapping groups of containers and it adds an element of "auto-reverse-mapping", which reduces the administration overhead of this to be nightmare of orchestration. It seems to me, that you should look into working your patch set on top of fsid mapping and try to make use of it as much as possible. And to make things a bit more clear to the rest of us, you should probably market your feature as "auto back shifting mount" or something like that and explain the added value of the feature on top of plain fsid mapping. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20200214183554.1133805-1-christian.brauner@ubuntu.com/
On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote: > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > The object of this series is to replace shiftfs with a proper > > uid/gid shifting bind mount instead of the shiftfs hack of > > introducing something that looks similar to an overlay filesystem > > to do it. > > > > The VFS still has the problem that in order to tell what vfsmount a > > dentry belongs to, struct path would have to be threaded everywhere > > struct dentry currently is. However, this patch is structured only > > to require a rethreading of notify_change. The rest of the > > knowledge that a shift is in operation is carried in the task > > structure by caching the unshifted credentials. > > > > Note that although it is currently dependent on the new configfd > > interface for bind mounts, only patch 3/3 relies on this, and the > > whole thing could be redone as a syscall or any other mechanism > > (depending on how people eventually want to fix the problem with > > the new fsconfig mechanism being unable to reconfigure bind > > mounts). > > > > The changes from v2 are I've added Amir's reviewed-by for the > > notify_change rethreading and I've implemented Serge's request for > > a base offset shift for the image. It turned out to be much harder > > to implement a simple linear shift than simply to do it through a > > different userns, so that's how I've done it. The userns you need > > to set up for the offset shifted image is one where the interior > > uid would see the shifted image as fake root. I've introduced an > > additional "ns" config parameter, which must be specified when > > building the allow shift mount point (so it's done by the admin, > > not by the unprivileged user). I've also taken care that the image > > shifted to zero (real root) is never visible in the > > filesystem. Patch 3/3 explains how to use the additional "ns" > > parameter. > > > > > > James, > > To us common people who do not breath containers, your proposal seems > like a competing implementation to Christian's proposal [1]. I think we have three things that swirl around this space and aren't quite direct copies of each other's use cases but aren't entirely disjoint either: the superblock user namespace, this and the user namespace fsid mapping. > If it were a competing implementation, I think Christian's proposal > would have won by points for being less intrusive to VFS. Heh, that one's subjective. I think the current fsid code is missing quite a few use cases in the stat/attr/in_group_p cases. I'm just building the code now to run it through the shiftfs tests and see how it fares. I think once those cases are added, the VFS changes in fsid will be the same as I have in patch 2/3 ... primarily because we all have to shift the same thing at the same point. If you include the notify_change rethreading then, yes, you're correct, but that patch does stand on its own and is consonant with a long term vfs goal of using path instead of dentry. > But it is not really a competing implementation, is it? Your > proposals meet two different, but very overlapping, set of > requirements. IMHO, none of you did a really good job of explaining > that in the cover latter, let alone, refer to each others proposals > (I am referring to your v3 posting of course). Yes, I know, but the fsid one is only a few days old so I haven't had time to absorb all of it yet. > IIUC, Christian's proposal deals with single shared image per > non-overlapping groups of containers. And it deals with this use case > very elegantly IMO. From your comments on Christian's post, it does > not seem that you oppose to his proposal, except that it does not > meet the requirements for all of your use cases. No, but I think it could. It's one of these perennial problems of more generic vs more specific to use case. I'm a bit lost in really what we need for containers. In the original shiftfs I made it superblock based precisely because that was the only way I could integrate s_user_ns into it ... and I thought that was a good idea. > IIUC, your proposal can deal with multiple shared images per > overlapping groups of containers That's right ... it does the shift based on path and user namespace. In theory it allows an image with shifted and unshifted pieces ... however, I'm not sure there's even a use case for that granularity because all my current image shifting use cases are all or nothing. The granularity is an accident of the bind mount implementation. > and it adds an element of "auto-reverse-mapping", which reduces the > administration overhead of this to be nightmare of orchestration. Right, but the same thing could be an option to the fsid proposal: the default use could shift forward on kuid and back on the same map for fsuid ... then it would do what shiftfs does. Likewise, shiftfs could pick up the shift from an existing namespace and thus look more like what fsuid does. > It seems to me, that you should look into working your patch set on > top of fsid mapping and try to make use of it as much as possible. > And to make things a bit more clear to the rest of us, you should > probably market your feature as "auto back shifting mount" or > something like that and explain the added value of the feature on top > of plain fsid mapping Well we both need the same shift points, so we could definitely both work off a generic vfs encapsulation of "shift needed here". Once that's done it does become a question of use and activation. I can't help feeling that now that we've been around the houses a few times, s_user_ns is actually in the wrong place and it should be in the mount struct not the superblock. I get the impression we've got the what we need to expose (the use cases) well established (at least in our heads). The big question your asking is implementation (the how) and also whether there isn't a combination of the exposures that works for everyone. I think this might make a very good session at LSF/MM. The how piece is completely within the purview of kernel developers. The use case is more problematic because that does involve the container orchestration community. However, I think at LSF/MM if we could get agreement on a unified implementation that covers the three use cases we're in a much better position to have the container orchestration conversation because it's simply a case of tweaking the activation mechanisms. I'll propose it as a topic. James
On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote: > On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote: > > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > The object of this series is to replace shiftfs with a proper > > > uid/gid shifting bind mount instead of the shiftfs hack of > > > introducing something that looks similar to an overlay filesystem > > > to do it. > > > > > > The VFS still has the problem that in order to tell what vfsmount a > > > dentry belongs to, struct path would have to be threaded everywhere > > > struct dentry currently is. However, this patch is structured only > > > to require a rethreading of notify_change. The rest of the > > > knowledge that a shift is in operation is carried in the task > > > structure by caching the unshifted credentials. > > > > > > Note that although it is currently dependent on the new configfd > > > interface for bind mounts, only patch 3/3 relies on this, and the > > > whole thing could be redone as a syscall or any other mechanism > > > (depending on how people eventually want to fix the problem with > > > the new fsconfig mechanism being unable to reconfigure bind > > > mounts). > > > > > > The changes from v2 are I've added Amir's reviewed-by for the > > > notify_change rethreading and I've implemented Serge's request for > > > a base offset shift for the image. It turned out to be much harder > > > to implement a simple linear shift than simply to do it through a > > > different userns, so that's how I've done it. The userns you need > > > to set up for the offset shifted image is one where the interior > > > uid would see the shifted image as fake root. I've introduced an > > > additional "ns" config parameter, which must be specified when > > > building the allow shift mount point (so it's done by the admin, > > > not by the unprivileged user). I've also taken care that the image > > > shifted to zero (real root) is never visible in the > > > filesystem. Patch 3/3 explains how to use the additional "ns" > > > parameter. > > > > > > > > > > James, > > > > To us common people who do not breath containers, your proposal seems > > like a competing implementation to Christian's proposal [1]. > > I think we have three things that swirl around this space and aren't > quite direct copies of each other's use cases but aren't entirely > disjoint either: the superblock user namespace, this and the user > namespace fsid mapping. > > > If it were a competing implementation, I think Christian's proposal > > would have won by points for being less intrusive to VFS. > > Heh, that one's subjective. I think the current fsid code is missing I honestly don't think it is subjective. Leaving aside that the patch here is more invasive to the vfs just in how it needs to alter it, you currently require a whole new set of syscalls for this feature. The syscalls themselves even introduce a whole new API and complicated semantics in the kernel and it's completely unclear whether they will make it or not. Even if not, this feature will be tied to the new mount api making it way harder and a longer process to adopt for userspace given that not even all bits and pieces userspace needs are currently in any released kernel. But way more important: what Amir got right is that your approach and fsid mappings don't stand in each others way at all. Shiftfed bind-mounts can be implemented completely independent of fsid mappings after the fact on top of it. Your example, does this: nsfd = open("/proc/567/ns/user", O_RDONLY); /* note: not O_PATH */ configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd); as the ultimate step. Essentially marking a mountpoint as shifted relative to that user namespace. Once fsid mappings are in all that you need to do is replace your make_kuid()/from_kuid()/from_kuid_munged() calls and so on in your patchset with make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done. So I honestly don't currently see any need to block the patchsets on each other. Christian
On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote: > On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote: > > On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote: > > > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > > > The object of this series is to replace shiftfs with a proper > > > > uid/gid shifting bind mount instead of the shiftfs hack of > > > > introducing something that looks similar to an overlay > > > > filesystem to do it. > > > > > > > > The VFS still has the problem that in order to tell what > > > > vfsmount a dentry belongs to, struct path would have to be > > > > threaded everywhere struct dentry currently is. However, this > > > > patch is structured only to require a rethreading of > > > > notify_change. The rest of the knowledge that a shift is in > > > > operation is carried in the task structure by caching the > > > > unshifted credentials. > > > > > > > > Note that although it is currently dependent on the new > > > > configfd interface for bind mounts, only patch 3/3 relies on > > > > this, and the whole thing could be redone as a syscall or any > > > > other mechanism (depending on how people eventually want to fix > > > > the problem with the new fsconfig mechanism being unable to > > > > reconfigure bind mounts). > > > > > > > > The changes from v2 are I've added Amir's reviewed-by for the > > > > notify_change rethreading and I've implemented Serge's request > > > > for a base offset shift for the image. It turned out to be > > > > much harder to implement a simple linear shift than simply to > > > > do it through a different userns, so that's how I've done > > > > it. The userns you need to set up for the offset shifted image > > > > is one where the interior uid would see the shifted image as > > > > fake root. I've introduced an additional "ns" config > > > > parameter, which must be specified when building the allow > > > > shift mount point (so it's done by the admin, not by the > > > > unprivileged user). I've also taken care that the image > > > > shifted to zero (real root) is never visible in the > > > > filesystem. Patch 3/3 explains how to use the additional "ns" > > > > parameter. > > > > > > > > > > > > > > James, > > > > > > To us common people who do not breath containers, your proposal > > > seems like a competing implementation to Christian's proposal > > > [1]. > > > > I think we have three things that swirl around this space and > > aren't quite direct copies of each other's use cases but aren't > > entirely disjoint either: the superblock user namespace, this and > > the user namespace fsid mapping. > > > > > If it were a competing implementation, I think Christian's > > > proposal would have won by points for being less intrusive to > > > VFS. > > > > Heh, that one's subjective. I think the current fsid code is > > missing > > I honestly don't think it is subjective. Leaving aside that the patch > here is more invasive to the vfs just in how it needs to alter it, > you currently require a whole new set of syscalls for this feature. Well I really don't ... I'd like to replace about four existing syscalls (fspick/fsopen/fsconfig/fsmount) with two more general ones (configfd_open/configfd_action), but I kept the original four to demonstrate the two new ones could subsume their work. However, I really think we should leave aside the activation mechanism discussion and concentrate on the internal implementation. > The syscalls themselves even introduce a whole new API and > complicated semantics in the kernel and it's completely unclear > whether they will make it or not. As I said in the cover letter. I'm entirely mechanism agnostic. Whatever solves our current rebind reconfigure problem will be usable for shifting bind mounts. I like the idea of using the same fsconfig mechanism, but this patch isn't wedded to it, that's why 2/3 is implementation and 3/3 is activation. The activation patch can be completely replaced without affecting the implementation mechanism. > Even if not, this feature will be tied to the new mount api making > it way harder and a longer process to adopt for userspace given that > not even all bits and pieces userspace needs are currently in > any released kernel. Well, given that this problem and various proposed solutions have been around for years, I really don't think we're suddenly in any rush to get it out of the door and into user hands. > But way more important: what Amir got right is that your approach and > fsid mappings don't stand in each others way at all. Shiftfed > bind-mounts can be implemented completely independent of fsid > mappings after the fact on top of it. > > Your example, does this: > > nsfd = open("/proc/567/ns/user", O_RDONLY); /* note: not O_PATH */ > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd); > > as the ultimate step. Essentially marking a mountpoint as shifted > relative to that user namespace. Once fsid mappings are in all that > you need to do is replace your > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in your > patchset with make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and > you're done. So I honestly don't currently see any need to block the > patchsets on each other. Can I repeat: there's no rush to get upstream on this. Let's pause to get the kernel implementation (the thing we have to maintain) right. I realise we could each work around the other and get our implementations bent around each other so they all work independently thus making our disjoint user cabals happy but I don't think that would lead to the best outcome for kernel maintainability. I already think that history shows us that s_user_ns went upstream too fast and the fact that unprivileged fuse has yet to make it (and the reception the original shiftfs got) indicates there may be an abstraction problem with s_user_ns (I have to confess here that while I think I understand the unprivileged fuse issues for both the daemon and the consumer, I don't have a clear insight into the f2fs use case ... it's something androidy that no-one has fully explained). So I think the implementors of all three features need to come up with a generic VFS shifting abstraction that covers all the use cases. We also need to decide which user_ns we care about: the one above us in current, the one captured when the mnt_ns got unshared, the one captured in s_user_ns when the mount was done or the new one I introduced which captures the user_ns at struct mount creation/clone time ... I really don't think we need them all so we should work out what do they all mean and which should we care about and start ruthlessly eliminating any one we believe we don't care about. We're really like people building a castle by each working on our own turret and my fear is the weight of all the turrets, even if it doesn't cause the whole castle simply to collapse, is weakening our security posture. James
On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote: > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote: > > On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote: > > > On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote: > > > > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley > > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > > > > > The object of this series is to replace shiftfs with a proper > > > > > uid/gid shifting bind mount instead of the shiftfs hack of > > > > > introducing something that looks similar to an overlay > > > > > filesystem to do it. > > > > > > > > > > The VFS still has the problem that in order to tell what > > > > > vfsmount a dentry belongs to, struct path would have to be > > > > > threaded everywhere struct dentry currently is. However, this > > > > > patch is structured only to require a rethreading of > > > > > notify_change. The rest of the knowledge that a shift is in > > > > > operation is carried in the task structure by caching the > > > > > unshifted credentials. > > > > > > > > > > Note that although it is currently dependent on the new > > > > > configfd interface for bind mounts, only patch 3/3 relies on > > > > > this, and the whole thing could be redone as a syscall or any > > > > > other mechanism (depending on how people eventually want to fix > > > > > the problem with the new fsconfig mechanism being unable to > > > > > reconfigure bind mounts). > > > > > > > > > > The changes from v2 are I've added Amir's reviewed-by for the > > > > > notify_change rethreading and I've implemented Serge's request > > > > > for a base offset shift for the image. It turned out to be > > > > > much harder to implement a simple linear shift than simply to > > > > > do it through a different userns, so that's how I've done > > > > > it. The userns you need to set up for the offset shifted image > > > > > is one where the interior uid would see the shifted image as > > > > > fake root. I've introduced an additional "ns" config > > > > > parameter, which must be specified when building the allow > > > > > shift mount point (so it's done by the admin, not by the > > > > > unprivileged user). I've also taken care that the image > > > > > shifted to zero (real root) is never visible in the > > > > > filesystem. Patch 3/3 explains how to use the additional "ns" > > > > > parameter. > > > > > > > > > > > > > > > > > > James, > > > > > > > > To us common people who do not breath containers, your proposal > > > > seems like a competing implementation to Christian's proposal > > > > [1]. > > > > > > I think we have three things that swirl around this space and > > > aren't quite direct copies of each other's use cases but aren't > > > entirely disjoint either: the superblock user namespace, this and > > > the user namespace fsid mapping. > > > > > > > If it were a competing implementation, I think Christian's > > > > proposal would have won by points for being less intrusive to > > > > VFS. > > > > > > Heh, that one's subjective. I think the current fsid code is > > > missing > > > > I honestly don't think it is subjective. Leaving aside that the patch > > here is more invasive to the vfs just in how it needs to alter it, > > you currently require a whole new set of syscalls for this feature. > > Well I really don't ... I'd like to replace about four existing > syscalls (fspick/fsopen/fsconfig/fsmount) with two more general ones > (configfd_open/configfd_action), but I kept the original four to > demonstrate the two new ones could subsume their work. However, I > really think we should leave aside the activation mechanism discussion > and concentrate on the internal implementation. > > > The syscalls themselves even introduce a whole new API and > > complicated semantics in the kernel and it's completely unclear > > whether they will make it or not. > > As I said in the cover letter. I'm entirely mechanism agnostic. > Whatever solves our current rebind reconfigure problem will be usable > for shifting bind mounts. I like the idea of using the same fsconfig > mechanism, but this patch isn't wedded to it, that's why 2/3 is > implementation and 3/3 is activation. The activation patch can be > completely replaced without affecting the implementation mechanism. > > > Even if not, this feature will be tied to the new mount api making > > it way harder and a longer process to adopt for userspace given that > > not even all bits and pieces userspace needs are currently in > > any released kernel. > > Well, given that this problem and various proposed solutions have been > around for years, I really don't think we're suddenly in any rush to > get it out of the door and into user hands. > > > But way more important: what Amir got right is that your approach and > > fsid mappings don't stand in each others way at all. Shiftfed > > bind-mounts can be implemented completely independent of fsid > > mappings after the fact on top of it. > > > > Your example, does this: > > > > nsfd = open("/proc/567/ns/user", O_RDONLY); /* note: not O_PATH */ > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd); > > > > as the ultimate step. Essentially marking a mountpoint as shifted > > relative to that user namespace. Once fsid mappings are in all that > > you need to do is replace your > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in your > > patchset with make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and > > you're done. So I honestly don't currently see any need to block the > > patchsets on each other. > > Can I repeat: there's no rush to get upstream on this. Let's pause to > get the kernel implementation (the thing we have to maintain) right. I > realise we could each work around the other and get our implementations > bent around each other so they all work independently thus making our > disjoint user cabals happy but I don't think that would lead to the > best outcome for kernel maintainability. We have had the discussion with all major stakeholders in a single room on what we need at LPC 2019. We agreed on what we need and fsids are a concrete proposal for an implementation that appears to solve all discussed major use-cases in a simple and elegant manner, which can also be cleanly extended to cover your approach later. Imho, there is no need to have the same discussion again at an invite-only event focussed on kernel developers where most of the major stakeholders are unlikely to be able to participate. The patch proposals are here on all relevant list where everyone can participate and we can discuss them right here. I have not yet heard a concrete technical reason why the patch proposal is inadequate and I see no reason to stall this. > > I already think that history shows us that s_user_ns went upstream too > fast and the fact that unprivileged fuse has yet to make it (and the We've established on the other patchset that fsid mappings in no way interfere nor care about s_user_ns so I'm not going to go into this again here. But for the record, unprivileged fuse mounts are supported since: commit 4ad769f3c346ec3d458e255548dec26ca5284cf6 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Tue May 29 09:04:46 2018 -0500 fuse: Allow fully unprivileged mounts Now that the fuse and the vfs work is complete. Allow the fuse filesystem to be mounted by the root user in a user namespace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Christian
On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote: > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote: > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote: [...] > > > But way more important: what Amir got right is that your approach > > > and fsid mappings don't stand in each others way at all. Shiftfed > > > bind-mounts can be implemented completely independent of fsid > > > mappings after the fact on top of it. > > > > > > Your example, does this: > > > > > > nsfd = open("/proc/567/ns/user", O_RDONLY); /* note: not O_PATH > > > */ > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd); > > > > > > as the ultimate step. Essentially marking a mountpoint as shifted > > > relative to that user namespace. Once fsid mappings are in all > > > that you need to do is replace your > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in > > > your patchset with > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done. > > > So I honestly don't currently see any need to block the patchsets > > > on each other. > > > > Can I repeat: there's no rush to get upstream on this. Let's pause > > to get the kernel implementation (the thing we have to maintain) > > right. I realise we could each work around the other and get our > > implementations bent around each other so they all work > > independently thus making our disjoint user cabals happy but I > > don't think that would lead to the best outcome for kernel > > maintainability. > > We have had the discussion with all major stakeholders in a single > room on what we need at LPC 2019. Well, you didn't invite me, so I think "stakeholders" means people we selected because we like their use case. More importantly: "stakeholders" traditionally means not only people who want to consume the feature but also people who have to maintain it ... how many VFS stakeholders were present? > We agreed on what we need and fsids are a concrete proposal for an > implementation that appears to solve all discussed major use-cases in > a simple and elegant manner, which can also be cleanly extended to > cover your approach later. Imho, there is no need to have the same > discussion again at an invite-only event focussed on kernel > developers where most of the major stakeholders are unlikely to be > able to participate. The patch proposals are here on all relevant > list where everyone can participate and we can discuss them right > here. I have not yet heard a concrete technical reason why the patch > proposal is inadequate and I see no reason to stall this. You cut the actual justification I gave: tacking together ad hoc solutions for particular interests has already lead to a proliferation of similar but not quite user_ns captures spreading through the vfs. I didn't say don't do it this way ... all I said was let's get clear what we are doing and lets put together a shifting infrastructure that's clean, easy to understand and reason about in security terms and which can be used to implement all our use cases ... including s_user_ns. And when we've done this, lets eject any of the ad hoc stuff we find we don't need to make the whole thing simpler. > > I already think that history shows us that s_user_ns went upstream > > too fast and the fact that unprivileged fuse has yet to make it > > (and the > > We've established on the other patchset that fsid mappings in no way > interfere nor care about s_user_ns so I'm not going to go into this > again here. But for the record, unprivileged fuse mounts are > supported since: I know, but I'm taking the opposite view: not caring about the other uses and working around them has lead to the ad hoc userns creep we see today and I think we need to roll it back to a consistent and easy to reason about implementation. > commit 4ad769f3c346ec3d458e255548dec26ca5284cf6 > Author: Eric W. Biederman <ebiederm@xmission.com> > Date: Tue May 29 09:04:46 2018 -0500 > > fuse: Allow fully unprivileged mounts I know the patch is there ... I just haven't found any users yet, so I think there's still something else missing. This is really Seth's baby so I was hoping he'd have ideas about what. James
On Tue, Feb 18, 2020 at 03:43:18PM -0800, James Bottomley wrote: > On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote: > > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote: > > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote: > [...] > > > > But way more important: what Amir got right is that your approach > > > > and fsid mappings don't stand in each others way at all. Shiftfed > > > > bind-mounts can be implemented completely independent of fsid > > > > mappings after the fact on top of it. > > > > > > > > Your example, does this: > > > > > > > > nsfd = open("/proc/567/ns/user", O_RDONLY); /* note: not O_PATH > > > > */ > > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd); > > > > > > > > as the ultimate step. Essentially marking a mountpoint as shifted > > > > relative to that user namespace. Once fsid mappings are in all > > > > that you need to do is replace your > > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in > > > > your patchset with > > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done. > > > > So I honestly don't currently see any need to block the patchsets > > > > on each other. > > > > > > Can I repeat: there's no rush to get upstream on this. Let's pause > > > to get the kernel implementation (the thing we have to maintain) > > > right. I realise we could each work around the other and get our > > > implementations bent around each other so they all work > > > independently thus making our disjoint user cabals happy but I > > > don't think that would lead to the best outcome for kernel > > > maintainability. > > > > We have had the discussion with all major stakeholders in a single > > room on what we need at LPC 2019. > > Well, you didn't invite me, so I think "stakeholders" means people we I'm confused as you were in the room with everyone. It's even on the schedule under your name: https://linuxplumbersconf.org/event/4/contributions/474/ > selected because we like their use case. More importantly: > "stakeholders" traditionally means not only people who want to consume > the feature but also people who have to maintain it ... how many VFS > stakeholders were present? Again, I'm confused you were in the room with at least David and Eric. In any case, the patches are on the relevant lists. They are actively being discussed and visible to everyone and we'll have time for proper review which is already happening.
On Tue, Feb 18, 2020 at 6:43 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote: > > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote: > > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote: > [...] > > > > But way more important: what Amir got right is that your approach > > > > and fsid mappings don't stand in each others way at all. Shiftfed > > > > bind-mounts can be implemented completely independent of fsid > > > > mappings after the fact on top of it. > > > > > > > > Your example, does this: > > > > > > > > nsfd = open("/proc/567/ns/user", O_RDONLY); /* note: not O_PATH > > > > */ > > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd); > > > > > > > > as the ultimate step. Essentially marking a mountpoint as shifted > > > > relative to that user namespace. Once fsid mappings are in all > > > > that you need to do is replace your > > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in > > > > your patchset with > > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done. > > > > So I honestly don't currently see any need to block the patchsets > > > > on each other. > > > > > > Can I repeat: there's no rush to get upstream on this. Let's pause > > > to get the kernel implementation (the thing we have to maintain) > > > right. I realise we could each work around the other and get our > > > implementations bent around each other so they all work > > > independently thus making our disjoint user cabals happy but I > > > don't think that would lead to the best outcome for kernel > > > maintainability. > > > > We have had the discussion with all major stakeholders in a single > > room on what we need at LPC 2019. > > Well, you didn't invite me, so I think "stakeholders" means people we > selected because we like their use case. More importantly: > "stakeholders" traditionally means not only people who want to consume > the feature but also people who have to maintain it ... how many VFS > stakeholders were present? > > > We agreed on what we need and fsids are a concrete proposal for an > > implementation that appears to solve all discussed major use-cases in > > a simple and elegant manner, which can also be cleanly extended to > > cover your approach later. Imho, there is no need to have the same > > discussion again at an invite-only event focussed on kernel > > developers where most of the major stakeholders are unlikely to be > > able to participate. The patch proposals are here on all relevant > > list where everyone can participate and we can discuss them right > > here. I have not yet heard a concrete technical reason why the patch > > proposal is inadequate and I see no reason to stall this. > > You cut the actual justification I gave: tacking together ad hoc > solutions for particular interests has already lead to a proliferation > of similar but not quite user_ns captures spreading through the vfs. I > didn't say don't do it this way ... all I said was let's get clear what > we are doing and lets put together a shifting infrastructure that's > clean, easy to understand and reason about in security terms and which > can be used to implement all our use cases ... including s_user_ns. > And when we've done this, lets eject any of the ad hoc stuff we find we > don't need to make the whole thing simpler. > > > > I already think that history shows us that s_user_ns went upstream > > > too fast and the fact that unprivileged fuse has yet to make it > > > (and the > > > > We've established on the other patchset that fsid mappings in no way > > interfere nor care about s_user_ns so I'm not going to go into this > > again here. But for the record, unprivileged fuse mounts are > > supported since: > > I know, but I'm taking the opposite view: not caring about the other > uses and working around them has lead to the ad hoc userns creep we see > today and I think we need to roll it back to a consistent and easy to > reason about implementation. > > > commit 4ad769f3c346ec3d458e255548dec26ca5284cf6 > > Author: Eric W. Biederman <ebiederm@xmission.com> > > Date: Tue May 29 09:04:46 2018 -0500 > > > > fuse: Allow fully unprivileged mounts > > I know the patch is there ... I just haven't found any users yet, so I > think there's still something else missing. This is really Seth's > baby so I was hoping he'd have ideas about what. I'm confused by that part, we have hundreds of thousands of users of this feature and it's been backported to older kernels on a number of platforms due to its usefulness. It's what's used for installing/running snap packages inside Ubuntu containers, it's also quite widely used inside Terminal/Crostini on Chromebooks, performs transparent unprivileged mounts on Travis-CI and is also used by some of the rootless runtimes to allow for mounting overlayfs, squashfs or even ext* inside unprivileged containers. It's certainly not something that's super user visible, it also doesn't need any particular support for it to work, just use fuse as usual inside an unprivileged container. So the lack of noise about it doesn't necessarily indicate that it's not used, it may just indicate that it's been working as intended :) > James >
On Wed, 2020-02-19 at 14:26 +0100, Christian Brauner wrote: > On Tue, Feb 18, 2020 at 03:43:18PM -0800, James Bottomley wrote: > > On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote: > > > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote: > > > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote: > > > > [...] > > > > > But way more important: what Amir got right is that your > > > > > approach and fsid mappings don't stand in each others way at > > > > > all. Shiftfed bind-mounts can be implemented completely > > > > > independent of fsid mappings after the fact on top of it. > > > > > > > > > > Your example, does this: > > > > > > > > > > nsfd = open("/proc/567/ns/user", O_RDONLY); /* note: not > > > > > O_PATH > > > > > */ > > > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd); > > > > > > > > > > as the ultimate step. Essentially marking a mountpoint as > > > > > shifted relative to that user namespace. Once fsid mappings > > > > > are in all that you need to do is replace your > > > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in > > > > > your patchset with > > > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're > > > > > done. So I honestly don't currently see any need to block the > > > > > patchsets on each other. > > > > > > > > Can I repeat: there's no rush to get upstream on this. Let's > > > > pause to get the kernel implementation (the thing we have to > > > > maintain) right. I realise we could each work around the other > > > > and get our implementations bent around each other so they all > > > > work independently thus making our disjoint user cabals happy > > > > but I don't think that would lead to the best outcome for > > > > kernel maintainability. > > > > > > We have had the discussion with all major stakeholders in a > > > single room on what we need at LPC 2019. > > > > Well, you didn't invite me, so I think "stakeholders" means people > > we > > I'm confused as you were in the room with everyone. It's even on the > schedule under your name: > https://linuxplumbersconf.org/event/4/contributions/474/ > > > selected because we like their use case. More importantly: > > "stakeholders" traditionally means not only people who want to > > consume the feature but also people who have to maintain it ... how > > many VFS stakeholders were present? > > Again, I'm confused you were in the room with at least David and > Eric. OK, I think we both got different things out of this, but I've now put the videos for this on the LPC site so others can judge for themselves. The one for this session is here: https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2h12m52s Although it does make more sense if you watch the Dave Howells session on the new mount API first: https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h45m54s There is lots of discussion of the shared image use case, but nowhere is there any discussion of separating uid from fsuid in the user namespace, unless I missed it again in my second viewing? > In any case, the patches are on the relevant lists. They are actively > being discussed and visible to everyone and we'll have time for > proper review which is already happening. The big issue, though, is that while you've produced a patch set that covers your use case (shared unprivileged images) it doesn't cover mine (privileged images). However, the patch set I produced does cover both use cases. Safely handling privileged images is a much bigger security problem than unprivileged ones, which is why I have more VFS code than you do. While I could, in theory, remove your use case from my patch, doing so would really only reduce it by 38 lines (this is the diffstat going from v2 to v3). However, I still need all the code in the v2 patch to handle the backshifts needed for safe privileged images and I pretty much don't use any of the fsid code you add because I need an unprivileged fsid and uid, so the shifts are always identical for both ... which is the default before your patch. This is what I mean about us getting the VFS implementation correct: I think I can sweep up s_user_ns into my patch (unproven because I've yet to write the patch), effectively making it use the mnt_userns I introduced and eliminating the superblock additions and, for 38 additional VFS lines, I also pick up your use case ... although there may be subtleties I've missed. I also think the way I coded it is much easier to use for an orchestration system. Basically you create a user_ns for the unpacker, rendering it unprivileged, and it unpacks your images into the filesystem so they also become unprivileged and globally shifted. Then whenever you use these images for a container with a separate uid shift, you pass the unpacker userns with the "ns" argument and bind mount the root into the container's userns: the container gets fake root and any writes by fake root are shifted correctly into the unprivileged image. Everything just works and there's no need to bother with fsid's. However, what's still not completely done by any of us is convincing the VFS community that we've got the correct and maintainable way of adding all this to their VFS layer. James