[v3,0/3] introduce a uid/gid shifting bind mount

Message ID	20200217205307.32256-1-James.Bottomley@HansenPartnership.com (mailing list archive)
Headers	show Return-Path: <SRS0=6ymt=4F=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: James Bottomley <James.Bottomley@HansenPartnership.com> To: linux-fsdevel@vger.kernel.org Cc: David Howells <dhowells@redhat.com>, Christian Brauner <christian@brauner.io>, Al Viro <viro@ZenIV.linux.org.uk>, Miklos Szeredi <miklos@szeredi.hu>, Seth Forshee <seth.forshee@canonical.com>, linux-unionfs@vger.kernel.org, Amir Goldstein <amir73il@gmail.com>, =?utf-8?q?St=C3=A9phane_Graber?= <stgraber@ubuntu.com>, Eric Biederman <ebiederm@xmission.com>, Aleksa Sarai <cyphar@cyphar.com>, "Serge E . Hallyn" <serge@hallyn.com>, Tycho Andersen <tycho@tycho.ws>, containers@lists.linux-foundation.org Subject: [PATCH v3 0/3] introduce a uid/gid shifting bind mount Date: Mon, 17 Feb 2020 12:53:04 -0800 Message-Id: <20200217205307.32256-1-James.Bottomley@HansenPartnership.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	introduce a uid/gid shifting bind mount \| expand [v3,0/3] introduce a uid/gid shifting bind mount [v3,1/3] fs: rethread notify_change to take a path instead of a dentry [v3,2/3] fs: introduce uid/gid shifting bind mount [v3,3/3] fs: expose shifting bind mount to userspace

James Bottomley Feb. 17, 2020, 8:53 p.m. UTC

The object of this series is to replace shiftfs with a proper uid/gid
shifting bind mount instead of the shiftfs hack of introducing
something that looks similar to an overlay filesystem to do it.

The VFS still has the problem that in order to tell what vfsmount a
dentry belongs to, struct path would have to be threaded everywhere
struct dentry currently is.  However, this patch is structured only to
require a rethreading of notify_change.  The rest of the knowledge
that a shift is in operation is carried in the task structure by
caching the unshifted credentials.

Note that although it is currently dependent on the new configfd
interface for bind mounts, only patch 3/3 relies on this, and the
whole thing could be redone as a syscall or any other mechanism
(depending on how people eventually want to fix the problem with the
new fsconfig mechanism being unable to reconfigure bind mounts).

The changes from v2 are I've added Amir's reviewed-by for the
notify_change rethreading and I've implemented Serge's request for a
base offset shift for the image.  It turned out to be much harder to
implement a simple linear shift than simply to do it through a
different userns, so that's how I've done it.  The userns you need to
set up for the offset shifted image is one where the interior uid
would see the shifted image as fake root.  I've introduced an
additional "ns" config parameter, which must be specified when
building the allow shift mount point (so it's done by the admin, not
by the unprivileged user).  I've also taken care that the image
shifted to zero (real root) is never visible in the filesystem.  Patch
3/3 explains how to use the additional "ns" parameter.

James

---

James Bottomley (3):
  fs: rethread notify_change to take a path instead of a dentry
  fs: introduce uid/gid shifting bind mount
  fs: expose shifting bind mount to userspace

 drivers/base/devtmpfs.c   |   8 ++-
 fs/attr.c                 | 131 ++++++++++++++++++++++++++++++++++++++--------
 fs/bind.c                 | 105 +++++++++++++++++++++++++++++++++----
 fs/cachefiles/interface.c |   6 ++-
 fs/coredump.c             |   4 +-
 fs/ecryptfs/inode.c       |   9 ++--
 fs/exec.c                 |   3 +-
 fs/inode.c                |  17 +++---
 fs/internal.h             |   2 +
 fs/mount.h                |   3 ++
 fs/namei.c                | 114 +++++++++++++++++++++++++++++++++-------
 fs/namespace.c            |   6 +++
 fs/nfsd/vfs.c             |  13 +++--
 fs/open.c                 |  44 ++++++++++++----
 fs/overlayfs/copy_up.c    |  40 ++++++++------
 fs/overlayfs/dir.c        |  10 +++-
 fs/overlayfs/inode.c      |   6 ++-
 fs/overlayfs/overlayfs.h  |   2 +-
 fs/overlayfs/super.c      |   3 +-
 fs/posix_acl.c            |   4 +-
 fs/proc_namespace.c       |   4 ++
 fs/stat.c                 |  32 +++++++++--
 fs/utimes.c               |   2 +-
 include/linux/cred.h      |  12 +++++
 include/linux/fs.h        |   7 ++-
 include/linux/mount.h     |   4 +-
 include/linux/sched.h     |   5 ++
 kernel/capability.c       |   9 +++-
 kernel/cred.c             |  20 +++++++
 29 files changed, 507 insertions(+), 118 deletions(-)

Amir Goldstein Feb. 18, 2020, 7:18 a.m. UTC | #1

On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> The object of this series is to replace shiftfs with a proper uid/gid
> shifting bind mount instead of the shiftfs hack of introducing
> something that looks similar to an overlay filesystem to do it.
>
> The VFS still has the problem that in order to tell what vfsmount a
> dentry belongs to, struct path would have to be threaded everywhere
> struct dentry currently is.  However, this patch is structured only to
> require a rethreading of notify_change.  The rest of the knowledge
> that a shift is in operation is carried in the task structure by
> caching the unshifted credentials.
>
> Note that although it is currently dependent on the new configfd
> interface for bind mounts, only patch 3/3 relies on this, and the
> whole thing could be redone as a syscall or any other mechanism
> (depending on how people eventually want to fix the problem with the
> new fsconfig mechanism being unable to reconfigure bind mounts).
>
> The changes from v2 are I've added Amir's reviewed-by for the
> notify_change rethreading and I've implemented Serge's request for a
> base offset shift for the image.  It turned out to be much harder to
> implement a simple linear shift than simply to do it through a
> different userns, so that's how I've done it.  The userns you need to
> set up for the offset shifted image is one where the interior uid
> would see the shifted image as fake root.  I've introduced an
> additional "ns" config parameter, which must be specified when
> building the allow shift mount point (so it's done by the admin, not
> by the unprivileged user).  I've also taken care that the image
> shifted to zero (real root) is never visible in the filesystem.  Patch
> 3/3 explains how to use the additional "ns" parameter.
>
>

James,

To us common people who do not breath containers, your proposal seems
like a competing implementation to Christian's proposal [1]. If it were a
competing implementation, I think Christian's proposal would have won by
points for being less intrusive to VFS.

But it is not really a competing implementation, is it? Your proposals meet
two different, but very overlapping, set of requirements. IMHO, none of you
did a really good job of explaining that in the cover latter, let
alone, refer to
each others proposals (I am referring to your v3 posting of course).

IIUC, Christian's proposal deals with single shared image per
non-overlapping groups of containers. And it deals with this use case very
elegantly IMO. From your comments on Christian's post, it does not
seem that you oppose to his proposal, except that it does not meet the
requirements for all of your use cases.

IIUC, your proposal can deal with multiple shared images per overlapping
groups of containers and it adds an element of "auto-reverse-mapping",
which reduces the administration overhead of this to be nightmare
of orchestration.

It seems to me, that you should look into working your patch set on
top of fsid mapping and try to make use of it as much as possible.
And to make things a bit more clear to the rest of us, you should probably
market your feature as "auto back shifting mount" or something like that
and explain the added value of the feature on top of plain fsid mapping.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20200214183554.1133805-1-christian.brauner@ubuntu.com/

James Bottomley Feb. 18, 2020, 4:11 p.m. UTC | #2

On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > The object of this series is to replace shiftfs with a proper
> > uid/gid shifting bind mount instead of the shiftfs hack of
> > introducing something that looks similar to an overlay filesystem
> > to do it.
> > 
> > The VFS still has the problem that in order to tell what vfsmount a
> > dentry belongs to, struct path would have to be threaded everywhere
> > struct dentry currently is.  However, this patch is structured only
> > to require a rethreading of notify_change.  The rest of the
> > knowledge that a shift is in operation is carried in the task
> > structure by caching the unshifted credentials.
> > 
> > Note that although it is currently dependent on the new configfd
> > interface for bind mounts, only patch 3/3 relies on this, and the
> > whole thing could be redone as a syscall or any other mechanism
> > (depending on how people eventually want to fix the problem with
> > the new fsconfig mechanism being unable to reconfigure bind
> > mounts).
> > 
> > The changes from v2 are I've added Amir's reviewed-by for the
> > notify_change rethreading and I've implemented Serge's request for
> > a base offset shift for the image.  It turned out to be much harder
> > to implement a simple linear shift than simply to do it through a
> > different userns, so that's how I've done it.  The userns you need
> > to set up for the offset shifted image is one where the interior
> > uid would see the shifted image as fake root.  I've introduced an
> > additional "ns" config parameter, which must be specified when
> > building the allow shift mount point (so it's done by the admin,
> > not by the unprivileged user).  I've also taken care that the image
> > shifted to zero (real root) is never visible in the
> > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > parameter.
> > 
> > 
> 
> James,
> 
> To us common people who do not breath containers, your proposal seems
> like a competing implementation to Christian's proposal [1].

I think we have three things that swirl around this space and aren't
quite direct copies of each other's use cases but aren't entirely
disjoint either: the superblock user namespace, this and the user
namespace fsid mapping.

>  If it were a competing implementation, I think Christian's proposal
> would have won by points for being less intrusive to VFS.

Heh, that one's subjective.  I think the current fsid code is missing
quite a few use cases in the stat/attr/in_group_p cases.  I'm just
building the code now to run it through the shiftfs tests and see how
it fares.  I think once those cases are added, the VFS changes in fsid
will be the same as I have in patch 2/3 ... primarily because we all
have to shift the same thing at the same point.  If you include the
notify_change rethreading then, yes, you're correct, but that patch
does stand on its own and is consonant with a long term vfs goal of
using path instead of dentry.

> But it is not really a competing implementation, is it? Your
> proposals meet two different, but very overlapping, set of
> requirements. IMHO, none of you did a really good job of explaining
> that in the cover latter, let alone, refer to each others proposals
> (I am referring to your v3 posting of course).

Yes, I know, but the fsid one is only a few days old so I haven't had
time to absorb all of it yet.

> IIUC, Christian's proposal deals with single shared image per
> non-overlapping groups of containers. And it deals with this use case
> very elegantly IMO. From your comments on Christian's post, it does
> not seem that you oppose to his proposal, except that it does not
> meet the requirements for all of your use cases.

No, but I think it could.  It's one of these perennial problems of more
generic vs more specific to use case.  I'm a bit lost in really what we
need for containers.  In the original shiftfs I made it superblock
based precisely because that was the only way I could integrate
s_user_ns into it ... and I thought that was a good idea.

> IIUC, your proposal can deal with multiple shared images per
> overlapping groups of containers

That's right ... it does the shift based on path and user namespace. 
In theory it allows an image with shifted and unshifted pieces ...
however, I'm not sure there's even a use case for that granularity
because all my current image shifting use cases are all or nothing. 
The granularity is an accident of the bind mount implementation.

>  and it adds an element of "auto-reverse-mapping", which reduces the
> administration overhead of this to be nightmare of orchestration.

Right, but the same thing could be an option to the fsid proposal: the
default use could shift forward on kuid and back on the same map for
fsuid ... then it would do what shiftfs does.  Likewise, shiftfs could
pick up the shift from an existing namespace and thus look more like
what fsuid does.

> It seems to me, that you should look into working your patch set on
> top of fsid mapping and try to make use of it as much as possible.
> And to make things a bit more clear to the rest of us, you should
> probably market your feature as "auto back shifting mount" or
> something like that and explain the added value of the feature on top
> of plain fsid mapping

Well we both need the same shift points, so we could definitely both
work off a generic vfs encapsulation of "shift needed here".  Once
that's done it does become a question of use and activation.

I can't help feeling that now that we've been around the houses a few
times, s_user_ns is actually in the wrong place and it should be in the
mount struct not the superblock.  I get the impression we've got the
what we need to expose (the use cases) well established (at least in
our heads).  The big question your asking is implementation (the how)
and also whether there isn't a combination of the exposures that works
for everyone.  I think this might make a very good session at LSF/MM. 
The how piece is completely within the purview of kernel developers. 
The use case is more problematic because that does involve the
container orchestration community.  However, I think at LSF/MM if we
could get agreement on a unified implementation that covers the three
use cases we're in a much better position to have the container
orchestration conversation  because it's simply a case of tweaking the
activation mechanisms.  I'll propose it as a topic.

James

Christian Brauner Feb. 18, 2020, 5:26 p.m. UTC | #3

On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > > 
> > > The object of this series is to replace shiftfs with a proper
> > > uid/gid shifting bind mount instead of the shiftfs hack of
> > > introducing something that looks similar to an overlay filesystem
> > > to do it.
> > > 
> > > The VFS still has the problem that in order to tell what vfsmount a
> > > dentry belongs to, struct path would have to be threaded everywhere
> > > struct dentry currently is.  However, this patch is structured only
> > > to require a rethreading of notify_change.  The rest of the
> > > knowledge that a shift is in operation is carried in the task
> > > structure by caching the unshifted credentials.
> > > 
> > > Note that although it is currently dependent on the new configfd
> > > interface for bind mounts, only patch 3/3 relies on this, and the
> > > whole thing could be redone as a syscall or any other mechanism
> > > (depending on how people eventually want to fix the problem with
> > > the new fsconfig mechanism being unable to reconfigure bind
> > > mounts).
> > > 
> > > The changes from v2 are I've added Amir's reviewed-by for the
> > > notify_change rethreading and I've implemented Serge's request for
> > > a base offset shift for the image.  It turned out to be much harder
> > > to implement a simple linear shift than simply to do it through a
> > > different userns, so that's how I've done it.  The userns you need
> > > to set up for the offset shifted image is one where the interior
> > > uid would see the shifted image as fake root.  I've introduced an
> > > additional "ns" config parameter, which must be specified when
> > > building the allow shift mount point (so it's done by the admin,
> > > not by the unprivileged user).  I've also taken care that the image
> > > shifted to zero (real root) is never visible in the
> > > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > > parameter.
> > > 
> > > 
> > 
> > James,
> > 
> > To us common people who do not breath containers, your proposal seems
> > like a competing implementation to Christian's proposal [1].
> 
> I think we have three things that swirl around this space and aren't
> quite direct copies of each other's use cases but aren't entirely
> disjoint either: the superblock user namespace, this and the user
> namespace fsid mapping.
> 
> >  If it were a competing implementation, I think Christian's proposal
> > would have won by points for being less intrusive to VFS.
> 
> Heh, that one's subjective.  I think the current fsid code is missing

I honestly don't think it is subjective. Leaving aside that the patch
here is more invasive to the vfs just in how it needs to alter it, you
currently require a whole new set of syscalls for this feature. The
syscalls themselves even introduce a whole new API and complicated
semantics in the kernel and it's completely unclear whether they will
make it or not. Even if not, this feature will be tied to the new mount
api making it way harder and a longer process to adopt for userspace
given that not even all bits and pieces userspace needs are currently in
any released kernel.

But way more important: what Amir got right is that your approach and
fsid mappings don't stand in each others way at all. Shiftfed
bind-mounts can be implemented completely independent of fsid mappings
after the fact on top of it.

Your example, does this:

nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);

as the ultimate step. Essentially marking a mountpoint as shifted
relative to that user namespace. Once fsid mappings are in all that you
need to do is replace your make_kuid()/from_kuid()/from_kuid_munged()
calls and so on in your patchset with
make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done. So I
honestly don't currently see any need to block the patchsets on each
other. 

Christian

James Bottomley Feb. 18, 2020, 7:05 p.m. UTC | #4

On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> > > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > 
> > > > The object of this series is to replace shiftfs with a proper
> > > > uid/gid shifting bind mount instead of the shiftfs hack of
> > > > introducing something that looks similar to an overlay
> > > > filesystem to do it.
> > > > 
> > > > The VFS still has the problem that in order to tell what
> > > > vfsmount a dentry belongs to, struct path would have to be
> > > > threaded everywhere struct dentry currently is.  However, this
> > > > patch is structured only to require a rethreading of
> > > > notify_change.  The rest of the knowledge that a shift is in
> > > > operation is carried in the task structure by caching the
> > > > unshifted credentials.
> > > > 
> > > > Note that although it is currently dependent on the new
> > > > configfd interface for bind mounts, only patch 3/3 relies on
> > > > this, and the whole thing could be redone as a syscall or any
> > > > other mechanism (depending on how people eventually want to fix
> > > > the problem with the new fsconfig mechanism being unable to
> > > > reconfigure bind mounts).
> > > > 
> > > > The changes from v2 are I've added Amir's reviewed-by for the
> > > > notify_change rethreading and I've implemented Serge's request
> > > > for a base offset shift for the image.  It turned out to be
> > > > much harder to implement a simple linear shift than simply to
> > > > do it through a different userns, so that's how I've done
> > > > it.  The userns you need to set up for the offset shifted image
> > > > is one where the interior uid would see the shifted image as
> > > > fake root.  I've introduced an additional "ns" config
> > > > parameter, which must be specified when building the allow
> > > > shift mount point (so it's done by the admin, not by the
> > > > unprivileged user).  I've also taken care that the image
> > > > shifted to zero (real root) is never visible in the
> > > > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > > > parameter.
> > > > 
> > > > 
> > > 
> > > James,
> > > 
> > > To us common people who do not breath containers, your proposal
> > > seems like a competing implementation to Christian's proposal
> > > [1].
> > 
> > I think we have three things that swirl around this space and
> > aren't quite direct copies of each other's use cases but aren't
> > entirely disjoint either: the superblock user namespace, this and
> > the user namespace fsid mapping.
> > 
> > >  If it were a competing implementation, I think Christian's
> > > proposal would have won by points for being less intrusive to
> > > VFS.
> > 
> > Heh, that one's subjective.  I think the current fsid code is
> > missing
> 
> I honestly don't think it is subjective. Leaving aside that the patch
> here is more invasive to the vfs just in how it needs to alter it,
> you currently require a whole new set of syscalls for this feature. 

Well I really don't ... I'd like to replace about four existing
syscalls (fspick/fsopen/fsconfig/fsmount) with two more general ones
(configfd_open/configfd_action), but I kept the original four to
demonstrate the two new ones could subsume their work.  However, I
really think we should leave aside the activation mechanism discussion
and concentrate on the internal implementation.

> The syscalls themselves even introduce a whole new API and
> complicated semantics in the kernel and it's completely unclear
> whether they will make it or not.

As I said in the cover letter.  I'm entirely mechanism agnostic. 
Whatever solves our current rebind reconfigure problem will be usable
for shifting bind mounts.  I like the idea of using the same fsconfig
mechanism, but this patch isn't wedded to it, that's why 2/3 is
implementation and 3/3 is activation.  The activation patch can be
completely replaced without affecting the implementation mechanism.

>  Even if not, this feature will be tied to the new mount api making
> it way harder and a longer process to adopt for userspace given that
> not even all bits and pieces userspace needs are currently in
> any released kernel.

Well, given that this problem and various proposed solutions have been
around for years, I really don't think we're suddenly in any rush to
get it out of the door and into user hands.

> But way more important: what Amir got right is that your approach and
> fsid mappings don't stand in each others way at all. Shiftfed
> bind-mounts can be implemented completely independent of fsid
> mappings after the fact on top of it.
> 
> Your example, does this:
> 
> nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
> configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> 
> as the ultimate step. Essentially marking a mountpoint as shifted
> relative to that user namespace. Once fsid mappings are in all that
> you need to do is replace your
> make_kuid()/from_kuid()/from_kuid_munged() calls and so on in your
> patchset with make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and
> you're done. So I honestly don't currently see any need to block the
> patchsets on each other. 

Can I repeat: there's no rush to get upstream on this.  Let's pause to
get the kernel implementation (the thing we have to maintain) right.  I
realise we could each work around the other and get our implementations
bent around each other so they all work independently thus making our
disjoint user cabals happy but I don't think that would lead to the
best outcome for kernel maintainability.

I already think that history shows us that s_user_ns went upstream too
fast and the fact that unprivileged fuse has yet to make it (and the
reception the original shiftfs got) indicates there may be an
abstraction problem with s_user_ns (I have to confess here that while I
think I understand the unprivileged fuse issues for both the daemon and
the consumer, I don't have a clear insight into the f2fs use case ...
it's something androidy that no-one has fully explained).  So I think
the implementors of all three features need to come up with a generic
VFS shifting abstraction that covers all the use cases.  We also need
to decide which user_ns we care about: the one above us in current, the
one captured when the mnt_ns got unshared, the one captured in
s_user_ns when the mount was done or the new one I introduced which
captures the user_ns at struct mount creation/clone time ... I really
don't think we need them all so we should work out what do they all
mean and which should we care about and start ruthlessly eliminating
any one we believe we don't care about.

We're really like people building a castle by each working on our own
turret and my fear is the weight of all the turrets, even if it doesn't
cause the whole castle simply to collapse, is weakening our security
posture.

James

Christian Brauner Feb. 18, 2020, 8:03 p.m. UTC | #5

On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> > On Tue, Feb 18, 2020 at 08:11:00AM -0800, James Bottomley wrote:
> > > On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote:
> > > > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley
> > > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > > 
> > > > > The object of this series is to replace shiftfs with a proper
> > > > > uid/gid shifting bind mount instead of the shiftfs hack of
> > > > > introducing something that looks similar to an overlay
> > > > > filesystem to do it.
> > > > > 
> > > > > The VFS still has the problem that in order to tell what
> > > > > vfsmount a dentry belongs to, struct path would have to be
> > > > > threaded everywhere struct dentry currently is.  However, this
> > > > > patch is structured only to require a rethreading of
> > > > > notify_change.  The rest of the knowledge that a shift is in
> > > > > operation is carried in the task structure by caching the
> > > > > unshifted credentials.
> > > > > 
> > > > > Note that although it is currently dependent on the new
> > > > > configfd interface for bind mounts, only patch 3/3 relies on
> > > > > this, and the whole thing could be redone as a syscall or any
> > > > > other mechanism (depending on how people eventually want to fix
> > > > > the problem with the new fsconfig mechanism being unable to
> > > > > reconfigure bind mounts).
> > > > > 
> > > > > The changes from v2 are I've added Amir's reviewed-by for the
> > > > > notify_change rethreading and I've implemented Serge's request
> > > > > for a base offset shift for the image.  It turned out to be
> > > > > much harder to implement a simple linear shift than simply to
> > > > > do it through a different userns, so that's how I've done
> > > > > it.  The userns you need to set up for the offset shifted image
> > > > > is one where the interior uid would see the shifted image as
> > > > > fake root.  I've introduced an additional "ns" config
> > > > > parameter, which must be specified when building the allow
> > > > > shift mount point (so it's done by the admin, not by the
> > > > > unprivileged user).  I've also taken care that the image
> > > > > shifted to zero (real root) is never visible in the
> > > > > filesystem.  Patch 3/3 explains how to use the additional "ns"
> > > > > parameter.
> > > > > 
> > > > > 
> > > > 
> > > > James,
> > > > 
> > > > To us common people who do not breath containers, your proposal
> > > > seems like a competing implementation to Christian's proposal
> > > > [1].
> > > 
> > > I think we have three things that swirl around this space and
> > > aren't quite direct copies of each other's use cases but aren't
> > > entirely disjoint either: the superblock user namespace, this and
> > > the user namespace fsid mapping.
> > > 
> > > >  If it were a competing implementation, I think Christian's
> > > > proposal would have won by points for being less intrusive to
> > > > VFS.
> > > 
> > > Heh, that one's subjective.  I think the current fsid code is
> > > missing
> > 
> > I honestly don't think it is subjective. Leaving aside that the patch
> > here is more invasive to the vfs just in how it needs to alter it,
> > you currently require a whole new set of syscalls for this feature. 
> 
> Well I really don't ... I'd like to replace about four existing
> syscalls (fspick/fsopen/fsconfig/fsmount) with two more general ones
> (configfd_open/configfd_action), but I kept the original four to
> demonstrate the two new ones could subsume their work.  However, I
> really think we should leave aside the activation mechanism discussion
> and concentrate on the internal implementation.
> 
> > The syscalls themselves even introduce a whole new API and
> > complicated semantics in the kernel and it's completely unclear
> > whether they will make it or not.
> 
> As I said in the cover letter.  I'm entirely mechanism agnostic. 
> Whatever solves our current rebind reconfigure problem will be usable
> for shifting bind mounts.  I like the idea of using the same fsconfig
> mechanism, but this patch isn't wedded to it, that's why 2/3 is
> implementation and 3/3 is activation.  The activation patch can be
> completely replaced without affecting the implementation mechanism.
> 
> >  Even if not, this feature will be tied to the new mount api making
> > it way harder and a longer process to adopt for userspace given that
> > not even all bits and pieces userspace needs are currently in
> > any released kernel.
> 
> Well, given that this problem and various proposed solutions have been
> around for years, I really don't think we're suddenly in any rush to
> get it out of the door and into user hands.
> 
> > But way more important: what Amir got right is that your approach and
> > fsid mappings don't stand in each others way at all. Shiftfed
> > bind-mounts can be implemented completely independent of fsid
> > mappings after the fact on top of it.
> > 
> > Your example, does this:
> > 
> > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
> > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > 
> > as the ultimate step. Essentially marking a mountpoint as shifted
> > relative to that user namespace. Once fsid mappings are in all that
> > you need to do is replace your
> > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in your
> > patchset with make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and
> > you're done. So I honestly don't currently see any need to block the
> > patchsets on each other. 
> 
> Can I repeat: there's no rush to get upstream on this.  Let's pause to
> get the kernel implementation (the thing we have to maintain) right.  I
> realise we could each work around the other and get our implementations
> bent around each other so they all work independently thus making our
> disjoint user cabals happy but I don't think that would lead to the
> best outcome for kernel maintainability.

We have had the discussion with all major stakeholders in a single room
on what we need at LPC 2019. We agreed on what we need and fsids are a
concrete proposal for an implementation that appears to solve all discussed
major use-cases in a simple and elegant manner, which can also be cleanly
extended to cover your approach later.
Imho, there is no need to have the same discussion again at an
invite-only event focussed on kernel developers where most of the major
stakeholders are unlikely to be able to participate. The patch proposals
are here on all relevant list where everyone can participate and we can discuss
them right here.
I have not yet heard a concrete technical reason why the patch proposal is
inadequate and I see no reason to stall this.

> 
> I already think that history shows us that s_user_ns went upstream too
> fast and the fact that unprivileged fuse has yet to make it (and the

We've established on the other patchset that fsid mappings in no way interfere
nor care about s_user_ns so I'm not going to go into this again here. But for
the record, unprivileged fuse mounts are supported since:

commit 4ad769f3c346ec3d458e255548dec26ca5284cf6
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue May 29 09:04:46 2018 -0500

    fuse: Allow fully unprivileged mounts

    Now that the fuse and the vfs work is complete.  Allow the fuse filesystem
    to be mounted by the root user in a user namespace.

    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

Christian

James Bottomley Feb. 18, 2020, 11:43 p.m. UTC | #6

On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
[...]
> > > But way more important: what Amir got right is that your approach
> > > and fsid mappings don't stand in each others way at all. Shiftfed
> > > bind-mounts can be implemented completely independent of fsid
> > > mappings after the fact on top of it.
> > > 
> > > Your example, does this:
> > > 
> > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH
> > > */
> > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > 
> > > as the ultimate step. Essentially marking a mountpoint as shifted
> > > relative to that user namespace. Once fsid mappings are in all
> > > that you need to do is replace your
> > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > your patchset with
> > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done.
> > > So I honestly don't currently see any need to block the patchsets
> > > on each other. 
> > 
> > Can I repeat: there's no rush to get upstream on this.  Let's pause
> > to get the kernel implementation (the thing we have to maintain)
> > right.  I realise we could each work around the other and get our
> > implementations bent around each other so they all work
> > independently thus making our disjoint user cabals happy but I
> > don't think that would lead to the best outcome for kernel
> > maintainability.
> 
> We have had the discussion with all major stakeholders in a single
> room on what we need at LPC 2019.

Well, you didn't invite me, so I think "stakeholders" means people we
selected because we like their use case.  More importantly:
"stakeholders" traditionally means not only people who want to consume
the feature but also people who have to maintain it ... how many VFS
stakeholders were present?

>  We agreed on what we need and fsids are a concrete proposal for an
> implementation that appears to solve all discussed major use-cases in
> a simple and elegant manner, which can also be cleanly extended to
> cover your approach later.  Imho, there is no need to have the same
> discussion again at an invite-only event focussed on kernel
> developers where most of the major stakeholders are unlikely to be
> able to participate. The patch proposals are here on all relevant
> list where everyone can participate and we can discuss them right
> here. I have not yet heard a concrete technical reason why the patch
> proposal is inadequate and I see no reason to stall this.

You cut the actual justification I gave: tacking together ad hoc
solutions for particular interests has already lead to a proliferation
of similar but not quite user_ns captures spreading through the vfs.  I
didn't say don't do it this way ... all I said was let's get clear what
we are doing and lets put together a shifting infrastructure that's
clean, easy to understand and reason about in security terms and which
can be used to implement all our use cases ... including s_user_ns. 
And when we've done this, lets eject any of the ad hoc stuff we find we
don't need to make the whole thing simpler.

> > I already think that history shows us that s_user_ns went upstream
> > too fast and the fact that unprivileged fuse has yet to make it
> > (and the
> 
> We've established on the other patchset that fsid mappings in no way
> interfere nor care about s_user_ns so I'm not going to go into this
> again here. But for the record, unprivileged fuse mounts are
> supported since:

I know, but I'm taking the opposite view: not caring about the other
uses and working around them has lead to the ad hoc userns creep we see
today and I think we need to roll it back to a consistent and easy to
reason about implementation.

> commit 4ad769f3c346ec3d458e255548dec26ca5284cf6
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Tue May 29 09:04:46 2018 -0500
> 
>     fuse: Allow fully unprivileged mounts

I know the patch is there ... I just haven't found any users yet, so I
think there's still something else missing.   This is really Seth's
baby so I was hoping he'd have ideas about what.

James

Christian Brauner Feb. 19, 2020, 1:26 p.m. UTC | #7

On Tue, Feb 18, 2020 at 03:43:18PM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> [...]
> > > > But way more important: what Amir got right is that your approach
> > > > and fsid mappings don't stand in each others way at all. Shiftfed
> > > > bind-mounts can be implemented completely independent of fsid
> > > > mappings after the fact on top of it.
> > > > 
> > > > Your example, does this:
> > > > 
> > > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH
> > > > */
> > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > > 
> > > > as the ultimate step. Essentially marking a mountpoint as shifted
> > > > relative to that user namespace. Once fsid mappings are in all
> > > > that you need to do is replace your
> > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > > your patchset with
> > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done.
> > > > So I honestly don't currently see any need to block the patchsets
> > > > on each other. 
> > > 
> > > Can I repeat: there's no rush to get upstream on this.  Let's pause
> > > to get the kernel implementation (the thing we have to maintain)
> > > right.  I realise we could each work around the other and get our
> > > implementations bent around each other so they all work
> > > independently thus making our disjoint user cabals happy but I
> > > don't think that would lead to the best outcome for kernel
> > > maintainability.
> > 
> > We have had the discussion with all major stakeholders in a single
> > room on what we need at LPC 2019.
> 
> Well, you didn't invite me, so I think "stakeholders" means people we

I'm confused as you were in the room with everyone. It's even on the
schedule under your name:
https://linuxplumbersconf.org/event/4/contributions/474/

> selected because we like their use case.  More importantly:
> "stakeholders" traditionally means not only people who want to consume
> the feature but also people who have to maintain it ... how many VFS
> stakeholders were present?

Again, I'm confused you were in the room with at least David and Eric.

In any case, the patches are on the relevant lists. They are actively
being discussed and visible to everyone and we'll have time for proper
review which is already happening.

Stéphane Graber Feb. 19, 2020, 4:01 p.m. UTC | #8

On Tue, Feb 18, 2020 at 6:43 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> [...]
> > > > But way more important: what Amir got right is that your approach
> > > > and fsid mappings don't stand in each others way at all. Shiftfed
> > > > bind-mounts can be implemented completely independent of fsid
> > > > mappings after the fact on top of it.
> > > >
> > > > Your example, does this:
> > > >
> > > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH
> > > > */
> > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > >
> > > > as the ultimate step. Essentially marking a mountpoint as shifted
> > > > relative to that user namespace. Once fsid mappings are in all
> > > > that you need to do is replace your
> > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > > your patchset with
> > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're done.
> > > > So I honestly don't currently see any need to block the patchsets
> > > > on each other.
> > >
> > > Can I repeat: there's no rush to get upstream on this.  Let's pause
> > > to get the kernel implementation (the thing we have to maintain)
> > > right.  I realise we could each work around the other and get our
> > > implementations bent around each other so they all work
> > > independently thus making our disjoint user cabals happy but I
> > > don't think that would lead to the best outcome for kernel
> > > maintainability.
> >
> > We have had the discussion with all major stakeholders in a single
> > room on what we need at LPC 2019.
>
> Well, you didn't invite me, so I think "stakeholders" means people we
> selected because we like their use case.  More importantly:
> "stakeholders" traditionally means not only people who want to consume
> the feature but also people who have to maintain it ... how many VFS
> stakeholders were present?
>
> >  We agreed on what we need and fsids are a concrete proposal for an
> > implementation that appears to solve all discussed major use-cases in
> > a simple and elegant manner, which can also be cleanly extended to
> > cover your approach later.  Imho, there is no need to have the same
> > discussion again at an invite-only event focussed on kernel
> > developers where most of the major stakeholders are unlikely to be
> > able to participate. The patch proposals are here on all relevant
> > list where everyone can participate and we can discuss them right
> > here. I have not yet heard a concrete technical reason why the patch
> > proposal is inadequate and I see no reason to stall this.
>
> You cut the actual justification I gave: tacking together ad hoc
> solutions for particular interests has already lead to a proliferation
> of similar but not quite user_ns captures spreading through the vfs.  I
> didn't say don't do it this way ... all I said was let's get clear what
> we are doing and lets put together a shifting infrastructure that's
> clean, easy to understand and reason about in security terms and which
> can be used to implement all our use cases ... including s_user_ns.
> And when we've done this, lets eject any of the ad hoc stuff we find we
> don't need to make the whole thing simpler.
>
> > > I already think that history shows us that s_user_ns went upstream
> > > too fast and the fact that unprivileged fuse has yet to make it
> > > (and the
> >
> > We've established on the other patchset that fsid mappings in no way
> > interfere nor care about s_user_ns so I'm not going to go into this
> > again here. But for the record, unprivileged fuse mounts are
> > supported since:
>
> I know, but I'm taking the opposite view: not caring about the other
> uses and working around them has lead to the ad hoc userns creep we see
> today and I think we need to roll it back to a consistent and easy to
> reason about implementation.
>
> > commit 4ad769f3c346ec3d458e255548dec26ca5284cf6
> > Author: Eric W. Biederman <ebiederm@xmission.com>
> > Date:   Tue May 29 09:04:46 2018 -0500
> >
> >     fuse: Allow fully unprivileged mounts
>
> I know the patch is there ... I just haven't found any users yet, so I
> think there's still something else missing.   This is really Seth's
> baby so I was hoping he'd have ideas about what.

I'm confused by that part, we have hundreds of thousands of users of
this feature
and it's been backported to older kernels on a number of platforms due
to its usefulness.

It's what's used for installing/running snap packages inside Ubuntu containers,
it's also quite widely used inside Terminal/Crostini on Chromebooks,
performs transparent unprivileged mounts on Travis-CI and is also used by
some of the rootless runtimes to allow for mounting overlayfs, squashfs or
even ext* inside unprivileged containers.

It's certainly not something that's super user visible, it also
doesn't need any particular
support for it to work, just use fuse as usual inside an unprivileged container.
So the lack of noise about it doesn't necessarily indicate that it's not used,
it may just indicate that it's been working as intended :)

> James
>

James Bottomley Feb. 19, 2020, 10:26 p.m. UTC | #9

On Wed, 2020-02-19 at 14:26 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 03:43:18PM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 21:03 +0100, Christian Brauner wrote:
> > > On Tue, Feb 18, 2020 at 11:05:48AM -0800, James Bottomley wrote:
> > > > On Tue, 2020-02-18 at 18:26 +0100, Christian Brauner wrote:
> > 
> > [...]
> > > > > But way more important: what Amir got right is that your
> > > > > approach and fsid mappings don't stand in each others way at
> > > > > all. Shiftfed bind-mounts can be implemented completely
> > > > > independent of fsid mappings after the fact on top of it.
> > > > > 
> > > > > Your example, does this:
> > > > > 
> > > > > nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not
> > > > > O_PATH
> > > > > */
> > > > > configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
> > > > > 
> > > > > as the ultimate step. Essentially marking a mountpoint as
> > > > > shifted relative to that user namespace. Once fsid mappings
> > > > > are in all that you need to do is replace your
> > > > > make_kuid()/from_kuid()/from_kuid_munged() calls and so on in
> > > > > your patchset with
> > > > > make_kfsuid()/from_kfsuid()/from_kfsuid_munged() and you're
> > > > > done. So I honestly don't currently see any need to block the
> > > > > patchsets on each other. 
> > > > 
> > > > Can I repeat: there's no rush to get upstream on this.  Let's
> > > > pause to get the kernel implementation (the thing we have to
> > > > maintain) right.  I realise we could each work around the other
> > > > and get our implementations bent around each other so they all
> > > > work independently thus making our disjoint user cabals happy
> > > > but I don't think that would lead to the best outcome for
> > > > kernel maintainability.
> > > 
> > > We have had the discussion with all major stakeholders in a
> > > single room on what we need at LPC 2019.
> > 
> > Well, you didn't invite me, so I think "stakeholders" means people
> > we
> 
> I'm confused as you were in the room with everyone. It's even on the
> schedule under your name:
> https://linuxplumbersconf.org/event/4/contributions/474/
> 
> > selected because we like their use case.  More importantly:
> > "stakeholders" traditionally means not only people who want to
> > consume the feature but also people who have to maintain it ... how
> > many VFS stakeholders were present?
> 
> Again, I'm confused you were in the room with at least David and
> Eric.

OK, I think we both got different things out of this, but I've now put
the videos for this on the LPC site so others can judge for themselves.
 The one for this session is here:

https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2h12m52s

Although it does make more sense if you watch the Dave Howells session
on the new mount API first:

https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h45m54s

There is lots of discussion of the shared image use case, but nowhere
is there any discussion of separating uid from fsuid in the user
namespace, unless I missed it again in my second viewing?

> In any case, the patches are on the relevant lists. They are actively
> being discussed and visible to everyone and we'll have time for
> proper review which is already happening.

The big issue, though, is that while you've produced a patch set that
covers your use case (shared unprivileged images) it doesn't cover mine
(privileged images).  However, the patch set I produced does cover both
use cases.  Safely handling privileged images is a much bigger security
problem than unprivileged ones, which is why I have more VFS code than
you do.  While I could, in theory, remove your use case from my patch,
doing so would really only reduce it by 38 lines (this is the diffstat
going from v2 to v3).  However, I still need all the code in the v2
patch to handle the backshifts needed for safe privileged images and I
pretty much don't use any of the fsid code you add because I need an
unprivileged fsid and uid, so the shifts are always identical for both
... which is the default before your patch.

This is what I mean about us getting the VFS implementation correct:  I
think I can sweep up s_user_ns into my patch (unproven because I've yet
to write the patch), effectively making it use the mnt_userns I
introduced and eliminating the superblock additions and, for 38
additional VFS lines, I also pick up your use case ... although there
may be subtleties I've missed.

I also think the way I coded it is much easier to use for an
orchestration system.  Basically you create a user_ns for the unpacker,
rendering it unprivileged, and it unpacks your images into the
filesystem so they also become unprivileged and globally shifted.  Then
whenever you use these images for a container with a separate uid
shift, you pass the unpacker userns with the "ns" argument and bind
mount the root into the container's userns: the container gets fake
root and any writes by fake root are shifted correctly into the
unprivileged image.  Everything just works and there's no need to
bother with fsid's.

However, what's still not completely done by any of us is convincing
the VFS community that we've got the correct and maintainable way of
adding all this to their VFS layer.

James

[v3,0/3] introduce a uid/gid shifting bind mount

Message

Comments