mbox series

[v3,00/25] user_namespace: introduce fsid mappings

Message ID 20200218143411.2389182-1-christian.brauner@ubuntu.com (mailing list archive)
Headers show
Series user_namespace: introduce fsid mappings | expand

Message

Christian Brauner Feb. 18, 2020, 2:33 p.m. UTC
Hey everyone,

This is v3 after (off- and online) discussions with Jann the following
changes were made:
- To handle nested user namespaces cleanly, efficiently, and with full
  backwards compatibility for non fsid-mapping aware workloads we only
  allow writing fsid mappings as long as the corresponding id mapping
  type has not been written.
- Split the patch which adds the internal ability in
  kernel/user_namespace to verify and write fsid mappings into tree
  patches:
  1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
     patch to implement core helpers for fsid translations (i.e.
     make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
     k*id_to_kfs*id()
  2. [PATCH v3 05/25] user_namespace: refactor map_write()
     patch to refactor map_write() in order to prepare for actual fsid
     mappings changes in the following patch. (This should make it
     easier to review.)
  3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
     patch to implement actual fsid mappings support in mape_write()
- Let the keyctl infrastructure only operate on kfsid which are always
  mapped/looked up in the id mappings similar to what we do for
  filesystems that have the same superblock visible in multiple user
  namespaces.

This version also comes with minimal tests which I intend to expand in
the future.

From pings and off-list questions and discussions at Google Container
Security Summit there seems to be quite a lot of interest in this
patchset with use-cases ranging from layer sharing for app containers
and k8s, as well as data sharing between containers with different id
mappings. I haven't Cced all people because I don't have all the email
adresses at hand but I've at least added Phil now. :)

This is the implementation of shiftfs which was cooked up during lunch at
Linux Plumbers 2019 the day after the container's microconference. The
idea is a design-stew from Stéphane, Aleksa, Eric, and myself (and by
now also Jann.
Back then we all were quite busy with other work and couldn't really sit
down and implement it. But I took a few days last week to do this work,
including demos and performance testing.
This implementation does not require us to touch the VFS substantially
at all. Instead, we implement shiftfs via fsid mappings.
With this patch, it took me 20 mins to port both LXD and LXC to support
shiftfs via fsid mappings.

For anyone wanting to play with this the branch can be pulled from:
https://github.com/brauner/linux/tree/fsid_mappings
https://gitlab.com/brauner/linux/-/tree/fsid_mappings
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings

The main use case for shiftfs for us is in allowing shared writable
storage to multiple containers using non-overlapping id mappings.
In such a scenario you want the fsids to be valid and identical in both
containers for the shared mount. A demo for this exists in [3].
If you don't want to read on, go straight to the other demos below in
[1] and [2].

People not as familiar with user namespaces might not be aware that fsid
mappings already exist. Right now, fsid mappings are always identical to
id mappings. Specifically, the kernel will lookup fsuids in the uid
mappings and fsgids in the gid mappings of the relevant user namespace.

With this patch series we simply introduce the ability to create fsid
mappings that are different from the id mappings of a user namespace.
The whole feature set is placed under a config option that defaults to
false.

In the usual case of running an unprivileged container we will have
setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
correspond to this id mapping, i.e. all files which we want to appear as
0:0 inside the user namespace will be chowned to 100000:100000 on the
host. This works, because whenever the kernel needs to do a filesystem
access it will lookup the corresponding uid and gid in the idmapping
tables of the container.
Now think about the case where we want to have an id mapping of 0 100000
100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
share a single on-disk mapping with multiple containers that all have
different id mappings.
This will be problematic. Whenever a filesystem access is requested, the
kernel will now try to lookup a mapping for 300000 in the id mapping
tables of the user namespace but since there is none the files will
appear to be owned by the overflow id, i.e. usually 65534:65534 or
nobody:nogroup.

With fsid mappings we can solve this by writing an id mapping of 0
100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
access the kernel will now lookup the mapping for 300000 in the fsid
mapping tables of the user namespace. And since such a mapping exists,
the corresponding files will have correct ownership.

A note on proc (and sys), the proc filesystem is special in sofar as it
only has a single superblock that is (currently but might be about to
change) visible in all user namespaces (same goes for sys). This means
it has special semantics in many ways, including how file ownership and
access works. The fsid mapping implementation does not alter how proc
(and sys) ownership works. proc and sys will both continue to lookup
filesystem access in id mapping tables.

When Writing fsid mappings the same rules apply as when writing id
mappings so I won't reiterate them here. The limit of fs id mappings is
the same as for id mappings, i.e. 340 lines.

# Performance
Back when I extended the range of possible id mappings to 340 I did
performance testing by booting into single user mode, creating 1,000,000
files to fstat()ing them and calculated the mean fstat() time per file.
(Back when Linux was still fast. I won't mention that the stat
 numbers have (thanks microcode!) doubled since then...)
I did the same test for this patchset: one vanilla kernel, one kernel
with my fsid mapping patches but CONFIG_USER_NS_FSID set to n and one
with fsid mappings patches enabled. I then ran the same test on all
three kernels and compared the numbers. The implementation does not
introduce overhead. That's all I can say. Here are the numbers:

             | vanilla v5.5 | fsid mappings       | fsid mappings      | fsid mappings      |
	     |              | disabled in Kconfig | enabled in Kconfig | enabled in Kconfig |
	     |   	    |                     | and unset for all  | and set for all    |
	     |   	    |    		  | test cases         | test cases         |
-------------|--------------|---------------------|--------------------|--------------------|
 0  mappings |       367 ns |              365 ns |             365 ns |             N/A    |
 1  mappings |       362 ns |              367 ns |             363 ns |             363 ns |
 2  mappings |       361 ns |              369 ns |             363 ns |             364 ns |
 3  mappings |       361 ns |              368 ns |             366 ns |             365 ns |
 5  mappings |       365 ns |              368 ns |             363 ns |             365 ns |
 10 mappings |       391 ns |              388 ns |             387 ns |             389 ns |
 50 mappings |       395 ns |              398 ns |             401 ns |             397 ns |
100 mappings |       400 ns |              405 ns |             399 ns |             399 ns |
200 mappings |       404 ns |              407 ns |             430 ns |             404 ns |
300 mappings |       492 ns |              494 ns |             432 ns |             413 ns |
340 mappings |       495 ns |              497 ns |             500 ns |             484 ns |

# Demos
[1]: Create a container with different id and fsid mappings.
     https://asciinema.org/a/300233 
[2]: Create a container with id mappings but without fsid mappings.
     https://asciinema.org/a/300234
[3]: Share storage between multiple containers with non-overlapping id
     mappings.
     https://asciinema.org/a/300235

Thanks!
Christian

Christian Brauner (25):
  user_namespace: introduce fsid mappings infrastructure
  proc: add /proc/<pid>/fsuid_map
  proc: add /proc/<pid>/fsgid_map
  fsuidgid: add fsid mapping helpers
  user_namespace: refactor map_write()
  user_namespace: make map_write() support fsid mappings
  proc: task_state(): use from_kfs{g,u}id_munged
  cred: add kfs{g,u}id
  fs: add is_userns_visible() helper
  namei: may_{o_}create(): handle fsid mappings
  inode: inode_owner_or_capable(): handle fsid mappings
  capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  stat: handle fsid mappings
  open: handle fsid mappings
  posix_acl: handle fsid mappings
  attr: notify_change(): handle fsid mappings
  commoncap: cap_bprm_set_creds(): handle fsid mappings
  commoncap: cap_task_fix_setuid(): handle fsid mappings
  commoncap: handle fsid mappings with vfs caps
  exec: bprm_fill_uid(): handle fsid mappings
  ptrace: adapt ptrace_may_access() to always uses unmapped fsids
  devpts: handle fsid mappings
  keys: handle fsid mappings
  sys: handle fsid mappings in set*id() calls
  selftests: add simple fsid mapping selftests

 fs/attr.c                                     |  23 +-
 fs/devpts/inode.c                             |   7 +-
 fs/exec.c                                     |  25 +-
 fs/inode.c                                    |   7 +-
 fs/namei.c                                    |  36 +-
 fs/open.c                                     |  16 +-
 fs/posix_acl.c                                |  17 +-
 fs/proc/array.c                               |   5 +-
 fs/proc/base.c                                |  34 ++
 fs/stat.c                                     |  48 +-
 include/linux/cred.h                          |   4 +
 include/linux/fs.h                            |   5 +
 include/linux/fsuidgid.h                      | 122 +++++
 include/linux/stat.h                          |   1 +
 include/linux/user_namespace.h                |  10 +
 init/Kconfig                                  |  11 +
 kernel/capability.c                           |  10 +-
 kernel/ptrace.c                               |   4 +-
 kernel/sys.c                                  | 106 +++-
 kernel/user.c                                 |  22 +
 kernel/user_namespace.c                       | 517 ++++++++++++++++--
 security/commoncap.c                          |  35 +-
 security/keys/key.c                           |   2 +-
 security/keys/permission.c                    |   4 +-
 security/keys/process_keys.c                  |   6 +-
 security/keys/request_key.c                   |  10 +-
 security/keys/request_key_auth.c              |   2 +-
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/user_namespace/Makefile |  11 +
 .../selftests/user_namespace/test_fsid_map.c  | 511 +++++++++++++++++
 30 files changed, 1461 insertions(+), 151 deletions(-)
 create mode 100644 include/linux/fsuidgid.h
 create mode 100644 tools/testing/selftests/user_namespace/Makefile
 create mode 100644 tools/testing/selftests/user_namespace/test_fsid_map.c


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9

Comments

James Bottomley Feb. 18, 2020, 11:50 p.m. UTC | #1
On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
> In the usual case of running an unprivileged container we will have
> setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> correspond to this id mapping, i.e. all files which we want to appear
> as 0:0 inside the user namespace will be chowned to 100000:100000 on
> the host. This works, because whenever the kernel needs to do a
> filesystem access it will lookup the corresponding uid and gid in the
> idmapping tables of the container. Now think about the case where we
> want to have an id mapping of 0 100000 100000 but an on-disk mapping
> of 0 300000 100000 which is needed to e.g. share a single on-disk
> mapping with multiple containers that all have different id mappings.
> This will be problematic. Whenever a filesystem access is requested,
> the kernel will now try to lookup a mapping for 300000 in the id
> mapping tables of the user namespace but since there is none the
> files will appear to be owned by the overflow id, i.e. usually
> 65534:65534 or nobody:nogroup.
> 
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping
> exists, the corresponding files will have correct ownership.

So I did compile this up in order to run the shiftfs tests over it to
see how it coped with the various corner cases.  However, what I find
is it simply fails the fsid reverse mapping in the setup.  Trying to
use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the
entry setuid(0) call because of this code:

long __sys_setuid(uid_t uid)
{
	struct user_namespace *ns =
current_user_ns();
	const struct cred *old;
	struct cred *new;
	int
retval;
	kuid_t kuid;
	kuid_t kfsuid;

	kuid = make_kuid(ns, uid);
	if
(!uid_valid(kuid))
		return -EINVAL;

	kfsuid = make_kfsuid(ns, uid);
	if
(!uid_valid(kfsuid))
		return -EINVAL;

which means you can't have a fsid mapping that doesn't have the same
domain as the uid mapping, meaning a reverse mapping isn't possible
because the range and domain have to be inverse and disjoint.

James
Christian Brauner Feb. 19, 2020, 12:27 p.m. UTC | #2
On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
> > In the usual case of running an unprivileged container we will have
> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > correspond to this id mapping, i.e. all files which we want to appear
> > as 0:0 inside the user namespace will be chowned to 100000:100000 on
> > the host. This works, because whenever the kernel needs to do a
> > filesystem access it will lookup the corresponding uid and gid in the
> > idmapping tables of the container. Now think about the case where we
> > want to have an id mapping of 0 100000 100000 but an on-disk mapping
> > of 0 300000 100000 which is needed to e.g. share a single on-disk
> > mapping with multiple containers that all have different id mappings.
> > This will be problematic. Whenever a filesystem access is requested,
> > the kernel will now try to lookup a mapping for 300000 in the id
> > mapping tables of the user namespace but since there is none the
> > files will appear to be owned by the overflow id, i.e. usually
> > 65534:65534 or nobody:nogroup.
> > 
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping
> > exists, the corresponding files will have correct ownership.
> 
> So I did compile this up in order to run the shiftfs tests over it to
> see how it coped with the various corner cases.  However, what I find
> is it simply fails the fsid reverse mapping in the setup.  Trying to
> use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the
> entry setuid(0) call because of this code:

This is easy to fix. But what's the exact use-case?
Jann Horn Feb. 19, 2020, 3:33 p.m. UTC | #3
On Tue, Feb 18, 2020 at 3:35 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
[...]
> - Let the keyctl infrastructure only operate on kfsid which are always
>   mapped/looked up in the id mappings similar to what we do for
>   filesystems that have the same superblock visible in multiple user
>   namespaces.
>
> This version also comes with minimal tests which I intend to expand in
> the future.
>
> From pings and off-list questions and discussions at Google Container
> Security Summit there seems to be quite a lot of interest in this
> patchset with use-cases ranging from layer sharing for app containers
> and k8s, as well as data sharing between containers with different id
> mappings. I haven't Cced all people because I don't have all the email
> adresses at hand but I've at least added Phil now. :)
>
> This is the implementation of shiftfs which was cooked up during lunch at
> Linux Plumbers 2019 the day after the container's microconference. The
> idea is a design-stew from Stéphane, Aleksa, Eric, and myself (and by
> now also Jann.
> Back then we all were quite busy with other work and couldn't really sit
> down and implement it. But I took a few days last week to do this work,
> including demos and performance testing.
> This implementation does not require us to touch the VFS substantially
> at all. Instead, we implement shiftfs via fsid mappings.
> With this patch, it took me 20 mins to port both LXD and LXC to support
> shiftfs via fsid mappings.
[...]

Can you please grep through the kernel for all uses of ->fsuid and
->fsgid and fix them up appropriately? Some cases I still see:


The SafeSetID LSM wants to enforce that you can only use CAP_SETUID to
gain the privileges of a specific set of IDs:

static int safesetid_task_fix_setuid(struct cred *new,
                                     const struct cred *old,
                                     int flags)
{

        /* Do nothing if there are no setuid restrictions for our old RUID. */
        if (setuid_policy_lookup(old->uid, INVALID_UID) == SIDPOL_DEFAULT)
                return 0;

        if (uid_permitted_for_cred(old, new->uid) &&
            uid_permitted_for_cred(old, new->euid) &&
            uid_permitted_for_cred(old, new->suid) &&
            uid_permitted_for_cred(old, new->fsuid))
                return 0;

        /*
         * Kill this process to avoid potential security vulnerabilities
         * that could arise from a missing whitelist entry preventing a
         * privileged process from dropping to a lesser-privileged one.
         */
        force_sig(SIGKILL);
        return -EACCES;
}

This could theoretically be bypassed through setfsuid() if the kuid
based on the fsuid mappings is permitted but the kuid based on the
normal mappings is not.


fs/coredump.c in suid dump mode uses "cred->fsuid = GLOBAL_ROOT_UID";
this should probably also fix up the other uid, even if there is no
scenario in which it would actually be used at the moment?


The netfilter xt_owner stuff makes packet filtering decisions based on
the ->fsuid; it might be better to filter on the ->kfsuid so that you
can filter traffic from different user namespaces differently?


audit_log_task_info() is doing "from_kuid(&init_user_ns, cred->fsuid)".
James Bottomley Feb. 19, 2020, 3:36 p.m. UTC | #4
On Wed, 2020-02-19 at 13:27 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
[...]
> > > With fsid mappings we can solve this by writing an id mapping of
> > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On
> > > filesystem access the kernel will now lookup the mapping for
> > > 300000 in the fsid mapping tables of the user namespace. And
> > > since such a mapping exists, the corresponding files will have
> > > correct ownership.
> > 
> > So I did compile this up in order to run the shiftfs tests over it
> > to see how it coped with the various corner cases.  However, what I
> > find is it simply fails the fsid reverse mapping in the
> > setup.  Trying to use a simple uid of 0 100000 1000 and a fsid of
> > 100000 0 1000 fails the entry setuid(0) call because of this code:
> 
> This is easy to fix. But what's the exact use-case?

Well, the use case I'm looking to solve is the same one it's always
been: getting a deprivileged fake root in a user_ns to be able to write
an image at fsuid 0.

I don't think it's solvable in your current framework, although
allowing the domain to be disjoint might possibly hack around it.  The
problem with the proposed framework is that there are no backshifts
from the filesystem view, there are only forward shifts to the
filesystem view.  This means that to get your framework to write a
filesystem at fsuid 0 you have to have an identity map for fsuid. Which
I can do: I tested uid shift 0 100000 1000 and fsuid shift 0 0 1000. 
It does all work, as you'd expect because the container has real fs
root not a fake root.  And that's the whole problem:  Firstly, I'm fs
root for any filesystem my userns can see, so any imprecision in
setting up the mount namespace of the container and I own your host and
secondly any containment break and I'm privileged with respect to the
fs uid wherever I escape to so I will likewise own your host.

The only way to keep containment is to have a zero fsuid inside the
container corresponding to a non-zero one outside.  And the only way to
solve the imprecision in mount namespace issue is to strictly control
the entry point at which the writing at fsuid 0 becomes active.

James
Serge E. Hallyn Feb. 19, 2020, 7:35 p.m. UTC | #5
On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping exists,
> the corresponding files will have correct ownership.

So if I have

/proc/self/uid_map: 0 100000 100000
/proc/self/fsid_map: 1000 1000 1

1. If I read files from the rootfs which have host uid 101000, they
will appear as uid 100 to me?

2. If I read host files with uid 1000, they will appear as uid 1000 to me?

3. If I create a new file, as uid 1000, what will be the inode owning uid?
Serge E. Hallyn Feb. 19, 2020, 9:48 p.m. UTC | #6
On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote:
> On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping exists,
> > the corresponding files will have correct ownership.
> 
> So if I have
> 
> /proc/self/uid_map: 0 100000 100000
> /proc/self/fsid_map: 1000 1000 1

Oh, sorry.  Your explanation in 20/25 i think set me straight, though I need
to think through a few more examples.

...

> 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid?

(Note - I edited the quoted txt above to be more precise)

I'm still not quite clear on this.  I believe the fsid mapping will take
precedence so it'll be uid 1000 ?  Per mount behavior would be nice there,
but perhaps unwieldy.
Tycho Andersen Feb. 19, 2020, 9:56 p.m. UTC | #7
On Wed, Feb 19, 2020 at 03:48:37PM -0600, Serge E. Hallyn wrote:
> On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote:
> > On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> > > With fsid mappings we can solve this by writing an id mapping of 0
> > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > > access the kernel will now lookup the mapping for 300000 in the fsid
> > > mapping tables of the user namespace. And since such a mapping exists,
> > > the corresponding files will have correct ownership.
> > 
> > So if I have
> > 
> > /proc/self/uid_map: 0 100000 100000
> > /proc/self/fsid_map: 1000 1000 1
> 
> Oh, sorry.  Your explanation in 20/25 i think set me straight, though I need
> to think through a few more examples.
> 
> ...
> 
> > 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid?
> 
> (Note - I edited the quoted txt above to be more precise)
> 
> I'm still not quite clear on this.  I believe the fsid mapping will take
> precedence so it'll be uid 1000 ?  Per mount behavior would be nice there,
> but perhaps unwieldy.

The is_userns_visible() bits seems to be an attempt at understanding
what people would want per-mount, with a policy hard coded in the
kernel.

But maybe per-mount behavior can be solved more elegantly with shifted
bind mounts, so we can drop all that from this series, and ignore
per-mount settings here?

Tycho
Josef Bacik Feb. 27, 2020, 7:33 p.m. UTC | #8
On 2/18/20 9:33 AM, Christian Brauner wrote:
> Hey everyone,
> 
> This is v3 after (off- and online) discussions with Jann the following
> changes were made:
> - To handle nested user namespaces cleanly, efficiently, and with full
>    backwards compatibility for non fsid-mapping aware workloads we only
>    allow writing fsid mappings as long as the corresponding id mapping
>    type has not been written.
> - Split the patch which adds the internal ability in
>    kernel/user_namespace to verify and write fsid mappings into tree
>    patches:
>    1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
>       patch to implement core helpers for fsid translations (i.e.
>       make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
>       k*id_to_kfs*id()
>    2. [PATCH v3 05/25] user_namespace: refactor map_write()
>       patch to refactor map_write() in order to prepare for actual fsid
>       mappings changes in the following patch. (This should make it
>       easier to review.)
>    3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
>       patch to implement actual fsid mappings support in mape_write()
> - Let the keyctl infrastructure only operate on kfsid which are always
>    mapped/looked up in the id mappings similar to what we do for
>    filesystems that have the same superblock visible in multiple user
>    namespaces.
> 
> This version also comes with minimal tests which I intend to expand in
> the future.
> 
>  From pings and off-list questions and discussions at Google Container
> Security Summit there seems to be quite a lot of interest in this
> patchset with use-cases ranging from layer sharing for app containers
> and k8s, as well as data sharing between containers with different id
> mappings. I haven't Cced all people because I don't have all the email
> adresses at hand but I've at least added Phil now. :)
> 
I put this into a kernel for our container guys to mess with in order to 
validate it would actually be useful for real world uses.  I've cc'ed the guy 
who did all of the work in case you have specific questions.

Good news is the interface is acceptable, albeit apparently the whole user ns 
interface sucks in general.  But you haven't made it worse, so success!

But in testing it there appears to be a problem with tmpfs?  Our applications 
will use shared memory segments for certain things and it apparently breaks this 
in interesting ways, it appears to not shift the UID appropriately on tmpfs. 
This seems to be relatively straightforward to reproduce, but if you have 
trouble let me know and I'll come up with a shell script that reproduces the 
problem.

We are happy to continue testing these patches to make sure they're working in 
our container setup, if you want to CC me on future submissions I can build them 
for our internal testing and validate them as well.  Thanks,

Josef
Serge E. Hallyn March 2, 2020, 2:34 p.m. UTC | #9
On Thu, Feb 27, 2020 at 02:33:04PM -0500, Josef Bacik wrote:
> On 2/18/20 9:33 AM, Christian Brauner wrote:
> > Hey everyone,
> > 
> > This is v3 after (off- and online) discussions with Jann the following
> > changes were made:
> > - To handle nested user namespaces cleanly, efficiently, and with full
> >    backwards compatibility for non fsid-mapping aware workloads we only
> >    allow writing fsid mappings as long as the corresponding id mapping
> >    type has not been written.
> > - Split the patch which adds the internal ability in
> >    kernel/user_namespace to verify and write fsid mappings into tree
> >    patches:
> >    1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
> >       patch to implement core helpers for fsid translations (i.e.
> >       make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
> >       k*id_to_kfs*id()
> >    2. [PATCH v3 05/25] user_namespace: refactor map_write()
> >       patch to refactor map_write() in order to prepare for actual fsid
> >       mappings changes in the following patch. (This should make it
> >       easier to review.)
> >    3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
> >       patch to implement actual fsid mappings support in mape_write()
> > - Let the keyctl infrastructure only operate on kfsid which are always
> >    mapped/looked up in the id mappings similar to what we do for
> >    filesystems that have the same superblock visible in multiple user
> >    namespaces.
> > 
> > This version also comes with minimal tests which I intend to expand in
> > the future.
> > 
> >  From pings and off-list questions and discussions at Google Container
> > Security Summit there seems to be quite a lot of interest in this
> > patchset with use-cases ranging from layer sharing for app containers
> > and k8s, as well as data sharing between containers with different id
> > mappings. I haven't Cced all people because I don't have all the email
> > adresses at hand but I've at least added Phil now. :)
> > 
> I put this into a kernel for our container guys to mess with in order to
> validate it would actually be useful for real world uses.  I've cc'ed the
> guy who did all of the work in case you have specific questions.
> 
> Good news is the interface is acceptable, albeit apparently the whole user
> ns interface sucks in general.  But you haven't made it worse, so success!

Well I very much disagree here :)  With the first part!  But I do
understand the shortcomings.  Anyway,

I still hope we get to talk about this in person, but IMO this is the
right approach (this being - thinking about how to make the uid mappings
more flexible without making them too complicated to be safe to use),
but a bit too static in terms of target.  There are at least two ways
that I could see usefully generalizing it

From a user space pov, the following goal is indespensible (for my use
cases):  that the fsuid be selectable based on fs, mountpoint, or file
context (as in selinux).

From a userns pov, one way to look at it is this:  when task t1 signals
task t2, it's not only t1's namespace that's considered when filling in
the sender uid, but also t2's.  Likewise, when writing a file, we should
consider both t1's fsuid+userns, and the file's, mount's, or filesystem's
userns.

From that POV, your patch is a step in the right direction and could be
taken as is (modulo any tmpfs fix Josef needs :)  From there I would
propose adding a 'userns=<uidnsfd>' bind mount option, so we could create
an empty userns with the desired mapping (subject to permissions granted
by subuids), get an fd to the uidns, and say

	mount --bind -o uidns=5 /shared /containers/c1/mnt/shared

So now when I write a file /etc/hosts as container fsuid 0, it'll be
subject to the container rootfs mount's uid mapping, presumably
100000.  When I write /mnt/shared/hello, it'll be subject to the mount's
uid mapping, which might be 1000.

-serge