Message ID | 155024685321.21651.1504201877881622756.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Containers and using authenticated filesystems | expand |
Hi David, On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote: > Implement a kernel container object such that it contains the > following > things: > > (1) Namespaces. > > (2) A root directory. > > (3) A set of processes, including one designated as the 'init' > process. > > A container is created and attached to a file descriptor by: > > int cfd = container_create(const char *name, unsigned int > flags); > > this inherits all the namespaces of the parent container unless > otherwise > the mask calls for new namespaces. > > CONTAINER_NEW_FS_NS > CONTAINER_NEW_EMPTY_FS_NS > CONTAINER_NEW_CGROUP_NS [root only] > CONTAINER_NEW_UTS_NS > CONTAINER_NEW_IPC_NS > CONTAINER_NEW_USER_NS > CONTAINER_NEW_PID_NS > CONTAINER_NEW_NET_NS > > Other flags include: > > CONTAINER_KILL_ON_CLOSE > CONTAINER_CLOSE_ON_EXEC > > Note that I've added a pointer to the current container to > task_struct. > This doesn't make the nsproxy pointer redundant as you can still make > new > namespaces with clone(). > > I've also added a list_head to task_struct to form a list in the > container > of its member processes. This is convenient, but redundant since the > code > could iterate over all the tasks looking for ones that have a > matching > task->container. > > It might make sense to use fsconfig() to configure the container: > > fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd); > fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd); > fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd); > fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0); > > > ================== > FUTURE DEVELOPMENT > ================== > > (1) Setting up the container. > > A container would be created with, say: > > int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS); > > Once created, it should then be possible for the supervising > process > to modify the new container. Mounts can be created inside of > the > container's namespaces: > > fsfd = fsopen("ext4", 0); > fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd); > fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0); > fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0); > fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); > mfd = fsmount(fsfd, 0, 0); > > and then mounted into the namespace: > > move_mount(mfd, "", cfd, "/", > MOVE_MOUNT_F_EMPTY_PATH | > MOVE_MOUNT_T_CONTAINER_ROOT); > > Further mounts can be added by: > > move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH); > > Files and devices can be created by supplying the container fd > as the > dirfd argument: > > mkdirat(int cfd, const char *path, mode_t mode); > mknodat(int cfd, const char *path, mode_t mode, dev_t dev); > int fd = openat(int cfd, const char *path, > unsigned int flags, mode_t mode); > > [*] Note that when using cfd as dirfd, the path must not contain > a '/' > at the front. > > Sockets, such as netlink, can be opened inside of the > container's > namespaces: > > int fd = container_socket(int cfd, int domain, int type, > int protocol); > > This should allow management of the container's network > namespace from > outside. > > (2) Starting the container. > > Once all modifications are complete, the container's 'init' > process > can be started by: > > fork_into_container(int cfd); > > This precludes further external modification of the mount tree > within > the container. Before this point, the container is simply > destroyed > if the container fd is closed. > > (3) Waiting for the container to complete. > > The container fd can then be polled to wait for init process > therein > to complete and the exit code collected by: > > container_wait(int container_fd, int *_wstatus, unsigned int > wait, > struct rusage *rusage); > > The container and everything in it can be terminated or killed > off: > > container_kill(int container_fd, int initonly, int signal); > > If 'init' dies, all other processes in the container are > preemptively > SIGKILL'd by the kernel. > > By default, if the container is active and its fd is closed, the > container is left running and wil be cleaned up when its 'init' > exits. > The default can be changed with the CONTAINER_KILL_ON_CLOSE > flag. > > (4) Supervising the container. > > Given that we have an fd attached to the container, we could > make it > such that the supervising process could monitor and override > EPERM > returns for mount and other privileged operations within the > container. > > (5) Per-container keyring. > > Each container can point to a per-container keyring for the > holding of > integrity keys and filesystem keys for use inside the > container. This > would be attached: > > keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring) > > This keyring would be searched by request_key() after it has > searched > the thread, process and session keyrings. > > (6) Running different LSM policies by container. This might > particularly > make sense with something like Apparmor where different path- > based > rules might be required inside a container to inside the parent. > > Signed-off-by: David Howells <dhowells@redhat.com> > --- Do we really need a new system call to set up containers? That would force changes to all existing orchestration software. Given that the main thing we want to achieve is to direct messages from the kernel to an appropriate handler, why not focus on adding functionality to do just that? Is there any reason why a syscall to allow an appropriately privileged process to add a keyring-specific message queue to its own user_namespace and obtain a file descriptor to that message queue might not work? That forces the container to use a daemon if it cares to intercept keyring traffic, rather than worrying about the kernel running request_key (in fact, it might make sense to allow a trivial implementation of the daemon to be to just read the messages, parse them and run request_key). With such an implementation, the fallback mechanism could be to walk back up the hierarchy of user_namespaces until a message queue is found, and to invoke the existing request_key mechanism if not.
Added containers and cgroups list, which somehow got lost since they might have a slight interest in a complete rewrite of the container API. On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote: > Implement a kernel container object such that it contains the > following things: > > (1) Namespaces. > > (2) A root directory. Doesn't this conflict with how the mount namespace works today? It contains the notion of unescapable root and we shouldn't have two of those in different locations. > (3) A set of processes, including one designated as the 'init' > process. This is a violation of a fundamental tenet: I can create a "container" as simply a set of unoccupied namespaces and bind them into the filesystem with a mount. This mechanism is what I use for architectural emulation containers and how network namespaces currently work. For all of these cases, the container is empty of processes when it is created and is selectively filled and emptied of processes as you use it. If I create a container without a PID namespace, I definitely wouldn't want the notion of an "init" process because I'm deliberately avoiding that. > A container is created and attached to a file descriptor by: > > int cfd = container_create(const char *name, unsigned int > flags); I thought we got agreement years ago that containers don't exist in Linux as a single entity: they're currently a collection of cgroups and namespaces some of which may and some of which may not be local to the entity the orchestration system thinks of as a "container". > this inherits all the namespaces of the parent container unless > otherwise the mask calls for new namespaces. > > CONTAINER_NEW_FS_NS > CONTAINER_NEW_EMPTY_FS_NS > CONTAINER_NEW_CGROUP_NS [root only] > CONTAINER_NEW_UTS_NS > CONTAINER_NEW_IPC_NS > CONTAINER_NEW_USER_NS > CONTAINER_NEW_PID_NS > CONTAINER_NEW_NET_NS > > Other flags include: > > CONTAINER_KILL_ON_CLOSE > CONTAINER_CLOSE_ON_EXEC > > Note that I've added a pointer to the current container to > task_struct. This doesn't make the nsproxy pointer redundant as you > can still make new namespaces with clone(). > > I've also added a list_head to task_struct to form a list in the > container of its member processes. This is convenient, but redundant > since the code could iterate over all the tasks looking for ones that > have a matching task->container. > > It might make sense to use fsconfig() to configure the container: > > fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd); > fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd); > fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd); > fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0); You're trying to introduce a new set of container APIs that don't quite align with how containers work today. If I look at the justification below the whole thing seems to require the notion of a container as an atomic entity with an exclusive process list. You can argue that's how you want it to work, but it looks like this notion would have difficulty working with the standard kubernetes pod/container notion, let alone all of the other esoteric ways we use containers today. James > > ================== > FUTURE DEVELOPMENT > ================== > > (1) Setting up the container. > > A container would be created with, say: > > int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS); > > Once created, it should then be possible for the supervising > process > to modify the new container. Mounts can be created inside of > the > container's namespaces: > > fsfd = fsopen("ext4", 0); > fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd); > fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0); > fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0); > fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); > mfd = fsmount(fsfd, 0, 0); > > and then mounted into the namespace: > > move_mount(mfd, "", cfd, "/", > MOVE_MOUNT_F_EMPTY_PATH | > MOVE_MOUNT_T_CONTAINER_ROOT); > > Further mounts can be added by: > > move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH); > > Files and devices can be created by supplying the container fd > as the > dirfd argument: > > mkdirat(int cfd, const char *path, mode_t mode); > mknodat(int cfd, const char *path, mode_t mode, dev_t dev); > int fd = openat(int cfd, const char *path, > unsigned int flags, mode_t mode); > > [*] Note that when using cfd as dirfd, the path must not contain > a '/' > at the front. > > Sockets, such as netlink, can be opened inside of the > container's > namespaces: > > int fd = container_socket(int cfd, int domain, int type, > int protocol); > > This should allow management of the container's network > namespace from > outside. > > (2) Starting the container. > > Once all modifications are complete, the container's 'init' > process > can be started by: > > fork_into_container(int cfd); > > This precludes further external modification of the mount tree > within > the container. Before this point, the container is simply > destroyed > if the container fd is closed. > > (3) Waiting for the container to complete. > > The container fd can then be polled to wait for init process > therein > to complete and the exit code collected by: > > container_wait(int container_fd, int *_wstatus, unsigned int > wait, > struct rusage *rusage); > > The container and everything in it can be terminated or killed > off: > > container_kill(int container_fd, int initonly, int signal); > > If 'init' dies, all other processes in the container are > preemptively > SIGKILL'd by the kernel. > > By default, if the container is active and its fd is closed, the > container is left running and wil be cleaned up when its 'init' > exits. > The default can be changed with the CONTAINER_KILL_ON_CLOSE > flag. > > (4) Supervising the container. > > Given that we have an fd attached to the container, we could > make it > such that the supervising process could monitor and override > EPERM > returns for mount and other privileged operations within the > container. > > (5) Per-container keyring. > > Each container can point to a per-container keyring for the > holding of > integrity keys and filesystem keys for use inside the > container. This > would be attached: > > keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring) > > This keyring would be searched by request_key() after it has > searched > the thread, process and session keyrings. > > (6) Running different LSM policies by container. This might > particularly > make sense with something like Apparmor where different path- > based > rules might be required inside a container to inside the parent. > > Signed-off-by: David Howells <dhowells@redhat.com> > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 > arch/x86/entry/syscalls/syscall_64.tbl | 1 > fs/namespace.c | 5 > include/linux/container.h | 86 ++++++++ > include/linux/init_task.h | 1 > include/linux/lsm_hooks.h | 20 ++ > include/linux/sched.h | 3 > include/linux/security.h | 15 + > include/linux/syscalls.h | 3 > include/uapi/linux/container.h | 28 +++ > init/Kconfig | 7 + > init/init_task.c | 3 > kernel/Makefile | 2 > kernel/container.c | 348 > ++++++++++++++++++++++++++++++++ > kernel/exit.c | 1 > kernel/fork.c | 7 + > kernel/namespaces.h | 15 + > kernel/nsproxy.c | 23 +- > kernel/sys_ni.c | 3 > security/security.c | 12 + > 20 files changed, 571 insertions(+), 13 deletions(-) > create mode 100644 include/linux/container.h > create mode 100644 include/uapi/linux/container.h > create mode 100644 kernel/container.c > create mode 100644 kernel/namespaces.h > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl > b/arch/x86/entry/syscalls/syscall_32.tbl > index c9db9d51a7df..3564814a5d21 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -407,3 +407,4 @@ > 393 i386 fsinfo sys_fsinfo > __ia32_sys_fsinfo > 394 i386 mount_notify sys_mount_notify > __ia32_sys_mount_notify > 395 i386 sb_notify sys_sb_notify > __ia32_sys_sb_notify > +396 i386 container_create sys_container_create > __ia32_sys_container_create > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 17869bf7788a..aa6cccbe5271 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -352,6 +352,7 @@ > 341 common fsinfo __x64_sys_fsi > nfo > 342 common mount_notify __x64_sys_mount > _notify > 343 common sb_notify __x64_sys_sb_notif > y > +344 common container_create __x64_sys_container > _create > > # > # x32-specific system call numbers start at 512 to avoid cache > impact > diff --git a/fs/namespace.c b/fs/namespace.c > index f378cfc63043..ea005f55ec4c 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -30,6 +30,7 @@ > #include <uapi/linux/mount.h> > #include <linux/fs_context.h> > #include <linux/fsinfo.h> > +#include <linux/container.h> > > #include "pnode.h" > #include "internal.h" > @@ -3742,6 +3743,10 @@ static void __init init_mount_tree(void) > > set_fs_pwd(current->fs, &root); > set_fs_root(current->fs, &root); > +#ifdef CONFIG_CONTAINERS > + path_get(&root); > + init_container.root = root; > +#endif > } > > void __init mnt_init(void) > diff --git a/include/linux/container.h b/include/linux/container.h > new file mode 100644 > index 000000000000..0a8918435097 > --- /dev/null > +++ b/include/linux/container.h > @@ -0,0 +1,86 @@ > +/* Container objects > + * > + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. > + * Written by David Howells (dhowells@redhat.com) > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public Licence > + * as published by the Free Software Foundation; either version > + * 2 of the Licence, or (at your option) any later version. > + */ > + > +#ifndef _LINUX_CONTAINER_H > +#define _LINUX_CONTAINER_H > + > +#include <uapi/linux/container.h> > +#include <linux/refcount.h> > +#include <linux/list.h> > +#include <linux/spinlock.h> > +#include <linux/wait.h> > +#include <linux/path.h> > +#include <linux/seqlock.h> > + > +struct fs_struct; > +struct nsproxy; > +struct task_struct; > + > +/* > + * The container object. > + */ > +struct container { > + char name[24]; > + u64 id; /* Container > ID */ > + refcount_t usage; > + int exit_code; /* The exit > code of 'init' */ > + const struct cred *cred; /* Creds for > this container, including userns */ > + struct nsproxy *ns; /* This > container's namespaces */ > + struct path root; /* The root > of the container's fs namespace */ > + struct task_struct *init; /* The > 'init' task for this container */ > + struct container *parent; /* Parent of this > container. */ > + void *security; /* LSM data */ > + struct list_head members; /* Member processes, > guarded with ->lock */ > + struct list_head child_link; /* Link in > parent->children */ > + struct list_head children; /* Child containers > */ > + wait_queue_head_t waitq; /* Someone > waiting for init to exit waits here */ > + unsigned long flags; > +#define CONTAINER_FLAG_INIT_STARTED 0 /* Init is > started - certain ops now prohibited */ > +#define CONTAINER_FLAG_DEAD 1 /* Init has died > */ > +#define CONTAINER_FLAG_KILL_ON_CLOSE 2 /* Kill init if > container handle closed */ > + spinlock_t lock; > + seqcount_t seq; /* Track > changes in ->root */ > +}; > + > +extern struct container init_container; > + > +#ifdef CONFIG_CONTAINERS > +extern const struct file_operations container_fops; > + > +extern int copy_container(unsigned long flags, struct task_struct > *tsk, > + struct container *container); > +extern void exit_container(struct task_struct *tsk); > +extern void put_container(struct container *c); > + > +static inline struct container *get_container(struct container *c) > +{ > + refcount_inc(&c->usage); > + return c; > +} > + > +static inline bool is_container_file(struct file *file) > +{ > + return file->f_op == &container_fops; > +} > + > +#else > + > +static inline int copy_container(unsigned long flags, struct > task_struct *tsk, > + struct container *container) > +{ return 0; } > +static inline void exit_container(struct task_struct *tsk) { } > +static inline void put_container(struct container *c) {} > +static inline struct container *get_container(struct container *c) { > return NULL; } > +static inline bool is_container_file(struct file *file) { return > false; } > + > +#endif /* CONFIG_CONTAINERS */ > + > +#endif /* _LINUX_CONTAINER_H */ > diff --git a/include/linux/init_task.h b/include/linux/init_task.h > index a7083a45a26c..f016cadece24 100644 > --- a/include/linux/init_task.h > +++ b/include/linux/init_task.h > @@ -10,6 +10,7 @@ > #include <linux/ipc.h> > #include <linux/pid_namespace.h> > #include <linux/user_namespace.h> > +#include <linux/container.h> > #include <linux/securebits.h> > #include <linux/seqlock.h> > #include <linux/rbtree.h> > diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h > index 52d0f3f4c786..0f310d911815 100644 > --- a/include/linux/lsm_hooks.h > +++ b/include/linux/lsm_hooks.h > @@ -1460,6 +1460,16 @@ > * @bpf_prog_free_security: > * Clean up the security information stored inside bpf prog. > * > + * Security hooks for containers: > + * > + * @container_alloc: > + * Permit creation of a new container and assign security > data. > + * @container: The new container. > + * > + * @container_free: > + * Free security data attached to a container. > + * @container: The container. > + * > */ > union security_list_options { > int (*binder_set_context_mgr)(struct task_struct *mgr); > @@ -1825,6 +1835,12 @@ union security_list_options { > int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux); > void (*bpf_prog_free_security)(struct bpf_prog_aux *aux); > #endif /* CONFIG_BPF_SYSCALL */ > + > + /* Container management security hooks */ > +#ifdef CONFIG_CONTAINERS > + int (*container_alloc)(struct container *container, unsigned > int flags); > + void (*container_free)(struct container *container); > +#endif > }; > > struct security_hook_heads { > @@ -2069,6 +2085,10 @@ struct security_hook_heads { > struct hlist_head bpf_prog_alloc_security; > struct hlist_head bpf_prog_free_security; > #endif /* CONFIG_BPF_SYSCALL */ > +#ifdef CONFIG_CONTAINERS > + struct hlist_head container_alloc; > + struct hlist_head container_free; > +#endif /* CONFIG_CONTAINERS */ > } __randomize_layout; > > /* > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d2f90fa92468..073a3a930514 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -36,6 +36,7 @@ struct backing_dev_info; > struct bio_list; > struct blk_plug; > struct cfs_rq; > +struct container; > struct fs_struct; > struct futex_pi_state; > struct io_context; > @@ -870,6 +871,8 @@ struct task_struct { > > /* Namespaces: */ > struct nsproxy *nsproxy; > + struct container *container; > + struct list_head container_link; > > /* Signal handlers: */ > struct signal_struct *signal; > diff --git a/include/linux/security.h b/include/linux/security.h > index da538c06766f..acd0c14c6e95 100644 > --- a/include/linux/security.h > +++ b/include/linux/security.h > @@ -70,6 +70,7 @@ struct ctl_table; > struct audit_krule; > struct user_namespace; > struct timezone; > +struct container; > > enum lsm_event { > LSM_POLICY_CHANGE, > @@ -1751,6 +1752,20 @@ static inline void > security_audit_rule_free(void *lsmrule) > #endif /* CONFIG_SECURITY */ > #endif /* CONFIG_AUDIT */ > > +#ifdef CONFIG_CONTAINERS > +#ifdef CONFIG_SECURITY > +int security_container_alloc(struct container *container, unsigned > int flags); > +void security_container_free(struct container *container); > +#else > +static inline int security_container_alloc(struct container > *container, > + unsigned int flags) > +{ > + return 0; > +} > +static inline void security_container_free(struct container > *container) {} > +#endif > +#endif /* CONFIG_CONTAINERS */ > + > #ifdef CONFIG_SECURITYFS > > extern struct dentry *securityfs_create_file(const char *name, > umode_t mode, > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 10127b1d923b..dac42098c2dd 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -943,6 +943,9 @@ asmlinkage long sys_mount_notify(int dfd, const > char __user *path, > unsigned int at_flags, int > watch_fd, int watch_id); > asmlinkage long sys_sb_notify(int dfd, const char __user *path, > unsigned int at_flags, int watch_fd, > int watch_id); > +asmlinkage long sys_container_create(const char __user *name, > unsigned int flags, > + unsigned long spare3, unsigned > long spare4, > + unsigned long spare5); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/linux/container.h > b/include/uapi/linux/container.h > new file mode 100644 > index 000000000000..43748099b28d > --- /dev/null > +++ b/include/uapi/linux/container.h > @@ -0,0 +1,28 @@ > +/* Container UAPI > + * > + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. > + * Written by David Howells (dhowells@redhat.com) > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public Licence > + * as published by the Free Software Foundation; either version > + * 2 of the Licence, or (at your option) any later version. > + */ > + > +#ifndef _UAPI_LINUX_CONTAINER_H > +#define _UAPI_LINUX_CONTAINER_H > + > + > +#define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current > fs namespace */ > +#define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new > empty fs namespace */ > +#define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup > current cgroup namespace */ > +#define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup > current uts namespace */ > +#define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup > current ipc namespace */ > +#define CONTAINER_NEW_USER_NS 0x00000020 /* Dup > current user namespace */ > +#define CONTAINER_NEW_PID_NS 0x00000040 /* Dup > current pid namespace */ > +#define CONTAINER_NEW_NET_NS 0x00000080 /* Dup > current net namespace */ > +#define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill > all member processes when fd closed */ > +#define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the > fd on exec */ > +#define CONTAINER__FLAG_MASK 0x000003ff > + > +#endif /* _UAPI_LINUX_CONTAINER_H */ > diff --git a/init/Kconfig b/init/Kconfig > index 5984dd7f2156..ab37c3a55aa1 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -992,6 +992,13 @@ config NET_NS > Allow user space to create what appear to be multiple > instances > of the network stack. > > +config CONTAINERS > + bool "Container support" > + default y > + help > + Allow userspace to create and manipulate containers as > objects that > + have namespaces and hold a set of processes. > + > endif # NAMESPACES > > config CHECKPOINT_RESTORE > diff --git a/init/init_task.c b/init/init_task.c > index 5aebe3be4d7c..90c7439a195b 100644 > --- a/init/init_task.c > +++ b/init/init_task.c > @@ -108,6 +108,9 @@ struct task_struct init_task > .signal = &init_signals, > .sighand = &init_sighand, > .nsproxy = &init_nsproxy, > + .container = &init_container, > + .container_link.next = &init_container.members, > + .container_link.prev = &init_container.members, > .pending = { > .list = LIST_HEAD_INIT(init_task.pending.list), > .signal = {{0}} > diff --git a/kernel/Makefile b/kernel/Makefile > index 6aa7543bcdb2..98cdd18cecef 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -8,7 +8,7 @@ obj-y = fork.o exec_domain.o panic.o \ > sysctl.o sysctl_binary.o capability.o ptrace.o user.o \ > signal.o sys.o umh.o workqueue.o pid.o task_work.o \ > extable.o params.o \ > - kthread.o sys_ni.o nsproxy.o \ > + kthread.o sys_ni.o nsproxy.o container.o \ > notifier.o ksysfs.o cred.o reboot.o \ > async.o range.o smpboot.o ucount.o > > diff --git a/kernel/container.c b/kernel/container.c > new file mode 100644 > index 000000000000..ca4012632cfa > --- /dev/null > +++ b/kernel/container.c > @@ -0,0 +1,348 @@ > +/* Implement container objects. > + * > + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved. > + * Written by David Howells (dhowells@redhat.com) > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public Licence > + * as published by the Free Software Foundation; either version > + * 2 of the Licence, or (at your option) any later version. > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > +#include <linux/poll.h> > +#include <linux/wait.h> > +#include <linux/init_task.h> > +#include <linux/fs.h> > +#include <linux/fs_struct.h> > +#include <linux/anon_inodes.h> > +#include <linux/container.h> > +#include <linux/syscalls.h> > +#include <linux/printk.h> > +#include <linux/security.h> > +#include "namespaces.h" > + > +struct container init_container = { > + .name = ".init", > + .id = 1, > + .usage = REFCOUNT_INIT(2), > + .cred = &init_cred, > + .ns = &init_nsproxy, > + .init = &init_task, > + .members.next = &init_task.container_link, > + .members.prev = &init_task.container_link, > + .children = LIST_HEAD_INIT(init_container.children), > + .flags = (1 << CONTAINER_FLAG_INIT_STARTED), > + .lock = > __SPIN_LOCK_UNLOCKED(init_container.lock), > + .seq = SEQCNT_ZERO(init_fs.seq), > +}; > + > +#ifdef CONFIG_CONTAINERS > + > +static atomic64_t container_id_counter = ATOMIC_INIT(1); > + > +/* > + * Drop a ref on a container and clear it if no longer in use. > + */ > +void put_container(struct container *c) > +{ > + struct container *parent; > + > + while (c && refcount_dec_and_test(&c->usage)) { > + BUG_ON(!list_empty(&c->members)); > + if (c->ns) > + put_nsproxy(c->ns); > + path_put(&c->root); > + > + parent = c->parent; > + if (parent) { > + spin_lock(&parent->lock); > + list_del(&c->child_link); > + spin_unlock(&parent->lock); > + } > + > + if (c->cred) > + put_cred(c->cred); > + security_container_free(c); > + kfree(c); > + c = parent; > + } > +} > + > +/* > + * Allow the user to poll for the container dying. > + */ > +static unsigned int container_poll(struct file *file, poll_table > *wait) > +{ > + struct container *container = file->private_data; > + unsigned int mask = 0; > + > + poll_wait(file, &container->waitq, wait); > + > + if (test_bit(CONTAINER_FLAG_DEAD, &container->flags)) > + mask |= POLLHUP; > + > + return mask; > +} > + > +static int container_release(struct inode *inode, struct file *file) > +{ > + struct container *container = file->private_data; > + > + put_container(container); > + return 0; > +} > + > +const struct file_operations container_fops = { > + .poll = container_poll, > + .release = container_release, > +}; > + > +/* > + * Handle fork/clone. > + * > + * A process inherits its parent's container. The first process > into the > + * container is its 'init' process and the life of everything else > in there is > + * dependent upon that. > + */ > +int copy_container(unsigned long flags, struct task_struct *tsk, > + struct container *container) > +{ > + struct container *c = container ?: tsk->container; > + int ret = -ECANCELED; > + > + spin_lock(&c->lock); > + > + if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) { > + list_add_tail(&tsk->container_link, &c->members); > + get_container(c); > + tsk->container = c; > + if (!c->init) { > + set_bit(CONTAINER_FLAG_INIT_STARTED, &c- > >flags); > + c->init = tsk; > + } > + ret = 0; > + } > + > + spin_unlock(&c->lock); > + return ret; > +} > + > +/* > + * Remove a dead process from a container. > + * > + * If the 'init' process in a container dies, we kill off all the > other > + * processes in the container. > + */ > +void exit_container(struct task_struct *tsk) > +{ > + struct task_struct *p; > + struct container *c = tsk->container; > + struct kernel_siginfo si = { > + .si_signo = SIGKILL, > + .si_code = SI_KERNEL, > + }; > + > + spin_lock(&c->lock); > + > + list_del(&tsk->container_link); > + > + if (c->init == tsk) { > + c->init = NULL; > + c->exit_code = tsk->exit_code; > + smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */ > + set_bit(CONTAINER_FLAG_DEAD, &c->flags); > + wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD); > + > + list_for_each_entry(p, &c->members, container_link) > { > + si.si_pid = task_tgid_vnr(p); > + send_sig_info(SIGKILL, &si, p); > + } > + } > + > + spin_unlock(&c->lock); > + put_container(c); > +} > + > +/* > + * Allocate a container. > + */ > +static struct container *alloc_container(const char __user *name) > +{ > + struct container *c; > + long len; > + int ret; > + > + c = kzalloc(sizeof(struct container), GFP_KERNEL); > + if (!c) > + return ERR_PTR(-ENOMEM); > + > + INIT_LIST_HEAD(&c->members); > + INIT_LIST_HEAD(&c->children); > + init_waitqueue_head(&c->waitq); > + spin_lock_init(&c->lock); > + refcount_set(&c->usage, 1); > + > + ret = -EFAULT; > + len = strncpy_from_user(c->name, name, sizeof(c->name)); > + if (len < 0) > + goto err; > + ret = -ENAMETOOLONG; > + if (len >= sizeof(c->name)) > + goto err; > + ret = -EINVAL; > + if (strchr(c->name, '/')) > + goto err; > + > + c->name[len] = 0; > + return c; > + > +err: > + kfree(c); > + return ERR_PTR(ret); > +} > + > +/* > + * Create some creds for the container. We don't want to pin things > we don't > + * have to, so drop all keyrings from the new cred. The LSM gets to > audit the > + * cred struct when security_container_alloc() is invoked. > + */ > +static const struct cred *create_container_creds(unsigned int flags) > +{ > + struct cred *new; > + int ret; > + > + new = prepare_creds(); > + if (!new) > + return ERR_PTR(-ENOMEM); > + > +#ifdef CONFIG_KEYS > + key_put(new->thread_keyring); > + new->thread_keyring = NULL; > + key_put(new->process_keyring); > + new->process_keyring = NULL; > + key_put(new->session_keyring); > + new->session_keyring = NULL; > + key_put(new->request_key_auth); > + new->request_key_auth = NULL; > +#endif > + > + if (flags & CONTAINER_NEW_USER_NS) { > + ret = create_user_ns(new); > + if (ret < 0) > + goto err; > + new->euid = new->user_ns->owner; > + new->egid = new->user_ns->group; > + } > + > + new->fsuid = new->suid = new->uid = new->euid; > + new->fsgid = new->sgid = new->gid = new->egid; > + return new; > + > +err: > + abort_creds(new); > + return ERR_PTR(ret); > +} > + > +/* > + * Create a new container. > + */ > +static struct container *create_container(const char __user *name, > unsigned int flags) > +{ > + struct container *parent, *c; > + struct fs_struct *fs; > + struct nsproxy *ns; > + const struct cred *cred; > + int ret; > + > + c = alloc_container(name); > + if (IS_ERR(c)) > + return c; > + > + if (flags & CONTAINER_KILL_ON_CLOSE) > + __set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags); > + > + cred = create_container_creds(flags); > + if (IS_ERR(cred)) { > + ret = PTR_ERR(cred); > + goto err_cont; > + } > + c->cred = cred; > + > + ret = -ENOMEM; > + fs = copy_fs_struct(current->fs); > + if (!fs) > + goto err_cont; > + > + ns = create_new_namespaces( > + (flags & CONTAINER_NEW_FS_NS ? CLONE_NEWNS : > 0) | > + (flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP : > 0) | > + (flags & CONTAINER_NEW_UTS_NS ? CLONE_NEWUTS > : 0) | > + (flags & CONTAINER_NEW_IPC_NS ? CLONE_NEWIPC > : 0) | > + (flags & CONTAINER_NEW_PID_NS ? CLONE_NEWPID > : 0) | > + (flags & CONTAINER_NEW_NET_NS ? CLONE_NEWNET > : 0), > + current->nsproxy, cred->user_ns, fs); > + if (IS_ERR(ns)) { > + ret = PTR_ERR(ns); > + goto err_fs; > + } > + > + c->ns = ns; > + c->root = fs->root; > + c->seq = fs->seq; > + fs->root.mnt = NULL; > + fs->root.dentry = NULL; > + > + ret = security_container_alloc(c, flags); > + if (ret < 0) > + goto err_fs; > + > + parent = current->container; > + get_container(parent); > + c->parent = parent; > + c->id = atomic64_inc_return(&container_id_counter); > + spin_lock(&parent->lock); > + list_add_tail(&c->child_link, &parent->children); > + spin_unlock(&parent->lock); > + return c; > + > +err_fs: > + free_fs_struct(fs); > +err_cont: > + put_container(c); > + return ERR_PTR(ret); > +} > + > +/* > + * Create a new container object. > + */ > +SYSCALL_DEFINE5(container_create, > + const char __user *, name, > + unsigned int, flags, > + unsigned long, spare3, > + unsigned long, spare4, > + unsigned long, spare5) > +{ > + struct container *c; > + int fd; > + > + if (!name || > + flags & ~CONTAINER__FLAG_MASK || > + spare3 != 0 || spare4 != 0 || spare5 != 0) > + return -EINVAL; > + if ((flags & (CONTAINER_NEW_FS_NS | > CONTAINER_NEW_EMPTY_FS_NS)) == > + (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) > + return -EINVAL; > + > + c = create_container(name, flags); > + if (IS_ERR(c)) > + return PTR_ERR(c); > + > + fd = anon_inode_getfd("container", &container_fops, c, > + O_RDWR | (flags & CONTAINER_FD_CLOEXEC > ? O_CLOEXEC : 0)); > + if (fd < 0) > + put_container(c); > + return fd; > +} > + > +#endif /* CONFIG_CONTAINERS */ > diff --git a/kernel/exit.c b/kernel/exit.c > index 284f2fe9a293..78f6065ad799 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -864,6 +864,7 @@ void __noreturn do_exit(long code) > if (group_dead) > disassociate_ctty(1); > exit_task_namespaces(tsk); > + exit_container(tsk); > exit_task_work(tsk); > exit_thread(tsk); > exit_umh(tsk); > diff --git a/kernel/fork.c b/kernel/fork.c > index b69248e6f0e0..009cf7e63894 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1920,9 +1920,12 @@ static __latent_entropy struct task_struct > *copy_process( > retval = copy_namespaces(clone_flags, p); > if (retval) > goto bad_fork_cleanup_mm; > - retval = copy_io(clone_flags, p); > + retval = copy_container(clone_flags, p, NULL); > if (retval) > goto bad_fork_cleanup_namespaces; > + retval = copy_io(clone_flags, p); > + if (retval) > + goto bad_fork_cleanup_container; > retval = copy_thread_tls(clone_flags, stack_start, > stack_size, p, tls); > if (retval) > goto bad_fork_cleanup_io; > @@ -2121,6 +2124,8 @@ static __latent_entropy struct task_struct > *copy_process( > bad_fork_cleanup_io: > if (p->io_context) > exit_io_context(p); > +bad_fork_cleanup_container: > + exit_container(p); > bad_fork_cleanup_namespaces: > exit_task_namespaces(p); > bad_fork_cleanup_mm: > diff --git a/kernel/namespaces.h b/kernel/namespaces.h > new file mode 100644 > index 000000000000..c44e3cf0e254 > --- /dev/null > +++ b/kernel/namespaces.h > @@ -0,0 +1,15 @@ > +/* Local namespaces defs > + * > + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. > + * Written by David Howells (dhowells@redhat.com) > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public Licence > + * as published by the Free Software Foundation; either version > + * 2 of the Licence, or (at your option) any later version. > + */ > + > +extern struct nsproxy *create_new_namespaces(unsigned long flags, > + struct nsproxy > *nsproxy, > + struct user_namespace > *user_ns, > + struct fs_struct > *new_fs); > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c > index f6c5d330059a..4bb5184b3a80 100644 > --- a/kernel/nsproxy.c > +++ b/kernel/nsproxy.c > @@ -27,6 +27,7 @@ > #include <linux/syscalls.h> > #include <linux/cgroup.h> > #include <linux/perf_event.h> > +#include "namespaces.h" > > static struct kmem_cache *nsproxy_cachep; > > @@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void) > * Return the newly created nsproxy. Do not attach this to the > task, > * leave it to the caller to do proper locking and attach it to > task. > */ > -static struct nsproxy *create_new_namespaces(unsigned long flags, > - struct task_struct *tsk, struct user_namespace *user_ns, > +struct nsproxy *create_new_namespaces(unsigned long flags, > + struct nsproxy *nsproxy, struct user_namespace *user_ns, > struct fs_struct *new_fs) > { > struct nsproxy *new_nsp; > @@ -72,39 +73,39 @@ static struct nsproxy > *create_new_namespaces(unsigned long flags, > if (!new_nsp) > return ERR_PTR(-ENOMEM); > > - new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, > user_ns, new_fs); > + new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns, > user_ns, new_fs); > if (IS_ERR(new_nsp->mnt_ns)) { > err = PTR_ERR(new_nsp->mnt_ns); > goto out_ns; > } > > - new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy- > >uts_ns); > + new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy- > >uts_ns); > if (IS_ERR(new_nsp->uts_ns)) { > err = PTR_ERR(new_nsp->uts_ns); > goto out_uts; > } > > - new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy- > >ipc_ns); > + new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy- > >ipc_ns); > if (IS_ERR(new_nsp->ipc_ns)) { > err = PTR_ERR(new_nsp->ipc_ns); > goto out_ipc; > } > > new_nsp->pid_ns_for_children = > - copy_pid_ns(flags, user_ns, tsk->nsproxy- > >pid_ns_for_children); > + copy_pid_ns(flags, user_ns, nsproxy- > >pid_ns_for_children); > if (IS_ERR(new_nsp->pid_ns_for_children)) { > err = PTR_ERR(new_nsp->pid_ns_for_children); > goto out_pid; > } > > new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns, > - tsk->nsproxy- > >cgroup_ns); > + nsproxy->cgroup_ns); > if (IS_ERR(new_nsp->cgroup_ns)) { > err = PTR_ERR(new_nsp->cgroup_ns); > goto out_cgroup; > } > > - new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy- > >net_ns); > + new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy- > >net_ns); > if (IS_ERR(new_nsp->net_ns)) { > err = PTR_ERR(new_nsp->net_ns); > goto out_net; > @@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct > task_struct *tsk) > (CLONE_NEWIPC | CLONE_SYSVSEM)) > return -EINVAL; > > - new_ns = create_new_namespaces(flags, tsk, user_ns, tsk- > >fs); > + new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, > tsk->fs); > if (IS_ERR(new_ns)) > return PTR_ERR(new_ns); > > @@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long > unshare_flags, > if (!ns_capable(user_ns, CAP_SYS_ADMIN)) > return -EPERM; > > - *new_nsp = create_new_namespaces(unshare_flags, current, > user_ns, > + *new_nsp = create_new_namespaces(unshare_flags, current- > >nsproxy, user_ns, > new_fs ? new_fs : current- > >fs); > if (IS_ERR(*new_nsp)) { > err = PTR_ERR(*new_nsp); > @@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype) > if (nstype && (ns->ops->type != nstype)) > goto out; > > - new_nsproxy = create_new_namespaces(0, tsk, > current_user_ns(), tsk->fs); > + new_nsproxy = create_new_namespaces(0, tsk->nsproxy, > current_user_ns(), tsk->fs); > if (IS_ERR(new_nsproxy)) { > err = PTR_ERR(new_nsproxy); > goto out; > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index a4e7131b2509..f0455cbb91cf 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -136,6 +136,9 @@ COND_SYSCALL(acct); > COND_SYSCALL(capget); > COND_SYSCALL(capset); > > +/* kernel/container.c */ > +COND_SYSCALL(container_create); > + > /* kernel/exec_domain.c */ > > /* kernel/exit.c */ > diff --git a/security/security.c b/security/security.c > index b49732c02e21..259be9a1746c 100644 > --- a/security/security.c > +++ b/security/security.c > @@ -1864,3 +1864,15 @@ void security_bpf_prog_free(struct > bpf_prog_aux *aux) > call_void_hook(bpf_prog_free_security, aux); > } > #endif /* CONFIG_BPF_SYSCALL */ > + > +#ifdef CONFIG_CONTAINERS > +int security_container_alloc(struct container *container, unsigned > int flags) > +{ > + return call_int_hook(container_alloc, 0, container, flags); > +} > + > +void security_container_free(struct container *container) > +{ > + call_void_hook(container_free, container); > +} > +#endif /* CONFIG_CONTAINERS */ >
David Howells <dhowells@redhat.com> writes: The container id details are ludicrous and will break practically every use case. This completely unacceptable. Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> > diff --git a/include/linux/container.h b/include/linux/container.h > new file mode 100644 > index 000000000000..0a8918435097 > --- /dev/null > +++ b/include/linux/container.h > +/* > + * The container object. > + */ > +struct container { > + u64 id; /* Container ID */ ... No. This is absolutely unacceptable. As this breaks breaks nested containers and process migration. > +}; > + > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d2f90fa92468..073a3a930514 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -36,6 +36,7 @@ struct backing_dev_info; > struct bio_list; > struct blk_plug; > struct cfs_rq; > +struct container; > struct fs_struct; > struct futex_pi_state; > struct io_context; > @@ -870,6 +871,8 @@ struct task_struct { > > /* Namespaces: */ > struct nsproxy *nsproxy; > + struct container *container; > + struct list_head container_link; Why? nsproxy would be a much cheaper location to put this. Less space and less foobar. > /* Signal handlers: */ > struct signal_struct *signal; > diff --git a/kernel/container.c b/kernel/container.c > new file mode 100644 > index 000000000000..ca4012632cfa > --- /dev/null > +++ b/kernel/container.c > @@ -0,0 +1,348 @@ [...] > + > + c->id = atomic64_inc_return(&container_id_counter); This id is not in a namespace, and it doesn't have enough bits of entropy to be globally unique. Not that 64bit is enough to have a chance at being globablly unique. Eric
Trond Myklebust <trondmy@hammerspace.com> wrote: > Do we really need a new system call to set up containers? That would > force changes to all existing orchestration software. No, it wouldn't. Nothing in my patches forces existing orchestration software to change, unless it wants to use the new facilities - then it would have to be changed anyway, right? I will grant, though, that the extent of the change might vary. > Given that the main thing we want to achieve is to direct messages from > the kernel to an appropriate handler, why not focus on adding > functionality to do just that? Because it's *not* just that that is added here. There are a number of things this patchset (and one it depends on) provides: (1) The ability to intercept request_key() upcalls that happen inside a container, filtered by operative namespace. (2) The ability to provide a per-container keyring that can hold keys that can be used inside the container without any action on behalf of the denizens of the container. (3) The ability to grant permissions to a *container* as a subject, allowing it and its denizens to use, but not necessarily read, modify, link or invalidate a key. (4) The ability to create superblocks inside a container with a separate mount namespace from outside, such that they can use the container keys, thereby allowing the root of a container to be on an authenticated filesystem. > Is there any reason why a syscall to allow an appropriately privileged > process to add a keyring-specific message queue to its own > user_namespace and obtain a file descriptor to that message queue might > not work? Yes. That forces the use of a new user_namespace for every container in which you want to use any of the above features. The user_namespace is already way too big and intrusive a hammer as it is. > With such an implementation, the fallback mechanism could be to walk > back up the hierarchy of user_namespaces until a message queue is > found, and to invoke the existing request_key mechanism if not. That's definitely wrong. /sbin/request-key should *not* be spawned if the key to be instantiated is not in all the init namespaces. I went with a container object with namespaces for a reason: initially, it was so that the upcall could take place inside of the container's namespaces, but now it's do that any request that doesn't match the namespaces on the container gets rejected at the boundary - so that some daemon up the chain doesn't try servicing a request for which it can't access the config data or would end up talking out of the wrong NIC. I can drop the container object part of it for the moment. I could instead create 1-3 new namespaces: (1) A namespace with an upcall-interception point. (2) A namespace with a container keyring. (3) A namespace with a subject ID for use in key ACLs. I think I should also consider adding: (4) A namespace with keyring names in it. I'm leaning towards this not being part of user_namespace because these probably should not be visible between containers. David
James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > I thought we got agreement years ago that containers don't exist in > Linux as a single entity: they're currently a collection of cgroups and > namespaces some of which may and some of which may not be local to the > entity the orchestration system thinks of as a "container". I wasn't party to that agreement and don't feel particularly bound by it. David
Eric W. Biederman <ebiederm@xmission.com> wrote: > > + c->id = atomic64_inc_return(&container_id_counter); > > This id is not in a namespace, and it doesn't have enough bits > of entropy to be globally unique. Not that 64bit is enough > to have a chance at being globablly unique. It's in a container, so it doesn't need to be in a namespace. The intended purpose is for annotating audit messages. Globally unique wasn't particularly in mind. It could be turned into, say, a uuid, so that isn't really a problem at this point. You are right, though, it really should be globally unique as best possible - even the one in init_container should be. Ideally, it would look the same inside the root container as any subcontainer. David
On Fri, Feb 15, 2019 at 04:07:33PM +0000, David Howells wrote: > ================== > FUTURE DEVELOPMENT > ================== > > (1) Setting up the container. > > A container would be created with, say: > > int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS); > ... > Further mounts can be added by: > > move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH); > ... > (2) Starting the container. > > Once all modifications are complete, the container's 'init' process > can be started by: > > fork_into_container(int cfd); > > This precludes further external modification of the mount tree within > the container. Is there a technical reason for this? In particular, there are some container runtimes that do this today via clever use of bind mounts and MS_MOVE, for things like dynamically attaching volumes. It would be useful to be able to mount things into the container after the fact. > (3) Waiting for the container to complete. > > The container fd can then be polled to wait for init process therein > to complete and the exit code collected by: > > container_wait(int container_fd, int *_wstatus, unsigned int wait, > struct rusage *rusage); > > The container and everything in it can be terminated or killed off: > > container_kill(int container_fd, int initonly, int signal); > > If 'init' dies, all other processes in the container are preemptively > SIGKILL'd by the kernel. Isn't this essentially how the pid ns works today? I'm not sure what the container fd offers here (of course if it lands, then having the same semantics makes sense). > (6) Running different LSM policies by container. This might particularly > make sense with something like Apparmor where different path-based > rules might be required inside a container to inside the parent. Apparmor supports this today, as long as the host is also running Apparmor. For the more general case, Casey (and others) have been working on LSM stacking for a long time. Tycho
On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote: > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > I thought we got agreement years ago that containers don't exist in > > Linux as a single entity: they're currently a collection of cgroups > > and namespaces some of which may and some of which may not be local > > to the entity the orchestration system thinks of as a "container". > > I wasn't party to that agreement and don't feel particularly bound by > it. That's not at all relevant, is it? The point is we have widespread uses of namespaces and cgroups that span containers today meaning that a "container id" becomes a problematic concept. What we finally got to with the audit people was an unmodifiable label which the orchestration system can set ... can't you just use that? James
On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote: > Implement a kernel container object such that it contains the following > things: > > (1) Namespaces. > > (2) A root directory. > > (3) A set of processes, including one designated as the 'init' process. Yeah, I think a name other than init needs to be used for this process. The problem being that there is no requirement for container process 1 to behave in any way like an "init" process is expected to behave and that leads to confusion (at least it certainly did for me). Admittedly I haven't yet worked through the series but in the light of the comments from James I wanted to chime in (probably too early to be useful not having read the series but ...). I believe what your trying to do here is so badly needed it would be great if the needs of James could be met to some (as yet undefined) satisfactory extent. Would there be any possibility of introducing a concept of inactive and active containers where the creation is a two (maybe more) step procedure, first the creation of (if you like a "true") container that's essentially empty, basically a shell (not the program "shell" of course), inert wrt. events and such and implement the ability to make the container active by adding various things, like processes, to it? Clearly the concepts of inactive and active require a definition of what they mean and I don't have that, perhaps a starting point could be a container that has a process 1 (which should also require a root fs and namespaces) is active otherwise it's considered inactive. Ian
On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote: > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote: > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > > > I thought we got agreement years ago that containers don't exist in > > > Linux as a single entity: they're currently a collection of cgroups > > > and namespaces some of which may and some of which may not be local > > > to the entity the orchestration system thinks of as a "container". > > > > I wasn't party to that agreement and don't feel particularly bound by > > it. > > That's not at all relevant, is it? The point is we have widespread > uses of namespaces and cgroups that span containers today meaning that > a "container id" becomes a problematic concept. What we finally got to > with the audit people was an unmodifiable label which the orchestration > system can set ... can't you just use that? Sorry James, I fail to see how assigning an id to a collection of objects constitutes a problem or how that could restrict the way a container is used. Isn't the only problem here the current restrictions on the way objects need to be combined as a set and the ability to be able add or subtract from that set. Then again the notion of active vs. inactive might not be sufficient to allow for the needed flexibility ... Ian
On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote: > On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote: > > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote: > > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > > > > > I thought we got agreement years ago that containers don't > > > > exist in Linux as a single entity: they're currently a > > > > collection of cgroups and namespaces some of which may and some > > > > of which may not be local to the entity the orchestration > > > > system thinks of as a "container". > > > > > > I wasn't party to that agreement and don't feel particularly > > > bound by it. > > > > That's not at all relevant, is it? The point is we have widespread > > uses of namespaces and cgroups that span containers today meaning > > that a "container id" becomes a problematic concept. What we > > finally got to with the audit people was an unmodifiable label > > which the orchestration system can set ... can't you just use that? > > Sorry James, I fail to see how assigning an id to a collection of > objects constitutes a problem or how that could restrict the way a > container is used. Rather than rehash the whole argument again, what's the reason you can't use the audit label? It seems to do what you want in a way that doesn't cause problems. If you can just use it there's little point arguing over what is effectively a moot issue. James > Isn't the only problem here the current restrictions on the way > objects need to be combined as a set and the ability to be able add > or subtract from that set. > > Then again the notion of active vs. inactive might not be sufficient > to allow for the needed flexibility ... > > Ian >
On Tue, 2019-02-19 at 19:46 -0800, James Bottomley wrote: > On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote: > > On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote: > > > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote: > > > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > > > > > > > I thought we got agreement years ago that containers don't > > > > > exist in Linux as a single entity: they're currently a > > > > > collection of cgroups and namespaces some of which may and some > > > > > of which may not be local to the entity the orchestration > > > > > system thinks of as a "container". > > > > > > > > I wasn't party to that agreement and don't feel particularly > > > > bound by it. > > > > > > That's not at all relevant, is it? The point is we have widespread > > > uses of namespaces and cgroups that span containers today meaning > > > that a "container id" becomes a problematic concept. What we > > > finally got to with the audit people was an unmodifiable label > > > which the orchestration system can set ... can't you just use that? > > > > Sorry James, I fail to see how assigning an id to a collection of > > objects constitutes a problem or how that could restrict the way a > > container is used. > > Rather than rehash the whole argument again, what's the reason you > can't use the audit label? It seems to do what you want in a way that > doesn't cause problems. If you can just use it there's little point > arguing over what is effectively a moot issue. David might want to use the audit label for this, I don't know. And maybe that's a good choice initially. But going way off topic. Because there is a need to not clutter kernel space with logging, leaving it to user space to handle but also without providing user space with sufficient information to do so there will need to be some sort of globally unique (sub-system) identifiers of kernel objects for which user space needs logging information so that if or when that kernel to user space information flow is implemented the consistent identifiers that will be needed will at least exist for some kernel objects. Yes, that's way off topic for this series but I think it's something that needs at least some consideration for new implementation work. Unfortunately properly implementing such an encoding scheme probably warrants a completely separate project so, as you say moot wrt. this series. > > James > > > > Isn't the only problem here the current restrictions on the way > > objects need to be combined as a set and the ability to be able add > > or subtract from that set. > > > > Then again the notion of active vs. inactive might not be sufficient > > to allow for the needed flexibility ... > > > > Ian > > > >
On Tue, Feb 19, 2019 at 10:46 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote: > > On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote: > > > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote: > > > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > > > > > > > I thought we got agreement years ago that containers don't > > > > > exist in Linux as a single entity: they're currently a > > > > > collection of cgroups and namespaces some of which may and some > > > > > of which may not be local to the entity the orchestration > > > > > system thinks of as a "container". > > > > > > > > I wasn't party to that agreement and don't feel particularly > > > > bound by it. > > > > > > That's not at all relevant, is it? The point is we have widespread > > > uses of namespaces and cgroups that span containers today meaning > > > that a "container id" becomes a problematic concept. What we > > > finally got to with the audit people was an unmodifiable label > > > which the orchestration system can set ... can't you just use that? > > > > Sorry James, I fail to see how assigning an id to a collection of > > objects constitutes a problem or how that could restrict the way a > > container is used. > > Rather than rehash the whole argument again, what's the reason you > can't use the audit label? It seems to do what you want in a way that > doesn't cause problems. If you can just use it there's little point > arguing over what is effectively a moot issue. Ignoring for a moment whether or not the audit container ID is applicable here, one of the things I've been focused on with the audit container ID work is trying to make it difficult for other subsystems to use it. I've taken this stance not because I don't think something like a container ID would be useful outside the audit subsystem, but rather because I'm afraid of how it might be abused by other subsystems and that abuse might threaten the existence of the audit container ID. If there is a willingness to implement a general kernel container ID that behaves similarly to how the audit container ID is envisioned, I'd much rather do that then implement something which is audit specific.
On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote: > On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote: > > Implement a kernel container object such that it contains the following > > things: > > > > (1) Namespaces. > > > > (2) A root directory. > > > > (3) A set of processes, including one designated as the 'init' process. > > Yeah, I think a name other than init needs to be used for this > process. > > The problem being that there is no requirement for container > process 1 to behave in any way like an "init" process is > expected to behave and that leads to confusion (at least > it certainly did for me). If you look at the documentation for pid namespaces(7) you can see that the pid 1 inside a pid namespace is expected to behave like an init process: - "The first process created in a new namespace [...] has the PID 1, and is the "init" process for the namespace (see init(1))." - "[...] child process that is orphaned within the namespace will be reparented to this process rather than init(1) [...]" - "If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal. This behavior reflects the fact that the "init" process is essential for the cor‐ rect operation of a PID namespace." - "Only signals for which the "init" process has established a signal handler can be sent to the "init" process by other members of the PID namespace." - "[...] the reboot(2) system call causes a signal to be sent to the namespace "init" process." This is one of the reasons why all major current container runtimes finally after years of failing to realize this run a stub init process that mimicks a dumb init. Sure, you get away with not having an init that behaves like an init but this is inherently broken or at least against the way pid namespaces were designed.
On Tue, 2019-02-19 at 23:03 +0000, David Howells wrote: > Trond Myklebust <trondmy@hammerspace.com> wrote: > > > Do we really need a new system call to set up containers? That > > would > > force changes to all existing orchestration software. > > No, it wouldn't. Nothing in my patches forces existing orchestration > software > to change, unless it wants to use the new facilities - then it would > have to > be changed anyway, right? I will grant, though, that the extent of > the change > might vary. Right. It depends on what you want to the orchestrator to do. If you want it to manage authenticated storage for you, then I grant that you may need to change the existing orchestrator. However if you just want the containerised software to be able to manage AFS/CIFS/... keys for its own processes, then it's not obvious to me why you would need a new orchestrator. > > Given that the main thing we want to achieve is to direct messages > > from > > the kernel to an appropriate handler, why not focus on adding > > functionality to do just that? > > Because it's *not* just that that is added here. There are a number > of things > this patchset (and one it depends on) provides: > > (1) The ability to intercept request_key() upcalls that happen > inside a > container, filtered by operative namespace. The requirement that you need to filter derives from the fact that the kernel is being forced to run an untrusted executable in user space. That may be acceptable when running in an uncontainerised environment, where the executable can be vetted by the sysadmin, but it clearly isn't in an environment where containers can be set up by untrusted users. If we replace the executable with a daemon that is started from inside the container (presumably by the init process running there), then there should be no requirement for the orchestrator to filter. > (2) The ability to provide a per-container keyring that can hold > keys that > can be used inside the container without any action on behalf of > the > denizens of the container. Keyrings already define some inheritance semantics for child processes. Why can't we tweak those semantics to do what is needed? IOW: instead of adding a container syscall and a new keyring type, why can't we just define the required keyring type and let it be inherited through the existing clone() mechanism? > (3) The ability to grant permissions to a *container* as a subject, > allowing > it and its denizens to use, but not necessarily read, modify, > link or > invalidate a key. Again, this sounds like a child process keyring inheritance issue. Right now, the session keyring does not appear to match the semantics that you describe, but why couldn't we set up a new keyring type that can provide them? > (4) The ability to create superblocks inside a container with a > separate > mount namespace from outside, such that they can use the > container keys, > thereby allowing the root of a container to be on an > authenticated > filesystem. > I'm not sure that I understand the premise. If the orchestrator is setting up and managing that authenticated root filesystem, then why do the containerised processes need to be involved at all? If, OTOH, the intention is to allow the containerised processes to manage the filesystems without knowledge of the keyring contents, then again isn't that really the same issue as (3)? > > Is there any reason why a syscall to allow an appropriately > > privileged > > process to add a keyring-specific message queue to its own > > user_namespace and obtain a file descriptor to that message queue > > might > > not work? > > Yes. That forces the use of a new user_namespace for every container > in which > you want to use any of the above features. The user_namespace is > already way > too big and intrusive a hammer as it is. No. I would need a user_namespace if I want to allow child processes to handle request upcalls. Is that unreasonable? > > With such an implementation, the fallback mechanism could be to > > walk > > back up the hierarchy of user_namespaces until a message queue is > > found, and to invoke the existing request_key mechanism if not. > > That's definitely wrong. /sbin/request-key should *not* be spawned > if the key > to be instantiated is not in all the init namespaces. > > I went with a container object with namespaces for a reason: > initially, it was > so that the upcall could take place inside of the container's > namespaces, but > now it's do that any request that doesn't match the namespaces on the > container gets rejected at the boundary - so that some daemon up the > chain > doesn't try servicing a request for which it can't access the config > data or > would end up talking out of the wrong NIC. > > I can drop the container object part of it for the moment. > > I could instead create 1-3 new namespaces: > > (1) A namespace with an upcall-interception point. > > (2) A namespace with a container keyring. > > (3) A namespace with a subject ID for use in key ACLs. > > I think I should also consider adding: > > (4) A namespace with keyring names in it. I'm leaning towards this > not being > part of user_namespace because these probably should not be > visible > between containers. > > David
On Wed, 2019-02-20 at 14:26 +0100, Christian Brauner wrote: > On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote: > > On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote: > > > Implement a kernel container object such that it contains the following > > > things: > > > > > > (1) Namespaces. > > > > > > (2) A root directory. > > > > > > (3) A set of processes, including one designated as the 'init' process. > > > > Yeah, I think a name other than init needs to be used for this > > process. > > > > The problem being that there is no requirement for container > > process 1 to behave in any way like an "init" process is > > expected to behave and that leads to confusion (at least > > it certainly did for me). > > If you look at the documentation for pid namespaces(7) you can see that > the pid 1 inside a pid namespace is expected to behave like an init > process: > - "The first process created in a new namespace [...] has the PID 1, > and is the "init" process for the namespace (see init(1))." > - "[...] child process that is orphaned within the namespace will be > reparented to this process rather than init(1) [...]" > - "If the "init" process of a PID namespace terminates, the kernel > terminates all of the processes in the namespace via a SIGKILL > signal. This behavior reflects the fact that the "init" process is > essential for the cor‐ rect operation of a PID namespace." > - "Only signals for which the "init" process has established a signal > handler can be sent to the "init" process by other members of the > PID namespace." > - "[...] the reboot(2) system call causes a signal to be sent to the > namespace "init" process." > > This is one of the reasons why all major current container runtimes > finally after years of failing to realize this run a stub init process > that mimicks a dumb init. Sure, you get away with not having an init > that behaves like an init but this is inherently broken or at least > against the way pid namespaces were designed. TBH I wasn't sure why the signal I sent didn't arrive, AFAICS it should have regardless of what signals the container init process was accepting. But it could have been due to a different problem in my kernel code (that's very likely). In any case it wasn't worth perusing because even if I did work it out I had already found that the request_key sub-system wasn't playing well with others when trying to run something within a container's namespaces, so no point in going further ... Ian
================== FUTURE DEVELOPMENT ================== (1) Setting up the container. A container would be created with, say: int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS); Once created, it should then be possible for the supervising process to modify the new container. Mounts can be created inside of the container's namespaces: fsfd = fsopen("ext4", 0); fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd); fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0); fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0); fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); mfd = fsmount(fsfd, 0, 0); and then mounted into the namespace: move_mount(mfd, "", cfd, "/", MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_CONTAINER_ROOT); Further mounts can be added by: move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH); Files and devices can be created by supplying the container fd as the dirfd argument: mkdirat(int cfd, const char *path, mode_t mode); mknodat(int cfd, const char *path, mode_t mode, dev_t dev); int fd = openat(int cfd, const char *path, unsigned int flags, mode_t mode); [*] Note that when using cfd as dirfd, the path must not contain a '/' at the front. Sockets, such as netlink, can be opened inside of the container's namespaces: int fd = container_socket(int cfd, int domain, int type, int protocol); This should allow management of the container's network namespace from outside. (2) Starting the container. Once all modifications are complete, the container's 'init' process can be started by: fork_into_container(int cfd); This precludes further external modification of the mount tree within the container. Before this point, the container is simply destroyed if the container fd is closed. (3) Waiting for the container to complete. The container fd can then be polled to wait for init process therein to complete and the exit code collected by: container_wait(int container_fd, int *_wstatus, unsigned int wait, struct rusage *rusage); The container and everything in it can be terminated or killed off: container_kill(int container_fd, int initonly, int signal); If 'init' dies, all other processes in the container are preemptively SIGKILL'd by the kernel. By default, if the container is active and its fd is closed, the container is left running and wil be cleaned up when its 'init' exits. The default can be changed with the CONTAINER_KILL_ON_CLOSE flag. (4) Supervising the container. Given that we have an fd attached to the container, we could make it such that the supervising process could monitor and override EPERM returns for mount and other privileged operations within the container. (5) Per-container keyring. Each container can point to a per-container keyring for the holding of integrity keys and filesystem keys for use inside the container. This would be attached: keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring) This keyring would be searched by request_key() after it has searched the thread, process and session keyrings. (6) Running different LSM policies by container. This might particularly make sense with something like Apparmor where different path-based rules might be required inside a container to inside the parent. Signed-off-by: David Howells <dhowells@redhat.com> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 arch/x86/entry/syscalls/syscall_64.tbl | 1 fs/namespace.c | 5 include/linux/container.h | 86 ++++++++ include/linux/init_task.h | 1 include/linux/lsm_hooks.h | 20 ++ include/linux/sched.h | 3 include/linux/security.h | 15 + include/linux/syscalls.h | 3 include/uapi/linux/container.h | 28 +++ init/Kconfig | 7 + init/init_task.c | 3 kernel/Makefile | 2 kernel/container.c | 348 ++++++++++++++++++++++++++++++++ kernel/exit.c | 1 kernel/fork.c | 7 + kernel/namespaces.h | 15 + kernel/nsproxy.c | 23 +- kernel/sys_ni.c | 3 security/security.c | 12 + 20 files changed, 571 insertions(+), 13 deletions(-) create mode 100644 include/linux/container.h create mode 100644 include/uapi/linux/container.h create mode 100644 kernel/container.c create mode 100644 kernel/namespaces.h diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index c9db9d51a7df..3564814a5d21 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -407,3 +407,4 @@ 393 i386 fsinfo sys_fsinfo __ia32_sys_fsinfo 394 i386 mount_notify sys_mount_notify __ia32_sys_mount_notify 395 i386 sb_notify sys_sb_notify __ia32_sys_sb_notify +396 i386 container_create sys_container_create __ia32_sys_container_create diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 17869bf7788a..aa6cccbe5271 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -352,6 +352,7 @@ 341 common fsinfo __x64_sys_fsinfo 342 common mount_notify __x64_sys_mount_notify 343 common sb_notify __x64_sys_sb_notify +344 common container_create __x64_sys_container_create # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/namespace.c b/fs/namespace.c index f378cfc63043..ea005f55ec4c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -30,6 +30,7 @@ #include <uapi/linux/mount.h> #include <linux/fs_context.h> #include <linux/fsinfo.h> +#include <linux/container.h> #include "pnode.h" #include "internal.h" @@ -3742,6 +3743,10 @@ static void __init init_mount_tree(void) set_fs_pwd(current->fs, &root); set_fs_root(current->fs, &root); +#ifdef CONFIG_CONTAINERS + path_get(&root); + init_container.root = root; +#endif } void __init mnt_init(void) diff --git a/include/linux/container.h b/include/linux/container.h new file mode 100644 index 000000000000..0a8918435097 --- /dev/null +++ b/include/linux/container.h @@ -0,0 +1,86 @@ +/* Container objects + * + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#ifndef _LINUX_CONTAINER_H +#define _LINUX_CONTAINER_H + +#include <uapi/linux/container.h> +#include <linux/refcount.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/wait.h> +#include <linux/path.h> +#include <linux/seqlock.h> + +struct fs_struct; +struct nsproxy; +struct task_struct; + +/* + * The container object. + */ +struct container { + char name[24]; + u64 id; /* Container ID */ + refcount_t usage; + int exit_code; /* The exit code of 'init' */ + const struct cred *cred; /* Creds for this container, including userns */ + struct nsproxy *ns; /* This container's namespaces */ + struct path root; /* The root of the container's fs namespace */ + struct task_struct *init; /* The 'init' task for this container */ + struct container *parent; /* Parent of this container. */ + void *security; /* LSM data */ + struct list_head members; /* Member processes, guarded with ->lock */ + struct list_head child_link; /* Link in parent->children */ + struct list_head children; /* Child containers */ + wait_queue_head_t waitq; /* Someone waiting for init to exit waits here */ + unsigned long flags; +#define CONTAINER_FLAG_INIT_STARTED 0 /* Init is started - certain ops now prohibited */ +#define CONTAINER_FLAG_DEAD 1 /* Init has died */ +#define CONTAINER_FLAG_KILL_ON_CLOSE 2 /* Kill init if container handle closed */ + spinlock_t lock; + seqcount_t seq; /* Track changes in ->root */ +}; + +extern struct container init_container; + +#ifdef CONFIG_CONTAINERS +extern const struct file_operations container_fops; + +extern int copy_container(unsigned long flags, struct task_struct *tsk, + struct container *container); +extern void exit_container(struct task_struct *tsk); +extern void put_container(struct container *c); + +static inline struct container *get_container(struct container *c) +{ + refcount_inc(&c->usage); + return c; +} + +static inline bool is_container_file(struct file *file) +{ + return file->f_op == &container_fops; +} + +#else + +static inline int copy_container(unsigned long flags, struct task_struct *tsk, + struct container *container) +{ return 0; } +static inline void exit_container(struct task_struct *tsk) { } +static inline void put_container(struct container *c) {} +static inline struct container *get_container(struct container *c) { return NULL; } +static inline bool is_container_file(struct file *file) { return false; } + +#endif /* CONFIG_CONTAINERS */ + +#endif /* _LINUX_CONTAINER_H */ diff --git a/include/linux/init_task.h b/include/linux/init_task.h index a7083a45a26c..f016cadece24 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -10,6 +10,7 @@ #include <linux/ipc.h> #include <linux/pid_namespace.h> #include <linux/user_namespace.h> +#include <linux/container.h> #include <linux/securebits.h> #include <linux/seqlock.h> #include <linux/rbtree.h> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 52d0f3f4c786..0f310d911815 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1460,6 +1460,16 @@ * @bpf_prog_free_security: * Clean up the security information stored inside bpf prog. * + * Security hooks for containers: + * + * @container_alloc: + * Permit creation of a new container and assign security data. + * @container: The new container. + * + * @container_free: + * Free security data attached to a container. + * @container: The container. + * */ union security_list_options { int (*binder_set_context_mgr)(struct task_struct *mgr); @@ -1825,6 +1835,12 @@ union security_list_options { int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux); void (*bpf_prog_free_security)(struct bpf_prog_aux *aux); #endif /* CONFIG_BPF_SYSCALL */ + + /* Container management security hooks */ +#ifdef CONFIG_CONTAINERS + int (*container_alloc)(struct container *container, unsigned int flags); + void (*container_free)(struct container *container); +#endif }; struct security_hook_heads { @@ -2069,6 +2085,10 @@ struct security_hook_heads { struct hlist_head bpf_prog_alloc_security; struct hlist_head bpf_prog_free_security; #endif /* CONFIG_BPF_SYSCALL */ +#ifdef CONFIG_CONTAINERS + struct hlist_head container_alloc; + struct hlist_head container_free; +#endif /* CONFIG_CONTAINERS */ } __randomize_layout; /* diff --git a/include/linux/sched.h b/include/linux/sched.h index d2f90fa92468..073a3a930514 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -36,6 +36,7 @@ struct backing_dev_info; struct bio_list; struct blk_plug; struct cfs_rq; +struct container; struct fs_struct; struct futex_pi_state; struct io_context; @@ -870,6 +871,8 @@ struct task_struct { /* Namespaces: */ struct nsproxy *nsproxy; + struct container *container; + struct list_head container_link; /* Signal handlers: */ struct signal_struct *signal; diff --git a/include/linux/security.h b/include/linux/security.h index da538c06766f..acd0c14c6e95 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -70,6 +70,7 @@ struct ctl_table; struct audit_krule; struct user_namespace; struct timezone; +struct container; enum lsm_event { LSM_POLICY_CHANGE, @@ -1751,6 +1752,20 @@ static inline void security_audit_rule_free(void *lsmrule) #endif /* CONFIG_SECURITY */ #endif /* CONFIG_AUDIT */ +#ifdef CONFIG_CONTAINERS +#ifdef CONFIG_SECURITY +int security_container_alloc(struct container *container, unsigned int flags); +void security_container_free(struct container *container); +#else +static inline int security_container_alloc(struct container *container, + unsigned int flags) +{ + return 0; +} +static inline void security_container_free(struct container *container) {} +#endif +#endif /* CONFIG_CONTAINERS */ + #ifdef CONFIG_SECURITYFS extern struct dentry *securityfs_create_file(const char *name, umode_t mode, diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 10127b1d923b..dac42098c2dd 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -943,6 +943,9 @@ asmlinkage long sys_mount_notify(int dfd, const char __user *path, unsigned int at_flags, int watch_fd, int watch_id); asmlinkage long sys_sb_notify(int dfd, const char __user *path, unsigned int at_flags, int watch_fd, int watch_id); +asmlinkage long sys_container_create(const char __user *name, unsigned int flags, + unsigned long spare3, unsigned long spare4, + unsigned long spare5); /* * Architecture-specific system calls diff --git a/include/uapi/linux/container.h b/include/uapi/linux/container.h new file mode 100644 index 000000000000..43748099b28d --- /dev/null +++ b/include/uapi/linux/container.h @@ -0,0 +1,28 @@ +/* Container UAPI + * + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#ifndef _UAPI_LINUX_CONTAINER_H +#define _UAPI_LINUX_CONTAINER_H + + +#define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */ +#define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */ +#define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace */ +#define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */ +#define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */ +#define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */ +#define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */ +#define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */ +#define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */ +#define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */ +#define CONTAINER__FLAG_MASK 0x000003ff + +#endif /* _UAPI_LINUX_CONTAINER_H */ diff --git a/init/Kconfig b/init/Kconfig index 5984dd7f2156..ab37c3a55aa1 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -992,6 +992,13 @@ config NET_NS Allow user space to create what appear to be multiple instances of the network stack. +config CONTAINERS + bool "Container support" + default y + help + Allow userspace to create and manipulate containers as objects that + have namespaces and hold a set of processes. + endif # NAMESPACES config CHECKPOINT_RESTORE diff --git a/init/init_task.c b/init/init_task.c index 5aebe3be4d7c..90c7439a195b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -108,6 +108,9 @@ struct task_struct init_task .signal = &init_signals, .sighand = &init_sighand, .nsproxy = &init_nsproxy, + .container = &init_container, + .container_link.next = &init_container.members, + .container_link.prev = &init_container.members, .pending = { .list = LIST_HEAD_INIT(init_task.pending.list), .signal = {{0}} diff --git a/kernel/Makefile b/kernel/Makefile index 6aa7543bcdb2..98cdd18cecef 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -8,7 +8,7 @@ obj-y = fork.o exec_domain.o panic.o \ sysctl.o sysctl_binary.o capability.o ptrace.o user.o \ signal.o sys.o umh.o workqueue.o pid.o task_work.o \ extable.o params.o \ - kthread.o sys_ni.o nsproxy.o \ + kthread.o sys_ni.o nsproxy.o container.o \ notifier.o ksysfs.o cred.o reboot.o \ async.o range.o smpboot.o ucount.o diff --git a/kernel/container.c b/kernel/container.c new file mode 100644 index 000000000000..ca4012632cfa --- /dev/null +++ b/kernel/container.c @@ -0,0 +1,348 @@ +/* Implement container objects. + * + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include <linux/poll.h> +#include <linux/wait.h> +#include <linux/init_task.h> +#include <linux/fs.h> +#include <linux/fs_struct.h> +#include <linux/anon_inodes.h> +#include <linux/container.h> +#include <linux/syscalls.h> +#include <linux/printk.h> +#include <linux/security.h> +#include "namespaces.h" + +struct container init_container = { + .name = ".init", + .id = 1, + .usage = REFCOUNT_INIT(2), + .cred = &init_cred, + .ns = &init_nsproxy, + .init = &init_task, + .members.next = &init_task.container_link, + .members.prev = &init_task.container_link, + .children = LIST_HEAD_INIT(init_container.children), + .flags = (1 << CONTAINER_FLAG_INIT_STARTED), + .lock = __SPIN_LOCK_UNLOCKED(init_container.lock), + .seq = SEQCNT_ZERO(init_fs.seq), +}; + +#ifdef CONFIG_CONTAINERS + +static atomic64_t container_id_counter = ATOMIC_INIT(1); + +/* + * Drop a ref on a container and clear it if no longer in use. + */ +void put_container(struct container *c) +{ + struct container *parent; + + while (c && refcount_dec_and_test(&c->usage)) { + BUG_ON(!list_empty(&c->members)); + if (c->ns) + put_nsproxy(c->ns); + path_put(&c->root); + + parent = c->parent; + if (parent) { + spin_lock(&parent->lock); + list_del(&c->child_link); + spin_unlock(&parent->lock); + } + + if (c->cred) + put_cred(c->cred); + security_container_free(c); + kfree(c); + c = parent; + } +} + +/* + * Allow the user to poll for the container dying. + */ +static unsigned int container_poll(struct file *file, poll_table *wait) +{ + struct container *container = file->private_data; + unsigned int mask = 0; + + poll_wait(file, &container->waitq, wait); + + if (test_bit(CONTAINER_FLAG_DEAD, &container->flags)) + mask |= POLLHUP; + + return mask; +} + +static int container_release(struct inode *inode, struct file *file) +{ + struct container *container = file->private_data; + + put_container(container); + return 0; +} + +const struct file_operations container_fops = { + .poll = container_poll, + .release = container_release, +}; + +/* + * Handle fork/clone. + * + * A process inherits its parent's container. The first process into the + * container is its 'init' process and the life of everything else in there is + * dependent upon that. + */ +int copy_container(unsigned long flags, struct task_struct *tsk, + struct container *container) +{ + struct container *c = container ?: tsk->container; + int ret = -ECANCELED; + + spin_lock(&c->lock); + + if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) { + list_add_tail(&tsk->container_link, &c->members); + get_container(c); + tsk->container = c; + if (!c->init) { + set_bit(CONTAINER_FLAG_INIT_STARTED, &c->flags); + c->init = tsk; + } + ret = 0; + } + + spin_unlock(&c->lock); + return ret; +} + +/* + * Remove a dead process from a container. + * + * If the 'init' process in a container dies, we kill off all the other + * processes in the container. + */ +void exit_container(struct task_struct *tsk) +{ + struct task_struct *p; + struct container *c = tsk->container; + struct kernel_siginfo si = { + .si_signo = SIGKILL, + .si_code = SI_KERNEL, + }; + + spin_lock(&c->lock); + + list_del(&tsk->container_link); + + if (c->init == tsk) { + c->init = NULL; + c->exit_code = tsk->exit_code; + smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */ + set_bit(CONTAINER_FLAG_DEAD, &c->flags); + wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD); + + list_for_each_entry(p, &c->members, container_link) { + si.si_pid = task_tgid_vnr(p); + send_sig_info(SIGKILL, &si, p); + } + } + + spin_unlock(&c->lock); + put_container(c); +} + +/* + * Allocate a container. + */ +static struct container *alloc_container(const char __user *name) +{ + struct container *c; + long len; + int ret; + + c = kzalloc(sizeof(struct container), GFP_KERNEL); + if (!c) + return ERR_PTR(-ENOMEM); + + INIT_LIST_HEAD(&c->members); + INIT_LIST_HEAD(&c->children); + init_waitqueue_head(&c->waitq); + spin_lock_init(&c->lock); + refcount_set(&c->usage, 1); + + ret = -EFAULT; + len = strncpy_from_user(c->name, name, sizeof(c->name)); + if (len < 0) + goto err; + ret = -ENAMETOOLONG; + if (len >= sizeof(c->name)) + goto err; + ret = -EINVAL; + if (strchr(c->name, '/')) + goto err; + + c->name[len] = 0; + return c; + +err: + kfree(c); + return ERR_PTR(ret); +} + +/* + * Create some creds for the container. We don't want to pin things we don't + * have to, so drop all keyrings from the new cred. The LSM gets to audit the + * cred struct when security_container_alloc() is invoked. + */ +static const struct cred *create_container_creds(unsigned int flags) +{ + struct cred *new; + int ret; + + new = prepare_creds(); + if (!new) + return ERR_PTR(-ENOMEM); + +#ifdef CONFIG_KEYS + key_put(new->thread_keyring); + new->thread_keyring = NULL; + key_put(new->process_keyring); + new->process_keyring = NULL; + key_put(new->session_keyring); + new->session_keyring = NULL; + key_put(new->request_key_auth); + new->request_key_auth = NULL; +#endif + + if (flags & CONTAINER_NEW_USER_NS) { + ret = create_user_ns(new); + if (ret < 0) + goto err; + new->euid = new->user_ns->owner; + new->egid = new->user_ns->group; + } + + new->fsuid = new->suid = new->uid = new->euid; + new->fsgid = new->sgid = new->gid = new->egid; + return new; + +err: + abort_creds(new); + return ERR_PTR(ret); +} + +/* + * Create a new container. + */ +static struct container *create_container(const char __user *name, unsigned int flags) +{ + struct container *parent, *c; + struct fs_struct *fs; + struct nsproxy *ns; + const struct cred *cred; + int ret; + + c = alloc_container(name); + if (IS_ERR(c)) + return c; + + if (flags & CONTAINER_KILL_ON_CLOSE) + __set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags); + + cred = create_container_creds(flags); + if (IS_ERR(cred)) { + ret = PTR_ERR(cred); + goto err_cont; + } + c->cred = cred; + + ret = -ENOMEM; + fs = copy_fs_struct(current->fs); + if (!fs) + goto err_cont; + + ns = create_new_namespaces( + (flags & CONTAINER_NEW_FS_NS ? CLONE_NEWNS : 0) | + (flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP : 0) | + (flags & CONTAINER_NEW_UTS_NS ? CLONE_NEWUTS : 0) | + (flags & CONTAINER_NEW_IPC_NS ? CLONE_NEWIPC : 0) | + (flags & CONTAINER_NEW_PID_NS ? CLONE_NEWPID : 0) | + (flags & CONTAINER_NEW_NET_NS ? CLONE_NEWNET : 0), + current->nsproxy, cred->user_ns, fs); + if (IS_ERR(ns)) { + ret = PTR_ERR(ns); + goto err_fs; + } + + c->ns = ns; + c->root = fs->root; + c->seq = fs->seq; + fs->root.mnt = NULL; + fs->root.dentry = NULL; + + ret = security_container_alloc(c, flags); + if (ret < 0) + goto err_fs; + + parent = current->container; + get_container(parent); + c->parent = parent; + c->id = atomic64_inc_return(&container_id_counter); + spin_lock(&parent->lock); + list_add_tail(&c->child_link, &parent->children); + spin_unlock(&parent->lock); + return c; + +err_fs: + free_fs_struct(fs); +err_cont: + put_container(c); + return ERR_PTR(ret); +} + +/* + * Create a new container object. + */ +SYSCALL_DEFINE5(container_create, + const char __user *, name, + unsigned int, flags, + unsigned long, spare3, + unsigned long, spare4, + unsigned long, spare5) +{ + struct container *c; + int fd; + + if (!name || + flags & ~CONTAINER__FLAG_MASK || + spare3 != 0 || spare4 != 0 || spare5 != 0) + return -EINVAL; + if ((flags & (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) == + (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) + return -EINVAL; + + c = create_container(name, flags); + if (IS_ERR(c)) + return PTR_ERR(c); + + fd = anon_inode_getfd("container", &container_fops, c, + O_RDWR | (flags & CONTAINER_FD_CLOEXEC ? O_CLOEXEC : 0)); + if (fd < 0) + put_container(c); + return fd; +} + +#endif /* CONFIG_CONTAINERS */ diff --git a/kernel/exit.c b/kernel/exit.c index 284f2fe9a293..78f6065ad799 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -864,6 +864,7 @@ void __noreturn do_exit(long code) if (group_dead) disassociate_ctty(1); exit_task_namespaces(tsk); + exit_container(tsk); exit_task_work(tsk); exit_thread(tsk); exit_umh(tsk); diff --git a/kernel/fork.c b/kernel/fork.c index b69248e6f0e0..009cf7e63894 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1920,9 +1920,12 @@ static __latent_entropy struct task_struct *copy_process( retval = copy_namespaces(clone_flags, p); if (retval) goto bad_fork_cleanup_mm; - retval = copy_io(clone_flags, p); + retval = copy_container(clone_flags, p, NULL); if (retval) goto bad_fork_cleanup_namespaces; + retval = copy_io(clone_flags, p); + if (retval) + goto bad_fork_cleanup_container; retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls); if (retval) goto bad_fork_cleanup_io; @@ -2121,6 +2124,8 @@ static __latent_entropy struct task_struct *copy_process( bad_fork_cleanup_io: if (p->io_context) exit_io_context(p); +bad_fork_cleanup_container: + exit_container(p); bad_fork_cleanup_namespaces: exit_task_namespaces(p); bad_fork_cleanup_mm: diff --git a/kernel/namespaces.h b/kernel/namespaces.h new file mode 100644 index 000000000000..c44e3cf0e254 --- /dev/null +++ b/kernel/namespaces.h @@ -0,0 +1,15 @@ +/* Local namespaces defs + * + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +extern struct nsproxy *create_new_namespaces(unsigned long flags, + struct nsproxy *nsproxy, + struct user_namespace *user_ns, + struct fs_struct *new_fs); diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index f6c5d330059a..4bb5184b3a80 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -27,6 +27,7 @@ #include <linux/syscalls.h> #include <linux/cgroup.h> #include <linux/perf_event.h> +#include "namespaces.h" static struct kmem_cache *nsproxy_cachep; @@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void) * Return the newly created nsproxy. Do not attach this to the task, * leave it to the caller to do proper locking and attach it to task. */ -static struct nsproxy *create_new_namespaces(unsigned long flags, - struct task_struct *tsk, struct user_namespace *user_ns, +struct nsproxy *create_new_namespaces(unsigned long flags, + struct nsproxy *nsproxy, struct user_namespace *user_ns, struct fs_struct *new_fs) { struct nsproxy *new_nsp; @@ -72,39 +73,39 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, if (!new_nsp) return ERR_PTR(-ENOMEM); - new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs); + new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns, user_ns, new_fs); if (IS_ERR(new_nsp->mnt_ns)) { err = PTR_ERR(new_nsp->mnt_ns); goto out_ns; } - new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns); + new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy->uts_ns); if (IS_ERR(new_nsp->uts_ns)) { err = PTR_ERR(new_nsp->uts_ns); goto out_uts; } - new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns); + new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy->ipc_ns); if (IS_ERR(new_nsp->ipc_ns)) { err = PTR_ERR(new_nsp->ipc_ns); goto out_ipc; } new_nsp->pid_ns_for_children = - copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children); + copy_pid_ns(flags, user_ns, nsproxy->pid_ns_for_children); if (IS_ERR(new_nsp->pid_ns_for_children)) { err = PTR_ERR(new_nsp->pid_ns_for_children); goto out_pid; } new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns, - tsk->nsproxy->cgroup_ns); + nsproxy->cgroup_ns); if (IS_ERR(new_nsp->cgroup_ns)) { err = PTR_ERR(new_nsp->cgroup_ns); goto out_cgroup; } - new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns); + new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy->net_ns); if (IS_ERR(new_nsp->net_ns)) { err = PTR_ERR(new_nsp->net_ns); goto out_net; @@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk) (CLONE_NEWIPC | CLONE_SYSVSEM)) return -EINVAL; - new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs); + new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs); if (IS_ERR(new_ns)) return PTR_ERR(new_ns); @@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags, if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; - *new_nsp = create_new_namespaces(unshare_flags, current, user_ns, + *new_nsp = create_new_namespaces(unshare_flags, current->nsproxy, user_ns, new_fs ? new_fs : current->fs); if (IS_ERR(*new_nsp)) { err = PTR_ERR(*new_nsp); @@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype) if (nstype && (ns->ops->type != nstype)) goto out; - new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs); + new_nsproxy = create_new_namespaces(0, tsk->nsproxy, current_user_ns(), tsk->fs); if (IS_ERR(new_nsproxy)) { err = PTR_ERR(new_nsproxy); goto out; diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index a4e7131b2509..f0455cbb91cf 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -136,6 +136,9 @@ COND_SYSCALL(acct); COND_SYSCALL(capget); COND_SYSCALL(capset); +/* kernel/container.c */ +COND_SYSCALL(container_create); + /* kernel/exec_domain.c */ /* kernel/exit.c */ diff --git a/security/security.c b/security/security.c index b49732c02e21..259be9a1746c 100644 --- a/security/security.c +++ b/security/security.c @@ -1864,3 +1864,15 @@ void security_bpf_prog_free(struct bpf_prog_aux *aux) call_void_hook(bpf_prog_free_security, aux); } #endif /* CONFIG_BPF_SYSCALL */ + +#ifdef CONFIG_CONTAINERS +int security_container_alloc(struct container *container, unsigned int flags) +{ + return call_int_hook(container_alloc, 0, container, flags); +} + +void security_container_free(struct container *container) +{ + call_void_hook(container_free, container); +} +#endif /* CONFIG_CONTAINERS */