Message ID | 20240516092213.6799-2-jcalmels@3xx0.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introduce user namespace capabilities | expand |
Some quick remarks, no bandwidth to understand this. First of all short summary does not define any action so it should be rather e.g. "capabilities: Add userns capabilities" Much more understandable. On Thu May 16, 2024 at 12:22 PM EEST, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 Kernel coding style does not support this way but instead recommends enum but apparently the whole header is not following that so I guess it is fine ;-) > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); Could be also a single line (checkpatch.pl does not complain). > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; Empty line. > + /* a capability cannot be raised unless the current task has it in Incorrectly formatted comment: /* * Foo That is what's the correct format. > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; Empty line might be appropriate here. > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) I'd consider being dummy here to make this more easy to verify also in the future: create two bools and use them for final comparison. My head hurts reading that. > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring); BR, Jarkko
On 5/16/24 02:22, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace agreed, though it really is application dependent. Some applications handle the denial at userns creation better, than the capability after. Others like anything based on QTWebEngine will crash on denial of userns creation but handle denial of the capability within the userns just fine, and some applications just crash regardless. The userns cred from the LSM hook can be modified, yes it is currently specified as const but is still under construction so it can be safely modified the LSM hook just needs a small update. The advantage of doing it under the LSM is an LSM can have a richer policy around what can use them and tracking of what is allowed. That is to say the LSM has the capability of being finer grained than doing it via capabilities. I am not opposed to adding another mechanism to control user namespaces, I am just not currently convinced that capabilities are the right mechanism. > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > this should be bounded by the creating task's bounding set, other wise the capability model's bounding invariant will be broken, but having the capabilities that the userns want to access in the task's bounding set is a problem for all the unprivileged processes wanting access to user namespaces. Simply setting the userns fcap on the programs that want access to user namespaces, does certainly reduce the attack surface, but really is insufficient for utilities like unshare, bwrap, lxd etc. They can be used to trivially by-pass the restriction. > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > If I am reading this right for unprivileged processes the capabilities in the userns are bounded by the processes permitted set before the userns is created? This is only being respected in PR_CTL, the user mode helper is straight setting the caps. > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; > + /* a capability cannot be raised unless the current task has it in > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring);
On Thu, May 16, 2024 at 03:07:28PM GMT, John Johansen wrote: > agreed, though it really is application dependent. Some applications handle > the denial at userns creation better, than the capability after. Others > like anything based on QTWebEngine will crash on denial of userns creation > but handle denial of the capability within the userns just fine, and some > applications just crash regardless. Yes this is application specific, but I would argue that the latter is much more preferable. For example, having one application crash in a container is probably ok, but not being able to start the container in the first place is probably not. Similarly, preventing the network namespace creation breaks services which rely on systemd’s PrivateNetwork, even though they most likely use it to prevent any networking from being done. > The userns cred from the LSM hook can be modified, yes it is currently > specified as const but is still under construction so it can be safely > modified the LSM hook just needs a small update. > > The advantage of doing it under the LSM is an LSM can have a richer policy > around what can use them and tracking of what is allowed. That is to say the > LSM has the capability of being finer grained than doing it via capabilities. Sure, we could modify the LSM hook to do all sorts of things, but leveraging it would be quite cumbersome, will take time to show up in userspace, or simply never be adopted. We’re already seeing it in Ubuntu which started requiring Apparmor profiles. This new capability set would be a universal thing that could be leveraged today without modification to userspace. Moreover, it’s a simple framework that can be extended. As you mentioned, LSMs are even finer grained, and that’s the idea, those could be used hand in hand eventually. You could envision LSM hooks controlling the userns capability set, and thus enforce policies on the creation of nested namespaces without limiting the other tasks’ capabilities. > I am not opposed to adding another mechanism to control user namespaces, > I am just not currently convinced that capabilities are the right > mechanism. Well that’s the thing, from past conversations, there is a lot of disagreement about restricting namespaces. By restricting the capabilities granted by namespaces instead, we’re actually treating the root cause of most concerns. Today user namespaces are "special" and always grant full caps. Adding a new capability set to limit this behavior is logical; same way it's done for usual process transitions. Essentially this set is to namespaces what the inheritable set is to root. > this should be bounded by the creating task's bounding set, other wise > the capability model's bounding invariant will be broken, but having the > capabilities that the userns want to access in the task's bounding set is > a problem for all the unprivileged processes wanting access to user > namespaces. This is possible with the security bit introduced in the second patch. The idea of having those separate is that a service which has dropped its capabilities can still create a fully privileged user namespace. For example, systemd’s machined drops capabilities from its bounding set, yet it should be able to create unprivileged containers. The invariant is sound because a child userns can never regain what it doesn’t have in its bounding set. If it helps you can view the userns set as a “namespace bounding set” since it defines the future bounding sets of namespaced tasks. > If I am reading this right for unprivileged processes the capabilities in > the userns are bounded by the processes permitted set before the userns is > created? Yes, unprivileged processes that want to raise a capability in their userns set need it in their permitted set (as well as their bounding set). This is similar to inheritable capabilities. Recall that processes start with a full set of userns capabilities, so if you drop a userns capability (or something else did, e.g. init/pam/sysctl/parent) you will never be able to regain it, and namespaces you create won't have it included. Now, if you’re root (or cap privileged) you can always regain it. > This is only being respected in PR_CTL, the user mode helper is straight > setting the caps. Usermod helper requires CAP_SYS_MODULE and CAP_SETPCAP in the initns so the permitted set is irrelevant there. It starts with a full set but from there you can only lower caps, so the invariant holds.
Jonathan Calmels <jcalmels@3xx0.net> writes: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable > them. Pointers please? That sentence sounds about 5 years out of date. Eric
On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: > > Pointers please? > > That sentence sounds about 5 years out of date. The link referenced is from last year. Here are some others often cited by distributions: https://nvd.nist.gov/vuln/detail/CVE-2022-0185 https://nvd.nist.gov/vuln/detail/CVE-2022-1015 https://nvd.nist.gov/vuln/detail/CVE-2022-2078 https://nvd.nist.gov/vuln/detail/CVE-2022-24122 https://nvd.nist.gov/vuln/detail/CVE-2022-25636 Recent thread discussing this too: https://seclists.org/oss-sec/2024/q2/128
On 5/17/24 03:51, Jonathan Calmels wrote: > On Thu, May 16, 2024 at 03:07:28PM GMT, John Johansen wrote: >> agreed, though it really is application dependent. Some applications handle >> the denial at userns creation better, than the capability after. Others >> like anything based on QTWebEngine will crash on denial of userns creation >> but handle denial of the capability within the userns just fine, and some >> applications just crash regardless. > > Yes this is application specific, but I would argue that the latter is > much more preferable. For example, having one application crash in a > container is probably ok, but not being able to start the container in > the first place is probably not. Similarly, preventing the network > namespace creation breaks services which rely on systemd’s > PrivateNetwork, even though they most likely use it to prevent any > networking from being done. > Agred the solution has to be application/usage model specific. Some of them are easy, and others not so much. >> The userns cred from the LSM hook can be modified, yes it is currently >> specified as const but is still under construction so it can be safely >> modified the LSM hook just needs a small update. >> >> The advantage of doing it under the LSM is an LSM can have a richer policy >> around what can use them and tracking of what is allowed. That is to say the >> LSM has the capability of being finer grained than doing it via capabilities. > > Sure, we could modify the LSM hook to do all sorts of things, but > leveraging it would be quite cumbersome, will take time to show up in > userspace, or simply never be adopted. > We’re already seeing it in Ubuntu which started requiring Apparmor profiles. > yes, I would argue that is a metric of adoption. > This new capability set would be a universal thing that could be > leveraged today without modification to userspace. Moreover, it’s a > simple framework that can be extended. I would argue that is a problem. Userspace has to change for this to be secure. Is it an improvement over the current state yes. > As you mentioned, LSMs are even finer grained, and that’s the idea, > those could be used hand in hand eventually. You could envision LSM > hooks controlling the userns capability set, and thus enforce policies > on the creation of nested namespaces without limiting the other tasks’ > capabilities. > >> I am not opposed to adding another mechanism to control user namespaces, >> I am just not currently convinced that capabilities are the right >> mechanism. > > Well that’s the thing, from past conversations, there is a lot of > disagreement about restricting namespaces. By restricting the > capabilities granted by namespaces instead, we’re actually treating the > root cause of most concerns. > no disagreement there. This is actually Ubuntu's posture with user namespaces atm. Where the user namespace is allowed but the capabilities within it are denied. It does however when not handled correctly result in some very odd failures and would be easier to debug if the use of user namespaces were just cleanly denied. > Today user namespaces are "special" and always grant full caps. Adding a > new capability set to limit this behavior is logical; same way it's done > for usual process transitions. > Essentially this set is to namespaces what the inheritable set is to > root. > its not so much the capabilities set as the inheritable part that is problematic. Yes I am well aware of where that is required but I question that capabilities provides the needed controls here. >> this should be bounded by the creating task's bounding set, other wise >> the capability model's bounding invariant will be broken, but having the >> capabilities that the userns want to access in the task's bounding set is >> a problem for all the unprivileged processes wanting access to user >> namespaces. > > This is possible with the security bit introduced in the second patch. > The idea of having those separate is that a service which has dropped > its capabilities can still create a fully privileged user namespace. yes, which is the problem. Not that we don't do that with say setuid applications, but the difference is that they were known to be doing something dangerous and took measures around that. We are starting from a different posture here. Where applications have assumed that user namespaces where safe and no measures were needed. Tools like unshare and bwrap if set to allow user namespaces in their fcaps will allow exploits a trivial by-pass. > For example, systemd’s machined drops capabilities from its bounding set, > yet it should be able to create unprivileged containers. > The invariant is sound because a child userns can never regain what it > doesn’t have in its bounding set. If it helps you can view the userns > set as a “namespace bounding set” since it defines the future bounding > sets of namespaced tasks. > sure I get it, some of the use cases work, some not so well >> If I am reading this right for unprivileged processes the capabilities in >> the userns are bounded by the processes permitted set before the userns is >> created? > > Yes, unprivileged processes that want to raise a capability in their > userns set need it in their permitted set (as well as their bounding > set). This is similar to inheritable capabilities. Right. > Recall that processes start with a full set of userns capabilities, so > if you drop a userns capability (or something else did, e.g. > init/pam/sysctl/parent) you will never be able to regain it, and > namespaces you create won't have it included. sure, that part of the behavior is fine > Now, if you’re root (or cap privileged) you can always regain it. > yes What I was trying to get at is two points. 1. The written description wasn't clear enough, leaving room for ambiguity. 2. That I quest that the behavior should be allowed given the current set of tools that use user namespaces. It reduces exploit codes ability to directly use unprivileged user namespaces but makes it all to easy to by-pass the restriction because of the behavior of the current tool set. ie. user space has to change. >> This is only being respected in PR_CTL, the user mode helper is straight >> setting the caps. > > Usermod helper requires CAP_SYS_MODULE and CAP_SETPCAP in the initns so > the permitted set is irrelevant there. It starts with a full set but from > there you can only lower caps, so the invariant holds. > sure, I get what is happening. Again the description needs work. It was ambiguous as to whether it was applying to the fcaps or only the pcaps. But again, I believe the fcaps behavior is wrong, because of the state of current software. If this had been a proposal where there was no existing software infrastructure I would be starting from a different stance.
On 5/17/24 04:55, Jonathan Calmels wrote: > On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: >> >> Pointers please? >> >> That sentence sounds about 5 years out of date. > > The link referenced is from last year. > Here are some others often cited by distributions: > > https://nvd.nist.gov/vuln/detail/CVE-2022-0185 > https://nvd.nist.gov/vuln/detail/CVE-2022-1015 > https://nvd.nist.gov/vuln/detail/CVE-2022-2078 > https://nvd.nist.gov/vuln/detail/CVE-2022-24122 > https://nvd.nist.gov/vuln/detail/CVE-2022-25636 > > Recent thread discussing this too: > https://seclists.org/oss-sec/2024/q2/128 > they were used in 2020, 2021, and 2022 pwn2own exploits. Sorry I don't remember the exact numbers and will have to dig. pwn2own 2023 4/5 hacks used them https://www.zerodayinitiative.com/blog/2023/3/23/pwn2own-vancouver-2023-day-two-results I will need to dig to find the CVEs associated with them. pwn2own 2024 I can not discuss atm but its not just pwn2own, the actual list of kernel CVEs that unprivileged user namespaces make exploitable is much larger.
Jonathan Calmels <jcalmels@3xx0.net> writes: > On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: >> >> Pointers please? >> >> That sentence sounds about 5 years out of date. > > The link referenced is from last year. > Here are some others often cited by distributions: > > https://nvd.nist.gov/vuln/detail/CVE-2022-0185 > https://nvd.nist.gov/vuln/detail/CVE-2022-1015 > https://nvd.nist.gov/vuln/detail/CVE-2022-2078 > https://nvd.nist.gov/vuln/detail/CVE-2022-24122 > https://nvd.nist.gov/vuln/detail/CVE-2022-25636 > > Recent thread discussing this too: > https://seclists.org/oss-sec/2024/q2/128 My apologies perhaps I trimmed too much. I know that user namespaces enlarge the attack surface. How much and how serious could be debated but for unprivileged users the attack surface is undoubtedly enlarged. As I read your introduction you were justifying the introduction of a new security mechanism with the observation that distributions were carrying distribution specific patches. To the best of my knowledge distribution specific patches and distributions disabling user namespaces have been gone for quite a while. So if that has changed recently I would like to know. Thank you, Eric
> > On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: > As I read your introduction you were justifying the introduction > of a new security mechanism with the observation that distributions > were carrying distribution specific patches. > > To the best of my knowledge distribution specific patches and > distributions disabling user namespaces have been gone for quite a > while. So if that has changed recently I would like to know. On the top of my head: - RHEL based: namespace.unpriv_enable user_namespace.enable - Arch/Debian based: kernel.unprivileged_userns_clone - Ubuntu based: kernel.apparmor_restrict_unprivileged_userns I'm not sure which exact version those apply to, but it's definitely still out there. The observation is that while you can disable namespaces today, in practice it breaks userspace in various ways. Hence, being able to control capabilities is a better way to approach it. For example, today's big hammer to prevent CAP_NET_ADMIN in userns: # sysctl -qw user.max_net_namespaces=0 $ unshare -U -r -n ip tuntap add mode tap tap0 && echo OK unshare: unshare failed: No space left on device With patch, this becomes manageable: # capsh --drop=cap_net_admin --secbits=$((1 << 8)) --user=$USER -- \ -c 'unshare -U -r -n ip tuntap add mode tap tap0 && echo OK' ioctl(TUNSETIFF): Operation not permitted
On Fri, May 17, 2024 at 04:59:41AM GMT, John Johansen wrote: > On 5/17/24 03:51, Jonathan Calmels wrote: > > This new capability set would be a universal thing that could be > > leveraged today without modification to userspace. Moreover, it’s a > > simple framework that can be extended. > > I would argue that is a problem. Userspace has to change for this to be > secure. Is it an improvement over the current state yes. Well, yes and no. With those patches, I can lock down things today on my system and I don't need to change anything. For example I can decide that none of my rootless containers started under SSH will get CAP_NET_ADMIN: # echo "auth optional pam_cap.so" >> /etc/pam.d/sshd # echo "!cap_net_admin $USER" >> /etc/security/capability.conf # capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd $ ssh localhost 'unshare -r capsh --current' Current: =ep cap_net_admin-ep Current IAB: !cap_net_admin Or I can decide than I don't ever want CAP_SYS_RAWIO in my namespaces: # sysctl -w cap_bound_userns_mask=0x1fffffdffff This doesn't require changes to userspace. Now, granted if you want to have finer-grained controls, it will require *some* changes in *some* places (e.g. adding new systemd property like UserNSSet=). > > Well that’s the thing, from past conversations, there is a lot of > > disagreement about restricting namespaces. By restricting the > > capabilities granted by namespaces instead, we’re actually treating the > > root cause of most concerns. > > > no disagreement there. This is actually Ubuntu's posture with user namespaces > atm. Where the user namespace is allowed but the capabilities within it > are denied. > > It does however when not handled correctly result in some very odd failures > and would be easier to debug if the use of user namespaces were just > cleanly denied. Yes but as we established it depends on the use case, both are not mutually exclusive. > its not so much the capabilities set as the inheritable part that is > problematic. Yes I am well aware of where that is required but I question > that capabilities provides the needed controls here. Again, I'm not opposed to doing this with LSMs. I just think both could work well together. We already do that with standard capabilities vs LSMs, both have their strength and weaknesses. It's always a tradeoff, do you want a setting that's universal and coarse, or do you want one that's tailored to specific things but less ubiquitous. It's also a tradeoff on usability. If this doesn't get used in practice, then there is no point. I would argue that even though capabilities are complicated, they are more widely understood than LSMs. Are capabilities insufficient in certain scenarios, absolutely, and that's usually where LSMs come in. > > This is possible with the security bit introduced in the second patch. > > The idea of having those separate is that a service which has dropped > > its capabilities can still create a fully privileged user namespace. > > yes, which is the problem. Not that we don't do that with say setuid > applications, but the difference is that they were known to be doing > something dangerous and took measures around that. > > We are starting from a different posture here. Where applications have > assumed that user namespaces where safe and no measures were needed. > Tools like unshare and bwrap if set to allow user namespaces in their > fcaps will allow exploits a trivial by-pass. Agreed, but we can't really walk back this decision unfortunately. At least with this patch series system administrators have the ability to limit such tools. > What I was trying to get at is two points. > 1. The written description wasn't clear enough, leaving room for > ambiguity. > 2. That I quest that the behavior should be allowed given the > current set of tools that use user namespaces. It reduces exploit > codes ability to directly use unprivileged user namespaces but > makes it all to easy to by-pass the restriction because of the > behavior of the current tool set. ie. user space has to change. > But again, I believe the fcaps behavior is wrong, because of the state of > current software. If this had been a proposal where there was no existing > software infrastructure I would be starting from a different stance. As mentioned above, userspace doesn't necessarily have to change. I'm also not sure what you mean by easy to by-pass? If I mask off some capabilities system wide or in a given process tree, I know for a fact that no namespace will ever get those capabilities.
On 5/17/24 20:50, Jonathan Calmels wrote: > On Fri, May 17, 2024 at 04:59:41AM GMT, John Johansen wrote: >> On 5/17/24 03:51, Jonathan Calmels wrote: >>> This new capability set would be a universal thing that could be >>> leveraged today without modification to userspace. Moreover, it’s a >>> simple framework that can be extended. >> >> I would argue that is a problem. Userspace has to change for this to be >> secure. Is it an improvement over the current state yes. > > Well, yes and no. With those patches, I can lock down things today on my > system and I don't need to change anything. > sure, same as with the big no user ns toggle. This is finer and allows selectively enabling on a per application basis. > For example I can decide that none of my rootless containers started > under SSH will get CAP_NET_ADMIN: > > # echo "auth optional pam_cap.so" >> /etc/pam.d/sshd > # echo "!cap_net_admin $USER" >> /etc/security/capability.conf > # capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd > > $ ssh localhost 'unshare -r capsh --current' > Current: =ep cap_net_admin-ep > Current IAB: !cap_net_admin > > Or I can decide than I don't ever want CAP_SYS_RAWIO in my namespaces: > > # sysctl -w cap_bound_userns_mask=0x1fffffdffff > > This doesn't require changes to userspace. > Now, granted if you want to have finer-grained controls, it will require > *some* changes in *some* places (e.g. adding new systemd property like > UserNSSet=). > yep >>> Well that’s the thing, from past conversations, there is a lot of >>> disagreement about restricting namespaces. By restricting the >>> capabilities granted by namespaces instead, we’re actually treating the >>> root cause of most concerns. >>> >> no disagreement there. This is actually Ubuntu's posture with user namespaces >> atm. Where the user namespace is allowed but the capabilities within it >> are denied. >> >> It does however when not handled correctly result in some very odd failures >> and would be easier to debug if the use of user namespaces were just >> cleanly denied. > > Yes but as we established it depends on the use case, both are not > mutually exclusive. > yep >> its not so much the capabilities set as the inheritable part that is >> problematic. Yes I am well aware of where that is required but I question >> that capabilities provides the needed controls here. > > Again, I'm not opposed to doing this with LSMs. I just think both could > work well together. We already do that with standard capabilities vs > LSMs, both have their strength and weaknesses. > yes, don't get me wrong I am not necessarily advocating an LSM solution as being necessary. I just want to make sure the trade-offs of the capabilities solution get discussed to help evaluate whether extending the current capability model is worth it. > It's always a tradeoff, do you want a setting that's universal and > coarse, or do you want one that's tailored to specific things but less > ubiquitous. > yep > It's also a tradeoff on usability. If this doesn't get used in practice, > then there is no point. agreed > I would argue that even though capabilities are complicated, they are > more widely understood than LSMs. Are capabilities insufficient in > certain scenarios, absolutely, and that's usually where LSMs come in. > hrmmm, I am not sure I would agree with capabilities are better understood than LSMs, At the base level of capability(X) to get permission yes, but the whole permitting, bounding, ... Really I think most people are just confused by both >>> This is possible with the security bit introduced in the second patch. >>> The idea of having those separate is that a service which has dropped >>> its capabilities can still create a fully privileged user namespace. >> >> yes, which is the problem. Not that we don't do that with say setuid >> applications, but the difference is that they were known to be doing >> something dangerous and took measures around that. >> >> We are starting from a different posture here. Where applications have >> assumed that user namespaces where safe and no measures were needed. >> Tools like unshare and bwrap if set to allow user namespaces in their >> fcaps will allow exploits a trivial by-pass. > > Agreed, but we can't really walk back this decision unfortunately. And that is partly the crux of the issue, if we can't walk back the decision then the solution becomes more complex > At least with this patch series system administrators have the ability > to limit such tools. > agreed >> What I was trying to get at is two points. >> 1. The written description wasn't clear enough, leaving room for >> ambiguity. >> 2. That I quest that the behavior should be allowed given the >> current set of tools that use user namespaces. It reduces exploit >> codes ability to directly use unprivileged user namespaces but >> makes it all to easy to by-pass the restriction because of the >> behavior of the current tool set. ie. user space has to change. > >> But again, I believe the fcaps behavior is wrong, because of the state of >> current software. If this had been a proposal where there was no existing >> software infrastructure I would be starting from a different stance. > > As mentioned above, userspace doesn't necessarily have to change. I'm > also not sure what you mean by easy to by-pass? If I mask off some > capabilities system wide or in a given process tree, I know for a fact > that no namespace will ever get those capabilities. so by-pass will very much depend on the system but from a distro pov we pretty much have to have bwrap enabled if users want to use flatpaks (and they do), same story for several other tools. Since this basically means said tools need to be available by default, most systems the distro is installed on are vulnerable by default. The trivial by-pass then becomes the exploit running its payload through one of these tools, and yes I have tested it. Could a distro disable these tools by default, and require the user/admin to enable them, yes though there would be a lot of friction, push back, and in the end most systems would still end up with them enabled. With the capibilities approach can a user/admin make their system more secure than the current situation, absolutely. Note, that regardless of what happens with patch 1, and 2. I think we either need the big sysctl toggle, or a version of your patch 3
On Sat, May 18, 2024 at 05:27:27AM GMT, John Johansen wrote: > On 5/17/24 20:50, Jonathan Calmels wrote: > > As mentioned above, userspace doesn't necessarily have to change. I'm > > also not sure what you mean by easy to by-pass? If I mask off some > > capabilities system wide or in a given process tree, I know for a fact > > that no namespace will ever get those capabilities. > > so by-pass will very much depend on the system but from a distro pov > we pretty much have to have bwrap enabled if users want to use flatpaks > (and they do), same story for several other tools. Since this basically > means said tools need to be available by default, most systems the > distro is installed on are vulnerable by default. The trivial by-pass > then becomes the exploit running its payload through one of these tools, > and yes I have tested it. > > Could a distro disable these tools by default, and require the user/admin > to enable them, yes though there would be a lot of friction, push back, > and in the end most systems would still end up with them enabled. > > With the capibilities approach can a user/admin make their system > more secure than the current situation, absolutely. > > Note, that regardless of what happens with patch 1, and 2. I think we > either need the big sysctl toggle, or a version of your patch 3 Ah ok, I get you concerns. Unfortunately, I can't really speak for distros or tooling about how this gets leveraged. I've never claimed this was going to be bulletproof day 1. All I'm saying is that they now have the option to do so. As you pointed out, we're coming from a model where today it's open-bar. Only now they can put a bouncer in front of it, so to speak :) Regarding distros: Maybe they ship with an empty userns mask by default and admins have to tweak it, understanding full well the consequences of doing so. Maybe they ship with a conservative mask and use pam rules to adjust it. Maybe they introduce something like a wheel/sudo group that you need to be part of to gain extra privileges in your userns. Maybe only some system services (e.g. dockerd, lxd/incusd, machined) get confined. Maybe they need highly specific policies, and this is where you'll would want LSM support. Say an Apparmor profile targetting unshare(1) specifically. Regarding tools: Maybe bwrap has its own group you need to be part of to get full caps. Maybe docker uses this set behind `--cap-add` `--cap-drop`. Maybe lxd/incusd imlement ACL restricting who can do what. Maybe steam always drops everything it doesn't need, I'm sure this won't cover every single corner cases, but as stated in the headline, this is a start, a simple framework we can always extend if needed in the future.
On Thu, May 16, 2024 at 02:22:03AM -0700, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> Thanks! Of course we'llnsee how the conversations fall out, but Reviewed-by: Serge Hallyn <serge@hallyn.com> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; > + /* a capability cannot be raised unless the current task has it in > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring); > -- > 2.45.0 >
On Thu, May 16, 2024 at 02:22:03AM -0700, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; > + /* a capability cannot be raised unless the current task has it in > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { Sorry, I meabt to say, one nit would be that this next block does not need to be in an else, since every other condition returns. > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring); > -- > 2.45.0 >
On 5/17/24 07:22, Eric W. Biederman wrote: > Jonathan Calmels <jcalmels@3xx0.net> writes: > >> On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: >>> >>> Pointers please? >>> >>> That sentence sounds about 5 years out of date. >> >> The link referenced is from last year. >> Here are some others often cited by distributions: >> >> https://nvd.nist.gov/vuln/detail/CVE-2022-0185 >> https://nvd.nist.gov/vuln/detail/CVE-2022-1015 >> https://nvd.nist.gov/vuln/detail/CVE-2022-2078 >> https://nvd.nist.gov/vuln/detail/CVE-2022-24122 >> https://nvd.nist.gov/vuln/detail/CVE-2022-25636 >> >> Recent thread discussing this too: >> https://seclists.org/oss-sec/2024/q2/128 > > My apologies perhaps I trimmed too much. > > I know that user namespaces enlarge the attack surface. > How much and how serious could be debated but for unprivileged > users the attack surface is undoubtedly enlarged. > > As I read your introduction you were justifying the introduction > of a new security mechanism with the observation that distributions > were carrying distribution specific patches. > > To the best of my knowledge distribution specific patches and > distributions disabling user namespaces have been gone for quite a > while. So if that has changed recently I would like to know. > almost all the distros are carrying the out of try sysctl to disable user namepsaces. Its disabled by default but is available. Ubuntu in its 24.04 release is now limiting unprivileged use of user namespaces to known code. At a generic code level they are allowed but with no capabilities within the user namespace.
diff --git a/fs/proc/array.c b/fs/proc/array.c index 34a47fb0c57f..364e8bb19f9d 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) const struct cred *cred; kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset, cap_ambient; +#ifdef CONFIG_USER_NS + kernel_cap_t cap_userns; +#endif rcu_read_lock(); cred = __task_cred(p); @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) cap_effective = cred->cap_effective; cap_bset = cred->cap_bset; cap_ambient = cred->cap_ambient; +#ifdef CONFIG_USER_NS + cap_userns = cred->cap_userns; +#endif rcu_read_unlock(); render_cap_t(m, "CapInh:\t", &cap_inheritable); @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) render_cap_t(m, "CapEff:\t", &cap_effective); render_cap_t(m, "CapBnd:\t", &cap_bset); render_cap_t(m, "CapAmb:\t", &cap_ambient); +#ifdef CONFIG_USER_NS + render_cap_t(m, "CapUNs:\t", &cap_userns); +#endif } static inline void task_seccomp(struct seq_file *m, struct task_struct *p) diff --git a/include/linux/cred.h b/include/linux/cred.h index 2976f534a7a3..adab0031443e 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -124,6 +124,9 @@ struct cred { kernel_cap_t cap_effective; /* caps we can actually use */ kernel_cap_t cap_bset; /* capability bounding set */ kernel_cap_t cap_ambient; /* Ambient capability set */ +#ifdef CONFIG_USER_NS + kernel_cap_t cap_userns; /* User namespace capability set */ +#endif #ifdef CONFIG_KEYS unsigned char jit_keyring; /* default keyring to attach requested * keys to */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 370ed14b1ae0..e09475171f62 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,6 +198,13 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Control the userns capability set */ +#define PR_CAP_USERNS 48 +# define PR_CAP_USERNS_IS_SET 1 +# define PR_CAP_USERNS_RAISE 2 +# define PR_CAP_USERNS_LOWER 3 +# define PR_CAP_USERNS_CLEAR_ALL 4 + /* arm64 Scalable Vector Extension controls */ /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ #define PR_SVE_SET_VL 50 /* set task vector length */ diff --git a/kernel/cred.c b/kernel/cred.c index 075cfa7c896f..9912c6f3bc6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -56,6 +56,9 @@ struct cred init_cred = { .cap_permitted = CAP_FULL_SET, .cap_effective = CAP_FULL_SET, .cap_bset = CAP_FULL_SET, +#ifdef CONFIG_USER_NS + .cap_userns = CAP_FULL_SET, +#endif .user = INIT_USER, .user_ns = &init_user_ns, .group_info = &init_groups, diff --git a/kernel/umh.c b/kernel/umh.c index 1b13c5d34624..51f1e1d25d49 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -32,6 +32,9 @@ #include <trace/events/module.h> +#ifdef CONFIG_USER_NS +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; +#endif static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; static DEFINE_SPINLOCK(umh_sysctl_lock); @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); new->cap_inheritable = cap_intersect(usermodehelper_inheritable, new->cap_inheritable); +#ifdef CONFIG_USER_NS + new->cap_userns = cap_intersect(usermodehelper_userns, + new->cap_userns); +#endif spin_unlock(&umh_sysctl_lock); if (sub_info->init) { @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { .mode = 0600, .proc_handler = proc_cap_handler, }, +#ifdef CONFIG_USER_NS + { + .procname = "userns", + .data = &usermodehelper_userns, + .maxlen = 2 * sizeof(unsigned long), + .mode = 0600, + .proc_handler = proc_cap_handler, + }, +#endif { } }; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 0b0b95418b16..7e624607330b 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) { - /* Start with the same capabilities as init but useless for doing - * anything as the capabilities are bound to the new user namespace. - */ - cred->securebits = SECUREBITS_DEFAULT; + /* Start with the capabilities defined in the userns set. */ + cred->cap_bset = cred->cap_userns; + cred->cap_permitted = cred->cap_userns; + cred->cap_effective = cred->cap_userns; cred->cap_inheritable = CAP_EMPTY_SET; - cred->cap_permitted = CAP_FULL_SET; - cred->cap_effective = CAP_FULL_SET; cred->cap_ambient = CAP_EMPTY_SET; - cred->cap_bset = CAP_FULL_SET; + cred->securebits = SECUREBITS_DEFAULT; #ifdef CONFIG_KEYS key_put(cred->request_key_auth); cred->request_key_auth = NULL; diff --git a/security/commoncap.c b/security/commoncap.c index 162d96b3a676..b3d3372bf910 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) return 1; } +/* + * Determine whether a userns capability can be raised. + * Returns 1 if it can, 0 otherwise. + */ +#ifdef CONFIG_USER_NS +static inline int cap_uns_is_raiseable(unsigned long cap) +{ + if (!!cap_raised(current_cred()->cap_userns, cap)) + return 1; + /* a capability cannot be raised unless the current task has it in + * its bounding set and, without CAP_SETPCAP, its permitted set. + */ + if (!cap_raised(current_cred()->cap_bset, cap)) + return 0; + if (cap_capable(current_cred(), current_cred()->user_ns, + CAP_SETPCAP, CAP_OPT_NONE) != 0 && + !cap_raised(current_cred()->cap_permitted, cap)) + return 0; + return 1; +} +#endif + /** * cap_capset - Validate and apply proposed changes to current's capabilities * @new: The proposed new credentials; alterations should be made here @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, return commit_creds(new); } +#ifdef CONFIG_USER_NS + case PR_CAP_USERNS: + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { + if (arg3 | arg4 | arg5) + return -EINVAL; + + new = prepare_creds(); + if (!new) + return -ENOMEM; + cap_clear(new->cap_userns); + return commit_creds(new); + } + + if (((!cap_valid(arg3)) | arg4 | arg5)) + return -EINVAL; + + if (arg2 == PR_CAP_USERNS_IS_SET) { + return !!cap_raised(current_cred()->cap_userns, arg3); + } else if (arg2 != PR_CAP_USERNS_RAISE && + arg2 != PR_CAP_USERNS_LOWER) { + return -EINVAL; + } else { + if (arg2 == PR_CAP_USERNS_RAISE && + !cap_uns_is_raiseable(arg3)) + return -EPERM; + + new = prepare_creds(); + if (!new) + return -ENOMEM; + if (arg2 == PR_CAP_USERNS_RAISE) + cap_raise(new->cap_userns, arg3); + else + cap_lower(new->cap_userns, arg3); + return commit_creds(new); + } +#endif + default: /* No functionality available - continue with default */ return -ENOSYS; diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c index b5d5333ab330..e3670d815435 100644 --- a/security/keys/process_keys.c +++ b/security/keys/process_keys.c @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) new->cap_effective = old->cap_effective; new->cap_ambient = old->cap_ambient; new->cap_bset = old->cap_bset; +#ifdef CONFIG_USER_NS + new->cap_userns = old->cap_userns; +#endif new->jit_keyring = old->jit_keyring; new->thread_keyring = key_get(old->thread_keyring);
Attackers often rely on user namespaces to get elevated (yet confined) privileges in order to target specific subsystems (e.g. [1]). Distributions have been pretty adamant that they need a way to configure these, most of them carry out-of-tree patches to do so, or plainly refuse to enable them. As a result, there have been multiple efforts over the years to introduce various knobs to control and/or disable user namespaces (e.g. [2][3][4]). While we acknowledge that there are already ways to control the creation of such namespaces (the most recent being a LSM hook), there are inherent issues with these approaches. Preventing the user namespace creation is not fine-grained enough, and in some cases, incompatible with various userspace expectations (e.g. container runtimes, browser sandboxing, service isolation) This patch addresses these limitations by introducing an additional capability set used to restrict the permissions granted when creating user namespaces. This way, processes can apply the principle of least privilege by configuring only the capabilities they need for their namespaces. For compatibility reasons, processes always start with a full userns capability set. On namespace creation, the userns capability set (pU) is assigned to the new effective (pE), permitted (pP) and bounding set (X) of the task: pU = pE = pP = X The userns capability set obeys the invariant that no bit can ever be set if it is not already part of the task’s bounding set. This ensures that no namespace can ever gain more privileges than its predecessors. Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit in the userns set requires its corresponding bit to be set in the permitted set. This effectively mimics the inheritable set rules and means that, by default, only root in the initial user namespace can gain userns capabilities: p’U = (pE & CAP_SETPCAP) ? X : (X & pP) Note that since userns capabilities are strictly hierarchical, policies can be enforced at various levels (e.g. init, pam_cap) and inherited by every child namespace. Here is a sample program that can be used to verify the functionality: /* * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. * * ./cap_userns_test unshare -r grep Cap /proc/self/status * CapInh: 0000000000000000 * CapPrm: 000001fffffdffff * CapEff: 000001fffffdffff * CapBnd: 000001fffffdffff * CapAmb: 0000000000000000 * CapUNs: 000001fffffdffff */ int main(int argc, char *argv[]) { if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) err(1, "cannot drop userns cap"); execvp(argv[1], argv + 1); err(1, "cannot exec"); } Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> --- fs/proc/array.c | 9 ++++++ include/linux/cred.h | 3 ++ include/uapi/linux/prctl.h | 7 +++++ kernel/cred.c | 3 ++ kernel/umh.c | 16 ++++++++++ kernel/user_namespace.c | 12 +++----- security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ security/keys/process_keys.c | 3 ++ 8 files changed, 105 insertions(+), 7 deletions(-)