Message ID | 20230912212906.3975866-3-andrii@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Paul Moore |
Headers | show |
Series | BPF token and BPF FS-based delegation | expand |
On Tue, Sep 12, 2023 at 2:30 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > allow delegating privileged BPF functionality, like loading a BPF > program or creating a BPF map, from privileged process to a *trusted* > unprivileged process, all while have a good amount of control over which > privileged operations could be performed using provided BPF token. > > This is achieved through mounting BPF FS instance with extra delegation > mount options, which determine what operations are delegatable, and also > constraining it to the owning user namespace (as mentioned in the > previous patch). > > BPF token itself is just a derivative from BPF FS and can be created > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > a path specification (using the usual fd + string path combo) to a BPF > FS mount. Currently, BPF token "inherits" delegated command, map types, > prog type, and attach type bit sets from BPF FS as is. In the future, > having an BPF token as a separate object with its own FD, we can allow > to further restrict BPF token's allowable set of things either at the creation > time or after the fact, allowing the process to guard itself further > from, e.g., unintentionally trying to load undesired kind of BPF > programs. But for now we keep things simple and just copy bit sets as is. > > When BPF token is created from BPF FS mount, we take reference to the > BPF super block's owning user namespace, and then use that namespace for > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > capabilities that are normally only checked against init userns (using > capable()), but now we check them using ns_capable() instead (if BPF > token is provided). See bpf_token_capable() for details. > > Such setup means that BPF token in itself is not sufficient to grant BPF > functionality. User namespaced process has to *also* have necessary > combination of capabilities inside that user namespace. So while > previously CAP_BPF was useless when granted within user namespace, now > it gains a meaning and allows container managers and sys admins to have > a flexible control over which processes can and need to use BPF > functionality within the user namespace (i.e., container in practice). > And BPF FS delegation mount options and derived BPF tokens serve as > a per-container "flag" to grant overall ability to use bpf() (plus further > restrict on which parts of bpf() syscalls are treated as namespaced). > > The alternative to creating BPF token object was: > a) not having any extra object and just pasing BPF FS path to each > relevant bpf() command. This seems suboptimal as it's racy (mount > under the same path might change in between checking it and using it > for bpf() command). And also less flexible if we'd like to further > restrict ourselves compared to all the delegated functionality > allowed on BPF FS. > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > a dedicated FD that would represent a token-like functionality. This > doesn't seem superior to having a proper bpf() command, so > BPF_TOKEN_CREATE was chosen. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > include/linux/bpf.h | 36 +++++++ > include/uapi/linux/bpf.h | 39 +++++++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/inode.c | 4 +- > kernel/bpf/syscall.c | 17 +++ > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > tools/include/uapi/linux/bpf.h | 39 +++++++ > 7 files changed, 324 insertions(+), 2 deletions(-) > create mode 100644 kernel/bpf/token.c > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index e9a3ab390844..6abd2b96e096 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -51,6 +51,8 @@ struct module; > struct bpf_func_state; > struct ftrace_ops; > struct cgroup; > +struct bpf_token; > +struct user_namespace; > > extern struct idr btf_idr; > extern spinlock_t btf_idr_lock; > @@ -1568,6 +1570,13 @@ struct bpf_mount_opts { > u64 delegate_attachs; > }; > > +struct bpf_token { > + struct work_struct work; > + atomic64_t refcnt; > + struct user_namespace *userns; > + u64 allowed_cmds; > +}; > + > struct bpf_struct_ops_value; > struct btf_member; > > @@ -2192,6 +2201,15 @@ int bpf_link_new_fd(struct bpf_link *link); > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > +void bpf_token_inc(struct bpf_token *token); > +void bpf_token_put(struct bpf_token *token); > +int bpf_token_create(union bpf_attr *attr); > +int bpf_token_new_fd(struct bpf_token *token); > +struct bpf_token *bpf_token_get_from_fd(u32 ufd); > + > +bool bpf_token_capable(const struct bpf_token *token, int cap); > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd); > + > int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > @@ -2551,6 +2569,24 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags) > return -EOPNOTSUPP; > } > > +static inline void bpf_token_inc(struct bpf_token *token) > +{ > +} > + > +static inline void bpf_token_put(struct bpf_token *token) > +{ > +} > + > +static inline int bpf_token_new_fd(struct bpf_token *token) > +{ > + return -EOPNOTSUPP; > +} > + > +static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd) > +{ > + return ERR_PTR(-EOPNOTSUPP); > +} > + > static inline void __dev_flush(void) > { > } > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 73b155e52204..36e98c6f8944 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -847,6 +847,37 @@ union bpf_iter_link_info { > * Returns zero on success. On error, -1 is returned and *errno* > * is set appropriately. > * > + * BPF_TOKEN_CREATE > + * Description > + * Create BPF token with embedded information about what > + * BPF-related functionality it allows: > + * - a set of allowed bpf() syscall commands; > + * - a set of allowed BPF map types to be created with > + * BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed; > + * - a set of allowed BPF program types and BPF program attach > + * types to be loaded with BPF_PROG_LOAD command, if > + * BPF_PROG_LOAD itself is allowed. > + * > + * BPF token is created (derived) from an instance of BPF FS, > + * assuming it has necessary delegation mount options specified. > + * BPF FS mount is specified with openat()-style path FD + string. > + * This BPF token can be passed as an extra parameter to various > + * bpf() syscall commands to grant BPF subsystem functionality to > + * unprivileged processes. > + * > + * When created, BPF token is "associated" with the owning > + * user namespace of BPF FS instance (super block) that it was > + * derived from, and subsequent BPF operations performed with > + * BPF token would be performing capabilities checks (i.e., > + * CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within > + * that user namespace. Without BPF token, such capabilities > + * have to be granted in init user namespace, making bpf() > + * syscall incompatible with user namespace, for the most part. > + * > + * Return > + * A new file descriptor (a nonnegative integer), or -1 if an > + * error occurred (in which case, *errno* is set appropriately). > + * > * NOTES > * eBPF objects (maps and programs) can be shared between processes. > * > @@ -901,6 +932,8 @@ enum bpf_cmd { > BPF_ITER_CREATE, > BPF_LINK_DETACH, > BPF_PROG_BIND_MAP, > + BPF_TOKEN_CREATE, > + __MAX_BPF_CMD, > }; > > enum bpf_map_type { > @@ -1694,6 +1727,12 @@ union bpf_attr { > __u32 flags; /* extra flags */ > } prog_bind_map; > > + struct { /* struct used by BPF_TOKEN_CREATE command */ > + __u32 flags; > + __u32 bpffs_path_fd; > + __u64 bpffs_pathname; > + } token_create; > + > } __attribute__((aligned(8))); > > /* The description below is an attempt at providing documentation to eBPF > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index f526b7573e97..4ce95acfcaa7 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse > endif > CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) > > -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o > +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o > obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o > obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o > obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c > index 8f66b57d3546..82f11fbffd3e 100644 > --- a/kernel/bpf/inode.c > +++ b/kernel/bpf/inode.c > @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) > { > struct bpf_mount_opts *opts = root->d_sb->s_fs_info; > umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX; > + u64 mask; > > if (mode != S_IRWXUGO) > seq_printf(m, ",mode=%o", mode); > > - if (opts->delegate_cmds == ~0ULL) > + mask = (1ULL << __MAX_BPF_CMD) - 1; > + if ((opts->delegate_cmds & mask) == mask) > seq_printf(m, ",delegate_cmds=any"); > else if (opts->delegate_cmds) > seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds); > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 6a692f3bea15..4fae678c1f48 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -5297,6 +5297,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr) > return ret; > } > > +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname > + > +static int token_create(union bpf_attr *attr) > +{ > + if (CHECK_ATTR(BPF_TOKEN_CREATE)) > + return -EINVAL; > + > + /* no flags are supported yet */ > + if (attr->token_create.flags) > + return -EINVAL; A question to people looking at this: should BPF_TOKEN_CREATE be guarded with ns_capable(CAP_BPF) or it's fine to rely on FS permissions alone for this? It can't be capable(CAP_BPF), obviously, but having it guarded by ns_capable(CAP_BPF) would make it impossible to even construct a BPF token without having namespaced CAP_BPF inside the container. > + > + return bpf_token_create(attr); > +} > + > static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size) > { > union bpf_attr attr; > @@ -5430,6 +5444,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size) > case BPF_PROG_BIND_MAP: > err = bpf_prog_bind_map(&attr); > break; > + case BPF_TOKEN_CREATE: > + err = token_create(&attr); > + break; > default: > err = -EINVAL; > break; > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > new file mode 100644 > index 000000000000..f6ea3eddbee6 > --- /dev/null > +++ b/kernel/bpf/token.c > @@ -0,0 +1,189 @@ > +#include <linux/bpf.h> > +#include <linux/vmalloc.h> > +#include <linux/anon_inodes.h> > +#include <linux/fdtable.h> > +#include <linux/file.h> > +#include <linux/fs.h> > +#include <linux/kernel.h> > +#include <linux/idr.h> > +#include <linux/namei.h> > +#include <linux/user_namespace.h> > + > +bool bpf_token_capable(const struct bpf_token *token, int cap) > +{ > + /* BPF token allows ns_capable() level of capabilities */ > + if (token) { > + if (ns_capable(token->userns, cap)) > + return true; > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > + return true; > + } > + /* otherwise fallback to capable() checks */ > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > +} > + > +void bpf_token_inc(struct bpf_token *token) > +{ > + atomic64_inc(&token->refcnt); > +} > + > +static void bpf_token_free(struct bpf_token *token) > +{ > + put_user_ns(token->userns); > + kvfree(token); > +} > + > +static void bpf_token_put_deferred(struct work_struct *work) > +{ > + struct bpf_token *token = container_of(work, struct bpf_token, work); > + > + bpf_token_free(token); > +} > + > +void bpf_token_put(struct bpf_token *token) > +{ > + if (!token) > + return; > + > + if (!atomic64_dec_and_test(&token->refcnt)) > + return; > + > + INIT_WORK(&token->work, bpf_token_put_deferred); > + schedule_work(&token->work); > +} > + > +static int bpf_token_release(struct inode *inode, struct file *filp) > +{ > + struct bpf_token *token = filp->private_data; > + > + bpf_token_put(token); > + return 0; > +} > + > +static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz, > + loff_t *ppos) > +{ > + /* We need this handler such that alloc_file() enables > + * f_mode with FMODE_CAN_READ. > + */ > + return -EINVAL; > +} > + > +static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf, > + size_t siz, loff_t *ppos) > +{ > + /* We need this handler such that alloc_file() enables > + * f_mode with FMODE_CAN_WRITE. > + */ > + return -EINVAL; > +} > + > +static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp) > +{ > + struct bpf_token *token = filp->private_data; > + u64 mask; > + > + mask = (1ULL << __MAX_BPF_CMD) - 1; > + if ((token->allowed_cmds & mask) == mask) > + seq_printf(m, "allowed_cmds:\tany\n"); > + else > + seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds); > +} > + > +static const struct file_operations bpf_token_fops = { > + .release = bpf_token_release, > + .read = bpf_dummy_read, > + .write = bpf_dummy_write, > + .show_fdinfo = bpf_token_show_fdinfo, > +}; > + > +static struct bpf_token *bpf_token_alloc(void) > +{ > + struct bpf_token *token; > + > + token = kvzalloc(sizeof(*token), GFP_USER); > + if (!token) > + return NULL; > + > + atomic64_set(&token->refcnt, 1); > + > + return token; > +} > + > +int bpf_token_create(union bpf_attr *attr) > +{ > + struct path path; > + struct bpf_mount_opts *mnt_opts; > + struct bpf_token *token; > + int ret; > + > + ret = user_path_at(attr->token_create.bpffs_path_fd, > + u64_to_user_ptr(attr->token_create.bpffs_pathname), > + LOOKUP_FOLLOW | LOOKUP_EMPTY, &path); > + if (ret) > + return ret; > + > + if (path.mnt->mnt_root != path.dentry) { > + ret = -EINVAL; > + goto out; > + } > + ret = path_permission(&path, MAY_ACCESS); > + if (ret) > + goto out; > + > + token = bpf_token_alloc(); > + if (!token) { > + ret = -ENOMEM; > + goto out; > + } > + > + /* remember bpffs owning userns for future ns_capable() checks */ > + token->userns = get_user_ns(path.dentry->d_sb->s_user_ns); > + > + mnt_opts = path.dentry->d_sb->s_fs_info; > + token->allowed_cmds = mnt_opts->delegate_cmds; > + > + ret = bpf_token_new_fd(token); > + if (ret < 0) > + bpf_token_free(token); > +out: > + path_put(&path); > + return ret; > +} > + > +#define BPF_TOKEN_INODE_NAME "bpf-token" > + > +/* Alloc anon_inode and FD for prepared token. > + * Returns fd >= 0 on success; negative error, otherwise. > + */ > +int bpf_token_new_fd(struct bpf_token *token) > +{ > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > +} > + > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > +{ > + struct fd f = fdget(ufd); > + struct bpf_token *token; > + > + if (!f.file) > + return ERR_PTR(-EBADF); > + if (f.file->f_op != &bpf_token_fops) { > + fdput(f); > + return ERR_PTR(-EINVAL); > + } > + > + token = f.file->private_data; > + bpf_token_inc(token); > + fdput(f); > + > + return token; > +} > + > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > +{ > + if (!token) > + return false; > + > + return token->allowed_cmds & (1ULL << cmd); > +} > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 73b155e52204..36e98c6f8944 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -847,6 +847,37 @@ union bpf_iter_link_info { > * Returns zero on success. On error, -1 is returned and *errno* > * is set appropriately. > * > + * BPF_TOKEN_CREATE > + * Description > + * Create BPF token with embedded information about what > + * BPF-related functionality it allows: > + * - a set of allowed bpf() syscall commands; > + * - a set of allowed BPF map types to be created with > + * BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed; > + * - a set of allowed BPF program types and BPF program attach > + * types to be loaded with BPF_PROG_LOAD command, if > + * BPF_PROG_LOAD itself is allowed. > + * > + * BPF token is created (derived) from an instance of BPF FS, > + * assuming it has necessary delegation mount options specified. > + * BPF FS mount is specified with openat()-style path FD + string. > + * This BPF token can be passed as an extra parameter to various > + * bpf() syscall commands to grant BPF subsystem functionality to > + * unprivileged processes. > + * > + * When created, BPF token is "associated" with the owning > + * user namespace of BPF FS instance (super block) that it was > + * derived from, and subsequent BPF operations performed with > + * BPF token would be performing capabilities checks (i.e., > + * CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within > + * that user namespace. Without BPF token, such capabilities > + * have to be granted in init user namespace, making bpf() > + * syscall incompatible with user namespace, for the most part. > + * > + * Return > + * A new file descriptor (a nonnegative integer), or -1 if an > + * error occurred (in which case, *errno* is set appropriately). > + * > * NOTES > * eBPF objects (maps and programs) can be shared between processes. > * > @@ -901,6 +932,8 @@ enum bpf_cmd { > BPF_ITER_CREATE, > BPF_LINK_DETACH, > BPF_PROG_BIND_MAP, > + BPF_TOKEN_CREATE, > + __MAX_BPF_CMD, > }; > > enum bpf_map_type { > @@ -1694,6 +1727,12 @@ union bpf_attr { > __u32 flags; /* extra flags */ > } prog_bind_map; > > + struct { /* struct used by BPF_TOKEN_CREATE command */ > + __u32 flags; > + __u32 bpffs_path_fd; > + __u64 bpffs_pathname; > + } token_create; > + > } __attribute__((aligned(8))); > > /* The description below is an attempt at providing documentation to eBPF > -- > 2.34.1 > >
On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > allow delegating privileged BPF functionality, like loading a BPF > program or creating a BPF map, from privileged process to a *trusted* > unprivileged process, all while have a good amount of control over which > privileged operations could be performed using provided BPF token. > > This is achieved through mounting BPF FS instance with extra delegation > mount options, which determine what operations are delegatable, and also > constraining it to the owning user namespace (as mentioned in the > previous patch). > > BPF token itself is just a derivative from BPF FS and can be created > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > a path specification (using the usual fd + string path combo) to a BPF > FS mount. Currently, BPF token "inherits" delegated command, map types, > prog type, and attach type bit sets from BPF FS as is. In the future, > having an BPF token as a separate object with its own FD, we can allow > to further restrict BPF token's allowable set of things either at the creation > time or after the fact, allowing the process to guard itself further > from, e.g., unintentionally trying to load undesired kind of BPF > programs. But for now we keep things simple and just copy bit sets as is. > > When BPF token is created from BPF FS mount, we take reference to the > BPF super block's owning user namespace, and then use that namespace for > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > capabilities that are normally only checked against init userns (using > capable()), but now we check them using ns_capable() instead (if BPF > token is provided). See bpf_token_capable() for details. > > Such setup means that BPF token in itself is not sufficient to grant BPF > functionality. User namespaced process has to *also* have necessary > combination of capabilities inside that user namespace. So while > previously CAP_BPF was useless when granted within user namespace, now > it gains a meaning and allows container managers and sys admins to have > a flexible control over which processes can and need to use BPF > functionality within the user namespace (i.e., container in practice). > And BPF FS delegation mount options and derived BPF tokens serve as > a per-container "flag" to grant overall ability to use bpf() (plus further > restrict on which parts of bpf() syscalls are treated as namespaced). > > The alternative to creating BPF token object was: > a) not having any extra object and just pasing BPF FS path to each > relevant bpf() command. This seems suboptimal as it's racy (mount > under the same path might change in between checking it and using it > for bpf() command). And also less flexible if we'd like to further > restrict ourselves compared to all the delegated functionality > allowed on BPF FS. > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > a dedicated FD that would represent a token-like functionality. This > doesn't seem superior to having a proper bpf() command, so > BPF_TOKEN_CREATE was chosen. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > include/linux/bpf.h | 36 +++++++ > include/uapi/linux/bpf.h | 39 +++++++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/inode.c | 4 +- > kernel/bpf/syscall.c | 17 +++ > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > tools/include/uapi/linux/bpf.h | 39 +++++++ > 7 files changed, 324 insertions(+), 2 deletions(-) > create mode 100644 kernel/bpf/token.c ... > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > new file mode 100644 > index 000000000000..f6ea3eddbee6 > --- /dev/null > +++ b/kernel/bpf/token.c > @@ -0,0 +1,189 @@ > +#include <linux/bpf.h> > +#include <linux/vmalloc.h> > +#include <linux/anon_inodes.h> > +#include <linux/fdtable.h> > +#include <linux/file.h> > +#include <linux/fs.h> > +#include <linux/kernel.h> > +#include <linux/idr.h> > +#include <linux/namei.h> > +#include <linux/user_namespace.h> > + > +bool bpf_token_capable(const struct bpf_token *token, int cap) > +{ > + /* BPF token allows ns_capable() level of capabilities */ > + if (token) { > + if (ns_capable(token->userns, cap)) > + return true; > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > + return true; > + } > + /* otherwise fallback to capable() checks */ > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > +} While the above looks to be equivalent to the bpf_capable() function it replaces, for callers checking CAP_BPF and CAP_SYS_ADMIN, I'm looking quickly at patch 3/12 and this is also being used to replace a capable(CAP_NET_ADMIN) call which results in a change in behavior. The current code which performs a capable(CAP_NET_ADMIN) check cannot be satisfied by CAP_SYS_ADMIN, but this patchset using bpf_token_capable(token, CAP_NET_ADMIN) can be satisfied by either CAP_NET_ADMIN or CAP_SYS_ADMIN. It seems that while bpf_token_capable() can be used as a replacement for bpf_capable(), it is not currently a suitable replacement for a generic capable() call. Perhaps this is intentional, but I didn't see it mentioned in the commit description, or in the comments, and I wanted to make sure it wasn't an oversight. > +void bpf_token_inc(struct bpf_token *token) > +{ > + atomic64_inc(&token->refcnt); > +} > + > +static void bpf_token_free(struct bpf_token *token) > +{ > + put_user_ns(token->userns); > + kvfree(token); > +} > + > +static void bpf_token_put_deferred(struct work_struct *work) > +{ > + struct bpf_token *token = container_of(work, struct bpf_token, work); > + > + bpf_token_free(token); > +} > + > +void bpf_token_put(struct bpf_token *token) > +{ > + if (!token) > + return; > + > + if (!atomic64_dec_and_test(&token->refcnt)) > + return; > + > + INIT_WORK(&token->work, bpf_token_put_deferred); > + schedule_work(&token->work); > +} > + > +static int bpf_token_release(struct inode *inode, struct file *filp) > +{ > + struct bpf_token *token = filp->private_data; > + > + bpf_token_put(token); > + return 0; > +} > + > +static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz, > + loff_t *ppos) > +{ > + /* We need this handler such that alloc_file() enables > + * f_mode with FMODE_CAN_READ. > + */ > + return -EINVAL; > +} > + > +static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf, > + size_t siz, loff_t *ppos) > +{ > + /* We need this handler such that alloc_file() enables > + * f_mode with FMODE_CAN_WRITE. > + */ > + return -EINVAL; > +} > + > +static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp) > +{ > + struct bpf_token *token = filp->private_data; > + u64 mask; > + > + mask = (1ULL << __MAX_BPF_CMD) - 1; > + if ((token->allowed_cmds & mask) == mask) > + seq_printf(m, "allowed_cmds:\tany\n"); > + else > + seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds); > +} > + > +static const struct file_operations bpf_token_fops = { > + .release = bpf_token_release, > + .read = bpf_dummy_read, > + .write = bpf_dummy_write, > + .show_fdinfo = bpf_token_show_fdinfo, > +}; > + > +static struct bpf_token *bpf_token_alloc(void) > +{ > + struct bpf_token *token; > + > + token = kvzalloc(sizeof(*token), GFP_USER); > + if (!token) > + return NULL; > + > + atomic64_set(&token->refcnt, 1); > + > + return token; > +} > + > +int bpf_token_create(union bpf_attr *attr) > +{ > + struct path path; > + struct bpf_mount_opts *mnt_opts; > + struct bpf_token *token; > + int ret; > + > + ret = user_path_at(attr->token_create.bpffs_path_fd, > + u64_to_user_ptr(attr->token_create.bpffs_pathname), > + LOOKUP_FOLLOW | LOOKUP_EMPTY, &path); > + if (ret) > + return ret; > + > + if (path.mnt->mnt_root != path.dentry) { > + ret = -EINVAL; > + goto out; > + } > + ret = path_permission(&path, MAY_ACCESS); > + if (ret) > + goto out; > + > + token = bpf_token_alloc(); > + if (!token) { > + ret = -ENOMEM; > + goto out; > + } > + > + /* remember bpffs owning userns for future ns_capable() checks */ > + token->userns = get_user_ns(path.dentry->d_sb->s_user_ns); > + > + mnt_opts = path.dentry->d_sb->s_fs_info; > + token->allowed_cmds = mnt_opts->delegate_cmds; > + > + ret = bpf_token_new_fd(token); > + if (ret < 0) > + bpf_token_free(token); > +out: > + path_put(&path); > + return ret; > +} > + > +#define BPF_TOKEN_INODE_NAME "bpf-token" > + > +/* Alloc anon_inode and FD for prepared token. > + * Returns fd >= 0 on success; negative error, otherwise. > + */ > +int bpf_token_new_fd(struct bpf_token *token) > +{ > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > +} > + > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > +{ > + struct fd f = fdget(ufd); > + struct bpf_token *token; > + > + if (!f.file) > + return ERR_PTR(-EBADF); > + if (f.file->f_op != &bpf_token_fops) { > + fdput(f); > + return ERR_PTR(-EINVAL); > + } > + > + token = f.file->private_data; > + bpf_token_inc(token); > + fdput(f); > + > + return token; > +} > + > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > +{ > + if (!token) > + return false; > + > + return token->allowed_cmds & (1ULL << cmd); > +} I mentioned this a while back, likely in the other threads where this token-based approach was only being discussed in general terms, but I think we want to have a LSM hook at the point of initial token delegation for this and a hook when the token is used. My initial thinking is that we should be able to address the former with a hook in bpf_fill_super() and the latter either in bpf_token_get_from_fd() or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, but it doesn't allow for much in the way of granularity. Inserting the LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall gracefully fallback to the system-wide checks if the LSM denied the requested access whereas an access denial in bpf_token_get_from_fd() denial would cause the operation to error out. -- paul-moore.com
On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@paul-moore.com> wrote: > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > allow delegating privileged BPF functionality, like loading a BPF > > program or creating a BPF map, from privileged process to a *trusted* > > unprivileged process, all while have a good amount of control over which > > privileged operations could be performed using provided BPF token. > > > > This is achieved through mounting BPF FS instance with extra delegation > > mount options, which determine what operations are delegatable, and also > > constraining it to the owning user namespace (as mentioned in the > > previous patch). > > > > BPF token itself is just a derivative from BPF FS and can be created > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > > a path specification (using the usual fd + string path combo) to a BPF > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > prog type, and attach type bit sets from BPF FS as is. In the future, > > having an BPF token as a separate object with its own FD, we can allow > > to further restrict BPF token's allowable set of things either at the creation > > time or after the fact, allowing the process to guard itself further > > from, e.g., unintentionally trying to load undesired kind of BPF > > programs. But for now we keep things simple and just copy bit sets as is. > > > > When BPF token is created from BPF FS mount, we take reference to the > > BPF super block's owning user namespace, and then use that namespace for > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > capabilities that are normally only checked against init userns (using > > capable()), but now we check them using ns_capable() instead (if BPF > > token is provided). See bpf_token_capable() for details. > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > functionality. User namespaced process has to *also* have necessary > > combination of capabilities inside that user namespace. So while > > previously CAP_BPF was useless when granted within user namespace, now > > it gains a meaning and allows container managers and sys admins to have > > a flexible control over which processes can and need to use BPF > > functionality within the user namespace (i.e., container in practice). > > And BPF FS delegation mount options and derived BPF tokens serve as > > a per-container "flag" to grant overall ability to use bpf() (plus further > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > The alternative to creating BPF token object was: > > a) not having any extra object and just pasing BPF FS path to each > > relevant bpf() command. This seems suboptimal as it's racy (mount > > under the same path might change in between checking it and using it > > for bpf() command). And also less flexible if we'd like to further > > restrict ourselves compared to all the delegated functionality > > allowed on BPF FS. > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > a dedicated FD that would represent a token-like functionality. This > > doesn't seem superior to having a proper bpf() command, so > > BPF_TOKEN_CREATE was chosen. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > include/linux/bpf.h | 36 +++++++ > > include/uapi/linux/bpf.h | 39 +++++++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/inode.c | 4 +- > > kernel/bpf/syscall.c | 17 +++ > > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > 7 files changed, 324 insertions(+), 2 deletions(-) > > create mode 100644 kernel/bpf/token.c > > ... > > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > > new file mode 100644 > > index 000000000000..f6ea3eddbee6 > > --- /dev/null > > +++ b/kernel/bpf/token.c > > @@ -0,0 +1,189 @@ > > +#include <linux/bpf.h> > > +#include <linux/vmalloc.h> > > +#include <linux/anon_inodes.h> > > +#include <linux/fdtable.h> > > +#include <linux/file.h> > > +#include <linux/fs.h> > > +#include <linux/kernel.h> > > +#include <linux/idr.h> > > +#include <linux/namei.h> > > +#include <linux/user_namespace.h> > > + > > +bool bpf_token_capable(const struct bpf_token *token, int cap) > > +{ > > + /* BPF token allows ns_capable() level of capabilities */ > > + if (token) { > > + if (ns_capable(token->userns, cap)) > > + return true; > > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > > + return true; > > + } > > + /* otherwise fallback to capable() checks */ > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > +} > > While the above looks to be equivalent to the bpf_capable() function it > replaces, for callers checking CAP_BPF and CAP_SYS_ADMIN, I'm looking > quickly at patch 3/12 and this is also being used to replace a > capable(CAP_NET_ADMIN) call which results in a change in behavior. > The current code which performs a capable(CAP_NET_ADMIN) check cannot > be satisfied by CAP_SYS_ADMIN, but this patchset using > bpf_token_capable(token, CAP_NET_ADMIN) can be satisfied by either > CAP_NET_ADMIN or CAP_SYS_ADMIN. > > It seems that while bpf_token_capable() can be used as a replacement > for bpf_capable(), it is not currently a suitable replacement for a > generic capable() call. Perhaps this is intentional, but I didn't see > it mentioned in the commit description, or in the comments, and I > wanted to make sure it wasn't an oversight. You are right. It is an intentional attempt to unify all such checks. If you look at bpf_prog_load(), we have this: if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && !capable(CAP_SYS_ADMIN)) return -EPERM; So seeing that, I realized that we did have an intent to always use CAP_SYS_ADMIN as a "fallback" cap, even for CAP_NET_ADMIN when it comes to using network-enabled BPF programs. So I decided that unifying all this makes sense. I'll add a comment mentioning this, I should have been more explicit from the get go. > > > +void bpf_token_inc(struct bpf_token *token) > > +{ > > + atomic64_inc(&token->refcnt); > > +} > > + [...] > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > + > > +/* Alloc anon_inode and FD for prepared token. > > + * Returns fd >= 0 on success; negative error, otherwise. > > + */ > > +int bpf_token_new_fd(struct bpf_token *token) > > +{ > > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > > +} > > + > > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > > +{ > > + struct fd f = fdget(ufd); > > + struct bpf_token *token; > > + > > + if (!f.file) > > + return ERR_PTR(-EBADF); > > + if (f.file->f_op != &bpf_token_fops) { > > + fdput(f); > > + return ERR_PTR(-EINVAL); > > + } > > + > > + token = f.file->private_data; > > + bpf_token_inc(token); > > + fdput(f); > > + > > + return token; > > +} > > + > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > > +{ > > + if (!token) > > + return false; > > + > > + return token->allowed_cmds & (1ULL << cmd); > > +} > > I mentioned this a while back, likely in the other threads where this > token-based approach was only being discussed in general terms, but I > think we want to have a LSM hook at the point of initial token > delegation for this and a hook when the token is used. My initial > thinking is that we should be able to address the former with a hook > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > but it doesn't allow for much in the way of granularity. Inserting the > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > gracefully fallback to the system-wide checks if the LSM denied the > requested access whereas an access denial in bpf_token_get_from_fd() > denial would cause the operation to error out. I think the bpf_fill_super() LSM hook makes sense, but I thought someone mentioned that we already have some generic LSM hook for validating mounts? If we don't, I can certainly add one for BPF FS specifically. As for the bpf_token_allow_xxx(). This feels a bit too specific and narrow-focused. What if we later add yet another dimension for BPF FS and token? Do we need to introduce yet another LSM for each such case? But also see bpf_prog_load(). There are two checks, allow_prog_type and allow_attach_type, which are really only meaningful in combination. And yet you'd have to have two separate LSM hooks for that. So I feel like the better approach is less mechanistically concentrating on BPF token operations themselves, but rather on more semantically meaningful operations that are token-enabled. E.g., protect BPF program loading, BPF map creation, BTF loading, etc. And we do have such LSM hooks right now, though they might not be the most convenient. So perhaps the right move is to add new ones that would provide a bit more context (e.g., we can pass in the BPF token that was used for the operation, attributes with which map/prog was created, etc). Low-level token LSMs seem hard to use cohesively in practice, though. WDYT? > > -- > paul-moore.com
On Thu, Sep 14, 2023 at 1:31 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@paul-moore.com> wrote: > > > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > > allow delegating privileged BPF functionality, like loading a BPF > > > program or creating a BPF map, from privileged process to a *trusted* > > > unprivileged process, all while have a good amount of control over which > > > privileged operations could be performed using provided BPF token. > > > > > > This is achieved through mounting BPF FS instance with extra delegation > > > mount options, which determine what operations are delegatable, and also > > > constraining it to the owning user namespace (as mentioned in the > > > previous patch). > > > > > > BPF token itself is just a derivative from BPF FS and can be created > > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > > > a path specification (using the usual fd + string path combo) to a BPF > > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > > prog type, and attach type bit sets from BPF FS as is. In the future, > > > having an BPF token as a separate object with its own FD, we can allow > > > to further restrict BPF token's allowable set of things either at the creation > > > time or after the fact, allowing the process to guard itself further > > > from, e.g., unintentionally trying to load undesired kind of BPF > > > programs. But for now we keep things simple and just copy bit sets as is. > > > > > > When BPF token is created from BPF FS mount, we take reference to the > > > BPF super block's owning user namespace, and then use that namespace for > > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > > capabilities that are normally only checked against init userns (using > > > capable()), but now we check them using ns_capable() instead (if BPF > > > token is provided). See bpf_token_capable() for details. > > > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > > functionality. User namespaced process has to *also* have necessary > > > combination of capabilities inside that user namespace. So while > > > previously CAP_BPF was useless when granted within user namespace, now > > > it gains a meaning and allows container managers and sys admins to have > > > a flexible control over which processes can and need to use BPF > > > functionality within the user namespace (i.e., container in practice). > > > And BPF FS delegation mount options and derived BPF tokens serve as > > > a per-container "flag" to grant overall ability to use bpf() (plus further > > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > > > The alternative to creating BPF token object was: > > > a) not having any extra object and just pasing BPF FS path to each > > > relevant bpf() command. This seems suboptimal as it's racy (mount > > > under the same path might change in between checking it and using it > > > for bpf() command). And also less flexible if we'd like to further > > > restrict ourselves compared to all the delegated functionality > > > allowed on BPF FS. > > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > > a dedicated FD that would represent a token-like functionality. This > > > doesn't seem superior to having a proper bpf() command, so > > > BPF_TOKEN_CREATE was chosen. > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > --- > > > include/linux/bpf.h | 36 +++++++ > > > include/uapi/linux/bpf.h | 39 +++++++ > > > kernel/bpf/Makefile | 2 +- > > > kernel/bpf/inode.c | 4 +- > > > kernel/bpf/syscall.c | 17 +++ > > > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > > 7 files changed, 324 insertions(+), 2 deletions(-) > > > create mode 100644 kernel/bpf/token.c > > > > ... > > > > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > > > new file mode 100644 > > > index 000000000000..f6ea3eddbee6 > > > --- /dev/null > > > +++ b/kernel/bpf/token.c > > > @@ -0,0 +1,189 @@ > > > +#include <linux/bpf.h> > > > +#include <linux/vmalloc.h> > > > +#include <linux/anon_inodes.h> > > > +#include <linux/fdtable.h> > > > +#include <linux/file.h> > > > +#include <linux/fs.h> > > > +#include <linux/kernel.h> > > > +#include <linux/idr.h> > > > +#include <linux/namei.h> > > > +#include <linux/user_namespace.h> > > > + > > > +bool bpf_token_capable(const struct bpf_token *token, int cap) > > > +{ > > > + /* BPF token allows ns_capable() level of capabilities */ > > > + if (token) { > > > + if (ns_capable(token->userns, cap)) > > > + return true; > > > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > > > + return true; > > > + } > > > + /* otherwise fallback to capable() checks */ > > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > > +} > > > > While the above looks to be equivalent to the bpf_capable() function it > > replaces, for callers checking CAP_BPF and CAP_SYS_ADMIN, I'm looking > > quickly at patch 3/12 and this is also being used to replace a > > capable(CAP_NET_ADMIN) call which results in a change in behavior. > > The current code which performs a capable(CAP_NET_ADMIN) check cannot > > be satisfied by CAP_SYS_ADMIN, but this patchset using > > bpf_token_capable(token, CAP_NET_ADMIN) can be satisfied by either > > CAP_NET_ADMIN or CAP_SYS_ADMIN. > > > > It seems that while bpf_token_capable() can be used as a replacement > > for bpf_capable(), it is not currently a suitable replacement for a > > generic capable() call. Perhaps this is intentional, but I didn't see > > it mentioned in the commit description, or in the comments, and I > > wanted to make sure it wasn't an oversight. > > You are right. It is an intentional attempt to unify all such checks. > If you look at bpf_prog_load(), we have this: > > if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && > !capable(CAP_SYS_ADMIN)) > return -EPERM; > > So seeing that, I realized that we did have an intent to always use > CAP_SYS_ADMIN as a "fallback" cap, even for CAP_NET_ADMIN when it > comes to using network-enabled BPF programs. So I decided that > unifying all this makes sense. > > I'll add a comment mentioning this, I should have been more explicit > from the get go. Thanks for the clarification. I'm not to worried about checking CAP_SYS_ADMIN as a fallback, but I always get a little twitchy when I see capability changes in the code without any mention. A mention in the commit description is good, and you could also draft up a standalone patch that adds the CAP_SYS_ADMIN fallback to the current in-tree code. That would be a good way to really highlight the capability changes and deal with any issues that might arise (review, odd corner cases?, etc.) prior to the BPF capability delegation patcheset we are discussing here. > > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > > + > > > +/* Alloc anon_inode and FD for prepared token. > > > + * Returns fd >= 0 on success; negative error, otherwise. > > > + */ > > > +int bpf_token_new_fd(struct bpf_token *token) > > > +{ > > > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > > > +} > > > + > > > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > > > +{ > > > + struct fd f = fdget(ufd); > > > + struct bpf_token *token; > > > + > > > + if (!f.file) > > > + return ERR_PTR(-EBADF); > > > + if (f.file->f_op != &bpf_token_fops) { > > > + fdput(f); > > > + return ERR_PTR(-EINVAL); > > > + } > > > + > > > + token = f.file->private_data; > > > + bpf_token_inc(token); > > > + fdput(f); > > > + > > > + return token; > > > +} > > > + > > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > > > +{ > > > + if (!token) > > > + return false; > > > + > > > + return token->allowed_cmds & (1ULL << cmd); > > > +} > > > > I mentioned this a while back, likely in the other threads where this > > token-based approach was only being discussed in general terms, but I > > think we want to have a LSM hook at the point of initial token > > delegation for this and a hook when the token is used. My initial > > thinking is that we should be able to address the former with a hook > > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > > but it doesn't allow for much in the way of granularity. Inserting the > > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > > gracefully fallback to the system-wide checks if the LSM denied the > > requested access whereas an access denial in bpf_token_get_from_fd() > > denial would cause the operation to error out. > > I think the bpf_fill_super() LSM hook makes sense, but I thought > someone mentioned that we already have some generic LSM hook for > validating mounts? If we don't, I can certainly add one for BPF FS > specifically. We do have security_sb_mount(), but that is a generic mount operation access control and not well suited for controlling the mount-based capability delegation that you are proposing here. However, if you or someone else has a clever way to make security_sb_mount() work for this purpose I would be very happy to review that code. > As for the bpf_token_allow_xxx(). This feels a bit too specific and > narrow-focused. What if we later add yet another dimension for BPF FS > and token? Do we need to introduce yet another LSM for each such case? [I'm assuming you meant new LSM *hook*] Possibly. There are also some other issues which I've been thinking about along these lines, specifically the fact that the capability/command delegation happens after the existing security_bpf() hook is called which makes things rather awkward from a LSM perspective: the LSM would first need to allow the process access to the desired BPF op using it's current LSM specific security attributes (e.g. SELinux security domain, etc.) and then later consider the op in the context of the delegated access control rights (if the LSM decides to support those hooks). I suspect that if we want to make this practical we would need to either move some of the token code up into __sys_bpf() so we could have a better interaction with security_bpf(), or we need to consider moving the security_bpf() call into the op specific functions. I'm still thinking on this (lots of reviews to get through this week), but I'm hoping there is a better way because I'm not sure I like either option very much. > But also see bpf_prog_load(). There are two checks, allow_prog_type > and allow_attach_type, which are really only meaningful in > combination. And yet you'd have to have two separate LSM hooks for > that. > > So I feel like the better approach is less mechanistically > concentrating on BPF token operations themselves, but rather on more > semantically meaningful operations that are token-enabled. E.g., > protect BPF program loading, BPF map creation, BTF loading, etc. And > we do have such LSM hooks right now, though they might not be the most > convenient. So perhaps the right move is to add new ones that would > provide a bit more context (e.g., we can pass in the BPF token that > was used for the operation, attributes with which map/prog was > created, etc). Low-level token LSMs seem hard to use cohesively in > practice, though. Can you elaborate a bit more? It's hard to judge the comments above without some more specifics about hook location, parameters, etc.
On Thu, Sep 14, 2023 at 5:55 PM Paul Moore <paul@paul-moore.com> wrote: > > On Thu, Sep 14, 2023 at 1:31 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@paul-moore.com> wrote: > > > > > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > > > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > > > allow delegating privileged BPF functionality, like loading a BPF > > > > program or creating a BPF map, from privileged process to a *trusted* > > > > unprivileged process, all while have a good amount of control over which > > > > privileged operations could be performed using provided BPF token. > > > > > > > > This is achieved through mounting BPF FS instance with extra delegation > > > > mount options, which determine what operations are delegatable, and also > > > > constraining it to the owning user namespace (as mentioned in the > > > > previous patch). > > > > > > > > BPF token itself is just a derivative from BPF FS and can be created > > > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > > > > a path specification (using the usual fd + string path combo) to a BPF > > > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > > > prog type, and attach type bit sets from BPF FS as is. In the future, > > > > having an BPF token as a separate object with its own FD, we can allow > > > > to further restrict BPF token's allowable set of things either at the creation > > > > time or after the fact, allowing the process to guard itself further > > > > from, e.g., unintentionally trying to load undesired kind of BPF > > > > programs. But for now we keep things simple and just copy bit sets as is. > > > > > > > > When BPF token is created from BPF FS mount, we take reference to the > > > > BPF super block's owning user namespace, and then use that namespace for > > > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > > > capabilities that are normally only checked against init userns (using > > > > capable()), but now we check them using ns_capable() instead (if BPF > > > > token is provided). See bpf_token_capable() for details. > > > > > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > > > functionality. User namespaced process has to *also* have necessary > > > > combination of capabilities inside that user namespace. So while > > > > previously CAP_BPF was useless when granted within user namespace, now > > > > it gains a meaning and allows container managers and sys admins to have > > > > a flexible control over which processes can and need to use BPF > > > > functionality within the user namespace (i.e., container in practice). > > > > And BPF FS delegation mount options and derived BPF tokens serve as > > > > a per-container "flag" to grant overall ability to use bpf() (plus further > > > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > > > > > The alternative to creating BPF token object was: > > > > a) not having any extra object and just pasing BPF FS path to each > > > > relevant bpf() command. This seems suboptimal as it's racy (mount > > > > under the same path might change in between checking it and using it > > > > for bpf() command). And also less flexible if we'd like to further > > > > restrict ourselves compared to all the delegated functionality > > > > allowed on BPF FS. > > > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > > > a dedicated FD that would represent a token-like functionality. This > > > > doesn't seem superior to having a proper bpf() command, so > > > > BPF_TOKEN_CREATE was chosen. > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > --- > > > > include/linux/bpf.h | 36 +++++++ > > > > include/uapi/linux/bpf.h | 39 +++++++ > > > > kernel/bpf/Makefile | 2 +- > > > > kernel/bpf/inode.c | 4 +- > > > > kernel/bpf/syscall.c | 17 +++ > > > > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > > > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > > > 7 files changed, 324 insertions(+), 2 deletions(-) > > > > create mode 100644 kernel/bpf/token.c > > > > > > ... > > > > > > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > > > > new file mode 100644 > > > > index 000000000000..f6ea3eddbee6 > > > > --- /dev/null > > > > +++ b/kernel/bpf/token.c > > > > @@ -0,0 +1,189 @@ > > > > +#include <linux/bpf.h> > > > > +#include <linux/vmalloc.h> > > > > +#include <linux/anon_inodes.h> > > > > +#include <linux/fdtable.h> > > > > +#include <linux/file.h> > > > > +#include <linux/fs.h> > > > > +#include <linux/kernel.h> > > > > +#include <linux/idr.h> > > > > +#include <linux/namei.h> > > > > +#include <linux/user_namespace.h> > > > > + > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap) > > > > +{ > > > > + /* BPF token allows ns_capable() level of capabilities */ > > > > + if (token) { > > > > + if (ns_capable(token->userns, cap)) > > > > + return true; > > > > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > > > > + return true; > > > > + } > > > > + /* otherwise fallback to capable() checks */ > > > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > > > +} > > > > > > While the above looks to be equivalent to the bpf_capable() function it > > > replaces, for callers checking CAP_BPF and CAP_SYS_ADMIN, I'm looking > > > quickly at patch 3/12 and this is also being used to replace a > > > capable(CAP_NET_ADMIN) call which results in a change in behavior. > > > The current code which performs a capable(CAP_NET_ADMIN) check cannot > > > be satisfied by CAP_SYS_ADMIN, but this patchset using > > > bpf_token_capable(token, CAP_NET_ADMIN) can be satisfied by either > > > CAP_NET_ADMIN or CAP_SYS_ADMIN. > > > > > > It seems that while bpf_token_capable() can be used as a replacement > > > for bpf_capable(), it is not currently a suitable replacement for a > > > generic capable() call. Perhaps this is intentional, but I didn't see > > > it mentioned in the commit description, or in the comments, and I > > > wanted to make sure it wasn't an oversight. > > > > You are right. It is an intentional attempt to unify all such checks. > > If you look at bpf_prog_load(), we have this: > > > > if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && > > !capable(CAP_SYS_ADMIN)) > > return -EPERM; > > > > So seeing that, I realized that we did have an intent to always use > > CAP_SYS_ADMIN as a "fallback" cap, even for CAP_NET_ADMIN when it > > comes to using network-enabled BPF programs. So I decided that > > unifying all this makes sense. > > > > I'll add a comment mentioning this, I should have been more explicit > > from the get go. > > Thanks for the clarification. I'm not to worried about checking > CAP_SYS_ADMIN as a fallback, but I always get a little twitchy when I > see capability changes in the code without any mention. > > A mention in the commit description is good, and you could also draft > up a standalone patch that adds the CAP_SYS_ADMIN fallback to the > current in-tree code. That would be a good way to really highlight > the capability changes and deal with any issues that might arise > (review, odd corner cases?, etc.) prior to the BPF capability > delegation patcheset we are discussing here. Sure, sounds good, I'll add this as a pre-patch for next revision. > > > > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > > > + > > > > +/* Alloc anon_inode and FD for prepared token. > > > > + * Returns fd >= 0 on success; negative error, otherwise. > > > > + */ > > > > +int bpf_token_new_fd(struct bpf_token *token) > > > > +{ > > > > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > > > > +} > > > > + > > > > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > > > > +{ > > > > + struct fd f = fdget(ufd); > > > > + struct bpf_token *token; > > > > + > > > > + if (!f.file) > > > > + return ERR_PTR(-EBADF); > > > > + if (f.file->f_op != &bpf_token_fops) { > > > > + fdput(f); > > > > + return ERR_PTR(-EINVAL); > > > > + } > > > > + > > > > + token = f.file->private_data; > > > > + bpf_token_inc(token); > > > > + fdput(f); > > > > + > > > > + return token; > > > > +} > > > > + > > > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > > > > +{ > > > > + if (!token) > > > > + return false; > > > > + > > > > + return token->allowed_cmds & (1ULL << cmd); > > > > +} > > > > > > I mentioned this a while back, likely in the other threads where this > > > token-based approach was only being discussed in general terms, but I > > > think we want to have a LSM hook at the point of initial token > > > delegation for this and a hook when the token is used. My initial > > > thinking is that we should be able to address the former with a hook > > > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > > > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > > > but it doesn't allow for much in the way of granularity. Inserting the > > > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > > > gracefully fallback to the system-wide checks if the LSM denied the > > > requested access whereas an access denial in bpf_token_get_from_fd() > > > denial would cause the operation to error out. > > > > I think the bpf_fill_super() LSM hook makes sense, but I thought > > someone mentioned that we already have some generic LSM hook for > > validating mounts? If we don't, I can certainly add one for BPF FS > > specifically. > > We do have security_sb_mount(), but that is a generic mount operation > access control and not well suited for controlling the mount-based > capability delegation that you are proposing here. However, if you or > someone else has a clever way to make security_sb_mount() work for > this purpose I would be very happy to review that code. To be honest, I'm a bit out of my depth here, as I don't know the mounting parts well. Perhaps someone from VFS side can advise. But regardless, I have no problem adding a new LSM hook as well, ideally not very BPF-specific. If you have a specific form of it in mind, I'd be curious to see it and implement it. > > > As for the bpf_token_allow_xxx(). This feels a bit too specific and > > narrow-focused. What if we later add yet another dimension for BPF FS > > and token? Do we need to introduce yet another LSM for each such case? > > [I'm assuming you meant new LSM *hook*] yep, of course, sorry about using terminology sloppily > > Possibly. There are also some other issues which I've been thinking > about along these lines, specifically the fact that the > capability/command delegation happens after the existing > security_bpf() hook is called which makes things rather awkward from a > LSM perspective: the LSM would first need to allow the process access > to the desired BPF op using it's current LSM specific security > attributes (e.g. SELinux security domain, etc.) and then later > consider the op in the context of the delegated access control rights > (if the LSM decides to support those hooks). > > I suspect that if we want to make this practical we would need to > either move some of the token code up into __sys_bpf() so we could > have a better interaction with security_bpf(), or we need to consider > moving the security_bpf() call into the op specific functions. I'm > still thinking on this (lots of reviews to get through this week), but > I'm hoping there is a better way because I'm not sure I like either > option very much. Yes, security_bpf() is happening extremely early and is lacking a lot of context. I'm not sure if moving it around is a good idea as it basically changes its semantics. But adding a new set of coherent LSM hooks per each appropriate BPF operation with good context to make decisions sounds like a good improvement. E.g., for BPF_PROG_LOAD, we can have LSM hook after struct bpf_prog is allocated, bpf_token is available, attributes are sanity checked. All that together is a very useful and powerful context that can be used both by more fixed LSM policies (like SELinux), and very dynamic user-defined BPF LSM programs. But I'd like to keep all that outside of the BPF token feature itself, as it's already pretty hard to get a consensus just on those bits, so complicating this with simultaneously designing a new set of LSM hooks is something that we should avoid. Let's keep discussing this, but not block that on BPF token. > > > But also see bpf_prog_load(). There are two checks, allow_prog_type > > and allow_attach_type, which are really only meaningful in > > combination. And yet you'd have to have two separate LSM hooks for > > that. > > > > So I feel like the better approach is less mechanistically > > concentrating on BPF token operations themselves, but rather on more > > semantically meaningful operations that are token-enabled. E.g., > > protect BPF program loading, BPF map creation, BTF loading, etc. And > > we do have such LSM hooks right now, though they might not be the most > > convenient. So perhaps the right move is to add new ones that would > > provide a bit more context (e.g., we can pass in the BPF token that > > was used for the operation, attributes with which map/prog was > > created, etc). Low-level token LSMs seem hard to use cohesively in > > practice, though. > > Can you elaborate a bit more? It's hard to judge the comments above > without some more specifics about hook location, parameters, etc. So something like my above proposal for a new LSM hook in BPF_PROG_LOAD command. Just right before passing bpf_prog to BPF verifier, we can have err = security_bpf_prog_load(prog, attr, token) if (err) return -EPERM; Program, attributes, and token give a lot of inputs for security policy logic to make a decision about allowing that specific BPF program to be verified and loaded or not. I know how this could be used from BPF LSM side, but I assume that SELinux and others can take advantage of that provided additional context as well. Similarly we can have a BPF_MAP_CREATE-specific LSM hook with context relevant to creating a BPF map. And so on. > > -- > paul-moore.com
On Fri, Sep 15, 2023 at 4:59 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > On Thu, Sep 14, 2023 at 5:55 PM Paul Moore <paul@paul-moore.com> wrote: > > On Thu, Sep 14, 2023 at 1:31 PM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@paul-moore.com> wrote: > > > > > > > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > > > > > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > > > > allow delegating privileged BPF functionality, like loading a BPF > > > > > program or creating a BPF map, from privileged process to a *trusted* > > > > > unprivileged process, all while have a good amount of control over which > > > > > privileged operations could be performed using provided BPF token. > > > > > > > > > > This is achieved through mounting BPF FS instance with extra delegation > > > > > mount options, which determine what operations are delegatable, and also > > > > > constraining it to the owning user namespace (as mentioned in the > > > > > previous patch). > > > > > > > > > > BPF token itself is just a derivative from BPF FS and can be created > > > > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > > > > > a path specification (using the usual fd + string path combo) to a BPF > > > > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > > > > prog type, and attach type bit sets from BPF FS as is. In the future, > > > > > having an BPF token as a separate object with its own FD, we can allow > > > > > to further restrict BPF token's allowable set of things either at the creation > > > > > time or after the fact, allowing the process to guard itself further > > > > > from, e.g., unintentionally trying to load undesired kind of BPF > > > > > programs. But for now we keep things simple and just copy bit sets as is. > > > > > > > > > > When BPF token is created from BPF FS mount, we take reference to the > > > > > BPF super block's owning user namespace, and then use that namespace for > > > > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > > > > capabilities that are normally only checked against init userns (using > > > > > capable()), but now we check them using ns_capable() instead (if BPF > > > > > token is provided). See bpf_token_capable() for details. > > > > > > > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > > > > functionality. User namespaced process has to *also* have necessary > > > > > combination of capabilities inside that user namespace. So while > > > > > previously CAP_BPF was useless when granted within user namespace, now > > > > > it gains a meaning and allows container managers and sys admins to have > > > > > a flexible control over which processes can and need to use BPF > > > > > functionality within the user namespace (i.e., container in practice). > > > > > And BPF FS delegation mount options and derived BPF tokens serve as > > > > > a per-container "flag" to grant overall ability to use bpf() (plus further > > > > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > > > > > > > The alternative to creating BPF token object was: > > > > > a) not having any extra object and just pasing BPF FS path to each > > > > > relevant bpf() command. This seems suboptimal as it's racy (mount > > > > > under the same path might change in between checking it and using it > > > > > for bpf() command). And also less flexible if we'd like to further > > > > > restrict ourselves compared to all the delegated functionality > > > > > allowed on BPF FS. > > > > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > > > > a dedicated FD that would represent a token-like functionality. This > > > > > doesn't seem superior to having a proper bpf() command, so > > > > > BPF_TOKEN_CREATE was chosen. > > > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > --- > > > > > include/linux/bpf.h | 36 +++++++ > > > > > include/uapi/linux/bpf.h | 39 +++++++ > > > > > kernel/bpf/Makefile | 2 +- > > > > > kernel/bpf/inode.c | 4 +- > > > > > kernel/bpf/syscall.c | 17 +++ > > > > > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > > > > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > > > > 7 files changed, 324 insertions(+), 2 deletions(-) > > > > > create mode 100644 kernel/bpf/token.c > > > > > > > > ... > > > > > > > > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > > > > > new file mode 100644 > > > > > index 000000000000..f6ea3eddbee6 > > > > > --- /dev/null > > > > > +++ b/kernel/bpf/token.c > > > > > @@ -0,0 +1,189 @@ > > > > > +#include <linux/bpf.h> > > > > > +#include <linux/vmalloc.h> > > > > > +#include <linux/anon_inodes.h> > > > > > +#include <linux/fdtable.h> > > > > > +#include <linux/file.h> > > > > > +#include <linux/fs.h> > > > > > +#include <linux/kernel.h> > > > > > +#include <linux/idr.h> > > > > > +#include <linux/namei.h> > > > > > +#include <linux/user_namespace.h> > > > > > + > > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap) > > > > > +{ > > > > > + /* BPF token allows ns_capable() level of capabilities */ > > > > > + if (token) { > > > > > + if (ns_capable(token->userns, cap)) > > > > > + return true; > > > > > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > > > > > + return true; > > > > > + } > > > > > + /* otherwise fallback to capable() checks */ > > > > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > > > > +} > > > > > > > > While the above looks to be equivalent to the bpf_capable() function it > > > > replaces, for callers checking CAP_BPF and CAP_SYS_ADMIN, I'm looking > > > > quickly at patch 3/12 and this is also being used to replace a > > > > capable(CAP_NET_ADMIN) call which results in a change in behavior. > > > > The current code which performs a capable(CAP_NET_ADMIN) check cannot > > > > be satisfied by CAP_SYS_ADMIN, but this patchset using > > > > bpf_token_capable(token, CAP_NET_ADMIN) can be satisfied by either > > > > CAP_NET_ADMIN or CAP_SYS_ADMIN. > > > > > > > > It seems that while bpf_token_capable() can be used as a replacement > > > > for bpf_capable(), it is not currently a suitable replacement for a > > > > generic capable() call. Perhaps this is intentional, but I didn't see > > > > it mentioned in the commit description, or in the comments, and I > > > > wanted to make sure it wasn't an oversight. > > > > > > You are right. It is an intentional attempt to unify all such checks. > > > If you look at bpf_prog_load(), we have this: > > > > > > if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && > > > !capable(CAP_SYS_ADMIN)) > > > return -EPERM; > > > > > > So seeing that, I realized that we did have an intent to always use > > > CAP_SYS_ADMIN as a "fallback" cap, even for CAP_NET_ADMIN when it > > > comes to using network-enabled BPF programs. So I decided that > > > unifying all this makes sense. > > > > > > I'll add a comment mentioning this, I should have been more explicit > > > from the get go. > > > > Thanks for the clarification. I'm not to worried about checking > > CAP_SYS_ADMIN as a fallback, but I always get a little twitchy when I > > see capability changes in the code without any mention. > > > > A mention in the commit description is good, and you could also draft > > up a standalone patch that adds the CAP_SYS_ADMIN fallback to the > > current in-tree code. That would be a good way to really highlight > > the capability changes and deal with any issues that might arise > > (review, odd corner cases?, etc.) prior to the BPF capability > > delegation patcheset we are discussing here. > > Sure, sounds good, I'll add this as a pre-patch for next revision. My apologies on the delay, I've been traveling this week and haven't had the time to dig back into this. I do see that you've posted another revision of this patchset with the capability pre-patch, thanks for doing that. > > > > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > > > > + > > > > > +/* Alloc anon_inode and FD for prepared token. > > > > > + * Returns fd >= 0 on success; negative error, otherwise. > > > > > + */ > > > > > +int bpf_token_new_fd(struct bpf_token *token) > > > > > +{ > > > > > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > > > > > +} > > > > > + > > > > > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > > > > > +{ > > > > > + struct fd f = fdget(ufd); > > > > > + struct bpf_token *token; > > > > > + > > > > > + if (!f.file) > > > > > + return ERR_PTR(-EBADF); > > > > > + if (f.file->f_op != &bpf_token_fops) { > > > > > + fdput(f); > > > > > + return ERR_PTR(-EINVAL); > > > > > + } > > > > > + > > > > > + token = f.file->private_data; > > > > > + bpf_token_inc(token); > > > > > + fdput(f); > > > > > + > > > > > + return token; > > > > > +} > > > > > + > > > > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > > > > > +{ > > > > > + if (!token) > > > > > + return false; > > > > > + > > > > > + return token->allowed_cmds & (1ULL << cmd); > > > > > +} > > > > > > > > I mentioned this a while back, likely in the other threads where this > > > > token-based approach was only being discussed in general terms, but I > > > > think we want to have a LSM hook at the point of initial token > > > > delegation for this and a hook when the token is used. My initial > > > > thinking is that we should be able to address the former with a hook > > > > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > > > > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > > > > but it doesn't allow for much in the way of granularity. Inserting the > > > > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > > > > gracefully fallback to the system-wide checks if the LSM denied the > > > > requested access whereas an access denial in bpf_token_get_from_fd() > > > > denial would cause the operation to error out. > > > > > > I think the bpf_fill_super() LSM hook makes sense, but I thought > > > someone mentioned that we already have some generic LSM hook for > > > validating mounts? If we don't, I can certainly add one for BPF FS > > > specifically. > > > > We do have security_sb_mount(), but that is a generic mount operation > > access control and not well suited for controlling the mount-based > > capability delegation that you are proposing here. However, if you or > > someone else has a clever way to make security_sb_mount() work for > > this purpose I would be very happy to review that code. > > To be honest, I'm a bit out of my depth here, as I don't know the > mounting parts well. Perhaps someone from VFS side can advise. But > regardless, I have no problem adding a new LSM hook as well, ideally > not very BPF-specific. If you have a specific form of it in mind, I'd > be curious to see it and implement it. I agree that there can be benefits to generalized LSM hooks, but in this hook I think it may need to be BPF specific simply because the hook would be dealing with the specific concept of delegating BPF permissions. I haven't taken the time to write up any hook patches yet as I wanted to discuss it with you and the others on the To/CC line, but it seems like we are roughly on the same page, at least with the initial delegation hook, so I can put something together if you aren't comfortable working on this (more on this below) ... > > > As for the bpf_token_allow_xxx(). This feels a bit too specific and > > > narrow-focused. What if we later add yet another dimension for BPF FS > > > and token? Do we need to introduce yet another LSM for each such case? > > > > [I'm assuming you meant new LSM *hook*] > > yep, of course, sorry about using terminology sloppily > > > > > Possibly. There are also some other issues which I've been thinking > > about along these lines, specifically the fact that the > > capability/command delegation happens after the existing > > security_bpf() hook is called which makes things rather awkward from a > > LSM perspective: the LSM would first need to allow the process access > > to the desired BPF op using it's current LSM specific security > > attributes (e.g. SELinux security domain, etc.) and then later > > consider the op in the context of the delegated access control rights > > (if the LSM decides to support those hooks). > > > > I suspect that if we want to make this practical we would need to > > either move some of the token code up into __sys_bpf() so we could > > have a better interaction with security_bpf(), or we need to consider > > moving the security_bpf() call into the op specific functions. I'm > > still thinking on this (lots of reviews to get through this week), but > > I'm hoping there is a better way because I'm not sure I like either > > option very much. > > Yes, security_bpf() is happening extremely early and is lacking a lot > of context. I'm not sure if moving it around is a good idea as it > basically changes its semantics. There are a couple of things that make this not quite as scary as it may seem. The first is that currently only SELinux implements a security_bpf() hook and the implementation is rather simplistic in terms of what information it requires to perform the existing access controls; decomposing the single security_bpf() call site into multiple op specific calls, perhaps with some op specific hooks, should be doable without causing major semantic changes. The second thing is that we could augment the existing security_bpf() hook and call site with a new LSM hook(s) that are called from the op specific call sites; this would allow those LSMs that desire the current semantics to use the existing security_bpf() hook and those that wish to use the new semantics could implement the new hook(s). This is very similar to the pathname-based and inode-based hooks in the VFS layer, some LSMs choose to implement pathname-based security and use one set of hooks, while others implement a label-based security mechanism and use a different set of hooks. > But adding a new set of coherent LSM > hooks per each appropriate BPF operation with good context to make > decisions sounds like a good improvement. E.g., for BPF_PROG_LOAD, we > can have LSM hook after struct bpf_prog is allocated, bpf_token is > available, attributes are sanity checked. All that together is a very > useful and powerful context that can be used both by more fixed LSM > policies (like SELinux), and very dynamic user-defined BPF LSM > programs. This is where it is my turn to mention that I'm getting a bit out of my depth, but I'm hopeful that the two of us can keep each other from drowning :) Typically the LSM hook call sites end up being in the same general area as the capability checks, usually just after (we want the normal Linux discretionary access controls to always come first for the sake of consistency). Sticking with that approach it looks like we would end up with a LSM call in bpf_prog_load() right after bpf_capable() call, the only gotcha with that is the bpf_prog struct isn't populated yet, but how important is that when we have the bpf_attr info (honest question, I don't know the answer to this)? Ignoring the bpf_prog struct, do you think something like this would work for a hook call site (please forgive the pseudo code)? int bpf_prog_load(...) { ... bpf_cap = bpf_token_capable(token, CAP_BPF); err = security_bpf_token(BPF_PROG_LOAD, attr, uattr_size, token); if (err) return err; ... } Assuming this type of hook configuration, and an empty/passthrough security_bpf() hook, a LSM would first see the various capable()/ns_capable() checks present in bpf_token_capable() followed by a BPF op check, complete with token, in the security_bpf_token() hook. Further assuming that we convert the bpf_token_new_fd() to use anon_inode_getfd_secure() instead of anon_inode_getfd() and the security_bpf_token() could still access the token fd via the bpf_attr struct I think we could do something like this for the SELinux case (more rough pseudo code): int selinux_bpf_token(...) { ssid = current_sid(); if (token) { /* this could be simplified with better integration * in bpf_token_get_from_fd() */ fd = fdget(attr->prog_token_fd); inode = file_inode(fd.file); isec = selinux_inode(inode); tsid = isec->sid; fdput(fd); } else tsid = ssid; switch(cmd) { ... case BPF_PROG_LOAD: rc = avc_has_perm(ssid, tsid, SECCLAS_BPF, BPF__PROG_LOAD); break; default: rc = 0; } return rc; } This would preserve the current behaviour when a token was not present: allow @current @current : bpf { prog_load } ... but this would change to the following if a token was present: allow @current @DELEGATED_ANON_INODE : bpf { prog_load } That seems reasonable to me, but I've CC'd the SELinux list on this so others can sanity check the above :) > But I'd like to keep all that outside of the BPF token feature itself, > as it's already pretty hard to get a consensus just on those bits, so > complicating this with simultaneously designing a new set of LSM hooks > is something that we should avoid. Let's keep discussing this, but not > block that on BPF token. The unfortunate aspect of disconnecting new functionality from the associated access controls is that it introduces a gap where the new functionality is not secured in a manner that users expect. There are billions of systems/users that rely on LSM-based access controls for a large part of their security story, and I think we are doing them a disservice by not including the LSM controls with new security significant features. We (the LSM folks) are happy to work with you to get this sorted out, and I would hope my comments in this thread (as well as prior iterations) and the rough design above is a good faith indicator of that. > > > But also see bpf_prog_load(). There are two checks, allow_prog_type > > > and allow_attach_type, which are really only meaningful in > > > combination. And yet you'd have to have two separate LSM hooks for > > > that. > > > > > > So I feel like the better approach is less mechanistically > > > concentrating on BPF token operations themselves, but rather on more > > > semantically meaningful operations that are token-enabled. E.g., > > > protect BPF program loading, BPF map creation, BTF loading, etc. And > > > we do have such LSM hooks right now, though they might not be the most > > > convenient. So perhaps the right move is to add new ones that would > > > provide a bit more context (e.g., we can pass in the BPF token that > > > was used for the operation, attributes with which map/prog was > > > created, etc). Low-level token LSMs seem hard to use cohesively in > > > practice, though. > > > > Can you elaborate a bit more? It's hard to judge the comments above > > without some more specifics about hook location, parameters, etc. > > So something like my above proposal for a new LSM hook in > BPF_PROG_LOAD command. Just right before passing bpf_prog to BPF > verifier, we can have > > err = security_bpf_prog_load(prog, attr, token) > if (err) > return -EPERM; > > Program, attributes, and token give a lot of inputs for security > policy logic to make a decision about allowing that specific BPF > program to be verified and loaded or not. I know how this could be > used from BPF LSM side, but I assume that SELinux and others can take > advantage of that provided additional context as well. If you think a populated bpf_prog struct is important for BPF LSM programs then I have no problem with that hook placement. It's a lot later in the process than we might normally want to place the hook, but we can still safely error out here so that should be okay. From a LSM perspective I think we can make either work, I think the big question is which would you rather have in the BPF code: the security_bpf_prog_load() hook you've suggested here or the security_bpf_token() hook I suggested above? > Similarly we can have a BPF_MAP_CREATE-specific LSM hook with context > relevant to creating a BPF map. And so on. Of course. I've been operating under the assumption that whatever we do for one op we should be able to apply the same idea to the others that need it.
On Thu, Sep 21, 2023 at 6:18 PM Paul Moore <paul@paul-moore.com> wrote: > ... > Typically the LSM hook call sites end up being in the same general > area as the capability checks, usually just after (we want the normal > Linux discretionary access controls to always come first for the sake > of consistency). Sticking with that approach it looks like we would > end up with a LSM call in bpf_prog_load() right after bpf_capable() > call, the only gotcha with that is the bpf_prog struct isn't populated > yet, but how important is that when we have the bpf_attr info (honest > question, I don't know the answer to this)? > > Ignoring the bpf_prog struct, do you think something like this would > work for a hook call site (please forgive the pseudo code)? > > int bpf_prog_load(...) > { > ... > bpf_cap = bpf_token_capable(token, CAP_BPF); > err = security_bpf_token(BPF_PROG_LOAD, attr, uattr_size, token); > if (err) > return err; > ... > } > > Assuming this type of hook configuration, and an empty/passthrough > security_bpf() hook, a LSM would first see the various > capable()/ns_capable() checks present in bpf_token_capable() followed > by a BPF op check, complete with token, in the security_bpf_token() > hook. Further assuming that we convert the bpf_token_new_fd() to use > anon_inode_getfd_secure() instead of anon_inode_getfd() and the > security_bpf_token() could still access the token fd via the bpf_attr > struct I think we could do something like this for the SELinux case > (more rough pseudo code): > > int selinux_bpf_token(...) > { > ssid = current_sid(); > if (token) { > /* this could be simplified with better integration > * in bpf_token_get_from_fd() */ > fd = fdget(attr->prog_token_fd); > inode = file_inode(fd.file); > isec = selinux_inode(inode); > tsid = isec->sid; > fdput(fd); > } else > tsid = ssid; > switch(cmd) { > ... > case BPF_PROG_LOAD: > rc = avc_has_perm(ssid, tsid, SECCLAS_BPF, BPF__PROG_LOAD); > break; > default: > rc = 0; > } > return rc; > } > > This would preserve the current behaviour when a token was not present: > > allow @current @current : bpf { prog_load } > > ... but this would change to the following if a token was present: > > allow @current @DELEGATED_ANON_INODE : bpf { prog_load } > > That seems reasonable to me, but I've CC'd the SELinux list on this so > others can sanity check the above :) I thought it might be helpful to add a bit more background on my thinking for the SELinux folks, especially since the object label used in the example above is a bit unusual. As a reminder, the object label in the delegated case is not the current domain as it is now for standard BPF program loads, it is the label of the BPF delegation token (anonymous inode) that is created by the process/orchestrator that manages the namespace and explicitly enabled the BPF privilege delegation. The BPF token can be labeled using the existing anonymous inode transition rules. First off I decided to reuse the existing permission so as not to break current policies. We can always separate the PROG_LOAD permission into a standard and delegated permission if desired, but I believe we would need to gate that with a policy capability and preserve some form of access control for the legacy PROG_LOAD-only case. Preserving the PROG_LOAD permission does present a challenge with respect to differentiating the delegated program load from a normal program load while ensuring that existing policies continue to work and delegated operations require explicit policy adjustments. Changing the object label in the delegated case was the only approach I could think of that would satisfy all of these constraints, but I'm open to other ideas, tweaks, etc. and I would love to get some other opinions on this. Thoughts?
On Thu, Sep 21, 2023 at 3:18 PM Paul Moore <paul@paul-moore.com> wrote: > > On Fri, Sep 15, 2023 at 4:59 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > On Thu, Sep 14, 2023 at 5:55 PM Paul Moore <paul@paul-moore.com> wrote: > > > On Thu, Sep 14, 2023 at 1:31 PM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@paul-moore.com> wrote: > > > > > > > > > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > > > > > > > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > > > > > allow delegating privileged BPF functionality, like loading a BPF > > > > > > program or creating a BPF map, from privileged process to a *trusted* > > > > > > unprivileged process, all while have a good amount of control over which > > > > > > privileged operations could be performed using provided BPF token. > > > > > > > > > > > > This is achieved through mounting BPF FS instance with extra delegation > > > > > > mount options, which determine what operations are delegatable, and also > > > > > > constraining it to the owning user namespace (as mentioned in the > > > > > > previous patch). > > > > > > > > > > > > BPF token itself is just a derivative from BPF FS and can be created > > > > > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts > > > > > > a path specification (using the usual fd + string path combo) to a BPF > > > > > > FS mount. Currently, BPF token "inherits" delegated command, map types, > > > > > > prog type, and attach type bit sets from BPF FS as is. In the future, > > > > > > having an BPF token as a separate object with its own FD, we can allow > > > > > > to further restrict BPF token's allowable set of things either at the creation > > > > > > time or after the fact, allowing the process to guard itself further > > > > > > from, e.g., unintentionally trying to load undesired kind of BPF > > > > > > programs. But for now we keep things simple and just copy bit sets as is. > > > > > > > > > > > > When BPF token is created from BPF FS mount, we take reference to the > > > > > > BPF super block's owning user namespace, and then use that namespace for > > > > > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > > > > > capabilities that are normally only checked against init userns (using > > > > > > capable()), but now we check them using ns_capable() instead (if BPF > > > > > > token is provided). See bpf_token_capable() for details. > > > > > > > > > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > > > > > functionality. User namespaced process has to *also* have necessary > > > > > > combination of capabilities inside that user namespace. So while > > > > > > previously CAP_BPF was useless when granted within user namespace, now > > > > > > it gains a meaning and allows container managers and sys admins to have > > > > > > a flexible control over which processes can and need to use BPF > > > > > > functionality within the user namespace (i.e., container in practice). > > > > > > And BPF FS delegation mount options and derived BPF tokens serve as > > > > > > a per-container "flag" to grant overall ability to use bpf() (plus further > > > > > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > > > > > > > > > The alternative to creating BPF token object was: > > > > > > a) not having any extra object and just pasing BPF FS path to each > > > > > > relevant bpf() command. This seems suboptimal as it's racy (mount > > > > > > under the same path might change in between checking it and using it > > > > > > for bpf() command). And also less flexible if we'd like to further > > > > > > restrict ourselves compared to all the delegated functionality > > > > > > allowed on BPF FS. > > > > > > b) use non-bpf() interface, e.g., ioctl(), but otherwise also create > > > > > > a dedicated FD that would represent a token-like functionality. This > > > > > > doesn't seem superior to having a proper bpf() command, so > > > > > > BPF_TOKEN_CREATE was chosen. > > > > > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > > --- > > > > > > include/linux/bpf.h | 36 +++++++ > > > > > > include/uapi/linux/bpf.h | 39 +++++++ > > > > > > kernel/bpf/Makefile | 2 +- > > > > > > kernel/bpf/inode.c | 4 +- > > > > > > kernel/bpf/syscall.c | 17 +++ > > > > > > kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ > > > > > > tools/include/uapi/linux/bpf.h | 39 +++++++ > > > > > > 7 files changed, 324 insertions(+), 2 deletions(-) > > > > > > create mode 100644 kernel/bpf/token.c > > > > > > > > > > ... > > > > > > > > > > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c > > > > > > new file mode 100644 > > > > > > index 000000000000..f6ea3eddbee6 > > > > > > --- /dev/null > > > > > > +++ b/kernel/bpf/token.c > > > > > > @@ -0,0 +1,189 @@ > > > > > > +#include <linux/bpf.h> > > > > > > +#include <linux/vmalloc.h> > > > > > > +#include <linux/anon_inodes.h> > > > > > > +#include <linux/fdtable.h> > > > > > > +#include <linux/file.h> > > > > > > +#include <linux/fs.h> > > > > > > +#include <linux/kernel.h> > > > > > > +#include <linux/idr.h> > > > > > > +#include <linux/namei.h> > > > > > > +#include <linux/user_namespace.h> > > > > > > + > > > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap) > > > > > > +{ > > > > > > + /* BPF token allows ns_capable() level of capabilities */ > > > > > > + if (token) { > > > > > > + if (ns_capable(token->userns, cap)) > > > > > > + return true; > > > > > > + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) > > > > > > + return true; > > > > > > + } > > > > > > + /* otherwise fallback to capable() checks */ > > > > > > + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); > > > > > > +} > > > > > > > > > > While the above looks to be equivalent to the bpf_capable() function it > > > > > replaces, for callers checking CAP_BPF and CAP_SYS_ADMIN, I'm looking > > > > > quickly at patch 3/12 and this is also being used to replace a > > > > > capable(CAP_NET_ADMIN) call which results in a change in behavior. > > > > > The current code which performs a capable(CAP_NET_ADMIN) check cannot > > > > > be satisfied by CAP_SYS_ADMIN, but this patchset using > > > > > bpf_token_capable(token, CAP_NET_ADMIN) can be satisfied by either > > > > > CAP_NET_ADMIN or CAP_SYS_ADMIN. > > > > > > > > > > It seems that while bpf_token_capable() can be used as a replacement > > > > > for bpf_capable(), it is not currently a suitable replacement for a > > > > > generic capable() call. Perhaps this is intentional, but I didn't see > > > > > it mentioned in the commit description, or in the comments, and I > > > > > wanted to make sure it wasn't an oversight. > > > > > > > > You are right. It is an intentional attempt to unify all such checks. > > > > If you look at bpf_prog_load(), we have this: > > > > > > > > if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && > > > > !capable(CAP_SYS_ADMIN)) > > > > return -EPERM; > > > > > > > > So seeing that, I realized that we did have an intent to always use > > > > CAP_SYS_ADMIN as a "fallback" cap, even for CAP_NET_ADMIN when it > > > > comes to using network-enabled BPF programs. So I decided that > > > > unifying all this makes sense. > > > > > > > > I'll add a comment mentioning this, I should have been more explicit > > > > from the get go. > > > > > > Thanks for the clarification. I'm not to worried about checking > > > CAP_SYS_ADMIN as a fallback, but I always get a little twitchy when I > > > see capability changes in the code without any mention. > > > > > > A mention in the commit description is good, and you could also draft > > > up a standalone patch that adds the CAP_SYS_ADMIN fallback to the > > > current in-tree code. That would be a good way to really highlight > > > the capability changes and deal with any issues that might arise > > > (review, odd corner cases?, etc.) prior to the BPF capability > > > delegation patcheset we are discussing here. > > > > Sure, sounds good, I'll add this as a pre-patch for next revision. > > My apologies on the delay, I've been traveling this week and haven't > had the time to dig back into this. > No worries, lots of conferences are happening right now, so I expected people to be unavailable. > I do see that you've posted another revision of this patchset with the > capability pre-patch, thanks for doing that. Yep, hopefully networking folks won't be opposed and we can streamline all that a bit. > > > > > > > +#define BPF_TOKEN_INODE_NAME "bpf-token" > > > > > > + > > > > > > +/* Alloc anon_inode and FD for prepared token. > > > > > > + * Returns fd >= 0 on success; negative error, otherwise. > > > > > > + */ > > > > > > +int bpf_token_new_fd(struct bpf_token *token) > > > > > > +{ > > > > > > + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); > > > > > > +} > > > > > > + > > > > > > +struct bpf_token *bpf_token_get_from_fd(u32 ufd) > > > > > > +{ > > > > > > + struct fd f = fdget(ufd); > > > > > > + struct bpf_token *token; > > > > > > + > > > > > > + if (!f.file) > > > > > > + return ERR_PTR(-EBADF); > > > > > > + if (f.file->f_op != &bpf_token_fops) { > > > > > > + fdput(f); > > > > > > + return ERR_PTR(-EINVAL); > > > > > > + } > > > > > > + > > > > > > + token = f.file->private_data; > > > > > > + bpf_token_inc(token); > > > > > > + fdput(f); > > > > > > + > > > > > > + return token; > > > > > > +} > > > > > > + > > > > > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) > > > > > > +{ > > > > > > + if (!token) > > > > > > + return false; > > > > > > + > > > > > > + return token->allowed_cmds & (1ULL << cmd); > > > > > > +} > > > > > > > > > > I mentioned this a while back, likely in the other threads where this > > > > > token-based approach was only being discussed in general terms, but I > > > > > think we want to have a LSM hook at the point of initial token > > > > > delegation for this and a hook when the token is used. My initial > > > > > thinking is that we should be able to address the former with a hook > > > > > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > > > > > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > > > > > but it doesn't allow for much in the way of granularity. Inserting the > > > > > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > > > > > gracefully fallback to the system-wide checks if the LSM denied the > > > > > requested access whereas an access denial in bpf_token_get_from_fd() > > > > > denial would cause the operation to error out. > > > > > > > > I think the bpf_fill_super() LSM hook makes sense, but I thought > > > > someone mentioned that we already have some generic LSM hook for > > > > validating mounts? If we don't, I can certainly add one for BPF FS > > > > specifically. > > > > > > We do have security_sb_mount(), but that is a generic mount operation > > > access control and not well suited for controlling the mount-based > > > capability delegation that you are proposing here. However, if you or > > > someone else has a clever way to make security_sb_mount() work for > > > this purpose I would be very happy to review that code. > > > > To be honest, I'm a bit out of my depth here, as I don't know the > > mounting parts well. Perhaps someone from VFS side can advise. But > > regardless, I have no problem adding a new LSM hook as well, ideally > > not very BPF-specific. If you have a specific form of it in mind, I'd > > be curious to see it and implement it. > > I agree that there can be benefits to generalized LSM hooks, but in > this hook I think it may need to be BPF specific simply because the > hook would be dealing with the specific concept of delegating BPF > permissions. Sure. As an alternative, if this is about controlling BPF delegation, instead of doing mount-time checks and LSM hook, perhaps we can add a new LSM hook to BPF_CREATE_TOKEN, just like we have ones for BPF_MAP_CREATE and BPF_PROG_LOAD. That will enable controlling delegation more directly when it is actually attempted to be used. > > I haven't taken the time to write up any hook patches yet as I wanted > to discuss it with you and the others on the To/CC line, but it seems > like we are roughly on the same page, at least with the initial > delegation hook, so I can put something together if you aren't > comfortable working on this (more on this below) ... I'd appreciate the help from the SELinux side specifically, yes. I'm absolutely OK to add a few new LSM hooks, though. > > > > > As for the bpf_token_allow_xxx(). This feels a bit too specific and > > > > narrow-focused. What if we later add yet another dimension for BPF FS > > > > and token? Do we need to introduce yet another LSM for each such case? > > > > > > [I'm assuming you meant new LSM *hook*] > > > > yep, of course, sorry about using terminology sloppily > > > > > > > > Possibly. There are also some other issues which I've been thinking > > > about along these lines, specifically the fact that the > > > capability/command delegation happens after the existing > > > security_bpf() hook is called which makes things rather awkward from a > > > LSM perspective: the LSM would first need to allow the process access > > > to the desired BPF op using it's current LSM specific security > > > attributes (e.g. SELinux security domain, etc.) and then later > > > consider the op in the context of the delegated access control rights > > > (if the LSM decides to support those hooks). > > > > > > I suspect that if we want to make this practical we would need to > > > either move some of the token code up into __sys_bpf() so we could > > > have a better interaction with security_bpf(), or we need to consider > > > moving the security_bpf() call into the op specific functions. I'm > > > still thinking on this (lots of reviews to get through this week), but > > > I'm hoping there is a better way because I'm not sure I like either > > > option very much. > > > > Yes, security_bpf() is happening extremely early and is lacking a lot > > of context. I'm not sure if moving it around is a good idea as it > > basically changes its semantics. > > There are a couple of things that make this not quite as scary as it > may seem. The first is that currently only SELinux implements a > security_bpf() hook and the implementation is rather simplistic in > terms of what information it requires to perform the existing access > controls; decomposing the single security_bpf() call site into > multiple op specific calls, perhaps with some op specific hooks, > should be doable without causing major semantic changes. The second > thing is that we could augment the existing security_bpf() hook and > call site with a new LSM hook(s) that are called from the op specific > call sites; this would allow those LSMs that desire the current > semantics to use the existing security_bpf() hook and those that wish > to use the new semantics could implement the new hook(s). This is > very similar to the pathname-based and inode-based hooks in the VFS > layer, some LSMs choose to implement pathname-based security and use > one set of hooks, while others implement a label-based security > mechanism and use a different set of hooks. > Agreed. I think new LSM hooks that are operation-specific make a lot of sense. I'd probably not touch existing security_bpf(), it's an early-entry LSM hook for anything bpf() syscall-specific. This might be very useful in some cases, probably. > > But adding a new set of coherent LSM > > hooks per each appropriate BPF operation with good context to make > > decisions sounds like a good improvement. E.g., for BPF_PROG_LOAD, we > > can have LSM hook after struct bpf_prog is allocated, bpf_token is > > available, attributes are sanity checked. All that together is a very > > useful and powerful context that can be used both by more fixed LSM > > policies (like SELinux), and very dynamic user-defined BPF LSM > > programs. > > This is where it is my turn to mention that I'm getting a bit out of > my depth, but I'm hopeful that the two of us can keep each other from > drowning :) > > Typically the LSM hook call sites end up being in the same general > area as the capability checks, usually just after (we want the normal > Linux discretionary access controls to always come first for the sake > of consistency). Sticking with that approach it looks like we would > end up with a LSM call in bpf_prog_load() right after bpf_capable() > call, the only gotcha with that is the bpf_prog struct isn't populated > yet, but how important is that when we have the bpf_attr info (honest > question, I don't know the answer to this)? Ok, so I agree in general about having LSM hooks close to capability checks, but at least specifically for BPF_PROG_CREATE, it won't work. This bpf_capable() check you mention. This is just one check. If you look into bpf_prog_load() in kernel/bpf/syscall.c, you'll see that we can also check CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, in addition to CAP_BPF, based on various aspects (like program type + subtype). So for such a complex BPF_PROG_CREATE operation I think we should deviate a bit and place LSM in a logical place that would enable doing LSM enforcement with lots of relevant information, but before doing anything dangerous or expensive. For BPF_PROG_LOAD that place seems to be right before bpf_check(), which is BPF verification. By that time we did a bunch of different sanity checks, resolved various things (like found another bpf_program to attach to, if requested), copied user-provided strings (e.g., license), etc, etc. That's tons of good stuff to be used for either audit or enforcement. This also answers your question about why bpf_attr isn't enough. bpf_attr has various FDs, data pointers (program instructions), strings, etc, etc. All of that might be a) inconvenient to fetch (at least from BPF LSM) and/or b) racy (e.g., FD can get changed between the check and actual usage). So while bpf_attr is useful as an input, ideally we'd augment it with bpf_prog and bpf_token. Right now we have `security_bpf_prog_alloc(prog->aux);`, which is almost in the ideal place, but provides prog->aux instead of program itself (not sure why), and doesn't provide bpf_attr and bpf_token. So I'm thinking that maybe we get rid of bpf_prog_alloc() in favor of new security_bpf_prog_load(prog, &attr, token)? > > Ignoring the bpf_prog struct, do you think something like this would > work for a hook call site (please forgive the pseudo code)? > > int bpf_prog_load(...) > { > ... > bpf_cap = bpf_token_capable(token, CAP_BPF); > err = security_bpf_token(BPF_PROG_LOAD, attr, uattr_size, token); > if (err) > return err; > ... > } > See above, I think this should be program-centric, not token-centric. > Assuming this type of hook configuration, and an empty/passthrough > security_bpf() hook, a LSM would first see the various > capable()/ns_capable() checks present in bpf_token_capable() followed > by a BPF op check, complete with token, in the security_bpf_token() > hook. Further assuming that we convert the bpf_token_new_fd() to use > anon_inode_getfd_secure() instead of anon_inode_getfd() and the > security_bpf_token() could still access the token fd via the bpf_attr wouldn't this be a race to read FD from bpf_attr for LSM, and then separately read it again in bpf_prog_load()? That seems like TOCTOU to me? As I mentioned above, I think bpf_token should be provided as an argument after it was "extracted" from FD in one place. > struct I think we could do something like this for the SELinux case > (more rough pseudo code): > > int selinux_bpf_token(...) > { > ssid = current_sid(); > if (token) { > /* this could be simplified with better integration > * in bpf_token_get_from_fd() */ > fd = fdget(attr->prog_token_fd); > inode = file_inode(fd.file); > isec = selinux_inode(inode); > tsid = isec->sid; > fdput(fd); > } else > tsid = ssid; > switch(cmd) { > ... > case BPF_PROG_LOAD: > rc = avc_has_perm(ssid, tsid, SECCLAS_BPF, BPF__PROG_LOAD); > break; > default: > rc = 0; > } > return rc; > } > > This would preserve the current behaviour when a token was not present: > > allow @current @current : bpf { prog_load } > > ... but this would change to the following if a token was present: > > allow @current @DELEGATED_ANON_INODE : bpf { prog_load } > > That seems reasonable to me, but I've CC'd the SELinux list on this so > others can sanity check the above :) doesn't seem like using anon_inode_getfd_secure() should be a big deal > > > But I'd like to keep all that outside of the BPF token feature itself, > > as it's already pretty hard to get a consensus just on those bits, so > > complicating this with simultaneously designing a new set of LSM hooks > > is something that we should avoid. Let's keep discussing this, but not > > block that on BPF token. > > The unfortunate aspect of disconnecting new functionality from the > associated access controls is that it introduces a gap where the new > functionality is not secured in a manner that users expect. There are > billions of systems/users that rely on LSM-based access controls for a > large part of their security story, and I think we are doing them a > disservice by not including the LSM controls with new security > significant features. > > We (the LSM folks) are happy to work with you to get this sorted out, > and I would hope my comments in this thread (as well as prior > iterations) and the rough design above is a good faith indicator of > that. I'd be happy to collaborate on designing proper LSM hooks around all this (which is what we are doing right now, I believe). I'm just trying to think pragmatically how this should all work logistically. This patch set gets the BPF token concept into the kernel. But there is more work to do in libbpf and other supporting infrastructure to make proper use of it. So I'm just trying to avoid going too broad with this patch set. But if you'd be ok to converge on the design of BPF token-enabled LSM hooks for bpf() syscall here, I'm happy to implement them. I'm not feeling confident enough to do SELinux work on top, though, so that's where I'd appreciate help. If LSM folks would be willing to add SELinux interface on top of LSM hooks, we'd be able to parallelize this work with me finishing libbpf and user-space parts, while you or someone else finalizes the SELinux details. How does that sound to you? > > > > > But also see bpf_prog_load(). There are two checks, allow_prog_type > > > > and allow_attach_type, which are really only meaningful in > > > > combination. And yet you'd have to have two separate LSM hooks for > > > > that. > > > > > > > > So I feel like the better approach is less mechanistically > > > > concentrating on BPF token operations themselves, but rather on more > > > > semantically meaningful operations that are token-enabled. E.g., > > > > protect BPF program loading, BPF map creation, BTF loading, etc. And > > > > we do have such LSM hooks right now, though they might not be the most > > > > convenient. So perhaps the right move is to add new ones that would > > > > provide a bit more context (e.g., we can pass in the BPF token that > > > > was used for the operation, attributes with which map/prog was > > > > created, etc). Low-level token LSMs seem hard to use cohesively in > > > > practice, though. > > > > > > Can you elaborate a bit more? It's hard to judge the comments above > > > without some more specifics about hook location, parameters, etc. > > > > So something like my above proposal for a new LSM hook in > > BPF_PROG_LOAD command. Just right before passing bpf_prog to BPF > > verifier, we can have > > > > err = security_bpf_prog_load(prog, attr, token) > > if (err) > > return -EPERM; > > > > Program, attributes, and token give a lot of inputs for security > > policy logic to make a decision about allowing that specific BPF > > program to be verified and loaded or not. I know how this could be > > used from BPF LSM side, but I assume that SELinux and others can take > > advantage of that provided additional context as well. > > If you think a populated bpf_prog struct is important for BPF LSM > programs then I have no problem with that hook placement. It's a lot > later in the process than we might normally want to place the hook, > but we can still safely error out here so that should be okay. > > From a LSM perspective I think we can make either work, I think the > big question is which would you rather have in the BPF code: the > security_bpf_prog_load() hook you've suggested here or the > security_bpf_token() hook I suggested above? I think security_bpf_prog_load() makes more sense, as I tried to lay out above. But you know, it's not set in stone yet, so let's try to converge. I tried to elaborate on why security_bpf_prog_load() and then separately security_bpf_map_create(), etc make most sense. For BPF LSM, having pointer to struct bpf_prog* and struct bpf_token* is a huge benefit compared to trying to somehow get them out of union bpf_attr. Same for map creation or any other BPF object that bpf() syscall operates on. > > > Similarly we can have a BPF_MAP_CREATE-specific LSM hook with context > > relevant to creating a BPF map. And so on. > > Of course. I've been operating under the assumption that whatever we > do for one op we should be able to apply the same idea to the others > that need it. > Awesome, yep. > -- > paul-moore.com
On Fri, Sep 22, 2023 at 6:35 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > No worries, lots of conferences are happening right now, so I expected > people to be unavailable. Just a quick note to let you know that my network access is still limited, but I appreciate the understanding and the detail in your reply; I'll get you a proper response next week.
On Fri, Sep 22, 2023 at 6:35 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > On Thu, Sep 21, 2023 at 3:18 PM Paul Moore <paul@paul-moore.com> wrote: > > On Fri, Sep 15, 2023 at 4:59 PM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > On Thu, Sep 14, 2023 at 5:55 PM Paul Moore <paul@paul-moore.com> wrote: > > > > On Thu, Sep 14, 2023 at 1:31 PM Andrii Nakryiko > > > > <andrii.nakryiko@gmail.com> wrote: > > > > > On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@paul-moore.com> wrote: > > > > > > > > > > > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: ... > > > > > > I mentioned this a while back, likely in the other threads where this > > > > > > token-based approach was only being discussed in general terms, but I > > > > > > think we want to have a LSM hook at the point of initial token > > > > > > delegation for this and a hook when the token is used. My initial > > > > > > thinking is that we should be able to address the former with a hook > > > > > > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > > > > > > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > > > > > > but it doesn't allow for much in the way of granularity. Inserting the > > > > > > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > > > > > > gracefully fallback to the system-wide checks if the LSM denied the > > > > > > requested access whereas an access denial in bpf_token_get_from_fd() > > > > > > denial would cause the operation to error out. > > > > > > > > > > I think the bpf_fill_super() LSM hook makes sense, but I thought > > > > > someone mentioned that we already have some generic LSM hook for > > > > > validating mounts? If we don't, I can certainly add one for BPF FS > > > > > specifically. > > > > > > > > We do have security_sb_mount(), but that is a generic mount operation > > > > access control and not well suited for controlling the mount-based > > > > capability delegation that you are proposing here. However, if you or > > > > someone else has a clever way to make security_sb_mount() work for > > > > this purpose I would be very happy to review that code. > > > > > > To be honest, I'm a bit out of my depth here, as I don't know the > > > mounting parts well. Perhaps someone from VFS side can advise. But > > > regardless, I have no problem adding a new LSM hook as well, ideally > > > not very BPF-specific. If you have a specific form of it in mind, I'd > > > be curious to see it and implement it. > > > > I agree that there can be benefits to generalized LSM hooks, but in > > this hook I think it may need to be BPF specific simply because the > > hook would be dealing with the specific concept of delegating BPF > > permissions. > > Sure. As an alternative, if this is about controlling BPF delegation, > instead of doing mount-time checks and LSM hook, perhaps we can add a > new LSM hook to BPF_CREATE_TOKEN, just like we have ones for > BPF_MAP_CREATE and BPF_PROG_LOAD. That will enable controlling > delegation more directly when it is actually attempted to be used. I'm also going to reply to the v6 patchset, but I thought there were some important points in this thread that were worth responding to here so that it would have the context of our previous discussion. So yes, from an LSM perspective we are concerned with who grants the delegation (creates the token) and who leverages that token to do work. When this patchset was still using anonymous inodes, marking and controlling token creation was relatively easy as we have existing hooks/control-points for anonymous inodes which take into account the anonymous inode class/type, e.g. bpffs. Now that this patchset is using a regular bpffs inode we may need to do some additional work so that we can mark the bpffs token inode as a "token" so that we can later distinguish it from an ordinary bpffs inode; it might also serve as a convenient place to control creation of the token, but as you have already mentioned we could also control this from the existing security_bpf(BPF_CREATE_TOKEN, ...) hook at the top of __sys_bpf(). Anyway, more on this in the v6 patchset. > > I haven't taken the time to write up any hook patches yet as I wanted > > to discuss it with you and the others on the To/CC line, but it seems > > like we are roughly on the same page, at least with the initial > > delegation hook, so I can put something together if you aren't > > comfortable working on this (more on this below) ... > > I'd appreciate the help from the SELinux side specifically, yes. I'm > absolutely OK to add a few new LSM hooks, though. I just want to say again that I'm very happy we can work together to make sure everything is covered :) > > > > > As for the bpf_token_allow_xxx(). This feels a bit too specific and > > > > > narrow-focused. What if we later add yet another dimension for BPF FS > > > > > and token? Do we need to introduce yet another LSM for each such case? > > > > > > > > [I'm assuming you meant new LSM *hook*] > > > > > > yep, of course, sorry about using terminology sloppily > > > > > > > Possibly. There are also some other issues which I've been thinking > > > > about along these lines, specifically the fact that the > > > > capability/command delegation happens after the existing > > > > security_bpf() hook is called which makes things rather awkward from a > > > > LSM perspective: the LSM would first need to allow the process access > > > > to the desired BPF op using it's current LSM specific security > > > > attributes (e.g. SELinux security domain, etc.) and then later > > > > consider the op in the context of the delegated access control rights > > > > (if the LSM decides to support those hooks). > > > > > > > > I suspect that if we want to make this practical we would need to > > > > either move some of the token code up into __sys_bpf() so we could > > > > have a better interaction with security_bpf(), or we need to consider > > > > moving the security_bpf() call into the op specific functions. I'm > > > > still thinking on this (lots of reviews to get through this week), but > > > > I'm hoping there is a better way because I'm not sure I like either > > > > option very much. > > > > > > Yes, security_bpf() is happening extremely early and is lacking a lot > > > of context. I'm not sure if moving it around is a good idea as it > > > basically changes its semantics. > > > > There are a couple of things that make this not quite as scary as it > > may seem. The first is that currently only SELinux implements a > > security_bpf() hook and the implementation is rather simplistic in > > terms of what information it requires to perform the existing access > > controls; decomposing the single security_bpf() call site into > > multiple op specific calls, perhaps with some op specific hooks, > > should be doable without causing major semantic changes. The second > > thing is that we could augment the existing security_bpf() hook and > > call site with a new LSM hook(s) that are called from the op specific > > call sites; this would allow those LSMs that desire the current > > semantics to use the existing security_bpf() hook and those that wish > > to use the new semantics could implement the new hook(s). This is > > very similar to the pathname-based and inode-based hooks in the VFS > > layer, some LSMs choose to implement pathname-based security and use > > one set of hooks, while others implement a label-based security > > mechanism and use a different set of hooks. > > Agreed. I think new LSM hooks that are operation-specific make a lot > of sense. I'd probably not touch existing security_bpf(), it's an > early-entry LSM hook for anything bpf() syscall-specific. This might > be very useful in some cases, probably. > > > > But adding a new set of coherent LSM > > > hooks per each appropriate BPF operation with good context to make > > > decisions sounds like a good improvement. E.g., for BPF_PROG_LOAD, we > > > can have LSM hook after struct bpf_prog is allocated, bpf_token is > > > available, attributes are sanity checked. All that together is a very > > > useful and powerful context that can be used both by more fixed LSM > > > policies (like SELinux), and very dynamic user-defined BPF LSM > > > programs. > > > > This is where it is my turn to mention that I'm getting a bit out of > > my depth, but I'm hopeful that the two of us can keep each other from > > drowning :) > > > > Typically the LSM hook call sites end up being in the same general > > area as the capability checks, usually just after (we want the normal > > Linux discretionary access controls to always come first for the sake > > of consistency). Sticking with that approach it looks like we would > > end up with a LSM call in bpf_prog_load() right after bpf_capable() > > call, the only gotcha with that is the bpf_prog struct isn't populated > > yet, but how important is that when we have the bpf_attr info (honest > > question, I don't know the answer to this)? > > Ok, so I agree in general about having LSM hooks close to capability > checks, but at least specifically for BPF_PROG_CREATE, it won't work. > This bpf_capable() check you mention. This is just one check. If you > look into bpf_prog_load() in kernel/bpf/syscall.c, you'll see that we > can also check CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, in > addition to CAP_BPF, based on various aspects (like program type + > subtype). That's a fair point. > So for such a complex BPF_PROG_CREATE operation I think we > should deviate a bit and place LSM in a logical place that would > enable doing LSM enforcement with lots of relevant information, but > before doing anything dangerous or expensive. > > For BPF_PROG_LOAD that place seems to be right before bpf_check(), > which is BPF verification ... > ... Right now we have `security_bpf_prog_alloc(prog->aux);`, which is > almost in the ideal place, but provides prog->aux instead of program > itself (not sure why), and doesn't provide bpf_attr and bpf_token. > > So I'm thinking that maybe we get rid of bpf_prog_alloc() in favor of > new security_bpf_prog_load(prog, &attr, token)? That sounds reasonable. We'll need to make sure we update the docs for that LSM hook to indicate that it performs both allocation of the LSM's BPF program state (it's current behavior), as well as access control for BPF program loads both with and without delegation. I think those are the big points worth wrapping up here in this thread, I'll move the rest over to the v6 patchset.
On Tue, Oct 10, 2023 at 2:20 PM Paul Moore <paul@paul-moore.com> wrote: > > On Fri, Sep 22, 2023 at 6:35 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > On Thu, Sep 21, 2023 at 3:18 PM Paul Moore <paul@paul-moore.com> wrote: > > > On Fri, Sep 15, 2023 at 4:59 PM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > On Thu, Sep 14, 2023 at 5:55 PM Paul Moore <paul@paul-moore.com> wrote: > > > > > On Thu, Sep 14, 2023 at 1:31 PM Andrii Nakryiko > > > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Wed, Sep 13, 2023 at 2:46 PM Paul Moore <paul@paul-moore.com> wrote: > > > > > > > > > > > > > > On Sep 12, 2023 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > ... > > > > > > > > I mentioned this a while back, likely in the other threads where this > > > > > > > token-based approach was only being discussed in general terms, but I > > > > > > > think we want to have a LSM hook at the point of initial token > > > > > > > delegation for this and a hook when the token is used. My initial > > > > > > > thinking is that we should be able to address the former with a hook > > > > > > > in bpf_fill_super() and the latter either in bpf_token_get_from_fd() > > > > > > > or bpf_token_allow_XXX(); bpf_token_get_from_fd() would be simpler, > > > > > > > but it doesn't allow for much in the way of granularity. Inserting the > > > > > > > LSM hooks in bpf_token_allow_XXX() would also allow the BPF code to fall > > > > > > > gracefully fallback to the system-wide checks if the LSM denied the > > > > > > > requested access whereas an access denial in bpf_token_get_from_fd() > > > > > > > denial would cause the operation to error out. > > > > > > > > > > > > I think the bpf_fill_super() LSM hook makes sense, but I thought > > > > > > someone mentioned that we already have some generic LSM hook for > > > > > > validating mounts? If we don't, I can certainly add one for BPF FS > > > > > > specifically. > > > > > > > > > > We do have security_sb_mount(), but that is a generic mount operation > > > > > access control and not well suited for controlling the mount-based > > > > > capability delegation that you are proposing here. However, if you or > > > > > someone else has a clever way to make security_sb_mount() work for > > > > > this purpose I would be very happy to review that code. > > > > > > > > To be honest, I'm a bit out of my depth here, as I don't know the > > > > mounting parts well. Perhaps someone from VFS side can advise. But > > > > regardless, I have no problem adding a new LSM hook as well, ideally > > > > not very BPF-specific. If you have a specific form of it in mind, I'd > > > > be curious to see it and implement it. > > > > > > I agree that there can be benefits to generalized LSM hooks, but in > > > this hook I think it may need to be BPF specific simply because the > > > hook would be dealing with the specific concept of delegating BPF > > > permissions. > > > > Sure. As an alternative, if this is about controlling BPF delegation, > > instead of doing mount-time checks and LSM hook, perhaps we can add a > > new LSM hook to BPF_CREATE_TOKEN, just like we have ones for > > BPF_MAP_CREATE and BPF_PROG_LOAD. That will enable controlling > > delegation more directly when it is actually attempted to be used. > > I'm also going to reply to the v6 patchset, but I thought there were > some important points in this thread that were worth responding to > here so that it would have the context of our previous discussion. > > So yes, from an LSM perspective we are concerned with who grants the > delegation (creates the token) and who leverages that token to do > work. When this patchset was still using anonymous inodes, marking > and controlling token creation was relatively easy as we have existing > hooks/control-points for anonymous inodes which take into account the > anonymous inode class/type, e.g. bpffs. Now that this patchset is > using a regular bpffs inode we may need to do some additional work so > that we can mark the bpffs token inode as a "token" so that we can > later distinguish it from an ordinary bpffs inode; it might also serve > as a convenient place to control creation of the token, but as you > have already mentioned we could also control this from the existing > security_bpf(BPF_CREATE_TOKEN, ...) hook at the top of __sys_bpf(). > > Anyway, more on this in the v6 patchset. > > > > I haven't taken the time to write up any hook patches yet as I wanted > > > to discuss it with you and the others on the To/CC line, but it seems > > > like we are roughly on the same page, at least with the initial > > > delegation hook, so I can put something together if you aren't > > > comfortable working on this (more on this below) ... > > > > I'd appreciate the help from the SELinux side specifically, yes. I'm > > absolutely OK to add a few new LSM hooks, though. > > I just want to say again that I'm very happy we can work together to > make sure everything is covered :) Likewise! I ran out of time today to finish all the requested changes, but hopefully I will be able to post a new version tomorrow with all the feedback applied. > > > > > > > As for the bpf_token_allow_xxx(). This feels a bit too specific and > > > > > > narrow-focused. What if we later add yet another dimension for BPF FS > > > > > > and token? Do we need to introduce yet another LSM for each such case? > > > > > > > > > > [I'm assuming you meant new LSM *hook*] > > > > > > > > yep, of course, sorry about using terminology sloppily > > > > > > > > > Possibly. There are also some other issues which I've been thinking > > > > > about along these lines, specifically the fact that the > > > > > capability/command delegation happens after the existing > > > > > security_bpf() hook is called which makes things rather awkward from a > > > > > LSM perspective: the LSM would first need to allow the process access > > > > > to the desired BPF op using it's current LSM specific security > > > > > attributes (e.g. SELinux security domain, etc.) and then later > > > > > consider the op in the context of the delegated access control rights > > > > > (if the LSM decides to support those hooks). > > > > > > > > > > I suspect that if we want to make this practical we would need to > > > > > either move some of the token code up into __sys_bpf() so we could > > > > > have a better interaction with security_bpf(), or we need to consider > > > > > moving the security_bpf() call into the op specific functions. I'm > > > > > still thinking on this (lots of reviews to get through this week), but > > > > > I'm hoping there is a better way because I'm not sure I like either > > > > > option very much. > > > > > > > > Yes, security_bpf() is happening extremely early and is lacking a lot > > > > of context. I'm not sure if moving it around is a good idea as it > > > > basically changes its semantics. > > > > > > There are a couple of things that make this not quite as scary as it > > > may seem. The first is that currently only SELinux implements a > > > security_bpf() hook and the implementation is rather simplistic in > > > terms of what information it requires to perform the existing access > > > controls; decomposing the single security_bpf() call site into > > > multiple op specific calls, perhaps with some op specific hooks, > > > should be doable without causing major semantic changes. The second > > > thing is that we could augment the existing security_bpf() hook and > > > call site with a new LSM hook(s) that are called from the op specific > > > call sites; this would allow those LSMs that desire the current > > > semantics to use the existing security_bpf() hook and those that wish > > > to use the new semantics could implement the new hook(s). This is > > > very similar to the pathname-based and inode-based hooks in the VFS > > > layer, some LSMs choose to implement pathname-based security and use > > > one set of hooks, while others implement a label-based security > > > mechanism and use a different set of hooks. > > > > Agreed. I think new LSM hooks that are operation-specific make a lot > > of sense. I'd probably not touch existing security_bpf(), it's an > > early-entry LSM hook for anything bpf() syscall-specific. This might > > be very useful in some cases, probably. > > > > > > But adding a new set of coherent LSM > > > > hooks per each appropriate BPF operation with good context to make > > > > decisions sounds like a good improvement. E.g., for BPF_PROG_LOAD, we > > > > can have LSM hook after struct bpf_prog is allocated, bpf_token is > > > > available, attributes are sanity checked. All that together is a very > > > > useful and powerful context that can be used both by more fixed LSM > > > > policies (like SELinux), and very dynamic user-defined BPF LSM > > > > programs. > > > > > > This is where it is my turn to mention that I'm getting a bit out of > > > my depth, but I'm hopeful that the two of us can keep each other from > > > drowning :) > > > > > > Typically the LSM hook call sites end up being in the same general > > > area as the capability checks, usually just after (we want the normal > > > Linux discretionary access controls to always come first for the sake > > > of consistency). Sticking with that approach it looks like we would > > > end up with a LSM call in bpf_prog_load() right after bpf_capable() > > > call, the only gotcha with that is the bpf_prog struct isn't populated > > > yet, but how important is that when we have the bpf_attr info (honest > > > question, I don't know the answer to this)? > > > > Ok, so I agree in general about having LSM hooks close to capability > > checks, but at least specifically for BPF_PROG_CREATE, it won't work. > > This bpf_capable() check you mention. This is just one check. If you > > look into bpf_prog_load() in kernel/bpf/syscall.c, you'll see that we > > can also check CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, in > > addition to CAP_BPF, based on various aspects (like program type + > > subtype). > > That's a fair point. > > > So for such a complex BPF_PROG_CREATE operation I think we > > should deviate a bit and place LSM in a logical place that would > > enable doing LSM enforcement with lots of relevant information, but > > before doing anything dangerous or expensive. > > > > For BPF_PROG_LOAD that place seems to be right before bpf_check(), > > which is BPF verification ... > > > ... Right now we have `security_bpf_prog_alloc(prog->aux);`, which is > > almost in the ideal place, but provides prog->aux instead of program > > itself (not sure why), and doesn't provide bpf_attr and bpf_token. > > > > So I'm thinking that maybe we get rid of bpf_prog_alloc() in favor of > > new security_bpf_prog_load(prog, &attr, token)? > > That sounds reasonable. We'll need to make sure we update the docs > for that LSM hook to indicate that it performs both allocation of the > LSM's BPF program state (it's current behavior), as well as access > control for BPF program loads both with and without delegation. Heh, I asked about this distinction between "allocation of LSM state" and "access control" in another email thread, but based on this it seems like it's purely convention based and it's OK to do both with the same hook, right? If that's the case, then I think we should combine what you proposed as two hooks, security_bpf_token_alloc() and security_bpf_token_create(), into one hook to minimize mental overhead. But if I'm missing something, please point it out on the patch. > > I think those are the big points worth wrapping up here in this > thread, I'll move the rest over to the v6 patchset. It all makes sense and I intend to add all those hooks and do refactoring of existing map/prog ones. I will put that into a separate patch in the series for a more focused review. > > -- > paul-moore.com
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e9a3ab390844..6abd2b96e096 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -51,6 +51,8 @@ struct module; struct bpf_func_state; struct ftrace_ops; struct cgroup; +struct bpf_token; +struct user_namespace; extern struct idr btf_idr; extern spinlock_t btf_idr_lock; @@ -1568,6 +1570,13 @@ struct bpf_mount_opts { u64 delegate_attachs; }; +struct bpf_token { + struct work_struct work; + atomic64_t refcnt; + struct user_namespace *userns; + u64 allowed_cmds; +}; + struct bpf_struct_ops_value; struct btf_member; @@ -2192,6 +2201,15 @@ int bpf_link_new_fd(struct bpf_link *link); struct bpf_link *bpf_link_get_from_fd(u32 ufd); struct bpf_link *bpf_link_get_curr_or_next(u32 *id); +void bpf_token_inc(struct bpf_token *token); +void bpf_token_put(struct bpf_token *token); +int bpf_token_create(union bpf_attr *attr); +int bpf_token_new_fd(struct bpf_token *token); +struct bpf_token *bpf_token_get_from_fd(u32 ufd); + +bool bpf_token_capable(const struct bpf_token *token, int cap); +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd); + int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); @@ -2551,6 +2569,24 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags) return -EOPNOTSUPP; } +static inline void bpf_token_inc(struct bpf_token *token) +{ +} + +static inline void bpf_token_put(struct bpf_token *token) +{ +} + +static inline int bpf_token_new_fd(struct bpf_token *token) +{ + return -EOPNOTSUPP; +} + +static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd) +{ + return ERR_PTR(-EOPNOTSUPP); +} + static inline void __dev_flush(void) { } diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 73b155e52204..36e98c6f8944 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -847,6 +847,37 @@ union bpf_iter_link_info { * Returns zero on success. On error, -1 is returned and *errno* * is set appropriately. * + * BPF_TOKEN_CREATE + * Description + * Create BPF token with embedded information about what + * BPF-related functionality it allows: + * - a set of allowed bpf() syscall commands; + * - a set of allowed BPF map types to be created with + * BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed; + * - a set of allowed BPF program types and BPF program attach + * types to be loaded with BPF_PROG_LOAD command, if + * BPF_PROG_LOAD itself is allowed. + * + * BPF token is created (derived) from an instance of BPF FS, + * assuming it has necessary delegation mount options specified. + * BPF FS mount is specified with openat()-style path FD + string. + * This BPF token can be passed as an extra parameter to various + * bpf() syscall commands to grant BPF subsystem functionality to + * unprivileged processes. + * + * When created, BPF token is "associated" with the owning + * user namespace of BPF FS instance (super block) that it was + * derived from, and subsequent BPF operations performed with + * BPF token would be performing capabilities checks (i.e., + * CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within + * that user namespace. Without BPF token, such capabilities + * have to be granted in init user namespace, making bpf() + * syscall incompatible with user namespace, for the most part. + * + * Return + * A new file descriptor (a nonnegative integer), or -1 if an + * error occurred (in which case, *errno* is set appropriately). + * * NOTES * eBPF objects (maps and programs) can be shared between processes. * @@ -901,6 +932,8 @@ enum bpf_cmd { BPF_ITER_CREATE, BPF_LINK_DETACH, BPF_PROG_BIND_MAP, + BPF_TOKEN_CREATE, + __MAX_BPF_CMD, }; enum bpf_map_type { @@ -1694,6 +1727,12 @@ union bpf_attr { __u32 flags; /* extra flags */ } prog_bind_map; + struct { /* struct used by BPF_TOKEN_CREATE command */ + __u32 flags; + __u32 bpffs_path_fd; + __u64 bpffs_pathname; + } token_create; + } __attribute__((aligned(8))); /* The description below is an attempt at providing documentation to eBPF diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index f526b7573e97..4ce95acfcaa7 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse endif CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 8f66b57d3546..82f11fbffd3e 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) { struct bpf_mount_opts *opts = root->d_sb->s_fs_info; umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX; + u64 mask; if (mode != S_IRWXUGO) seq_printf(m, ",mode=%o", mode); - if (opts->delegate_cmds == ~0ULL) + mask = (1ULL << __MAX_BPF_CMD) - 1; + if ((opts->delegate_cmds & mask) == mask) seq_printf(m, ",delegate_cmds=any"); else if (opts->delegate_cmds) seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 6a692f3bea15..4fae678c1f48 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -5297,6 +5297,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr) return ret; } +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname + +static int token_create(union bpf_attr *attr) +{ + if (CHECK_ATTR(BPF_TOKEN_CREATE)) + return -EINVAL; + + /* no flags are supported yet */ + if (attr->token_create.flags) + return -EINVAL; + + return bpf_token_create(attr); +} + static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size) { union bpf_attr attr; @@ -5430,6 +5444,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size) case BPF_PROG_BIND_MAP: err = bpf_prog_bind_map(&attr); break; + case BPF_TOKEN_CREATE: + err = token_create(&attr); + break; default: err = -EINVAL; break; diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c new file mode 100644 index 000000000000..f6ea3eddbee6 --- /dev/null +++ b/kernel/bpf/token.c @@ -0,0 +1,189 @@ +#include <linux/bpf.h> +#include <linux/vmalloc.h> +#include <linux/anon_inodes.h> +#include <linux/fdtable.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/kernel.h> +#include <linux/idr.h> +#include <linux/namei.h> +#include <linux/user_namespace.h> + +bool bpf_token_capable(const struct bpf_token *token, int cap) +{ + /* BPF token allows ns_capable() level of capabilities */ + if (token) { + if (ns_capable(token->userns, cap)) + return true; + if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)) + return true; + } + /* otherwise fallback to capable() checks */ + return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN)); +} + +void bpf_token_inc(struct bpf_token *token) +{ + atomic64_inc(&token->refcnt); +} + +static void bpf_token_free(struct bpf_token *token) +{ + put_user_ns(token->userns); + kvfree(token); +} + +static void bpf_token_put_deferred(struct work_struct *work) +{ + struct bpf_token *token = container_of(work, struct bpf_token, work); + + bpf_token_free(token); +} + +void bpf_token_put(struct bpf_token *token) +{ + if (!token) + return; + + if (!atomic64_dec_and_test(&token->refcnt)) + return; + + INIT_WORK(&token->work, bpf_token_put_deferred); + schedule_work(&token->work); +} + +static int bpf_token_release(struct inode *inode, struct file *filp) +{ + struct bpf_token *token = filp->private_data; + + bpf_token_put(token); + return 0; +} + +static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz, + loff_t *ppos) +{ + /* We need this handler such that alloc_file() enables + * f_mode with FMODE_CAN_READ. + */ + return -EINVAL; +} + +static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf, + size_t siz, loff_t *ppos) +{ + /* We need this handler such that alloc_file() enables + * f_mode with FMODE_CAN_WRITE. + */ + return -EINVAL; +} + +static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp) +{ + struct bpf_token *token = filp->private_data; + u64 mask; + + mask = (1ULL << __MAX_BPF_CMD) - 1; + if ((token->allowed_cmds & mask) == mask) + seq_printf(m, "allowed_cmds:\tany\n"); + else + seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds); +} + +static const struct file_operations bpf_token_fops = { + .release = bpf_token_release, + .read = bpf_dummy_read, + .write = bpf_dummy_write, + .show_fdinfo = bpf_token_show_fdinfo, +}; + +static struct bpf_token *bpf_token_alloc(void) +{ + struct bpf_token *token; + + token = kvzalloc(sizeof(*token), GFP_USER); + if (!token) + return NULL; + + atomic64_set(&token->refcnt, 1); + + return token; +} + +int bpf_token_create(union bpf_attr *attr) +{ + struct path path; + struct bpf_mount_opts *mnt_opts; + struct bpf_token *token; + int ret; + + ret = user_path_at(attr->token_create.bpffs_path_fd, + u64_to_user_ptr(attr->token_create.bpffs_pathname), + LOOKUP_FOLLOW | LOOKUP_EMPTY, &path); + if (ret) + return ret; + + if (path.mnt->mnt_root != path.dentry) { + ret = -EINVAL; + goto out; + } + ret = path_permission(&path, MAY_ACCESS); + if (ret) + goto out; + + token = bpf_token_alloc(); + if (!token) { + ret = -ENOMEM; + goto out; + } + + /* remember bpffs owning userns for future ns_capable() checks */ + token->userns = get_user_ns(path.dentry->d_sb->s_user_ns); + + mnt_opts = path.dentry->d_sb->s_fs_info; + token->allowed_cmds = mnt_opts->delegate_cmds; + + ret = bpf_token_new_fd(token); + if (ret < 0) + bpf_token_free(token); +out: + path_put(&path); + return ret; +} + +#define BPF_TOKEN_INODE_NAME "bpf-token" + +/* Alloc anon_inode and FD for prepared token. + * Returns fd >= 0 on success; negative error, otherwise. + */ +int bpf_token_new_fd(struct bpf_token *token) +{ + return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC); +} + +struct bpf_token *bpf_token_get_from_fd(u32 ufd) +{ + struct fd f = fdget(ufd); + struct bpf_token *token; + + if (!f.file) + return ERR_PTR(-EBADF); + if (f.file->f_op != &bpf_token_fops) { + fdput(f); + return ERR_PTR(-EINVAL); + } + + token = f.file->private_data; + bpf_token_inc(token); + fdput(f); + + return token; +} + +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd) +{ + if (!token) + return false; + + return token->allowed_cmds & (1ULL << cmd); +} diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 73b155e52204..36e98c6f8944 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -847,6 +847,37 @@ union bpf_iter_link_info { * Returns zero on success. On error, -1 is returned and *errno* * is set appropriately. * + * BPF_TOKEN_CREATE + * Description + * Create BPF token with embedded information about what + * BPF-related functionality it allows: + * - a set of allowed bpf() syscall commands; + * - a set of allowed BPF map types to be created with + * BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed; + * - a set of allowed BPF program types and BPF program attach + * types to be loaded with BPF_PROG_LOAD command, if + * BPF_PROG_LOAD itself is allowed. + * + * BPF token is created (derived) from an instance of BPF FS, + * assuming it has necessary delegation mount options specified. + * BPF FS mount is specified with openat()-style path FD + string. + * This BPF token can be passed as an extra parameter to various + * bpf() syscall commands to grant BPF subsystem functionality to + * unprivileged processes. + * + * When created, BPF token is "associated" with the owning + * user namespace of BPF FS instance (super block) that it was + * derived from, and subsequent BPF operations performed with + * BPF token would be performing capabilities checks (i.e., + * CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within + * that user namespace. Without BPF token, such capabilities + * have to be granted in init user namespace, making bpf() + * syscall incompatible with user namespace, for the most part. + * + * Return + * A new file descriptor (a nonnegative integer), or -1 if an + * error occurred (in which case, *errno* is set appropriately). + * * NOTES * eBPF objects (maps and programs) can be shared between processes. * @@ -901,6 +932,8 @@ enum bpf_cmd { BPF_ITER_CREATE, BPF_LINK_DETACH, BPF_PROG_BIND_MAP, + BPF_TOKEN_CREATE, + __MAX_BPF_CMD, }; enum bpf_map_type { @@ -1694,6 +1727,12 @@ union bpf_attr { __u32 flags; /* extra flags */ } prog_bind_map; + struct { /* struct used by BPF_TOKEN_CREATE command */ + __u32 flags; + __u32 bpffs_path_fd; + __u64 bpffs_pathname; + } token_create; + } __attribute__((aligned(8))); /* The description below is an attempt at providing documentation to eBPF
Add new kind of BPF kernel object, BPF token. BPF token is meant to allow delegating privileged BPF functionality, like loading a BPF program or creating a BPF map, from privileged process to a *trusted* unprivileged process, all while have a good amount of control over which privileged operations could be performed using provided BPF token. This is achieved through mounting BPF FS instance with extra delegation mount options, which determine what operations are delegatable, and also constraining it to the owning user namespace (as mentioned in the previous patch). BPF token itself is just a derivative from BPF FS and can be created through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts a path specification (using the usual fd + string path combo) to a BPF FS mount. Currently, BPF token "inherits" delegated command, map types, prog type, and attach type bit sets from BPF FS as is. In the future, having an BPF token as a separate object with its own FD, we can allow to further restrict BPF token's allowable set of things either at the creation time or after the fact, allowing the process to guard itself further from, e.g., unintentionally trying to load undesired kind of BPF programs. But for now we keep things simple and just copy bit sets as is. When BPF token is created from BPF FS mount, we take reference to the BPF super block's owning user namespace, and then use that namespace for checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} capabilities that are normally only checked against init userns (using capable()), but now we check them using ns_capable() instead (if BPF token is provided). See bpf_token_capable() for details. Such setup means that BPF token in itself is not sufficient to grant BPF functionality. User namespaced process has to *also* have necessary combination of capabilities inside that user namespace. So while previously CAP_BPF was useless when granted within user namespace, now it gains a meaning and allows container managers and sys admins to have a flexible control over which processes can and need to use BPF functionality within the user namespace (i.e., container in practice). And BPF FS delegation mount options and derived BPF tokens serve as a per-container "flag" to grant overall ability to use bpf() (plus further restrict on which parts of bpf() syscalls are treated as namespaced). The alternative to creating BPF token object was: a) not having any extra object and just pasing BPF FS path to each relevant bpf() command. This seems suboptimal as it's racy (mount under the same path might change in between checking it and using it for bpf() command). And also less flexible if we'd like to further restrict ourselves compared to all the delegated functionality allowed on BPF FS. b) use non-bpf() interface, e.g., ioctl(), but otherwise also create a dedicated FD that would represent a token-like functionality. This doesn't seem superior to having a proper bpf() command, so BPF_TOKEN_CREATE was chosen. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- include/linux/bpf.h | 36 +++++++ include/uapi/linux/bpf.h | 39 +++++++ kernel/bpf/Makefile | 2 +- kernel/bpf/inode.c | 4 +- kernel/bpf/syscall.c | 17 +++ kernel/bpf/token.c | 189 +++++++++++++++++++++++++++++++++ tools/include/uapi/linux/bpf.h | 39 +++++++ 7 files changed, 324 insertions(+), 2 deletions(-) create mode 100644 kernel/bpf/token.c