Message ID | b16456e8dbc378c41b73c00c56854a3c30580833.1601478774.git.yifeifz2@illinois.edu (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | seccomp: Add bitmap cache of constant allow filter results | expand |
On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > access any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > return allow regardless of syscall arguments for a given > architecture number / syscall number pair. This is implemented > here with a pseudo-emulator, and stored in a per-filter bitmap. > > Each common BPF instruction are emulated. Any weirdness or loading > from a syscall argument will cause the emulator to bail. > > The emulation is also halted if it reaches a return. In that case, > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > Emulator structure and comments are from Kees [1] and Jann [2]. > > Emulation is done at attach time. If a filter depends on more > filters, and if the dependee does not guarantee to allow the > syscall, then we skip the emulation of this syscall. > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ [...] > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 1ab22869a765..ff5289228ea5 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -150,6 +150,7 @@ config X86 > select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_STACKLEAK > select HAVE_ARCH_TRACEHOOK If you did the architecture enablement for X86 later in the series, you could move this part over into that patch, that'd be cleaner. > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index ae6b40cc39f4..f09c9e74ae05 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -143,6 +143,37 @@ struct notification { > struct list_head notifications; > }; > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * struct seccomp_cache_filter_data - container for cache's per-filter data > + * > + * Tis struct is ordered to minimize padding holes. I think this comment can probably go away, there isn't really much trickery around padding holes in the struct as it is now. > + * @syscall_allow_default: A bitmap where each bit represents whether the > + * filter willalways allow the syscall, for the nit: s/willalways/will always/ [...] > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, > + void *bitmap, const void *bitmap_prev, > + size_t bitmap_size, int arch) > +{ > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > + struct seccomp_data sd; > + int nr; > + > + for (nr = 0; nr < bitmap_size; nr++) { > + if (bitmap_prev && !test_bit(nr, bitmap_prev)) > + continue; > + > + sd.nr = nr; > + sd.arch = arch; > + > + if (seccomp_emu_is_const_allow(fprog, &sd)) > + set_bit(nr, bitmap); set_bit() is atomic, but since we only do this at filter setup, before the filter becomes globally visible, we don't need atomicity here. So this should probably use __set_bit() instead.
On Wed, Sep 30, 2020 at 10:19:13AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <yifeifz2@illinois.edu> > > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > access any syscall arguments or instruction pointer. To facilitate > this we need a static analyser to know whether a filter will > return allow regardless of syscall arguments for a given > architecture number / syscall number pair. This is implemented > here with a pseudo-emulator, and stored in a per-filter bitmap. > > Each common BPF instruction are emulated. Any weirdness or loading > from a syscall argument will cause the emulator to bail. > > The emulation is also halted if it reaches a return. In that case, > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > Emulator structure and comments are from Kees [1] and Jann [2]. > > Emulation is done at attach time. If a filter depends on more > filters, and if the dependee does not guarantee to allow the > syscall, then we skip the emulation of this syscall. > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ > > Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> See comments on patch 3 for reorganizing this a bit for the next version. For the infrastructure patch, I'd like to see much of the cover letter in the commit log (otherwise those details are harder for people to find). That will describe the _why_ for preparing this change, etc. For the emulator patch, I'd like to see the discussion about how the subset of BFP instructions was selected, what libraries Jann and I examined, etc. (For all of these commit logs, I try to pretend that whoever is reading it has not followed any lkml thread of discussion, etc.) > --- > arch/Kconfig | 34 ++++++++++ > arch/x86/Kconfig | 1 + > kernel/seccomp.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 201 insertions(+), 1 deletion(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 21a3675a7a3a..ca867b2a5d71 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER > results in the system call being skipped immediately. > - seccomp syscall wired up > > +config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > + bool > + help > + An arch should select this symbol if it provides all of these things: > + - all the requirements for HAVE_ARCH_SECCOMP_FILTER > + - SECCOMP_ARCH_DEFAULT > + - SECCOMP_ARCH_DEFAULT_NR > + There's no need for this config and the per-arch Kconfig clutter: SECCOMP_ARCH_NATIVE will be a sufficient gate. > config SECCOMP > prompt "Enable seccomp to safely execute untrusted bytecode" > def_bool y > @@ -498,6 +506,32 @@ config SECCOMP_FILTER > > See Documentation/userspace-api/seccomp_filter.rst for details. > > +choice > + prompt "Seccomp filter cache" > + default SECCOMP_CACHE_NONE > + depends on SECCOMP_FILTER > + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > + help > + Seccomp filters can potentially incur large overhead for each > + system call. This can alleviate some of the overhead. > + > + If in doubt, select 'syscall numbers only'. > + > +config SECCOMP_CACHE_NONE > + bool "None" > + help > + No caching is done. Seccomp filters will be called each time > + a system call occurs in a seccomp-guarded task. > + > +config SECCOMP_CACHE_NR_ONLY > + bool "Syscall number only" > + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > + help > + For each syscall number, if the seccomp filter has a fixed > + result, store that result in a bitmap to speed up system calls. > + > +endchoice I don't want this config: there is only 1 caching mechanism happening in this series and I do not want to have it buildable as "off": it should be available for all supported architectures. When further caching methods happen, the config can be introduced then (though I'll likely argue it should then be a boot param to allow distro kernels to make it selectable). > + > config HAVE_ARCH_STACKLEAK > bool > help > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 1ab22869a765..ff5289228ea5 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -150,6 +150,7 @@ config X86 > select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT > select HAVE_ARCH_PREL32_RELOCATIONS > select HAVE_ARCH_SECCOMP_FILTER > + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY > select HAVE_ARCH_THREAD_STRUCT_WHITELIST > select HAVE_ARCH_STACKLEAK > select HAVE_ARCH_TRACEHOOK > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index ae6b40cc39f4..f09c9e74ae05 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -143,6 +143,37 @@ struct notification { > struct list_head notifications; > }; > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * struct seccomp_cache_filter_data - container for cache's per-filter data naming nits: "data" doesn't tell me anything. "seccomp_action_cache" might be better. Or since it's an internal struct, maybe just "action_cache". And let's not use the word "container" for the kerndoc. ;) How about "per-filter cache of seccomp actions per arch/syscall pair" > + * > + * Tis struct is ordered to minimize padding holes. typo: This > + * > + * @syscall_allow_default: A bitmap where each bit represents whether the > + * filter willalways allow the syscall, for the typo: missing space > + * default architecture. default -> native > + * @syscall_allow_compat: A bitmap where each bit represents whether the > + * filter will always allow the syscall, for the > + * compat architecture. > + */ > +struct seccomp_cache_filter_data { > +#ifdef SECCOMP_ARCH_DEFAULT > + DECLARE_BITMAP(syscall_allow_default, SECCOMP_ARCH_DEFAULT_NR); naming nit: "syscall" is redundant here, IMO. "allow_native" should be fine. > +#endif > +#ifdef SECCOMP_ARCH_COMPAT > + DECLARE_BITMAP(syscall_allow_compat, SECCOMP_ARCH_COMPAT_NR); > +#endif > +}; > + > +#define SECCOMP_EMU_MAX_PENDING_STATES 64 > +#else > +struct seccomp_cache_filter_data { }; > + > +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) > +{ > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * struct seccomp_filter - container for seccomp BPF programs > * > @@ -159,6 +190,7 @@ struct notification { > * this filter after reaching 0. The @users count is always smaller > * or equal to @refs. Hence, reaching 0 for @users does not mean > * the filter can be freed. > + * @cache: container for cache-related data. more descriptive: "cache of arch/syscall mappings to actions" > * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged > * @prev: points to a previously installed, or inherited, filter > * @prog: the BPF program to evaluate > @@ -180,6 +212,7 @@ struct seccomp_filter { > refcount_t refs; > refcount_t users; > bool log; > + struct seccomp_cache_filter_data cache; > struct seccomp_filter *prev; > struct bpf_prog *prog; > struct notification *notif; > @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > { > struct seccomp_filter *sfilter; > int ret; > - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); > + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || > + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); > > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) > return ERR_PTR(-EINVAL); > @@ -610,6 +644,136 @@ seccomp_prepare_user_filter(const char __user *user_filter) > return filter; > } > > +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY > +/** > + * seccomp_emu_is_const_allow - check if filter is constant allow with given data > + * @fprog: The BPF programs > + * @sd: The seccomp data to check against, only syscall number are arch > + * number are considered constant. > + */ > +static bool seccomp_emu_is_const_allow(struct sock_fprog_kern *fprog, > + struct seccomp_data *sd) naming: I would drop "emu" from here. The caller doesn't care how it is determined. ;) > +{ > + unsigned int insns; > + unsigned int reg_value = 0; > + unsigned int pc; > + bool op_res; > + > + if (WARN_ON_ONCE(!fprog)) > + return false; > + > + insns = bpf_classic_proglen(fprog); > + for (pc = 0; pc < insns; pc++) { > + struct sock_filter *insn = &fprog->filter[pc]; > + u16 code = insn->code; > + u32 k = insn->k; > + > + switch (code) { > + case BPF_LD | BPF_W | BPF_ABS: > + switch (k) { > + case offsetof(struct seccomp_data, nr): > + reg_value = sd->nr; > + break; > + case offsetof(struct seccomp_data, arch): > + reg_value = sd->arch; > + break; > + default: > + /* can't optimize (non-constant value load) */ > + return false; > + } > + break; > + case BPF_RET | BPF_K: > + /* reached return with constant values only, check allow */ > + return k == SECCOMP_RET_ALLOW; > + case BPF_JMP | BPF_JA: > + pc += insn->k; > + break; > + case BPF_JMP | BPF_JEQ | BPF_K: > + case BPF_JMP | BPF_JGE | BPF_K: > + case BPF_JMP | BPF_JGT | BPF_K: > + case BPF_JMP | BPF_JSET | BPF_K: > + switch (BPF_OP(code)) { > + case BPF_JEQ: > + op_res = reg_value == k; > + break; > + case BPF_JGE: > + op_res = reg_value >= k; > + break; > + case BPF_JGT: > + op_res = reg_value > k; > + break; > + case BPF_JSET: > + op_res = !!(reg_value & k); > + break; > + default: > + /* can't optimize (unknown jump) */ > + return false; > + } > + > + pc += op_res ? insn->jt : insn->jf; > + break; > + case BPF_ALU | BPF_AND | BPF_K: > + reg_value &= k; > + break; > + default: > + /* can't optimize (unknown insn) */ > + return false; > + } > + } > + > + /* ran off the end of the filter?! */ > + WARN_ON(1); > + return false; > +} For the emulator patch, you'll want to include these tags in the commit log: Suggested-by: Jann Horn <jannh@google.com> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> > + > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, > + void *bitmap, const void *bitmap_prev, > + size_t bitmap_size, int arch) > +{ > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > + struct seccomp_data sd; > + int nr; > + > + for (nr = 0; nr < bitmap_size; nr++) { > + if (bitmap_prev && !test_bit(nr, bitmap_prev)) > + continue; > + > + sd.nr = nr; > + sd.arch = arch; > + > + if (seccomp_emu_is_const_allow(fprog, &sd)) > + set_bit(nr, bitmap); The guiding principle with seccomp's designs is to always make things _more_ restrictive, never less. While we can never escape the consequences of having seccomp_is_const_allow() report the wrong answer, we can at least follow the basic principles, hopefully minimizing the impact. When the bitmap starts with "always allowed" and we only flip it towards "run full filters", we're only ever making things more restrictive. If we instead go from "run full filters" towards "always allowed", we run the risk of making things less restrictive. For example: a process that maliciously adds a filter that the emulator mistakenly evaluates to "always allow" doesn't suddenly cause all the prior filters to stop running. (i.e. this isolates the flaw outcome, and doesn't depend on the early "do not emulate if we already know we have to run filters" case before the emulation call: there is no code path that allows the cache to weaken: it can only maintain it being wrong). Without any seccomp filter installed, all syscalls are "always allowed" (from the perspective of the seccomp boundary), so the default of the cache needs to be "always allowed". if (bitmap_prev) { /* The new filter must be as restrictive as the last. */ bitmap_copy(bitmap, bitmap_prev, bitmap_size); } else { /* Before any filters, all syscalls are always allowed. */ bitmap_fill(bitmap, bitmap_size); } for (nr = 0; nr < bitmap_size; nr++) { /* No bitmap change: not a cacheable action. */ if (!test_bit(nr, bitmap_prev) || continue; /* No bitmap change: continue to always allow. */ if (seccomp_is_const_allow(fprog, &sd)) continue; /* Not a cacheable action: always run filters. */ clear_bit(nr, bitmap); > + } > +} > + > +/** > + * seccomp_cache_prepare - emulate the filter to find cachable syscalls > + * @sfilter: The seccomp filter > + * > + * Returns 0 if successful or -errno if error occurred. > + */ > +static void seccomp_cache_prepare(struct seccomp_filter *sfilter) > +{ > + struct seccomp_cache_filter_data *cache = &sfilter->cache; > + const struct seccomp_cache_filter_data *cache_prev = > + sfilter->prev ? &sfilter->prev->cache : NULL; > + > +#ifdef SECCOMP_ARCH_DEFAULT > + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_default, > + cache_prev ? cache_prev->syscall_allow_default : NULL, > + SECCOMP_ARCH_DEFAULT_NR, > + SECCOMP_ARCH_DEFAULT); > +#endif /* SECCOMP_ARCH_DEFAULT */ > + > +#ifdef SECCOMP_ARCH_COMPAT > + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_compat, > + cache_prev ? cache_prev->syscall_allow_compat : NULL, > + SECCOMP_ARCH_COMPAT_NR, > + SECCOMP_ARCH_COMPAT); > +#endif /* SECCOMP_ARCH_COMPAT */ > +} > +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ > + > /** > * seccomp_attach_filter: validate and attach filter > * @flags: flags to change filter behavior > @@ -659,6 +823,7 @@ static long seccomp_attach_filter(unsigned int flags, > * task reference. > */ > filter->prev = current->seccomp.filter; > + seccomp_cache_prepare(filter); > current->seccomp.filter = filter; Jann, do we need a WRITE_ONCE() or something when writing current->seccomp.filter here? I think the rmb() in __seccomp_filter() will cover the cache bitmap writes having finished before the filter pointer is followed in the TSYNC case. > atomic_inc(¤t->seccomp.filter_count); > > -- > 2.28.0 > Otherwise, yes, I'm looking forward to having this for everyone to use! :)
On Thu, Oct 01, 2020 at 12:24:32AM +0200, Jann Horn wrote: > On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not > > access any syscall arguments or instruction pointer. To facilitate > > this we need a static analyser to know whether a filter will > > return allow regardless of syscall arguments for a given > > architecture number / syscall number pair. This is implemented > > here with a pseudo-emulator, and stored in a per-filter bitmap. > > > > Each common BPF instruction are emulated. Any weirdness or loading > > from a syscall argument will cause the emulator to bail. > > > > The emulation is also halted if it reaches a return. In that case, > > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. > > > > Emulator structure and comments are from Kees [1] and Jann [2]. > > > > Emulation is done at attach time. If a filter depends on more > > filters, and if the dependee does not guarantee to allow the > > syscall, then we skip the emulation of this syscall. > > > > [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ > > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ > [...] > > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, > > + void *bitmap, const void *bitmap_prev, > > + size_t bitmap_size, int arch) > > +{ > > + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; > > + struct seccomp_data sd; > > + int nr; > > + > > + for (nr = 0; nr < bitmap_size; nr++) { > > + if (bitmap_prev && !test_bit(nr, bitmap_prev)) > > + continue; > > + > > + sd.nr = nr; > > + sd.arch = arch; > > + > > + if (seccomp_emu_is_const_allow(fprog, &sd)) > > + set_bit(nr, bitmap); > > set_bit() is atomic, but since we only do this at filter setup, before > the filter becomes globally visible, we don't need atomicity here. So > this should probably use __set_bit() instead. Oh yes, excellent point! That will speed this up a bit. When you do this, please include a comment here describing why its safe to do it non-atomic. :)
On Wed, Sep 30, 2020 at 5:24 PM Jann Horn <jannh@google.com> wrote: > If you did the architecture enablement for X86 later in the series, > you could move this part over into that patch, that'd be cleaner. As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for x86? > > + * Tis struct is ordered to minimize padding holes. > > I think this comment can probably go away, there isn't really much > trickery around padding holes in the struct as it is now. Oh right, I was trying the locks and adding bits to indicate if certain arches are primed, then I undid that. > > + set_bit(nr, bitmap); > > set_bit() is atomic, but since we only do this at filter setup, before > the filter becomes globally visible, we don't need atomicity here. So > this should probably use __set_bit() instead. Right YiFei Zhu
On Wed, Sep 30, 2020 at 5:40 PM Kees Cook <keescook@chromium.org> wrote: > I don't want this config: there is only 1 caching mechanism happening > in this series and I do not want to have it buildable as "off": it > should be available for all supported architectures. When further caching > methods happen, the config can be introduced then (though I'll likely > argue it should then be a boot param to allow distro kernels to make it > selectable). Alright, we can think about configuration (or boot param) when more methods happen then. > The guiding principle with seccomp's designs is to always make things > _more_ restrictive, never less. While we can never escape the > consequences of having seccomp_is_const_allow() report the wrong > answer, we can at least follow the basic principles, hopefully > minimizing the impact. > > When the bitmap starts with "always allowed" and we only flip it towards > "run full filters", we're only ever making things more restrictive. If > we instead go from "run full filters" towards "always allowed", we run > the risk of making things less restrictive. For example: a process that > maliciously adds a filter that the emulator mistakenly evaluates to > "always allow" doesn't suddenly cause all the prior filters to stop running. > (i.e. this isolates the flaw outcome, and doesn't depend on the early > "do not emulate if we already know we have to run filters" case before > the emulation call: there is no code path that allows the cache to > weaken: it can only maintain it being wrong). > > Without any seccomp filter installed, all syscalls are "always allowed" > (from the perspective of the seccomp boundary), so the default of the > cache needs to be "always allowed". I cannot follow this. If a 'process that maliciously adds a filter that the emulator mistakenly evaluates to "always allow" doesn't suddenly cause all the prior filters to stop running', hence, you want, by default, the cache to be as transparent as possible. You would lift the restriction if and only if you are absolutely sure it does not cause an impact. In this patch, if there are prior filters, it goes through this logic: if (bitmap_prev && !test_bit(nr, bitmap_prev)) continue; Hence, if the malicious filter were to happen, and prior filters were supposed to run, then seccomp_is_const_allow is simply not invoked -- what it returns cannot be used maliciously by an adversary. > if (bitmap_prev) { > /* The new filter must be as restrictive as the last. */ > bitmap_copy(bitmap, bitmap_prev, bitmap_size); > } else { > /* Before any filters, all syscalls are always allowed. */ > bitmap_fill(bitmap, bitmap_size); > } > > for (nr = 0; nr < bitmap_size; nr++) { > /* No bitmap change: not a cacheable action. */ > if (!test_bit(nr, bitmap_prev) || > continue; > > /* No bitmap change: continue to always allow. */ > if (seccomp_is_const_allow(fprog, &sd)) > continue; > > /* Not a cacheable action: always run filters. */ > clear_bit(nr, bitmap); I'm not strongly against this logic. I just feel unconvinced that this is any different with a slightly increased complexity. YiFei Zhu
On Thu, Oct 01, 2020 at 06:52:50AM -0500, YiFei Zhu wrote: > On Wed, Sep 30, 2020 at 5:40 PM Kees Cook <keescook@chromium.org> wrote: > > The guiding principle with seccomp's designs is to always make things > > _more_ restrictive, never less. While we can never escape the > > consequences of having seccomp_is_const_allow() report the wrong > > answer, we can at least follow the basic principles, hopefully > > minimizing the impact. > > > > When the bitmap starts with "always allowed" and we only flip it towards > > "run full filters", we're only ever making things more restrictive. If > > we instead go from "run full filters" towards "always allowed", we run > > the risk of making things less restrictive. For example: a process that > > maliciously adds a filter that the emulator mistakenly evaluates to > > "always allow" doesn't suddenly cause all the prior filters to stop running. > > (i.e. this isolates the flaw outcome, and doesn't depend on the early > > "do not emulate if we already know we have to run filters" case before > > the emulation call: there is no code path that allows the cache to > > weaken: it can only maintain it being wrong). > > > > Without any seccomp filter installed, all syscalls are "always allowed" > > (from the perspective of the seccomp boundary), so the default of the > > cache needs to be "always allowed". > > I cannot follow this. If a 'process that maliciously adds a filter > that the emulator mistakenly evaluates to "always allow" doesn't > suddenly cause all the prior filters to stop running', hence, you > want, by default, the cache to be as transparent as possible. You > would lift the restriction if and only if you are absolutely sure it > does not cause an impact. Yes, right now, the v3 code pattern is entirely safe. > > In this patch, if there are prior filters, it goes through this logic: > > if (bitmap_prev && !test_bit(nr, bitmap_prev)) > continue; > > Hence, if the malicious filter were to happen, and prior filters were > supposed to run, then seccomp_is_const_allow is simply not invoked -- > what it returns cannot be used maliciously by an adversary. Right, but we depend on that test always doing the correct thing (and continuing to do so into the future). I'm looking at this from the perspective of future changes, maintenance, etc. I want the actions to match the design principles as closely as possible so that future evolutions of the code have lower risk to bugs causing security failures. Right now, the code is simple. I want to design this so that when it is complex, it will still fail toward safety in the face of bugs. > > if (bitmap_prev) { > > /* The new filter must be as restrictive as the last. */ > > bitmap_copy(bitmap, bitmap_prev, bitmap_size); > > } else { > > /* Before any filters, all syscalls are always allowed. */ > > bitmap_fill(bitmap, bitmap_size); > > } > > > > for (nr = 0; nr < bitmap_size; nr++) { > > /* No bitmap change: not a cacheable action. */ > > if (!test_bit(nr, bitmap_prev) || > > continue; > > > > /* No bitmap change: continue to always allow. */ > > if (seccomp_is_const_allow(fprog, &sd)) > > continue; > > > > /* Not a cacheable action: always run filters. */ > > clear_bit(nr, bitmap); > > I'm not strongly against this logic. I just feel unconvinced that this > is any different with a slightly increased complexity. I'd prefer this way because for the loop, the tests, and the results only make the bitmap more restrictive. The worst thing a bug in here can do is leave the bitmap unchanged (which is certainly bad), but it can't _undo_ an earlier restriction. The proposed loop's leading test_bit() becomes only an optimization, rather than being required for policy enforcement. In other words, I prefer: inherit all prior prior bitmap restrictions for all syscalls if this filter not restricted continue set bitmap restricted within this loop (where the bulk of future logic may get added), the worse-case future bug-induced failure mode for the syscall bitmap is "skip *this* filter". Instead of: set bitmap all restricted for all syscalls if previous bitmap not restricted and filter not restricted set bitmap unrestricted within this loop the worst-case future bug-induced failure mode for the syscall bitmap is "skip *all* filters". Or, to reword again, this: retain restrictions from previous caching decisions for all syscalls [evaluate this filter, maybe continue] set restricted instead of: set new cache to all restricted for all syscalls [evaluate prior cache and this filter, maybe continue] set unrestricted I expect the future code changes for caching to be in the "evaluate" step, so I'd like the code designed to make things MORE restrictive not less from the start, and remove any prior cache state tests from the loop. At the end of the day I believe changing the design like this now lays the groundwork to the caching mechanism being more robust against having future bugs introduce security flaws.
On Thu, Oct 1, 2020 at 1:28 PM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > On Wed, Sep 30, 2020 at 5:24 PM Jann Horn <jannh@google.com> wrote: > > If you did the architecture enablement for X86 later in the series, > > you could move this part over into that patch, that'd be cleaner. > > As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for x86? Yeah.
On Thu, Oct 1, 2020 at 4:05 PM Kees Cook <keescook@chromium.org> wrote: > Right, but we depend on that test always doing the correct thing (and > continuing to do so into the future). I'm looking at this from the > perspective of future changes, maintenance, etc. I want the actions to > match the design principles as closely as possible so that future > evolutions of the code have lower risk to bugs causing security > failures. Right now, the code is simple. I want to design this so that > when it is complex, it will still fail toward safety in the face of > bugs. > > I'd prefer this way because for the loop, the tests, and the results only > make the bitmap more restrictive. The worst thing a bug in here can do is > leave the bitmap unchanged (which is certainly bad), but it can't _undo_ > an earlier restriction. > > The proposed loop's leading test_bit() becomes only an optimization, > rather than being required for policy enforcement. > > In other words, I prefer: > > inherit all prior prior bitmap restrictions > for all syscalls > if this filter not restricted > continue > set bitmap restricted > > within this loop (where the bulk of future logic may get added), > the worse-case future bug-induced failure mode for the syscall > bitmap is "skip *this* filter". > > > Instead of: > > set bitmap all restricted > for all syscalls > if previous bitmap not restricted and > filter not restricted > set bitmap unrestricted > > within this loop the worst-case future bug-induced failure mode > for the syscall bitmap is "skip *all* filters". > > > > > Or, to reword again, this: > > retain restrictions from previous caching decisions > for all syscalls > [evaluate this filter, maybe continue] > set restricted > > instead of: > > set new cache to all restricted > for all syscalls > [evaluate prior cache and this filter, maybe continue] > set unrestricted > > I expect the future code changes for caching to be in the "evaluate" > step, so I'd like the code designed to make things MORE restrictive not > less from the start, and remove any prior cache state tests from the > loop. > > At the end of the day I believe changing the design like this now lays > the groundwork to the caching mechanism being more robust against having > future bugs introduce security flaws. > I see. Makes sense. Thanks. Will do that in the v4 YiFei Zhu
On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > { > struct seccomp_filter *sfilter; > int ret; > - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); > + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || > + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); > > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) > return ERR_PTR(-EINVAL); I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this message: kernel/seccomp.c: In function ‘seccomp_prepare_filter’: ././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_" and "(" does not give a valid preprocessing token 44 | #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) | ^~~~~~~~~~~~~~~~~~ ././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’ 43 | #define __is_defined(x) ___is_defined(x) | ^~~~~~~~~~~~~ kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’ 629 | __is_defined(SECCOMP_ARCH_NATIVE); | ^~~~~~~~~~~~ Looking at the implementation of __is_defined, it is: #define __ARG_PLACEHOLDER_1 0, #define __take_second_arg(__ignored, val, ...) val #define __is_defined(x) ___is_defined(x) #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) #define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0) Hence, when FOO is defined to be 1, then the expansion would be __is_defined(FOO) -> ___is_defined(1) -> ____is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) -> 1, and when FOO is not defined, the expansion would be __is_defined(FOO) -> ___is_defined(FOO) -> ____is_defined(__ARG_PLACEHOLDER_FOO) -> __take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0 However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to __ARG_PLACEHOLDER_1 during any stage in the preprocessing. Is there any better way to do this? I'm thinking of just doing #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE) like in Kee's patch. YiFei Zhu
On Thu, Oct 08, 2020 at 11:47:17PM -0500, YiFei Zhu wrote: > On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote: > > @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > > { > > struct seccomp_filter *sfilter; > > int ret; > > - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); > > + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || > > + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); > > > > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) > > return ERR_PTR(-EINVAL); > > I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this message: > > kernel/seccomp.c: In function ‘seccomp_prepare_filter’: > ././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_" > and "(" does not give a valid preprocessing token > 44 | #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) > | ^~~~~~~~~~~~~~~~~~ > ././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’ > 43 | #define __is_defined(x) ___is_defined(x) > | ^~~~~~~~~~~~~ > kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’ > 629 | __is_defined(SECCOMP_ARCH_NATIVE); > | ^~~~~~~~~~~~ > > Looking at the implementation of __is_defined, it is: > > #define __ARG_PLACEHOLDER_1 0, > #define __take_second_arg(__ignored, val, ...) val > #define __is_defined(x) ___is_defined(x) > #define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) > #define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0) > > Hence, when FOO is defined to be 1, then the expansion would be > __is_defined(FOO) -> ___is_defined(1) -> > ____is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) -> > 1, > and when FOO is not defined, the expansion would be __is_defined(FOO) > -> ___is_defined(FOO) -> ____is_defined(__ARG_PLACEHOLDER_FOO) -> > __take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0 > > However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some > bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to > __ARG_PLACEHOLDER_1 during any stage in the preprocessing. > > Is there any better way to do this? I'm thinking of just doing #if > defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE) > like in Kee's patch. Yeah, I think that's simplest.
diff --git a/arch/Kconfig b/arch/Kconfig index 21a3675a7a3a..ca867b2a5d71 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + bool + help + An arch should select this symbol if it provides all of these things: + - all the requirements for HAVE_ARCH_SECCOMP_FILTER + - SECCOMP_ARCH_DEFAULT + - SECCOMP_ARCH_DEFAULT_NR + config SECCOMP prompt "Enable seccomp to safely execute untrusted bytecode" def_bool y @@ -498,6 +506,32 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +choice + prompt "Seccomp filter cache" + default SECCOMP_CACHE_NONE + depends on SECCOMP_FILTER + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + help + Seccomp filters can potentially incur large overhead for each + system call. This can alleviate some of the overhead. + + If in doubt, select 'syscall numbers only'. + +config SECCOMP_CACHE_NONE + bool "None" + help + No caching is done. Seccomp filters will be called each time + a system call occurs in a seccomp-guarded task. + +config SECCOMP_CACHE_NR_ONLY + bool "Syscall number only" + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + help + For each syscall number, if the seccomp filter has a fixed + result, store that result in a bitmap to speed up system calls. + +endchoice + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1ab22869a765..ff5289228ea5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -150,6 +150,7 @@ config X86 select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_STACKLEAK select HAVE_ARCH_TRACEHOOK diff --git a/kernel/seccomp.c b/kernel/seccomp.c index ae6b40cc39f4..f09c9e74ae05 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,37 @@ struct notification { struct list_head notifications; }; +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_cache_filter_data - container for cache's per-filter data + * + * Tis struct is ordered to minimize padding holes. + * + * @syscall_allow_default: A bitmap where each bit represents whether the + * filter willalways allow the syscall, for the + * default architecture. + * @syscall_allow_compat: A bitmap where each bit represents whether the + * filter will always allow the syscall, for the + * compat architecture. + */ +struct seccomp_cache_filter_data { +#ifdef SECCOMP_ARCH_DEFAULT + DECLARE_BITMAP(syscall_allow_default, SECCOMP_ARCH_DEFAULT_NR); +#endif +#ifdef SECCOMP_ARCH_COMPAT + DECLARE_BITMAP(syscall_allow_compat, SECCOMP_ARCH_COMPAT_NR); +#endif +}; + +#define SECCOMP_EMU_MAX_PENDING_STATES 64 +#else +struct seccomp_cache_filter_data { }; + +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -159,6 +190,7 @@ struct notification { * this filter after reaching 0. The @users count is always smaller * or equal to @refs. Hence, reaching 0 for @users does not mean * the filter can be freed. + * @cache: container for cache-related data. * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate @@ -180,6 +212,7 @@ struct seccomp_filter { refcount_t refs; refcount_t users; bool log; + struct seccomp_cache_filter_data cache; struct seccomp_filter *prev; struct bpf_prog *prog; struct notification *notif; @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -610,6 +644,136 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_emu_is_const_allow - check if filter is constant allow with given data + * @fprog: The BPF programs + * @sd: The seccomp data to check against, only syscall number are arch + * number are considered constant. + */ +static bool seccomp_emu_is_const_allow(struct sock_fprog_kern *fprog, + struct seccomp_data *sd) +{ + unsigned int insns; + unsigned int reg_value = 0; + unsigned int pc; + bool op_res; + + if (WARN_ON_ONCE(!fprog)) + return false; + + insns = bpf_classic_proglen(fprog); + for (pc = 0; pc < insns; pc++) { + struct sock_filter *insn = &fprog->filter[pc]; + u16 code = insn->code; + u32 k = insn->k; + + switch (code) { + case BPF_LD | BPF_W | BPF_ABS: + switch (k) { + case offsetof(struct seccomp_data, nr): + reg_value = sd->nr; + break; + case offsetof(struct seccomp_data, arch): + reg_value = sd->arch; + break; + default: + /* can't optimize (non-constant value load) */ + return false; + } + break; + case BPF_RET | BPF_K: + /* reached return with constant values only, check allow */ + return k == SECCOMP_RET_ALLOW; + case BPF_JMP | BPF_JA: + pc += insn->k; + break; + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JSET | BPF_K: + switch (BPF_OP(code)) { + case BPF_JEQ: + op_res = reg_value == k; + break; + case BPF_JGE: + op_res = reg_value >= k; + break; + case BPF_JGT: + op_res = reg_value > k; + break; + case BPF_JSET: + op_res = !!(reg_value & k); + break; + default: + /* can't optimize (unknown jump) */ + return false; + } + + pc += op_res ? insn->jt : insn->jf; + break; + case BPF_ALU | BPF_AND | BPF_K: + reg_value &= k; + break; + default: + /* can't optimize (unknown insn) */ + return false; + } + } + + /* ran off the end of the filter?! */ + WARN_ON(1); + return false; +} + +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, + void *bitmap, const void *bitmap_prev, + size_t bitmap_size, int arch) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct seccomp_data sd; + int nr; + + for (nr = 0; nr < bitmap_size; nr++) { + if (bitmap_prev && !test_bit(nr, bitmap_prev)) + continue; + + sd.nr = nr; + sd.arch = arch; + + if (seccomp_emu_is_const_allow(fprog, &sd)) + set_bit(nr, bitmap); + } +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +static void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct seccomp_cache_filter_data *cache = &sfilter->cache; + const struct seccomp_cache_filter_data *cache_prev = + sfilter->prev ? &sfilter->prev->cache : NULL; + +#ifdef SECCOMP_ARCH_DEFAULT + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_default, + cache_prev ? cache_prev->syscall_allow_default : NULL, + SECCOMP_ARCH_DEFAULT_NR, + SECCOMP_ARCH_DEFAULT); +#endif /* SECCOMP_ARCH_DEFAULT */ + +#ifdef SECCOMP_ARCH_COMPAT + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_compat, + cache_prev ? cache_prev->syscall_allow_compat : NULL, + SECCOMP_ARCH_COMPAT_NR, + SECCOMP_ARCH_COMPAT); +#endif /* SECCOMP_ARCH_COMPAT */ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -659,6 +823,7 @@ static long seccomp_attach_filter(unsigned int flags, * task reference. */ filter->prev = current->seccomp.filter; + seccomp_cache_prepare(filter); current->seccomp.filter = filter; atomic_inc(¤t->seccomp.filter_count);