mbox series

[v4,seccomp,0/5] seccomp: Add bitmap cache of constant allow filter results

Message ID cover.1602263422.git.yifeifz2@illinois.edu (mailing list archive)
Headers show
Series seccomp: Add bitmap cache of constant allow filter results | expand

Message

YiFei Zhu Oct. 9, 2020, 5:14 p.m. UTC
From: YiFei Zhu <yifeifz2@illinois.edu>

Alternative: https://lore.kernel.org/lkml/20200923232923.3142503-1-keescook@chromium.org/T/

Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.

This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.

Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.

Some benchmarks are performed with results in patch 5, copied below:
  Current BPF sysctl settings:
  net.core.bpf_jit_enable = 1
  net.core.bpf_jit_harden = 0
  Benchmarking 200000000 syscalls...
  129.359381409 - 0.008724424 = 129350656985 (129.4s)
  getpid native: 646 ns
  264.385890006 - 129.360453229 = 135025436777 (135.0s)
  getpid RET_ALLOW 1 filter (bitmap): 675 ns
  399.400511893 - 264.387045901 = 135013465992 (135.0s)
  getpid RET_ALLOW 2 filters (bitmap): 675 ns
  545.872866260 - 399.401718327 = 146471147933 (146.5s)
  getpid RET_ALLOW 3 filters (full): 732 ns
  696.337101319 - 545.874097681 = 150463003638 (150.5s)
  getpid RET_ALLOW 4 filters (full): 752 ns
  Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
  Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
  Estimated total seccomp overhead for 3 full filters: 86 ns
  Estimated total seccomp overhead for 4 full filters: 106 ns
  Estimated seccomp entry overhead: 29 ns
  Estimated seccomp per-filter overhead (last 2 diff): 20 ns
  Estimated seccomp per-filter overhead (filters / 4): 19 ns
  Expectations:
  	native ≤ 1 bitmap (646 ≤ 675): ✔️
  	native ≤ 1 filter (646 ≤ 732): ✔️
  	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
  	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
  	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
  	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

v3 -> v4:
* Reordered patches
* Naming changes
* Fixed racing in /proc/pid/seccomp_cache against filter being released
  from task, using Jann's suggestion of sighand spinlock.
* Cache no longer configurable.
* Copied some description from cover letter to commit messages.
* Used Kees's logic to set clear bits from bitmap, rather than set bits.

v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
  are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.

v1 -> v2:
* Corrected one outdated function documentation.

RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
  have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
  instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
  during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.

Patch 1 implements the test_bit against the bitmaps.

Patch 2 implements the emulator that finds if a filter must return allow,

Patch 3 adds the arch macros for x86.

Patch 4 updates the selftest to better show the new semantics.

Patch 5 implements /proc/pid/seccomp_cache.

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020

Kees Cook (2):
  x86: Enable seccomp architecture tracking
  selftests/seccomp: Compare bitmap vs filter overhead

YiFei Zhu (3):
  seccomp/cache: Lookup syscall allowlist bitmap for fast path
  seccomp/cache: Add "emulator" to check if filter is constant allow
  seccomp/cache: Report cache data through /proc/pid/seccomp_cache

 arch/Kconfig                                  |  24 ++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/seccomp.h                |  15 +
 fs/proc/base.c                                |   6 +
 include/linux/seccomp.h                       |   5 +
 kernel/seccomp.c                              | 289 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 151 +++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 8 files changed, 469 insertions(+), 24 deletions(-)

--
2.28.0

Comments

Andy Lutomirski Oct. 9, 2020, 5:25 p.m. UTC | #1
On Fri, Oct 9, 2020 at 10:15 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> From: Kees Cook <keescook@chromium.org>
>
> Provide seccomp internals with the details to calculate which syscall
> table the running kernel is expecting to deal with. This allows for
> efficient architecture pinning and paves the way for constant-action
> bitmaps.
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
> ---
>  arch/x86/include/asm/seccomp.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> index 2bd1338de236..03365af6165d 100644
> --- a/arch/x86/include/asm/seccomp.h
> +++ b/arch/x86/include/asm/seccomp.h
> @@ -16,6 +16,18 @@
>  #define __NR_seccomp_sigreturn_32      __NR_ia32_sigreturn
>  #endif
>
> +#ifdef CONFIG_X86_64
> +# define SECCOMP_ARCH_NATIVE           AUDIT_ARCH_X86_64
> +# define SECCOMP_ARCH_NATIVE_NR                NR_syscalls
> +# ifdef CONFIG_COMPAT
> +#  define SECCOMP_ARCH_COMPAT          AUDIT_ARCH_I386
> +#  define SECCOMP_ARCH_COMPAT_NR       IA32_NR_syscalls
> +# endif
> +#else /* !CONFIG_X86_64 */
> +# define SECCOMP_ARCH_NATIVE           AUDIT_ARCH_I386
> +# define SECCOMP_ARCH_NATIVE_NR                NR_syscalls
> +#endif

Is the idea that any syscall that's out of range for this (e.g. all of
the x32 syscalls) is unoptimized?  I'm okay with this, but I think it
could use a comment.

> +
>  #include <asm-generic/seccomp.h>
>
>  #endif /* _ASM_X86_SECCOMP_H */
> --
> 2.28.0
>
YiFei Zhu Oct. 9, 2020, 6:32 p.m. UTC | #2
On Fri, Oct 9, 2020 at 12:25 PM Andy Lutomirski <luto@amacapital.net> wrote:
> Is the idea that any syscall that's out of range for this (e.g. all of
> the x32 syscalls) is unoptimized?  I'm okay with this, but I think it
> could use a comment.

Yes, any syscall number that is out of range is unoptimized. Where do
you think I should put a comment? seccomp_cache_check_allow_bitmap
above `if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))`,
with something like "any syscall number out of range is unoptimized"?

YiFei Zhu
Andy Lutomirski Oct. 9, 2020, 8:59 p.m. UTC | #3
On Fri, Oct 9, 2020 at 11:32 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 12:25 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > Is the idea that any syscall that's out of range for this (e.g. all of
> > the x32 syscalls) is unoptimized?  I'm okay with this, but I think it
> > could use a comment.
>
> Yes, any syscall number that is out of range is unoptimized. Where do
> you think I should put a comment? seccomp_cache_check_allow_bitmap
> above `if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))`,
> with something like "any syscall number out of range is unoptimized"?
>

I was imagining a comment near the new macros explaining that this is
the range of syscalls that seccomp will optimize, that behavior is
still correct (albeit slower) for out of range syscalls, and that x32
is intentionally not optimized.

This avoids people like future me reading this code, not remembering
the context, and thinking it looks buggy.