mbox series

[RFC,0/8] seccomp: Implement constant action bitmaps

Message ID 20200616074934.1600036-1-keescook@chromium.org (mailing list archive)
Headers show
Series seccomp: Implement constant action bitmaps | expand

Message

Kees Cook June 16, 2020, 7:49 a.m. UTC
Hi,

This is my initial stab at making constant-time seccomp bitmaps that
are automatically generated from filters as they are added. This version
is x86 only (and not x86_x32), but it should be easy to expand this to
other architectures. I'd like to get arm64 working, but it has some
NR_syscalls shenanigans I haven't sorted out yet.

The first two patches are small clean-ups that I intend to land in
for-next/seccomp unless there are objections. Patch 3 is another
experimental feature to perform architecture-pinning. Patch 4 is the
bulk of the bitmap code. Patch 5 is benchmark updates. Patches 6 and
7 perform the x86 enablement. Patch 8 is just a debugging example,
in case anyone wants to play with this and would find it helpful.

Repeating the commit log from patch 4:


One of the most common pain points with seccomp filters has been dealing
with the overhead of processing the filters, especially for "always allow"
or "always reject" cases. While BPF is extremely fast[1], it will always
have overhead associated with it. Additionally, due to seccomp's design,
filters are layered, which means processing time goes up as the number
of filters attached goes up.

In the past, efforts have been focused on making filter execution complete
in a shorter amount of time. For example, filters were rewritten from
using linear if/then/else syscall search to using balanced binary trees,
or moving tests for syscalls common to the process's workload to the
front of the filter. However, there are limits to this, especially when
some processes are dealing with tens of filters[2], or when some
architectures have a less efficient BPF engine[3].

The most common use of seccomp, constructing syscall block/allow-lists,
where syscalls that are always allowed or always rejected (without regard
to any arguments), also tends to produce the most pathological runtime
problems, in that a large number of syscall checks in the filter need
to be performed to come to a determination.

In order to optimize these cases from O(n) to O(1), seccomp can
use bitmaps to immediately determine the desired action. A critical
observation in the prior paragraph bears repeating: the common case for
syscall tests do not check arguments. For any given filter, there is a
constant mapping from the combination of architecture and syscall to the
seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
there is a single architecture.). As such, it is possible to construct
a mapping of arch/syscall to action, which can be updated as new filters
are attached to a process.

In order to build this mapping at filter attach time, each filter is
executed for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
there is a constant mapping for that syscall, and bitmaps can be updated
accordingly. If any accesses happen outside of those struct members,
seccomp must not bypass filter execution for that syscall, since program
state will be used to determine filter action result.

During syscall action probing, in order to determine whether other members
of struct seccomp_data are being accessed during a filter execution,
the struct is placed across a page boundary with the "arch" and "nr"
members in the first page, and everything else in the second page. The
"page accessed" flag is cleared in the second page's PTE, and the filter
is run. If the "page accessed" flag appears as set after running the
filter, we can determine that the filter looked beyond the "arch" and
"nr" members, and exclude that syscall from the constant action bitmaps.

For architectures to support this optimization, they must declare
their architectures for seccomp to see (via SECCOMP_ARCH and
SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Areas needing more attention:

On x86, this currently adds 168 bytes (or 336 bytes under CONFIG_COMPAT)
to the size of task_struct. Allocating these on demand may be a better
use of memory, but may not result in good cache locality.

For architectures with "synthetic" architectures, like x86_x32,
additional work is needed. It should be possible to define a simple
mechanism based on the masking done in the x86 syscall entry path to
create another set of bitmaps for seccomp to key off of. I am, however,
considering just leaving HAVE_ARCH_SECCOMP_BITMAP depend on !X86_X32.

[1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
[2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
[3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/



Thanks!

-Kees

Kees Cook (8):
  selftests/seccomp: Improve calibration loop
  seccomp: Use pr_fmt
  seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
  seccomp: Implement constant action bitmaps
  selftests/seccomp: Compare bitmap vs filter overhead
  x86: Provide API for local kernel TLB flushing
  x86: Enable seccomp constant action bitmaps
  [DEBUG] seccomp: Report bitmap coverage ranges

 arch/Kconfig                                  |   7 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/syscall.h                |   5 +
 arch/x86/include/asm/tlbflush.h               |   2 +
 arch/x86/mm/tlb.c                             |  12 +-
 include/linux/seccomp.h                       |  18 +
 include/uapi/linux/seccomp.h                  |   1 +
 kernel/seccomp.c                              | 374 +++++++++++++++++-
 .../selftests/seccomp/seccomp_benchmark.c     | 197 +++++++--
 tools/testing/selftests/seccomp/settings      |   2 +-
 10 files changed, 571 insertions(+), 48 deletions(-)

Comments

Andy Lutomirski June 16, 2020, 5:01 p.m. UTC | #1
On Tue, Jun 16, 2020 at 12:49 AM Kees Cook <keescook@chromium.org> wrote:
>
> Hi,
>

> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.

>
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.

This is... evil.  I don't know how I feel about it.  It's also
potentially quite slow.

I don't suppose you could, instead, instrument the BPF code to get at
this without TLB hackery?  Or maybe try to do some real symbolic
execution of the BPF code?

--Andy
Kees Cook June 16, 2020, 6:35 p.m. UTC | #2
On Tue, Jun 16, 2020 at 10:01:43AM -0700, Andy Lutomirski wrote:
> On Tue, Jun 16, 2020 at 12:49 AM Kees Cook <keescook@chromium.org> wrote:
> >
> > Hi,
> >
> 
> > In order to build this mapping at filter attach time, each filter is
> > executed for every syscall (under each possible architecture), and
> > checked for any accesses of struct seccomp_data that are not the "arch"
> > nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> > there is a constant mapping for that syscall, and bitmaps can be updated
> > accordingly. If any accesses happen outside of those struct members,
> > seccomp must not bypass filter execution for that syscall, since program
> > state will be used to determine filter action result.
> 
> >
> > During syscall action probing, in order to determine whether other members
> > of struct seccomp_data are being accessed during a filter execution,
> > the struct is placed across a page boundary with the "arch" and "nr"
> > members in the first page, and everything else in the second page. The
> > "page accessed" flag is cleared in the second page's PTE, and the filter
> > is run. If the "page accessed" flag appears as set after running the
> > filter, we can determine that the filter looked beyond the "arch" and
> > "nr" members, and exclude that syscall from the constant action bitmaps.
> 
> This is... evil.  I don't know how I feel about it.  It's also

Thank you! ;)

> potentially quite slow.

I got the impression that (worst-case: a "full" filter for every
arch/syscall combo) ~900 _local_ TLB flushes per filter attach wouldn't be
very slow at all. (And the code is optimized to avoid needless flushes.)

> I don't suppose you could, instead, instrument the BPF code to get at
> this without TLB hackery?  Or maybe try to do some real symbolic
> execution of the BPF code?

I think the "simple emulator" path[1] might get us a realistically large
coverage. I'm going to try it out, and see what it looks like.

-Kees

[1] https://lore.kernel.org/lkml/202006160757.99FD9B785@keescook/