[RFC,0/2] seccomp: Split set filter into two steps

Message ID	20231003083836.100706-1-hengqi.chen@gmail.com (mailing list archive)
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A3B13FC2 for <bpf@vger.kernel.org>; Tue, 3 Oct 2023 08:44:08 +0000 (UTC) From: Hengqi Chen <hengqi.chen@gmail.com> To: linux-kernel@vger.kernel.org, bpf@vger.kernel.org Cc: keescook@chromium.org, luto@amacapital.net, wad@chromium.org, alexyonghe@tencent.com, hengqi.chen@gmail.com Subject: [RFC PATCH 0/2] seccomp: Split set filter into two steps Date: Tue, 3 Oct 2023 08:38:34 +0000 Message-Id: <20231003083836.100706-1-hengqi.chen@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	seccomp: Split set filter into two steps \| expand [RFC,0/2] seccomp: Split set filter into two steps [RFC,1/2] seccomp: Introduce SECCOMP_LOAD_FILTER operation [RFC,2/2] seccomp: Introduce SECCOMP_ATTACH_FILTER operation

Message ID

20231003083836.100706-1-hengqi.chen@gmail.com (mailing list archive)

Headers

From: Hengqi Chen <hengqi.chen@gmail.com>
To: linux-kernel@vger.kernel.org,
	bpf@vger.kernel.org
Cc: keescook@chromium.org,
	luto@amacapital.net,
	wad@chromium.org,
	alexyonghe@tencent.com,
	hengqi.chen@gmail.com
Subject: [RFC PATCH 0/2] seccomp: Split set filter into two steps
Date: Tue,  3 Oct 2023 08:38:34 +0000
Message-Id: <20231003083836.100706-1-hengqi.chen@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

seccomp: Split set filter into two steps | expand

Message

Hengqi Chen Oct. 3, 2023, 8:38 a.m. UTC

This patchset introduces two new operations which essentially
splits the SECCOMP_SET_MODE_FILTER process into two steps:
SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER.

The SECCOMP_LOAD_FILTER loads the filter and returns a fd
which can be pinned to bpffs. This extends the lifetime of the
filter and thus can be reused by different processes.
With this new operation, we can eliminate a hot path of JITing
BPF program (the filter) where we apply the same seccomp filter
to thousands of micro VMs on a bare metal instance.

The SECCOMP_ATTACH_FILTER is used to attach a loaded filter.
The filter is represented by a fd which is either returned
from SECCOMP_LOAD_FILTER or obtained from bpffs using bpf syscall.

Hengqi Chen (2):
  seccomp: Introduce SECCOMP_LOAD_FILTER operation
  seccomp: Introduce SECCOMP_ATTACH_FILTER operation

 include/uapi/linux/seccomp.h |   2 +
 kernel/seccomp.c             | 138 ++++++++++++++++++++++++++++++++++-
 2 files changed, 136 insertions(+), 4 deletions(-)

Comments

Kees Cook Oct. 3, 2023, 6:01 p.m. UTC | #1

On Tue, Oct 03, 2023 at 08:38:34AM +0000, Hengqi Chen wrote:
> This patchset introduces two new operations which essentially
> splits the SECCOMP_SET_MODE_FILTER process into two steps:
> SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER.
> 
> The SECCOMP_LOAD_FILTER loads the filter and returns a fd
> which can be pinned to bpffs. This extends the lifetime of the
> filter and thus can be reused by different processes.
> With this new operation, we can eliminate a hot path of JITing
> BPF program (the filter) where we apply the same seccomp filter
> to thousands of micro VMs on a bare metal instance.
> 
> The SECCOMP_ATTACH_FILTER is used to attach a loaded filter.
> The filter is represented by a fd which is either returned
> from SECCOMP_LOAD_FILTER or obtained from bpffs using bpf syscall.

Interesting! I like this idea, thanks for writing it up.

Two design notes:

- Can you reuse/refactor seccomp_prepare_filter() instead of duplicating
  the logic into two new functions?

- Is there a way to make sure the BPF program coming from the fd is one
  that was built via SECCOMP_LOAD_FILTER? (I want to make sure we can
  never confuse a non-seccomp program into getting loaded into seccomp.)

-Kees

Rodrigo Campos Oct. 4, 2023, 2:03 p.m. UTC | #2

On 10/3/23 10:38, Hengqi Chen wrote:
> This patchset introduces two new operations which essentially
> splits the SECCOMP_SET_MODE_FILTER process into two steps:
> SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER.
> 
> The SECCOMP_LOAD_FILTER loads the filter and returns a fd
> which can be pinned to bpffs. This extends the lifetime of the
> filter and thus can be reused by different processes.

A quick question to see if handling something else too is 
possible/reasonable to do here too.

Let me explain our use case first.

For us (Alban in cc) it would be great if we can extend the lifetime of 
the fd returned, so the process managing a seccomp notification in 
userspace can easly crash or be updated. Today, if the agent that got 
the fd crashes, all the "notify-syscalls" return ENOSYS in the target 
process.

Our use case is we created a seccomp agent to use in Kubernetes 
(github.com/kinvolk/seccompagent) and we need to handle either the agent 
crashing or upgrading it. We were thinking tricks to have another 
container that just stores fds and make sure that never crashes, but it 
is not ideal (we checked tricks to use systemd to store our fds, but it 
is not simpler either to use from containers).

If the agent crashes today, all the syscalls return ENOSYS. It will be 
great if we can make the process doing the syscall just wait until a new 
process to handle the notifications is up and the syscalls done in the 
meantime are just queued. A mode of saying "if the agent crashes, just 
queue notifications, one agent to pick them up will come back soon" (we 
can of course limit reasonably the notification queue).

It seems the split here would not just work for that use case. I think 
we would need to pin the attachment.

Do you think handling that is something reasonable to do in this series too?

I'll be afk until end next week. I'll catch up as soon as I'm back with 
internet :)

Best,
Rodrigo

Hengqi Chen Oct. 6, 2023, 7:58 a.m. UTC | #3

+ BPF maintainers

On Wed, Oct 4, 2023 at 2:02 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Tue, Oct 03, 2023 at 08:38:34AM +0000, Hengqi Chen wrote:
> > This patchset introduces two new operations which essentially
> > splits the SECCOMP_SET_MODE_FILTER process into two steps:
> > SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER.
> >
> > The SECCOMP_LOAD_FILTER loads the filter and returns a fd
> > which can be pinned to bpffs. This extends the lifetime of the
> > filter and thus can be reused by different processes.
> > With this new operation, we can eliminate a hot path of JITing
> > BPF program (the filter) where we apply the same seccomp filter
> > to thousands of micro VMs on a bare metal instance.
> >
> > The SECCOMP_ATTACH_FILTER is used to attach a loaded filter.
> > The filter is represented by a fd which is either returned
> > from SECCOMP_LOAD_FILTER or obtained from bpffs using bpf syscall.
>
> Interesting! I like this idea, thanks for writing it up.
>
> Two design notes:
>
> - Can you reuse/refactor seccomp_prepare_filter() instead of duplicating
>   the logic into two new functions?
>

Sure, will do.

> - Is there a way to make sure the BPF program coming from the fd is one
>   that was built via SECCOMP_LOAD_FILTER? (I want to make sure we can
>   never confuse a non-seccomp program into getting loaded into seccomp.)
>

Maybe we can add a new prog type enum like BPF_PROG_TYPE_SECCOMP
for seccomp filter.

> -Kees
>
> --
> Kees Cook
>

Cheers,
--
Hengqi

Hengqi Chen Oct. 6, 2023, 8:12 a.m. UTC | #4

On Wed, Oct 4, 2023 at 10:03 PM Rodrigo Campos <rodrigo@sdfg.com.ar> wrote:
>
> On 10/3/23 10:38, Hengqi Chen wrote:
> > This patchset introduces two new operations which essentially
> > splits the SECCOMP_SET_MODE_FILTER process into two steps:
> > SECCOMP_LOAD_FILTER and SECCOMP_ATTACH_FILTER.
> >
> > The SECCOMP_LOAD_FILTER loads the filter and returns a fd
> > which can be pinned to bpffs. This extends the lifetime of the
> > filter and thus can be reused by different processes.
>
> A quick question to see if handling something else too is
> possible/reasonable to do here too.
>
> Let me explain our use case first.
>
> For us (Alban in cc) it would be great if we can extend the lifetime of
> the fd returned, so the process managing a seccomp notification in
> userspace can easly crash or be updated. Today, if the agent that got
> the fd crashes, all the "notify-syscalls" return ENOSYS in the target
> process.
>
> Our use case is we created a seccomp agent to use in Kubernetes
> (github.com/kinvolk/seccompagent) and we need to handle either the agent
> crashing or upgrading it. We were thinking tricks to have another
> container that just stores fds and make sure that never crashes, but it
> is not ideal (we checked tricks to use systemd to store our fds, but it
> is not simpler either to use from containers).
>
> If the agent crashes today, all the syscalls return ENOSYS. It will be
> great if we can make the process doing the syscall just wait until a new
> process to handle the notifications is up and the syscalls done in the
> meantime are just queued. A mode of saying "if the agent crashes, just
> queue notifications, one agent to pick them up will come back soon" (we
> can of course limit reasonably the notification queue).
>
> It seems the split here would not just work for that use case. I think
> we would need to pin the attachment.
>
> Do you think handling that is something reasonable to do in this series too?
>

I am not familiar with this notification mechanism, but it seems unrelated.
This patchset is trying to reuse the seccomp filter itself.

> I'll be afk until end next week. I'll catch up as soon as I'm back with
> internet :)
>
>
>
> Best,
> Rodrigo

--
Hengqi