seccomp: passthrough uretprobe systemcall without filtering

Message ID	20250117005539.325887-1-eyal.birger@gmail.com (mailing list archive)
State	Superseded
Headers	show Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3A3F4964F; Fri, 17 Jan 2025 00:56:05 +0000 (UTC) From: Eyal Birger <eyal.birger@gmail.com> To: kees@kernel.org, luto@amacapital.net, wad@chromium.org, oleg@redhat.com, mhiramat@kernel.org, andrii@kernel.org, jolsa@kernel.org Cc: alexei.starovoitov@gmail.com, olsajiri@gmail.com, cyphar@cyphar.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, daniel@iogearbox.net, ast@kernel.org, andrii.nakryiko@gmail.com, rostedt@goodmis.org, rafi@rbk.io, shmulik.ladkani@gmail.com, bpf@vger.kernel.org, linux-api@vger.kernel.org, linux-trace-kernel@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, Eyal Birger <eyal.birger@gmail.com>, stable@vger.kernel.org Subject: [PATCH] seccomp: passthrough uretprobe systemcall without filtering Date: Thu, 16 Jan 2025 16:55:39 -0800 Message-ID: <20250117005539.325887-1-eyal.birger@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	seccomp: passthrough uretprobe systemcall without filtering \| expand seccomp: passthrough uretprobe systemcall without filtering

Eyal Birger Jan. 17, 2025, 12:55 a.m. UTC

When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe.

The reason is that now that uretprobe is a system call the default seccomp
filters in docker block it as they only allow a specific set of known
syscalls. This is true for other userspace applications which use seccomp
to control their syscall surface.

Since uretprobe is a "kernel implementation detail" system call which is
not used by userspace application code directly, it is impractical and
there's very little point in forcing all userspace applications to
explicitly allow it in order to avoid crashing tracked processes.

Pass this systemcall through seccomp without depending on configuration.

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Rafael Buchbinder <rafi@rbk.io>
Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
---

The following reproduction script synthetically demonstrates the problem:

cat > /tmp/x.c << EOF

char *syscalls[] = {
	"write",
	"exit_group",
	"fstat",
};

__attribute__((noinline)) int probed(void)
{
	printf("Probed\n");
	return 1;
}

void apply_seccomp_filter(char **syscalls, int num_syscalls)
{
	scmp_filter_ctx ctx;

	ctx = seccomp_init(SCMP_ACT_KILL);
	for (int i = 0; i < num_syscalls; i++) {
		seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
				 seccomp_syscall_resolve_name(syscalls[i]), 0);
	}
	seccomp_load(ctx);
	seccomp_release(ctx);
}

int main(int argc, char *argv[])
{
	int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);

	apply_seccomp_filter(syscalls, num_syscalls);

	probed();

	return 0;
}
EOF

cat > /tmp/trace.bt << EOF
uretprobe:/tmp/x:probed
{
    printf("ret=%d\n", retval);
}
EOF

gcc -o /tmp/x /tmp/x.c -lseccomp

/usr/bin/bpftrace /tmp/trace.bt &

sleep 5 # wait for uretprobe attach
/tmp/x

pkill bpftrace

rm /tmp/x /tmp/x.c /tmp/trace.bt
---
 kernel/seccomp.c | 5 +++++
 1 file changed, 5 insertions(+)

Oleg Nesterov Jan. 17, 2025, 1:39 a.m. UTC | #1

On 01/16, Eyal Birger wrote:
>
> Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> Reported-by: Rafael Buchbinder <rafi@rbk.io>
> Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> Cc: stable@vger.kernel.org
...
> @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
>  	this_syscall = sd ? sd->nr :
>  		syscall_get_nr(current, current_pt_regs());
>
> +#ifdef CONFIG_X86_64
> +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> +		return 0;
> +#endif

Acked-by: Oleg Nesterov <oleg@redhat.com>

A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this
patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

The problem is that we need a simple patch for -stable which fixes the real
problem. We can cleanup this logic later, I think.

Oleg.

Masami Hiramatsu (Google) Jan. 17, 2025, 8:02 a.m. UTC | #2

On Fri, 17 Jan 2025 02:39:28 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 01/16, Eyal Birger wrote:
> >
> > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > Cc: stable@vger.kernel.org
> ...
> > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> >  	this_syscall = sd ? sd->nr :
> >  		syscall_get_nr(current, current_pt_regs());
> >
> > +#ifdef CONFIG_X86_64
> > +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > +		return 0;
> > +#endif
> 
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> 
> 
> A note for the seccomp maintainers...
> 
> I don't know what do you think, but I agree in advance that the very fact this
> patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> 

Indeed. in_ia32_syscall() depends arch/x86 too.
We can add an inline function like;

``` uprobes.h
static inline bool is_uprobe_syscall(int syscall)
{
	// arch_is_uprobe_syscall check can be replaced by Kconfig,
	// something like CONFIG_ARCH_URETPROBE_SYSCALL.
#ifdef arch_is_uprobe_syscall
	return arch_is_uprobe_syscall(syscall)
#else
	return false;
#endif
}
```
and 
``` arch/x86/include/asm/uprobes.h
#define arch_is_uprobe_syscall(syscall) \
	(IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
```

> The problem is that we need a simple patch for -stable which fixes the real
> problem. We can cleanup this logic later, I think.

Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
do not pollute the seccomp subsystem with #ifdef.

Thank you,



> 
> Oleg.
>

Eyal Birger Jan. 17, 2025, 1:36 p.m. UTC | #3

On Fri, Jan 17, 2025 at 12:02 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 17 Jan 2025 02:39:28 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 01/16, Eyal Birger wrote:
> > >
> > > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > > Cc: stable@vger.kernel.org
> > ...
> > > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> > >     this_syscall = sd ? sd->nr :
> > >             syscall_get_nr(current, current_pt_regs());
> > >
> > > +#ifdef CONFIG_X86_64
> > > +   if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > > +           return 0;
> > > +#endif
> >
> > Acked-by: Oleg Nesterov <oleg@redhat.com>
> >
> >
> > A note for the seccomp maintainers...
> >
> > I don't know what do you think, but I agree in advance that the very fact this
> > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> >
>
> Indeed. in_ia32_syscall() depends arch/x86 too.
> We can add an inline function like;
>
> ``` uprobes.h
> static inline bool is_uprobe_syscall(int syscall)
> {
>         // arch_is_uprobe_syscall check can be replaced by Kconfig,
>         // something like CONFIG_ARCH_URETPROBE_SYSCALL.
> #ifdef arch_is_uprobe_syscall
>         return arch_is_uprobe_syscall(syscall)
> #else
>         return false;
> #endif
> }
> ```
> and
> ``` arch/x86/include/asm/uprobes.h
> #define arch_is_uprobe_syscall(syscall) \
>         (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
> ```

Notice it'll need to be enclosed in ifdef CONFIG_X86_64 as __NR_uretprobe
isn't defined otherwise so the IS_ENABLED isn't needed.

>
> > The problem is that we need a simple patch for -stable which fixes the real
> > problem. We can cleanup this logic later, I think.
>
> Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
> do not pollute the seccomp subsystem with #ifdef.

I like this approach.

Notice it does add a couple of includes that weren't there before:
- arch/x86/include/asm/uprobes.h would include asm/unistd.h
- seccomp.c would include linux/uprobes.h

So it's a less "trivial" patch... If that's ok I can repost with these
changes.

Eyal.

Oleg Nesterov Jan. 17, 2025, 2:09 p.m. UTC | #4

On 01/17, Masami Hiramatsu wrote:
>
> On Fri, 17 Jan 2025 02:39:28 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > A note for the seccomp maintainers...
> >
> > I don't know what do you think, but I agree in advance that the very fact this
> > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> >
>
> Indeed. in_ia32_syscall() depends arch/x86 too.
> We can add an inline function like;
>
> ``` uprobes.h
> static inline bool is_uprobe_syscall(int syscall)
> {

We can, and this is what I tried to suggest from the very beginning.
But I agree with Eyal who decided to send the most trivial fix for
-stable, we can add the helper later.

I don't think it should live in uprobes.h and I'd prefer something
like arch_seccomp_ignored(int) but I won't insist.

> 	// arch_is_uprobe_syscall check can be replaced by Kconfig,
> 	// something like CONFIG_ARCH_URETPROBE_SYSCALL.

Or sysctl or both. This is another issue.

> ``` arch/x86/include/asm/uprobes.h
> #define arch_is_uprobe_syscall(syscall) \
> 	(IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
> ```

This won't compile if IS_ENABLED(CONFIG_X86_64) == false, __NR_uretprobe
will be undefined.

> > The problem is that we need a simple patch for -stable which fixes the real
> > problem. We can cleanup this logic later, I think.
>
> Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
> do not pollute the seccomp subsystem with #ifdef.

See above. But I won't insist.

Oleg.

Andrii Nakryiko Jan. 17, 2025, 5:51 p.m. UTC | #5

On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/17, Masami Hiramatsu wrote:
> >
> > On Fri, 17 Jan 2025 02:39:28 +0100
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > A note for the seccomp maintainers...
> > >
> > > I don't know what do you think, but I agree in advance that the very fact this
> > > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> > >
> >
> > Indeed. in_ia32_syscall() depends arch/x86 too.
> > We can add an inline function like;
> >
> > ``` uprobes.h
> > static inline bool is_uprobe_syscall(int syscall)
> > {
>
> We can, and this is what I tried to suggest from the very beginning.
> But I agree with Eyal who decided to send the most trivial fix for
> -stable, we can add the helper later.
>
> I don't think it should live in uprobes.h and I'd prefer something
> like arch_seccomp_ignored(int) but I won't insist.

yep, I think this is the way, keeping it as a general category. Should
we also put rt_sigreturn there explicitly as well? Also, wouldn't it
be better to have it as a non-arch-specific function for something
like rt_sigreturn where defining it per each arch is cumbersome, and
have the default implementation also call into an arch-specific
function?

>
> >       // arch_is_uprobe_syscall check can be replaced by Kconfig,
> >       // something like CONFIG_ARCH_URETPROBE_SYSCALL.
>
> Or sysctl or both. This is another issue.
>
> > ``` arch/x86/include/asm/uprobes.h
> > #define arch_is_uprobe_syscall(syscall) \
> >       (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
> > ```
>
> This won't compile if IS_ENABLED(CONFIG_X86_64) == false, __NR_uretprobe
> will be undefined.
>
> > > The problem is that we need a simple patch for -stable which fixes the real
> > > problem. We can cleanup this logic later, I think.
> >
> > Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that
> > do not pollute the seccomp subsystem with #ifdef.
>
> See above. But I won't insist.
>
> Oleg.
>

Dmitry V. Levin Jan. 17, 2025, 6:34 p.m. UTC | #6

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> When attaching uretprobes to processes running inside docker, the attached
> process is segfaulted when encountering the retprobe.
> 
> The reason is that now that uretprobe is a system call the default seccomp
> filters in docker block it as they only allow a specific set of known
> syscalls. This is true for other userspace applications which use seccomp
> to control their syscall surface.
> 
> Since uretprobe is a "kernel implementation detail" system call which is
> not used by userspace application code directly, it is impractical and
> there's very little point in forcing all userspace applications to
> explicitly allow it in order to avoid crashing tracked processes.
> 
> Pass this systemcall through seccomp without depending on configuration.
> 
> Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> Reported-by: Rafael Buchbinder <rafi@rbk.io>
> Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> Cc: stable@vger.kernel.org
> Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
> ---
> 
> The following reproduction script synthetically demonstrates the problem:
> 
> cat > /tmp/x.c << EOF
> 
> char *syscalls[] = {
> 	"write",
> 	"exit_group",
> 	"fstat",
> };
> 
> __attribute__((noinline)) int probed(void)
> {
> 	printf("Probed\n");
> 	return 1;
> }
> 
> void apply_seccomp_filter(char **syscalls, int num_syscalls)
> {
> 	scmp_filter_ctx ctx;
> 
> 	ctx = seccomp_init(SCMP_ACT_KILL);
> 	for (int i = 0; i < num_syscalls; i++) {
> 		seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
> 				 seccomp_syscall_resolve_name(syscalls[i]), 0);
> 	}
> 	seccomp_load(ctx);
> 	seccomp_release(ctx);
> }
> 
> int main(int argc, char *argv[])
> {
> 	int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
> 
> 	apply_seccomp_filter(syscalls, num_syscalls);
> 
> 	probed();
> 
> 	return 0;
> }
> EOF
> 
> cat > /tmp/trace.bt << EOF
> uretprobe:/tmp/x:probed
> {
>     printf("ret=%d\n", retval);
> }
> EOF
> 
> gcc -o /tmp/x /tmp/x.c -lseccomp
> 
> /usr/bin/bpftrace /tmp/trace.bt &
> 
> sleep 5 # wait for uretprobe attach
> /tmp/x
> 
> pkill bpftrace
> 
> rm /tmp/x /tmp/x.c /tmp/trace.bt
> ---
>  kernel/seccomp.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 385d48293a5f..10a55c9b5c18 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
>  	this_syscall = sd ? sd->nr :
>  		syscall_get_nr(current, current_pt_regs());
>  
> +#ifdef CONFIG_X86_64
> +	if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> +		return 0;
> +#endif
> +
>  	switch (mode) {
>  	case SECCOMP_MODE_STRICT:
>  		__secure_computing_strict(this_syscall);  /* may call do_exit */

This seems to be a hot fix to bypass some SECCOMP_RET_ERRNO filters.
However, this way it bypasses seccomp completely, including
SECCOMP_RET_TRACE, making it invisible to strace --seccomp,
and I wonder why do you want that.

Eyal Birger Jan. 17, 2025, 6:52 p.m. UTC | #7

On Fri, Jan 17, 2025 at 10:34 AM Dmitry V. Levin <ldv@strace.io> wrote:
>
> On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> > When attaching uretprobes to processes running inside docker, the attached
> > process is segfaulted when encountering the retprobe.
> >
> > The reason is that now that uretprobe is a system call the default seccomp
> > filters in docker block it as they only allow a specific set of known
> > syscalls. This is true for other userspace applications which use seccomp
> > to control their syscall surface.
> >
> > Since uretprobe is a "kernel implementation detail" system call which is
> > not used by userspace application code directly, it is impractical and
> > there's very little point in forcing all userspace applications to
> > explicitly allow it in order to avoid crashing tracked processes.
> >
> > Pass this systemcall through seccomp without depending on configuration.
> >
> > Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
> > ---
> >
> > The following reproduction script synthetically demonstrates the problem:
> >
> > cat > /tmp/x.c << EOF
> >
> > char *syscalls[] = {
> >       "write",
> >       "exit_group",
> >       "fstat",
> > };
> >
> > __attribute__((noinline)) int probed(void)
> > {
> >       printf("Probed\n");
> >       return 1;
> > }
> >
> > void apply_seccomp_filter(char **syscalls, int num_syscalls)
> > {
> >       scmp_filter_ctx ctx;
> >
> >       ctx = seccomp_init(SCMP_ACT_KILL);
> >       for (int i = 0; i < num_syscalls; i++) {
> >               seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
> >                                seccomp_syscall_resolve_name(syscalls[i]), 0);
> >       }
> >       seccomp_load(ctx);
> >       seccomp_release(ctx);
> > }
> >
> > int main(int argc, char *argv[])
> > {
> >       int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
> >
> >       apply_seccomp_filter(syscalls, num_syscalls);
> >
> >       probed();
> >
> >       return 0;
> > }
> > EOF
> >
> > cat > /tmp/trace.bt << EOF
> > uretprobe:/tmp/x:probed
> > {
> >     printf("ret=%d\n", retval);
> > }
> > EOF
> >
> > gcc -o /tmp/x /tmp/x.c -lseccomp
> >
> > /usr/bin/bpftrace /tmp/trace.bt &
> >
> > sleep 5 # wait for uretprobe attach
> > /tmp/x
> >
> > pkill bpftrace
> >
> > rm /tmp/x /tmp/x.c /tmp/trace.bt
> > ---
> >  kernel/seccomp.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 385d48293a5f..10a55c9b5c18 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd)
> >       this_syscall = sd ? sd->nr :
> >               syscall_get_nr(current, current_pt_regs());
> >
> > +#ifdef CONFIG_X86_64
> > +     if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
> > +             return 0;
> > +#endif
> > +
> >       switch (mode) {
> >       case SECCOMP_MODE_STRICT:
> >               __secure_computing_strict(this_syscall);  /* may call do_exit */
>
> This seems to be a hot fix to bypass some SECCOMP_RET_ERRNO filters.

It's a little broader than just SECCOMP_RET_ERRNO, but yes, this is a
hotfix to avoid filtering this system call in seccomp.

The rationale is that this is not a userspace created system call - the
kernel uses it to instrument the function - and the fact that it's a
system call is just an implementation detail. Ideally, userspace wouldn't
need to know or care about it.

> However, this way it bypasses seccomp completely, including
> SECCOMP_RET_TRACE, making it invisible to strace --seccomp,
> and I wonder why do you want that.

It's a good question. I could move this check to both "strict" seccomp and
after the BPF verdict is received, but before it's applied, but I fear this
would make the fix more error prone, and way harder to backmerge. So I'm
wondering whether supporting strace --seccomp-bpf for this particular
syscall is a priority.

Eyal.

Eyal Birger Jan. 17, 2025, 7:24 p.m. UTC | #8

On Fri, Jan 17, 2025 at 9:51 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 01/17, Masami Hiramatsu wrote:
> > >
> > > On Fri, 17 Jan 2025 02:39:28 +0100
> > > Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > A note for the seccomp maintainers...
> > > >
> > > > I don't know what do you think, but I agree in advance that the very fact this
> > > > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> > > >
> > >
> > > Indeed. in_ia32_syscall() depends arch/x86 too.
> > > We can add an inline function like;
> > >
> > > ``` uprobes.h
> > > static inline bool is_uprobe_syscall(int syscall)
> > > {
> >
> > We can, and this is what I tried to suggest from the very beginning.
> > But I agree with Eyal who decided to send the most trivial fix for
> > -stable, we can add the helper later.
> >
> > I don't think it should live in uprobes.h and I'd prefer something
> > like arch_seccomp_ignored(int) but I won't insist.
>
> yep, I think this is the way, keeping it as a general category. Should
> we also put rt_sigreturn there explicitly as well? Also, wouldn't it
> be better to have it as a non-arch-specific function for something
> like rt_sigreturn where defining it per each arch is cumbersome, and
> have the default implementation also call into an arch-specific
> function?

I like the more generic approach and keeping CONFIG_X86 out of seccomp,
and more generic than uprobes, however, I'm not sure where a common part
to place it which includes arch/x86/include/asm/syscall.h would be. And
as mentioned before, this would make this bugfix more complex to backport.

For that reason I wouldn't refactor handling rt_sigreturn as part of
this fix.

Thanks!
Eyal.

Andrii Nakryiko Jan. 17, 2025, 7:34 p.m. UTC | #9

On Fri, Jan 17, 2025 at 11:24 AM Eyal Birger <eyal.birger@gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 9:51 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > On 01/17, Masami Hiramatsu wrote:
> > > >
> > > > On Fri, 17 Jan 2025 02:39:28 +0100
> > > > Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > > A note for the seccomp maintainers...
> > > > >
> > > > > I don't know what do you think, but I agree in advance that the very fact this
> > > > > patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
> > > > >
> > > >
> > > > Indeed. in_ia32_syscall() depends arch/x86 too.
> > > > We can add an inline function like;
> > > >
> > > > ``` uprobes.h
> > > > static inline bool is_uprobe_syscall(int syscall)
> > > > {
> > >
> > > We can, and this is what I tried to suggest from the very beginning.
> > > But I agree with Eyal who decided to send the most trivial fix for
> > > -stable, we can add the helper later.
> > >
> > > I don't think it should live in uprobes.h and I'd prefer something
> > > like arch_seccomp_ignored(int) but I won't insist.
> >
> > yep, I think this is the way, keeping it as a general category. Should
> > we also put rt_sigreturn there explicitly as well? Also, wouldn't it
> > be better to have it as a non-arch-specific function for something
> > like rt_sigreturn where defining it per each arch is cumbersome, and
> > have the default implementation also call into an arch-specific
> > function?
>
> I like the more generic approach and keeping CONFIG_X86 out of seccomp,
> and more generic than uprobes, however, I'm not sure where a common part
> to place it which includes arch/x86/include/asm/syscall.h would be. And
> as mentioned before, this would make this bugfix more complex to backport.
>
> For that reason I wouldn't refactor handling rt_sigreturn as part of
> this fix.
>

SGTM, it can always be improved later, if necessary

> Thanks!
> Eyal.

Oleg Nesterov Jan. 18, 2025, 3:05 p.m. UTC | #10

On 01/17, Andrii Nakryiko wrote:
>
> On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > We can, and this is what I tried to suggest from the very beginning.
> > But I agree with Eyal who decided to send the most trivial fix for
> > -stable, we can add the helper later.
> >
> > I don't think it should live in uprobes.h and I'd prefer something
> > like arch_seccomp_ignored(int) but I won't insist.
>
> yep, I think this is the way, keeping it as a general category. Should
> we also put rt_sigreturn there explicitly as well? Also, wouldn't it
> be better to have it as a non-arch-specific function for something
> like rt_sigreturn where defining it per each arch is cumbersome, and
> have the default implementation also call into an arch-specific
> function?

I personally don't think we should exclude rt_sigreturn. and I guess
we can't do it in a arch-agnostic way, think of __NR_ia32_sigreturn.

However. These are all good questions that need a separate discussion.
Plus the SECCOMP_RET_TRACE/strace issue raised by Dmitry. And probably
even more.

But IMO it would be better to push the trivial (and urgent) fix to
-stable first, then discuss the possible cleanups/improvements.

What do you think?

Oleg.

Kees Cook Jan. 18, 2025, 8:21 p.m. UTC | #11

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> Since uretprobe is a "kernel implementation detail" system call which is
> not used by userspace application code directly, it is impractical and
> there's very little point in forcing all userspace applications to
> explicitly allow it in order to avoid crashing tracked processes.

How is this any different from sigreturn, rt_sigreturn, or
restart_syscall? These are all handled explicitly by userspace filters
already, and I don't see why uretprobe should be any different. Docker
has had plenty of experience with fixing their seccomp filters for new
syscalls. For example, many times already a given libc will suddenly
start using a new syscall when it sees its available, etc.

Basically, this is a Docker issue, not a kernel issue. Seccomp is
behaving correctly. I don't want to start making syscalls invisible
without an extremely good reason. If _anything_ should be invisible, it
is restart_syscall (which actually IS invisible under certain
architectures).

-Kees

Darrick J. Wong Jan. 18, 2025, 8:31 p.m. UTC | #12

On Sat, Jan 18, 2025 at 12:21:51PM -0800, Kees Cook wrote:
> On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> > Since uretprobe is a "kernel implementation detail" system call which is
> > not used by userspace application code directly, it is impractical and
> > there's very little point in forcing all userspace applications to
> > explicitly allow it in order to avoid crashing tracked processes.
> 
> How is this any different from sigreturn, rt_sigreturn, or
> restart_syscall? These are all handled explicitly by userspace filters
> already, and I don't see why uretprobe should be any different. Docker
> has had plenty of experience with fixing their seccomp filters for new
> syscalls. For example, many times already a given libc will suddenly
> start using a new syscall when it sees its available, etc.
> 
> Basically, this is a Docker issue, not a kernel issue. Seccomp is
> behaving correctly. I don't want to start making syscalls invisible
> without an extremely good reason. If _anything_ should be invisible, it
> is restart_syscall (which actually IS invisible under certain
> architectures).

I was wondering that too -- if ______'s security policy is to disallow
by default, then fix the security policy.  Don't blow a hole in seccomp
for all users.  Maybe someone *wants* to block uretprobe.  Maybe doing
so will be needed some day as a crude mitigation for a zeroday.

--D

> -Kees
> 
> -- 
> Kees Cook
>

Eyal Birger Jan. 18, 2025, 8:45 p.m. UTC | #13

Hןת

On Sat, Jan 18, 2025 at 12:21 PM Kees Cook <kees@kernel.org> wrote:
>
> On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> > Since uretprobe is a "kernel implementation detail" system call which is
> > not used by userspace application code directly, it is impractical and
> > there's very little point in forcing all userspace applications to
> > explicitly allow it in order to avoid crashing tracked processes.
>
> How is this any different from sigreturn, rt_sigreturn, or
> restart_syscall? These are all handled explicitly by userspace filters
> already, and I don't see why uretprobe should be any different. Docker
> has had plenty of experience with fixing their seccomp filters for new
> syscalls. For example, many times already a given libc will suddenly
> start using a new syscall when it sees its available, etc.

I think the difference is that this syscall is not part of the process's
code - it is inserted there by another process tracing it.
So this is different than desiring to deploy a new version of a binary
that uses a new libc or a new syscall. Here the case is that there are
three players - the tracer running out of docker, the tracee running in docker,
and docker itself. All three were running fine in a specific kernel version,
but upgrading the kernel now crashes the traced process.

>
> Basically, this is a Docker issue, not a kernel issue.

As mentione above, for all three given binaries, nothing changed - only the
kernel version.

> Seccomp is behaving correctly. I don't want to start making syscalls invisible
> without an extremely good reason. If _anything_ should be invisible, it
> is restart_syscall (which actually IS invisible under certain
> architectures).

I think this syscall is different in that respect for the reasons described.
I don't know if seccomp is behaving correctly when it blocks a kernel
implementation detail that isn't user created. IMHO the fact that this
implementation detail is implemented as a syscall is unfortunate, and I'm
trying to mitigate the result.

Eyal.
>
> -Kees
>
> --
> Kees Cook

Kees Cook Jan. 19, 2025, 2:24 a.m. UTC | #14

On January 18, 2025 12:45:47 PM PST, Eyal Birger <eyal.birger@gmail.com> wrote:
>I think the difference is that this syscall is not part of the process's
>code - it is inserted there by another process tracing it.

Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!

So, no, sorry, this needs to be handled by the seccomp policy that is applied to the process.

>So this is different than desiring to deploy a new version of a binary
>that uses a new libc or a new syscall.

Uh, no, the case I used as an example was no changes to anything except the kernel. Libc noticed the available syscall, uses it, and is instantly killed by the Docker seccomp policy which didn't know about that syscall.

> Here the case is that there are
>three players - the tracer running out of docker, the tracee running in docker,
>and docker itself. All three were running fine in a specific kernel version,
>but upgrading the kernel now crashes the traced process.

If uretprobe used to work without a syscall, then that seems to be the problem. But I think easiest is just fixing the Docker policy. (Which is a text file configuration change; no new binaries, no rebuilds!).

>I think this syscall is different in that respect for the reasons described.

I don't agree, sorry. Seccomp has a really singular and specific purpose, which is explicitly *externalizing* policy. I do not want to have policy within seccomp itself.

>I don't know if seccomp is behaving correctly when it blocks a kernel
>implementation detail that isn't user created.

But it is user created? Something added a uretprobe to a process who's seccomp policy is not expecting it. This seems precisely by design.

-Kees

Eyal Birger Jan. 19, 2025, 3:39 a.m. UTC | #15

Hi,

Thank you for the detailed response.

On Sat, Jan 18, 2025 at 6:25 PM Kees Cook <kees@kernel.org> wrote:

>
> On January 18, 2025 12:45:47 PM PST, Eyal Birger <eyal.birger@gmail.com> wrote:
> >I think the difference is that this syscall is not part of the process's
> >code - it is inserted there by another process tracing it.
>
> Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!
>

I think I understand your point. But do you think this is intentional?
i.e. seccomp couldn't have been used to block uretprobes before this
syscall implementation afaict.

> So, no, sorry, this needs to be handled by the seccomp policy that is applied to the process.
>

The problem we're facing is that existing workloads are breaking, and
as mentioned I'm not sure how practical it is to demand replacing a
working docker environment because of a new syscall that was added for
performance reasons.

> >So this is different than desiring to deploy a new version of a binary
> >that uses a new libc or a new syscall.
>
> Uh, no, the case I used as an example was no changes to anything except the kernel. Libc noticed the available syscall, uses it, and is instantly killed by the Docker seccomp policy which didn't know about that syscall.
>

That's an interesting situation and quite unexpected :) I'm glad I didn't
have to face that one in production.

> > Here the case is that there are
> >three players - the tracer running out of docker, the tracee running in docker,
> >and docker itself. All three were running fine in a specific kernel version,
> >but upgrading the kernel now crashes the traced process.
>
> If uretprobe used to work without a syscall, then that seems to be the problem.

I agree.

> But I think easiest is just fixing the Docker policy. (Which is a text file configuration change; no new binaries, no rebuilds!).

As far as I can tell libseccomp needs to provide support for this new
syscall and a new docker version would need to be deployed, so It's not
just a configuration change. Also the default policy which comes packed in
docker would probably need to be changed to avoid having to explicitly
provide a seccomp configuration for each deployment.

>
> >I think this syscall is different in that respect for the reasons described.
>
> I don't agree, sorry. Seccomp has a really singular and specific purpose, which is explicitly *externalizing* policy. I do not want to have policy within seccomp itself.
>

Understood.

> >I don't know if seccomp is behaving correctly when it blocks a kernel
> >implementation detail that isn't user created.
>
> But it is user created? Something added a uretprobe to a process who's seccomp policy is not expecting it. This seems precisely by design.

I think I wasn't accurate in my wording.
The uretprobe syscall is added to the tracee by the kernel.
The tracer itself is merely requesting to attach a uretprobe bpf
function. In previous versions, this was implemented by the kernel
installing an int3 instruction, and in the new implementation the kernel
is installing a uretprobe syscall.
The "user" in this case - the tracer program - didn't deliberately install
the syscall, but anyway this is semantics.

I think I understand your point that it is regarded as "policy", only that
it creates a problem in actual deployments, where in order to be able to
run the tracer software which has been working on newer kernels a new docker
has to be deployed.

I'm trying to find a pragmatic solution to this problem, and I understand
the motivation to avoid policy in seccomp.

Alternatively, maybe this syscall implementation should be reverted?

Thanks again,
Eyal.

Jiri Olsa Jan. 19, 2025, 10:44 a.m. UTC | #16

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
> Hi,
> 
> Thank you for the detailed response.
> 
> On Sat, Jan 18, 2025 at 6:25 PM Kees Cook <kees@kernel.org> wrote:
> 
> >
> > On January 18, 2025 12:45:47 PM PST, Eyal Birger <eyal.birger@gmail.com> wrote:
> > >I think the difference is that this syscall is not part of the process's
> > >code - it is inserted there by another process tracing it.
> >
> > Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!
> >
> 
> I think I understand your point. But do you think this is intentional?
> i.e. seccomp couldn't have been used to block uretprobes before this
> syscall implementation afaict.
> 
> > So, no, sorry, this needs to be handled by the seccomp policy that is applied to the process.
> >
> 
> The problem we're facing is that existing workloads are breaking, and
> as mentioned I'm not sure how practical it is to demand replacing a
> working docker environment because of a new syscall that was added for
> performance reasons.
> 
> > >So this is different than desiring to deploy a new version of a binary
> > >that uses a new libc or a new syscall.
> >
> > Uh, no, the case I used as an example was no changes to anything except the kernel. Libc noticed the available syscall, uses it, and is instantly killed by the Docker seccomp policy which didn't know about that syscall.
> >
> 
> That's an interesting situation and quite unexpected :) I'm glad I didn't
> have to face that one in production.
> 
> > > Here the case is that there are
> > >three players - the tracer running out of docker, the tracee running in docker,
> > >and docker itself. All three were running fine in a specific kernel version,
> > >but upgrading the kernel now crashes the traced process.
> >
> > If uretprobe used to work without a syscall, then that seems to be the problem.
> 
> I agree.
> 
> > But I think easiest is just fixing the Docker policy. (Which is a text file configuration change; no new binaries, no rebuilds!).
> 
> As far as I can tell libseccomp needs to provide support for this new
> syscall and a new docker version would need to be deployed, so It's not
> just a configuration change. Also the default policy which comes packed in
> docker would probably need to be changed to avoid having to explicitly
> provide a seccomp configuration for each deployment.
> 
> >
> > >I think this syscall is different in that respect for the reasons described.
> >
> > I don't agree, sorry. Seccomp has a really singular and specific purpose, which is explicitly *externalizing* policy. I do not want to have policy within seccomp itself.
> >
> 
> Understood.
> 
> > >I don't know if seccomp is behaving correctly when it blocks a kernel
> > >implementation detail that isn't user created.
> >
> > But it is user created? Something added a uretprobe to a process who's seccomp policy is not expecting it. This seems precisely by design.
> 
> I think I wasn't accurate in my wording.
> The uretprobe syscall is added to the tracee by the kernel.
> The tracer itself is merely requesting to attach a uretprobe bpf
> function. In previous versions, this was implemented by the kernel
> installing an int3 instruction, and in the new implementation the kernel
> is installing a uretprobe syscall.
> The "user" in this case - the tracer program - didn't deliberately install
> the syscall, but anyway this is semantics.

that's correct, uretprobe syscall is installed by kernel to special user
memory map and it can be executed only from there and if process calls it
from another place it receives sigill

so at the end the process executes the uretprobe syscall, but it's up to
kernel to decide that and set it up..  but I don't know if that's strong
enough reason for seccomp to ignore the syscall

> 
> I think I understand your point that it is regarded as "policy", only that
> it creates a problem in actual deployments, where in order to be able to
> run the tracer software which has been working on newer kernels a new docker
> has to be deployed.
> 
> I'm trying to find a pragmatic solution to this problem, and I understand
> the motivation to avoid policy in seccomp.

I could think of sysctl for that.. you complained earlier about weird
semantics for that [1], but I think it's better than to remove it

jirka

> 
> Alternatively, maybe this syscall implementation should be reverted?
> 
> Thanks again,
> Eyal.

[1] https://lore.kernel.org/bpf/CAHsH6Gs03iJt-ziWt5Bye_DuqCbk3TpMmgPbkYh64XBvpGaDtw@mail.gmail.com/

Oleg Nesterov Jan. 19, 2025, 12:40 p.m. UTC | #17

On 01/18, Kees Cook wrote:
>
> On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> > Since uretprobe is a "kernel implementation detail" system call which is
> > not used by userspace application code directly, it is impractical and
> > there's very little point in forcing all userspace applications to
> > explicitly allow it in order to avoid crashing tracked processes.
>
> How is this any different from sigreturn, rt_sigreturn, or
> restart_syscall? These are all handled explicitly by userspace filters
> already, and I don't see why uretprobe should be any different.

The only difference is that sys_uretprobe() is new and existing setups
doesn't know about it. Suppose you have

	int func(void)
	{
		return 123;
	}

	int main(void)
	{
		seccomp(SECCOMP_SET_MODE_STRICT, 0,0);
		for (;;)
			func();
	}

and it runs with func() uretprobed.

If you install the new kernel, this application will crash immediately.

I understand your objections, but what do you think we can do instead?
I don't think a new "try_to_speedup_uretprobes_at_your_own_risk" sysctl
makes sense, it will be almost never enabled...

Oleg.

Andy Lutomirski Jan. 19, 2025, 6:36 p.m. UTC | #18

On Sat, Jan 18, 2025 at 6:25 PM Kees Cook <kees@kernel.org> wrote:
>
>
>
> On January 18, 2025 12:45:47 PM PST, Eyal Birger <eyal.birger@gmail.com> wrote:
> >I think the difference is that this syscall is not part of the process's
> >code - it is inserted there by another process tracing it.
>
> Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!
>

I've been contemplating this.  uretprobe is a very odd syscall: it's a
syscall that emulates a breakpoint.  So, before uretprobe-the-syscall
was added, the process would breakpoint via a non-syscall vector, and
the tracing code would do its thing, and seccomp would be none the
wiser.

There's a distinction between different types of operations that
seccomp is entirely unaware of right now: is the task trying to:

a) do something *to itself*

b) do something that doesn't have meaningful side effects on the rest
of the world, at least in a non-buggy kernel, but where the process is
actually trying to restrict its own actions

c) interacting with something outside the process, that has privilege
over the process, where the interaction in question should absolutely
be subject to security policy, but that security policy really ought
to apply to the outside process.

uretprobe is very much in category c, and it's kind of unique in this
sense *as a syscall*.  But there are plenty of other examples that
just happen to not be syscalls.  For example, ptrace breakpoints use
the #DB vector, which isn't a syscall.

Here are few factors that may be vaguely relevant:

 - uretprobe is conceptually a bit like sigreturn in the sense that
both of them are having the kernel help with something that process
can kind-of-sort-of do all by itself.

 - BUT: sigreturn is not supposed to have side effects reaching
outside the calling task.  uretprobe does, and that's the whole point.

 - uretprobe-the-syscall is, in a rather optimistic sense, obsolete.
Once FRED becomes common (if ever...), it won't really serve much
purpose any more.  FRED, for those not watching, at least in theory,
makes "event delivery" and "return" fast, for all (hah!) events.
Including breakpoints.  And returns to usermode where rcx != rip, etc.

So I don't know what the right answer is.  There's a real argument to
be made that seccomp ought to decide that its policy permits whomever
installed the uretprobe to do so, and that this decision means that
the uretprobe machinery is therefore permissible.

Kees Cook Jan. 20, 2025, 9:32 p.m. UTC | #19

On Sun, Jan 19, 2025 at 01:40:22PM +0100, Oleg Nesterov wrote:
> On 01/18, Kees Cook wrote:
> >
> > On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:
> > > Since uretprobe is a "kernel implementation detail" system call which is
> > > not used by userspace application code directly, it is impractical and
> > > there's very little point in forcing all userspace applications to
> > > explicitly allow it in order to avoid crashing tracked processes.
> >
> > How is this any different from sigreturn, rt_sigreturn, or
> > restart_syscall? These are all handled explicitly by userspace filters
> > already, and I don't see why uretprobe should be any different.
> 
> The only difference is that sys_uretprobe() is new and existing setups
> doesn't know about it. Suppose you have
> 
> 	int func(void)
> 	{
> 		return 123;
> 	}
> 
> 	int main(void)
> 	{
> 		seccomp(SECCOMP_SET_MODE_STRICT, 0,0);
> 		for (;;)
> 			func();
> 	}
> 
> and it runs with func() uretprobed.
> 
> If you install the new kernel, this application will crash immediately.
> 
> I understand your objections, but what do you think we can do instead?
> I don't think a new "try_to_speedup_uretprobes_at_your_own_risk" sysctl
> makes sense, it will be almost never enabled...

This seems like a uretprobes design problem. If it's going to use
syscalls, it must take things like seccomp into account.
SECCOMP_SET_MODE_STRICT will also crash in the face of syscall_restart...

Kees Cook Jan. 20, 2025, 9:34 p.m. UTC | #20

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
> Alternatively, maybe this syscall implementation should be reverted?

Honestly, that seems the best choice. I don't think any thought was
given to how it would interact with syscall interposers (including
ptrace, strict mode seccomp, etc).

Jiri Olsa Jan. 21, 2025, 2:38 p.m. UTC | #21

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

SNIP

> I think I wasn't accurate in my wording.
> The uretprobe syscall is added to the tracee by the kernel.
> The tracer itself is merely requesting to attach a uretprobe bpf
> function. In previous versions, this was implemented by the kernel
> installing an int3 instruction, and in the new implementation the kernel
> is installing a uretprobe syscall.
> The "user" in this case - the tracer program - didn't deliberately install
> the syscall, but anyway this is semantics.
> 
> I think I understand your point that it is regarded as "policy", only that
> it creates a problem in actual deployments, where in order to be able to
> run the tracer software which has been working on newer kernels a new docker
> has to be deployed.
> 
> I'm trying to find a pragmatic solution to this problem, and I understand
> the motivation to avoid policy in seccomp.
> 
> Alternatively, maybe this syscall implementation should be reverted?

you mentioned in the previous reply:

> > As far as I can tell libseccomp needs to provide support for this new
> >  syscall and a new docker version would need to be deployed, so It's not
> > just a configuration change. Also the default policy which comes packed in
> > docker would probably need to be changed to avoid having to explicitly
> > provide a seccomp configuration for each deployment.

please disregard if this is too stupid.. but could another way out be just
to disable it (easy to do) and meanwhile teach libseccomp to allow uretprobe
(or whatever mechanism needs to be added to libseccomp) plus the needed
docker change ... to minimize the impact ? 

or there's just too many other seccomp user space libraries

I'm still trying to come up with some other solution but wanted
to exhaust all the options I could think of

thanks,
jirka

Eyal Birger Jan. 21, 2025, 2:47 p.m. UTC | #22

Hi,

On Tue, Jan 21, 2025 at 6:38 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
>
> SNIP
>
> > I think I wasn't accurate in my wording.
> > The uretprobe syscall is added to the tracee by the kernel.
> > The tracer itself is merely requesting to attach a uretprobe bpf
> > function. In previous versions, this was implemented by the kernel
> > installing an int3 instruction, and in the new implementation the kernel
> > is installing a uretprobe syscall.
> > The "user" in this case - the tracer program - didn't deliberately install
> > the syscall, but anyway this is semantics.
> >
> > I think I understand your point that it is regarded as "policy", only that
> > it creates a problem in actual deployments, where in order to be able to
> > run the tracer software which has been working on newer kernels a new docker
> > has to be deployed.
> >
> > I'm trying to find a pragmatic solution to this problem, and I understand
> > the motivation to avoid policy in seccomp.
> >
> > Alternatively, maybe this syscall implementation should be reverted?
>
> you mentioned in the previous reply:
>
> > > As far as I can tell libseccomp needs to provide support for this new
> > >  syscall and a new docker version would need to be deployed, so It's not
> > > just a configuration change. Also the default policy which comes packed in
> > > docker would probably need to be changed to avoid having to explicitly
> > > provide a seccomp configuration for each deployment.
>
> please disregard if this is too stupid.. but could another way out be just
> to disable it (easy to do) and meanwhile teach libseccomp to allow uretprobe
> (or whatever mechanism needs to be added to libseccomp) plus the needed
> docker change ... to minimize the impact ?

Right. the patch I was thinking to suggest wouldn't revert the entire
thing, but instead disable its use for now and allow a careful
reconsideration of the available options.

If that makes sense, I'll post it.

>
> or there's just too many other seccomp user space libraries

I think in theory, the example of a simple binary using "restrict" mode
makes it problematic to assume that this can be fixed solely from userspace
i.e. for such binary, uretprobes would still work in one kernel version and
break on another. It's hard to tell how common this is.

>
> I'm still trying to come up with some other solution but wanted
> to exhaust all the options I could think of
>
> thanks,
> jirka

Oleg Nesterov Jan. 21, 2025, 3:28 p.m. UTC | #23

On 01/20, Kees Cook wrote:
>
> > The only difference is that sys_uretprobe() is new and existing setups
> > doesn't know about it. Suppose you have
> >
> > 	int func(void)
> > 	{
> > 		return 123;
> > 	}
> >
> > 	int main(void)
> > 	{
> > 		seccomp(SECCOMP_SET_MODE_STRICT, 0,0);
> > 		for (;;)
> > 			func();
> > 	}
> >
> > and it runs with func() uretprobed.
> >
> > If you install the new kernel, this application will crash immediately.
> >
> > I understand your objections, but what do you think we can do instead?
> > I don't think a new "try_to_speedup_uretprobes_at_your_own_risk" sysctl
> > makes sense, it will be almost never enabled...
>
> This seems like a uretprobes design problem. If it's going to use
> syscalls, it must take things like seccomp into account.

True. I reviewed that patch, and I forgot about seccomp too.

> SECCOMP_SET_MODE_STRICT will also crash in the face of syscall_restart...

Yes, I guess SECCOMP_SET_MODE_STRICT assumes that read/write can't return
ERESTART_RESTARTBLOCK.

But again, what can we do right now?

I do not like the idea to revert the patch which adds sys_uretprobe().
Don't get me wrong, I do not use uprobes, so personally I don't really
care about the performance improvements it adds. Not to mention FRED,
although I have no idea when it will be available.

Lets forget about sys_uretprobe(). Lets suppose the kernel doesn't have
ERESTART_RESTARTBLOCK/sys_restart_syscall and we want to add this feature
today.

How do you think we can do this without breaking the existing setups which
use seccomp ?

Oleg.

Steven Rostedt Jan. 21, 2025, 4:16 p.m. UTC | #24

[ Watching this with popcorn from the sidelines, but I'll chime in anyway ]

On Tue, 21 Jan 2025 15:38:48 +0100
Jiri Olsa <olsajiri@gmail.com> wrote:

> I'm still trying to come up with some other solution but wanted
> to exhaust all the options I could think of

I think this may have been mentioned, but is there a way that the kernel
could know that this system call is being monitored by seccomp, and if so,
just stick with the interrupt version? If not, enable the system call?

-- Steve

Oleg Nesterov Jan. 21, 2025, 4:44 p.m. UTC | #25

On 01/21, Steven Rostedt wrote:
>
> I think this may have been mentioned, but is there a way that the kernel
> could know that this system call is being monitored by seccomp, and if so,
> just stick with the interrupt version? If not, enable the system call?

Consider

	int func_to_uretprobe()
	{
		seccomp(SECCOMP_SET_MODE_STRICT/whatever);
		return 123;
	}

by the time it is called, the kernel can't know that this function will
call seccomp/install-the-filters/etc, so prepare_uretprobe() can't know
if it is safe to use uretprobe or not.

Oleg.

Jiri Olsa Jan. 21, 2025, 4:55 p.m. UTC | #26

On Tue, Jan 21, 2025 at 11:16:31AM -0500, Steven Rostedt wrote:
> 
> [ Watching this with popcorn from the sidelines, but I'll chime in anyway ]
> 
> On Tue, 21 Jan 2025 15:38:48 +0100
> Jiri Olsa <olsajiri@gmail.com> wrote:
> 
> > I'm still trying to come up with some other solution but wanted
> > to exhaust all the options I could think of
> 
> I think this may have been mentioned, but is there a way that the kernel
> could know that this system call is being monitored by seccomp, and if so,
> just stick with the interrupt version? If not, enable the system call?

yes [1], the problem with that solution is that we install uretprobe
trampoline at function's uprobe entry probe, so we won't catch case
where seccomp is enabled in this probed function, like:

  foo
    uprobe -> install uretprobe trampoline
    ...
    seccomp(SECCOMP_MODE_STRICT..
    ...
    ret -> execute uretprobe trampoline with sys_uretprobe


I thought we could perhaps switch existing uretprobe trampoline to
int3 when we are in sys_seccomp, but another user thread might be
already executing the existing uretprobe trampoline, so I don't
think we can do that 

jirka


[1] https://lore.kernel.org/bpf/20250114123257.GD19816@redhat.com/

Andrii Nakryiko Jan. 21, 2025, 10:38 p.m. UTC | #27

On Tue, Jan 21, 2025 at 8:55 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Jan 21, 2025 at 11:16:31AM -0500, Steven Rostedt wrote:
> >
> > [ Watching this with popcorn from the sidelines, but I'll chime in anyway ]
> >
> > On Tue, 21 Jan 2025 15:38:48 +0100
> > Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > > I'm still trying to come up with some other solution but wanted
> > > to exhaust all the options I could think of
> >
> > I think this may have been mentioned, but is there a way that the kernel
> > could know that this system call is being monitored by seccomp, and if so,
> > just stick with the interrupt version? If not, enable the system call?
>
> yes [1], the problem with that solution is that we install uretprobe
> trampoline at function's uprobe entry probe, so we won't catch case
> where seccomp is enabled in this probed function, like:
>
>   foo
>     uprobe -> install uretprobe trampoline
>     ...
>     seccomp(SECCOMP_MODE_STRICT..
>     ...
>     ret -> execute uretprobe trampoline with sys_uretprobe
>
>
> I thought we could perhaps switch existing uretprobe trampoline to
> int3 when we are in sys_seccomp, but another user thread might be
> already executing the existing uretprobe trampoline, so I don't
> think we can do that

Jiri,

We should abandon the vector of "let's try to detect whether someone
is blocking sys_uretprobe" as a solution, I don't believe it's
possible. Blocking sys_uretprobe is too dynamic of a thing. There is
an arbitrary periods of time between adding uretprobe trampoline
(i.e., sys_uretprobe) and actually disabling sys_uretprobe through
seccomp (or even BPF: LSM or even kprobes can do that, why not?), and
userspace can flip this decision many times over.

And as Oleg said, sysctl
"please-make-my-uretprobe-2x-faster-assuming-i-know-about-this-option"
makes no sense either, this will basically almost never get enabled.

Kees,

You said yourself that sys_uretprobe is no different from rt_sigreturn
and restart_syscall, so why would we rollback sys_uretprobe if we
wouldn't rollback rt_sigreturn/restart_syscall? Given it's impossible,
generally speaking, to know if userspace is blocking the syscall (and
that can change dynamically and very frequently), any improvement or
optimization that kernel would do with the help of special syscall is
now prohibited, effectively. That doesn't seem wise to restrict the
kernel development so much just because libseccomp blocks any unknown
syscall by default.

I'm OK either asking libseccomp to learn about sys_uretprobe and not
block it (like systemd is doing), or if we want to bend over
backwards, prevent user policy from filtering theses special syscalls
which are meant to be used by kernel only. We can't single out
sys_uretprobe just because it's the newest of this special cohort.

You also asked "what if userspace wants to block uprobes"? If that's
really the goal, that would be done at uprobe attachment time, not
when uprobe is (conceptually) attached, new process is forked, and
kernel installs uretprobe trampoline with uretprobe syscall. Or just
control that through (lack of) capabilities. Using seccomp to block
*second part of uretprobe handling* doesn't make much sense. It's just
the wrong place for that.

P.S. Also using FRED as an excuse for not doing sys_uretprobe is
manipulative. When we get FRED-enabled CPUs widely available and
deployed *and* all (or at least majority of) the currently used CPUs
are decommissioned, only then we can realistically talk about
sys_uretprobe being unnecessary. That's years and years. sys_uretprobe
is necessary and important *right now* and will be for the foreseeable
future.

>
> jirka
>
>
> [1] https://lore.kernel.org/bpf/20250114123257.GD19816@redhat.com/

Steven Rostedt Jan. 21, 2025, 10:46 p.m. UTC | #28

On Tue, 21 Jan 2025 14:38:09 -0800
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> You said yourself that sys_uretprobe is no different from rt_sigreturn
> and restart_syscall, so why would we rollback sys_uretprobe if we
> wouldn't rollback rt_sigreturn/restart_syscall? Given it's impossible,
> generally speaking, to know if userspace is blocking the syscall (and
> that can change dynamically and very frequently), any improvement or
> optimization that kernel would do with the help of special syscall is
> now prohibited, effectively. That doesn't seem wise to restrict the
> kernel development so much just because libseccomp blocks any unknown
> syscall by default.

What happens if the system call is hit when there was no uprobe attached to
it? Perhaps it should segfault? That is, this system call is only used when
the kernel attaches it, if the kernel did not attach it, perhaps there's
some malicious code that is trying to use it for some CVE corner case. But
if it always crashes when added, the only thing the malicious code can do
by adding this system call is to crash the application. That shouldn't be a
problem, as if malicious code can add a system call, it can also change the
code to crash as well.

Perhaps the security folks would feel better if there were other
protections against this system call when not used as expected?

-- Steve

Eyal Birger Jan. 21, 2025, 11:13 p.m. UTC | #29

On Tue, Jan 21, 2025 at 2:46 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 21 Jan 2025 14:38:09 -0800
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > You said yourself that sys_uretprobe is no different from rt_sigreturn
> > and restart_syscall, so why would we rollback sys_uretprobe if we
> > wouldn't rollback rt_sigreturn/restart_syscall? Given it's impossible,
> > generally speaking, to know if userspace is blocking the syscall (and
> > that can change dynamically and very frequently), any improvement or
> > optimization that kernel would do with the help of special syscall is
> > now prohibited, effectively. That doesn't seem wise to restrict the
> > kernel development so much just because libseccomp blocks any unknown
> > syscall by default.
>
> What happens if the system call is hit when there was no uprobe attached to
> it? Perhaps it should segfault? That is, this system call is only used when
> the kernel attaches it, if the kernel did not attach it, perhaps there's
> some malicious code that is trying to use it for some CVE corner case. But
> if it always crashes when added, the only thing the malicious code can do
> by adding this system call is to crash the application. That shouldn't be a
> problem, as if malicious code can add a system call, it can also change the
> code to crash as well.
>
> Perhaps the security folks would feel better if there were other
> protections against this system call when not used as expected?

Isn't that the case already, or maybe I misunderstood what Jiri wrote [1]:

> On Sun, Jan 19, 2025 at 2:44 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> that's correct, uretprobe syscall is installed by kernel to special user
> memory map and it can be executed only from there and if process calls it
> from another place it receives sigill

Eyal.

[1] https://lore.kernel.org/lkml/Z4zXlaEMPbiYYlQ8@krava/

Steven Rostedt Jan. 21, 2025, 11:29 p.m. UTC | #30

On Tue, 21 Jan 2025 15:13:52 -0800
Eyal Birger <eyal.birger@gmail.com> wrote:

> Isn't that the case already, or maybe I misunderstood what Jiri wrote [1]:
> 
> > On Sun, Jan 19, 2025 at 2:44 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > that's correct, uretprobe syscall is installed by kernel to special user
> > memory map and it can be executed only from there and if process calls it
> > from another place it receives sigill  
> 
> Eyal.
> 
> [1] https://lore.kernel.org/lkml/Z4zXlaEMPbiYYlQ8@krava/

Ah, he did. Thanks I missed that:

> that's correct, uretprobe syscall is installed by kernel to special user
> memory map and it can be executed only from there and if process calls it
> from another place it receives sigill

-- Steve

Eyal Birger Jan. 27, 2025, 7:24 p.m. UTC | #31

Hi Kees,

On Mon, Jan 20, 2025 at 1:34 PM Kees Cook <kees@kernel.org> wrote:
>
> On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
> > Alternatively, maybe this syscall implementation should be reverted?
>
> Honestly, that seems the best choice. I don't think any thought was
> given to how it would interact with syscall interposers (including
> ptrace, strict mode seccomp, etc).

I don't know if you noticed Andrii's and others' comments on this [1].

Given that:
- this issue requires immediate remediation
- there seems to be pushback for reverting the syscall implementation
- filtering uretprobe is not within the capabilities of seccomp without this
  syscall (so reverting the syscall is equivalent to just passing it through
  seccomp)

is it possible to consider applying this current fix, with the possibility of
extending seccomp in the future to support filtering uretprobe if deemed
necessary (for example by allowing userspace to define a stricter policy)?

Thanks,
Eyal.

[1] https://lore.kernel.org/lkml/20250121182939.33d05470@gandalf.local.home/T/#me2676c378eff2d6a33f3054fed4a5f3afa64e65b

Kees Cook Jan. 27, 2025, 7:33 p.m. UTC | #32

On Mon, Jan 27, 2025 at 11:24:02AM -0800, Eyal Birger wrote:
> Hi Kees,
> 
> On Mon, Jan 20, 2025 at 1:34 PM Kees Cook <kees@kernel.org> wrote:
> >
> > On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
> > > Alternatively, maybe this syscall implementation should be reverted?
> >
> > Honestly, that seems the best choice. I don't think any thought was
> > given to how it would interact with syscall interposers (including
> > ptrace, strict mode seccomp, etc).
> 
> I don't know if you noticed Andrii's and others' comments on this [1].
> 
> Given that:
> - this issue requires immediate remediation
> - there seems to be pushback for reverting the syscall implementation
> - filtering uretprobe is not within the capabilities of seccomp without this
>   syscall (so reverting the syscall is equivalent to just passing it through
>   seccomp)
> 
> is it possible to consider applying this current fix, with the possibility of
> extending seccomp in the future to support filtering uretprobe if deemed
> necessary (for example by allowing userspace to define a stricter policy)?

I still think this is a Docker problem, but I agree that uretprobe
without syscall is just as unfilterable as seccomp ignoring the syscall.

Can you please update the patch to use the existing action_cache bitmaps
instead of adding an open-coded check? We can consider adding
syscall_restart to this as well in the future...

Eyal Birger Jan. 27, 2025, 7:39 p.m. UTC | #33

On Mon, Jan 27, 2025 at 11:33 AM Kees Cook <kees@kernel.org> wrote:
>
> On Mon, Jan 27, 2025 at 11:24:02AM -0800, Eyal Birger wrote:
> > Hi Kees,
> >
> > On Mon, Jan 20, 2025 at 1:34 PM Kees Cook <kees@kernel.org> wrote:
> > >
> > > On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
> > > > Alternatively, maybe this syscall implementation should be reverted?
> > >
> > > Honestly, that seems the best choice. I don't think any thought was
> > > given to how it would interact with syscall interposers (including
> > > ptrace, strict mode seccomp, etc).
> >
> > I don't know if you noticed Andrii's and others' comments on this [1].
> >
> > Given that:
> > - this issue requires immediate remediation
> > - there seems to be pushback for reverting the syscall implementation
> > - filtering uretprobe is not within the capabilities of seccomp without this
> >   syscall (so reverting the syscall is equivalent to just passing it through
> >   seccomp)
> >
> > is it possible to consider applying this current fix, with the possibility of
> > extending seccomp in the future to support filtering uretprobe if deemed
> > necessary (for example by allowing userspace to define a stricter policy)?
>
> I still think this is a Docker problem, but I agree that uretprobe
> without syscall is just as unfilterable as seccomp ignoring the syscall.
>
> Can you please update the patch to use the existing action_cache bitmaps
> instead of adding an open-coded check? We can consider adding
> syscall_restart to this as well in the future...

I can. The main difference as far as I can tell is that it would not
apply to strict mode. Is that OK? it means that existing binaries using
strict mode would still crash if uretprobe is attached to them.

Eyal.

Kees Cook Jan. 27, 2025, 7:43 p.m. UTC | #34

On Mon, Jan 27, 2025 at 11:39:44AM -0800, Eyal Birger wrote:
> On Mon, Jan 27, 2025 at 11:33 AM Kees Cook <kees@kernel.org> wrote:
> >
> > On Mon, Jan 27, 2025 at 11:24:02AM -0800, Eyal Birger wrote:
> > > Hi Kees,
> > >
> > > On Mon, Jan 20, 2025 at 1:34 PM Kees Cook <kees@kernel.org> wrote:
> > > >
> > > > On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
> > > > > Alternatively, maybe this syscall implementation should be reverted?
> > > >
> > > > Honestly, that seems the best choice. I don't think any thought was
> > > > given to how it would interact with syscall interposers (including
> > > > ptrace, strict mode seccomp, etc).
> > >
> > > I don't know if you noticed Andrii's and others' comments on this [1].
> > >
> > > Given that:
> > > - this issue requires immediate remediation
> > > - there seems to be pushback for reverting the syscall implementation
> > > - filtering uretprobe is not within the capabilities of seccomp without this
> > >   syscall (so reverting the syscall is equivalent to just passing it through
> > >   seccomp)
> > >
> > > is it possible to consider applying this current fix, with the possibility of
> > > extending seccomp in the future to support filtering uretprobe if deemed
> > > necessary (for example by allowing userspace to define a stricter policy)?
> >
> > I still think this is a Docker problem, but I agree that uretprobe
> > without syscall is just as unfilterable as seccomp ignoring the syscall.
> >
> > Can you please update the patch to use the existing action_cache bitmaps
> > instead of adding an open-coded check? We can consider adding
> > syscall_restart to this as well in the future...
> 
> I can. The main difference as far as I can tell is that it would not
> apply to strict mode. Is that OK? it means that existing binaries using
> strict mode would still crash if uretprobe is attached to them.

Ah, good point. Please also add it to mode1_syscalls for strict. :)

seccomp: passthrough uretprobe systemcall without filtering

Commit Message

Comments

Patch